IJCNLP 2008
The 6th Workshop on Asian Language Resources (ALR 6)
Proceedings of the Workshop
11-12 January 2008 Indian School of Business, Hyderabad, India °c 2008 Asian Federation of Natural Language Processing
Sponsor
Special Coordination Funds for Promoting Science and Technology, Ministry of Education, Culture, Sport, Science and Technology, MEXT Japan. Preface
This volume contains the papers presented at the sixth workshop on Asian Language Resources, held on 11–12 January 2008 in conjunction with the third International Joint Conference on Natural Langauge Processing (IJCNLP 2008). Language resources have played an essential role in empirical approaches to natural language processing (NLP) for the last two decades. Previous concerted efforts on construction of language resources, particu- larly in the US and European countries, have laid a solid foundation for the pioneering NLP researches in these two communities. In comparison, the availability and accessibility of many Asian language resources are still very limited except for a few languages. Moreover, there is a greater diversity in Asian languages with respect to character sets, grammatical properties and the cultural background. Motivated by such a context, we have organised a series of workshops on Asian language resources since 2001. This workshop series has contributed to the activation of the NLP research in Asia particularly of building and utilising corpora of various types and languages. In this sixth workshop, we had 31 submis- sions encompassing 13 languages. The paper selection was highly competitive compared with the last five workshops. The program committee selected 10 regular papers, 3 short papers and 8 resource reports for presentation at the workshop. The workshop is comprised of two parts, technical sessions and a session devoted to reporting activities related to language resources in several languages. Following the resource report session, we have an open discussion on the collaboration in building, standardising and exchanging language resources in Asia. We hope this workshop further accelerates the already thriving NLP research in Asia.
Chu-Ren Huang Mikami Yoshiki Workshop Co-chairs
Hasida Koitiˆ Tokunaga Takenobu Program Co-chairs
i Organiser
Workshop chairs Huang, Chu-Ren Academia Sinica Mikami, Yoshiki Nagaoka University of Technology
Program chairs Hasida, Koitiˆ National Institute of Advanced Industrial Science and Technology Tokunaga, Takenobu Tokyo Institute of Technology
Program Committee Bhattacharyya, Pushpak IIT, Bombay Fang, Alex Chengyu City University of Hong Kong Riza, Hammam IPTEKnet–BPPT Hasida, Koitiˆ National Institute of Advanced Industrial Science and Technology He, Tingting Huazhong Normal University Huang, Chu-Ren Academia Sinica Hussain, Sarmad National University of Computer & Emerging Sciences Itahashi, Shuichi National Institute of Informatics Lu, Qin Hong Kong Polytechnic University Luong, Chi Mai National Center for Sciences and Technologies of Vietnam Mikami, Yoshiki Nagaoka University of Technology Nandasara, Shakrange Turrance University of Colombo, School of Computing Nguyen, Thi Minh Huyen Hanoi University of Sciences Oo, Thein Myanmar Computer Federation Rau, Victoria Providence University Rim, Hae-Chang Korea University Roxas, Rachel Edita O De La Salle University, Manila Shirai, Kiyoaki Japan Advanced Institute of Science and Technology Sornlertlamvanich, Virach Thai Computational Linguistics Laboratory, NICT Sui, Zhifang Peking University Tokunaga, Takenobu Tokyo Institute of Technology Vikas, Om Indian Institute of Information Technology and Management Zhao, Jun Chinese Academy of Sciences
This workshop is supported by Special Coordination Funds for Promoting Science and Technology, Min- istry of Education, Culture, Sport, Science and Technology, MEXT Japan.
ii Workshop Program 11-12 January 2008 Indian School of Business, Hyderabad, India
Day 1 (11 January)
9:00 Registration 9:20 Opening 9:30 Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems Asif Ekbal and Sivaji Bandyopadhyay 9:55 Gazetteer Preparation for Named Entity Recognition in Indian Languages Sujan Kumar Saha, Sudeshna Sarkar and Pabitra Mitra 10:20 Preliminary Chinese Term Classification for Ontology Construction Gaoying Cui, Qin Lu and Wenjie Li 10:45 Break 11:05 Technical Terminology in Asian Languages: Different Approaches to Adopting Engineering Terms Makiko Matsuda, Tomoe Takahashi, Hiroki Goto, Yoshikazu Hayase, Robin Lee Nagano and Yoshiki Mikami 11:30 Selection of XML tag set for Myanmar National Corpus Wunna Ko Ko and Thin Zar Phyo 11:55 Myanmar Word Segmentation using Syllable level Longest Matching Hla Hla Htay and Kavi Narayana Murthy 12:20 Lunch 13:50 The Link Structure of Language Communities and its Implication for Language-specific Crawling Rizza Caminero and Yoshiki Mikami 14:15 A Multilingual Multimedia Indian Sign Language Dictionary Tool Tirthankar Dasgupta, Sambit Shukla, Sandeep Kumar, Synny Diwakar and Anupam Basu 14:40 A Discourse Resource for Turkish: Annotating Discourse Connectives in the METU Corpus Deniz Zeyrek and Bonnie Webber 15:05 Towards an Annotated Corpus of Discourse Relations in Hindi Rashmi Prasad, Samar Husain, Dipti Sharma and Aravind Joshi 15:30 Break 15:50 A Semantic Study on Yami Ontology in Traditional Songs Yin-Sheng Tai, D. Victoria Rau and Meng-Chien Yang 16:05 Assessment and Development of POS Tag Set for Telugu Rama Sree R.J., Uma Maheswara Rao G and Madhu Murthy K.V. 16:20 Designing a Common POS-Tagset Framework for Indian Languages Sankaran Baskaran, Kalika Bali, Tanmoy Bhattacharya, Pushpak Bhattacharyya, Girish Nath Jha, Rajendran S, Saravanan K, Sobha L and Subbarao K V
iii Day 2 (12 January)
9:00 Resources Report on Languages of Indonesia Hammam Riza 9:15 Confirmed Language Resource for Answering How Type Questions Developed by Using Mails Posted to a Mailing List Ryo Nishimura, Yasuhiko Watanabe and Yoshihiro Okada 9:30 Corpus building for Mongolian language Purev Jaimai and Odbayar Chimeddorj 9:45 Resources for Urdu Language Processing Sarmad Hussain 10:00 Balanced Corpus of Contemporary Written Japanese Kikuo Maekawa 10:15 Break 10:35 A Basic Framework to Build a Test Collection for the Vietnamese Text Catergorization Viet Hoang-Anh, Thu Dinh-Thi-Phuong and Thang Huynh-Quyet 10:50 Enhanced Tools for Online Collaborative Language Resource Development Virach Sornlertlamvanich, Thatsanee Charoenporn, Suphanut Thayaboon, Chumpol Mokarat and Hitoshi Isahara 11:05 Japanese Effort Toward Sharing Text and Speech Corpora Shuichi Itahashi and Koitiˆ Hasida 11:20 Open Discussion 12:20 Closing
iv Table of Contents hRegular papersi Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems Asif Ekbal and Sivaji Bandyopadhyay ...... 1 Gazetteer Preparation for Named Entity Recognition in Indian Languages Sujan Kumar Saha, Sudeshna Sarkar and Pabitra Mitra ...... 9 Preliminary Chinese Term Classification for Ontology Construction Gaoying Cui, Qin Lu and Wenjie Li ...... 17 Technical Terminology in Asian Languages: Different Approaches to Adopting Engineering Terms Makiko Matsuda, Tomoe Takahashi, Hiroki Goto, Yoshikazu Hayase, Robin Lee Nagano and Yoshiki Mikami ...... 25 Selection of XML tag set for Myanmar National Corpus Wunna Ko Ko and Thin Zar Phyo ...... 33 Myanmar Word Segmentation using Syllable level Longest Matching Hla Hla Htay and Kavi Narayana Murthy ...... 41 The Link Structure of Language Communities and its Implication for Language-specific Crawling Rizza Caminero and Yoshiki Mikami ...... 49 A Multilingual Multimedia Indian Sign Language Dictionary Tool Tirthankar Dasgupta, Sambit Shukla, Sandeep Kumar, Synny Diwakar and Anupam Basu ...... 57 A Discourse Resource for Turkish: Annotating Discourse Connectives in the METU Corpus Deniz Zeyrek and Bonnie Webber ...... 65 Towards an Annotated Corpus of Discourse Relations in Hindi Rashmi Prasad, Samar Husain, Dipti Sharma and Aravind Joshi ...... 73 hShort papersi A Semantic Study on Yami Ontology in Traditional Songs Yin-Sheng Tai, D. Victoria Rau and Meng-Chien Yang ...... 81 Assessment and Development of POS Tag Set for Telugu Rama Sree R.J., Uma Maheswara Rao G and Madhu Murthy K.V...... 85 Designing a Common POS-Tagset Framework for Indian Languages Sankaran Baskaran, Kalika Bali, Tanmoy Bhattacharya, Pushpak Bhattacharyya, Girish Nath Jha, Ra- jendran S, Saravanan K, Sobha L and Subbarao K V ...... 89 hResource reportsi Resources Report on Languages of Indonesia Hammam Riza ...... 93 Confirmed Language Resource for Answering How Type Questions Developed by Using Mails Posted to a Mailing List Ryo Nishimura, Yasuhiko Watanabe and Yoshihiro Okada ...... 95
v Corpus building for Mongolian language Purev Jaimai and Odbayar Chimeddorj ...... 97 Resources for Urdu Language Processing Sarmad Hussain ...... 99 Balanced Corpus of Contemporary Written Japanese Kikuo Maekawa ...... 101 A Basic Framework to Build a Test Collection for the Vietnamese Text Catergorization Viet Hoang-Anh, Thu Dinh-Thi-Phuong and Thang Huynh-Quyet ...... 103 Enhanced Tools for Online Collaborative Language Resource Development Virach Sornlertlamvanich, Thatsanee Charoenporn, Suphanut Thayaboon, Chumpol Mokarat and Hi- toshi Isahara ...... 105 Japanese Effort Toward Sharing Text and Speech Corpora Shuichi Itahashi and Koitiˆ Hasida ...... 107
vi The 6th Workshop on Asian Languae Resources, 2008
Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems
Asif Ekbal Sivaji Bandyopadhyay Department of Computer Science and Department of Computer Science and Engineering, Jadavpur University Engineering, Jadavpur University Kolkata-700032, India Kolkata-700032, India [email protected] [email protected]
tional Random Field (CRF) and Support Abstract Vector Machine (SVM). Experimental re- sults of the 10-fold cross validation test The rapid development of language tools have demonstrated that the SVM based using machine learning techniques for less NER system performs the best with an computerized languages requires appropri- overall F-Score of 91.8%. ately tagged corpus. A Bengali news cor- pus has been developed from the web ar- 1 Introduction chive of a widely read Bengali newspaper. A web crawler retrieves the web pages in The mode of language technology work has been Hyper Text Markup Language (HTML) changed dramatically since the last few years with format from the news archive. At present, the web being used as a data source in a wide the corpus contains approximately 34 mil- range of research activities. The web is anarchic, lion wordforms. The date, location, re- and its use is not in the familiar territory of compu- porter and agency tags present in the web tational linguistics. The web walked into the ACL pages have been automatically named en- meetings started in 1999. The use of the web as a tity (NE) tagged. A portion of this partially corpus for teaching and research on language tech- NE tagged corpus has been manually anno- nology has been proposed a number of times tated with the sixteen NE tags with the help (Rundel, 2000; Fletcher, 2001; Robb, 2003; 1 of Sanchay Editor , a text editor for Indian Fletcher, 2003). There is a long history of creating languages. This NE tagged corpus contains a standard for western language resources. The 150K wordforms. Additionally, 30K word- human language technology (HLT) society in forms have been manually annotated with Europe has been particularly zealous for the stan- the twelve NE tags as part of the IJCNLP- dardization, making a series of attempts such as 08 NER Shared Task for South and South EAGLES3, PROLE/SIMPLE (Lenci et al., 2000), 2 East Asian Languages . A table driven ISLE/MILE (Calzolari et al., 2003; Bertagna et al., semi-automatic NE tag conversion routine 2004) and more recently multilingual lexical data- has been developed in order to convert the base generation from parallel texts in 20 European sixteen-NE tagged corpus to the twelve-NE languages (Giguet and Luquet, 2006). On the other tagged corpus. The 150K NE tagged cor- hand, in spite of having great linguistic and cul- pus has been used to develop Named Entity tural diversities, Asian language resources have Recognition (NER) system in Bengali us- received much less attention than their western ing pattern directed shallow parsing ap- counterparts. A new project (Takenobou et al., proach, Hidden Markov Model (HMM), 2006) has been started to create a common stan- Maximum Entropy (ME) Model, Condi- dard for Asian language resources. They have ex- tended an existing description framework, the
1 sourceforge.net/project/nlp-sanchay 2 http://ltrc.iiit.ac.in/ner-ssea-08 3 http://www.ilc.cnr.it/Eagles96/home.html
1 The 6th Workshop on Asian Languae Resources, 2008
MILE (Bertagna et al., 2004), to describe several tion of the corpus has been manually annotated lexical entries of Japanese, Chinese and Thai. India with the twelve NE tags as part of the IJCNLP-08 is a multilingual country with the enormous cul- NER Shared Task for South and South East Asian tural diversities. (Bharati et al., 2001) reports on Languages. A table driven semi-automatic NE tag efforts to create lexical resources such as transfer conversion routine has been developed in order to lexicon and grammar from English to several In- convert this corpus to a form tagged with the dian languages and dependency tree bank of anno- twelve NE tags. The NE tagged corpus has been tated corpora for several Indian languages. Corpus used to develop Named Entity Recognition (NER) development work from web can be found in (Ek- system in Bengali using pattern directed shallow bal and Bandyopadhyay, 2007d) for Bengali. parsing approach, HMM, ME, CRF and SVM Named Entity Recognition (NER) is one of the frameworks. core components in most Information Extraction A number of detailed experiments have been (IE) and Text Mining systems. During the last few conducted to identify the best set of features for years, the probabilistic machine learning methods NER in Bengali. The ME, CRF and SVM based have become state of the art for NER (Bikel et al., NER models make use of the language independ- 1999; Chieu and Ng, 2002) and for field extraction ent as well as language dependent features. The (McCallum et al., 2000). Most prominently, Hid- language independent features could be applicable den Markov Models (HMMs) have been used for for NER in other Indian languages also. The sys- the information extraction task (Bikel et al., 1999). tem has demonstrated the highest F-Score value of Beside HMM, there are other systems based on 91.8% with the SVM framework. One possible Support Vector Machine (Sun et al., 2003) and reason behind its best performance may be the Naïve Bayes (De Sitter and Daelemans, 2003). flexibility of the SVM framework to handle the Maximum Entropy (ME) conditional models like morphologically rich Indian languages. ME Markov models (McCallum et al., 2000) and Conditional Random Fields (CRFs) (Lafferty et al., 2 Development of the Named Entity 2001) were reported to outperform the generative Tagged Bengali News Corpus HMM models on several IE tasks. The existing works in the area of NER are 2.1 Language Resource Acquisition mostly in non-Indian languages. There has been a A web crawler has been developed that retrieves very little work in the area of NER in Indian lan- the web pages in Hyper Text Markup Language guages (ILs). In ILs, particularly in Bengali, the (HTML) format from the news archive of a leading work in NER can be found in (Ekbal and Bengali newspaper within a range of dates Bandyopadhyay, 2007a; Ekbal and Bandyop- provided as input. The crawler generates the adhyay, 2007b; Ekbal et al., 2007c). Other than Universal Resource Locator (URL) address for the Bengali, the work on NER can be found in (Li and index (first) page of any particular date. The index McCallum, 2003) for Hindi. page contains actual news page links and links to Newspaper is a huge source of readily available some other pages (e.g., Advertisement, TV documents. In the present work, the corpus has schedule, Tender, Comics and Weather etc.) that been developed from the web archive of a very do not contribute to the corpus generation. The well known and widely read Bengali newspaper. HTML files that contain news documents are Bengali is the seventh popular language in the identified and the rest of the HTML files are not world, second in India and the national language of considered further. Bangladesh. A code conversion routine has been written that converts the proprietary codes used in 2.2 Language Resource Creation the newspaper into the standard Indian Script Code The HTML files that contain news documents are for Information Interchange (ISCII) form, which identified by the web crawler and require cleaning can be processed for various tasks. A separate code to extract the Bengali text to be stored in the conversion routine has been developed for convert- corpus along with relevant details. The HTML file ing ISCII codes to UTF-8 codes. A portion of this is scanned from the beginning to look for tags like corpus has been manually annotated with the six-
2 The 6th Workshop on Asian Languae Resources, 2008
of one of the Bengali font faces as defined in the and the various date expressions that appear in the news archive. The Bengali text enclosed within fixed places of the newspaper. These tags are not font tags are retrieved and stored in the database able to recognize the various NEs that appear after appropriate tagging. Pictures, captions and within the actual news body. tables may exist anywhere within the actual news. In order to identify NEs within the actual news Tables are integral part of the news item. The body, we have defined a tagset consisting of pictures, its captions and other HTML tags that are seventeen tags. We have considered the major four not relevant to our text processing tasks are NE classes, namely ‘Person name’, ‘Location discarded during the file cleaning. The Bengali name’, ‘Organization name’ and ‘Miscellaneous news corpus has been developed in both ISCII and name’. Miscellaneous names include the date, UTF-8 codes. The tagged news corpus contains time, number, percentage and monetary 108,305 number of news documents with about expressions. The four major NE classes are further five (5) years (2001-2005) of news data collection. divided in order to properly denote each Some statistics about the tagged news corpus are component of the multiword NEs. The NE tagset is presented in Table 1. shown in Table 3 with the appropriate examples. We have also tagged a portion of the corpus as Total number of news documents 108, 305 part of the IJCNLP-08 NER Shared Task for South in the corpus and South East Asian Languages. This tagset has Total number of sentences in the 2, 822, 737 twelve different tags. The underlying reason for corpus adopting these tags was the necessity of a slightly Avgerage number of sentences in 27 finer tagset for various natural language processing a document (NLP) applications and particularly for machine Total number of wordforms in 33, 836, 736 translation. The IJCNLP-08 NER shared task the corpus tagset is shown in Table 4. Avgerage number of wordforms 313 One important aspect of IJCNLP-08 NER in a document shared task was to annotate only the maximal NEs Total number of distinct 467, 858 and not the structures of the entities. For example, wordforms in the corpus mahatma gandhi road is annotated as location and Table 1. Corpus Statistics assigned the tag ‘NEL’ even if mahatma and gan- dhi are NE title person and person name, respec- 2.3 Language Resource Annotation tively, according to the IJCNLP-08 shared task The Bengali news corpus collected from the web is tagset. These internal structures of the entities need annotated using a tagset that includes the type and to be identified during testing. So, mahatma gan- subtype of the news, title, date, reporter or agency dhi road will be tagged as mahatma/NETP gan- dhi/NEP road/NEL. The structure of the tagged name, news location and the body of the news. A 5 part of this corpus is then tagged with a tagset, element using the Shakti Standard Format (SSF) consisting of sixteen NE tags and one non-NE tag. will be as follows: Also, a part of the corpus has been tagged with a 1 (( NP
4http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=3 5http://shiva.iiit.ac.in/SPSAL 2007/ssf.html
3 The 6th Workshop on Asian Languae Resources, 2008
2.4 Partially Tagged News Corpus Develop- dates are stored in the form ‘day month year’. ment A sequence of four Bengali digits separates the Bengali date from the Bengali day. The English A news document is stored in the corpus in XML date starts with one/two digits in Bengali font. format using the tagset, mentioned in Table 2. In Bengali date, day and English date can be distin- the HTML news file, the date is stored at first and guished by checking the appearance of the numer- is divided into three parts. The first one is the date als and these are tagged as
Tag Definition Tag Definition Tag Definition header Header of the news day Day body Body of the news documents document title Headline of the news ed English date p Paragraph document t1 1st headline of the reporter Reporter name table Information in tabular title form t2 agency Agency tc 2nd headline of the providing news Table column title date location News location tr Table row Date of the news document bd Bengali date Table 2. News Corpus Tagset
Tag Meaning Example PER Single word person name sachin / PER, manmohan/PER LOC Single word location name jadavpur / LOC, delhi/LOC ORG Single word organization name infosys / ORG, tifr/ORG MISC Single word miscellaneous name 100% / MISC, 100/MISC B-PER Beginning, Internal or the end of sachin/ B-PER ramesh / I-PER I-PER a multiword person name tendulkar / E- PER E-PER B-LOC Beginning, Internal or the end of mahatma/ B-LOC gandhi / I-LOC road / E-LOC I-LOC a multiword location name E-LOC B-ORG Beginning, Internal or the end of bhaba / B-ORG atomic / I-ORG I-ORG a multiword organization name research / I-ORG centre / E-ORG E-ORG B-MISC Beginning, Internal or the end of 10 e / B-MISC magh / I-MISC I-MISC a multiword miscellaneous name 1402 / E-MISC E-MISC NNE Words that are not named entities neta/NNE, bidhansabha/NNE
Table 3. Named Entity Tagset
4 The 6th Workshop on Asian Languae Resources, 2008
NE tag Meaning Example
NE tagged corpus is in the following SSF form. Original date pattern Tagged date pattern
5 The 6th Workshop on Asian Languae Resources, 2008
2.6 Tag Conversion tern directed shallow parsing, HMM, ME, CRF and SVM frameworks. A tag conversion routine has been developed in order to convert the sixteen-NE tagged corpus of NE Class Number of Number of dis- 150K wordforms to the corpus, tagged with the wordforms tinct wordforms IJCNLP-08 twelve-NE tags. This conversion is a semi-automatic process. Some of our sixteen NE Person name 5, 123 3, 201 tags can be automatically mapped to some of the Location 1, 675 1, 119 IJCNLP-08 shared task tags. The tags that repre- name sent person, location and organization names can Organization 168 131 be directly mapped to the NEP, NEL and NEO name tags, respectively. Other IJCNLP-08 shared task Designation 231 102 tags can be obtained with the help of gazetteer lists Abbreviation 32 21 and simple heuristics. In our earlier NER experi- Brand 15 12 ments, we have already developed a number of Title-person 79 51 gazetteer lists such as: lists of person, location and Title-object 63 42 organization names; list of prefix words (e.g., sree, Number 324 126 sriman etc.) that predict the left boundary of a per- Measure 54 31 son name; list of designation words (e.g., mantri, Time 337 212 sangsad etc.) that helps to identify person names. Terms 46 29 The lists of prefix and designation words are help- Table 7. Statistics of the 30K-tagged Corpus ful to find the NETP and NED tags. The sixteen- NE tagged corpus is searched for the person name Sixteen-NE tagset IJCNLP-08 tags and the previous word is matched against the tagset lists of prefix and designation words. The previous PER, LOC, ORG NEP, NEL, word is tagged as NED or NETP if there is a NEO match with the lists of designation words and pre- B-PER, I-PER, E-PER NEP fix words, respectively. The NEN and NETI tags B-LOC, I-LOC, E-LOC NEL can be obtained by looking at our miscellaneous B-ORG, I-ORG, E-ORG NEO tags and using some simple heuristics. The NEN MISC NEN tags can be simply obtained by checking whether B-MISC, I-MISC, E-MISC NETI, NEM the MISC tagged element consists of digits only. Brand name gazetteer NEB The lists of cardinal and ordinal numbers have Title-object gazetteer NETO been also kept to recognize NETI tags. A list of Term gazetteer NETE words that denote the measurements (e.g., kilo- Person prefix word + PER/B- NETP gram, taka, dollar etc.) has been kept in order to PER, I-PER, E-PER get the NEM tag. The lists of words, denoting the Designation word +PER/B- NED brand names, title-objects and terms, have been PER, I-PER, E-PER prepared to get the NEB, NETO and NETE tags. The NEA tags can be obtained with the help of a Abbreviation gazetteer + NEA gazetteer list and using some simple heuristics Heuristics (whether the word contains the dot and there is no Table 8. Tagset Mapping Table space between the characters). The mapping from our NE tagset to the IJCNLP-08 NER shared task tagset is shown in Table 8. We have considered the sixteen NE tags to de- velop these systems. Named entity recognition in Indian Languages (ILs) in general and particularly 3 Use of Language Resources in Bengali is difficult and challenging as there is no concept of capitalization in ILs. The NE tagged news corpus, developed in this The Bengali NER systems that use pattern di- work, has been used to develop the Named Entity rected shallow parsing approach can be found in Recognition (NER) systems in Bengali using pat-
6 The 6th Workshop on Asian Languae Resources, 2008
(Ekbal and Bandyopadhyay, 2007a) and (Ekbal •CRF based NER: Contextual window of size five and Bandyopadhyay, 2007b). An HMM-based (current, previous two words and the next two Bengali NER system can be found in (Ekbal and words), prefix and suffix of length upto three of the Bandyopadhyay, 2007c). These systems have been current word, POS information of window three trained and tested with the corpus tagged with the (current word, previous word and the next word), sixteen NE tags. NE information of the previous word (dynamic A number of experiments have been conducted feature), different digit features and the various in order to find out the best feature set for NER in gazetteer lists. Bengali using the ME, CRF and SVM frameworks. •SVM based NER: Contextual window of size six In all these experiments, we have used a number of (current, previous three words and the next two gazetteer lists such as: first names (72, 206 en- words), prefix and suffix of length upto three of the tries), middle names (1,491 entries) and sur names current word, POS information of window three (5,288 entries) of persons; prefix (245 entries) and (current word, previous word and the next word), designation words (947 entries) that help to detect NE information of the previous two words (dy- person names; suffixes (45 and 23 entries) that namic feature), different digit features and the vari- help to identify person and location names; clue ous gazetteer lists. words (94 entries) that help to detect organization Evaluation results of the 10-fold cross validation names; location name (7, 870 entries) and organi- test for all the models are presented in Table 9. zation name (2,225 entries). Apart from these gaz- Evaluation results clearly show that the SVM etteer lists, we have used the prefix and suffix based NER model outperforms other models due to (may not be linguistically meaningful suf- it’s efficiency to handle the non-independent, di- fix/prefix) features, digit features, first word fea- verse and overlapping features of Bengali lan- ture and part of speech information of the words guage. etc. We have used the C++ based Maximum En- tropy package6, C++ based OpenNLP CRF++ pack- Model F-Score (in %) age7 and open source YamCha8 tool for ME based A 74.5 NER, CRF based NER and SVM based NER, re- B 77.9 spectively. For SVM based NER system, we have HMM 84.5 used TinySVM 9 classifier, pair wise multi-class ME 87.4 decision method and the second-degree polynomial CRF 90.7 kernel function. The brief descriptions of all the SVM 91.8 models are given below: Table 9.Results of 10-fold Cross Validation Test •A: Pattern directed shallow parsing approach without linguistic knowledge. 4 Conclusion •B: Pattern directed shallow parsing approach with linguistic knowledge. In this work we have developed a Bengali news •HMM based NER: Trigram model with additional corpus of approximately 34 million wordforms context dependency, NE suffix lists for handling from the web archive of a leading Bengali newspa- unknown words. per. The date, location, reporter and agency tags present in the web pages have been automatically •ME based NER: Contextual window of size three (current, previous and the next word), prefix and NE tagged. Around 150K wordforms of this par- suffix of length upto three of the current word, tially NE tagged corpus has been manually anno- POS information of the current word, NE informa- tated with the sixteen NE tags. We have also tion of the previous word (dynamic feature), dif- tagged around 30K wordforms with the twelve NE ferent digit features and the various gazetteer liststs. tags, defined for the IJCNLP-08 NER shared task. A tag conversion routine has also been developed
6 in order to convert any sixteen-NE tagged corpus http://homepages.inf.ed.ac.uk/s0450736/software/maxe to the twelve-NE tagged corpus. The sixteen-NE nt/maxent-20061005.tar.bz2 tagged corpus of 150K wordforms has been used to 7 http://crfpp.sourceforge.net 8 http://chasen.org/~taku/software/yamcha/ 9http://cl.aist-nara.ac.jp/taku-ku/software/TinySVM
7 The 6th Workshop on Asian Languae Resources, 2008
develop the NER systems using various machine- ognition. Language Resources and Evaluation Jour- learning approaches. nal (Accepted and to appear by December 2007). This NE tagged corpus can be used for other Fletcher, W. H. 2001. Making the Web More Useful as NLP research activities such as machine Source for Linguistics Corpora. In Ulla Conor and translation, information retrieval, cross-lingual Thomas A. Upton (eds.), Applied corpus Linguistics: event tracking, automatic summarization etc. A Multidimensional Perspective, 191-205. Fletcher, W. H. 2003. Concording the Web with References KwiCFinder. In Proceedings of the Third North Bertagna, M. and A. Lenci, M. Monachini and N. Cal- American Symposium on Corpus Linguistics and zolari. 2004. CotentInteroperability of Lexical Re- Language Teaching, Boston. sources, Open Issues and “MILE” Perspectives, In Giguet, E., and P. Luquet. 2006. Multilingual Lexical Proceedings of the LREC, 131-134. Database Generation from Parallel Texts in 20 Euro- Bharthi, A., D.M. Sharma, V. Chaitnya, A. P. Kulkarni pean Languages with Endogeneous Resources. In and R. Sanghal. 2001. LERIL: Collaborative Effort Proceedings of the COLING/ACL, Sydney, 271-278. for Creating Lexical Resources. In Proceedings of Lafferty, J., McCallum, A., and Pereira, F. 2001. Condi- th the 6 NLP Pacific Rim Symposium Post-Conference tional Random Fields: Probabilistic Models for Seg- Workshop, Japan. menting and Labeling Sequence Data, In Proceedings th Bikel, D. M., Schwartz, R., Weischedel, R. M. 1999. An of the 18 International Conference on Machine Algorithm that Learns What’s in a Name. Machine Learning, 282-289. Learning, 34, 211-231. Lenci, A., N. Bel, F. Busu, N. Calzolari, E. Gola, M. Calzolari, N., F. Bertagna, A. Lenci and M. Monachni. Monachini, A. Monachini, A. Ogonowski, I. Peters, 2003. Standards and best Practice for Miltilingual W. Peters, N. Ruimy, M. Villegas and A. Zampolli. Computational Lexicons, MILE (the multilingual 2000. SIMPLE: A general Framework for the Devel- ISLE lexical entry). ISLE Deliverable D2.2 &3.2. opment of Multilingual Lexicons. International Journal of Lexicography, Special Issue, Dictionaries, Chieu, H. L., Tou Ng, H. 2002. Named Entity Recogni- Thesauri and Lexical-Semantic Relations, XIII(4): tion: A Maximum Entropy Approach Using Global 249-263. Information, In Proc. of the 6th Workshop on Very Large Corpora. Li, Wei and Andrew McCallum. 2004. Rapid Develop- ment of Hindi Named Entity Recognition Using De Sitter, A., Daelemans W. 2003. Information Extrac- Conditional Random Fields and Feature Inductions. tion via Double Classification. In Proeedings of In- ACM TALIP, Vol. 2(3), 290-294. ternational Workshop on Adaptive Text Extraction and Mining, Dubronik. McCallum, A., Freitag, D., Pereira, F. 2000. Maximum Entropy Markov Models for Information Extraction Ekbal, Asif, and S. Bandyopadhyay. 2007a. Pattern and Segmentation. In Proceedings of the 17th Inter- Based Bootstrapping Method for Named Entity Rec- national Conference Machine Learning. ognition. In Proceedings of the 6th International Con- ference on Advances in Pattern Recognition, 2007, Robb, T. 2003. Google as a Corpus Tool? ETJ Journal, India, 349-355. 4(1), Spring 2003. Ekbal, Asif, and S. Bandyopadhyay. 2007b. Lexical Rundell, M. 2000. The Biggest Corpus of All. Humanis- Pattern Learning from Corpus Data for Named Entity ing Language Teaching, 2(3). th Recognition. In Proceedings of the 5 International Sun, A., et al. 2003. Using Support Vector Machine for Conference on Natural Language Processing Terrorism Information Extraction. In Proceedings of (ICON), India, 123-128. 1st NSF/NIJ Symposium on Intelligence and Security. Ekbal, Asif, Naskar, Sudip and S. Bandyopadhyay. Takenobou, T., V. Sornlertlamvanich, T. Charoenporn, 2007c. Named Entity Recognition and Transliteration N. Calzolari, M. Monachini, C. Soria, C. Huang, X. in Bengali, Named Entities: Recognition, YingJu, Y. Hao, L. Prevot and S. Kiyoaki. 2006. In- Classification and Use, Special Issue of Lingvisticae frastructure for Standardization of Asian Languages Investigationes Journal, 30:1 (2007), 95-114. Resources. In Proceedings of the COLING/ACL Ekbal, Asif, and S. Bandyopadhyay. 2007d. A Web- 2006, Sydney, 827-834. based Bengali News Corpus for Named Entity Rec-
8 The 6th Workshop on Asian Languae Resources, 2008
Gazetteer Preparation for Named Entity Recognition in Indian Languages
Sujan Kumar Saha Sudeshna Sarkar Pabitra Mitra Indian Institute of Technology Indian Institute of Technology Indian Institute of Technology Kharagpur, West Bengal Kharagpur, West Bengal Kharagpur, West Bengal India - 721302 India - 721302 India - 721302 [email protected] [email protected] [email protected]
Abstract languages there is no reasonable size publicly avail- able list. The web contains lots of such resources, This paper describes our approaches for the which can be used for Indian language NER devel- preparation of gazetteers for named entity opment. But most of the web resources are in En- recognition (NER) in Indian languages. We glish. Our approach is to transliterate the relevant have described two methodologies for the English resources and name dictionaries into Indian preparation of gazetteers1. Since the rel- languages to make them useful for Indian language evant gazetteer lists are more easily avail- NER task. But direct transliteration from English to able in English we have used a translitera- an Indian languages is not easy. Few attempts are tion based approach to convert available En- taken to build English to Indian language transliter- glish name lists to Indian languages. The ation systems but the word agreement ratio (WAR) second approach is a context pattern induc- reached is upto 69.3% (Ekbal et al., 2006). tion based domain specific gazetteer prepa- We have attempted to build a transliteration sys- ration. This approach uses a domain specific tem which uses an intermediate alphabet. Both raw corpus and a few seed entities to learn the English and the Indian language strings are context patterns and then the corresponding transliterated to the intermediate alphabet and for a name lists are generated by using bootstrap- English-Indian language pair, if the transliterated in- ping. termediate alphabet strings are same then we have concluded that the strings are the transliteration of 1 Introduction one another. We have transliterated the available En- Named entity recognition involves locating and clas- glish name lists into the intermediate alphabet and sifying the names in text. NER is an important task, these might be used as gazetteers. The Indian lan- having applications in information extraction (IE), guage words need to be transliterated to the inter- question answering (QA), machine translation and mediate format to check whether the word is in a in most other NLP applications. gazetteer or not. This system does not transliter- NER systems have been developed for resource- ate the English name lists into Indian languages but rich languages like English with very high accura- makes them useful in Indian languages NER task. cies. But constructing an NER for a resource-poor Transliteration based approaches are useful when language is very challenging due to unavailability of there is availability of English name lists. But when proper resources. Name-dictionaries or gazetteers relevant English name lists are not available then are very helpful NER resources and in most Indian also we can prepare gazetteers from raw corpus. We have defined a semi-automatic context pattern 1Specialized list of names for a particular class of Named Entity (NE). For Example, India is in the location gazetteer, (CP) extraction based gazetteer preparation frame- Sachin is in the person first name gazetteer. work. This framework uses bootstrapping to prepare
9 The 6th Workshop on Asian Languae Resources, 2008
the gazetteers from a large raw corpus starting from several gazetteers like organization names, location few seed entities. Firstly fixed length patterns are names, person names, human titles etc. formed using the surrounding words of the seeds. We will now mention some ML based sys- Depending on the pattern precision, the patterns are tems. MENE is a MaxEnt based system de- discarded or generalized by dropping tokens from veloped by Borthwick. This system has used 8 the patterns. This set of high precision patterns ex- dictionaries (Borthwick, 1999), which are: First tracts other named entities (NEs) which are added to names (1,245), Corporate names (10,300), Corpo- the seed list for the next iteration of the process. Fi- rate names without suffix (10,300), Colleges and nally we are able to prepare the required gazetteers. Universities (1,225), Corporate suffixes (244), Date To prove the effectiveness of the gazetteer prepa- and Time (51) etc. The italics numbers in bracket ration approach, we have prepared some gazetteers indicates the size of the dictionaries. The hy- like names of cricketers, names of tennis players etc. brid system developed by Srihari et al.(2000) com- from a raw Hindi sports domain corpus. The de- bines several modules built by using MaxEnt, HMM tails of the approaches are given in the following and handcrafted rules. This system uses the fol- sections. lowing gazetteers: First name (8,000), Family The paper is organized as follows. Usefulness name (14,000) and a big gazetteer of Locations of gazetteers in NER, transliteration approaches in (250,000). There are many other systems which general and specific for Indian languages and gen- have used name dictionaries to improve the accu- eral pattern extraction methodologies are discussed racy. Kozareva (2006) described a methodology in section 2. Section 3 presents the architecture of to generate gazetteer lists automatically for Spanish the 2-phase transliteration system and preparation of and to build NER system with labeled and unlabeled gazetteers using that. In section 4 context pattern data. The location gazetteer is built by finding loca- extraction based gazetteer preparation is discussed. tion patterns which looks for specific prepositions. Finally section 5 concludes the paper. And the person gazetteer is constructed with graph exploration algorithm. 2 Previous Work Transliteration is also a very important topic and lots of transliteration systems for different lan- The main approaches to NER are Linguistic ap- guages have been developed using different ap- proaches and Machine Learning (ML) based ap- proaches. The basic approaches for transliteration proaches. The linguistic approach typically uses are phoneme based or spelling-based. A phoneme- rule-based models manually written by linguists. based statistical transliteration system from Ara- ML based techniques make use of a large amount bic to English was developed by Knight and of annotated training data to acquire high-level lan- Graehl(1998). This system uses a finite state guage knowledge. Several ML techniques like Hid- transducer that implements transformation rules to den Markov Model (HMM)(Bikel et al., 1997), do back-transliteration. A spelling-based model Maximum Entropy Model(MaxEnt) (Borthwick, that directly maps English letter sequences into 1999), Conditional Random Field(CRF) (Li and Arabic letters was developed by Al-Onaizan and McCallum, 2004) etc. have been successfully used Knight(2002). Several transliteration systems ex- for the NER task. Both the approaches may make ist for English-Japanese, English-Chinese, English- use of gazetteer information to build systems. There Spanish and many other languages to English. But are many systems which use gazetteers to improve very few attempts have been reported on the de- the accuracy. velopment of transliteration systems between Indian Ralph Grishman has developed a rule-based NER languages and English. We can mention a transliter- system which uses some specialized name dictionar- ation system for Bengali-English transliteration de- ies including names of all countries, names of major veloped by Ekbal et al.(2006). They have proposed cities, names of companies, common first names etc different models modifying the joint source chan- (Grishman, 1995). Another rule based NER system nel model. In that system a Bengali string is di- is developed by Wakao et al. (1996) which has used vided into transliteration units containing a vowel
10 The 6th Workshop on Asian Languae Resources, 2008
modifier or matra at the end of each unit. Sim- List Web Sources ilarly the English string is also divided into units. http://www.bsnl.co.in/ onlinedi- Then they defined various unigram, bigram or tri- rectory.htm gram models depending on the consideration of the First http://web1.mtnl.net.in/ direc- contexts of the units. They have also considered lin- Name tory/ guistic knowledge in the form of possible conjuncts http://www.eci.gov.in/ and diphthongs in Bengali and their representations http://hiren.info/indian-baby- in English. This system is capable of transliterat- names/ ing mainly person names. The highest transliteration http://www.indiaexpress.com/ accuracy achieved by them is 69.3% Word Agree- specials/babynames/ ment Ratio (WAR) for Bengali to English and 67.9% http://surnamedirectory.com/ WAR for English to Bengali transliteration. surname-index.html Surname http://web1.mtnl.net.in/ direc- In the field of IE, patterns play a key role in tory/ identifying relevant pieces of information. Soder- http://en.wikipedia.org land et al.(1995), Rillof and Jones(1999), Lin et India Lo- http://indiavilas.com/ indi- al.(2003), Downey et al.(2004), Etzioni et al.(2005) cation ainfo/pincodes.asp described different approaches to context pattern in- http://www.indiapost.gov.in duction. Talukder et al.(2006) combined grammati- http://www.eci.gov.in/ cal and statistical techniques to create high precision World http://www.maxmind.com/ patterns specific for NE extraction. An approach to Location app/worldcities lexical pattern learning for Indian languages is de- http://en.wikipedia.org/wiki scribed by Ekbal and Bandopadhyay (2007). They used seed data and annotated corpus to find the pat- Table 1: Web sources for some relevant name lists terns for NER.
3.1 Transliteration 3 Transliteration based Gazetteer The transliteration from English to Hindi is quite Preparation difficult. English alphabet contains 26 characters whereas the Hindi alphabet contains 52 characters. So the mapping is not trivial. We have already men- Gazetteers or name dictionaries are helpful in NER. tioned that for Bengali a transliteration system was We have already discussed about some English NER developed by Ekbal et al. Similar approach can be systems where the usefulness of the gazetteers have used to develop transliteration systems for other In- been established. However while developing NER dian languages. But this approach uses a bilingual systems in Indian languages, we tried to find relevant transliteration corpus, which requires much efforts gazetteers. But we could not obtain openly available to built, is unavailable in proper size in all Indian gazetteer lists for these languages. But we found languages. Also using this approach the word agree- that there are a lot of resources of names of Indian ment ratio obtained is below 70%. persons, Indian places, organizations etc. in English To make the transliteration process easier and available in the web. In Table 1 we have mentioned more accurate, we have decided to build a 2-phase some of the sources which contains relevant name transliteration module. Our goal is to make deci- lists. sion that a particular Indian language string is in an But it is not possible to use the available name English gazetteer or not. We need not transliterate lists directly in the Indian language NER task as directly from Indian language strings to English or these are in English. We have decided to translit- English name lists into Indian languages. Our idea is erate the English lists into Indian languages to make to define an intermediate alphabet and both English them useful in the Indian language NER task. and Indian language strings will be transliterated to
11 The 6th Workshop on Asian Languae Resources, 2008
the intermediate alphabet. For two English-Hindi S = S − G. string pair, if the intermediate alphabet is same then Go to step 2. we can conclude that one string is the transliteration = − 1 of the other. 5. Else set n n . First of all we need to decide the size of the inter- Go to step 3. mediate alphabet. Preserving the phonetic proper- Here we can take an Indian language name, ties we have defined our intermediate alphabet con- ‘surabhi’, as example to explain the procedure in sisting of 34 characters. To indicate these 34 charac- details. When the name is written in English, it ters, we have given unique character-id to each char- may be written in several ways like ‘suravi, ‘shu- acter. ravi’, ‘surabhee’, ‘shurabhi’ etc. The English string 3.2 English to Intermediate Alphabet ‘surabhi’ is transliterated to ‘ˆsˆuˆrˆaˆvˆı’ by the translit- Transliteration erator. Again if we see the transliteration for ‘shu- ravi’, then also the intermediate transliterated string For transliterating English strings into the interme- is same as the previous one. diate state, we have built a phonetic map table. This phonetic map table maps an English n-gram into an 3.3 Indian Language to Intermediate Alphabet intermediate character. A part of the map table is Transliteration given in Table 2. In the map table, the mapping is This is a 2-phase process. The first phase transliter- from strings of varying length in the English to one ates the Indian language string into itrans. Itrans is character in the intermediate alphabet. In our table representation of Indian language alphabets in terms the length of the left hand side varies from 1 to 3. of ASCII. Since Indian text is composed of syllabic English Intermediate units rather than individual alphabetic letters, itrans uses combinations of two or more letters of En- A ˆa glish alphabet to represent an Indian language syl- EE, I, II ˆı lable. However, there being multiple sounds in In- OO,U ˆu dian languages corresponding to the same English B,W bˆ letter, not all Indian syllables can be represented by BH,V ˆv logical combinations of English alphabet. Hence, CH ˆc itrans uses some non-alphabetic special characters R,RH ˆr also in some of the syllables. A map table2, with SH,S ˆs some heuristic knowledge, is used for the translit- Table 2: A part of the map table eration. For example, the Hindi word ‘surabhi’ is converted ‘sUrabhI’ in itrans. In the following we have described the procedure In the second phase the itrans string is translit- of transliteration. erated into the intermediate state using the similar procedure described section 3.2. Here also we use Procedure 1: Transliteration a map-table containing the mappings from itrans to Source string - English, Output string - Intermediate. intermediate alphabet. This procedure transliterates 1. Scan the source string (S) from left to right. the example itrans word ‘sUrabhI’ to ‘ˆsˆuˆrˆaˆvˆı’.
2. Extract the first n-gram (G) from the string. 3.4 Evaluation (n = 3) In section 3.2 and 3.3 we have described two phase transliteration with an example word. We have 3. Find if it is in the map-table. shown that our transliteration system transliterates the Indian language name ‘surabhi’ and the corre- 4. If yes, insert its corresponding intermediate sponding English strings into the same intermediate state entity into target string T. Remove the n-gram from S. 2The map table is available at www.aczoom.com/itrans.
12 The 6th Workshop on Asian Languae Resources, 2008
string. The system has limitations like sometimes igrams (e.g., kolakAtA3 (Kolkata), bihAra (Bihar)), two different strings can be mapped into a same in- bigrams (e.g., nayI dillI (New Delhi), pashchima termediate alphabet string. bA.nglA (West Bengal)) and trigrams (e.g. uttaara For the evaluation of the system we have applied chabisha paraganA (North 24 Pargana)). The words the transliteration system for two languages - Hindi are matched with unigrams, sequences of two con- and Bengali. For evaluating the system for Hindi secutive words are matched against bigrams and we have created a bi-lingual corpus containing 1070 sequences of three consecutive words are matched English-Hindi word pair most of which are names. against trigrams. 980 of them are transliterated correctly by the sys- World Location: The list contains the names of tem. The system accuracy is 980 × 100/1070 = the countries, different state and city names in world 91.59%. For evaluating the system for Bengali, we and also the names of important rivers, mountains have used a similar bi-lingual corpus and the system etc. The list contains about 4,000 location names. transliterates with 89.3% accuracy. Similar to the Indian location list, this list also needs to be processed as unigram, bigrams and trigrams. 3.5 Prepared Gazetteer Lists Previously we have mentioned the web sources 4 Context Pattern Extraction based where some name lists are available. Names of Gazetteer Preparation a particular category are collected from different Gazetteers can also be prepared by extracting con- sources and merged to build a English name list of text patterns. Transliteration based gazetteer prepa- that category. Then we have applied our transliter- ration is applicable while there is availability of En- ation procedure on the list and transliterated the list glish or parallel language name list. But if such into the intermediate alphabet. This intermediate al- relevant name lists are not available, but a large phabet list acts as a gazetteer in NER task in Indian raw corpus is available, then we can use the con- languages. When an Indian language NER system text pattern extraction based methodology to prepare needs to access the gazetteer lists, it transliterates the the gazetteers. This method seeks some high pre- Indian language strings into the intermediate alpha- cision context patterns by using some seed entities bet, and searches into the list. In the following we and hits the patterns to the raw corpus to prepare the have described the prepared gazetteer lists which are gazetteers. useful for a general domain Indian language NER The overall methodology of extracting context system. patterns from a raw corpus is summarized as fol- First Name List: This list contains 10,200 first lows: names collected from the web. Most of the collected first names are of Indian origin. Apart from the In- 1. Find a large raw corpus and some seed entities dian names, we have also collected some non-Indian (E) for each class of NEs. names. These non-Indian names are generally the 2. For each seed entity e in E, from the corpus names of some famous persons, like sports stars, find context string(C) comprised of n tokens film stars, scientists, politicians, who are likely to before e, a placeholder for the class instance come in Indian language texts. In our first name list and n tokens after e. [We have used n = 3] 1500 such names are included. This set of words form initial pattern. Surname List: This is a very important list which contains common surnames. We have prepared the 3. Search the pattern to the corpus and find the surname list from different sources containing about coverage and precision. 1500 Indian surnames and 400 other surnames. 4. Discard the patterns having low precision. Indian Locations: This list contains about 14,000 entities. The names of states, cities and 5. Generalize the patterns by dropping one or towns, districts, important places in different cities more tokens to increase coverage. and even lots of village names are collected in the 3The Indian languages strings are written in italics font and list. The list needs to be processed into a list of un- using itrans transliteration.
13 The 6th Workshop on Asian Languae Resources, 2008
6. Find best patterns having good precision and For the seed sachina tedulkara (Sachin Ten- coverage. dulkar) we extract 92 initial patterns. Some of which are: The details of the context pattern extraction based gazetteer preparation methodology is de- • ki mAsTara blAsTara CRIC ko Tima ke scribed in the following subsections. We have taken • dravi.Da aura CRIC 241 nAbAda ke a Hindi sports domain raw corpus and prepared some gazetteers like names of cricketers, names of • bhAratiya ballebAja CRIC ne 100 rana tennis players to prove the effectiveness of the pro- posed methodology. • mere vichAra se CRIC ko Takkara dene
4.1 Selection of Seed Entity 4.3 Pattern Quality Measure Context pattern extraction based gazetteer prepa- We measured the quality of a pattern depending on ration methodology is applied to a sports domain its precision and coverage. Precision is the ratio of corpus which contains about 20 lakhs words col- correct identification and the total identification. If lected from the popular Hindi newspaper “Dainik the precision is high then also we have assumed that Jagaran”. We have worked on preparing the lists the pattern is a good pattern. In our development of cricket players, list of tennis players. We have we have marked a pattern as good if the precision is collected the most frequent names to build the seed 100%. list. For preparing the list of tennis players, we We search the initial patterns in the correspond- have taken 5 names as seed entities : Andre Agassi, ing ‘part’ corpus to measure the precision and cov- Steffi Graf, Serena Williams, Roger Federer and Jus- erage. If the precision is less than 100% for a pat- tine Henin. Similarly the seed list of cricket players tern then we have rejected the pattern. Otherwise we name list contains only 3 names: Sachin Tendulkar, have tried to make it more generalized to increase Brian Lara and Glenn McGrath. the coverage. To make the generalization we have dropped the left most and right most tokens one by 4.2 Context Extraction one and checked the pattern quality. If for a initial To extract the patterns for a particular category, we pattern, several patterns presents with 100% preci- select a part of the corpus where the target seeds will sion then we have selected those patterns for which be available with high frequency. For example to no subset of those is a good pattern. By this way we get the patterns for the names of the cricketers, we have prepared a list of good patterns for a particular select a part of the corpus where most of the sen- gazetteer type. tences are cricket related. To select the cricket re- In time of ‘good’ pattern selection we have made lated sentences, we prepared a list containing the some interesting observations. most frequent words related to cricket like, rAna • There are some patterns which satisfy the 100% (run), ballebAja (batsman), gedabAja (bowler) etc. precision criteria but the coverage is very poor Depending upon the presence of such words we have in terms of new entity extraction. For exam- selected the ‘part’. In our development the cricket ple, “mAsTara blAsTara CRIC ko” is a pattern ‘part’ contains 120K words. Similar ‘part’ is devel- with 100% precision. The pattern has 24 in- oped for other gazetteers. For a particular seed, we stances in the ‘part’ corpus, but all the extracted find the occurrences of the seed entity in the corre- entities are ‘sachina tedulkara’. We have also sponding raw ‘part’ corpus. Then we extracted three examined the pattern in the total raw corpus. tokens immediately preceding the seed and three to- It is capable of extracting ‘sachina tedulkara’ kens immediately following the seed. A placeholder only. So in spite of fulfilling all the criteria the (CRIC for cricketers, TENS for tennis players) pattern is not a ‘good’ pattern. replaces the seed. The placeholder and the surround- ing tokens t−3 t−2 t−1 placeholder t+1 t+2 t+3) • Another interesting observation is, that there form the initial set of patterns. are some patterns which are ‘good’ patterns in
14 The 6th Workshop on Asian Languae Resources, 2008
the context of the ‘part’ corpus, but when used entities we become able to prepare a gazetteer which in the total raw corpus, it extracts non-relevant contains 412 names of the cricketers. Even using entities. For example “mere vichAra se CRIC this approach only one seed ‘Sachin Tendualkar’ ex- ko Takkara” is a ‘good’ pattern so it should ex- tracts 297 names after the second iteration. Similarly tract the names of the cricketers. But when we have collected 245 names of tennis players from this is used in the total raw corpus it extracts 5 seed entities. non-cricketer entities (e.g. tennis players, chess players) also. To make such patterns useful we 5 Conclusion have extracted all the cricket related sentences In this paper we have described our approaches for in the similar way which was used for selecting the preparation of gazetteers. We have also prepared the ‘part’ corpus and then the patterns are used some gazetteers using both the approaches to show to extract entities from these sentences. their effectiveness. These approaches are very use- • There present are patterns with very high cover- ful for the NER task in resource-poor languages and age but precision is just below 100%. We have also in domain specific NER task. analyzed these patterns and manually identified the wrongly extracted entities. If the wrong en- References tities can be grouped together and are having some specific properties then we have added Al-Onaizan Y. and Knight K. 2002. Machine Translit- eration of Names in Arabic Text. Proceedings of these entities in a ‘pattern exception list’. Then the ACL Workshop on Computational Approaches to the pattern is used as a ‘good’ pattern and the Semitic Languages. exception list is used to detect the wrong iden- Bikel Daniel M., Miller Scott, Schwartz Richard and tifications. Weischedel Ralph. 1997. Nymble: A high perfor- mance learning name-finder. In Proceedings of the In the following we have given some example of Fifth Conference on Applied Natural Language Pro- ‘good’ patterns which are useful in identification of cessing, peges 194–201. the names of the cricket players. Borthwick Andrew. 1999. A Maximum Entropy Ap- • ballebAja CRIC ko Tima ke proach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York University. • ballebAja CRIC ne Downey D., Etzioni O., Soderland S., Weld D.S. 2004. • CRIC kA arddhashataka Learning text patterns for Web information extraction and assessment. In AAAI-04 Workshop on Adaptive 4.4 Gazetteer Preparation Text Extraction and Mining, pages 50–55. The extracted ‘good’ patterns are capable of iden- Ekbal A. and Bandyopadhyay S. 2007. Lexical Pattern tifying NEs from a raw corpus. These patterns are Learning from Corpus Data for Named Entity Recog- nition. In Proceedings of International Conference on then used to prepare the gazetteers. The seeds form Natural Language Processing (ICON), 2007. the initial gazetteer list for a particular gazetteer type. The ‘good’ patterns are used to extract entities Ekbal A., Naskar S. and Bandyopadhyay S. 2006. A from the total raw corpus. The entities identified by Modified Joint Source Channel Model for Translitera- tion. In Proceedings of the COLING/ACL 2006, Aus- the patterns are added to the corresponding gazetteer tralia, pages 191–198. list. In that way we can add more entities in our first phase gazetteer list. These new entities are taken as Etzioni Oren, Cafarella Michael, Downey Doug, Popescu Ana-Maria, Shaked Tal, Soderland Stephen, Weld seeds for the next phase. Then the same procedure Daniel S. and Yates Alexander. 2005. Unsupervised is followed repeatedly to develop a large gazetteer. named-entity extraction from the Web: An experimen- We have already mentioned that we have worked tal study. In Artificial Intelligence, 165(1): 91-134. with a sports domain corpus and prepared some Grishman Ralph. 1995. The New York University Sys- gazetteers. This gazetteers are prepared just to prove tem MUC-6 or Where’s the syntax? In Proceedings of the efficiency of our approach. By using only 3 seed the Sixth Message Understanding Conference.
15 The 6th Workshop on Asian Languae Resources, 2008
Knight K. and Graehl J. 1998. Machine Transliteration. Computational Linguistics, 24(4): 599–612. Kozareva Zornitsa. 2006. Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists. In Proceedings of EACL student session (EACL 2006). Li Wei and McCallum Andrew. 2004. Rapid Develop- ment of Hindi Named Entity Recognition using Condi- tional Random Fields and Feature Induction (Short Pa- per). In ACM Transactions on Computational Logic. Lin Winston, Yangarber Roman and Grishman Ralph. 2003. Bootstrapped learning of semantic classes from positive and negative examples. In Proceedings of ICML-2003 Workshop on The Continuum from La- beled to Unlabeled Data. Riloff E. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Articial Intelli- gence, pages 1044–1049. Srihari R., Niu C. and Li W. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging. In Proceed- ings of the sixth conference on Applied natural lan- guage processing. Soderland Stephen, Fisher David, Aseltine Jonathan, Lehnert Wendy. 1995. CRYSTAL: Inducing a Con- ceptual Dictionary. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelli- gence. Talukdar P. Pratim, T. Brants, M. Liberman and F. Pereira. 2006. A context pattern induction method for named entity extraction. In Proceedings of the Tenth Conference on Computational Natural Lan- guage Learning (CoNLL-X). Wakao T., Gaizauskas R. and Wilks Y. 1996. Evaluation of an algorithm for the recognition and classification of proper names. In Proceedings of COLING-96.
16 The 6th Workshop on Asian Languae Resources, 2008
Preliminary Chinese Term Classification for Ontology Construction
Gaoying Cui, Qin Lu, Wenjie Li Department of Computing, Hong Kong Polytechnic University {csgycui, csluqin, cswjli}@comp.polyu.edu.hk
Abstract 1 Introduction Ontology construction has two main issues An ontology can be seen as a representa- including the acquisition of domain concepts and tion of concepts in a specific domain. Ac- the acquisition of appropriate taxonomies of these cordingly, ontology construction can be re- concepts. These concepts are labeled by the terms garded as the process of organizing these used in the domain which are described by concepts. If the terms which are used to la- different attributes. Since domain specific terms bel the concepts are classified before build- (terminology) are labels of concepts among other ing an ontology, the work of ontology con- things, terminology extraction is the first and the struction can proceed much more easily. foremost important step of domain concept Part-of-speech (PoS) tags usually carry acquisition. Most of the existing algorithms in some linguistic information of terms, so Chinese terminology extraction only produce a list PoS tagging can be seen as a kind of pre- of terms without much linguistic information or liminary classification to help constructing classification information (Yun Li and Qiangjun concept nodes in ontology because features Wang, 2001; Yan He et al., 2006; Feng Zhang et or attributes related to concepts of different al., 2006). This fact makes it difficult in ontology PoS types may be different. This paper pre- construction as the fundamental features of these sents a simple approach to tag domain terms are missing. The acquisition of taxonomies is terms for the convenience of ontology con- in fact the process of organizing domain specific struction, referred to as Term PoS (TPoS) concepts. These concepts in an ontology should be Tagging. The proposed approach makes defined using a subclass hierarchy by assigning use of segmentation and tagging results and defining properties and by defining from a general PoS tagging software to pre- relationship between concepts etc. (Van Rees, R., dict tags for extracted domain specific 2003). These methods are all concept descriptions. terms. This approach needs no training and The linguistic information associated with domain no context information. The experimental terms such as PoS tags and semantic classification results show that the proposed approach information of terms can also make up for the achieves a precision of 95.41% for ex- concept related features which are associated with tracted terms and can be easily applied to concept labels. Terms with different PoS tags different domains. Comparing with some usually carry different semantic information. For existing approaches, our approach shows example, a noun is usually a word naming a thing that for some specific tasks, simple method or an object. A verb is usually a word denoting an can obtain very good performance and is action, occurrence or state of existence, which are thus a better choice. all associated with a time and a place. Thus in ontology construction, noun nodes and verb nodes Keywords: ontology construction, part-of- should be described using different attributes with speech (PoS) tagging, Term PoS (TPoS) different discriminating characters. With this tagging. information, extracted terms can then be classified
17 The 6th Workshop on Asian Languae Resources, 2008
accordingly to help in ontology construction and information from the words themselves. A number retrieval work. Thus PoS tags can help identify the of features based on prefixes and suffixes and different features needed for concept representation spelling cues like capitalization were adopted in in domain ontology construction. these researches (Mikheev, A, 1997; Brants, Thor- sten, 2000; Mikheev, A, 1996). Mikheev presented It should be pointed out that Term PoS (TPoS) a technique for automatically acquiring rules to tagging is different from the general PoS tagging guess possible POS tags for unknown words using tasks. It is designed to do PoS tagging for a given their starting and ending segments (Mikheev, A, list of terms extracted from some terminology ex- 1997). Through an unsupervised process of rule traction algorithms such as those presented in acquisition, three complementary sets of word- (Luning Ji et al., 2007). The granularity of general guessing rules would be induced from a general PoS tagging is smaller than what is targeted in this purpose lexicon and a raw corpus: prefix morpho- paper because terms representing domain specific logical rules, suffix morphological rules and end- concepts are more likely to be compound words ing-guessing rules (Mikheev, A, 1996). Brants and sometimes even phrases, such as “文件管理 used the linear interpolation of fixed length suffix 器 ”(file manager), “ 并发描述”(description of model for word handling in his POS tagger, named concurrency), etc.. Even though current general TnT. For example, an English word ending in the word segmentation and PoS tagging can achieve suffix –able was very likely to be an adjective precision of 99.6% and 97.58%, respectively (Brants, Thorsten, 2000). (Huaping Zhang et al., 2003), its performance for domain specific corpus is much less satisfactory Some existing methods are based on the analysis (Luning Ji et al., 2007), which is why terminology of word morphology. They exploited more features extraction algorithms need to be developed. besides morphology or took morphology as sup- In this paper, a very simple but effective method plementary means (Toutanova et al., 2003; Huihsin is proposed for TPoS tagging which needs no train- Tseng et al., 2005; Samuelsson, Christer, 1993). ing process or even context information. This Toutanova et al. demonstrated the use of both pre- method is based on the assumption that every term ceding and following tag contexts via a depend- has a headword. For a given list of domain specific ency network representation and made use of some terms which are segmented and each word in the additional features such as lexical features includ- term already has a PoS tag, the TPoS tagging algo- ing jointly conditioning on multiple consecutive rithm then identifies the position of the headword words and other fine-grained modeling of word and take the tag of the headword as the tag of the features (Toutanova et al., 2003). Huihisin et al. term. Experiments show that this method is quite proposed a variety of morphological word features, effective in giving good precision and minimal such as the tag sequence features from both left computing time. and right side of the current word for POS tagging The remaining of this paper is organized as fol- and implemented them in a Maximum Entropy lows. Section 2 reviews the related work. Section 3 Markov model (Huihsin Tseng et al., 2005). gives the observations to the task and correspond- Samuelsson used n-grams of letter sequences end- ing corpus, then presents our method for TPOS ing and starting each word as word features. The tagging. Section 4 gives the evaluation details and main goal of using Bayesian inference was to in- discussions on the proposed method and reference vestigate the influence of various information methods. Section 5 concludes this paper. sources, and ways of combining them, on the abil- ity to assign lexical categories to words. The 2 Related Work Bayesian inference was used to find the tag as- signment T with highest probability P(T|M, S) Although TPoS tagging is different from general given morphology M (word form) and syntactic PoS tagging, the general POS tagging methods are context S (neighboring tags) (Samuelsson, Christer, worthy of referencing. There are a lot of existing 1993). POS tagging researches which can be classified into following categories in general. Natural ideas Other researchers inclined to regard this POS of solving this problem were to make use of the tagging work as a multi-class classification prob- lem. Many methods used in machine learning, such
18 The 6th Workshop on Asian Languae Resources, 2008
as Decision Tree, Support Vector Machines (SVM) principle, the context information of terms can help and k-Nearest-Neighbors (k-NN), were used for TPoS tagging and the individual PoS tags may be guessing possible POS tags of words (G. Orphanos good choices as classification features. and D. Christodoulakis, 1999; Nakagawa T, 2001; The proposed TPoS tagging algorithm consists Maosong Sun et al., 2000). Orphanos and Christo- of two modules. The first module is a terminology doulakis presented a POS tagger for Modern Greek extraction preprocessing module. The second and focused on a data-driven approach for the in- module carries out the TPoS tag assignment. In the duction of decision trees used as disambiguation or terminology extraction module, if the result of ter- guessing devices (G. Orphanos and D. Christodou- minology extraction algorithm is a list of terms lakis, 1999). The system was based on a high- without PoS tags, a general purpose segmenter coverage lexicon and a tagged corpus capable of called ICTCLAS1 will be used to give PoS tags to showing off the behavior of all POS ambiguity all individual words. ICTCLAS is developed by schemes and characteristics of words. Support Chinese Academy of Science, the precision of Vector Machine is a widely used (or effective) which is 97.58% on tagging general words classification approach for solving two-class pat- (Huaping Zhang et al., 2003). Then the output of tern recognition problems. Selecting appropriate this module is a list of terms, referred to as Term- features and training effective classifiers are the List, using algorithms such as the method de- main points of SVM method. Nakagawa et al. used scribed in (Luning Ji et al., 2007). substrings and surrounding context as features and achieve high accuracy in POS tag prediction (Na- In this paper, two simple schemes for the term kagawa T, 2001). Furthermore, Sun et al presented PoS tag assignment module are proposed. The first a POS identification algorithm based on k-nearest- scheme is called the blind assignment scheme. It neighbors (k-NN) strategy for Chinese word POS simply assigns the noun tag to every term in the tagging. With the auxiliary information such as TermList. This is based on the assumption that existing tagged lexicon, the algorithm can find out most of the terms in a specific domain represent k nearest words which were mostly similar with the certain concepts that are most likely to be nouns. word need tagging (Maosong Sun et al., 2000). Result from this blind assignment scheme can be considered as the baseline or the worse case sce- 3 Algorithm Design nario. Even in general domain, it is observed that nouns are in the majority of Chinese words with As pointed out earlier, TPoS tagging is different more than 50% among all different PoS tags (Hui from the general PoS tagging tasks. In this paper, it Wang, 2006). is assumed that a terminology extraction algorithm has already obtained the PoS tags of individual The second scheme is called head-word-driven words. For example, in the segmented and tagged assignment scheme. Theoretically, it will take the sentence “计算机/n 图形/n 学/v 中/f ,/w 物体/n tag of the head word of one term as the tag of the 常常/d 用/v 多边形/a 网格/n 来/f 表示/v 。/w”(In whole term. But here it simply takes the tag of the last word in a term. This is based on the assump- computer graphics, objects are usually represented tion that each term has a headword which in most as polygonal meshes.), the term “多边形网格” cases is the last word in a term (Hui Wang, 2006). (polygonal meshes) has been segmented into two One additional experiment has been done to verify individual words and tagged as “ 多边形/a” this assumption. A manually annotated Chinese (polygonal /a) and “ 网格/n” (meshes /n). The shallow Treebank in general domain is used for the terminology extraction algorithm would identify statistic work (Ruifeng Xu et al., 2005). There are these two words “多边形/a” and “网格/n” as a 9 different structures of Chinese phrases, (Yunfang single term in a specific domain. The proposed Wu et al., 2003), but only 3 of them do not have algorithm is to determine the PoS of this single their head words in the tail, which are about 6.56% term “多边形网格” (polygonal meshes), thus the from all phrases. Following the examples earlier, algorithm is referred to as TPoS tagging. It can be seen that the general purpose PoS tagging and term PoS tagging assign tags at different granularity. In 1 Copyright © Institute of Computing Technology, Chinese Academy of Sciences
19 The 6th Workshop on Asian Languae Resources, 2008
the term “多边形/a 网格/n” (polygonal /a meshes ment scheme. In first part of this experiment, all /n) will be assigned the tag “/n” because the last the 3,343 terms in TermList1 are tagged as nouns. word is labeled “/n”. The result shows that the precision of the blind assignment scheme is between 78.79% and 84.77%. There are a lot of semanteme tags at the end of a The reason for the range is that there are about 200 term. For example, “/ng” presents single character terms in TermList1 which can be considered either postfix of a noun. But it would be improper if a as nouns, gerunds, or even verbs without reference 决 term is tagged as “/ng”. For example, the term “ to context. For example, the term “局域网远程访 策器 ” (decision-making machine) contains two 问” (“remote access of local area network” or “re- segments as listed with two components “决策/n” mote access to local area network”) and the term and “器/ng”. It is obvious that “决策器/ng” is in- “极化” (polarization or polarize), can be consid- appropriate. Thus the head-word-driven assign- ered either as nouns if they are regarded as courses ment scheme also includes some rules to correct of events or as verbs if they refer to the actions for this kind of problems. As will be discussed in the completing certain work. The specific type is de- experiment, the current result of TPoS tagging is pendent on the context which is not provided with- based on 2 simple induction rules applied in this out the use of a corpus. However, the experiment algorithm. result does show that in a specific domain, there is a much higher percentage of terms that are nouns 4 Experiments and Discussions than other tags in general (Hui Wang, 2006). As to The domain corpus used in this work contains TermList3, the precision of blind assignment is 16 papers selected from different Chinese IT jour- between 65.57% and 70.77% (19 mixed ones). nals between 1998 and 2000 with over 1,500,000 TermList2 is the result of a terminology extraction numbers of characters. They cover topics in IT, algorithm and there are non-term items in the ex- such as electronics, software engineering, telecom, traction result, so the blind assignment scheme is and wireless communication. The same corpus is not applied on TermList2. The blue colored bars used by the terminology extraction algorithm de- (lighter color) in Figure 1 shows the result of veloped in (Luning Ji et al., 2007). In the domain TermList1 and TermList3 using the blind assign- of IT, two TermLists are used for the experiment. ment scheme which gives the two worst result TermList1 is a manually collected and verified compared to our proposed approach to be dis- term list from the selected corpus containing a total cussed in Section 4.2 of 3,343 terms. TermList1 is also referred to as the 4.2 Experiments on the Head-Word-Driven standard answer set to the corpus for evaluation Assignment Scheme purposes. TermList2 is produced by running the terminology extraction algorithm in (Van Rees, R, The experiment in this section was designed to 2003). TermList2 contains 2,660 items out of validate the proposed head-word-driven assign- which 929 of them are verified as terminology and ment scheme. The same experiment is conducted 1,731 items are not considered terminology ac- on the three term lists respectively, as shown in cording to the standard answer above. Figure 1 in purple color (darker color). The preci- To verify the validity of the proposed method to sion for assigning TPoS tags to TermList1 is different domains, a term list containing 366 legal 93.45%. By taking the result from a terminology terms obtained from Google searching results for extraction algorithm without regards to its potential “ 法律术语大全”(complete dictionary of legal error propagations, the precision of the head-word- terms) is selected for comparison, which is named driven assignment scheme for TermList2 is TermList3. 94.32%. For TermList3, the precision of PoS tag assignment is 90.71%. By comparing to the blind 4.1 Experiment on the Blind Assignment assignment scheme, this algorithm has reasonably Scheme good performance for all three term list with The first experiment is designed to examine the precision of over 90%. It also gives 8.7% and proportion of nouns in TermList1 and TermList3, 19.9% improvement for TermList1 and TermList3, to validate of the assumption of the blind assign- respectively, as compared to the blind assignment
20 The 6th Workshop on Asian Languae Resources, 2008
scheme, a reasonably good improvement without a bunal) and “数据/n 集/vg” (data set) will be tagged heavy cost. However, there are some abnormali- as “知识产权庭/ng” and “数据集/vg”, respec- ties in these results. Supposedly, TermList1 is a tively, which are obviously wrong. So, the simple hand verified term list in the IT domain and thus its rule of “tagging terms with “/ng” and “/vg” to “/n” result should have less noise and thus should per- is applied. The performance of TPoS tag assign- form better than TermList2, which is not the case ment after applying these two fine tuning induction as shown in Figure 1. rules are shown in Table 1 below. Figure 1 Performance of the Two Assignment Table 1 Influence of Induction Rules on Different Schemes on the Three Term Lists Term Lists 100.00% Precision Improve- Term Precision after add- 80.00% ment Lists of tagging ing induc- blind assignment Percentage 60.00% tion rule head-driven TermList1 93.45% 97.03% 3.83%
Precision 40.00% assignment 20.00% TermList2 94.32% 95.41% 1.16%
0.00% TermList3 90.71% 93.99% 3.62% T1 T2 T3 Term Lists It is obvious that with the use of fine tuning us- ing induction rules, the results are much better. In By further analyzing the error result, for exam- fact the result for TermList1 reached 97.03% ple for TermList1, among these 3,343 terms, about which is quite close to PoS tagging of general do- 219 were given improper tags, such as the term “图 main data. The abnormality also disappeared as the 形学” (Graphics). In this example, two individual performance of TermList1 has the best result. The words, “图形/n” and “学/v”, form a term. So the improvement to TermList2 (1.16%) is not as obvi- output was “图形学/v” for taking the tag of the last ous as that for TermList1 and TermList3, which segment. It was a wrong tag because the whole are 3.83% and 3.62%, respectively. This, however, term was a noun. In fact, the error is caused by the is reasonable as TermList2 is produced directly general word PoS tagging algorithm because with- from a terminology extraction algorithm using a out context, the most likely tagging of “学”, a se- corpus, thus, the results are noisier. manteme, is a verb. This kind of errors in se- Further analysis is then conducted on the result manteme tagging appeared in the results of all of TermList2 to examine the influence of non-term three term lists with 169 from TermList1, 29 from items to this term list. The non-term items are TermList2 and 12 from TermList3, respectively. items that are general words or items cannot be This was a kind of errors which can be corrected considered as terminology according to the stan- by applying some simple induction rules. For ex- dard answer sheet. For example, neither of the ample, for all semantemes with multiple tags (in- terms “问题” (problem) and “模式训练是” (pat- cluding noun as in the example), the rule can be tern training is) were considered as terms because “tagging terms with noun suffixes as nouns”. For the former was a general word, and the latter example, terms “劳改/n 场/q” (reform-through- should be considered as a fragment rather than a labor camp) and “计算机/n 图形/n 学/v” (com- word. In fact, in 2,660 items extracted by the algo- puter graphics) were given different tags using the rithm as terminology, only 929 of them are indeed head-word-driven assignment scheme. They were terminology (34.92%), and rest of them do not assigned as: “劳改场/q” and “计算机图形学/v” qualify as domain specific terms. The result of this which can be corrected by this rule. Another kind analysis is listed in Table 2. of mistake is related to the suffix tags such as “/ng” (noun suffix) and “/vg”(verb suffix). For examples, “知识/n 产权/n 庭/ng” (intellectual property tri-
21 The 6th Workshop on Asian Languae Resources, 2008
Table 2 Data Distribution Analysis on TermList2 posed method in this paper is comparable to these Without Induction Induction Rules general PoS tagging algorithms in magnitude. Of Rules Applied course, one main reason of this fact is the differ-
correct correct ence in its objectives. The proposed method is for precision precision terms terms the PoS tagging of domain specific terms which Terms have much less ambiguity than tagging of general 879 94.62% 898 96.66% (929) text. Domain specific terms are more likely to be Non-terms 1,630 94.17% 1,640 94.74% nouns and there are some rules in the word- (1,731) formation patterns while general PoS tagging algo- Total 2,509 94.32% 2,538 95.41% rithms usually need training process in which large (2,660) manually labeled corpora would be involved. Ex- Results show that 31 and 50 from the 929 cor- periment results also show that this simple method rect terms were assigned improper PoS tags using can be applied to data in different domains. the proposed algorithm with and without the induc- tions rules, respectively. That is, the precisions for 5 Conclusion and Future Work correct data are comparable to that of TermList1 In this paper, a simple but effective method for (93.45% and 97.03%, respectively). For the non- assigning PoS tags to domain specific terms was terms, 91 items and 101 items from 1,731 items presented. This is a preliminary classification work were assigned improper tags with and without the on terms. It needs no training process and not even induction rules, respectively. Even though the pre- context information. Yet it obtains a relatively cisions for terms and non-terms without using the good result. The method itself is not domain de- induction rules are quite the same (94.62% vs. pendent, thus it is applicable to different domains. 94.17%), the improvement for the non-terms using Results show that in certain applications, a simple the induction rules are much less impressive than method may be more effective under similar cir- that for the terms. This is the reason for the rela- cumstances. The algorithm can still be investigated tively less impressed performance of induction over the use of more induction rules. Some context rules for TermList2. It is interesting to know that, information, statistics of word/tag usage can also even though the performance of the terminology be explored. extraction algorithm is quite poor with precision of only around 35% (929 out of 2,666 terms), it does Acknowledgments not affect too much on the performance of the This project is partially supported by CERG TPoS proposed in this paper. This is mainly be- grants (PolyU 5190/04E and PolyU 5225/05E) and cause the items extracted are still legitimate words, B-Q941 (Acquisition of New Domain Specific compounds, or phrases which are not necessarily Concepts and Ontology Update). domain specific. The proposed algorithm in this paper use mini- References mum resources. They need no training process and Yun Li, Qiangjun Wang. 2001. Automatic Term Extrac- even no context information. But the performance tion in the Field of Information Technology. In the of the proposed algorithm is still quite good and proceedings of The Conference of 20th Anniversary can be directly used as a preparation work for do- for Chinese Information Processing Society of China. main ontology construction because of its presion Yan He, Zhifang Sui, Huiming Duan, and Shiwen Yu. of over 95%. Other PoS tagging algorithms reach 2006. Term Mining Combining Term Component good performance in processing general words. Bank. In Computer Engineering and Applications. For example, a k-nearest-neighbors strategy to Vol.42 No.33,4--7. identify possible PoS tags for Chinese words can Feng Zhang, Xiaozhong Fan, and Yun Xu. 2006. Chi- reach 90.25% for general word PoS tagging nese Term Extraction Based on PAT Tree. Journal of (Maosong Sun et al., 2000). Another method based Beijing Institute of Technology. Vol. 15, No. 2. on SVM method on English corpus can reach 96.9% in PoS tagging known and unknown words Van Rees, R. 2003. Clarity in the Usage of the Terms (Nakagawa T, 2001). These results show that pro- Ontology, Taxonomy and Classification. CIB73.
22 The 6th Workshop on Asian Languae Resources, 2008
Mikheev, A. 1997. Automatic Rule Induction. for Un- shop affiliated with 41th ACL, 184--187. Sapporo known Word Guessing. In Computational Lingusitics Japan. Vol. 23(3), ACL. Hui Wang. Last checked: 2007-08-04. Statistical studies Toutanova, Kristina, Dan Klein, Christopher Manning, on Chinese vocabulary ( 汉语词汇统计研究). and Yoram Singer. 2003. Feature-Rich Part-of- http://www.huayuqiao.org/articles/wanghui/wanghui Speech Tagging with a Cyclic Dependency Network. 06.doc. The date of publication is unknown from the In proceedings of HLT-NAACL. online source. Huihsin Tseng, Daniel Jurafsky, and Christopher Man- Ruifeng Xu, Qin Lu, Yin Li and Wanyin Li. 2005. The ning. 2005.Morphological Features Help POS Tag- Design and Construction of the PolyU Shallow Tree- ging of Unknown Words across Language Varieties. bank. International Journal of Computational Lin- In proceedings of the Fourth SIGHAN Workshop on guistics and Chinese Language Processing, V.10 N.3. Chinese Language Processing. Yunfang Wu, Baobao Chang and Weidong Zhan. 2003. Samuelsson, Christer. 1993. Morphological Tagging Building Chinese-English Bilingual Phrase Database. Based Entirely on Bayesian Inference. In proceedings Page 41-45, Vol. 4. of NCCL 9. Brants, Thorsten. 2000. TnT: A Statistical Part-of- Speech Tagger. In proceedings of ANLP 6. G. Orphanos, and D. Christodoulakis. 1999. POS Dis- ambiguation and Unknown Word Guessing with De- cision Trees. In proceedings of EACL’99, 134--141. H Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In proceedings of International Conference on New Methods in Language Processing. Maosong Sun, Dayang Shen, and Changning Huang. 1997. CSeg& Tag1.0: a practical word segmenter and POS tagger for Chinese texts. In proceedings of the fifth conference on applied natural language processing. Ying Liu. 2002. Analysing Chinese with Rule-based Method Combined with Statistic-based Method. In Computer Engineering and Applications, Vol.7. Mikheev, A. 1996. Unsupervised Learning of Word- Category Guessing Rules. In proceedings of ACL-96. Nakagawa T, Kudoh T, and Matsumoto Y. 2001. Un- known Word Guessing and Part-of-Speech Tagging Using Support Vector Machines. In proceedings of NLPPRS 6, 325--331. Maosong Sun, Zhengping Zuo, and B K, TSOU. 2000. Part-of-Speech Identification for Unknown Chinese Words Based on K-Nearest-Neighbors Strategy. In Chinese Journal of Computers. Vol.23 No.2: 166-- 170. Luning Ji, Mantai Sum, Qin Lu, Wenjie Li, Yirong Chen. 2007. Chinese Terminology Extraction using Window-based Contextual Information. In proceed- ings of CICLING. Huaping Zhang et al. 2003. HHMM-based Chinese Lexical Analyzer ICTCLAS. Second SIGHAN work-
23 The 6th Workshop on Asian Languae Resources, 2008
24 The 6th Workshop on Asian Language Resources, 2008
Technical Terminology in Asian Languages: Different Approaches to Adopting Engineering Terms
Makiko Matsuda Tomoe Takahashi Nagaoka University of Technology Nagaoka University of Technology 1603-1 Kamitomioka, Nagaoka, Japan 1603-1 Kamitomioka, Nagaoka, Japan [email protected] [email protected] Hiroki Goto Yoshikazu Hayase University of Toyama Toyama National College of Maritime Technology Gofuku, Toyama, Japan 1-2 Ebie-Neriya, Imizu , Japan [email protected] [email protected] Robin Lee Nagano Yoshiki Mikami University of Miskolc Nagaoka University of Technology Miskolc-Egyetemváros, Hungary 1603-1 Kamitomioka, Nagaoka, Japan [email protected] [email protected]
Abstract 1 Introduction Terminology development in education, 1.1 Terminology in Development Context science and technology is a key to formu- For developing countries there is a strong necessity lating a knowledge society. The authors are developing a multilingual engineering to establish an appropriate set of terminology to terminology dictionary consisting of more enable local people to learn science and technology in an efficient manner throughout their educational than ten thousand engineering terms each programs using their mother tongues as the instruc- for ten Asian languages and English. The tional medium. Instruction in the mother tongue is dictionary is primarily designed to support foreign students who are studying engi- beneficial for students to acquire a basic concept in neering subjects at Japanese higher educa- a subject (UNESCO, 2003). International communities are fully aware of this tional institutions in the Japanese language. necessity. The World Summit on the Information Analysis of the lexical terms could provide Society (WSIS) addressed this issue, saying that useful knowledge for language resource terminology development in education, science and creators. There are two adoption ap- proaches, “phonetic adoption” (translitera- culture is a key to developing knowledge societies. tion of borrowed terms) and “semantic 1.2 Two Approaches to Adopting Terms adoption” (where the meaning is expressed using native words). The proportion of the Unlike the situation with basic vocabulary, scien- two options found in the terminology set of tific or technical terms are usually not “home- each country (or language) shows a sub- grown”, but are often imported from the outside stantial difference and seems to reflecti world. And when a scientific or technical term is language policies of the country and the in- imported from one language (the source language) fluence of foreign languages on the host to another language (the host language), typically language. This paper presents preliminary two approaches are found. results of our investigation on this question The first approach is to simply borrow a word based on a comparative study of three lan- by creating a phonetically equivalent word in the guages: Japanese, Vietnamese and Thai. host language. For example, the equivalent of the
25 The 6th Workshop on Asian Language Resources, 2008
English word "science" in Malay is "sains", pro- pare a meaningful number of vocabulary items and nounced as /sains/, and "computer" in Japanese is to train teachers in their use. written as "コンピュータ" in the Japanese kata- kana1 syllabary and is pronounced /konpju:ta/ . We Table 1. Pros and Cons of Phonetic and Semantic call this phonetic adoption in this paper. It is quite Approaches to Term Adoption similar to transliteration, but not exactly the same Advantages Disadvantages due to the difference in phonetic structure between Phonetic (1) easy to find (1) disturbs the purity the source and host languages. adoption connectivity to the of host language The second approach is to create a new word by original word (2) not easy to guess combining a semantically relevant word-root or (2) easy to create what it means Semantic (1) does not disturb (1) not easy to find word in the host language. For example, "science" h adoption the purity of host connectivity to the in Thai is "วิทยาศาสตร", pronounced /wítt ayasà:t/. language original word This word is derived from the word "vidya" (2) relatively easy (2) not easy to create (knowledge) in Sanskrit, a source from which the to guess what it Thai language has imported many words through- means, or at least to out its history. "Science" in Japanese is "科学", what it relates pronounced /kagaku/. This word was coined more 1.4 Third Approach: Use of foreign language than one hundred years ago as a combination of two Chinese characters, "科" and "学", meaning "a In addition to this dilemma, a completely different section, or a branch of something" and "learning", approach can be taken when the necessity is very respectively. As shown in these two cases, it often urgent. This is to design and to implement the happens that classical languages like Sanskrit or whole educational program by means of an estab- Chinese, rather than the host language itself, pro- lished foreign language. We call this approach the vide root-words in this creation process just as source language approach. In the real world to- Latin and Greek roots are used in scientific and day, this third approach is widely adopted in many technical terms in English. We call this approach developing countries, to various extents. The semantic adoption in this paper. choice of foreign language depends on subject do- mains and the economic and political condition of 1.3 Pros and Cons of the Two Approaches the country as well as on historical ties. When we look back on the historical evolution of 1.5 Objectives of the study scientific terms dating from ancient civilization to modern times, both types of adoption are found. In summary, there are three basic options. How- Both approaches to adoption have advantages and ever, little research has been carried out on how a disadvantages in terms of ease of creation and ease language community can effectively formulate of understanding, as well as other issues. The pros terminology focused on Asian languages. “Guide- and cons of semantic adoption, summarized in Ta- lines for Terminology Policy” (Infoterm 2005) ble 1, are in principle the opposites of those of talks about various issues relating to this subject, phonetic adoption. but no mention is found relating to the question of But the choice of adoption approach is not so adoption strategy. The authors believe that an simple. The phonetic adoption option is easy to analysis of the existing terminology can provide implement, but would have the danger of disturb- practical and applicable lessons to terminology ing language purity, which also has a high priority development practitioners and policy makers. This in many nations. When the semantic adoption op- paper presents preliminary results of our investiga- tion is chosen, language purity is maintained but tion on this question, based on a comparative study the adoption cost is high and it takes time to pre- of three countries: Japanese, Vietnamese and Thai.
2 Dictionary 1 In Japanese, four scripts are used to write Japanese, e.g. In this paper, the main corpus is extracted from a kanji (i.e., Chinese characters), hiragana, katakana and multilingual engineering dictionary based on Babel the Latin alphabet. Katakana script is conventionally used to represent foreign phonetically translated words. (Hayase and Kawata 2002). Babel is a web tech-
26 The 6th Workshop on Asian Language Resources, 2008
nology dictionary for foreign students in Japan2. It adoptions into 8 languages: Vietnamese, Chinese, is composed of the 11 academic domains given in Korean, Filipino, Malay, Sinhalese, Myanmar and Table 2 and three languages, English, Japanese and Mongolian. The data are coded in Unicode. At pre- Thai. sent Vietnamese, Chinese, Korean and Sinhalese language have been translated and other languages Table 2. The Number of Words in the Japanese will be finished soon. The dictionary will be made Lexicon by Subject Domain in April, 2008. For this paper, we have selected Words Words Subject Domain three languages for analysis – Vietnamese, Thai (TYPE) TOKEN and Japanese – because these languages have rela- architecture 1,314 1,798 tively rich vocabularies. chemistry 717 958 3 Comparison of Adoption Approaches civil engineering 766 1,263 telecommunications 650 1,410 The approach to terminology adoption is examined computer science 941 1,538 in this section from various angles. control engineering 1,081 1,545 3.1 Comparison by Country electronics 681 1,200 In order to investigate the portfolio of two adoption maritime science 961 1,403 approaches, we surveyed the entire lexicon for all mathematics 952 1,284 mechanical engineering subjects. In addition, we mechanical engineering 1,117 1,186 took a random sampling in our dictionary and physics 587 945 chose approximately 25 words from two major others* 1,872 - subject fields: architecture and computer science. The portfolios of three languages are shown in Ta- Total of all words 11,639 14,530 ble 4. Compounds such as hybrids of native and *there is some overlap with other domains loaned word-components were separated from se- Table 3. A Sample of the Lexicon mantic adoption. Vietnam- English Japanese Chinese Thai ese Table 4. Comparison of Terms in 3 Domains Origin discrim- วิธีแบงแยก Subject # of Adoption Rate 判別式 判别式 Biệt thức of inant ระหวางกัน Domain terms Approach words JP VN TH factor Định lý ษฎีบทตัวปร 因数定理 因数定理 theorem thừa số ะกอบ Semantic 0.68 0.89 0.78 circular Số đo การวัดเชิงว mechanical 弧度法 弧度法 1,186 English 0.32 0.09 0.21 measure cung tròn งกลม engineering phonetic others 0.02 0.01 radical 累乗根 根 Căn ราก root semantic 0.43 0.87 0.78 limit Giá trị computer 極限値 极限值 คาจํากัด 23 English 0.57 0.13 0.21 value giới hạn science phonetic diver- others 発散 发散 Phát tán การลูออก gence semantic 0.69 0.93 0.76 natural Đối số tự ลอการิทึมธ 自然対数 自然对数 logarithm nhiên รรมชาต ิ architecture 29 English 0.31 0.03 0.24 phonetic point of จุดเปลี่ยนค others 0.03 変曲点 拐点 Điểm uốn inflection วามเวา
Table 4 shows that the Japanese language has To this primary source, we added some terms in the highest percentage of phonetically translated mathematics and physics and made a multilingual words from English. Vietnamese has the lowest database such as that shown in Table 3 by adding percentage of phonetic adopted words from Eng- lish, but includes terms from French in mechanical 2 http://www.toyama-cmt.ac.jp/%7Ehayase/Project/Babel/ engineering plus a few from Russian. Thai shows a
27 The 6th Workshop on Asian Language Resources, 2008
roughly similar balance between semantic adoption 3.3 Mathematical Terms at Different Grade and English-origin terms in all three domains. Level The high number of Thai and Vietnamese origin We have collected about 1,300 terms in mathemat- words in computer science is rather remarkable, as ics for our multilingual engineering dictionary. it indicates a consistent effort to adapt the Thai Following the Japanese course of study3 for grades language to every emerging technology. Note that 1-12, we chose 57 terms from our dictionary and over half of the computer science terms for Japa- compared those terms with their counterparts in nese are of English origin. Vietnamese draws on Thai and Vietnamese. For university-level mathe- different languages to varying degrees; it is notable matical terms, following the syllabus of the 1st year that the languages of origin vary by subject field. of the Faculty of Mathematics in one Japanese 4 3.2 Comparison by Subject Domain: A Case of University, we chose 11 terms . Then we also sur- Japanese Terms veyed those terms in the same way. Table 6 exam- ines the terms used in various grade levels. As we saw in Table 4, Japanese technical terms are often phonetically translated, but less so in archi- Table 6. Comparison of Mathematical Terms by tecture than in the other two fields. To gain a better Grade Level idea of the distribution of phonetic adoption, we Rate surveyed the entire Japanese lexicon for all subject Grade # of Origin of domains. Table 5 gives the rate of katakana usage, Level terms words JP VN TH that is, the rate of words written in the katakana Semantic 0.91 0.90 0.64 Univer- script used for transcribing foreign words. 11 (Sino) (0.91) (0.64) (0.00) sity Phonetic 0.09 0.09 0.36 Table 5. Rate of katakana Usage Semantic 1.00 1.00 0.88 High Subject Domain # of terms Katakana Rate 17 (Sino) (1.00) (0.47) (0.00) School architecture 1,798 147 0.081 Phonetic 0.00 0.00 0.12 chemistry 958 234 0.244 Junior Semantic 1.00 1.00 1.00 high 18 (Sino) (1.00) (0.61) (0.00) civil engineering 1,263 156 0.123 school Phonetic 0.00 0.00 0.00 telecommunications 1,410 422 0.299 Semantic 1.00 1.00 1.00 computer science 1,538 650 0.422 Elemen- tary 22 (Sino) (1.00) (0.50) (0.00) control engineering 1,545 497 0.321 school Phonetic 0.00 0.00 0.00 electronics 1,200 250 0.208 maritime science 1,403 51 0.036 These data indicate that all languages use either mathematics 1,284 50 0.038 original or semantically adopted from elementary mechanical engineering 1,186 385 0.324 level to junior-high school level. For Japanese al- physics 945 120 0.126 most all words are Sino-Japanese; in oral instruc- tion native words may be used in Japan but those The domains of computer science, control engi- are not used as keywords. As for high school, some neering and mechanical engineering include over words phonetically translated from English are 30% of katakana words in their terms. In contrast, used in Thai, while at university level all languages the percentage of katakana terms in architecture, use English-origin words. As for the rate of Sino- mathematics and maritime science domains is less Vietnamese words in mathematics, it is considera- than 10%, indicating that these domains are highly bly higher than other categories. This suggests that semantically adopted in their terms, or have little the likelihood that Sino-origin words can be shared need for imported vocabulary. We thus see that academic domain affects the adoption of phoneti- 3 cally translated terms. The Ministry of Education, Culture, Sports, Science and Technology, Japan (MEXT) determines the national curriculum. 4 We took terms from the 1st year syllabus of the Faculty of Mathematical Science in Doshisya University.
28 The 6th Workshop on Asian Language Resources, 2008
effectively in other countries using Chinese charac- While data are limited to just four terms, general ters is generally high in the field of mathematics. trends can be identified. For Japanese pages, while there were many hits for English, the hits for Japa- 3.4 Web Presence of English Terms and their nese terms were at least three times higher. Inter- Adopted Equivalents estingly, for operand, the one term with two We also examined the web presence of pairs of equivalents, the katakana (i.e., phonetically terms, an English word and its equivalent in the adopted) term was used far more frequently than host language. This is an attempt to find how the semantically adopted kanji term. In the Viet- widely the source language approach – direct use namese pages, while far fewer total hits were made, of foreign language – is used in technological top- the Vietnamese terms were overwhelmingly more ics. Four words from different subject domains frequent than English terms. For Thai pages the were chosen, and a search was carried out for the opposite was found, with the exception of auto- term (as an exact phrase) using the Google search matic lathe, a tool that may require marketing in engine (Table 7). the local language. These results suggest that the major language of technology for Thailand is Eng- Table 7. Web Presence of Technical Terms lish, for Vietnam it is the native language, and for Search Japan there is English use but Japanese dominates. Search term(s) # of hits* through In order to find if there is a distinction between EN ionization 121,000 use in technological fields and in general use, we JP 電離 419,000 made a comparison using three non-technical terms EN fusion welding 387 (Table 8).
JP 融接 12,900 Table 8. Web Presence of Non-technical Terms Japanese EN automatic lathe 258 Search pages Search term(s) # of hits* through JP 自動旋盤 56,600 EN water 2,200,000 EN operand 32,500 JP 水 24,200,000 JP オペランド (katakana) 138,000 Japanese EN novel 2,242,000 JP 演算数 (kanji) 10,700 pages JP 小説 9,870,000 EN ionization 89 EN zoo 2,580,000 VN Điện ly 13,900 JP 動物園 2,650,000 EN fusion welding 6 EN water 295,000 Viet- VN Hàn nóng chảy 66 namese VN nước 4,310,000 pages EN automatic lathe 2 EN novel 68,400 VN Máy tiện tự động 48 Vietnamese VN tiểu thuyết 2,130,000 EN operand 111 pages EN zoo 41,500 VN Số tính toán 810 VN vườn bách thú** 615,000 EN ionization 10,600 VN sở thú*** 1,750,000 TH การแตกประจุ 48 EN water 1,850,000 EN fusion welding 692 TH น้ํา 2,610,000 Thai TH การเชื่อมแบบหลอมละลาย 6 EN novel 269,000 pages EN automatic lathe 39 Thai pages TH วรรณกรรม 1,830,000 TH เครื่องกลึงอัตโนมัติ 195 EN zoo 272,000 EN operand 568 TH สวนสัตว 620,000 TH ตัวถูกดําเนินการ 24 *Accessed on Sept. 10, 2007 *Accessed on Sept. 18, 2007 **Northern dialect, ***Southern dialect
29 The 6th Workshop on Asian Language Resources, 2008
Again, we find that Japanese pages use the “[e]very nationality has the right to use its own Japanese word far more frequently (although zoo is language and system of writing, to preserve its na- nearly equal – perhaps due to the exotic appeal of a tional identity, and to promote its fine customs, foreign word, and one understood by almost every habits, traditions and culture.” Therefore they sel- Japanese) but that English words have a sizable dom use phonetically adopted words as such. One presence. For Vietnamese pages, the number of example of its limited use is “ôm” from the English hits is substantially higher than that for the techni- “ohm.” Another example is proper nouns, when cal terms, but the same trend exists: the native lan- foreign words are used in newspapers This politi- guage dominates. Finally, for Thai pages, the hits cal reason is the primary impetus for the semantic for Thai words far exceed those for English words adoption of terms. in all three cases, although English words are often Secondly, as in other communist nations, Eng- used. lish is not taught as an obligatory subject in Viet- A comparison of results for Tables 7 and 8 con- nam. This educational policy also brings about a firms that the role of languages differs between low rate of phonetically adopted words from Eng- daily use and situations that require technical ter- lish, because many people cannot grasp the mean- minology, even in a country that has adopted the ing. third approach of using a foreign language for sci- Thirdly, we need to consider that the Vietnam- ence and technology. ese language has been historically influenced by several foreign languages, as we can induce from 4 Social and Academic Factors Table 4’s results. Vietnam was under the influence of China for approximately 1,000 years. A number 4.1 High Rate of Semantic Adoption in Asian of Chinese terms were absorbed into Vietnamese Countries as Sino-Vietnamese words. During the French co- As can be imagined, it is not easy for Asian people lonial period, the Vietnamese language added to understand technical terms in other Asian lan- French words for the manufacture of specific guages, because semantic adoption is popular products such as automobiles and bicycles. Lastly, among Asian languages. The diversity of scripts in at the end of the 20th century Russian technical Asia is another reason of this difficulty. In Euro- terms were introduced by people who had studied pean countries, term-sharing by phonetic adoption in or technical advisers who had sent by the Soviet is easier because almost all languages were derived Union or in Russia from the same origin and are sharing the same script. Therefore, in order to achieve mutual com- 4.2.2 Japan munication using technical terms in Asian coun- As is the case with Vietnam, Chinese characters tries, we need to make special efforts to go over and words were imported from China so long ago the variation and harmonize terms each other. that the characters and words are regarded as part But we need to discuss why those languages pre- of the native Japanese language. In the end of the fer semantic adoption. Various social, political, 19th century, a number of Japanese-kango (words and historical factors influence language use. This that used Chinese characters but were created by applies also to the adoption of technical terminol- Japanese people) were invented as adoptions of ogy. Here, we discuss influences of these factors western technical terms, and many of these terms on the approaches used to adopt technical terms, were re-imported to China. including the influence of technical domains and After World WarⅡ, Japan was occupied by the academic discipline. Allied Forces led by the United States and English 4.2 Language Policies and Historical Back- became a compulsory subject from junior high ground school. English-origin words became easy to un- derstand for many Japanese. This historical change 4.2.1 Vietnam caused a sharp rise in percentage of phonetic adop- tions from English among Japanese technical terms In Vietnam, adoption is strongly promoted under (see Table 5) (Hashimoto 2007). However, English the strict language policy. The Political Resume of is not usually used as an instructional medium in the Socialist Republic of Vietnam states that
30 The 6th Workshop on Asian Language Resources, 2008
Japanese institutions of higher education. Teachers disciplines with academic societies established and students know many English loan terms but relatively early may have developed a great deal of they use these loan terms not as English but as their terminology during the Meiji era, and may Japanese, and since they are pronounced according still have a preference for Japanese-kango over to Japanese phonetic rules, they may not even be katakana. Newer disciplines, on the other hand, intelligible to English speakers. may have found it easier to use the imported terms in a relatively direct way. 4.2.2 Thai Thai has different features from the other two Table 9. Academic Societies in Japan countries in that it is less influenced by Chinese. Discipline Major Academic Society in Japan Est. But the Thai language has been influenced by an- architecture Architectural Institute of Japan 1886 other foreign language, Sanskrit, the classical In- civil Japan Society of Civil Engineers 1914 dian language. We did not examine the rate of San- engineering Chemistry and Chemical Industry skrit origin words in the Thai language in this chemistry 1878 study, but it seems likely that a high rate of terms of Japan Information Processing Society of phonetically adopted from Sanskrit will be found 1960 in our engineering dictionary. Although Thai has tele- Japan native-language equivalents at the lexical level, communica- The Institute of Electronics, Infor- tions, mation, and Communication Engi- 1987 English is daily used as an instructional medium in computer neers higher education. Students in Thailand start to science The Japan Society of Information 1983 learn English at 10 years old or earlier. Thus, even and Communication Research though they have native terms they prefer to use control The Institute of Systems, Control 1957 terms phonetically adopted from English or to use engineering and Information Engineers English. The Institute of Electrical Engi- electronics 1888 4.3 Other Factors neers of Japan maritime The Japan Society of Naval Archi- 1898 4.3.1 When the subject was introduced? science tects and Ocean Engineers In the Meiji era (1868-1912) in Japan, experts mathematics Mathematical Society of Japan 1877 mechanical The Japan Society of Mechanical translated technical terms from Western languages 1897 engineering Engineers semantically, into Japanese-kango. Since World War II, however, it has been the mainstream to physics The Physical Society of Japan 1877 transliterate into katakana. Therefore a decisive 4.3.2. Is it indigenous? factor in adoption approach is when the subject was introduced into Japan. Table 9 shows the Each country has specific characteristics of the foundation year for major academic societies and region, such as regional climate, folk, culture, life- institutes in engineering fields, which roughly in- style and other indigenous factors. Technical terms dicates the time of introduction of the subject. in these fields are therefore strongly connected As shown in Table 5, computer science is the with regional characteristics. Before unfamiliar field that uses technical terms in katakana most terms or new technologies were brought over to the frequently, while the sampled vocabulary from host country, experts in those fields had already telecommunications also includes nearly 30% of created and used their own technical terms in their katakana words. These disciplines are relatively native language, and also they were already aware newly developed in Japan; the societies or insti- of the concept of a thing indicated by an unfamiliar tutes related to these fields were established in the term in alien language. All they had to do was to latter half of the 20th century. make a one-to-one correspondence and correctly The oldest societies, the Mathematical Society translate a foreign term into the term that was al- of Japan and The Physical Society of Japan, were ready in use in the mother tongue. established in 1877; our study showed that only 3.8% of mathematical terms and 12.6% of terms in physics are phonetically adopted terms. Therefore,
31 The 6th Workshop on Asian Language Resources, 2008
5 Conclusion References This preliminary study looks at the different ap- Christian Galinski. 1995. Terminology and Standardiza- proaches to adopting technical terms in three Asian tion under a machine adoption perspective. MT languages. The three languages investigated prefer Summit V Proceedings. different methods: Vietnamese tends to adopt Douglas Skuce, Ingrid Meyer. 1990. Concept Analysis words into its language semantically, through and Terminology: A Knowledge-Based Approach to adoption, while Japanese adopts them phonetically, Documentation, International Conference on Compu- through transliteration. Thai, to a large extent, has tational Linguistics archive Proceedings of the 13th chosen a third approach, that of using a source lan- conference on Computational linguistics, Volume 1. guage (English). Giovanni Battista Varile, Antonio Zampolli. 1997. Sur- In Japanese and in Vietnamese, many terms sub- vey of the State of the Art in Human Language Tech- ject to semantic adoption were translated into Chi- nology. Cambridge University Press. nese characters. This suggests that Sino-words may Hashimoto Waka. 2007. A diachronic transition of pho- be an effective way of communicating concepts netically translated words in Japanese: based on the among certain language users in certain disciplines, column of newspaper. Language, 36-6, Taishukan such as mathematics. Publishers, Japan. (橋本和佳 「外来語の通時的 The effect of grade level was also investigated, 推移-新聞社説を素材として-」『言語』36-6, and it was found that native language terms were 大修館書店) used in mathematics almost exclusively until high Hayase Yoshikazu, Shigeo Kawata. 2002. BABEL: A school or university level. multilingual dictionary and hypertext formation sys- Subject domain was found to have an effect on tem for engineers. Transactions of JSCES, Paper the adoption approach and was often influenced by No.20020023.(早勢欣和, 川田重夫「BABEL: エ whether the domain had a long tradition, and when ンジニア支援のための多言語辞書とハイパーテ established a discipline was (as shown by the キスト生成」) foundation of academic societies). A domain effect was seen in Vietnamese, where Russian or French- Infoterm. 2005. UNESCO Guidelines for Terminology origin terms appeared in different domains. Policies. Fomulating and implementing terminology While quite limited in scope, this study has re- policy in language communities, Paris. vealed clear trends that deserve further investiga- Kageura Kyo. 2006. The status of borrowed items in tion. Japanese terminology. Studies in Japanese Language Vol2, No.4 (影浦峡 「日本語専門語彙の構成 6 Future Research における外来語語基の位置づけ」『日本語の研 究』2006) We plan to investigate more fully the rate of pho- netically translated words and approach to termi- UNESCO. 2003. Education in a multilingual world. nology adoption. We expect that Chinese and Ko- Paris: UNESCO, (ED-2003/WS/2) . rean will give us further evidence of widespread http://unesdoc.unesco.org/images/0012/001297/1297 use of Sino-words and Mongolian will provide 28e.pdf data for phonetic adoption from Russian. Malay and Filipino are likely to be highly influenced by English, while we will find the influence of San- skrit in Myanmar and Sinhalese.
Acknowledgements The study was made possible by the sponsorship of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) through the Asian Language Resource Network project. The authors would like to express sincere gratitude to all lexicon contributors for their help.
32 The 6th Workshop on Asian Languae Resources, 2008
Selection of XML tag set for Myanmar National Corpus
Wunna Ko Ko Thin Zar Phyo AWZAR Co. Myanmar Unicode and NLP Research Mayangone Township, Yangon, Center Myanmar Myanmar Info-Tech, Hlaing Campus, [email protected] Yangon, Myanmar [email protected]
Myanmar Language Commission and a number Abstract of scholars had been collected a number of corpora for their specific uses [Htay et al., 2006]. But there In this paper, the authors mainly describe is no national corpus collection, both in digital and about the selections of XML tag set for non-digital format, until now. Myanmar National Corpus (MNC). MNC Since there are a number of languages used in will be a sentence level annotated corpus. Myanmar, the national level corpus to be built will The validity of XML tag set has been include all languages and scripts used in Myanmar. tested by manually tagging the sample data. It has been named as Myanmar National Corpus or MNC, in short form. Keywords: Corpus, XML, Myanmar, During the discussion for the selection of Myanmar Languages format for the corpus, XML (eXtensible Markup 1 Introduction Language), a subset of SGML (Standard Generalized Markup Language), format has been Myanmar (formerly known as Burma) is one of the chosen since XML format can be a long usable and South-East Asian countries. There are 135 ethnic possible to keep the original format of the text groups living in Myanmar. These ethnic groups [Burnard. 1996]. The range of software available speak more than one language and use different for XML is increasing day by day. Certainly more scripts to present their respective languages. There and more NLP related tools and resources are are a total of 109 languages spoken by the people produced in it. This in turn makes the necessity of living in Myanmar [Ethnologue, 2005]. selection of XML tag set to start building of MNC. There are seven major languages, according to MNC will include not only written text but also the speaking population in Myanmar. They are spoken texts. The part of written text will include Kachin, Kayin/Karen, Chin, Mon, Burmese, regional and national newspapers and periodicals, Rakhine and Shan [Ko Ko & Mikami, 2005]. journals and interests, academic books, fictions, Among them, Burmese is the official language and memoranda, essays, etc. The part of spoken text spoken by about 69% of the population as their will include scripted formal and informal mother tongue [Ministry of Immigration and conversations, movies, etc. Population, 1995]. During the selection of XML tag sets, the Corpus is a large and structured set of texts. sample for all the data which will be included in They are used to do statistical analysis, checking building of MNC, has been learnt. occurrences or validating linguistic rules on a specific universe.1 2 Myanmar National Corpus In Myanmar, there are a plenty of text for most Myanmar is a country of using 109 different of the languages, especially Burmese and major languages and a number of different scripts languages, since stone inscription. [Ethnologue, 2005]. In order to do language processing for these languages and scripts, it
1 http://en.wikipedia.org/wiki/Text_corpus becomes a necessity to build a corpus with
33 The 6th Workshop on Asian Languae Resources, 2008
languages and scripts used in Myanmar; at least great benefit about XML is that the document itself with major languages and scripts, which will describes the structure of data. 2 include almost all areas of documents. Three characteristics of XML distinguish from Among the different scripts used in Myanmar, other markup languages:3 the popular scripts include Burmese script (a • its emphasis on descriptive rather than Brahmi based script), Latin scripts. Building of procedural markup; MNC will be helpful for development of Natural Language Processing (NLP) tools (such as • its notion of documents as instances of a grammar rules, spelling checking, etc) and also for document type and linguistic research on these languages and scripts. Moreover, since Burmese script is written without • its independence of any hardware or necessarily pausing between words with spaces, software system. the corpus to be built is hoped to be useful for Since MNC is to be built in XML based format, developing tools for automatic word segmentation. the selection process for tag set of XML become an important process. The XML tagged corpus data 2.1 XML based corpus should also keep the original format of the data. XML is universal format for structured documents In order to select XML tag set for MNC, the and data, and can provide highly standardized sample data for the corpus has to be collected. The representation frameworks for NLP (Jin-Dong format of the sample corpus data has been studied KIM et al. 2001); especially, the ones with for the selection of the XML tag set in appropriate annotated corpus based approaches, by providing with the data format. them with the knowledge representation frameworks for morphological, syntactic, 2.2 Structure of a data file at MNC semantics and/or pragmatics information structure. The structure of a data file at MNC will include Important features are: two main parts: information of the corpus file and • XML is extensible and it does not consist the corpus data. of a fixed set of tags. The first part, the header part of a corpus file, describes the information of a corpus file. The • XML documents must be well-formed information of the corpus file includes the header according to a defined syntax. which will provide sensible use of the corpus • XML document can be formally validated information in machine readable form. In this part, against a schema of some kind. the information such as language usage and the description of the corpus file will be included. • XML is more interested in the meaning of The second part, the document part, of a corpus data than its presentation. file will include the source description of the The XML documents must have exactly one corpus data and the corpus data, the written or top-level element or root element. All other spoken part of the text, itself. The information of elements must be nested within it. Elements must the corpus data such as bibliographic information, be properly nested [Young, 2001]. That is, if an authorship, and publisher information will be element starts within another element, it must also included in this section. Moreover, the corpus data end within that same element. itself will also be included in this section. Each element must have both a start-tag and an The hierarchically structure of a corpus file at end-tag. The element type name in a start-tag must MNC will be as shown in figure 1. exactly match the name in the corresponding end- tag and element name are case sensitive. Moreover, the advantages of XML for NLP includes ontology extraction into XML based structured languages using XML Schema. The
2 http://www.tei-c.org/P5/Guidelines/index.html 3 http://www.w3.org/TR/xml/
34 The 6th Workshop on Asian Languae Resources, 2008
Myanmar National Corpus Data File
Header Document
Language File Source Written or Usage Description Description Spoken Texts
Figure 1. Hierarchically structure of a data file at MNC
6 3 Selection of necessary XML tag set for the text encoding and Interchange . TEI encoding scheme consists of a number of rules After studying original formats and features of with which the document has to adhere in order texts, to be used in corpus, and the structure of to be accepted as a TEI document. the corpus file has been determined, the This header part contains language usage of selection procedure for XML tag set has been the data file
35 The 6th Workshop on Asian Languae Resources, 2008
description part of MNC will be saved in this part. 3.2 Document Part itself which in turn can be divided into two +
Since MNC is going to be annotated in Figure 6. Element and Child tags of MNC sentence level, each sentence will be annotated and numbered. The first part, the source description part of the data
36 The 6th Workshop on Asian Languae Resources, 2008
material because we expect the corpus to be - A sample MNC data is use the Universal
37 The 6th Workshop on Asian Languae Resources, 2008
-
- အြပည်ြပည်ဆိ င်ရာလူ ့အခွင့်အ ရး ကညာစာတမ်း (meaning: Universal Declaration of Human Rights) + စကားချီး (meaning: Preamble) - လူခပ်သိမ်း၏မျိုးရိ းဂ ဏ်သိက္ခာနှင့်တကွလူတိ င်းအညီအမ ခ စားခွင့်ရှိသည့်အခွင့်အ ရးများကိ အသိအမှတ်ြပုြခင်းသည်လူခပ်သိမ်း၏လွတ်လပ်မ ၊တရားမ တမ ၊ ငိမ်းချမ်းမ တိ ့၏အ ြခခ အ တ်ြမစ် ြဖစ် သာ ကာင့်လည်း ကာင်း၊ ……. (meaning: Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,…….) အပိ ဒ် ၁ (meaning: paragraph 1) - လူတိ င်းသည်တူညီလွတ်လပ် သာဂ ဏ်သိက္ခာြဖင့်လည်း ကာင်း၊တူညီလွတ်လပ် သာအခွင့်အ ရး များြဖင့်လည်း ကာင်း၊ မွးဖွားလာသူများြဖစ်သည်။ (meaning: All human beings are born free and equal in dignity and rights.)
38 The 6th Workshop on Asian Languae Resources, 2008
ထိ သူတိ ့၌ပိ င်းြခား ဝဖန်တတ် သာဉာဏ်နှင့်ကျင့်ဝတ်သိတတ် သာစိတ်တိ ့ရှိ က၍ထိ သူတိ ့သည် အချင်းချင်း မတ္တာထား၍ဆက်ဆ ကျင့်သ းသင့်၏။ (meaning: They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.) -
next step is to develop an algorithm for 4 Conclusion and Future work automatic tagging the data. In this paper, the authors have clearly described Acknowledgement about the selection of XML tag set for building of MNC. Since the word level segmentation for This study was performed with the support of Burmese script is not yet available, the corpus the Government of the Union of Myanmar data will be annotated only up to the sentence through Myanmar Natural Language level in order to be in the same format for all Implementation Committee. Thanks and Myanmar languages and scripts. gratitude towards the members of Myanmar In order to check whether the selected the Language Commission for providing necessary XML tag set will be enough and useful for information to write this paper. tagging the corpus data, the sample corpus data has been collected by manually tagging the data References which includes newspapers and periodicals, Ethnologue. 2005 Languages of the World, 15th Universal Declaration of Human Rights Edition, Dallas, Tex.: SIL International. Online (UDHR), novels and essays. version: http://www.ethnologue.com/. Edited by Since the manual tagging to the sample Raymond G. Gordon, Jr. corpus data proves that the selected XML tag set is enough to cover a variety of data sources, the
39 The 6th Workshop on Asian Languae Resources, 2008
Hla Hla Htay, G. Bharadwaja Kumar and Kavi N. Murthy. 2006. Constructing English-Myanmar Parallel Corpora. The Fourth International Conference on Computer Application 2006 (ICCA 2006) Conference Program. Jin-Dong KIM, Tomoko OHTA, Yuka TATEISI, Hideki MIMA and Jun’ichi TSUJII. 2001. XML- based Linguistic Annotation of Corpus . In the Proceedings of the first NLP and XML Workshop held at NLPRS 2001. pp. 47--53. Lou Burnard. 1996. Using SGML for Linguistic Analysis: the case of the BNC. ACM Vol 1 Issue 2 (Spring 1999) MIT Press ISSN: 1099-6621. pp. 31-51. Michaek J. Young. 2001. Step by Step XML.Prentice Hall of India Private Limited Press. ISBN-81-203- 1804-B Ministry of Immigration and Population. 1995. Myanmar Population Changes and Fertility Survey 1991. Immigration and Population Department Wunna Ko Ko, Yoshiki Mikami. 2005 Languages of Myanmar in Cyberspace, In Proceedings of TALN & RECITAL 2005 (NLP for Under-Resourced Languages Workshop), Dourdan, FRANCE, 2005 June, pp. 269-278.
40 The 6th Workshop on Asian Languae Resources, 2008
Myanmar Word Segmentation using Syllable level Longest Matching
Hla Hla Htay, Kavi Narayana Murthy Department of Computer and Information Sciences University of Hyderabad, India hla hla [email protected], [email protected]
Abstract Key Words:- Myanmar, Syllable, Words, Seg- mentation, Syllabification, Dictionary
In Myanmar language, sentences are 1 Introduction clearly delimited by a unique sentence boundary marker but are written without Myanmar (Burmese) is a member of the Burmese- necessarily pausing between words with Lolo group of the Sino-Tibetan language spoken by spaces. It is therefore non-trivial to seg- about 21 Million people in Myanmar (Burma). It ment sentences into words. Word tokeniz- is a tonal language, that is to say, the meaning of a ing plays a vital role in most Natural Lan- syllable or word changes with the tone. It has been guage Processing applications. We observe classified by linguists as a mono-syllabic or isolating that word boundaries generally align with language with agglutinative features. According to syllable boundaries. Working directly with history, Myanmar script has originated from Brahmi characters does not help. It is therefore script which flourished in India from about 500 B.C. useful to syllabify texts first. Syllabification to over 300 A.D (MLC, 2002). The script is syllabic is also a non-trivial task in Myanmar. We in nature, and written from left to right. have collected 4550 syllables from avail- Myanmar script is composed of 33 consonants, able sources . We have evaluated our syl- 11 basic vowels, 11 consonant combination sym- lable inventory on 2,728 sentences spread bols and extension vowels, vowel symbols, devow- over 258 pages and observed a coverage of elizing consonants, diacritic marks, specified sym- 99.96%. In the second part, we build word bols and punctuation marks(MLC, 2002),(Thu and lists from available sources such as dic- Urano, 2006). Myanmar script represents sequences tionaries, through the application of mor- of syllables where each syllable is constructed from phological rules, and by generating syllable consonants, consonant combination symbols (i.e. n-grams as possible words and manually Medials), vowel symbols related to relevant conso- checking. We have thus built list of 800,000 nants and diacritic marks indicating tone level. words including inflected forms. We have Myanmar has mainly 9 parts of speech: noun, tested our algorithm on a 5000 sentence pronoun, verb, adjective, adverb, particle , conjunc- test data set containing a total of (35049 tion, post-positional marker and interjection (MLC, words) and manually checked for evaluat- 2005), (Judson, 1842). ing the performance. The program recog- In Myanmar script, sentences are clearly delim- nized 34943 words of which 34633 words ited by a sentence boundary marker but words are were correct, thus giving us a Recall of not always delimited by spaces. Although there is 98.81%, a Precision of 99.11% and a F- a general tendency to insert spaces between phrases, Measure is 98.95%. inserting spaces is more of a convenience rather than
41 The 6th Workshop on Asian Languae Resources, 2008
a rule. Spaces may sometimes be inserted between ¥ CVVC [maun] (term of address for young men) words and even between a root word and the associ- ated post-position. In fact in the past spaces were ¥ CGVVC [mjaun] ditch rarely used. Segmenting sentences into words is therefore a challenging task. Words in the Myanmar language can be divided Word boundaries generally align with syllable into simple words, compound words and complex boundaries and syllabification is therefore a useful words (Tint, 2004),(MLC, 2005),(Judson, 1842). strategy. In this paper we describe our attempts on Some examples of compound words and loan words syllabification and segmenting Myanmar sentences are given below. into words. After a brief discussion of the corpus collection and pre-processing phases, we describe ¥ Compound Words
our approaches to syllabification and tokenization ÜÝ – head [u:] Ç + pack [htou ] = hat [ou into words. htou ] ÇÜÝ
Computational and quantitative studies in Myan- ÅÙÒ language [sa] ÔÑ+ look,see [kji.] + mar are relatively new. Lexical resources available [tai ] building ÛÙ = library [sa kji. dai ]
are scanty. Development of electronic dictionaries ÔÑÅÙÒ ÛÙ
and other lexical resources will facilitate Natural ¼¸ – sell [yaun:] ²Ñ + buy [we] = trading Language Processing tasks such as Spell Checking, [yaun : we] Machine Translation, Automatic Text summariza- ²Ñ¼¸ tion, Information Extraction, Automatic Text Cate- ¥ Loan Words gorization, Information Retrieval and so on (Murthy,
2006). – ÙÝ×ÄÛÑ [kun pju ta] computer
Over the last few years, we have developed mono- – ÕÝÙÑÖÛ [hsa ko ma ti] sub-committee lingual text corpora totalling to about 2,141,496 ø
– ׸² [che ri] cherry sentences and English-Myanmar parallel corpora amounting to about 80,000 sentences and sentence 3 Corpus Collection and Preprocessing fragments, aligned at sentence and word levels. We have also collected word lists from these corpora Development of lexical resources is a very tedious and also from available dictionaries. Currently our and time consuming task and purely manual ap- word list includes about 800,000 words including in- proaches are too slow. We have downloaded Myan- flected forms. mar texts from various web sites including news sites including official newspapers, on-line maga- 2 Myanmar Words zines, trial e-books (over 300 full books) as well as Myanmar words are sequences of syllables. The syl- free and trial texts from on-line book stores includ- lable structure of Burmese is C(G)V((V)C), which ing a variety of genres, types and styles - modern and is to say the onset consists of a consonant option- ancient, prose and poetry, and example sentences ally followed by a glide, and the rhyme consists of from dictionaries. As of now, our corpus includes a monophthong alone, a monophthong with a con- 2,141,496 sentences. sonant, or a diphthong with a consonant 1. Some The downloaded corpora need to be cleaned up to representative words are: remove hypertext markup and we need to extract text if in pdf format. We have developed the necessary ¥ CV [mei] girl scripts in Perl for this. Also, different sites use dif- ferent font formats and character encoding standards ¥ CVC [me ] crave are not yet widely followed. We have mapped these ¥ CGV [mjei] earth various formats into the standard WinInnwa font for- mat. We have stored the cleaned up texts in ASCII ¥ CGVC [mje ]eye format and these pre-processed corpora are seen to 1http://en.wikipedia.org/wiki/Burmese language be reasonably clean.
42 The 6th Workshop on Asian Languae Resources, 2008
4 Collecting Word Lists Nominative personal pronouns
ÙÊÖ I ÙÊÛÑ[kjun do], [kja ma.], [nga],
ø
Ù×Ñ ÙÊ Ý Ù×ÃÝ[kjou ], [kja no], [kjanou ], Electronic dictionaries can be updated much more ø ø Ù×Ö[kja ma.] easily than published printed dictionaries, which Possessive pronounsø and adjectives
ÙÊÛÑ my ÙÊ Ý[kjou i.], [kjun do i.],
need more time, cost and man power to bring out Ù×Ñ ÙÊÖ[kja ma. i.], [kja nou i.],
a fresh edition. Word lists and dictionaries in elec- ø ø
²ÂÍ Ù×ÃݲÂÍ Ù×Ö[kja ma. i.], [nga i.], [kjou i.],
ø ÙÊÛѲÂÍ tronic form are of great value in computational lin- ÙÊ Ý²ÂÍ[kjou je.], [kjun do je.],
Ù×ѲÂÍ ÙÊÖ²ÂÍ[kja ma. je.], [kja nou je.],
guistics and NLP. Here we describe our efforts in ø ø ²ÂÍ Ù×Ö²ÂÍ[kja ma. je.], [nga je.],
developing a large word list for Myanmar. ø ÙÊÛÑ Ù×ÃݲÂÍ[kjou je.], [kjun do.],
Ù×Ñ[kja no.]
4.1 Independent Words Indefinite pronounsø and adjectives Ø×ÃÍÓÑ some Ø×ÃÍ[a chou.], [a chou. tho:],
As we keep analyzing texts, we can identify some ø ø Û×ÃÍÓÑ Û×ÃÍ[ta chou.], [a chou. tho:], ø ø words that can appear independently without com- Û×ÃÍ×ÃÍ[ta chou.ta chou.], ø ø Û×ÃÍÛÚ[ta chou.ta lei] bining with other words or suffixes. We build a list ø ø of such valid words and we keep adding new valid Table 1: Stop-words of English Vs Myanmar words as we progress through our segmentation pro- cess, gradually developing larger and larger lists of valid words. We have also collected from sources always necessarily lead to correct segmentation of such as Myanmar Orthography(MLC, 2003), CD sentences into words. Both under and over segmen- versions of English-English-Myanmar (Student’s tation are possible. When stop-words are too short, Dictionary)(stu, 2000) and English-Myanmar Dic- over segmentation can occur. Under segmentation tionary (EMd, ) and Myanmar-French Dictionary can occur when no stop-words occur between words. (damma sami, 2004). Currently our word list in- Examples of segmentation can be seen in Table 2. cludes 800,000 words. We have observed that over segmentation is more frequent than under segmentation.
4.2 Stop Word Removal
¼¼× Ù×ÄÀ²ÓÞÓÐØÙÓÒ
¼¼× Ù×ÄÀ² ØÙ Stop words include prepositions/post-positions, con- [waing: win: chi: kyu: khan ya] [a nay khak] junctions, particles, inflections etc. which ap- received compliments abashed
Vpp Vpast ÕÝÓÒ
pear as suffixes added to other words. They Ù×ÑØÝÕ²ÑÙ ÓÒØÅÙÖÞÙÖÁÙÔÙ
ØÅÙÖÞÙÖÁ ÔÙÕÝ form closed classes and hence can be listed. Pre- Ù×ÑØÝÕ²ÑÙ [kyaung: aop hsa ya kyi:] [a kyan: phak mhu] [sak sop] liminary studies therefore suggested that Myan- The headmaster violence abhors mar words can be recognized by eliminating these Nsubj Nobj Vpresent stop words. Hopple (Hopple, 2003) also notices Table 2: Removing stop-words for segmentation that particles ending phrases can be removed to recognize words in a sentence. We have col- lected stop words by analyzing official newspa- 4.3 Syllable N-grams pers, Myanmar grammar text books and CD ver- Myanmar language uses a syllabic writing system sions of English-English-Myanmar (Student’s Dic- unlike English and many other western languages tionary)(stu, 2000), English-Myanmar Dictionary which use an alphabetic writing system. Interest- (EMd, ) and The Khit Thit English-Myanmar dictio- ingly, almost every syllable has a meaning in Myan- nary (Saya, 2000). We have also looked at stop word mar language. This can also be seen from the work lists in English (www.syger.com, ) and mapped them of Hopple (Hopple, 2003). to equivalent stop words in Myanmar. See Table 1. Myanmar Natural Language Processing Group As of now, our stop words list contains 1216 en- has listed 1894 syllables that can appear in Myan- tries. Stop words can be prefixes of other stop words mar texts (Htut, 2001). We have observed that there leading to ambiguities. However, usually the longest are more syllables in use, especially in foreign words matching stop word is the right choice. including Pali and Sanskrit words which are widely Identifying and removing stop words does not used in Myanmar. We have collected other pos-
43 The 6th Workshop on Asian Languae Resources, 2008
sible syllables from the Myanmar-English dictio- of Myanmar texts. We have used “-type=word” op- nary(MLC, 2002). Texts collected from the Internet tion treating syllables as words. We had to mod- show lack of standard typing sequences. There are ify this program a bit since Myanmar uses zero (as several possible typing sequences and correspond- “(0) wa ” letter) and the other special characters ( ing internal representations for a given syllable. We “,”, “<”, “>”, “.”, “&”, “[”,“]” etc.) which were be- include all of these possible variants in our list. Now ing ignored in the original Text::Ngrams software. we have over 4550 syllables. We collect all possible words which is composed of n-grams of syllables up to 5-grams. Table 1 Bigram Trigram 4-gram shows some words which are collected through n- bisyllables 3-syllables 4-syllables
gram analysis. Almost all monograms are meaning-
ÞØÖ Â ËÔ ËÔÙÑÙÑ lantern with a big sound whole-heartedly ful words. Many bi-grams are also valid words and [hpan ein] [boun: ga ne:] [hni hni ka ga]
ø as we move towards longer n-grams, we generally
ÞÓÑ ÔÍÂ ÜÐÜÐÙÂÙÂ glassware effortlessly outstanding get less and less number of valid words. See Table [hpan tha:] [swei. ga ne:] [htu: htu: ke: ke:] 3. Further, frequency of occurrence of these n-grams
ø
ÜÑ Ö×ÑÖ×ÑÔÑÔÑ ÙÔÑ is a useful clue. See Table 4. bank of lake fuming with rage many,much [kan saun:] [htaun: ga ne:] [mja: mja: sa: za:] By analyzing the morphological structure of ø words we will be able to analyze inflected and de- Table 3: Examples of Collected N-grams rived word forms. A set of morphemes and mor- phological forms have been collected from (MLC, 2005) and (Judson, 1842) . See Table 5. For exam- No. of No of words Example syllables ple, the four-syllable word in Table 3 is an adverb
1 4550 ÙÑ Good (Adj) “ ÜÐÜÐÙÂÙ” [htu: htu: ke: ke:] outstanding derived [kaun:] from the verb “ÜÐÙ”. See Table 3.
2 59964 ÚÝÝÑ Butterfly, Soul (N) Statistical construction of machine readable dic- [lei pja] tionaries has many advantages. New words which 3 170762 ÝÛÝ Ù Window (N) appear from time to time such as Internet, names [b a din: bau ] ø 4 274775 ÝÒÛ!ÜÛÙ of medicines, can also be detected. Compounds Domestic Product (N) [pji dwin: htou koun] words also can be seen. Common names such
5 199682 ÚÉÝÔÔÜÖØ as names of persons, cities, committees etc. can [hlja si ht a min: ou:] Rice Cooker(N)ø be also mined. Once sufficient data is available,
6 99762 ÓÐÑÝÃÕ²ÑÖ Nurse(female) (N) statistical analysis can be carried and techniques [thu na bju. hs a ja ma.] ø such as mutual information and maximum entropy
7 41499 ²%Ë Ó!ÑÅÙÝÛÑÓÒ become friend (V) can be used to hypothesize possible words. [jin: hni: thwa: kya. pei to. thi]
8 14149 ÝÒÜÑÔÖÖÑ%ÀÛÑ Union of Myanmar (N) [pji daun zu. mj a ma nain gan to ] ø 4.4 Words from Dictionaries
9 4986 ÓÀ¸À(ÑÛزØÖÔÖ×Ñ Natural Resources (N) [than jan za ta. a jin: a mji ] Collecting words using the above three mentioned ø ø 10 1876 ÖÙÖÚÙÖÙÖÞÔÓÒ methods has still not covered all the valid words be agitated or shaken(V) [ chei ma kain mi. le ma kain mi. hpji thi] in our corpus. We have got only 150,000 words. ø ø Words collected from n-grams needs exhaustive hu- Table 4: Syllable Structure of Words man effort to pick the valid words. We have therefore collected words from two on-line dictio- We have developed scripts in Perl to syllabify naries - the English-English-Myanmar (Student’s words using our list of syllables as a base and then Dictionary) (stu, 2000), English-Myanmar Dictio- generate n-gram statistics using Text::Ngrams which nary (EMd, ) and from two e-books - French- is developed by Vlado Keselj (Keselj, 2006). This Myanmar(damma sami, 2004), and Myanmar Or- program is quite fast and it took only a few min- thography (MLC, 2003). Fortunately, these texts utes on a desktop PC in order to process 3.5M bytes can be transformed into winninnwa font. We have
44 The 6th Workshop on Asian Languae Resources, 2008
A B C D E
basic unit (Verb)= (Noun)= (Negative)= (Noun)=
Ø Ö "Ð ÖÁ
1 syllable A+ ÓÒ +A +A+ A+
ÙÑ ÙÑÓÒ ØÙÑ ÖÙÑ"Ð ÙÑÖÁ [kaun:] [kaun: thi] [a kaun:] [ma. kaun: bu:] [kaun: mhu.]
good (Adj) is good øgood Not good good deeds
Õ ÕÓÒ ØÕ ÖÕ"Ð ÕÖÁ [hso:] [hso: thi] [a hso:] [ma. hso: bu:] [hso: mhu.]
bad (Adj) is bad øbad Not bad Bad Deeds
²Ñ ²ÑÓÒ Ø²Ñ Ö²Ñ"Ð ²ÑÖÁ [jaun:] [jaun: thi] [a jaun:] [ma. jaun: bu:] [jaun: mhu.]
sell(Verb) sell ø sale not sell sale
² ²ÓÒ Ø² Ö²"Ð ²ÖÁ [jei:] [jei: thi] [a jei:] [ma. jei: bu:] [jei: mhu.]
write(Verb) write writingø do not write writing
%ÝÑ %ÝÑÓÒ Ø%ÝÑ Ö%ÝÑ"Ð %ÝÑÖÁ [pjo:] [pjo: thi] [a pjo:] [ma. pjo: bu:] [pjo: mhu.] talk,speak(Verb) talk,speak talk,speechø not talk,speak talking
Table 5: Example patterns of Myanmar Morphological Analysis written Perl scripts to convert to the standard font. 5 Syllabification and Word Segmentation Myanmar Spelling Bible lists only lemma (root Since dictionaries and other lexical resources are not words). We have suffixed some frequently used mor- yet widely available in electronic form for Myanmar phological forms to these root words. language, we have collected 4550 possible syllables There are lots of valid words which are not de- including those used in Pali and foreign words such
scribed in published dictionaries. The entries of ÛÙ ÛÚ as ÑÜ ), considering different typing se- words in the Myanmar-English dictionary which is quences and corresponding internal representations, produced by the Department of the Myanmar Lan- and from the 800,000 strong Myanmar word-list we guage Commission are mainly words of the com- have built. With the help of these stored syllables mon Myanmar vocabulary. Most of the compound and word lists, we have carried out syllabification words have been omitted in the dictionary (MLC, and word segmentation as described below. Many 2002). This can be seen in the preface and guide researchers have used longest string matching (An- to the dictionary of the Myanmar-English dictio- gell et al., 1983),(Ari et al., 2001) and we follow the nary produced by Department of the Myanmar Lan- same approach. guage Commission, Ministry of Education. 4- The first step in building a word hypothesizer is syllables words like “ ÜÐÜÐÕÕ ”[htu: htu: syllabification of the input text by looking up sylla- zan: zan:] (strange), “ ÜÐÜÐÙÂÙÂ ” [htu: htu: ke: ble lists. In the second step, we exploit lists of words ke:](outstanding) and “ ÜÐÜÐ Ñ Ñ ” [htu: htu: gja: (viewed as n-grams at syllable level) for word seg- gja:](different)(see Table 3) are not listed in dictio- mentation from left to right. nary although we usually use those words in every day life. 5.1 Syllabification With all this, we have been able to collect a total As an initial attempt we use longest string matching of about 800,000 words. As we have collected words alone for Myanmar text syllabification. Examples from various sources and techniques, we believe we are shown in Table 7. have fairly good data for further work.
Pseudo code Here we go from left-to-right in a Ù
On screen ÅÙ greedy manner:
Ù Ù In ascii MuD: udk sub syllabification{ BuD: ukd Load the set of syllables from syllable-file Load the sentences to be processed from sentence-file Table 6: Syllables with different typing sequences Store all syllables of length j in Nj where j =10..1 for-each sentence do length ← length of the sentence
45 The 6th Workshop on Asian Languae Resources, 2008
pos ← 0 for j =10..1 do while (length > 0) do for-each word in Nj do for j =10..1 do if string-match sentence(pos, pos + j) with word for-each syllable in Nj do word found. Mark word if string-match sentence(pos, pos + j) with syllable pos ← pos + j Syllable found. Mark syllable length ← length − j pos ← pos + j End if length ← length − j End for End if End for End for End while End for Print tokenized string End while End for Print syllabified string End for }
We have evaluated our syllables list on a collec- 6 Evaluation and Observations
tion of 11 short novels entitled “Orchestra” ÓÀ¹ We have segmented 5000 sentences including a to-
ÔÀÛ¼[than zoun ti: wain:], written by “Nikoye” tal of (35049 words) with our programs and manu- (Ye, 1997) which includes 2,728 sentences spread ally checked for evaluating the performance. These over 259 pages including a total of 70,384 sylla- sentences are from part of the English-Myanmar bles. These texts were syllabified using the longest parallel corpus being developed by us (Htay et al., matching algorithm over our syllable list and we 2006). The program recognized 34943 words of observed that only 0.04% of the actual syllables which 34633 words were correct, thus giving us a were not detected. The Table 6 shows that differ- Recall of 98.81% and a Precision of 99.11%. The ent typing sequences of syllables were also detected. F-Measure is 98.95%. The algorithm suffers in ac-
Here are some examples of failure: Ö [rkdCf;]and curacy in two ways:
ÖÚ[rkdvf;] which are seldom used in text. The typ- ing sequence is also wrong. Failures are generally Out-of-vocabulary Words: Segmentation error traced to can occur when the words are not listed in dictionary. No lexicon contains every possible ¥ differing combinations of writing sequences word of a language. There always exist ¥ loan words borrowed from foreign languages out-of-vocabulary words such as new derived words, new compounds words, morphological ¥ rarely used syllables not listed in our list variations of existing words and technical words (Park, 2002). In order to check the effect 5.2 Word Segmentation of out-of-vocabulary words, we took a new We have carried out tokenization with longest sylla- set of 1000 sentences (7343 words). We have ble word matching using our 800,000 strong stored checked manually and noticed 329 new words, word list. This word list has been built from avail- that is about 4% of the words are not found in able sources such as dictionaries, through the ap- our list, giving us a coverage of about 96%. plication of morphological rules, and by generating syllables n-grams and manually checking. An exam- Limitations of left-to-right processing: ple sentence and its segmentation is given in Table Segmentation errors can also occur due to 8. the limitations of the left-to-right processing. Load the set of words from word-file See the example 1 in Table 9. The algorithm for-each word do suffers most in recognizing the sentences i ← syllabification(word); i =10 1 Store all words of syllable length i in N where i .. which have the word He ÓÐ [thu] followed by a End for
negative verb starting with the particle Ö[ma.]. Load the sentences to be processed from sentence-file The program wrongly segments she as he. Our for-each sentence do length ←syllabification(sentence); text collection obtains from various sources #length of the sentence in terms of syllables and the word “she” is used as ÓÐÖ [thu ma.] in pos ← 0 while (length > 0) do modern novels and Internet text. Therefore, our
46
The 6th Workshop on Asian Languae Resources, 2008 ÙÑÞ ÓÑÙ²ØÛ %ËØÚÝÓÚÝÝÑÂÓÒ
aumfzDaomuf&if;tefwDESihftvyovyajymaecJhonf
ÙÑ Þ ÓÑÙ ² Ø Û %Ë Ø Ú Ý Ó Ú Ý ÝÑ Â ÓÒ aumf zD aomuf &if; tef wD ESihf t v y o v y ajym ae cJh onf [ko] [hpi] [thau ] [jin:] [an] [ti] [hnin.] [a] [la] [pa.] [tha.] [la] [pa.] [pjo] [nei] [khe.] [thi] Having his coffee,ø he chit-chat with the lady.
Table 7: Example syllabification
Ù×ÑØÝÕ²Ñ&Ù'ÓÒØÅÙÖÞÙÖÁÙÔÙÕÝÓÒ
Ù×ÑØÝÕ²Ñ&Ù' ÓÒ ØÅÙÖÞÙÖÁ Ù ÔÙÕÝÓÒ [kyaung: aop hsa ya kyi:] [thi] [a kyan: phak mhu] [ko] [sak sop thi] The headmaster violence abhors Nsubj Particle Nobj Particle Vpresent
Table 8: A sentence being segmented into words
word list contains she ÓÐÖ. This problem can be lected monolingual text corpora totalling to about solved by standardization. Myanmar Language 2,141,496 sentences and English-Myanmar parallel Commission (MLC, 1993) has advised that the corpora amounting to about 80,000 sentences and
words “she” and “he” should be written only sentence fragments, aligned at sentence and word ÓÐÖ as ÓÐ and the word representing a feminine levels. We have also built a list of 1216 stop words, pronoun should not be used. For example 2 in 4550 syllables and 800,000 words from a variety
Table 9, the text ØÑÝÓÒ can be segmented of sources including our own corpora. We have
into two ways. 1) ØÑÝÓÒ [a: pei: thi] which used fairly simple and intuitive methods not requir-
means “encourage” and 2) ØÑ [particle for ing deep linguistic insights or sophisticated statisti-
indicating dative case] and ÝÓÒ give [pei: cal inference. With this initial work, we now plan thi]. Because of greedy search from left to to apply a variety of machine learning techniques. right, our algorithm will always segment as We hope this work will help to accelerate work in
ØÑÝÓÒ no matter what the context is. Myanmar language and larger lexical resources will be developed soon. In order to solve these problems, we are plan to use machine learning techniques which 1) can also detect real words dynamically (Park, 2002) while we References are segmenting the words and 2) correct the greedy Richard C. Angell, George W. Freurd, and Peter Willett. cut from left to right using frequencies of the words 1983. Automatic spelling correction using a trigram from the training samples. similarity measure. Information Processing & Man- agement Although our work presented here is for Myan- , 19(4):255Ð261. mar, we believe that the basic ideas can be applied Pirkola Ari, Heikki Keskustalo, Erkka Leppnen, Antti- to any script which is primarily syllabic in nature. Pekka Knsl, and Kalervo Jrvelin. 2001. Targeted s- gram matching: a novel n-gram matching technique 7 Conclusions for cross- and monolingual word form variants. Infor- mation Research, 7(2):235Ð237, january. Since words are not uniformly delimited by spaces U damma sami. 2004. Myanmar-French Dictionary. in Myanmar script, segmenting sentences into words English-myanmar dictionary. Ministry of Education, is an important task for Myanmar NLP. In this pa- Union of Myanmar,CD version. per we have described the need and possible tech- niques for segmentation in Myanmar script. In par- Paulette Hopple. 2003. The structure of nominalization in Burmese,Ph.D thesis. May. ticular, we have used a combination of stored lists, suffix removal, morphological analysis and syllable Hla Hla Htay, G. Bharadwaja Kumar, and Kavi Narayana Murthy. 2006. Building english-myanmar parallel level n-grams to hypothesize valid words with about corpora. In Fourth International Conference on Com- 99% accuracy. Necessary scripts have been writ- puter Applications, pages 231Ð238, Yangon, Myan- ten in Perl. Over the last few years, we have col- mar, Feb.
47 The 6th Workshop on Asian Languae Resources, 2008
Example 1: )Ñ%ÝÖÁÛÓÐÖÝ ¼Â"Ð
)Ñ%ÝÖÁ Û ÓÐÖ Ý ¼Â"Ð [damja. hmu.] [twin] [thu ma.] [pa wun khe. bu;] ørobbery in she did not involve
N Particle Nsubj Vpastneg ÛѼÙÓÐÛÔÝ ØÑÝÓÒ»
Example 2: ÖÖÖÚ×ÓÑ
ÖÖ ÖÚ×ÓÑ ÛѼ Ù ÓÐÛÔÝ ØÑÝÓÒ [mi. mi.] [ma. lou chin tho:] [ta wun] [gou] [thu daba:] [a: pei: thi] I,myself don’t want duty,responsibility othersø encourage Nsubj Vneg Nobj1 Particle Nobj2 V
Table 9: Analysis of Over-Segmentation
Zaw Htut. 2001. All possible myanmar syllables, Ye Kyaw Thu and Yoshiyori Urano. 2006. Text entry for September. myanmar language sms: Proposal of 3 possible input methods, simulation and analysis. In Fourth Interna- Adoniram Judson. 1842. Grammatical Notices of the tional Conference on Computer Applications, Yangon, Buremse Langauge. Maulmain: American Baptist Myanmar, Feb. Mission Press. U Tun Tint. 2004. Features of myanmar language. May. Vlado Keselj. 2006. Text ::ngrams. www.syger.com. http://www.syger.com/jsc/docs/ http://search.cpan.org/ vlado/ Text-Ngrams-1.8/, stopwords/english.htm. November. Ni Ko Ye. 1997. Orchestra. The two cats, June. MLC. 1993. Myanmar Words Commonly Misspelled and Misused. Department of the Myanmar Language Commission,Ministry of Education, Union of Myan- mar.
MLC. 2002. Myanmar-English Dictionary. Department of the Myanmar Language Commission, Ministry of Education, Union of Myanmar.
MLC. 2003. Myanmar Orthography. Department of the Myanmar Language Commission,Ministry of Educa- tion, Union of Myanmar, June.
MLC. 2005. Myanmar Grammer. Department of the Myanmar Language Commission, Ministry of Educa- tion,Union of Myanmar, June.
Kavi Narayana Murthy. 2006. Natural Language Pro- cessing - an Information Access Perspective. Ess Ess Publications, New Delhi, India.
Youngja Park. 2002. Identification of probable real words : an entropy-based approach. In ACL-02 Work- shop on Unsupervised Lexical Acquisition, pages 1Ð8, Morristown, NJ, USA. Association for Computational Linguistics.
U Soe Saya. 2000. The Khit Thit English-English- Myanmar Dictionary with Pronunciation. Yangon, Myanmar, Apr.
2000. Student’s english-english/myanmar dictionary. Ministry of Commerce and Myanmar Inforithm Ltd, Union of Myanmar, CD version, Version 1, April.
48 The 6th Workshop on Asian Languae Resources, 2008
The Link Structure of Language Communities and its Implication for Language-specific Crawling
Rizza Camus Caminero Yoshiki Mikami Language Observatory Language Observatory Nagaoka University of Technology Nagaoka University of Technology Nagaoka, Niigata, Japan Nagaoka, Niigata, Japan [email protected] [email protected]
large amount of information that can be shared and Abstract disseminated to people with Internet access. Au- thors and creators of a web page come from differ- Since its inception, the World Wide Web ent backgrounds, different cultures, and different (WWW) has exponentially grown to shelter languages. Thus, a web page is a resource of multi- billions of monolingual and multilingual lingual content, and a fertile source for socio- web pages that can be navigated through linguistic analysis. hyperlinks. Its structural properties provide useful information in presenting the socio- 1.1 Web Crawling linguistic properties of the web. In this study, about 26 million web pages under While search-engines are important means of ac- the South East Asian country code top- cessing the web for most users, automated systems level-domains (ccTLDs) are analyzed, sev- of retrieving information from the web have been eral language communities are identified, developed. These systems are called web crawlers, and the graph structure of these communi- where a software agent is given a list of pages to ties are analyzed. The distance between visit. As the crawler visits these pages, it follows language communities are calculated by a their outgoing links and adds these to the list of distance metrics based on the number of pages to visit. Each page is visited recursively ac- outgoing links between web pages. Inter- cording to some sets of policies (e.g. the type of mediary languages are identified by graph pages to retrieve, direction, the maximum depth analysis. By creating a language subgraph, from the URL (uniform resource locator), etc.). the size and diameter of its strongly- The result of this crawl is a vast amount of data, connected components are derived, as these most of which may be irrelevant to certain indi- values are useful parameters for language- viduals. Thus, a focused-crawling approach was specific crawling. Performing a link struc- implemented by some systems to limit the search ture analysis of the web pages can be a use- to only a subset of the web. ful tool for socio-linguistic and technical Focused crawlers rely on classifiers to work ef- research purposes. fectively. Language-specific crawlers, for example, need a very good language identification module 1 Introduction that properly identifies the language of the web pages. General crawlers can be extended to include The World Wide Web contains interlinked docu- focused-crawling capabilities by incorporating the ments, called web pages, navigable by hyperlinks. classifiers. Another important requirement to effi- Since its creation, it has grown to contain billions ciently crawl the desired domain is the list of initial of web pages and several billion hyperlinks. These web pages, called seed URLs. Each of these URLs web pages are created by millions of people from will be enqueued into a list. The agent visits each all parts of the world. Each web page contains a URL in the list. Since the crawler recursively visits
49 The 6th Workshop on Asian Languae Resources, 2008
the outgoing links of each URL, it is possible that A language community in the web is the group the seed URL is an outgoing link of another seed of web pages written in a language. Major lan- URL, or an outgoing link of an outgoing link of guage communities discovered in each country another seed URL. Listing these URLs as seed indicates the dominant language of the country’s URLs will just waste the crawler’s time in visiting web space. How one language community is re- them, when they have already been visited. If sev- lated to another language community can be shown eral URLs can be reached from just one URL, the by analyzing the hyperlinks between them. Thus, a seed URL list size will decrease and the crawler language graph can be created with the language will be more efficient. However, the maximum communities as nodes, and the links between the distance between these web pages must also be language communities as edges. considered, since the crawler can have a policy to stop the crawling after reaching a certain depth. 2 Previous Studies 1.2 The Web as Graph One of the earliest web survey in Asia (Ciolek, 1998) presented statistical data of the Asian web A graph consists of a set of nodes, denoted by V space by using the Altavista WWW search engine and a set of edges, denoted by E. Each edge is a in gathering its data. In 2001, he wrote a paper pre- pair of nodes (u, v) representing a connection be- senting the trends in the volume of hyperlinks con- tween u and v. A path between two nodes is a se- necting websites in 10 Eash Asian countries. quence of edges that is passed through from one Several studies have also been done regarding node to reach the other node. the representation of the web as a graph. Kumar et In a directed graph, each edge is an ordered pair al. (2000) showed that a graph can be induced by of nodes. A path from u to v does not imply a path the hyperlinks between pages. Measures on the from v to u. The distance between any two nodes is connected component sizes and diameter were pre- the number of edges in a shortest path connecting sented to show the high-level structure of the web. them. The diameter of the graph is the maximum Broder et al. (2000) did experiments on the web on distance between any two nodes. a larger scale and showed the web’s macroscopic In an undirected graph, each edge is an unor- structure consisting of the SCC, IN, OUT, and dered pair of nodes. There is an edge between u TENDRILS. Balakrishnan and Deo (2006) ob- and v if there is a link between u and v, regardless served that the number of edges grow superlinearly of which node is the source of the link. with the number of nodes, showing the degree dis- A strongly-connected component (SCC) of a di- tributions and diameter. Petricek et al. (2006) used rected graph is a set of nodes such that for any pair web graph structural metrics to measure properties of nodes u and v in the set, there is a path from u to such as the distance between two random pages v. The strongly-connected components of a graph and interconnectedness of e-government websites. consist of disjoint sets of nodes. A subgraph is a Bharat et al. (2001) studied the macro-structure of graph whose nodes and edges are subsets of an- the web, showing the linkage between web sites by other graph. creating the “hostgraph”, with the nodes represent- With the interlinked nature of the web, it can be ing the hosts and the edges as the count of hyper- represented as a graph, with the web pages as the links between pages on the corresponding hosts. nodes and the edges as the hyperlinks between the Chakrabarti et al. (1999) proposed a new ap- pages. proach to topic-specific web resource discovery by 1.3 Languages creating a focused crawler that selectively retrieves web pages relevant to a pre-defined set of topics. Languages are expressions of individuals and Stamatakis et al. (2003) created CROSSMARC, a groups of individuals, essential to all form of focused web crawler that collects domain-specific communication. It is a fundamental medium of web pages. Deligenti et al. (2000) presented a fo- expressing one’s self whether in spoken or written cused crawling algorithm that builds a model for form. Ehtnologue (2005) lists 6,912 known living the context within which relevant pages to a topic languages in the world. Only a small portion of occur on the web. Pingali et al. (2006) created an these languages can be found in the web today. Indian search engine with a language identification
50 The 6th Workshop on Asian Languae Resources, 2008
module that returns a language only if the number the Language Observatory Project (LOP) 1 . The of words in a web page are above a given threshold crawl was started on July 5, 2006, running for 14 value. The web pages were transliterated first into days. 107,168,733 web pages were collected from UTF-8 encoding. Tamura et al. (2007) presented a 43 ccTLDs in Asia. Each page contains several simulation study of language specific crawling and information such as the character set and outgoing proposed a method for selectively collecting web links. For this study, the URL of a web page and pages written in a specific language by doing a the URL of its outgoing links were used. Although linguistic graph analysis of real web data, and then there are many web pages of Asia that can be transforming them into variation of link selection found in generic domains, they were not included strategies. in this survey to limit the volume of the crawl data. Despite several studies on web graph and lan- guage-specific crawling, no study has been done 4.2 Language Identification showing the “language graph”. Herring et al. A language identification module (LIM) developed (2007) showed a study on language networks of a for LOP based on the n-gram method (Suzuki et al., selected web community, LiveJournal, but not on 2002) was used to identify on what language a the web as a whole. page is written in. This method first creates byte sequences for each training text of a language. It 3 Scope and Objectives of the Study then checks the byte sequences of the web pages This research was conducted on the 10 ccTLDs of that match the byte sequences of the training texts. the South East Asian countries. This paper, how- The language having the highest matching rate is ever, will only show the results for the Indonesian considered as the language of the web page. The domain (.id). language identification module used the parallel This research aims to show the socio-linguistic corpus of the Universal Declaration of Human properties of the language communities in each Rights (UDHR), translated into several languages. country at the macroscopic level. The web page After crawling, LIM was executed to identify the distribution for each language community in a languages of each downloaded web page. The given ccTLD and its most frequently linked to lan- identification result was stored in a LIM result file guages are shown. The distance is also computed that contains the URL, the language, and matching and the language graph is illustrated. rate, among others. In this study, the issues This research also aims to show the graph prop- regarding the accuracy of LIM will not be erties of some Filipino language communities and discussed. its implication for crawling. Graph properties like 4.3 Web Page Analysis the SCC size and the diameter will be presented to show these characteristics in a subset of the web. For this study, the web pages of the 10 South East Finally, this research demonstrates the useful- Asian ccTLDs were selected for analysis. There ness of graph analysis approaches. were 26,196,823 web pages downloaded under these ccTLDs. The web pages for each country 4 Methodology were grouped by languages. The list of languages was narrowed down to 20 based on the number of This study was conducted by performing a series pages, arranged from highest to lowest. of steps from the collection of data to the presenta- The link structure can be analyzed by traversing tion of the results through images. the outgoing links of each web page. For each web page, its outgoing links are retrieved. For each out- 4.1 UbiCrawler going link, the LIM result file is checked for its UbiCrawler (Boldi et al., 2004) was used to language. The number of outgoing links in each download the web pages under the Asian ccTLDs. language is counted. If the URL of the outgoing These pages were downloaded primarily for the link is not on the file, it wasn’t downloaded. There- purpose of assessing the usage level of each fore, the language of the ougoing link is unidenti- language in cyberspace, one of the objectives of fied. This is usually the case of outgoing links un-
1 http://www.language-observatory.org
51 The 6th Workshop on Asian Languae Resources, 2008
der the generic TLDs (e.g., .com, .org, .gov, etc.) where α is an adjusting parameter introduced to and non-Asian ccTLDs. avoid division-by-zero, which may happen when Rij =0, i.e. no links between two languages. We set 4.4 Language Graph α=0.0001. Thus, the maximum distance between There is a link between two languages if there is at languages becomes 99. β is another adjusting pa- least one outgoing link from a web page in one rameter to make Dij as a distance metrics, and we language to the other language. The language set β=1. Assumption behind this definition is based graph is created through contraction procedure, on commonly-observed rules in our world. It is where all edges linking the same language page are widely observed that interaction between two ob- contracted. jects is proportional to the inverse square of dis- tance between two objects. The number of web 4.5 Language Adjacency Matrix links between two language communities is con- Based on the number of web pages in a language sidered as a kind of interaction. Languages with no and the number of outgoing links from one lan- links between them have a distance of 99, and the guage to another, the language adjacency table N distance of a language to itself is 0. for each country is created. The row and column Based on this distance metrics, the macroscopic headers are the same – the top 20 languages based language graph is created. A distance limit of 15 is on the number of web pages. The value Nij is the used to clearly show which languages are closely- number of outgoing links from language i to lan- related by their link structure. guage j. 4.7 Intermediary Languages The language adjacency matrix P contains the ratio of the number of outgoing links and the total Considering the direction of the outgoing links, the number of outgoing links as can be found in the possibility that language i will link to language j language adjacency table. Each cell value, Pij is the may be lesser than the possibility that language i probability that a web page in a language i has an will link to language j by passing through an in- outgoing link to language j. termediary language k, such that PikPkj > Pij. The intermediary language is identified, as this would ij = ij / ∑ NNP ik k mean that there are better ways to reach another A link from language i to language j is not nec- language from one language. essarily accompanied by a link from language j to i. 4.8 Graph Analysis using JGraphT Even if there is a link, the number of outgoing links is not equal. To show the relationship be- JGraphT is a free Java graph library that provides tween two languages based on the link structure, graph objects and algorithms. The library provides the language distance is computed. classes that calculates and returns the strongly- connected components subgraphs. To compute the 4.6 Distance between Languages distance between nodes, several graph searching The distance between two languages measures algorithms are available, one of which is the Dijik- their level of connectedness. It is the relationship stra algorithm that computes for the shortest path. between the number of outgoing links from lan- A utility to export the graph into a format readable guage i to language j and vice versa. The distance by most graph visualization tools is also available. is computed as the ratio of the number of outgoing A graph file, written in the DOT language (a plain links between two languages and the total number text graph description language) was created, con- of outgoing links of the two languages. taining the nodes and edges of language pages. The distance between language i and language j, From this, the strongly-connected components and its properties (i.e. size, diameter) were determined. Dij is, ij (R1/D ij ) −+= βα where Rij is the lan- guage link ratio is defined as, 4.9 GraphVis: Visualization Tool GraphViz is open-source graph visualization ()ijij += ji / ik + ∑∑ NNNNR jk for (i ≠ j) software that takes descriptions of graphs in a k k simple text language and makes diagrams in Rij = 1 for (i = j)
52 The 6th Workshop on Asian Languae Resources, 2008
several formats, including images. The neato Javanese is the most popular language in the layout, which makes “spring model” layouts, was web space of Indonesia, and constitutes 28% of the used to visualize the distance between languages. total number of web pages. It is followed by Eng- However, the calculated distance cannot be drawn lish and Indonesian. Although Indonesian is an exactly, and this visualization is only two- official language of the country, it is ranked third. dimensional. So, the images are distorted and do English, a major business language of Indonesia, not illustrate the exact distance, only an has the second largest number of web pages in the approximation. domain. But, Indonesian occupies the second rank in the number of outgoing links. 5 Results Languages Linked to (# of This section shows some results on the Indonesian No. Language outgoing links) Javanese Javanese (22,641,844) domain. 1 (jav) Indonesian (3,761,205) English English (11,036,726) 5.1 Link Structure 2 (eng) Javanese (1,443,054) Indonesia is a country with one of the biggest lan- Indonesian Indonesian (12,530,636) 3 guage diversity in the world. According to Eth- (ind) Javanese (4,168,734) 2 Thai Thai (4,367,895) nologue , 742 languages are spoken in the country. 4 But, the LIM results show that only five of these (tha) English (1,775,132) Malay Malay (1,944,430) 5 languages are listed in the top 10 languages in the (mly) Indonesian (1,516,778) country, i.e., Javanese, Indonesian, Malay, Sun- Sundanese Javanese (2,004,316) 6 danese, and Madurese. (sun) Sundanese (1,641,623) Luxemburg Luxemburg (1,128,342) 7 # of Outgo- (ltz) Javanese (393,524) No. Language # of Pages Occitan Occitan (119,734) ing Links 8 Javanese 797,300 33,411,032 (lnc) English (83,380) 1 Madurese Madurese (387,585) (jav) (28.01%) (27.20%) 9 English 743,457 16,645,014 (mad) Javanese (93,837) 2 Tatar Javanese (1,802,601) (eng) (26.11%) (13.55%) 10 Indonesian 516,528 20,783,793 (tat) Thai (397,988) 3 (ind) (18.14%) (16.92%) Thai 218,453 8,952,101 Table 2. Language Link for Indonesian Domain 4 (tha) (7.67%) (7.29%) Malay 197,535 4,990,402 5 (mly) (3.47%) (4.06%) The table above only shows the top 10 lan- Sundanese 98,835 5,349,194 6 guages. Among these, 8 languages are most fre- (sun) (3.47%) (4.35%) quently linked to the same language. The two other Luxemburg3 43,376 2,307,602 7 (ltz) (1.52%) (1.88%) languages are Sundanese and Tatar, both mostly Occitan4 27,663 351,318 linked to Javanese. 8 (lnc) (0.97%) (0.29%) The language graph below shows the languages Madurese 22,121 777,903 9 as the nodes and the edges representing the dis- (mad) (0.78%) (0.63%) tance between languages. In the figure above, the Tatar 20,709 3,334,651 10 (tat) (0.73%) (2.71%) six languages of Indonesia are found to be closely- 160,917 25,930,885 connected to each other. Others (5.65%) (21.11%) Total 2,846,894 122,833,898
Table 1. LIM result for Indonesian Domain
2 Gordon, Raymond G., Jr. (ed.), 2005. Ethnologue: Lan- guages of the World, Fifteenth edition. Dallas, Tex.: SIL In- ternational. 3 Luxembourgish 4 Occitan Languidocien
53 The 6th Workshop on Asian Languae Resources, 2008
intermediary language. However, it is likely that several pages were misidentified by LIM. The sec- ond, Javanese is not surprising since it is a major language of Indonesia. The table below shows se- lected language pairs, where one of its intermedi- ary languages is Javanese.
5 Language i Language j Pij Pik*Pkj Tatar Indonesian 0.01753 0.06085 Tatar Luxemburg 0.00210 0.01193 Samoan Balinese 0.00000 0.00052
Table 4. Selected Languages of Indonesia in which it’s Intermediary Language is Javanese
5.3 Graph Properties This section discusses the size distribution of the SCCs and the diameter of the Filipino language community in Indonesia.
SCC size The SCCs of a graph are those sets of nodes such that for every node, there is a path to all the other Figure 1. Language Graph of Indonesia nodes. The size of each SCC refers to the number of nodes it contains. Distribution of the sizes of 5.2 Intermediary Languages SCCs gives a good understanding of the graph structure of the web, and has important implica- For languages with only a few outgoing links be- tions for crawling. If most components have large tween them, there exists a language that acts as its sizes, only a few nodes are needed as seed URLs (a intermediary, that makes access between the two list of starting pages for a crawler) to be able to languages more convenient if passed through. reach all the other nodes. If all nodes are members of a single SCC, one URL is enough to crawl all Intermediary No. Language Frequency Percentage pages. 1 English 89 17.73% SCC 6 2 Javanese 43 8.57% 1 2 4 16 19 20 26 45 T 3 Tatar 42 8.37% size 4 Thai 40 7.97% # of 9 1 3 1 1 1 1 1 18 5 Madurese 38 7.57% SCCs 6 Indonesian 34 6.77% # of 7 Latin 34 6.77% 9 2 12 16 19 20 26 45 149 8 Sundanese 29 5.78% nodes 9 Malay 28 5.58% 10 Luxemburg 24 4.78% Table 5. SCC size distribution of the Filipino Others (top11-20) 122 24.30% language community in Indonesia Total 502 100.00%
Table 3. Intermediary Languages of Indonesia SCC diameter The maximum distance between any two nodes is The above table shows the number of language the diameter. For each node size, the diameter is pairs having the given language as its intermediary language. English has the highest frequency as an 5 k = Javanese 6 Total
54 The 6th Workshop on Asian Languae Resources, 2008
calculated and plotted in the chart. Their corre- perlinks, information can be readily accesses by sponding SCC graph is also shown. jumping from one page to another via the hyper- links.
4 For each country domain, the web pages written in the same language form a language community. 3 The link structure between language communities 2 shows how connected a language community is diameter 1 with another language community. It can be as- sumed that the close links between two language 0 0 5 10 15 20 25 30 35 40 45 50 communities on the web imply the existence of SCC size multilingual speakers of the two languages. Oth-
erwise linked pages will not be visited. In this con- Figure 2. Diameter distribution of the Filipino text, the language graph analysis demonstrated in language community in Indonesia this study gives an effective tool to understand the linguistic scenes of the country. If the same analy- For the Filipino language subgraph of Indonesia, sis is performed for the secondary level domain the component with the largest node size also has data, further insight into the socio-linguistic status the largest diameter. However, the largest diameter of each language can be drawn. Secondary domain size is only 3, which is a very small number. Most corresponds to different social area of language of the components have a diameter of 4. activities, such as “ac” or “edu” for academic and education arena, “go” or “gov” for government or 6 Implications public arena, and “co” or “com” for commercial business and occupational arena. Although this For the Filipino language community, many SCCs study does not extend its scope to the secondary can be found in the Philippines, since Filipino is level domain analysis, the effectiveness of the ap- one of its national languages. However, for some proach was demonstrated. countries, there are not many SCCs. For example, Another implication drawn from this study is Indonesia only has 18 SCCs, half of which consist that the language graph analysis can identify in- only of one node. However, the largest component termediary languages in the multilingual communi- size is 45, and there are 4 more large components. ties. In the real world, some languages are acting By picking out just one node from each SCC with as a medium of communications among the differ- a large size and using it as a seed URL, many web ent language speakers. In most cases, such lingua pages can already be downloaded. Add to it the franca are international languages such as English, depth parameter of 3, which is the largest diameter, French, Arabic, etc. But it’s difficult to identify these web pages can be downloaded within a short which language is acting as such in detail. But on period of time. the web link structure among languages, the lan- The choice of seed URL and the crawly depth guage graph can give us a clue to identify this. As are useful parameters for crawling. The analysis is shown in this paper, there are a number of lan- done for each language community to get these guages acting as intermediary between two lan- parameters for language-specific crawling pur- guages having only a few hyperlinks between them. poses. These parameters are different for each lan- Although the result of this category is doubtful guage community. This paper shows only shows because of misidentification of language, some the case of the Filipino language community of cases show the expected result. Indonesia as a sample illustration of the diameter The second objective of the study is to give a metric. microscopic level structure of the web communi- ties for much more practical and technical reasons, 7 Conclusion such as how to design more effective crawling strategy, and how to prepare starting URLs with The vastness and multilingual content of the web minimal efforts. The key issue in this context is to makes it a rich source of culturally-diversified in- reveal the connectedness of the web. To show the formation. Since web pages are connected by hy- connectedness of language communities, several
55 The 6th Workshop on Asian Languae Resources, 2008
graph theory metrics, the size and numbers of Andrew Tomkins, and Janet Wiener. 2000. Graph strongly-connected components and the diameters structure in the web. In Proceedings of the 9th World are calculated and visual presentations of language Wide Web Conference, pages 309-320, Amsterdam, communities are also given. This information can Netherlands. aid in defining parameters used for crawling, par- Chakrabarti, Soumen, Martin van den Berg, and By- ticularly language-specific crawling. ronDom. 1999. Focused Crawling: a new approach As a summary, the link structure analysis of lan- to topic-specific Web resource discovery. In Pro guage graphs can be a useful tool for various spec- ceedings of the 8th International World Wide Web trums of socio-linguistic and technical research Conference, pages 1623-1640, Toronto, Canada. purposes. Herring, Susan C., John C. Paolillo, Irene Ramos-Vielba, Inna Kouper, Elijah Wright, Sharon Stoerger, Lois 8 Limitations and Future Work Ann Scheidt, and Benjamin Clark. 2007. Language Networks on LiveJournal. In Proceedings of the 40th The results of this research are highly dependent Annual Hawaii International Conference on System on the language identification module (LIM). With Sciences, Hawaii. a more improved LIM, more accurate results can Kumar, Ravi, Prabhakar Raghavan, Sridhar Rajagopalan, be presented. Currently, there is an ongoing ex- D. Sivakumar, Andrew S. Tompkins, and Eli Upfal. periment that uses a new LIM. 2000. The Web as a graph. In Proceedings of the This analysis will also be done to the secondary- Nineteenth ACM SIGMOD SIGACT SIGART Sympo level-domains to show the language distribution sium on Principles of Database Systems, pages 1-10, for different social areas. Dallas, Texas, United States. Future work also includes the creation of a lan- Petricek, Vaclav, Tobias Escher, Ingemar J. Cox, and guage-specific crawler that will incorporate the Helen Margetts. 2006. The web structure of e- results derived from the analysis of the SCC size government: developing a methodology for quantita- and diameter of the language subgraphs. tive evaluation. In Proceedings of the 15th Interna tional World Wide Web Conference, pages 669-678, Acknowledgment Edinburgh, Scotland. The study was made possible by the financial sup- Pingali, Prasad, Jagadeesh Jagarlamudi, and Vasudeva port of the Japan Science and Technology Agency Varma. 2006. WebKhoj: Indian language IR from Multiple Character Encodings. In Proceedings of the (JST) under the RISTEX program and the Asian th Language Resource Network Project. We also 15 International World Wide Web Conference, pages 801-809, Edinburgh, Scotland. thank UNESCO for giving official support to the project since its inception. Stamatakis, Konstantinos, Vangelis Karkaletsis, Geor- gios Paliouras, James Horlock, Claire Grover, James References R. Curran, and Shipra Dingare. 2003. Domain- Specific Web Site Identification: The CROSSMARC Balakrishnan, Hemant and Narsingh Deo. 2006. Evolu- Focused Web Crawler. In Proceedings of the 2nd In tion in Web graphs. Proceedings of the 37th South ternational Workshop on Web Document Analysis eastern International Conference on Combinatorics, (WDA 2003), pages 75–78. Edinburgh, UK. Graph Theory, and Computing. Boca Raton, FL. Suzuki, Izumi, Yoshiki Mikami, Ario Ohsato, and Yo- Bharat Krishna, Bay-Wei Chang, Monika Henzinger, shihide Chubachi. 2002. A language and character set and Matthias Ruhl. 2001. Who links to whom: min- determination method based on N-gram statistics. ing linkage between Web sites. In Proceedings of the ACM Transactions on Asian Language Information 2001 IEEE International Conference on Data Mining, Processing, 1(3): 269-278. pages 51-58, San Jose, California. Tamura, Takayuki, Kulwadee Somboonviwat, and Ma- Boldi, Paolo, Bruno Codenotti, Massimo Santini, and saru Kitsuregawa. 2007. A method for language- Sebastiano Vigna. 2004. UbiCrawler: A scalable specific Web crawling and its evaluation. Systems fully distributed web crawler. Software: Practice &