A State of the Art of Thai Language Resources and Thai Language Behavior Analysis and Modeling

A State of the Art of Thai Language Resources and Thai Language Behavior Analysis and Modeling Asanee Kawtrakul, Mukda Suktarachan, Patcharee Varasai, Hutchatai Chanlekha Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand 10900. E-mail: ak, mukda, pom, [email protected], Abstract As electronic communications is now increasing, the term Natural Language Processing should be considered in the broader aspect of Multi-Language processing system. Observation of the language behavior will provide a good basis for design of computational language model and also creating cost- effective solutions to the practical problems. In order to have a good language modeling, the language resources are necessary for the language behavior analysis. This paper intended to express what we have and what we have done by the desire to make a bridge between the languages and to share and make maximal use of the existing lexica, corpus and the tools. Three main topics are, then, focussed: A State of the Art of Thai language Resources, Thai language behaviors and their computational models. 1. Introduction • Thai language behaviors (only in word As electronic communications are now and phrase level) analyzed from the varieties increasing, the term Natural Language Processing of corpus which consist of Lexicon growth, should be considered in the broader aspect of New word formation and Phrase/Sentence Multi- Language Processing system. An important construction, and phase in the system development process is • The computational models providing requirement engineering, which can define as the for those behaviors, which consist of Unknown process of analyzing the problems in a certain Word Extraction and Name Entities identification, language. An essential part of the requirement- New word generation and Noun phrase engineering phase is computational language recognition. modeling which is an abstract representation of the behavior of the language. In order to have a The remainder of the paper is organized as good language model for creating cost-effective follows. In section 2, we give the gateway of Thai solutions to the practical problems, the language language resources. Thai Language behaviors are resources are necessary for the language behavior discussed in section 3. In section 4, then, provides analysis. Thai Language Computational Modeling as a This paper intended to express what we basis for creating cost-effective solutions to those have and what we have done by the desire to practical problems. make a bridge between the languages and to share and make maximal use of the existing 2. A State of the Art of Thai Language lexica, corpus and the tools. Three main topics Resource are, then, focussed: This section gives a survey of a state of the • A State of the Art of Thai language art of Thai Language Resources consisting of Resources that will give an overview of what we Corpus, Lexicon and Tools. Here, we will present have in Corpus, Lexicon and tools for corpus only the resources that open for public access. processing and analysis. Dictionary Type Size status Web site 2.1 Corpus (word) The existing Thai corpus is divided into 2 Narin’s Bi - Online http://www.wiwi. types; speech and text corpus developed by many Thailand uni- homepage frankfurt.de/~sas Thai Universities. Thai Language Audio Resource cha/thailand/dicti Center of Thammasart University (ThaiARC) onary/dictionary (http:// thaiarc.ac.th) developed speech corpus _index.html Saikam Bi 133,524 Online http://saikam.nii. aimed to provide digitized audio information for online ac.jp/. dissemination via Internet. The project pioneers Lao-Thai- Multi 5,000 Offline - the production and collection of various types of English Dic. audio information and various styles of Thai speech, such as royal speeches, academic lectures, From the table 2, Only Lexitron (from oral literature, etc. NECTEC) and NAiST Lexibase (from Kasetsart For Text corpus, originally, the goal of the University) that were applied to NLP. NAiST corpus collecting is used only inside the Lexibase has been developed based on relational laboratory. Until 1996, National Electronics and model for managing and maintaining easily in the Computer Technology Center (NECTEC) and future. It contains 15,000 words list with their Communications Research Laboratory (CRL) had syntax and the semantic concept information in a collaboration project with the purpose of the concept code. preparing Thai language corpus from technical proceedings for language study and application 2.3 Corpus and Language Analysis research. It named ORCHID corpus (NECTEC, Tools 1997). NAiST Corpus began in 1996 with the Corpus is not only the resource of Linguistic primary aim of collecting document from Knowledge but is used for training, improving and magazines for training and testing program in evaluating the NLP systems. The tools for corpus Written Production Assistance (Asanee, 1995). manipulation and knowledge acquisition become The existing corpus can be summarized as shown necessary. in Table 1. NAiST Lab. has developed the toolkit for Table 1: The List of Thai Corpus sharing via the Internet. It has been designed for List Corpus Type Amount Status corpus collecting, annotating, maintaining and NECTEC Orchid POS- 2,560,000 Online Corpus Tagged Text words analyzing. Additionally, it has been designed as Kasetsart NAiST 60,511,974 Text Online the engine, which the end user could use with their Univ. Corpus words data. (See a service on http://naist.cpe.ku.ac.th). Thammasart Thai Digitized 4000 online Univ. ARC audio words++ 3. Thai Language Behavior Analysis 2.2 Lexicon In order to have a good language model for There are a number of Thai lexicons, which creating cost-effective solutions to the practical has been developed as shown in Table 2. problems in application development, language behavior must be observed. Next is Thai language Table 2: The List of Thai Dictionaries behavior analysis based on NAiST corpus consisting of Lexicon growth, Thai word Dictionary Type Size status Web site formation and Phrase construction. (word) Royal Mono 33,582 Online http://rirs3.royin. Institute go.th/riThdict/loo 3.1 Lexicon Growth Dictionary kup.html The lexicon growth is studied by using Word Lexitron Bi 50,000 Online http://www.links. nectec.or.th/lexit/ list Extraction tool to extract word lists from a lex_t.html large-scale corpus and mapping to the Royal NaiST Mono 15,000 Online http://beethoven. Institute Dictionary (RID). It is noticeable that Lexibase cpe.ku.ac.th/ So Bi - 48,000 Online http://www.thais there are two types of lexicon: common and Sethaputra Eng oftware.co.th/ unknown words. The common word lists are some Dictionary words - 38,000 words in RID, which occur in almost every Thai document, and use in daily life. They are primitive words words but not being proper names or colloquial 3.2.1 Basic Information about Thai words. The unknown or new words occur much in Thai words are multi-syllabic words which the real document such as Proper names, stringing together could form a new word. Since Colloquial words, Abbreviations, and Foreign Thai has no inflection and no word delimiters, words. Thai morphological processing is mainly to The lexicon growth is observed from corpus recognize word boundaries instead of recognizing size, 400,000, 2,154,700 and 60,511,974 words a lexical form from a surface form as in English. from Newspaper, Magazine and Agriculture text. Let C be a sequence of characters We found that common word lists increased from C = c1c2c3…cn : n>=1 111,954 to 839,522 and 49,136,408 words Let W be a sequence of words according to the corpus size, while the unknown W = w1w2w3…wm : m>=1 word lists increased from 288,046 to 1,315,178 Where wi = c1ic2i…ci r: i>=1, r>=2 and 11,375,566 words respectively as shown in Since Thai sentences are formed with a table3. sequence of words with a stream of characters, Table 3 : Lexicon-growth i.e., c1c2c3…cn mostly without explicit Size of Corpus/ words Common words Unknown words delimiters, the word boundary in “c1c2c3c4c5“ 400,000 111,954 288,046 pattern as shown below could have two 2,154,700 839,522 1,315,178 ambiguous forms. One is “c1c2“ and “c3c4c5“. 60,511,974 49,136,408 11,375,566 The other one is “c1c2c3“ and “c4c5“ (Kawtrakul, 1997) Regarding to 60,511,974 words corpus in the table 3, it composes of 35,127,012 words from Stream of characters Newspaper, 18,359,724 words from Magazine and c1c2 c3c4c5 7,025,238 words from Agricultural Text. c1c2c3c4c5 Unknown words occur in each category as shown c1c2c3 c4c5 in table 4. Table 4: The Categories of Unknown words Figure 1: Word Boundary Ambiguity according to the various corpus genres Types of Newspaper Magazine Agricultural unknown word (words) (words) Text (words) From the figure 1, if characters were grouped Proper name 4,809,160 1,272,747 1,170,076 differently, the meaning of words will be changed Spoken words 58,335 8,787 0 too. For example, “กอดอก” can be grouped to “กอด- Abbreviation 70,109 43,056 0 Foreign words 304,519 239,107 3,399,670 อก(fold one's arms across the chest)” and “กอ-ดอก (a Total 5,242,123 1,563,697 4,569,746 clump of flower)”. From our corpus, we found that the sentence with 45 characters has 30 According to table 3 and 4, we could observe that combinations of words sequence. not only unknown words increase but common words also increase and the main categories of 3.2.2 New word construction increasing unknown word are proper names and Almost all-Thai new words are formed by foreign words.

A State of the Art of Thai Language Resources and Thai Language Behavior Analysis and Modeling

THE Tal of MUONG VAT DO NOT SPEAK the BLACK Tal

LINGUISTIC PERSPECTIVES of THAI CULTURE by Peansiri Vongvipanond

A Comparative Study of Shan and Standard Thai Morphology

Negative Markers in Dialects of Northern Thai

ED 206 7,6 AUTHOR V Understanding Laotian People

13Th ICLEHI and 2Nd ICOLET Osaka Apr 2019 Proceedings

Development of a Corpus for Southern Thai Dialect Speech Recognition: Design and Text Preparation

The Language Olicy of Minority Languages in Vietnam

The Tai-Kadai Languages Resources for Thai Language Research

Language Diversity of Phetchabun Province, Thailand Dr.Samran Tao-Ngoen'. Anekpong Thammathiwat2 1 Phetchabun Rajabhat Universit

Language Revitalization Or Dying Gasp? Language Preservation Efforts Among the Bisu of Northern Thailand

A Comparative Study of Tai-Phuan, Tai-Phake and Lao-Wiang