Optimizing N-Gram Order of an N-Gram Based Language Identification

The International Journal on Advances in ICT for Emerging Regions 2009 02 (02) : 21 - 28 Optimizing n-gram Order of an n-gram Based Language Identification Algorithm for 68 Written Languages Chew Y. Choong, Yoshiki Mikami, C. A. Marasinghe and S. T. Nandasara Abstract—Language identification technology is I. INTRODUCTION widely used in the domains of machine learning and text mining. Many researchers have achieved excellent A. Digital Language Divide results on a few selected European languages. However, thnologue [1] claims that there are 6,912 living the majority of African and Asian languages remain Elanguages in the world. However, ISO 639-2, the untested. The primary objective of this research is to second part of the ISO 639 standard, only adopted evaluate the performance of our new n-gram based 464 codes for the representation of the names of language identification algorithm on 68 written languages [2]. In 1999, worried about half of the languages used in the European, African and Asian world languages facing the risk of dying out, the regions. The secondary objective is to evaluate how United Nations Educational, Scientific and Cultural n-gram orders and a mix n-gram model affect the Organization (UNESCO) decided to launch and relative performance and accuracy of language observe an International Mother Language Day on identification. The n-gram based algorithm used in 21 February every year to honor all mother languages this paper does not depend on the n-gram frequency. and promoting linguistic diversity [3]. The United Instead, the algorithm is based on a Boolean method to Nation’s effort in promoting mother languages was determine the output of matching target n-grams to recognized by the Guinness World Record when training n-grams. The algorithm is designed to its publication of Universal Declaration of Human automatically detect the language, script and character Rights (UDHR) was declared the “Most Translated encoding scheme of a written text. It is important to Document” in the world. UDHR is translated into identify these three properties due to the reason that a 329 languages as of March 2009. On the Web, language can be written in different types of scripts Google search engine allows users to refine their and encoded with different types of character encoding search based on one of the 45 languages it supports. schemes. The experimental results show that in one As of November 2008, Microsoft’s dominant test the algorithm achieved up to 99.59% correct (63.67%) Windows XP operating system was only identification rate on selected languages. The results released in 44 localized language versions. All these also show that the performance of language facts lead us to conclude that access to the digital identification can be improved by using a mix n-gram model of bigram and trigram. The mix n-gram model world is greatly divided by language. consumed less disk space and computing time, B. Measure Languages on the Internet compared to a trigram model. In order to bridge the digital language divide, UNESCO has been emphasizing the concept Index Terms—Boolean Method, Character of multilingualism and participation for all the Encoding Scheme, Digital Language Divide, Language languages in the Internet. UNESCO, at its 2005 Identification, Mix n-gram Model, n-gram, Natural World Summit for the Information Society in Tunis, Language Processing, Language, Script. published a report entitled "Measuring Linguistic Diversity on the Internet", comprising articles on Manuscript received April 2, 2009. Accepted October 20th, 2009. issues of the Language Diversity on the Internet. This work was sponsored by the Japan Science Technology However, UNESCO admitted that the volume does Agency (JST) through the Language Observatory Project (LOP) not present any final answer on how to measure and of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) through the Asian Language languages on the Internet [4]. Resource Network project. The Language Observatory Project (LOP) Chew Y. Choong is with the Nagaoka University of Technology, launched in 2003 is to provide means for assessing Nagaoka, Niigata, Japan, (e-mail: [email protected]). Yoshiki Mikami and C. A. Marasinghe are also with the the usage level of each language in the Internet. Nagaoka University of Technology, Nagaoka, Niigata, Japan. More specifically, the project is expected to produce (e-mail: [email protected], [email protected]. a periodic statistical profile of language, script and ac.jp). S. T. Nandasara is with the University of Colombo School of character encoding scheme (LSE) usage in the Computing, Colombo, Sri Lanka (e-mail: [email protected]). Internet [5]. The LOP uses a language identifier to 22 C��hew�� Y. Choong, Yoshiki Mikami, C. A. Marasinghe and S. T. Nandasara automatically detect the LSE of a web page. The An n-gram order 1 (i.e. n=1) is referred to as a algorithm described in this paper is used to construct monogram; n-gram order 2 as a bigram and n-gram the language identifier for LOP. order 3 as a trigram. The rest is generally referred as “n-gram”. Using “No-456” as an example, if we C. Language Identification defined that the basic unit of desired n-gram as a Language identification generally refers to a “character”, the valid lists of character level bigrams process that attempts to classify a text in a language and trigrams (each separated by space) will be as to one in a pre-defined set of known languages. It is below: a vital technique for Natural Language Processing Bigram: No o- -4 45 56 (NLP), especially in manipulating and classifying Trigram: No- o-4 -45 456 text according to language. Many researchers [6] [7] Several researchers [6] [7] [8] [9] [10] reported [8] [9] [10] [11] [12] have achieved excellent results that using trigram model on selected European on language identification based on a few selected languages produced the best language identification European languages. However, majority of African result. However, many African and Asian languages and Asian languages remain untested. This reflects are not based on the Latin alphabet that many the fact that search engines have very limited support European languages employ. Thus, this study in their language-specific search ability for most of evaluates the performance of n-gram orders (n=1, 2 the African and Asian languages. …6) and a special mix n-gram model for language In this paper, a language is identified by its LSE identification on selected languages. properties. All LSE properties are important for The rest of the paper is structured as follows. In precise language categorization. For example, the the next section the authors briefly discuss related script detection ability allows one to measure the work. The n-gram based language identification number of web pages that are written in a particular algorithm is introduced in Section III. In Section script, for instance, the Sinhala script. Furthermore, IV, the authors explain about the datasets and LSE detection is critical to determine the correct tool experiments. Experimental results are presented and for text processing at a later stage. Table I shows discussed in Section V. Section VI concludes the sample texts of Uzbek language written in three paper and mentions future work. different types of scripts and character encoding schemes. A machine translation tool must at first get II. RELA T E D Wor K to know the script and character encoding scheme of the source text, in order to select the proper translator The task of identifying the language of a text had to translate the source text to another language. been relatively well studied over the past century. A variety of approaches and methods such as Dictionary TABLE I method, Closed-class-model [11], Bayesian models EXAMPLE OF UZ B EK LA N G U AGE US in G DI FFE R E nt Scri P T S And [7], SVM [12] and n-gram [6] [7] [8] [9] [10] [13] CHA R A ct E R Encodin G SC HEMES [14] had been used. Two n-gram based algorithms are selected for detailed description. The Cavnar Character and Trenkle algorithm deserves special attention as Language Script Encoding Sample Text it explains in-depth on how n-gram can be used for Scheme language identification. Suzuki algorithm which is implemented in Language Observatory Project is a غفقكگڭلمنء Uzbek Arabic UTF-8 Uzbek Cyrillic Cyrillic лмпрстўфх benchmark to our algorithm. Uzbek Latin ISO 8859-1 abchdefgg A. Cavnar and Trenkle Algorithm In 1994, Cavnar and Trenkle reported very D. N-gram high (92.9–99.8%) correct classification rate on Usenet newsgroup articles written in eight different An n-gram can be viewed as a sub-sequence of N languages using rank-order statistics on n-gram items from a longer sequence. The item mentioned can be refer to a letter, word, syllable or any logical profiles [8]. They reported that their system was data type that is defined by the application.Due to its relatively insensitive to the length of the string to simplicity in implementation and high accuracy on be classified. In their experiment, the shortest text predicting the next possible sequence from known they used for classifying was 300 bytes, while their sequence, the n-gram probability model is one of the training sets were on the order of 20 Kilobytes to 120 most popular methods in statistical NLP. The principal Kilobytes in length. They classified documents by idea of using n-gram for language identification is calculating the distances of a test document’s n-gram that every language contains its own unique n-grams profile from all the training languages’ n-gram and tends to use certain n-grams more frequently than profiles and then taking the language corresponding others, hence providing a clue about the language.

Optimizing N-Gram Order of an N-Gram Based Language Identification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support