identification: how to distinguish similar ? Nikola Ljubešić, Nives Mikelić, Damir Boras Department of Information Sciences, Faculty of Philosophy, University of Ivana Lučića 3, 10000 Zagreb, E-mail: nljubesi@ffzg.hr, nmikelic@ffzg.hr, dboras@ffzg.hr

Abstract. The goal of this paper is to used Markov models [23, 18, 7], while some discuss the language identification problem of used information theoretic measures of - Croatian, language that even state-of-the-art tropy and document similarity [19, 1]. The language identification tools find hard to application of support vector machines and distinguish from similar languages, such as kernel methods to the language identification Serbian, Slovenian or . We task has been considered relatively recently developed the tool that implements the list [22, 14, 12]. of Croatian most frequent words with the Sibun & Reynar [19] applied relative en- threshold that each document needs to satisfy, tropy to language identification. Their work we added the specific characters elimination is important for us because they were first to rule, applied second-order Markov model provide the scientific results for Croatian, Ser- classification and rule of forbidden words. bian and Slovak. For Croatian, they got recall Finally, we built up the tool that overperforms rate of 94%, while precision was 91.74%. In- current tools in discriminating between these teresting fact is that Sibun & Reynar’s tool similar languages. made error by identified Croatian as Slovak language, but it never confused Croatian and Keywords. Written language identifi- Serbian. On the other hand, Serbian and Slo- cation, , second-order vak were likely to identified as Croatian. Markov model, web-corpus, most frequent The improvement from Sibun & Reynar words method, forbidden words method. work was Elworthy’s algorithm [8], which achieved recall rate of 96%, and precision rate 1. Introduction of 97.96%, because Serbian and sometimes Slovenian were identified as Croatian. Without the basic knowledge of the Automated written language identification language the document is written in, applica- tools are nowadays widely used, such as the tions such as information retrieval and text best known van Noord’s TextCat [21], Ba- mining are not able to accurately process sis Tech’s Rosette Language Identifier [2], the data, potentially leading to a loss of and web based language identification ser- critical information. The problem of written vices such as Xerox Language Identifier [25]. language identification is attempted to be TextCat is an implementation of the text cat- solved for long time and various feature- egorization algorithm presented in[5]. Both based models were developed for written TextCat and Xerox Language Identifier are language identification. freely available and commonly used and do Some authors used the presence of diacrit- language identification for Croatian and sim- ics and special characters [17], some used syl- ilar languages (Slovak, Serbian, Slovenian, lable characteristics [16] and some used infor- Czech) as well. Basis Tech’s Rosette Lan- mation about and syntax [26]. guage Identifier also includes all these lan- Some of them used information about short guages, but is available only when purchased. words [10, 9, 13, 3, 20], while some au- Since Croatian and Serbian are similar thors used the frequency of n-grams of char- languages that were considered as Serbo- acters [4, 5, 6, 20, 15]. Some techniques Croatian language for almost a century, lan- guage identifiers such as TextCat and Xe- Firstly, we extracted the list of most rox do get confused and are likely identify- frequent words from our remaining Croatian ing Croatian documents as Serbian and vice corpus. We measured the frequency distri- versa. bution of the documents in our test corpus Moreover, the TextCat algorithm intro- (4364 documents in each of 3 languages) duces that makes identifi- regarding the percentage of N most frequent cation even harder. The Bosnian is spoken Croatian words each document contains. by in and Herzegovina and Since the experimental data proved obvious the region of Sandžak (in and Mon- normality, we presented these distributions tenegro), it is based on the Western variant in figure 1 as normal distributions. The of the Štokavian dialect, it has normality of the three distributions was and has its vocabulary and grammar based on proved by the Shapiro-Wilk test with the −11 Croatian and as well. The largest p-value of 9.12 ∗ 10 for the Slove- differences that are important for distinguish- nian documents distribution. From figure ing Croatian from Serbian in the process of 1 it is obvious that this approach is not language identification are useless when deal- capable of distinguishing between these three ing with Bosnian documents, since Bosnian languages, especially not between Croatian language accepts all of them. Generally, it and Serbian since their distributions overlap prefers Croatian, but vocabulary and gram- significantly. Nevertheless, this method is mar are both a mix of Serbian and Croatian, capable of distinguishing between these three apart from some turcisms that are frequently and all other languages with the exception of in use only in Bosnian. For instance, with western . modal verbs such as ht()eti (want) or moći (can), the infinitive is prescribed in Croatian,

12 Croatian while the construction da (that/to) + present Serbian Slovenian tense is preferred in Serbian. Both alterna- 10

tives are present and allowed in Bosnian. 8 Therefore, in our research we did not try 6 to distinguish Bosnian from Croatian, since Density it is hard even for native speakers to notice 4

the difference between the two of them at a 2 glance. 0

0.0 0.1 0.2 0.3 0.4 2. The first step in developing the language identifier for Croatian Percentage of Croatian 100 most frequent words Figure 1. Normal distributions of the documents Since Croatian, Serbian and Slovenian for Croatian, Serbian and Slovenian regarding are proved to be most difficult languages to the percentage of 100 most frequent Croatian distinguish, we collected our training and words test corpora from three most popular portals in Croatia, Serbia and [24]. One can notice that the distribution of We collected 67244 documents in Croatian, Slovenian documents is moved leftwards com- 30076 documents in Serbian and 5295 doc- paring to the distribution of Croatian docu- uments in Slovenian. Since Slovenian cor- ments, which shows that Croatian and Slove- pus was the smallest, considering the need nian are different languages. On the other for training corpora in the steps that follow, hand, the Serbian distribution is moved right- we took parts of the Croatian and Serbian wards. The reason for this is the frequency of corpus and built a smooth Croatian-Serbian- construction da (that/to) + present tense in Slovenian test corpus consisting of 4364 doc- Serbian (da is one of the 100 most frequent uments for each language (13092 in total). words), that is replaced by infinitive in Croa- tian. Although all figures show differences the number of Chech, Polish or Slovak docu- between three languages, the overlapping be- ments classified as potentially Croatian would tween them is still very high, especially be- be lower, but we would irreversibly lose many tween Croatian and Serbian. Croatian documents as well. We solved that Two values had to be chosen during the problem by introducing a special character first step - the threshold of percentage of elimination rule. The maximum percentage N most frequent Croatian words T and the of the 20 most frequent characters in the mini- number of Croatian most frequent words N. corpus that are not part of the Croatian al- Table 1 shows recall concerning these two phabet in Croatian documents was 0.26%. values. T =15% never yielded in satisfiable The average percentage was 0.00181%. On recall. On the other hand, choosing T the other hand, in the small corpus of Czech, as 10% for N=100 yielded in satisfiable Polish and Slovak texts (30 documents), the recall which is 0.13% lower than recall with smallest percentage of these characters was N=200, but with N=200 we would have 4.50%. The average one was 6.7%. Since the exposed ourselves to the danger of intro- Croatian sample is much bigger and therefore ducing corpus-specific most frequent words. much trustworthier, we composed a rule that Therefore we decided to choose N=100 with eliminates all documents whose special char- T =10%. acter percentage in documents exceeds the threshold of 1%. Table 1. Change in recall when discriminating We can conclude that the method that uses languages using T =15% and 10% (rows) for the 100 common words with the threshold of 10% N={25,50,75,100,200} (columns) gives good results in distinguishing Croatian 25 50 75 100 200 and languages very similar to Croatian (Ser- 15 .9175 .9552 .9681 .9718 .9830 bian and Slovenian) from all other languages, 10 .9853 .9913 .9931 .9943 .9956 with one additional rule: eliminating docu- ments whose percentage of 20 most frequent Table 2 shows the number of documents specific characters of Czech, Polish and Slo- below and above the threshold in our sample vak exceeds 1%. concerning the chosen T and N. All quan- Since this method eliminated only 8.66% tities that are reported in these tables are of Slovenian documents and none of the Ser- derived from the samples and not the normal bian ones, we needed to apply additional clas- distributions that underlie our data. sification methods that were more efficient in distinguishing between similar languages. Table 2. Number of documents for Croatian, Serbian and Slovenian that are above or below 3. The second step in developing the the 10% threshold for 100 Croatian most fre- language identifier for Croatian quent words. above below The second step involved a simple Croatian 4339 25 method of supervised machine learning. Serbian 4364 0 We developed a set of trigram character Slovenian 3986 378 level language models for each of the three languages (Croatian, Serbian and Slovenian), Since western Slavic languages share the trained them and used them to estimate same alphabet and similar most frequent the probability of generation of a particular words with southern Slavic languages, we had string by one of those three models. to realize on a mini-corpus of 10 documents of We used a Markov model, which is a proba- Chech, Polish and Slovak language (30 doc- bilistic model for sequential data that defines uments in total) that the average percentage the probability distribution of a sequence of of 100 most frequent Croatian words in this random variables P (X1,X2, ...Xn) making N- documents was 10.64%. If we used T =15%, th order Markov assumptions that the value of a specific random variable depends only on on 400.000 characters, precision starts - values of N prior random variables. In case creasing more rapidly than recall increases (as of a second-order Markov model the probabil- shown by the decline of the F1 measure). ity of a sequence and conditional probabilities The precision of distinguishing Croatian are calculated as documents in the Croatian-Serbian corpus n using 350.000 characters as a training corpus −1 p(x1, x2, ...xn) = p(xi|xi−2) is 99.08%, while the recall is 92.89%. i=1 i−1 i−1 (xi−2)

p(xi|xi−2) = i 1.00 c(xi−2) where c(xk) is the number of times the se- j 0.95 quence of random variables Xj...Xk takes val- ues xj...xk. We used a second-order Markov model 0.90

since although higher order Markov models Probability

make less assumptions, they have to fight 0.85 with the data sparseness, especially in the case of language identification. Namely, as F1 0.80 recall we will show, Markov models for language precision identification achieve optimal results on rela- 0 20 40 60 80 100 tively small amounts of training data. In our case, we solved the data sparseness problem Number of characters (*10000) by the simplest smoothing method defining Figure 2. Relationship between the size of the probability of unseen data as 1 ∗ 10−10. We training corpus and precision/recall measures in also calculated the sum of logs of probabili- differentiating Croatian documents in a Croatian- ties rather than the product of probabilities Serbian validation corpus to avoid zero underflow. Since the distinction between Croatian and Now that we obtained an optimal size Serbian is a much harder task than the one of the training corpus for distinguishing between Croatian and Slovenian, we first Croatian from Serbian, we trained language tried to distinguish Croatian and Serbian. models for all three languages (Croatian, Firstly, we observed the relationship between Serbian and Slovenian) and tested them the size of the training corpus and the re- on the test corpus of 4364 documents for call and precision measures. We decided each language (13092 documents in total). to move the size of the training corpus to The confusion matrix of the results on the 1.000.000 characters by steps of 100 charac- distinguishing between three languages in ters. We used 4588 documents from each lan- the three-language-corpus is shown in table 3. guage as the training corpus and 21124 doc- uments from each language as the validation Table 3. Confusion matrix for 13092 docu- corpus (since we had to estimate the optimal ments of Croatian-Slovenian-Serbian test corpus size of the training corpus, we used the vali- (columns are the language identified) dation corpus that did not overlap with our Croatian Serbian Slovenian test corpus). We realized, as shown in figure Croatian 4321 38 5 2, that second-order Markov models trained Serbian 309 4055 0 on 350.000 characters give optimal results. Slovenian 5 0 4359 If trained on 240.000 characters for each language, precision reaches its peak, but re- call decreases drastically (our results vary too The final result for distinguishing all 3 much at this point to be regarded as good pre- languages concerning Croatian is a recall of dictors of future performance) and if trained 99.01% and precision of 93.23%. acters threshold rule. In the second step these 4. The final step in developing the three languages were distinguished between language identifier for Croatian themselves with a character-based second- order Markov model. In the third step, the The aim of the final step was to im- classification between Croatian and Serbian prove the precision of identifying Croatian was improved with a forbidden word list rule. documents which was primarily low because of misclassifications of Croatian and Serbian 5. Conclusion documents. The additional classification was done with the list of forbidden words In this paper we presented the tool for Croatian and Serbian. Both Serbian for language identification that overperforms and Croatian lists consisted of words that existing tools in differentiating Croatian appear at least 5 or more times in one from Serbian and Slovenian. corpus, but do not exist in the other one at The method of most frequent words proved all. Therefore, if the document, identified as to be most usable in differentiating similar Serbian after the second step, contained one from all other languages where a special char- or more words from the Croatian list and acter constraint also proved to be very handy. none from the Serbian one, the decision was The character n-gram models proved to be changed and the document was identified as quite efficient in distinguishing similar lan- Croatian. There were 79827 such words in guages. The combination of these two meth- the Croatian corpus and 18733 in the Serbian ods proved to work best since the n-gram one. The difference between these numbers method requires a language model for ev- lies primarily in the fact that the remaining ery possible language and the most frequent part of the Croatian corpus was much bigger words method efficiently strips the number of than the one of the Serbian corpus. remaining languages to a few. The method of If the list of Croatian specific types is tai- forbidden words proved to improve results in lored down to 18733, the precision improves distinguishing very similar languages. up to 99.84%. Since the danger of overfitting Although some of the state-of-the-art lan- in this case is very high, we decided to take guage guessers distinguish Bosnian as a lan- just the 1000 most frequent words from guage, in our research we did not try to dis- both lists and improved the precision to tinguish Bosnian from Croatian, since it is 99.18%. Regardless the number of words hard for native speakers to notice the differ- taken, recall improved up to 99.31%. The re- ence between the two of them at a glance. call/precision measures through all the three steps where each step follows up on the re- References sults of the previous one are shown in table 4. [1] Aslam JA, Frost M. An information- Table 4. Recall/precision measures for identify- theoretic measure for document similar- ing Croatian documents in the 13092 documents ity. In Proceedings of the 26th Annual test corpus through all three language identifica- International ACM SIGIR Conference on tion steps Research and Development in Information Recall Precision Retrieval (SIGIR), 2003. First step .9943 .3419 [2] Basis Techis Rosette Language Identifier. Second step .9846 .9329 http://www.basistech.com/language- Third step .9931 .9918 identification/ [07/01/2006] [3] Batchelder EO. A learning experience: In the first step, Croatian, Serbian and Training an artificial neural network to Slovenian were efficiently distinguished from discriminate languages. Technical Report, all other languages with a Croatian most fre- 1992. quent words threshold rule and a special char- [4] Beesley KR. Language identifier: A com- puter program for automatic natural lan- Text Retrieval. Information Retrieval guage identification on on-line text. In 2004; 7:73-97. Proceedings of the 29th Annual Confer- [16] Mustonen S. Multiple discriminant anal- ence of the American Translators Associa- ysis in linguistic problems. In Statisti- tion, 1988. p.47-54. cal Methods in Linguistics. Skriptor Fack, [5] Cavnar WB, Trenkle JM. N-gram-based Stockholm, 1965;(4). text categorization. In Proceedings of [17] Newman P. Foreign language identifica- SDAIR-94, the 3rd Annual Symposium on tion - first step in the translation process. Document Analysis and Information Re- In K. Kummer (editor), Proceedings of the trieval. Las Vegas, Nevada, USA, 1994. p. 28th Annual Conference of the American 161-175. Translators Association, 1987. p.509-516. [6] Damashek M. Gauging similarity with n- [18] Schmitt JC. Trigram-based method of lan- grams: language independent categoriza- guage identification, October 1991. .S. tion of text. Science 1995; 267(5199):843– Patent number: 5062143. 848. [19] Sibun, P. and Reynar, J. C. Language De- [7] Dunning T. Statistical identification of termination: Examining the Issues. In language. Technical Report MCCS. New Proceedings of the 5th Annual Sympo- Mexico State University, 1994. p.94-273. sium on Document Analysis and Informa- [8] Elworthy, D. Language Identification tion Retrieval, pp. 125-135, Las Vegas, With Confidence Limits. In CoRR: Com- Nevada, 1996. putation and Language Journal, 1999. [20] Souter et al. Natural Language Identifica- [9] Henrich P. Language identification for the tion Using Corpus-Based Models. Hermes automatic grapheme-to- conver- J. Linguistics 1994; 13:183-203. sion of foreign words in a german text- [21] TextCat Language Guesser Demo. to-speech system. In Proceedings of Eu- http://www.let.rug.nl/ vanno- rospeech 1989, European Speech Commu- ord/TextCat/Demo/ [07/01/2006] nication and Technology, 1989. p. 220- [22] Teytaud , Jalam . Kernel-based text 223. categorization. In Proceedings of the In- [10] Ingle N. A language identification table. ternational Joint Conference on Neural In The Incorporated Linguist 1976, 15(4). Networks (IJCNN), 2001. [11] Johnson, S. Solving the problem of lan- [23] Ueda Y, Nakagawa S. Prediction for guage recognition. Technical report. phoneme/syllable/word-category and School of Computer Studies, University of identification of language using HMM. Leeds, 1993. In Proceedings of the 1990 International [12] Kruengkrai C, Srichaivattana P, Sornlert- Conference on Spoken Language Pro- lamvanich V, Isahara . Language identi- cessing; November 1990; Kobe, Japan. fication based on string kernels. In Pro- Volume 2, p.1209-1212. ceedings of the 5th International Sympo- [24] News portals: http://www.net.hr, sium on Communications and Information http://www.b92.net, Technologies (ISCIT), 2005. http://novice.siol.net [4/26/2007] [13] Kulikowski S. Using short words: a lan- [25] Xerox Language Identifier. guage identification algorithm. Unpub- http://www.xrce.xerox.com/competencies lished technical report, 1991. /content-analysis/tools/guesser-ISO- [14] Lodhi H, Shawe-Taylor J, Cristianini N, 8859-1.en.html [07/01/2006] Watkins CJCH. Text classification using [26] Ziegler DV. The automatic identification string kernels. Journal of Machine Learn- of languages using linguistic recognition ing Research, 2:419-444. signals. State University of New York at [15] McNamee P, Mayfield J. Character N- Buffalo, Buffalo, NY, 1992. gram Tokenization for European Language