How to Distinguish Similar Languages?
Total Page:16
File Type:pdf, Size:1020Kb
Language identification: how to distinguish similar languages? Nikola Ljubešić, Nives Mikelić, Damir Boras Department of Information Sciences, Faculty of Philosophy, University of Zagreb Ivana Lučića 3, 10000 Zagreb, Croatia E-mail: nljubesi@ffzg.hr, nmikelic@ffzg.hr, dboras@ffzg.hr Abstract. The goal of this paper is to used Markov models [23, 18, 7], while some discuss the language identification problem of used information theoretic measures of en- Croatian, language that even state-of-the-art tropy and document similarity [19, 1]. The language identification tools find hard to application of support vector machines and distinguish from similar languages, such as kernel methods to the language identification Serbian, Slovenian or Slovak language. We task has been considered relatively recently developed the tool that implements the list [22, 14, 12]. of Croatian most frequent words with the Sibun & Reynar [19] applied relative en- threshold that each document needs to satisfy, tropy to language identification. Their work we added the specific characters elimination is important for us because they were first to rule, applied second-order Markov model provide the scientific results for Croatian, Ser- classification and a rule of forbidden words. bian and Slovak. For Croatian, they got recall Finally, we built up the tool that overperforms rate of 94%, while precision was 91.74%. In- current tools in discriminating between these teresting fact is that Sibun & Reynar’s tool similar languages. made error by identified Croatian as Slovak language, but it never confused Croatian and Keywords. Written language identifi- Serbian. On the other hand, Serbian and Slo- cation, Croatian language, second-order vak were likely to be identified as Croatian. Markov model, web-corpus, most frequent The improvement from Sibun & Reynar words method, forbidden words method. work was Elworthy’s algorithm [8], which achieved recall rate of 96%, and precision rate 1. Introduction of 97.96%, because Serbian and sometimes Slovenian were identified as Croatian. Without the basic knowledge of the Automated written language identification language the document is written in, applica- tools are nowadays widely used, such as the tions such as information retrieval and text best known van Noord’s TextCat [21], Ba- mining are not able to accurately process sis Tech’s Rosette Language Identifier [2], the data, potentially leading to a loss of and web based language identification ser- critical information. The problem of written vices such as Xerox Language Identifier [25]. language identification is attempted to be TextCat is an implementation of the text cat- solved for long time and various feature- egorization algorithm presented in[5]. Both based models were developed for written TextCat and Xerox Language Identifier are language identification. freely available and commonly used and do Some authors used the presence of diacrit- language identification for Croatian and sim- ics and special characters [17], some used syl- ilar languages (Slovak, Serbian, Slovenian, lable characteristics [16] and some used infor- Czech) as well. Basis Tech’s Rosette Lan- mation about morphology and syntax [26]. guage Identifier also includes all these lan- Some of them used information about short guages, but is available only when purchased. words [10, 9, 13, 3, 20], while some au- Since Croatian and Serbian are similar thors used the frequency of n-grams of char- languages that were considered as Serbo- acters [4, 5, 6, 20, 15]. Some techniques Croatian language for almost a century, lan- guage identifiers such as TextCat and Xe- Firstly, we extracted the list of most rox do get confused and are likely identify- frequent words from our remaining Croatian ing Croatian documents as Serbian and vice corpus. We measured the frequency distri- versa. bution of the documents in our test corpus Moreover, the TextCat algorithm intro- (4364 documents in each of 3 languages) duces Bosnian language that makes identifi- regarding the percentage of N most frequent cation even harder. The Bosnian is spoken Croatian words each document contains. by Bosniaks in Bosnia and Herzegovina and Since the experimental data proved obvious the region of Sandžak (in Serbia and Mon- normality, we presented these distributions tenegro), it is based on the Western variant in figure 1 as normal distributions. The of the Štokavian dialect, it has Latin alphabet normality of the three distributions was and has its vocabulary and grammar based on proved by the Shapiro-Wilk test with the −11 Croatian and Serbian language as well. The largest p-value of 9:12 ∗ 10 for the Slove- differences that are important for distinguish- nian documents distribution. From figure ing Croatian from Serbian in the process of 1 it is obvious that this approach is not language identification are useless when deal- capable of distinguishing between these three ing with Bosnian documents, since Bosnian languages, especially not between Croatian language accepts all of them. Generally, it and Serbian since their distributions overlap prefers Croatian, but vocabulary and gram- significantly. Nevertheless, this method is mar are both a mix of Serbian and Croatian, capable of distinguishing between these three apart from some turcisms that are frequently and all other languages with the exception of in use only in Bosnian. For instance, with western Slavic languages. modal verbs such as ht(j)eti (want) or moći (can), the infinitive is prescribed in Croatian, 12 Croatian while the construction da (that/to) + present Serbian Slovenian tense is preferred in Serbian. Both alterna- 10 tives are present and allowed in Bosnian. 8 Therefore, in our research we did not try 6 to distinguish Bosnian from Croatian, since Density it is hard even for native speakers to notice 4 the difference between the two of them at a 2 glance. 0 0.0 0.1 0.2 0.3 0.4 2. The first step in developing the language identifier for Croatian Percentage of Croatian 100 most frequent words Figure 1. Normal distributions of the documents Since Croatian, Serbian and Slovenian for Croatian, Serbian and Slovenian regarding are proved to be most difficult languages to the percentage of 100 most frequent Croatian distinguish, we collected our training and words test corpora from three most popular news portals in Croatia, Serbia and Slovenia [24]. One can notice that the distribution of We collected 67244 documents in Croatian, Slovenian documents is moved leftwards com- 30076 documents in Serbian and 5295 doc- paring to the distribution of Croatian docu- uments in Slovenian. Since Slovenian cor- ments, which shows that Croatian and Slove- pus was the smallest, considering the need nian are different languages. On the other for training corpora in the steps that follow, hand, the Serbian distribution is moved right- we took parts of the Croatian and Serbian wards. The reason for this is the frequency of corpus and built a smooth Croatian-Serbian- construction da (that/to) + present tense in Slovenian test corpus consisting of 4364 doc- Serbian (da is one of the 100 most frequent uments for each language (13092 in total). words), that is replaced by infinitive in Croa- tian. Although all figures show differences the number of Chech, Polish or Slovak docu- between three languages, the overlapping be- ments classified as potentially Croatian would tween them is still very high, especially be- be lower, but we would irreversibly lose many tween Croatian and Serbian. Croatian documents as well. We solved that Two values had to be chosen during the problem by introducing a special character first step - the threshold of percentage of elimination rule. The maximum percentage N most frequent Croatian words T and the of the 20 most frequent characters in the mini- number of Croatian most frequent words N. corpus that are not part of the Croatian al- Table 1 shows recall concerning these two phabet in Croatian documents was 0.26%. values. T =15% never yielded in satisfiable The average percentage was 0.00181%. On recall. On the other hand, choosing T the other hand, in the small corpus of Czech, as 10% for N=100 yielded in satisfiable Polish and Slovak texts (30 documents), the recall which is 0.13% lower than recall with smallest percentage of these characters was N=200, but with N=200 we would have 4.50%. The average one was 6.7%. Since the exposed ourselves to the danger of intro- Croatian sample is much bigger and therefore ducing corpus-specific most frequent words. much trustworthier, we composed a rule that Therefore we decided to choose N=100 with eliminates all documents whose special char- T =10%. acter percentage in documents exceeds the threshold of 1%. Table 1. Change in recall when discriminating We can conclude that the method that uses languages using T =15% and 10% (rows) for the 100 common words with the threshold of 10% N={25,50,75,100,200} (columns) gives good results in distinguishing Croatian 25 50 75 100 200 and languages very similar to Croatian (Ser- 15 .9175 .9552 .9681 .9718 .9830 bian and Slovenian) from all other languages, 10 .9853 .9913 .9931 .9943 .9956 with one additional rule: eliminating docu- ments whose percentage of 20 most frequent Table 2 shows the number of documents specific characters of Czech, Polish and Slo- below and above the threshold in our sample vak exceeds 1%. concerning the chosen T and N. All quan- Since this method eliminated only 8.66% tities that are reported in these tables are of Slovenian documents and none of the Ser- derived from the samples and not the normal bian ones, we needed to apply additional clas- distributions that underlie our data. sification methods that were more efficient in distinguishing between similar languages. Table 2. Number of documents for Croatian, Serbian and Slovenian that are above or below 3.