The Mysterious Letter J

The Mysterious Letter J Andelka¯ Zeceviˇ c´ Stasaˇ Vujiciˇ c´ Stankovic´ Faculty of Mathematics Faculty of Mathematics University of Belgrade University of Belgrade [email protected] [email protected] Abstract The choice of features might be linguistically mo- Ekavian and Ijekavian are two different tivated (diacritics and special characters) or more variants of the contemporary standard Ser- statistically oriented (word frequencies, n-grams bian language. The difference between of various lengths and types). The first tools them is related to the reflex of the old were based on the analyses of character n-grams: Slavic vowel jat and it influences both the Dunning (Dunning, 1994) introduced Markov speaking and writing language norms. The models while Cavnar and Trenkle (Cavnar and sensibility of existing language identifica- Trenkle, 1994) worked with 1-NN classification tion tools for both variants is of great im- algorithm. Nowadays the focus is on the diverse portance for building representative cor- set of (dis)similarity measures (Singh, 2006) and pora and development of relevant linguis- powerful algorithms as can be read in papers tics resources and tools underlying an au- discussing their performance and fields of the tomatic text processing. In this paper we application (Martins and Silva, 2005). present the results obtained after testing the three popular tools for language identi- The task of the language identification is con- fication on corpora containing documents sidered much harder if the document is of modest from each of the two variants. As it will length (for instance, e-Bay and Twitter messages be reported, the identification of Ijekavian or search engine queries) or the amount of avail- variant is a much more difficult task since able training data is limited. The same can be the observed tools are not adopted to it at said for the cases when the number of considered all. languages is huge or languages are similar to each other. All these conditions influence the success 1 Introduction rate as it is reported in Padróand Padró(Padró The language identification is a problem of and Padró, 2004), Lui and Baldwin (Baldwin and identifying the language a document is written Lui, 2010), and Milne et al. (Milne et al., 2012). in. It represents the fundamental step in tasks We are especially interested in the latter problem such as collecting the documents for corpora, since the Serbian language is closely related to the machine translation and information retrieval. languages spoken in former Yugoslavia. Because of its great importance, methodological approaches to the problem and submitted solu- The standard Serbian language is formed tions are numerous. In the basis, the problem on the basis of Ekavian and Ijekavian Neo- can be seen as a classification problem (Mitchell, Stokavianˇ South Slavic dialects and its form 1997): if collections of known language samples is determined by the reformer of the written represent classes, the problem of the language language of the Serbs, Vuk Karadˇzić(1787-1864) identification for the given document can be (Stanojˇcićand Popović, 2011). In the common seen as a problem of the document assignment state of Yugoslavia this language was officially to the one of the classes in respect to relevant encompassed by Serbo-Croatian, a name that classification features. implied a linguistic unity with the Croats (and later with other nations whose languages were Many sets of language features as well as based on Neo-Stokavianˇ dialects). In the last classification algorithms have been tested so far. decade of the 20th century in Serbia the name 40 Proceedings of the Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants, pages 40–44, Hissar, Bulgaria, 13 September 2013. Serbo-Croatian was replaced in general usage languages are singled out by the rule of 100 most by the name Serbian. As mentioned above, in frequent words and the rule of special character Serbian speaking countries two dialects coexist. elimination. In the next phase character based The Ekavian dialect is widespread in Serbia, second-order Markov model is developed aiming while the Serbian Ijekavian dialect is presented to distinguish languages among themselves. In in some parts of the northern Serbia, Croatia and order to improve the distinction between Croa- Montenegro as well as Bosnia and Herzegovina. tian and Serbian, in the final phase the lists of The difference among the dialects is related to the forbidden words are introduced. Those are the old Slavic vowel called jat and its conflation into lists containing words that appear in one language the vowel e or diphthongs ije and je. It is notable but not in others. The model is tested on the in both spoken and written language forms as news collection and the achieved accuracy of Serbian has a phonologically based orthography. 0.9918 is better than any reported for this group For instance, in respect to the Ekavian dialect of languages. the English word flower has forms cvet (long e) in nominative singular and cvetovi (short e) in The case of European and Brazilian Portuguese nominative plural while the appropriate forms in is discussed in (Zampieri and Gebre, 2012). As Ijekavian dialect are cvijet and cvjetovi respec- the differences between these two varieties can be tively. Therefore, Serbian and other languages of described at orthographic, lexical and syntactic Stokavianˇ provenance share the Ijekavian dialect level, the identifying algorithm analyse three in their standard forms which makes the task groups of features: character n-grams (n varying of the language identification very sensitive and from 2 to 6), word unigrams and word bi-grams. error-prone. From the other point of view, these The language models are calculated by using the languages overlapping can help in cooperative de- Laplace probability distribution and evaluated on velopment of tools and resources necessary for an the journalistic corpora containing texts from the automatic language processing (Vitas et al., 2011). both varieties further classified according to their length in tokens. The achieved accuracies are In this paper we present the results obtained af- 0.998 for 4-grams, 0.996 for word unigrams, and ter testing the three popular language identifica- 0.912 for word bi-grams. tion tools on the collection containing both Eca- vian and Ijekavian documents. Section 2 that In order to identify Spanish varieties, the au- refers to the related work and state-of-the-art ap- thors of (Zampieri et al., 2013) compared the clas- proaches is followed by two introductory sections sical character and word n-gram model to the numbered as 3 and 4 and related to tested tools and knowledge-rich model based on the morphosyn- specially created test corpora. The experiment is tactic information and parts of speech. The test- described in Section 5 while the results are pre- ing was done on the newspapers corpora from four sented in Section 6. The conclusions and ambi- Spanish speaking countries (Spain, Argentina, tions for future work are summarized in Section Mexico and Peru) and the reported results showed 7. the direct relationship between the performance using these two language models: for instance, the 2 Related Work Argentina-Spain classifier performed the worst in both cases (0.843 and 0.666 in terms of accuracy) There is a number of papers discussing the iden- while the Argentina-Mexico classifier generated tification of closely related languages, varieties the top results with characters and words (0.999) of polycentric languages and language dialects. as well as morphology and parts of speech (0.801). All these tasks are more advanced in comparison to classical ones and require application of more 3 Tested Language Identification Tools subtle techniques. In our experiment we have tested three tools for In the paper (Ljubeˇsićet al., 2007) the three- the language identification. A brief description phases model for differentiating Croatian from of tools and the motivation for the usage is given Slovenian and Serbian is presented. In the first below. phase the documents written in any of these three 41 Langid.py1 is a top-level tool developed by oci, djelimicnoˇ ). Lui and Baldwin (Lui and Baldwin, 2012). It is based on the multinomial naive Bayes classifier 4 Test corpus which operates on the set of features (byte level For testing purpose, we have created a corpus unigrams, bigrams and trigrams) selected so that which consists of documents in both Ekavian their information gain represents the characteris- and Ijekavian variant (Table 1). Since Serbian tics of the language rather than the characteristics can be written in Cyrillic or Latin script, all the of the training domains (Lui and Baldwin, 2011). documents are transliterated into Latin script. For the training phase the corpus, which encom- passes government documents, newswire, online encyclopedia, software documentations and an Size Size (in number of words) (in MB) internet crawl in 97 languages (Lui and Baldwin, Ekavian part 2. 078, 172 13.2 2011) is used. In the case of Serbian, the training Ijekavian part 528, 749 3.2 collection includes XML wiki dumps for the period July-August 2010 as well as the set of Table 1: The structure of the corpus manually translated content strings for a number of Debian software packages2. The Ekavian part of the corpus includes the CLD (Content Language Detection)3 is a li- articles from the daily newspaper Politika6 for the brary embedded in a Google’s Chromium browser years 2007 and 2010, the literary works written able to detect a language of a web page content. by the local authors and the translations of many Thanks to Michael McCandless, it is singled out popular novels. The list of all used materials is as a separated C++/Python module and ready reported in Table 2.

Load more