Automatic Detection of Arabicized Berber and Arabic
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Detection of Arabicized Berber and Arabic Varieties Wafia Adouane1, Nasredine Semmar2, Richard Johansson3, Victoria Bobicev4 Department of FLoV, University of Gothenburg, Sweden1 CEA Saclay – Nano-INNOV, Institut CARNOT CEA LIST, France2 Department of CSE, University of Gothenburg, Sweden3 Technical University of Moldova4 [email protected], [email protected] [email protected], [email protected] Abstract Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language pro- cessing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically pro- cessed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facili- tated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We intro- duce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%. 1 Introduction Automatic Language Identification (ALI) is a well-studied field in computational linguistics, since early 1960’s, where various methods achieved successful results for many languages. ALI is commonly framed as a categorization.1 problem. However, the rapid growth and wide dissemination of social media plat- forms and new technologies have contributed to the emergence of written forms of some varieties which are either minority or colloquial languages. These languages were not written before social media and mobile phone messaging services, and they are typically under-resourced. The state-of-the-art available ALI tools fail to recognize them and represent them by a unique category; standard language. For in- stance, whatever is written in Arabic script, and is clearly not Persian, Pashto or Urdu, is considered as Arabic, Modern Standard Arabic (MSA) precisely, even though there are many Arabic varieties which are considerably different from each other. There are also other less known languages written in Arabic script but which are completely different from all Arabic varieties. In North Africa, for instance, Berber or Tamazight2, which is widely used, is also written in Arabic script mainly in Algeria, Libya and Morocco. Arabicized Berber (BER) or Berber written in Arabic script is an under-resourced language and unknown to all available ALI tools which misclassify it as Arabic (MSA).3 Arabicized Berber does not use special characters and it coexists with Maghrebi Arabic where the dialectal contact has made it hard for non-Maghrebi people to distinguish This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. 1Assigning a predefined category to a given text based on the presence or absence of some features. 2An Afro-Asiatic language widely spoken in North Africa and different from Arabic. It has 13 varieties and each has formal and informal forms. It has its unique script called Tifinagh but for convenience Latin and Arabic scripts are also used. Using Arabic script to transliterate Berber has existed since the beginning of the Islamic Era (L. Souag, 2004). 3Among the freely available language identification tools, we tried Google Translator, Open Xerox language and Translated labs at http://labs.translated.net. 63 Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects, pages 63–72, Osaka, Japan, December 12 2016. it from local Arabic dialects.4 For instance each word in the Arabicized Berber sentence ’AHml sAqwl mA$y dwl kAn’ 5 which means ’love is from heart and not just a word’ has a false friend in MSA and all Arabic dialects. In MSA, the sentence means literally ’I carry I will say going countries was’ which does not mean anything. In this study, we deal with the automatic detection of Arabicized Berber and distinguishing it from the most popular Arabic varieties. We consider only the seven most popular Arabic dialects, based on the geographical classification, plus MSA. There are many local dialects due to the linguistic richness of the Arab world, but it is hard to deal with all of them for two reasons: it is hard to get enough data, and it is hard to find reliable linguistic features as these local dialects are very hard to describe and full of unpredictability and hybridization (Hassan R.S., 1992). We start the paper by a brief overview about the related work done for Arabicized Berber and dialectal Arabic ALI in Section 2. We then describe the process of building the linguistic resources (dataset and lexicons) used in this paper and motivate the adopted classification in Section 3. We next describe the experiments and analyze the results in Sections 4 and 5, and finally conclude with the findings and future plans. 2 Related Work Current available automatic language identifiers rely on character n-gram models and statistics using large training corpora to identify the language of an input text (Zampieri and Gebre, 2012). They are mainly trained on standard languages and not on the varieties of each language, for instance available language identification tools can easily distinguish Arabic from Persian, Pashto and Urdu based on char- acter sets and topology. However, they fail to properly distinguish between languages which use the same character set. Goutte et al., (2016) and Malmasi et al., (2016) give a comprehensive bibliography of the recently published work dealing with discriminating between similar languages and language varieties for different languages. There is some work done to identify spoken Berber. For instance Halimouche et al., (2014) discriminated between affirmative and interrogative Berber sentences using prosodic informa- tion, and Chelali et al., (2015) used speech signal information to automatically identify Berber speaker. We are not aware of any work which deals with automatic identification of written Arabicized Berber. Recently, there is an increasing interest in processing Arabic informal varieties (Arabic dialects) using various methods. The main challenge is the lack of freely available data (Benajiba and Diab, 2010). Most of the work focuses on distinguishing between Modern Standard Arabic (MSA) and dialectal Ara- bic (DA) where the latter is regarded as one class which consists mainly of Egyptian Arabic (Elfardy and Diab 2013). Further, Zaidan and Callison-Burch (2014) distinguished between four Arabic varieties (MSA, Egyptian, Gulf and Levantine dialects) using n-gram models. The system is trained on a large dataset and achieved an accuracy of 85.7%. However, the performance of the system can not be general- ized to other domains and topics, especially that the data comes from the same domain (users’ comments on selected newspapers websites). Sadat et al., (2014) distinguished between eighteen6 Arabic vari- eties using probabilistic models (character n-gram Markov language model and Naive Bayes classifiers) across social media datasets. The system was tested on 1,800 sentences (100 sentences for each Arabic variety) and the authors reported an overall accuracy of 98%. The small size of the used test dataset makes it hard to generalize the performance of the system to all dialectal Arabic content. Also Saâdane (2015) in her PhD classified Maghrebi Arabic (Algerian, Moroccan and Tunisian dialects) using morpho- syntactic information. Furthermore, Malmasi et al., (2015) distinguished between six Arabic varieties, namely MSA, Egyptian, Tunisian, Syrian, Jordanian and Palestinian, on sentence-level, using a Parallel Multidialectal Corpus (Bouamor et al., 2014). It is hard to compare the performance of the proposed systems, among others, namely that all of them were trained and tested on different datasets (different domains, topics and sizes). To the best of our 4In all polls about the hardest Arabic dialect to learn, Arabic speakers mention Maghrebi Arabic which has Berber, French and words of unknown origins unlike other Arabic dialects. 5We use Buckwalter Arabic transliteration scheme. For the complete chart see: http://www.qamus.org/ transliteration.htm. 6Egypt; Iraq; Gulf including Bahrein, Emirates, Kuwait, Qatar, Oman and Saudi Arabia; Maghrebi including Algeria, Tunisia, Morocco, Libya, Mauritania; Levantine including Jordan, Lebanon, Palestine, Syria; and Sudan. 64 knowledge, there is no single work done to evaluate the systems on one large multi-domain dataset. Hence, it is wrong to consider the automatic identification of Arabic varieties as a solved task, especially that there is no available tool which can be used to deal with further NLP tasks for dialectal Arabic. In this paper, we propose an automatic language identifier which distinguishes between Arabicized Berber and the eight most popular high level Arabic variants (Algerian,