Dialectal Arabic Processing Using Deep Learning

Total Page:16

File Type:pdf, Size:1020Kb

Dialectal Arabic Processing Using Deep Learning DIALECTAL ARABIC PROCESSING USING DEEP LEARNING Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. phil.) durch die Philosophische Fakultät der Heinrich-Heine-Universität Düsseldorf vorgelegt von Younes Samih aus essen Betreuerin: Prof. Dr. Laura Kallmeyer Düsseldorf, Oktober 2017 ABSTRACT In the last few years, the advent of the phenomena of social media and the ubiquity of the internet access have created an unprecedented deluge of in- formation and textual data on the world wide web. This data brings in its wake new opportunities and poses many challenges for machine learning and Natural Language Processing (NLP) in particular. The sheer size, non- standard spelling, the poor quality, the informality, and the noise of this data, presents new challenges to standard NLP tools developed for traditional data. To make sense out of such data and exploit its value, novel NLP methods, resources, and efficient algorithms beyond rule-based deductive reasoning and "traditional" system engineering need to be created. Given the ground breaking results of Deep Learning (DL) models in solving hard natural lan- guage processing tasks, I argue in this dissertation that these models are well suited for processing social Media textual data. A case in point is Dialectal Arabic (DA) which is emerging as the language of informal communication on the web, in emails, Social Media platforms, blogs, etc. To systematically investigate the ability of DL models to process the less controlled and more speech-like nature of DA in Social Media, I choose to address two concrete, challenging tasks, namely linguistic Code-Switching (CS) identification and DA morphological segmentation. ii ZUSAMMENFASSUNG In den letzten Jahren ist durch die sozialen Medien und allgegenwärtigen Zugang zum Internet eine noch nie dagewesene Flut von Information und Textdaten im Internet entstanden. Diese Daten bringen neue Chancen und vielfältige Herausforderungen im Bereich des machinellen Lernens, und in- besondere in der maschinellen Sprachverarbeitung NLP. Die schiere Menge, nicht den gewohnten Normen entsprechende Rechtschreibung, die schlechte Qualität und Informalität, und das Rauschen in den Daten stellen die NLP- Standardwerkzeuge, die auf traditionellen Daten entwickelt worden sind, vor neue Probleme. Um solche Daten interpretieren und ihren Wert nutzen zu können, müs- sen neue NLP-Methoden, Resourcen und effiziente Algorithmen jenseits von regelbasiertem dekutiven Herangehensweisen und "traditionellen"Methoden der Systementwicklung geschaffen werden. Im Licht der bahnbrechenden Er- gebnisse von Modellen des Deep Learning (DL) argumentiere ich in dieser Dissertation, dass solche Modelle gut zur Verarbeitung von Text aus sozia- len Medien geeignet sind. Als Beispielfall dient das Dialectal Arabic (DA), das sich zur Sprache der informellen Kommunikation im Web, in E-Mails, auf sozialen Medien, und in Blogs entwickelt hat. Um die Möglichkeiten von DL-Modellen zu zeigen, das weniger kontrollierte und der gesproche- nen Sprache ähnliche DA in sozialen Medien zu verarbeiten, bearbeite ich zwei konkrete, herausfordernde Probleme, und zwar die Identifikation von linguistischem Code-Switching (CS) und die morphologische Segmentierung von DA. iii AUTHOR STATEMENT I would like here to acknowledge that Chapter 4 “Data and Resource Cre- ation” includes work that was published as Samih and Maier (2016a) and Samih and Maier (2016b). I was primarily responsible for the design and creation of the corpus described in these two papers. I was also an active col- laborator throughout the creation of the multi-dialectal segmentation corpus presented in the following papers, Samih et al. (2017a), Samih et al. (2017b) Samih, Maier, and Kallmeyer (2016), and Eldesouki et al. (2017). Chapter 5 “Code-switching identification” also includes work that was pub- lished as Samih and Maier (2016a,b) and Samih, Maier, and Kallmeyer (2016). I was primarily responsible all the experiments and the design and creation of the corpus used in this chapter. The other authors contributed to these papers primarily in an advisory role. In addition, Section 5.5 in this chapter presents work that is excerpted from Samih et al. (2016), which also forms the basis for much of this chapter. I was responsible for the design of the neural models presented in that paper. Chapter 6 “Dialectal Arabic Segmentation” also includes work that was pub- lished as Eldesouki et al. (2017) and Samih et al. (2017a,b). I was responsi- ble for the design of the neural models presented in these three papers. I substantially contributed to the implementation and testing of these models and in the development of the segmentation algorithm. Eldesouki Mohamed was responsible for the implementation of the Support Vector Machine (SVM) models. The other authors contributed to these papers primarily in an advi- sory role. iv Appendix A “SAWT” is an updated version of my paper titled “SAWT: Se- quence Annotation Web Tool” presented in the following paper Samih, Maier, and Kallmeyer (2016). v ACKNOWLEDGMENTS I owe an enormous debt of gratitude to my supervisor, Laura Kallmeyer, for many reasons. First she put me in the right track since Day One saving me a lot of time and effort exploring uncountable paths. Second she kept motivating and supporting me along the way. She encouraged me to write papers, and attend meetings and conferences. She also provided the financial funds needed for travelling to these research events. Third, at times when I was about to go off rail or lose confidence, her advice helped put me back in the right track. My deepest gratitude goes to my affectionate mother and my dearest father, for everything they have provided, and to my family (my wife Ikram and my son Adam), to whom this dissertation is dedicated, for their love, patience and unfailing support. Mohammed Attia has been my post-doc mentor even before my dissertation. I am so thankful for all the detailed comments he has made during all these years. He has always been willing to discuss various aspect of my dissertation or read a chapter. I am really grateful for his support. I am grateful that I could finish my work at the University of Düsseldorf among nice and supportive colleagues. In particular I would like to thank Wolfgang Maier and Miriam Kaeshammer. I would also like to thank James Kilbury, Kareem Darwish, Ahmed Abde- lali, Hamdy Mubarak, Mohamed Eldesouki, Timm Lichte, Christian Wurm, Simon Petitjean, and Andreas van Cranenburgh for discussing various as- vi pects of my dissertation with me and for giving me insightful comments. All remaining errors are my own! vii CONTENTS i introduction 1 1 introduction 2 1.1 Overview and Motivation ...................... 2 1.2 Contributions of this Dissertation ................. 4 1.3 Dissertation Overview ........................ 6 ii background 7 2 arabic and its dialects 8 2.1 The Arabic Language Varieties ................... 8 2.2 Linguistic Characteristics of the Arabic Dialects ......... 12 2.2.1 Morphology .......................... 12 2.2.2 Syntax ............................. 14 2.3 Processing of Dialectal Arabic .................... 15 2.4 Challenges of Processing Dialectal Arabic ............. 17 2.5 Concluding Remarks ......................... 19 3 deep learning background 20 3.1 Introduction .............................. 20 3.2 Basics of Neural Networks ..................... 23 3.2.1 Mathematical Notation ................... 23 3.2.2 The Artificial Neuron .................... 23 3.2.3 Multi-Layer Neural Networks ............... 25 3.3 Recurrent Neural Networks ..................... 27 3.4 Deep Neural Networks for NLP .................. 30 3.5 Concluding Remarks ......................... 31 viii contents ix iii resource creation for dialctal arabic 32 4 data and resource creation 33 4.1 Introduction .............................. 33 4.2 The Moroccan Code-switching Corpus .............. 35 4.2.1 Data Collection ........................ 35 4.2.2 Annotation .......................... 36 4.2.3 Related Work ......................... 40 4.3 The Multi-Dialectal Segmentation Corpus ............ 41 4.3.1 Data Creation ......................... 41 4.3.2 Annotation .......................... 42 4.4 Concluding Remarks ......................... 46 iv applictaions 47 5 code-switching identification 48 5.1 Introduction .............................. 48 5.2 Code-Switching ............................ 50 5.2.1 A Linguistic Analysis of Code-Switching ......... 50 5.2.2 Code-Switching in Morocco ................ 52 5.3 Related Work ............................. 54 5.4 Dataset ................................. 56 5.5 Code-switching Detection Approaches .............. 57 5.5.1 CRF Approch (Baseline) ................... 58 5.5.2 A Deep Neural Network Model .............. 63 5.6 Results and Experiments ....................... 66 5.7 Analysis ................................ 68 5.7.1 What is Captured in Char-Word Representations? . 68 5.7.2 CRF Model .......................... 70 5.7.3 Error Analysis ........................ 71 contents x 5.8 Concluding Remarks ......................... 74 6 dialectal arabic segmentation 75 6.1 Introduction .............................. 75 6.2 related work .............................. 76 6.3 Segmentation Datasets ........................ 78 6.4 Arabic Dialects ............................ 78 6.4.1 Similarities Between Dialects................ 78 6.4.2 Differences Between Dialects ................ 81 6.5 Learning Algorithms ......................... 85 Rank 6.5.1 SVM Approach
Recommended publications
  • “The Overuse of Italian Loanwords in the Daily Speech of Tripoli University Students: the Impact of Gender and Residential Place”
    العدد العشرون تاريخ اﻹصدار: 2 – ُحزيران – 2020 م ISSN: 2663-5798 www.ajsp.net “The Overuse of Italian Loanwords in the Daily Speech of Tripoli University Students: the Impact of Gender and Residential Place” Prepared by Jalal Al Dain Y. Abidah, English Dept., Faculty of Education - Janzour, Tripoli University 19 Arab Journal for Scientific Publishing (AJSP) ISSN: 2663-5798 العدد العشرون تاريخ اﻹصدار: 2 – ُحزيران – 2020 م ISSN: 2663-5798 www.ajsp.net Abstract The study tries to explore the impact of social factors of gender and residential place on the use of Italian loanwords by Libyan university students (using Tripoli University as an example) and how the mentioned social factors affect their daily speech. To answer the questions of the study, a sample of 60 Tripoli university students are selected randomly in the campus (A) of University of Tripoli. They were divided into two groups according to their Gender and residential place. In order to collect data, a questionnaire was developed for this purpose. It generated data regarding the use of 150 Italian loanwords by both groups. The mean of using Italian loanwords in both groups was analyzed and computed using SPSS. However, the study reveals the impact of residential area where Italian loanwords were more incorporated by rural students than urbanites. The results of the study revealed that there was a significant statistical difference at (α≤0.05) among the means of both groups regarding the use of Italian loanwords in daily speech due to residential area. In contrasts, gender emerges as insignificant. Keyword Italian loanwords, Colloquial Arabic, Gender, Libya.
    [Show full text]
  • The Historical Development of the Maltese Plural Suffixes -Iet and -(I)Jiet Having Spent the Last Millennium in Close Contact Wi
    The historical development of the Maltese plural suffixes -iet and -(i)jiet Having spent the last millennium in close contact with several Indo-European languages, Maltese, the modern descendant of Siculo-Arabic (Brincat 2011), possesses various pluralization strategies. In this paper I explore the historical development of two such strategies of Semitic- origin: the suffixes -iet and -(i)jiet (e.g. papa ‘pope’, pl. papiet; omm ‘mother’, pl. ommijiet). Specialists agree that both suffixes originated from the Arabic plural suffix -āt (Borg 1976; Mifsud 1995); however, no research has explained the development of -(i)jiet, nor connected its development to that of -iet. I argue that the development of -(i)jiet was driven by the influx of i- final words which resulted from contact with Italian: Maltese speakers affixed -iet to such words, triggering a glide-epenthesis that occurs elsewhere in Maltese (e.g. Mifsud 1996: 34) and in other varieties of Arabic (e.g. Erwin 1963; Cowell 1964; Owens 1984; Harrell 2004). With a large number of plurals now ending in ijiet, speakers reanalyzed this sequence as a unique plural suffix and began applying it to new non-i-final words as well. Since only Maltese experienced this influx of i-final words, it was only in Maltese that speakers reanalyzed this sequence as a separate suffix. Additionally, I argue against an explanation of the development of -(i)jiet that does not rely on epenthesis. With regard to the suffix -iet, I argue that two properties unique to it – the obligatory omission of stem-final vowels upon pluralization, and the near-universal tendency to pluralize only a-final words – emerged from a separate reanalysis of the pluralization of collective nouns that reflects a general weakening of the Semitic element in Maltese under Indo-European contact.
    [Show full text]
  • Different Dialects of Arabic Language
    e-ISSN : 2347 - 9671, p- ISSN : 2349 - 0187 EPRA International Journal of Economic and Business Review Vol - 3, Issue- 9, September 2015 Inno Space (SJIF) Impact Factor : 4.618(Morocco) ISI Impact Factor : 1.259 (Dubai, UAE) DIFFERENT DIALECTS OF ARABIC LANGUAGE ABSTRACT ifferent dialects of Arabic language have been an Dattraction of students of linguistics. Many studies have 1 Ali Akbar.P been done in this regard. Arabic language is one of the fastest growing languages in the world. It is the mother tongue of 420 million in people 1 Research scholar, across the world. And it is the official language of 23 countries spread Department of Arabic, over Asia and Africa. Arabic has gained the status of world languages Farook College, recognized by the UN. The economic significance of the region where Calicut, Kerala, Arabic is being spoken makes the language more acceptable in the India world political and economical arena. The geopolitical significance of the region and its language cannot be ignored by the economic super powers and political stakeholders. KEY WORDS: Arabic, Dialect, Moroccan, Egyptian, Gulf, Kabael, world economy, super powers INTRODUCTION DISCUSSION The importance of Arabic language has been Within the non-Gulf Arabic varieties, the largest multiplied with the emergence of globalization process in difference is between the non-Egyptian North African the nineties of the last century thank to the oil reservoirs dialects and the others. Moroccan Arabic in particular is in the region, because petrol plays an important role in nearly incomprehensible to Arabic speakers east of Algeria. propelling world economy and politics.
    [Show full text]
  • Arabic and Contact-Induced Change Christopher Lucas, Stefano Manfredi
    Arabic and Contact-Induced Change Christopher Lucas, Stefano Manfredi To cite this version: Christopher Lucas, Stefano Manfredi. Arabic and Contact-Induced Change. 2020. halshs-03094950 HAL Id: halshs-03094950 https://halshs.archives-ouvertes.fr/halshs-03094950 Submitted on 15 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Arabic and contact-induced change Edited by Christopher Lucas Stefano Manfredi language Contact and Multilingualism 1 science press Contact and Multilingualism Editors: Isabelle Léglise (CNRS SeDyL), Stefano Manfredi (CNRS SeDyL) In this series: 1. Lucas, Christopher & Stefano Manfredi (eds.). Arabic and contact-induced change. Arabic and contact-induced change Edited by Christopher Lucas Stefano Manfredi language science press Lucas, Christopher & Stefano Manfredi (eds.). 2020. Arabic and contact-induced change (Contact and Multilingualism 1). Berlin: Language Science Press. This title can be downloaded at: http://langsci-press.org/catalog/book/235 © 2020, the authors Published under the Creative Commons Attribution
    [Show full text]
  • Expression of Attributive Possession in Tunisian Arabic: the Role of Language Contact
    Expression of Attributive Possession in Tunisian Arabic: The Role of Language Contact Lotfi Sayahi 1 Introduction Possession is a universal concept that describes a particular link between two entities. It can denote a set of relationships that range from strict ownership to more loose connections that are open to interpretations. At the semantic level, a major distinction has been established between alienable possession and inalienable possession. Alienable possession refers to relationships that are temporary or which can be ended freely such as ownership of material objects. In the case of inalienable possession the possessed entity is intrinsi- cally linked to the possessor, as in the case of part-whole relations (e.g., body parts) or kinship relations. Languages, however, vary when it comes to what counts as alienable vs. inalienable possession (Heine 1997; Payne and Barshi 1999; Herslund and Baron 2001). At the structural level, two major types of pos- sessive structures exist. The first type, which is not the object of the current study, is predicative possession which uses verb forms to express the possessive link. The second type is attributive possession, also referred to as adnominal possession, which marks the possession in the noun phrase through different morphosyntactic structures. Despite being such a common category, only a few studies have described change in the expression of possession in cases of language contact. Weinreich (1963) in his seminal work mentions the case of Estonian, Amharic, and Modern Hebrew which developed or increased the use of analytic possession constructions as a result of contact with other languages (Weinreich 1963: 41–42).
    [Show full text]
  • Berber Etymologies in Maltese1
    Berber E tymologies in Maltese 1 Lameen Souag LACITO (CNRS – Paris Sorbonne – INALCO) France ﻣﻠﺧص : ﻻ ﯾزال ﻣدى ﺗﺄﺛﯾر اﻷﻣﺎزﯾﻐﯾﺔ ﻓﻲ اﻟﻣﺎﻟطﯾﺔ ﻏﺎﻣﺿﺎ، واﻟﻛﺛﯾر ﻣن اﻻﻗﺗراﺣﺎت اﻟﻣﻧﺷورة ﻓﻲ ھذا اﻟﻣﺟﺎل ﻟﯾﺳت ﻣؤﻛدة . ھذه اﻟﻣﻘﺎﻟﺔ ﺗﻘﯾم ﻛل ﻛﻠﻣﺔ ﻣﺎﻟطﯾﺔ اﻗ ُﺗرح ﻟﮭﺎ أﺻل أﻣﺎزﯾﻐﻲ ﻣن ﻗﺑل، ﻓﺗﺳﺗﺑﻌد ﺧﻣﺳﯾن ﻛﻠﻣﺔ وﺗﻘﺑل ﻋﺷرﯾن . ﻛﻣﺎ أﻧﮭﺎ ﺗﻘﺗرح ﻟﻠﻣرة اﻷوﻟﻰ أﺻوﻻ أﻣﺎزﯾﻐﯾﺔ ﻟﺳﺗﺔ ﻛﻠﻣﺎت ﻣﺎﻟطﯾﺔ ﻏﯾر ھذه اﻟﺳﺑﻌﯾن، وھﻲ " أﻛﻠﺔ ﻟزﺟﺔ " ، " ﺻﻐﯾر " ، " طﯾر اﻟوﻗواق " ، " ﻧﺑﺎت اﻟدﯾس " ، " أﻧﺛﻰ اﻟﺣﺑ ﺎر " ، واﻟﺗﺻﻐﯾر ﺑﻌﻼﻣﺔ اﻟﺗﺄﻧﯾث . ﺑﻧﺎء ﻋﻠﻰ ھذه اﻟﻧﺗﺎﺋﺞ، ﺗﻔﺣص اﻟﻣﻘﺎﻟﺔ اﻟﺗوزﯾﻊ اﻟدﻻﻟﻲ ﻟدى اﻟﻣﻔردات اﻟدﺧﯾﻠﺔ ﻣن اﻷﻣﺎزﯾﻐﯾﺔ ﻟﺗﻠﻘﻲ ﻧظرة ﻋﻠﻰ ﻧوﻋﯾﺔ اﻻﺣﺗﻛﺎك ﺑﯾن اﻟﻌرﺑﯾﺔ واﻷﻣﺎزﯾﻐﯾﺔ ﻓﻲ إﻓرﯾﻘﯾﺔ ﻗﺑل وﺻول ﺑﻧﻲ ھﻼل إﻟﻰ اﻟﻣﻧطﻘﺔ Abstract : The extent of Berber lexical influence on Maltese remains unclear, and many of the published etymological proposals are problematic. This article evaluates the reliability of existing proposed Maltese etymologies involving Berber, excluding 50 but accepting 20, and proposes new Berbe r etymologies for another six Maltese words ( bażina “overcooked, sticky food”, ċkejken “small”, daqquqa “cuckoo”, dis “dis - grass, sparto”, tmilla “female cuttlefish used as bait”), as well as a diminutive formation strategy using - a . Based on the results, it examines the distribution of the loans found, in the hope of casting light on the context of Arabic - Berber contact in Ifrīqiyah before the arrival of the Banū Hilāl. Keywords: Maltese, Berber, Amazigh, etymology, loanword s 1 The author thanks Marijn van Putten, Karim Bensoukas, and two anonymous reviewers for their helpful comments on earlier drafts of this paper, and Abdessalem Saied for providing Nabuel data. © The International Journal of Arabic Linguistics (IJAL) Vol.
    [Show full text]
  • The Historical Development of the Maltese Plural Suffixes -Iet and -(I)Jiet JONATHAN GEARY, UNIVERSITY of ARIZONA ICHL23, SAN ANTONIO, TX AUGUST 4, 2017
    7/20/2018 The Historical Development of the Maltese plural suffixes -iet and -(i)jiet JONATHAN GEARY, UNIVERSITY OF ARIZONA ICHL23, SAN ANTONIO, TX AUGUST 4, 2017 Introduction • This paper explores the development of two Maltese suffixes of Semitic origin: -iet (e.g. papa ~ papiet “pope(s)”) and -(i)jiet (e.g. omm ~ ommijiet “mother(s)”). • -(i)jiet developed from -iet due to extensive borrowing of i-final Italian words. ◦ Affixation triggers glide-epenthesis, producing ijiet (i.e. i+iet > ijiet). ◦ With many ijiet-final plurals, speakers reanalyze this as a suffix and extend it to new words. ◦ Glide-epenthesis occurs elsewhere in Maltese and in other varieties of Arabic, but without the contact-induced conditions for reanalysis similar reanalyses did not occur. • -iet acquired two unique properties – restriction to words having a-final singulars; obligatory omission of stem-final vowels upon suffixation – due to a subsequent reanalysis of the pluralization of collective nouns. ◦ This reanalysis reflects a more general weakening of the Semitic element in Maltese. 2 1 7/20/2018 History of Maltese • Maltese is a Semitic language which has been shaped by extensive contact with Indo-European languages over the last millennium (Brincat 2011). ◦ 870 – Arab invasion depopulates the island of Malta. ◦ 1048 – Individuals from Muslim Sicily migrate to Malta, bringing with them Siculo-Arabic (Agius 1996), which would ultimately develop into Maltese. ◦ 1090 – Norman Conquest brings Siculo-Arabic into contact with Sicilian and Latin in Malta, cuts the language off from Classical Arabic and other Arabic varieties. ◦ 1530 – Malta is given to the Knights of St.
    [Show full text]
  • Tekerlemeli Ve Sözlü Oyunlarin KKTC'deki Okul Öncesi Çocukların Türkçe Dil Becerilerine Kazanımları
    Ghezewi N., Akbarov A., Socıal Influences On The Choıce Of A Lınguıstıc Varıant In Lıbyan Arabıc Dıalects Motif Akademi Halkbilimi Dergisi / Cilt:9, Sayı:17 / 2016 (Ocak – Haziran), s.105-112 SOCIAL INFLUENCES ON THE CHOICE OF A LINGUISTIC VARIANT IN LIBYAN ARABIC DIALECTS Libya Arap Lehçelerinde Dil Variant Seçimi Üzerindeki Sosyal Etkileri Nesrin GHEZEWI* & Azamat AKBAROV** Öz: Bu çalışmanın amacı, Arap dilinin farklı dil çeşitleri ve Libya’daki ikidillilik durumunu incelemektir . Yüksek ve düşük agizlari belirlemek icin Libya ( Trablusgarp ve Bingazi ) farklı bölgelerden gelen iki farklı düşük agiz benzerlikleri ve farklılıkları karsilastirildi. Bu şehirler, Tripoli doğu tarafında be Bingazi batı tarafında olmak uzere, Libya haritasında bir birine ters tarafta bulunmaktadır. Benzerlikler ve farklılıklar, örnekler ve tablolarin yanı sıra , Modern Standart Arapça aracılığıyla da gösterilmektedir . Kolonizasyon ve göç gibi farklılıklara katkıda bulunan etkiler da açıklanmıştır. Libya Arapça ikidillilik kavramı oldukça karmaşık oldugundan, çok daha fazla çalışma gerektirmektedir . Anahtar Kelimeler : Ikidillilik , diyalekt , Yüksek Lehçe , Düşük Lehçe, Libya , Klasik Arapça , Modern Standart Arapça , Kolonizasyon . Abstract: The purpose of this paper is to look at the different linguistic varieties of the Arabic language, and the diglossic situation in Libya. In order to determine the high and low variety/ varieties, two different low varieties from different regions in Libya (Tripoli and Benghazi) were compared and contrasted. These cities are on opposite sides of the Libyan map, Tripoli being on the western side and Benghazi on the eastern side. The similarities and differences are shown through examples and tables as well as how those varieties measure up to the Modern Standard Arabic. The influences that contributed to the differences such as colonization and migration are explained.
    [Show full text]
  • Automatic Detection of Arabicized Berber and Arabic Varieties
    Automatic Detection of Arabicized Berber and Arabic Varieties Wafia Adouane1, Nasredine Semmar2, Richard Johansson3, Victoria Bobicev4 Department of FLoV, University of Gothenburg, Sweden1 CEA Saclay – Nano-INNOV, Institut CARNOT CEA LIST, France2 Department of CSE, University of Gothenburg, Sweden3 Technical University of Moldova4 [email protected], [email protected] [email protected], [email protected] Abstract Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language pro- cessing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically pro- cessed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facili- tated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We intro- duce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%.
    [Show full text]
  • Mapping Linguistic Variations in Colloquial Arabic Through Twitter a Centroid-Based Lexical Clustering Approach
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 Mapping Linguistic Variations in Colloquial Arabic through Twitter A Centroid-based Lexical Clustering Approach 1 Abdulfattah Omar * Hamza Ethleb2 Mohamed Elarabawy Hashem3 Department of English Translation Department Department of English College of Sciences and Humanities Faculty of Languages College of Science and Arts in Tabarjal, Prince Sattam Bin Abdulaziz University University of Tripoli Jouf University, KSA Department of English, Faculty of Arts Tripoli, Libya Faculty of Languages and Translation Port Said University Cairo, Al-Azhar University, Egypt Abstract—The recent years have witnessed the development regional dialectology is so far based on traditional methods; of different computational approaches to the study of linguistic therefore, it is difficult to provide a comprehensive mapping variations and regional dialectology in different languages of the dialect variation of all the colloquial dialects of Arabic. including English, German, Spanish and Chinese. These As thus, this study is concerned with proposing a approaches have proved effective in dealing with large corpora computational model for mapping the linguistic variation and and making reliable generalizations about the data. In Arabic, regional dialectology in Colloquial Arabic through Twitter however, much of the work on regional dialectology is so far based on the lexical choices of speakers. The purpose of the based on traditional methods; therefore, it is difficult to provide study is to explore the lexical patterns for generating regional a comprehensive mapping of the dialectal variations of all the dialect maps as derived from Twitter users. In order to map colloquial dialects of Arabic.
    [Show full text]
  • Tunisian and Libyan Arabic Dialects Common Trends – Recent Developments – Diachronic Aspects
    Veronika Ritt-Benmimoun (ed.) Tunisian and Libyan Arabic Dialects Common Trends – Recent Developments – Diachronic Aspects PRENSAS DE LA UNIVERSIDAD DE ZARAGOZA TUNISIAN and Libyan Arabic Dialects : Common Trends – Recent Developments – Diachronic Aspects / Veronika Ritt-Benmimoun (ed.). — Zaragoza : Prensas de la Universidad de Zaragoza, 2017 389 p. ; 24 cm. — (Estudios de Dialectología Árabe ; 13) ISBN 978-84-16933-98-3 1. Lengua árabe–Dialectos. 2. Lengua árabe–Túnez. 3. Lengua árabe–Libia RITT-BENMIMOUN, Veronika 811.411.21’282(611) 811.411.21’282(612) Cualquier forma de reproducción, distribución, comunicación pública o transformación de esta obra solo puede ser realizada con la autorización de sus titulares, salvo excepción prevista por la ley. Diríjase a CEDRO (Centro Español de Derechos Reprográficos, www.cedro.org) si necesita fotocopiar o escanear algún fragmento de esta obra. © Veronika Ritt-Benmimoun © De la presente edición, Prensas de la Universidad de Zaragoza (Vicerrectorado de Cultura y Política Social) 1.ª edición, 2017 Diseño gráfico: Victor M. Lahuerta Colección Estudios de Dialectología Árabe, n.º 13 Prensas de la Universidad de Zaragoza. Edificio de Ciencias Geológicas, c/ Pedro Cerbuna, 12, 50009 Zaragoza, España. Tel.: 976 761 330. Fax: 976 761 063 [email protected] http://puz.unizar.es Esta editorial es miembro de la UNE, lo que garantiza la difusión y comercialización de sus publicaciones a nivel nacional e internacional. Impreso en España Imprime: Servicio de Publicaciones. Universidad de Zaragoza D.L.: Z xxx-2017 Fī (‘in’) as a Marker of the Progressive Aspect in Tunisian Arabic ∗ Karen MCNEIL 1. Introduction In this work I will be discussing the preposition fī and its use as an aspectual marker in Tunisian Arabic.
    [Show full text]
  • RESOURCED LANGUAGES Dialectal Arabic Short Texts
    DEPARTMENT OF PHILOSOPHY, LINGUISTICS AND THEORY OF SCIENCE AUTOMATIC DETECTION OF UNDER- RESOURCED LANGUAGES Dialectal Arabic Short Texts Wafia Adouane Master’s Thesis: 30 credits Programme: Master’s Programme in Language Technology Level: Advanced level Semester and year: Spring, 2016 Supervisor: Richard Johansson, Nasredine Semmar and Alan Said Examiner: Staffan Larsson Report number: (number will be provided by the administrators) Keywords: Under-resourced languages, Discrimination between similar languages and language varieties, Arabic varieties, Linguistic resource building Abstract Automatic Language Identification (ALI) is the first necessary step to do any language-dependent natural language processing task. It is the identification of the natural language of the input content by a machine. Being a well-established task in computational linguistics since early 1960's, various methods have been successfully applied to a wide range of languages. The state-of-the-art automatic language identifiers are based on character n-gram models trained on huge corpora. However, there are many natural languages which are not yet automatically processed. For instance, minority languages or informal forms of standard languages (general purpose languages used only in media/administration and taught at schools). Some of these languages are only spoken and do not exist in a written format. The use of social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. These new written languages are under- resourced, hence the current ALI tools fail to properly recognize them. In this study, we revisit the problem of ALI with the focus on discriminating under-resourced similar languages.
    [Show full text]