Building a Corpus for Palestinian Arabic: a Preliminary Study

Total Page:16

File Type:pdf, Size:1020Kb

Building a Corpus for Palestinian Arabic: a Preliminary Study Building a Corpus for Palestinian Arabic: a Preliminary Study Mustafa Jarrar, *Nizar Habash, Diyam Akra, Nasser Zalmout Birzeit University, West Bank, Palestine {mjarrar,nzalmout}@birzeit.edu, [email protected] *New York University Abu Dhabi, United Arab Emirates [email protected] the Arab World. Using such tools to understand Abstract and process Arabic dialects (DAs) is a challenging task because of the phonological and This paper presents preliminary results in morphological differences between DAs and building an annotated corpus of the MSA. In addition, there is no standard Palestinian Arabic dialect. The corpus orthography for DAs. Moreover, DAs have consists of about 43K words, stemming limited standardized written resources, since from diverse resources. The paper most of the written dialectal content is the result discusses some linguistic facts about the of ad hoc and unstructured social conversations Palestinian dialect, compared with the or commentary, in comparison to MSA’s vast Modern Standard Arabic, especially in body of literary works. terms of morphological, orthographic, and lexical variations, and suggests some The rest of this paper is structured as follows: directions to resolve the challenges these We present important linguistic background in differences pose to the annotation goal. Section 2, followed by a survey of related work Furthermore, we present two pilot in Section 3. We then present the process of studies that investigate whether existing collecting the Curras Corpus (Section 4) and the tools for processing Modern Standard challenges of annotating it (Section 5). Arabic and Egyptian Arabic can be used to speed up the annotation process of our 2. Linguistic Background Palestinian Arabic corpus. In this section we summarize some important linguistic facts about PAL that influence the 1. Introduction and Motivation decisions we made in this project. For more This paper presents preliminary results towards information on PAL and Levantine Arabic in building a high-coverage well-annotated corpus general, see (Rice and Sa’id, 1960; Cowell, of the Palestinian Arabic dialect (henceforth 1964; Bateson, 1967; Brustad, 2000; Halloun, PAL), which is part of an ongoing project called 2000; Holes, 2004; Elihai, 2004). For a Curras. Building such a PAL corpus is a first discussion of differences between Levantine and important step towards developing natural Egyptian Arabic (EGY), see Omar (1976). language processing (NLP) applications, for 2.1 Arabic and its dialects searching, retrieving, machine-translating, spell- checking PAL text, etc. The importance of The Arabic language is a collection of variants processing and understanding such text is among which a standard variety (MSA) has a increasing due to the exponential growth of special status, while the rest are considered socially generated dialectal content at recent colloquial dialects (Bateson, 1967, Holes, 2004; Social Media and Web 2.0 breakthroughs. Habash, 2010). MSA is the official written language of government, media and education in Most Arabic NLP tools and resources were the Arab World, but it is not anyone’s native developed to serve Modern Standard Arabic language; the spoken dialects vary widely across (MSA), which is the official written language in the Arab World and are the true native varieties 18 Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pages 18–27, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics of Arabic, yet they have no standard orthography dialects. The Druze dialect retains the /q/ and are not taught in schools (Habash et al., pronunciation. Another example is the /k/ k), which ك Zribi et al., 2014). phoneme (corresponding to MSA ,2012 realizes as /tš/ in rural dialects. These difference qlb ‘heart’ to be ﻗﻠﺐ PAL is the dialect spoken by Arabic speakers cause the word for who live in or originate from the area of pronounced as /qalb/, /’alb/, /kalb/ and /galb/ and ﻛﻠﺐ Historical Palestine. PAL is part of the South to be ambiguous out of context with the word Levantine Arabic dialect subgroup (of which klb ‘dog’ /kalb/ and /tšalb/. And similarly to Jordanian Arabic is another dialect). PAL is EGY (but unlike Tunisian Arabic), the MSA θ) becomes /s/ or /t/, and the ث) /historically the result of interaction between phoneme /θ ð) becomes /z/ or /d/ in ذ) /Syriac and Arabic and has been influenced by MSA phoneme /ð kðb ﻛﺬب many other regional language such as Turkish, different lexical contexts, e.g., MSA Persian, English and most recently Hebrew. The /kaðib/ ‘lying’ is pronounced /kizib/ in PAL and Palestinian refugee problem has led to additional /kidb/ in EGY. mixing among different PAL sub-dialects as well as borrowing from other Arabic dialects. We Similar to many other dialects, e.g. EGY and discuss next some of the important Tunisian (Habash et al., 2012; Zribi et al., 2014), distinguishing features of PAL in comparison to the glottal stop phoneme that appears in many MSA as well as other Arabic dialects. We MSA words has disappeared in PAL: compare /bŷr /bi’r ﺑﺌﺮ rÂs /ra’s/ ‘head’ and رأس consider the following dimensions: phonology, MSA morphology, and lexicon. Like other Arabic ‘well’ with their Palestinian urban versions: /rās/ dialects, PAL has no standard orthography. and /bīr/. Also, the MSA diphthongs /ay/ and /aw/ generally become /ē/ and /ō/; this 2.2 Phonology transformation happens in EGY but not in other PAL consists of several sub-dialects that Levantine dialects such as Lebanese, e.g., MSA ./byt /bayt/ ‘house’ becomes PAL /bēt ﺑﯿﺖ generally vary in terms of phonology and lexicon preferences. Commonly identified sub- dialects include urban (which itself varies mostly PAL also elides many short vowels that appear phonologically among the major cities such as in the MSA cognates leading to heavier syllabic jibāl/ ‘mountains’ (and/ ﺟﺒﺎل Jerusalem, Jaffa, Gaza, Nazareth, Nablus and structure, e.g. MSA Hebron), rural, and Bedouin. The Druze EGY /gibāl/) becomes PAL /jbāl/. Additionally community has also some distinctive long vowels in unstressed positions in some PAL phonological features that set it apart. The sub-dialects shorten, a phenomenon shared with زاروا) /variations are a miniature version of the EGY but not MSA: e.g., compare /zāru زاروه) /variations in Levantine Arabic in general. zAr+uwA) ‘they visited’ with /zarū Perhaps the most salient variation is the zAr+uw+h) ‘they visited him’. Finally, PAL has pronunciation of the /q/ phoneme (corresponding commonly inserted epenthetic vowels q1), which realizes as /’/ in most urban (Herzallah, 1990), which are optional in some ق to MSA dialects, /k/ in rural dialects, and /g/ in Bedouin cases leading to multiple pronunciations of the klb ﻛﻠﺐ) /same word, e.g., /kalb/ and /kalib 1Arabic orthographic transliterations are provided in the ‘dog’). This multiplicity is not shared with MSA, Habash-Soudi-Buckwalter (HSB) scheme (Habash et al., which has a simpler syllabic structure and more 2007), except where indicated. HSB extends Buckwalter’s limited epenthesis than PAL. transliteration scheme (Buckwalter, 2004) to increase its readability while maintaining the 1-to-1 correspondence 2.3 Morphology with Arabic orthography as represented in standard encodings of Arabic, i.e., Unicode, etc. The following are PAL, like MSA and its dialects and other the only differences from Buckwalter’s scheme (indicated Semitic languages, makes extensive use of templatic morphology in addition to a large set ة ħ ,({) ئ ŷ ,(>) إ Ǎ ,(&) ؤ ŵ ,(<) أ Â ,(|) آ in parentheses): Ā of affixations and clitics. There are however ى g), ý) غ E), γ) ع Z), ς) ظ Ď ,($) ش š ,(*) ذ v), ð) ث p), θ) K). Orthographic transliterations are) ـ ٍ N), ĩ) ـ ٌ F), ũ) ـ ً Y), ã) presented in italics. For phonological transcriptions, we some important differences between MSA and follow the common practice of using ‘/.../’ to represent PAL in terms of morphology. First, like many phonological sequences and we use HSB choices with some other dialects, PAL lost nominal case and verbal extensions instead of the International Phonetic Alphabet mood, which remain in MSA. Additionally, PAL (IPA) to minimize the number of representations used, as was done by Habash (2010). in most of its sub-dialects collapses the feminine and masculine plurals and duals in verbs and 19 most nouns. Some specific inflections are 3. Related Work Hbyt ﺣﺒﯿﺖ ,.ambiguous in PAL but not MSA, e.g /Habbēt/ ‘I (or you [m.s.]) loved’. 3.1 Corpus Collection and Annotation There have been many contributions aiming to Second, some specific morphemes are slightly or develop annotated Arabic language corpora, with quite different in PAL from their MSA forms, the main objective of facilitating Arabic NLP e.g., the future marker is /sa/ in MSA but /Ha/ or applications. Notable contributions targeting /raH/ in PAL. Another prominent example is the MSA include the work of Maamouri and Cieri, feminine singular suffix morpheme (Ta (2002), Maamouri et al. (2004), Smrž and Hajič Marbuta), which in MSA is pronounced as /at/ (2006), and Habash and Roth (2009). These except at utterance final positions (where it is efforts developed annotation guidelines for /a/). In some PAL urban sub dialects, it has written MSA content producing large-scale multiple allomorphs that are phonologically and Arabic Treebanks. syntactically conditioned: /a/ (after non-front and emphatic consonants), /e/ (after front non- Contributions that are specific to DA include the emphatic consonants), /it/ (nouns in construct development of a pilot Levantine Arabic state such as before possessive pronouns) and /ā/ Treebank (LATB) of Jordanian Arabic, which bTħ contained morphological and syntactic ﺑﻄﺔ .in deverbals before direct objects): e.g) annotations of about 26,000 words (Maamouri et ﺑﻄﺘﻨﺎ ,’Hbħ /Habb+e/ ‘pill ﺣﺒﺔ ,’baTT+a/ ‘duck/ bTnA /baTT+it+na/ ‘our duck’ and /mdars+ā al., 2006). To speed up the process of creating +hum/ ‘she taught them’.
Recommended publications
  • Issues in the Syntax of Arabic
    Cambridge University Press 978-0-521-65017-5 - The Syntax of Arabic Joseph E. Aoun, Elabbas Benmamoun and Lina Choueiri Excerpt More information 1 Issues in the syntax of Arabic 1.1 The Arabic language(s) Arabic belongs to the Semitic branch of the Afro-Asiatic (Hamito- Semitic) family of languages, which includes languages like Aramaic, Ethiopian, South Arabian, Syriac, and Hebrew. A number of the languages in this group are spoken in the Middle East, the Arabian Peninsula, and Africa. It has been docu- mented that Arabic spread with the Islamic conquests from the Arabian Peninsula and within a few decades, it spread over a wide territory across North Africa and the Middle East. Arabic is now spoken by more than 200 million speakers excluding bilingual speakers (Gordon 2005). Although there is a debate about the history of Arabic (including that of the Standard variety and the spoken dialects) Arabic displays some of the typical characteristics of Semitic languages: root-pattern morphology, broken plurals in nouns, emphatic and glottalized consonants, and a verbal system with prefix and suffix conjugation. 1.1.1 The development of Arabic Classical Arabic evolved from the standardization of the language of the Qur’an and poetry. This standardization became necessary at the time when Arabic became the language of an empire, with the Islamic expansion starting in the seventh century. In addition to Classical Arabic, there were regional spo- ken Arabic varieties. It is a matter of intense debate what the nature of the historical relation between Classical Arabic and the spoken dialects is (Owens 2007).
    [Show full text]
  • Arabic Sociolinguistics: Topics in Diglossia, Gender, Identity, And
    Arabic Sociolinguistics Arabic Sociolinguistics Reem Bassiouney Edinburgh University Press © Reem Bassiouney, 2009 Edinburgh University Press Ltd 22 George Square, Edinburgh Typeset in ll/13pt Ehrhardt by Servis Filmsetting Ltd, Stockport, Cheshire, and printed and bound in Great Britain by CPI Antony Rowe, Chippenham and East bourne A CIP record for this book is available from the British Library ISBN 978 0 7486 2373 0 (hardback) ISBN 978 0 7486 2374 7 (paperback) The right ofReem Bassiouney to be identified as author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. Contents Acknowledgements viii List of charts, maps and tables x List of abbreviations xii Conventions used in this book xiv Introduction 1 1. Diglossia and dialect groups in the Arab world 9 1.1 Diglossia 10 1.1.1 Anoverviewofthestudyofdiglossia 10 1.1.2 Theories that explain diglossia in terms oflevels 14 1.1.3 The idea ofEducated Spoken Arabic 16 1.2 Dialects/varieties in the Arab world 18 1.2. 1 The concept ofprestige as different from that ofstandard 18 1.2.2 Groups ofdialects in the Arab world 19 1.3 Conclusion 26 2. Code-switching 28 2.1 Introduction 29 2.2 Problem of terminology: code-switching and code-mixing 30 2.3 Code-switching and diglossia 31 2.4 The study of constraints on code-switching in relation to the Arab world 31 2.4. 1 Structural constraints on classic code-switching 31 2.4.2 Structural constraints on diglossic switching 42 2.5 Motivations for code-switching 59 2.
    [Show full text]
  • Arabic and Contact-Induced Change Christopher Lucas, Stefano Manfredi
    Arabic and Contact-Induced Change Christopher Lucas, Stefano Manfredi To cite this version: Christopher Lucas, Stefano Manfredi. Arabic and Contact-Induced Change. 2020. halshs-03094950 HAL Id: halshs-03094950 https://halshs.archives-ouvertes.fr/halshs-03094950 Submitted on 15 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Arabic and contact-induced change Edited by Christopher Lucas Stefano Manfredi language Contact and Multilingualism 1 science press Contact and Multilingualism Editors: Isabelle Léglise (CNRS SeDyL), Stefano Manfredi (CNRS SeDyL) In this series: 1. Lucas, Christopher & Stefano Manfredi (eds.). Arabic and contact-induced change. Arabic and contact-induced change Edited by Christopher Lucas Stefano Manfredi language science press Lucas, Christopher & Stefano Manfredi (eds.). 2020. Arabic and contact-induced change (Contact and Multilingualism 1). Berlin: Language Science Press. This title can be downloaded at: http://langsci-press.org/catalog/book/235 © 2020, the authors Published under the Creative Commons Attribution
    [Show full text]
  • A Tale of Two Morphologies
    A Tale of Two Morphologies Verb structure and argument alternations in Maltese Dissertation zur Erlangung des akademischen Grades eines Doktors der Philosophie vorgelegt von Spagnol, Michael an der Geisteswissenschaftliche Sektion Sprachwissenschaft 1. Referent: Prof. Dr. Frans Plank 2. Referent: Prof. Dr. Christoph Schwarze 3. Referent: Prof. Dr. Albert Borg To my late Nannu Kieli, a great story teller Contents Acknowledgments ............................................................................................................................. iii Notational conventions .................................................................................................................... v Abstract ............................................................................................................................................... viii Ch. 1. Introduction ............................................................................................................................. 1 1.1. A tale to be told ............................................................................................................................................. 2 1.2 Three sides to every tale ........................................................................................................................... 4 Ch. 2. Setting the stage ...................................................................................................................... 9 2.1. No language is an island .......................................................................................................................
    [Show full text]
  • The Routledge Handbook of Arabic Linguistics
    Review Copy - Not for Distribution Youssef A Haddad - University of Florida - 02/01/2018 THE ROUTLEDGE HANDBOOK OF ARABIC LINGUISTICS The Routledge Handbook of Arabic Linguistics introduces readers to the major facets of research on Arabic and of the linguistic situation in the Arabic-speaking world. The edited collection includes chapters from prominent experts on various fields of Arabic linguistics. The contributors provide overviews of the state of the art in their field and specifically focus on ideas and issues. Not simply an overview of the field, this handbook explores subjects in great depth and from multiple perspectives. In addition to the traditional areas of Arabic linguistics, the handbook covers computational approaches to Arabic, Arabic in the diaspora, neurolinguistic approaches to Arabic, and Arabic as a global language. The Routledge Handbook of Arabic Linguistics is a much-needed resource for researchers on Arabic and comparative linguistics, syntax, morphology, computational linguistics, psycholinguistics, sociolinguistics, and applied linguistics, and also for undergraduate and graduate students studying Arabic or linguistics. Elabbas Benmamoun is Professor of Asian and Middle Eastern Studies and Linguistics at Duke University, USA. Reem Bassiouney is Professor in the Applied Linguistics Department at the American University in Cairo, Egypt. Review Copy - Not for Distribution Youssef A Haddad - University of Florida - 02/01/2018 Review Copy - Not for Distribution Youssef A Haddad - University of Florida - 02/01/2018
    [Show full text]
  • AIDA Bibliographie
    George Grigore A Bibliography of AIDA Association Internationale de Dialectologie Arabe (1992-2017) Descrierea CIP a Bibliotecii Naţionale a României GRIGORE, GEORGE A bibliography of AIDA (Association Internationale de Dialectologie Arabe) : (1992-2017) / George Grigore ; pref.: Dominique Caubert, George Grigore, Stephan Procházka. - Iaşi : Ars Longa, 2016 ISBN 978-973-148-245-3 I. Caubert, Dominique (pref.) II. Procházka, Stephan (pref.) 061:811.411.21'28"1992-2017" © George Grigore © ARS LONGA, 2016 str. Elena Doamna, 2 700398 Iaşi, România Tel.: 0724 516 581 Fax: +40-232-215078 e-mail: [email protected]; [email protected] web: www.arslonga.ro All rights reserved. George Grigore A Bibliography of AIDA Association Internationale de Dialectologie Arabe (1992-2017) Foreword: AIDA – A brief history by Dominique Caubet George Grigore Stephan Procházka Ars Longa 2016 AIDA (Association Internationale de Dialectologie Arabe) – A brief history1 – AIDA (fr. Association Internationale de Dialectologie Arabe) – International Association of Arabic Dialectology / is an association of researchers – الرابطة الدولية لدراسة اللهجات العربية in Arabic dialects, Nowadays AIDA is the leading international association in this field of research and it has become a platform that joins scholars from all over the world, interested in various aspects of Arabic dialectology. Therefore, the idea of founding an association gathering the high-rated specialists in Arabic dialects was discussed, for the first time, between Dominique Caubet (the future founder of AIDA) and a number of dialectologists such as Clive Holes (Oriental Studies, Trinity Hall, Cambridge), Bruce Ingham (The School of Oriental and African Studies, London), Otto Jastrow (Heidelberg University), Catherine Miller (Director of IREMAM, Institut de Recherches et d’Etudes sur le Monde 1 Many thanks to our colleagues Catherine Miller and Martine Vanhove for helping us to clarify the beginnings of AIDA.
    [Show full text]
  • Current Research on Linguistic Variation in the Arabic-Speaking World
    Language and Linguistics Compass 10/8 (2016): 370–381, 10.1111/lnc3.12202 Current Research on Linguistic Variation in the Arabic-Speaking World Uri Horesh1* and William M. Cotter2 1Department of Language and Linguistics, University of Essex 2School of Anthropology and Department of Linguistics, The University of Arizona Abstract Given its abundance of dialects, varieties, styles, and registers, Arabic lends itself easily to the study of language variation and change. It is spoken by some 300 million people in an area spanning roughly from northwest Africa to the Persian Gulf. Traditional Arabic dialectology has dealt predominantly with geographical variation. However, in recent years, more nuanced studies of inter- and intra-speaker variation have seen the light of day. In some respects, Arabic sociolinguistics is still lagging behind the field compared to variationist studies in English and other Western languages. On the other hand, the insight presented in studies of Arabic can and should be considered in the course of shaping a crosslinguistic sociolinguistic theory. Variationist studies of Arabic-speaking speech communities began almost two de- cades after Labov’s pioneering studies of American English and have f lourished following the turn of the 21st century. These studies have sparked debates between more quantitatively inclined sociolinguists and those who value qualitative analysis. In reality, virtually no sociolinguistic study of Arabic that includes statistical modeling is free of qualitative insights. They are also not f lawless and not always cutting edge methodologically or theoretically, but the field is moving in a positive direction, which will likely lead to the recognition of its significance to sociolinguistics at large.
    [Show full text]
  • Chapter 4 Arabic in Iraq, Syria, and Southern Turkey Stephan Procházka University of Vienna
    Chapter 4 Arabic in Iraq, Syria, and southern Turkey Stephan Procházka University of Vienna This chapter covers the Arabic dialects spoken in the region stretching from the Turkish province of Mersin in the west to Iraq in the east, including Lebanon and Syria. The area is characterized by a high degree of linguistic diversity, and for about two and a half millennia Arabic has come into contact with various other Semitic languages, as well as with Indo-European languages and Turkish. Bilin- gualism, particularly with Aramaic, Kurdish, and Turkish, has resulted in numer- ous contact-induced changes in all realms of grammar, including morphology and syntax. 1 Current state and historical development The region discussed in this chapter is linguistically extremely heterogeneous: in it three different Arabic dialect groups, plus several other languages, are spo- ken. The two main Arabic dialect groups are Syrian and Iraqi, the distribution of which does not exactly correspond to the political boundaries of those two countries. Syrian-type dialects are also spoken in Lebanon, in three provinces of southern Turkey (Mersin, Adana,1 Hatay), and in one village on Cyprus (cf. Walter, this volume). In Iraq, Arabic is mainly spoken in Mesopotamia proper, whereas considerable parts of the mountainous parts of the country are Kurdish- speaking. Arabic dialects which are very akin to the Iraqi ones extend into north- eastern Syria and southeastern Anatolia (for the latter see Akkuş, this volume). These two groups are geographically divided by a third dialect group, which ar- rived in the region with an originally (semi-) nomadic population from northern 1The dialects spoken in Mersin and Adana provinces will henceforth referred to as Cilician Arabic.
    [Show full text]
  • Diacritization of Moroccan and Tunisian Arabic Dialects: a CRF Approach
    Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach Kareem Darwish∗, Ahmed Abdelali∗, Hamdy Mubarak∗, Younes Samihy, Mohammed Attia? ∗QCRI, yUniversity of Dusseldorf, ?Google Inc. fkdarwish, aabdelali, [email protected], [email protected], [email protected] Abstract Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on dialectal Arabic (DA) diacritization. Phonemic patterns of DA vary greatly from MSA and even from one another, which accounts for the noted difficulty of mutual intelligibility between dialects. In this paper we present our research and benchmark results on the automatic diacritization of two Maghrebi sub-dialects, namely Tunisian and Moroccan, using Conditional Random Fields (CRF). Aside from using character n-grams as features, we also employ character-level Brown clusters, which are hierarchical clusters of characters based on the contexts in which they appear. We achieved word-level diacritization errors of 2.9% and 3.8% for Moroccan and Tunisian respectively. We also show that effective diacritization can be performed out-of-context for both sub-dialects. Keywords: Arabic, Dialects, Vowelization, Diacritization 1. Introduction • We explore the use of cross dialect and joint dialect train- ing between Moroccan and Tunisian, highlighting the or- Different varieties of Arabic are typically written with- thographic and phonetic similarities and dissimilarities of out diacritics (short vowels). Arabic readers disambiguate both sub-dialects.
    [Show full text]
  • Classifying ASR Transcriptions According to Arabic Dialect
    Classifying ASR Transcriptions According to Arabic Dialect Abualsoud Hanani, Aziz Qaroush Stephen Taylor Electrical & Computer Engineering Dept Computer Science Dept Birzeit University Fitchburg State University West Bank, Palestine Fitchburg, MA, USA ahanani,qaroush @birzeit.edu [email protected] { } Abstract We describe several systems for identifying short samples of Arabic dialects, which were pre- pared for the shared task of the 2016 DSL Workshop (Malmasi et al., 2016). Our best system, an SVM using character tri-gram features, achieved an accuracy on the test data for the task of 0.4279, compared to a baseline of 0.20 for chance guesses or 0.2279 if we had always chosen the same most frequent class in the test set. This compares with the results of the team with the best weighted F1 score, which was an accuracy of 0.5117. The team entries seem to fall into cohorts, with the all the teams in a cohort within a standard-deviation of each other, and our three entries are in the third cohort, which is about seven standard deviations from the top. 1 Introduction In 2016 the Distinguishing Similar Languages workshop (Malmasi et al., 2016) added a shared task to classify short segments of text as one of five Arabic dialects. The workshop organizers provided a training file and a schedule. After allowing the participants development time, they distributed a test file, and evaluated the success of participating systems. We built several systems for dialect classification, and submitted runs from three of them. Interestingly, our results on the workshop test data were not consistent with our tests on reserved training data.
    [Show full text]
  • Arabic Diglossia Within Palestinian-Arab Folk Narratives
    Portland State University PDXScholar University Honors Theses University Honors College 2017 Arabic Diglossia within Palestinian-Arab Folk Narratives Jordan M. Martinez Portland State University Follow this and additional works at: https://pdxscholar.library.pdx.edu/honorstheses Let us know how access to this document benefits ou.y Recommended Citation Martinez, Jordan M., "Arabic Diglossia within Palestinian-Arab Folk Narratives" (2017). University Honors Theses. Paper 421. https://doi.org/10.15760/honors.417 This Thesis is brought to you for free and open access. It has been accepted for inclusion in University Honors Theses by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. ARABIC DIGLOSSIA WITHIN PALESTINIAN-ARAB FOLK NARRATIVES Jordan Martinez Dr. Dirgham H. Sbait Honors 403: Undergraduate Thesis June 15, 2017 1 Acknowledgements I would like to give a special thanks to my thesis advisor, Dr. Sbait, for spending numerous hours working with me this past year. This project would have been impossible without his guidance and profound knowledge of the Arabic language. I wish you luck in your retirement! I would also like to acknowledge the work put forth by Ibrahim Muhawi and Sharif Kanaana. Their work with Speak Bird, Speak Again is unprecedented, and will always be appreciated. 2 Abstract The preservation of Palestinian folktales is a way to restore and preserve Palestinian culture in result to the declining state and displacement of Palestinian populations into other regions of the Arabic World (Muhawi and Kanaana 1989). Some academics have acknowledged that the translation of folklore and other similar tales have contributed to a growing consideration of foreign cultures and the overall understanding by non-native sources (Klaus Roth 1998) (Kanaana 2005).
    [Show full text]
  • Arabic Dialects (General Article)
    This is a repository copy of Arabic dialects (general article). White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/76443/ Version: Accepted Version Book Section: Watson, J (2011) Arabic dialects (general article). In: Weninger, S, Khan, G, Streck, M and Watson, JCE, (eds.) The Semitic Languages: An international handbook. Handbücher zur Sprach- und Kommunikationswissenschaft / Handbooks of Linguistics and Communication Science (HSK) . Walter de Gruyter , Berlin , pp. 851-896. ISBN 978-3-11-025158-6 This article is protected by copyright. Reproduced in accordance with De Gruyter self-archiving policy. https://www.degruyter.com/viewbooktoc/product/175227? rskey=gAMxxa&result=1 Reuse See Attached Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. [email protected] https://eprints.whiterose.ac.uk/ 312 50. Arabic Dialects (general article) 841 278 McLoughlin, L. 279280 1972 Towards a definition of modern standard Arabic. Archivum Linguisticum (new series) 281 3, 57Ϫ73. 282 Monteil, V. 283284 1960 L’arabe moderne (Etudes arabes et islamiques 4). Paris: Klinksieck. 285 Parkinson, D. 286287 1991 Searching for modern fushâ: Real-life formal Arabic. Al-Arabiyya 24, 31Ϫ64. 288 Ryding, K. 289290 2005a A Reference Grammar of Modern Standard Arabic. Cambridge: Cambridge Univer- 291 sity Press. 292 Ryding, K. 293294 2005b Educated Arabic. Encyclopedia of Arabic Language and Linguistics, Vol. 1 (Leiden: 295 Brill) 666Ϫ671. 296 Ryding, K. 297298 2006 Teaching Arabic in the United States.
    [Show full text]