A Computational Model of Modern Standard Arabic Verbal Morphology Based on Generation
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Automatic Identification of Arabic Language Varieties and Dialects in Social Media
Automatic Identification of Arabic Language Varieties and Dialects in Social Media Fatiha Sadat Farnazeh Kazemi Atefeh Farzindar University of Quebec in NLP Technologies Inc. NLP Technologies Inc. Montreal, 201 President Ken- 52 Le Royer Street W., 52 Le Royer Street W., nedy, Montreal, QC, Canada Montreal, QC, Canada Montreal, QC, Canada [email protected] [email protected] [email protected] Abstract Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dia- lects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially be- tween MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different condi- tions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable over- all accuracy of 98%. 1 Introduction Arabic is a morphologically rich and complex language, which presents significant challenges for nat- ural language processing and its applications. It is the official language in 22 countries spoken by more than 350 million people around the world1. Moreover, the Arabic language exists in a state of diglossia where the standard form of the language, Modern Standard Arabic (MSA) and the regional dialects (AD) live side-by-side and are closely related (Elfardy and Diab, 2013). -
A Sociolinguistic View of "Hazl" in the Andalusian Arabic "Muwashshaḥ
A Sociolinguistic View of "hazl" in the Andalusian Arabic "muwashshaḥ" Author(s): David Hanlon Source: Bulletin of the School of Oriental and African Studies, University of London, Vol. 60, No. 1 (1997), pp. 35-46 Published by: Cambridge University Press on behalf of School of Oriental and African Studies Stable URL: http://www.jstor.org/stable/620768 Accessed: 31-08-2015 13:05 UTC Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/ info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Cambridge University Press and School of Oriental and African Studies are collaborating with JSTOR to digitize, preserve and extend access to Bulletin of the School of Oriental and African Studies, University of London. http://www.jstor.org This content downloaded from 163.10.30.70 on Mon, 31 Aug 2015 13:05:02 UTC All use subject to JSTOR Terms and Conditions A sociolinguisticview of hazl in the Andalusian Arabic muwashshah DAVID HANLON BirkbeckCollege, London The documented history of the theory of the muwashshah and one of its constituent parts, the kharja, spans almost 800 years: from Ibn Sana' al-Mulk (d. 608/1211) to the present day. Apologists for the various theories broadly belong to one of two schools, which for the sake of convenience I shall label 'integralist' and 'partialist'. -
Towards Understanding the Status of the Dual in Pre-Islamic Arabic
Towards Understanding the Status of the Dual in Pre-Islamic Arabic MUHAMMAD AL-SHARKAWI (Wayne State University, Detroit) Abstract This article suggests that the dual suffix in pre-Islamic Arabic did not differentiate for case. Tamīm, one of the most trustworthy pre-Islamic dialects, treated the dual suffix invariably although it had a full case system. There are also tokens of the same invariable treatment in the Qurʾān. The article proposes that the suffix long vowel variation due to the phenomenon of ʾimāla makes the formal origin of the invariable dual suffix difficult to ascribe to the East and Northwest Semitic oblique dual allomorph. Keywords: Dual, pre-Islamic Arabic, ʾimāla, Classical Arabic, vowel harmony. Introduction This article discusses data on the dual suffix in pre-Islamic dialects from medieval Arab grammarians and manuals of qirāʾāt to suggest that the status of the dual suffix in the pre- Islamic Arabic linguistic situation was unique among the Semitic languages.1 The article does not, however, seek to take a comparative Semitic framework. It rather seeks to discuss the dual suffix behavior on the eve of the Arab conquests and probably immediately thereafter. Although attempts to understand particular structural concepts of pre-Islamic Arabic are forthcoming, the formal, functional and semantic shape of the dual system remains to be studied in detail. In addition, despite the limited and sporadic data about the morphological and syntactic aspects of pre-Islamic Arabic,2 the dual suffix3 is one of the features of pre-Islamic Arabic dialects that can shed light on both the position of grammati- cal case4 in the Arabic dialects in the peninsula, and how it came to be standardized after the emergence of Islam. -
Saudi Dialects: Are They Endangered?
Academic Research Publishing Group English Literature and Language Review ISSN(e): 2412-1703, ISSN(p): 2413-8827 Vol. 2, No. 12, pp: 131-141, 2016 URL: http://arpgweb.com/?ic=journal&journal=9&info=aims Saudi Dialects: Are They Endangered? Salih Alzahrani Taif University, Saudi Arabia Abstract: Krauss, among others, claims that languages will face death in the coming centuries (Krauss, 1992). Austin (2010a) lists 7,000 languages as existing and spoken in the world today. Krauss estimates that this figure could come down to 600. That is, most the world's languages are endangered. Therefore, an endangered language is a language that loses her speakers within a few generations. According to Dorian (1981), there is what is called ―tip‖ in language endangerment. He argues that a language's decline can start slowly but suddenly goes through a rapid decline towards the extinction. Thus, languages must be protected at much earlier stage. Arabic dialects such as Zahrani Spoken Arabic (ZSA), and Faifi Spoken Arabic (henceforth, FSA), which are spoken in the southern region of Saudi Arabia, have not been studied, yet. Few people speak these dialects, among many other dialects in the same region. However, the problem is that most these dialects' native speakers are moving to other regions in Saudi Arabia where they use other different dialects. Therefore, are these dialects endangered? What other factors may cause its endangerment? Have they been documented before? What shall we do? This paper discusses three main different points regarding this issue: language and endangerment, languages documentation and description and Arabic language and its family, giving a brief history of Saudi dialects comparing their situation with the whole existing dialects. -
A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew
A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech Dror Kamir Naama Soreq Yoni Neeman Melingo Ltd. Melingo Ltd. Melingo Ltd. 16 Totseret Haaretz st. 16 Totseret Haaretz st. 16 Totseret Haaretz st. Tel-Aviv, Israel Tel-Aviv, Israel Tel-Aviv, Israel [email protected] [email protected] [email protected] Abstract 1 Introduction This paper presents a comprehensive NLP sys- 1.1 The common Semitic basis from an NLP tem by Melingo that has been recently developed standpoint for Arabic, based on MorfixTM – an operational formerly developed highly successful comprehen- Modern Standard Arabic (MSA) and Modern Hebrew (MH) share the basic Semitic traits: rich sive Hebrew NLP system. morphology, based on consonantal roots (Jiðr / The system discussed includes modules for Šoreš)1, which depends on vowel changes and in morphological analysis, context sensitive lemmati- some cases consonantal insertions and deletions to zation, vocalization, text-to-phoneme conversion, create inflections and derivations.2 and syntactic-analysis-based prosody (intonation) For example, in MSA: the consonantal root model. It is employed in applications such as full /ktb/ combined with the vocalic pattern CaCaCa text search, information retrieval, text categoriza- derives the verb kataba ‘to write’. This derivation tion, textual data mining, online contextual dic- is further inflected into forms that indicate seman- tionaries, filtering, and text-to-speech applications tic features, such as number, gender, tense etc.: katab-tu ‘I wrote’, katab-ta ‘you (sing. masc.) in the fields of telephony and accessibility and wrote’, katab-ti ‘you (sing. fem.) wrote, ?a-ktubu could serve as a handy accessory for non-fluent ‘I write/will write’, etc. -
Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia
2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia Arabic alphabet From Wikipedia, the free encyclopedia َأﺑْ َﺠ ِﺪﯾﱠﺔ َﻋ َﺮﺑِﯿﱠﺔ :The Arabic alphabet (Arabic ’abjadiyyah ‘arabiyyah) or Arabic abjad is Arabic abjad the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[1] stand for consonants, it is classified as an abjad. Type Abjad Languages Arabic Time 400 to the present period Parent Proto-Sinaitic systems Phoenician Aramaic Syriac Nabataean Arabic abjad Child N'Ko alphabet systems ISO 15924 Arab, 160 Direction Right-to-left Unicode Arabic alias Unicode U+0600 to U+06FF range (http://www.unicode.org/charts/PDF/U0600.pdf) U+0750 to U+077F (http://www.unicode.org/charts/PDF/U0750.pdf) U+08A0 to U+08FF (http://www.unicode.org/charts/PDF/U08A0.pdf) U+FB50 to U+FDFF (http://www.unicode.org/charts/PDF/UFB50.pdf) U+FE70 to U+FEFF (http://www.unicode.org/charts/PDF/UFE70.pdf) U+1EE00 to U+1EEFF (http://www.unicode.org/charts/PDF/U1EE00.pdf) Note: This page may contain IPA phonetic symbols. Arabic alphabet ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع en.wikipedia.org/wiki/Arabic_alphabet 1/20 2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia غ ف ق ك ل م ن ه و ي History · Transliteration ء Diacritics · Hamza Numerals · Numeration V · T · E (//en.wikipedia.org/w/index.php?title=Template:Arabic_alphabet&action=edit) Contents 1 Consonants 1.1 Alphabetical order 1.2 Letter forms 1.2.1 Table of basic letters 1.2.2 Further notes -
Considerations About Semitic Etyma in De Vaan's Latin Etymological Dictionary
applyparastyle “fig//caption/p[1]” parastyle “FigCapt” Philology, vol. 4/2018/2019, pp. 35–156 © 2019 Ephraim Nissan - DOI https://doi.org/10.3726/PHIL042019.2 2019 Considerations about Semitic Etyma in de Vaan’s Latin Etymological Dictionary: Terms for Plants, 4 Domestic Animals, Tools or Vessels Ephraim Nissan 00 35 Abstract In this long study, our point of departure is particular entries in Michiel de Vaan’s Latin Etymological Dictionary (2008). We are interested in possibly Semitic etyma. Among 156 the other things, we consider controversies not just concerning individual etymologies, but also concerning approaches. We provide a detailed discussion of names for plants, but we also consider names for domestic animals. 2018/2019 Keywords Latin etymologies, Historical linguistics, Semitic loanwords in antiquity, Botany, Zoonyms, Controversies. Contents Considerations about Semitic Etyma in de Vaan’s 1. Introduction Latin Etymological Dictionary: Terms for Plants, Domestic Animals, Tools or Vessels 35 In his article “Il problema dei semitismi antichi nel latino”, Paolo Martino Ephraim Nissan 35 (1993) at the very beginning lamented the neglect of Semitic etymolo- gies for Archaic and Classical Latin; as opposed to survivals from a sub- strate and to terms of Etruscan, Italic, Greek, Celtic origin, when it comes to loanwords of certain direct Semitic origin in Latin, Martino remarked, such loanwords have been only admitted in a surprisingly exiguous num- ber of cases, when they were not met with outright rejection, as though they merely were fanciful constructs:1 In seguito alle recenti acquisizioni archeologiche ed epigrafiche che hanno documen- tato una densità finora insospettata di contatti tra Semiti (soprattutto Fenici, Aramei e 1 If one thinks what one could come across in the 1890s (see below), fanciful constructs were not a rarity. -
Ugaritic Studies and the Hebrew Bible, 1968-1998 (With an Excursus on Judean Monotheism and the Ugaritic Texts)
UGARITIC STUDIES AND THE HEBREW BIBLE, 1968-1998 (WITH AN EXCURSUS ON JUDEAN MONOTHEISM AND THE UGARITIC TEXTS) by MARK S. SMITH Philadelphia This presentation offers a synopsis of the major texts and tools as well as intellectual topics and trends that have dominated the field of Ugaritic-biblical studies since 1968. 1 This year marked a particu lar watershed with the publication of Ugaritica V 2 which included fourteen new Ugaritic texts and several polyglots. For the sake of convenience, this discussion is divided into three periods: 1968-1985, 1985 to the present, and prospects for the future. The texts and tools for each period are listed, followed by a brief discussion of the intellectual topics and trends. This survey is hardly exhaustive; it is intended instead to be representative. 1 Earlier technical surveys of Ugaritic-biblical studies include W.F. Albright, "The Old Testament and Canaanite Language and Literature", CBQ. 7 (1945), pp. 5-31; H. Donner, "Ugaritismen in der Psalmenforschung", ZAW 79 (1967), pp. 322-50; H.L. Ginsberg, "Ugaritic Studies and the Bible", BA 8 (1945), pp. 41-58; H. Ringgren, "Ugarit und das Alte Testament: einige methodolische Erwagungen", UF II (1979 = C.F.A. Scha4fer Festschrift ), pp. 719-21 ; and E. Ullendorff, "Ugaritic Studies Within Their Semitic and Eastern Mediterranean Setting" , BJRL 46 (1963), pp. 236-44. Popular surveys include: M. Baldacci, La scoperta di Ugarit: La citta-stato ai primordi della Bibbia (Piemme, 1996); P.C. Craigie, Ugarit and the Old Testament (Grand Rapids, MI, 1983); AHW. Curtis, Ugarit (Has Shamra) , Cities of the Biblical World (Cambridge, 1985); AS. -
DRAFT Arabtex a System for Typesetting Arabic User Manual Version 4.00
DRAFT ArabTEX a System for Typesetting Arabic User Manual Version 4.00 12 Klaus Lagally May 25, 1999 1Report Nr. 1998/09, Universit¨at Stuttgart, Fakult¨at Informatik, Breitwiesenstraße 20–22, 70565 Stuttgart, Germany 2This Report supersedes Reports Nr. 1992/06 and 1993/11 Overview ArabTEX is a package extending the capabilities of TEX/LATEX to generate the Perso-Arabic writing from an ASCII transliteration for texts in several languages using the Arabic script. It consists of a TEX macro package and an Arabic font in several sizes, presently only available in the Naskhi style. ArabTEX will run with Plain TEXandalsowithLATEX2e. It is compatible with Babel, CJK, the EDMAC package, and PicTEX (with some restrictions); other additions to TEX have not been tried. ArabTEX is primarily intended for generating the Arabic writing, but the stan- dard scientific transliteration can also be easily produced. For languages other than Arabic that are customarily written in extensions of the Perso-Arabic script some limited support is available. ArabTEX defines its own input notation which is both machine, and human, readable, and suited for electronic transmission and E-Mail communication. However, texts in many of the Arabic standard encodings can also be processed. Starting with Version 3.02, ArabTEX also provides support for fully vowelized Hebrew, both in its private ASCII input notation and in several other popular encodings. ArabTEX is copyrighted, but free use for scientific, experimental and other strictly private, noncommercial purposes is granted. Offprints of scientific publi- cations using ArabTEX are welcome. Using ArabTEX otherwise requires a license agreement. There is no warranty of any kind, either expressed or implied. -
Corpus Study of Tense, Aspect, and Modality in Diglossic Speech in Cairene Arabic
CORPUS STUDY OF TENSE, ASPECT, AND MODALITY IN DIGLOSSIC SPEECH IN CAIRENE ARABIC BY OLA AHMED MOSHREF DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Linguistics in the Graduate College of the University of Illinois at Urbana-Champaign, 2012 Urbana, Illinois Doctoral Committee: Professor Elabbas Benmamoun, Chair Professor Eyamba Bokamba Professor Rakesh M. Bhatt Assistant Professor Marina Terkourafi ABSTRACT Morpho-syntactic features of Modern Standard Arabic mix intricately with those of Egyptian Colloquial Arabic in ordinary speech. I study the lexical, phonological and syntactic features of verb phrase morphemes and constituents in different tenses, aspects, moods. A corpus of over 3000 phrases was collected from religious, political/economic and sports interviews on four Egyptian satellite TV channels. The computational analysis of the data shows that systematic and content morphemes from both varieties of Arabic combine in principled ways. Syntactic considerations play a critical role with regard to the frequency and direction of code-switching between the negative marker, subject, or complement on one hand and the verb on the other. Morph-syntactic constraints regulate different types of discourse but more formal topics may exhibit more mixing between Colloquial aspect or future markers and Standard verbs. ii To the One Arab Dream that will come true inshaa’ Allah! عربية أنا.. أميت دمها خري الدماء.. كما يقول أيب الشاعر العراقي: بدر شاكر السياب Arab I am.. My nation’s blood is the finest.. As my father says Iraqi Poet: Badr Shaker Elsayyab iii ACKNOWLEDGMENTS I’m sincerely thankful to my advisor Prof. Elabbas Benmamoun, who during the six years of my study at UIUC was always kind, caring and supportive on the personal and academic levels. -
Amharic-Arabic Neural Machine Translation
AMHARIC-ARABIC NEURAL MACHINE TRANSLATION Ibrahim Gashaw and H L Shashirekha Mangalore University, Department of Computer Science, Mangalagangotri, Mangalore-574199 ABSTRACT Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. Two Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) models are developed using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. In order to perform the experiment, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively. KEYWORDS Amharic, Arabic, Neural Machine Translation, OpenNMT 1. INTRODUCTION "Computational linguistics from a computational perspective is concerned with understanding written and spoken language, and building artifacts that usually process and produce language, either in bulk or in a dialogue setting." [1]. Machine Translation (MT), the task of translating texts from one natural language to another natural language automatically, is an important application of Computational Linguistics (CL) and Natural Language Processing (NLP). The overall process of invention, innovation, and diffusion of technology related to language translation drive the increasing rate of the MT industry rapidly [2]. The number of Language Service Provider (LSP) companies offering varying degrees of translation, interpretation, localization, language, and social coaching solutions are rising in accordance with the MT industry [2]. -
Translation of Religious Terminology: Al-Fat-H Al-Islami As a Model
International Journal of English Linguistics; Vol. 6, No. 3; 2016 ISSN 1923-869X E-ISSN 1923-8703 Published by Canadian Center of Science and Education Translation of Religious Terminology: al-fat-h al-islami as a Model Ali Al-Halawani1 1 Kulliyyah of Languages and Management (KLM), International Islamic University Malaysia (IIUM), Kuala Lumpur, Malaysia Correspondence: Ali Al-Halawani, Kulliyyah of Languages and Management (KLM), IIUM, Kuala Lumpur, Malaysia. E-mail: [email protected] or [email protected] Received: March 6, 2016 Accepted: March 27, 2016 Online Published: May 25, 2016 doi:10.5539/ijel.v6n3p136 URL: http://dx.doi.org/10.5539/ijel.v6n3p136 Abstract This paper is an attempt to illustrate the importance of understanding the religious and cultural background of the ST in the translation process in order to reach an accurate and precise translation product in the TL. The paper affirms that differences between cultures may cause complications which are even more serious for the translator than those arising from differences in language structures. The sample of the study is concerned with an Islamic term, namely al-fat-h al-Islami-commonly rendered into English as Islamic Conquest or Invasion- a religiously and culturally bound term/concept. The paper starts by defining culture, and then follows with an extensive lexical analysis of the selected term/concept. The study proves that it is difficult to translate this concept into the TL simply due to the lack of optimal or even near optimal cultural equivalents. The skill and the intervention of the translator are most crucial in this respect because, above all, translation is an act of communication.