A Computational Model of Modern Standard Arabic Verbal Morphology Based on Generation

Total Page:16

File Type:pdf, Size:1020Kb

A Computational Model of Modern Standard Arabic Verbal Morphology Based on Generation A computational model of Modern Standard Arabic verbal morphology based on generation by Alicia González Martínez under the supervision of Dr Antonio Moreno Sandoval Dissertation submitted in partial fulfillment of the requirements for the degree of PhD Madrid, December 2012 Universidad Autónoma de Madrid Facultad de Filosofía y Letras Departamento de Lingüística Laboratorio de Lingüística Informática LLI-UAM This dissertation was completed thanks to an FPU scholarship from the Universidad Autónoma de Madrid, and was carried out in the LLI-UAM Laboratory of Computational Linguistics, Department of Linguistics, under the supervision of Dr Antonio Moreno Sandoval para Tronspmkins, que me mostró cerdos volar para Su, que me regaló un encinar y me enseñó a hozar "Why are you working so hard? Why don't you come and play?" But the stubborn bricklayer pig just said "no". "I shall finish my house first. It must be solid and sturdy. And then I'll come and play!" Three little pigs Abstract The computational handling of non-concatenative morphologies is still a challenge in the field of natural language processing. Amongst the various areas of research, Arabic morphology stands out due to its highly complex structure. We propose a model for Arabic verbal morphology based on a root-and-pattern approach, which satisfies both computational consistency and an elegant formalization. Our model defines an abstract representation of prosodic templates and a set of intertwined morphemes that operate at different phonological levels, as well as a separate module of rewrite rules to deal with morphophonological and orthographic alterations. Our verbal system model asserts that Arabic exhibits two conjugational classes. The computational system, named Jabalín, is focused on generation—the program generates a full annotated lexicon of verbal forms, which is subsequently used to develop a morphological analyzer and generator. The input of the system consists of a lexicon of 15,452 verb lemmas of both Classical Arabic and Modern Standard Arabic— taken from El-Dahdah (1991)—comprising a total of 3,706 roots. The output of the system is a lexicon of 1,684,268 verbal inflected forms. We carried out an evaluation against a lexicon of inflected verbs provided by the analyzer ElixirFM (Smrž, 2007a; 2007b), which we considered a Golden Standard, achieving a precision of 99.52%. Additionally, we compared our lexicon with a list of the most frequent verb lemmas— including the most frequent verbs from each conjugation—taken from Buckwalter and Parkinson (2010). The list includes 825 verbs which are all included in our lexicon and passed an evaluation test with 99.27% of accuracy. Jabalín is available under a GNU license, and can be accessed and tested through an online interface, at http://elvira.lllf.uam.es/jabalin/, hosted at the LLI-UAM lab. The Jabalín interface provides different functionalities: analyze a form, generate the inflectional paradigm of a verb lemma, derive a root, show quantitative data, and explore the database, which includes data from the evaluation. i Key words: Computational Linguistics, Natural Language Processing, Arabic Computational Morphology, Root-and-Pattern Morphology, Non-concatenative Morphology, Templatic Morphology, Root-and-Prosody Morphology, Computational Prosodic Morphology. ii Resumen (Spanish) Los sistemas morfológicos de tipo no concatenativo siguen siendo uno de los mayores retos para el procesamiento del lenguaje natural. Entre las diversas líneas de investigación, el estudio de la morfología del árabe destaca por ser un sistema de gran complejidad estructural. En el presente proyecto de investigación, se propone un modelo de morfología verbal del árabe basado en un enfoque root-and-pattern, así como formalmente elegante y coherente desde el punto de vista computacional. El modelo propuesto se apoya fundamentalmente en una formalización abstracta de los esquemas prosódicos y su interrelación con el material morfológico. Paralelamente, el sistema cuenta con un módulo de reglas que tratan las alteraciones morfofonológicas y ortográficas del árabe. El modelo del sistema verbal propone, y se asienta en la idea de que, existen sólo dos clases conjugacionales en árabe. El sistema computacional, llamado Jabalín, está orientado a la generación: el programa genera un lexicón de formas verbales con la información lingüística asociada. El lexicón se emplea a continuación para desarrollar un analizador y generador morfológicos. Como entrada, el sistema recibe un lexicón de lemas verbales de 15.452 entradas (tomado de El- Dahdah, 1991), que combina léxico tanto del árabe clásico como del árabe estándar moderno, y cuenta con un total de 3.706 raíces. La salida es un lexicón de 1.684.268 formas verbales flexionadas. Se ha llevado a cabo una evaluación contra un lexicón de formas verbales extraído del analizador ElixirFM (Smrž, 2007a; 2007b), con una precisión de 99,52%. Por otro lado, el lexicón se ha evaluado también contra una lista de verbos más frecuentes (incluyendo los lemas más frecuentes de cada tipo de conjugación) sacada de Buckwalter y Parkinson (2010). El total de los 825 verbos que componen la lista están incluidos en nuestro lexicón de lemas verbales y presentan una precisión del 99.27%. El sistema Jabalín, desarrollado bajo licencia GNU, cuenta además con una interfaz web donde se pueden realizar consultas en árabe, http://elvira.lllf.uam.es/jabalin/, albergada en el LLI-UAM. La interfaz cuenta iii con varias funcionalidades: analizar forma, generar flexión de un lema verbal, derivar raíz, mostrar datos cuantitativos, y explorar la base de datos, que incluye los datos de la evaluación. Palabras clave: Lingüística Computacional, Procesamiento del Lenguaje Natural, Morfología Computacional del Árabe, morfología root-and-pattern, morfología no- concatenativa, morfología templática, morfología root-and-prosody, morfología prosódica computacional. iv Agradecimientos A todo cerdo le llaga su San Martín, y yo no iba a ser menos. Y es que cuando una tesis está ya terminada, el doctorando, ya fuera de sí, se empeña en seguir hozando y hozando en busca de más trufas y bellotas; “¡no es suficiente!, ¡no es suficiente!”. Pero por mucho que se obceque, y por mucho que le imponga, llega el momento de terminar. Uno entonces tiene que aceptar que ya no es un lechoncillo salvaje, sino un jabalí bien lustroso y satisfecho, listo para la matanza. Y es en ese preciso momento, cuando el jabalí levanta al fin el hocico, sereno, y se para a pensar, afortunado, en todos los que acompañaron su trote y en los que le enseñaron a bellotear. El tesinando es un animal ególatra donde los haya, a quien las largas noches trabajando en la tesis le han convertido en un ser retraído y huraño. Y mucho han tenido que sufrir y soportar los que se han mantenido a su lado. A todos ellos, que han agrantado refunfuñeos crónicos y berrinches descomunales, se lo agradezco con especial cariño. A Santa Bárbara bendita, que en el cielo estás escrita con papel y agua bendita, por darme constancia y paciencia, que como buen totem me has estado troleando para que finalmente pudiera registar esta tesis un 4 de diciembre. A mis profes de filología, que a pesar de mi arrebato tránsfuga en aras de la lingüística, me habéis marcado para siempre. ¡Ante todo, filóloga! A Waleed, que aunque quizá no lo recuerde, se hizo cargo de aquella olvidada asignatura de primero, de esas que a veces los profesores no quieren dar, pero que son las que marcan a los estudiantes. Sus clases, y su dedicación a los alumnos, no se olvidan. v A Iñaqui, por sus curradísimas clases de árabe y sus simpáticas chanzas. Gracias a sus enseñanzas, me pude defender en la lengua de la Dad en mis trotes por el mundo árabe y, como los buenos estudiantes, fui el chascarillo pedante de los zocos árabes. A Ana Ramos, de quien me acordé redactando esta tesis, por el sobrecogedor suicidio de las letras enfermas. A mis profesoras sirias, en aquel verano que tanto aprendí en Damasco. A Theophile, el maestro de la morfología, por ofrecerme sus valiosos consejos. To Jirka Hana, I am grateful for the opportinuty he gave me to stay those three months in UFAL. A Antonio, mi director de tesis, de quien nunca olvidaré su enorme paciencia durante mi etapa de formación y su inquebrantable apoyo, un privilegio del que disfrutan muy pocos tesinandos; por ofrecerme sus ideas y su experiencia, por acogerme en el laboratorio y animarme a investigar, y por darme siempre la mayor libertad. Y no puedo olvidar todos los que han pasado por el labo, con quienes he compartido grandes momentos y cuchufletas: Ana Ledesma y su paralelo mundo de la pragmática. Marta y Manuel, que no nos han dejado tiempo para aburrirnos con sus historietas. Doaa, a quien le debo su inestimable ayuda en mis difíciles comienzos. Sus consejos y su dedicación han sido el mejor de los recibimientos en éste, el espinoso mundo de la investigación. Conchi, esa primera informática que abrió la caja de pandora. Antonio Pastor, que siempre estaba ahí para instalarme los programas que necesitara. Leo, quien ha estado todos estos años trabajando al lado, y ambos nos hemos mantenido a flote en los momentos bajos del labo. Dalia, una de esas personas que el buen recuerdo siempre conserva. El glotón de Jordi Casanova, conocido en tierras niponas como Ryo Tzutahara, ¡gran gurú de la gastronomía! Carlos, “el de japo”, que nos trae vi nuevos aires a las filas de los becarios. Y, cómo no, no podría olvidarme de Azucena, que tantos días me ha hecho compañía cuando me quedaba hasta bien tarde trabajando en el labo. A Carlos, mi maestro, que me enseñó a pensar, que alumbró el trabajo de esta tesis y cada una de sus ideas; y que siempre me ha hecho sentir de pata negra. A Susana, que es todo lo que me gustaría ser.
Recommended publications
  • Automatic Identification of Arabic Language Varieties and Dialects in Social Media
    Automatic Identification of Arabic Language Varieties and Dialects in Social Media Fatiha Sadat Farnazeh Kazemi Atefeh Farzindar University of Quebec in NLP Technologies Inc. NLP Technologies Inc. Montreal, 201 President Ken- 52 Le Royer Street W., 52 Le Royer Street W., nedy, Montreal, QC, Canada Montreal, QC, Canada Montreal, QC, Canada [email protected] [email protected] [email protected] Abstract Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dia- lects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially be- tween MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different condi- tions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable over- all accuracy of 98%. 1 Introduction Arabic is a morphologically rich and complex language, which presents significant challenges for nat- ural language processing and its applications. It is the official language in 22 countries spoken by more than 350 million people around the world1. Moreover, the Arabic language exists in a state of diglossia where the standard form of the language, Modern Standard Arabic (MSA) and the regional dialects (AD) live side-by-side and are closely related (Elfardy and Diab, 2013).
    [Show full text]
  • A Sociolinguistic View of "Hazl" in the Andalusian Arabic "Muwashshaḥ
    A Sociolinguistic View of "hazl" in the Andalusian Arabic "muwashshaḥ" Author(s): David Hanlon Source: Bulletin of the School of Oriental and African Studies, University of London, Vol. 60, No. 1 (1997), pp. 35-46 Published by: Cambridge University Press on behalf of School of Oriental and African Studies Stable URL: http://www.jstor.org/stable/620768 Accessed: 31-08-2015 13:05 UTC Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/ info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Cambridge University Press and School of Oriental and African Studies are collaborating with JSTOR to digitize, preserve and extend access to Bulletin of the School of Oriental and African Studies, University of London. http://www.jstor.org This content downloaded from 163.10.30.70 on Mon, 31 Aug 2015 13:05:02 UTC All use subject to JSTOR Terms and Conditions A sociolinguisticview of hazl in the Andalusian Arabic muwashshah DAVID HANLON BirkbeckCollege, London The documented history of the theory of the muwashshah and one of its constituent parts, the kharja, spans almost 800 years: from Ibn Sana' al-Mulk (d. 608/1211) to the present day. Apologists for the various theories broadly belong to one of two schools, which for the sake of convenience I shall label 'integralist' and 'partialist'.
    [Show full text]
  • Towards Understanding the Status of the Dual in Pre-Islamic Arabic
    Towards Understanding the Status of the Dual in Pre-Islamic Arabic MUHAMMAD AL-SHARKAWI (Wayne State University, Detroit) Abstract This article suggests that the dual suffix in pre-Islamic Arabic did not differentiate for case. Tamīm, one of the most trustworthy pre-Islamic dialects, treated the dual suffix invariably although it had a full case system. There are also tokens of the same invariable treatment in the Qurʾān. The article proposes that the suffix long vowel variation due to the phenomenon of ʾimāla makes the formal origin of the invariable dual suffix difficult to ascribe to the East and Northwest Semitic oblique dual allomorph. Keywords: Dual, pre-Islamic Arabic, ʾimāla, Classical Arabic, vowel harmony. Introduction This article discusses data on the dual suffix in pre-Islamic dialects from medieval Arab grammarians and manuals of qirāʾāt to suggest that the status of the dual suffix in the pre- Islamic Arabic linguistic situation was unique among the Semitic languages.1 The article does not, however, seek to take a comparative Semitic framework. It rather seeks to discuss the dual suffix behavior on the eve of the Arab conquests and probably immediately thereafter. Although attempts to understand particular structural concepts of pre-Islamic Arabic are forthcoming, the formal, functional and semantic shape of the dual system remains to be studied in detail. In addition, despite the limited and sporadic data about the morphological and syntactic aspects of pre-Islamic Arabic,2 the dual suffix3 is one of the features of pre-Islamic Arabic dialects that can shed light on both the position of grammati- cal case4 in the Arabic dialects in the peninsula, and how it came to be standardized after the emergence of Islam.
    [Show full text]
  • Saudi Dialects: Are They Endangered?
    Academic Research Publishing Group English Literature and Language Review ISSN(e): 2412-1703, ISSN(p): 2413-8827 Vol. 2, No. 12, pp: 131-141, 2016 URL: http://arpgweb.com/?ic=journal&journal=9&info=aims Saudi Dialects: Are They Endangered? Salih Alzahrani Taif University, Saudi Arabia Abstract: Krauss, among others, claims that languages will face death in the coming centuries (Krauss, 1992). Austin (2010a) lists 7,000 languages as existing and spoken in the world today. Krauss estimates that this figure could come down to 600. That is, most the world's languages are endangered. Therefore, an endangered language is a language that loses her speakers within a few generations. According to Dorian (1981), there is what is called ―tip‖ in language endangerment. He argues that a language's decline can start slowly but suddenly goes through a rapid decline towards the extinction. Thus, languages must be protected at much earlier stage. Arabic dialects such as Zahrani Spoken Arabic (ZSA), and Faifi Spoken Arabic (henceforth, FSA), which are spoken in the southern region of Saudi Arabia, have not been studied, yet. Few people speak these dialects, among many other dialects in the same region. However, the problem is that most these dialects' native speakers are moving to other regions in Saudi Arabia where they use other different dialects. Therefore, are these dialects endangered? What other factors may cause its endangerment? Have they been documented before? What shall we do? This paper discusses three main different points regarding this issue: language and endangerment, languages documentation and description and Arabic language and its family, giving a brief history of Saudi dialects comparing their situation with the whole existing dialects.
    [Show full text]
  • A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew
    A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech Dror Kamir Naama Soreq Yoni Neeman Melingo Ltd. Melingo Ltd. Melingo Ltd. 16 Totseret Haaretz st. 16 Totseret Haaretz st. 16 Totseret Haaretz st. Tel-Aviv, Israel Tel-Aviv, Israel Tel-Aviv, Israel [email protected] [email protected] [email protected] Abstract 1 Introduction This paper presents a comprehensive NLP sys- 1.1 The common Semitic basis from an NLP tem by Melingo that has been recently developed standpoint for Arabic, based on MorfixTM – an operational formerly developed highly successful comprehen- Modern Standard Arabic (MSA) and Modern Hebrew (MH) share the basic Semitic traits: rich sive Hebrew NLP system. morphology, based on consonantal roots (Jiðr / The system discussed includes modules for Šoreš)1, which depends on vowel changes and in morphological analysis, context sensitive lemmati- some cases consonantal insertions and deletions to zation, vocalization, text-to-phoneme conversion, create inflections and derivations.2 and syntactic-analysis-based prosody (intonation) For example, in MSA: the consonantal root model. It is employed in applications such as full /ktb/ combined with the vocalic pattern CaCaCa text search, information retrieval, text categoriza- derives the verb kataba ‘to write’. This derivation tion, textual data mining, online contextual dic- is further inflected into forms that indicate seman- tionaries, filtering, and text-to-speech applications tic features, such as number, gender, tense etc.: katab-tu ‘I wrote’, katab-ta ‘you (sing. masc.) in the fields of telephony and accessibility and wrote’, katab-ti ‘you (sing. fem.) wrote, ?a-ktubu could serve as a handy accessory for non-fluent ‘I write/will write’, etc.
    [Show full text]
  • Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia
    2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia Arabic alphabet From Wikipedia, the free encyclopedia َأﺑْ َﺠ ِﺪﯾﱠﺔ َﻋ َﺮﺑِﯿﱠﺔ :The Arabic alphabet (Arabic ’abjadiyyah ‘arabiyyah) or Arabic abjad is Arabic abjad the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[1] stand for consonants, it is classified as an abjad. Type Abjad Languages Arabic Time 400 to the present period Parent Proto-Sinaitic systems Phoenician Aramaic Syriac Nabataean Arabic abjad Child N'Ko alphabet systems ISO 15924 Arab, 160 Direction Right-to-left Unicode Arabic alias Unicode U+0600 to U+06FF range (http://www.unicode.org/charts/PDF/U0600.pdf) U+0750 to U+077F (http://www.unicode.org/charts/PDF/U0750.pdf) U+08A0 to U+08FF (http://www.unicode.org/charts/PDF/U08A0.pdf) U+FB50 to U+FDFF (http://www.unicode.org/charts/PDF/UFB50.pdf) U+FE70 to U+FEFF (http://www.unicode.org/charts/PDF/UFE70.pdf) U+1EE00 to U+1EEFF (http://www.unicode.org/charts/PDF/U1EE00.pdf) Note: This page may contain IPA phonetic symbols. Arabic alphabet ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع en.wikipedia.org/wiki/Arabic_alphabet 1/20 2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia غ ف ق ك ل م ن ه و ي History · Transliteration ء Diacritics · Hamza Numerals · Numeration V · T · E (//en.wikipedia.org/w/index.php?title=Template:Arabic_alphabet&action=edit) Contents 1 Consonants 1.1 Alphabetical order 1.2 Letter forms 1.2.1 Table of basic letters 1.2.2 Further notes
    [Show full text]
  • Considerations About Semitic Etyma in De Vaan's Latin Etymological Dictionary
    applyparastyle “fig//caption/p[1]” parastyle “FigCapt” Philology, vol. 4/2018/2019, pp. 35–156 © 2019 Ephraim Nissan - DOI https://doi.org/10.3726/PHIL042019.2 2019 Considerations about Semitic Etyma in de Vaan’s Latin Etymological Dictionary: Terms for Plants, 4 Domestic Animals, Tools or Vessels Ephraim Nissan 00 35 Abstract In this long study, our point of departure is particular entries in Michiel de Vaan’s Latin Etymological Dictionary (2008). We are interested in possibly Semitic etyma. Among 156 the other things, we consider controversies not just concerning individual etymologies, but also concerning approaches. We provide a detailed discussion of names for plants, but we also consider names for domestic animals. 2018/2019 Keywords Latin etymologies, Historical linguistics, Semitic loanwords in antiquity, Botany, Zoonyms, Controversies. Contents Considerations about Semitic Etyma in de Vaan’s 1. Introduction Latin Etymological Dictionary: Terms for Plants, Domestic Animals, Tools or Vessels 35 In his article “Il problema dei semitismi antichi nel latino”, Paolo Martino Ephraim Nissan 35 (1993) at the very beginning lamented the neglect of Semitic etymolo- gies for Archaic and Classical Latin; as opposed to survivals from a sub- strate and to terms of Etruscan, Italic, Greek, Celtic origin, when it comes to loanwords of certain direct Semitic origin in Latin, Martino remarked, such loanwords have been only admitted in a surprisingly exiguous num- ber of cases, when they were not met with outright rejection, as though they merely were fanciful constructs:1 In seguito alle recenti acquisizioni archeologiche ed epigrafiche che hanno documen- tato una densità finora insospettata di contatti tra Semiti (soprattutto Fenici, Aramei e 1 If one thinks what one could come across in the 1890s (see below), fanciful constructs were not a rarity.
    [Show full text]
  • Ugaritic Studies and the Hebrew Bible, 1968-1998 (With an Excursus on Judean Monotheism and the Ugaritic Texts)
    UGARITIC STUDIES AND THE HEBREW BIBLE, 1968-1998 (WITH AN EXCURSUS ON JUDEAN MONOTHEISM AND THE UGARITIC TEXTS) by MARK S. SMITH Philadelphia This presentation offers a synopsis of the major texts and tools as well as intellectual topics and trends that have dominated the field of Ugaritic-biblical studies since 1968. 1 This year marked a particu­ lar watershed with the publication of Ugaritica V 2 which included fourteen new Ugaritic texts and several polyglots. For the sake of convenience, this discussion is divided into three periods: 1968-1985, 1985 to the present, and prospects for the future. The texts and tools for each period are listed, followed by a brief discussion of the intellectual topics and trends. This survey is hardly exhaustive; it is intended instead to be representative. 1 Earlier technical surveys of Ugaritic-biblical studies include W.F. Albright, "The Old Testament and Canaanite Language and Literature", CBQ. 7 (1945), pp. 5-31; H. Donner, "Ugaritismen in der Psalmenforschung", ZAW 79 (1967), pp. 322-50; H.L. Ginsberg, "Ugaritic Studies and the Bible", BA 8 (1945), pp. 41-58; H. Ringgren, "Ugarit und das Alte Testament: einige methodolische Erwagungen", UF II (1979 = C.F.A. Scha4fer Festschrift ), pp. 719-21 ; and E. Ullendorff, "Ugaritic Studies Within Their Semitic and Eastern Mediterranean Setting" , BJRL 46 (1963), pp. 236-44. Popular surveys include: M. Baldacci, La scoperta di Ugarit: La citta-stato ai primordi della Bibbia (Piemme, 1996); P.C. Craigie, Ugarit and the Old Testament (Grand Rapids, MI, 1983); AHW. Curtis, Ugarit (Has Shamra) , Cities of the Biblical World (Cambridge, 1985); AS.
    [Show full text]
  • DRAFT Arabtex a System for Typesetting Arabic User Manual Version 4.00
    DRAFT ArabTEX a System for Typesetting Arabic User Manual Version 4.00 12 Klaus Lagally May 25, 1999 1Report Nr. 1998/09, Universit¨at Stuttgart, Fakult¨at Informatik, Breitwiesenstraße 20–22, 70565 Stuttgart, Germany 2This Report supersedes Reports Nr. 1992/06 and 1993/11 Overview ArabTEX is a package extending the capabilities of TEX/LATEX to generate the Perso-Arabic writing from an ASCII transliteration for texts in several languages using the Arabic script. It consists of a TEX macro package and an Arabic font in several sizes, presently only available in the Naskhi style. ArabTEX will run with Plain TEXandalsowithLATEX2e. It is compatible with Babel, CJK, the EDMAC package, and PicTEX (with some restrictions); other additions to TEX have not been tried. ArabTEX is primarily intended for generating the Arabic writing, but the stan- dard scientific transliteration can also be easily produced. For languages other than Arabic that are customarily written in extensions of the Perso-Arabic script some limited support is available. ArabTEX defines its own input notation which is both machine, and human, readable, and suited for electronic transmission and E-Mail communication. However, texts in many of the Arabic standard encodings can also be processed. Starting with Version 3.02, ArabTEX also provides support for fully vowelized Hebrew, both in its private ASCII input notation and in several other popular encodings. ArabTEX is copyrighted, but free use for scientific, experimental and other strictly private, noncommercial purposes is granted. Offprints of scientific publi- cations using ArabTEX are welcome. Using ArabTEX otherwise requires a license agreement. There is no warranty of any kind, either expressed or implied.
    [Show full text]
  • Corpus Study of Tense, Aspect, and Modality in Diglossic Speech in Cairene Arabic
    CORPUS STUDY OF TENSE, ASPECT, AND MODALITY IN DIGLOSSIC SPEECH IN CAIRENE ARABIC BY OLA AHMED MOSHREF DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Linguistics in the Graduate College of the University of Illinois at Urbana-Champaign, 2012 Urbana, Illinois Doctoral Committee: Professor Elabbas Benmamoun, Chair Professor Eyamba Bokamba Professor Rakesh M. Bhatt Assistant Professor Marina Terkourafi ABSTRACT Morpho-syntactic features of Modern Standard Arabic mix intricately with those of Egyptian Colloquial Arabic in ordinary speech. I study the lexical, phonological and syntactic features of verb phrase morphemes and constituents in different tenses, aspects, moods. A corpus of over 3000 phrases was collected from religious, political/economic and sports interviews on four Egyptian satellite TV channels. The computational analysis of the data shows that systematic and content morphemes from both varieties of Arabic combine in principled ways. Syntactic considerations play a critical role with regard to the frequency and direction of code-switching between the negative marker, subject, or complement on one hand and the verb on the other. Morph-syntactic constraints regulate different types of discourse but more formal topics may exhibit more mixing between Colloquial aspect or future markers and Standard verbs. ii To the One Arab Dream that will come true inshaa’ Allah! عربية أنا.. أميت دمها خري الدماء.. كما يقول أيب الشاعر العراقي: بدر شاكر السياب Arab I am.. My nation’s blood is the finest.. As my father says Iraqi Poet: Badr Shaker Elsayyab iii ACKNOWLEDGMENTS I’m sincerely thankful to my advisor Prof. Elabbas Benmamoun, who during the six years of my study at UIUC was always kind, caring and supportive on the personal and academic levels.
    [Show full text]
  • Amharic-Arabic Neural Machine Translation
    AMHARIC-ARABIC NEURAL MACHINE TRANSLATION Ibrahim Gashaw and H L Shashirekha Mangalore University, Department of Computer Science, Mangalagangotri, Mangalore-574199 ABSTRACT Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. Two Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) models are developed using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. In order to perform the experiment, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively. KEYWORDS Amharic, Arabic, Neural Machine Translation, OpenNMT 1. INTRODUCTION "Computational linguistics from a computational perspective is concerned with understanding written and spoken language, and building artifacts that usually process and produce language, either in bulk or in a dialogue setting." [1]. Machine Translation (MT), the task of translating texts from one natural language to another natural language automatically, is an important application of Computational Linguistics (CL) and Natural Language Processing (NLP). The overall process of invention, innovation, and diffusion of technology related to language translation drive the increasing rate of the MT industry rapidly [2]. The number of Language Service Provider (LSP) companies offering varying degrees of translation, interpretation, localization, language, and social coaching solutions are rising in accordance with the MT industry [2].
    [Show full text]
  • Translation of Religious Terminology: Al-Fat-H Al-Islami As a Model
    International Journal of English Linguistics; Vol. 6, No. 3; 2016 ISSN 1923-869X E-ISSN 1923-8703 Published by Canadian Center of Science and Education Translation of Religious Terminology: al-fat-h al-islami as a Model Ali Al-Halawani1 1 Kulliyyah of Languages and Management (KLM), International Islamic University Malaysia (IIUM), Kuala Lumpur, Malaysia Correspondence: Ali Al-Halawani, Kulliyyah of Languages and Management (KLM), IIUM, Kuala Lumpur, Malaysia. E-mail: [email protected] or [email protected] Received: March 6, 2016 Accepted: March 27, 2016 Online Published: May 25, 2016 doi:10.5539/ijel.v6n3p136 URL: http://dx.doi.org/10.5539/ijel.v6n3p136 Abstract This paper is an attempt to illustrate the importance of understanding the religious and cultural background of the ST in the translation process in order to reach an accurate and precise translation product in the TL. The paper affirms that differences between cultures may cause complications which are even more serious for the translator than those arising from differences in language structures. The sample of the study is concerned with an Islamic term, namely al-fat-h al-Islami-commonly rendered into English as Islamic Conquest or Invasion- a religiously and culturally bound term/concept. The paper starts by defining culture, and then follows with an extensive lexical analysis of the selected term/concept. The study proves that it is difficult to translate this concept into the TL simply due to the lack of optimal or even near optimal cultural equivalents. The skill and the intervention of the translator are most crucial in this respect because, above all, translation is an act of communication.
    [Show full text]