A Conventional Orthography for Tunisian Arabic

A Conventional Orthography for Tunisian Arabic Inès Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem Ellouze, Lamia Belguith and †Nizar Habash ANLP Research group, MIRACL Lab., University of Sfax, Tunisia †Center for Computational Learning Systems, Columbia University, USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic. Keywords: Tunisian Arabic, Arabic Dialect, Orthography, CODA. convention. Our work is a continuation of the work of 1. Introduction Habash et al, (2012a) who proposed CODA, a The Arabic language in its modern form is a collection of Conventional Orthography for Dialectal Arabic, which is dialects with various degrees of differences in terms of designed for the purpose of developing computational phonology, morphology, syntax and lexicon among each models of Arabic dialects and provided a detailed other and between them and Modern Standard Arabic description of its guidelines as applied to Egyptian Arabic (MSA). While MSA is the language of official use, the (EGY). We do not expect this convention to be produced media and education, Dialectal Arabic (DA) is the by TUN speakers as input, but it is primarily for use in language of daily life, the true native form of Arabic. The development NLP systems. Spontaneously written TUN dialects are not taught in schools and have no standard will have to be converted automatically into its CODA orthography, although they have been for a long time the version (Habash et al., 2012b, Eskander et al., 2013). carriers of rich oral traditions. Tunisian Arabic In this paper, we first review some previous related work (henceforth, TUN) is the primary dialect spoken in the (Section 1). In Section 2, we present an overview of TUN. North African country of Tunisia. TUN has some unique In Section 3, we highlight the linguistic differences features that distinguish it from its direct neighboring between TUN and both MSA and EGY to motivate some dialects as well as other Arabic dialects. In the last decade, of our TUN CODA decisions. In Section 4, we present as has happened for many Arabic dialects, TUN has TUN CODA guidelines. And in Section 5, we briefly emerged as the language of informal communication discuss ongoing efforts by the authors which use TUN online: in emails, blogs, discussion forums, SMS, etc. CODA. However the development of natural language processing (NLP) tools and resources for TUN still lags behind other 2. Related Works dialects and is quite behind the state-of-the-art for MSA Efforts on modernizing Arabic orthography and NLP. With the increasing presence of TUN online and the developing orthographies for Arabic dialects have been increasing use of language technologies for many going on for many years (Habash et al, 2012a). Zawaydeh languages (e.g., Siri), the need for work on technologies et al. (2003) and Maamouri et al. (2004) developed a set such as speech recognition, speech synthesis, telephony, of rules for orthographic transcription and annotation of machine translation, etc., for TUN is more real than ever Levantine dialects in order to create a Levantine Arabic before. The absence of resources creates a pronounced corpus. The proposed transcription rules are based on two bottleneck for processing and building robust tools and levels of transcription: MSA-based transcription for the applications. Applying NLP tools designed for MSA purpose of language modeling and Arabic orthographic directly to TUN yields significantly low performance, system based transliteration for the purpose of acoustic making it imperative to build resources and dedicated modeling. Zribi et al, (2013a) presented OTTA, the tools for TUN processing (Diab et al, 2010, Boujelbane et Orthographic Transcription for Tunisian Arabic al., 2013b). convention. This convention proposed the use of some In this paper, we discuss an important basic technology rules based on MSA conventions and defined another set that is necessary for the efficient development of, and of rules which preserved the phonetic particularities of maximized synergy between, the various ongoing and TUN. Zribi et al, (2013a) presented also a set of rules for future efforts (both tools and resources) on TUN NLP: the annotation to use while transcribing spoken TUN. Habash design of an orthography to be used as a common standard et al. (2013a) introduced the concept of a conventional 2355 orthography for dialectal Arabic (CODA) and defined it then are due to trade between the non-Arabs living in for EGY. They identified five goals for CODA: (i) CODA North Africa and the Arabs who traveled. Later on, the is an internally consistent and coherent convention for Arabization of the Maghreb was connected to the Islamic writing DA; (ii) CODA is created for computational conquests from the east, which introduced the Arabic purposes; (iii) CODA uses the Arabic script; (iv) CODA is language on much larger scale in North Africa (Peirera, intended as a unified framework for writing all DAs; and 2011). The Ottoman Turkish political domination of finally, (v) CODA aims to strike an optimal balance North Africa roughly from the mid-fifteenth to the late between maintaining a level of dialectal uniqueness and nineteenth century and the French colonization from 1830 establishing conventions based on MSA-DA similarities. had an impact on the absorption of foreign vocabulary Their convention is used in many of their NLP tools and into the lexicon of local Arabic dialects (Holes, 2004). In resources for EGY (Habash et al, 2012a; Habash et al., addition to Turkish and French, we find numerous 2013; Eskander et al., 2013; Pasha et al., 2014). In this examples of the European lexical elements in TUN. We paper, we extend the CODA guidelines they created for can identify a significant number of expressions and EGY to TUN. We believe that the CODA goals, especially words from Spanish and Italian, and even Maltese (which the unified framework for all DAs, can help in is an Arabic dialect historically). Table 1 contains some maximizing synergy between, and encouraging examples of borrowed words in TUN. adaptation from, other dialects and TUN, when it comes to resource creation. Origin English Words Transliteration sense bar~iAkaħ Italian booth بشاكت An Overview of Tunisian Arabic .3 baAnkaħ Italian bank ببَكت .Arabic dialects are the vernacular of all Arabic speakers daAkuwrduw Italian okay داكٕسدٔ They are the native languages of peoples of various fiyšTaħ Italian party فٍططت Arabic countries, and these linguistic forms are miAkiynaħ Italian machine يبكٍُت .sometimes very different from one region to another kar~uwsaħ Italian stroller كشٔست .TUN is a dialect of the Arabic language spoken in Tunisia Kuwjiynaħ Italian kitchen كٕجٍُت ,ςaAm~iyaħ عبيٍت ,daArijaħ1 داسجت It is often referred to as baAbuwr Turkish ship بببٕس tuwnsiy which is considered as the low variety حَٕسً or sfin~aAriyaħ Turkish carrots سفُبسٌت given that it is neither codified nor standardized even kahwaAjiy Turkish waiter قٕٓاجً though it is the mother tongue and the variety spoken by Barnuws Berber burnous بشَٕط .(all the population in daily usage (Saidi, 2007 Kusksiy Berber couscous كسكسً Approximately 11 million people speak one or two of the baT~aAniy~aħ Berber blanket بطبٍَت many regional varieties of TUN; including the Tunis Sab~aAT Spanish Shoe صببط dialect (Capital), Sahil dialect, Sfax dialect, Northwestern buwsTaħ French post office بٕسطت Tunisian dialect, Southwestern Tunisian dialect, and Southeastern Tunisian dialect (Gibson, 1998, Talmoudi, blaASaħ French Space بﻻصت 1980). baAkuw French package ببكٕ TUN is considered an under-resourced language. It has sbiyTaAr French hospital سبٍطبس neither a standard orthography nor large collections of qaT~uws Maltese cat قطٕط written text and dictionaries. Actually, there is no strict separation between MSA and its dialects: they coexist on Table 1: The origin and the meaning of some borrowed a continuum dominated by mixed forms (MSA-DA). In words used in TUN. addition, TUN is distinguished by the presence of words from several other languages. The presence of these In addition to all these borrowed terms which have been languages mainly occurred due to historical facts. Indeed, integrated in the TUN morpho-phonology, Tunisians code they have rendered the linguistic situation in Tunisia switch often in daily conversations, particularly from rather complex. Lawson and Sachdev (2000) describe the French, e.g., “ça va”, “désolé”, “rendez-vous”, etc. All linguistic situation in Tunisia as “poly-glossic” where these expressions and words are used without being multiple languages and language varieties coexist. Before adapted to the phonology. describing Tunisian language situation, we present a brief historical overview of the TUN dialect. 4. Tunisian Arabic vs. MSA and Egyptian During the centuries before the Islamic conquests, the Arabic native languages of the Maghreb in general were varieties TUN, EGY and MSA differ at the phonological, of Berber. The few Arabic words that were part of Berber morphological and of course orthographic levels. We present in this section the main differences between the 1 Transliteration of Arabic will be presented in the TUN, EGY, and MSA. For further discussions of Arabic Habash-Soudi-Buckwalter (HSB) scheme (Habash et al, morphology and orthography, see (Habash, 2010).

Load more