Diacritics Restoration for Arabic Dialects Salima Harrat, Mourad Abbas, Karima Meftouh, Kamel Smaïli
Total Page:16
File Type:pdf, Size:1020Kb
Diacritics Restoration for Arabic Dialects Salima Harrat, Mourad Abbas, Karima Meftouh, Kamel Smaïli To cite this version: Salima Harrat, Mourad Abbas, Karima Meftouh, Kamel Smaïli. Diacritics Restoration for Arabic Dialects. INTERSPEECH 2013 - 14th Annual Conference of the International Speech Communication Association, ISCA, Aug 2013, Lyon, France. hal-00925815 HAL Id: hal-00925815 https://hal.inria.fr/hal-00925815 Submitted on 8 Jan 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Diacritics restoration for Arabic dialect texts S. Harrat1, M. Abbas2, K. Meftouh3, K. Smaili4 1ENS Bouzareah, Algiers, Algeria 2CRSTDLA, Algiers, Algeria 3Badji Mokhtar University-Annaba, Algeria 4Campus scientifique LORIA , Nancy, France [email protected], m [email protected], [email protected], [email protected] Abstract nally, we worked first on MSA texts because of the available results for many works in this field. In this paper we present a statistical approach for automatic di- This paper is organised as follows: In section 2, we describe acritization of Algiers dialectal texts. This approach is based the Arabic language and present the main features of this lan- on statistical machine translation. We first investigate this ap- guage especially those related to diacritization. In section 3, proach on Modern Standard Arabic (MSA) texts using several we provide some specificities of Algiers’s dialect. In section 4, data sources and extrapolated the results on available dialectal we present related works in this field summarizing the main as- texts. For evaluation we used word and diacritization error rates pects of every work. In section 5, we present our approach and and also precision and recall. describe the used data (for training tuning and testing). Evalu- Index Terms: Machine translation system, Modern Standard ation results and experiments are reported in the next. Finally, Arabic, automatic diacritization, Algiers’s dialect We conclude by showing our main results and future work di- rections. 1. Introduction Vocalization, diactritization or diacritics restoration is one of 2. Arabic language the major challenges for the Arabic natural language process- Arabic is a Semitic Language with consonantal alphabet (which ing. Indeed, the absence of vowels in Arabic texts generates a denotes only consonants). Arabic alphabet consists of 28 letters considerable number of ambiguities in morphological, syntac- 1 denoting consonants and three long vowels ( a¯ , w and tic and semantic layer. For Arabic speakers, absence of vow- @ ð els does not pose problems for understanding texts, although ø y). Arabic script includes short vowels ( a, u, i ) in some cases vocalization is not intuitive. Automatic vo- and other phonetic symbol which are represented by diacritics calization for Arabic language (even for other languages) re- (strokes placed above or below consonants and long vowels). mains unresolved problem for which several works are dedi- Short vowels can appear anywhere in the word, the tanween cated [1][2][3][4]. Absence of diacritics in Arabic texts pro- represents doubled case endings, it consists of three cases ( duces serious problem for many applications such as grapheme- an, un, in ) it appears only in the end of the words and phoneme conversion. Algiers dialect is concerned by most of the problems of Arabic language processing. Although, absence expresses nominal indefiniteness. Arabic diacritics include also of diacritics in Algiers dialectal texts produces several chal- syllabification marks: the Shadda or germination mark denotes lenges for many automatic tasks on this language. The goal of a double consonant( it could be combined with short vowels this work is automatic diacritics restoration for Algiers’s dialect and doubled case endings), and the Sukun which denotes the text. Actually, this is an intermediate purpose, since our general absence of vowel (see table 1). According to the function of goal is speech translation between Modern Standard Arabic and the word in the sentence, an Arabic word could take as case Algerian dialect. In order to reach this aim, translation system endings the short vowels ( a, u, i ) respectively for must include a Text-to-Speech module. The first component accusative, nominative and genitive case. It is important to note of this module is a grapheme phoneme converter that we have that in some cases diacritics are not allowed such as in the case to develop. Results of grapheme phoneme conversion for Ara- of: bic texts (both classical and dialectal) could not be acceptable if these texts are not diacritized. That is why the first step for • The letter ø a¯ (Alif Maksoura) which never takes dia- developing our grapheme phoneme converter is an automatic critics such as in the word sa,a¯ (to endeavor). diacritics restoration for Algiers’s dialect texts. ùª • The long vowels w and y when preceded (in a In this paper, we present an automatic diacritization system for ð ø Arabic texts based on statistical approach. We attempt to use word) respectively by the short vowels u and i. available tools for statistical machine translation for building such a system which basically does not require any linguistic • The long vowel @ a¯ when preceded (in a word) by and knowledge. We began by working on MSA texts for many rea- in the definite article . sons: Algiers’s dialect is an Arabic language which obey to Ë@ almost the same rules of writing as MSA. Availability of di- 1Including 14 solar consonants which assimilate the of a preceding acritized texts in MSA allows to test our solution on a large È amount of data, which is not the case for Algiers’s dialect. Fi- definite article È@ and 14 lunar consonants which do not assimilate it. the Sukun (absence of vowel), this simplifies diacritization pro- Table 1: Arabic diacritics and their pronunciation cess but generates ambiguity at syntactic level. This conducts Diacritized consonant /b/ Name Pronunciation to an ambiguity concerning the identification of grammatical Fatha /ba/ J. role of certain words, for example, to specify the agent and the Damma /bu/ object in some simple sentences without using semantic knowl- J. edge (see in table 3 a detailed example). Instead of written Ara- Kasra /bi/ J . bic, diacritics in Algiers’s dialect are used for disambiguation, a Tanween Fatha /ban/ word without diacritics could accept many valid diacritizations AJ. and then if they are missing, it would be difficult to understand Tanween damma /bun/ I. its meaning(see table 4 for some examples). It is important to I. Tanween kasra /bin/ note that due mostly to dropping case endings of words in Al- Shadda /bb/ giers’s dialect, the number of multiple valid diacritizations for a J. word is less important than it is for a MSA word. Sukun /b/ J. 4. Related works • The consonant Ë could not take a diacritic in the definite article Ë@ when it is followed by a solar consonant. Most works for diacritization are dedicated for Modern stan- Arabic diacritics are used for disambiguation, although a word dard Arabic. Several approaches were adopted, in [1] the sys- without diacritics could have many valid diactritizations (de- tem is based on hierarchical search at sentence, phrase word and pending on its grammatical category) which generates several character level. From the sentence level, the system attempts to interpretations (see table 2 for some examples). retrieve diacritized examples from training data, if no suitable example is found for the given sentence, the system splits it into phrases and searches for fitting diacritics phrases in the training Table 2: Some possible diacritizations of Arabic word Qå¯ fsr data. If the search fails the phrase is split into words then into Diacritic form Meaning Grammatical category characters (if word search fails). The system uses n-grams to de- fine suitable examples. In [2] the diacritization was considered Qå ¯ fassara he explained verb (active voice) as a machine translating problem from non-diacritized texts to fussira was explained verb (passive voice) diacritized texts. The authors (IBM Egypt) adopted a rule based Qå¯ approach. The shortcomings of this system are those of machine fasir Qå ¯ so walk conjunction+ verb translation based rules: adding new cases requires the defini- tion of new rules which makes system maintenance expensive in Qå ¯ fasirrun and a secret conjunction+noun terms of human efforts. In [3] the authors combined weighted fi- Qå ¯ fasrun statement Noun nite state machines and language models. Their system is based on three language models (trained on LDC’s Arabic Treebank Data): word, character and clitics. Transducers transitions are obtained by probabilistic calculations from the three language 3. Algiers dialect models. In [4] two approaches are used, the first one a SMT In Algeria, as elsewhere, spoken Arabic differs from written based approach combined with rule based diacritizer, the second Arabic; Algerian Arabic has a vocabulary inspired from Arabic considers diacritization as a sequence labelling problem. The but the original words have been altered phonologically, with first approach operates at the whole sentence, word level and significant Berber substrates, and many new words and loan- then at the character level then a combination of the two (word words borrowed from French, Turkish and Spanish [5]. The and character) levels is achieved. A statistical phrase based sys- Algiers’s dialect represents the dialectical Arabic spoken in Al- tem is built by the alignment between the output of rule based giers and its periphery.