View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Archive Ouverte en Sciences de l'Information et de la Communication

A Study of a Non-Resourced Language: The Case of one of the Algerian Karima Meftouh, Najette Bouchemal, Kamel Smaïli

To cite this version:

Karima Meftouh, Najette Bouchemal, Kamel Smaïli. A Study of a Non-Resourced Language: The Case of one of the Algerian Dialects. The third International Workshop on Spoken Languages Tech- nologies for Under-resourced Languages - SLTU’12, May 2012, Cape-town, South . pp.1-7. ￿hal-00727042￿

HAL Id: hal-00727042 https://hal.archives-ouvertes.fr/hal-00727042 Submitted on 14 Sep 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A STUDY OF A NON-RESOURCED LANGUAGE: THE CASE OF ONE OF THE ALGERIAN

K. Meftouh, N. Bouchemal K.Smaili

UBMA LORIA Badji Mokhtar University Campus scientifique Informatic Department BP 139, 54500 Vandoeuvre Les` BP 12, 23000 Annaba, Nancy Cedex, France

ABSTRACT , ... This paper presents a linguistic study of an algerian di- In this paper, we will focus on algerian dialect. We have to alect, namely the dialect of Annaba (AD). It also presents the understand that the concept of dialect here is different from methodology applied in the construction of a parallel corpus what is admitted in west. In fact, people in their day life do MSA-AD. This work is done in a future goal of developing not use standard Arabic but dialect, which is in most cases a machine translation system of standard Arabic (MSA) to different from standard Arabic. Consequently, people who dialects. are not educated can not understand standard Arabic which is considered as a foreign language. Index Terms— Machine translation system, Standard Ara- This work is part of a project TORJMAN1 which is dedicated bic, Algerian arabic dialect, parallel corpus, dialect of Annaba, to translating standard Arabic to algerian arabic dialect. In- cosine similarity measure terest in such extremely complicated problem can be very surprising. In fact, it is difficult to understand this issue but 1. INTRODUCTION when we analyze the spoken language in different places in Algeria for instance, we can notice that almost nobody Arabic is a Semitic language, it is used by around 250 million speaks standard Arabic even if the official language of Al- people, but is understood by up to four times more among geria is standard Arabic. Furthermore, this spoken language around the world [1]. Arabic is a language divided is not written. The idea of this project is twofold, first un- into 3 separate groups: Classical written Arabic, written mod- derstand the function and the underlying structure of algerian ern standard Arabic and spoken Arabic. dialects and then provide the population and social-economic Classical written Arabic is principally defined as the Arabic actors, a tool enabling the average user to understand the used in the Qur’an and in the earliest literature from the ara- standard Arabic. We present in the following section (section bian peninsula, but also forms the core of much literature up 2) why should we be interested in arabic dialect. until our time. written (or MSA, also called Alfus’ha), is the of Arabic most widely used in print media, official documents, correspondence, education, 2. WHY ARE WE INTERESTED IN COLLOQUIAL and as a liturgical language. It is essentially a modern variant ARABIC? of . Standard Arabic is not acquired as a mother tongue, but rather it is learned as a We see at international conferences post September 11, 2001, at school and through exposure to formal broadcast programs a craze increasingly important for machine translation of stan- (such as the daily news), religious practice, and print media. dard Arabic to Indo-European languages. These studies are Spoken Arabic is often referred to as colloquial Arabic, di- important when it comes to translating official documents, alects, or vernaculars. It’s a mixed form, which has many however if you want to develop applications for the average variations, and often a dominating influence from local lan- citizen, it is necessary to take into account his mother tongue, guages (from before the introduction of arabic). Differ- it means his dialect. ences between the various variants of spoken Arabic can be The main dialectal division is between the dialects large enough to make them incomprehensible to one another. and those of the middle east, followed by that between seden- Hence, regarding the large differences between such spoken tary dialects and ones. languages, we can consider them as disparate languages or Watson writes ”Dialects of Arabic form a roughly continuous more exactly as different dialects depending on the geograph- 1TORJMAN is a national research project which is totally financed by the ical place in which they are practiced : Morocco, Algeria, algerian research ministry spectrum of variation, with the dialects spoken in the eastern 3. ALGERIAN ARABIC and western extremes of the Arab-speaking world being mu- tually unintelligible” [2]. Effectively, while middle easterners In Algeria, as elsewhere, spoken Arabic differs from written can generally understand one another, they often have trouble Arabic; algerian Arabic has a vocabulary inspired from Ara- understanding Maghrebis2. Although the converse is not true, bic but the original words have been altered phonologically, due to the popularity of middle eastern, especially egyptian, with significant Berber substrates, and many new words and films and other media. In some cases people from these coun- borrowed from french, turkish and spanish. Like tries are unable to understand each other, at most few words all arabic dialects, algerian Arabic has dropped the case end- are unknown for them [3]. In other cases, people from one of ings of the written language. It is not used in schools, tele- the concerned country could find the grammatical structure vision or newspapers, which usually use standard Arabic or of the neighbor country bit understandable. Table1 provides French, but is more likely, heard in music if not just heard in a simple, yet interesting, example of how spoken varieties of algerian homes and on the street. Algerian Arabic is spoken Arabic differ in intelligibility. The English sentence I am go- daily by the vast majority of Algerians [5]. Algerian Arabic ing now is given in the syrian, egyptian, tunisian and algerian is part of the maghreb arabic , and fades dialects and in MSA with their respective transliteration. into and along the respec- tive borders. Algerian Arabic vocabulary is pretty much sim- ilar throughout Algeria, although the easterners sound closer to Tunisians while the westerners speak an Arabic closer to Table 1. Variants of arabic dialects expressing the English that of the Moroccans. sentence I am going now  We focus, in this paper, on one of the easterners dialects of MSA à B@Ië@XA K @ -ana¯ dahibun¯ al¯ -an¯ Algeria: Annaba’s dialect (AD). This choice is justified by . ¯ Egyptian úGZñËX l' @P AK @ -ana¯ rayih¯ . dilw-ty the fact that this dialect is the one we know best. We present  in section 4 its peculiarities. Syrian Éë hðP h@P rah¯ . ruh¯ . halla Tunisian    ba¯sˇ nimsyˇ tawa¯ øñK ú æ„Öß €AK. 4. SPECIFICITIES OF ANNABA’S DIALECT Algerian ¼PX hðQK h@P rah¯ . nruh¯ . durk To develop any application based on a language, at least a Moroccan HX øXA«A K @ -ana¯ g˙ady¯ daba basic linguistic study is necessary even if we use a statistical . model. In this section, we present the main features of the dialect of Annaba in which we are concerned. These examples reflect clearly the distance between di- Annaba’s dialect is spoken in the city of Annaba located east alectal sentences expressing the same idea. If we consider of Algeria. It is spoken by more than one million people. Like  only the word à B@ al¯ -an¯ (Now) in MSA, we remark that its for Maghreb arabic dialects, the most notable features of this equivalent in each of the considered dialects differs from that dialect, is the collapse of short vowels in some positions. The   used in the others:  dilw-ty in egyptian, halla in word H AJ» kitab¯ (book) in MSA correspond to H AJ» ktab¯ : ú GZñËX Éë . . syrian, øñK tawa¯ in tunisian, ¼PX durk in algerian and HX the short vowel @ i kasra on the first consonant » k- in MSA daba in moroccan. . is deleted in dialectal and replaced by the sukun¯ . Now let us consider maghreb spoken languages. There are In AD, the consonant  q is generally pronounced  v. For clearly two native languages in Morocco and Algeria, alge- † ¬  qal  val rian or moroccan Arabic and Berber3 (respectively 40 to 50% example ÈA¯ ¯ (to say) is pronounced ÈA¯ ¯ . For some of Berbers in Morocco, and 25 to 30% in Algeria). In Tunisia, words both alternatives exist like the word ©¢¯ qt.a, which there are only few Berbers (1 or 2%). In addition, the number can be also pronounced  vt.a,. We give in Table 2 a list of of monolingual berbers in rural areas is not negligible. On the ©¢¯ other consonants which pronunciation differs from standard other hand, the most optimistic estimates of illiteracy is 50% Arabic, and their respective pronunciation. in Morocco, Algeria 26% and 23% in Tunisia [4]. MSA is therefore still possessed by a small minority. So, much of the population is monolingual in Arabic moroccan, algerian or Table 2. Arabic consonant and their dialectal pronunciation tunisian or bilingual berber/arab moroccan or algerian, with Consonant pronunciation snippets of standard Arabic and French. d d X ¯ X t t H ¯ H z d. 2People from Tunisia, Algeria and Morocco .  3Berber or are a family of similar or closely related lan- guages and dialects indigenous to North Africa. The Hamza, which is very present in standard Arabic, is avoided or bypassed by almost all the dialects including the the pronoun seems to be an equivalent to the verb ”to one used in Annaba. be”[6], ¬A ®m Ì ñë AK @ -ana¯ huwa lh. afaf¯ (I am the hair- This is practically systematic in the middle of a word or at the dresser). end. Either it disappears altogether at the pronunciation, or it is replaced by ø y like in èY K AÓ ma¯-idah or éÊK A« ,a¯-ilah • The personal pronoun as suffixes. We have already mention that for possessive such as in MSA which correspond respectively to   maydah and èY JÓ ”my”, ”his”, ”our”, etc., or objective such as ”me”,   ,aylah¯ in dialect form. At the beginning of a word, the ”him”, ”us”,etc., a different system is employed and éÊ KA« Hamza can be preserved as in the case of imperative form, the pronoun is expressed by a shortened form which  is added to the end of a noun, verb, or certain particles. for example  -udhul (enter). However, it disappears The suffixes thus used are: ÉgX@ ˘ Singular. automatically if it is preceded by the article È@ al¯ (the), á J JË ltn¯ın (  al-¯ -itnayn in MSA). We give in the following áJKB@ ¯ other dialectal characteristics and we begin with the personal 1. ø y is used for ”My”, for example úGAJ» ktab¯ ¯ı pronouns used. ( My book). .

2. k is used for ”Your”, as ktabik¯ (Your 4.1. Personal pronouns ¼ ½KAJ» book). . The personal pronoun appears in two forms: 3. masc. ð' u¯ or è h ”His” as ñK.AJ» ktab¯ u¯ (His a the separate form which is used in the nominative ”I”, book), huh¯ (his brother); fem. ha¯ ”Her”, ”he”,etc. èñk ˘ Aë as AîE.AJ» ktabh¯ a¯ (Her book). b the suffixed form which is used for the possessive ”my”, ”his”,etc., or for the objective ”me”, ”him” Plural.

The first form stands alone, the second can only be used 1. na¯ is used for ”Our”, darn¯ a¯ (Our house). attached to a noun, verb, or certain particles. AK AKP@X • The Personal Pronoun : Separate Form. 2. Õ» kum is used for ”Your”, as Õ»P@X darkum¯ (Your Singular. house). 3. hum is used for ”Their”, as darhum¯ 1. -ana¯, -an¯ı (I). Ñë ÑëP@X AK@ ú G@ (Their house). 2. masc. nta; fem. nti (You). I K I K In the case of feminine nouns ending with è h ta- marbuta¯ as   samˇ ,ah, the suffixes are  t¯ı,  3. masc. ñë huwa (He); fem. ùë hiya (She). éªÖÞ ú G ½K tk, AîE tha¯, ... Plural. The form ¨AK ta¯, combined with personal pronouns as suffixes is also used to denote property. It’s intro- 1. AK AJk h. nay¯ a¯, AJk@ -ih. na¯ (We). duced after the noun to which the possessive refers, it then becoming necessary that that noun be defined by 2.  ntum¯ a¯ or  -intum (You) is said to both AÓñJK ÕæK@ the addition of the defining article, as ú«AKH AJºË lktab¯ plural masculine and feminine. . ta¯,¯ı (My book), Õº«AKP@ YË@ ad-dar¯ ta¯,kum (Your house) 3. AÓñë hum¯ a¯ (They) also is said to both plural mas- culine and feminine. 4.2. Interrogatives It is generally possible to omit the personal pronoun We list in table 3 the commonest forms of interrogative parti- when it is obvious, thus when we ask someone ”are you cles and pronouns used in the dialect of Annaba. thirsty ?”, we will just say ?àA ‚¢« ,at.sˇan?¯ = ”thirsty ?”. 4.3. The interrogative sentence Very often a personal pronoun is added to a word al- ready defined, and this added pronoun may become Any dialectal sentence can be turned into a question in any necessary when the predicate is also defined. Thus used one of two ways. For feminine nouns, the plural is mostly regular (obtained by Table 3. Interrogative particles and pronouns in AD and their postfixing  -at): the plural of  bant (girl) is  equivalents in MSA. HA IK. HAJK. English Annaba dial. MSA bn-at. For some words the broken plural is used: like ÉK.@ñ£ Who  skˇ un¯ man .tw-abl which is the plural of éÊK A£ .tablah¯ (table). àñºƒ áÓ We have listed in the foregoing,. the main features of the di- Which wana¯ -ayu AKð ø @ alect of Annaba. We will now present how we proceeded to develop corpora for use in a statistical translation system. Where áK ð wayn áK @ -ayna What  wsiyˇ a¯ mad¯ a¯ AJ ƒð @XAÓ ¯ €ð wsˇ When €A J¯ð waqta¯sˇ úæÓ mata¯ 5. COLLECTING CORPORA Why wa,la¯sˇ limad¯ a¯ €C«ð @XAÖÏ ¯ kifasˇ kayfa How €A ®» ¯ ­J » The statistical translation approach and availability of tools ready-to-use allow us to build quickly a machine translation system with sufficient parallel training data. For the transla- 1. It may be spoken in an interrogative tone of , like tion to (or from) an under-resourced language, this type of  rah¯ taqra?¯ (Will you revise?). ?@Q®K h@P . parallel corpora does not always exist, or exist with only a 2. An interrogative pronoun or compound derived from small amount of insufficient data for learning robust proba- a pronoun may be used, as wayniya da-¯ bilistic models. In the case of Annaba’s dialect, there is no ?Õ»P@X ú æK ð corpus that can be used to develop a translation system. We rkum? (where is your house?). start this project from scratch. For the construction of such corpus, a first step is to establish a standard bilingual dictio- 4.4. The negative sentence nary Arabic - arabic dialect. Thus the dictionary will contain

The form Ó masˇ (Not) is in general use as a negative par- entries like this: ¨Qå @ -sri, → H. PP@ -izrib. This entry is ticle, and may be found with all the persons. It can also be the word corresponding to ”act quickly” which is translated combined with the personal pronouns 4 to get negatives:  into the dialect of Annaba by: -izrib. The constitution ú æ‚Ó H. PP@ masnˇ ¯ı (I am not);  maskˇ ,  maskumˇ (You are not); of this dictionary is the first stone of the building which will ½‚Ó Õº‚Ó subsequently build the corpus.  masnˇ a¯ (We are not);  masˇu¯ (He is not);  masˇ¯ı AJ‚Ó ñ‚Ó ú æ„Ó To build the dictionary and consequently the corpus, we made (She is not) and ÑîD„Ó mashumˇ (They are not). The negative recordings of discussions ”in live” in different environments sentence can also be obtained by adding affixes ma¯ (as a (medical offices, cafes, markets, ...) to ensure a large variety AÓ of vocabulary used. Afterward we performed a manual tran- sˇ prefix) and € (as a suffix) to verbs. Table 4 gives examples scription of these recordings and extracted all words. Subse- of negative sentences. quently, we have assigned, to each extracted word, the ara- bic form which best fits. This resulted in a dictionary MSA- Table 4. Negative sentences Annaba’s dialect and a written dialectal corpus. We give in English Annaba Dialect table5 a sample of this dictionary. To complete the construc- I do not go '  masˇ rayah¯ . tion of the parallel corpus, we performed the translation of the l @P Ó dialect of Annaba to MSA based on the developed dictionary. I do not remember   masnˇ ¯ı matfakar Qº®JÓ ú æ‚Ó A sample of this corpus is given in figure 1. You did not eat  JÊ¿AÓ maklit¯ ¯ısˇ Table 5. A sample of the dictionary MSA-Annaba’s dialect. 4.5. Pluralization Annaba Dialect MSA  grˇ ¯ıt  garaytuˇ Algerian Arabic uses broken and regular plural. Like all other IK Qk. IK Qk. lgnˇ an¯  al-bust¯ an¯ arabic dialects, suffix wn used for the nominative in clas- àAJm.Ì àAJ‚.Ë@ àð lgnˇ an¯  al-h¯ ad¯ıqah sical Arabic is no longer in use in regular plural. Suffix yn àAJm.Ì é®K YmÌ'@ . áK wra¯ wara¯-a used in classical Arabic for the accusative and the genitive is @Pð Z@Pð used for all cases. For example the plural of muman¯ hallas.  saddada áÓñÓ ‘Êg ˘ XYƒ (believer) is mumn¯ ¯ın.  hall¯ık minha¯ da,ka minha¯ á JÓñÓ AîDÓ ½J Êg ˘ AîDÓ ½«X 4We are referring here to personal pronouns as suffixes 10 MSA-AD corpus BAF corpus 9

8

7

6 Angle values

5

4 Fig. 1. A sample of parallel corpus MSA-Annaba’s dialect. 3 0 100 200 300 400 500 600 700 800 900 1000 frequency vector size 6. ENRICHING CORPORA Fig. 2. The cosine similarity for BAF and MSA-AD corpora As noted above, a machine translation system requires a large amount of data. However, in order to increase the size of our corpora, we propose to produce new sentences from the ini- can therefore confirm that MSA and AD corpora are parallel. tial corpus. Producing new sentences is done by replacing each word in the original sentence by its different synonyms. 8. THE DIALECT’S VOCABULARY Each time a word is replaced by its synonym will produce a new sentence which is added to the initial corpus. For the In this section we focus on the study of dialect’s vocabulary. development of such tool, we must necessarily start by the We notice that there are three types of words: development of two dictionaries: one containing synonyms in AD and the other synonyms in MSA. To this end, we used • Arabized borrowed words: are words belonging to the MSA-AD dictionary. We have assigned to each entry (of foreign vocabulary (most of them are words borrowed dialect or MSA) one or several synonymous words if they ex- from French), which were introduced in the dialect ist. This tool uses the dictionaries of synonyms to produce all after having been naturalized phonetically and/or mor- possible sentences by combination. Once the sentences are phologically. Examples of such words are given in generated, they are added to the appropriate corpus. table 6.

7. IS THE MSA-AD CORPUS PARALLEL? Table 6. Examples of Arabized Foreign words English Annaba Dialect Origin In this section we show that the corpus we have built is really Nurse ÓQ¯ farml¯ı French ”Infirmier” parallel. To this end, we selected the most commonly used ú Î measure in this area called ”cosine similarity”[7]. infirmier The cosine of null angle is 1, and less than 1 for any other Place é“CK. blas¯. ah French ”Place” angle; the lowest value of the cosine is −1. The cosine of the That’s enough øQK yizz¯ı Berber angle between two vectors thus determines whether two vec- Ship babur¯ Turk tors are pointing in roughly the same direction. This is often PñJ.K. used to compare documents in text mining. The vectors used in this case consist of normalized frequencies of words. So, • Words that have unknown origin like s. warad¯ we have computed and normalized word frequencies for each XP@ñ“ Money mahbul¯ Crazy of the corpora5 to constitute the vectors. We have taken vec- , ÈñJ.êÓ ... tors of different size each time to determine from what size • Arabic words: The dialect of Annaba is largely based the corpora became very close. In order to interpret these val- on the standard Arabic. However, the words of arab ues, we compared to those obtained with the BAF corpus[8]. origin have undergone some distortions. In order to de- The values in terms of cosine, for our corpus and the BAF termine these distortions, we computed the Levenshtein one, were very close, so the curves were juxtaposed. In or- distance. The results showed that the deformations per- der to have more demonstrative curves, we chose to use their formed on arabic word are: respective angles (see figure 2). We note that the curves are similar. The more we increase the size of the vectors the more the angles tend to zero. We – In pronunciation: all consonants occur in the di- alectal word but the short vowels are changed. In 5Here we are referring to AD and MSA corpora. such cases, the Levenshtein distance is zero. – By insertion, deletion or substitution of conso- 10. REFERENCES nants. [1] Abdel Monem A., Shaalan K., Rafea A., and Baraka H., Table 7 provides examples of dialect words, their equiv- “Generating arabic text in multilingual speech-to-speech alents in standard Arabic and their corresponding Lev- machine translation framework,” in Machine Translation, enshtein distance. Springer, 2009. [2] Jeremy Palmer, “Arabic : Teaching only Table 7. Levenshtein distance for dialect words and their the standard variety is a disservice to students,” equivalents in MSA http://w3.coh.arizona.edu/AWP/AWP14/Palmer.pdf, MSA Annaba Dialect Lev. dist. 2007.  ta,rif  ta,raf 0 ¬QªK ¬QªK [3] Barkat-Defradas M., Al-Tamimi J., and Benkirane T.,  takun¯  tkun¯ 0 àñºK àñºK “Phonetic variation in production and perception of QjJ.Ë@ al-bah¯ . r QjJ.Ë lbh. ar 1 speech : a comparative study of two arabic dialects.,” in éKñJ .ƒAm' yuh. asib¯ unahu¯ èñJ.ƒAm' yh. asb¯ uh¯ 1 proc. of the 15th International Congress of Phonetic Sci- ences (ICPhS), 2003. ÐAK B@ al-¯ -ayam¯ ÐAJ Ë liyam¯ 2  -astarˇ ıh nisrˇ ıh 2 éK Qƒ@ ¯ éK Qå„ ¯ [4] Dominique Caubet, “Arabe maghrebin,” éJJ ªK yu,¯ınuh ñKðAªK y,awn¯ u¯ 3 http://corpusdelaparole.in2p3.fr/spip.php.

É¿@ -akl éÊ¿AÓ maklah¯ 3 [5] Boucherit A., L’Arabe parle´ a` Alger, ANEP Edition, 2002.

[6] De Lacy O’Leary, “Colloquial arabic,” 9. CONCLUSION http://www.archive.org/details/colloquialarabic00oleauoft, In this paper, we have presented the main features of the di- Digitized by the Internet Archive in 2007 with funding alect of Annaba through a linguistic study. We believe we are from Microsoft corporation. the first to do this study. [7] Salton G. and Mac Gill M.J, Introduction to modern in- As we have already specified above, this work is part of a formation retrieval, International student Edition, 1983. project TORJMAN which is dedicated to translating standard Arabic to algerian arabic dialect. To build a machine trans- [8] Langlais P., Simard M., and Veronis J. et al., “Arcade: lation system a sufficient parallel training data is necessary. A cooperative research project on parallel text alignment In the case of Annaba’s dialect, there is no corpus that can evaluation,” http: be used. So, to build the corpus, we performed recordings of www.lpl.univ-aix.fr/projects/arcade, 1998. dialect we transcribed. We subsequently developed AD-MSA dictionary that we used to translate the dialect corpus in stan- dard Arabic. We demonstrated that the built corpus is parallel using cosine similarity measure. We have also presented a study of the dialect’s vocabulary which has shown that it is mainly inspired from standard Arabic. The development of a machine translation system is subject to our future work.