<<

Transferring Egyptian Colloquial into Modern Standard

Khaled Shaalan Hitham M. Abo Bakr Ibrahim Ziedan The Institute of Informatics Computer & System Dept Computer & System Dept The British University in , Zagazig University Zagazig University PO Box 502216, Dubai, UAE [email protected] [email protected] [email protected]

Abstract These significant tools will become more Arabic is rooted in the Classical or Qur’anical complicated if they include in parallel the Arabic, but over the centuries, the has handling of Colloquial Arabic problems. developed to what is now accepted as Modern Today , also known as Masri, is Standard Arabic (MSA). Arab colloquial are generally only spoken , but recently the the dialect spoken in by more than 70 rate of colloquial written text increases dramatically million people. It is understood across the as a medium of expressing ideas especially across due to the predominance of the WWW, usually in the form of blogs and partially Egyptian media, making it one of the most colloquial articles. Most of these written colloquial widely spoken and most widely studied varieties has been in the Egyptian colloquial dialect, which is of Arabic. For this reason we selected Egyptian considered the most widely dialect understood and Arabic to prove the capability of our approach in used throughout the . We are able to transferring a Colloquial Arabic dialect into reuse MSA processing tools with colloquial Arabic MSA. by transferring colloquial Arabic words into their corresponding MSA words. The advantages of this lexical transfer are to facilitate the communication In literature, there are few researches that relate with colloquial Arabic speakers and restoring it to colloquial Arabic to MSA [6, 7]. These the in use nowadays. This paper researches have focused on the spoken colloquial addresses the transfer techniques between colloquial features of Arabic while our research focuses on Arabic and MSA, which have not yet been closely written colloquial Arabic. Our approach is to studied before. In particular, we present a rule-based develop transfer techniques that are able to lexical transfer approach for converting Egyptian perform the lexical mapping between written colloquial words into their corresponding MSA colloquial Arabic and MSA. The resultant front- words. This process involves morphological analysis and lexical acquisition of colloquial words. end module will make it easy to incorporate colloquial Arabic into existing MSA tools. This Keywords Colloquial Arabic dialects processing, and transferring will widen the coverage of current Arabic natural Egyptian Arabic into . language processing applications to include colloquial languages or dialects of Arabic. Our proposed research builds the linguistic 1. Introduction transformation resources between colloquial Colloquial Arabic is a collective term for the Arabic and MSA using the rule-based method. spoken languages or dialects of people throughout The data collection process will gather colloquial the Arab world. Although it is descended from words from Arabic websites across the Web. Arabic, it is considered a separate language. Speakers of some of these dialects are unable to The paper is structured as follows. Section 2, understand speakers of other Arabic dialects. discusses the challenges in handling written Recently, the rate of colloquial written text colloquial Arabic. In Section 3, we propose increases dramatically. Modern Standard Arabic solutions for these problems. Section 4 gives (MSA) is the official Arabic language taught and background information. Section 5 concentrates understood all over the Arabic world. MSA has on handling the deviation of Egyptian Arabic many challenges concerning the development of from MSA. Section 6 gives some concluding morphological and syntactic processing tools. remarks. 2. Challenges in Handling Written • Normalize the words such as removing repeated characters that is usually used to Colloquial Arabic with Regard to informally indicate emotions, and MSA • Lookup the Colloquial-to-MSA Language processing of colloquial Arabic is a for the closest colloquial word match and difficult task. The reasons of this difficulty come return the corresponding colloquial entry. from several sources: As an example, the phrase “Meeeesh 3aweez I do) ” ﻣﻴﺶ ﻋﺎوز ﺣﺎﺟﺔ “ . There two ways that colloquial 7agh” will be converted to 1- Arabic speaker use in their writing of not need anything). colloquial words. One way is to Romanize the colloquial word (written using the To solve the problem of the deviation of alphabet) and hence has to be transliterated Egyptian Arabic from MSA, the major from Arabic to English. Informal chatting contribution of this research, we used an existing mature MSA lexicon (Buckwalter lexicon across chat rooms or exchanged SMS 1 messages in the Arab community usually done version2, [3] ) to build the Colloquial-to-MSA using Romanized letters. The other way is to lexicon such that both their entries coexist in one write Arabic words using lexographic Arabic lexicon. We followed the same morphological letters. Colloquial normal Arabic letters. analysis approach of this tool in analyzing the 2-Deviation from MSA. There are five main colloquial Arabic word. A rule-based lexical deviations from MSA: transfer approach is use to transform the • Distortion of verbs (e.g. analyzed colloquial Arabic word into MSA .(word(s ﺑﻠﻴﺘﻪ ﻣﻦ ﺑﻠﻠﺘﻪ – ﺿَ ﺮَ ﺑْ ﺘِﻴﻪ ﻣﻦ ﺿَ ﺮَ ﺑْ ﺘِﻪ - ﺣﺎآﺘﺐ ﻣﻦ ﺳﺄآﺘﺐ - .(ﻣﺎﺗﺄﻋﺪ ﻣﻦ أﻣﺎ ﺗﻘﻌﺪ To solve the problem of the lack of identified • Distortion of nouns. (e.g. colloquial syntactic rules, we suggest solving this اﻟﺨِﻴﺮ ﻣﻦ اﻟﺨَﻴﺮ - دﻩ ﻣﻦ هﺬا - ﺟَﻤﻬﻮر ﺧﺎﻳﻒ ﻣﻦ ﺧﺎﺋﻒ - problem with empirical corpus-based techniques ﻣﻦ ﺟُﻤﻬﻮر- ﻣﻴﻦ ﻣﻦ ﻣَ ﻦْ - ﻓﻴﻦ ﻣﻦ أﻳﻦ). from Example Based Machine Translation • Distortion of Pronouns and letters meanings. (EBMT) [8, 9]. This has incurred building a (e.g. parallel corpus of both the colloquial and MSA .(ﻋﺼﺎﻳﺘﻲ ﻣﻦ ﻋﺼﺎي - اﺣﻨﺎ ﻣﻦ ﻧﺤﻦ - ه ﻮﱠ ﻣﻦ هُ ﻮَ text. The development of such corpus is • Distortion of the structure of the word form relatively new and will be published elsewhere. (e.g. To solve the problem of acquiring new colloquial اﺗﺎوب ﻣﻦ ﺗﺜﺎءب - اﺗﱠﺎوى ﻣﻦ اوى - ﺑﻐﺒﻐﺎن ﻣﻦ ﺑﺒﻐﺎء - words/expressions, we propose a process based .(ﺗﻼت ﺷﻬﻮر ﻣﻦ ﺛﻼﺛﺔ ﺷﻬﻮر • Replace the characters and movements. on EBMT techniques that maintains the lexicon (e.g. and keeps it up-to-date. This sophisticated .process will gather Arabic text from the Web ﺗِﻌﺒﺎن ﻣﻦ ﺛﻌﺒﺎن - ﺗﻮم ﻣﻦ ﺛﻮم - ﺳﻘﺐ ﻣﻦ ﺛﻘﺐ - ﺷﺒﻂ ﻣﻦ The text is analyzed in order to recognize the ﺷﺒﺚ "أي ﺗﻌﻠﻖ.(" 3- Lack of syntactic rules. There are no unknown lexical items. An Arabic specialist has identified rules for colloquial to take a decision of whether or not to add the dialects. unknown lexical item to the lexicon. 4-Lexical expansion rate. As colloquial Arabic is more popular than MSA, it is very often to observe much more newly added 4. The Buckwalter expressions/words as apposed to MSA. Morphological Analyzer We build our system on top of Buckwalter 3. The Proposed Approach Arabic Morphological Analyzer Version 2.0 [3]. For the problems introduced in the previous His morphological analysis depends on a section, we give suggestions for each of which. dictionary of prefixes, a dictionary of suffixes, a stem dictionary, and three checking tables for To solve problem of writing colloquial Arabic in testing the validity of a word analysis. The Latin alphabet, we propose the following process: • Detect Romanized words in the input and 1 transliterate theses words into Arabic See the description of the Buckwalter's Arabic lexographic letters, morphological analyzer http://www.qamus.org/morphology.htm morphological analyzer tries to breakdown the • Compatibility table tableBC lists input Arabic word into three elements: prefix, compatible Stem and Suffix stem, and suffix. If all the three word elements are morphological categories, such as: found in their respective , then their PV PVSuff-a respective morphological categories are used to determine whether they are compatible. If all the 5. The Proposed Solution of morphological category pairs are compatible, then the morphological analysis is valid. Transferring Colloquial Arabic Dialect to MSA Each entry in the three lexicon files consists of four tab-delimited fields: Our proposed transfer techniques are based on 1. the entry (prefix, stem, or suffix) without previous studies of the transformations between short vowels and diacritics, the MSA and colloquial Arabic [1, 2, 4, 5]. We 2. the entry (prefix, stem, or suffix) with used the indicated variations to acquire the short vowels and diacritics, lexical transfer rules that can be used to derive 3. its morphological category (used for the the MSA word from a corresponding colloquial compatibility between prefixes, stems, Arabic word. Additional rules will be acquired and suffixes), and and judged by an Arabic specialist during the 4. its English gloss(es), including selective lexical acquisition process. These rules are used POS data within XML tags to analyze the input colloquial word and produce ... the target MSA word(s).

Only fields 1 and 3 are required for morphological analysis. Fields 2 and 4 provide additional 5.1 Examples of Egyptian Colloquial information once the morphology analysis is Word to MSA Transformations succeeded in producing the analyzed word(s). The colloquial Arabic word is normally derived Arabic script data in the lexicons is provided in the from a well-formed MSA word. This process can Buckwalter transliteration scheme. be traced back to the distortion (transformation) made to the MSA word that has changed it to a The following is a description of the three lexicon colloquial Arabic word form. The analysis of the files: relationship between well-formed MSA Arabic • dictPrefixes contains all Arabic prefixes and words and colloquial words has been discussed their concatenations. Sample entry: by many linguists [1, 2, 4, 5]. Table 1 shows w wa Pref-Wa wa/CONJ distortion examples and how to transfer them • dictSuffixes contains all Arabic suffixes and their into MSA words. concatenations. Sample entry: p ap NSuff-ap [fem.sg] The transfer between Egyptian Arabic dialect ap/NSUFF_FEM_SG and MSA is one-to-many transformation. This • dictStems contains all Arabic stems. Sample means some Egyptian Arabic words can be entries: transferred in one or more steps through lexicon ktb katab PV write lookup as the mapping involves more than one ازﻳﻚ ktb kotub IV write morpheme. For example, the Egyptian word "How are you?" is transformed to two MSA :Other examples are ."آﻴﻒ ﺣﺎﻟﻚ؟" There are three compatibility tables; each of the words three compatibility tables lists pairs of compatible • ﻣﺎورد (Ma2 ward) : ﻣﺎء ورد :morphological categories • آﻠﺸﻴﻨﻜﺎن (Koleshenkan) : آﻞ ﺷﻲء آﺎن Compatibility table tableAB lists • • أﺟﺮﻧﻚ (2agranak): ﻻ ﺟﺮم اﻧﻚ وﺗﻘﺎل ﻓﻲ اﻟﻌﺎﻣﻴﺔ compatible Prefix and Stem أﺟﺮﻧﻚ ﺷﺎﻃﺮ أي ﻻ ﺟﺮم اﻧﻚ ﺷﺎﻃﺮ :morphological categories, such as • أﺷﻤﻌﻨﺎ (2eshMe3na): اﻳﺶ اﻟﻤﻌﻨﻲ NPref-Al N • إآﻤﻨﻪ (2kmeno): آﻤﺎ اﻧﻪ NPref-Al N-ap • ﺑﺴﻤﻠﺔ (Besmellah): ﺑﺴﻢ اﷲ Compatibility table tableAC lists • compatible Prefix and Suffix morphological categories, such as: In colloquial language processing, a word might NPref-Al Suff-0 be added to the lexicon which does not have a NPref-Al NSuff-u corresponding word in the formal language. This is • NewSegmentPosition: this is the new position also the case in Egyptian colloquial (e.g. the word of the word segment, which indicates its can is used to indicate either an exclamation proper order, within the target MSA word or " ﺑﻘﻰ " or an interrogation such that both the symbols “?!” sentence. This field takes one of the following appear together at the end of the sentence. This is values: best explained by the following examples: o same position (SP), o start of word (SoW), ,(which is transferred to o end of word (EoW ” ﺑﻘﻰ أﻧﺖ ﺗﻌﻤﻞ آﺪة؟ “ • ,(Do you do this?), o start of sentence (SoS) ” أﻧﺖ ﺗﻔﻌﻞ هﺬا؟ !“ MSA as and o end of sentence (EoS), and the like. which is transferred to MSA as "ازﻳﻚ ﺑﻘﻰ؟ " • How are you?). For example, the Egyptian colloquial sentence) ” آﻴﻒ ﺣﺎﻟﻚ؟!“ you came when?) is literally) " ﺟﻴﺖ اﻣﺘﻰ؟" "ﺟﺌﺖ ﻣﺘﻰ؟" Table 1. Examples that illustrate the relation transformed to the MSA sentence ”اﻣﺘﻰ“ between MSA words and Egyptian Arabic words (you came when?). Given that the word takes the value “SoS” for the MSA EGW Distortion Handling method NewSegmentPosition field, the transformation Type moves this word to the beginning of the sentence

ﻣﺘﻰ ” Replace of Add new stem and in order to get the target MSA sentence إﻳﺪ ﻳﺪ .(?When did you come) ”ﺟﺌﺖ؟ vowels assign the same rules as (colloquial " ﻓﺘﺢ اﻻول (Arabic word (CAW واﻟﻌﺎﻣﻴﺔ ﺗﻜﺴﺮﻩ" Distortion in Add new stem and 5.3 Mapping Rules وﻧﺎ وأﻧﺎ ,(Pronouns assign the same A new database file, called Mapping Table (MT اﺣﻨﺎ ﻧﺤﻦ and letters rules as CAW is introduced to encode the mapping rules meaning between Egyptian Arabic to MSA. This table " اﻟﺘﺤﺮﻳﻒ ﻓﻲ uses the value of the lexicon's ID field to cross اﻟﻀﻤﺎﺋﺮ" Distortion in Add new stem and reference the lexical entries inside the rules. The اﻣﺒﺎرح اﻟﺒﺎرﺣﺔ Pronouns assign the same mapping is either one-to-one or one-to-many. An أﻣﺲ and letters rules as CAW to be suitable entry of this table has three fields: source "اﻣﺲ" meaning in MSA even it is colloquial word, target colloquial word, and the " اﻟﺘﺤﺮﻳﻒ ﻓﻲ more suitable mapping mode. The mapping mode takes either ﺣﺮوف اﻟﻤﻌﺎﻧﻲ of two values: 0 indicates one-to-one and 1 "اﻟﺒﺎرﺣﺔ" ال - ام" Replace the Add new stem and أال ﻗﺎل characters assign the same indicates one-to-many. In the following we will ﻳﺎرﻳﺖ ﻳﺎ ﻟﻴﺖ and vowels. rules as CAW present examples of mapping rules along with ﺑﺘﺎع ﻣﺘﺎع .their related lexicon entries " اﺑﺪال ﺗﻤﻄﻊ ﺗَ ﻤَ ﻄًَ ﻰ اﻟﺤﺮوف " زﺣﻠﻔﺔ ﺳﻠﺤﻔﺎء

Distortions Add new stem and اﺗﺒﻞ اﺑﺘﻞ in the assign the same Example 1: mapping the colloquial interrogative اﺗﺮﻣﻰ ارﺗﻤﻰ ."ﻣﺘﻲ" when) to the MSA word) ”اﻣﺘﻲ“ structure of rules as CAW اﺗﺮوى ارﺗﻮى the word اﺗﺸﻮى اﺷﺘﻮى " ﺗﻘﺪم اﻟﺘﺎء ﻋﻠﻲ اﺗﻔﻀﺢ اﻓﺘﻀﺢ This rule will be represented in the MT by an ﻓﺎء اﻟﻔﻌﻞ ﻓﻲ entry with the values: source colloquial ﺻﻴﻐﺔ أﻓﺘﻌﻞ" interrogative=ID 79831, target MSA 5.2 Lexicon Structure interrogative ID=64063, and mapping mode=0, where the source and target words entries in the We enhanced the Buckwalter's lexicon tables with lexicon are: ﻣــﱴ new extra fields: 64063 mtY mataY FW-Wa when mataY/INTERROG_PART ﻣﺘَﻰ ID: An identifier to distinguish each word • mataY_2 ﻣﺘَﻰ/أداة اﺳﺘﻔﻬﺎم segment. This field is used for indexing SP Ar-Ar ( ﻣﱵ(mty(1) 1 ﻣﺘَىـ2 purposes, أﻣﱴ SegmentType: it can be either MSA (Ar-Ar), 79831 >mtY >mtY FW-Wa when • mtY/INTERROG_PART< أﻣﺘَﻰ Egyptian dialect (Ar-Eg) or other dialects such as mataY< أﻣﺘَﻰ/أداة اﺳﺘﻔﻬﺎم SoS Ar-Eg أﻣﱵ mty< أﻣﺘَﻰ Jordanian dialect (Ar-Jr) for future extension of the lexicon. References ”ﻋﺎل “ Example 2: mapping the colloquial prefix on) and the) "ﻋﻠﻰ" on-the) to the MSA words) the). 1. Shawki Deef, Tahrifat Al Amiah Lil) ”ال“ prefix article Fousah Fi El Kawaad wa Al Bonian we ﺗﺤﺮﻳﻔﺎت اﻟﻌﺎﻣﻴﺔ , These rules will be represented in the MT by two Al Horouf wa Al Harakat ,ﻟﻠﻔﺼﺤﻰ ﻓﻲ اﻟﻘﻮاﻋﺪ واﻟﺒِﻨْﻴﺎت واﻟﺤﺮوف واﻟﺤﺮآﺎت entries (one-to-many): 1) source colloquial prefix=ID 79835, target MSA preposition=46196, Dar El Maaref, Egypt, 1994. and mapping mode=0, and 2) source colloquial 2. Scocrates Spiro ,”An Arabic – English prefix=ID 79835, target MSA article=15, and Dictionary of the Colloquial Arabic of mapping mode=1. Egypt”, Bookshop Publisher, Lebanon, 1973 In addition to adding colloquial prefixes, stems and 3. Tim Buckwalter, Buckwalter Arabic suffixes to the corresponding lexicon database file, Morphological Analyzer Version 2.0 the compatibility database files should also be LDC Linguistic Data Consortium, modified to include entries that will verify the University of Pennsylvania, 2004. recognized prefix, stem, and suffix of the input Available at Egyptian Arabic word. Consequently, the http://www.ldc.upenn.edu/Catalog/Catal EAl) will have also entries ogEntry.jsp?catalogId=LDC2004L02) ” ﻋﺎل” colloquial prefix in the respective compatibility tables: tableAB, and 4. Ahmed Taymour, "Moaagam Taymour ﻣﻌﺠﻢ ﺗﻴﻤﻮر “ ,"tableAC. As a matter of fact, these entries will be Al Kbir: volume 1, 2 & 3 Dar El Afak el ,” ﻣﺠﻠﺪ 1 ؛ 2 ؛ 3 :اﻟﻜﺒﻴﺮ ” ﺑﺎل“ treated in a similar way to the MSA prefix (BiAl). In order to distinguish between MSA and Arabia, Egypt, 2003. colloquial entries, we used the prefix "C_" as an 5. Ibn El hanbaly, "Bahr ul-awwam fi ma indicator of a colloquial entry, e.g. the ﺑﺤﺮ اﻟﻌﻮﱠام ﻓﻴﻤﺎ أﺻﺎب ﻓﻴﻪ " ,asaba fihil a'wam EAl) is) ”ﻋﺎل” morphological category of Ibn Zietoun, ,1937 ,"اﻟﻌﻮامّ ”ﺑﺎل“ C_NPref-EAl” while the MSA for“ (BiAl) is “NPref-BiAl” 6. Owen Rambow, David Chiang, Mona Diab and Nizar Habash, The final report: Parsing Arabic Dialects (version 6. Conclusion I), CSLP, JHU, Baltimore, USA, 2006. We have investigated the variations between 7. Nizar Habash and Owen Rambow, Egyptian Arabic and MSA, and introduced lexical MAGEAD:A Morphological Analyzer transfer techniques between these languages. These and Generator for the Arabic Dialects, techniques reuse existing Arabic morphological In the Proceedings of the 21st analysis resources and enhance these resources International Conference on with meta data of Egyptian Arabic. Our approach is Computational Linguistics and 44th able to transfer written Egyptian colloquial dialect Annual Meeting of the ACL, Sydney, into its corresponding MSA forms in order to cope PP 681–688, July 2006. with the dramatic increase of written colloquial 8. Ralf D Brown, Example Based dialects. This step showed that it is easy to Machine Translation in the Pangloss incorporate colloquial Arabic dialects into existing System, In the proceedings of The 16th MSA tools. We hope these techniques to be applied International Conference on to other colloquial Arabic dialects such as Computational Linguistics, Copenhagen Moroccan, Levantine and . Moreover, (COLING-96), pp 169-174, 1996. using MSA Arabic as a hub language, into and out 9. Ralf D Brown and Robert Frederking of which all transfer is done, will make the transfer Applying Statistical among these Arabic colloquial dialects straight Modeling to Symbolic Machine way such that speakers of one dialect is able to Translation, In the Proceedings of the read and understand written material of other 6th International Conference on Arabic dialects. Theoretical and Methodological Issues in Machine Translation (TMI-95), Leuven, Belgium, pp 221-239, 1995.