<<

The Challenges and Pitfalls of and Arabization

Jack Halpern (春遍雀來) The CJK Dictionary Institute, Inc. (日中韓辭典研究所) 34-14, 2-chome, Tohoku, Niiza-shi, Saitama 352-0001, Japan [email protected]

their native script directly into Arabic, something Abstract probably never attempted. These systems are part of our ongoing efforts to develop Arabic re- The high level of ambiguity of the Ara- sources for automatic transcription, machine bic script poses special challenges to translation and named entity extraction. developers of NLP tools in areas such as morphological analysis, named entity The following typographic conventions are used extraction and machine translation. in this paper: These difficulties are exacerbated by the lack of comprehensive lexical resources, 1. Phonemic transcriptions are indicated by . (/qaabuus/ < ﻗــــﺎﺑﻮس) such as proper noun databases, and the slashes multiplicity of ambiguous transcription 2. Phonetic transcriptions are indicated by . ([qɑːbuːs] < ﻗــــﺎﺑﻮس ) schemes. This paper focuses on some of square brackets the linguistic issues encountered in two subdisciplines that play an increasingly 3. Graphemic are indicated by . (\qAbws\ < ﻗــــﺎﺑﻮس ) important role in Arabic information back slashes processing: the 4. Popular transcriptions are indicated by italics .(Qaboos < ﻗــــﺎﺑﻮس ) -names and the arabization of non Arabic names. The basic premise is that linguistic knowledge in the form of lin- 2 Motivation and Previous Work guistic rules is essential for achieving Arabic transcription technology is playing an high accuracy. increasingly important role in a variety of practical applications such as named entity 1 Introduction recognition, machine translation, cross-language information retrieval and various security The process of automatically transcribing Arabic applications such as anti-money laundering and to a Roman script representation, called romani- terrorist watch lists. Despite the importance of zation, is a tough computational task to which these applications, Arabic transcription has not there is no definitive solution. The opposite op- been the subject of sufficient studies that eration of transcribing a non- into examine the linguistic issues. This paper Arabic, called arabization, is also difficult but attempts to fill that gap. for different reasons.

Several companies and researchers have This paper briefly describes the algorithms and developed automatic diacriticization software. major linguistic issues encountered in the course Vergyri and Kirchhoff (2004) report the high of developing two automatic transcription sys- error rate of these products. Gal used a HMM tems: (1) Automatic Romanizer of Arabic Names bigram model and achieved a 14% error rate, (ARAN), which romanizes unvocalized Arabic while AbdulJaleel and Larkey (2003) developed names into various systems, and an -gram based statistical system for arabizing (2) Non- Arabizer (NANA), which English, with an error rate of 10%-20%. Elshafei arabizes non-Arabic names written in the Roman et al. (2006) report a 5.5% error rate using an and CJK scripts. HMM approach, while Arbabi et al (1994).

developed a diacriticizer that combines a A novel feature of these systems is that they are knowledge base with neural networks to achieve fine tuned to transcribing personal names and a low error rate of 3.1% but which rejects 55% of placenames to and from Arabic, with special fo- the names as unprocessable. cus on the linguistic knowledge and rules re- quired for transcribing CJK names written in

1 ner that reflects the pronunciation of the original, We have not used sophisticated statistical often ignoring graphemic correspondence. This approaches. Our basic strategy has been to use includes the following subcategories: conventional linguistic knowledge because we believe that ultimately statistical methods by 1. A represents the ac- themselves are inadequate. Kay (2004) argues tual speech sounds, including allophones. that "statistics are a surrogate for knowledge of The best known of these is IPA. For example, .[is transcribed as [muħɛ̈mmɛ̈d ﻣﺤﻤﺪ the world" and that "this is an alarming trend that computational linguists ... should resist with 2. A phonemic transcription represents the great determination." This was reinforced by of the source language (ignoring Farghaly (2004) when wrote "It is becoming allophones), ideally on a one-to-one basis. increasingly evident that statistical and corpus- is transcribed as ﻣﺤﻤﺪ ,based approaches...are not sufficient..." For example //, in which a represents the pho- Our policy is that linguistic rules, based on deep neme /a/, rather than the [ɛ̈]. analysis of the source and target scripts, are 3. A popular transcription is a conventional- indispensable. To rephrase, many contemporary ized that roughly represents -is tran ﻣﺤﻤﺪ ,statistical methods involve brute-force pronunciation. For example mathematical techniques that exploit vast scribed in some 200 different ways, such as amounts of data, whereas a rule-based approach Mohammed, Muhammad, Moohammad, captures aspects of human intelligence because it Moohamad, Mohammad, Mohamad, etc. is based on linguistic knowledge. We have combined linguistic rules with statistically Diacriticization is the process of adding derived mapping tables to build a flexible system signs (called vocalization) and other . mHmd\ is converted to the\ ﻣﺤﻤﺪ ,that can be extended to other Arabic script based For example -muHam~ad\. Note the four dia\ ﻣُﺤَﻤﱠﺪ languages. vocalized critics that were added. 3 Basic Concepts Much confusion surrounds the terms translitera- Arabization is the reverse of romanization; that tion and transcription, with the former often mis- is, the representation of a non-Arabic script, such leadingly used in the sense of the latter even in as the Roman and CJK scripts, using the Arabic → Clinton ,ﻣﺤﻤﺪ → academic papers (AbdulJaleel and Larkey, 2003). , .., Muhammad .ﺳـــــﺎﻳﺘﺎﻣﺎ → 埼玉, Saitama آﻠﻴﻨﺘــــــــﻮن To discuss these concepts in an unambiguous manner it is necessary to understand these and related terms correctly. 4 Why is Arabic ambiguous?

A distinguishing feature of in general, and Romanization is the representation of a language of Arabic in particular, is that words are written written in a non-Roman script using the Roman as a string of with little or no indica- alphabet. This includes both and tion of , referred to as unvocalized Arabic. is transliterated as Though diacritics can be used to indicate short ﻣﺤﻤﺪ .transcription, e.g \mHmd\ and transcribed as Mohammed, Mu- vowels, they are used sparingly, while the use of hammad, or Mohamad, among many others. consonants to indicate long vowels is ambiguous. On the whole, unvocalized Arabic is highly am- Transliteration is a representation of the script biguous and poses major challenges to Arabic of a source language by using the characters of information processing applications. another script. Ideally, it unambiguously repre- sents the graphemes, rather than the phonemes, 4.1 Morphological Ambiguity ﻣﺤﻤﺪ of the source language. For example, is Arabic is a highly inflected language. Inflection transliterated as \mHmd\, in which each Arabic is indicated by changing the vowel patterns as letter is unambiguously represented by one Ro- well as by adding various suffixes, prefixes, and man letter, enabling round-trip conversion. 'kaatib/ 'writer/ آَﺎﺗِﺐ clitics. A full paradigm for

that we created (for a comprehensive Arabic- Transcription is a representation of the source English dictionary project) reaches a staggering script of a language in the target script in a man- total of 3487 valid forms, including affixes and

2 clitics as well as inflectional syncretisms. For literated, it must not be transcribed, e.g., is transliterated as \ktbwA\, with ‘alif آﺘﺒـــــﻮا -can represent any of the follow آﺎﺗـــﺐ ,example -at the end, but transcribed as /katabuu/, omit آَﺎﺗَﺐَ ,/kaatib/ آَﺎﺗِﺐ :ing seven wordforms .ting the 'alif آَﺎﺗِﺐَ,/kaatibun/ آَﺎﺗِﺐٌ ,/kaatibin/ آَﺎﺗِﺐٍ ,/kaataba/ kaatibu/. 7. The indicating / آَﺎﺗِ ﺐُ ,/kaatibi/ آَﺎﺗِﺐِ ,/kaatiba/ is normally omitted, e.g., the un- Muhammad (vocalized ﻣﺤﻤﺪ vocalized provides no clues that the [] should (ﻣُﺤَﻤﱠﺪ Orthographical Ambiguity 4.2 On the orthographic level, Arabic is also highly be doubled. can theo- 8. Another source of ambiguity is the omission ﻣﻮ ambiguous. For example, the string retically represent 40 consonant-vowel permuta- of tanwiin diacritics for case endings, e.g., in the ,(ﺷُﻜْﺮَاً ukrAF\ (vocalized$\ ﺷــﻜﺮا ,tions, such as mawa, mawwa, mawi, mawwi mawu, mawwu, maw, maww, miwa, miwwa.... fatHatayn is not written. etc., though in practice some may never be used. 9. The rules for determining the seat are Humans can normally disambiguate this by con- of notorious complexity. In transcribing to text, but for a program the task is formidable. Arabic, it is difficult to determine the hamza seat as well as the short vowel that follows; ,/could represent /'a (ؤ) Conventional wisdom has it that the Arabic e.g., hamzated waaw script is ambiguous "due to non-representation of /'u/ or even /'/ (no vowel). short vowels," while other features are often 10. In arabization, determining the hamza seat lightly passed over. In fact, a whole gamut of requires the application of complex rules factors contribute to orthographical ambiguity. based on the phonological environment, which is further complicated by the frequent The list of factors below is not intended to serve omission and inconsistent use of hamza in as a detailed treatment of Arabic orthographic foreign names (see Section 7). ambiguity, but to demonstrate the principal lin- 11. Phonological alternation processes such as guistic issues that need to be addressed to that modify the phonetic realiza- اﻟﺮﺟﻞ achieve accurate transcription. tion. For example, the unvocalized the tall man' is realized as' اﻟﻄﻮﻳــــﻞ in which ,(اَﻟﺮﱠ ﺟُ ﻞُ ٱﻟﻄﱠﻮِ ﻳ ﻞُ) /The greatest challenge is the omission of /'arrajulu-TTawiilu .1 TTa/, not as/ طﱠ is assimilated into ال the آﺎﺗـــﺐ short vowels; e.g., the unvocalized \kAtb\ can represent seven wordforms such /'alrajulu alTawiilu/. -kaatiba/. 12. Vowel shortening is sometimes lexically de/ آَﺎﺗِﺐَ kaatib/ and/ آَﺎﺗِﺐ as 2. In contrast, some short vowels actually are termined and thus cannot be predicted from 'in Cairo' ﻓــﻲ اﻟﻘﺎهﺮة ,.represented. For example, taa' marbuuTa of- the orthography; e.g .jaami`a/, is pronounced /fi-lqaahira/, not /fii-lqaahira/ ﺟﺎﻣﻌﺔ ten indicates a short /a/, as in while in foreign names short and long vow- That is, /fii/ is shortened to /fi/. els are normally written identically by add- rwsyA\ 'Russia'. 5 Automatic Romanizer of Arabic Names\ روﺳـــﻴﺎ as in ,و or ي, ا ing 3. Long /aa/ can be expressed in multiple ways, 5.1 Overview (by (2 ,ﺳـــﻮرﻳﺎ as in (ا) e.g., by 'alif Tawiila and by (3) 'alif The Automatic Romanizer of Arabic Names ,ﺁﺳـــﻴﺎ as in (ﺁ) alif maduuda' ARAN) consists of multiple modules for the) .ﺁﺳـــﻴﺎ اﻟﻮﺳــﻄﻰ as in (ى) maqSuura 4. Long vowels are sometimes omitted too, as transcription and transliteration of Arabic and -haadha/. In this case, the 'alif qaSiira related tasks such as variant generation and vo/ هﺪا in ("") is omitted. calization. The core problem that ARAN ad- 5. Not all bare alifs represent long /a/. Some are dresses is making an intelligent guess at deter- silent (next item), while some are nunated; mining the vowels of unvocalized Arabic names and generating romanized candidates based on رَا not ,راً ,/represents /ran ﺷــﻜﺮا in را ,.e.g /raa/. statistically motivated linguistic rules derived 6. 'alif alfaaSila (otiose alif), added to the third from an in-depth analysis of Arabic orthography. person masculine plural forms of the past The principal components of ARAN are: tense, is a mere orthographic convention and is not pronounced. Though it must be trans- 1. ATAN: Automatic Transcriber of Arabic Names 2. AXAN: Automatic Transliterator of Arabic Names

3 3. APAN: Automatic Phoneticizer of Arabic Names but might improve the match rate because fuzzily 4. ADAN: Automatic Diacriticizer of Arabic Names matched names could often be correct, whereas 5. AVAN: Automatic Variant Generator for Arabic generated names could have incorrect short vow- Names els. The user can set parameters to output any desired combination of three modes: exact match, Table 1 shows examples of how each module fuzzy match or algorithmic generation. processes a string of unvocalized Arabic:

Table 1. Output from Various ARAN modules Unvocalized Vocalized Phonemic Graphemic Phonetic Popular (input) (ADAN) (ATAN) (AXAN) (APAN) (AVAN)* muHammad mHmd muħɛ̈mmɛ̈d Muhammad ﻣُﺤَﻤﱠﺪ ﻣﺤﻤﺪ qaabuus qAbws qɑːbuːs Qaboos ﻗَﺎﺑُﻮس ﻗــــﺎﺑﻮس jamaal jmAl dʒɛ̈mɛ̈ːl Jamal ﺟَﻤَﺎل ﺟﻤﺎل makka mkp mɛ̈kkɛ ﻣَـﻜـﱠﺔ ﻣﻜﺔ *Only one popular variant is shown, but in reality there could be dozens. For example, .AVAN generates Qabuus, Qabus, Qabous, Qabooss, … and many more ﻗﺎﺑﻮس for

5.2 Romanization Algorithm 5.3 Rules Knowledge Base The romanization algorithm accepts an Arabic ARAN uses a knowledge base module for gener- string as input and generates a list of romanized ating romanized strings from the Arabic input candidates by combining lookup in the Database string. This is the central component of the algo- of Arabic Names (DAN), a database of about rithm but is independent of it for maximum 180,000 romanized Arabic name variants and flexibility. The rules can be modified by the user their variants, with a knowledge base of rules. to further refine the accuracy or to adjust them to ARAN can generate candidates in pure algo- other Arabic-script based languages. rithmic mode, or it can access DAN to find ex- plicit entries before resorting to algorithmic gen- The knowledge base was created by in-depth eration. Roughly, the algorithm works as fol- analysis of the Arabic orthography using the re- lows: sults of statistical analysis of a large name corpus based on a bilingually aligned phone directory. A 1. Get an Arabic string from the input file. regular-expression-like mini-language for writ- 2. Transliterate to Buckwalter for internal proc- ing vocalization and romanization rules was de- essing using the AXAN module. veloped in which LHS (left-hand side) and RHS 3. Attempt to find an exact match in DAN. (right-hand side) style rules are defined as de- 4. If that fails, perform a fuzzy match to re- clarative statements on a high level of abstraction trieve from DAN. independent of specific computer languages. 5. If that fails, generate romanization candi- These are then implemented by the appropriate dates algorithmically. functions in the romanization algorithm module. 6. Output a list of romanized candidates. For example, the rule ":C1(?=[^Awyp]):&[aiu]" (colons are field separators) means as follows: is first transliterated to إﺑـــﺮاهﻴﻢ ,For example \

Fuzzy matching, such as ignoring hamza and collapsing 'alif with 'alif maqSuura, is a bit risky

4 埼玉 /saitama/), and other kinds) ﺳـــــﺎﻳﺘﺎﻣﺎ Non-Arabic Name Arabizer 6 for the more آﺎﻧﺎﺟــﺎوا of variants, such as 6.1 Overview .(/神奈川 /kanagawa) آﺎﻧﺎﻏــﺎوا common The Non-Arabic Name Arabizer (NANA) is designed to arabize non-Arabic names. This in- 6.3 Vowel Sequence Ambiguity cludes Roman-script names such as Bill Clinton Vowels sequences are difficult to transcribe be- -as well as a technology cause they could represent diphthongs, mo ,ﺑﻴــــﻞ آﻠﻴﻨﺘــــــــﻮن to probably never attempted before: transcribing nophthongs (distinct vowels), or long vowels. CJK names directly into Arabic. We have devel- Representing Japanese vowels accurately in Ara- oped language-dependent rules, mapping tables bic is not possible. In cases where vowel se- and algorithms for transcribing CJK names writ- quences represent monophthongs, hamza is ten in their native scripts. For example, the Japa- sometimes used and sometimes omitted. nese placename 埼玉 /saitama/ is arabized as /the Chinese name 杨海洋 /yang hai- Table 2. Diphthong Ambiguity for 福井 /fu-ku-i ,ﺳـــــﺎﻳﺘﺎﻣﺎ and the Korean city No. Arabic Google hits Buckwalter ,ﻳــــﺎﻧﻎ هﺎﻳﻴـــــﺎﻧﻎ yang/ as fwkw} 468 ﻓﻮآﻮﺋـــــﻲ 1 ﺑﻮﺳـــﺎن {fwkw 9 ﻓﻮآـــﻮئ 부산 /busan/ as . 2 Fwkwy 1950 ﻓﻮآـــﻮي 3 Fwkwyy 335 ﻓﻮآﻮﻳـــــﻲ Various papers, such as AbdulJaleel and Larkey 4 (2003), describe systems for transcribing Roman- script names into Arabic. Although NANA also Table 2 shows some of the variation to expect in has this capability, it is beyond the scope of this Arabization. Though phonologi- paper. The issues for Chinese and Korean, the cally (2) is the most accurate, it is the least used. subject of a future paper, are similar in nature but As expected, the diphthongized (3) is the most require a different set of language-specific rules. common form because of the tendency to avoid hamza in foreign names. Some important vowel 6.2 Arabization Policy sequence issues are: A fundamental problem in arabizing CJK names is that there are significant differences between 1. There is a strong tendency not to use non- the Arabic and CJK phonological systems and initial hamza, as in (1) and (2) above, in for- the lack of detalied transcription standards. Since eign names. One reason for this is insuffi- these languages are not well known in the Arab- cient knowledge of the phonology of the speaking world, CJK names are often arabized source language, especially of such "exotic" on the basis of their romanized transcriptions, languages as Japanese. rather than the native script, and it is sometimes 2. Japanese is especially problematic because it erroneously assumed that the Roman letters are is moraic. Some Japanese mora sequences, pronounced as in English. This is further compli- such as あい /ai/ or うい /ui/, are often diph- cated by the plethora of CJK romanization stan- thongized in Arabic, though ideally the sec- dards. We have established an arabization policy ond vowel should be treated as a mo- for Japanese based on a number of sometimes nophthong represented by hamza. That is, 福 conflicting criteria: 井 /fu-ku-i/ should be written as (1)

rather than the ,ﻓﻮآـــﻮئ (or (2 ﻓﻮآﻮﺋـــــﻲ 1. How names are actually spelled on the Ara- .ﻓﻮآـــﻮي (bic web, atlases, maps and books. more common (3 2. Ensuring that same source syllables are 3. In theory, a vowel sequence like /ai/ as in さ ﺳﺎي :spelled consistently taking into account pho- い /sa-i/ can be written in five ways -To accu .ﺳـــﺎﺋﻲ ﺳـــﺎﻳﻲ ﺳﺎئ ﺳﻲ .nological changes 3. Treating Japanese names as a sequence of rately transcribe a name like Saitama (埼玉) syllables, rather than of morae, since that is it is necessary to know that it consists of four how they are commonly transcribed. morae (/sa-i-ta-ma/ さいたま), rather than 4. Using hamza to represent vowel sequences three syllables (/sai-ta-ma/). Ideally it should rather than ,ﺳـــــــﺎﺋﻴﺘﺎﻣﺎ only in those cases where dipthongization is be transcribed as ,That is .ﺳـــــﺎﻳﺘﺎﻣﺎ not possible or awkward (see Section 6.3). the much more common 5. Generating hamzated variants, such as since /sa-i/ is a bimoraic syllable, the hamza for the more common ﺳـــــــﺎﺋﻴﺘﺎﻣﺎ

5 over yaa' should be used to represent /i/ as a names using a knowledge base of rules and map- In reality, ping tables fine tuned to the Japanese and Arabic .ﺳﺎئ distinct monophthong, as in so phonological systems. Roughly, the algorithm ,ﺳـــــﺎﻳﺘﺎﻣﺎ Saitama is normally spelled :say/. works as follows/ ﺳﺎي that /sa-i/ is diphthongized as 4. In names like 福岡 /fu-ku--ka/ the sequence /ku-o/ represents distinct sounds that cannot 1. Get a string from the input file. be diphthongized. Following hamza rules, 2. Determine if the string is Japanese. but in fact 3. Convert kanji to hiragana by looking ,ﻓﻮآﻮؤوآــــﺎ this should be written .in which up in JEP ,ﻓﻮآﻮأوآــــﺎ it is commonly spelled 4. Convert hiragana to romanized Japanese by represents /u/. Omitting the ,ؤو rather than ,أو looking up in JEP. hamza here would make little sense. 5. If (3) fails, convert to hiragana algorithmi- 6.4 Long and Short Vowels cally (difficult due to extreme ambiguity). 6. If (3) returns multiple strings, use criteria The treatment of Japanese vowels is complex like frequency and semantic codes to elimi- and may have hamzated variants. nate unlikely candidates. Table 3. Long and Short Vowels No. Kanji Phonemic Arab1 Arab2 Arab3 أوﺗــﺎ 太田 おおた oota 1 ﻓﻮﻣــﺎ 風馬 ふうま fuuma 2 آﻴﻜــــﻮ آﻴﻴﻜــــــﻮ 敬子 けいこ keiko 3 آﻮﻧـــﻮ 空野 くうの kuuno 4 آﻮﻧـــﻮ 久野 くの kuno 5 هﻴﺌﻴـــــﺪا هﻴﺌـــﺪا هﻴﻴـــﺪا 日枝 ひえだ hieda 6 ﻳﻮﺷـــــــﻴﺌﻲ ﻳﻮﺷـــــــﻴﺌﻪ ﻳﻮﺷـــــﻴﻲ 芳江 よしえ yoshie 7

1. Japanese long vowels are expressed in vari- 7. Determine whether to diphthongize or to use ous ways, such as by repeating the vowel as hamza by considering both the hiragana and in (2) ふう /fuu/, or by adding う /u/ after /o/ the romanized Japanese. -Use the rules knowledge base, which is em .8 ي as in (1). えい /ei/ is special because the may be repeated, as in (3). bedded in a multi-option comprehensive hi- 2. Since short vowels are omitted in Arabic, ragana-to-Arabic mapping tables to convert short vowels in foreign names are normally to Arabic script. transcribed as if they were long; that is, by 9. The AVAN module generates variants if re- .for /u/. Thus quested by user parameters و for /i/ and ي ,/for /a ا adding both (4) and (5) are written identically as 10. Output arabized name (with or without vari- .(and there is no way to distinguish ants as necessary آﻮﻧـــﻮ vowel length. 3. Normally the vowel /e/ is not distinguished We have not yet performed formal error rate test- ing, but our preliminary experiments indicate An .ي from /i/ and both are represented by extra complication is that at word end /e/ is that the above algorithm can arabize a CJK name to its correct or legitimate variant form with a -so that in tran ,ﻩ sometimes expressed by success rate of nearly 100%. This is because the scribing such names as (5) and (6) it is nec- algorithm is based on a thorough understanding essary to consider hamza rules, whether to of the Arabic and Japanese (as well as Chinese diphthongize, the position of the syllable in and Korean, though not discussed here) phono- the word, and how these interact. logical systems, and a comprehensive mapping 6.5 Arabization Algorithm table designed to cover almost all possible Japa- nese-to-Arabic mappings, including positional The arabization algorithm accepts a CJK string variants and phonological changes resulting from as input and generates a list of romanized candi- liaison. dates by combining lookup in the Japanese- English Proper Noun Database (JEP), a database of about 600,000 Japanese personal and place

6 7 Arabic Orthographic Variants these cannot be rigorously defined, they are both of frequent occurrence based on statistical and The number of personal names and their variants linguistic analysis of MSA orthography. It in the world is in the billions. Identifying names should also be noted that "standard form," and their variants (named entity recognition) is a though linguistically correct, is not necessarily hot topic in computational . To en- the most common form (we are gathering statis- hance this technology, we added a variant gen- tics for the occurrence of each form). eration module (AVAN) to both the ARAN and NANA systems, which is supported by compre- There are often many more variants than those hensive databases of CJK proper nouns. shown above. For example, Alexandria can be 7.1 Romanization Variants written in about a dozen ways, the most frequent اﻻﺳــــــﻜﻨﺪرﻳﺔ ones according to Google being ,with 690,000 اﻹﺳــــــﻜﻨﺪرﻳﺔ ,The many popular transcriptions of Arabic with 2,930,000 -with 89,200 occurrences re اﻻﺳــــــﻜﻨﺪرﻳﻪ names result in a large number of variants. One and reason for this is that several Arabic consonants, spectively. ðˁ], do not] ظ tˁ] and] ط ,[dˁ] ض ,[ʔˁ] ع such as exist in European languages. These sounds are 8 System Modules and Future Work difficult to pronounce and are rendered in differ- The principal components of ARAN (some of ent ways when romanized. Another factor is the which are in progress) are briefly described be- bewildering variety of ways in which Arabic low, vowels are transcribed, partially due to dialecti- is 1. The Automatic Transcriber of Arabic أﺳﺎﻣﺔ u/ in'/ أ cal influences. For example, the transcribed in various ways as seen in Usama, Names (ATAN) is ARAN' core module for -generating phonemic and popular transcrip ﻣﻌـﻤﺮ Ousama, Osama and Oosama, while \mEmr\ is spelled as Moammar, Muammar, tions of Arabic personal names. Because of Mu'ammar, Mo'ammar, Moammar, Moamer, the inconsistent nature of the various popular Moamar, and others. Arabic romanization systems, there are often 7.2 Arabic Variants many, sometimes dozens or even hundreds, of romanizations for the same name. Both Arab and foreign names have orthographic ATAN supports most of the commonly used variants in Arabic. These are of two kinds: systems, and has a flexible architecture that enables the user to configure the system to 1. Orthographic variants are non-standard ways support user-defined systems. For example, which is first transliterated to ,ﺷـــﻮﻟﻮخ اﺑــﻮ to spell a specific variant of a name, like for Abu Dhabi, in \$wlwx\ by the AXAN module, can then be أﺑــﻮ ﻇــﺒﻲ instead of ﻇــﺒﻲ which the hamza is omitted. transcribed as /shwlwkh/ in the ALC-LC sys- 2. Orthographic errors are frequently occurring, tem, as /šūlūḫ/ in the DIN system, as Shou- اﺑــﻮ systematic spelling mistakes, like yaa' in lokh as a possible English spelling, etc. The Abu Dhabi) being replaced by 'alif) ﻇــﺒﻲ AVAN module can then be used to return .many popular variants .اﺑــﻮ ﻇــﺒﻰ maqSuura in

Table 4. Orthographic Variation in Arabic Names Standard Buckwalter English Variant Error Remarks : omit hamza أﺑــﻮ ﻇــﺒﻰ اﺑــﻮ ﻇــﺒﻲ bw Zby Abu Dhabi< أﺑــﻮ ﻇــﺒﻲ 'E: ‘alif maqsura replaces yaa اﺑــﻮ ﻇــﺒﻰ V: omit hamza إﺳـــــﻜﻨﺪرﻳﻪال اﻻﺳــــــﻜﻨﺪرﻳﺔ Alltw Palo Alto ﺑــــﺎﻟﻮ أﻟﺘــــﻮ V2: madda replaces hamza ﺑــــﺎﻟﻮ ﺁﻟﺘــــﻮ 'E: taa' replaces Taa ﺗﻮآﻴـــــﻮ Twkyw Tokyo ﻃﻮآﻴـــﻮ

Table 4 shows examples of variants ("V") and 2. The Automatic Transliterator of Arabic errors ("E"). Though the difference between Names (AXAN) generates transliterations of

7 Arabic names or any other Arabic text. There these are Farsi ( of Iran), are few strict transliteration systems that use (western and official lan- unique symbols for each letter and allow for guage of Afghanistan), Dari (Afghan dialect round-trip conversion. The excellent and of Farsi, official language of Afghanistan), widely used Buckwalter transliteration sys- (official language of Pakistan) and tem is not only supported by AXAN, but is Kurdish (Turkey, , Iran, , Armenia, also used for internal processing in all Lebanon). Others include Shamukhi (Paki- ARAN databases and algorithms. AXAN can stani version of Punjabi), Kashmiri (India be configured to support other transliteration and Pakistan), and Uyghur (northwest ). systems, including , by adding ARAN will eventually be expanded to (1) custom mapping tables . romanize to/from the major ASBL languages, 3. The Automatic Phoneticizer of Arabic (2) automatically identify the language, (3) Names (APAN) generates phonetic tran- automatically detect legacy encodings and scriptions of Arabic names in IPA. This convert to . represents the actual pronunciation in MSA, including distinctions between the major al- 9 Conclusion ﻗــــﺎﺑﻮس lophones. For example, the name As we have seen, the high level of ambiguity in Qaboos is transcribed as [qɑːbuːs]. Note the Arabic script makes it challenging to build that the phonemic transcription /qaabuus/ automatic transcription systems that produce re- only indicates the vowel length (/aa/), liable results. In particular, we have seen the dif- whereas the phonetic transcription also indi- ficulties in arabizing CJK names due to the lack cates the quality of the vowel (ɑː), distin- of standards and to the major phonological dif- ferences between the languages. We have also guishing it from its more common realization seen how important linguistic knowledge is in [ɛ̈ː]. APAN can be configured to transcribe such areas as Japanese-to-Arabic transcription, in various MSA flavors. This refers to re- resulting in a very high accuracy rate. Since Ara- gional variations in MSA pronunciation, not bic transcription is playing an increasingly im- to Arabic dialects per se. For example, for portant role in a variety of practical applications, it is necessary to pursue efforts to develop more [jammal/ APAN generates [dʒɛ̈mɛ̈ːl/ ﺟﻤﺎل language-specific transcription systems based on for Gulf MSA, [gɛ̈mɛ̈ːl] for Egyptian MSA , linguistic knowledge. and [ʒɛmɛ̈ːl] for Levantine MSA. References 4. The Automatic Generator of Variants for Arabic Names (AVAN) supports the ARAN Nasreen Abduljaleel and Leah S. Larkey. 2003. Sta- and NANA system by generating a large tistical transliteration for English-Arabic Cross number of variants and variant candidates Language Information Retrieval. CIKM 2003: 139- 146 both algorithmically and by retrieving from hardcoded databases, whose occurrences are M. Arbabi, S.M. Fischthal, V.C. Cheng and E. Bart. then validated in Arabic corpora and the web. 1994. Algorithms for Arabic name transliteration. See Section 7 for details. IBM . Res. Develop., 38(2) 5. The Automatic Diacriticizer of Arabic Moustafa Elshafei, Husni Al-Muhtaseb, Mansour Al- Names (ADAN) automatically diacriticizes, Ghamdi. 2006. Machine Generation of Arabic Dia- or adds vowels and other diacritics (like critical Marks. MLMTA 2006: 128-133 fatha and shadda) to unvocalized or semi- Ali Farghaly. 2004 Computer Processing of Arabic Script-based Languages: Current State and Future ﻣﺤﻤﺪ ,vocalized Arabic. For example AlryAD\ are converted Directions. COLING 2004\ اﻟﺮﻳـﺎض mHmd\ and\ -respec- Martin Kay Stanford University. 2004. Arabic Script اﻟﺮﱢﻳـَﺎض and ﻣُﺤَﻤﱠﺪ to the vocalized tively. This is related to, but distinct from, based Languages deserve to be studied linguisti- the equally difficult task of phonemic tran- cally. COLING 2004. scription. See Table 1 for examples. 6. There are dozens of non-Arabic languages . Vergyri and . Kirchhoff. 2004. Automatic Diacri- tization of Arabic for Acoustic Modeling in Speech that are or have been written in the Arabic Recognition. COLING Workshop on Arabic-script script, referred to as Arabic Script Based Based Languages, Geneva, Switzerland, 2004. Languages (ASBL). The most important of

8