The Challenges and Pitfalls of Arabic Romanization and Arabization

The Challenges and Pitfalls of Arabic Romanization and Arabization Jack Halpern (春遍雀來) The CJK Dictionary Institute, Inc. (日中韓辭典研究所) 34-14, 2-chome, Tohoku, Niiza-shi, Saitama 352-0001, Japan [email protected] their native script directly into Arabic, something Abstract probably never attempted. These systems are part of our ongoing efforts to develop Arabic re- The high level of ambiguity of the Ara- sources for automatic transcription, machine bic script poses special challenges to translation and named entity extraction. developers of NLP tools in areas such as morphological analysis, named entity The following typographic conventions are used extraction and machine translation. in this paper: These difficulties are exacerbated by the lack of comprehensive lexical resources, 1. Phonemic transcriptions are indicated by . (/qaabuus/ < ﻗــــﺎﺑﻮس) such as proper noun databases, and the slashes multiplicity of ambiguous transcription 2. Phonetic transcriptions are indicated by . ([qɑːbuːs] < ﻗــــﺎﺑﻮس ) schemes. This paper focuses on some of square brackets the linguistic issues encountered in two subdisciplines that play an increasingly 3. Graphemic transliterations are indicated by . (\qAbws\ < ﻗــــﺎﺑﻮس ) important role in Arabic information back slashes processing: the romanization of Arabic 4. Popular transcriptions are indicated by italics .(Qaboos < ﻗــــﺎﺑﻮس ) -names and the arabization of non Arabic names. The basic premise is that linguistic knowledge in the form of lin- 2 Motivation and Previous Work guistic rules is essential for achieving Arabic transcription technology is playing an high accuracy. increasingly important role in a variety of practical applications such as named entity 1 Introduction recognition, machine translation, cross-language information retrieval and various security The process of automatically transcribing Arabic applications such as anti-money laundering and to a Roman script representation, called romani- terrorist watch lists. Despite the importance of zation, is a tough computational task to which these applications, Arabic transcription has not there is no definitive solution. The opposite op- been the subject of sufficient studies that eration of transcribing a non-Arabic script into examine the linguistic issues. This paper Arabic, called arabization, is also difficult but attempts to fill that gap. for different reasons. Several companies and researchers have This paper briefly describes the algorithms and developed automatic diacriticization software. major linguistic issues encountered in the course Vergyri and Kirchhoff (2004) report the high of developing two automatic transcription sys- error rate of these products. Gal used a HMM tems: (1) Automatic Romanizer of Arabic Names bigram model and achieved a 14% error rate, (ARAN), which romanizes unvocalized Arabic while AbdulJaleel and Larkey (2003) developed names into various romanizations systems, and an n-gram based statistical system for arabizing (2) Non-Arabic Name Arabizer (NANA), which English, with an error rate of 10%-20%. Elshafei arabizes non-Arabic names written in the Roman et al. (2006) report a 5.5% error rate using an and CJK scripts. HMM approach, while Arbabi et al (1994). developed a diacriticizer that combines a A novel feature of these systems is that they are knowledge base with neural networks to achieve fine tuned to transcribing personal names and a low error rate of 3.1% but which rejects 55% of placenames to and from Arabic, with special fo- the names as unprocessable. cus on the linguistic knowledge and rules re- quired for transcribing CJK names written in 1 ner that reflects the pronunciation of the original, We have not used sophisticated statistical often ignoring graphemic correspondence. This approaches. Our basic strategy has been to use includes the following subcategories: conventional linguistic knowledge because we believe that ultimately statistical methods by 1. A phonetic transcription represents the ac- themselves are inadequate. Kay (2004) argues tual speech sounds, including allophones. that "statistics are a surrogate for knowledge of The best known of these is IPA. For example, .[is transcribed as [muħɛ̈mmɛ̈d ﻣﺤﻤﺪ the world" and that "this is an alarming trend that computational linguists ... should resist with 2. A phonemic transcription represents the great determination." This was reinforced by phonemes of the source language (ignoring Farghaly (2004) when he wrote "It is becoming allophones), ideally on a one-to-one basis. increasingly evident that statistical and corpus- is transcribed as ﻣﺤﻤﺪ ,based approaches...are not sufficient..." For example /muHammad/, in which a represents the pho- Our policy is that linguistic rules, based on deep neme /a/, rather than the phone [ɛ̈]. analysis of the source and target scripts, are 3. A popular transcription is a conventional- indispensable. To rephrase, many contemporary ized orthography that roughly represents -is tran ﻣﺤﻤﺪ ,statistical methods involve brute-force pronunciation. For example mathematical techniques that exploit vast scribed in some 200 different ways, such as amounts of data, whereas a rule-based approach Mohammed, Muhammad, Moohammad, captures aspects of human intelligence because it Moohamad, Mohammad, Mohamad, etc. is based on linguistic knowledge. We have combined linguistic rules with statistically Diacriticization is the process of adding vowel derived mapping tables to build a flexible system signs (called vocalization) and other diacritics. mHmd\ is converted to the\ ﻣﺤﻤﺪ ,that can be extended to other Arabic script based For example -muHam~ad\. Note the four dia\ ﻣُﺤَﻤﱠﺪ languages. vocalized critics that were added. 3 Basic Concepts Much confusion surrounds the terms translitera- Arabization is the reverse of romanization; that tion and transcription, with the former often mis- is, the representation of a non-Arabic script, such leadingly used in the sense of the latter even in as the Roman and CJK scripts, using the Arabic → Clinton ,ﻣﺤﻤﺪ → academic papers (AbdulJaleel and Larkey, 2003). alphabet, e.g., Muhammad .ﺳـــــﺎﻳﺘﺎﻣﺎ → 埼玉, Saitama آﻠﻴﻨﺘــــــــﻮن To discuss these concepts in an unambiguous manner it is necessary to understand these and related terms correctly. 4 Why is Arabic ambiguous? A distinguishing feature of abjads in general, and Romanization is the representation of a language of Arabic in particular, is that words are written written in a non-Roman script using the Roman as a string of consonants with little or no indica- alphabet. This includes both transliteration and tion of vowels, referred to as unvocalized Arabic. is transliterated as Though diacritics can be used to indicate short ﻣﺤﻤﺪ .transcription, e.g \mHmd\ and transcribed as Mohammed, Mu- vowels, they are used sparingly, while the use of hammad, or Mohamad, among many others. consonants to indicate long vowels is ambiguous. On the whole, unvocalized Arabic is highly am- Transliteration is a representation of the script biguous and poses major challenges to Arabic of a source language by using the characters of information processing applications. another script. Ideally, it unambiguously represents the graphemes, rather than the phonemes, 4.1 Morphological Ambiguity ﻣﺤﻤﺪ of the source language. For example, is Arabic is a highly inflected language. Inflection transliterated as \mHmd\, in which each Arabic is indicated by changing the vowel patterns as letter is unambiguously represented by one Ro- well as by adding various suffixes, prefixes, and man letter, enabling round-trip conversion. 'kaatib/ 'writer/ آَﺎﺗِﺐ clitics. A full paradigm for that we created (for a comprehensive Arabic- Transcription is a representation of the source English dictionary project) reaches a staggering script of a language in the target script in a man- total of 3487 valid forms, including affixes and 2 clitics as well as inflectional syncretisms. For literated, it must not be transcribed, e.g., is transliterated as \ktbwA\, with ‘alif آﺘﺒـــــﻮا -can represent any of the follow آﺎﺗـــﺐ ,example -at the end, but transcribed as /katabuu/, omit آَﺎﺗَﺐَ ,/kaatib/ آَﺎﺗِﺐ :ing seven wordforms .ting the 'alif آَﺎﺗِﺐَ,/kaatibun/ آَﺎﺗِﺐٌ ,/kaatibin/ آَﺎﺗِﺐٍ ,/kaataba/ kaatibu/. 7. The diacritic shadda indicating consonant/ آَﺎﺗِ ﺐُ ,/kaatibi/ آَﺎﺗِﺐِ ,/kaatiba/ gemination is normally omitted, e.g., the un- Muhammad (vocalized ﻣﺤﻤﺪ vocalized provides no clues that the [m] should (ﻣُﺤَﻤﱠﺪ Orthographical Ambiguity 4.2 On the orthographic level, Arabic is also highly be doubled. can theo- 8. Another source of ambiguity is the omission ﻣﻮ ambiguous. For example, the string retically represent 40 consonant-vowel permuta- of tanwiin diacritics for case endings, e.g., in the ,(ﺷُﻜْﺮَاً ukrAF\ (vocalized$\ ﺷــﻜﺮا ,tions, such as mawa, mawwa, mawi, mawwi mawu, mawwu, maw, maww, miwa, miwwa.... fatHatayn is not written. etc., though in practice some may never be used. 9. The rules for determining the hamza seat are Humans can normally disambiguate this by con- of notorious complexity. In transcribing to text, but for a program the task is formidable. Arabic, it is difficult to determine the hamza seat as well as the short vowel that follows; ,/could represent /'a (ؤ) Conventional wisdom has it that the Arabic e.g., hamzated waaw script is ambiguous "due to non-representation of /'u/ or even /'/ (no vowel). short vowels," while other features are often 10. In arabization, determining the hamza seat lightly passed over. In fact, a whole gamut of requires the application

The Challenges and Pitfalls of Arabic Romanization and Arabization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support