The Challenges and Pitfalls of Arabic Romanization and Arabization

Total Page:16

File Type:pdf, Size:1020Kb

The Challenges and Pitfalls of Arabic Romanization and Arabization The Challenges and Pitfalls of Arabic Romanization and Arabization Jack Halpern (春遍雀來) The CJK Dictionary Institute, Inc. (日中韓辭典研究所) 34-14, 2-chome, Tohoku, Niiza-shi, Saitama 352-0001, Japan [email protected] their native script directly into Arabic, something Abstract probably never attempted. These systems are part of our ongoing efforts to develop Arabic re- The high level of ambiguity of the Ara- sources for automatic transcription, machine bic script poses special challenges to translation and named entity extraction. developers of NLP tools in areas such as morphological analysis, named entity The following typographic conventions are used extraction and machine translation. in this paper: These difficulties are exacerbated by the lack of comprehensive lexical resources, 1. Phonemic transcriptions are indicated by . (/qaabuus/ < ﻗــــﺎﺑﻮس) such as proper noun databases, and the slashes multiplicity of ambiguous transcription 2. Phonetic transcriptions are indicated by . ([qɑːbuːs] < ﻗــــﺎﺑﻮس ) schemes. This paper focuses on some of square brackets the linguistic issues encountered in two subdisciplines that play an increasingly 3. Graphemic transliterations are indicated by . (\qAbws\ < ﻗــــﺎﺑﻮس ) important role in Arabic information back slashes processing: the romanization of Arabic 4. Popular transcriptions are indicated by italics .(Qaboos < ﻗــــﺎﺑﻮس ) -names and the arabization of non Arabic names. The basic premise is that linguistic knowledge in the form of lin- 2 Motivation and Previous Work guistic rules is essential for achieving Arabic transcription technology is playing an high accuracy. increasingly important role in a variety of practical applications such as named entity 1 Introduction recognition, machine translation, cross-language information retrieval and various security The process of automatically transcribing Arabic applications such as anti-money laundering and to a Roman script representation, called romani- terrorist watch lists. Despite the importance of zation, is a tough computational task to which these applications, Arabic transcription has not there is no definitive solution. The opposite op- been the subject of sufficient studies that eration of transcribing a non-Arabic script into examine the linguistic issues. This paper Arabic, called arabization, is also difficult but attempts to fill that gap. for different reasons. Several companies and researchers have This paper briefly describes the algorithms and developed automatic diacriticization software. major linguistic issues encountered in the course Vergyri and Kirchhoff (2004) report the high of developing two automatic transcription sys- error rate of these products. Gal used a HMM tems: (1) Automatic Romanizer of Arabic Names bigram model and achieved a 14% error rate, (ARAN), which romanizes unvocalized Arabic while AbdulJaleel and Larkey (2003) developed names into various romanizations systems, and an n-gram based statistical system for arabizing (2) Non-Arabic Name Arabizer (NANA), which English, with an error rate of 10%-20%. Elshafei arabizes non-Arabic names written in the Roman et al. (2006) report a 5.5% error rate using an and CJK scripts. HMM approach, while Arbabi et al (1994). developed a diacriticizer that combines a A novel feature of these systems is that they are knowledge base with neural networks to achieve fine tuned to transcribing personal names and a low error rate of 3.1% but which rejects 55% of placenames to and from Arabic, with special fo- the names as unprocessable. cus on the linguistic knowledge and rules re- quired for transcribing CJK names written in 1 ner that reflects the pronunciation of the original, We have not used sophisticated statistical often ignoring graphemic correspondence. This approaches. Our basic strategy has been to use includes the following subcategories: conventional linguistic knowledge because we believe that ultimately statistical methods by 1. A phonetic transcription represents the ac- themselves are inadequate. Kay (2004) argues tual speech sounds, including allophones. that "statistics are a surrogate for knowledge of The best known of these is IPA. For example, .[is transcribed as [muħɛ̈mmɛ̈d ﻣﺤﻤﺪ the world" and that "this is an alarming trend that computational linguists ... should resist with 2. A phonemic transcription represents the great determination." This was reinforced by phonemes of the source language (ignoring Farghaly (2004) when he wrote "It is becoming allophones), ideally on a one-to-one basis. increasingly evident that statistical and corpus- is transcribed as ﻣﺤﻤﺪ ,based approaches...are not sufficient..." For example /muHammad/, in which a represents the pho- Our policy is that linguistic rules, based on deep neme /a/, rather than the phone [ɛ̈]. analysis of the source and target scripts, are 3. A popular transcription is a conventional- indispensable. To rephrase, many contemporary ized orthography that roughly represents -is tran ﻣﺤﻤﺪ ,statistical methods involve brute-force pronunciation. For example mathematical techniques that exploit vast scribed in some 200 different ways, such as amounts of data, whereas a rule-based approach Mohammed, Muhammad, Moohammad, captures aspects of human intelligence because it Moohamad, Mohammad, Mohamad, etc. is based on linguistic knowledge. We have combined linguistic rules with statistically Diacriticization is the process of adding vowel derived mapping tables to build a flexible system signs (called vocalization) and other diacritics. mHmd\ is converted to the\ ﻣﺤﻤﺪ ,that can be extended to other Arabic script based For example -muHam~ad\. Note the four dia\ ﻣُﺤَﻤﱠﺪ languages. vocalized critics that were added. 3 Basic Concepts Much confusion surrounds the terms translitera- Arabization is the reverse of romanization; that tion and transcription, with the former often mis- is, the representation of a non-Arabic script, such leadingly used in the sense of the latter even in as the Roman and CJK scripts, using the Arabic → Clinton ,ﻣﺤﻤﺪ → academic papers (AbdulJaleel and Larkey, 2003). alphabet, e.g., Muhammad .ﺳـــــﺎﻳﺘﺎﻣﺎ → 埼玉, Saitama آﻠﻴﻨﺘــــــــﻮن To discuss these concepts in an unambiguous manner it is necessary to understand these and related terms correctly. 4 Why is Arabic ambiguous? A distinguishing feature of abjads in general, and Romanization is the representation of a language of Arabic in particular, is that words are written written in a non-Roman script using the Roman as a string of consonants with little or no indica- alphabet. This includes both transliteration and tion of vowels, referred to as unvocalized Arabic. is transliterated as Though diacritics can be used to indicate short ﻣﺤﻤﺪ .transcription, e.g \mHmd\ and transcribed as Mohammed, Mu- vowels, they are used sparingly, while the use of hammad, or Mohamad, among many others. consonants to indicate long vowels is ambiguous. On the whole, unvocalized Arabic is highly am- Transliteration is a representation of the script biguous and poses major challenges to Arabic of a source language by using the characters of information processing applications. another script. Ideally, it unambiguously repre- sents the graphemes, rather than the phonemes, 4.1 Morphological Ambiguity ﻣﺤﻤﺪ of the source language. For example, is Arabic is a highly inflected language. Inflection transliterated as \mHmd\, in which each Arabic is indicated by changing the vowel patterns as letter is unambiguously represented by one Ro- well as by adding various suffixes, prefixes, and man letter, enabling round-trip conversion. 'kaatib/ 'writer/ آَﺎﺗِﺐ clitics. A full paradigm for that we created (for a comprehensive Arabic- Transcription is a representation of the source English dictionary project) reaches a staggering script of a language in the target script in a man- total of 3487 valid forms, including affixes and 2 clitics as well as inflectional syncretisms. For literated, it must not be transcribed, e.g., is transliterated as \ktbwA\, with ‘alif آﺘﺒـــــﻮا -can represent any of the follow آﺎﺗـــﺐ ,example -at the end, but transcribed as /katabuu/, omit آَﺎﺗَﺐَ ,/kaatib/ آَﺎﺗِﺐ :ing seven wordforms .ting the 'alif آَﺎﺗِﺐَ,/kaatibun/ آَﺎﺗِﺐٌ ,/kaatibin/ آَﺎﺗِﺐٍ ,/kaataba/ kaatibu/. 7. The diacritic shadda indicating consonant/ آَﺎﺗِ ﺐُ ,/kaatibi/ آَﺎﺗِﺐِ ,/kaatiba/ gemination is normally omitted, e.g., the un- Muhammad (vocalized ﻣﺤﻤﺪ vocalized provides no clues that the [m] should (ﻣُﺤَﻤﱠﺪ Orthographical Ambiguity 4.2 On the orthographic level, Arabic is also highly be doubled. can theo- 8. Another source of ambiguity is the omission ﻣﻮ ambiguous. For example, the string retically represent 40 consonant-vowel permuta- of tanwiin diacritics for case endings, e.g., in the ,(ﺷُﻜْﺮَاً ukrAF\ (vocalized$\ ﺷــﻜﺮا ,tions, such as mawa, mawwa, mawi, mawwi mawu, mawwu, maw, maww, miwa, miwwa.... fatHatayn is not written. etc., though in practice some may never be used. 9. The rules for determining the hamza seat are Humans can normally disambiguate this by con- of notorious complexity. In transcribing to text, but for a program the task is formidable. Arabic, it is difficult to determine the hamza seat as well as the short vowel that follows; ,/could represent /'a (ؤ) Conventional wisdom has it that the Arabic e.g., hamzated waaw script is ambiguous "due to non-representation of /'u/ or even /'/ (no vowel). short vowels," while other features are often 10. In arabization, determining the hamza seat lightly passed over. In fact, a whole gamut of requires the application
Recommended publications
  • Arabic Alphabet Etymology Hamzat Waṣl
    Hamza 1 Hamza Arabic alphabet ﻱ ﻭ ﻩ ﻥ ﻡ ﻝ ﻙ ﻕ ﻑ ﻍ ﻉ ﻅ ﻁ ﺽ ﺹ ﺵ ﺱ ﺯ ﺭ ﺫ ﺩ ﺥ ﺡ ﺝ ﺙ ﺕ ﺏ ﺍ Arabic script • History • Transliteration • Diacritics • Hamza • Numerals • Numeration • v • t [1] • e is a letter in the Arabic alphabet, representing the glottal stop [ʔ]. Hamza is not (ء) (hamzah ,ﻫَﻤْﺰﺓ :Hamza (Arabic one of the 28 "full" letters, and owes its existence to historical inconsistencies in the standard writing system. It is derived from the Arabic letter ‘ayn. In the Phoenician and Aramaic alphabets, from which the Arabic alphabet is descended, the glottal stop was expressed by aleph ( ), continued by alif ( ) in the Arabic alphabet. However, alif was used to express both a glottal stop and a long vowel /aː/. To indicate that a glottal stop, and not a mere vowel, was intended, hamza was added diacritically to alif. In modern orthography, under certain circumstances, hamza may also appear on the line, as if it were a full letter, independent of an alif. Etymology hamaz-a meaning ‘to prick, goad, drive’ or ‘to provide (a letter or word) with ﻫَﻤَﺰَ Hamzah is a noun from the verb hamzah’.[2] Hamzat waṣl that is, a phonemic glottal stop. Compared to ;(ﻫﻤﺰﺓ ﻗﻄﻊ) ‘The hamzah letter on its own always represents hamzat qaṭ is a non-phonemic glottal stop produced automatically at the (ﻫﻤﺰﺓ ﺍﻟﻮﺻﻞ) this, hamzat waṣl or hamzat al-waṣl it is usually indicated by a ,ﭐ beginning of an utterance. Although it can be written as alif carrying a waṣlah sign regular alif without a hamzah.
    [Show full text]
  • Alif and Hamza Alif) Is One of the Simplest Letters of the Alphabet
    ’alif and hamza alif) is one of the simplest letters of the alphabet. Its isolated form is simply a vertical’) ﺍ stroke, written from top to bottom. In its final position it is written as the same vertical stroke, but joined at the base to the preceding letter. Because of this connecting line – and this is very important – it is written from bottom to top instead of top to bottom. Practise these to get the feel of the direction of the stroke. The letter 'alif is one of a number of non-connecting letters. This means that it is never connected to the letter that comes after it. Non-connecting letters therefore have no initial or medial forms. They can appear in only two ways: isolated or final, meaning connected to the preceding letter. Reminder about pronunciation The letter 'alif represents the long vowel aa. Usually this vowel sounds like a lengthened version of the a in pat. In some positions, however (we will explain this later), it sounds more like the a in father. One of the most important functions of 'alif is not as an independent sound but as the You can look back at what we said about .(ﺀ) carrier, or a ‘bearer’, of another letter: hamza hamza. Later we will discuss hamza in more detail. Here we will go through one of the most common uses of hamza: its combination with 'alif at the beginning or a word. One of the rules of the Arabic language is that no word can begin with a vowel. Many Arabic words may sound to the beginner as though they start with a vowel, but in fact they begin with a glottal stop: that little catch in the voice that is represented by hamza.
    [Show full text]
  • Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
    ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No.
    [Show full text]
  • Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia
    2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia Arabic alphabet From Wikipedia, the free encyclopedia َأﺑْ َﺠ ِﺪﯾﱠﺔ َﻋ َﺮﺑِﯿﱠﺔ :The Arabic alphabet (Arabic ’abjadiyyah ‘arabiyyah) or Arabic abjad is Arabic abjad the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[1] stand for consonants, it is classified as an abjad. Type Abjad Languages Arabic Time 400 to the present period Parent Proto-Sinaitic systems Phoenician Aramaic Syriac Nabataean Arabic abjad Child N'Ko alphabet systems ISO 15924 Arab, 160 Direction Right-to-left Unicode Arabic alias Unicode U+0600 to U+06FF range (http://www.unicode.org/charts/PDF/U0600.pdf) U+0750 to U+077F (http://www.unicode.org/charts/PDF/U0750.pdf) U+08A0 to U+08FF (http://www.unicode.org/charts/PDF/U08A0.pdf) U+FB50 to U+FDFF (http://www.unicode.org/charts/PDF/UFB50.pdf) U+FE70 to U+FEFF (http://www.unicode.org/charts/PDF/UFE70.pdf) U+1EE00 to U+1EEFF (http://www.unicode.org/charts/PDF/U1EE00.pdf) Note: This page may contain IPA phonetic symbols. Arabic alphabet ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع en.wikipedia.org/wiki/Arabic_alphabet 1/20 2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia غ ف ق ك ل م ن ه و ي History · Transliteration ء Diacritics · Hamza Numerals · Numeration V · T · E (//en.wikipedia.org/w/index.php?title=Template:Arabic_alphabet&action=edit) Contents 1 Consonants 1.1 Alphabetical order 1.2 Letter forms 1.2.1 Table of basic letters 1.2.2 Further notes
    [Show full text]
  • The Ideology of American English As Standard English in Taiwan
    Arab World English Journal (AWEJ) Volume.7 Number.4 December, 2016 Pp. 80 - 96 The Ideology of American English as Standard English in Taiwan Jackie Chang English Department, National Pingtung University Pingtung City, Taiwan Abstract English language teaching and learning in Taiwan usually refers to American English teaching and learning. Taiwan views American English as Standard English. This is a strictly perceptual and ideological issue, as attested in the language school promotional materials that comprise the research data. Critical Discourse Analysis (CDA) was employed to analyze data drawn from language school promotional materials. The results indicate that American English as Standard English (AESE) ideology is prevalent in Taiwan. American English is viewed as correct, superior and the proper English language version for Taiwanese people to compete globally. As a result, Taiwanese English language learners regard native English speakers with an American accent as having the greatest prestige and as model teachers deserving emulation. This ideology has resulted in racial and linguistic inequalities in contemporary Taiwanese society. AESE gives Taiwanese learners a restricted knowledge of English and its underlying culture. It is apparent that many Taiwanese people need tore-examine their taken-for-granted beliefs about AESE. Keywords: American English as Standard English (AESE),Critical Discourse Analysis (CDA), ideology, inequalities 80 Arab World English Journal (AWEJ) Vol.7. No. 4 December 2016 The Ideology of American English as Standard English in Taiwan Chang Introduction It is an undeniable fact that English has become the global lingua franca. However, as far as English teaching and learning are concerned, there is a prevailing belief that the world should be learning not just any English variety but rather what is termed Standard English.
    [Show full text]
  • DRAFT Arabtex a System for Typesetting Arabic User Manual Version 4.00
    DRAFT ArabTEX a System for Typesetting Arabic User Manual Version 4.00 12 Klaus Lagally May 25, 1999 1Report Nr. 1998/09, Universit¨at Stuttgart, Fakult¨at Informatik, Breitwiesenstraße 20–22, 70565 Stuttgart, Germany 2This Report supersedes Reports Nr. 1992/06 and 1993/11 Overview ArabTEX is a package extending the capabilities of TEX/LATEX to generate the Perso-Arabic writing from an ASCII transliteration for texts in several languages using the Arabic script. It consists of a TEX macro package and an Arabic font in several sizes, presently only available in the Naskhi style. ArabTEX will run with Plain TEXandalsowithLATEX2e. It is compatible with Babel, CJK, the EDMAC package, and PicTEX (with some restrictions); other additions to TEX have not been tried. ArabTEX is primarily intended for generating the Arabic writing, but the stan- dard scientific transliteration can also be easily produced. For languages other than Arabic that are customarily written in extensions of the Perso-Arabic script some limited support is available. ArabTEX defines its own input notation which is both machine, and human, readable, and suited for electronic transmission and E-Mail communication. However, texts in many of the Arabic standard encodings can also be processed. Starting with Version 3.02, ArabTEX also provides support for fully vowelized Hebrew, both in its private ASCII input notation and in several other popular encodings. ArabTEX is copyrighted, but free use for scientific, experimental and other strictly private, noncommercial purposes is granted. Offprints of scientific publi- cations using ArabTEX are welcome. Using ArabTEX otherwise requires a license agreement. There is no warranty of any kind, either expressed or implied.
    [Show full text]
  • Romanization Examples
    Romanization examples Each title of a language or a writing system is followed by a note on the appropriate romanization system used (UN = United Nations, BGN/PCGN = US Board on Geographic Names and Permanent Committee on Geographical Names for British Official Use) Amharic [UN 1967, I/17] Lao [national 1966] ኢትዮጵያ Ityop’ya [ Ethiopia ], አዲስ አበባ Addis Abe ̱ ba ລາວ Lao [ Laos ], ວງຈັ ນ Viangchan Arabic [UN 1972, II/8] Macedonian Cyrillic [UN 1977, III/11] Jaz īrat al-‘Arab [ Arabian Peninsula ] Скопје Skopje, Битола Bitola ز رة ارب Armenian [BGN/PCGN 1981] Malayalam [UN 1972, II/11; 1977, III/12] Հայաստան Hayastan [ Armenia ], Երևան Yerevan Kera ḷaṁ, Tiruvanantapura ṁ Assamese [UN 1972, II/11; 1977, III/12] Maldivian [national 1987] Asam [ Assam ], Dichhapura [ Dispur ] ޖ އ ރ ހ ވ ދ Dhivehi Raajje [ Maldives ], ލ މ Maale Bengali [UN 1972, II/11; 1977, III/12] Marathi [UN 1972, II/11; 1977, III/12] Bāṁ lādesh, Dhaka महारा Mah ārāṣhṭra, मुंबई Mu ṁba ī Bulgarian [UN 1977, III/10] Mongolian (Cyrillic) [BGN/PCGN 1964] Република България Republika B ǎlgarija Монгол улс Mongol uls, Улаанбаатар Ulaanbaatar Burmese [BGN/PCGN 1970] Nepalese [UN 1972, II/11; 1977, III/12] ြမန်မာ Myanma, ရန်ကန် Yangôn नेपाल Nepāl, काठमाड Kāṭhm āḍau ṁ [Kathmandu ] Byelorussian [national 2007] Беларусь Bielaru ś, Минск Minsk Oriya [UN 1972, II/11; 1977, III/12] Chinese [UN 1977, III/8] Oṙish ā, Bhubaneshbar 中国 Zhongguo, 北京 Beijing Pashto [BGN/PCGN 1968] XQY Kābulل ,Afgh ānist ān اQRSTQUVن [Dzongkha [national 1997 འག་ལ Drukyuel [Bhutan ], ཐིམ་ Thimphu Persian
    [Show full text]
  • New Zealand English
    New Zealand English Štajner, Renata Undergraduate thesis / Završni rad 2011 Degree Grantor / Ustanova koja je dodijelila akademski / stručni stupanj: Josip Juraj Strossmayer University of Osijek, Faculty of Humanities and Social Sciences / Sveučilište Josipa Jurja Strossmayera u Osijeku, Filozofski fakultet Permanent link / Trajna poveznica: https://urn.nsk.hr/urn:nbn:hr:142:005306 Rights / Prava: In copyright Download date / Datum preuzimanja: 2021-09-26 Repository / Repozitorij: FFOS-repository - Repository of the Faculty of Humanities and Social Sciences Osijek Sveučilište J.J. Strossmayera u Osijeku Filozofski fakultet Preddiplomski studij Engleskog jezika i književnosti i Njemačkog jezika i književnosti Renata Štajner New Zealand English Završni rad Prof. dr. sc. Mario Brdar Osijek, 2011 0 Summary ....................................................................................................................................2 Introduction................................................................................................................................4 1. History and Origin of New Zealand English…………………………………………..5 2. New Zealand English vs. British and American English ………………………….….6 3. New Zealand English vs. Australian English………………………………………….8 4. Distinctive Pronunciation………………………………………………………………9 5. Morphology and Grammar……………………………………………………………11 6. Maori influence……………………………………………………………………….12 6.1.The Maori language……………………………………………………………...12 6.2.Maori Influence on the New Zealand English………………………….………..13 6.3.The
    [Show full text]
  • Arabic in Romanization
    Transliteration of Arabic 1/6 ARABIC Arabic script* DIN 31635 ISO 233 ISO/R 233 UN ALA-LC EI 1982(1.0) 1984(2.0) 1961(3.0) 1972(4.0) 1997(5.0) 1960(6.0) iso ini med !n Consonants! " 01 # $% &% ! " — (3.1)(3.2) — (4.1) — — 02 ' ( ) , * ! " #, $ (2.1) —, ’ (3.3) %, — (4.2) —, ’ (5.1) " 03 + , - . b b b b b b 04 / 0 1 2 t t t t t t 05 3 4 5 6 & & & th th th 06 7 8 9 : ' ' ' j j dj 07 ; < = > ( ( ( ) ( ( 08 ? @ * + + kh kh kh 09 A B d d d d d d 10 C D , , , dh dh dh 11 E F r r r r r r 12 G H I J z z z z z z 13 K L M N s s s s s s 14 O P Q R - - - sh sh sh 15 S T U V . / . 16 W X Y Z 0 0 0 d 1 0 0 17 [ \ ] ^ 2 2 2 3 2 2 18 _ ` a b 4 4 4 z 1 4 4 19 c d e f 5 5 5 6 6 5 20 g h i j 7 7 8 gh gh gh 21 k l m n f f f f f f 22 o p q r q q q q q 9 23 s t u v k k k k k k 24 w x y z l l l l l l 25 { | } ~ m m m m m m 26 • € • n n n n n n 27 h h h h h h 28 … " h, t (1.1) : ;, <(3.4) h, t (4.3) h, t (5.2) a, at (6.1) 29 w w w w w w 30 y y y y y y 31 ! = — y y ! • 32 s! l! la" l! l! l! l! 33 # al- (1.2) "#al (2.2) al- (3.5) al- (4.4) al- (5.3) al-, %l- (6.2) Thomas T.
    [Show full text]
  • Writing Arabizi: Orthographic Variation in Romanized
    WRITING ARABIZI: ORTHOGRAPHIC VARIATION IN ROMANIZED LEBANESE ARABIC ON TWITTER ! ! ! ! Natalie!Sullivan! ! ! ! TC!660H!! Plan!II!Honors!Program! The!University!of!Texas!at!Austin! ! ! ! ! May!4,!2017! ! ! ! ! ! ! ! _______________________________________________________! Barbara!Bullock,!Ph.D.! Department!of!French!&!Italian! Supervising!Professor! ! ! ! ! _______________________________________________________! John!Huehnergard,!Ph.D.! Department!of!Middle!Eastern!Studies! Second!Reader!! ii ABSTRACT Author: Natalie Sullivan Title: Writing Arabizi: Orthographic Variation in Romanized Lebanese Arabic on Twitter Supervising Professors: Dr. Barbara Bullock, Dr. John Huehnergard How does technology influence the script in which a language is written? Over the past few decades, a new form of writing has emerged across the Arab world. Known as Arabizi, it is a type of Romanized Arabic that uses Latin characters instead of Arabic script. It is mainly used by youth in technology-related contexts such as social media and texting, and has made many older Arabic speakers fear that more standard forms of Arabic may be in danger because of its use. Prior work on Arabizi suggests that although it is used frequently on social media, its orthography is not yet standardized (Palfreyman and Khalil, 2003; Abdel-Ghaffar et al., 2011). Therefore, this thesis aimed to examine orthographic variation in Romanized Lebanese Arabic, which has rarely been studied as a Romanized dialect. It was interested in how often Arabizi is used on Twitter in Lebanon and the extent of its orthographic variation. Using Twitter data collected from Beirut, tweets were analyzed to discover the most common orthographic variants in Arabizi for each Arabic letter, as well as the overall rate of Arabizi use. Results show that Arabizi was not used as frequently as hypothesized on Twitter, probably because of its low prestige and increased globalization.
    [Show full text]
  • Inventory of Romanization Tools
    Inventory of Romanization Tools Standards Intellectual Management Office Library and Archives Canad Ottawa 2006 Inventory of Romanization Tools page 1 Language Script Romanization system for an English Romanization system for a French Alternate Romanization system catalogue catalogue Amharic Ethiopic ALA-LC 1997 BGN/PCGN 1967 UNGEGN 1967 (I/17). http://www.eki.ee/wgrs/rom1_am.pdf Arabic Arabic ALA-LC 1997 ISO 233:1984.Transliteration of Arabic BGN/PCGN 1956 characters into Latin characters NLC COPIES: BS 4280:1968. Transliteration of Arabic characters NL Stacks - TA368 I58 fol. no. 00233 1984 E DMG 1936 NL Stacks - TA368 I58 fol. no. DIN-31635, 1982 00233 1984 E - Copy 2 I.G.N. System 1973 (also called Variant B of the Amended Beirut System) ISO 233-2:1993. Transliteration of Arabic characters into Latin characters -- Part 2: Lebanon national system 1963 Arabic language -- Simplified transliteration Morocco national system 1932 Royal Jordanian Geographic Centre (RJGC) System Survey of Egypt System (SES) UNGEGN 1972 (II/8). http://www.eki.ee/wgrs/rom1_ar.pdf Update, April 2004: http://www.eki.ee/wgrs/ung22str.pdf Armenian Armenian ALA-LC 1997 ISO 9985:1996. Transliteration of BGN/PCGN 1981 Armenian characters into Latin characters Hübschmann-Meillet. Assamese Bengali ALA-LC 1997 ISO 15919:2001. Transliteration of Hunterian System Devanagari and related Indic scripts into Latin characters UNGEGN 1977 (III/12). http://www.eki.ee/wgrs/rom1_as.pdf 14/08/2006 Inventory of Romanization Tools page 2 Language Script Romanization system for an English Romanization system for a French Alternate Romanization system catalogue catalogue Azerbaijani Arabic, Cyrillic ALA-LC 1997 ISO 233:1984.Transliteration of Arabic characters into Latin characters.
    [Show full text]
  • Processing Judeo-Arabic Texts
    Processing Judeo-Arabic Texts Kfir Bar, Nachum Dershowitz, Lior Wolf, Yackov Lubarsky, and Yaacov Choueka Abstract. Judeo-Arabic is a language spoken and written by Jewish communities living in Arab countries. Judeo-Arabic is typically written in Hebrew letters, enriched with diacritic marks that relate to the under- lying Arabic. However, some inconsistencies in rendering words in He- brew letters increase the level of ambiguity of a given word. Furthermore, Judeo-Arabic texts usually contain non-Arabic words and phrases, such as quotations or borrowed words from Hebrew and Aramaic. We focus on two main tasks: (1) automatic transliteration of Judeo-Arabic Hebrew letters into Arabic letters; and (2) automatic identification of language switching points between Judeo-Arabic and Hebrew. For transliteration, we employ a statistical translation system trained on the character level, resulting in 96.9% precision, a significant improvement over the baseline. For the language switching task, we use a word-level supervised classifier, also showing some significant improvements over the baseline. 1 Introduction Judeo-Arabic is a set of dialects spoken and written by Jewish communities living in Arab countries, mainly during the Middle Ages. Judeo-Arabic is typically written in Hebrew letters, and since the Arabic alphabet is larger than the Hebrew one, additional diacritic marks are added to some Hebrew letters when rendering Arabic consonants that are lacking in the Hebrew alphabet. Judeo- Arabic authors often use different letters and diacritic marks to represent the same Arabic consonant. For example, some authors use b (Hebrew gimel) to represent (Arabic jim) and b˙ to represent (ghayn), while others reverse the h.
    [Show full text]