Automatic Romanization of Arabic Bibliographic Records

Total Page:16

File Type:pdf, Size:1020Kb

Automatic Romanization of Arabic Bibliographic Records Automatic Romanization of Arabic Bibliographic Records Fadhl Eryani and Nizar Habash Computational Approaches to Modelling Language (CAMeL) Lab New York University Abu Dhabi, UAE {fadhl.eryani,nizar.habash}@nyu.edu Abstract Description Tag Romanized Entry Arabic Entry ﺗﯿﻤﻮر، اﺣﻤﺪ، ,Author 100a* Taymūr, Ah ̣mad ، ,International library standards require cata- details 100d 1871-1930 ﻣﺆﻟﻒ .loguers to tediously input Romanization of 100e author ﻗﺒﺮ اﻹﻣﺎم اﻟﺴﯿﻮطﻲ وﺗﺤﻘﯿﻖ their catalogue records for the benefit of li- Title 245a* Qabr al-imām al-Suyūṭī ﻣﻮﺿﻌﮫ / brary users without specific language exper- details wa-taḥqīq mawḍiʻihi ﺑﻘﻠﻢ اﻟﻔﻘﯿﺮ إﻟﯿﮫ ﺗﻌﺎﻟﻰ أﺣﻤﺪ ﺗﯿﻤﻮر 245c* bi-qalam al-faqīr ilayhi tise. In this paper, we present the first reported ta‘ālá Aḥmad Taymūr. اﻟﻘﺎھﺮة : results on the task of automatic Romanization Place of 264a* al-Qāhirah اﻟﻤﻄﺒﻌﺔ اﻟﺴﻠﻔﯿﺔ وﻣﻜﺘﺒﺘﮭﺎ، of undiacritized Arabic bibliographic entries. production 264b* al-Maṭbaʻah al-Salafīyah This complex task requires the modeling of wa-Maktabatuhā, ھـ ، ؟ م [.Arabic phonology, morphology, and even se- 264c 1346 [H.], [1927? M ﺻﻔﺤﺔ، ورﻗﺔ ﻟﻮﺣﺎت ﻏﯿﺮ Physical 300a 24 pages, 1 unnumbered ﻣﺮﻗﻤﺔ : mantics. We collected a 2.5M word corpus of description leaf of plates ﻣﺼﻮرات، ﺧﺮاﺋﻂ ؛ parallel Arabic and Romanized bibliographic 300b illustrations, maps ﺳﻢ entries, and benchmarked a number of models 300c اﻟﺴﯿﻮطﻲ، ,that vary in terms of complexity and resource Subject - 600a* Suyūt ̣ı̄ dependence. Our best system reaches 89.3% name 600d 1445-1505. اﻟﻘﺒﻮر exact word Romanization on a blind test set. Subject - 650a Tombs ﻣﺼﺮ We make our data and code publicly available. topic 650z Egypt اﻟﻘﺎھﺮة .650z Cairo 1 Introduction Figure 1: A bibliographic record for Taymur¯ (1927?) Library catalogues comprise a large number of bib- in Romanized and original Arabic forms. liographic records consisting of entries that provide specific descriptions of library holdings. Records for Arabic and other non-Roman-script language of complexity and resource dependence. Our best materials ideally include Romanized entries to help system reaches 89.3% exact word Romanization on a blind test set. We make our data and code researchers without language expertise, e.g., Fig- 1 ure1. There are many Romanization standards publicly available for researchers in Arabic NLP. such as the ISO standards used by French and other 2 Related Work European libraries, and the ALA-LC (American Library Association and Library of Congress) sys- Arabic Language Challenges Arabic poses a tem (Library of Congress, 2017) widely adopted by number of challenges for NLP in general, and the North American and UK affiliated libraries. These task of Romanization in particular. Arabic is mor- Romanizations are applied manually by librarians phologically rich, uses a number of clitics, and is across the world – a tedious error-prone task. written using an Abjad script with optional diacrit- In this paper, we present, to our knowledge, the ics, all leading to a high degree of ambiguity. The first reported results on automatic Romanization of Arabic script does not include features such as cap- undiacritized Arabic bibliographic entries. This is italization which is helpful for NLP in a range of a non-trivial task as it requires modeling of Arabic Roman script languages. There are a number of phonology, morphology and even semantics. We enabling technologies for Arabic that can help, e.g., collect and clean a 2.5M word corpus of parallel MADAMIRA (Pasha et al., 2014), Farasa (Abde- Arabic and Romanized bibliographic entries, and 1https://www.github.com/CAMeL-Lab/ evaluate a number of models that vary in terms Arabic_ALA-LC_Romanization 213 Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 213–218 Kyiv, Ukraine (Virtual), April 19, 2021. Split Bib Records Entries Words York University Abu Dhabi’s Arabic Collections Train 85,952 (80%) 479,726 ∼2M Online (ACO) (12K), amounting to 11.2 million Dev 10,744 (10%) 59,964 ∼250K records in total. Test 10,743 (10%) 59,752 ∼250K Extraction From these collections, we extracted Total 107,439 599,442 ∼2.5M 107,493 records that are specifically tagged with the Arabic language code (MARC 008 “ara”). Table 1: Corpus statistics and data splits. Filtering Within the extracted records we filter out some of the entries using two strategies. First, lali et al., 2016), and CAMeL Tools (Obeid et al., we used a list of 33 safe tags (determined us- 2020). In this paper we use MADAMIRA to pro- ing their definitions and with empirical sampling vide diacritics, morpheme boundaries and English check) to eliminate all entries that include a mix of gloss capitalization information as part of a rule- translations, control information, and dates. The based Romanization technique. star-marked tags in Figure1 are all included, while Machine Transliteration Transliteration refers the rest are filtered out. Second, we eliminated all to the mapping of text from one script to another. entries with mismatched numbers of tokens. This Romanization is specifically transliteration into the check was done after a cleaning step that corrected Roman script (Beesley, 1997). There are many for common errors and inconsistencies in many ways to transliterate and Romanize, varying in entries such as punctuation misalignment and in- 2 terms of detail, consistency, and usefulness. Com- correct separation of the conjunction +ð wa+ ‘and’ monly used name transliterations (Al-Onaizan and clitic. As a result of this filtering, a small number of Knight, 2002) and so-called Arabizi transliteration additional records are eliminated since all their en- (Darwish, 2014) tend to be lossy and inconsistent tries were eliminated. The total number of retained while strict orthographic transliterations such as records is 107,439. The full details on extraction Buckwalter’s (Buckwalter, 2004) tend to be exact and filtering are provided as part of the project’s but not easily readable. The ALA-LC transliter- public github repo (see footnote 1). ation is a relatively easy to read standard that re- Data Splits Finally, we split the remaining col- quires a lot of details on phonology, morphology lection of records into Train, Dev, and Test sets. De- and semantics. There has been a sizable amount tails on the number of records, entries, and words of work on mapping Arabizi to Arabic script using they contain is presented in Table1. We make our a range of techniques from rules to neural mod- data and data splits available (see footnote 1). els Chalabi and Gerges(2012); Darwish(2014); Al-Badrashiny et al.(2014); Guellil et al.(2017); 4 Task Definition and Challenges Younes et al.(2018); Shazal et al.(2020). In this paper we make use of a number of insights and As discussed above, there are numerous ways to techniques from work on Arabizi-to-Arabic script “transliterate” from one script to another. In this sec- transliteration, but apply them in the opposite di- tion we focus on the Romanization of undiacritized rection to map from Arabic script to a complex, Arabic bibliographic entries into the ALA-LC stan- detailed and strict Romanization. We compare dard. Our intention is to highlight the important rule-based and corpus-based techniques including challenges of this task in order to justify the design a Seq2Seq model based on the publicly available choices we make in our approaches. For a detailed code base of Shazal et al.(2020). reference of the ALA-LC Arabic Romanization standard, see (Library of Congress, 2012). 3 Data Collection Phonological Challenges While Romanizing Sources We collected bibliographic records from Arabic consonants is simple, the main challenge three publicly available xml dumps stored in the is in identifying unwritten phonological phenom- machine-readable cataloguing (MARC) standard, ena, e.g., short vowels, under-specified long vow- an international standard for storing and describing els, consonantal gemination, and nunnation, all of bibliographic information. The three data sources which require modeling Arabic diacritization. are the Library of Congress (LC) (10.5M), the Uni- 2Strict orthographic transliteration using the HSB scheme versity of Michigan (UMICH) (680K), and New (Habash et al., 2007). 214 Morphosyntactic Challenges Beyond basic di- in the transliteration. We strip diacritical morpho- acritization modeling, the task requires some mor- logical case endings, but keep other diacritics. We phosyntactic modeling: examples include (a) pro- utilize the CharTrans technique to finalize the Ro- clitics such as the definite article, prepositions and manization starting with the diacritized, hyphen- conjunctions are marked with a hyphen, (b) case ated and capitalization marked words. For words endings are dropped, except before pronominal unknown to the morphological analyzer, we simply enclitics, (c) the silent Alif, appearing in some mas- back off to the CharTrans technique. Model Rules culine plural verbal endings, is ignored, and (d) Morph uses MorphTrans with CharTrans backoff. the Ta-Marbuta ending can be written as h or t de- pending on the morphosyntactic state of the noun. MLE Technique Unlike the previous two tech- For more information on Arabic morphology, see niques, MLE (maximum likelihood estimate) re- (Habash, 2010). lies on the parallel training data we presented in Section3. This simple technique works on white- Semantic Challenges Proper nouns need to be space and punctuation tokenized entries and learns marked with capitalization on their first non-clitic simple one-to-one mapping from Arabic script to alphabetic letter. Since Arabic script does not have Romanization. The most common Romanization “capitalizations”, this effectively requires named- for a particular Arabic script input in the training entity recognition. The Romanization of the word data is used. The outputs are detokenized to allow AlqAhr¯h ‘Cairo’ as al-Qahirah¯ in Figure1 èQëA®Ë@ strict matching alignment with the input. Faced illustrates elements from all challenge types. with OOV (out of vocabulary), we back off to the Special Cases The Arabic ALA-LC guidelines MorphTrans technique (Model MLE Morph) or include a number of special cases, e.g., the word CharTrans Technique (Model MLE Simple).
Recommended publications
  • Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
    ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No.
    [Show full text]
  • Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia
    2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia Arabic alphabet From Wikipedia, the free encyclopedia َأﺑْ َﺠ ِﺪﯾﱠﺔ َﻋ َﺮﺑِﯿﱠﺔ :The Arabic alphabet (Arabic ’abjadiyyah ‘arabiyyah) or Arabic abjad is Arabic abjad the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[1] stand for consonants, it is classified as an abjad. Type Abjad Languages Arabic Time 400 to the present period Parent Proto-Sinaitic systems Phoenician Aramaic Syriac Nabataean Arabic abjad Child N'Ko alphabet systems ISO 15924 Arab, 160 Direction Right-to-left Unicode Arabic alias Unicode U+0600 to U+06FF range (http://www.unicode.org/charts/PDF/U0600.pdf) U+0750 to U+077F (http://www.unicode.org/charts/PDF/U0750.pdf) U+08A0 to U+08FF (http://www.unicode.org/charts/PDF/U08A0.pdf) U+FB50 to U+FDFF (http://www.unicode.org/charts/PDF/UFB50.pdf) U+FE70 to U+FEFF (http://www.unicode.org/charts/PDF/UFE70.pdf) U+1EE00 to U+1EEFF (http://www.unicode.org/charts/PDF/U1EE00.pdf) Note: This page may contain IPA phonetic symbols. Arabic alphabet ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع en.wikipedia.org/wiki/Arabic_alphabet 1/20 2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia غ ف ق ك ل م ن ه و ي History · Transliteration ء Diacritics · Hamza Numerals · Numeration V · T · E (//en.wikipedia.org/w/index.php?title=Template:Arabic_alphabet&action=edit) Contents 1 Consonants 1.1 Alphabetical order 1.2 Letter forms 1.2.1 Table of basic letters 1.2.2 Further notes
    [Show full text]
  • Arabic in Romanization
    Transliteration of Arabic 1/6 ARABIC Arabic script* DIN 31635 ISO 233 ISO/R 233 UN ALA-LC EI 1982(1.0) 1984(2.0) 1961(3.0) 1972(4.0) 1997(5.0) 1960(6.0) iso ini med !n Consonants! " 01 # $% &% ! " — (3.1)(3.2) — (4.1) — — 02 ' ( ) , * ! " #, $ (2.1) —, ’ (3.3) %, — (4.2) —, ’ (5.1) " 03 + , - . b b b b b b 04 / 0 1 2 t t t t t t 05 3 4 5 6 & & & th th th 06 7 8 9 : ' ' ' j j dj 07 ; < = > ( ( ( ) ( ( 08 ? @ * + + kh kh kh 09 A B d d d d d d 10 C D , , , dh dh dh 11 E F r r r r r r 12 G H I J z z z z z z 13 K L M N s s s s s s 14 O P Q R - - - sh sh sh 15 S T U V . / . 16 W X Y Z 0 0 0 d 1 0 0 17 [ \ ] ^ 2 2 2 3 2 2 18 _ ` a b 4 4 4 z 1 4 4 19 c d e f 5 5 5 6 6 5 20 g h i j 7 7 8 gh gh gh 21 k l m n f f f f f f 22 o p q r q q q q q 9 23 s t u v k k k k k k 24 w x y z l l l l l l 25 { | } ~ m m m m m m 26 • € • n n n n n n 27 h h h h h h 28 … " h, t (1.1) : ;, <(3.4) h, t (4.3) h, t (5.2) a, at (6.1) 29 w w w w w w 30 y y y y y y 31 ! = — y y ! • 32 s! l! la" l! l! l! l! 33 # al- (1.2) "#al (2.2) al- (3.5) al- (4.4) al- (5.3) al-, %l- (6.2) Thomas T.
    [Show full text]
  • Writing Arabizi: Orthographic Variation in Romanized
    WRITING ARABIZI: ORTHOGRAPHIC VARIATION IN ROMANIZED LEBANESE ARABIC ON TWITTER ! ! ! ! Natalie!Sullivan! ! ! ! TC!660H!! Plan!II!Honors!Program! The!University!of!Texas!at!Austin! ! ! ! ! May!4,!2017! ! ! ! ! ! ! ! _______________________________________________________! Barbara!Bullock,!Ph.D.! Department!of!French!&!Italian! Supervising!Professor! ! ! ! ! _______________________________________________________! John!Huehnergard,!Ph.D.! Department!of!Middle!Eastern!Studies! Second!Reader!! ii ABSTRACT Author: Natalie Sullivan Title: Writing Arabizi: Orthographic Variation in Romanized Lebanese Arabic on Twitter Supervising Professors: Dr. Barbara Bullock, Dr. John Huehnergard How does technology influence the script in which a language is written? Over the past few decades, a new form of writing has emerged across the Arab world. Known as Arabizi, it is a type of Romanized Arabic that uses Latin characters instead of Arabic script. It is mainly used by youth in technology-related contexts such as social media and texting, and has made many older Arabic speakers fear that more standard forms of Arabic may be in danger because of its use. Prior work on Arabizi suggests that although it is used frequently on social media, its orthography is not yet standardized (Palfreyman and Khalil, 2003; Abdel-Ghaffar et al., 2011). Therefore, this thesis aimed to examine orthographic variation in Romanized Lebanese Arabic, which has rarely been studied as a Romanized dialect. It was interested in how often Arabizi is used on Twitter in Lebanon and the extent of its orthographic variation. Using Twitter data collected from Beirut, tweets were analyzed to discover the most common orthographic variants in Arabizi for each Arabic letter, as well as the overall rate of Arabizi use. Results show that Arabizi was not used as frequently as hypothesized on Twitter, probably because of its low prestige and increased globalization.
    [Show full text]
  • A Sociolinguistic Analysis of the Use of Arabizi in Social Media Among Saudi Arabians
    International Journal of English Linguistics; Vol. 9, No. 6; 2019 ISSN 1923-869X E-ISSN 1923-8703 Published by Canadian Center of Science and Education A Sociolinguistic Analysis of the Use of Arabizi in Social Media Among Saudi Arabians Ashwaq Alsulami1 1 School of Languages, Literatures, and Linguistics, Bangor University, Wales, United Kingdom Correspondence: Ashwaq Alsulami, School of Languages, Literatures, and Linguistics, Bangor University, Wales, United Kingdom. E-mail: [email protected] Received: August 27, 2019 Accepted: September 20, 2019 Online Published: October 28, 2019 doi:10.5539/ijel.v9n6p257 URL: https://doi.org/10.5539/ijel.v9n6p257 Abstract The aim of this sociolinguistically-oriented study is to explore the Arabizi phenomenon which is characterized by spelling Arabic words using the Latin script. It is prevalent in the text-based computer-mediated communications among Saudi Arabians. The study focuses on why Arabizi is used, how, particularly in respect to with whom and in which topics, it is used, the attitudes of its users toward its use and the perceived advantages and disadvantages of its use. Using an online survey, data were collected from 241 participants, 72 of which were users of Arabizi. The findings revealed that the primary reasons for using Arabizi were its being a communication code among youths and a compensation for the lack of Arabic keyboard from technological devices as well as being more expressive than Arabic language. It was also found that Arabizi was primarily used to communicate with friends and individuals of the same age, but not with parents and older people or in formal relationships.
    [Show full text]
  • Romanization of Arabic 1 Romanization of Arabic
    Romanization of Arabic 1 Romanization of Arabic Arabic alphabet ﺍ ﺏ ﺕ ﺙ ﺝ ﺡ ﺥ ﺩ ﺫ ﺭ ﺯ ﺱ ﺵ ﺹ ﺽ ﻁ ﻅ ﻉ ﻍ ﻑ ﻕ ﻙ ﻝ ﻡ ﻥ ﻩ ﻭ ﻱ • History • Transliteration • Diacritics (ء) Hamza • • Numerals • Numeration Different approaches and methods for the romanization of Arabic exist. They vary in the way that they address the inherent problems of rendering written and spoken Arabic in the Latin script. Examples of such problems are the symbols for Arabic phonemes that do not exist in English or other European languages; the means of representing the Arabic definite article, which is always spelled the same way in written Arabic but has numerous pronunciations in the spoken language depending on context; and the representation of short vowels (usually i u or e o, accounting for variations such as Muslim / Moslem or Mohammed / Muhammad / Mohamed ). Method Romanization is often termed "transliteration", but this is not technically correct. Transliteration is the direct representation of foreign letters using Latin symbols, while most systems for romanizing Arabic are actually transcription systems, which represent the sound of the language. As an example, the above rendering is a transcription, indicating the pronunciation; an ﺍﻟﻌﺮﺑﻴﺔ ﺍﻟﺤﺮﻭﻑ ﻣﻨﺎﻇﺮﺓ :munāẓarat al-ḥurūf al-ʻarabīyah of the Arabic example transliteration would be mnaẓrḧ alḥrwf alʻrbyḧ. Romanization standards and systems This list is sorted chronologically. Bold face indicates column headlines as they appear in the table below. • IPA: International Phonetic Alphabet (1886) • Deutsche Morgenländische Gesellschaft (1936): Adopted by the International Convention of Orientalist Scholars in Rome. It is the basis for the very influential Hans Wehr dictionary (ISBN 0-87950-003-4).
    [Show full text]
  • The Challenges and Pitfalls of Arabic Romanization and Arabization
    The Challenges and Pitfalls of Arabic Romanization and Arabization Jack Halpern (春遍雀來) The CJK Dictionary Institute, Inc. (日中韓辭典研究所) 34-14, 2-chome, Tohoku, Niiza-shi, Saitama 352-0001, Japan [email protected] their native script directly into Arabic, something Abstract probably never attempted. These systems are part of our ongoing efforts to develop Arabic re- The high level of ambiguity of the Ara- sources for automatic transcription, machine bic script poses special challenges to translation and named entity extraction. developers of NLP tools in areas such as morphological analysis, named entity The following typographic conventions are used extraction and machine translation. in this paper: These difficulties are exacerbated by the lack of comprehensive lexical resources, 1. Phonemic transcriptions are indicated by . (/qaabuus/ < ﻗــــﺎﺑﻮس) such as proper noun databases, and the slashes multiplicity of ambiguous transcription 2. Phonetic transcriptions are indicated by . ([qɑːbuːs] < ﻗــــﺎﺑﻮس ) schemes. This paper focuses on some of square brackets the linguistic issues encountered in two subdisciplines that play an increasingly 3. Graphemic transliterations are indicated by . (\qAbws\ < ﻗــــﺎﺑﻮس ) important role in Arabic information back slashes processing: the romanization of Arabic 4. Popular transcriptions are indicated by italics .(Qaboos < ﻗــــﺎﺑﻮس ) -names and the arabization of non Arabic names. The basic premise is that linguistic knowledge in the form of lin- 2 Motivation and Previous Work guistic rules is essential for achieving Arabic transcription technology is playing an high accuracy. increasingly important role in a variety of practical applications such as named entity 1 Introduction recognition, machine translation, cross-language information retrieval and various security The process of automatically transcribing Arabic applications such as anti-money laundering and to a Roman script representation, called romani- terrorist watch lists.
    [Show full text]
  • Arabic Alphabet 1 Arabic Alphabet
    Arabic alphabet 1 Arabic alphabet Arabic abjad Type Abjad Languages Arabic Time period 400 to the present Parent systems Proto-Sinaitic • Phoenician • Aramaic • Syriac • Nabataean • Arabic abjad Child systems N'Ko alphabet ISO 15924 Arab, 160 Direction Right-to-left Unicode alias Arabic Unicode range [1] U+0600 to U+06FF [2] U+0750 to U+077F [3] U+08A0 to U+08FF [4] U+FB50 to U+FDFF [5] U+FE70 to U+FEFF [6] U+1EE00 to U+1EEFF the Arabic alphabet of the Arabic script ﻍ ﻉ ﻅ ﻁ ﺽ ﺹ ﺵ ﺱ ﺯ ﺭ ﺫ ﺩ ﺥ ﺡ ﺝ ﺙ ﺕ ﺏ ﺍ ﻱ ﻭ ﻩ ﻥ ﻡ ﻝ ﻙ ﻕ ﻑ • history • diacritics • hamza • numerals • numeration abjadiyyah ‘arabiyyah) or Arabic abjad is the Arabic script as it is’ ﺃَﺑْﺠَﺪِﻳَّﺔ ﻋَﺮَﺑِﻴَّﺔ :The Arabic alphabet (Arabic codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[7] stand for consonants, it is classified as an abjad. Arabic alphabet 2 Consonants The basic Arabic alphabet contains 28 letters. Adaptations of the Arabic script for other languages added and removed some letters, such as Persian, Ottoman, Sindhi, Urdu, Malay, Pashto, and Arabi Malayalam have additional letters, shown below. There are no distinct upper and lower case letter forms. Many letters look similar but are distinguished from one another by dots (’i‘jām) above or below their central part, called rasm. These dots are an integral part of a letter, since they distinguish between letters that represent different sounds.
    [Show full text]
  • Language Choice and Romanization Online by Lebanese Arabic Speakers
    Treball de fi de màster Màster: Edició: Directors: Any de defensa: Col⋅lecció: Treballs de fi de màster Programa oficial de postgrau "Comunicació lingüística i mediació multilingüe" Departament de Traducció i Ciències del Llenguatge 1 TABLE OF CONTENTS 1. Introduction…………………………………………………………………..…4 2. Theoretical background………………………………………………………....6 2.1.Languages in Lebanon…………………………………………………...….6 2.2.Language choice online………………………………………………..……8 2.2.1. English…………………………………………………………..…..8 2.2.2. MSA…………………………………………………………..…….9 2.3.Romanization……………..………………………………………………..10 2.3.1. Romanization as an orthography…………………………………..10 2.3.2. Reasons for Romanization…………………………………………11 3. Informants……………………………………………………...………………14 4. Method…………………………………………………………………...…….15 5. Limitations……………………………………………………..........................17 6. Findings…………………………………………………………………..…....18 6.1.Language choice……………………………………………………..…….18 6.1.1. Emails and formal inquiries……………………………………..….18 6.1.2. Initiating contributions…………………………………………..….18 6.1.3. Responding contributions……………………………………….….19 6.1.4. Whatsapp……………………………………………………….…...20 6.2.Reasons for Romanization…………………………………………….…..22 6.3.The practice of Romanization……………………………………………..22 6.3.1. Phonology 6.3.1.1.Consonants………………………………………………….,...22 6.3.1.2. Vowels…………………………………………………….….23 6.3.2. Morphophonology………………………………………………....24 2 6.3.3. Morphology……………………………………………………….25 7. Conclusion…………………………………………………………………….26 8. References…………………………………………………………………….28 3 ABSTRACT This paper investigates language
    [Show full text]
  • Attitudes Towards Arabic Romanization and Student's Major: Evidence from the University of Jordan
    Arab World English Journal (AWEJ) Volume.7 Number.4 December, 2016 Pp.493-502 Attitudes towards Arabic Romanization and Student's Major: Evidence from the University of Jordan Hady Hamdan Department of Linguistics Faculty of Foreign Languages University of Jordan Amman, Jordan Abstract This paper seeks to unveil the attitudes of a sample of students at the University of Jordan towards the use of romanized Arabic in computer--mediated communication (CMC). In particular, it provides answers to two questions. First, do the subjects encode Arabic characters, including numbers, in a romanized version in their CMC? If yes, how often and why? Second, does the students' major and the language of instruction used therein (i.e. English or Arabic) affect their choice of Arabic romanization and their attitudes towards it? The data are collected by means of a questionnaire completed by students from four different majors: (1) Applied English, (2) Arabic, (3) Medicine, and (4) Islamic Sharia. While the majority of students of Applied English and Medicine tend to use Romanized Jordanian Arabic, the students of Arabic and Sharia show a clear preference for the use of Arabic letters. The users of Romanized Arabic cite a number of reasons for their choice. Some believe that Romanized letters are easier and faster to type than Arabic letters. Some posit that English is the language of the Internet and technology and, thus, the use of romanization gives communication a special flavor. A third group report that their devices do not support Arabic language. This study is expected to contribute to identifying the youth attitudes towards the use or avoidance of romanized Arabic, which in turn may help develop a better understanding of this issue and assist cyber Arabic users to make the right choice when interacting with others in Arabic online.
    [Show full text]
  • Arabizi: an Analysis of the Romanization of the Arabic Script from a Sociolinguistic Perspective
    Arab World English Journal AWEJ INTERNATIONAL PEER REVIEWED JOURNAL ISSN: 2229-9327 جمةل اللغة الانلكزيية يف العامل العريب AWEJ Volume.4 Number.3, 2013 Pp.52-62 Arabizi: An Analysis of the Romanization of the Arabic Script from a Sociolinguistic Perspective Wid H. Allehaiby King Saud bin Abdulaziz University for Health Sciences Riyadh, Saudi Arabia Abstract: The purpose of this paper is to present a sociolinguistic analysis of the Romanization of the Arabic Script phenomenon better known as Arabizi. Arabizi is a widely used alternative for the Arabic script in computer-mediated contexts (CMC) and social networking sites, which has emerged as a result of the Arabic script being unsupported in technological tools and Internet resources. Arabic speakers have taken different stances to using Arabizi. In this paper, the following issues are discussed: the historical emergence of Arabizi, a thorough explanation of the script, its characters and its features, the social contexts in which users utilize it, and the different attitudes of Arabic speakers towards its use. Keywords: Arabic Script, Arabizi, Romanization, sociolinguistics, speakers‟ attitudes Arab World English Journal www.awej.org 52 ISSN: 2229-9327 AWEJ Volume 4.Number. 3, 2013 Arabizi: An Analysis of the Romanization of the Arabic Allehaiby Henry Globalization has struck the world like a storm as the result of the spread of technology. During the 90s, most Arab countries, if not all, witnessed an increasing importance of the Pramoolsook & Qian English language, which rose to dominate many technological devices and realms, such as online chats, short message service (SMS), and mobile phones (Warschauer, Said, & Zohry, 2002).
    [Show full text]
  • Tarc: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus
    TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus Elisa Gugliotta1;2, Marco Dinarelli1 1. LIG, Bâtiment IMAG - 700 avenue Centrale - Domaine Universitaire de Saint-Martin-d’Hères, France 2. University of Rome Sapienza, Piazzale Aldo Moro 5, 00185 Roma, Italy [email protected], [email protected] Abstract This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish. Keywords: Tunisian Arabish Corpus, Arabic Dialect, Arabizi 1. Introduction 2018) and taking into account the specific guidelines for Arabish is the romanization of Arabic Dialects (ADs) used TUN (CODA TUN) (Zribi et al., 2014).
    [Show full text]