<<

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

RESEARCH OPEN ACCESS

A New Approach to Romanize Words

*Hussein K. Khafaji, **Nada Adnan Taher *Al-Rafidain University College/ Computer Communications Engineering Dept. **Al-Rafidain University College Computer Engineering Dept.

ABSTRACT of Arabic words has been acquired the interest of the researchers due to its importance in many fields such as security and terrorism fighting, translation, religious purposes, etc. In this paper, a proposed method was presented to solve the drawbacks of available methods such as lack of reverse recognition, using of extra letters and punctuation characters, and neglecting the correlation of the letters in a word. This method was implemented and tested using a sample of 100 undergraduate Iraqi students and 150 Arabic words which romanized using five well-known methods in addition to the proposed one. The test showed that the proposed method dominants the rest method from the recognition and reverse recognition process in considerable ratio.

I. INTRODUCTION would be mnaẓrḧalḥrwfalʿrbyḧ [5]. Sometimes, it is impossible to translate some of Another case, which differentiates between the words and terms from one language to another normalization and transliteration, is caused by the due to the nature of a language itself or unavailability fact that written Arabic is normally unvocalized, i.e., of equivalent meaning word especially in the case of many of the are not written out, and must be names of peoples, places, countries, and local names supplied by the language [6]. Hence, unvocalized of things, ideas, and doctrines. Therefore, the need Arabic writing does not give a reader unfamiliar with for Romanization or transliteration was emerging [1]. the language sufficient information for accurate Transliteration is the conversion of a text from one pronunciation. As a result, a pure transliteration, e.g. as qṭr, is meaningless to an untrained ”قطش“ script to another. If the language characters are rendering Roman character, this process is called Romanization reader. For this reason, transcriptions are generally [1]. Romanization or latinization, in linguistics, is the used that add vowels, e.g. qaṭar [5]. conversion of writing from a different writing system to the Roman (Latin) script, or a system for doing so. II. ROMANIZATION SYSTEMS Methods of Romanization include transliteration, for The first attempts of Romanization began by representing written text, and transcription, for Orient lists based on personal reasoning in how to representing the spoken word, and combinations of write and pronounce the Arabic word in Latin both [2]. Transcription methods can be subdivided characters. This matter has several problems, into phonemic transcription, which records the including that the Arabic word can be romanized in a or units of semantic meaning in speech, number of ways depending on the different place or and more strict , which records origin, even if all of them took into account the speech sounds with precision [3]. eloquent speech as much as possible [7]. Add to this Most Arabic Romanization systems, which that the process of Romanization affected by the began by Orientals many centuries ago when the mother tongue of Orientals. English may use letter „a‟ Arab Islamic civilization contacted with European to denote the slot, sh denotes the character s, while civilization, are self-assiduousness and vary from one may use his German counterpart, crafts e to denote person to another. Romanization is often termed the slot, sch or ch to denote the letter s, etc. "transliteration", but this is not technically Therefore, many approaches were emerging to correct. Transliteration is the direct representation of Romanize Arabic words [8]. foreign letters using Latin symbols, while most \Currently, the Arab linguists, orient lists, and systems for Romanizing Arabic are offices follow manual process of Romanization in actually transcription systems, which represent most cases, which result in the followings: the sound of the language [4]. As an example, the 1. Lack of consistency in the process of rendering munāẓarat al-ḥurūf al-ʿarabiyyah of Romanization inside the text. For example, the is a transcription, Lawrence of Arabia in his famous book he wroteمناظشة الحشوف العشبيت :the Arabic indicating the pronunciation; an example of the same several different ways, for www.ijera.com 94|P a g e

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

example, the word “Jeddah” and sometimes Jidda may not produce the original Arabic name [15]. Due (the name of the famous city in the Kingdom of to the mentioned drawbacks of the Romanization Saudi Arabia) as he wrote the name of - systems, the numerous scripting of Arabic names still Almoeen six different ways. [9] a big problem in documents, passports, hospital 2. Lack of consistency in the process of records, terrorist tracing by Interpol, Academic Romanization between various texts, whether for certificates, etc. This leads to various problems, the same person or a different group of people. especially when retrieving the Romanized names This is a fortiori, of course. The surveyed write from a database. Although the British Standard the name of Sharaf al-Din in many Romanized BS4280 is considered a reasonably consistent, but forms such as: they do not use on a large scale. In the research [8] SharafEldin, Sharaf El Din, SharafElDin, Sharaf- there is a comparison between these systems can be Eldin, Sharaf-El-Din, Sharafeldin, Sharafel Din, referenced. In spite of the presence of Standard Sharafelddin, Sharafelldin, Sharafudin, Audio International Phonetic Alphabet highly Sharafuddin, Sharaf Aldine, Sharaf Al Din, complex but it fits specialists in the field of Sharaf Al Dine, Sharaf-Al-Din, Sharaf El deen, linguistics only [8]. It may seem to be a natural etc[10]. alternative to this is to compensate for each character 3. Failure in the reverse process of Romanization, Arabic counterpart voice of English (Roman) , but i.e., recovery of the Arabic word of the this is not available because most of the Arabic letters Romanized word.[11] cannot be compensated by one character and that 4. Slow “Romanization” process. such symbols S. , for example, as to some of the 5. The possibility of human error due to the manual letters is unrivaled audio in the target language [16]. work [12]. Perhaps the logical solution to this is to develop a III. REQUIREMENT FOR GOOD uniform system for the binding process of ROMANIZATION SYSTEMS Romanization that means made standard specification The following basic characteristics of the for it. The exclamation point when you know that requirements necessary for the Romanization systems there are several specifications proposed and some can be concluded: are in circulation among the researchers, however, 1. Symmetry between each character in the source none of them has achieved sufficient deployment language (Arabic), the target language (English) and [13]. For example, there are currently on the scene this condition is difficult to fully abide by it to the several ways of Romanization, and perhaps following: surprising that many researchers, governments, and A) Arabic characters need to diacritical marks so they countries do not use any of these, but each may can be tuned and writes these signs above or below invent a way of Romanization. The followings are the the original character. Thus, the symmetry can be most known Romanization approaches [14]: automatically bound by it (as the formation is 1 - US Library of Congress (LC) and perhaps the independent characters) cannot be bound by it most common. manually (where the accent character is only one 2 - Specification of Romanization British BS4280 character). [17] (BSI). B) In most cases cannot pronounce the name in true 3 - Specification for the Department of Islamic way for experienced people only [18]. knowledge. 2. The possibility of recovery of the , 4 - Specification for ISO. which already Romanized to its original state without 5 - Specification of the international magazine for confusion [19]. Middle East Studies (IJMES). 3. Totalitarian in the sense that all the symbols and 6 - Specification of the Institute of Islamic Studies, sounds of language in the source language have McGill University, Canada. compared to the target language [20]. One of the disadvantages of the previous The aim of this research is design and Romanization systems is the need to add special implementation of a words symbols for regular letters of the alphabet. This in to avoid the mentioned drawbacks and support the turn leads to the difficulty of computing due to the requirements of robust Romanization systems. lack of these symbols on the keyboard directly (such as putting points higher character that is not already IV. The Proposed Romanization System exist)[14]. In addition, some of the Arabic letters are The proposed Romanization system consists of romanized by more than one Latin character, which the following components: Word's string handler, may be signed in confusion. Moreover, the difficulty Irregular-names database scanner, Word scanner, and of remembering these symbols in general. Romanizer, as shown in figure (1). These modules Furthermore, the reverse process to Romanized name will be explained in the next sub sections. www.ijera.com 95|P a g e

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

The input to the RS is an Arabic word. The the input word may be irregular word. Usually the word should be written with all its HARACAT such number of irregular words is very small in as DAMMA, FATHA, SUKUN, etc. For example, comparison with the number of the language words. is different in A special database was constructed to hold the " َـس ْـوح" and "سوح" the pronunciation of spite of the similarity of the Arabic letters in these Romanization of the irregular words. The database is two words due to the difference of the HARACAT. scanned to find the Romanized word otherwise; the In addition, the word may be compound or simple control will be transferred to word scanner and the .Also, Romanizer to construct the English word ." َـ ِـ ْـ " or ," َـش َـج َـشةُـ ال ﱡذس" ," َـصال ُـح الذيه" word, such as

Arabic Word

Word's String Handler Word Scanner

Handler

Processed String Processed Romanization String Table Irregular- Names Characters Database Scanner Stream

Romanizer

English Word

Figure (1) the proposed Romanization System Architecture

1. Word's String Handler its pronunciation except some irregular name and are "الش ْْٰـح ْٰمه" and "هللا" This module removes extra white spaces, word. For example, the words punctuation, numbers, and special characters. After regarded as irregular words. Such these words, which this trimming process the word will be checked by a are not obeyed to the writing rules, were stored in the spell checker designed according to [21]. irregular-name database. The input to this module is Indeed, this module disintegrates the word to its a string processed by the previous module, word's characters and then reconstructs a word from the string handler, stored in out-string variable presented remaining letters because the stored words in the in figure (2). The output of this module is a string of irregular-name database are stored as string data type. zero or more characters. Empty string indicates Figure (2) shows a self-documented pseudo code unavailability of the word in the database, i.e., it is representing the duties of this module. not 'irregular' word; therefore the control will be 2. Irregular-name Database Scanner signaled to Romanization process. As it is mentioned in section 2, there is a correspondence between the Arabic word writing and

www.ijera.com 96|P a g e

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

Input-string=right trim (left trim (input-string)); While not end of input-string do {

ch=getcharacter(input-string); out-string=""; Case ch of {

' ': { out-string+=' '; skip-spaces(input-string); } ـ ' ' : skip_TATWEEL_character(input-string); '0'..'9': skip_numbers(input-string); ;out-string+= ch : 'ي'..'ء' Default: skip_others(input-string);

} out-string=spell-checker(out-string); } …

Figure (2) Pseudo code of word's string handler module

should be articulated with "الـ" Word Scanner not "Al-roh". But .3 ."Al-Kanz" ,"ال َـن ْـنض" This module will be activated when the word "Al-Qameria" letters such as is regular word. It receives the processed word The output of this module according to first and its ['ح' ,'و',' ّـ ' ,'س' ,'ا'] obtained from the word's string handler module and example is the stream converts it to stream of letters and HARACAT to output for the second example is the stream One of the duties of this .['ص',' ْـ ','ن',' َـ ','ك','ه','ا'] enable the next module; romanizer, to keep track of some pointers because the pronunciation of a letter module is recognition of these two cases. Figure (3) may depend on some letters or HARACATs. shows a pseudo code of this process. Arabic letters are divided into two classes; The Shamsyan function invoked in the "Ashshamsia" and "Al-Qameria". Usually the pseudo code of figure (2) is a binary function When a letter returns true if the letter is from "Ashshamsia" class ."الـ" ;"words are prefixed with "Al the otherwise it returns false, i.e., it returns false if the ,"الـ" from "Ashshamsia" class is joined with هـ، م، ي، ق، ع، :is not articulated but the letter is repeated. For letter is one of the following letters "الـ" . ف، خ، و، ك، ج، ح، غ، ب، أ is written as "Arroh" and "الشوح" example, the word

… ch=getcharacter(input-string);

Case ch of ( 'ل'= =(if (ch=getcharacter(input-string :'ا' { ch=getcharacter(input-string) ;[' ّ ' ,ch ,'ا'] If (shamsyan(ch)) initialize the stream with

[ch,'ل','ا'] Else initialize the stream with } ;[ch ,'ا'] Else initialize the stream with

… Figure (3) Pseudo code of recognition of "Ashshamsia" and "Al-Qameria" letters

www.ijera.com 97|P a g e

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

4. Romanizer Module The input of this module is the stream produced from the word scanner module. It depends mainly on the Romanization table presented in table (1).

Table (1) Romanization Table No. Pronunciation English Arabic Letter Letter or HARACAT )ء( "No thing " 0621 1 أ Like A in Apple A 0623 2 إ I 0625 3 آ A 0622 4 ؤ Like o in On U 0624 5 ئـ FE8C I 6 ا A 0627 7 الباء )ب( Like o in Baby B 0628 8 التاء )ث( 062A Like T in Tree T 9 الثاء )ث( 062B Like the Th in Theory Th 10 الجيم )ج( 062C Sometimes like the G in Girl or like the J in Jar J 11 الحاء )ح( 062D Like the h in he yet light in pronunciation H 12 الخاء )خ( 062E Like the Ch in the name Bach Kh 13 الذاه )د( 062F Like the D in Dad D 14 الزاه )ر( Like the Th in The Dh 0630 15 الشاء )س( Like the R in Ram R 0631 16 الضيه )ص( Like the Z in zoo Z 0632 17 السيه )ط( Like the S in See S 0633 18 الشيه )ش( Like the Sh in She Sh 0634 19 الصاد )ص( Like the S in Sad yet heavy in pronunciation S 0635 20 الضاد )ض( Like the D in Dead yet heavy in pronunciation D 0636 21 الطاء )ط( Like the T in Table yet heavy in pronunciation T 0637 22 الظاء )ظ( Like the Z in Zorro yet heavy in pronunciation Z 0638 23 العيه )ع( Has no real equivalent sometimes they replace its A 0639 24 sound with the A sound like for example the name /ali /ع Ali for العيه ) ـ(: ) ـ ِـ( E or 0639+0650 25 ) يـ( العيه ) ـ(: ) ـ َـ( 0639+064E A 26 العيه) ـ(: ) ا( Aa 0639+0627 27 العيه ) ـ(: ) ـ ُـ( 0639+064F O 28 العيه )ـعـ(: )ـعا( A 0639+0627 29 or)ـعـ َـ( or 0639+064E العيه )ـعـ(: )ـْـعـ ُـ( )ـْـعى( Nothing 0639+0647 30 العيه )ـعـ(: )ـَـ ْـعـ( 064E+0639 31 الغيه )غ( 063A Like the Gh in Ghandi Gh 32 الفاء )ف( Like the F in Fool F 0641 33 القاف )ق( Like the Q in Queen yet heavy velar sound in Q 0642 34 pronunciation الناف )ك( Like the K in Kate K 0643 35 الالم )ه( Like the L in Love L 0644 36 الميم )م( Like the M in Moon M 0645 37 النىن )ن( Like the N in Noon N 0646 38 الهاء )هـ( Like the H in He H 0647 39 الىاو )سامنت وما قب ها 064F+0648 Like U in soon U 40 مضمىم(: )ـُــ ْـى ( الىاو )و( Like the W in the reaction of astonishment saying: W 0648 41

www.ijera.com 98|P a g e

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

WAW! الياء )سامنت وما قب ها 0650+064A I 42 منسىس(: )ـــِـيـْــ( الياء )ي( 064A Like the Y in you Y 43 األلف المقصىسة )ي( A 0649 44 التاء المشبىطت السامنت H 0629 45 )ـ ْـت ( الضمت )ــ ُـ( 064F U 46 التنىيه المضمىم )ــ ٌـ( 064C Un 48 الفتحت )ـــ َـ( 064E A 49 التنىيه المفتىح )ـــ ًـ( 064B An 50 النسشة )ــ ِـ( I 0650 51 التنىيه المنسىس )ــ ٍـ( 064D In 52 الحشف المشذد )ــ ّـ( Duplicate the 0651 53 letter before

In many cases, the romanizer maintains pointers ISO, Arabtex, ALA-LC, in addition to the suggested to the current letter and the previous three letters and method, i.e., each student guessed 900 Romanized HARACATs and this is the reason of constructing word presented in a random list. Fifty words of the the output of the scanner as a stream to achieve fast 150 are frequent words. The rest 100 words are movements for the romanizer over the word. classified as relatively frequent and rarely used words because the aim of the experiment is measuring the 5. Experimental Results ability of the persons to read and recognize the A sample of 100 UG Arabic students was original word and prevent him from guessing the selected to read and recognize 150 Romanized Arabic words. Table (2) shows the ratio of recognized words words, and then write the corresponding Arabic according to each approach. words. Each word is Romanized by IPA, UNGEGN,

Table (2) Recognition ratio of Romanized words IPA UNGEGN ISO Arabtex ALA-LC Suggested Method %60 %77 %79 %83 %69 %96

All the unrecognized words in the suggested according to our opinion, is the frequent use of the " ال " in comparison with " َـم ْـسؤوه" and " َـ ـ " Ain of word ,(ـعـ) method are containing the letter There are twelve students did not ." َـم ْـسعىه" Unicode=0639) with the letter Hamza, and) There are 86 students ."بَـ ْـع َـبَـل" Unicode=0621) or Alef, (unicod= 0623). For recognize the word) Aali" are recognized all the words Romanized by the suggested"-" ال " Ali" and"-" َـ ـ " example the word Ali" by nine method while table (3) presents the number of"-" َـ ـ " recognized as same word Masool" students who recognize all the words in the rest"-" َـم ْـسعىه" students, in addition to the words .approach -" َـم ْـسؤوه" Masaol" are recognized as"-" َـم ْـسؤوه" and "Masaol" by twelve student. The reason for that,

Table (3) Number of students who recognized all the words for each approach IPA UNGEGN ISO Arabtex ALA-LC Suggested Method

45 56 62 69 71 86

The average of recognition of the 100 students for each approach is presented in table (4) to achieve reliable results.

Table (4) Recognition average for each approach IPA UNGEGN ISO Arabtex ALA-LC Suggested Method 85.25 84.20 82.50 83 89.33 94.45

www.ijera.com 99|P a g e

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

Consider table(3) and table (4); in spite the fact that [4]. Alghamdi, Mansour (2004) Analysis, there are 45 students recognized all the words Synthesis and Perception of Voicing in Romanized by using IPA, the average of Arabic. Al-Toubah Bookshop, Riyadh. recognition is higher than UNGEN, ISO, and (The book is now available on the web) Arabtex; also it is very close to ALA-LC [5]. Nahar, Khalid, Moustafa Elshafei, Wasfi recognition average. Al-Khatib, Husni Al-Muhtaseb, Mansour Alghamdi, "Statistical Analysis of Arabic V. Discussion and Conclusions Phonemes for Continuous Arabic Speech This work presents a new approach to Recognition", International Journal of Romanize Arabic words. This approach Computer and Information Technology, characterized by many characteristics such as: 2012. 1. This approach uses English letters only for [6]. http://www.alab.com/arab/language/roma Romanized word without any sound, n1. htm#TRANSCRIPTION a good punctuation, or special characters; therefore it discussion about the “standards” used for has got high readability according to the results Romanization of the Arabic text. presented in table (2). [7]. "Wikipedia: Manual of Style/Arabic". 2. It doesn't depend on letter to letter Retrieved 2013-06-03. correspondence, but it considers the correlation [8]. ArabEasy - transliterate Arabic webpages of a letter with former letters and HARACATs, [9]. http:// transliteration.eki.ee/pdf/Arabic_ this makes the Romanization very close to the 2.2.pdf actual Arabic pronunciation. This feature [10]. Sharaf Eldin, A., 6. Hanna, S. and N. simplifies the matching of Arabic word and its Greis, "Writing Arabic - A linguistic Romanization in case of availability or absence approach from sounds to script", E.J. Brill, of the Arabic words. Leiden, 1972. 3. It is the first approach which considers the “Computer-Assisted Arabic properties of the letters such as "Ashshamsia" Transliteration”, 7th.. ICAIA, 1999. and "Al-Qameria" features to simulate the [11]. Roochnik, P., "Computer-based solutions actual Arabic word articulate. to certain linguistic problems arising from 4. According to the experiment results, the the Romanization of Arabic names", Ph.D. students' incorrect recognition in the suggested dissertation, Georgetown university, approach occurred in the words containing the Washington DC, 1993. but in the rest approaches [12]. Al-Onaizan, K. Knight, “Transliteration of ,"ع" and "أ" letters most reading mistakes occurred in words Names in Arabic Text”, obtained via -http://www.isi.edu/natural ,"ض" ,"د" ,"ر" ,"ع" ,"أ" containing the letters language/projects/rewrite/yaser-acl02- ."ث" and ,"ظ" From the implementation point of view, the WS.ps preprocessing of the input string eases the [13]. Arbabi, M., S. Fischthal,V. Cheng and E. Romanization operation and makes it very close to Bort, "Algorithms for Arabic name Arabic word pronunciation. transliteration", IBM J. of research and development, Vol. 38, no. 2, Mar., 1994, Bibliography: pp. 183-193. [1] Toivonen, Jarmo, Ari Pirkola, [14]. "Arabic" (.pdf). ALA-LC Romanization Heikkikeshustalo, Kari Visala, and Tables. Library of Congress. p. 9. Kalervo Jarvelin, Translating cross lingual Retrieved 2013-06-14. spelling variants using transformation [15]. Stalls, B. and K. Knight, “Translating rules", Information Processing and Names and Technical Terms in Arabic Management, 2004 Text”, Proceedings of the COLING/ACL [2] Meng, H.M.; Wai:-Kitlo; Berlin Chen; Workshop on Computational Approaches Tang, K. Automatic Speech Recognition to Semitic Languages, 1998. and Understanding. ASRU’01. IEEE work [16]. Beesley, K. R. , “Arabic Finite-state shop on, 9-13 Dec. 2001. 311-214. 2001. Morphological Analysis and Generation”, [3] UNGEGN working Group on In COLING-96 proceedings, volume 1 , Romanization Systems, Report on the pages 89-94 Copenhagen, Center for UNITED NATIONS ROMANIZATION Sprogteknologi. The 16th International SYSTEMS CURRENT STATUS OF FOR Conference on Computational Linguistics, GEOGRAPHICAL NAMES, Arabic 1996. version 2.2, January 2003. www.ijera.com 100|P a g e

Hussein K. Khafaji et al. Int. Journal of Engineering Research and Application www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 12, (Part - 2) December 2015, pp.94-101

[17]. Arbabi, M., S. Fischthal, V. Cheng and E. Bort, "Algorithms for Arabic name transliteration", IBM J. of research and development, Vol. 38, no. 2, Mar., 1994, pp. 183-193. [18]. Beesley, K. R., “Computer Analysis of Arabic Morphology: A two-level Approach with Detours”, In Comrie, B. and Eid, M., editors, perspectives on Arabic Linguistics III: papers from the third Annual Symposium on Arabic Linguistics, pages 155 –172, John Benjamins, Amsterdam, 1991. [19]. Ridley, M.J., "A code for bibliographic records transliterated from Greek", literary and Linguistic Computing, Vol. 7, pp. 27- 29. [20]. Knight, K. and J. Graehl, “Machine Transliteration”, Proceedings of the 35th. Annual Meeting of the Association for Computational Linguistics, pp. 128-135, 1997. [21]. Hussein K. AlKhafaji, Suhail A. Abdullah, Hanaa H. Merza, "A New Algorithm to Design and Implementation of Multilingual Spellchecker and Corrector", Journal of Al-Rafidain University College, ISSN 1681-6870, No. 32, 2013

www.ijera.com 101|P a g e