2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur ()

Machine Transliteration Design for Old Malay Manuscript

Che Wan Shamsul Bahri Che Wan Ahmad, Khairuddin Omar, Mohammad Faidzul Nasrudin, Mohd Zamri Murah, Sanusi Mohd Azmi

name of Pedoman Ejaan Jawi Yang Disempurnakan Abstract—Jawi script is a script that has the Arabic influence. In (Guidelines of the Enhanced Jawi Spelling) is intended to meet the past, these writings are widely used by the Malay community as the present the needs of the future. This system is the result of well as foreigners who have diplomatic relations, business, a formula written in the Jawi National Convention held in missionary and such. At that time, the is the lingua 1984 at Kuala Terengganu, Malaysia. Jawi spelling system is franca of this region. So there are many Malay heritages such as manuscripts, religious books, letters, documents and other compiled by [2] as its basis. According to [3], the system agreements in the Jawi script. There are significant needs to do the involves five processes, which are: maintaining confirmed transliteration of the Jawi text on the materials to Malay Roman. words, perfecting the imperfect, creating the non-existence, Thus, research on machine transliteration will help the effort. Many clarifying the vague, tidying up the loose. In this paper, we researches in machine transliteration in the world for high-level define old Jawi as Jawi spelling system before era of Za’ba language have been done; such as English, European, and Asia TABLE I languages such as Chinese, Japanese, Korean and Arabic. However SAMPLE OLD AND NEW JAWI the research in the context of the Malay language is still lacking, Old Jawi New Jawi Roman (bagi (for ﺑﺎڬﻲ ﺑﮏ .especially those involving the Romanized transliteration of Jawi (segala (all ﺳڬﺎﻻ ﺳڬﻞ Jawi writing is quite different from the Urdu and Arabic although (olehmu (by you ﺍﻭﻟﻬﻴﻤﻮ ﺍﻭﻟﻬﻢ they share the same characters. Modern Jawi uses more than (takut (fear ﺗﺎﮐﻮﺓ ﺗﺎﮐﺔ the old version. These papers discuss the previous studies related to (saudara (brother ﺳﺎﻭﺩﺍﺭﺍ ﺳﻮﺩﺭﺍ machine transliteration of the Malay language and approaches that can be used to develop it. (before 1949). According to [4] there is a 30% difference Keywords—Jawi, machine transliteration, Malay, rule base between the old and new Jawi spelling. The old and new Jawi spelling systems are differences as shown in Table I below. I. INTRODUCTION II. PROBLEM STATEMENT AWI writing today is actually a Malay writing with Arabic Jin fluences that have been used nearly 700 years ago. This is In the context of the Malay language, transliteration is used evidenced by the discovery of Terengganu Inscribed Stone, to change the Jawi spelling to Rumi or vice versa. The main dated 1303 AD [1]. However, the old Jawi is different problem in the transliteration is when there are no matching compare Jawi script today in terms of the use of the vowel, the characters in target text [5]. Jawi spelling system and Rumi writing techniques and the use of new letters. It is seen in line (Roman) spelling are two different spelling systems. Jawi with the development of the Malay language itself began to spelling read and written from right to left, while the reverse include some foreign words and technical terms especially in spelling of Rumi [6]. the advancement of science and technology today. In addition, old Jawi spelling were not using more letters in New Jawi spelling system was introduced in (1986) with the practicing the concept of economy in the use of Jawi character than modern today [7]. According to [8] Jawi spelling as long as it is not consistent (varies according to the author) and many Che Wan Shamsul Bahri C.W.Ahmad is with the International Islamic University College Selangor (KUIS), Bangi, 43000 Kajang, Selangor, occur ambiguous word compared to new Jawi spelling system. Malaysia (phone: +60389254251; fax: +6038925447; e-mail: The problems mentioned above are main challenges in the [email protected]). development machine transliteration for old Jawi to Rumi. Prof. Dr. Khairuddin Omar, Dr. Mohammad Faidzul Nasrudin and Mohd Zamri Murah was with Center for Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia, Bangi, 43600 Kajang, Selangor, Malaysia. III. RELATED WORK (e-mail: [email protected], [email protected], [email protected]). Machine transliteration is an important matter in the Mohd Sanusi Azmi is with the Universiti Teknikal Malaysia (UTEM), , Malaysia (e-mail: [email protected]). application of natural language processing (NLP), especially in translating an entity name from one language to another [9].

23 2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

Among the world's fastest-growing language in its A. Homograph6B and Less Vowel transliteration mechanism is English, European, Asian Homograph is referring to two or more words that have the languages such as Chinese, Japanese, Korean, Arabic, Urdu, same spelling but different meanings[23]. Homograph can also Hindi, Punjabi, Taiwan, Korea, Japan, China, Thailand and coincided with a homonymous (same sound), for example, the others. Among them are [10], [11], [12], [13], [14], [15] and word mereka (to create) with mereka (they) which is similar in [5]. terms of raw sound[24]. In the Jawi script, most homograph In Malaysia, research on machine transliteration is long, but word occurred due to Jawi writing system itself that only use its development is quite slow particularly the Malay language -compared with six (a, e (e (ﺍ ، ﻭ، ﻱ، ﻯ) four vowels as compared with foreign languages. pepet), e (e-taling), i, o, u) in Rumi writing. There are many studies in Malay machine transliteration There are more homograph problems in old Jawi spelling have done such as by [16], [17], [18], [19], [20], [21] and because the economic principle of the lesser use of the vowel [22]. All researches on Malay transliteration are related to modern Jawi, except study done by [19] and [20] are related to [7]. Here is an example of the old Jawi sentences occurs old Jawi. homograph. ﺍﻳﻪ ﭬﺎﻛﻲ ﺗﻮ ﻲﭬ ، ﺗﻴﻤﺒﻖ ﺗﻮ ﻲﭬ ﺑﺮﺳﺎﻡ ﺍﻧﭽﺊ ﺟﻮﻫﺮ ﺩ ﺟﻮﻫﺮ Several researchers such as [16], [17] and [18], using a .to Ayah pakai topiU ,U tembak tupaiU U bersama En. Johar di Johor ﺏ ,to a ﺍ character mapping technique that matches directly b, and so on. Therefore, there are many words that cannot be (Dad wear a hat, shoot squirrels with Mr Johar in Johor) converted correctly in the machine transliteration process. This ﺍﻳﻪ ﻣﺎﻛﻦ ڬﻮﻟﻲ، ڠﺳﺪ ﻜﻦ ﺍﻧﻖ ﺑﺮﻣﺎءﻳﻦ ڬﻮﻟﻲ is because there are more than matches for some of the letters in the target text. Ayah makan gulaiU ,U sedangkan anak bermain guliU .U In contrast to previous studies, [21] use rule-based (Dad eat curry, while children play marbles) transliteration technique in the study of Rumi to Jawi B. Spelling7B Not Consistency transliteration. Each Rumi word is processed following the In old Malay manuscripts, spelling varies used even in a Jawi spelling patterns. First, the word is divided into , same book for some words. It also varies spelling between and then the syllables are matched with rule for Jawi authors for some words. There are classification of old Jawi conversion into Jawi spelling. Based on [21] research, there writer, including educated people in religion and palace are words that cannot be converted into Jawi precisely because writers anonymous, scholars and the general public[8]. Some some of the problems; loan words from Arabic and English, of these people are not educated about the method of Jawi the difficulty in distinguishing e-taling and e-pepet vowel, and writing. Thus they write not according to the rules, not [presentable and uncertain sentences. For example, [8 .ک or ﻕ the difficulty to determine whether the word has glottis In the Jawi vocabulary, several words classified as past law identified is in the pamphlets of ilmu wafaq (wafaq or exempt from the law. Several words are not subject to the knowledge), ilmu hikmat (wisdom knowledge), mantera rule of Jawi. These are among the difficulties faced by [21] (mantras), traditional medicine and others. because Jawi spelling system is so unique.

[19] and [20] study is different as studying old Malay C. Sentence8B Structure manuscripts. [19] using a Kitab Nazam (old Malay Most of the old Jawi spelling or old manuscripts do not use Manuscript) that has been digitized in advance for the study. punctuation [25]. The beginning of a new sentence usually [19] uses stemming and filtering model which is to make the begins with the word bahawasanya (whereas), maka process of rooting and separation between old and new Jawi (therefore), lagi (again) as sentence delimiter signal, even words based on a Jawi corpus of built. While [20] use a though sometimes the words maka (then) and lagi (again) act transliteration based on to the epic of the old Malay as conjunctions. Word dan (and) also used at the beginning of manuscript, Merong Mahawangsa. With grapheme method, a sentence. In addition, there is a hadith or Quranic verses that .for old Jawi words, can be used as a delimiter sentence ڤ to ﻑ successfully converted letters [20] instead of using grapheme p in Rumi, not f. This is because D. Spelling System ڤ most of the old Jawi script does not distinguish ﻱ Most of the old spelling does not use a point on the letter can reduce the gap between the old and new Jawi. (ya) if in the end of the word. This is because many of the old[20] .ﻑ and ya) does) ﻯ Many other studies are more focused on Rumi to new Jawi books printed in Arabia [25]. In Arabic, the letter transliteration. Transliteration for old Jawi to Rumi is different not use a point when at the end of a word. Unlike other Malay nya) that dotted the) ڽ ,(nga) ڠ ,(ca) چ because it contains its own rules and methods. Many studies letters, for example ga) which is used in real-dotted one. Sometimes) ڬ have been done on transliteration for new Jawi but not for old three and .(ga) ڬ Jawi. there is also the point of the letter is placed under

E. Arabic10B Loanwords IV. C3B HALLENGES There are consumption foreign words that absorption Arabic In doing transliteration of the old Malay manuscripts, there word for a kitab (book) or manuscript was written in Mecca. are some challenges that are identified as they use old Jawi The Arabic word is synonymous with Islam itself. Sometimes, spelling. Several challenges are discussed here. the author of a kitab, for example Kitab Hidayah al-Salikin

24 2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia) difficult to find a corresponding meaning in Malay vocabulary The old Jawi spelling in before Za'ba (1949) mostly do not because the original book is written in Arabic, Kitab Bidayatul use vowel strokes at the end of the open and closed . Hidayat written by al-Imam al-Ghazali. If there are words in the Malay language, the purpose to be served may not meet the ﻛﺎﺕ Input = Old Jawi original intent of the word. Therefore, most scholars who TABLE II ARABIC LOANWORD Jawi Rumi (Malay) English Check pattern ka + t (nabi prophet (rules ﻧﺒﻰ syafaat mediation ﺷﻔﺎﻋﺔ nubuwwah prophetship ﻧﺒﻮﺓ Two syllables cv+c cv+ca alim pious ﻋﺎﻟﻢ fasal clause ﻓﺼﻞ fadhilat virtues ﻓﻀﻠﺔ quran quran ka ta ﻗﺮﺍﻥ hadith hadith ﺣﺪﻳﺚ ulamak scholars ﻋﻠﻤﺎء afdal nice ﺍﻓﻀﻞ manfaat benefit Output = Rumi kata ﻣﻨﻔﻌﺔ ilmu knowledge ﻋﻠﻢ wrote the book more comfortable use Arabic loanwords as Fig. 2 Transliteration process based on rules Table II because more accurate meaning and mix with the Malay culture. For the words that are not on the list after the test, then that word will also be tested using a search engine. For example, is not available in ”ﭬﺎﻛﻲ ﺗﻮ ﻲﭬ “ Phrase .”ﺍﻳﻪ ﭬﺎﻛﻲ ﺗﻮ ﻲﭬ “ V. ARCHITECTURE OF OLD MALAY TRANSLITERATION the sentence Fig. 1 show proposed architecture for Old Malay the list, and then the on-line resource will be used by the API Transliteration. Hybrid model are highlighted in the proposed search engine connections. The highest results will be selected architecture considering the unique old Jawi spelling itself. as an outcome of transliteration as shown in Table III. Each word in the old Malay manuscripts will search through Some rulings in old Jawi writing like Shift Law, Insert Law, and character vowel materialized method have to be taken into the list of words before the character mapping conducted at no consideration in design of the old Jawi to Rumi transliteration. matching words or Out of Vocabulary (OOV). Methods involve character vowel materialized Kāf -Ga Law

This is important to .(ﺩ ﺭ ڠ ﻝ ﻭ) and out of Deranglu law

consider because several laws are used in old Jawi spelling but

Old Jawi Word is no longer used in new Jawi spelling. For example, the Shift

and Insert Law no longer used in new Jawi spelling while Kāf-

still remain in use (ﺩ ﺭ ڠ ﻝ ﻭ) Ga Law and out of Deranglu Law Database Lookup Malay till today. Deranglu Law use full vowels for its each syllables. Wordlist

Transliteration TABLE III Rule (Insertion Mapping ARABIC LOANWORD Vowel) Tables Result – based on Jawi Rumi (Malay) google SE Ayah pakai topi 1,040,000 (√) Homograph Word Dad wear a hat ﺍﻳﻪ ﭬﺎﻛﻲ ﺗﻮ ﻲﭬ Search Ayah pakai tupai Engine 430,000 (X) Dad wear a squirrel

Rumi Word

Fig. 1 Propose architecture for old Malay transliteration VI. C5B ONCLUSION

From the above discussion, we can see some of difference Database lookup is use for old Jawi words (input) referred between old and new Jawi spelling system. Development work to the Jawi-Rumi word list provided. If the word is found in of machine transliteration for old Jawi to Rumi is not an easy the list, then its pair Rumi will be obtained. job compared new Jawi to Rumi. The work becomes more On the other hand, if the word is not available, then the rule- complicated and complex because the characters used are based approach applied to the old Jawi word. If [14, 26] using different. The results obtained from the proposed design may the model collapsed-vowel strokes (CV) but in this study, a be accurate depending on how far we can follow some law or different approach is used which is called the Insertion-Vowel rule discussed above. (IV) strokes. See fig. 2 for transliteration process based on rules. This is the reverse process of the [21].

25 2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

ACKNOWLEDGMENT [22] Roslan Abdul Ghani, et al., "Jawi-Malay transliteration," in International Conference on Electrical Engineering and Informatics, Special thanks to the International Islamic University 2009. (ICEEI '09), 2009, pp. 154-157. College Selangor (KUIS) for providing the funds to continue [23] "homograph. (n.d.) " in Collins English Dictionary – Complete and studies at PhD level under the Academic Staff Training Unabridged., ed, 1991, 1994, 1998, 2000, 2003. [24] Adi Yasran Abdul Aziz and Hashim Musa, "Isu homograf dan Scheme KUIS (SLAK). Thanks also to Universiti Kebangsaan cabarannya dalam usaha pelestarian tulisan Jawi," Jurnal ASWARA, vol. Malaysia (UKM) for fees sponsorship to attend this seminar. 3, pp. 109-126, 2008. [25] Muhammad @ Mokhtar Talib (MATLOB), Pandai Jawi, 3 ed. Shah Alam: Cerdik Publications Sdn. Bhd., 2007. REFERENCES [26] S. Karimi, "Machine transliteration of proper names between English [1] Amat Juhari Moain, "Sejarah Tulisan Jawi," dalam Jurnal Dewan and Persian," Tesis PhD, School of Computer Science and Information Bahasa, 1991. Technology, RMIT University, Melbourne, Victoria, Australia, 2008. [2] Zainal Abidin Ahmad (Za'ba), Daftar Ejaan Melayu (Rumi-Jawi), . : Department of Education Printers Ltd.,

1949. [3] Ismail Dahaman, "Pedoman ejaan jawi yang disempurnakan (1986)," Che Wan Shamsul Bahri is a lecturer at International Islamic University presented at the Konvensyen Tulisan Jawi, Kuala Lumpur, 1991. College Selangor (KUIS), Bangi, Selangor, Malaysia. He is a Phd candidate at [4] Hamdan Abdul Rahman, "Sistem Baharu Ejaan Jawi Bahasa Melayu," Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor, Malaysia presented at the Konvensyen Tulisan Jawi, Terengganu, 1984. Prof Khairuddin Omar, Dr Mohammad Faidzul Nasrudin and Mohd [5] G. S. Josan and J. Kaur, "Punjabi to Hindi Statistical Machine Zamri Murah are lecturer at Faculty of Information Science and Technology, Trasnliteration," International Journal of Information Technology, vol. Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor. They are also 4, pp. 459-463, 2011. members of Center of Artificial Intelligent Technology(CAIT), UKM. [6] M. F. Nasrudin, et al., "Handwritten Cursive Jawi Character Mohd Sanusi Azmi is a lecturer at Universiti Teknikal Malaysia (UTeM), Recognition: A Survey," in Computer Graphics, Imaging and Malacca, Malaysia. He is also members of Center of Artificial Intelligent Visualisation, 2008. CGIV '08. Fifth International Conference on, Technology (CAIT), UKM. 2008, pp. 247-256. [7] Mahmud Haji Ashari, et al., "Antara Jawi lama dan baru serta masalah pelaksanaannya," presented at the Konvensyen Tulisan Jawi, Kuala Lumpur, 1991. [8] Wan Mohd Shaghir Abdullah, "Tulisan Melayu/Jawi dalam manuskrip dan kitab bercetak : Suatu analisis perbandingan," in Tradisi Penulisan Manuskrip Melayu, ed Kuala Lumpur: Perpustakaan Negara Malaysia, 1997, pp. 87-105. [9] Antony P J and Soman K P, "Machine Transliteration for Indian Languages: A Literature Survey," International Journal of Scientific & Engineering Research,IJSER © 2011, vol. 2, pp. 1-8, 2011. [10] M. Arbabi, et al., "Algorithms for Arabic name transliteration," IBM Journal of Research and Development, vol. 38, pp. 183-194, 1994. [11] K. Knight and J. Graehl, "Machine transliteration," Computational Lingustics, vol. 24, pp. 128-135, 1997. [12] B. G. Stalls and K. Knight, "Translating names and technical terms in Arabic text," in COLING/ACL Workshop on Computational Approaches to Semitic Languages, 1998, pp. 34-41. [13] Y. Al-Onaizan and K. Knight, "Machine transliteration of names in Arabic text," in ACL-02 Conference Workshop on Computational approaches to Semantic Languages, 2002, pp. 1-13. [14] A. T. Sarvnaz Karimi, and Falk Scholer, "English to Persian Transliteration," in Springer-Verlag Berlin Heidelberg 2006, P. F. Fabio Crestani, Mark Sanderson Ed., ed Glasgow, UK, October 11-13, 2006, : Springer 2006, 2006, pp. 255-266. [15] A. Malik, et al., "A hybrid model for Urdu Hindi transliteration," presented at the Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, Suntec, Singapore, 2009. [16] K. A. Rahman, "Perisian Pemprosesan Perkataan untuk Sistem Tulisan Rumi dan Jawi (Editor Rumi-Jawi)," Sarjana Muda Teknologi Maklumat, Universiti Kebangsaan Malaysia, 1998. [17] Ab. Nasir Abdul Aziz, "Sistem Pertukaran Tulisan Rumi Ke Jawi," Tesis Sarjana Muda, Universiti Teknologi Malaysia, Skudai, Johor Bahru, 1998. [18] Suhailan Safei, "Sistem Penterjemahan Jawi Ke Rumi," Tesis Sarjana Muda, Universiti Teknologi Malaysia, Skudai, Johor Bahru, 2000. [19] C. W. S. B. C. W. Ahmad, "Penterjemah Jawi lama kepada Jawi baru," Sarjana Tesis Sarjana, Fakulti Teknologi Dan Sains Maklumat, Universiti Kebangsaaan Malaysia, 2007. [20] Juhaida Abu Bakar, "Transliterasi Jawi lama kepada Jawi baru berasaskan grafem (kajian kes pada Hikayat Merong Mahawangsa)," Sarjana Tesis Sarjana, Fakulti Teknologi Dan Sains Maklumat, Universiti Kebangsaaan Malaysia, 2008. [21] Yonhendri, "Transliterasi Rumi ke Jawi berasaskan petua," Sarjana Tesis Sarjana, Fakulti Teknologi Dan Sains Maklumat, Universiti Kebangsaaan Malaysia, 2009.

26