Machine Transliteration Design for Old Malay Manuscript
Total Page:16
File Type:pdf, Size:1020Kb
2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia) Machine Transliteration Design for Old Malay Manuscript Che Wan Shamsul Bahri Che Wan Ahmad, Khairuddin Omar, Mohammad Faidzul Nasrudin, Mohd Zamri Murah, Sanusi Mohd Azmi name of Pedoman Ejaan Jawi Yang Disempurnakan Abstract—Jawi script is a script that has the Arabic influence. In (Guidelines of the Enhanced Jawi Spelling) is intended to meet the past, these writings are widely used by the Malay community as the present the needs of the future. This system is the result of well as foreigners who have diplomatic relations, business, a formula written in the Jawi National Convention held in missionary and such. At that time, the Malay language is the lingua 1984 at Kuala Terengganu, Malaysia. Jawi spelling system is franca of this region. So there are many Malay heritages such as manuscripts, religious books, letters, documents and other compiled by [2] as its basis. According to [3], the system agreements in the Jawi script. There are significant needs to do the involves five processes, which are: maintaining confirmed transliteration of the Jawi text on the materials to Malay Roman. words, perfecting the imperfect, creating the non-existence, Thus, research on machine transliteration will help the effort. Many clarifying the vague, tidying up the loose. In this paper, we researches in machine transliteration in the world for high-level define old Jawi as Jawi spelling system before era of Za’ba language have been done; such as English, European, and Asia TABLE I languages such as Chinese, Japanese, Korean and Arabic. However SAMPLE OLD AND NEW JAWI the research in the context of the Malay language is still lacking, Old Jawi New Jawi Roman (bagi (for ﺑﺎڬﻲ ﺑﮏ .especially those involving the Romanized transliteration of Jawi (segala (all ﺳڬﺎﻻ ﺳڬﻞ Jawi writing is quite different from the Urdu and Arabic although (olehmu (by you ﺍﻭﻟﻬﻴﻤﻮ ﺍﻭﻟﻬﻢ they share the same characters. Modern Jawi uses more vowel than (takut (fear ﺗﺎﮐﻮﺓ ﺗﺎﮐﺔ the old version. These papers discuss the previous studies related to (saudara (brother ﺳﺎﻭﺩﺍﺭﺍ ﺳﻮﺩﺭﺍ machine transliteration of the Malay language and approaches that can be used to develop it. (before 1949). According to [4] there is a 30% difference Keywords—Jawi, machine transliteration, Malay, rule base between the old and new Jawi spelling. The old and new Jawi spelling systems are differences as shown in Table I below. I. INTRODUCTION II. PROBLEM STATEMENT AWI writing today is actually a Malay writing with Arabic Jin fluences that have been used nearly 700 years ago. This is In the context of the Malay language, transliteration is used evidenced by the discovery of Terengganu Inscribed Stone, to change the Jawi spelling to Rumi or vice versa. The main dated 1303 AD [1]. However, the old Jawi is different problem in the transliteration is when there are no matching compare Jawi script today in terms of the use of the vowel, the characters in target text [5]. Jawi spelling system and Rumi writing techniques and the use of new letters. It is seen in line (Roman) spelling are two different spelling systems. Jawi with the development of the Malay language itself began to spelling read and written from right to left, while the reverse include some foreign words and technical terms especially in spelling of Rumi [6]. the advancement of science and technology today. In addition, old Jawi spelling were not using more letters in New Jawi spelling system was introduced in (1986) with the practicing the concept of economy in the use of Jawi character than modern today [7]. According to [8] Jawi spelling as long as it is not consistent (varies according to the author) and many Che Wan Shamsul Bahri C.W.Ahmad is with the International Islamic University College Selangor (KUIS), Bangi, 43000 Kajang, Selangor, occur ambiguous word compared to new Jawi spelling system. Malaysia (phone: +60389254251; fax: +6038925447; e-mail: The problems mentioned above are main challenges in the [email protected]). development machine transliteration for old Jawi to Rumi. Prof. Dr. Khairuddin Omar, Dr. Mohammad Faidzul Nasrudin and Mohd Zamri Murah was with Center for Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia, Bangi, 43600 Kajang, Selangor, Malaysia. III. RELATED WORK (e-mail: [email protected], [email protected], [email protected]). Machine transliteration is an important matter in the Mohd Sanusi Azmi is with the Universiti Teknikal Malaysia (UTEM), Malacca, Malaysia (e-mail: [email protected]). application of natural language processing (NLP), especially in translating an entity name from one language to another [9]. 23 2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia) Among the world's fastest-growing language in its A. Homograph6B and Less Vowel transliteration mechanism is English, European, Asian Homograph is referring to two or more words that have the languages such as Chinese, Japanese, Korean, Arabic, Urdu, same spelling but different meanings[23]. Homograph can also Hindi, Punjabi, Taiwan, Korea, Japan, China, Thailand and coincided with a homonymous (same sound), for example, the others. Among them are [10], [11], [12], [13], [14], [15] and word mereka (to create) with mereka (they) which is similar in [5]. terms of raw sound[24]. In the Jawi script, most homograph In Malaysia, research on machine transliteration is long, but word occurred due to Jawi writing system itself that only use its development is quite slow particularly the Malay language -compared with six vowels (a, e (e (ﺍ ، ﻭ، ﻱ، ﻯ) four vowels as compared with foreign languages. pepet), e (e-taling), i, o, u) in Rumi writing. There are many studies in Malay machine transliteration There are more homograph problems in old Jawi spelling have done such as by [16], [17], [18], [19], [20], [21] and because the economic principle of the lesser use of the vowel [22]. All researches on Malay transliteration are related to modern Jawi, except study done by [19] and [20] are related to [7]. Here is an example of the old Jawi sentences occurs old Jawi. homograph. ﺍﻳﻪ ﭬﺎﻛﻲ ﺗﻮ ﻲﭬ ، ﺗﻴﻤﺒﻖ ﺗﻮ ﻲﭬ ﺑﺮﺳﺎﻡ ﺍﻧﭽﺊ ﺟﻮﻫﺮ ﺩ ﺟﻮﻫﺮ Several researchers such as [16], [17] and [18], using a .to Ayah pakai topiU ,U tembak tupaiU U bersama En. Johar di Johor ﺏ ,to a ﺍ character mapping technique that matches directly b, and so on. Therefore, there are many words that cannot be (Dad wear a hat, shoot squirrels with Mr Johar in Johor) converted correctly in the machine transliteration process. This ﺍﻳﻪ ﻣﺎﻛﻦ ڬﻮﻟﻲ، ڠﺳﺪ ﻜﻦ ﺍﻧﻖ ﺑﺮﻣﺎءﻳﻦ ڬﻮﻟﻲ is because there are more than matches for some of the letters in the target text. Ayah makan gulaiU ,U sedangkan anak bermain guliU .U In contrast to previous studies, [21] use rule-based (Dad eat curry, while children play marbles) transliteration technique in the study of Rumi to Jawi B. Spelling7B Not Consistency transliteration. Each Rumi word is processed following the In old Malay manuscripts, spelling varies used even in a Jawi spelling patterns. First, the word is divided into syllables, same book for some words. It also varies spelling between and then the syllables are matched with rule for Jawi authors for some words. There are classification of old Jawi conversion into Jawi spelling. Based on [21] research, there writer, including educated people in religion and palace are words that cannot be converted into Jawi precisely because writers anonymous, scholars and the general public[8]. Some some of the problems; loan words from Arabic and English, of these people are not educated about the method of Jawi the difficulty in distinguishing e-taling and e-pepet vowel, and writing. Thus they write not according to the rules, not [presentable and uncertain sentences. For example, [8 .ک or ﻕ the difficulty to determine whether the word has glottis In the Jawi vocabulary, several words classified as past law identified is in the pamphlets of ilmu wafaq (wafaq or exempt from the law. Several words are not subject to the knowledge), ilmu hikmat (wisdom knowledge), mantera rule of Jawi. These are among the difficulties faced by [21] (mantras), traditional medicine and others. because Jawi spelling system is so unique. [19] and [20] study is different as studying old Malay C. Sentence8B Structure manuscripts. [19] using a Kitab Nazam (old Malay Most of the old Jawi spelling or old manuscripts do not use Manuscript) that has been digitized in advance for the study. punctuation [25]. The beginning of a new sentence usually [19] uses stemming and filtering model which is to make the begins with the word bahawasanya (whereas), maka process of rooting and separation between old and new Jawi (therefore), lagi (again) as sentence delimiter signal, even words based on a Jawi corpus of built. While [20] use a though sometimes the words maka (then) and lagi (again) act transliteration based on grapheme to the epic of the old Malay as conjunctions. Word dan (and) also used at the beginning of manuscript, Merong Mahawangsa. With grapheme method, a sentence. In addition, there is a hadith or Quranic verses that .for old Jawi words, can be used as a delimiter sentence ڤ to ﻑ successfully converted letters [20] instead of using grapheme p in Rumi, not f. This is because D. Spelling System ڤ most of the old Jawi script does not distinguish phonemes ﻱ Most of the old spelling does not use a point on the letter can reduce the gap between the old and new Jawi. (ya) if in the end of the word. This is because many of the old[20] .ﻑ and ya) does) ﻯ Many other studies are more focused on Rumi to new Jawi books printed in Arabia [25].