Finite-State Scriptural Translation

Finite-state Scriptural Translation M. G. Abbas Malik Christian Boitet Pushpak Bhattacharyya GETALP – LIG (Grenoble Informatics Lab.) University of Grenoble IIT Bombay Abbas.Malik [email protected] [email protected] ple, the transcription of the word “love” in the Abstract International Phonetic Alphabet (IPA) is [ləv]. While the text processing community uses the We use robust and fast Finite-State Machines term transliteration and defines it as a process of (FSMs) to solve scriptural translation prob- converting a word written in one writing system lems. We describe a phonetico-morphotactic into another writing system while preserving the pivot UIT (universal intermediate transcrip- sound of the original word (Al-Onaizan and tion), based on the common phonetic reposito- Knight, 2002; AbdulJaleel and Larkey, 2003). ry of Indo-Pak languages. It is also extendable to other language groups. We describe a finite- More precisely, the text processing community state scriptural translation model based on fi- defines the term transliteration as two transcrip- nite-state transducers and UIT. We report its tion processes “source script to sound transcrip- performance on Hindi, Urdu, Punjabi and Se- tion” and “sound to target script transcription” raiki corpora. For evaluation, we design two and sometimes as one transcription process classification scales based on the word and “source script to target script transcription”. sentence accuracies for translation system We propose a new term Scriptural Translation classifications. We also show that subjective for this combined process. Scriptural translation evaluations are vital for real life usage of a is a process of transcribing a word written in the translation system in addition to objective source language script into the target language evaluations. script by preserving its articulation in the original 1 Introduction language in such a way that the native speaker of the target language can produce the original pro- Transliteration refers to phonetic translation nunciation. across two languages with different writing sys- FSMs have been successfully used in various tems, such as Arabic to English (Arbabi et al., domains of Computational Linguistics and Natu- 1994; Stall and Knight, 1998; Al-Onaizan and ral Language Processing (NLP). The successful Knight, 2002; AbdulJaleel and Larkey, 2003). use of FSMs have already been shown in various Most prior work on transliteration has been done fields of computational linguistics (Mohri, 1997; for MT of English, Arabic, Japanese, Chinese, Roche and Schabes, 1997; Knight and Al- Korean, etc., for CLIR (Lee and Choi., 1998; Onaizan, 1998). Their practical and advanta- Jeong et al., 1999; Fujii and Ishikawa, 2001; geous features make them very strong candidates Sakai et al., 2002; Pirkola et al., 2003; Virga and to be used for solving scriptural translation Khudanpur, 2003; Yan et al., 2003), and for the problems. development of multilingual resources (Kang First, we describe scriptural translation and and Choi, 2000; Yan, Gregory et al., 2003). identify its problems that fall under weak transla- The terms transliteration and transcription are tion problems. Then, we analyze various chal- often used as generic terms for various processes lenges for solving weak scriptural translation like transliteration, transcription, romanization, problems. We describe our finite-state scriptural transcribing and technography (Halpern, 2002). translation model and report our results on Indo- In general, the speech processing community Pak languages. uses the term transcription to denote a process of conversion from the script or writing system to the sound (phonetic representation). For exam- 2 Scriptural Translation – a weak 3.1 Scriptural divide translation problem There exists a written communication gap be- A weak translation problem is a translation prob- tween people who can understand each other lem in which the number of possible valid trans- verbally but cannot read each other. They are lations, say N, is either very small, less than 5, or virtually divided and become scriptural aliens. almost always 1. Examples are the Hindi & Urdu communities, Scriptural Translation is a sub-problem of the Punjabi/Shahmukhi & Punjabi/Gurmukhi general translation and almost always a weak communities, etc. An example of scriptural di- translation problem. For example, French-IPA vide is shown in Figure 1. Such a gap also ap- and Hindi-Urdu scriptural translation problems pears when people want to read some foreign are weak translation problems due to their small language or access a bilingual dictionary and are number of valid translations. On the other hand, not familiar with the writing system. For exam- Japanese-English and French-Chinese scriptural ple, Japanese–French or French–Urdu dictiona- translation problems are not weak. ries are useless for French learners because of the Scriptural translation is not only vital for scriptural divide. Table 1 gives some figures on translation between different languages, but also how this scriptural divide affects a large popula- becomes inevitable when the same language is tion of the world. written in two or more mutually incomprehensi- Sr. Language Number of Speakers ble scripts. For example, Punjabi is written in 1 Hindi 853,000,000 2 Urdu 164,290,000 three different scripts: Shahmukhi (a derivation 3 Punjabi 120,000,000 of the Perso-Arabic script), Gurmukhi and Deva- 4 Sindhi 21,382,120 nagari. Kazakh and Kurdish are also written in 5 Seraiki 13,820,000 three different scripts, Arabic, Latin and Cyrillic. 6 Kashmir 5,640,940 Malay has two writing systems, Latin and Jawi Total 1178,133,060 (a derivation of the Arabic script), etc. Figure 1 Table 1: Number of Speakers of Indo-Pak languages shows an example of scriptural divide between Hindi and Urdu. 3.2 Under-resourced languages Under-resourced and under-written features of Xì ]gzu[¢ ÌÐ áðZ â§ 63Õ [3e the source or target language are the second big challenge for scriptural translation. The lack of टुिनया को अमन की ज़रत है। standard writing practices or even the absence of [ḓʊnɪjɑ ko əmən ki zərurəṱ hæ.] a standard code page for a language makes trans- The world needs peace. literation or transcription very hard. The exis- Figure 1: Example of scriptural divide tence of various writing styles and systems for a Thus, solving the scriptural translation prob- language leads towards a large number of va- lem is vital to bridge the scriptural divide be- riants and it becomes difficult and complex to tween the speakers of different languages as well handle them. as of the same language. In the case of Indo-Pak languages, Punjabi is Punjabi, Sindhi, Seraiki and Kashmiri exist on the largest language of Pakistan (more than 70 both sides of the common border between India million) and is more a spoken language than a and Pakistan and all of them are written in two or written one. There existed only two magazines more mutually incomprehensible scripts. The (one weekly and one monthly) in 1992 (Rahman, Hindi–Urdu pair exists both in India and Pakis- 1997). In the words of (Rahman, 2004), “… tan. We call all these languages the Indo-Pak there is little development in Punjabi, Pashto, languages. Balochi and other languages…”. (Malik, 2005) reports the first effort towards establishing a 3 Challenges of Scriptural Translation standard code page for Punjabi-Shahmukhi and In this section, we describe the main challenges till date, a standard code page for Shahmukhi of scriptural translation. does not exist. Similar problems also exist for the Kashmiri and Seraiki languages. 3.3 Absence of necessary information conventions between Punjabi/Shahmukhi– Punjabi Gurmukhi, Hindi–Urdu, etc. There are cases where the necessary and indis- Different spelling conventions are also driven pensable information for scriptural translation by different religious influences on different are missing in the source text. For example, the communities. In the Indian sub-continent, Hindi -world) of the example sen) [ ] دﻧﻴﺎ first word ḓʊnɪjɑ is a part of the Hindu identity, while Urdu is a tence of Figure 1 misses crucial diacritical in- part of the Muslim identity1 (Rahman, 1997; Rai, formation, mandatory to perform Urdu to Hindi 2000). Hindi derives its vocabulary from San- scriptural translation. Like in Arabic, diacritical skrit, while Urdu borrows its literary and scien- marks are part of the Urdu writing system but are tific vocabulary from Persian and Arabic. Hindi sparingly used in writings (Zia, 1999; Malik et and Urdu not only borrow from Sanskrit and Per- al., 2008; Malik et al., 2009). sian/Arabic, but also adopt the original spellings Figure 2(a) shows the example word without of the borrowed word due the sacredness of the diacritical marks and its wrong Hindi conversion original language. These differences make scrip- according to conversion rules (explained later). tural translation across scripts, dialects or lan- The Urdu community can understand the word in guages more challenging and complex. its context or without the context because people are tuned to understand the Urdu text or word 3.5 Transcriptional ambiguities without diacritical marks, but the Hindi conver- Character level scriptural translation across dif- sion of Figure 2(a) is not at all acceptable or ferent scripts is ambiguous. For example, the readable in the Hindi community. human being) can be) [ ] اِﻧﺴﺎن Sindhi word Figure 2(b) shows the example word with dia- ɪɲsɑn converted into Devanagari either as इंसान [ɪɲsɑn] or critical marks and its correct Hindi conversion * * according to conversion rules. Similar problems इसान [ɪnsɑn] ( means wrong spellings). The trans- also arise for the other Indo-Pak languages. literation process of the example word from Therefore, missing information in the source text Sindhi to Devanagari is shown in Figure 3(a). makes the scriptural translation problem compu- The transliteration of the third character from the n], is ambiguous because in the] (ن) tationally complex and difficult. left, Noon middle of a word, Noon may represent a conso- دُ ﻧِﻴﺎ = د [ ] ُ [ ] ن [n] ِ [ ] ﯼ [ ] ا [ ] ɑ j ɪ ʊ ḓ nant [n] or the nasalization [ɲ] of a vowel.

Finite-State Scriptural Translation

Persian Language

Notes on the Emergence of New Semitic Roots in the Light of Compounding

Psalm 119 & the Hebrew Aleph

Leila's Alphabet Journey Text Book 2019 -Reduced

Old Phrygian Inscriptions from Gordion: Toward A

Persian Basic Course: Units 1-12. INSTITUTION Center for Applied Linguistics, Washington, D.C.; Foreign Service (Dept

Majority Language Death

Arabic Alphabet 1 Arabic Alphabet

DHJC Scroll May-June 2019

Molybdenum and Cobalt Silicide Field Emitter Arrays

Punjab, Punjabi and Urdu, the Question of Displaced Identity: a Historical Appraisal

Text-Critical Question Begging in Nahum 1,2–8: Re-Evaluating the Evidence and Arguments