Hindi Urdu Machine Transliteration Using Finite-State Transducers

Hindi Urdu Machine Transliteration using Finite-state Transducers M G Abbas Malik Christian Boitet Pushpak Bhattacharyya GTALP, Laboratoire d’Informatique Grenoble Dept. of Computer Science and Engineering, Université Joseph Fourier, France IIT Bombay, India [email protected], [email protected] [email protected] type (noun, verb, etc. and not only proper noun Abstract or unknown word). “One man’s Hindi is another man’s Urdu” Finite-state Transducers (FST) can be (Rai, 2000). The major difference between Hindi very efficient to implement inter-dialectal and Urdu is that the former is written in Devana- transliteration. We illustrate this on the gari script with a more Sanskritized vocabulary Hindi and Urdu language pair. FSTs can and the latter is written in Urdu script (derivation also be used for translation between sur- of Persio-Arabic script) with more vocabulary face-close languages. We introduce UIT borrowed from Persian and Arabic. In contrast to (universal intermediate transcription) for the transcriptional difference, Hindi and Urdu the same pair on the basis of their com- share grammar, morphology, a huge vocabulary, mon phonetic repository in such a way history, classical literature, cultural heritage, etc. that it can be extended to other languages Hindi is the National language of India with 366 like Arabic, Chinese, English, French, etc. million native speakers. Urdu is the National and We describe a transliteration model based one of the state languages of Pakistan and India on FST and UIT, and evaluate it on Hindi respectively with 60 million native speakers and Urdu corpora. (Rahman, 2004). Table 1 gives an idea about the size of Hindi and Urdu. 1 Introduction Native 2nd Language Total Speakers Speakers Transliteration is mainly used to transcribe a Hindi 366,000,000 487,000,000 853,000,000 word written in one language in the writing sys- Urdu 60,290,000 104,000,000 164,290,000 tem of the other language, thereby keeping an Total 426,290,000 591,000,000 1,017,000,000 approximate phonetic equivalence. It is useful for Table 1: Hindi and Urdu speakers MT (to create possible equivalents of unknown Hindi and Urdu, being varieties of the same words) (Knight and Stall, 1998; Paola and San- language, cover a huge proportion of world’s jeev, 2003), cross-lingual information retrieval population. People from Hindi and Urdu com- (Pirkola et al, 2003), the development of multi- munities can understand the verbal expressions lingual resources (Yan et al, 2003) and multilin- of each other but not the written expressions. gual text and speech processing. Inter-dialectal HUMT is an effort to bridge this scriptural divide translation without lexical changes is quite useful between India and Pakistan. and sometimes even necessary when the dialects Hindi and Urdu scripts are briefly introduced in question use different scripts; it can be in section 2. Universal Intermediate Transcrip- achieved by transliteration alone. That is the case tion (UIT) is described in section 3, and UIT of HUMT (Hindi-Urdu Machine Transliteration) mappings for Hindi and Urdu are given in sec- where each word has to be transliterated from tion 4. Contextual HUMT rules are presented and Hindi to Urdu and vice versa, irrespective of its discussed in section 5. An HUMT system im- plementation and its evaluation are provided in section 6 and 7. Section 8 is on future work and conclusion. © 2008. Licensed under the Creative Commons Attri- bution-Noncommercial-Share Alike 3.0 Unported license (http://creativecommons.org/licenses/by-nc- sa/3.0/). Some rights reserved. 537 Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 537–544 Manchester, August 2008 2 HUMT 3 Universal Intermediate Transcription There exist three languages at the border between UIT (Universal Intermediate Transcription) is a India and Pakistan: Kashmiri, Punjabi and Sindhi. scheme to transcribe texts in Hindi, Urdu, Punja- All of them are mainly written in two scripts, one bi, etc. in an unambiguous way encoded in AS- being a derivation of the Persio-Arabic script and CII range 32 – 126, since a text in this range is the other being Devanagari script. A person us- portable across computers and operating systems ing the Persio-Arabic script cannot understand (James 1993; Wells, 1995). SAMPA (Speech the Devanagari script and vice versa. The same is Assessment Methods Phonetic Alphabet) is a true for Hindi and Urdu which are varieties or widely accepted scheme for encoding the IPA dialects of the same language, called Hindustani (International Phonetic Alphabet) into ASCII. It by Platts (1909). was first developed for Danish, Dutch, French, PMT (Punjabi Machine Transliteration) (Ma- German and Italian, and since then it has been lik, 2006) was a first effort to bridge this scrip- extended to many languages like Arabic, Czech, tural divide between the two scripts of Punjabi English, Greek, Hebrew, Portuguese, Russian, namely Shahmukhi (a derivation of Perio-Arabic Spanish, Swedish, Thai, Turkish, etc. script) and Gurmukhi (a derivation of Landa, We define UIT as a logical extension of Shardha and Takri, old Indian scripts). HUMT is SAMPA. The UIT encoding for Hindi and Urdu a logical extension of PMT. Our HUMT system is developed on the basis of rules and principles is generic and flexible such that it will be extend- of SAMPA and X-SAMPA (Wells, 1995), that able to handle similar cases like Kashmiri, Pun- cover all symbols on the IPA chart. Phonemes jabi, Sindhi, etc. HUMT is also a special type of are the most appropriate invariants to mediate machine transliteration like PMT. between the scripts of Hindi, Punjabi, Urdu, etc., A brief account of Hindi and Urdu is first giv- so that the encoding choice is logical and suitable. en for unacquainted readers. 4 Analysis of Scripts and UIT Mappings 2.1 Hindi For the analysis and comparison, scripts of Hindi The Devanagari (literally “godly urban”) script, a and Urdu are divided into different groups on the simplified version of the alphabet used for San- basis of character types. skrit, is a left-to-right script. Each consonant symbol inherits by default the vowel sound [ə]. 4.1 Consonants Two or more consonants may be combined to- These are grouped into two categories: gether to form a cluster called Conjunct that Aspirated Consonants: Hindi and Urdu both marks the absence of the inherited vowel [ə] be- have 15 aspirated consonants. In Hindi, 11 aspi- tween two consonants (Kellogg, 1872; Montaut, rated consonants are represented by separate cha- 2004). A sentence illustrating Devanagari is given below: racters e.g. ख [kʰ], भ [bʰ], etc. The remaining 4 consonants are represented by combining a sim- िहन्दी िहन्दःतानु की क़ौमी ज़ुबान है. ple consonant to be aspirated and the conjunct [ ] hɪnḓi hɪnḓustɑn ki qɔmi zubɑn hæ form of HA [h], e.g. [l] + + [h] = [l ]. (Hindi is the national language of India) ह ल ◌् ह ल्ह ʰ In Urdu, all aspirated consonants are 2.2 Urdu represented by a combination of a simple conso- (ه) Urdu is written in an alphabet derived from the nant to be aspirated and Heh Doachashmee ﺑﻬ = [h] ه + [b] ب ,[kʰ] ﮐﻬ = [h] ه + [k] ﮎ .Persio-Arabic alphabet. It is a right-to-left script [h], e.g and the shape assumed by a character in a word .lʰ], etc] ﻟﻬ = [h] ه + [l] ل ,[bʰ] is context-sensitive, i.e. the shape of a character The UIT mapping for aspirated consonants is is different depending on whether its position is given in Table 2. at the beginning, in the middle or at the end of a Hindi Urdu UIT Hindi Urdu UIT word (Zia, 1999). A sentence illustrating Urdu is r_h [ ] ره b_h [ ] ﺑﻬ given below: भ bʰ हर् rʰ ɽʰ] r`_h] ڑه pʰ] p_h ढ़] ﭘﻬ फ kʰ] k_h] ﮐﻬ ṱʰ] t_d_h ख] ﺗﻬ Xì y6[6Ei ÌòâF¯ ÌÐ y636G¾6[ zEegEZ थ gʰ] g_h] ﮔﻬ ʈʰ] t`_h घ] ﭨﻬ ʊrḓu pɑkɪstɑn ki qɔmi zubɑn hæ] ठ] lʰ] l_h] ﻟﻬ ʤʰ] d_Z_h ल्ह] ﺟﻬ Urdu is the National Language of Pakistan.) झ) 538 mʰ] m_h Urdu contains 10 vowels and 7 of them have] ﻣﻬ ʧʰ] t_S_h म्ह] ﭼﻬ छ nasalized forms (Hussain, 2004; Khan, 1997). nʰ] n_h] ﻧﻬ ḓʰ] d_d_h न्ह] ده ध Urdu vowels are represented using four long vo- and Choti (و) Vav ,(ا) Alef ,(ﺁ) ɖʰ] d`_h wels (Alef Madda] ڈه ढ – and three short vowels (Arabic Fatha ((ﯼ) Table 2: Hindi Urdu aspirated consonants Yeh Non-aspirated Consonants: Hindi has 29 -Zabar َ-, Arabic Damma – Pesh ُ- and Arabic Ka non-aspirated consonant symbols representing 28 -sra – Zer ِ-). Vowel representation is context are (ﯼ) and Choti Yeh (و) consonant sounds as both SHA (श) and SSA (ष) sensitive in Urdu. Vav represent the same sound [ʃ]. Similarly Urdu has also used as consonants. -is a place holder between two suc (ء) consonant symbols representing 27 sounds as Hamza 35 [kəmɑi] ﮐﻤﺎﺋﯽ multiple characters are used to represent the cessive vowel sounds, e.g. in separates the two vowel (ء) earning), Hamza) (ﮦ) and Heh-Goal (ح) same sound e.g. Heh (س) Seen ,(ث) represent the sound [h] and Theh -i]. Noon] (ﯼ) ɑ] and Choti Yeh] (ا) sounds Alef .represent the sound [s], etc (ص) and Sad -is used as nasalization marker. Anal (ں) ghunna UIT mapping for non-aspirated consonants is ysis and mapping of Hindi Urdu vowels is given given in Table 3. Hindi Urdu UIT Hindi Urdu UIT in Table 5. s] s2 4.3 Diacritical Marks] ص b] b स] ب ब z] z2 Urdu contains 15 diacritical marks.

Load more