<<

Machine Transliteration using Finite-state Transducers

M G Abbas Malik Christian Boitet Pushpak Bhattacharyya GTALP, Laboratoire d’Informatique Grenoble Dept. of Computer Science and Engineering, Université Joseph Fourier, France IIT Bombay, [email protected], [email protected] [email protected]

type (noun, verb, etc. and not only proper noun Abstract or unknown word). “One man’s Hindi is another man’s Urdu” Finite-state Transducers (FST) can be (Rai, 2000). The major difference between Hindi very efficient to implement inter-dialectal and Urdu is that the former is written in Devana- transliteration. We illustrate this on the gari script with a more Sanskritized vocabulary Hindi and Urdu language pair. FSTs can and the latter is written in Urdu script (derivation also be used for translation between sur- of Persio-Arabic script) with more vocabulary face-close languages. We introduce UIT borrowed from Persian and Arabic. In contrast to (universal intermediate transcription) for the transcriptional difference, Hindi and Urdu the same pair on the basis of their com- share grammar, morphology, a huge vocabulary, mon phonetic repository in such a way history, classical literature, cultural heritage, etc. that it can be extended to other languages Hindi is the National language of India with 366 like Arabic, Chinese, English, French, etc. million native speakers. Urdu is the National and We describe a transliteration model based one of the state languages of and India on FST and UIT, and evaluate it on Hindi respectively with 60 million native speakers and Urdu corpora. (Rahman, 2004). Table 1 gives an idea about the size of Hindi and Urdu. 1 Introduction Native 2nd Language Total Speakers Speakers Transliteration is mainly used to transcribe a Hindi 366,000,000 487,000,000 853,000,000 word written in one language in the writing sys- Urdu 60,290,000 104,000,000 164,290,000 tem of the other language, thereby keeping an Total 426,290,000 591,000,000 1,017,000,000 approximate phonetic equivalence. It is useful for Table 1: Hindi and Urdu speakers MT (to create possible equivalents of unknown Hindi and Urdu, being varieties of the same words) (Knight and Stall, 1998; Paola and San- language, cover a huge proportion of world’s jeev, 2003), cross-lingual information retrieval population. People from Hindi and Urdu com- (Pirkola et al, 2003), the development of multi- munities can understand the verbal expressions lingual resources (Yan et al, 2003) and multilin- of each other but not the written expressions. gual text and speech processing. Inter-dialectal HUMT is an effort to bridge this scriptural divide translation without lexical changes is quite useful between India and Pakistan. and sometimes even necessary when the dialects Hindi and Urdu scripts are briefly introduced in question use different scripts; it can be in section 2. Universal Intermediate Transcrip- achieved by transliteration alone. That is the case tion (UIT) is described in section 3, and UIT of HUMT (Hindi-Urdu Machine Transliteration) mappings for Hindi and Urdu are given in sec- where each word has to be transliterated from tion 4. Contextual HUMT rules are presented and Hindi to Urdu and vice versa, irrespective of its discussed in section 5. An HUMT system im- plementation and its evaluation are provided in section 6 and 7. Section 8 is on future work and conclusion. © 2008. Licensed under the Creative Commons Attri- bution-Noncommercial-Share Alike 3.0 Unported license (http://creativecommons.org/licenses/by-nc- sa/3.0/). Some rights reserved.

537 Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 537–544 Manchester, August 2008 2 HUMT 3 Universal Intermediate Transcription There exist three languages at the border between UIT (Universal Intermediate Transcription) is a India and Pakistan: Kashmiri, Punjabi and Sindhi. scheme to transcribe texts in Hindi, Urdu, Punja- All of them are mainly written in two scripts, one bi, etc. in an unambiguous way encoded in AS- being a derivation of the Persio-Arabic script and CII range 32 – 126, since a text in this range is the other being Devanagari script. A person us- portable across computers and operating systems ing the Persio-Arabic script cannot understand (James 1993; Wells, 1995). SAMPA (Speech the Devanagari script and vice versa. The same is Assessment Methods Phonetic Alphabet) is a true for Hindi and Urdu which are varieties or widely accepted scheme for encoding the IPA dialects of the same language, called Hindustani (International Phonetic Alphabet) into ASCII. It by Platts (1909). was first developed for Danish, Dutch, French, PMT (Punjabi Machine Transliteration) (Ma- German and Italian, and since then it has been lik, 2006) was a first effort to bridge this scrip- extended to many languages like Arabic, Czech, tural divide between the two scripts of Punjabi English, Greek, Hebrew, Portuguese, Russian, namely Shahmukhi (a derivation of Perio-Arabic Spanish, Swedish, Thai, Turkish, etc. script) and Gurmukhi (a derivation of Landa, We define UIT as a logical extension of Shardha and Takri, old Indian scripts). HUMT is SAMPA. The UIT encoding for Hindi and Urdu a logical extension of PMT. Our HUMT system is developed on the basis of rules and principles is generic and flexible such that it will be extend- of SAMPA and X-SAMPA (Wells, 1995), that able to handle similar cases like Kashmiri, Pun- cover all symbols on the IPA chart. Phonemes jabi, Sindhi, etc. HUMT is also a special type of are the most appropriate invariants to mediate machine transliteration like PMT. between the scripts of Hindi, Punjabi, Urdu, etc., A brief account of Hindi and Urdu is first giv- so that the encoding choice is logical and suitable. en for unacquainted readers. 4 Analysis of Scripts and UIT Mappings 2.1 Hindi For the analysis and comparison, scripts of Hindi The Devanagari (literally “godly urban”) script, a and Urdu are divided into different groups on the simplified version of the alphabet used for San- basis of character types. skrit, is a left-to-right script. Each consonant symbol inherits by default the vowel sound [ə]. 4.1 Consonants Two or more consonants may be combined to- These are grouped into two categories: gether to form a cluster called Conjunct that Aspirated Consonants: Hindi and Urdu both marks the absence of the inherited vowel [ə] be- have 15 aspirated consonants. In Hindi, 11 aspi- tween two consonants (Kellogg, 1872; Montaut, rated consonants are represented by separate cha- 2004). A sentence illustrating Devanagari is giv- en below: racters e.g. ख [kʰ], भ [bʰ], etc. The remaining 4 consonants are represented by combining a sim- िहन्दी िहन्दःतानु की क़ौमी ज़ुबान है. ple consonant to be aspirated and the conjunct [ ] hɪnḓi hɪnḓustɑn ki qɔmi zubɑn hæ form of HA [h], e.g. [l] + + [h] = [l ]. (Hindi is the national language of India) ह ल ◌् ह ल्ह ʰ In Urdu, all aspirated consonants are 2.2 Urdu represented by a combination of a simple conso- (ه) Urdu is written in an alphabet derived from the nant to be aspirated and Heh Doachashmee ﺑﻬ = [h] ه + [b] ب ,[kʰ] ﮐﻬ = [h] ه + [k] ﮎ .Persio-Arabic alphabet. It is a right-to-left script [h], e.g and the shape assumed by a character in a word .lʰ], etc] ﻟﻬ = [h] ه + [l] ل ,[bʰ] is context-sensitive, i.e. the shape of a character The UIT mapping for aspirated consonants is is different depending on whether its position is given in Table 2. at the beginning, in the middle or at the end of a Hindi Urdu UIT Hindi Urdu UIT word (Zia, 1999). A sentence illustrating Urdu is r_h [ ] ره b_h [ ] ﺑﻬ given below: भ bʰ हर् rʰ ɽʰ] r`_h] ڑه pʰ] p_h ढ़] ﭘﻬ फ kʰ] k_h] ﮐﻬ ṱʰ] t_d_h ख] ﺗﻬ Xì y6[6Ei ÌòâF¯ ÌÐ y636G¾6[ zEegEZ थ gʰ] g_h] ﮔﻬ ʈʰ] t`_h घ] ﭨﻬ ʊrḓu pɑkɪstɑn ki qɔmi zubɑn hæ] ठ] lʰ] l_h] ﻟﻬ ʤʰ] d_Z_h ल्ह] ﺟﻬ Urdu is the National Language of Pakistan.) झ)

538 mʰ] m_h Urdu contains 10 vowels and 7 of them have] ﻣﻬ ʧʰ] t_S_h म्ह] ﭼﻬ छ nasalized forms (Hussain, 2004; Khan, 1997). nʰ] n_h] ﻧﻬ ḓʰ] d_d_h न्ह] ده ध Urdu vowels are represented using four long vo- and Choti (و) Vav ,(ا) Alef ,(ﺁ) ɖʰ] d`_h wels (Alef Madda] ڈه ढ – and three short vowels (Arabic Fatha ((ﯼ) Table 2: Hindi Urdu aspirated consonants Yeh Non-aspirated Consonants: Hindi has 29 -Zabar َ-, Arabic Damma – Pesh ُ- and Arabic Ka non-aspirated consonant symbols representing 28 -sra – Zer ِ-). Vowel representation is context are (ﯼ) and Choti Yeh (و) consonant sounds as both SHA (श) and SSA (ष) sensitive in Urdu. Vav represent the same sound [ʃ]. Similarly Urdu has also used as consonants. -is a place holder between two suc (ء) consonant symbols representing 27 sounds as Hamza 35 [kəmɑi] ﮐﻤﺎﺋﯽ multiple characters are used to represent the cessive vowel sounds, e.g. in separates the two vowel (ء) earning), Hamza) (ﮦ) and Heh-Goal (ح) same sound e.g. Heh (س) Seen ,(ث) represent the sound [h] and Theh -i]. Noon] (ﯼ) ɑ] and Choti Yeh] (ا) sounds Alef .represent the sound [s], etc (ص) and Sad -is used as nasalization marker. Anal (ں) ghunna UIT mapping for non-aspirated consonants is ysis and mapping of Hindi Urdu vowels is given given in Table 3. Hindi Urdu UIT Hindi Urdu UIT in Table 5. s] s2 4.3 Diacritical Marks] ص b] b स] ب ब z] z2 Urdu contains 15 diacritical marks. They] ض p] p ज़] پ प -ٔ ṱ] t_d1 represent vowel sounds, except Hamza-e-Izafat] ط ṱ] t_d त] ت त -z] z3 and Kasr-e-Izafat ِ- that are used to build com] ظ ʈ] t` ज़] ٹ ट -ɪḓɑrəhɪsɑɪns] (In] اِ د ا ر ﮦٔ ﺳﺎﺋﻨﺲ .ʔ] ? pound words, e.g] ع - s] s1] ث स [tɑrixɪpedɑɪʃ] ﺗ ﺎ رِ ﻳ ﺦِ ﭘﻴﺪاﺋﺶ ,(ɣ] X stitute of Science] غ ʤ] d_Z ग़] ج ज f] f (date of birth), etc. Shadda ّ- is used to geminate] ف ʧ] t_S फ़] چ च [əʧʧʰɑ] اﭼّﻬﺎ ,(rəbb] (God] ر بّ .q] q a consonant e.g] ق h] h1 क़] ح ह k] k (good), etc. Jazm ْ- is used to mark the absence of] ﮎ x] x क] خ ख़ a vowel after the base consonant (Platts, 1909). g] g] گ ḓ] d_d ग] د द In Hindi, the conjunct form is used to geminate a l [ ] ل `d [ ] ڈ ड ɖ ल l consonant. Urdu diacritical marks mapping is .m] m given in Table 4] م z] z1 म] ذ ज़ n] n Hindi Urdu UIT Hindi Urdu UIT] ن r] r न] ر र v] v - @ A] و ɽ] r` व] ڑ उ F◌ [ə] ◌ा G◌ [ɑ] h] h] ﮦ z] z ह] ز ज़ ि◌ [ ] I न [ ] @n j] j G◌ ɪ F◌ ən] ﯼ ʒ] Z य] ژ ज़ s t_d2 U Un [ṱ] ◌ु E◌ [ʊ] ◌ुन E◌ [ʊn] ة s] त] س स S n` ʃ] ण - [ɳ] In] ش श ◌ू E◌ [u] u ि◌न F◌ [ɪn] ~ [ŋ] ں ं ◌ ʃ] S1] ش ष i Table 3: Hindi Urdu non-aspirated consonants ◌ी G◌ [i] Table 4: Diacritical Marks of Urdu 4.2 Vowels Diacritical marks are present in Urdu but spa- Hindi has 11 vowels and 10 of them have nasa- ringly used by people. They are very important lized forms. They are represented by 11 indepen- for the correct pronunciation and understanding dent vowel symbols e.g. [ ], [u], [ ], the meanings of a word. For example, ﻳہ ﺳﮍﮎ ﺑﮩﺖ ﭼﻮڑﯼ ﮨﮯ۔ आ ɑ ऊ औ ɔ etc. and 10 dependent vowel symbols e.g. ◌ा [je səɽək bʊhəṱ ʧɔɽi hæ] (This is a wide road.) ﻣﻴﺮﯼ ﭼﻮڑﯼ ﺳﺮخ ﮨﮯ۔ ɑ], ◌ू [u], ◌ौ [ɔ], etc. called maatraas. When a] vowel comes at the start of a word or a syllable, [meri ʧuɽi sʊrəx hæ] (My bangle is red.) -is pro ﭼﻮڑﯼ the independent form is used; otherwise the de- In the first sentence, the word pendent form is used (Kellogg, 1872; Montaut, nounced as [ ] (wide) and in the second, it is 2004). ʧɔɽi

539 pronounced as [ʧuɽi] (bangle). There should be essential for removing ambiguities, natural lan- guage processing and speech synthesis. in above (چ) Zabar (◌َ ) and Pesh (◌ُ ) after Cheh (wide) ﭼَﻮڑﯼ words and correct transcriptions are bangle). Thus diacritical marks are) ﭼُﻮڑﯼ and Vowel Urdu Hindi (UIT) əb] (now) and by Zabar َ- in the middle] اَب .Zabar َ- at the start of a word e.g + (ا) It is represented by Alef ə (@) rəbb] (God). It never comes at the end of a word. अ] رَ بّ .of a word respectively e.g or Alef (ا) ɑḓmi] (man) and by Alef] ﺁدﻣﯽ .at the start of a word e.g (ﺁ) It is represented by Alef Madda bɪlɑxər] (at last). At the end of a word, it is] ﺑِﻶﺧﺮ ,(ʤɑnɑ] (go] ﺟﺎﻧﺎ .in the middle of a word e.g (ﺁ) Madda (Khari Zabar ٰ- at आ or ◌ा (A + (ﯼ) In some Arabic loan words, it is represented by Choti Yeh .(ا) ɑ represented by Alef [ɪlɑhi] اﻟٰﮩﯽ .ə?lɑ] (Superior) and by Khari Zabar ٰ- in the middle of a word e.g] اﻋﻠﯽٰ .the end of a word e.g (God). ,(ek] (one] اﻳﮏ ,(esɑr] (sacrifice] اﻳﺜﺎر .at the start of a word e.g (ﯼ) Choti Yeh + (ا) It is represented by Alef اﻧﺪهﻴﺮا ,(merɑ] (mine] ﻣﻴﺮا .in the middle of a word e.g (ے) or Baree Yeh (ﯼ) etc. and by Choti Yeh e ए or ◌े (e) begʰər] (homeless) etc. At the end of a word, It is represented by Baree Yeh] ﺑﮯﮔﻬﺮ ,(ənḓʰerɑ] (darkness] .(sɑre] (all] ﺳﺎرے .e.g (ے) æh] (this) and by Zabar] اَﻳﮩہ .at the start of a word e.g (ﯼ) Zabar َ- + Choti Yeh + (ا) It is represented by Alef (}) ै◌ mæl] (dirt). At the end of a word, it is represented by ऐ or] ﻣَﻴﻞ .in the middle of a word e.g (ﯼ) æ َ- + Choti Yeh .(hæ] (is] ﮨَﮯ .e.g (ے) Zabar َ- + Baree Yeh ɪs] (this) and by Zer ِ- in the middle of a] اِس .Zer ِ- at the start of a word e.g + (ا) It is represented by Alef or (I) ◌bɑrɪʃ] (rain). It never comes at the end of a word. At the end of a word, it is used as Kasr-e- इ ि] ﺑﺎرِش .ɪ word e.g Izafat to build compound words. imɑn] (belief) and by] اِﻳﻤﺎن .at the start of a word e.g (ﯼ) Zer ِ- + Choti Yeh + (ا) It is represented by Alef i or (i) qərib] (near), ई ◌ी] ﻗﺮِﻳﺐ ,(ɑmiri] (richness] اﻣِﻴﺮﯼ .in the middle or at the end of a word e.g (ﯼ) Zer ِ- + Choti Yeh etc. ʊḓḓʰər] (there) and by Pesh ُ- in the] اُدّهﺮ .Pesh ُ- at the start of a word e.g + (ا) It is represented by Alef ʊ उ or ◌ु (U) .mʊll] (price). It never comes at the end of a word] ﻣُ ﻞّ .middle of a word e.g ũgʰəṱɑ] (dozzing) and by] اُوﻧﮕﻬﺘﺎ .at the start of a word e.g (و) Pesh ُ- + Vav + (ا) It is represented by Alef u (ṱərɑzu] (physical bal- ऊ or ◌ू (u] ﺗﺮازُو ,(surəṱ] (face] ﺻُﻮرت .in the middle or at the end of a word e.g (و) Pesh ُ- + Vav ance), etc. in the (و) oʧʰɑ] (nasty) and by Vav] اوﭼﻬﺎ .at the start of a word e.g (و) Vav + (ا) It is represented by Alef o ओ or ◌ो (o) .kəho] (say), etc] ﮐﮩﻮ ,(holi] (slowly] ﮨﻮﻟﯽ .middle or at the end of a word e.g -َ ɔʈ] (hindrance) and by Zabar] اَوٹ .at the start of a word e.g (و) Zabar َ- + Vav + (ا) It is represented by Alef ɔ औ or ◌ौ (O) .(mɔṱ] (death] ﻣَﻮت .in the middle or at the end of a word e.g (و) Vav + r] as this vowel is only present in Sanskrit loan words. It is] (ر) It is represented by a consonant symbol Reh r or (r1) ̥ almost not used in modern standard Hindi. It is not present in Urdu as it is used only in Sanskrit loan words. ऋ ◌ृ Note: In Hindi, Nasalization of a vowel is done by adding Anunasik (◌ँ) or Anusavar (◌ं) after the vowel. Anusavar (◌ं) is used when the vowel graph goes over the upper line; otherwise Anunasik (◌ँ) is used (Kellogg, 1872; Montaut, 2004). In UIT, ~ is added at end of UIT encoding for nasalization of all above vowels except the last one that do not have a nasalized form. Table 5: Analysis and Mapping of Hindi Urdu Vowels To make distinction between SHA .(ش) Sheen 5 HUMT Rules (श) and SSA (ष) in UIT, they are mapped on S ,(ث) In this section, UIT mappings of Hindi Urdu al- and S1 respectively. Similarly in Urdu, Seh [represent the sound [s (ص) and Sad (س) phabets and contextual rules that are necessary Seen for Hindi-Urdu transliteration are discussed. and have one equivalent symbol in Hindi, i.e. SA (स). To make distinction among them in UIT, 5.1 UIT Mappings they are mapped on s1, s and s2 respectively. All UIT mappings for Hindi and Urdu alphabets and similar cases are shown in Table 6. their vowels are given in Table 2 – 5. In Hindi, IPA Urdu (UIT) Hindi (UIT) (t_d2) त (t_d) ة ,(t_d1) ط ,(t_d) ت SHA (श) and SSA (ष) both represent the sound ṱ (s2) स (s) ص ,(s) س ,(s1) ث s [ʃ] and have one equivalent symbol in Urdu, i.e. (h) ह (h) ﮦ ,(h1) ح H

540 z3) ज़ (z) 6 HUMT System) ظ ,(z2) ض ,(Z) ژ ,(z) ز ,(z1) ذ z (S) (S), (S1) ش ʃ श ष The HUMT system exploits the simplicity, ro- r) र (r), ऋ (r1) bustness, power and time and space efficiency of) ر r Table 6: Multiple Characters for one IPA finite-state transducers. Exactly the same trans- Multi-equivalences are problematic for Hindi- ducer that encodes a Hindi or Urdu text into UIT Urdu transliteration. can be used in the reverse direction to generate UIT is extendable to other languages like Eng- Hindi or Urdu text from the UIT encoded text. lish, French, Kashmiri, Punjabi, Sindhi, etc. For This two-way power of the finite-state transducer example, Punjabi has one extra character than (Mohri, 1997) has significantly reduced the amount of efforts to build the HUMT system. Urdu i.e. Rnoon [ ] ( ), it is mapped on ‘n`’ in Another very important and powerful strength of ڻ ɳ UIT. Similarly, UIT, a phonetic encoding finite-state transducers, they can be composed scheme, can be extended to other languages. together to build a single transducer that can per- All these mappings can be implemented by form the same task that could be done with help simple finite-state transducers using XEROX’s of two or more transducers when applied sequen- XFST (Beesley and Karttunen, 2003) language. tially (Mohri, 1997), not only allows us to build a A sample XFST code is given in Figure 1. direct Hindi ↔ Urdu transducer, but also helps to -d “_” Z] ]; divide difficult and complex problems into sim] <- ج ,p <- پ ,b <- ب] read regex ;[[d “_” Z “_” h] <- [ه ج]] read regex j || .#. _ ]; ple ones, and has indeed simplified the process of <- ﯼ ,v <- و] read regex ↔ building the HUMT system. A direct Hindi ;[[ا | ﺁ] _ || j <- ﯼ ,v <- و] read regex e || CONSONANTS _ ]; Urdu transducer can be used in applications <- ﯼ] read regex -i || _ [ ْ| .#.]]; where UIT encoding is not necessary like Hindi <- ﯼ ] read regex … Urdu MT system. read regex [ब -> b, प -> p, ज़ -> z, झ -> [d “_” Z “_” h]]; The HUMT system can be extended to per- read regex [अ -> “@”, आ -> A, ई -> i || .#. _ ] form transliteration between two or more differ- … ent scripts used for the same languages like Figure 1: Sample XFST code Kashmiri, Kazakh, Malay, Punjabi, Sindhi, etc. Finite-state transducers are robust and time or between language pairs like English–Hindi, and space efficient (Mohri, 1997). They are a English–Urdu, English–French, etc. by just in- logical choice for Hindi-Urdu transliteration via troducing the respective transducers in the Fi- UIT as this problem could also be seen as string of matching and producing an analysis string as an nite-state Transducer Manager output like finite-state morphological analysis. the HUMT system to build a multilingual ma- chine transliteration system. 5.2 Contextual HUMT Rules UIT mappings need to be accompanied by neces- sary contextual HUMT rules for correct Hindi to Urdu transliteration and vice versa. are (ﯼ) and Choti Yeh (و) For example, Vav used to represent vowels like [o], [ɔ], [i], [e], etc. and (و) but they are also used as consonants. Vav are consonants when they come at (ﯼ) Choti Yeh the beginning of a word or when they are fol- Also, Choti .(ا) or Alef (ﺁ) lowed by Alef mada -represents the vowel [e] when it is pre (ﯼ) Yeh ceded by a consonant but when it comes at the end of a word and is preceded by a consonant then it represents the vowel [i]. These rules are shown in red colour in Figure 1. Thus HUMT contextual rules are necessary for Hindi-Urdu transliteration and they can also be Figure 2: HUMT System implemented as finite-state transducer using In the HUMT system, Text Tokenizer XFST. All these rules can’t be given here due to takes the input Hindi or Urdu Unicode text, toke- shortage of space. nizes it into Hindi or Urdu words and passes

541 them to UIT Enconverter. The enconverter 6.2 Deconversion of UIT to Hindi-Urdu enconverts Hindi or Urdu words into UIT words For the deconversion, Hindi ↔ UIT or Urdu ↔ using the appropriate transducer from Finite- UIT transducer is applied in reverse on the UIT state Transducers Manager, e.g. for enconverted words to generate Hindi or Urdu Hindi words, it uses the Hindi ↔ UIT transducer. words. To continue with the example in the pre- It passes these UIT encoded words to UIT De- vious section, the UIT words are deconverted converter, which deconverts them into Hindi into the Urdu words by the UIT Deconver- or Urdu words using the appropriate transducer ter using Urdu ↔ UIT transducer in reverse. from Finite-state Transducers Man- The Urdu words are given in table 8 with the ager in reverse and generates the target Hindi Hindi and the UIT words. or Urdu text. Hindi UIT Urdu ﻓﺎﺧﺘﺎ fAx@t_dA [ ] 6.1 Enconversion of Hindi-Urdu to UIT फ़ाख़ता fɑxəṱɑ ﻣُﺤﺒﺖ मुहबत [mʊhəbəṱ] mUh@b@t_d Hindi ↔ UIT transducer is a composition of the اَور Or [ ] mapping rules transducers and the contextual और ɔr اﻣﻦ rules transducers. This is clearly shown in figure अमन [əmən] @m@n ﮐﺎ 3 with a sample XFST code. का [kɑ] kA ﻧِﺸﺎن clear stack िनशान [nɪʃɑn] nISAn set char-encoding UTF-8 ﮨَﮯ }H [ ] define CONSONANTS [क | ख | ग | घ | ङ | छ | ज]; है hæ Table 8: Hindi, UIT and Urdu Words read regex [◌् -> J, ◌ः -> h, ◌़ -> 0]; Finally, the following Urdu sentence is gener- read regex [क -> k, ख -> [k “_” h], ग -> g, घ -> [g “_” ated from Urdu words. ﻓﺎﺧﺘﺎ ﻣُﺤﺒﺖ اَور اﻣﻦ ﮐﺎ ﻧِﺸﺎن ﮨَﮯ ;[[h], ङ -> [n “@” g], च -> [t “_” S], छ -> [t “_” S “_” h read regex [[क ◌् क] -> [k kّ], [क ◌् ख] -> [k k “_” h], Here the word फ़ाख़ता [fɑxəṱɑ] (Dove) is because the ’ﻓﺎﺧﺘﺎ‘ ग ◌् ग] -> [g gّ], [ग ◌् घ] -> [g g “_” h]]; transliterated wrongly into] … vowel [ɑ] at the end of some Urdu words (bor- read regex [[ > [k h], [ > [n A], [ > [j h], क ि◌] - न] - य ◌]े - rowed from Persian language) is transcribed with This phenomenon is a .(ﮦ) [व ◌े] -> [v h] || .#. _ .#.]; help of Heh-gol [h] compose net problem for Hindi to Urdu transliteration but not Figure 3: Sample code for Hindi ↔ UIT Transducer for Urdu to Hindi transliteration. How the HUMT system works is shown with the help of an example. Take the Hindi sentence: 7 Evaluation Experiments and Results फ़ाख़ता मुहबत और अमन का िनशान है For evaluation purpose, we used a Hindi corpus, [fɑxəṱɑ mʊhəbəṱ ɔr əmən kɑ nɪʃɑn hæ] containing 374,150 words, and an Urdu corpus (Dove is symbol of love and peace) with 38,099 words. The Hindi corpus is extracted This sentence is received by the 2 Text To- from the Hindi WordNet developed by the Re- kenizer and is tokenized into Hindi words, source Center for Indian Language Technology which are enconverted into UIT words using the Solutions, CSE Department, Indian Institute of mapping and the contextual rules of Hindi ↔ Technology (IIT) Bombay, India and from the UIT transducer by the UIT Enconverter. project CIFLI (GETALP-LIG 3 , University Jo- The Hindi Words and the UIT enconversions are seph Fourier), a project for building resources given in Table 7. and tools for network-based “linguistic survival” Hindi Words UIT communication between French, English and फ़ाख़ता [fɑxəṱɑ] fAx@t_dA Indian languages like Hindi, Tamil, etc. The Ur- मुहबत [mʊhəbəṱ] mUh@b@t_d du corpus was developed manually from a book zʊlməṱ kədɑ]. The Hindi-Urdu] ”ﻇُﻠﻤﺖ ﮐﺪﮦ“ और [ɔr] Or titled अमन [əmən] @m@n corpus contains in total 412,249 words. का [kɑ] kA The HUMT system is an initial step to build Urdu resources and add Urdu to the languages of िनशान [nɪʃɑn] nISAn

[ ] H{ है hæ 2 http://www.cfilt.iitb.ac.in Table 7: Hindi Words with UIT 3 http://www.liglab.fr

542 SurviTra-CIFLI (Survival Translation) (Boitet et 7.2 Hindi → Urdu Transliteration Results al, 2007), a multilingual digital phrase-book to Hindi → Urdu transliteration also have multi- help tourists for communication and enquiries equivalences and no-equivalence problems that like restaurant, hotel reservation, flight enquiry, are given in Table 12. etc.

To reduce evaluation and testing efforts, Hindi Urdu (corpus Frequency) unique words are extracted from the Hindi-Urdu (1312) ط ,(41,751) ت corpus and are transliterated using the HUMT त (86) ث ,(751) ص ,(53,289) س system. These unique words and their translitera- स (1800) ح ,(72,850) ﮦ tions are checked for accuracy with the help of ह (2) ژ ,(215) ظ ,(228) ذ ,(1489) ض ,(2551) ز dictionaries (Platts, 1911; Feroz). ज़ (2857) ع - 7.1 Urdu → Hindi Transliteration Results Table 12: Hindi → Urdu Multi & No equivalences While transliterating Urdu into Hindi, multiple Results of Hindi to Urdu transliteration are problems occur like multi-equivalences, no equi- given in Table 13. valence, missing diacritical marks in Urdu text. Error Words Accuracy Corpus 8,740 97.88% can be transliterated Unique Words 1400 83.41% (ش) [For example, Sheen [ʃ Table 13: Hindi → Urdu Transliteration Results in Hindi into SHA [ʃ] (श) or SSA [ʃ] (ष) that are Interestingly, Hindi to Urdu conversion is present in 7,917 and 6,399 corpus words respec- 14.47% less accurate on the unique words as is transliterated into SHA compared to its result on the corpus data that is a (ش) [tively. Sheen [ʃ [ʃ] (श) by default. Thus, 6,399 words containing contrasting fact for the reverse conversion. The HUMT system gives 97.12% accuracy for SSA [ʃ] (ष) are wrongly transliterated into Hindi Urdu to Hindi and 97.88% accuracy for Hindi to using HUMT. Urdu to Hindi multi-equivalences Urdu. Thus, the HUMT system works with cases are given in Table 9 with their frequencies. 97.50% accuracy. Urdu Hindi (corpus Frequency) ʃ] श (7917), ष (6399) 8 Future Implications] ش (r] (79,345), (199] ر र ऋ Hindi-Urdu transliteration is one of the cases Table 9: Urdu → Hindi Multi-equivalences where one language is written in two or more Some Hindi characters do not have equivalent mutually incomprehensible scripts like Kazakh, characters in Urdu, e.g. NNA [ɳ] (ण), retroflexed Kashmiri, Malay, Punjabi, Sindhi, etc. The version of [n], has approximately mapped onto HUMT system can be enhanced by extending This creates a problem when a UIT and introducing the respective finite-state .(ن) [Noon [n word actually containing NNA [ɳ] (ण) is transli- transducers. It can similarly be enhanced to terated from Urdu to Hindi. No-equivalence cas- transliterate between language pairs, e.g. Eng- es are given in Table 10. lish-Arabic, English-Hindi, English-Urdu, Urdu Hindi (corpus Frequency) French-Hindi, etc. Thus, it can be enhanced to build a multilingual machine transliteration sys- - ण (4744) tem that can be used for cross-scriptural transli- - (0) ङ teration and MT. - ञ (532) We are intended to resolve the problems of Table 10: Urdu → Hindi No-equivalences multi-equivalences, no-equivalences and the Missing diacritical marks is the major problem most importantly the restoration of diacritical when transliterating Urdu into Hindi. The impor- marks in Urdu text that are observed but left un- tance of diacritical marks has already been ex- attended in the current work. Restoration of dia- plained in section 4.3. This work assumed that all critical marks in Urdu, Sindhi, Punjabi, Kashmi- necessary diacritical marks are present in Urdu ri, etc. texts is essential for word sense disambig- text because they play a vital role in Urdu to uation, natural language processing and speech Hindi transliterations. Results of Urdu to Hindi synthesis of the said languages. transliteration are given in Table 11. The HUMT system will also provide a basis Error Words Accuracy for the development of Inter-dialectal translation Corpus 11,874 97.12% Unique Words 123 98.54% system and MT system for surface-close lan- Table 11: Urdu → Hindi Transliteration Results guages like Indonesian-Malay, Japanese-Korean,

543 Hindi-Marathi, Hindi-Urdu, etc. Translation of Knight, K. and Graehl, J. 1998. Machine Translitera- the surface-close languages or inter-dialectal tion. Computational Linguistics, 24(4). translation can be performed by using mainly Knight, K. and Stall, B G. 1998. Translating Names transliteration and some lexical translations. and Technical Terms in Arabic Tex. Proceedings of Thus HUMT will also provide basis for Cross- the COLING/ACL Workshop on Computational Ap- Scriptural Transliteration, Cross-scriptural In- proaches to Semitic Languages. formation Retrieval, Cross-scriptural Applica- Malik, M. G. Abbas. 2006. Punjabi Machine Transli- tion Development, inter-dialectal translation and teration. Proceedings of the 21st International Confe- translation of surface-close languages. rence on Computational Linguistics and 44th Annual Meeting of the ACL, July 2006, Sydney. 9 Conclusion Mohri, Mehryar. 1997. Finite-state Transducers in Finite-state transducers are very efficient, robust, Language and Speech Processing. Computational and simple to use. Their simplicity and powerful Linguistics, 23(2). features are exploited in the HUMT model to Montaut A. 2004. A Linguistic Grammar of Hindi. perform Hindi-Urdu transliteration using UIT Studies in Indo-European Linguistics Series, Mün- that is a generic and flexible encoding scheme to chen, Lincom Europa. uniquely encode natural languages into ASCII. Paola, V. and Sanjeev, K. 2003. Transliteration of The HUMT system gives 97.50% accuracy when proper names in cross-language applications. Pro- it is applied on the Hindi-Urdu corpora contain- ceedings of the 26th annual International ACM SIGIR ing 412,249 words in total. It is an endeavor to conference on research and development in informa- bridge the scriptural, ethnical, cultural and geo- tion retrieval. graphical division between 1,017 millions people Pirkola, A. Toivonen, J. Keskustalo, H. Visala, K. and around the globe. Järvelin, K. 2003. Fuzzy translation of cross-lingual spelling variants. Proceedings of the 26th Annual Acknowledgement international ACM SIGIR Conference on Research This study is partially supported by the project CIFLI and Development in informaion Retrieval, Toronto, funded under ARCUS-INDIA program by Ministry of Canada. Foreign Affairs and Rhône-Alpes region. Platts, John T. 1909. A Grammar of the Hindustani or References Urdu Language. Crosby Lockwood and Son, 7 Sta- tioners Hall Court, Ludgate hill, London. E.C. Beesley, Kenneth R. and Karttunen, Lauri. 2003. Fi- nite State Morphology. CSLI Publications, USA. Platts, John T. 1911. A Dictionary of Urdu, Classical Hindi and English. Crosby Lockwood and Son, 7 Sta- Boitet, Christian. Bhattacharayya, Pushpak. Blanc, tioners Hall Court, Ludgate hill, London, E.C. Etienne. Meena, Sanjay. Boudhh, Sangharsh. Fafiotte, Georges. Falaise, Achille. Vacchani, Vishal. 2007. Rahman, Tariq. 2004. Language Policy and Localiza- Building Hindi-French-English-UNL Resources for tion in Pakistan: Proposal for a Paradigmatic Shift. SurviTra-CIFLI, a linguistic survival system under Crossing the Digital Divide, SCALLA Conference on construction. Proceedings of the Seventh Symposium Computational Linguistics. on NLP, 13 – 15 December, Chonburi, Thailand. Rai, Alok. 2000. Hindi Nationalism. Orient Longman .Feroz Sons Publishers, Private Limited, New Delhi ﻓﻴﺮوزُاﻟﻠُﻐﺎت اُردو .Feroz ul Din Lahore, Pakistan. Wells, J C. 1995. Computer-coding the IPA: A Pro- Hussain, Sarmad. 2004. Letter to Sound Rules for posed Extension of SAMPA. University College Lon- Urdu Text to Speech System. Proceedings of Work- don. http://www.phon.ucl.ac.uk/home/sampa/ipasam- shop on Computational Approaches to Arabic Script- x.pdf. based Languages, COLING 2004, Geneva, Switzer- Yan Qu, Gregory Grefenstette, David A. Evans. 2003. land. Automatic transliteration for Japanese-to-English text James, L. Hieronymus. 1993. ASCII Phonetic Symbols retrieval. Proceedings of the 26th annual interntional for the World’s Languages: Worldbet. AT&T Bell ACM SIGIR conference on Research and develop- Laboratories, Murray Hill, NJ 07974, USA. ment in information retrieval. Kellogg, Rev. S. H. 1872. A Grammar of Hindi Lan- Zia, Khaver. 1999a. Standard Code Table for Urdu. guage. Delhi, Oriental Book Reprints. Proceedings of 4th Symposium on Multilingual In- formation Processing (MLIT-4), Yangon, Myanmar, .Sound CICC, Japan) اُردو ﮐﺎ ﺻﻮﺗﯽ ﻧﻈﺎم .Khan, Mehboob Alam. 1997 System in Urdu) National Language Authority, Pakis- tan.

544