<<

Punjabi Machine Transliteration Muhammad Ghulam Abbas Malik

To cite this version:

Muhammad Ghulam Abbas Malik. Punjabi Machine Transliteration. 21st international Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the ACL, Jul 2006, Sydney, France. pp.1137-1144. ￿hal-01002160￿

HAL Id: hal-01002160 https://hal.archives-ouvertes.fr/hal-01002160 Submitted on 15 Jan 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée dépôt et à diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Punjabi Machine Transliteration

M. G. Abbas Malik Department of Linguistics Denis Diderot, University of Paris 7 Paris, France [email protected]

Transliteration refers to phonetic translation Abstract across two languages with different writing sys- tems (Knight & Graehl, 1998), such as Arabic to Machine Transliteration is to transcribe a English (Nasreen & Leah, 2003). Most prior word written in a with approximate work has been done for Machine Translation phonetic equivalence in another lan- (MT) (Knight & Leah, 97; Paola & Sanjeev, guage. It is useful for machine transla- 2003; Knight & Stall, 1998) from English to tion, cross-lingual information retrieval, other major languages of the world like Arabic, multilingual text and speech processing. Chinese, etc. for cross-lingual information re- Punjabi Machine Transliteration (PMT) trieval (Pirkola et al, 2003), for the development is a special case of machine translitera- of multilingual resources (Yan et al, 2003; Kang tion and is a process of converting a word & Kim, 2000) and for the development of cross- from Shahmukhi (based on ) lingual applications. to (derivation of Landa, PMT is a special kind of machine translitera- Shardha and Takri, old scripts of Indian tion. It converts a Shahmukhi word into a Gur- subcontinent), two scripts of Punjabi, ir- mukhi word irrespective of the type constraints respective of the type of word. of the word. It not only preserves the phonetics of the transliterated word but in contrast to usual The Punjabi Machine Transliteration transliteration, also preserves the meaning. System uses transliteration rules (charac- Two scripts are discussed and compared. ter mappings and dependency rules) for Based on this comparison and analysis, character transliteration of Shahmukhi words into mappings between Shahmukhi and Gurmukhi are Gurmukhi. The PMT system can translit- drawn and transliteration rules are discussed. erate every word written in Shahmukhi. Finally, architecture and process of the PMT sys- tem are discussed. When it is applied to Punjabi 1 Introduction encoded text especially designed for testing, the results were complied and analyzed. Punjabi is the mother tongue of more than 110 PMT system will provide basis for Cross- million people of Pakistan (66 million), India (44 Scriptural Information Retrieval (CSIR) and million) and many millions in America, Canada Cross-Scriptural Application Development and Europe. It has been written in two mutually (CSAD). incomprehensible scripts Shahmukhi and Gur- mukhi for centuries. from Pakistan are 2 Punjabi Machine Transliteration unable to comprehend Punjabi written in Gur- mukhi and Punjabis from India are unable to According to Paola (2003), “When writing a for- comprehend Punjabi written in Shahmukhi. In eign name in one’s native language, one tries to contrast, they do not have any problem to under- preserve the way it sounds, .. one uses an or- stand the verbal expression of each other. Pun- thographic representation which, when read jabi Machine Transliteration (PMT) system is an aloud by the native speaker of the language, effort to bridge the written communication gap sounds as it would when spoken by a speaker of between the two scripts for the benefit of the mil- the foreign language – a process referred to as lions of Punjabis around the globe. Transliteration”. Usually, transliteration is re- ferred to phonetic translation of a word of some

1137 Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 1137–1144, Sydney, July 2006. c 2006 Association for Computational Linguistics specific type (proper nouns, technical terms, etc) of types of characters e.g. , , across languages with different writing systems. diacritical marks, etc. Native speakers may not understand the meaning of transliterated word. 4.1 Mapping PMT is a special type of Machine Translitera- Consonants can be further subdivided into two tion in which a word is transliterated across two groups: different writing systems used for the same lan- Aspirated Consonants: There are sixteen as- guage. It is independent of the type constraint of pirated consonants in Punjabi (Malik, 2005). Ten the word. It preserves both the phonetics as well of these aspirated consonants ( [ ], [ ], as the meaning of transliterated word. JJ bʰ JJ pʰ JJ[ṱʰ], JJ[ʈʰ], bY[ʤʰ], bb[ʧʰ], |e[ḓʰ], |e[ɖʰ], ÏÏ[kʰ], 3 Scripts of Punjabi ÏÏ[gʰ]) are very frequently used in Punjabi as 3.1 Shahmukhi compared to the remaining six aspirates (|g[rʰ], Shahmukhi derives its character set form the [ ], [ ], [ ], [ ], [ ]). In Arabic . It is a right-to-left script and the |h ɽʰ Ïà lʰ Jb mʰ JJ nʰ |z vʰ shape assumed by a character in a word is con- Shahmukhi, aspirated consonants are represented text sensitive, i.e. the shape of a character is dif- by the combination of a consonant (to be aspi- ferent depending whether the position of the rated) and HEH-DOACHASHMEE (|). For character is at the beginning, in the middle or at example [ ] + [ ] = [ ] and [ ] + [ ] the end of the word. Normally, it is written in [ b | h JJ bʰ ` ʤ | h Nastalique, a highly complex that = b Y [ʤʰ]. is cursive and context-sensitive. A sentence illus- In Gurmukhi, each frequently used aspirated- trating Shahmukhi is given below: consonant is represented by a unique character. But, less frequent aspirated consonants are repre- ð ÌÌ6=P sented by the combination of a consonant (to be@~ –6ڻ >X}Z Ìáââ y6– ÌÐâ It has 49 consonants, 16 diacritical marks and aspirated) and sub-joined PAIREEN HAAHAA 16 vowels, etc. (Malik 2005) e.g. ਲ [l] + ◌੍ + ਹ [h] = ਲ (Ïà) [lʰ] and ਵ [v] + ◌੍ 3.2 Gurmukhi + ਹ [h] = ਵ (|z) [vʰ], where ◌੍ is the sub-joiner. Gurmukhi derives its character set from old scripts of the Indian Sub-continent i.e. Landa The sub-joiner character (◌੍) tells that the follow- (script of North West), Sharda (script of Kash- ing [h] is going to change the shape of mir) and Takri (script of western Himalaya). It is ਹ PAIREEN HAAHHA. a left-to-right syllabic script. A sentence illustrat- The mapping of ten frequently used aspirated ing Gurmukhi is given below: consonants is given in Table 1.

ਪੰ ਜਾਬੀ ਮੇਰੀ ਮਾਣ ਜੋਗੀ ਮ ਬੋਲੀ ਏ. Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi It has 38 consonants, 10 vowels characters, 9 1 [bʰ] 6 [ʧʰ] symbols, 2 symbols for nasal sounds and 1 JJ ਭ bb ਛ symbol that duplicates the sound of a consonant. 2 JJ [pʰ] ਫ 7 |e [ḓʰ] ਧ (Bhatia 2003, Malik 2005) 3 JJ [ṱʰ] ਥ 8 |e [ɖʰ] ਢ 4 [ ] 9 [ ] 4 Analysis and PMT Rules JJ ʈʰ ਠ ÏÏ kʰ ਖ 5 bY [ʤʰ] ਝ 10 ÏÏ [gʰ] ਘ Punjabi is written in two completely different Table 1: Aspirated Consonants Mapping scripts. One script is right-to-left and the other is left-to-right. One is Arabic based cursive and the The mapping for the remaining six aspirates is other is syllabic. But both of them represent the covered under non-aspirated consonants. phonetic repository of Punjabi. These phonetic Non-Aspirated Consonants: In case of non- sounds are used to determine the relation be- aspirated consonants, Shahmukhi has more con- tween the characters of two scripts. On the basis sonants than Gurmukhi, which follows the one of this idea, character mappings are determined. symbol for one sound principle. On the other hand there are more then one characters for a For the analysis and comparison, both scripts single sound in Shahmukhi. For example, Seh are subdivided into different group on the basis

1138 (_), Seen (k) and Sad (m) represent [s] and [s] Hamza (Y) is a special character and always comes between two vowel sounds as a place has one equivalent in Gurmukhi i.e. Sassaa (ਸ). holder. For example, in õGõ6 W [ɑsɑɪʃ] (comfort), Similarly other characters like ਅ [a], ਤ [ṱ], ਹ [h] Hamza (Y) is separating two vowel sounds Alef (Z) and ਜ਼ [z] have multiple equivalents in Shah- and Zer ( ), in [ ] (come), Hamza ( ) is mukhi. Non-aspirated consonants mapping is G◌ zW ɑ Y given in Table 2. separating two vowel sounds Alef Madda (W) [ɑ] Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi and Vav (z) [o], etc. In the first example õGõ6 W 1 [ [b] ਬ 21 o [ṱ] ਤ [ɑsɑɪʃ] (comfort), Hamza (Y) is separating two 2 \ [p] ਪ 22 p [z] ਜ਼ vowel sounds Alef (Z) and Zer (G◌), but normally 3 ] [ṱ] ਤ 23 q [ʔ] ਅ Zer ( ) is dropped by common people. So 4 ^ [ʈ] ਟ 24 r [ɤ] ਗ਼ G◌ Hamza ( ) is mapped on [ ] when it is followed 5 _ [s] ਸ 25 s [f] ਫ਼ Y ਇ ɪ 6 ` [ʤ] ਜ 26 t [q] by a consonant. In Gurmukhi, vowels are represented by ten 7 a [ʧ] ਚ 27 [k] ਕ independent vowel characters (ਅ, ਆ, ਇ, ਈ, ਉ, 8 b [h] ਹ 28 v [g] ਗ , , , , ) and nine dependent vowel signs 9 c [x] ਖ਼ 29 w [l] ਲ ਊ ਏ ਐ ਓ ਔ e [ḓ] ਦ 30 wؕ [ɭ] ਲ਼ (◌ਾ, ਿ◌, ◌ੀ, ◌ੁ, ◌ੂ, ◌ੇ, ◌ੈ, ◌ੋ, ◌ੌ). When a vowel 10 11 e [ɖ] ਡ 31 x [m] ਮ sound comes at the start of a word or is inde- pendent of some consonant in the middle or end 12 f [z] 32 y [n] ਜ਼ ਨ of a word, independent vowels are used; other- -ɳ] ਣ wise dependent vowel signs are used. The analy] ڻ g [r] ਰ 33 13 14 h [ɽ] ੜ 35 y [ŋ] ◌ਂ sis of vowels is shown in Table 4 and the vowel mapping is given in Table 3. 15 i [z] ਜ਼ 35 z [v] ਵ Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi 16 j [ʒ] ਜ਼ 36 { [h] ਹ 1 FZ [ə] ਅ 11 Z[ə] ਅ,◌ਾ 17 k [s] ਸ 37 | [h] ◌੍ਹ ◌ɑ] ਆ 12 G◌ [ɪ] ਿ] ﺁ 2 18 l [ʃ] ਸ਼ 38 ~ [j] ਯ G [i] ◌ੀ◌ ﯼ GZ [ɪ] ਇ 13 3 19 m [s] ਸ 39 } [j] ਯ ੁ◌ [i] ਈ 14 E◌ [ʊ] اِﯼ 4 20 n [z] ਜ਼

Table 2: Non-Aspirated Consonants Mapping 5 EZ [ʊ] ਉ 15 z E◌ [u] ◌ੂ 6 [ ] 16 [ ] 4.2 Vowel Mapping zEZ u ਊ } e ◌ੇ 7 }Z [e] ਏ 17 } F◌ [æ] ◌ੈ Punjabi contains ten vowels. In Shahmukhi, 8 18 these vowels are represented with help of four }FZ [æ] ਐ z [o] ◌ੋ 9 [o] 19 [Ɔ] long vowels (Alef Madda (W), Alef (Z), Vav (z) and zZ ਓ Fz ◌ੌ 10 [ ] 20 [ ] Choti Yeh ( )) and three short vowels (Arabic zFZ Ɔ ਔ Y ɪ ਇ ~ Table 3: Vowels Mapping Fatha – Zabar (F◌), Arabic Damma – Pesh (E◌) and Arabic Kasra – Zer (G◌)). Note that the last two long vowels are also used as consonants.

1139

Vowel Shahmukhi Gurmukhi Example Represented by Alef Madda (W) in the beginning Represented by ਆ ÌòeW → ਆਦਮੀ [ɑdmi] (man) ɑ of a word and by Alef (Z) in the middle or at the and ◌ਾ 6 z6 → [ʤɑvɳɑ] (go) end of a word. ਜਾਵਣਾ Represented by Alef () in the beginning of a Z Represented by ਅ ə H`Z → ਅੱਜ [ɑʤʤ] (today) word and with Zabar (F◌) elsewhere. in the beginning. Represented by the combinations of Alef (Z) and uOääZ → ਏਧਰ [eḓʰər] (here), Choti Yeh (~) in the beginning; a consonant and Represented by ਏ e Z@ð → ਮੇਰਾ [merɑ] (mine), Choti Yeh (~) in the middle and a consonant and and ◌ ੇ }g66 → ਸਾਰ ੇ [sɑre] (all) Baree Yeh (}) at the end of a word. Represented by the combination of Alef (Z), Za- (F◌) and Choti Yeh (~) in the beginning; a E}FZ → ਐਹ [æh] (this), Represented by ਐ æ consonant, Zabar (F◌) and Choti Yeh (~) in the I‚Fr → [mæl] (dirt), and ਮੈਲ middle and a consonant, Zabar ( ) and Baree ◌ੈ F◌ Fì → ਹੈ [hæ] (is) Yeh (}) at the end of a word.

Represented by the combination of Alef (Z) and Represented by → [ɪkko] (one), Zer (G◌) in the beginning and a consonant and ਇ âH§GZ ਇੱਕੋ ɪ and → [ ] (rain) Zer (G◌) in the middle of a word. It never appears ਿ◌ lGg66 ਬਾਿਰਸ਼ bɑrɪsh at the end of a word. Represented by the combination of Alef (Z), Zer @GZ → ਈਤਰ [iṱər] (mean) (G◌) and Choti Yeh (~) in the beginning; a → [ ] (rich- Represented by ਈ ~@GðZ ਅਮੀਰੀ ɑmiri i consonant, Zer (G◌) and Choti Yeh (~) in the ness), and ◌ੀ middle and a consonant and Choti Yeh (~) at the ÌÌ6=P → ਪੰ ਜਾਬੀ [pənʤɑbi] end of a word (Punjabi) Represented by the combination of Alef (Z) and Represented by → [ʊḓḓhr] (there) Pesh (E◌) in the beginning; a consonant and Pesh ਉ uOHeEZ ਧਰ ʊ and → [ ] (price) (E◌) in the middle of a word. It never appears at ◌ੁ HIEï ਮੁੱਲ mʊll the end of a word. Represented by the combination of Alef (Z), Pesh Represented by → [ʊrḓu] (E◌) and Vav (z) in the beginning, a consonant, ਊ zEegEZ ਉਰਦੂ u and → [ ] (face) Pesh (E◌) and Vav (z) in the middle and at the end ◌ੂ ]gâEß ਸੂਰਤ surṱ of a word. Represented by the combination of Alef () and Z h6JzZ → ਓਛਾੜ [oʧhɑɽ] (cover), Represented by ਓ Vav ( ) in the beginning; a consonant and Vav o z iâðww → [pɽholɑ] (a big and ◌ ੋ ਪੜੋਲਾ (z) in the middle and at the end of a word. pot in which wheat is stored) Represented by the combination of Alef (Z), Za- Represented by → [Ɔɽɑ] (hindrance), bar (F◌) and Vav (z) in the beginning; a ਔ ZhzFZ ਔੜਾ Ɔ and → [ ] (death) consonant, Zabar (F◌) and Vav (z) in the middle ◌ੌ ]âFñ ਮੌਤ mƆṱ and at the end of a word. Note: Where → means ‘its equivalent in Gurmukhi is’.

Table 4: Vowels Analysis of Punjabi for PMT

1140 4.3 Sub-Joins (PAIREEN) of Gurmukhi For example, There are three PAIREEN (sub-joins) in Gur- X}Z ~hâa ~ww uuu E}FZ mukhi, “Haahaa”, “Vaavaa” and “Raaraa” shown in Table 5. For PMT, if HEH-DOACHASHMEE X}Z wi ~hâa ~@ð (|) does come after the less frequently used In the first sentence, the word ~hâa is pronounced aspirated consonants then it is transliterated into as [ʧɔɽi] and it conveys the meaning of ‘wide’. PAIREEN Haahaa. Other PAIREENS are very rare in their usage and are used only in In the second sentence, the word ~hâa is pro- loan words. In present day writings, PAIREEN nounced as [ʧuɽi] and it conveys the meaning of Vaavaa and Raaraa are being replaced by normal ‘bangle’. There should be Zabar (◌F ) after Cheh Vaavaa (ਵ) and Raaraa (ਰ) respectively. (a) and Pesh (◌E ) after Cheh (a) in the first and Sr. PAIREEN Shahmukhi Gurmukhi English second words respectively, to remove the ambi- 1 H JHçEo ਬੁੱਲ Lips guities. It is clear from the above example that dia- 2 R 6–gäs" Moon ਚੰ ਦਮਾ critical marks are essential for removing ambi- Self- 3 Í y6˜FâÎ ਸੈਮਾਨ respect guities, natural language processing and speech synthesis. Table 5: Sub-joins (PAIREEN) of Gurmukhi 4.5 Other Symbols 4.4 Diacritical Marks marks in Gurmukhi are the same as Both in Shahmukhi and Gurmukhi, diacritical marks (dependent vowel signs in Gurmukhi) are in English, except the . (।) and the back bone of the vowel system and are very double DANDA (॥) of Devanagri script are used important for the correct pronunciation and un- for the full stop instead. In case of Shahmukhi, derstanding the meaning of a word. There are these are same as in Arabic. The mapping of dig- sixteen diacritical marks in Shahmukhi and nine its and punctuation marks is given in Table 7. dependent vowel sings in Gurmukhi (Malik, 2005). The mapping of diacritical marks is given Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi in Table 6. 1 0 ੦ 8 7 ੭ Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi 2 1 ੧ 9 8 ੮

1 F◌ [ə] --- 9 F◌ [ɪn] ਿ◌ਨ 3 2 ੨ 10 9 ੯

2 G◌ [ɪ] ਿ◌ 10 H◌ ◌ੱ 4 3 ੩ 11 Ô ,

3 E◌ [ʊ] ◌ੁ 11 W◌ --- 5 4 ੪ 12 ? ? 4 --- 12 --- ; ; Y◌ 6 5 ੫ 13 ؕ 5 [ ] 13 --- F◌ ən ਨ Y◌ 7 6 ੬ 14 X । 6 E◌ [ʊn] ◌ੂਨ 14 G◌ --- Table 7: Other Symbols Mapping 7 E◌ --- 15 --- 4.6 Dependency Rules 8 --- 16 G◌ [ɑ] ◌ਾ Character mappings alone are not sufficient for

Table 6: Diacritical Mapping PMT. They require certain dependency or con- textual rules for producing correct transliteration. Diacritical marks in Shahmukhi are very im- The basic idea behind these rules is the same as portant for the correct pronunciation and under- that of the character mappings. These rules in- standing the meaning of a word. But they are clude rules for aspirated consonants, non- sparingly used in writing by common people. In the normal text of Shahmukhi books, newspa- aspirated consonants, Alef (Z), Alef Madda (W), pers, and magazines etc. one will not find the Vav (z), Choti Yeh (~) etc. Only some of these diacritical marks. The pronunciation of a word rules are discussed here due to space limitations. and its meaning would be comprehended with Rules for Consonants: Shahmukhi conso- the help of the context in which it is used. nants are transliterated into their equivalent

1141 parsing techniques. These words are called Gurmukhi consonants e.g. k → [s]. Any dia- ਸ Shahmukhi Tokens. Then these tokens are given critical mark except Shadda (H◌) is ignored at this to the Transliteration Component. This point and is treated in rules for vowels or in rules component gives each token to the PMT Token Converter for diacritical marks. In Shahmukhi, Shadda ( ) that converts a Shahmukhi Token H◌ into a Gurmukhi Token by using the PMT is placed after the consonant but in Gurmukhi, its Rules Manager, which consists of character equivalent Addak (◌ੱ) is placed before the con- mappings and dependency rules. The PMT To- ken Converter then gives the Gurmukhi To- sonant e.g. + → [ ]. Both Shadda ( ) \ ◌H ◌ੱਪ pp H◌ ken back to the Transliteration Compo- nent and Addak (◌ੱ) double the sound a consonant . When all Shahmukhi Tokens are con- after or before which they are placed. verted into Gurmukhi Tokens, then all Gurmukhi Output Text Gen- This rule is applicable to all consonants in Ta- Tokens are passed to the erator that generates the output Unicode en- ble 1 and 2 except Ain (q), Noon (y), coded Gurmukhi text. The main PMT process is PMT Token Converter Noonghunna (y), Vav (z), Heh Gol ({), done by the and the PMT Rules Manager. Dochashmee Heh (|), Choti Yeh (~) and Baree

Yeh (}). These characters are treated separately. Unicode Encoded Shahmukhi Text Rule for Hamza (Y): Hamza (Y) is a special character of Shahmukhi. Rules for Hamza ( ) are: Input Text Parser Punjabi Machine Transliteration Y System − If Hamza (Y) is followed by Choti Yeh (~), then Shahmukhi Tokens Shahmukhi Token Hamza (Y) and Choti Yeh (~) will be Transliteration PMT Token Converter transliterated into [ ]. Component ਈ i Gurmukhi Token Gurmukhi Tokens − If Hamza (Y) is followed by Baree Yeh (}), PMT Rules Manager

then Hamza ( ) and Baree Yeh ( ) will be Output Text Character Depend- Y } Mappings ency Rules Generator transliterated into ਏ [e]. − If Hamza ( ) is followed by Zer ( ), then Unicode Encoded Y G◌ Gurmukhi Text Hamza (Y) and Zer (G◌) will be transliterated Figure 1: Architecture of PMT System into ਇ [ɪ]. PMT system is a rule based transliteration sys- − If Hamza ( ) is followed by Pesh ( ), then tem and is very robust. It is fast and accurate in Y ◌E its working. It can be used in domains involving Hamza (Y) and Pesh (E◌) will be transliterated Information Communication Technology (web, WAP, instant messaging, etc.). into ਉ [ʊ]. 5.2 PMT Process In all other cases, Hamza (Y) will be transliter- PMT ated into [ ]. The PMT Process is implemented in the ਇ ɪ Token Converter and the PMT Rules Manager. For PMT, each Shahmukhi Token is 5 PMT System parsed into its constituent characters and the 5.1 System Architecture character dependencies are determined on the basis of the occurrence and the contextual The architecture of PMT system and its func- placement of the character in the token. In each tionality are described in this section. The system Shahmukhi Token, there are some characters that architecture of Punjabi Machine Transliteration bear dependencies and some characters are inde- System is shown in figure 1. pendent of such contextual dependencies for Unicode encoded Shahmukhi text input is re- transliteration. If the character under considera- ceived by the Input Text Parser that tion bears a dependency, then it is resolved and parses it into Shahmukhi words by using simple transliterated with the help of dependency rules.

1142 If the character under consideration does not bear is divided into four parts eastern Punjab a dependency, then its transliteration is achieved (Indian Punjab), central Punjab, southern Punjab by character mapping. This is done through map- and northern Punjab. All these geographical re- ping a character of the Shahmukhi token to its gions represent the major dialects of Punjabi. equivalent Gurmukhi character with the help of Hayms of Baba Nanak (eastern Punjab), Heer by character mapping tables 1, 2, 3, 6 and 7, which- (central Punjab), Hayms by Khawaja ever is applicable. In this way, a Shahmukhi To- Farid (southern Punjab) and Saif-ul-Malooq by ken is transliterated into its equivalent Gurmukhi Mian Muhammad Bakhsh (northern Punjab) Token. were selected for the evaluation of PMT system. Consider some input Shahmukhi text S. First it All the above selected texts are categorized as is parsed into Shahmukhi Tokens (S , S … S ). classical literature of Punjabi. In modern litera- 1 2 N th ture, poetry and short stories of different poets Suppose that S = “ ” [ ] is the i Shah- i y63„Zz vɑlejɑ̃ and writers were selected from some issues of Puncham (monthly Punjabi magazine since mukhi Token. S is parsed into characters Vav (z) i 1985) and other published books. All of these [v], Alef (Z) [ɑ], Lam (w) [l], Choti Yeh (~) [j], selected texts were then compiled into Unicode encoded text as none of them were available in Alef (Z) [ɑ] and Noon Ghunna (y) [ŋ]. Then PMT this form before. mappings and dependency rules are applied to The main task after the compilation of all the transliterate the Shahmukhi Token into a Gur- selected texts into Unicode encoded texts is to mukhi Token. The Gurmukhi Token put all necessary diacritical marks in the text. G =“ ” is generated from S . The step by i ਵਾਿਲਆਂ i This is done with help of dictionaries. The accu- step process is clearly shown in Table 8. racy of the PMT system depends upon the neces- sary diacritical marks. Absence of the necessary Character(s) Gurmukhi Sr. Mapping or Rule Applied Parsed Token diacritical marks affects the accuracy greatly.

1 z → ਵ [v] ਵ Mapping Table 4 6.2 Results

2 Z → ◌ਾ [ɑ] ਵਾ Rule for ALEF After the compilation of selected input texts, they are transliterated into Gurmukhi texts by using 3 w → ਲ [l] ਵਾਲ Mapping Table 4 the PMT system. Then the transliterated Gur- 66 → ਿ◌ਆ mukhi texts are tested for errors and accuracy. 4 ਵਾਿਲਆ Rule for YEH [ɪɑ] Testing is done manually with help of dictionar- Rule for ies of Shahmukhi and Gurmukhi by persons who 5 y → [ŋ] ◌ਂ ਵਾਿਲਆਂ NOONGHUNNA know both scripts. The results are given in Table Note: → is read as ‘is transliterated into’. 9. Table 8: Methodology of PMTS Source Total Words Accuracy In this way, all Shahmukhi Tokens are trans- Manuscripts 1,007 98.21 Baba Nanak 3,918 98.47 literated into Gurmukhi Tokens (G1, G2 … Gn). From these Gurmukhi Tokens, Gurmukhi text G Khawaja Farid 2,289 98.25 Waris Shah 14,225 98.95 is generated. Mian Muhammad Bakhsh 7,245 98.52 The important point to be noted here is that Modern lieratutre 16,736 99.39 input Shahmukhi text must contain all necessary Total 45,420 98.95 diacritical marks, which are necessary for the Table 9: Results of PMT System correct pronunciation and understanding the If we look at the results, it is clear that the meaning of the transliterated word. PMT system gives more than 98% accuracy on 6 Evaluation Experiments classical literature and more than 99% accuracy on the modern literature. So PMT system fulfills 6.1 Input Selection the requirement of transliteration across two The first task for evaluation of the PMT system scripts of Punjabi. The only constraint to achieve is the selection of input texts. To consider the this accuracy is that input text must contain all historical aspects, two manuscripts, poetry by necessary diacritical marks for removing ambi- Maqbal (Maqbal) and Heer by Waris Shah guities. (Waris, 1766) were selected. Geographically

1143 7 Conclusion Knight, Kevin; Morgan Kaufmann and Graehl, Jona- than. 1998. Machine Transliteration. In Computa- Shahmukhi and Gurmukhi being the only two tional Linguistics. 24(4): 599 – 612 prevailing scripts for Punjabi expressions en- Malik, M. G. Abbas. 2005. Towards Unicode Com- compass a population of almost 110 million patible Punjabi Character Set. In proceedings of around the globe. PMT is an endeavor to bridge 27th Internationalization and Unicode Conference, the ethnical, cultural and geographical divisions 6 – 8 April, Berlin, Germany between the Punjabi speaking communities. By implementing this system of transliteration, new Maqbal. _âú Gbäæ. Punjabi Manuscript in Oriental Sec- horizons for thought, idea and belief will be tion, Main Library University of the Punjab, shared and the world will gain an impetus on the Quaid-e-Azam Campus, Lahore Pakistan; 7 pages; efforts harmonizing relationships between na- Access # 8773 tions. The large repository of historical, literary Mian Muhammad Bakhsh (Edited by Fareer - and religious work done by generations will now hammad Faqeer). 2000. Saif-ul-Malooq. Al-Faisal be available for easy transformation and critique Pub. Bazar, Lahore for all. The research has future milestone ena- Nasreen AbdulJaleel, Leah S. Larkey. 2003. Statisti- bling PMT system for back machine translitera- cal transliteration for English-Arabic cross lan- tion from Gurmukhi to Shahmukhi. guage information retrieval. In Proceedings of the 12th international conference on information and Reference knowledge management. pp: 139 – 146 Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari Paola Virga and Sanjeev Khudanpur. 2003. Translit- Visala, and Kalervo Järvelin. 2003. Fuzzy Transla- eration of proper names in cross-language appli- tion of Cross-Lingual Spelling Variants. In Pro- cations. In Proceedings of the 26th annual interna- ceedings of the 26th annual international ACM tional ACM SIGIR conference on Research and SIGIR conference on Research and development in development in information retrieval. pp: 365 – informaion retrieval. pp: 345 – 352 366 Baba , arranged by Muhammad Asif Rahman Tariq. 2004. Language Policy and Localiza- tion in Pakistan: Proposal for a Paradigmatic Khan. 1998. " HH6 6666 63riW (Sayings of Baba Nanak in Shift. Crossing the Digital Divide, SCALLA Con- Punjabi Shahmukhi). Pakistan Punjabi Adbi Board, ference on Computational Linguistics, 5 – 7 Janu- Lahore ary 2004 Bhatia, Tej K. 2003. The Gurmukhi Script and Other Sung Young Jung, SungLim Hong and Eunok Peak. Writing Systems of Punjab: History, Structure and 2000. An English to Korean transliteration model Identity. International Symposium on Indic Script: of extended markov window. In Proceedings of the Past and future organized by Research Institute for 17th conference on Computational Linguistics. the Languages and Cultures of Asia and Africa and 1:383 – 389 Tokyo University of Foreign Studies, December 17 – 19. pp: 181 – 213 Tanveer Bukhari. 2000. zegEZ ÌÌ6=P ›~Ö. Urdu Science In-Ho Kang and GilChang Kim. 2000. English-to- Board, 299 Uper Mall, Lahore Korean transliteration using multiple unbounded overlapping phoneme chunks. In Proceedings of Waris Shah. 1766. 6J6=Zg @¦. Punjabi Manuscript in Ori- the 17th conference on Computational Linguistics. ental Section, Main Library University of the Pun- 1: 418 – 424 jab, Quaid-e-Azam Campus, Lahore Pakistan; 48 pages; Access # [Ui VI 135/]1443 Khawaja Farid (arranged by Muhammad Asif Khan). " ääGuu EbZâa 63riW (Sayings of Khawaja Farid in Punjabi Waris Shah (arranged by Naseem Ijaz). 1977. 6J6=Zg @¦. Shahmukhi). Pakistan Punjabi Adbi Board, Lahore Lehran, Punjabi Journal, Lahore Knight, K. and Stalls, B. G. 1998. Translating Names Yan Qu, Gregory Grefenstette, David A. Evans. 2003. and Technical Terms in Arabic Tex. Proceedings of Automatic transliteration for Japanese-to-English the COLING/ACL Workshop on Computational text retrieval. In Proceedings of the 26th annual in- Approaches to Semitic Languages ternational ACM SIGIR conference on Research and development in information retrieval. pp: 353 Knight, Kevin and Graehl, Jonathan. 1997. Machine – 360 Transliteration. In Proceedings of the 35th Annual Meeting of the Association for Computational Lin- guistics. pp. 128-135

1144