Punjabi Machine Transliteration Muhammad Ghulam Abbas Malik
Total Page:16
File Type:pdf, Size:1020Kb
Punjabi Machine Transliteration Muhammad Ghulam Abbas Malik To cite this version: Muhammad Ghulam Abbas Malik. Punjabi Machine Transliteration. 21st international Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the ACL, Jul 2006, Sydney, France. pp.1137-1144. hal-01002160 HAL Id: hal-01002160 https://hal.archives-ouvertes.fr/hal-01002160 Submitted on 15 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Punjabi Machine Transliteration M. G. Abbas Malik Department of Linguistics Denis Diderot, University of Paris 7 Paris, France [email protected] Transliteration refers to phonetic translation Abstract across two languages with different writing sys- tems (Knight & Graehl, 1998), such as Arabic to Machine Transliteration is to transcribe a English (Nasreen & Leah, 2003). Most prior word written in a script with approximate work has been done for Machine Translation phonetic equivalence in another lan- (MT) (Knight & Leah, 97; Paola & Sanjeev, guage. It is useful for machine transla- 2003; Knight & Stall, 1998) from English to tion, cross-lingual information retrieval, other major languages of the world like Arabic, multilingual text and speech processing. Chinese, etc. for cross-lingual information re- Punjabi Machine Transliteration (PMT) trieval (Pirkola et al, 2003), for the development is a special case of machine translitera- of multilingual resources (Yan et al, 2003; Kang tion and is a process of converting a word & Kim, 2000) and for the development of cross- from Shahmukhi (based on Arabic script) lingual applications. to Gurmukhi (derivation of Landa, PMT is a special kind of machine translitera- Shardha and Takri, old scripts of Indian tion. It converts a Shahmukhi word into a Gur- subcontinent), two scripts of Punjabi, ir- mukhi word irrespective of the type constraints respective of the type of word. of the word. It not only preserves the phonetics of the transliterated word but in contrast to usual The Punjabi Machine Transliteration transliteration, also preserves the meaning. System uses transliteration rules (charac- Two scripts are discussed and compared. ter mappings and dependency rules) for Based on this comparison and analysis, character transliteration of Shahmukhi words into mappings between Shahmukhi and Gurmukhi are Gurmukhi. The PMT system can translit- drawn and transliteration rules are discussed. erate every word written in Shahmukhi. Finally, architecture and process of the PMT sys- tem are discussed. When it is applied to Punjabi 1 Introduction Unicode encoded text especially designed for testing, the results were complied and analyzed. Punjabi is the mother tongue of more than 110 PMT system will provide basis for Cross- million people of Pakistan (66 million), India (44 Scriptural Information Retrieval (CSIR) and million) and many millions in America, Canada Cross-Scriptural Application Development and Europe. It has been written in two mutually (CSAD). incomprehensible scripts Shahmukhi and Gur- mukhi for centuries. Punjabis from Pakistan are 2 Punjabi Machine Transliteration unable to comprehend Punjabi written in Gur- mukhi and Punjabis from India are unable to According to Paola (2003), “When writing a for- comprehend Punjabi written in Shahmukhi. In eign name in one’s native language, one tries to contrast, they do not have any problem to under- preserve the way it sounds, i.e. one uses an or- stand the verbal expression of each other. Pun- thographic representation which, when read jabi Machine Transliteration (PMT) system is an aloud by the native speaker of the language, effort to bridge the written communication gap sounds as it would when spoken by a speaker of between the two scripts for the benefit of the mil- the foreign language – a process referred to as lions of Punjabis around the globe. Transliteration”. Usually, transliteration is re- ferred to phonetic translation of a word of some 1137 Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 1137–1144, Sydney, July 2006. c 2006 Association for Computational Linguistics specific type (proper nouns, technical terms, etc) of types of characters e.g. consonants, vowels, across languages with different writing systems. diacritical marks, etc. Native speakers may not understand the meaning of transliterated word. 4.1 Consonant Mapping PMT is a special type of Machine Translitera- Consonants can be further subdivided into two tion in which a word is transliterated across two groups: different writing systems used for the same lan- Aspirated Consonants: There are sixteen as- guage. It is independent of the type constraint of pirated consonants in Punjabi (Malik, 2005). Ten the word. It preserves both the phonetics as well of these aspirated consonants ( [ ], [ ], as the meaning of transliterated word. JJ bʰ JJ pʰ JJ[ṱʰ], JJ[ʈʰ], bY[ʤʰ], bb[ʧʰ], |e[ḓʰ], |e[ɖʰ], ÏÏ[kʰ], 3 Scripts of Punjabi ÏÏ[gʰ]) are very frequently used in Punjabi as 3.1 Shahmukhi compared to the remaining six aspirates (|g[rʰ], Shahmukhi derives its character set form the [ ], [ ], [ ], [ ], [ ]). In Arabic alphabet. It is a right-to-left script and the |h ɽʰ Ïà lʰ Jb mʰ JJ nʰ |z vʰ shape assumed by a character in a word is con- Shahmukhi, aspirated consonants are represented text sensitive, i.e. the shape of a character is dif- by the combination of a consonant (to be aspi- ferent depending whether the position of the rated) and HEH-DOACHASHMEE (|). For character is at the beginning, in the middle or at example [ ] + [ ] = [ ] and [ ] + [ ] the end of the word. Normally, it is written in [ b | h JJ bʰ ` ʤ | h Nastalique, a highly complex writing system that = b Y [ʤʰ]. is cursive and context-sensitive. A sentence illus- In Gurmukhi, each frequently used aspirated- trating Shahmukhi is given below: consonant is represented by a unique character. But, less frequent aspirated consonants are repre- ð ÌÌ6=P sented by the combination of a consonant (to be@~ –6ڻ >X}Z Ìáââ y6– ÌÐâ It has 49 consonants, 16 diacritical marks and aspirated) and sub-joined PAIREEN HAAHAA 16 vowels, etc. (Malik 2005) e.g. ਲ [l] + ◌੍ + ਹ [h] = ਲ (Ïà) [lʰ] and ਵ [v] + ◌੍ 3.2 Gurmukhi + ਹ [h] = ਵ (|z) [vʰ], where ◌੍ is the sub-joiner. Gurmukhi derives its character set from old scripts of the Indian Sub-continent i.e. Landa The sub-joiner character (◌੍) tells that the follow- (script of North West), Sharda (script of Kash- ing [h] is going to change the shape of mir) and Takri (script of western Himalaya). It is ਹ PAIREEN HAAHHA. a left-to-right syllabic script. A sentence illustrat- The mapping of ten frequently used aspirated ing Gurmukhi is given below: consonants is given in Table 1. ਪੰ ਜਾਬੀ ਮੇਰੀ ਮਾਣ ਜੋਗੀ ਮ ਬੋਲੀ ਏ. Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi It has 38 consonants, 10 vowels characters, 9 1 [bʰ] 6 [ʧʰ] vowel symbols, 2 symbols for nasal sounds and 1 JJ ਭ bb ਛ symbol that duplicates the sound of a consonant. 2 JJ [pʰ] ਫ 7 |e [ḓʰ] ਧ (Bhatia 2003, Malik 2005) 3 JJ [ṱʰ] ਥ 8 |e [ɖʰ] ਢ 4 [ ] 9 [ ] 4 Analysis and PMT Rules JJ ʈʰ ਠ ÏÏ kʰ ਖ 5 bY [ʤʰ] ਝ 10 ÏÏ [gʰ] ਘ Punjabi is written in two completely different Table 1: Aspirated Consonants Mapping scripts. One script is right-to-left and the other is left-to-right. One is Arabic based cursive and the The mapping for the remaining six aspirates is other is syllabic. But both of them represent the covered under non-aspirated consonants. phonetic repository of Punjabi. These phonetic Non-Aspirated Consonants: In case of non- sounds are used to determine the relation be- aspirated consonants, Shahmukhi has more con- tween the characters of two scripts. On the basis sonants than Gurmukhi, which follows the one of this idea, character mappings are determined. symbol for one sound principle. On the other hand there are more then one characters for a For the analysis and comparison, both scripts single sound in Shahmukhi. For example, Seh are subdivided into different group on the basis 1138 (_), Seen (k) and Sad (m) represent [s] and [s] Hamza (Y) is a special character and always comes between two vowel sounds as a place has one equivalent in Gurmukhi i.e. Sassaa (ਸ). holder. For example, in õGõ6 W [ɑsɑɪʃ] (comfort), Similarly other characters like ਅ [a], ਤ [ṱ], ਹ [h] Hamza (Y) is separating two vowel sounds Alef (Z) and ਜ਼ [z] have multiple equivalents in Shah- and Zer ( ), in [ o] (come), Hamza ( ) is mukhi. Non-aspirated consonants mapping is G◌ zW ɑ Y given in Table 2. separating two vowel sounds Alef Madda (W) [ɑ] Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi and Vav (z) [o], etc. In the first example õGõ6 W 1 [ [b] ਬ 21 o [ṱ] ਤ [ɑsɑɪʃ] (comfort), Hamza (Y) is separating two 2 \ [p] ਪ 22 p [z] ਜ਼ vowel sounds Alef (Z) and Zer (G◌), but normally 3 ] [ṱ] ਤ 23 q [ʔ] ਅ Zer ( ) is dropped by common people. So 4 ^ [ʈ] ਟ 24 r [ɤ] ਗ਼ G◌ Hamza ( ) is mapped on [ ] when it is followed 5 _ [s] ਸ 25 s [f] ਫ਼ Y ਇ ɪ 6 ` [ʤ] ਜ 26 t [q] by a consonant. In Gurmukhi, vowels are represented by ten 7 a [ʧ] ਚ 27 u [k] ਕ independent vowel characters (ਅ, ਆ, ਇ, ਈ, ਉ, 8 b [h] ਹ 28 v [g] ਗ , , , , ) and nine dependent vowel signs 9 c [x] ਖ਼ 29 w [l] ਲ ਊ ਏ ਐ ਓ ਔ e [ḓ] ਦ 30 wؕ [ɭ] ਲ਼ (◌ਾ, ਿ◌, ◌ੀ, ◌ੁ, ◌ੂ, ◌ੇ, ◌ੈ, ◌ੋ, ◌ੌ).