A Gurmukhi to Shahmukhi Transliteration System Gurpreet Singh Lehal Department of Computer Science Punjabi University, Patiala, India-147002
Total Page:16
File Type:pdf, Size:1020Kb
A Gurmukhi to Shahmukhi Transliteration System Gurpreet Singh Lehal Department of Computer Science Punjabi University, Patiala, India-147002. [email protected] Abstract of the original word. Transliteration is frequently used in Machine Translation, to create possible Transliteration is a process wherein an input equivalents of unknown words, cross-lingual string in some alphabet is converted to a string in information retrieval systems and in another alphabet, usually based on the phonetics development of multilingual resources. of the original word. It is useful for Machine Transliteration is usually classified into two Translation, to create possible equivalents of directions. Given a pair (s,t) where s is the unknown words, cross-lingual information original word in the source language and t is the retrieval systems and in development of transliterated word in the target language, multilingual resources. Punjabi is one of the forward transliteration is the process of unique languages, which are written in more than phonetically converting s into t, and backward one script. In India, Punjabi is written in transliteration is the process of correctly Gurmukhi script, while in Pakistan it is written in generating s given t(Lin and Chen, 2002). Shahmukhi (Urdu) script. This has created a Backward transliteration is more challenging script wedge as majority of Punjabi speaking than forward transliteration. While forward people in Pakistan cannot read Gurmukhi script, transliteration can accomplish the mapping and similarly the majority of Punjabi speaking through table-lookup, backward transliteration is people in India cannot comprehend Shahmukhi required to disambiguate the noise produced in script. In this paper, we present a high accuracy the forward transliteration and estimate the Gurmukhi to Shahmukhi transliteration system. original word as close as possible. The Gurmukhi We have tried to overcome the shortcomings of to Shahmukhi transliteration is different from the existing Gurmukhi to Shahmukhi standard transliteration schemes in the sense that transliteration systems and developed a system the Gurmukhi word has to be converted to which can which can transliterate any Gurmukhi Shahmukhi with exact spellings, since the text to Shahmukhi at more than 98.6% accuracy language is written in both Gurmukhi and at word level. Shahmukhi. This fact has largely been ignored by existing systems, which have treated the Introduction transliteration problem as forward transliteration and used character mappings and dependency Punjabi is one of the most widely spoken rules to convert Gurmukhi word to Shahmukhi language in Indian sub-continent, with more than without any consideration to spellings. 100 million speakers in India and Pakistan. A In this paper we present a fairly good accuracy unique feature of Punjabi is that it is written in Gurmukhi-Shahmukhi transliteration system, two mutually incomprehensible scripts. In India where we have taken special care to retain the Punjabi language is written in Gurmukhi script, correct spellings in the transliterated Shahmukhi while in Pakistan it is written in Shahmukhi text. (Urdu) script. This has created a script wedge as majority of Punjabi speaking people in Pakistan Gurmukhi and Shahmukhi scripts: a brief cannot read Gurmukhi script, and similarly the overview majority of Punjabi speaking people in India cannot comprehend Shahmukhi script. To break Shahmukhi is a right-to-left script and the shape this script barrier, we have developed a assumed by a character in a word is context transliteration system for transliterating Punjabi sensitive. The Nastalique script, a cursive, right- text written in Gurmukhi script to Shahmukhi. to-left context-sensitive and a highly complex Transliteration is a process wherein an input writing system is normally used for Shahmukhi string in some alphabet is converted to a string in orthography. It has 35 simple consonants, 15 another alphabet, usually based on the phonetics aspirated consonants, one character for nasal "Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009" د sound, 15 diacritical marks, 10 digits and other symbols. Gurmukhi derives its character set from ਦ دھ old scripts of the Indian Sub-continent. It is a ਧ ن ,left-to-right syllabic script. It has 38 consonants 10 vowels characters, 9 vowel symbols, 2 ਨ پ symbols for nasal sounds and 1 symbol that ਪ پھ ,duplicates the sound of a consonant (Malik 2006; Malik, 2005; Jawaid and Ahmed, 2009). ਫ ب Below is the mapping chart for Gurmukhi and its ਬ respective Shahmukhi character(s). بھ Table 1. Gurmukhi to Shahmukhi Mapping ਭ م Gurmukhi Shahmukhi ਮ ا ے ਅ ਯ آ ر ਆ ਰ اِ ل ਇ ਲ ای ل ਈ ਲ਼ اُ و ਉ ਵ ُاو ش ਊ ਸ਼ اے / ص / ث ਏ ਸ س َاے ਐ ه / ح ਹ َاو ਓ ا / ه / ٰ◌ ਾ◌ َاو ਔ ِ◌ ◌ਿ ق/ ک ਕ ی ੀ◌ کھ ਖ ُ◌ ੁ◌ گ ਗ ُ◌و ੂ◌ گھ ਘ ے ੇ◌ نگ ਙ َ◌ے ੈ◌ چ ਚ و ੋ◌ چھ ਛ َ◌و ੌ◌ ج ਜ خ ਖ਼ جھ ਝ غ ਗ਼ نج ਞ ذ / ظ / ض ਜ਼ ٹ ز / ਟ ڑ ٹھ ਠ ੜ ف ڈ ਡ ਫ਼ ن / ں ڈھ ਢ ◌ੰ ّ◌ ن ਣ ◌ੱ ن / ں ت / ط ਤ ◌ਂ تھ ਥ As can be seen from Table 1, there are certain Two more characters ◌ੰ and ◌ਂ are mapped Gurmukhi characters which have multiple But these characters do not . ن and ں to both equivalent Shahmukhi characters. For example, ں pose any problem since the character Similarly, always come at the end of the word. For .ط and ت the character ਤ is mapped to the character is mapped to four Shahmukhi other cases we choose the character having ਜ਼ higher frequency of occurrence. But one .(ض/ظ/ذ/ز) characters obvious problem is that the lesser frequent A statistical analysis on the Shahmukhi corpus characters never get mapped. A statistical was performed to determine the percentage of analysis of corpus reveals that these lesser times the Gurmukhi character was mapped to the frequently occurring similar sounding similar sounding Shahmukhi characters (Table Shahmukhi characters have 1.45% frequency 2). The Shahmukhi characters with highest of occurrence. It was also found that the frequency are selected for default mapping. average word length of a Shahmukhi word is Table 2. Frequency of Occurrence of multiple 3.55, which means that we can roughly mapping characters predict that any rule based system which Gurmukhi Equivalent % Default does not take care of these characters, will Character Shahmukhi Frequency have 1.45*3.55 = 5.15% error rate at word Characters level, due to multiple mappings. Thus to ه %92.12 ه ਹ have a high accuracy system, it is necessary .to solve this problem %7.88 ح س %92.58 س ਸ 2. No exact equivalent mappings in Gurmukhi for some Shahmukhi %7.41 ص characters: There are certain Shahmukhi %1.39 ث ک %91.29 ک hamza) for) ء ain) and) ع ਕ characters such as which there no exact equivalent Gurmukhi %8.71 ق ت %95.49 ت ਤ characters. Though hamza can be generated using character combination rules, but there %4.51 ط ز %62.12 ز ਜ਼ are no specific rules for ain. As for example, :consider the following cases %14.87 ض عورت <- %13.91 ظ ਔਰਤ %9.10 ذ ُ ُشروع <- ਸ਼ੁਰੂ رفيع <- Issues in Gurmukhi-Shahmukhi Transliteration ਰਫ਼ੀ ِالوداع <- ਅਲਿਵਦਾ As already discussed above, the main issue in It is not possible to specify the rules in above Gurmukhi-Shahmukhi transliteration is to examples for generation of ain in Shahmukhi convert the Gurmukhi word to Shahmukhi with words and it depends from case to case. correct spellings. The existing systems for From the statistics, we find that ain has Gurmukhi-Shahmukhi transliteration are mostly 0.42% of frequency of occurrence and thus rule based and thus do not have good any pure rule based system, which cannot transliteration accuracy as they do not check for properly generate ain in Shahmukhi will correct Shahmukhi spellings. Two transliteration further have 0.42*3.55 = 1.49% error rate at systems available on internet have been taken for word level. comparison with our system 3. Missing nukta symbols in Gurmukhi text : (http://www.puran.info/PMT/PMT.aspx; Punjabi language has borrowed many words http://parc.cdac.in/utrans.htm). The main challenges from Arabic, Persian etc. Five consonants ( involved in development of a good accuracy ਸ਼ system are: ਖ਼ ਗ਼ ਜ਼ ਫ਼) were added to the original 35 1. Multiple Mappings : As already discussed characters in Gurmukhi to accommodate the in above section, there are certain Gurmukhi sounds from these languages. As can be characters which are mapped to multiple seen, these characters were created by adding Shahmukhi characters. The five such the nukta(dot) symbol at the feet of the Gurmukhi characters are shown in Table 2. respectively. For transliteration of such existing symbols (ਸ ਖ ਗ ਜ ਫ). But over the words, we have to device special methods to years, the usage of these characters handle these cases. particularly, ਖ਼ ਗ਼ ਜ਼ ਫ਼, has been on decline as 5. Transliteration of proper nouns: The many Punjabi speakers do not make a transliteration of Urdu proper nouns such as distinction between , and . In names of persons and places from Gurmukhi ਖ ਖ਼ ਗ ਗ਼ ਫ ਫ਼ to Shahmukhi poses another challenge. fact from the analysis of Gurmukhi and Many times the spellings of such words in Shahmukhi corpus, we found that these four Shahmukhi are typical and it is not possible characters had combined character frequency to formulate transliteration rules for of occurrence of only 0.17% as compared to generation of such spellings.