<<

A to Shahmukhi Transliteration System Gurpreet Lehal Department of Computer Science , , India-147002. [email protected]

Abstract of the original word. Transliteration is frequently used in Machine Translation, to create possible Transliteration is a process wherein an input equivalents of unknown words, cross-lingual string in some is converted to a string in information retrieval systems and in another alphabet, usually based on the phonetics development of multilingual resources. of the original word. It is useful for Machine Transliteration is usually classified into two Translation, to create possible equivalents of directions. Given a pair (s,t) where s is the unknown words, cross-lingual information original word in the source language and t is the retrieval systems and in development of transliterated word in the target language, multilingual resources. Punjabi is one of the forward transliteration is the process of unique languages, which are written in more than phonetically converting s into t, and backward one . In India, Punjabi is written in transliteration is the process of correctly Gurmukhi script, while in Pakistan it is written in generating s given t(Lin and Chen, 2002). Shahmukhi () script. This has created a Backward transliteration is more challenging script wedge as majority of Punjabi speaking than forward transliteration. While forward people in Pakistan cannot read Gurmukhi script, transliteration can accomplish the mapping and similarly the majority of Punjabi speaking through table-lookup, backward transliteration is people in India cannot comprehend Shahmukhi required to disambiguate the noise produced in script. In this paper, we present a high accuracy the forward transliteration and estimate the Gurmukhi to Shahmukhi transliteration system. original word as close as possible. The Gurmukhi We have tried to overcome the shortcomings of to Shahmukhi transliteration is different from the existing Gurmukhi to Shahmukhi standard transliteration schemes in the sense that transliteration systems and developed a system the Gurmukhi word has to be converted to which can which can transliterate any Gurmukhi Shahmukhi with exact spellings, since the text to Shahmukhi at more than 98.6% accuracy language is written in both Gurmukhi and at word level. Shahmukhi. This fact has largely been ignored by existing systems, which have treated the Introduction transliteration problem as forward transliteration and used character mappings and dependency Punjabi is one of the most widely spoken rules to convert Gurmukhi word to Shahmukhi language in Indian sub-continent, with more than without any consideration to spellings. 100 million speakers in India and Pakistan. A In this paper we present a fairly good accuracy unique feature of Punjabi is that it is written in Gurmukhi-Shahmukhi transliteration system, two mutually incomprehensible scripts. In India where we have taken special care to retain the is written in Gurmukhi script, correct spellings in the transliterated Shahmukhi while in Pakistan it is written in Shahmukhi text. (Urdu) script. This has created a script wedge as majority of Punjabi speaking people in Pakistan Gurmukhi and Shahmukhi scripts: a brief cannot read Gurmukhi script, and similarly the overview majority of Punjabi speaking people in India cannot comprehend Shahmukhi script. To break Shahmukhi is a right-to-left script and the shape this script barrier, we have developed a assumed by a character in a word is context transliteration system for transliterating Punjabi sensitive. The Nastalique script, a cursive, right- text written in Gurmukhi script to Shahmukhi. to-left context-sensitive and a highly complex Transliteration is a process wherein an input is normally used for Shahmukhi string in some alphabet is converted to a string in orthography. It has 35 simple , 15 another alphabet, usually based on the phonetics aspirated consonants, one character for nasal

"Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009" د sound, 15 diacritical marks, 10 digits and other symbols. Gurmukhi derives its character set from ਦ دھ old scripts of the Indian Sub-continent. It is a ਧ ن ,left-to-right syllabic script. It has 38 consonants 10 characters, 9 symbols, 2 ਨ پ symbols for nasal sounds and 1 symbol that ਪ پھ ,duplicates the sound of a (Malik 2006; Malik, 2005; Jawaid and Ahmed, 2009). ਫ ب Below is the mapping chart for Gurmukhi and its ਬ respective Shahmukhi character(s). بھ Table 1. Gurmukhi to Shahmukhi Mapping ਭ م Gurmukhi Shahmukhi ਮ ا ے ਅ ਯ آ ر ਆ ਰ اِ ل ਇ ਲ ای ل ਈ ਲ਼ اُ و ਉ ਵ ُاو ش ਊ ਸ਼ اے / ص / ث ਏ ਸ س َاے ਐ ه / ح ਹ َاو ਓ ا / ه / ٰ◌ ਾ◌ َاو ਔ ِ◌ ◌ਿ ق/ ک ਕ ی ੀ◌ کھ ਖ ُ◌ ੁ◌ گ ਗ ُ◌و ੂ◌ گھ ਘ ے ੇ◌ نگ ਙ َ◌ے ੈ◌ چ ਚ و ੋ◌ چھ ਛ َ◌و ੌ◌ ج ਜ خ ਖ਼ جھ ਝ غ ਗ਼ نج ਞ ذ / ظ / ض ਜ਼ ٹ ز / ਟ ڑ ٹھ ਠ ੜ ف ڈ ਡ ਫ਼ ن / ں ڈھ ਢ ◌ੰ ّ◌ ن ਣ ◌ੱ ن / ں ت / ط ਤ ◌ਂ تھ ਥ

As can be seen from Table 1, there are certain Two more characters ◌ੰ and ◌ਂ are mapped Gurmukhi characters which have multiple But these characters do not . ن and ں to both equivalent Shahmukhi characters. For example, ں pose any problem since the character Similarly, always come at the end of the word. For .ط and ت the character ਤ is mapped to the character is mapped to four Shahmukhi other cases we choose the character having ਜ਼ higher frequency of occurrence. But one .(ض/ظ/ذ/ز) characters obvious problem is that the lesser frequent A statistical analysis on the Shahmukhi corpus characters never get mapped. A statistical was performed to determine the percentage of analysis of corpus reveals that these lesser times the Gurmukhi character was mapped to the frequently occurring similar sounding similar sounding Shahmukhi characters (Table Shahmukhi characters have 1.45% frequency 2). The Shahmukhi characters with highest of occurrence. It was also found that the frequency are selected for default mapping. average word length of a Shahmukhi word is Table 2. Frequency of Occurrence of multiple 3.55, which means that we can roughly mapping characters predict that any rule based system which Gurmukhi Equivalent % Default does not take care of these characters, will Character Shahmukhi Frequency have 1.45*3.55 = 5.15% error rate at word Characters level, due to multiple mappings. Thus to ه %92.12 ه ਹ have a high accuracy system, it is necessary .to solve this problem %7.88 ح س %92.58 س ਸ 2. No exact equivalent mappings in Gurmukhi for some Shahmukhi %7.41 ص characters: There are certain Shahmukhi %1.39 ث ک %91.29 ک hamza) for) ء ain) and) ع ਕ characters such as which there no exact equivalent Gurmukhi %8.71 ق ت %95.49 ت ਤ characters. Though hamza can be generated using character combination rules, but there %4.51 ط ز %62.12 ز ਜ਼ are no specific rules for ain. As for example, :consider the following cases %14.87 ض عورت <- %13.91 ظ ਔਰਤ %9.10 ذ ُ ُشروع <- ਸ਼ੁਰੂ رفيع <- Issues in Gurmukhi-Shahmukhi Transliteration ਰਫ਼ੀ ِالوداع <- ਅਲਿਵਦਾ As already discussed above, the main issue in It is not possible to specify the rules in above Gurmukhi-Shahmukhi transliteration is to examples for generation of ain in Shahmukhi convert the Gurmukhi word to Shahmukhi with words and it depends from case to case. correct spellings. The existing systems for From the statistics, we find that ain has Gurmukhi-Shahmukhi transliteration are mostly 0.42% of frequency of occurrence and thus rule based and thus do not have good any pure rule based system, which cannot transliteration accuracy as they do not check for properly generate ain in Shahmukhi will correct Shahmukhi spellings. Two transliteration further have 0.42*3.55 = 1.49% error rate at systems available on internet have been taken for word level. comparison with our system 3. Missing nukta symbols in Gurmukhi text : (http://www.puran.info/PMT/PMT.aspx; Punjabi language has borrowed many words http://parc.cdac.in/utrans.htm). The main challenges from Arabic, Persian etc. Five consonants ( involved in development of a good accuracy ਸ਼ system are: ਖ਼ ਗ਼ ਜ਼ ਫ਼) were added to the original 35 1. Multiple Mappings : As already discussed characters in Gurmukhi to accommodate the in above section, there are certain Gurmukhi sounds from these languages. As can be characters which are mapped to multiple seen, these characters were created by adding Shahmukhi characters. The five such the nukta() symbol at the feet of the Gurmukhi characters are shown in Table 2.

respectively. For transliteration of such existing symbols (ਸ ਖ ਗ ਜ ਫ). But over the words, we have to device special methods to years, the usage of these characters handle these cases. particularly, ਖ਼ ਗ਼ ਜ਼ ਫ਼, has been on decline as 5. Transliteration of proper nouns: The many Punjabi speakers do not make a transliteration of Urdu proper nouns such as distinction between , and . In names of persons and places from Gurmukhi ਖ ਖ਼ ਗ ਗ਼ ਫ ਫ਼ to Shahmukhi poses another challenge. fact from the analysis of Gurmukhi and Many times the spellings of such words in Shahmukhi corpus, we found that these four Shahmukhi are typical and it is not possible characters had combined character frequency to formulate transliteration rules for of occurrence of only 0.17% as compared to generation of such spellings. As for example 1.35% for their counterparts in Shahmukhi. consider the Gurmukhi words The result is that most of the words in ਅਬਦੁੱ ਲਾ ਬੁਸ਼ਰਾ Gurmukhi are now written without nukta ਰਿਹਮਾਨ and ਹੈਦਰਾਬਾਦ, which are written in but عبدﷲ ُ ٰبشری ٰرحمن حيدرآباد symbol. The symbol ਸ਼ is an exception. As an Shahmukhi are example, we found that the word which will yield wrong results if ਫ਼ਕੀਰ transliterated using the usual rule based occurs 69 times in the Gurmukhi corpus mapping system. while the word ਫਕੀਰ occurs 147 times in the Gurmukhi corpus. It is interesting to see that Gurmukhi-Shahmukhi Transliteration even though is the correct spelling but System Architecture ਫ਼ਕੀਰ still the word ਫਕੀਰ has higher frequency of The system architecture of the Gurmukhi- occurence.The reader can easily make out Shahmukhi transliteration developed by us is the word even if it is not written without displayed in Fig. 1. The system has been to take nukta, but if that word is converted to care of the issues raised in the last section. As Shahmukhi using character to character can be seen from Fig. 1, the complete system is based mapping it results in wrong spellings. divided into 3 stages: pre-processing, processing and post-processing. in ِپھکير So ਫਕੀਰ will be transliterated to Shahmukhi, which is wrong while the actual which is obtained if the , فقير transliteration is correct Gurmukhi spellings ਫ਼ਕੀਰ are used. This puts a constraint on the system that either the user should supply the correct Gurmukhi spellings for nukta characters, or the system should automatically correct such errors if proper spellings in Shahmukhi are to be generated. 4. Difference between pronunciation and orthography: In certain cases, the Gurmukhi words are written with short vowels, while they are pronounced with long vowels. The equivalent words in Shahmukhi are also written with long vowels and so the rule based mapping system which converts those short vowels in Gurmukhi Shahmukhi give wrong results in such cases. Some examples of such words are ਗੁਰੂ, ਿਬਮਾਰ and ਖ਼ੁਰਾਕ. They are pronounced as ਗੂਰੂ ਬੀਮਾਰ ਖ਼ੂਰਾਕ but written with short vowels, while the corresponding words in Shahmukhi are Figure 1: System architecture of Gurmukhi- Shahmukhi Transliteration System خوراکگورو بيمار written with long vowels as

generated. Thus for the input word ਖੁਸ਼ੀ , the In the pre-processing stage, the Gurmukhi word is cleaned and prepared for transliteration by output word will be ਖ਼ੂਸ਼ੀ after spell check and normalizing the Gurmukhi word according to the normalization. Shahmukhi spellings and pronunciation. In the 2. Processing : In this stage the normalized processing stage, the normalized Gurmukhi word word generated in pre-processing stage is is converted to Shahmukhi using a rule based converted to Shahmukhi by using letter to system. The proper names and words with letter conversion mapping rules using the typical spellings are transliterated using a mapping Table 1 and Table 2. The Gurmukhi Gurmukhi-Shahmukhi lookup dictionary. In the word is broken into its constituent post-processing stage, the spellings of the characters and each Gurmukhi character is transliterated Shahmukhi word are corrected by replaced by corresponding Shahmukhi using a Shahmukhi corpus. character. For multiple mapping, the default character, with highest frequency of Lexical Resources Used occurrence as mentioned in Table 2 is selected. The following lexical resources have been used: As for example, the word will be ਕਮਰੇ Resource Details converted to Shahmukhi as follows: کمرے = ے + ر + م + ک <- ੇ◌ + Gurmukhi spell Root words : 41,253 ਕ + ਮ + ਰ checker If two vowels in Gurmukhi come together, Gurmukhi-Shahmukhi Terms: 10,254 then the character hamza is placed in dictionary between them in Shahmukhi. Shahmukhi Corpus Total words :97,63,294 As for example for , Unique words:1,93,679 ਕੋਈ and ئی+کو<- ਕੋ + ਈ The details of the three stages are: Similarly for ਆਓ ئو+ آ <- Pre-Processing : The input to this stage is a ਆ + ਓ .1 Gurmukhi word in Unicode format. As Besides, these simple mapping rules, some already discussed above, the Gurmukhi word special pronunciation based rules have also may be wrongly spelled because of the nukta been developed. It is not possible to lay related characters. These spelling errors may down all the rules, but some of these rules not appear so serious for Gurmukhi readers, are: ( اليا<- ) يا <- + but when transliterated as such they generate wrong Shahmukhi spellings. To solve this ਇ ਆ ਲਾਇਆ (واليو <- ਵਾਿਲਓ) يو <- problem, the Gurmukhi word is sent to a ਿ◌ + ਓ Gurmukhi spell checker, which (پمپ <- ) مپ <- + automatically corrects the nukta related ◌ੰ ਪ ਪੰਪ ( ُ ّسن<- ) ّن <- + errors. Thus the word ਗਜਲ will get ◌ੰ ਨ ਸੁੰਨ converted to after being fed to the ਗ਼ਜ਼ਲ Though the above rules work fine for most Gurmukhi spell checker. To handle the of the common words, but they do not give problem of variation in pronunciation and proper results in the following cases: orthography of some Gurmukhi words, we 1. In case of multiple mappings of a have created a database of all such words Gurmukhi character to Shahmukhi along with their inflections. The Gurmukhi characters. These rules may not map the word is checked in this database and if found correct Shahmukhi character. gets converted to the appropriate form. Thus 2. There are certain Shahmukhi characters ain) and khadi zabar for which)ع the word ਗੁਰੂਆਂ gets converted to ਗੂਰੂਆਂ. At like the end of this stage the Gurmukhi word there no exact equivalent Gurmukhi correcetd for the spellings and normalised characters. The above rules will not be for the pronunciation based errors is able to automatically generate these characters in final Shahmukhi output.

has 2045 occurences, none of the طاقت These rules many times fail to word .3 transliterate Urdu proper nouns other form has a single occurrence. Thus the The multiple mapping problem is .طاقت word ਤਾਕਤ is transliterated to handled in the post-processing stage, To illustrate how the Gurmukhi word is while for handling the zero mappings transformed in each of the three stages, we and transliterating proper nouns and take the following sample sentence in other typical spellings, we create a Gurmukhi. The words which get modified in separate database of Gurmukhi- the next stage are highlighted in bold. Shahmukhi words. Some sample Gurmukhi words with typical ਪੁਿਲਸ ਨੂੰ ਮੂਸਾ ਖਾਨ ਿਬਮਾਰ ਿਸਹਤ ਅਤੇ ਜਖਮੀ Shahmukhi spellings stored in the database are : ਹਾਲਤ ਿਵਚ ਿਮਿਲਆ After Gurmukhi spell checking in pre- ٰفتوی ਫ਼ਤਵਾ processing stage, the sentence becomes خصوصاً ਖ਼ਸੂਸਨ ਪੁਿਲਸ ਨੂੰ ਮੂਸਾ ਖ਼ਾਨ ਿਬਮਾਰ ਿਸਹਤ ਅਤੇ ਜ਼ਖ਼ਮੀ شروع ਸ਼ੁਰੂ ਹਾਲਤ ਿਵਚ ਿਮਿਲਆ The text after normalization becomes ضلع ਿਜ਼ਲ੍ਹਾ

ਪੂਲੀਸ ਨੂੰ ਮੂਸਾ ਖ਼ਾਨ ਬੀਮਾਰ ਿਸਹਤ ਅਤੇ ਜ਼ਖ਼ਮੀ موقع ਮੌਕਾ

ਹਾਲਤ ਿਵਚ ਿਮਿਲਆ اطالع ਇੱ ਤਲਾਅ This normalized text is sent to the processing The Gurmukhi word to be transliterated stage. In the processing stage, first each of is first searched in this database. If the the word is checked for typical spellings in word is found, then it is directly the Gurmukhi-Shahmukhi dictionary. If converted to Shahmukhi else it is found then they are directly changed to converted using the above rule based Shahmukhi, else transliterated using mapping. mapping rules. 3. Post - Processing: This stage is primarily Since the word has typical spellings the used to correct the spellings of the ਮੂਸਾ Shahmukhi words generated in the output after dictionary lookup is : ਖ਼ਾਨ ਬੀਮਾਰ ਿਸਹਤ ਅਤੇ ਜ਼ਖ਼ਮੀ ُ ٰموسی processing stage. The major source of ਪੂਲੀਸ ਨੂੰ spellings errors in the transliterated

Shahmukhi words is the multiple character ਹਾਲਤ ਿਵਚ ਿਮਿਲਆ mapping in Shahmukhi. As for example the The output from the processing stage after words , , and will rule based transliteration is : ਸਲਾਹ ਮਜ਼ਬੂਤ ਿਕਸਮ ਮਤਲਬ and ِکسم مزبوت، ,ساله be transliterated as پوليس نوں ُ ٰموسی خان بيمار ِسہت اتے زخمی ہالت ِوچ ِملی ,صالح while the actual spellings are متلب respectively. To The final output after running the spell مطلب and ِ◌قسم ,مضبوط automatically correct the spellings, we have checker in the post processing stage is : used a Shahmukhi corpus. For the Gurmukhi characters with multiple Shahmukhi پوليس نوں ُ ٰموسی خان بيمار ِصحت َاتے زخمی حالت ِوچ ِملی mappings, word forms using all the possible mappings are generated and the word with The outputs we got from the other existing the highest frequency of occurrence in the systems are below. The wrongly Shahmukhi corpus is selected. As for transliterated words are highlighted in red example, consider the word ਤਾਕਤ. Both the colour.

characters ਤ and ਕ have multiple Shahmukhi ُ ِپلس ُنون ُموسا کھان ِبمار ِسہت اتے جکھمی ہالت ِوچ ِمل اي mappings. The Shahmukhi word generated http://www.puran.info/PMT/PMT.aspx From this . تکتا in the processing stage is ُ ِپلس ُنوں ُموسا کھان ِبمار سيھت اتے جکھمي ھالت ِوچ ِمليا :word, all its forms are generated as follows http://parc.cdac.in/utrans.htm . طاقط طاکط طاکت طاقت تاقط تاکط تاقت تاکت A search for each of these words in the Shahmukhi corpus reveals that while the

Another sample sentence when transliterated by Failure Cases the three systems gave the following output: The system fails for the following cases: Gurmukhi 1. Words with typical spellings, which are not ਾਹ ਆਲਮ ਕਪ ਿਵਚ ਿਦਨ ਤ ਿਕਸੇ ਨਾ present in Gurmukhi-Shahmukhi dictionary ਿਕਸੇ ਤਰ੍ਹ ਗੁਜ਼ਰ ਜਦੇ ਹਨ ਪਰ ਰਾਤ and Shahmukhi corpus. 2. Gurmukhi words with multiple spellings in । ਿਕਆਮਤ ਦੀਆਂ ਹੁੰ ਦੀਆਂ ਹਨ Shahmukhi. As for example the Gurmukhi شاه عالم کيمپ وچ دن تاں کسے نہ کسے Our system and قطره word can get mapped to both ਕਤਰਾ طرحاں گزر جاندے ہن پر راتاں قيامت Similarly the word ਅਰਬ has two forms .کترا دياں ہندياں ہن۔ The correct . ارب and عرب in Shahmukhi spellings can be selected after context ساه آلم َکينپ ِوچ ِ دن تاں ِکسے نا ِکسے Puran ُ .analysis only تر੍◌ہاں گجر جاندے ہن پر راتاں کيامت ِ دياں ُِہندياں ہن ۔ Conclusion

شاه الم َکينپ ِوچ ِ دن تاں ِکسے نا ِکسے Utrans In this paper we have presented a Gurmukhi to ترھاں ُگزر جاندے ھن پر راتاں کيامت دياں Shahmukhi Transliteration system with around ُھندياں ھن ۔ 98.6% word accuracy. We have tried to overcome the shortcomings of the existing rule Experimental Results based Gurmukhi to Shahmukhi Transliteration We have tested our system on 121 pages of text systems. The various challenges such as compiled from newspapers, books and poetry. multiple/zero character mappings, variations in The results are compared with pronunciations and orthography and Puran(http://www.puran.info/PMT/PMT.aspx ) transliteration of proper nouns etc. have been and Utrans(http://parc.cdac.in/utrans.htm), the two handled by generating special rules and using transliteration systems available on the net various lexical resources such Gurmukhi spell (Table 3). checker, Shahmukhi corpus, Gurmukhi- Shahmukhi transliteration dictionary etc. Table 3. Transliteration Accuracy System Transliteration Accuracy The author will like to acknowledge the web Utrans 83.45% support provided by Tejinder Singh Saini and Puran 84.92% linguistic support provided by Abdur Rashid, Anwar Chirag and Mohammad Sadiq. Our System 98.6%

Reference We observed that the main of sources of Wei-Hao Lin and Hsin-His Chen, “Backward improvements in the transliteration accuracy Machine Transliteration by Learning Phonetic over the existing systems have been: Similarity”, proceedings of the 6th conference on 1. Pre-processing stage, wherein the wrong Natural language learning, - Volume 20 , pp. 1-7 Gurmukhi spellings are corrected and (2002). spellings of some of the words are M.G.A. Malik, “Punjabi Machine transliteration”, modified according to their Proceedings of the 21st International Conference on pronunciation. Computational Linguistics, pp. 1137–1144 (2006). 2. Development of transliteration rules for M.G.A. Malik, “Towards Unicode Compatible special cases Punjabi Character Set”, proceedings of 27th Internationalization and Unicode Conference, 3. Usage of Gurmukhi-Shahmukhi Berlin, Germany (2005). dictionary Bushra Jawaid, and Tafseer Ahmed, “Hindi to Urdu 4. Correction of Shahmukhi spellings with Conversion: Beyond Simple Transliteration”, the help of a Shahmukhi corpus. Proceedings of the Conference on Language & All these have been implemented for the first Technology, Lahore, pp.24-31 (2009). time in a Gurmukhi-Shahmukhi transliteration http://www.puran.info/PMT/PMT.aspx and have resulted in the development of a good http://parc.cdac.in/utrans.htm accuracy transliteration system.