<<

Shahmukhi to

Tejinder Saini Gurpreet Singh Lehal Virinder S Kalra ACTDPL, , DCS, Punjabi University, Sociology, SOSS 147 002, Patiala 147 002, India University of Manchester [email protected] [email protected] [email protected]

Western parts is mutually comprehensible, in the Abstract written form it is not so. The existence of two scripts for Punjabi has created barrier The existence of two scripts for Punjabi between the written in India has created a script barrier be- and . More than 60 per cent of Punjabi tween the Punjabi literature written in In- literature of medieval period (500-1450 AD) is dia and Pakistan. This research has de- available in Shahmukhi script only, while most veloped a new system for the first time of of the modern Punjabi are in Gurmukhi. its kind for Shahmukhi text without dia- Potentially, all members of the substantial Pun- critical marks. The purposed system for jabi community will benefit vastly from this Shahmukhi to Gurmukhi transliteration transliteration system. has been implemented with various re- search techniques based on language cor- 1.1 Gurmukhi Script pus. The corpus analysis of both scripts is performed for generating statistical data The Gurmukhi script, derived from the Sharada of different types like character and word script and standardised by Dev in frequencies and bi-gram frequencies. the 16th century, was designed to write the Pun- This statistical analysis is used in differ- jabi language. The meaning of "Gurmukhi" is ent phases of transliteration. Potentially, literally “from the mouth of the Guru". The all members of the substantial Punjabi Gurmukhi script has forty one letters, including community will benefit vastly from this thirty eight and three basic transliteration system. sign bearers. There are five nasal consonants (ਙ, ਞ, ਣ, ਨ, ਮ) and two additional nasalization signs, 1 Introduction ɲ ɲ bindi ◌ਂ [ ] and tippi ◌ੰ [ ]. In addition to this, One of the great challenges before Information there are nine dependent vowel signs ( ʊ], Technology is to overcome language barriers ◌ੁ[ ◌ੂ ɘ ɪ across the whole humanity so that everyone can [], ◌ੋ[], ◌ਾ[ ], ਿ◌[ ], ◌ੀ[], ◌ੇ[], ◌ੈ[æ], ◌ੌ[Ɔ]) communicate with everyone else on the planet in used to create ten independent with three real time. is one of those unique parts ʊ bearer characters: Ura ੳ[ ], Aira ਅ [ə] and Iri of the world where a single language is written in ɪ]. different scripts. This is the case, for example, ੲ[ with , spoken by tens of mil- 1.2 Shahmukhi Script lions of people, but written in Indian East (20 million) in Gurmukhi script (a Left to Right The meaning of "Shahmukhi" is literally script based on ) and in Pakistani “from the King's mouth". Shahmukhi is a local West Punjab (80 million), written in Shahmukhi variant of the script used to record the Pun- script (a Right to Left script based on ), jabi language. It is based on right to left and by growing number of (2 million) Nastalique style of the Persian and . in the EU and the US in the Roman script. Whilst It has thirty seven simple consonants, eleven fre- in speech Punjabi spoken in the Eastern and the quently used aspirated consonants, four long vowels and three short vowel symbols ( © 2008. Licensed under the Creative Commons Attri- 2006). bution-Noncommercial-Share Alike 3.0 Unported license (http://creativecommons.org/licenses/by-nc- sa/3.0/). Some rights reserved.

177 Coling 2008: Companion volume – Posters and Demonstrations, pages 177–180 Manchester, August 2008 2 Comparison with the Existing System PMT system against the following Shahmukhi input published on a web site and the output text In actual practice, Shahmukhi script is written is shown as output-A in Table 1.The output of without short vowels and other diacritical marks. proposed system on the same input is shown as The PMT system discussed by Malik A. (2006) output-B. The wrong transliteration of Gurmukhi claims 98% accuracy only when the input text tokens is shown in bold and italic and the com- Input text (right to left) parison of both outputs is shown in Table Clearly, our system is more practical in nature.2 اس ﮔﻞ وچ ﺟﺪوں اﺳﻴﮟ ﺑﮩﺘﮯ ﭘﻨﺠﺎﺑﻴﺎں ﻧﻮں وﻳﮑﻬﺪے ﮨﺎں ﺗﺎں than PMT and we got good transliteration with ﭘﺮﻧﺴﭙﻞ ﺗﻴﺠﺎ ﺳﻨﮕﻪ دے ﻟﻴﮑﻪ وچ ﺑﻴﺎﻧﻴﺎں ﮔﺌﻴﺎں ﮐﻮڑﻳﺎں ﺳﭽﺎﺋﻴﺎں .different inputs having missing diacritical marks ﮨﻮر وی ﺷﺪت ﻧﺎل ﻣﺤﺴﻮس ﮨﻨﺪﻳﺎں ﮨﻴﻦ۔ اﺳﻴﮟ دﻳﺲ ﻧﻮں ﭘﻴﺎر ﮐﺮن دا دﻋﻮیٰ ﮐﺮدے ﮨﺎں ﭘﺮ اﭘﻨﮯ ﺻﻮﺑﮯ ﻧﻮں وﺳﺎری ﺑﻴﭩﻬﮯ The Complexity 3 ﮨﺎں۔ اس دا ﺳﺒﻪ ﺗﻮں وڈا ﺛﺒﻮت اﻳﮩہ ﮨﮯ ﮐہ ﺑﻬﺎرت دے ﻟﮓ The Shahmukhi script has many complexities by ﺑﻬﮓ ﺑﮩﺘﮯ ﺻﻮﺑﮯ اﭘﻨﮯ اﭘﻨﮯ ﺳﺘﻬﺎﭘﻨﺎ دوس ﺑﮍے اﺗﺸﺎﮦ ﺗﮯ :its nature and the major two of them are ﺟﺬﺑﮯ ﻧﺎل ﻣﻨﺎؤﻧﺪے ﮨﻴﻦ۔ اﭘﻨﯽ زﺑﺎن، اﭘﻨﮯ ﺳﺒﻬﻴﺎﭼﺎر، اﭘﻨﮯ ﭘﭽﻬﻮﮐﮍ ﺗﮯ اﭘﻨﮯ ورﺛﮯ ﺗﮯ ﻣﺎن ﮐﺮدے ﮨﻴﻦ۔ اﭘﻨﯽ ﻗﻮﻣﯽ Recognition of Shahmukhi Text without 3.1 ﭘﭽﻬﺎن ﺗﮯ ﻣﺎن ﮐﺮدے ﮨﻴﻦ۔ ﭘﺮ ﺳﺎڈا ﺑﺎﺑﺎ ﺁدم ﮨﯽ ﻧﺮاﻻ ﮨﮯ۔ Diacritical Marks ﺳﺮﮐﺎراں ﺗﻮں ﻟﮯ ﮐﮯ ﻋﺎم ﻟﻮﮐﺎں ﺗﮏ ﭘﻨﺠﺎﺑﯽ ﺻﻮﺑﮯ دے ﺑﻨﻦ دن ﺑﺎرے ﭘﻮری ﻃﺮﺣﺎں اوﻳﺴﻠﮯ ﮨﯽ رﮨﻨﺪے ﮨﻴﻦ۔ Shahmukhi script is usually written without short Output-A of PMT system (left to right) vowels and other diacritical marks, often leading ਅਸ ਗਲ ਵਚ ਜਦ ਅਸ ਬਹਤ ੇ ਪਨਜਾਬਆੇ ਂ ਨ ਵੇਖਦ ੇ ਹਾਂ ਤਾ ਂ ਪਰਨਸਪਲ to potential ambiguity. Arabic orthography does ਤੇਜਾ ਸਨਘ ਦੇ ਲੇਖ ਵਚ ਬੇਅਨੇਆ ਂ ਗਈਆਂ ਕੜੋ ਆੇ ਂ ਸਚਾਈਆਂ ਹੋਰ ਵੀ not provide full vocalization of the text, and the ਸ਼ਦਤ ਨਾਲ ਮਹਸਸੋ ਹਨਦਆੇ ਂ ਹਨੇ । ਅਸ ਦੇਸ ਨ ਪਆਰੇ ਕਰਨ ਦਾ reader is expected to infer short vowels from the ਦਾਵਾ ਕਰਦ ੇ ਹਾ ਂ ਪਰ ਅਪਨੇ ਸਬੋ ੇ ਨ ਵਸਾਰੀ ਬਠੇ ੇ ਹਾਂ। ਅਸ ਦਾ ਸਭ ਤ context of the sentence. In the written Shah- ਵਡਾ ਸਬਤੋ ਇਹਾ ਹ ੇ ਕਹ ਭਾਰਤ ਦੇ ਲਗ ਭਗ ਬਹਤ ੇ ਸਬੋ ੇ ਅਪਨੇ ਅਪਨੇ mukhi script it is not mandatory to put short ਸਥਾਪਨਾ ਦਸੋ ਬੜੇ ਅਤਸ਼ਾਹ ਤੇ ਜਜ਼ਬੇ ਨਾਲ ਮਨਾਈਵਨਦ ੇ ਹਨੇ । ਅਪਨੀ vowels below or above the Shahmukhi character ਜ਼ਬਾਨ, ਅਪਨੇ ਸਭਅਚਾਰੇ , ਅਪਨੇ ਪਛਕੜੋ ਤੇ ਅਪਨੇ ਵਰਸ ੇ ਤੇ ਮਾਨ to clear its sound. These special signs are called ਕਰਦੇ ਹਨੇ । ਅਪਨੀ ਕੋਮੀ ਪਛਾਨ ਤੇ ਮਾਨ ਕਰਦੇ ਹਨੇ । ਪਰ ਸਾਡਾ ਬਾਬਾ "Aerab" in Urdu. It is a big challenge in the ਆਦਮ ਹੀ ਨਰਾਲਾ ਹੇ। ਸਰਕਾਰਾ ਂ ਤ ਲੇ ਕੇ ਆਮ ਲੋਕਾ ਂ ਤਕ ਪਨਜਾਬੀ ਸਬੋ ੇ process of machine transliteration to recognize ਦੇ ਬਨਨ ਦਨ ਬਾਰ ੇ ਪਰੀੋ ਤਰਹਾ ਂ ਉ◌ਸਲੇ ੇ ਹੀ ਰਹਨਦ ੇ ਹਨੇ । the right word from the written text. Output-B of proposed system (left to right) 3.2 Multiple Mappings ਇਸ ਗੱਲ ਿਵਚ ਜਦ ਅਸ ਬਹੁਤੇ ਪਜਾਬੀਆੰ ਂ ਨੂੰ ਵੇਖਦੇ ਹਾ ਂ ਤਾਂ ਿਪਰ੍ੰਸੀਪਲ ਤੇਜਾ ਿਸਘੰ ਦੇ ਲੇਖ ਿਵਚ ਿਬਆਨੀਆਂ ਗਈਆਂ ਕੜੀਆੌ ਂ ਸਚਾਈਆਂ ਹੋਰ It is observed that there is multiple possible ਵੀ ਿਸ਼ੱਦਤ ਨਾਲ ਮਿਹਸੂਸ ਹੁੰਦੀਆ ਂ ਹੈਨ। ਅਸ ਦੇਸ ਨੂੰ ਿਪਆਰ ਕਰਨ ਦਾ mapping in Gurmukhi script corresponding to a ਦਾਵਾ ਕਰਦੇ ਹਾ ਂ ਪਰ ਆਪਣੇ ਸੂਬੇ ਨੂ ੰ ਵਸਾਰੀ ਬੈਠੇ ਹਾਂ। ਇਸ ਦਾ ਸਭ ਤ single character in the Shahmukhi script as ਵੱਡਾ ਸਬੂਤ ਇਹ ਹ ੈ ਿਕ ਭਾਰਤ ਦੇ ਲਗ ਭਗ ਬਹਤੁ ੇ ਸੂਬ ੇ ਆਪਣੇ ਆਪਣੇ shown in Table 3. ਸਥਾਪਨਾ ਦਸੋ ਬੜ ੇ ਉਤਸ਼ਾਹ ਤੇ ਜਜ਼ਬੇ ਨਾਲ ਮਨਾਉਂਦੇ ਹੈਨ। ਆਪਣੀ Shahmukhi ਜ਼ਬਾਨ, ਆਪਣੇ ਸਿਭਆਚਾਰ, ਆਪਣੇ ਿਪਛੋਕੜ ਤੇ ਆਪਣੇ ਿਵਰਸੇ ਤ ੇ Name Gurmukhi Mapping Character ਮਾਣ ਕਰਦੇ ਹੈਨ। ਆਪਣੀ ਕੌਮੀ ਪਛਾਣ ਤੇ ਮਾਣ ਕਰਦੇ ਹੈਨ। ਪਰ ਸਾਡਾ [], [o], [Ɔ], [ʊ], [u], ੂ◌ ੁ◌ ੌ◌ ੋ◌ v] ਵ] و ਬਾਬਾ ਆਦਮ ਹੀ ਿਨਰਾਲਾ ਹੈ। ਸਰਕਾਰਾਂ ਤ ਲੈ ਕੇ ਆਮ ਲੋਕਾਂ ਤੱਕ Vav [o] ਪੰਜਾਬੀ ਸੂਬੇ ਦੇ ਬਣਨ ਿਦਨ ਬਾਰੇ ਪੂਰੀ ਤਰਹ੍ਾਂ ਅਵੈਸਲੇ ਹੀ ਰਿਹੰਦੇ ਹੈਨ। ਓ Yeh [], [ɪ], [e], [æ], [i], j] ਯ ਿ◌ ◌ੇ ◌ੈ ◌ੀ]ى Table 1. I/O of PMT and Proposed Systems [i] Output Transliteration Tokens Accuracy ਈ Table 3. Multiple Mapping into Gurmukhi Script Type Total Wrong Right % A 116 64 52 44.8275 4 Transliteration System B 116 02 114 98.2758 Table 2. Comparison of Output-A & B The transliteration system as shown in figure 1 is virtually divided into two phases. The first has all necessary diacritical marks for removing phase performs pre-processing on the input ambiguities. But this process of putting missing Shahmukhi token by performing dictionary diacritical marks is not practically possible due to lookup. If the dictionary lookup fails then the many reasons like large input size, manual inter- token will go for rule based transliteration and vention, person having knowledge of both the ultimately this phase will generate best possible scripts and so on. We have manually evaluated Gurmukhi token(s). The second phase performs

178 the task of post-processing. Alignment input token has been searched in the dictionary component performs context analysis of input for their existence. This status result is shown in Gurmukhi token(s). All Forms generator (AFG) table 4 where the tokens 1st, 2nd, 4th, 5th, 6th, 7th, component will perform critical task of handling 8th, 10th and 11th are found in dictionary and their missing diacritical marks. This component will intermediate Weighted Gurmukhi Forms (WGF) suggest similar possible forms of a Gurmukhi have been generated. These tokens directly jump token which is not most frequent one. The queue to bi-gram queue manager for bi-gram analysis in manager of post-processing phase is designed to post-processing phase. work on bi-gram language model. This will se- Input 11 Shahmukhi tokens (Right to Left) lect the best possible unigram for final output by consulting bi-gram weights of the current token 5 a ر aا ا ں ی اے ا M :ن with its neighboring tokens Unicode Encoded Shahmukhi Text 11 10 9 8 7 6 5 4 3 2 1

Transliteration Shahmukhi Tokenizer System ਿਫਰ ਹੋਰ ਹੈਰਾਨੀ ਇਸ ਗੱਲ ਹੰ ੁਦੀ ਐ ਜੇ ਅਿਜਹੀ ਗੱਲ ਕਰਨ Shahmukhi Token 1 2 3 4 5 6 7 8 9 10 11

Transliterated 11 Gurmukhi tokens (Left to Right) Pre-Processing &Transliteration Figure 2. Shahmukhi Gurmukhi Tokens Dictionary Component Token Shahmukhi Found in WGF Token Dictionary token{weight} {4513}; {8714} Rule Based Transliteration 1 Yes ਫੇਰ ਿਫਰ a Component 2 Yes ਹੋਰ{14054}; ਹੌਰ{18} ر Gurmukhi Token(s) 3 No ਹੈਰਾਨੀ{524} aا Post-Processing 4 Yes ਇਸ{59998}; ਏਸ{1186} ا Unicode Alignment Bi-Gram 5 Yes ਗੱਲ{107} 5ں Queue All Forms Generator Manager {7699} (AFG) 6 Yes ਹੰ ੁਦੀ ی 7 Yes ਏ{7927}; ਐ{3600} اے Out Put Text Generator 8 Yes ਜੈ{295}; ਜੇ{9791} Unicode Encoded Gurmukhi Text 9 No ਅਜਹੀ{4} ا Figure 1. System Overview ਗੱਲ{447};ਗਲ{47};ਿਗੱਲ{9} 10 M Yes {5}; {5} 5 Lexical Resources Used ਗੁਲ ਗੁੱਲ ਕਰਨ{21582};ਕਰਣ{174}; Yes :ن 11 Shahmukhi Corpus: 3.3 million words. ਿਕਰਨ{159} Gurmukhi Corpus: 7 million words. Table 4. Pre-Processing Transliteration Status Shahmukhi-Gurmukhi Dictionary rd th Unigram and Bi-gram Table On the other hand, the input tokens 3 and 9 are All Forms Generator (AFG) not found in dictionary. Therefore, in this phase they will pass through transliteration component 6 Example and then in post-processing phase they will pass through Unicode formatting. After that they will Here we show the internal working of the system test for Most Frequent (MF) check by comparing through an example. Suppose we observe a their weights with a predefined threshold value2 Shahmukhi string as shown in figure 2. First, we pass this through the pre-processing and translit- eration phase where the input string has been 2 Threshold value is minimum probability of occur- tokenized into eleven Shahmukhi tokens. Every rence among most frequent tokens in target script cor- pus.

179 (100 in this case). As shown in table 5 the WGF As we can observe an average transliteration ac- rd curacy of 91.37% has been obtained. We got of 3 token ਹੈਰਾਨੀ{524} is most frequent one and will move to bi-gram queue whereas the WGF of good transliteration with different inputs. The th main source of error is the existence of vowel- 9 token {4} is not a most frequent token ਅਜਹੀ mapping between the two scripts. The (ی)and Yeh (و)and will reach at bi-gram queue manager only Shahmukhi vowel characters Vav after passing through all forms generator (AFG). have mapping into Gurmukhi consonants Vava( ) and ( ) respectively. This kind of Token MF AFG Status Bi-gram Found Output ਵ ਯ 1 - - hold ਫੇਰ;ਿਫਰ - vowel-consonant mapping can not be resolved fully with dependency rules but can be mini- ਫੇਰ-ਹੋਰ,12; 2 - - - , 20; ਿਫਰ mized by refining the dictionary and phonetic ਿਫਰ ਹੋਰ code generation rules of AFG component. In hold 3 Yes - ਹੋਰ ਹੋਰ other cases, system makes errors showing defi- - , ; 4 - - ਹੈਰਾਨੀ ਇਸ 10 ਹੈਰਾਨੀ ciency in handling those tokens which are not hold 5 - - ਇਸ ਇਸ belonging to common vocabulary domain. - , ; 6 - - ਇਸ ਗੱਲ 45 ਗੱਲ Type Transliterated Tokens Accuracy % ਹੰ ੁਦੀ-ਏ,86; 7 - - ਹੰ ੁਦੀ Poetry 3,301 90.63769 ਹੰ ੁਦੀ-ਐ,125; Article 584 92.60274 8 - - hold ਐ ਐ Story 3,981 90.88043 {310} ਅਜੇਹੀ - ,22; Total 7,866 91.37362 {1486} ਐ ਜੇ 9 No ਅਿਜਹੀ ਜੇ ਜੇ-ਅਿਜਹੀ,13; Table 6. Transliteration Results ਅਜਹੀ{4} hold 18 Yes - ਅਿਜਹੀ ਅਿਜਹੀ 8 References - , ; ਅਿਜਹੀ ਗੱਲ 38 Arbabi, Mansur, Scott M. Fischthal, Vincent C. - , ; 11 - - ਗੱਲ ਕਰਨ 179 ਗੱਲ Cheng and Elizabeth . 1994. Algorithms for ਗਲ-ਕਰਨ,18; transliteration. IBM Journal of re- search and Development, pp 183-193. EOS - hold ਕਰਨ ਕਰਨ Table 5. Post-Processing Status and output Haizhou Li, Min Zhang and Jian Su. 2004. A Joint Here, we see that the AFG has generated two Source-Channel Model for Machine Transliteration. Proceedings of the 42nd Annual Meeting of the As- additional forms ਅਜੇਹੀ{310} ਅਿਜਹੀ{1486} (table 5) sociation for Computational Linguistics, pp 159- for this token. These new forms are having addi- 166. tional diacritical marks of short vowels those are missing in the original form. Clearly, AFG has Malik, M. G. Abbas. 2006. Punjabi Machine supplied the best possible forms. Next, we show Transliteration. Proceedings of the 21st Interna- how bi-gram manager will work on WGF tokens tional Conference on Computational Linguistics to generate final Gurmukhi token. In this model and 44th Annual Meeting of the ACL, pp 1137- the next token will decide the selection of its 1144. previous one. Consider the case of second WFG . Gal, 2002. An HMM Approach to Vowel token ਹੋਰ{14054} having bi-gram combinations Restoration in Arabic and Hebrew. Proceedings of with previous one as ਫੇਰ-ਹੋਰ with weight 12 and ACL Workshop on Computational Approaches to ਿਫਰ-ਹੋਰ with weight 20. Clearly, the token ਿਫਰ will Semitic , pp 27-33. produce as output not because - combi- ਫੇਰ ਿਫਰ ਹੋਰ Youngim Jung, Donghun Lee, Aesun Yoon, Hyuk nation has higher weight than ਫੇਰ-ਹੋਰ. Similarly, Chul Kwon. 2004. Transliteration System for this table shows found bi-gram weights and cor- Arabic-Numeral Expressions using Decision Tree respondingly decided Gurmukhi token as output. for Intelligent Korean TTS, volume 1. 30th Annual Conference of IEEE, pp 657-662. 7 Results and Discussion The transliteration system was tested on a small set of poetry, article and story. The results are tabulated in Table 6.

180