<<

Word Disambiguation in Shahmukhi to

Tejinder Singh Saini Gurpreet Singh Lehal ACTDPL, Punjabi University, DCS, Punjabi University, Patiala, Punjab-147 002, India Patiala, Punjab-147 002, India [email protected] [email protected]

In addition to this, there are three short [Abstract signs called Zer ◌ِ [ɪ], Pesh ◌ُ [ʊ] and Zabar ◌َ [ə and some other diacritical marks or symbols like ɪ], Shad ◌ّ , Khari-Zabar ◌ٰ [ɘ], do-Zabar]ء To write , Punjabi speakers use two different scripts, Perso-Arabic (re- ◌ً [ən] and do-Zer ◌ٍ [ɪn] etc. Arabic orthography ferred as Shahmukhi) and Gurmukhi. Shah- does not provide full vocalization of the text, and mukhi is used by the people of Western Pun- the reader is expected to infer short from jab in Pakistan, whereas Gurmukhi is used by the context of the sentence. Any machine trans- most people of Eastern Punjab in India. The natural written text in Shahmukhi script has literation or text to speech synthesis system has missing short vowels and other diacritical to automatically guess and insert these missing marks. Additionally, the presence of ambigu- symbols. This is non-trivial problem and re- ous character having multiple mappings in quires an in depth statistical analysis (Durrani Gurmukhi script cause ambiguity at character and Hussain, 2010). as well as word level while transliterating Shahmukhi text into Gurmukhi script. In this 1.2 Gurmukhi Script paper we focus on the word level ambiguity The Gurmukhi script, standardized by Guru An- problem. The ambiguous Shahmukhi word to- gad Dev in the 16th century, was designed to kens have many interpretations in target Gur- mukhi script. We have proposed two different write the Punjabi language (Sekhon, 1996); algorithms for Shahmukhi word disambigua- (Singh, 1997). It was modeled on the Landa al- tion. The first algorithm formulates this prob- phabet. The literal meaning of "Gurmukhi" is lem using a state sequence representation as a from the mouth of the Guru. The Gurmukhi Hidden Markov Model (HMM). The second script has syllabic alphabet in which all conso- approach proposes n-gram model in which the nants have an . The Gurmukhi joint occurrence of words within a small win- alphabet has forty one letters, comprising thirty dow of size ± 5 is used. After evaluation we eight consonants and three basic vowel sign found that both approaches have more than 92% word disambiguation accuracy. bearers. The first three letters Ura ੳ[ʊ], Aira ਅ [ə] and Iri [ɪ] of Gurmukhi alphabet are unique 1 Introduction ੲ because they form the basis for vowels and are 1.1 Shahmukhi Script not consonants. The six consonants are created Shahmukhi is a derivation of the Perso-Arabic by placing a dot at the foot (pair) of the existing script used to record the Punjabi language in consonant. There are five nasal consonants Pakistan. Shahmukhi script has thirty eight let- (ਙ[ɲə], ਞ[ɲə], ਣ[ɳ], ਨ[n], ਮ[m]) and two addi- ,[ɘ]ا ters, including four long vowel signs Alif tional nasalization signs, bindi [ɲ] and tippi ੰ◌ ਂ◌ .[j]ے j] and Badi-ye]ى v], Choti-ye]و Vao Shahmukhi script in general has thirty seven [ɲ] in Gurmukhi script. In addition to this, there simple consonants and eleven frequently used are nine dependent vowel signs (or ) aspirated consonants. There are three nasal ,[m]) and one additional (◌ੁ[ʊ], ◌ ੂ [], ◌ੋ[], ◌ਾ[ɘ], ਿ◌[ɪ], ◌ੀ[i], ◌[e],ੇ ◌ੈ[æ]م ,[n]ن ,[ɳ]ڻ) consonants .[ɲ] ں nasalization sign, called Noon-ghunna ◌ੌ[Ɔ]) used to create ten independent vowels (ਉ

79 Proceedings of the 9th Workshop on Asian Language Resources, pages 79–87, Chiang Mai, Thailand, November 12 and 13, 2011. , | , | , | [ʊ], ਊ [u], ਓ [o], ਅ [ə], ਆ [ɑ], ਇ [ɪ], ਈ [i], ਏ [e], ਉਸ ਇਸ ਦੌਰ ਉਸ ਦੌਰ ਇਸ ਦੂਰ ਉਸ ਦੂਰ ਦੌਰ ਿਵਚ ਦੂਰ ਿਵਚ, ਸਭ ਤ | ਸਭ ਤੰ ,ੂ ਤ ਤਾਕਤਵਰ | ਤੰ ੂ ਤਾਕਤਵਰ, ਐ [æ], ਔ [Ɔ]) with three bearer characters: Ura ਤਾਕਤਵਰ ਅਤ ੇ | ਤਾਕਤਵਰ ਉੱਤ,ੇ ਅਤੇ ਚਲਾਕ | ਉੱਤੇ ਚਲਾਕ, ੳ[ʊ], Aira ਅ [ə] and Iri ੲ[ɪ]. With the exception ਦਾ ਮੁਖੀ | ਦਾ ਮੱਖੀ, and ਮੁਖੀ ਿਰਹਾ | ਮੱਖੀ ਿਰਹਾ are exam- of Aira [ ] independent vowels are never used ਅ ə ined in the training corpus to estimate the likeli- without additional vowel signs. The diacritics hood. If the joint co-occurrence of following which can appear above, below, before or after , , , , , the consonant they belong to, are used to change ਇਸ ਇਸ ਦੌਰ ਦੌਰ ਿਵਚ ਸਭ ਤ ਤ ਤਾਕਤਵਰ the inherent vowel and when they appear at the ਤਾਕਤਵਰ ਅਤ,ੇ ਅਤੇ ਚਲਾਕ, ਦਾ ਮੁਖੀ, and ਮੁਖੀ ਿਰਹਾ bi- beginning of a syllable, vowels are written as gram tokens are found to be more likely then the independent vowels. Some Punjabi words re- disambiguation will decide ਇਸ, ਦੌਰ, ਤ, ਅਤ ੇ and quire consonants to be written in a conjunct form respectively as expected. Unfortunately, due in which the second consonant is written under ਮੁਖੀ the first as a subscript. There are three com- to limited training data size or data sparseness it monly used subjoined consonants as shown here is quite probable that some of the alternative word interpretations are missing in the training Haha [h] (usage [n] + [h] [nʰ]), Rara ਹ ਨ + ◌੍ ਹ = ਨ corpus. In such cases additional information about word similarity like POS tagger and the- ਰ[r] (usage ਪ[p] + ◌੍ + ਰ[r] =ਪ [prʰ]) and Vava saurus may be helpful. ਵ[v] (usage ਸ[s] + ◌੍ + ਵ[v] = ਸ [sv]). 1.4 Causes of Ambiguity 1.3 Transliteration and Ambiguity The most common reasons for this ambiguity are To understand the problem of word ambiguity in missing short vowels and the presence of am- the transliterated text let us consider a Shah- biguous character having multiple mappings in mukhi sentence having total 13 words out of Gurmukhi script. them five are ambiguous. During transliteration phase our system generates all possible interpre- Sr Word Possible Gurmukhi without Transliteration tations in target script. Therefore, with this input diacritics the transliterated text has supplied all the am- 1 /gall/, /gill/, biguous words with maximum two interpreta- ਗੱਲ ਿਗੱਲ ਗੁੱਲ tions in the Gurmukhi script as shown in Figure /gull/, ਗੁਲ /gul/ 1. 2 /tak/, /takk/, /tuk/ ਤਕ ਤੱਕ ਤੁਕ 3 ਮੱਖੀ/makkhī/, ਮੁਖੀ/mukhī/ 4 ਇਸ ਦੌਰ ਿਵਚ ਸਭ ਤ ਤਾਕਤਵਰ ਅਤੇ ਚਲਾਕ ਿਵਅਕਤੀ ਹਨ /han/, ਹੁਣ /huṇ/ 5 ਕਬੀਲੇ ਦਾ ਮੁਖੀ ਿਰਹਾ ਿਜਥੇ /jithē/, ਜਥ ੇ /jathē/ is daur vic sabh tōṃ tākatvar atē calāk viaktī 6 /disdā/, /dassdā/ kabīlē dā mukhī rihā ਿਦਸਦਾ ਦੱਸਦਾ 7 ਅੱਕ /akk/, ਇੱਕ /ikk/ 8 / ਜੱਟ/jaṭṭ/, ਜੁੱਟ /juṭṭ ّ 9 ਉਸ /us/, ਇਸ /is/ 10 ਅਤੇ /atē/, ਉੱਤੇ /uttē/ Table 1. Ambiguous Shahmukhi Words without Short Vowels

Figure 1. Word level Ambiguity in Transliter- In the written Shahmukhi script, it is not man- ated Text datory to put short vowels, called Aerab, below or above the Shahmukhi character to clear its In a bigram statistical word disambiguation sound leading to potential ambiguous translitera- approach, the probability of co-occurrence of tion to Gurmukhi as shown in Table 1. In our various alternatives such as | ਇਸ findings, Shahmukhi corpus has just 1.66% cov- ,(erage of short vowels ◌ُ [ʊ] (0.81415%

80 -ɪ](0.7295%), and ◌َ (0.1234%) whereas the thesaurus (Resnik, 1992, 1995); (Jiang and Con] ِ◌ rath, 1997); is presented in the literature. equivalent ਿ◌[ɪ] (4.5462%) and ◌ੁ[ʊ] (1.5844%) in Gurmukhi corpus has 6.13% usage. Hence, it Sr Word with Possible Gurmukhi is a big challenge in the process of machine Ambiguous Char. Transliteration [v] و transliteration process to recognize the right 1 ਖੂਹ / khūh/, ਖਹੋ /khōh/ word from the written (without ) text /j] /piō/, /pīō]ى 2 because in a situation like this, correct meaning ਿਪਓ ਪੀਓ [j]ى 3 of the word needs to be corroborated from its ਚੇਨ /cēn/, ਚੀਨ /cīn/, neighboring context. /cain/ Secondly, it is observed that there are multiple ਚੈਨ ਜਾਂਦਾ /jāndā/, ਜਾਣਦਾ possible mappings in Gurmukhi script corre- 4 [n] sponding to a single character in the Shahmukhi /jāṇdā/ ਸੂਚਨਾ /sūcnā/, ਸੋਚਣਾ [v], [n] و script as shown in Table 2. Moreover, the shown 5 characters of Shahmukhi have vowel-vowel, /sōcaṇā/ ਦੇਣ /dēṇ/, ਦੀਨ /dīn/ [j], [n]ى vowel-consonant and consonant-consonant map- 6 ping. Table 3. Shahmukhi Words with Multiple Gur- Multiple mukhi Mappings Sr Char. Gurmukhi Mappings In this paper we have proposed two different .v] ਵ [v], ◌ੋ [o], ◌ੌ [Ɔ], ◌ ੁ [ʊ], ◌ੂ [u], ਓ [o] algorithms for Shahmukhi word disambiguation] و 1 -j] [j], [ɪ], [e], [æ], [i], [i] The first algorithm formulates this problem us]ى 2 ਯ ਿ◌ ◌ੇ ◌ੈ ◌ੀ ਈ ing a state sequence representation as a Hidden 3 [n] ਨ [n], ◌ੰ [ɲ], ਣ [ɳ], ◌ਂ[ɲ] Markov Model (HMM). The second approach Table 2. Multiple Mapping into Gurmukhi Script uses n-gram model in which the joint occurrence of words within a small window of size ± 5 is For example, consider two Shahmukhi words used. /cīn/ and /rōs/ having the presence of an -o] respectively. 2 The Level of Ambiguity in Shah]و i] and]ى ambiguous character Our transliteration engine discovers the corre- mukhi Text sponding word interpretations as /cēn/, ਚੇਨ ਚੀਨ We have performed experiments to evaluate how /cīn/, or ਚੈਨ /cain/ and ਰੋਸ /rōs/, or ਰੂਸ /rūs/ re- much word level ambiguity is present in the spectively. Furthermore, both the problems may Shahmukhi text. In order to measure the extent coexist in a particular Shahmukhi word, for ex- of such ambiguous words in a Shahmukhi corpus ample, the Shahmukhi word /baṇdī/ which we have analyzed the top, most frequent 10,000 has four different forms /baṇdī/, words obtained from the Shahmukhi word list ਬਣਦੀ that was generated during corpus analysis. The ਬੁਣਦੀ/buṇdī/, ਬੰ ਦੀ/bandī/ or ਿਬੰ ਦੀ/bindī/ in Gur- result of this analysis is shown in Table 4. n] Sr. Most Frequent Percentage of]ن mukhi script due to ambiguous character and missing short vowel. More sample cases are Words Ambiguous shown in Table 3. words Another variety of word ambiguity mostly 1 Top 100 20% found in machine translation systems is where 2 Top 500 15.8% many words have several meanings or sense. 3 Top 1,000 11.9% The task of word sense disambiguation is to de- 4 Top 5,000 4.72% termine which of the sense of an ambiguous 5 Top 10,000 3.6% word is invoked in a particular use of the word. Table 4. Extent of Ambiguity in Top 10K words This is done by looking at the context of the am- of Shahmukhi Corpus biguous word and by exploiting contextual word Observations: similarities based on some predefined co- • Most frequent words in Shahmukhi corpus occurrence relations. The various types of dis- have higher chances of being ambiguous. ambiguation methods where the source of word similarity was either statistical (Schutze, 1992); • In this test case the maximum amount of (Dagan et al. 1993, 1995); (Karov and Shimon, ambiguity is 20% which is very high. 1996); (Lin, 1997); or using a manually crafted

81 • The percentage of ambiguity decreases N-gram models are used extensively in lan- continuously while moving from most guage modeling and the same is proposed for frequent to less frequent words within the Shahmukhi word disambiguation using the target list. script corpus. The N-grams have practical ad- vantages to provide useful likelihood estimations • The ambiguous words in Top 10,000 data- for alternative reading of language corpora. set have maximum four interpretations in Word similarities that are obtained from N-gram Gurmukhi script with 2% coverage analysis are a combination of syntactical, seman- whereas the amount of three and two tic and contextual similarities those are very Gurmukhi interpretations is 12% and 86% suitable in this task of word disambiguation. respectively as shown in Figure 2. Punjabi Corpus Unique Text Script Size words Source

Four Gurmukhi Gurmukhi 7.8 m 1,59,272 Daily and re- Interpretations Shahmukhi 8.5 m 1,79,537 gional news 2% papers, reports, periodicals, magazines, Three Gurmukhi short stories and Punjabi litera-

Interpretations ture books etc. 12% Table 5. Properties of Shahmukhi and Gurmukhi Corpora We have proposed two different algorithms for Shahmukhi word disambiguation. The first algo- Two Gurmukhi rithm formulates this problem using a state se- Interpretations quence representation as a Hidden Markov 86% Model. The second approach uses n-gram model Figure 2. Number of Gurmukhi Interpretations (including the right side context) in which the of Ambiguous words in Top 10K Dataset joint occurrence of words within a small window of size ± 5 is used. Additionally, a similar experiment was per- formed on a Shahmukhi book having a total of Training Data Size 37,620 words. After manual evaluation, we dis- (Records) covered that the extent of ambiguous words in Gurmukhi Word Frequency List 87,962 this book was 17.12%. Hence, both the test cases Gurmukhi Bigram List 265,372 figure out that there is significant percentage of Gurmukhi Trigram List 247,010 ambiguous words in Shahmukhi text and must Table 6. Training Data Resources be addressed to achieve higher rate of translitera- tion accuracy. 3.1 Word Disambiguation using HMM 3 The Approach Second order HMM is equivalent to n-gram lan- guage model with n=3 called trigram language At the outset, all we have is the raw corpora for model. One major problem with fixed n models each (Shahmukhi and Gurmukhi) script of Pun- is data sparseness. Therefore, one good idea to jabi language. The properties of these corpora smooth n-gram estimates is to use linear interpo- are presented in Table 5. The majority of Shah- lation (Jelinek and Mercer, 1980) of n-gram es- mukhi soft data was found in the form of InPage timates for various n, for example: software files. This soft data was converted to P(wn | wn−1wn−2 ) = format using the InPage to Unicode λ P (w | w w ) + λ P (w | w ) + λ P (w ) Converter. A corpus based statistical analysis of 1 1 n n−1 n−2 2 2 n n−1 3 3 n both the corpora is performed. We have started where λ =1 and 0 ≤ λ ≤ 1 (2) ∑i i i from scratch and created the required resources The variable n means that we are using tri- in Unicode for both Shahmukhi and Gurmukhi gram, bigram and unigram probabilities together scripts. The size of Gurmukhi training data used as a linear interpolation. This way we would get for word disambiguation task is shown in Table some probability of how likely a particular word 6. was, even if our coverage of trigram is sparse.

82 Now the next question is how to set the parame- each of the n-gram probabilities that adorn the arcs multiplied by the corresponding parame- ters λi . Thede and Harper, (1999) modeled a second order HMM for part of speech tagging. ter 0 ≤ λi ≤ 1. Correspondingly, the fragment of Rather than using fixed smoothing technique, HMM for ambiguous Gurmukhi word sequence they have discussed their new method of calcu- ਹੁਣ ਤੰ ੂ ਦੱਸ is shown in Figure 3 where the first two lating contextual probabilities using the linear consecutive words have two forms { or } interpolation. This method attaches more weight ਹਨ ਹੁਣ to triples that occur more often. The formula to and {ਤ or ਤੰ ੂ} where as the third consecutive estimate contextual probability is: Gurmukhi word has three forms like {ਦਸ or ਦੱਸ P(τ p = wk |τ p−1 = w j ,τ p−2 = wi ) = or ਿਦਸ}. N 3 N 2 N1 k3 . + (1− k3 )k2 . + (1− k3 )(1− k2 ). C2 C1 C0

log 2 (N 3 +1) +1 where k3 = ; and log 2 (N 3 +1) + 2

log 2 (N 2 +1) +1 k2 = (3) log 2 (N 2 +1) + 2 The equation 2 depends on the following numbers: N : Frequency of trigram w w w in Gurmukhi 3 i j k Figure 3. A Fragment of Second Order HMM for corpus

N : Frequency of bigram w w in Gurmukhi ਹੁਣ ਤੰ ੂ ਦੱਸ 2 j k To calculate the best state sequence we have corpus modeled Viterbi Algorithm, which efficiently N1 : Frequency of unigram wk in Gurmukhi cor- computes the most likely state sequence. pus 3.2 Word Disambiguation using ±5 Win- C2 : Number of times bigram wi w j occurs in dow Context Gurmukhi corpus This n-gram based algorithm performs word dis- C1 : Number of times unigram w j occurs in ambiguation using the small window context of Gurmukhi corpus size ±5. This context is used to exploit the con- textual, semantic and syntactical similarities C0 : Total number of words that appears in based on the information captured by an n-gram Gurmukhi corpus language model. The structure of our small win- The formulas for k3 and k2 are chosen so that dow context is shown in Figure 4. the weighting for each element in the equation 2 Left Context Ambiguous Right Context for P changes based on how often that element Word occurs in the Gurmukhi corpus. After compar- ing the two equations 1 and 2, we can easily un- 1 2 ? 4 5 derstand that:

λ1 = k3 ; λ2 = (1− k3 )k2 ; λ3 = (1− k3 )(1− k2 ) i-2 i-1 i i+1 i+2 and satisfy the condition λ =1 ; 0 ≤ λ ≤ 1 ∑i i i Figure 4. Structure of ±5 Window Context We build an HMM with four states for each word pair, one for the basic word pair, and three The disambiguation task starts from the first representing each choice of n-gram model for word of the input sentence and attempts to inves- calculating the next transition. Therefore, as ex- tigate co-occurrence probabilities of the words pressed in equation 2 and 3 of second order present in the left and right of the ambiguous c word within the window size. Unlike HMM ap- HMM, there are three ways for w ={ਦਸ or ਦੱਸ proach which is based on linear interpolation of a b n-gram estimates (Jelinek and Mercer, 1980), or ਿਦਸ} to follow w w (ਹੁਣ ਤੰ )ੂ and the total c this algorithm works in a back off fashion as probability of seeing w next is then the sum of proposed by Katz (1987) in which it first relies

83 on highest order trigram model to estimate the biguity is now reduced to ਹੁਣ ਤੰ ੂ {ਦਸ ਦੱਸ ਿਦਸ}. joint co-occurrence possibility of alternative The next ambiguity is lying in the last word of word interpretations to select the most probable the sentence so it has only left context as shown interpretation. For example, consider the follow- in Table 9. After evaluating the left context co- ing Shahmukhi sentence with three ambiguous occurrence for all the three word forms the sys- words as: tem found that the valid co-occurrence is P(ਹੁਣ, ਤੰ ,ੂ ਦੱਸ) and on this basis word ਦੱਸ is selected. Finally, the output of the system is ਹੁਣ ਤੰ ੂ ਦੱਸ as {ਦਸ ਦੱਸ ਿਦਸ}{ਤ ਤੰ ੂ}{ਹਨ ਹਣੁ } expected. The initial ambiguity is { /han/ | /huṇ/}, Right Left Context ਹਨ ਹੁਣ Context Bigram Trigram { /tōṃ/ | /tūṃ/} and { /das/ | /dass/ | ਤ ਤੰ ੂ ਦਸ ਦੱਸ N.A. P( , ) P( , , ) /dis/}. The left and right context probabili- ਦਸ ਤੰ ੂ ਦਸ ਹੁਣ ਤੰ ੂ ਦਸ ਿਦਸ - P=0020768 P=0 ties of the first ambiguous word are shown in N.A. Table 7. ਦੱਸ P(ਤੰ ੂ, ਦੱਸ) P(ਹੁਣ, ਤੰ ੂ, ਦੱਸ) Right Context Left Context - P=0.003980 P=0.037383 Bigram Trigram Bigram Trigram N.A. P( _ , _ , P( , ) P( , , ) ਹਨ P(ਹਨ, ਤ P(ਹਨ, ਤ P( _ , ਹਨ) ਿਦਸ ਤੰ ੂ ਿਦਸ ਹੁਣ ਤੰ ੂ ਿਦਸ ਹਨ) - P=0.000028 P=0 |ਤੰ ੂ) |ਤੰ ੂ, ਦਸ | Table 9. Context Window Probabilities for , ਦੱਸ |ਿਦਸ) ਦਸ P= 0.000519 0.0 0.001613 0.001673 ਦੱਸ and ਿਦਸ Words P( , P( , | P( _ , ) P( _ , _ , ਹੁਣ ਹੁਣ ਤ ਹੁਣ ਤ ਹੁਣ Unlike the above example, there is a situation ਹੁਣ) |ਤੰ ੂ) ਤੰ ੂ, ਦਸ | when the higher order joint co-occurrence is ਦੱਸ |ਿਦਸ) found to be zero in the training corpus. In this situation the proposed algorithm will back off to P= 0.015945 0.037383 0.063967 0.064930 next lower n-1 gram model. Table 7. Context Window Probabilities for ਹਨ and ਹੁਣ Words 4 Example Clearly, word is selected because it has ਹੁਣ Following is the N-gram and HMM outputs of higher trigram co-occurrence probability. Now word disambiguation task for the sample text the sentence ambiguity is reduced to ਹੁਣ {ਤ |ਤੰ }ੂ downloaded from the article available on the web site http://www.likhari.org {ਦਸ | ਦੱਸ |ਿਦਸ}. The estimation of co-occurrence probability for next ambiguous word is shown in Input Text ّ .Table 8 Right Context Left Context Bigram Bigram Trigram P( , | P( , ) P( _ , , ) ਤ ਤ ਦਸ ਦੱਸ ਹੁਣ ਤ ਹੁਣ ਤ ّ fi |ਿਦਸ) P=0.000492 P=0.006519 P=0.003341 P( , | P( , ) P( _ , , ) ਤੰ ੂ ਤੰ ੂ ਦਸ ਦੱਸ ਹੁਣ ਤੰ ੂ ਹੁਣ ਤੰ ੂ |ਿਦਸ) P=0.005019 P=0.009426 P=0.012454 ّ Table 8. Context Window Probabilities for and ਤ Words ਤੰ ੂ Romanized: As expected, word ਤੰ ੂ is selected by the system asīṃ gall tāṃ karadē hāṃ ki asīṃ āpṇī māṃ bōlī using left trigram context and the sentence am- nūṃ usdā baṇdā hakk divāuṇ laī purzōr hāṃ par

84 sāḍīā akkhāṃ sāmhṇē hī pañjābī nāḷ usdē ghar interpretations in Gurmukhi script. The disam- vic hī nā iṃsāfī hō rahī hai tē asīṃ phir cupp kar biguation results of this sample input show that kē ih sabh vēkh rahē hāṃ bhārat atē pākistān out of fifteen ambiguous words fourteen have dōvāṃ mulkāṃ vallōṃ pañjābī laī sāñjhē mañc been correctly disambiguated by both the N- tē kamm kītā jā rihā hai pichlē dinīṃ bambī vic gram and HMM algorithms whereas only one jō kujjh vāpriā us nē sārī dunīāṃ nūṃ hilā kē wrong word ਸਾਿਡਆ /sāḍiā/ is mistakenly chosen rakkh dittā is nāḷ dōvāṃ mulkāṃ dē rishtē taṛkē by N-gram approach that has correctly recog- han par buddhījīvī varag nūṃ ikk gall āpaṇē zi- nized as /sāḍīā/ by the HMM algorithm. han vic rakkhṇī cāhīdī hai ki sarhaddāṃ nē ਸਾਡੀਆ zamīn vaṇḍī hai zabān nahīṃ 5 Experiments and Results N-gram Output: The natural sources of Shahmukhi text are very ਅਸ ਗੱਲ ਤਾਂ ਕਰਦੇ ਹਾਂ ਿਕ ਅਸ ਆਪਣੀ ਮਾਂ ਬੋਲੀ ਨੰ ੂ ਉਸਦਾ limited. With this limitation we have identified the available online and offline sources and three ਬਣਦਾ ਹੱਕ ਿਦਵਾਉਣ ਲਈ ਪਰਜ਼ੁ ੋਰ ਹਾਂ ਪਰ ਸਾਿਡਆ ਅੱਖਾਂ different test sets are taken from different do- ਸਾਮਣੇ ਹੀ ਪੰ ਜਾਬੀ ਨਾਲ਼ ਉਸਦੇ ਘਰ ਿਵਚ ਹੀ ਨਾ ਇੰ ਸਾਫ਼ੀ ਹੋ mains as shown in Table 10. After manual ਰਹੀ ਹ ੈ ਤੇ ਅਸ ਿਫਰ ਚੱਪੁ ਕਰ ਕ ੇ ਇਹ ਸਭ ਵਖੇ ਰਹੇ ਹਾਂ evaluation, the word disambiguation results on the three datasets are given in Table 11. The ਭਾਰਤ ਅਤੇ ਪਾਿਕਸਤਾਨ ਦੋਵਾਂ ਮੁਲਕਾਂ ਵੱਲ ਪੰ ਜਾਬੀ ਲਈ ਸਾਂਝ ੇ overall 13.85% word ambiguity corresponding to ਮੰ ਚ ਤੇ ਕੰ ਮ ਕੀਤਾ ਜਾ ਿਰਹਾ ਹੈ ਿਪਛਲੇ ਿਦਨ ਬੰ ਬਈ ਿਵਚ ਜ ੋ all datasets has a significant value. The upper bound contribution is from Set-1(book) having a ਕੁੱਝ ਵਾਪਿਰਆ ਉਸ ਨ ਸਾਰੀ ਦਨੀਆੁ ਂ ਨੰ ੂ ਿਹਲਾ ਕੇ ਰੱਖ ਿਦੱਤਾ highest percentage 17.12% of word ambiguity ਇਸ ਨਾਲ਼ ਦਵਾਂੋ ਮੁਲਕਾਂ ਦੇ ਿਰਸ਼ਤ ੇ ਤੜਕੇ ਹਨ ਪਰ ਬੁੱਧੀਜੀਵੀ and the corresponding performance of two dif- ਵਰਗ ਨੰ ੂ ਇਕ ਗੱਲ ਆਪਣੇ ਿਜ਼ਹਨ ਿਵਚ ਰੱਖਣੀ ਚਾਹੀਦੀ ਹ ੈ ਿਕ ferent disambiguation tasks is also highest. ਸਰਹੱਦਾਂ ਨ ਜ਼ਮੀਨ ਵੰ ਡੀ ਹੈ ਜ਼ਬਾਨ ਨਹ Test Data Word Source Size Ambiguous words (Total =15 i.e. 14.285%) Set-1 37,620 Book { }{ }{ }{ Set-2 39,714 www.likhari.org ਗੱਲ ਿਗੱਲ ਗੁੱਲ ਗੁਲ ਨੰ ੂ ਨ ਬਣਦਾ ਬੰ ਦਾ ਪਰ ਪ Set-3 46678 www.wichaar.com }{ }{ }{ }{ }{ ਪੁਰ ਸਾਿਡਆ ਸਾਡੀਆ ਅਤੇ ਉੱਤੇ ਉਸ ਇਸ ਨੰ ੂ ਨ ਰੱਖ Total 1,24,012 ਰੁੱਖ}{ਇਸ ਉਸ ਐਸ}{ਹਨ ਹਣੁ }{ਪਰ ਪ ਪੁਰ}{ਨੰ ੂ ਨ }{ਇਕ Table 10. Description of the Test Data ਅੱਕ ਇੱਕ}{ਗੱਲ ਿਗੱਲ ਗੁੱਲ ਗਲੁ } We have evaluated both HMM and N-gram algorithms on these datasets and the results of 2nd Order HMM Output: this experiment have shown that the accuracy of N-grams and HMM based algorithms is 92.81% ਅਸ ਗੱਲ ਤਾਂ ਕਰਦੇ ਹਾਂ ਿਕ ਅਸ ਆਪਣੀ ਮਾਂ ਬੋਲੀ ਨੰ ੂ ਉਸਦਾ and 93.77% respectively. Hence, the HMM ਬਣਦਾ ਹੱਕ ਿਦਵਾਉਣ ਲਈ ਪੁਰਜ਼ੋਰ ਹਾਂ ਪਰ ਸਾਡੀਆ ਅੱਖਾਂ based approach has outperformed marginally. ਸਾਮਣੇ ਹੀ ਪੰ ਜਾਬੀ ਨਾਲ਼ ਉਸਦੇ ਘਰ ਿਵਚ ਹੀ ਨਾ ਇੰ ਸਾਫ਼ੀ ਹੋ Test Data Word N-gram 2nd order Ambiguity size ± 5 HMM ਰਹੀ ਹ ੈ ਤੇ ਅਸ ਿਫਰ ਚੱਪੁ ਕਰ ਕ ੇ ਇਹ ਸਭ ਵਖੇ ਰਹੇ ਹਾਂ Set-1 17.121% 95.358% 95.870% ਭਾਰਤ ਅਤੇ ਪਾਿਕਸਤਾਨ ਦੋਵਾਂ ਮੁਲਕਾਂ ਵੱਲ ਪੰ ਜਾਬੀ ਲਈ ਸਾਂਝ ੇ (book) Set-2 12.587% 91.189% 91.629% ਮੰ ਚ ਤੇ ਕੰ ਮ ਕੀਤਾ ਜਾ ਿਰਹਾ ਹੈ ਿਪਛਲੇ ਿਦਨ ਬੰ ਬਈ ਿਵਚ ਜ ੋ (likhari.org) Set-3 ਕੁੱਝ ਵਾਪਿਰਆ ਉਸ ਨ ਸਾਰੀ ਦਨੀਆੁ ਂ ਨੰ ੂ ਿਹਲਾ ਕੇ ਰੱਖ ਿਦੱਤਾ 11.85% 91.892% 93.822% (wichaar.com) ਇਸ ਨਾਲ਼ ਦਵਾਂੋ ਮੁਲਕਾਂ ਦੇ ਿਰਸ਼ਤ ੇ ਤੜਕੇ ਹਨ ਪਰ ਬੁੱਧੀਜੀਵੀ Total 13.85% 92.813% 93.773% ਵਰਗ ਨੰ ੂ ਇੱਕ ਗੱਲ ਆਪਣ ੇ ਿਜ਼ਹਨ ਿਵਚ ਰੱਖਣੀ ਚਾਹੀਦੀ ਹ ੈ ਿਕ Table 11. Word Disambiguation Result ਸਰਹੱਦਾਂ ਨ ਜ਼ਮੀਨ ਵੰ ਡੀ ਹੈ ਜ਼ਬਾਨ ਨਹ The accuracy of both algorithms is more that This sample input text has 105 words in total 92%, indicating there is still room for improve- and around 14.28% ambiguity at word level. ment. A comparative analysis of both outputs is While processing, the disambiguation task iden- performed. We found that there are cases when tified that there are fifteen (bold face) words that both HMM and N-gram based methods indi- are ambiguous, i.e. having two, three, and four vidually outperform as shown in Table 12 row 1

85 & 2 and row 3 & 4 respectively. However, there 12. Similarly, in some other cases system fails are various cases in which both approaches fail to predict name entity abbreviations as shown in to disambiguate either partially or fully as shown 6th row of Table 12. in row 5 & 6 of Table 12. It is observed that due We can produce better results in the future by to lack of training data both the proposed ap- increasing the size of the training corpus and by proaches have failed to distinguish correctly like exploiting contextual word similarities based on th some predefined co-occurrence relations. ਅਤੇ /atē/ or ਉੱਤੇ /uttē/ as shown in 5 row of Table Sr. N-gram Output Word Ambiguity 2nd order HMM Output Correct = 1 ਕਈ ਵਾਰ ਲੋਕਾਂ ਿਵਚ ਕੁੱਝ ਲੋਕ ਵੀ ਇਕ ਕਈ ਵਾਰ ਲੋਕਾਂ ਿਵਚ ਕੁੱਝ ਲੋਕ ਵੀ ਅੱਕ ਕੇ ਜ਼ਲਮੁ ਦਾ ਟਾਕਰਾ ਕਰਨ ਲਈ ਕੇ ਜ਼ਲਮੁ ਦਾ ਟਾਕਰਾ ਕਰਨ ਲਈ

ਹਿਥਆਰ ਚੱਕ ਲਦੇ ਹਨ ਪਰ ਸੇਧ ਅਤੇ {ਇਕ ਅੱਕ ਇੱਕ}{ਹਨ ਹਿਥਆਰ ਚੱਕ ਲਦੇ ਹਨ ਪਰ ਸੇਧ ਅਤੇ ਅਨੁਸ਼ਾਸਨ ਦੀ ਘਾਟ ਕਾਰਨ ਅਪਣਾ ਹੀ ਹੁਣ}{ਪਰ ਪ ਪਰੁ }{ਅਤੇ ਉੱਤੇ}{ ਅਨੁਸ਼ਾਸਨ ਦੀ ਘਾਟ ਕਾਰਨ ਅਪਣਾ ਹੀ

ਨੁਕਸਾਨ ਕਰਵਾ ਬੈਠਦੇ ਹਨ ਹਨ ਹਣੁ } ਨੁਕਸਾਨ ਕਰਵਾ ਬੈਠਦੇ ਹਨ kaī vār lōkāṃ vicōṃ kujjh lōk {ik akk ikk}{han huṇ}{par kaī vār lōkāṃ vicōṃ kujjh lōk vī ik kē zulam dā ṭākrā karan pr pur}{atē uttē}{han huṇ} vī akk kē zulam dā ṭākrā karan laī hathiār cakk laindē han par laī hathiār cakk laindē han par sēdh atē anushāsan dī ghāṭ sēdh atē anushāsan dī ghāṭ kāran apṇā hī nuksān karavā kāran apṇā hī nuksān karavā baiṭhdē han baiṭhdē han 2 ਆਪਣੇ ਘਰਿਦਆ ਂ ਤੰ ੂ ਿਵਆਹ ਦੀ ਆਸ ਆਪਣੇ ਘਰਿਦਆ ਂ ਤ ਿਵਆਹ ਦੀ ਆਸ { } ਲਾਹ ਛੱਡ ਤੰ ੂ ਤ ਲਾਹ ਛੱਡ āpaṇē ghardiāṃ tūṃ viāh dī ās {tūṃ tōṃ} āpaṇē ghardiāṃ tōṃ viāh dī ās lāh chaḍḍ lāh chaḍḍ 3 ਬਚਪਨ ਿਵਚ ਅਸ ਉਨ ਾਂ ਕੋਲ਼ ਜਾਂਦ ੇ ਹੰ ੁਦੇ ਬਚਪਨ ਿਵਚ ਅਸ ਉਨ ਾਂ ਕੋਲ਼ ਜਾਣਦ ੇ { } ਸੀ ਜਾਂਦੇ ਜਾਣਦੇ ਹੰ ੁਦੇ ਸੀ bacpan vic asīṃ unhāṃ kōḷ {jāndē jāṇdē} bacpan vic asīṃ unhāṃ kōḷ jāndē hundē sī jāṇdē hundē sī 4 ਸਾਨੰ ੂ ਪੁੱਤਾਂ ਤ ਸਪਤ ਬਣਨ ਲਈ ਜੀਵਨ ਸਾਨੰ ੂ ਪੁੱਤਾਂ ਤ ਸਪਤ ਬਣਨ ਲਈ ਜੀਵਨ ਸੇਧ ਗੁਰੂ ਗੰ ਥ ਸਾਿਹਬ ਦੇ ਿਵਚ ਦਰਜ ਸੇਧ ਗੁਰੂ ਗੰ ਥ ਸਾਿਹਬ ਦੇ ਿਵਚ ਦਰਜ {ਤ ਤੰ ੂ}{ਤ ਤੰ ੂ}{ਿਮਲ ਮੱਲ ਬਾਣੀ ਤ ਹੀ ਿਮਲ ਸਕਦੀ ਹ ੈ ਬਾਣੀ ਤੰ ੂ ਹੀ ਿਮਲ ਸਕਦੀ ਹ ੈ ਿਮੱਲ ਮੁੱਲ} sānūṃ puttāṃ tōṃ sapat baṇan {tōṃ tūṃ}{tōṃ tūṃ}{mil sānūṃ puttāṃ tōṃ sapat baṇan laī jīvan sēdh gurū granth sāhib mall mill mull} laī jīvan sēdh gurū granth sāhib dē vic daraj bāṇī tōṃ hī mil dē vic daraj bāṇī tūṃ hī mil sakadī hai sakadī hai 5 ਮੇਰ ੇ ਉੱਤ ੇ ਅਨੁਪੀਤ ਦੇ ਹੈਿਮਲਟਨ ਰਿਹਣ ਮੇਰ ੇ ਉੱਤ ੇ ਅਨੁਪੀਤ ਦੇ ਹੈਿਮਲਟਨ ਰਿਹਣ ਕਰ ਕੇ ਜਸਜੀਤ ਵੀ ਸਾਡੇ ਪਾਸ ਆ ਗਈ ਕਰ ਕੇ ਜਸਜੀਤ ਵੀ ਸਾਡੇ ਪਾਸ ਆ ਗਈ { ਅਤੇ ਉੱਤੇ} mērē uttē anuprīt dē haimilṭan {atē uttē} mērē uttē anuprīt dē haimilṭan rahiṇ kar kē jasjīt vī sāḍē pās rahiṇ kar kē jasjīt vī sāḍē pās ā gaī gaī 6 ਪੋਫ਼ੈਸਰ ਏਸ ਐਨ ਿਮਸ਼ਰਾ {ਏਸ ਇਸ ਐਸ } {ਐਨ ਇਨ} ਪੋਫ਼ੈਸਰ ਇਸ ਐਨ ਿਮਸ਼ਰਾ prōfaisar ēs ain mishrā {ēs is ais} {ain in} prōfaisar is ain mishrā Table 12. Sample Failure Cases

86 Ralph Grishman and John Sterling. 1993. Smoothing References of automatically generated selectional constraints. Dekang Lin. 1997. Using syntactic dependency as In Proceedings of DARPA Conference on Human local context to resolve word sense ambiguity. In Language Technology. San Francisco, California, Proceedings of 35th Annual Meeting of the ACL 254-259. and 8th Conference of the European Chapter of Ralph Grishman, Lynette Hirschman and Ngo Thanh the Association for Computational Linguistics. Nhan. 1986. Discovery procedures for sublanguage Madrid, Spain, 64-71. selectional patterns: initial experiments. Computa- Frederick Jelinek and Robert L. Mercer. 1980. Inter- tional Linguistics, 12(3):205-215. polated estimation of Markov source parameters Sant S. Sekhon. 1996. A History of Panjabi Litera- from sparse data. In Proceedings of the Workshop ture, Publication Bureau, Punjabi University, Pa- on Pattern Recognition in Practice. Amsterdam, tiala, volume 1 & 2, Punjab, India. The Netherlands: North-Holland. Scott M. Thede and Mary P. Harper. 1999. A second- Harbhajan Singh. 1997. Medieval Indian Literature: order Hidden Markov Model for part-of-speech An Anthology. Paniker K. Ayyappa, (Ed.) Sahitya tagging. In Proceedings of the 37th annual meeting Akademi Publication, volume 2, 417-452. of the Association for Computational Linguistics Hinrich Schütze. 1992. Dimensions of meaning. In on Computational Linguistics (ACL '99). Associa- Proceedings of Supercomputing ‘92. Minneapolis, tion for Computational Linguistics, Stroudsburg, MN, 787-796. PA, USA, 175-182. doi=10.3115/1034678.1034712 Ido Dagan, Shaul Marcus and Shaul Markovitch. Slava M. Katz. 1987. Estimation of probabilities from 1993. Contextual word similarity and estimation sparse data for the language model component of a from sparse data. In Proceedings of the 31st an- speech recognizer. IEEE Transactions on Acous- nual meeting on Association for Computational tics, Speech and Signal Processing, ASSP, Linguistics (ACL '93). Association for Computa- 35(3):400-401. tional Linguistics, Stroudsburg, PA, USA, 164- Ute Essen and Volker Steinbiss. 1992. Cooccurrence 171. doi=10.3115/981574.981596 smoothing for stochastic language modeling. In Ido Dagan, Shaul Marcus and Shaul Markovitch. Proceedings of the 1992 IEEE international con- 1995. Contextual word similarity and estimation ference on Acoustics, speech and signal processing from sparse data. Computer Speech and Language, - Volume 1 (ICASSP'92), IEEE Computer Society, 9:123-152. Washington, DC, USA, 161-164. Jay J. Jiang. David W. Conrath. 1997. Semantic simi- Yael Karov and Shimon Edelman. 1996. Learning larity based on corpus statistics and lexical taxon- similarity-based word sense disambiguation from omy. In Proceedings of International Conference sparse data. In Proceedings of the Fourth Work- Research on Computational Linguistics (RO- shop on Very Large Corpora. Copenhagen, Den- CLING). Taiwan, 1-15. mark, 42-55. Lawrence R. Rabiner. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257-285. Nadir Durrani and Sarmad Hussain. 2010. word segmentation. Human Language Technologies: The 2010 Annual Conference of the North Ameri- can Chapter of the Association for Computational Linguistics, Los Angeles, California, 528-536. Philip Resnik. 1992. WordNet and distributional analysis: A class-based approach to lexical discov- ery. In Proceedings of AAAI Workshop on Statisti- cally-based Natural Language Processing Tech- niques. Menlo Park, California, 56-64. Philip Resnik. 1995. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora. Cambridge, 54-68

87