<<

Sangam: A Perso- to Indic Machine Model

Gurpreet Lehal Tejinder Singh Saini Department of Computer Science Advanced Center for Technical Development of , Punjabi Literature and Culture 147002 , Punjabi University, Patiala Punjab, India gslehal@.com [email protected]

Abstract 1 Introduction

Indian sub-continent is one of those unique Indian sub-continent is one of those unique parts of parts of the world where single are the world where single languages are written in written in different scripts. This is the case for different scripts. This is the case for example with example with Punjabi, written in Indian East Punjabi, spoken by tens of millions of people, but Punjab in script (a Left to Right written in Indian (20 million) in Gur- script based on Devnagri) and in Pakistani mukhi script (a Left to Right script based on Dev- , it is written in Shahmukhi (a nagri) and in Pakistani West Punjab (80 million), it Right to Left script based on Perso-Arabic). is written in Shahmukhi (a Right to Left script This is also the case with other languages like based on Perso-Arabic). Whilst in speech, Punjabi and (whilst having different names, they are the same language but written spoken in the Eastern and the Western parts is mu- in mutually incomprehensible forms). Similar- tually comprehensible in the written form it is not. ly, Sindhi and Kashmiri languages are written This is also the case with other languages like Ur- in both Persio-Arabic and Devanagri scripts. du and Hindi (whilst having different names, they Thus there is a dire need for development are the same language but written, as with Punjabi, transliteration tools for conversion between in mutually incomprehensible forms). Hindi is Perso-Arabic and Indic scripts. In this paper, written in the Devnagri script from left to right, we present Sangam, a Perso-Arabic to Indic Urdu is written in a script derived from a Persian script machine transliteration , which modification of written from right to can convert with high accuracy text written in left. A similar problem resides with the Sindhi Perso-Arabic script to one of the Indic script sharing the same language. Sangam is a hybr- language, which is written in a Persio-Arabic script id system which combines rules as well as in and both in Persio-Arabic and Devana- word and character level language models to gri in India. Similar is the case with Kashmiri lan- transliterate the words. The system has been guage too. Konkani is probably the only language designed in such a fashion that the main code, in India which is written in five scripts Roman, algorithms and data structures remain un- Devnagri, , Persian-Arabic and Malaya- changed and for a adding a new script pair on- lam (Carmen Brandt. 2014). The existence of mul- ly the databases, mapping rules and language tiple scripts has created communication barriers, as models for the script pair need to be devel- people can understand the spoken or verbal com- oped and plugged in. The system has been munication, however when it comes to scripts or successfully tested on Punjabi, Urdu and Sindhi languages and can be easily extended written communication, the number diminishes, for other languages like Kashmiri and Konka- thus a need for transliteration tools which can con- ni. vert text written in one language script to another script arises. A common feature of all these lan- 232 D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 232–239, Goa, India. December 2014. c 2014 NLP Association of India (NLPAI) guages is that, one of the script is Perso-Arabic thodology to handle the transliteration issues re- (Urdu, Sindhi, Shahmukhi etc.), while other script lated to conversion between scripts of same lan- is Indic (Devnagri, Gurmukhi, Kannada, Malaya- guage. lam). Perso-Arabic script is a right to left script, while Indic scripts are left to right scripts and both 2 Related Work the scripts are mutually incomprehensible forms. Thus is a dire need for development of automatic The first transliteration system for a Perso-Arabic machine transliteration tools for conversion be- to Indic script was presented by Malik (2006), tween Perso-Arabic and Indic scripts. where described a Shahmukhi to Gurmukhi Machine Transliteration is an automatic method transliteration system with 98% accuracy. But the to generate characters or words in one alphabetical accuracy was achieved only when the input text system for the corresponding characters in another had all necessary diacritical marks for removing alphabetical system. The transformation of text ambiguities, even though the process of putting from one script to another is usually based on pho- missing diacritical marks is not practically possible netic equivalencies. Transliteration is usually cate- due to many reasons like large input size, manual gorized as forward and backward transliteration. intervention, person having knowledge of both the Forward transliteration refers to transliteration scripts and so on. Saini et al. (2008) developed a from the native language to foreign language, system, which could automatically insert the miss- while the process of recalling a word in native lan- ing diacritical marks in the Shahmukhi text and guage from a transliteration is defined as back- convert the text to Gurmukhi. The system had been transliteration. Forward transliteration plays an implemented with various research techniques important role in natural language applications based on corpus analysis of both scripts and an such as information retrieval and machine transla- accuracy of 91.37% at word level had been re- tion, especially for handling proper nouns, technic- ported. al terms and out of vocabulary words. While back Durrani et al. (2010) presented an approach to transliteration is popularly used as an input me- integrate transliteration into Hindi-to-Urdu statis- chanism for certain languages, where typing in the tical machine . They proposed two prob- native script is not very popular. In such cases, the abilistic models, based on conditional and joint user types the native language words and sentences probability formulations and have reported an ac- (usually) in Roman script, and a transliteration en- curacy of 81.4%. Lehal and Saini (2012) presented gine automatically converts the Roman input back an Urdu to Hindi transliteration system and had to the native script. This input mechanism is popu- claimed achieving an accuracy of 97.74% at word larly used for all Indian languages including Hindi, level. The various challenges such as multiple/zero Punjabi, Tamil, Telugu, etc., and also, Arabic, character mappings, missing marks in Ur- Chinese etc. du, multiple Hindi words mapped to an Urdu word, In this paper, we present Sangam, a Perso- word segmentation issues in Urdu text etc. have Arabic to Indic script machine transliteration sys- been handled by generating special rules and using tem, which can convert with high accuracy text various lexical resources such as n-gram language written in Perso-Arabic script to one of the Indic models at word and character level and Urdu-Hindi script sharing the same language. The system has parallel corpus. Recently Malik et al. (2013) have been successfully tested on Punjabi (Shahmukhi- analysed the application of statistical machine Gurmukhi) , Urdu (Urdu-Devnagri) and Sind- translation for solving the problem of Urdu-Hindi hi(Sindhi Perso Arabic - Sindhi Devnagri) lan- transliteration using a parallel lexicon. The authors guages and can be easily extended for other reported a word level accuracy of 77.8% when the languages like Kashmiri and Konkani. One should input Urdu text contained all necessary diacritical note that the transliteration model presented in this marks and 77% when the input Urdu text did not paper can neither be categorized as forward nor as contain all necessary diacritical marks, which is backward since it is concerned with script conver- much below the accuracy reported in earlier works. sion in same language, so the usual techniques for A rule based converter for forward or backward transliteration cannot be ap- from Persio-Arabic to script has been plied here and we have to develop a special me- 233

developed by Kak et al. (2010) and authors have Perso- Word Indic Indic Actual claimed 90% conversion accuracy. Arabic Script Transli- translite- Leghari and Rehman (2010) have discussed the script teration ration Devnagri दनया द�नयाु دد� different issues, complexities and problems of Urdu Sindhi transliteration and presented a model for Gurmukhi ਵਚ ਿ ਵੱ ਚ چووچ -Shah transliteration between Perso-Arabic and Devana- mukhi gari scripts of , which is based on ु Devnagri सनध �स ंध ﺳﻨﮅ an intermediate Roman script. Sindhi Malik et al. (2010) described a finite-state scrip- Table 1. Transliteration without diacritical marks tural translation model based on Finite State - chines to convert the scripts for Urdu, Punjabi and 3.2 Filling the Missing Script Maps Seraiki languages. But the transliteration results for Urdu-Hindi, Punjabi Shahmukhi-Gurmukhi and There are many characters which are present in the Seraiki Shahmukhi-Gurmukhi have not been very Perso-Arabic script, corresponding to those having -Do ,ء encouraging, with transliteration accuracy at word no character in Indic script, .g. .Khadi Zabar) etc) ٰ◌ ,ع level ranging from 31.2% to 58.9% for Urdu- Zabar ◌ً Aen Devnagri script pair and 67.3% for Shahmukhi- 3.3 Multiple Mappings for Perso-Arabic Gurmukhi. Characters

3 Challenges in Perso-Arabic to Indic It is observed that corresponding to many Perso- Script Transliteration Arabic characters there are multiple mappings into Indic script as shown in Table 2. Additional infor- Transliteration is not trivial to automate, but trans- mation such as grammar rules and context are literation of Perso-Arabic script to Indic scripts is needed to select the appropriate Indic script - even more challenging problem. Since the lan- racter for such Perso-Arabic characters. guage does not change, so it becomes important Perso-Arabic Char Indic Equivalent the correct spellings and context of the words is Script script Mappings maintained in target script. The major challenges , , , , Devnagri و Urdu of transliteration of languages using Perso-Arabic व ◌ो ◌ौ ◌ु ◌ू , ऊ, ओ, औ script to Indic scripts are as follows: Gurmukhi ◌ਂ, ◌ੰ, ਨ, ਣ ن Shahmukhi 3.1 Missing Diacritical marks and short Vo- Table 2. Multiple Mappings of Perso-Arabic characters wels 3.4 Transliteration Ambiguity at Word level Diacritical marks are critical for correct pronuncia- tion and sometimes even for disambiguation of Due to multiple character mappings and missing certain words. The diacritical marks are also used short , many words in Perso-Arabic script for (doubling of a ) and mark get mapped to multiple Indic words as shown in the absence of a following a base consonant. Table 3. Higher level language information will be But the diacritical marks and short vowels are spa- needed to choose the most relevant word in Indic ringly used in Perso-Arabic script . These script. missing diacritical marks and short vowels create Perso- Word Indic Equivalent substantial difficulties for transliteration systems, Arabic script script words in Indic as the missing diacritic marks and vowels have to script be guessed by the system and added for correct Urdu � Devnagri 啍या, �कया transliteration. For example in Table 1, we see how Shahmukhi Gurmukhi the words in Perso-Arabic script, which are com- � ਹਨ, ਹੁਣ monly written without diacritic marks, will be Sindhi Devnagri जां, जान, जा�न � ا ن -transliterated in Indic script, if we go in for charac ter by character substitution and do not put the Table 3. Multiple Mappings of Perso-Arabic words missing short vowels.

234

3.5 Word-Segmentation Issues processing stage, the text in S1 script is cleaned and prepared for transliteration by normalizing and Space is not consistently used in Perso-Arabic joining the broken Perso-Arabic words. In the words, which makes word segmentation a non- processing stage, corresponding to each word in S1 trivial task. Many times the space is deleted result- script, one or several possible words in S2 are gen- ing in many Perso-Arabic words being jumbled erated. If only one word is produced, then that together and many other times extra space is put in word is finalised. Otherwise for multiple alterna- word resulting in over segmentation of that word. tives, the final decision is taken in the post This problem is more pronounced in Urdu and processing stage. Shahmukhi scripts as compared to Sindhi script. In the post-processing stage, the final decision We see in Table 4, samples of Urdu and Shahmuk- about choosing from multiple S2 alternatives is hi words containing multiple merged words and made using language models for S2. The three their if the words are transliterated stages are discussed in detail in the following sec- as such without splitting them at proper positions. tions. Word Transliteration Actual translite- Perso-Arabic without splitting ration 4.1 Pre-Processing अनकारकरदयाहे इꅍकार कर �दया है اا�رر�ددنياا� (Urdu script) (Devnagri script) (Devnagri script) In the pre-processing stage, the Urdu words are -ਪਹਲਾਸ਼ਕਾਰ ਪਿਹਲਾ ਿਸ਼ਕਾਰ cleaned and prepared for transliteration by norma ��رر (Shahmukhi (Gurmukhi script) (Gurmukhi script) lizing the Urdu words as well as joining the broken script) Urdu words. The two main stages in pre- Table 4. Merged Perso-Arabic Words and their Transli- processing are: terations without splitting words 4.1.1 Normalizing Perso-Arabic words 4 System Architecture Two kinds of normalization are required for Perso- Arabic words. First, a letter may be represented by The system architecture of the Perso- multiple points, and thus the redundancy Arabic - Indic transliteration model developed by in encoding has to be cleaned in raw text before us is shown in Figure 1. The system has been de- further processing. As for example, from translite- (06cc)ی 064a) and)ي ,(0649)ى ,signed in such a fashion that the main code, algo- ration point of view rithms and data structures remain unchanged while represent the same character in Perso-Arabic depending on the script pair, the databases, map- script. Secondly, a letter or a ligature is sometimes ping rules and language models need to be plugged encoded in composed form as well as decomposed in. The source text is in S1 script while the target form. Thus, the two equivalent representations text is in script S2. For example if text in Urdu must also be reduced to same underlying form be- can (0622) آ ,script has to be converted to Devnagri, then we fore further processing. For example (0627) ا need to plug in word frequency list of Urdu and be also be represented by the combination Urdu-Devnagri dictionary along with n-gram lan- + ◌ٓ (0653). All such forms are normalized to have guage models at word and character level for Dev- only one representation. nagri script and mapping tables for Urdu to Devnagri transliteration. The system has been suc- 4.1.2 Joining the broken Perso-Arabic words cessfully tested on three script pairs(Urdu-Hindi, The transliteration system faces many problems Sindhi-Devnagri and Shahmukhi-Gurmukhi) and related to word segmentation of Perso-Arabic has been able to successfully handle most of the script, as in many cases space is not properly put issues raised in the previous section. We have de- between words. Sometimes it is deleted resulting in veloped the lexical resources for all the scripts and many Perso-Arabic words being jumbled together depending on our need, the relevant data is used. In and many other times extra space is put in word case a new script pair has to be added, only the resulting in over segmentation of that word. The lexical resources have to be created As can be seen space insertion problem is handled in pre- in figure 1, the complete transliteration system is processing stage, while the space deletion problem divided into three stages: pre-processing, is handled in the processing stage. The space inser- processing and post-processing. In the pre- 235

tion problem usually occurs due to conventional possible words in S2 script are generated. For mul- way of in Perso-Arabic script or due to ex- tiple alternatives, the final decision is taken in the tra space being inserted during typing. The typing post processing stage. First the word is searched in related space insertion problems are handled by the S1-S2 dictionary and if it is found, then all its using the word frequency list of script S1 (Lehal, alternatives are passed onto post processing stage. 2009). If the product of probability of occurrence In case the word is not found, then it is fed to a of two adjacent words in S1 is lesser than the prob- multi-stage transliteration engine. In the first stage ability of occurrence of the word formed by joining a Hybrid-wordlist-generator (HWG) is used to convert the word. The HWG uses the mapping the two, then the two words are joined together. rules and a trigram character language model to generate a set of words in S2. A unigram word Text in S1 Script language model is then used to rank these words, Pre-Processing after dropping words with zero probability. If there Normalise is no word with non-zero probability, then the Words word is inspected for presence of merged words which can be transliterated to non empty sets of words in S2 words. If no such sets can be generat-

Word Frequency ed then we use the simple character mapping rules Merge Broken Words List of S1 to convert the word to S2. We now discuss in detail, the main modules used in the multi-stage transliteration engine. These modules are: Text in S1 Script Processing 4.2.1 Hybrid-wordlist-generator (HWG) This is the major module in the transliteration en- Database Lookup S1-S2 Dictionary gine. It generates multiple transliterations for a word in S1. The multiple outputs are produced due to ambiguity both at character and word level as already mentioned in above sections. Mapping Tables The sequence of probable Indic words is pro- Rule Based/ duced by a hybrid system, which uses rule based Statistical Translitera- character mapping tables and a trigram character tion for OOV/Merged S2 Character Trigrams Language Model. The Perso-Arabic word is Words processed character by character, which are mapped directly to their corresponding similar S2 Word Unigrams sounding Indic characters based on their position in word and syntax rules (snippet shown in the Ta- ble 5). In most of the cases, there is a 1-1 mapping, Text in S2 Script which have ,ی ,ن و Post-Processing but a few characters such as multiple mappings and also some character combi- ﺗﮭ nations in Perso-Arabic script such as

Word Sense (062A+06BE) have single representation in some Disambiguation Word level n- gram Language of Indic scripts such as in Devnagri (थ) or Gur- Models for S2 mukhi (ਥ), while in Sindhi(Devnagri) we have ۾ character combination तह. Similarly the character Transliterated Text in S2 Script (U06FE) gets mapped to word म� in Sind- Figure 1. System Architecture hi(Devnagri) script, while it has no equivalent mapping in Devnagri and Shahmukhi scripts. In some cases the mapping is dependent on the posi- 4.2 Processing Stage tion of the character. As for example, the character -U0627) is mapped to character अ if it is in be) ا This is the main stage and in this stage, corres- ginning of the word else it gets mapped to अ and ن ponding to each word in S1 script, one or several ◌ा in Devnagri script. Similarly, the character 236

(U0646) gets mapped to character न in Sind- avoid processing exponential number of candi- hi(Devnagri), if it is in the starting of the word, and dates, we process the input characters one at a time gets mapped to न, ◌ं and ◌ँ otherwise. The same and use the character trigram probability for prun- .U0646), gets mapped to न and ण in ing the partially generated candidates at each step) ن ,character Devnagri, if it comes in beginning or ending of a Char Perso Devnagri Gurmuk- Sindhi word, else it gets mapped to न, ण, ◌ं and ◌ँ . At Arabic hi (Dev) ,थ, थ,् �थ, ਥ, ਿਥ, ਥੁ तह ﺗﮭ Char Perso Devna- Gurmukhi Sindhi Any At Arabic gri (Dev) थु त�ह, �त�ह, ब ਬ ब بب Any �तह, � थ ਥ तह तुह, तुह, ु �म - - ۾ Start , , , , , तहु , , و व ऊ ओ ਵ, ਊ, ਓ, ਔ व ऊ Start وو , व व,् व,ु ਵ, ਿਵ, ਵੁ, व व,् औ ओ औ , ,व, 핍व ਊ, ਓ, ਔ वु, �व� , ن Start , , , , , न ण ਨ, ਣ न ن Start 핍वु ि핍व ऊ 핍व 핍वु , , , अ ਅ अ اا ओ औ ि핍व ऊ , , , , , ن Mid , ँ◌ ं◌ न ण ◌ं ਨ, ਣ, ◌ਂ, ◌ੰ न ن ओ औ و /Mid ँ◌ Mid/ , व, व,् व,ु ਵ, ਿਵ, ਵੁ, व, व,् व,ु , , end अ ◌ा ਅ, ◌ਾ अ, ◌ा اا end �व, 핍व ◌ੂ, ◌ੋ, ◌ੌ, �व, 핍व ,, , ,, , Mid , , , , , 핍वु ि핍व ਊ, ਓ, ਔ 핍वु ि핍व , , , , व ◌ो ◌ौ ਵ, ◌ੂ, ◌ੋ, ◌ੌ, व ◌ो وو /end , , , , ◌ो ◌ौ ◌ु, ◌ो ◌ौ ◌ु ◌ू ਊ, ਓ, ਔ ◌ौ ◌ु , , , , , , , , , , ◌ू ऊ ओ ◌ु ◌ू ऊ ऊ ओ औ ◌ू ऊ , , औ ओ औ ओ औ Table 6. Portion of Modified Perso Arabic - Indic cha- , , , ن End न ण ਨ, ਣ न ◌ं ◌ँ racter mapping Tables ن Table 5. Portion of Perso Arabic - Indic character map- It should be noted that not all the suggestions ping Tables generated by the character language model are - As already mentioned, the short vowels and di- lid words in Indic script. To further rank these acritical marks are usually omitted in Perso-Arabic words, we use the Unigram Word Model for Indic text and there are no half characters in Perso- script. Words with zero probability in the Unigram Arabic script, so the result is that the resultant text word model are ignored and rest of the words are in Indic script has poor accuracy. For example, the ranked on their probabilities. It could also happen Urdu word in Perso-Arabic script, gets transli- �, that all the top N alternatives suggested by the cha- terated to क़समत, while the actual word in Devna- racter level trigram may be having zero probabili- gri should be . This is because the character �क़मत ties, in which case no alternative is returned by is written as half character in Devnagri while the स HWG. short vowel ि◌ which is missing in original Perso- Arabic word has to written in Devnagri to maintain 4.2.2 Merged word segmentation proper spellings in Devnagri. To fill these missing diacritical marks and put half characters at appro- As already discussed above, space is not consis- priate locations in Indic word, we consider all its tently used in Perso-Arabic, which gives rise to possible mappings in Indic which include the miss- both space omission and space insertion errors. ing short vowels and half characters. So we modify Due to the space deletion problem, a sequence of our mapping table to include all such forms for all words is jumbled together as a single word and the Perso-Arabic characters resulting in multiple when the HWG tries to generate the equivalent mappings. Thus for example the character combi- Indic alternatives it fails. The sequence of Perso- get Arabic words written together without space is still (0648) و 062a+06be) and character) ﺗﮭ nation mapped to as shown in in Table 6. readable because of the character joining property We form all possible combinations, which in Perso-Arabic. We have used the space deletion could be generated from these multiple mappings algorithm presented by Lehal (Lehal, 2010) to split and the top N combinations are retained. The cha- the Perso-Arabic words. The algorithm makes use racter based trigram language model for Indic of unigram wordlist of S2 and statistical word dis- script is used to select the top N combinations. To ambiguation techniques to first detect if the Perso- 237

Arabic word contains multiple words. And in case automatically using the Expectation-Maximization multiple words are present, the algorithm splits (EM) algorithm. them at appropriate positions. The Indic Script al- ternatives for these individual Perso-Arabic words 5. Experimental Results are then generated using the HWG module. We have tested our system on text in Perso-Arabic 4.2.3 Handling Out of Vocabulary Words script in Urdu, Punjabi and Sindhi languages and converted it to respective Indic scripts. The transli- For out of vocabulary words, no possible sugges- terated text has been manually evaluated. The re- tions will be generated by the dictionary or HWG. sults are tabulated in Table 7. We can see from the So if after passing through all the modules, still no table, that the transliteration accuracy for the three transliteration alternatives are generated, it implies scripts ranges from 91.68% to 97.75%, which is that the word is out of vocabulary. For such words the best accuracy reported so far in literature for the Indic Script word is generated by using the script pairs in Perso-Arabic and Indic scripts. As mapping rules and trigram character language can be observed the transliteration accuracy for model and the top most alternative is selected for Sindhi language is much lesser as compared to Ur- further processing. du and Punjabi languages. The main reasons for this are: 4.3 Post Processing Stage a) Lack of linguistic resources and digital text in Sindhi(Devnagri). The main task of post processing is to select the b) High level of ambiguity at word level in Sind- best alternative amongst the various transliteration hi(Perso-Arabic) words, which is much more pro- options. The HWG module presents a set of ranked nounced than Shahmukhi and Urdu words. transliterations instead of a single transliteration, A sample of the output for the three scripts is due to multiple character mappings as shown in shown in Figure 2. Table 5 and 6. Up to this point, we were only con- Script Pair Words Transliteration sidering the Indic words in isolation, without any Accuracy consideration to their neighbouring words. Now Urdu-Devnagri 30,248 97.75% we consider the whole sentence instead of isolated Shahmukhi-Gurmukhi 26,141 97.02% words. Sindhi (Perso Arabic) to 29,131 91.68% Pr(w | w w )= Sindhi (Devnagri) i − 2 i −1 Table 7. Word level Transliteration Accuracy of differ- c(w w w ) c(w w ) ent script pairs λ i − 2 i −1 i + λ i −1 i + 3 2 ت ن ( − c(w − w − ) c(w س بب �م اا� وو�ب � � �� �۔ � �م �باا ��۔� � � � �� i 2 i 1 i 1 c w ( i ) 1 ��۔ λ + λ 1 Ν 0 V सब काम अपने व慍त पर ह� होते ह�। हम� काम करना Where N = Number of words in the training corpus, चा�हए।फल क� �फ़क्र नह�ं करनी चा�हए। V = Size of the vocabulary a) Urdu-Devnagri ن To choose between the different alternatives we سااررے � اا� � � � �ے �۔ ساا�ن � �تاا ��اا �، � ددي � have used the word trigram probability. To take � �� ��ي ۔ care of the sparseness in the trigram model, we have used deleted interpolation, which offers the ਸਾਰੇ ਕੰ ਮ ਆਪਣੇ ਸਮ� ਿਸਰ ਹੀ ਹੁੰਦੇ ਹਨ। ਸਾਨੂੰ ਕੰ ਮ ਕਰਨਾ solution of backing away from low count trigrams by augmenting the estimate using bigram and uni- ਚਾਹੀਦਾ ਹੈ, ਫਲ ਦੀ ਿ ਚੰ ਤ ਾ ਨਹ� ਕਰਨੀ ਚਾਹੀਦੀ। gram counts. The deleted interpolation trigram b) Shahmukhi-Gurmukhi سڀ ڪم ﭘﻨﮭﻨﺠﻲ وڪت ﺗﻲ ﺋﻲ ٿﯾﻨﺪا آھﻨﻲ۔ اﺳﺎن ﮐﻲ model assigns a probability to each trigram which is the linear interpolation of the trigram, bigram, ڪم ڪرڻ ﮔﮭﺮﺟﻲ، لڦ ِﺟﻲ ﭼﻨﺘﺎ ﻧﮫ ڪرڻ unigram and uniform models. The weights are set ﮔﮭﺮﺟﻲ ۔ 238

Preservation in , ed. by Hugo C. Cardoso, सभु कम पं�हंजे व慍त ते ई थींदा आहनी। असां खे कम ु pp. 78-99 करण ु घुज�, फल जी �च ंता न करण ु घुज � । Durrani, N., Sajjad, H., Fraser, A. and Schmid, H. 2010. Hindi-to-Urdu Machine Translation through Transli- c) Sindhi (Perso-Arabic) - Sindhi (Devnagri) teration. In Proceedings of the 48th Annual Confe- Figure 2. Samples of Transliteration output of text rence of the Association for Computational in three languages (Urdu, Punjabi and Sindhi) by Linguistics, pp 465–474, Uppsala, Sweden. Sangam G. S. Lehal, 2009. A Two Stage Word Segmentation System For Handling Space Insertion Problem In Ur- Conclusion du Script, Proceedings of World Academy of Science, Engineering and Technology, Bangkok, Thailand, Vol. 60, pp 321-324. In this paper, we have presented Sangam, a Perso- G. S. Lehal, 2010. A Word Segmentation System for Arabic to Indic script machine transliteration mod- Handling Space Omission Problem in Urdu Script, el, which can convert with high accuracy text writ- Proceedings of the 1st Workshop on South and ten in Perso-Arabic script to one of the Indic script Southeast Asian Natural Language Processing sharing the same language. The system has been (WSSANLP), the 23rd International Conference on successfully tested on Punjabi, Urdu and Sindhi Computational Linguistics (COLING), pp. 43–50, languages and can be easily extended for other Beijing. languages like Kashmiri and Konkani. The transli- Gurpreet S. Lehal and Tejinder S. Saini. 2012. Devel- teration accuracy for the three languages ranges opment of a complete Urdu-Hindi transliteration sys- from 91.68% to 97.75%, which is the best accura- tem. In Proceedings of the 24th International Conference on Computational Linguistics, pp 643– cy reported so far in literature for translateration 652, , India. from Perso-Arabic to Indic script. The system has Leghari, M., and Rahman, M. . 2010. Towards Trans- been designed in such a fashion that the main code, literation between Sindhi Scripts by using Roman algorithms and data structures remain unchanged Script. Conference on Language and Technology. Is- and for a adding a new script pair only the databas- lamabad: Authority, Pakistan. es, mapping rules and language models for the http://www.cle.org.pk/clt10/papers/Towards%20Tran script pair need to be developed and plugged in. sliteration%20between%20Sindhi%20Scripts%20 by%20using%20Roman%20Script.pdf M. G. Abbas Malik, Christian Boitet, and Pushpak Acknowledgments Bhattacharyya. 2010. Finite-state scriptural transla- tion. In Proceedings of the 23rd International Confe- The authors would like to acknowledge the support rence on Computational Linguistics: provided by PAN ASIA Grants and ISIF Posters (COLING '10). ACL, pp 791-800, grants Australia for carrying out this research. The Stroudsburg, , USA. Sindhi language support provided by Dr. Bharat Malik, M. G. Abbas. 2006. Punjabi Machine Ratanpal and Ms. Madhuri Wardey, Faculty of Transliteration. Proceedings of the 21st International Technology & Engineering, MSU Baroda is also Conference on Computational Linguistics and 44th duly acknowledged. Annual Meeting of the ACL, pp 1137-1144. Malik. M. G. Abbas, Boitet. Christian, Besacier. Lau- References rent, Bhattcharrya. Pushpak. 2013 Urdu Hindi Ma- chine Transliteration using SMT, The 4th Workshop Aadil Amin Kak, Nazima Mehdi and Aadil Ahmad - on South and Southeast Asian Natural Language waye. 2010. Building a Cross Script Kashmiri Con- Processing (WSSANLP), a collocated event at Inter- verter: Issues and Solutions, In Proceedings of national Joint Conference on Natural Language Oriental COCOSDA (The International Committee Processing (IJCNLP), pp. 43-57, Nagoya, Japan. for the Co-ordination and Standardization of Speech T. S. Saini, G. S. Lehal and V. S. Kalra. 2008. Shah- Databases and Assessment Techniques). web access : mukhi to Gurmukhi Transliteration System, Coling: http://desceco.org/O- Companion volume: Posters and Demonstrations, pp. COCOSDA2010/proceedings/paper_38.pdf 177-180, Manchester, UK. Carmen Brandt. 2014. Script as a Potential Demarcator and Stabilizer of Languages in South Asia Lan- guage, Documentation & Conservation Special Pub- lication No. 7 in Language Endangerment and 239