Sangam: a Perso-Arabic to Indic Script Machine Transliteration Model

Sangam: A Perso-Arabic to Indic Script Machine Transliteration Model Gurpreet Singh Lehal Tejinder Singh Saini Department of Computer Science Advanced Center for Technical Development of Punjabi University, Patiala Punjabi Language Literature and Culture 147002 Punjab, India Punjabi University, Patiala Punjab, India [email protected] [email protected] Abstract 1 Introduction Indian sub-continent is one of those unique Indian sub-continent is one of those unique parts of parts of the world where single languages are the world where single languages are written in written in different scripts. This is the case for different scripts. This is the case for example with example with Punjabi, written in Indian East Punjabi, spoken by tens of millions of people, but Punjab in Gurmukhi script (a Left to Right written in Indian East Punjab (20 million) in Gur- script based on Devnagri) and in Pakistani mukhi script (a Left to Right script based on Dev- West Punjab, it is written in Shahmukhi (a nagri) and in Pakistani West Punjab (80 million), it Right to Left script based on Perso-Arabic). is written in Shahmukhi (a Right to Left script This is also the case with other languages like based on Perso-Arabic). Whilst in speech, Punjabi Urdu and Hindi (whilst having different names, they are the same language but written spoken in the Eastern and the Western parts is mu- in mutually incomprehensible forms). Similar- tually comprehensible in the written form it is not. ly, Sindhi and Kashmiri languages are written This is also the case with other languages like Ur- in both Persio-Arabic and Devanagri scripts. du and Hindi (whilst having different names, they Thus there is a dire need for development are the same language but written, as with Punjabi, transliteration tools for conversion between in mutually incomprehensible forms). Hindi is Perso-Arabic and Indic scripts. In this paper, written in the Devnagri script from left to right, we present Sangam, a Perso-Arabic to Indic Urdu is written in a script derived from a Persian script machine transliteration system, which modification of Arabic script written from right to can convert with high accuracy text written in left. A similar problem resides with the Sindhi Perso-Arabic script to one of the Indic script sharing the same language. Sangam is a hybr- language, which is written in a Persio-Arabic script id system which combines rules as well as in Pakistan and both in Persio-Arabic and Devana- word and character level language models to gri in India. Similar is the case with Kashmiri lan- transliterate the words. The system has been guage too. Konkani is probably the only language designed in such a fashion that the main code, in India which is written in five scripts Roman, algorithms and data structures remain un- Devnagri, Kannada, Persian-Arabic and Malaya- changed and for a adding a new script pair on- lam (Carmen Brandt. 2014). The existence of mul- ly the databases, mapping rules and language tiple scripts has created communication barriers, as models for the script pair need to be devel- people can understand the spoken or verbal com- oped and plugged in. The system has been munication, however when it comes to scripts or successfully tested on Punjabi, Urdu and Sindhi languages and can be easily extended written communication, the number diminishes, for other languages like Kashmiri and Konka- thus a need for transliteration tools which can con- ni. vert text written in one language script to another script arises. A common feature of all these lan- 232 D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 232–239, Goa, India. December 2014. c 2014 NLP Association of India (NLPAI) guages is that, one of the script is Perso-Arabic thodology to handle the transliteration issues re- (Urdu, Sindhi, Shahmukhi etc.), while other script lated to conversion between scripts of same lan- is Indic (Devnagri, Gurmukhi, Kannada, Malaya- guage. lam). Perso-Arabic script is a right to left script, while Indic scripts are left to right scripts and both 2 Related Work the scripts are mutually incomprehensible forms. Thus is a dire need for development of automatic The first transliteration system for a Perso-Arabic machine transliteration tools for conversion be- to Indic script was presented by Malik (2006), tween Perso-Arabic and Indic scripts. where he described a Shahmukhi to Gurmukhi Machine Transliteration is an automatic method transliteration system with 98% accuracy. But the to generate characters or words in one alphabetical accuracy was achieved only when the input text system for the corresponding characters in another had all necessary diacritical marks for removing alphabetical system. The transformation of text ambiguities, even though the process of putting from one script to another is usually based on pho- missing diacritical marks is not practically possible netic equivalencies. Transliteration is usually cate- due to many reasons like large input size, manual gorized as forward and backward transliteration. intervention, person having knowledge of both the Forward transliteration refers to transliteration scripts and so on. Saini et al. (2008) developed a from the native language to foreign language, system, which could automatically insert the miss- while the process of recalling a word in native lan- ing diacritical marks in the Shahmukhi text and guage from a transliteration is defined as back- convert the text to Gurmukhi. The system had been transliteration. Forward transliteration plays an implemented with various research techniques important role in natural language applications based on corpus analysis of both scripts and an such as information retrieval and machine transla- accuracy of 91.37% at word level had been re- tion, especially for handling proper nouns, technic- ported. al terms and out of vocabulary words. While back Durrani et al. (2010) presented an approach to transliteration is popularly used as an input me- integrate transliteration into Hindi-to-Urdu statis- chanism for certain languages, where typing in the tical machine translation. They proposed two prob- native script is not very popular. In such cases, the abilistic models, based on conditional and joint user types the native language words and sentences probability formulations and have reported an ac- (usually) in Roman script, and a transliteration en- curacy of 81.4%. Lehal and Saini (2012) presented gine automatically converts the Roman input back an Urdu to Hindi transliteration system and had to the native script. This input mechanism is popu- claimed achieving an accuracy of 97.74% at word larly used for all Indian languages including Hindi, level. The various challenges such as multiple/zero Punjabi, Tamil, Telugu, etc., and also, Arabic, character mappings, missing diacritic marks in Ur- Chinese etc. du, multiple Hindi words mapped to an Urdu word, In this paper, we present Sangam, a Perso- word segmentation issues in Urdu text etc. have Arabic to Indic script machine transliteration sys- been handled by generating special rules and using tem, which can convert with high accuracy text various lexical resources such as n-gram language written in Perso-Arabic script to one of the Indic models at word and character level and Urdu-Hindi script sharing the same language. The system has parallel corpus. Recently Malik et al. (2013) have been successfully tested on Punjabi (Shahmukhi- analysed the application of statistical machine Gurmukhi) , Urdu (Urdu-Devnagri) and Sind- translation for solving the problem of Urdu-Hindi hi(Sindhi Perso Arabic - Sindhi Devnagri) lan- transliteration using a parallel lexicon. The authors guages and can be easily extended for other reported a word level accuracy of 77.8% when the languages like Kashmiri and Konkani. One should input Urdu text contained all necessary diacritical note that the transliteration model presented in this marks and 77% when the input Urdu text did not paper can neither be categorized as forward nor as contain all necessary diacritical marks, which is backward since it is concerned with script conver- much below the accuracy reported in earlier works. sion in same language, so the usual techniques for A rule based converter for Kashmiri language forward or backward transliteration cannot be ap- from Persio-Arabic to Devanagari script has been plied here and we have to develop a special me- 233 developed by Kak et al. (2010) and authors have Perso- Word Indic Indic Actual claimed 90% conversion accuracy. Arabic Script Transli- translite- Leghari and Rehman (2010) have discussed the script teration ration Devnagri दनया द�नयाु دد� different issues, complexities and problems of Urdu Sindhi transliteration and presented a model for Gurmukhi ਵਚ ਿ ਵੱ ਚ چووچ -Shah transliteration between Perso-Arabic and Devana- mukhi gari scripts of Sindhi language, which is based on ु Devnagri सनध �स ंध ﺳﻨﮅ an intermediate Roman script. Sindhi Malik et al. (2010) described a finite-state scrip- Table 1. Transliteration without diacritical marks tural translation model based on Finite State Ma- chines to convert the scripts for Urdu, Punjabi and 3.2 Filling the Missing Script Maps Seraiki languages. But the transliteration results for Urdu-Hindi, Punjabi Shahmukhi-Gurmukhi and There are many characters which are present in the Seraiki Shahmukhi-Gurmukhi have not been very Perso-Arabic script, corresponding to those having -Do ,ء encouraging, with transliteration accuracy at word no character in Indic script, e.g. Hamza .Khadi Zabar) etc) ٰ◌ ,ع level ranging from 31.2% to 58.9% for Urdu- Zabar ◌ً Aen Devnagri script pair and 67.3% for Shahmukhi- 3.3 Multiple Mappings for Perso-Arabic Gurmukhi. Characters 3 Challenges in Perso-Arabic to Indic It is observed that corresponding to many Perso- Script Transliteration Arabic characters there are multiple mappings into Indic script as shown in Table 2.

Load more