International Journal of Computer Science and Communication Vol. 2, No. 2, July-December 2011, pp. 521-526

DEVELOPMENT OF A PUNJABI TO ENGLISH SYSTEM

Kamal Deep1 and Vishal Goyal2 1Department of Computer Science, Punjabi University, Patiala, -mail: [email protected] 2Assistant Professor, Department of Computer Science, Punjabi University, Patiala, India E-mail: [email protected]

ABSTRACT Machine transliteration has gained prime importance as a supporting tool for Machine translation and cross language information retrieval especially when proper names and technical terms are involved. The performance of machine translation and cross-language information retrieval depends extremely on accurate transliteration of named entities. Hence, the transliteration model must aim to preserve the phonetic structure of words as closely as possible. This paper addresses the problem of transliterating Punjabi to English language using a rule based approach .The proposed transliteration scheme uses based method to model the transliteration problem. This technique has demonstrated transliteration from Punjabi to English for conman names and achieved accuracy of 93.22%.

Y 1. INTRODUCTION grapheme transformation. In hybrid approaches ( H), it Transliteration is the process of replacing words in simply combines the grapheme-based transliteration Y source language with their approximate phonetic or probability (Pr ( G)) and the phoneme-based Y spelling equivalents in target language. Commonly, transliteration probability (Pr ( P)) using linear transliteration is used to translate named entities across interpolation. languages. Automatic transliteration is helpful for many Vijaya, VP, Shivapratap and KP CEN [1] has applications, such as Machine Translation (MT), Cross developed English to Tamil Transliteration system and Language Information Retrieval (CLIR) and Information named it WEKA. It is a Rule based system and is used Extraction (IE), etc. Transliterating a word from the the j48 decision tree classifier of WEKA for classification language of its origin to a foreign language is called purposes. The transliteration process consisted of four Forward Transliteration, while transliterating a loan phases: Preprocessing phase, feature extraction, training word written in a foreign language back to the language and transliteration phase .The accuracy of this system of its origin is called Backward Transliteration. This has been tested with 1000 English names that were out paper addresses the problem of forward transliterating of corpus. The transliteration model produced an exact of person names from Punjabi to English. transliteration in Tamil from English words with an accuracy of 84.82%.Chinnakotla, Damani, Satoskar[2] has The remainder of this paper is organized as follows. developed Transliteration systems for Resource Scarce In section 2, we have described the related work. Section Languages. They have developed rule based systems for 3 introduces about English and Punjabi Language. We Hindi to English, English to Hindi, and Persian to English describe Transliteration System Architecture in sections transliteration tasks. They used CSM (Character 4. Experimental Results and Error Analysis are discussed Sequence Modeling) on the source side for word origin in section 5. Finally, we have concluded it in section 6. identiûcation,a manually generated non-probabilistic character mapping rule base for generating 2. RELATED WORK transliteration candidates, and then again used the CSM Several approaches have been proposed for name on the target side for ranking the generated candidates. Y transliteration. In Grapheme based approaches ( G), The overall efficiency by using CRF (Conditional Random transliteration is viewed as a process of mapping a Field) approach of English to Hindi is 67.0%, Hindi to grapheme sequence from a source language to a target English is 70.7% and Persian to English is 48.0%. Lehal language ignoring the phoneme-level processes. In and Singh [3] have developed Shahmukhi to Y contrast, in phoneme-based approaches ( P), the Transliteration System based on Corpus approach. In this transliteration key is pronunciation or the source system, first of all mappings has been done in phoneme rather than Spelling or the source grapheme. which mapping of Simple Consonants, Aspirated This approach is basically source grapheme-to-source Consonants (AC), Vowels, other Diacritical Marks or phoneme transformation and source phoneme-to-target Symbols are done. This system has been virtually divided 522 International Journal of Computer Science and Communication (IJCSC) into two phases. The first phase performs pre-processing and its corresponding phoneme have been aligned and rule-based transliteration tasks and the second phase phonetically. Second, English words have been performs the task of post-processing. The overall transliterated into Korean words through several steps. accuracy of system has been reported to be 91.37%. Malik Using an English pronunciation dictionary (P-DIC), [4] has developed Punjabi Machine Transliteration assigned pronunciation to a given English word. If it has (PMT) system which is rule-based. PMT has been used been not found in P-DIC, system investigates that it has for the Shahmukhi to Gurmukhi Transliteration System. a complex word form. For detecting a complex word PMT has preserved the phonetics of transliterated word form, they have divided a given English word into two and the meaning of transliterated word. The primary words (word+word) using entries of P-DIC. If both of limitation of this system is that this system works only them are in P-DIC, system can assign pronunciation to on input data which has been manually edited for the given word otherwise system should estimate missing vowels or diacritical marks (the basic ambiguity pronunciation. Then, system checks whether the English of written ) which practically has limited word is from Greek origin or not. Because a way of E-K use. The accuracy of system has been reported to 98.95%. transliteration for the English words of Greek origin is Verma[5] has developed Gurmukhi to Roman different from that for pure English words, it is important Transliteration System and named it GTrans. He has to detect that. Pronunciation for English words, which surveyed existing Roman-Indic script transliteration were not registered in a P-DIC, has been estimated in techniques and finally a transliteration scheme based on the next step. Finally, Korean transliterated words has ISO: 15919 transliteration and ALA-LC has been been generated using conversion rules. Evaluation has developed. It is a rule based system. He has also done been performed through Word Accuracy (WA) and reverse transliteration from Gurumukhi to Roman. The Character Accuracy (). This system has reported overall accuracy of system has been reported to be accuracy of 90.82% for WA and 56% for CA. Yaser, 98.43%. Hong, Kim, Lee and Chang [6] have developed Knight [9] has developed Arabic To English English-Korean Name Transliteration system, using the Transliteration system based on the sound & spelling Hybrid Approach. In the transliteration process, first, a mapping using finite state machine. They have combined phrase-base SMT model with some factored translation the phonetic based model & spelling based model into features has been used. Second, they have expanded the the single transliteration model. For testing they have base system by applying web-based n-best re-ranking used the development data set & blind data set. The of the results. Third, they have applied a pronouncing overall accuracy with development data set has been dictionary-based method to the base system which reported to be 53.66% & with blind data set it showed utilizes the pronunciation symbols which is motivated 61% accuracy. The reason of high accuracy with blind by linguistic knowledge. Finally, phonics based method data set was that blind set is mostly of highly frequent, is applied which has been originally designed for prominent politicians where as development set also teaching speakers of English to read and write that contain names of writers and less common political language. The experimental results of using three n-best figure. re-ranking techniques have showed that the web-based re-ranking is proved to be a useful method .Their 3. PUNJABI & ENGLISH LANGUAGE standard run and best standard run has accuracy of In this section we will discuss about Punjabi and English 45.1% & 78.5%. Ali and Ijaz [7] have developed English Language. to Urdu Transliteration System based on the mapping rules. The whole process has three steps. In the first step, 3.1 Punjabi Language the mapping rules that have been used to generate Urdu Punjabi Language is written in Gurmukhi Script. The text from English transcription. English text is converted Gurmukhi script was derived from the Sharada script to Urdu using both English pronunciation and mapping and standardized by Guru Angad Dev in the 16th rules. In Second step, Urdu syllabification has been century. It was designed to write the Punjabi language. applied on English transcription. Consonant and Vowels The meaning of Gurmukhi is “from the mouth of the have been combined to make syllable and breaking up a Guru”. The Gurmukhi (or Punjabi) contains word into syllables is known as syllabification. To thirty-five distinct letters. These are: improve system’s accuracy, they have applied the Urduization Rules in third step. Overall system’s accuracy is 96%. Hoon Oh, and Choi [9] have developed English-Korean Transliteration system using the hybrid approach, because it has used both phonetic information such as phoneme and its context and orthography. This The first three letters are unique because they form method has been composed of two phases .e. alignment the basis for vowels and are not consonants. Apart from and transliteration. First, an English pronunciation unit Era, these characters are never used on their own. Development of a Punjabi to English Transliteration System 523

Consonants are:

4. ARCHITECTURE Our basic rule based transliteration system works by employing a set of character mapping or character sequence mapping rules between the languages involved. Punjabi words are written in Gurumukhi script while English words are written in Roman script. Each Gurumukhi consonant symbol that is not followed by a vowel represents that consonant plus an inherent schwa vowel sound . For example, is represented as In addition to these, there are six consonants created . Note that the schwa vowel does by placing a dot (bindi) at the foot (pair) of the consonant: not get pronounced in certain contexts as in this case after schwa sound symbol has not pronounced. A snippet of the direct mapping of vowels and consonant is shown in Table 1, 2, 3 and 4. The accuracy of the system using direct mapping was very low. To improve that In addition to this, there are nine dependent vowel accuracy we have developed different rules. In our signs used to create ten independent vowels with three system, rules also include constraints which specify the bearer characters: Ura  [  ], Aira  [ ]e and Iri [ I ]. context in which they are applicable like Start of a Word (S), Ending of a Word (E), After Vowel (AV), After 3.2 English Language Consonant (AC) etc. Combination of different mapping English Language is written in Roman script. English is options for each character in inputting Punjabi words a West Germanic language that arose in the Anglo-Saxon results in different transliteration candidates. For kingdoms of England. It is one of six official languages example, consider the Punjabi word of the United Nations. India is one of the countries where have 2, 2, 1, 1, 1, and 2 possible mappings respectively. English is spoken as a second language. There are 26 Hence a total of 2*2*1*1*1*2=8 transliteration candidates letters in English. Out of which 21 are consonants and 5 should be considered. (Examples: waishali, wayshali, are Vowels. Vowels are: vaishali, vayshali waishalee, wayshalee etc.). Table 1 Independent Vowels Mapping

Table 2 Dependent Vowels Mapping 524 International Journal of Computer Science and Communication (IJCSC)

Table 3 Consonant Mapping

Table 4 Mapping of Special Symbols

5. EXPERIMENTAL RESULT AND ERROR ANAYLSIS 5.2 Evaluation Data In this section, we will discuss the accuracy of our We have divided the data set into two parts. One is system. Training data set and second is Test Data set. Training data set consisted of 1013 person’s names, using these 5.1 Evaluation Metrics names we have made the rules for the transliteration The main aim of system is Effective transliteration of from Punjabi to English. And in Test Data set we have names from Punjabi to English language. Thus for our used the original data where it will be implemented. Our System evaluation Accuracy Test and error analysis have system is accurate for the Punjabi words but not for the been evaluated. To measure the quality of the foreign words. For evaluating the system we took names transliteration results, Word Accuracy is calculated by from the different domains like Person names, City using the following equation: names, State names, River names etc. We have made two test cases. Test case 1 contains person names. Test case 2 Accuracy = (C/N) * 100 contains City names, State names, River names. Where Test Case 1 Person names 1923 names C = indicates the total number of corrected (351 duplicates) transliterated words and Test Case 2 City names, 128 names N = indicates the total number of test words. State names, River names Development of a Punjabi to English Transliteration System 525

5.3 Result Problem is associated with following characters as The results of two test case discussed are given below. shown in Table III. For example, character in The overall accuracy of our system is 93.22%. Punjabi language can be transliterated into two characters in English ‘v’ or ‘w’. Some algorithm is Test Case Accuracy required to select the appropriate character at Test case 1 (Person names) 95.00% different situations. Test case 2 (City names, State names, 91.40% Some names that are not correct transliterated by River names) our system i.e. . This following figure gives us a graphical view of the accuracy of Test case 1 and Test case 2. The following names are transliterated by our system.

Figure 1 6. CONCLUSION 5.4 Error Analysis In this paper we have addressed the problem of The overall performance accuracy test of the system is transliterating Punjabi to English language using rule quite good. But the Test case 2 is less accurate than the based approach. Punjabi to English transliteration first one because, the un-standardized language causes system is very beneficial for removing the language and more ambiguities. There are several reasons for the errors scriptural barrier. The system is giving promising results in the output. and this can be further used by the researchers working • Multiple : Sometimes when a on Punjabi and English Natural Language Processing name is pronounced in Punjabi it correspond to tasks. As we know that in Punjab area most of the many English words, so their system fails to guess government departments use Punjabi language to store which one is the best for that particular their data, so this transliteration system will help them transliteration. a lot to transliterate Punjabi to English on a click of a button. • Wrong Input of Words: Some time user does not enter correct data to the system due to which output 7. ACKNOWLEDGMENTS is also not correct. For example as I would like to express my deep and sincere gratitude to here halant is used as such but we know it is used my Guide Dr. Vishal Goyal, Assistant Professor, Dept to write half letter. of Computer Science, Punjabi University, Patiala, for the • Character Gap: The number of characters in, both continuous support of my work, for his patience, English and Punjabi, character sets varies in both motivation, enthusiasm, and immense knowledge. His the language that makes the transliteration process understanding, encouraging and personal guidance difficult. The numbers of vowels are 5 and 20 and have provided a good basis for the present work. I would numbers of consonants are 21 and 41, in both like to thank my parents, for supporting me throughout English and Punjabi, respectively as explained my life. Above all, I thank ‘GOD’ for making this mortal earlier. So, there is character gap in both the venture possible. languages that leads to problems in transliteration process. For Example, for character in Punjabi REFERENCES there is no corresponding character in English. [1] Vijaya, V.P., Shivapratap and K.P. CEN(2009), “English • One-to-Multi mapping Problem: In this problem, to Tamil Transliteration using WEKA system”, single character in one script transform to multiple International Journal of Recent Trends in Engineering, May characters in another script. The Multi-mapping 2009, 1, No. 1, pages: 498-500. 526 International Journal of Computer Science and Communication (IJCSC)

[2] Chinnakotla, Damani, Satoskar(2009), “Transliteration pages 108–111, Suntec, Singapore, 7 August 2009 ACL for Resource Scarce Language”, ACM Transactions on and AFNLP. Asian Language Information Processing, V, No. N. [7] Ali and Ijaz(2009), “English to Urdu Transliteration [3] Lehal and Singh (2008), “Shahmukhi to Gurmukhi System”, Proceedings of the Conference on Language & Transliteration System: A Corpus based Approach”, Technology 2009, pages: 15-23. Proceeding of Advanced Centre for Technical Development [8] Hoon Oh, and Key-Sun Choi (2002), “An English-Korean of Punjabi Language, Literature & Culture, Punjabi Transliteration Model”, using Pronunciation and University, Patiala 147 002, Punjab, India, pages:151-162. Contextual Rules, In Proc. of the 19th International [4] Malik (2006), “Punjabi Machine Transliteration System”, Conference on Computational Linguistics (COLING 2002), In Proceedings of the 21st International Conference on pages: 393–399. Computational Linguistics and 44th Annual Meeting of the ACL (2006), pages:1137-1144. [9] Yaser, Knight (2002), “Machine Transliteration of Names [5] Verma (2006), “A Roman-Gurmukhi Transliteration in Arabic Text”, Machine Transliteration of Names in System”, Proceeding of the Department of Computer Science, Arabic text, In Proceedings of the ACL Workshop on Punjabi University, Patiala, 2006. Computational Approaches to Semitic Languages, Philadelphia, , pages: 1-13. [6] Hong, Kim, Lee and Chang(2009), “A Hybrid Approach to English-Korean Name Transliteration” , Procedings of [10] ”Transliteration”, Internet Source:- http://en.wikipedia. the 2009 Named Entities Workshop, ACL-IJCNLP 2009, org/wiki/transliteration acessed on jan,2011.