Development of a Punjabi to English Transliteration System

International Journal of Computer Science and Communication Vol. 2, No. 2, July-December 2011, pp. 521-526 DEVELOPMENT OF A PUNJABI TO ENGLISH TRANSLITERATION SYSTEM Kamal Deep1 and Vishal Goyal2 1Department of Computer Science, Punjabi University, Patiala, India E-mail: [email protected] 2Assistant Professor, Department of Computer Science, Punjabi University, Patiala, India E-mail: [email protected] ABSTRACT Machine transliteration has gained prime importance as a supporting tool for Machine translation and cross language information retrieval especially when proper names and technical terms are involved. The performance of machine translation and cross-language information retrieval depends extremely on accurate transliteration of named entities. Hence, the transliteration model must aim to preserve the phonetic structure of words as closely as possible. This paper addresses the problem of transliterating Punjabi to English language using a rule based approach .The proposed transliteration scheme uses grapheme based method to model the transliteration problem. This technique has demonstrated transliteration from Punjabi to English for conman names and achieved accuracy of 93.22%. Y 1. INTRODUCTION grapheme transformation. In hybrid approaches ( H), it Transliteration is the process of replacing words in simply combines the grapheme-based transliteration Y source language with their approximate phonetic or probability (Pr ( G)) and the phoneme-based Y spelling equivalents in target language. Commonly, transliteration probability (Pr ( P)) using linear transliteration is used to translate named entities across interpolation. languages. Automatic transliteration is helpful for many Vijaya, VP, Shivapratap and KP CEN [1] has applications, such as Machine Translation (MT), Cross developed English to Tamil Transliteration system and Language Information Retrieval (CLIR) and Information named it WEKA. It is a Rule based system and is used Extraction (IE), etc. Transliterating a word from the the j48 decision tree classifier of WEKA for classification language of its origin to a foreign language is called purposes. The transliteration process consisted of four Forward Transliteration, while transliterating a loan phases: Preprocessing phase, feature extraction, training word written in a foreign language back to the language and transliteration phase .The accuracy of this system of its origin is called Backward Transliteration. This has been tested with 1000 English names that were out paper addresses the problem of forward transliterating of corpus. The transliteration model produced an exact of person names from Punjabi to English. transliteration in Tamil from English words with an accuracy of 84.82%.Chinnakotla, Damani, Satoskar[2] has The remainder of this paper is organized as follows. developed Transliteration systems for Resource Scarce In section 2, we have described the related work. Section Languages. They have developed rule based systems for 3 introduces about English and Punjabi Language. We Hindi to English, English to Hindi, and Persian to English describe Transliteration System Architecture in sections transliteration tasks. They used CSM (Character 4. Experimental Results and Error Analysis are discussed Sequence Modeling) on the source side for word origin in section 5. Finally, we have concluded it in section 6. identiûcation,a manually generated non-probabilistic character mapping rule base for generating 2. RELATED WORK transliteration candidates, and then again used the CSM Several approaches have been proposed for name on the target side for ranking the generated candidates. Y transliteration. In Grapheme based approaches ( G), The overall efficiency by using CRF (Conditional Random transliteration is viewed as a process of mapping a Field) approach of English to Hindi is 67.0%, Hindi to grapheme sequence from a source language to a target English is 70.7% and Persian to English is 48.0%. Lehal language ignoring the phoneme-level processes. In and Singh [3] have developed Shahmukhi to Gurmukhi Y contrast, in phoneme-based approaches ( P), the Transliteration System based on Corpus approach. In this transliteration key is pronunciation or the source system, first of all script mappings has been done in phoneme rather than Spelling or the source grapheme. which mapping of Simple Consonants, Aspirated This approach is basically source grapheme-to-source Consonants (AC), Vowels, other Diacritical Marks or phoneme transformation and source phoneme-to-target Symbols are done. This system has been virtually divided 522 International Journal of Computer Science and Communication (IJCSC) into two phases. The first phase performs pre-processing and its corresponding phoneme have been aligned and rule-based transliteration tasks and the second phase phonetically. Second, English words have been performs the task of post-processing. The overall transliterated into Korean words through several steps. accuracy of system has been reported to be 91.37%. Malik Using an English pronunciation dictionary (P-DIC), [4] has developed Punjabi Machine Transliteration assigned pronunciation to a given English word. If it has (PMT) system which is rule-based. PMT has been used been not found in P-DIC, system investigates that it has for the Shahmukhi to Gurmukhi Transliteration System. a complex word form. For detecting a complex word PMT has preserved the phonetics of transliterated word form, they have divided a given English word into two and the meaning of transliterated word. The primary words (word+word) using entries of P-DIC. If both of limitation of this system is that this system works only them are in P-DIC, system can assign pronunciation to on input data which has been manually edited for the given word otherwise system should estimate missing vowels or diacritical marks (the basic ambiguity pronunciation. Then, system checks whether the English of written Arabic script) which practically has limited word is from Greek origin or not. Because a way of E-K use. The accuracy of system has been reported to 98.95%. transliteration for the English words of Greek origin is Verma[5] has developed Gurmukhi to Roman different from that for pure English words, it is important Transliteration System and named it GTrans. He has to detect that. Pronunciation for English words, which surveyed existing Roman-Indic script transliteration were not registered in a P-DIC, has been estimated in techniques and finally a transliteration scheme based on the next step. Finally, Korean transliterated words has ISO: 15919 transliteration and ALA-LC has been been generated using conversion rules. Evaluation has developed. It is a rule based system. He has also done been performed through Word Accuracy (WA) and reverse transliteration from Gurumukhi to Roman. The Character Accuracy (CA). This system has reported overall accuracy of system has been reported to be accuracy of 90.82% for WA and 56% for CA. Yaser, 98.43%. Hong, Kim, Lee and Chang [6] have developed Knight [9] has developed Arabic To English English-Korean Name Transliteration system, using the Transliteration system based on the sound & spelling Hybrid Approach. In the transliteration process, first, a mapping using finite state machine. They have combined phrase-base SMT model with some factored translation the phonetic based model & spelling based model into features has been used. Second, they have expanded the the single transliteration model. For testing they have base system by applying web-based n-best re-ranking used the development data set & blind data set. The of the results. Third, they have applied a pronouncing overall accuracy with development data set has been dictionary-based method to the base system which reported to be 53.66% & with blind data set it showed utilizes the pronunciation symbols which is motivated 61% accuracy. The reason of high accuracy with blind by linguistic knowledge. Finally, phonics based method data set was that blind set is mostly of highly frequent, is applied which has been originally designed for prominent politicians where as development set also teaching speakers of English to read and write that contain names of writers and less common political language. The experimental results of using three n-best figure. re-ranking techniques have showed that the web-based re-ranking is proved to be a useful method .Their 3. PUNJABI & ENGLISH LANGUAGE standard run and best standard run has accuracy of In this section we will discuss about Punjabi and English 45.1% & 78.5%. Ali and Ijaz [7] have developed English Language. to Urdu Transliteration System based on the mapping rules. The whole process has three steps. In the first step, 3.1 Punjabi Language the mapping rules that have been used to generate Urdu Punjabi Language is written in Gurmukhi Script. The text from English transcription. English text is converted Gurmukhi script was derived from the Sharada script to Urdu using both English pronunciation and mapping and standardized by Guru Angad Dev in the 16th rules. In Second step, Urdu syllabification has been century. It was designed to write the Punjabi language. applied on English transcription. Consonant and Vowels The meaning of Gurmukhi is “from the mouth of the have been combined to make syllable and breaking up a Guru”. The Gurmukhi (or Punjabi) alphabet contains word into syllables is known as syllabification. To thirty-five distinct letters. These are: improve system’s accuracy, they have applied the Urduization Rules

Development of a Punjabi to English Transliteration System

The Origins, Evolution and Decline of the Khojki Script

Shahmukhi to Gurmukhi Transliteration System: a Corpus Based Approach

Ist National Digital Workshop on Sharada Script Learning

Praagaash February 2019.Cdr

Some Interesting Facts, Myths and History of Mathematics

Evolution of Script in India

Know-Kashmir-2.Pdf

Galaxy: International Multidisciplinary Research Journal the Criterion: an International Journal in English Vol

Encoding of Vedic Characters Used in Non-Devanagari Scripts

4403 2014-01-28

Some Linguistic Features of the Old Kashmiri Language of the Bāṇāsurakathā

Proposal to Encode the Sharada Script in ISO/IEC 10646