Sangam: a Perso-Arabic to Indic Script Machine Transliteration Model

Total Page:16

File Type:pdf, Size:1020Kb

Sangam: a Perso-Arabic to Indic Script Machine Transliteration Model Sangam: A Perso-Arabic to Indic Script Machine Transliteration Model Gurpreet Singh Lehal Tejinder Singh Saini Department of Computer Science Advanced Center for Technical Development of Punjabi University, Patiala Punjabi Language Literature and Culture 147002 Punjab, India Punjabi University, Patiala Punjab, India [email protected] [email protected] Abstract 1 Introduction Indian sub-continent is one of those unique Indian sub-continent is one of those unique parts of parts of the world where single languages are the world where single languages are written in written in different scripts. This is the case for different scripts. This is the case for example with example with Punjabi, written in Indian East Punjabi, spoken by tens of millions of people, but Punjab in Gurmukhi script (a Left to Right written in Indian East Punjab (20 million) in Gur- script based on Devnagri) and in Pakistani mukhi script (a Left to Right script based on Dev- West Punjab, it is written in Shahmukhi (a nagri) and in Pakistani West Punjab (80 million), it Right to Left script based on Perso-Arabic). is written in Shahmukhi (a Right to Left script This is also the case with other languages like based on Perso-Arabic). Whilst in speech, Punjabi Urdu and Hindi (whilst having different names, they are the same language but written spoken in the Eastern and the Western parts is mu- in mutually incomprehensible forms). Similar- tually comprehensible in the written form it is not. ly, Sindhi and Kashmiri languages are written This is also the case with other languages like Ur- in both Persio-Arabic and Devanagri scripts. du and Hindi (whilst having different names, they Thus there is a dire need for development are the same language but written, as with Punjabi, transliteration tools for conversion between in mutually incomprehensible forms). Hindi is Perso-Arabic and Indic scripts. In this paper, written in the Devnagri script from left to right, we present Sangam, a Perso-Arabic to Indic Urdu is written in a script derived from a Persian script machine transliteration system, which modification of Arabic script written from right to can convert with high accuracy text written in left. A similar problem resides with the Sindhi Perso-Arabic script to one of the Indic script sharing the same language. Sangam is a hybr- language, which is written in a Persio-Arabic script id system which combines rules as well as in Pakistan and both in Persio-Arabic and Devana- word and character level language models to gri in India. Similar is the case with Kashmiri lan- transliterate the words. The system has been guage too. Konkani is probably the only language designed in such a fashion that the main code, in India which is written in five scripts Roman, algorithms and data structures remain un- Devnagri, Kannada, Persian-Arabic and Malaya- changed and for a adding a new script pair on- lam (Carmen Brandt. 2014). The existence of mul- ly the databases, mapping rules and language tiple scripts has created communication barriers, as models for the script pair need to be devel- people can understand the spoken or verbal com- oped and plugged in. The system has been munication, however when it comes to scripts or successfully tested on Punjabi, Urdu and Sindhi languages and can be easily extended written communication, the number diminishes, for other languages like Kashmiri and Konka- thus a need for transliteration tools which can con- ni. vert text written in one language script to another script arises. A common feature of all these lan- 232 D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 232–239, Goa, India. December 2014. c 2014 NLP Association of India (NLPAI) guages is that, one of the script is Perso-Arabic thodology to handle the transliteration issues re- (Urdu, Sindhi, Shahmukhi etc.), while other script lated to conversion between scripts of same lan- is Indic (Devnagri, Gurmukhi, Kannada, Malaya- guage. lam). Perso-Arabic script is a right to left script, while Indic scripts are left to right scripts and both 2 Related Work the scripts are mutually incomprehensible forms. Thus is a dire need for development of automatic The first transliteration system for a Perso-Arabic machine transliteration tools for conversion be- to Indic script was presented by Malik (2006), tween Perso-Arabic and Indic scripts. where he described a Shahmukhi to Gurmukhi Machine Transliteration is an automatic method transliteration system with 98% accuracy. But the to generate characters or words in one alphabetical accuracy was achieved only when the input text system for the corresponding characters in another had all necessary diacritical marks for removing alphabetical system. The transformation of text ambiguities, even though the process of putting from one script to another is usually based on pho- missing diacritical marks is not practically possible netic equivalencies. Transliteration is usually cate- due to many reasons like large input size, manual gorized as forward and backward transliteration. intervention, person having knowledge of both the Forward transliteration refers to transliteration scripts and so on. Saini et al. (2008) developed a from the native language to foreign language, system, which could automatically insert the miss- while the process of recalling a word in native lan- ing diacritical marks in the Shahmukhi text and guage from a transliteration is defined as back- convert the text to Gurmukhi. The system had been transliteration. Forward transliteration plays an implemented with various research techniques important role in natural language applications based on corpus analysis of both scripts and an such as information retrieval and machine transla- accuracy of 91.37% at word level had been re- tion, especially for handling proper nouns, technic- ported. al terms and out of vocabulary words. While back Durrani et al. (2010) presented an approach to transliteration is popularly used as an input me- integrate transliteration into Hindi-to-Urdu statis- chanism for certain languages, where typing in the tical machine translation. They proposed two prob- native script is not very popular. In such cases, the abilistic models, based on conditional and joint user types the native language words and sentences probability formulations and have reported an ac- (usually) in Roman script, and a transliteration en- curacy of 81.4%. Lehal and Saini (2012) presented gine automatically converts the Roman input back an Urdu to Hindi transliteration system and had to the native script. This input mechanism is popu- claimed achieving an accuracy of 97.74% at word larly used for all Indian languages including Hindi, level. The various challenges such as multiple/zero Punjabi, Tamil, Telugu, etc., and also, Arabic, character mappings, missing diacritic marks in Ur- Chinese etc. du, multiple Hindi words mapped to an Urdu word, In this paper, we present Sangam, a Perso- word segmentation issues in Urdu text etc. have Arabic to Indic script machine transliteration sys- been handled by generating special rules and using tem, which can convert with high accuracy text various lexical resources such as n-gram language written in Perso-Arabic script to one of the Indic models at word and character level and Urdu-Hindi script sharing the same language. The system has parallel corpus. Recently Malik et al. (2013) have been successfully tested on Punjabi (Shahmukhi- analysed the application of statistical machine Gurmukhi) , Urdu (Urdu-Devnagri) and Sind- translation for solving the problem of Urdu-Hindi hi(Sindhi Perso Arabic - Sindhi Devnagri) lan- transliteration using a parallel lexicon. The authors guages and can be easily extended for other reported a word level accuracy of 77.8% when the languages like Kashmiri and Konkani. One should input Urdu text contained all necessary diacritical note that the transliteration model presented in this marks and 77% when the input Urdu text did not paper can neither be categorized as forward nor as contain all necessary diacritical marks, which is backward since it is concerned with script conver- much below the accuracy reported in earlier works. sion in same language, so the usual techniques for A rule based converter for Kashmiri language forward or backward transliteration cannot be ap- from Persio-Arabic to Devanagari script has been plied here and we have to develop a special me- 233 developed by Kak et al. (2010) and authors have Perso- Word Indic Indic Actual claimed 90% conversion accuracy. Arabic Script Transli- translite- Leghari and Rehman (2010) have discussed the script teration ration Devnagri दनया द�नयाु دد� different issues, complexities and problems of Urdu Sindhi transliteration and presented a model for Gurmukhi ਵਚ ਿ ਵੱ ਚ چووچ -Shah transliteration between Perso-Arabic and Devana- mukhi gari scripts of Sindhi language, which is based on ु Devnagri सनध �स ंध ﺳﻨﮅ an intermediate Roman script. Sindhi Malik et al. (2010) described a finite-state scrip- Table 1. Transliteration without diacritical marks tural translation model based on Finite State Ma- chines to convert the scripts for Urdu, Punjabi and 3.2 Filling the Missing Script Maps Seraiki languages. But the transliteration results for Urdu-Hindi, Punjabi Shahmukhi-Gurmukhi and There are many characters which are present in the Seraiki Shahmukhi-Gurmukhi have not been very Perso-Arabic script, corresponding to those having -Do ,ء encouraging, with transliteration accuracy at word no character in Indic script, e.g. Hamza .Khadi Zabar) etc) ٰ◌ ,ع level ranging from 31.2% to 58.9% for Urdu- Zabar ◌ً Aen Devnagri script pair and 67.3% for Shahmukhi- 3.3 Multiple Mappings for Perso-Arabic Gurmukhi. Characters 3 Challenges in Perso-Arabic to Indic It is observed that corresponding to many Perso- Script Transliteration Arabic characters there are multiple mappings into Indic script as shown in Table 2.
Recommended publications
  • Named Entity Recognition System for Kashmiri Language Iamir Bashir Malik, Iikhushboo Bansal Istudent, M.Tech, Iiassistant Professor I,Iidept
    ISSN : 2347 - 8446 (Online) International Journal of Advanced Research in ISSN : 2347 - 9817 (Print) Vol. 3, Issue 2 (Apr. - Jun. 2015) Computer Science & Technology (IJARCST 2015) Named Entity Recognition System for Kashmiri Language IAmir Bashir Malik, IIKhushboo Bansal IStudent, M.Tech, IIAssistant Professor I,IIDept. of CSE, Desh Bhagat University, Mandi Gobindgarh, Punjab, India Abstract Named Entity Recognition (NER) is a task which helps in finding out Persons name, Location names, Organization names, Place, Date, Time etc. and classifies them into predefined different categories. Named Entity Recognition plays a major role in various Natural Language Processing (NLP) fields like Information Extraction, Machine Translations and Question Answering. Unfortunately Kashmiri language which is a scarce resourced language has not been taken into account. This paper describes the problems of NER in the context of Kashmiri Language and provides relevant solutions. Keywords Named Entity, Named Entity Recognition, Natural language process, Kashmiri language text. I. Introduction is as follows. The term Named Entity (NE) was evolved during the sixth (1) “Micromax”represent anorganization and “ Dec19, 2014” Message Understanding Conference (MUC -6, 1995).Named represent dateand “smartphone” represent entity and “had Entity Recognition (NER) is also knows as entity identification is a launched its on” represent others. subtask of information extraction (IE). NER extracts and classifies The named entities may be of any type such as given below
    [Show full text]
  • (And Potential) Language and Linguistic Resources on South Asian Languages
    CoRSAL Symposium, University of North Texas, November 17, 2017 Existing (and Potential) Language and Linguistic Resources on South Asian Languages Elena Bashir, The University of Chicago Resources or published lists outside of South Asia Digital Dictionaries of South Asia in Digital South Asia Library (dsal), at the University of Chicago. http://dsal.uchicago.edu/dictionaries/ . Some, mostly older, not under copyright dictionaries. No corpora. Digital Media Archive at University of Chicago https://dma.uchicago.edu/about/about-digital-media-archive Hock & Bashir (eds.) 2016 appendix. Lists 9 electronic corpora, 6 of which are on Sanskrit. The 3 non-Sanskrit entries are: (1) the EMILLE corpus, (2) the Nepali national corpus, and (3) the LDC-IL — Linguistic Data Consortium for Indian Languages Focus on Pakistan Urdu Most work has been done on Urdu, prioritized at government institutions like the Center for Language Engineering at the University of Engineering and Technology in Lahore (CLE). Text corpora: http://cle.org.pk/clestore/index.htm (largest is a 1 million word Urdu corpus from the Urdu Digest. Work on Essential Urdu Linguistic Resources: http://www.cle.org.pk/eulr/ Tagset for Urdu corpus: http://cle.org.pk/Publication/papers/2014/The%20CLE%20Urdu%20POS%20Tagset.pdf Urdu OCR: http://cle.org.pk/clestore/urduocr.htm Sindhi Sindhi is the medium of education in some schools in Sindh Has more institutional backing and consequent research than other languages, especially Panjabi. Sindhi-English dictionary developed jointly by Jennifer Cole at the University of Illinois Urbana- Champaign and Sarmad Hussain at CLE (http://182.180.102.251:8081/sed1/homepage.aspx).
    [Show full text]
  • Punjabi Language Characteristics and Role of Thesaurus in Natural
    Dharam Veer Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (4) , 2011, 1434-1437 Punjabi Language Characteristics and Role of Thesaurus in Natural Language processing Dharam Veer Sharma1 Aarti2 Department of Computer Science, Punjabi University, Patiala, INDIA Abstract---This paper describes an attempt to explain various 2.2 Characteristics of the Punjabi Language characteristics of Punjabi language. The origin and symbols of Modern Punjabi is a very tonal language, making use of Punjabi language are presents in this paper. Various relations various tones to differentiate words that would otherwise be exist in thesaurus and role of thesaurus in natural language identical. Three primary tones can be identified: high-rising- processing also has been elaborated in this paper. falling, mid-rising-falling, and low rising. Following are characteristics of Punjabi language [3] [4]. Keywords---Thesaurus, Punjabi, characteristics, relations 2.2.1 Morphological characteristics Morphologically, Punjabi is an agglutinative language. That 1. INTRODUCTION is to say, grammatical information is encoded by way of A thesaurus links semantically related words and helps in the affixation (largely suffixation), rather than via independent selection of most appropriate words for given contexts [1]. A freestanding morphemes. Punjabi nouns inflect for number thesaurus contains synonyms (words which have basically the (singular, plural), gender (masculine, feminine), and same meaning) and as such is an important tool for many declension class (absolute, oblique). The absolute form of a applications in NLP too. The purpose is twofold: For writers, noun is its default or uninflected form. This form is used as it is a tool - one with words grouped and classified to help the object of the verb, typically when inanimate, as well as in select the best word to convey a specific nuance of meaning, measure or temporal (point of time) constructions.
    [Show full text]
  • GEO ROO DIST BUC PRO Prep ATE 111 Cha Dec Expi OTECHNICA OSEVELT ST TRICT CKEYE, ARI OJECT # 15 Pared By
    GEOTECHNICAL EXPLORARATION REPORT ROOSEVELT STREET IMPROVEMENT DISTRICT BUCKEYE, ARIZONA PROJECT # 150004 Prepared by: ATEK Engineering Consultants, LLC 111 South Weber Drive, Suite 1 Chandler, Arizona 85226 Exp ires 9/30/2018 December 14, 2015 December 14, 2015 ATEK Project #150004 RITOCH-POWELL & Associates 5727 North 7th Street #120 Phoenix, AZ 85014 Attention: Mr. Keith L. Drunasky, P.E. RE: GEOTECHNICAL EXPLORATION REPORT Roosevelt Street Improvement District Buckeye, Arizona Dear Mr. Drunasky: ATEK Engineering Consultants, LLC is pleased to present the attached Geotechnical Exploration Report for the Roosevelt Street Improvement Disstrict located in Buckeye, Arizona. The purpose of our study was to explore and evaluate the subsurface conditions at the proposed site to develop geotechnical engineering recommendations for project design and construction. Based on our findings, the site is considered suittable for the proposed construction, provided geotechnical recommendations presented in thhe attached report are followed. Specific recommendations regarding the geotechnical aspects of the project design and construction are presented in the attached report. The recommendations contained within this report are depeendent on the provisions provided in the Limitations and Recommended Additional Services sections of this report. We appreciate the opportunity of providing our services for this project. If you have questions regarding this report or if we may be of further assistance, please contact the undersigned. Sincerely, ATEK Engineering Consultants, LLC Expires 9/30/2018 James P Floyd, P.E. Armando Ortega, P.E. Project Manager Principal Geotechnical Engineer Expires 9/30/2017 111 SOUTH WEBER DRIVE, SUITE 1 WWW.ATEKEC.COM P (480) 659-8065 CHANDLER, AZ 85226 F (480) 656-9658 TABLE OF CONTENTS 1.
    [Show full text]
  • Positional Notation Or Trigonometry [2, 13]
    The Greatest Mathematical Discovery? David H. Bailey∗ Jonathan M. Borweiny April 24, 2011 1 Introduction Question: What mathematical discovery more than 1500 years ago: • Is one of the greatest, if not the greatest, single discovery in the field of mathematics? • Involved three subtle ideas that eluded the greatest minds of antiquity, even geniuses such as Archimedes? • Was fiercely resisted in Europe for hundreds of years after its discovery? • Even today, in historical treatments of mathematics, is often dismissed with scant mention, or else is ascribed to the wrong source? Answer: Our modern system of positional decimal notation with zero, to- gether with the basic arithmetic computational schemes, which were discov- ered in India prior to 500 CE. ∗Bailey: Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. Email: [email protected]. This work was supported by the Director, Office of Computational and Technology Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy, under contract number DE-AC02-05CH11231. yCentre for Computer Assisted Research Mathematics and its Applications (CARMA), University of Newcastle, Callaghan, NSW 2308, Australia. Email: [email protected]. 1 2 Why? As the 19th century mathematician Pierre-Simon Laplace explained: It is India that gave us the ingenious method of expressing all numbers by means of ten symbols, each symbol receiving a value of position as well as an absolute value; a profound and important idea which appears so simple to us now that we ignore its true merit. But its very sim- plicity and the great ease which it has lent to all computations put our arithmetic in the first rank of useful inventions; and we shall appre- ciate the grandeur of this achievement the more when we remember that it escaped the genius of Archimedes and Apollonius, two of the greatest men produced by antiquity.
    [Show full text]
  • Online Guides to Indian Languages with Particular Reference to Hindi, Punjabi, and Sanskrit
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Library Philosophy and Practice (e-journal) Libraries at University of Nebraska-Lincoln 5-2012 Online Guides to Indian Languages with Particular Reference to Hindi, Punjabi, and Sanskrit Preeti Mahajan Panjab University, [email protected] Neeraj Kumar Singh Panjab University, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/libphilprac Part of the Library and Information Science Commons Mahajan, Preeti and Singh, Neeraj Kumar, "Online Guides to Indian Languages with Particular Reference to Hindi, Punjabi, and Sanskrit" (2012). Library Philosophy and Practice (e-journal). 749. https://digitalcommons.unl.edu/libphilprac/749 http://unllib.unl.edu/LPP/ Library Philosophy and Practice 2012 ISSN 1522-0222 Online Guides to Indian Languages with Particular Reference to Hindi, Punjabi, and Sanskrit Prof. Preeti Mahajan Department of Library and Information Science Panjab University Chandigarh, India Neeraj Kumar Singh Assistant Librarian A C Joshi Library Panjab University Chandigarh, India Introduction India is a multilingual country and the second most populated country on earth There are a quite a number of languages spoken in India. Some of these languages are accepted nationally while others are accepted as dialects of that particular region. The Indian languages belong to four language families namely Indo-European, Dravidian, Austroasiatic (Austric) and Sino-Tibetan. Majority of India's population are using Indo-European and Dravidian languages. The former are spoken mainly in northern and central regions and the latter in southern India. India has 22 officially recognised languages. But around 33 different languages and 2000 dialects have been identified in India.
    [Show full text]
  • Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
    ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No.
    [Show full text]
  • Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia
    2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia Arabic alphabet From Wikipedia, the free encyclopedia َأﺑْ َﺠ ِﺪﯾﱠﺔ َﻋ َﺮﺑِﯿﱠﺔ :The Arabic alphabet (Arabic ’abjadiyyah ‘arabiyyah) or Arabic abjad is Arabic abjad the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[1] stand for consonants, it is classified as an abjad. Type Abjad Languages Arabic Time 400 to the present period Parent Proto-Sinaitic systems Phoenician Aramaic Syriac Nabataean Arabic abjad Child N'Ko alphabet systems ISO 15924 Arab, 160 Direction Right-to-left Unicode Arabic alias Unicode U+0600 to U+06FF range (http://www.unicode.org/charts/PDF/U0600.pdf) U+0750 to U+077F (http://www.unicode.org/charts/PDF/U0750.pdf) U+08A0 to U+08FF (http://www.unicode.org/charts/PDF/U08A0.pdf) U+FB50 to U+FDFF (http://www.unicode.org/charts/PDF/UFB50.pdf) U+FE70 to U+FEFF (http://www.unicode.org/charts/PDF/UFE70.pdf) U+1EE00 to U+1EEFF (http://www.unicode.org/charts/PDF/U1EE00.pdf) Note: This page may contain IPA phonetic symbols. Arabic alphabet ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع en.wikipedia.org/wiki/Arabic_alphabet 1/20 2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia غ ف ق ك ل م ن ه و ي History · Transliteration ء Diacritics · Hamza Numerals · Numeration V · T · E (//en.wikipedia.org/w/index.php?title=Template:Arabic_alphabet&action=edit) Contents 1 Consonants 1.1 Alphabetical order 1.2 Letter forms 1.2.1 Table of basic letters 1.2.2 Further notes
    [Show full text]
  • Finite State Morphology and Sindhi Noun Inflections
    PACLIC 24 Proceedings 669 Finite State Morphology and Sindhi Noun Inflections Mutee U Rahman, Mohammad Iqbal Bhatti Department of Computer Science, Isra University, Hala Road, Hyderabad Sindh 71000, Pakistan [email protected], [email protected] Abstract. Sindhi is a morphologically rich language. Morphological construction include inflections and derivations. Sindhi morphology becomes more complex due to primary and secondary word types which are further divided into simple, complex and compound words. Sindhi nouns are marked by number gender and case. Finite state transducers (FSTs) quite reasonably represent the inflectional morphology of Sindhi nouns. The paper investigates Sindhi noun inflection rules and defines equivalent computational rules to be used by FSTs; corresponding FSTs are also given. Keywords. Sindhi, morphology, noun inflections, two-level morphology, finite state morphology. 1 Introduction Morphology deals with word formation rules in a language. Word structures of a language are defined by its morphological constructions. Morphology defines that how smaller meaning bearing units called morphemes are combined to make larger meaning bearing units of a language called words. Morphology also deals with word formation by variations in already existing words. The morphological changes are mostly done by suffix addition, subtraction and replacement phenomenon. In few words morphology can be defined as syntax of word formation. Models for computational analysis of morphology always remained challenge for computational linguists until early 1980’s when 4Ks* discovered the two level morphology (Kaplan, R. M. and M. Kay. 1981) the first general model for morphologically complex languages. This two level morphology represents a word at lexical level and surface level. Morphotactics or morpheme ordering model is used in between these two levels to incorporate morphological changes.
    [Show full text]
  • Shahmukhi to Gurmukhi Transliteration System: a Corpus Based Approach
    Shahmukhi to Gurmukhi Transliteration System: A Corpus based Approach Tejinder Singh Saini1 and Gurpreet Singh Lehal2 1 Advanced Centre for Technical Development of Punjabi Language, Literature & Culture, Punjabi University, Patiala 147 002, Punjab, India [email protected] http://www.advancedcentrepunjabi.org 2 Department of Computer Science, Punjabi University, Patiala 147 002, Punjab, India [email protected] Abstract. This research paper describes a corpus based transliteration system for Punjabi language. The existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and in Pakistan. This research project has developed a new system for the first time of its kind for Shahmukhi script of Punjabi language. The proposed system for Shahmukhi to Gurmukhi transliteration has been implemented with various research techniques based on language corpus. The corpus analysis program has been run on both Shahmukhi and Gurmukhi corpora for generating statistical data for different types like character, word and n-gram frequencies. This statistical analysis is used in different phases of transliteration. Potentially, all members of the substantial Punjabi community will benefit vastly from this transliteration system. 1 Introduction One of the great challenges before Information Technology is to overcome language barriers dividing the mankind so that everyone can communicate with everyone else on the planet in real time. South Asia is one of those unique parts of the world where a single language is written in different scripts. This is the case, for example, with Punjabi language spoken by tens of millions of people but written in Indian East Punjab (20 million) in Gurmukhi script (a left to right script based on Devanagari) and in Pakistani West Punjab (80 million), written in Shahmukhi script (a right to left script based on Arabic), and by a growing number of Punjabis (2 million) in the EU and the US in the Roman script.
    [Show full text]
  • Sanskrit Alphabet
    Sounds Sanskrit Alphabet with sounds with other letters: eg's: Vowels: a* aa kaa short and long ◌ к I ii ◌ ◌ к kii u uu ◌ ◌ к kuu r also shows as a small backwards hook ri* rri* on top when it preceeds a letter (rpa) and a ◌ ◌ down/left bar when comes after (kra) lri lree ◌ ◌ к klri e ai ◌ ◌ к ke o au* ◌ ◌ к kau am: ah ◌ं ◌ः कः kah Consonants: к ka х kha ga gha na Ê ca cha ja jha* na ta tha Ú da dha na* ta tha Ú da dha na pa pha º ba bha ma Semivowels: ya ra la* va Sibilants: sa ш sa sa ha ksa** (**Compound Consonant. See next page) *Modern/ Hindi Versions a Other ऋ r ॠ rr La, Laa (retro) औ au aum (stylized) ◌ silences the vowel, eg: к kam झ jha Numero: ण na (retro) १ ५ ॰ la 1 2 3 4 5 6 7 8 9 0 @ Davidya.ca Page 1 Sounds Numero: 0 1 2 3 4 5 6 7 8 910 १॰ ॰ १ २ ३ ४ ६ ७ varient: ५ ८ (shoonya eka- dva- tri- catúr- pancha- sás- saptán- astá- návan- dásan- = empty) works like our Arabic numbers @ Davidya.ca Compound Consanants: When 2 or more consonants are together, they blend into a compound letter. The 12 most common: jna/ tra ttagya dya ddhya ksa kta kra hma hna hva examples: for a whole chart, see: http://www.omniglot.com/writing/devanagari_conjuncts.php that page includes a download link but note the site uses the modern form Page 2 Alphabet Devanagari Alphabet : к х Ê Ú Ú º ш @ Davidya.ca Page 3 Pronounce Vowels T pronounce Consonants pronounce Semivowels pronounce 1 a g Another 17 к ka v Kit 42 ya p Yoga 2 aa g fAther 18 х kha v blocKHead
    [Show full text]
  • An Introduction to Spoken Kashmiri GLOSSARY
    An Introduction to Spoken Kashmiri GLOSSARY Braj B Kachru Kashmir News Network http://koshur.org/SpokenKashmiri A Basic Course and Referene Manual for Learning and Teaching Kashmiri as a Second Language PART II GLOSSARY BRAJ B. KACHRU Department of Linguistics, University of lllinois Urban, lllinois 61810 U.S.A June, 1973 The research project herein was performed pursuant to a contract with the United States Office of Education, Department of health, Education, and Welfare, Washington, D.C. Contract No. OEC-0-70-3981 Project Director and Principal Investigator: Braj B. Kachru, Department of Linguistics, University of Illinois, Urbana, Illinois, 61801, U.S.A. Disclaimer: We present this material as is, and assume no responsibility for its quality, any loss and/or damages. © 2006 Braj B. Kachru. All Rights Reserved. Kashmir News Network http://koshur.org/SpokenKashmiri Kashmir News Network http://koshur.org/SpokenKashmiri An Introduction to Spoken Kashmiri - GLOSSARY by Braj B. Kachru TABLE OF CONTENTS PREFACE ....................................................................................................1 GLOSSARY ...................................................................................................2 ABBREVIATIONS .........................................................................................3 1.0 KASHMIRI-ENGLISH ........................................................................ 1-4 2.0 ENGLISH-KASHMIRI ...................................................................... 2-32 3.0 A PARTIAL LIST OF ENGLISH
    [Show full text]