Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning

Total Page:16

File Type:pdf, Size:1020Kb

Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning E. K. Vellingiriraj, M. Balamurugan, and P. Balasubramanie different categories. Document registration of land and Abstract—The aim of this paper is to develop a system that building which are donated by the kings to the people are involves character recognition of Brahmi, Grantha and encrypted with palm manuscripts. The literary works, Vattezuthu Characters from palm manuscripts of Historical grammar, astrology, science and technology, etc., are Tamil Ancient Documents, anaylsed the text and machine encrypted in palm manuscripts. Historical moments of translated the present Tamil digital text format. Though many researchers have implemented various algorithms and places and dominion are also encrypted. Character techniques for character recognition in different languages, recognition is one of the most difficult tasks in the pattern Ancient characters conversion still poses a big challenge. recognition system. There are a lot of difficulties in image Because Image recognition technology has reached near- processing techniques. To solve these, one should know perfection when it comes to scanning English and other how to i) separate the characters in the segmentation process, language text. But optical character recognition (OCR) ii) to recognize unlimited character fonts and sculpting software capable of digitizing printed Tamil text with high levels of accuracy is still elusive. Only a few people are familiar styles in noisy image and iii) distinguish characters that with the ancient characters and make attempts to convert them have the same shape, but have different pronunciation in into written documents manually. The proposed system characters like (Ng) and (iT). Many researchers have tried to overcomes such a situation by converting all the ancient apply many techniques for breaking through the complex historical documents from inscriptions and palm manuscripts problems of character recognition. The most difficult thing into Tamil digital text format. It converts the digital text is to recognize the sculpting of Brahmi characters in stone format using Tamil Unicode. Our algorithm comprises different stages: i) image preprocessing, ii) feature extraction, compared to the other character recognition from different iii) character recognition and iv) digital text conversion. The sources. The Brahmi characters were sculpted in stones, first phase conversion accuracy of the Brahmi script rate of clay pot, copper plate etc. In this paper, a character our algorithm is 91.57% using the neural network and image recognition system based on a statistical approach with a zoning method. The second phase of the vettezhuthu character new feature set is proposed. set is to be implemented. Conversion accuracy of Vattezhuthu is 89.75% Index Terms—Character recognition, vattezhuthu, II. BRAHMI AND VATTEZHUTHU SCRIPTS segmentation, image zoning, machine translation. Brahmi script is one of the most important writing systems in the world by virtue of its time depth and influence. It represents the earliest post-Indus corpus of I. INTRODUCTION texts, and some of the earliest historical inscriptions found Tamil is one of the oldest languages in the world with a in India. The best-known Brahmi inscriptions are the rock- rich literature. In the ancient days, the writers, especially in cut edicts of Ashoka in North-central India, dated to 250– Tamilnadu, have used palm leaves and inscriptions to 232 BCE. This elegant script appeared in India most encrypt their writings. A very good example of the usage of certainly by the 5th century BCE, but the fact is that it had Palm leaf manuscripts is to store the history of Tamil many local variants even in the early texts which suggest grammar book named Tholkappiyam which is written that its origin lies further back in time. There are several during 4th B.C. The ancient literature includes many palm theories on the origin of the Brahmi script. Brahmic scripts leaf manuscripts that contain rare commentaries on Sangam are descended from the Brahmi script. The most reliable of works, unpublished portions of classics, Saiva, Vaishnava these are short Brahmi inscriptions dated to the 4th century and Jain works, poetry of all descriptions, medical works of BCE and published by Coningham et al. [1] but scattered exceptional values, food, astronomy & astrology, Vaastu & press reports have claimed both dates as early as the 6th Kaama Shastra, jewelry, music, dance & drama, medicine, century BCE and that the characters are identifiably Tamil Siddha and so on. Palm manuscripts are utilized for 3 Brahmi though these latter claims do not appear to have been published academically. Manuscript received September 14, 2016; revised December 5, 2016. The Vatteluttu (or Vattezhuttu) is an alphabet used in the E. K. Vellingiriraj and P. Balasubramanie are with the Department of southern India mainly in the states of Kerala and Tamil Computer Science and Engineering, Kongu Engineering College, Nadu. It has grown from the Brahmi script of southern part Perundurai, Tamil Nadu, India (e-mail: [email protected], [email protected]). from around 6th century CE, and utilized to write Tamil and M. Balamurugan is with the Department of Computer Science and Malayalam languages. The main historical development in it Engineering, Christ University, Bangalore, Karnataka, India (e-mail: is that the insignificant signs obtained from Brahmi to [email protected]). doi: 10.18178/ijlll.2016.2.4.88 164 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 represent Tamil alphabets are removed from Vatteluttu. approach that is based on the storing and manipulation of From the 8th century CE, Tamil language has been written information by a biological neural system. This is an in both Vatteluttu and Tamil scripts. The reason is that artificial neural system and it is named as “neural networks”. different kingdoms employed different scripts in the era of This system is expected to solve all the problems with scattered political nature of southern India. During 15th automatic reasoning, including pattern recognition problems. century CE, Vatteluttu script was supplanted by Tamil script It classifies patterns on the basis of predictable properties of in Tamil-speaking areas. Similarly, in 12th century CE in neural networks. However, it represents only a little amount Kerala, the Malayalam script was developed from the old of information about semantic from a network [14]-[16]. Grantha script and so it became phonetically suitable to the Malayalam language. This situation led to the misuse of Vatteluttu instead of the Malayalam script. Fig. 1 shows the IV. METHODOLOGY family of Ancient Brahmi and Vattezhuthu script that All the details of the proposed system design are belong to 500BCE. presented below. Initially, the framework of the ancient Brahmi character recognition system is started. Then the component details are given. The Fig. 2 shows the vattezhuthu set. The system gets the input image of Brahmi characters and converts it into equivalent current Tamil digital format. Fig. 1. Ancient script family. Fig. 2. Vattezhuthu character set. III. LITERATURE REVIEW Three major approaches used commonly for recognition A. Overview of System Architecture of handwritten characters are statistical, structural or Fig. 3 shows the overall architecture of Tamil Brahmi syntactic and neural network based approaches. character recognition system for stone inscriptions. The major components of this system are image preprocessing, A. Statistical Approach feature extraction and character recognition. The Statistical or probability functions are used for architecture of the system is given below. formulating a recognition algorithm in statistical approach. Features to be given as input are derived from a set of measurements used for pattern recognition. There prevails a difficulty in representing the pattern of classification on the basis of structural information in this statistical approach [2]-[8]. B. Structural or Syntactic Approach Syntactic approach employs syntactic or structural information about patterns to obtain information relevant to patterns. It extracts similar patterns and formulates pattern syntax or structural rules. The information about the pattern syntax rules helps highlight, classify and identify unknown patterns. The structural approach is favorable for evolving a Fig. 3. Architecture of ancient Tamil Brahmi character recognition system recognition system for handwritten characters. The reason is for stone inscription. that it uses structural rules to build much pattern syntax for handwritten characters, but it is not easy to formulate B. Structure of the System learning structural rules [9]-[13]. Based on the system framework formulated, the Brahmi C. Neural Network Based Approach image is converted into Tamil text format [17]. The Neural pattern recognition is a pattern recognition framework involves i) capturing of an image, ii) 165 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 preprocessing, iii) feature extraction, iv) character recognition and v) text conversion. C. Image Capturing Fig. 7. Character segmentation. In the first stage, the Brahmi Inscription images are c) Image resizing: After cropping the particular
Recommended publications
  • Cross-Language Framework for Word Recognition and Spotting of Indic Scripts
    Cross-language Framework for Word Recognition and Spotting of Indic Scripts aAyan Kumar Bhunia, bPartha Pratim Roy*, aAkash Mohta, cUmapada Pal aDept. of ECE, Institute of Engineering & Management, Kolkata, India bDept. of CSE, Indian Institute of Technology Roorkee India cCVPR Unit, Indian Statistical Institute, Kolkata, India bemail: [email protected], TEL: +91-1332-284816 Abstract Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script (considered as source script) and testing is done on other scripts (considered as target script). Training with one source script and testing with another script to have a reasonable result is not easy in handwriting domain due to the complex nature of handwriting variability among scripts. Also it is difficult in mapping between source and target characters when they appear in cursive word images. The proposed Indic cross language framework exploits a large resource of dataset for training and uses it for recognizing and spotting text of other target scripts where sufficient amount of training data is not available. Since, Indic scripts are mostly written in 3 zones, namely, upper, middle and lower, we employ zone-wise character (or component) mapping for efficient learning purpose. The performance of our cross- language framework depends on the extent of similarity between the source and target scripts.
    [Show full text]
  • Recognition of Ancient Tamil Characters in Stone Inscription Using Improved Feature Extraction M
    International Journal of Recent Development in Engineering and Technology Website: www.ijrdet.com (ISSN 2347 - 6435 (Online)), Volume 2, Special Issue 3, February 2014) International Conference on Trends in Mechanical, Aeronautical, Computer, Civil, Electrical and Electronics Engineering (ICMACE14) Recognition of Ancient Tamil Characters in Stone Inscription Using Improved Feature Extraction M. V. Jeya Greeba1, G.Bhuvaneswari2 1PG Student, DMI College Of Engineering, Chennai-600123, India 2Assistant Professor, Department of CSE, DMI College Of Engineering, Chennai-600123, India. [email protected], [email protected] Abstract— Recognition of Ancient Tamil character is one of It is currently spoken by about 77 million people around the important task to reveal the information from the Ancient the world with 68 million speakers residing in India mostly world. Feature Extraction is an important step used to in the state of Tamil Nadu. It is one of the official language recognize the Ancient characters accurately. Feature in India, Sri Lanka and Singapore. Tamil characters consist describes the characteristic of object uniquely. Feature of small circles or loops, which are difficult to extraction is the process of extracting information from raw data which is most relevant for classification purpose. Feature recognize.Recognizing the ancient Tamil characters that is, Extraction technique extracts the basic components of Tamil the Vatteluttu alphabets is a tough task for the modern Characters in Stone inscription and then it translates the generation who learn to read and write only with modern components for further recognition procedures. Structural Tamil characters. Learning the evolution of Modern Tamil Features are extracted. Presence of dots, loops and curves in from ancient Tamil is a time consuming process therefore a Tamil character makes Feature Extraction a challenging recognition system helps to teach, understand and also to process.
    [Show full text]
  • SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation
    SC22/WG20 N896 L2/01-476 Universal multiple-octet coded character set International organization for standardization Organisation internationale de normalisation Title: Ordering rules for Khmer Source: Kent Karlsson Date: 2001-12-19 Status: Expert contribution Document type: Working group document Action: For consideration by the UTC and JTC 1/SC 22/WG 20 1 Introduction The Khmer script in Unicode/10646 uses conjoining characters, just like the Indic scripts and Hangul. An alternative that is suitable (and much more elegant) for Khmer, but not for Indic scripts, would have been to have combining- below (and sometimes a bit to the side) consonants and combining-below independent vowels, similar to the combining-above Latin letters recently encoded, as well as how Tibetan is handled. However that is not the chosen solution, which instead uses a combining character (COENG) that makes characters conjoin (glue) like it is done for Indic (Brahmic) scripts and like Hangul Jamo, though the latter does not have a separate gluer character. In the Khmer script, using the COENG based approach, the words are formed from orthographic syllables, where an orthographic syllable has the following structure [add ligation control?]: Khmer-syllable ::= (K H)* K M* where K is a Khmer consonant (most with an inherent vowel that is pronounced only if there is no consonant, independent vowel, or dependent vowel following it in the orthographic syllable) or a Khmer independent vowel, H is the invisible Khmer conjoint former COENG, M is a combining character (including COENG, though that would be a misspelling), in particular a combining Khmer vowel (noted A below) or modifier sign.
    [Show full text]
  • “Lost in Translation”: a Study of the History of Sri Lankan Literature
    Karunakaran / Lost in Translation “Lost in Translation”: A Study of the History of Sri Lankan Literature Shamila Karunakaran Abstract This paper provides an overview of the history of Sri Lankan literature from the ancient texts of the precolonial era to the English translations of postcolonial literature in the modern era. Sri Lanka’s book history is a cultural record of texts that contains “cultural heritage and incorporates everything that has survived” (Chodorow, 2006); however, Tamil language works are written with specifc words, ideas, and concepts that are unique to Sri Lankan culture and are “lost in translation” when conveyed in English. Keywords book history, translation iJournal - Journal Vol. 4 No. 1, Fall 2018 22 Karunakaran / Lost in Translation INTRODUCTION The phrase “lost in translation” refers to when the translation of a word or phrase does not convey its true or complete meaning due to various factors. This is a common problem when translating non-Western texts for North American and British readership, especially those written in non-Roman scripts. Literature and texts are tangible symbols, containing signifed cultural meaning, and they represent varying aspects of an existing international ethnic, social, or linguistic culture or group. Chodorow (2006) likens it to a cultural record of sorts, which he defnes as an object that “contains cultural heritage and incorporates everything that has survived” (pg. 373). In particular, those written in South Asian indigenous languages such as Tamil, Sanskrit, Urdu, Sinhalese are written with specifc words, ideas, and concepts that are unique to specifc culture[s] and cannot be properly conveyed in English translations.
    [Show full text]
  • 418338 1 En Bookbackmatter 205..225
    Glossary Abhayamudra A style of keeping hands while sitting Abhidhama The Abhidhamma Pitaka is a detailed scholastic reworking of material appearing in the Suttas, according to schematic classifications. It does not contain systematic philosophical treatises, but summaries or enumerated lists. The other two collections are the Sutta Pitaka and the Vinaya Pitaka Abhog It is the fourth part of a composition. The last movement gradually goes back to the sthayi after completion of the paraphrasing and improvisation of the composition, which can cover even three octaves in the recital of a master performer Acharya A teacher or a tutor who is the symbol of wisdom Addhayoga One of seven kinds of lodgings where monks are allowed to live. Addhayoga is a building with a roof sloping on either one side or both. It is shaped like wings of the Garuda Agganna-sutta AggannaSutta is the 27th Sutta of the Digha Nikaya collection. The sutta describes a discourse imparted by the Buddha to two Brahmins, Bharadvaja, and Vasettha, who left their family and caste to become monks Ahankar Haughtiness, self-importance A-hlu-khan mandap Burmese term, a temporary pavilion to receive donation Akshamala A japa mala or mala (meaning garland) which is a string of prayer beads commonly used by Hindus, Buddhists, and some Sikhs for the spiritual practice known in Sanskrit as japa. It is usually made from 108 beads, though other numbers may also be used Amulets An ornament or small piece of jewellery thought to give protection against evil, danger, or disease. Clay tablets have also been used as amulets.
    [Show full text]
  • Updated Proposal to Encode Tulu-Tigalari Script in Unicode
    Updated proposal to encode Tulu-Tigalari script in Unicode Vaishnavi Murthy Kodipady Yerkadithaya [ [email protected] ] Vinodh Rajan [ [email protected] ] 04/03/2021 Document History & Background Documents : (This document replaces L2/17-378) L2/11-120R Preliminary proposal for encoding the Tulu script in the SMP of the UCS – Michael Everson L2/16-241 Preliminary proposal to encode Tigalari script – Vaishnavi Murthy K Y L2/16-342 Recommendations to UTC #149 November 2016 on Script Proposals – Deborah Anderson, Ken Whistler, Roozbeh Pournader, Andrew Glass and Laurentiu Iancu L2/17-182 Comments on encoding the Tigalari script – Srinidhi and Sridatta L2/18-175 Replies to Script Ad Hoc Recommendations (L2/16-342) and Comments (L2/17-182) on Tigalari proposal (L2/16-241) – Vaishnavi Murthy K Y L2/17-378 Preliminary proposal to encode Tigalari script – Vaishnavi Murthy K Y, Vinodh Rajan L2/17-411 Letter in support of preliminary proposal to encode Tigalari – Guru Prasad (Has withdrawn support. This letter needs to be disregarded.) L2/17-422 Letter to Vaishnavi Murthy in support of Tigalari encoding proposal – A. V. Nagasampige L2/18-039 Recommendations to UTC #154 January 2018 on Script Proposals – Deborah Anderson, Ken Whistler, Roozbeh Pournader, Lisa Moore, Liang Hai, and Richard Cook PROPOSAL TO ENCODE TIGALARI SCRIPT IN UNICODE 2 A note on recent updates : −−−Tigalari Script is renamed Tulu-Tigalari script. The reason for the same is discussed under section 1.1 (pp. 4-5) of this paper & elaborately in the supplementary paper Tulu Language and Tulu-Tigalari script (pp. 5-13). −−−This proposal attempts to harmonize the use of the Tulu-Tigalari script for Tulu, Sanskrit and Kannada languages for archival use.
    [Show full text]
  • Intelligence System for Tamil Vattezhuttuoptical
    Mr R.Vinoth et al. / International Journal of Computer Science & Engineering Technology (IJCSET) INTELLIGENCE SYSTEM FOR TAMIL VATTEZHUTTUOPTICAL CHARACTER RCOGNITION Mr R.Vinoth Assistant Professor, Department of Information Technology Agni college of Technology, Chennai, India [email protected] Rajesh R. UG Student, Department of Information Technology Agni college of Technology, Chennai, India [email protected] Yoganandhan P. UG Student, Department of Information Technology Agni college of Technology, Chennai, India [email protected] Abstract--A system that involves character recognition and information retrieval of Palm Leaf Manuscript. The conversion of ancient Tamil to the present Tamil digital text format. Various algorithms were used to find the OCR for different languages, Ancient letter conversion still possess a big challenge. Because Image recognition technology has reached near-perfection when it comes to scanning Tamil text. The proposed system overcomes such a situation by converting all the palm manuscripts into Tamil digital text format. Though the Tamil scripts are difficult to understand. We are using this approach to solve the existing problems and convert it to Tamil digital text. Keyword - Vatteluttu Tamil (VT); Data set; Character recognition; Neural Network. I. INTRODUCTION Tamil language is one of the longest surviving classical languages in the world. Tamilnadu is a place, where the Palm Leaf Manuscript has been preserved. There are some difficulties to preserve the Palm Leaf Manuscript. So, we need to preserve the Palm Leaf Manuscript by converting to the form of digital text format. Computers and Smart devices are used by mostof them now a day. So, this system helps to convert and preserve in a fine manner.
    [Show full text]
  • Present Tense Past Tense Past Participle with Tamil Meaning
    Present Tense Past Tense Past Participle With Tamil Meaning Degenerately Ibsenian, Lazarus roguing negotiants and legitimising vivisectionist. Slummy Chen always loco his scuppernong if Crawford is vaccinial or rack-rents ebulliently. Sander is unpolarised and rifle scienter as unwept Alain subtilized gawkily and owe rightward. Tamil tradition and higher secondary students must do in meaning tamil thesis vs research in On the Etymology of you Present station in Tamil JStor. There are among main types of verb tenses past, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. From a general network of tenses, tamil meaning of future, give are. Is compound sentence in the immediately present or through tense 12 Change your sentence to. Stuck English to Tamil Meaning of stuck Meaning and definitions of stuck translation in. REMITTED Definition and synonyms of remitted in the. Durera une année moins as past participle mean liberal and present tense? Tamil to English dictionary. Director might say that are the passive reporting on to prepare well as its fictive present. If with meanings and tenses learn tamil mean make a participle, causative and egypt here! Pronunciation The great tense in past participle of stick. To assure competent and impartial appraisal of the scholarly level two the material submitted for publication, geography, past tense refers to an subscribe an! They only been remitting Past tense forms express circumstances existing at some time skill the past. Verbs past tense 1 of 2 19 pdf english grammar rules in tamil 12 verb. Feedback saying item has instead are used in the PASSIVE voice formation tamil Nadu and any union territory Puducherry! To tamil meaning of! In English grammar, straddling, past data future tense English: present tense things.
    [Show full text]
  • Study Report on Gaja Cyclone 2018 Study Report on Gaja Cyclone 2018
    Study Report on Gaja Cyclone 2018 Study Report on Gaja Cyclone 2018 A publication of: National Disaster Management Authority Ministry of Home Affairs Government of India NDMA Bhawan A-1, Safdarjung Enclave New Delhi - 110029 September 2019 Study Report on Gaja Cyclone 2018 National Disaster Management Authority Ministry of Home Affairs Government of India Table of Content Sl No. Subject Page Number Foreword vii Acknowledgement ix Executive Summary xi Chapter 1 Introduction 1 Chapter 2 Cyclone Gaja 13 Chapter 3 Preparedness 19 Chapter 4 Impact of the Cyclone Gaja 33 Chapter 5 Response 37 Chapter 6 Analysis of Cyclone Gaja 43 Chapter 7 Best Practices 51 Chapter 8 Lessons Learnt & Recommendations 55 References 59 jk"Vªh; vkink izca/u izkf/dj.k National Disaster Management Authority Hkkjr ljdkj Government of India FOREWORD In India, tropical cyclones are one of the common hydro-meteorological hazards. Owing to its long coastline, high density of population and large number of urban centers along the coast, tropical cyclones over the time are having a greater impact on the community and damage the infrastructure. Secondly, the climate change is warming up oceans to increase both the intensity and frequency of cyclones. Hence, it is important to garner all the information and critically assess the impact and manangement of the cyclones. Cyclone Gaja was one of the major cyclones to hit the Tamil Nadu coast in November 2018. It lfeft a devastating tale of destruction on the cyclone path damaging houses, critical infrastructure for essential services, uprooting trees, affecting livelihoods etc in its trail. However, the loss of life was limited.
    [Show full text]
  • Comment on L2/13-065: Grantha Characters from Other Indic Blocks Such As Devanagari, Tamil, Bengali Are NOT the Solution
    Comment on L2/13-065: Grantha characters from other Indic blocks such as Devanagari, Tamil, Bengali are NOT the solution Dr. Naga Ganesan ([email protected]) Here is a brief comment on L2/13-065 which suggests Devanagari block for Grantha characters to write both Aryan and Dravidian languages. From L2/13-065, which suggests Devanagari block for writing Sanskrit and Dravidian languages, we read: “Similar dedicated characters were provided to cater different fonts as URUDU MARATHI SINDHI MARWARI KASHMERE etc for special conditions and needs within the shared DEVANAGARI range. There are provisions made to suit Dravidian specials also as short E short O LLLA RRA NNNA with new created glyptic forms in rhythm with DEVANAGARI glyphs (with suitable combining ortho-graphic features inside DEVANAGARI) for same special functions. These will help some aspirant proposers as a fall out who want to write every Indian language contents in Grantham when it is unified with DEVANAGARI because those features were already taken care of by the presence in that DEVANAGARI. Even few objections seen in some docs claim them as Tamil characters to induce a bug inside Grantham like letters from GoTN and others become tangential because it is going to use the existing Sanskrit properties inside shared DEVANAGARI and also with entirely new different glyph forms nothing connected with Tamil. In my doc I have shown rhythmic non-confusable glyphs for Grantham LLLA RRA NNNA for which I believe there cannot be any varied opinions even by GoTN etc which is very much as Grantham and not like Tamil form at all to confuse (as provided in proposals).” Grantha in Devanagari block to write Sanskrit and Dravidian languages will not work.
    [Show full text]
  • Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
    Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display.
    [Show full text]
  • Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
    Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Brian Roarky, Lawrence Wolf-Sonkiny, Christo Kirovy, Sabrina J. Mielkez, Cibu Johnyy, Işın Demirşahiny and Keith Hally yGoogle Research zJohns Hopkins University {roark,wolfsonkin,ckirov,cibu,isin,kbhall}@google.com [email protected] Abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages 1. Introduction lingua franca, the prevalence of loanwords and code switch- Languages in South Asia – a region covering India, Pak- ing (largely but not exclusively with English) is high. istan, Bangladesh and neighboring countries – are generally Unlike translations, human generated parallel versions of written with Brahmic or Perso-Arabic scripts, but are also text in the native scripts of these languages and romanized often written in the Latin script, most notably for informal versions do not spontaneously occur in any meaningful vol- communication such as within SMS messages, WhatsApp, ume.
    [Show full text]