Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Cross-Language Framework for Word Recognition and Spotting of Indic Scripts
Cross-language Framework for Word Recognition and Spotting of Indic Scripts aAyan Kumar Bhunia, bPartha Pratim Roy*, aAkash Mohta, cUmapada Pal aDept. of ECE, Institute of Engineering & Management, Kolkata, India bDept. of CSE, Indian Institute of Technology Roorkee India cCVPR Unit, Indian Statistical Institute, Kolkata, India bemail: [email protected], TEL: +91-1332-284816 Abstract Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script (considered as source script) and testing is done on other scripts (considered as target script). Training with one source script and testing with another script to have a reasonable result is not easy in handwriting domain due to the complex nature of handwriting variability among scripts. Also it is difficult in mapping between source and target characters when they appear in cursive word images. The proposed Indic cross language framework exploits a large resource of dataset for training and uses it for recognizing and spotting text of other target scripts where sufficient amount of training data is not available. Since, Indic scripts are mostly written in 3 zones, namely, upper, middle and lower, we employ zone-wise character (or component) mapping for efficient learning purpose. The performance of our cross- language framework depends on the extent of similarity between the source and target scripts. -
Recognition of Ancient Tamil Characters in Stone Inscription Using Improved Feature Extraction M
International Journal of Recent Development in Engineering and Technology Website: www.ijrdet.com (ISSN 2347 - 6435 (Online)), Volume 2, Special Issue 3, February 2014) International Conference on Trends in Mechanical, Aeronautical, Computer, Civil, Electrical and Electronics Engineering (ICMACE14) Recognition of Ancient Tamil Characters in Stone Inscription Using Improved Feature Extraction M. V. Jeya Greeba1, G.Bhuvaneswari2 1PG Student, DMI College Of Engineering, Chennai-600123, India 2Assistant Professor, Department of CSE, DMI College Of Engineering, Chennai-600123, India. [email protected], [email protected] Abstract— Recognition of Ancient Tamil character is one of It is currently spoken by about 77 million people around the important task to reveal the information from the Ancient the world with 68 million speakers residing in India mostly world. Feature Extraction is an important step used to in the state of Tamil Nadu. It is one of the official language recognize the Ancient characters accurately. Feature in India, Sri Lanka and Singapore. Tamil characters consist describes the characteristic of object uniquely. Feature of small circles or loops, which are difficult to extraction is the process of extracting information from raw data which is most relevant for classification purpose. Feature recognize.Recognizing the ancient Tamil characters that is, Extraction technique extracts the basic components of Tamil the Vatteluttu alphabets is a tough task for the modern Characters in Stone inscription and then it translates the generation who learn to read and write only with modern components for further recognition procedures. Structural Tamil characters. Learning the evolution of Modern Tamil Features are extracted. Presence of dots, loops and curves in from ancient Tamil is a time consuming process therefore a Tamil character makes Feature Extraction a challenging recognition system helps to teach, understand and also to process. -
SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation
SC22/WG20 N896 L2/01-476 Universal multiple-octet coded character set International organization for standardization Organisation internationale de normalisation Title: Ordering rules for Khmer Source: Kent Karlsson Date: 2001-12-19 Status: Expert contribution Document type: Working group document Action: For consideration by the UTC and JTC 1/SC 22/WG 20 1 Introduction The Khmer script in Unicode/10646 uses conjoining characters, just like the Indic scripts and Hangul. An alternative that is suitable (and much more elegant) for Khmer, but not for Indic scripts, would have been to have combining- below (and sometimes a bit to the side) consonants and combining-below independent vowels, similar to the combining-above Latin letters recently encoded, as well as how Tibetan is handled. However that is not the chosen solution, which instead uses a combining character (COENG) that makes characters conjoin (glue) like it is done for Indic (Brahmic) scripts and like Hangul Jamo, though the latter does not have a separate gluer character. In the Khmer script, using the COENG based approach, the words are formed from orthographic syllables, where an orthographic syllable has the following structure [add ligation control?]: Khmer-syllable ::= (K H)* K M* where K is a Khmer consonant (most with an inherent vowel that is pronounced only if there is no consonant, independent vowel, or dependent vowel following it in the orthographic syllable) or a Khmer independent vowel, H is the invisible Khmer conjoint former COENG, M is a combining character (including COENG, though that would be a misspelling), in particular a combining Khmer vowel (noted A below) or modifier sign. -
“Lost in Translation”: a Study of the History of Sri Lankan Literature
Karunakaran / Lost in Translation “Lost in Translation”: A Study of the History of Sri Lankan Literature Shamila Karunakaran Abstract This paper provides an overview of the history of Sri Lankan literature from the ancient texts of the precolonial era to the English translations of postcolonial literature in the modern era. Sri Lanka’s book history is a cultural record of texts that contains “cultural heritage and incorporates everything that has survived” (Chodorow, 2006); however, Tamil language works are written with specifc words, ideas, and concepts that are unique to Sri Lankan culture and are “lost in translation” when conveyed in English. Keywords book history, translation iJournal - Journal Vol. 4 No. 1, Fall 2018 22 Karunakaran / Lost in Translation INTRODUCTION The phrase “lost in translation” refers to when the translation of a word or phrase does not convey its true or complete meaning due to various factors. This is a common problem when translating non-Western texts for North American and British readership, especially those written in non-Roman scripts. Literature and texts are tangible symbols, containing signifed cultural meaning, and they represent varying aspects of an existing international ethnic, social, or linguistic culture or group. Chodorow (2006) likens it to a cultural record of sorts, which he defnes as an object that “contains cultural heritage and incorporates everything that has survived” (pg. 373). In particular, those written in South Asian indigenous languages such as Tamil, Sanskrit, Urdu, Sinhalese are written with specifc words, ideas, and concepts that are unique to specifc culture[s] and cannot be properly conveyed in English translations. -
418338 1 En Bookbackmatter 205..225
Glossary Abhayamudra A style of keeping hands while sitting Abhidhama The Abhidhamma Pitaka is a detailed scholastic reworking of material appearing in the Suttas, according to schematic classifications. It does not contain systematic philosophical treatises, but summaries or enumerated lists. The other two collections are the Sutta Pitaka and the Vinaya Pitaka Abhog It is the fourth part of a composition. The last movement gradually goes back to the sthayi after completion of the paraphrasing and improvisation of the composition, which can cover even three octaves in the recital of a master performer Acharya A teacher or a tutor who is the symbol of wisdom Addhayoga One of seven kinds of lodgings where monks are allowed to live. Addhayoga is a building with a roof sloping on either one side or both. It is shaped like wings of the Garuda Agganna-sutta AggannaSutta is the 27th Sutta of the Digha Nikaya collection. The sutta describes a discourse imparted by the Buddha to two Brahmins, Bharadvaja, and Vasettha, who left their family and caste to become monks Ahankar Haughtiness, self-importance A-hlu-khan mandap Burmese term, a temporary pavilion to receive donation Akshamala A japa mala or mala (meaning garland) which is a string of prayer beads commonly used by Hindus, Buddhists, and some Sikhs for the spiritual practice known in Sanskrit as japa. It is usually made from 108 beads, though other numbers may also be used Amulets An ornament or small piece of jewellery thought to give protection against evil, danger, or disease. Clay tablets have also been used as amulets. -
Updated Proposal to Encode Tulu-Tigalari Script in Unicode
Updated proposal to encode Tulu-Tigalari script in Unicode Vaishnavi Murthy Kodipady Yerkadithaya [ [email protected] ] Vinodh Rajan [ [email protected] ] 04/03/2021 Document History & Background Documents : (This document replaces L2/17-378) L2/11-120R Preliminary proposal for encoding the Tulu script in the SMP of the UCS – Michael Everson L2/16-241 Preliminary proposal to encode Tigalari script – Vaishnavi Murthy K Y L2/16-342 Recommendations to UTC #149 November 2016 on Script Proposals – Deborah Anderson, Ken Whistler, Roozbeh Pournader, Andrew Glass and Laurentiu Iancu L2/17-182 Comments on encoding the Tigalari script – Srinidhi and Sridatta L2/18-175 Replies to Script Ad Hoc Recommendations (L2/16-342) and Comments (L2/17-182) on Tigalari proposal (L2/16-241) – Vaishnavi Murthy K Y L2/17-378 Preliminary proposal to encode Tigalari script – Vaishnavi Murthy K Y, Vinodh Rajan L2/17-411 Letter in support of preliminary proposal to encode Tigalari – Guru Prasad (Has withdrawn support. This letter needs to be disregarded.) L2/17-422 Letter to Vaishnavi Murthy in support of Tigalari encoding proposal – A. V. Nagasampige L2/18-039 Recommendations to UTC #154 January 2018 on Script Proposals – Deborah Anderson, Ken Whistler, Roozbeh Pournader, Lisa Moore, Liang Hai, and Richard Cook PROPOSAL TO ENCODE TIGALARI SCRIPT IN UNICODE 2 A note on recent updates : −−−Tigalari Script is renamed Tulu-Tigalari script. The reason for the same is discussed under section 1.1 (pp. 4-5) of this paper & elaborately in the supplementary paper Tulu Language and Tulu-Tigalari script (pp. 5-13). −−−This proposal attempts to harmonize the use of the Tulu-Tigalari script for Tulu, Sanskrit and Kannada languages for archival use. -
Intelligence System for Tamil Vattezhuttuoptical
Mr R.Vinoth et al. / International Journal of Computer Science & Engineering Technology (IJCSET) INTELLIGENCE SYSTEM FOR TAMIL VATTEZHUTTUOPTICAL CHARACTER RCOGNITION Mr R.Vinoth Assistant Professor, Department of Information Technology Agni college of Technology, Chennai, India [email protected] Rajesh R. UG Student, Department of Information Technology Agni college of Technology, Chennai, India [email protected] Yoganandhan P. UG Student, Department of Information Technology Agni college of Technology, Chennai, India [email protected] Abstract--A system that involves character recognition and information retrieval of Palm Leaf Manuscript. The conversion of ancient Tamil to the present Tamil digital text format. Various algorithms were used to find the OCR for different languages, Ancient letter conversion still possess a big challenge. Because Image recognition technology has reached near-perfection when it comes to scanning Tamil text. The proposed system overcomes such a situation by converting all the palm manuscripts into Tamil digital text format. Though the Tamil scripts are difficult to understand. We are using this approach to solve the existing problems and convert it to Tamil digital text. Keyword - Vatteluttu Tamil (VT); Data set; Character recognition; Neural Network. I. INTRODUCTION Tamil language is one of the longest surviving classical languages in the world. Tamilnadu is a place, where the Palm Leaf Manuscript has been preserved. There are some difficulties to preserve the Palm Leaf Manuscript. So, we need to preserve the Palm Leaf Manuscript by converting to the form of digital text format. Computers and Smart devices are used by mostof them now a day. So, this system helps to convert and preserve in a fine manner. -
Present Tense Past Tense Past Participle with Tamil Meaning
Present Tense Past Tense Past Participle With Tamil Meaning Degenerately Ibsenian, Lazarus roguing negotiants and legitimising vivisectionist. Slummy Chen always loco his scuppernong if Crawford is vaccinial or rack-rents ebulliently. Sander is unpolarised and rifle scienter as unwept Alain subtilized gawkily and owe rightward. Tamil tradition and higher secondary students must do in meaning tamil thesis vs research in On the Etymology of you Present station in Tamil JStor. There are among main types of verb tenses past, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. From a general network of tenses, tamil meaning of future, give are. Is compound sentence in the immediately present or through tense 12 Change your sentence to. Stuck English to Tamil Meaning of stuck Meaning and definitions of stuck translation in. REMITTED Definition and synonyms of remitted in the. Durera une année moins as past participle mean liberal and present tense? Tamil to English dictionary. Director might say that are the passive reporting on to prepare well as its fictive present. If with meanings and tenses learn tamil mean make a participle, causative and egypt here! Pronunciation The great tense in past participle of stick. To assure competent and impartial appraisal of the scholarly level two the material submitted for publication, geography, past tense refers to an subscribe an! They only been remitting Past tense forms express circumstances existing at some time skill the past. Verbs past tense 1 of 2 19 pdf english grammar rules in tamil 12 verb. Feedback saying item has instead are used in the PASSIVE voice formation tamil Nadu and any union territory Puducherry! To tamil meaning of! In English grammar, straddling, past data future tense English: present tense things. -
Study Report on Gaja Cyclone 2018 Study Report on Gaja Cyclone 2018
Study Report on Gaja Cyclone 2018 Study Report on Gaja Cyclone 2018 A publication of: National Disaster Management Authority Ministry of Home Affairs Government of India NDMA Bhawan A-1, Safdarjung Enclave New Delhi - 110029 September 2019 Study Report on Gaja Cyclone 2018 National Disaster Management Authority Ministry of Home Affairs Government of India Table of Content Sl No. Subject Page Number Foreword vii Acknowledgement ix Executive Summary xi Chapter 1 Introduction 1 Chapter 2 Cyclone Gaja 13 Chapter 3 Preparedness 19 Chapter 4 Impact of the Cyclone Gaja 33 Chapter 5 Response 37 Chapter 6 Analysis of Cyclone Gaja 43 Chapter 7 Best Practices 51 Chapter 8 Lessons Learnt & Recommendations 55 References 59 jk"Vªh; vkink izca/u izkf/dj.k National Disaster Management Authority Hkkjr ljdkj Government of India FOREWORD In India, tropical cyclones are one of the common hydro-meteorological hazards. Owing to its long coastline, high density of population and large number of urban centers along the coast, tropical cyclones over the time are having a greater impact on the community and damage the infrastructure. Secondly, the climate change is warming up oceans to increase both the intensity and frequency of cyclones. Hence, it is important to garner all the information and critically assess the impact and manangement of the cyclones. Cyclone Gaja was one of the major cyclones to hit the Tamil Nadu coast in November 2018. It lfeft a devastating tale of destruction on the cyclone path damaging houses, critical infrastructure for essential services, uprooting trees, affecting livelihoods etc in its trail. However, the loss of life was limited. -
Comment on L2/13-065: Grantha Characters from Other Indic Blocks Such As Devanagari, Tamil, Bengali Are NOT the Solution
Comment on L2/13-065: Grantha characters from other Indic blocks such as Devanagari, Tamil, Bengali are NOT the solution Dr. Naga Ganesan ([email protected]) Here is a brief comment on L2/13-065 which suggests Devanagari block for Grantha characters to write both Aryan and Dravidian languages. From L2/13-065, which suggests Devanagari block for writing Sanskrit and Dravidian languages, we read: “Similar dedicated characters were provided to cater different fonts as URUDU MARATHI SINDHI MARWARI KASHMERE etc for special conditions and needs within the shared DEVANAGARI range. There are provisions made to suit Dravidian specials also as short E short O LLLA RRA NNNA with new created glyptic forms in rhythm with DEVANAGARI glyphs (with suitable combining ortho-graphic features inside DEVANAGARI) for same special functions. These will help some aspirant proposers as a fall out who want to write every Indian language contents in Grantham when it is unified with DEVANAGARI because those features were already taken care of by the presence in that DEVANAGARI. Even few objections seen in some docs claim them as Tamil characters to induce a bug inside Grantham like letters from GoTN and others become tangential because it is going to use the existing Sanskrit properties inside shared DEVANAGARI and also with entirely new different glyph forms nothing connected with Tamil. In my doc I have shown rhythmic non-confusable glyphs for Grantham LLLA RRA NNNA for which I believe there cannot be any varied opinions even by GoTN etc which is very much as Grantham and not like Tamil form at all to confuse (as provided in proposals).” Grantha in Devanagari block to write Sanskrit and Dravidian languages will not work. -
Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display. -
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Brian Roarky, Lawrence Wolf-Sonkiny, Christo Kirovy, Sabrina J. Mielkez, Cibu Johnyy, Işın Demirşahiny and Keith Hally yGoogle Research zJohns Hopkins University {roark,wolfsonkin,ckirov,cibu,isin,kbhall}@google.com [email protected] Abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages 1. Introduction lingua franca, the prevalence of loanwords and code switch- Languages in South Asia – a region covering India, Pak- ing (largely but not exclusively with English) is high. istan, Bangladesh and neighboring countries – are generally Unlike translations, human generated parallel versions of written with Brahmic or Perso-Arabic scripts, but are also text in the native scripts of these languages and romanized often written in the Latin script, most notably for informal versions do not spontaneously occur in any meaningful vol- communication such as within SMS messages, WhatsApp, ume.