Identification of Marathi and Sanskrit Compound and Non-Compound Wordusing Genetic Algorithm

Total Page:16

File Type:pdf, Size:1020Kb

Identification of Marathi and Sanskrit Compound and Non-Compound Wordusing Genetic Algorithm International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in IDENTIFICATION OF MARATHI AND SANSKRIT COMPOUND AND NON-COMPOUND WORDUSING GENETIC ALGORITHM 1SONAL P. PATIL, 2K. N. JARIWALA 1Ph.D. Research Scholar, 2Assistant Professor Computer Engineering Department, S.V.N.I.T, Surat, India E-mail: [email protected], [email protected] Abstract- Text based language recognition is the task of recognizing a language from a given text of document automatically. It is complicated to distinguish languages within language families than other families. In this paper, the performance of statistical measures has been investigated to determine the text-based language identification system with prominence on five languages used in India based on Devanagari script –Marathi, Hindi, Sanskrit, Bhojpuriand Nepali. n- grams is used as feature for classification in the proposed system. Language Identification is a main pre-processing step in several tasks of Natural Language Processing (NLP). There is wide scope in a multilingual society like India for automatic language identification since it would be a fundamental step in bridging the digital segregate between the Indian masses and the world. Index Terms- Devanagari Script, Multilingual Computing Wiener filter, Curvelet transform, Genetic algorithm I. INTRODUCTION based on consonants. In this system vowels are requisite [8]. The Unicode Standard describes three Text based language identification or recognition is blocks for Devanagari: Devanagari (U+0900– the chore ofautomatically recognizing a language U+097F)Devanagari Extended (U+1CD0– U+1CFF) from a specified text of document. It is not easy to and Vedic Extensions (U+A8E0–U+A8FF).Non- distinguish languages within language families. In assigned code points are indicate by grey areas. 0900 this paper, the performance of statistical measures has to 097F is the range. been investigated to determine the text-based language identification system with prominence on 1.2 Word Identification Architecture five languages used in India based on Devanagari Optical word identification involves many steps to script - Marathi, Hindi, Sanskrit, Bhojpuri andNepali. completely recognize and produce machine encoded The proposed system uses n-grams as feature for text. These phases are termed as: Pre-processing, classification. Language recognition is a significant Segmentation, Feature extraction, Classification. The pre-processing step in many tasks of Natural architecture of these phases is shown in figure1 and Language Processing (NLP). In a multilingual society these phases are listed below with brief description. like India there is wide scope for automatic language identification since it would be a vital step in bridging Pre-processing the digital divide between the Indian masses and the The pre-processing phase normally includes many world.OCR (Optical Character Recognition) is an techniques applied for binarization, noise removal, active field of research in Pattern Recognition. OCR skew detection, slant correction, normalization, methodologies can be classified based on two criteria; contour making and skeletonization like processes to data acquisition process which can be on-line or o- make character image easy to extract relevant line and type of the text which is printed text or hand- features and ecient recognition. written text[1].Devanagari is the most admired Indian script. But in case of Indian languages, the research work is very limited due to the complex structure of the language [2]. 1.1 Devanagari Character Set Devanagari is an Indian, syllabic alphabetic type of script that is used to write several languages like Sanskrit, Hindi, Marathi, Bhojpuri, Nepali, Konkani, Sindhi,Marwari, Pali, Maithli and many languages that are spoken in various parts of India. The word Figure 1: Steps to recognize a specificlanguage [3]. Devanagari is a combination of two words deva which means God and nagari.Most Indo-Aryan All possible n-grams (unigram, bigram and trigram) languages are written in Devanagari script. both character level and word level were extracted at Devanagari is the heart of the writing system. An the time of training stage.The main advantages of n- alpha syllabary is a writing system which is primarily gram models and algorithms are relative simplicity Identification of Marathi and Sanskrit Compound and Non-Compound Word using Genetic Algorithm 23 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in and the capability to scale up by simply increasing n. the features from an image thinning process is applied The model is used to store additional contexts with a in pre processing technique Thinning is asignificant well understood space– time tradeoff, enabling tiny pre-processing step in OCR. The purpose of thinning experiments to scale up powerfully. The n-gram is to delete redundant information and at the same approximation for calculating the next word in the time retain the characteristic features of the image. sequence is given by: Freeman Chain code is one of the representation P(X1.....Xn) = P (X1) P (X2|X1)..... P techniques that is useful for image processing, shape n-1 (Xn|X1 ) analysis and pattern recognition fields is used with n k-1 = ∏ k=1 P (X1 ) heuristic approach for feature extraction. [1]. U. Pal, Wakabayashi and Kimura also presented Language Profile Generator and Classifier are the two comparative study of Devanagari handwritten main components of LID system. For language character recognition using dierent features and identification the former calculates the n-gram profile classifiers [4]. They used four sets of features based of a text to be identified and after that compares it to on curvature and gradient information obtained from language specific n-gram profiles. For every language binary as well as gray scale images and compared it generates all possible n-grams for the text and save results using 10 dierent classifiers as concluded the it into corresponding language files. In classification best results 74.74% and 75.17% for features extracted method a test sample is given then its possibility is from binary and gray image respectively obtained calculated for all the models and the language that with Mirror Image Learning (MIL) classifier. gives the best likelihood is selected. Sarbajit Pal et al.[5] have described projection based Feature extraction is used to extract relevant features statistical approach for handwritten character for recognition of characters based on these features. recognition. They proposed four sided projections of First features are computed and extracted and then characters and projections were smoothed by polygon most relevant features are selected to construct approximation. feature vector which is used eventually for Nikita Gaur and Dayashankar Singh et al.[6] have recognition. described gradient feature extraction approach for recognition of Sanskrit word they have used sobel Classification operator for edge detection The Sobel operator is Each pattern having feature vector is classified in used in image processing, particularly with edge predefined classes using classifiers. Classifiers are detection algorithms. first trained by a training set of pattern samples to Brijmohan Singh, Ankush Mittal, M.A. Ansari, prepare a model which is later used to recognize the Debashis Ghosh et al.[7] have described a holistic test samples. The training data should consist of wide system of oine handwritten Devanagari word varieties of samples to recognize all possible samples recognition. In this paper, they proposed a Curvelet during testing. feature extractor with SVM and k-NN classifiers based scheme for the recognition of handwritten II. LITERATURE SURVEY Devanagari words. Prachi patil,saniya ansari et al.[9]have proposed A number of works on LID in Indian languages are Online Handwritten Devnagari Word Recognition extraordinary and these works helps us to know using HMM based Technique. Feature extraction of challenges and methods of Indian language input image is done by android technology. Using identification.Language identification is formulated that features HMM recognizes the word. They tested by Kavi Narayana Murthy. It stated that machine proposed system on dierent word images and learning problem is a supervised classification task in obtained 95.70% recognition accuracy. which features extracted from a training corpus which are then used for classification. The paper in which n- M. N. Sandhya Arora, D. Bhattacharjee et. al.[13] gram and Word Network Features are used forNative proposed Recognition of non-compound handwritten Language Identification by Shibamouli Lahiri devnagari characters using a combination of MLP recognize writer’s native language from his/ her and minimum edit distance they used two well known writing in second language using n-gram feature and and established pattern recognition techniques: one WordNet[9]. Another method in LID of Indian using neural networks and the other one using languages proposed byPinky Roy’s as “Language minimum edit distance and characters are represented Identification using Gaussian Mixture Model using shadow feature and chain code histogram. The Tokenization”, aim at identifying the language of a method is carried out on a database of 7154 samples. spoken utterance. It uses Gaussian mixture model as The overall recognition is found to be 90.74%. basis phone tokenization and uses n-gram
Recommended publications
  • Final Proposal to Encode Nandinagari in Unicode
    L2/17-162 2017-05-05 Final proposal to encode Nandinagari in Unicode Anshuman Pandey [email protected] May 5, 2017 1 Introduction This is a proposal to encode the Nandinagari script in Unicode. It supersedes the following documents: • L2/13-002 “Preliminary Proposal to Encode Nandinagari in ISO/IEC 10646” • L2/16-002 “Proposal to encode the Nandinagari script in Unicode” • L2/16-310 “Proposal to encode the Nandinagari script in Unicode” • L2/17-119 “Towards an encoding model for Nandinagari conjuncts” It incorporates comments regarding previous proposals made in: • L2/16-037 “Recommendations to UTC #146 January 2016 on Script Proposals” • L2/16-057 “Comments on L2/16-002 Proposal to encode Nandinagari” • L2/16-216 “Recommendations to UTC #148 August 2016 on Script Proposals” • L2/17-153 “Recommendations to UTC #151 May 2017 on Script Proposals” • L2/17-117 “Proposal to encode a nasal character in Vedic Extensions” Major changes since L2/16-310 include: • Expanded description of the headstroke and its behavior (see section 3.2). • Clarification of encoding model for consonant conjuncts (see section 5.4). • Removal of digits and proposed unification with Kannada digits (see section 4.9). • Re-analysis of ‘touching’ conjuncts as variant forms that may be controlled using fonts. • Removal of ardhavisarga and other characters that require additional research (see section 4.10). • Identification of a pr̥ ṣṭhamātrā, which is not included in the proposed repertoire (see section 4.10). • Proposed reallocation of a Vedic nasal letter to the ‘Vedic Extensions’ block (see L2/17-117). • Revision of Indic position category for vowel signs.
    [Show full text]
  • Iso/Iec Jtc1/Sc2/Wg2 N3383r L2/08-050R
    ISO/IEC JTC1/SC2/WG2 N3383R L2/08-050R 2008-03-06 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Summary proposal to encode characters for Vedic in the BMP of the UCS Source: Ireland and UC Berkeley Script Encoding Initiative (Universal Scripts Project) Authors: Michael Everson and Peter Scharf Status: National Body and Liaison Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Replaces: N3366 Date: 2008-03-06 1. Introduction. This document reflects the consensus achieved after intense discussion of the Vedic characters proposed by CDAC in India on the one hand by Everson, Scharf, et al. on the other, in a large number of documents presented since 2006. The repertoire given here comprises a set of characters which all parties agree should be encoded to enable India and Western Vedic specialists to represent the texts and critical apparatus of this important field of study. A few outstanding issues remain but it is agreed that these might be left for further study. The characters presented in this document are, in our opinion, well-documented and appropriate for encoding. It is worthwhile reviewing some of the issues which were previously controversial. 2. PRISHTHAMATRA E. It is our view that the best way to handle this is to encode a single vowel sign for e that can be used in combination with other characters for o, ai, and au, as presented in N3235R. We do not believe that the spoofing danger of this character is as great as has been suggested: one could recommend to registrars, for instance, that the character can be restricted from use in IDN.
    [Show full text]
  • The Unicode Standard, Version 6.1 Copyright © 1991–2012 Unicode, Inc
    The Unicode Standard Version 6.1 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Copyright © 1991–2012 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium ; edited by Julie D. Allen ... [et al.]. — Version 6.1.
    [Show full text]
  • Devanagari 1 Devanagari
    Devanagari 1 Devanagari Devanāgarī देवनागरी Rigveda manuscript in Devanagari (early 19th century) Type abugida Languages Several Indian languages and Nepali Languages, including Sanskrit, Hindi, Awadhi, Marathi, Pahari (Garhwali and Kumaoni), Nepali, Bhili, Konkani, Bhojpuri, Magahi, Kurukh, Nepal Bhasa, and Sindhi. Sometimes used to write or transliterate Sherpa, Kashmiri and Punjabi. Formerly used to write Gujarati. Time period c. 1200–present Parent systems Brāhmī •• Gupta •• Nāgarī • Devanāgarī देवनागरी Child systems Gujarati Moḍī Ranjana [1] Canadian Aboriginal syllabics Sister systems Sharada ISO 15924 Deva, 315 Direction Left-to-right Unicode alias Devanagari Unicode range [2] U+0900–U+097F Devanagari, [3] U+A8E0–U+A8FF Devanagari Extended, [4] U+1CD0–U+1CFF Vedic Extensions Brāhmī The Brahmic script and its descendants Devanagari 2 Devanagari (/ˌdeɪvəˈnɑːɡəriː/; Hindustani: [d̪eːʋˈnaːɡri]; देवनागरी devanāgarī — a compound of "deva" [देव] and "nāgarī" [नागरी]), also called Nagari (Nāgarī, नागरी, the name of its parent writing system), is an abugida alphabet of India and Nepal. It is written from left to right, does not have distinct letter cases, and is recognisable (along with most other North Indic scripts, with few exceptions like Gujarati and Oriya) by a horizontal line that runs along the top of full letters. Since the 19th century, it has been the most commonly used script for Sanskrit. Devanagari is used to write Standard Hindi, Marathi, Nepali along with Awadhi, Bodo, Bhojpuri, Gujari, Pahari, (Garhwali and Kumaoni), Konkani, Magahi, Maithili, Marwari, Bhili, Newar, Santhali, Tharu, Devanagari used in Melbourne Australia to and sometimes Sindhi, Dogri, Sherpa, Kashmiri and Punjabi. It was communicate in an advertisement formerly used to write Gujarati.
    [Show full text]
  • Vedic Extensions Range: 1CD0–1CFF
    Vedic Extensions Range: 1CD0–1CFF This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • UC Berkeley Proposals from the Script Encoding Initiative
    UC Berkeley Proposals from the Script Encoding Initiative Title Final proposal to encode Nandinagari in Unicode Permalink https://escholarship.org/uc/item/8z935033 Author Pandey, Anshuman Publication Date 2017-05-05 Peer reviewed eScholarship.org Powered by the California Digital Library University of California L2/17-162 2017-05-05 Final proposal to encode Nandinagari in Unicode Anshuman Pandey [email protected] May 5, 2017 1 Introduction This is a proposal to encode the Nandinagari script in Unicode. It supersedes the following documents: • L2/13-002 “Preliminary Proposal to Encode Nandinagari in ISO/IEC 10646” • L2/16-002 “Proposal to encode the Nandinagari script in Unicode” • L2/16-310 “Proposal to encode the Nandinagari script in Unicode” • L2/17-119 “Towards an encoding model for Nandinagari conjuncts” It incorporates comments regarding previous proposals made in: • L2/16-037 “Recommendations to UTC #146 January 2016 on Script Proposals” • L2/16-057 “Comments on L2/16-002 Proposal to encode Nandinagari” • L2/16-216 “Recommendations to UTC #148 August 2016 on Script Proposals” • L2/17-153 “Recommendations to UTC #151 May 2017 on Script Proposals” • L2/17-117 “Proposal to encode a nasal character in Vedic Extensions” Major changes since L2/16-310 include: • Expanded description of the headstroke and its behavior (see section 3.2). • Clarification of encoding model for consonant conjuncts (see section 5.4). • Removal of digits and proposed unification with Kannada digits (see section 4.9). • Re-analysis of ‘touching’ conjuncts as variant forms that may be controlled using fonts. • Removal of ardhavisarga and other characters that require additional research (see section 4.10).
    [Show full text]
  • Encoding of Vedic Characters Used in Non-Devanagari Scripts
    Encoding of Vedic characters used in non-Devanagari scripts Srinidhi, Tumakuru, Karnataka, India [email protected] Date: 27 March 2015 The Vedic Unicode proposals had dealt with the Vedic characters used in Devanagari script only .The Grantha proposal had proposed the encoding of Samavedic characters and Vedic Anusvaras and other characters and Tirhuta had proposed the encoding of a Vedic Anusvara. Apart from this there were no efforts of encoding Vedic characters used in non-Devanagari scripts. There is a need of encoding the characters which are seen in both manuscripts and prints. An encoding for these Vedic characters in the UCS will certainly be of promote the usage among native users, scholars and manuscriptologists. This is a preliminary document which gives brief description of some signs used in non-Devanagari scripts. It also seeks feedback from scholars, native users and experts on encoding these signs. The number of characters used is much more since only very few manuscripts and books are available online. It is to be noted not all scripts which are used to write Sanskrit are used to write Vedas. In general the scripts are used for Buddhist Sanskrit religious texts such as Tibetan, Siddham and Thai etc. are not used to write Vedas. The following scripts are used to write Vedas 1. Bengali/Assamese 2. Devanagari 3. Grantha 4. Gujarati 5. Kannada 6. Malayalam 7. Nandinagari 8. Newar 9. Odia 10. Sharada 11. Telugu 12. Tigalari 13. Tirhuta Many of the existing characters are used most commonly in other scripts, Devanagari sign Udatta (mainly for Svarita), Devanagari sign Anudatta and Vedic tone double Svarita.
    [Show full text]
  • ISO/IEC 10646:2017/Amd.2:2018(E)
    This preview is downloaded from www.sis.se. Buy the entire standard via https://www.sis.se/std-80012452 INTERNATIONAL ISO/IEC STANDARD 10646 Fifth edition 2017-12-01 AMENDMENT 2 2019-06 Information technology — Universal Coded Character Set (UCS) AMENDMENT 2: Nandinagari, Georgian extension, and other characters Technologies de l'information — Jeu universel de caractères codés (JUC) AMENDEMENT 2: Caractères nandinagari, extension pour les caractères géorgiens et autres caractères Reference number ISO/IEC 10646:2017/Amd.2:2019(E) © ISO/IEC 2019 This preview is downloaded from www.sis.se. Buy the entire standard via https://www.sis.se/std-80012452 ISO/IEC 10646:2017/Amd.2:2019(E) COPYRIGHT PROTECTED DOCUMENT © ISO/IEC 2019 All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of the requester. ISO copyright office CP 401 • Ch. de Blandonnet 8 CH-1214 Vernier, Geneva Phone: +41 22 749 01 11 Fax: +41 22 749 09 47 Email: [email protected] Website: www.iso.org Published in Switzerland ii © ISO/IEC 2019 – All rights reserved This preview is downloaded from www.sis.se. Buy the entire standard via https://www.sis.se/std-80012452 ISO/IEC 10646:2017/Amd.2:2019(E) Foreword ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialized system for worldwide standardization.
    [Show full text]
  • Encoding of Vaidika Characters & Symbols in Unicode L2/09-067
    Encoding of Vaidika Characters & Symbols in Unicode Title: Encoding of Vaidika Sanskrit Characters & Symbols in the BMP of UCS Source: Ministry of Communications and Information Technology Department of Information Technology Government of India Status: Institutional Member’s Contribution Date: 28 January 2009 This document consolidates the final recommended code charts for the representation of the Vaidika characters and symbols in the Unicode Standard by Department of Information Technology, Ministry of Communications & IT, Government of India. UTC recommended encoding of various characters and symbols required for representation of Vaidika text in three different blocks i.e. Devanagari (U+900 – U+97F), Vaidika Extensions (U+1CD0 – U+1CF1) and Devanagari Extensions (U+A8E0 – U+A8F9). In the Vaidika Extensions block (1CD0 – 1CFF), 35 characters have been recommended for encoding. 26 characters have been recommended for encoding in the Devanagari Extended block (A8E0 – A8FF). Four characters were recommended for encoding in the Devanagari block. The charts shown below shows the various characters accepted for encoding for the representation of Vaidika characters and symbols. 1. Devanagari (U+900 – U+97F) 090 091 092 093 094 095 096 097 ā 0 ऐ ठ र ◌ी ॐ ॠ ॰ 0900 0910 0920 0930 0940 0950 0960 0970 Ã 1 ◌ँ ऑ ड ऱ ◌ु ◌॑ ॡ 0901 0912 0921 0931 0941 0951 0961 0971 2 ◌ं ऒ ढ ल ◌ू z ◌ॢ , 0902 0912 0922 0932 0942 0952 0962 0972 3 ◌ः ओ ण ळ ◌ृ ◌॓ ◌ॣ 0903 0913 0923 0933 0943 0953 0963 + 4 औ त ऴ ◌ॄ ◌॔ । 0904 0914 0924 0934 0944 0954 0964 ă 5 अ क थ व ◌ॅ
    [Show full text]
  • ISO/IEC International Standard ISO/IEC 10646
    ISO/IEC International Standard ISO/IEC 10646 Final Committee Draft Information technology – Universal Coded Character Set (UCS) Technologie de l’information – Jeu universel de caractères codés (JUC) Second edition, 2010 ISO/IEC 10646:2010 (E) Final Committee Draft (FCD) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF- creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. © ISO/IEC 2010 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the ad- dress below or ISO's member body in the country of the requester. ISO copyright office Case postale 56 • CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail [email protected] Web www.iso.ch Printed in Switzerland 2 © ISO/IEC 2010 – All rights reserved ISO/IEC 10646:2010 (E) Final Committee Draft (FCD) CONTENTS Foreword.................................................................................................................................................
    [Show full text]
  • Linguistic Issues in Encoding Sanskrit. Peter M. Scharf and Malcolm D. Hyman. 2010
    i “LIES” — 2011/6/21 — 15:43 — page i — #1 i i i Linguistic Issues in Encoding Sanskrit Peter M. Scharf Malcolm D. Hyman Brown University MPIWG June 21, 2011 i i i i i “LIES” — 2011/6/21 — 15:43 — page iv — #2 i i i iv Scharf, Peter M. and Malcolm D. Hyman. Linguistic Issues in Encoding Sanskrit. Providence: The Sanskrit Library, 2011. Copyright c 2011 by The Sanskrit Library. All rights reserved. Repro- duction in any medium is restricted. i i i i i “LIES” — 2011/6/21 — 15:43 — page v — #3 i i i Foreword by GEORGE CARDONA Questions surrounding the encoding of speech have been considered since scholars began to consider the history of different writing systems and of writing itself. In modern times, attention has been paid to such issues as standardizing systems for portraying in Roman script the scripts used for recording other languages, and this has given rise to discussions about distinctions such as that between transliteration and transcription. In re- cent times, moreover, the advent and general use of digital technology has allowed us not only to replicate with relative ease details of various scripts and to produce machine searchable texts but also to reproduce images of manuscripts that can be viewed and manipulated, a true boon to philologists in that they are thus enabled to consult and study mate- rials with all the details found in original manuscripts, such as different hands that can be discerned and clues to modifications made due to fea- tures of different scripts. At the source of such endeavors lie the facts of language: phonological and phonetic matters that scripts portray with various degrees of fidelity.
    [Show full text]
  • The Unicode Standard, Version 7.0 This File Contains an Excerpt from the Character Code Tables and List of Character Names for the Unicode Standard, Version 7.0
    Vedic Extensions Range: 1CD0–1CFF The Unicode Standard, Version 7.0 This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 7.0 Characters in this chart that are new for The Unicode Standard, Version 7.0 are shown in conjunction with any existing characters. For ease of reference, the new characters have been highlighted in the chart grid and in the names list. This file will not be updated with errata, or when additional characters are assigned to the Unicode Standard. See http://www.unicode.org/errata/ for an up-to-date list of errata. See http://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See http://www.unicode.org/charts/PDF/Unicode-7.0/ for charts showing only the characters added in Unicode 7.0. See http://www.unicode.org/Public/7.0.0/charts/ for a complete archived file of character code charts for Unicode 7.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 7.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 7.0, online at http://www.unicode.org/versions/Unicode7.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, and #45, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online.
    [Show full text]