Unicode Standard, Version 5.2

Total Page:16

File Type:pdf, Size:1020Kb

Unicode Standard, Version 5.2 Devanagari Range: 0900–097F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 5.2. This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See http://www.unicode.org/errata/ for an up-to-date list of errata. See http://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See http://www.unicode.org/charts/PDF/Unicode-5.2/ for charts showing only the characters added in Unicode 5.2. See http://www.unicode.org/Public/5.2.0/charts/ for a complete archived file of character code charts for Unicode 5.2. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 5.2 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 5.2, online at http://www.unicode.org/versions/Unicode5.2.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, and #44, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See http://www.unicode.org/ucd/ and http://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation. Fonts The shapes of the reference glyphs used in these code charts are not prescriptive. Considerable variation is to be expected in actual fonts. The particular fonts used in these charts were provided to the Unicode Consortium by a number of different font designers, who own the rights to the fonts. See http://www.unicode.org/charts/fonts.html for a list. Terms of Use You may freely use these code charts for personal or internal business uses only. You may not incorporate them either wholly or in part into any product or publication, or otherwise distribute them without express written permission from the Unicode Consortium. However, you may provide links to these charts. The fonts and font data used in production of these code charts may NOT be extracted, or used in any other way in any product or publication, without permission or license granted by the typeface owner(s). The Unicode Consortium is not liable for errors or omissions in this file or the standard itself. Information on characters added to the Unicode Standard since the publication of the most recent version of the Unicode Standard, as well as on characters currently being considered for addition to the Unicode Standard can be found on the Unicode web site. See http://www.unicode.org/pending/pending.html and http://www.unicode.org/alloc/Pipeline.html. Copyright © 1991-2009 Unicode, Inc. All rights reserved. 0900 Devanagari 097F 090 091 092 093 094 095 096 097 0 $Ò q $¬ º Ç ¥ 0900 0910 0920 0930 0940 0950 0960 0970 1 $b r $­ $» È Ð 0901 0911 0921 0931 0941 0951 0961 0971 2 $c s $® $¼ $É Ñ 0902 0912 0922 0932 0942 0952 0962 0972 3 $d t $¯ $½ $Ê 0903 0913 0923 0933 0943 0953 0963 4 e u $° $¾ ¦ 0904 0914 0924 0934 0944 0954 0964 5 f v $± $Ô § 0905 0915 0925 0935 0945 0955 0965 6 g w $² 0906 0916 0926 0936 0946 0966 7 h x $³ 0907 0917 0927 0937 0947 0967 8 i y $´ ¿ 0908 0918 0928 0938 0948 0958 0968 9 j z $µ À Õ 0909 0919 0929 0939 0949 0959 0969 0979 A k { $¶ Á Ö 090A 091A 092A 094A 095A 096A 097A B l | $· Â Ì 090B 091B 092B 094B 095B 096B 097B C m } $¨ $¸ Ã ¡ Í 090C 091C 092C 093C 094C 095C 096C 097C D n ~ © $¹ Ä ¢ Î 090D 091D 092D 093D 094D 095D 096D 097D E o $ª $Ó Å £ Ï 090E 091E 092E 093E 094E 095E 096E 097E F p $« Æ ¤ Ë 090F 091F 092F 093F 095F 096F 097F The Unicode Standard 5.2, Copyright © 1991-2009 Unicode, Inc. All rights reserved. 67 0900 Devanagari 0951 Various signs 0931 DEVANAGARI LETTER RRA 0900 $Ò DEVANAGARI SIGN INVERTED CANDRABINDU for transcribing Dravidian alveolar r = vaidika adhomukha candrabindu half form is represented as Eyelash RA 0901 $b DEVANAGARI SIGN CANDRABINDU À 0930 093C $¨ = anunasika 0932 DEVANAGARI LETTER LA a 0310 $ o combining candrabindu 0933 DEVANAGARI LETTER LLA 0902 $c DEVANAGARI SIGN ANUSVARA 0934 DEVANAGARI LETTER LLLA = bindu for transcribing Dravidian l 0903 $d DEVANAGARI SIGN VISARGA À 0933 093C $¨ Independent vowels 0935 DEVANAGARI LETTER VA 0936 DEVANAGARI LETTER SHA 0904 e DEVANAGARI LETTER SHORT A 0937 DEVANAGARI LETTER SSA 0905 f DEVANAGARI LETTER A 0938 DEVANAGARI LETTER SA 0906 g DEVANAGARI LETTER AA 0939 DEVANAGARI LETTER HA 0907 h DEVANAGARI LETTER I 0908 i DEVANAGARI LETTER II Various signs 0909 j DEVANAGARI LETTER U 093C $¨ DEVANAGARI SIGN NUKTA 090A k DEVANAGARI LETTER UU for extending the alphabet to new letters 090B l DEVANAGARI LETTER VOCALIC R 093D © DEVANAGARI SIGN AVAGRAHA 090C m DEVANAGARI LETTER VOCALIC L Dependent vowel signs 090D DEVANAGARI LETTER CANDRA E n 093E $ª DEVANAGARI VOWEL SIGN AA 090E DEVANAGARI LETTER SHORT E o 093F $« DEVANAGARI VOWEL SIGN I for transcribing Dravidian short e stands to the left of the consonant 090F DEVANAGARI LETTER E p 0940 $¬ DEVANAGARI VOWEL SIGN II 0910 DEVANAGARI LETTER AI q 0941 $­ DEVANAGARI VOWEL SIGN U 0911 DEVANAGARI LETTER CANDRA O r 0942 $® DEVANAGARI VOWEL SIGN UU 0912 DEVANAGARI LETTER SHORT O s 0943 $¯ DEVANAGARI VOWEL SIGN VOCALIC R for transcribing Dravidian short o 0944 $° DEVANAGARI VOWEL SIGN VOCALIC RR 0913 t DEVANAGARI LETTER O 0945 $± DEVANAGARI VOWEL SIGN CANDRA E 0914 u DEVANAGARI LETTER AU = candra Consonants 0946 $² DEVANAGARI VOWEL SIGN SHORT E 0915 v DEVANAGARI LETTER KA for transcribing Dravidian vowels 0916 w DEVANAGARI LETTER KHA 0947 $³ DEVANAGARI VOWEL SIGN E 0917 x DEVANAGARI LETTER GA 0948 $´ DEVANAGARI VOWEL SIGN AI 0918 y DEVANAGARI LETTER GHA 0949 $µ DEVANAGARI VOWEL SIGN CANDRA O 0919 z DEVANAGARI LETTER NGA 094A $¶ DEVANAGARI VOWEL SIGN SHORT O 091A { DEVANAGARI LETTER CA for transcribing Dravidian vowels 091B | DEVANAGARI LETTER CHA 094B $· DEVANAGARI VOWEL SIGN O 091C } DEVANAGARI LETTER JA 094C $¸ DEVANAGARI VOWEL SIGN AU 091D ~ DEVANAGARI LETTER JHA Virama 091E DEVANAGARI LETTER NYA 094D $¹ DEVANAGARI SIGN VIRAMA 091F DEVANAGARI LETTER TTA = halant (the preferred Hindi name) 0920 DEVANAGARI LETTER TTHA suppresses inherent vowel 0921 DEVANAGARI LETTER DDA Archaic dependent vowel sign 0922 DEVANAGARI LETTER DDHA 0923 DEVANAGARI LETTER NNA 094E $Ó DEVANAGARI VOWEL SIGN PRISHTHAMATRA E 0924 DEVANAGARI LETTER TA character has historic use only 0925 DEVANAGARI LETTER THA combines with E to form AI, with AA to form O, and with O to form AU 0926 DEVANAGARI LETTER DA 0927 DEVANAGARI LETTER DHA Sign 0928 DEVANAGARI LETTER NA 0950 º DEVANAGARI OM 0929 DEVANAGARI LETTER NNNA Vedic tone marks for transcribing Dravidian alveolar n 0951 $» DEVANAGARI STRESS SIGN UDATTA À 0928 093C $¨ = Vedic tone svarita 092A DEVANAGARI LETTER PA mostly used for Rigvedic svarita, with rare use 092B DEVANAGARI LETTER PHA for Yajurvedic udatta 092C DEVANAGARI LETTER BA used also in Vedic texts written in other scripts 092D DEVANAGARI LETTER BHA a 1CDA $h vedic tone double svarita 092E DEVANAGARI LETTER MA 092F DEVANAGARI LETTER YA 0930 DEVANAGARI LETTER RA 68 The Unicode Standard 5.2, Copyright © 1991-2009 Unicode, Inc. All rights reserved. 0952 Devanagari 097F 0952 $¼ DEVANAGARI STRESS SIGN ANUDATTA Additional consonants = Vedic tone anudatta 0979 Õ DEVANAGARI LETTER ZHA used also in Vedic texts written in other scripts used in transliteration of Avestan a 1CDC $j vedic tone kathaka anudatta 097A Ö DEVANAGARI LETTER HEAVY YA Accent marks used for an affricated glide JJYA 0953 $½ DEVANAGARI GRAVE ACCENT Sindhi implosives a 0300 $ _ combining grave accent 097B Ì DEVANAGARI LETTER GGA 0954 $¾ DEVANAGARI ACUTE ACCENT 097C Í DEVANAGARI LETTER JJA a 0301 $ ` combining acute accent 0955 $Ô DEVANAGARI VOWEL SIGN CANDRA LONG E Glottal stop used in transliteration of Avestan 097D Î DEVANAGARI LETTER GLOTTAL STOP used for writing Limbu in Devanagari Additional consonants a glyph variant has the connecting top bar 0958 ¿ DEVANAGARI LETTER QA À 0915 v 093C $¨ Sindhi implosives 0959 À DEVANAGARI LETTER KHHA 097E Ï DEVANAGARI LETTER DDDA À 0916 w 093C $¨ 097F Ë DEVANAGARI LETTER BBA 095A Á DEVANAGARI LETTER GHHA À 0917 x 093C $¨ 095B Â DEVANAGARI LETTER ZA À 091C } 093C $¨ 095C Ã DEVANAGARI LETTER DDDHA À 0921 093C $¨ 095D Ä DEVANAGARI LETTER RHA À 0922 093C $¨ 095E Å DEVANAGARI LETTER FA À 092B 093C $¨ 095F Æ DEVANAGARI LETTER YYA À 092F 093C $¨ Additional vowels for Sanskrit 0960 Ç DEVANAGARI LETTER VOCALIC RR 0961 È DEVANAGARI LETTER VOCALIC LL 0962 $É DEVANAGARI VOWEL SIGN VOCALIC L 0963 $Ê DEVANAGARI VOWEL SIGN VOCALIC LL Generic punctuation for scripts of India These punctuation marks are for common use for the scripts of India despite being named "DEVANAGARI". 0964 ¦ DEVANAGARI DANDA = purna viram phrase separator 0965 § DEVANAGARI DOUBLE DANDA = deergh viram Digits 0966 DEVANAGARI DIGIT ZERO 0967 DEVANAGARI DIGIT ONE 0968 DEVANAGARI DIGIT TWO 0969 DEVANAGARI DIGIT THREE 096A DEVANAGARI DIGIT FOUR 096B DEVANAGARI DIGIT FIVE 096C ¡ DEVANAGARI DIGIT SIX 096D ¢ DEVANAGARI DIGIT SEVEN 096E £ DEVANAGARI DIGIT EIGHT 096F ¤ DEVANAGARI DIGIT NINE Devanagari-specific additions 0970 ¥ DEVANAGARI ABBREVIATION SIGN 0971 Ð DEVANAGARI SIGN HIGH SPACING DOT Additional vowel for Marathi 0972 Ñ DEVANAGARI LETTER CANDRA A Marathi The Unicode Standard 5.2, Copyright © 1991-2009 Unicode, Inc. All rights reserved. 69.
Recommended publications
  • Positional Notation Or Trigonometry [2, 13]
    The Greatest Mathematical Discovery? David H. Bailey∗ Jonathan M. Borweiny April 24, 2011 1 Introduction Question: What mathematical discovery more than 1500 years ago: • Is one of the greatest, if not the greatest, single discovery in the field of mathematics? • Involved three subtle ideas that eluded the greatest minds of antiquity, even geniuses such as Archimedes? • Was fiercely resisted in Europe for hundreds of years after its discovery? • Even today, in historical treatments of mathematics, is often dismissed with scant mention, or else is ascribed to the wrong source? Answer: Our modern system of positional decimal notation with zero, to- gether with the basic arithmetic computational schemes, which were discov- ered in India prior to 500 CE. ∗Bailey: Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. Email: [email protected]. This work was supported by the Director, Office of Computational and Technology Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy, under contract number DE-AC02-05CH11231. yCentre for Computer Assisted Research Mathematics and its Applications (CARMA), University of Newcastle, Callaghan, NSW 2308, Australia. Email: [email protected]. 1 2 Why? As the 19th century mathematician Pierre-Simon Laplace explained: It is India that gave us the ingenious method of expressing all numbers by means of ten symbols, each symbol receiving a value of position as well as an absolute value; a profound and important idea which appears so simple to us now that we ignore its true merit. But its very sim- plicity and the great ease which it has lent to all computations put our arithmetic in the first rank of useful inventions; and we shall appre- ciate the grandeur of this achievement the more when we remember that it escaped the genius of Archimedes and Apollonius, two of the greatest men produced by antiquity.
    [Show full text]
  • Kharosthi Manuscripts: a Window on Gandharan Buddhism*
    KHAROSTHI MANUSCRIPTS: A WINDOW ON GANDHARAN BUDDHISM* Andrew GLASS INTRODUCTION In the present article I offer a sketch of Gandharan Buddhism in the centuries around the turn of the common era by looking at various kinds of evidence which speak to us across the centuries. In doing so I hope to shed a little light on an important stage in the transmission of Buddhism as it spread from India, through Gandhara and Central Asia to China, Korea, and ultimately Japan. In particular, I will focus on the several collections of Kharo~thi manuscripts most of which are quite new to scholarship, the vast majority of these having been discovered only in the past ten years. I will also take a detailed look at the contents of one of these manuscripts in order to illustrate connections with other text collections in Pali and Chinese. Gandharan Buddhism is itself a large topic, which cannot be adequately described within the scope of the present article. I will therefore confine my observations to the period in which the Kharo~thi script was used as a literary medium, that is, from the time of Asoka in the middle of the third century B.C. until about the third century A.D., which I refer to as the Kharo~thi Period. In addition to looking at the new manuscript materials, other forms of evidence such as inscriptions, art and architecture will be touched upon, as they provide many complementary insights into the Buddhist culture of Gandhara. The travel accounts of the Chinese pilgrims * This article is based on a paper presented at Nagoya University on April 22nd 2004.
    [Show full text]
  • Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe Romanization: KNAB 2012
    Institute of the Estonian Language KNAB: Place Names Database 2012-10-11 Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe romanization: KNAB 2012 I. Consonant characters 1 ᦀ ’a 13 ᦌ sa 25 ᦘ pha 37 ᦤ da A 2 ᦁ a 14 ᦍ ya 26 ᦙ ma 38 ᦥ ba A 3 ᦂ k’a 15 ᦎ t’a 27 ᦚ f’a 39 ᦦ kw’a 4 ᦃ kh’a 16 ᦏ th’a 28 ᦛ v’a 40 ᦧ khw’a 5 ᦄ ng’a 17 ᦐ n’a 29 ᦜ l’a 41 ᦨ kwa 6 ᦅ ka 18 ᦑ ta 30 ᦝ fa 42 ᦩ khwa A 7 ᦆ kha 19 ᦒ tha 31 ᦞ va 43 ᦪ sw’a A A 8 ᦇ nga 20 ᦓ na 32 ᦟ la 44 ᦫ swa 9 ᦈ ts’a 21 ᦔ p’a 33 ᦠ h’a 45 ᧞ lae A 10 ᦉ s’a 22 ᦕ ph’a 34 ᦡ d’a 46 ᧟ laew A 11 ᦊ y’a 23 ᦖ m’a 35 ᦢ b’a 12 ᦋ tsa 24 ᦗ pa 36 ᦣ ha A Syllable-final forms of these characters: ᧅ -k, ᧂ -ng, ᧃ -n, ᧄ -m, ᧁ -u, ᧆ -d, ᧇ -b. See also Note D to Table II. II. Vowel characters (ᦀ stands for any consonant character) C 1 ᦀ a 6 ᦀᦴ u 11 ᦀᦹ ue 16 ᦀᦽ oi A 2 ᦰ ( ) 7 ᦵᦀ e 12 ᦵᦀᦲ oe 17 ᦀᦾ awy 3 ᦀᦱ aa 8 ᦶᦀ ae 13 ᦺᦀ ai 18 ᦀᦿ uei 4 ᦀᦲ i 9 ᦷᦀ o 14 ᦀᦻ aai 19 ᦀᧀ oei B D 5 ᦀᦳ ŭ,u 10 ᦀᦸ aw 15 ᦀᦼ ui A Indicates vowel shortness in the following cases: ᦀᦲᦰ ĭ [i], ᦵᦀᦰ ĕ [e], ᦶᦀᦰ ăe [ ∎ ], ᦷᦀᦰ ŏ [o], ᦀᦸᦰ ăw [ ], ᦀᦹᦰ ŭe [ ɯ ], ᦵᦀᦲᦰ ŏe [ ].
    [Show full text]
  • Malayalam Range: 0D00–0D7F
    Malayalam Range: 0D00–0D7F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • The Unicode Standard, Version 4.0--Online Edition
    This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (http://www.unicode.org/errata/). For information on more recent versions of the standard, see http://www.unicode.org/standard/versions/enumeratedversions.html. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten.
    [Show full text]
  • Punjabi Machine Transliteration Muhammad Ghulam Abbas Malik
    Punjabi Machine Transliteration Muhammad Ghulam Abbas Malik To cite this version: Muhammad Ghulam Abbas Malik. Punjabi Machine Transliteration. 21st international Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the ACL, Jul 2006, Sydney, France. pp.1137-1144. hal-01002160 HAL Id: hal-01002160 https://hal.archives-ouvertes.fr/hal-01002160 Submitted on 15 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Punjabi Machine Transliteration M. G. Abbas Malik Department of Linguistics Denis Diderot, University of Paris 7 Paris, France [email protected] Transliteration refers to phonetic translation Abstract across two languages with different writing sys- tems (Knight & Graehl, 1998), such as Arabic to Machine Transliteration is to transcribe a English (Nasreen & Leah, 2003). Most prior word written in a script with approximate work has been done for Machine Translation phonetic equivalence in another lan- (MT) (Knight & Leah, 97; Paola & Sanjeev, guage. It is useful for machine transla- 2003; Knight & Stall, 1998) from English to tion, cross-lingual information retrieval, other major languages of the world like Arabic, multilingual text and speech processing. Chinese, etc. for cross-lingual information re- Punjabi Machine Transliteration (PMT) trieval (Pirkola et al, 2003), for the development is a special case of machine translitera- of multilingual resources (Yan et al, 2003; Kang tion and is a process of converting a word & Kim, 2000) and for the development of cross- from Shahmukhi (based on Arabic script) lingual applications.
    [Show full text]
  • An Introduction to Indic Scripts
    An Introduction to Indic Scripts Richard Ishida W3C [email protected] HTML version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html PDF version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.pdf Introduction This paper provides an introduction to the major Indic scripts used on the Indian mainland. Those addressed in this paper include specifically Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. I have used XHTML encoded in UTF-8 for the base version of this paper. Most of the XHTML file can be viewed if you are running Windows XP with all associated Indic font and rendering support, and the Arial Unicode MS font. For examples that require complex rendering in scripts not yet supported by this configuration, such as Bengali, Oriya, and Malayalam, I have used non- Unicode fonts supplied with Gamma's Unitype. To view all fonts as intended without the above you can view the PDF file whose URL is given above. Although the Indic scripts are often described as similar, there is a large amount of variation at the detailed implementation level. To provide a detailed account of how each Indic script implements particular features on a letter by letter basis would require too much time and space for the task at hand. Nevertheless, despite the detail variations, the basic mechanisms are to a large extent the same, and at the general level there is a great deal of similarity between these scripts. It is certainly possible to structure a discussion of the relevant features along the same lines for each of the scripts in the set.
    [Show full text]
  • Data Issues in English-To-Hindi Machine Translation
    Data Issues in English-to-Hindi Machine Translation Ondřej Bojar, Pavel Straňák, Daniel Zeman Univerzita Karlova v Praze, Ústav formální a aplikované lingvistiky Malostranské náměstí 25, CZ-11800 Praha {bojar|stranak|zeman}@ufal.mff.cuni.cz http://ufal.mff.cuni.cz/umc/ Abstract Statistical machine translation to morphologically richer languages is a challenging task and more so if the source A dataset originally collected for the DARPA-TIDES surprise- and target languages differ in word order. Current state-of-the art MT systems thus deliver mediocre results. language contest in 2002, later refined at IIIT Hyderabad and Adding more parallel data often helps improve the results; if it does not, it may be caused by various problems such provided for the NLP Tools Contest at ICON 2008. Corpus Sentences En Tokens Hi Tokens as different domains, bad alignment or noise in the new data. We evaluate several available parallel data sources Tides.train 50,000 1,226,144 1,312,435 A journalist Daniel Pipes' website (http://www.danielpipes.org/) and provide cross-evaluation results on their combinations using two freely available statistical MT systems. We Tides.dev 1,000 22,485 24,363 demonstrate various problems encountered in the data and describe automatic methods of data cleaning and limited-domain articles about the Middle East. Written in English, Tides.test 1,000 27,169 28,574 normalization. We also show that the contents of two independently distributed data sets can unexpectedly overlap, many of them translated to up to 25 other languages. which negatively affects translation quality. Together with the error analysis, we also present a new tool for viewing Daniel Pipes 6,761 176,392 122,108 Monolingual, parallel and annotated corpora for fourteen South Emille 3,501 55,660 71,010 aligned corpora, which makes it easier to detect difficult parts in the data even for a developer not speaking the Asian languages (including Hindi) and English.
    [Show full text]
  • The Ramayana by R.K. Narayan
    Table of Contents About the Author Title Page Copyright Page Introduction Dedication Chapter 1 - RAMA’S INITIATION Chapter 2 - THE WEDDING Chapter 3 - TWO PROMISES REVIVED Chapter 4 - ENCOUNTERS IN EXILE Chapter 5 - THE GRAND TORMENTOR Chapter 6 - VALI Chapter 7 - WHEN THE RAINS CEASE Chapter 8 - MEMENTO FROM RAMA Chapter 9 - RAVANA IN COUNCIL Chapter 10 - ACROSS THE OCEAN Chapter 11 - THE SIEGE OF LANKA Chapter 12 - RAMA AND RAVANA IN BATTLE Chapter 13 - INTERLUDE Chapter 14 - THE CORONATION Epilogue Glossary THE RAMAYANA R. K. NARAYAN was born on October 10, 1906, in Madras, South India, and educated there and at Maharaja’s College in Mysore. His first novel, Swami and Friends (1935), and its successor, The Bachelor of Arts (1937), are both set in the fictional territory of Malgudi, of which John Updike wrote, “Few writers since Dickens can match the effect of colorful teeming that Narayan’s fictional city of Malgudi conveys; its population is as sharply chiseled as a temple frieze, and as endless, with always, one feels, more characters round the corner.” Narayan wrote many more novels set in Malgudi, including The English Teacher (1945), The Financial Expert (1952), and The Guide (1958), which won him the Sahitya Akademi (India’s National Academy of Letters) Award, his country’s highest honor. His collections of short fiction include A Horse and Two Goats, Malgudi Days, and Under the Banyan Tree. Graham Greene, Narayan’s friend and literary champion, said, “He has offered me a second home. Without him I could never have known what it is like to be Indian.” Narayan’s fiction earned him comparisons to the work of writers including Anton Chekhov, William Faulkner, O.
    [Show full text]
  • Second Language Writing System Word Recognition (With a Focus on Lao)
    Second Language Writing System Word Recognition (with a focus on Lao) Christine Elliott University of Wisconsin-Madison Abstract Learning a second language (L2) with a script different from the learner’s first language (L1) presents unique challenges for both stu- dent and teacher. This paper looks at current theory and research examining issues of second language writing system (L2WS) acquisi- tion, particularly issues pertaining to decoding and word recognition1 by adult learners. I argue that the importance of word recognition and decoding in fluent L1 and L2 reading has been overshadowed for several decades by a focus on research looking at top-down reading processes. Although top-down reading processes and strategies are clearly components of successful L2 reading, I argue that more atten- tion needs to be given to bottom-up processing skills, particularly for beginning learners of an L2 that uses a script that is different from their L1. I use the example of learning Lao as a second language writing system where possible and suggest preliminary pedagogical implications. Introduction Second language writing systems have increasingly become the focus of a growing body of research drawing on the fields of psy- chology, education, linguistics, and second language acquisition, among others. The term writing system is used to refer to the ways in which written symbols represent language in a systematic way (Cook and Bassetti, 2005). Further, a writing system can be discussed in terms of both its script and its orthography. Cook and Bassetti de- fine script as the physical implementation of a writing system (i.e. the written symbols) and orthography as “the rules for using a script in a 1 Following Koda (2005), I define word recognition as “the process of extract- ing lexical information from graphic displays of words,” and decoding as the specific process of extracting phonological information.
    [Show full text]
  • The Formal Kharoṣṭhī Script from the Northern Tarim Basin in Northwest
    Acta Orientalia Hung. 73 (2020) 3, 335–373 DOI: 10.1556/062.2020.00015 Th e Formal Kharoṣṭhī script from the Northern Tarim Basin in Northwest China may write an Iranian language1 FEDERICO DRAGONI, NIELS SCHOUBBEN and MICHAËL PEYROT* L eiden University Centre for Linguistics, Universiteit Leiden, Postbus 9515, 2300 RA Leiden, Th e Netherlands E-mail: [email protected]; [email protected]; *Corr esponding Author: [email protected] Received: February 13, 2020 •Accepted: May 25, 2020 © 2020 The Authors ABSTRACT Building on collaborative work with Stefan Baums, Ching Chao-jung, Hannes Fellner and Georges-Jean Pinault during a workshop at Leiden University in September 2019, tentative readings are presented from a manuscript folio (T II T 48) from the Northern Tarim Basin in Northwest China written in the thus far undeciphered Formal Kharoṣṭhī script. Unlike earlier scholarly proposals, the language of this folio can- not be Tocharian, nor can it be Sanskrit or Middle Indic (Gāndhārī). Instead, it is proposed that the folio is written in an Iranian language of the Khotanese-Tumšuqese type. Several readings are proposed, but a full transcription, let alone a full translation, is not possible at this point, and the results must consequently remain provisional. KEYWORDS Kharoṣṭhī, Formal Kharoṣṭhī, Khotanese, Tumšuqese, Iranian, Tarim Basin 1 We are grateful to Stefan Baums, Chams Bernard, Ching Chao-jung, Doug Hitch, Georges-Jean Pinault and Nicholas Sims-Williams for very helpful discussions and comments on an earlier draft. We also thank the two peer-reviewers of the manuscript. One of them, Richard Salomon, did not wish to remain anonymous, and espe- cially his observation on the possible relevance of Khotan Kharoṣṭhī has proved very useful.
    [Show full text]
  • Designing Devanagari Type
    Designing Devanagari type The effect of technological restrictions on current practice Kinnat Sóley Lydon BA degree final project Iceland Academy of the Arts Department of Design and Architecture Designing Devanagari type: The effect of technological restrictions on current practice Kinnat Sóley Lydon Final project for a BA degree in graphic design Advisor: Gunnar Vilhjálmsson Graphic design Department of Design and Architecture December 2015 This thesis is a 6 ECTS final project for a Bachelor of Arts degree in graphic design. No part of this thesis may be reproduced in any form without the express consent of the author. Abstract This thesis explores the current process of designing typefaces for Devanagari, a script used to write several languages in India and Nepal. The typographical needs of the script have been insufficiently met through history and many Devanagari typefaces are poorly designed. As the various printing technologies available through the centuries have had drastic effects on the design of Devanagari, the thesis begins with an exploration of the printing history of the script. Through this exploration it is possible to understand which design elements constitute the script, and which ones are simply legacies of older technologies. Following the historic overview, the character set and unique behavior of the script is introduced. The typographical anatomy is analyzed, while pointing out specific design elements of the script. Although recent years has seen a rise of interest on the subject of Devanagari type design, literature on the topic remains sparse. This thesis references books and articles from a wide scope, relying heavily on the works of Fiona Ross and her extensive research on non-Latin typography.
    [Show full text]