Arxiv:1805.08701V1 [Cs.CL] 22 May 2018 Cally Transliterated Form in Order to Use a Common Form of the Word Can Be Fetched (I.E

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:1805.08701V1 [Cs.CL] 22 May 2018 Cally Transliterated Form in Order to Use a Common Form of the Word Can Be Fetched (I.E Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance Soumil Mandal, Karthick Nanmaran Department of Computer Science & Engineering SRM Institute of Science & Technology, Chennai, India fsoumil.mandal, [email protected] Abstract ating systems for code-mixed data, post language tagging, normalization of transliterated text is ex- Building tools for code-mixed data is rapidly gaining popularity in the NLP research com- tremely important in order to identify the word munity as such data is exponentially rising on and understand it’s semantics. This would help social media. Working with code-mixed data a lot in systems like opinion mining, and is actu- contains several challenges, especially due to ally necessary for tasks like summarization, trans- grammatical inconsistencies and spelling vari- lation, etc. A normalizing module will also be of ations in addition to all the previous known immense help while making word embeddings for challenges for social media scenarios. In this code-mixed data. article, we present a novel architecture focus- In this paper, we present an architecture for ing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. automatic normalization of phonetically translit- One of the main features of our architecture is erated words to their standard forms. The lan- that in addition to normalizing, it can also be guage pair we have worked on is Bengali-English utilized for back-transliteration and word iden- (Bn-En), where both are typed in Roman script, tification in some cases. Our model achieved thus the Bengali words are in their transliterated an accuracy of 90.27% on the test data. form. The canonical or normalized form we have 1 Introduction considered is the Indian Languages Translitera- tion (ITRANS) form of the respective Bengali With rising popularity of social media, the amount word. Bengali is an Indo-Aryan language of In- of data is rising exponentially. If mined, this data dia where 8.10% 3 of the total population are 1st can proof to be useful for various purposes. In language speakers and is also the official language countries where the number of bilinguals are high, of Bangladesh. The mother script of Bengali is we see that users tend to switch back and forth be- the Eastern Nagari Script 4. Our architecture uti- tween multiple languages, a phenomenon known lizes fully char based sequence to sequence learn- as code-mixing or code-switching. An interest- ing in addition to Levenshtein distance to give the ing case is switching between languages which final normalized form or as close to it as possible. share different mother scripts. On such occasions, Some additional advantages of our system is that one of the two languages is typed in it’s phoneti- at an intermediate stage, the back-transliterated arXiv:1805.08701v1 [cs.CL] 22 May 2018 cally transliterated form in order to use a common form of the word can be fetched (i.e. word identifi- script. Though there are some standard transliter- cation), which will be very useful in several cases ation rules, for example ITRANS 1, ISO 2, but it as original tools (i.e. tools using mother script) can is extremely difficult and un-realistic for people to be utilized, for example emotion lexicons. Some follow them while typing. This indeed is the case other important contributions of our research are as we see that identical words are being transliter- the new lexicons that have been prepared (dis- ated differently by different people based on their cussed in Sec3) which can be used for building own phonetic judgment influenced by dialects, lo- various other tools for studying Bengali-English cation, or sometimes even based on the informal- code-mixed data. ity or casualness of the situation. Thus, for cre- 1https://en.wikipedia.org/wiki/ITRANS 3https://en.wikipedia.org/wiki/Languages of India 2https://en.wikipedia.org/wiki/ISO 15919 4https://www.omniglot.com/writing/bengali.htm 2 Related Work then converted it into standardized ITRANS for- mat. The final size of the PL was 6000. The Normalization of text has been studied quite a second lexicon we created was a transliteration lot (Sproat et al., 1999), especially as it acts as dictionary (BN TRANS) where the first column a pre-processing step for several text process- had Bengali words in Eastern Nagari script taken ing systems. Using Conditional Random Fields from samsad 5, while the second column had the (CRF), Zhu et al.(2007) performed text normal- standard transliterations (ITRANS). The number ization on informal emails. Dutta et al.(2015) of entries in the dictionary was 21850. For test- created a system based on noisy channel model ing, we took the data used in Mandal and Das for text normalization which handles wordplay, (2018), language tagged it using the system de- contracted words and phonetic variations in code- scribed in Mandal et al.(2018a), and then col- mixed background. An unsupervised framework lected Bn tagged tokens. Post manual checking was presented by Sridhar(2015) for normaliz- and discarding of misclassified tokens, the size ing domain-specific and informal noisy texts using of the list was 905. Finally, each of the words distributed representation of words. The soundex were tagged with their ITRANS using the same algorithm was used in (Sitaram et al., 2015) and approach used for making PL. For PL and test (Sitaram and Black, 2016) for spelling correc- col 1 data, some initial rule based normalization tech- tion of transliterated words and normalization in niques were used. If the input string contains a a speech to text scenario of code-mixed data re- digit, it was replaced by the respective phone (e.g. spectively. Sharma et al.(2016) build a normal- ek for 1, dui for 2, etc), and if there are n consecu- ization system using noisy channel framework and tive identical characters where n > 2 (elongation), SILPA spell checker in order to build a shallow it was trimmed down to 2 consecutive characters parser. Sproat and Jaitly(2016) build a system (e.g. baaaad will become baad), as no word in combining two models, where one essentially is a it’s standard form has more than two consecutive seq2seq model which checks the possible normal- identical characters. izations and the other is a language model which considers context information. Jaitly and Sproat 4 Proposed Method (2017) used a seq2seq model with attention trained at sentence level followed by error pruning us- Our method is a two step modular approach com- ◦ ing finite-state filters to build a normalization sys- prising of two degrees of normalization. The 1 tem, mainly targeted for text to speech purposes. normalization module does an initial normaliza- A similar flow was adopted by Zare and Rohatgi tion and tries to convert the input string closest ◦ (2017) as well where seq2seq was used for nor- to the standard transliteration. The 2 normaliza- malization and a window of size 20 was consid- tion module takes the output from the first mod- ered for context. Singh et al.(2018) exploited the ule and tries to match with the standard transliter- fact that words and their variations share similar ations present in the dictionary (BN TRANS). The context in large noisy text corpora to build their candidate with the closest match is returned as the normalizing model, using skip gram and cluster- final normalized string. ing techniques. To the best of our knowledge, 5 1◦ Normalization Module the system architecture proposed by us hasn’t been tried before, especially for code-mixed data. The purpose of this module is to phonetically nor- malize the word as close to the standard transliter- 3 Data Sets ation as possible, to make the work of the match- On a whole, three data sets or lexicons were cre- ing module easier. To achieve this, our idea was ated. The first data set was a parallel lexicon (PL) to train a sequence to sequence model where the where the 1st column had phonetically transliter- input sequences are user transliterated words and ated Bn words in Roman taken from code-mixed the target sequences are the respective ITRANS data prepared in Mandal et al.(2018b). The 2 nd transliterations. We had specifically chosen this column consisted of the standard Roman translit- architecture as it has performed amazingly well in erations (ITRANS) of the respective words. To get complex sequence mapping tasks like neural ma- chine translation and summarization. this, we first manually back-transliterated PLcol 1 to the original word in Eastern Nagari script, and 5http://dsal.uchicago.edu/dictionaries/biswas-bengali/ 5.1 Seq2Seq Model 5.2 Training The sequence to sequence model (Sutskever et al., The model is trained by minimizing the negative 2014) is a relatively new idea for sequence learn- log-likelihood. For training, we used the fully ing using neural networks. It has been especially character based seq2seq model (Lee et al., 2016) popular since it achieved state of the art results in with stacked LSTM cells. The input units were machine translation task. Essentially, the model user typed phonetic transliterations (PLcol 1) while takes as input a sequence X = fx1, x2, ..., xng the target units were respective standard transliter- and tries to generate another sequence Y = fy1, ations (PLcol 2). Thus, the model learns to map y2, ..., ymg, where xi and yi are the input and user transliterations to standard transliterations, target symbols respectively. The architecture of effectively learning to normalize phonetic varia- seq2seq model comprises of two parts, the en- tions. The lookup table Ex we used for character coder and decoder. As the input and target vec- encoding was a dictionary where the keys were the tors were quite small (words), attention (Vaswani 26 English alphabets and the values were the re- et al., 2017) mechanism was not incorporated.
Recommended publications
  • Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs
    Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs Anil Kumar Singh Samar Husain LTRC, IIIT LTRC, IIIT Gachibowli, Hyderabad Gachibowli, Hyderabad India - 500019 India - 500019 a [email protected] s [email protected] Abstract than 95%, and usually 98 to 99% and above). The evaluation is performed in terms of precision, and Several algorithms are available for sen- sometimes also recall. The figures are given for one tence alignment, but there is a lack of or (less frequently) more corpus sizes. While this systematic evaluation and comparison of does give an indication of the performance of an al- these algorithms under different condi- gorithm, the variation in performance under varying tions. In most cases, the factors which conditions has not been considered in most cases. can significantly affect the performance Very little information is given about the conditions of a sentence alignment algorithm have under which evaluation was performed. This gives not been considered while evaluating. We the impression that the algorithm will perform with have used a method for evaluation that the reported precision and recall under all condi- can give a better estimate about a sen- tions. tence alignment algorithm's performance, We have tested several algorithms under differ- so that the best one can be selected. We ent conditions and our results show that the per- have compared four approaches using this formance of a sentence alignment algorithm varies method. These have mostly been tried significantly, depending on the conditions of test- on European language pairs. We have ing. Based on these results, we propose a method evaluated manually-checked and validated of evaluation that will give a better estimate of the English-Hindi aligned parallel corpora un- performance of a sentence alignment algorithm and der different conditions.
    [Show full text]
  • Slides for My Lecture for the Texperience 2010
    ◦ DEVELOPMENTDEVELOPMENT OF OFxxındyındy◦ SORTSORT AND AND MERGE MERGE RULES RULES FORFOR INDIC INDIC LANGUAGES LANGUAGES ZdeněkZdeněk Wagner, Wagner, Praha, Praha, Česká Česká republika republika AnshumanAnshuman Pandey, Pandey, Univ. Univ. Michigan, Michigan, USA USA JayaJaya Saraswati, Saraswati, Mumbai, Mumbai, India India ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh Russian: 娭¤¨ (meaning Hindi) ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh Russian: 娭¤¨ (meaning Hindi) x◦ındy sorts Hindi MakeIndexMakeIndex • version for English and German • CSIndex – version for Czech and Slovak • unpublished version for Sanskrit (Mark Csernel) Tables defining the sort algorithm are hard-wired in the pro- gram source code. Modification for other languages is difficult and leads rather to confusion than to development of a univer- sal tool. InternationalInternationalMakeIndexMakeIndex • Tables defining the sort algorithm present in external files. • Sort rules defined by regular expressions.
    [Show full text]
  • Note: the Font “Hindiromanized” Must Be Selected
    Hinweis: Der Font “HindiRomanized” muss Note: the font “HindiRomanized” must be ausgewählt sein! selected! Jetzt können Sie ihre Kaschmiri-Texte in Devanagari- Now you may convert your Hindi texts in Schrift umwandeln Devanagari script entweder in ihrer reversen Form (reverse either in their reverted form (reverse Transliteration) Transliteration) oder in ihrer romanisierten Form (Romanization) or in their romanized form (Romanization) Um das zu machen, verfahren Sie so: To do this proceed as described: Klicken Sie mit der rechten Maustaste auf das Right click the “Aksharamala” icon and choose „Aksharamala” Icon und wählen Sie „Options“ “Options” Wählen Sie „Transliteration using multiple Select “Transliteration using multiple Keymaps” Keymaps” Um Kashmiri revers: Aksharas mit dem inherenten To get Kashmiri revers: Aksharas with the Vokal “a” zu erhalten inherent Vowel “a” Wählen Sie als den 1. Keymap: Select as the 1st Keymap: “VerynewDevKashmiri FullReverse-Literation “VerynewDevKashmiri FullReverse-Literation (Unicode -> ITRANS)” (Unicode -> ITRANS)” Wählen Sie als den 2. Keymap “RomKashmiriRaina Select as the 2nd Keymap: (ITRANS -> Unicode)” “RomKashmiriRaina (ITRANS -> Unicode)” oder or “RomKashmiriKoul (ITRANS -> Unicode)” “RomKashmiriKoul (ITRANS -> Unicode)” Drücken Sie anschließend „OK“ Click finally on “OK” Wählen Sie nun den Keymap “VerynewDevKashmiri Now select the keymap “VerynewDevKashmiri FullReverse-Literation (Unicode -> ITRANS)” aus FullReverse-Literation (Unicode -> ITRANS)” Um Kashmiri revers: Aksharas ohne den inherenten
    [Show full text]
  • Power of Sanskrit
    09/03/2014 WEBPAGE: http://www.translink.profkrishna.com E-mail: [email protected] rofkrishna.com p www. वागतम ् amurthy, Singapore n n Swaagatham N. Krishnamurthy 23 February 2014 www.profkrishna.com Copyright: Dr. N. Krish Consultant, Singapore 1 Acknowledgements and Scope of talk ी गुयो नमः (shri gurub’yo’ namaha) Thanks to Singapore Dakshina Bharatha Brahmana Sabha, and Sri Srinivasan and all members of the rofkrishna.com Sabha Committee for organising this event for me to p p launch my transliteration scheme KrishnaDheva. www. Thanks also to the Sanskrit scholars here, as well as those who have come to learn how to pronounce Sanskrit correctly in English. This talk will not be a religious discourse amurthy, Singapore n This talk will not be a Sanskrit tutoring class This talk will simply be my sharing with you how to: Write down in simple English (KrishnaDheva) any Sanskrit material without special rules, and, Copyright: Dr. N. Krish Read Sanskrit correctly from KrishnaDheva. 2 1 09/03/2014 Starting off The wrong things we say: Sri should be s’ri Siva or Shiva should be s’iva rofkrishna.com p Krishna should be kr+shna www. Visaka should be vis’a’k’a’ Shuklambaradharam should be s’ukla’mbaradh’aram Vrishaba raasi should be vr+shab’a ra’s’ihi amurthy, Singapore n Kowsika gothra should be kaus’ika go’thra … and so on! Copyright: Dr. N. Krish 3 http://sanskritdocuments.org/news/subnews/NASASanskrit.txt Power of Sanskrit – a In ancient India the intention to discover truth was so consuming, that in the process, they discovered perhaps the most perfect tool for fulfilling such a search that the world has ever known – the Sanskrit language.
    [Show full text]
  • Python Module Index 9
    indictransliterationDocumentation Release 0.0.1 sanskrit-programmers Mar 28, 2021 Contents 1 Submodules 3 1.1 indic_transliteration.sanscript......................................3 1.1.1 Submodules...........................................3 1.1.1.1 indic_transliteration.sanscript.schemes........................3 1.1.1.1.1 Submodules.................................3 1.2 indic_transliteration.xsanscript......................................3 1.3 indic_transliteration.detect........................................3 1.3.1 Supported schemes.......................................4 1.4 indic_transliteration.deduplication....................................5 2 Indices and tables 7 Python Module Index 9 Index 11 i ii indictransliterationDocumentation; Release0:0:1 sanscript is the most popular submodule here. Contents 1 indictransliterationDocumentation; Release0:0:1 2 Contents CHAPTER 1 Submodules 1.1 indic_transliteration.sanscript 1.1.1 Submodules 1.1.1.1 indic_transliteration.sanscript.schemes 1.1.1.1.1 Submodules indic_transliteration.sanscript.schemes.roman indic_transliteration.sanscript.schemes.brahmi 1.2 indic_transliteration.xsanscript 1.3 indic_transliteration.detect Example usage: from indic_transliteration import detect detect.detect('pitRRIn') == Scheme.ITRANS detect.detect('pitRRn') == Scheme.HK When handling a Sanskrit string, it’s almost always best to explicitly state its transliteration scheme. This avoids embarrassing errors with words like pitRRIn. But most of the time, it’s possible to infer the encoding from the text itself.
    [Show full text]
  • Language Transliteration in Indian Languages – a Lexicon Parsing Approach
    LANGUAGE TRANSLITERATION IN INDIAN LANGUAGES – A LEXICON PARSING APPROACH SUBMITTED BY JISHA T.E. Assistant Professor, Department of Computer Science, Mary Matha Arts And Science College, Vemom P O, Mananthavady A Minor Research Project Report Submitted to University Grants Commission SWRO, Bangalore 1 ABSTRACT Language, ability to speak, write and communicate is one of the most fundamental aspects of human behaviour. As the study of human-languages developed the concept of communicating with non-human devices was investigated. This is the origin of natural language processing (NLP). Natural language processing (NLP) is a subfield of Artificial Intelligence and Computational Linguistics. It studies the problems of automated generation and understanding of natural human languages. A 'Natural Language' (NL) is any of the languages naturally used by humans. It is not an artificial or man- made language such as a programming language. 'Natural language processing' (NLP) is a convenient description for all attempts to use computers to process natural language. The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person. The last 50 years of research in the field of Natural Language Processing is that, various kinds of knowledge about the language can be extracted through the help of constructing the formal models or theories. The tools of work in NLP are grammar formalisms, algorithms and data structures, formalism for representing world knowledge, reasoning mechanisms. Many of these have been taken from and inherit results from Computer Science, Artificial Intelligence, Linguistics, Logic, and Philosophy.
    [Show full text]
  • Brahminet-ITRANS Transliteration Scheme
    BrahmiNet-ITRANS Transliteration Scheme Anoop Kunchukuttan August 2020 The BrahmiNet-ITRANS notation (Kunchukuttan et al., 2015) provides a scheme for transcription of major Indian scripts in Roman, using ASCII characters. It extends the ITRANS1 transliteration scheme to cover characters not covered in the original scheme. Tables 1a shows the ITRANS mappings for vowels and diacritics (matras). Table 1b shows the ITRANS mappings for consonants. These tables also show the Unicode offset for each character. By Unicode offset, we mean the offset of the character in the Unicode range assigned to the script. For Indic scripts, logically equivalent characters are assigned the same offset in their respective Unicode codepoint ranges. For illustration, we also show the Devanagari characters corresponding to the transliteration. ka kha ga gha ∼Na, N^a ITRANS Unicode Offset Devanagari 15 16 17 18 19 a 05 अ क ख ग घ ङ aa, A 06, 3E आ, ◌ा cha Cha ja jha ∼na, JNa i 07, 3F इ, ि◌ 1A 1B 1C 1D 1E ii, I 08, 40 ई, ◌ी च छ ज झ ञ u 09, 41 उ, ◌ु Ta Tha Da Dha Na uu, U 0A, 42 ऊ, ◌ू 1F 20 21 22 23 RRi, R^i 0B, 43 ऋ, ◌ृ ट ठ ड ढ ण RRI, R^I 60, 44 ॠ, ◌ॄ ta tha da dha na LLi, L^i 0C, 62 ऌ, ◌ॢ 24 25 26 27 28 LLI, L^I 61, 63 ॡ, ◌ॣ त थ द ध न pa pha ba bha ma .e 0E, 46 ऎ, ◌ॆ 2A 2B 2C 2D 2E e 0F, 47 ए, ◌े प फ ब भ म ai 10,48 ऐ, ◌ै ya ra la va, wa .o 12, 4A ऒ, ◌ॊ 2F 30 32 35 o 13, 4B ओ, ◌ो य र ल व au 14, 4C औ, ◌ौ sha Sha sa ha aM 05 02, 02 अं 36 37 38 39 aH 05 03, 03 अः श ष स ह .m 02 ◌ं Ra lda, La zha .h 03 ◌ः 31 33 34 (a) Vowels ऱ ळ ऴ (b) Consonants Version: v1.0 (9 August 2020) NOTES: 1.
    [Show full text]
  • Tugboat, Volume 19 (1998), No. 2 115 an Overview of Indic Fonts For
    TUGboat, Volume 19 (1998), No. 2 115 the Indo-Aryan and Dravidian language families of Fonts India. Such uniformity in phonetics is reflected in orthography, which in turn enables all scripts to be transliterated through a single scheme. This unifor- An Overview of Indic Fonts for TEX mity has subsequently been reflected in the translit- Anshuman Pandey eration schemes of the Indic language/script pack- ages. 1 Introduction Most packages have their own transliteration Many scholars and students in the humanities have scheme, but these schemes are essentially variations on a single scheme, differing merely in the coding preferred TEX over other “word processors” or doc- ument preparation systems because of the ease TEX of a few vowel, nasal, and retroflex letters. Most provides them in typesetting non-Roman scripts, the of these packages accept input in one of the two availability of TEX fonts of interest to them, and the primary 7-bit transliteration schemes— ITRANS or ability TEX has in producing well-structured docu- Velthuis—or a derivative of one of them. There ments. is also an 8-bit format called CS/CSX which a few However, this is not the case amongst Indol- of these packages support. CS/CSX is described in ogists. The lack of Indic fonts for TEXandthe further detail in Section 3. perceived difficulty of typesetting them have often 2 The Fonts and Packages turned Indologists away from using TEX. Little do they realize that TEXisthe foremost tool for de- Figure 1 shows examples of the various fonts de- veloping Indic language/script documents.
    [Show full text]
  • Krishnadeva, a Fresh Transliteration Scheme
    KRISHNADHEVA SCRIPT – Transliteration Scheme for Indian Languages to English N. Krishnamurthy 1. INTRODUCTION For the thousands of years that Devanagari has existed, millions have used it, mainly to chant Sanskrit (or other Devanagari-script based) prayers and scriptures. Devanagari is the script, and it is used for many languages such as Sanskrit, Hindi, and Marathi. Many do not know the Devanagari script, but have access to its transliterations into a language they know. Of these, a number of South Indian languages, such as Kannada, Malayalam and Telugu, based on Devanagari, have very similar alphabet sets with almost identical sounds. To them the transliterated versions would have been faithful reproductions of the original. However for those who did not know such Devanagari-based languages, or who did not have the transliterations available, the next recourse was to use the English language script. The author of this paper learnt Sanskrit at home from a priest as a boy, and had Sanskrit as second language in his undergraduate days. He also studied some Hindu scriptures in the Devanagari script, and formally studied a small portion of the Veda-s through a Guru. Problems with Devanagari-related languages: Many adjustments and accommodations have been made to reflect the sounds of Devanagari, such as the following (see Comparison Table at end): . The International Alphabet of Sanskrit Transliteration (IAST) includes the following conventions: "i", "u", etc. with a bar above it, to represent the long "i", "u", etc. "t", "d", and "s" with a dot underneath each, to represent the sound "t", "d", and "sh", while without the dot, they would represent, "th", "dh" and "s".
    [Show full text]
  • Transcription Final Table
    MAJOR ALTERNATIVE TRANSLITERATION METHODS FOR MAINLAND SOUTHEAST ASIAN OLD SCRIPTS So far, this table includes Old Khmer and Cam scripts. IAST: International Alphabet of Sanskrit Transliteration ITRANS: Indian languages TRANSliteration See http://en.wikipedia.org/wiki/Devanagari_transliteration And http://indology.info/email/members/wujastyk And http://fr.wikipedia.org/wiki/American_Standard_Code_for_Information_Interchange See also: Antelme 2007 CIK+CIC CIK+CIC Other Harvard- ITRANS ASCII Sakamoto IAST Velthuis Schreiner Remarks publications pub. Kyoto 5.3 corpora Glottal stop (as in Khmer, Mon and ’ q q, a Q other Southeast Asian scripts) VOWELS a a a A a a a a a ā aa ā AF ā A aa, A aa -a i i i I i i i i i ī ii ī IF ī I ii, I ii -i u u u U u u u u u ū uu ū UF ū U uu, U uu -u 1 e e e E e e e e e ai ai ai AI ai ai ai ai ai o o o O o o o o o au au au AU au au au au au The consonant .r r̥ .r r̥ R. r̥ R RRi, R^i .r .r (and the .rh) is not relevant for Southeast Asian scripts: r̥/r ̥̄ always represent the vowel. The dot r̥̄ .r.r r̥̄ r̥̄ RR RRI, R^I .rr, .r.r -.r underneath the r should be a small circle (here I used Lucida Grande). ḷ .l ḷ L. ḷ IR LLi, L^i .l .l ḹ .l.l ḹ ḹ IRR LLI, L^I .ll, .l.l -.l ṁ .m ṃ M.
    [Show full text]
  • Claim Service User Guide
    ITRANS Dental User Guide © Continovation Services Inc. 1 www.continovation.com/ASSIST (05/11) Dental User Guide Table of Contents Dental User Guide .................................................................................. 2 Overview ............................................................................................. 3 Requirements ....................................................................................... 3 Log In ................................................................................................. 4 The Main Screen .................................................................................... 5 Log Off ............................................................................................... 6 General Navigation ................................................................................. 7 Library ............................................................................................. 7 Transactions ......................................................................................... 9 Most Recent ...................................................................................... 9 Pending Transactions .......................................................................... 14 Query ............................................................................................. 16 Management ....................................................................................... 18 Profile ...........................................................................................
    [Show full text]
  • A Tool for Transliteration of Bilingual Texts Involving Sanskrit
    A Tool for Transliteration of Bilingual Texts Involving Sanskrit Nikhil Chaturvedi Prof. Rahul Garg IIT Delhi IIT Delhi [email protected] [email protected] Abstract Sanskrit texts are increasingly being written in bilingual and trilingual formats, with Sanskrit paragraphs/shlokas followed by their corresponding English commentary. Sanskrit can also be written in many ways, including multiple encodings like SLP-1 and Velthuis for its romanised form. The need to tackle such code-switching is exacerbated through the requirement to render web pages with multilingual Sanskrit content. These need to automatically detect whether a given text fragment is in Sanskrit, followed by the identification of the form/encoding, further selectively performing transliteration to a user specified script. The Brahmi-derived writing systems of Indian languages are mostly rather similar in structure, but have different letter shapes. These scripts are based on similar phonetic values which allows for easy transliteration. This correspondence forms the basis of the motivation behind deriving a uniform encoding schema that is based on the underlying phonetic value rather than the symbolic representation. The open-source tool developed by us performs this end-to-end detection and transliteration, and achieves an accuracy of 99.1% between SLP-1 and English on a Wikipedia corpus using simple machine learning techniques. 1 Introduction Sanskrit is one of the most ancient languages in India and forms the basis of numerous Indian lan- guages. It is the only known language which has a built-in scheme for pronunciation, word formation and grammar (Maheshwari, 2011). It one of the most used languages of it's time (Huet et al., 2009) and hence encompasses a rich tradition of poetry and drama as well as scientific, technical, philosophical and religious texts.
    [Show full text]