Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs

Total Page:16

File Type:pdf, Size:1020Kb

Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs Anil Kumar Singh Samar Husain LTRC, IIIT LTRC, IIIT Gachibowli, Hyderabad Gachibowli, Hyderabad India - 500019 India - 500019 a [email protected] s [email protected] Abstract than 95%, and usually 98 to 99% and above). The evaluation is performed in terms of precision, and Several algorithms are available for sen- sometimes also recall. The figures are given for one tence alignment, but there is a lack of or (less frequently) more corpus sizes. While this systematic evaluation and comparison of does give an indication of the performance of an al- these algorithms under different condi- gorithm, the variation in performance under varying tions. In most cases, the factors which conditions has not been considered in most cases. can significantly affect the performance Very little information is given about the conditions of a sentence alignment algorithm have under which evaluation was performed. This gives not been considered while evaluating. We the impression that the algorithm will perform with have used a method for evaluation that the reported precision and recall under all condi- can give a better estimate about a sen- tions. tence alignment algorithm's performance, We have tested several algorithms under differ- so that the best one can be selected. We ent conditions and our results show that the per- have compared four approaches using this formance of a sentence alignment algorithm varies method. These have mostly been tried significantly, depending on the conditions of test- on European language pairs. We have ing. Based on these results, we propose a method evaluated manually-checked and validated of evaluation that will give a better estimate of the English-Hindi aligned parallel corpora un- performance of a sentence alignment algorithm and der different conditions. We also suggest will allow a more meaningful comparison. Our view some guidelines on actual alignment. is that unless this is done, it will not be possible to pick up the best algorithm for certain set of con- 1 Introduction ditions. Those who want to align parallel corpora Aligned parallel corpora are collections of pairs of may end up picking up a less suitable algorithm for sentences where one sentence is a translation of the their purposes. We have used the proposed method for comparing four algorithms under different con- other. Sentence alignment means identifying which ditions. Finally, we also suggest some guidelines for sentence in the target language (TL) is a translation using these algorithms for actual alignment. of which one in the source language (SL). Such cor- pora are useful for statistical NLP, algorithms based 2 Sentence Alignment Methods on unsupervised learning, automatic creation of re- sources, and many other applications. Sentence alignment approaches can be categorized Over the last fifteen years, several algorithms have as based on sentence length, word correspondence, been proposed for sentence alignment. Their perfor- and composite (where more than one approaches are mance as reported is excellent (in most cases not less combined), though other techniques, such as cog- 99 Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 99–106, Ann Arbor, June 2005. c Association for Computational Linguistics, 2005 nate matching (Simard et al., 1992) were also tried. Translation of a text can be fairly literal or it can Word correspondence was used by Kay (Kay, 1991; be a recreation, with a whole range between these Kay and Roscheisen, 1993). It was based on the idea two extremes. Paragraphs and/or sentences can be that words which are translations of each other will dropped or added. In actual corpora, there can even have similar distributions in the SL and TL texts. be noise (sentences which are not translations at all Sentence length methods were based on the intuition and may not even be part of the actual text). This can that the length of a translated sentence is likely to be happen due to fact that the texts have been extracted similar to that of the source sentence. Brown, Lai from some other format such as web pages. While and Mercer (Brown et al., 1991) used word count as translating, sentences can also be merged or split. the sentence length, whereas Gale and Church (Gale Thus, the SL and TL corpora may differ in size. and Church, 1991) used character count. Brown, Lai All these factors affect the performance of an al- and Mercer assumed prior alignment of paragraphs. gorithm in terms of, say, precision, recall and F- Gale and Church relied on some previously aligned measure. For example, we can expect the perfor- sentences as `anchors'. Wu (Wu, 1994) also used mance to worsen if there is an increase in additions, lexical cues from corpus-specific bilingual lexicon deletions, or noise. And if the texts were translated for better alignment. fairly literally, statistical algorithms are likely to per- Word correspondence was further developed in form better. However, our results show that this does IBM Model-1 (Brown et al., 1993) for statistical not happen for all the algorithms. machine translation. Melamed (Melamed, 1996) The linguistic distance between SL and TL can also used word correspondence in a different (geo- also play a role in performance. The simplest mea- metric correspondence) way for sentence alignment. sure of this distance is in terms of the distance on Simard and Plamondon (Simard and Plamondon, the family tree model. Other measures could be the 1998) used a composite method in which the first number of cognate words or some measure based pass does alignment at the level of characters as on syntactic features. For our purposes, it may not in (Church, 1993) (itself based on cognate match- be necessary to have a quantitative measure of lin- ing) and the second pass uses IBM Model-1, fol- guistic distance. The important point is that for lan- lowing Chen (Chen, 1993). The method used by guages that are distant, some algorithms may not Moore (Moore, 2002) also had two passes, the first perform too well, if they rely on some closeness be- one being based on sentence length (word count) and tween languages. For example, an algorithm based the second on IBM Model-1. Composite methods on cognates is likely to work better for English- are used so that different approaches can compli- French or English-German than for English-Hindi, ment each other. because there are fewer cognates for English-Hindi. It won't be without a basis to say that Hindi is 3 Factors in Performance more distant from English than is German. English and German belong to the Indo-Germanic branch As stated above, the performance of a sentence whereas Hindi belongs to the Indo-Aryan branch. alignment algorithm depends on some identifiable There are many more cognates between English and factors. We can even make predictions about German than between English and Hindi. Similarly, whether the performance will increase or decrease. as compared to French, Hindi is also distant from However, as the results given later show, the algo- English in terms of morphology. The vibhaktis of rithms don't always behave in a predictable way. For Hindi can adversely affect the performance of sen- example, one of the algorithms did worse rather than tence length (especially word count) as well as word better on an `easier' corpus. This variation in perfor- correspondence based algorithms. From the syntac- mance is quite significant and it cannot be ignored tic point of view, Hindi is a comparatively free word for actual alignment (table-1). Some of these factors order language, but with a preference for the SOV have been indicated in earlier papers, but these were (subject-object-verb) order, whereas English is more not taken into account while evaluating, nor were of a fixed word order and SVO type language. For their effects studied. sentence length and IBM model-1 based sentence 100 alignment, this doesn't matter since they don't take corpus sizes. They, also compared the performance the word order into account. However, Melamed's of several (till then known) algorithms. algorithm (Melamed, 1996), though it allows `non- In most of the other cases, evaluation was per- monotonic chains' (thus taking care of some differ- formed on only one corpus type and one corpus size. ence in word order), is somewhat sensitive to the In some cases, certain other factors were considered, word order. As Melamed states, how it will fare but not very systematically. In other words, there with languages with more word variation than En- wasn't an attempt to study the effect of various fac- glish and French is an open question. tors described earlier on the performance. In some Another aspect of the performance which may not cases, the size used for testing was too small. One seem important from NLP-research point of view, is other detail is that size was sometimes mentioned in its speed. Someone who has to use these algorithms terms of number of words, not number of sentences. for actual alignment of large corpora (say, more than 1000 sentences) will have to realize the importance 5 Evaluation Measures of speed. Any algorithm which does worse than We have used local (for each run) as well as global O(n) is bound to create problems for large sizes. Ob- (over all the runs) measures of performance of an viously, an algorithm that can align 5000 sentences algorithm. These measures are: in 1 hour is preferable to the one which takes three days, even if the latter is marginally more accurate. Precision (local and global) • Similarly, the one which takes 2 minutes for 100 sen- Recall (local and global) tences, but 16 minutes for 200 sentences will be dif- • ficult to use for practical purposes.
Recommended publications
  • Slides for My Lecture for the Texperience 2010
    ◦ DEVELOPMENTDEVELOPMENT OF OFxxındyındy◦ SORTSORT AND AND MERGE MERGE RULES RULES FORFOR INDIC INDIC LANGUAGES LANGUAGES ZdeněkZdeněk Wagner, Wagner, Praha, Praha, Česká Česká republika republika AnshumanAnshuman Pandey, Pandey, Univ. Univ. Michigan, Michigan, USA USA JayaJaya Saraswati, Saraswati, Mumbai, Mumbai, India India ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh Russian: 娭¤¨ (meaning Hindi) ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh Russian: 娭¤¨ (meaning Hindi) x◦ındy sorts Hindi MakeIndexMakeIndex • version for English and German • CSIndex – version for Czech and Slovak • unpublished version for Sanskrit (Mark Csernel) Tables defining the sort algorithm are hard-wired in the pro- gram source code. Modification for other languages is difficult and leads rather to confusion than to development of a univer- sal tool. InternationalInternationalMakeIndexMakeIndex • Tables defining the sort algorithm present in external files. • Sort rules defined by regular expressions.
    [Show full text]
  • Note: the Font “Hindiromanized” Must Be Selected
    Hinweis: Der Font “HindiRomanized” muss Note: the font “HindiRomanized” must be ausgewählt sein! selected! Jetzt können Sie ihre Kaschmiri-Texte in Devanagari- Now you may convert your Hindi texts in Schrift umwandeln Devanagari script entweder in ihrer reversen Form (reverse either in their reverted form (reverse Transliteration) Transliteration) oder in ihrer romanisierten Form (Romanization) or in their romanized form (Romanization) Um das zu machen, verfahren Sie so: To do this proceed as described: Klicken Sie mit der rechten Maustaste auf das Right click the “Aksharamala” icon and choose „Aksharamala” Icon und wählen Sie „Options“ “Options” Wählen Sie „Transliteration using multiple Select “Transliteration using multiple Keymaps” Keymaps” Um Kashmiri revers: Aksharas mit dem inherenten To get Kashmiri revers: Aksharas with the Vokal “a” zu erhalten inherent Vowel “a” Wählen Sie als den 1. Keymap: Select as the 1st Keymap: “VerynewDevKashmiri FullReverse-Literation “VerynewDevKashmiri FullReverse-Literation (Unicode -> ITRANS)” (Unicode -> ITRANS)” Wählen Sie als den 2. Keymap “RomKashmiriRaina Select as the 2nd Keymap: (ITRANS -> Unicode)” “RomKashmiriRaina (ITRANS -> Unicode)” oder or “RomKashmiriKoul (ITRANS -> Unicode)” “RomKashmiriKoul (ITRANS -> Unicode)” Drücken Sie anschließend „OK“ Click finally on “OK” Wählen Sie nun den Keymap “VerynewDevKashmiri Now select the keymap “VerynewDevKashmiri FullReverse-Literation (Unicode -> ITRANS)” aus FullReverse-Literation (Unicode -> ITRANS)” Um Kashmiri revers: Aksharas ohne den inherenten
    [Show full text]
  • Power of Sanskrit
    09/03/2014 WEBPAGE: http://www.translink.profkrishna.com E-mail: [email protected] rofkrishna.com p www. वागतम ् amurthy, Singapore n n Swaagatham N. Krishnamurthy 23 February 2014 www.profkrishna.com Copyright: Dr. N. Krish Consultant, Singapore 1 Acknowledgements and Scope of talk ी गुयो नमः (shri gurub’yo’ namaha) Thanks to Singapore Dakshina Bharatha Brahmana Sabha, and Sri Srinivasan and all members of the rofkrishna.com Sabha Committee for organising this event for me to p p launch my transliteration scheme KrishnaDheva. www. Thanks also to the Sanskrit scholars here, as well as those who have come to learn how to pronounce Sanskrit correctly in English. This talk will not be a religious discourse amurthy, Singapore n This talk will not be a Sanskrit tutoring class This talk will simply be my sharing with you how to: Write down in simple English (KrishnaDheva) any Sanskrit material without special rules, and, Copyright: Dr. N. Krish Read Sanskrit correctly from KrishnaDheva. 2 1 09/03/2014 Starting off The wrong things we say: Sri should be s’ri Siva or Shiva should be s’iva rofkrishna.com p Krishna should be kr+shna www. Visaka should be vis’a’k’a’ Shuklambaradharam should be s’ukla’mbaradh’aram Vrishaba raasi should be vr+shab’a ra’s’ihi amurthy, Singapore n Kowsika gothra should be kaus’ika go’thra … and so on! Copyright: Dr. N. Krish 3 http://sanskritdocuments.org/news/subnews/NASASanskrit.txt Power of Sanskrit – a In ancient India the intention to discover truth was so consuming, that in the process, they discovered perhaps the most perfect tool for fulfilling such a search that the world has ever known – the Sanskrit language.
    [Show full text]
  • Python Module Index 9
    indictransliterationDocumentation Release 0.0.1 sanskrit-programmers Mar 28, 2021 Contents 1 Submodules 3 1.1 indic_transliteration.sanscript......................................3 1.1.1 Submodules...........................................3 1.1.1.1 indic_transliteration.sanscript.schemes........................3 1.1.1.1.1 Submodules.................................3 1.2 indic_transliteration.xsanscript......................................3 1.3 indic_transliteration.detect........................................3 1.3.1 Supported schemes.......................................4 1.4 indic_transliteration.deduplication....................................5 2 Indices and tables 7 Python Module Index 9 Index 11 i ii indictransliterationDocumentation; Release0:0:1 sanscript is the most popular submodule here. Contents 1 indictransliterationDocumentation; Release0:0:1 2 Contents CHAPTER 1 Submodules 1.1 indic_transliteration.sanscript 1.1.1 Submodules 1.1.1.1 indic_transliteration.sanscript.schemes 1.1.1.1.1 Submodules indic_transliteration.sanscript.schemes.roman indic_transliteration.sanscript.schemes.brahmi 1.2 indic_transliteration.xsanscript 1.3 indic_transliteration.detect Example usage: from indic_transliteration import detect detect.detect('pitRRIn') == Scheme.ITRANS detect.detect('pitRRn') == Scheme.HK When handling a Sanskrit string, it’s almost always best to explicitly state its transliteration scheme. This avoids embarrassing errors with words like pitRRIn. But most of the time, it’s possible to infer the encoding from the text itself.
    [Show full text]
  • Language Transliteration in Indian Languages – a Lexicon Parsing Approach
    LANGUAGE TRANSLITERATION IN INDIAN LANGUAGES – A LEXICON PARSING APPROACH SUBMITTED BY JISHA T.E. Assistant Professor, Department of Computer Science, Mary Matha Arts And Science College, Vemom P O, Mananthavady A Minor Research Project Report Submitted to University Grants Commission SWRO, Bangalore 1 ABSTRACT Language, ability to speak, write and communicate is one of the most fundamental aspects of human behaviour. As the study of human-languages developed the concept of communicating with non-human devices was investigated. This is the origin of natural language processing (NLP). Natural language processing (NLP) is a subfield of Artificial Intelligence and Computational Linguistics. It studies the problems of automated generation and understanding of natural human languages. A 'Natural Language' (NL) is any of the languages naturally used by humans. It is not an artificial or man- made language such as a programming language. 'Natural language processing' (NLP) is a convenient description for all attempts to use computers to process natural language. The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person. The last 50 years of research in the field of Natural Language Processing is that, various kinds of knowledge about the language can be extracted through the help of constructing the formal models or theories. The tools of work in NLP are grammar formalisms, algorithms and data structures, formalism for representing world knowledge, reasoning mechanisms. Many of these have been taken from and inherit results from Computer Science, Artificial Intelligence, Linguistics, Logic, and Philosophy.
    [Show full text]
  • Brahminet-ITRANS Transliteration Scheme
    BrahmiNet-ITRANS Transliteration Scheme Anoop Kunchukuttan August 2020 The BrahmiNet-ITRANS notation (Kunchukuttan et al., 2015) provides a scheme for transcription of major Indian scripts in Roman, using ASCII characters. It extends the ITRANS1 transliteration scheme to cover characters not covered in the original scheme. Tables 1a shows the ITRANS mappings for vowels and diacritics (matras). Table 1b shows the ITRANS mappings for consonants. These tables also show the Unicode offset for each character. By Unicode offset, we mean the offset of the character in the Unicode range assigned to the script. For Indic scripts, logically equivalent characters are assigned the same offset in their respective Unicode codepoint ranges. For illustration, we also show the Devanagari characters corresponding to the transliteration. ka kha ga gha ∼Na, N^a ITRANS Unicode Offset Devanagari 15 16 17 18 19 a 05 अ क ख ग घ ङ aa, A 06, 3E आ, ◌ा cha Cha ja jha ∼na, JNa i 07, 3F इ, ि◌ 1A 1B 1C 1D 1E ii, I 08, 40 ई, ◌ी च छ ज झ ञ u 09, 41 उ, ◌ु Ta Tha Da Dha Na uu, U 0A, 42 ऊ, ◌ू 1F 20 21 22 23 RRi, R^i 0B, 43 ऋ, ◌ृ ट ठ ड ढ ण RRI, R^I 60, 44 ॠ, ◌ॄ ta tha da dha na LLi, L^i 0C, 62 ऌ, ◌ॢ 24 25 26 27 28 LLI, L^I 61, 63 ॡ, ◌ॣ त थ द ध न pa pha ba bha ma .e 0E, 46 ऎ, ◌ॆ 2A 2B 2C 2D 2E e 0F, 47 ए, ◌े प फ ब भ म ai 10,48 ऐ, ◌ै ya ra la va, wa .o 12, 4A ऒ, ◌ॊ 2F 30 32 35 o 13, 4B ओ, ◌ो य र ल व au 14, 4C औ, ◌ौ sha Sha sa ha aM 05 02, 02 अं 36 37 38 39 aH 05 03, 03 अः श ष स ह .m 02 ◌ं Ra lda, La zha .h 03 ◌ः 31 33 34 (a) Vowels ऱ ळ ऴ (b) Consonants Version: v1.0 (9 August 2020) NOTES: 1.
    [Show full text]
  • Tugboat, Volume 19 (1998), No. 2 115 an Overview of Indic Fonts For
    TUGboat, Volume 19 (1998), No. 2 115 the Indo-Aryan and Dravidian language families of Fonts India. Such uniformity in phonetics is reflected in orthography, which in turn enables all scripts to be transliterated through a single scheme. This unifor- An Overview of Indic Fonts for TEX mity has subsequently been reflected in the translit- Anshuman Pandey eration schemes of the Indic language/script pack- ages. 1 Introduction Most packages have their own transliteration Many scholars and students in the humanities have scheme, but these schemes are essentially variations on a single scheme, differing merely in the coding preferred TEX over other “word processors” or doc- ument preparation systems because of the ease TEX of a few vowel, nasal, and retroflex letters. Most provides them in typesetting non-Roman scripts, the of these packages accept input in one of the two availability of TEX fonts of interest to them, and the primary 7-bit transliteration schemes— ITRANS or ability TEX has in producing well-structured docu- Velthuis—or a derivative of one of them. There ments. is also an 8-bit format called CS/CSX which a few However, this is not the case amongst Indol- of these packages support. CS/CSX is described in ogists. The lack of Indic fonts for TEXandthe further detail in Section 3. perceived difficulty of typesetting them have often 2 The Fonts and Packages turned Indologists away from using TEX. Little do they realize that TEXisthe foremost tool for de- Figure 1 shows examples of the various fonts de- veloping Indic language/script documents.
    [Show full text]
  • Krishnadeva, a Fresh Transliteration Scheme
    KRISHNADHEVA SCRIPT – Transliteration Scheme for Indian Languages to English N. Krishnamurthy 1. INTRODUCTION For the thousands of years that Devanagari has existed, millions have used it, mainly to chant Sanskrit (or other Devanagari-script based) prayers and scriptures. Devanagari is the script, and it is used for many languages such as Sanskrit, Hindi, and Marathi. Many do not know the Devanagari script, but have access to its transliterations into a language they know. Of these, a number of South Indian languages, such as Kannada, Malayalam and Telugu, based on Devanagari, have very similar alphabet sets with almost identical sounds. To them the transliterated versions would have been faithful reproductions of the original. However for those who did not know such Devanagari-based languages, or who did not have the transliterations available, the next recourse was to use the English language script. The author of this paper learnt Sanskrit at home from a priest as a boy, and had Sanskrit as second language in his undergraduate days. He also studied some Hindu scriptures in the Devanagari script, and formally studied a small portion of the Veda-s through a Guru. Problems with Devanagari-related languages: Many adjustments and accommodations have been made to reflect the sounds of Devanagari, such as the following (see Comparison Table at end): . The International Alphabet of Sanskrit Transliteration (IAST) includes the following conventions: "i", "u", etc. with a bar above it, to represent the long "i", "u", etc. "t", "d", and "s" with a dot underneath each, to represent the sound "t", "d", and "sh", while without the dot, they would represent, "th", "dh" and "s".
    [Show full text]
  • Transcription Final Table
    MAJOR ALTERNATIVE TRANSLITERATION METHODS FOR MAINLAND SOUTHEAST ASIAN OLD SCRIPTS So far, this table includes Old Khmer and Cam scripts. IAST: International Alphabet of Sanskrit Transliteration ITRANS: Indian languages TRANSliteration See http://en.wikipedia.org/wiki/Devanagari_transliteration And http://indology.info/email/members/wujastyk And http://fr.wikipedia.org/wiki/American_Standard_Code_for_Information_Interchange See also: Antelme 2007 CIK+CIC CIK+CIC Other Harvard- ITRANS ASCII Sakamoto IAST Velthuis Schreiner Remarks publications pub. Kyoto 5.3 corpora Glottal stop (as in Khmer, Mon and ’ q q, a Q other Southeast Asian scripts) VOWELS a a a A a a a a a ā aa ā AF ā A aa, A aa -a i i i I i i i i i ī ii ī IF ī I ii, I ii -i u u u U u u u u u ū uu ū UF ū U uu, U uu -u 1 e e e E e e e e e ai ai ai AI ai ai ai ai ai o o o O o o o o o au au au AU au au au au au The consonant .r r̥ .r r̥ R. r̥ R RRi, R^i .r .r (and the .rh) is not relevant for Southeast Asian scripts: r̥/r ̥̄ always represent the vowel. The dot r̥̄ .r.r r̥̄ r̥̄ RR RRI, R^I .rr, .r.r -.r underneath the r should be a small circle (here I used Lucida Grande). ḷ .l ḷ L. ḷ IR LLi, L^i .l .l ḹ .l.l ḹ ḹ IRR LLI, L^I .ll, .l.l -.l ṁ .m ṃ M.
    [Show full text]
  • Claim Service User Guide
    ITRANS Dental User Guide © Continovation Services Inc. 1 www.continovation.com/ASSIST (05/11) Dental User Guide Table of Contents Dental User Guide .................................................................................. 2 Overview ............................................................................................. 3 Requirements ....................................................................................... 3 Log In ................................................................................................. 4 The Main Screen .................................................................................... 5 Log Off ............................................................................................... 6 General Navigation ................................................................................. 7 Library ............................................................................................. 7 Transactions ......................................................................................... 9 Most Recent ...................................................................................... 9 Pending Transactions .......................................................................... 14 Query ............................................................................................. 16 Management ....................................................................................... 18 Profile ...........................................................................................
    [Show full text]
  • A Tool for Transliteration of Bilingual Texts Involving Sanskrit
    A Tool for Transliteration of Bilingual Texts Involving Sanskrit Nikhil Chaturvedi Prof. Rahul Garg IIT Delhi IIT Delhi [email protected] [email protected] Abstract Sanskrit texts are increasingly being written in bilingual and trilingual formats, with Sanskrit paragraphs/shlokas followed by their corresponding English commentary. Sanskrit can also be written in many ways, including multiple encodings like SLP-1 and Velthuis for its romanised form. The need to tackle such code-switching is exacerbated through the requirement to render web pages with multilingual Sanskrit content. These need to automatically detect whether a given text fragment is in Sanskrit, followed by the identification of the form/encoding, further selectively performing transliteration to a user specified script. The Brahmi-derived writing systems of Indian languages are mostly rather similar in structure, but have different letter shapes. These scripts are based on similar phonetic values which allows for easy transliteration. This correspondence forms the basis of the motivation behind deriving a uniform encoding schema that is based on the underlying phonetic value rather than the symbolic representation. The open-source tool developed by us performs this end-to-end detection and transliteration, and achieves an accuracy of 99.1% between SLP-1 and English on a Wikipedia corpus using simple machine learning techniques. 1 Introduction Sanskrit is one of the most ancient languages in India and forms the basis of numerous Indian lan- guages. It is the only known language which has a built-in scheme for pronunciation, word formation and grammar (Maheshwari, 2011). It one of the most used languages of it's time (Huet et al., 2009) and hence encompasses a rich tradition of poetry and drama as well as scientific, technical, philosophical and religious texts.
    [Show full text]
  • The Cologne Sanskrit Lexicon
    The Cologne Sanskrit Lexicon Felix Rau, Jonathan Blumtritt & Daniel Kölligan University of Cologne Cologne Sanskrit Lexicon 1 11.11.2016 Community & Sustainability for an online collection of 36 dictionaries with hundreds of thousands of entries. Cologne Sanskrit Lexicon 2 11.11.2016 History of the CSL 1994 Dr. Malten initiates the project 1996 Digitization of the Monier-Williams 2003 CSL as a web app 2013 Dr. Malten retires, the DCH becomes responsible, increasing involvement of the community 2013-15 LAZARUS Project Cologne Sanskrit Lexicon 3 11.11.2016 The origins of the CSL • Digitisation of the 1899 edition of Monier-Williams with 31,836 entries • Complex mark-up (later converted to XML) • Originally as a file available for download • Since 2013 as a web app Cologne Sanskrit Lexicon 4 11.11.2016 Cologne Sanskrit Lexicon 5 11.11.2016 Cologne Sanskrit Lexicon 6 11.11.2016 Current CSL • 13 Sanskrit-English dictionaries • 3 English-Sanskrit dictionaries • 2 Sanskrit-French dictionaries • 5 Sanskrit German dictionaries • 1 Sanskrit-Latin dictionary • 2 Sanskrit-Sanskrit dictionaries • 10 specialised (encyclopedic) dictionaries Cologne Sanskrit Lexicon 7 11.11.2016 Current CSL • Internally uses SLP1 to represent Sanskrit • Search input as SLP1, Harvard-Kyoto, and ITRANS • Output in Devanagari, Romanisation, HK, SLP1, and ITRANS • XML Markup (diverse formats, determined by the structure of the printed dictionary) Cologne Sanskrit Lexicon 8 11.11.2016 Cologne Sanskrit Lexicon 9 11.11.2016 Community Cologne Sanskrit Lexicon 10 11.11.2016 1
    [Show full text]