Brahmi-Net: a Transliteration and Script Conversion System for Languages of the Indian Subcontinent
Total Page:16
File Type:pdf, Size:1020Kb
Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent Anoop Kunchukuttan ∗ Ratish Puduppully ∗y Pushpak Bhattacharyya IIT Bombay IIIT Hyderabad IIT Bombay [email protected] ratish.surendran [email protected] @research.iiit.ac.in Abstract one of the oldest writing systems of the Indian sub- continent which can be dated to at least the 3rd cen- We present Brahmi-Net - an online system for tury B.C.E. In addition, Arabic-derived and Roman transliteration and script conversion for all ma- scripts are also used for some languages. Given the jor Indian language pairs (306 pairs). The sys- diversity of languages and scripts, transliteration and tem covers 13 Indo-Aryan languages, 4 Dra- script conversion are extremely important to enable vidian languages and English. For training effective communication. the transliteration systems, we mined paral- The goal of script conversion is to represent the lel transliteration corpora from parallel trans- lation corpora using an unsupervised method source script accurately in the target script, without and trained statistical transliteration systems loss of phonetic information. It is useful for exactly using the mined corpora. Languages which reading manuscripts, signboards, etc. It can serve do not have parallel corpora are supported as a useful tool for linguists, NLP researchers, etc. by transliteration through a bridge language. whose research is multilingual in nature. Script con- Our script conversion system supports con- version enables reading text written in foreign scripts version between all Brahmi-derived scripts as accurately in a user's native script. On the other well as ITRANS romanization scheme. For this, we leverage co-ordinated Unicode ranges hand, transliteration aims to conform to the phonol- between Indic scripts and use an extended ogy of the target language, while being close to the ITRANS encoding for transliterating between source language phonetics. Transliteration is needed English and Indic scripts. The system also pro- for phonetic input systems, cross-lingual informa- vides top-k transliterations and simultaneous tion retrieval, question-answering, machine transla- transliteration into multiple output languages. tion and other cross-lingual applications. We provide a Python as well as REST API to Brahmi-Net is a general purpose transliteration access these services. The API and the mined and script conversion system that aims to provide so- transliteration corpus are made available for research use under an open source license. lutions for South Asian scripts and languages. While transliteration and script conversion are challenging given the scale and diversity, we leverage the com- 1 Introduction monality in the phonetics and the scriptural systems of these languages. The major features of Brahmi- The Indian subcontinent is home to some of the most Net are: widely spoken languages of the world. It is unique 1. It supports 18 languages and 306 language in terms of the large number of scripts used for writ- pairs for statistical transliteration. The sup- ing these languages. Most of the these are abugida ported languages cover 13 Indo-Aryan lan- scripts derived from the Brahmi script. Brahmi is guage (Assamese, Bengali, Gujarati, Hindi, ∗These authors contributed equally to this project Konkani, Marathi, Nepali, Odia, Punjabi, San- yWork done while the author was at IIT Bombay skrit, Sindhi, Sinhala, Urdu) , 4 Dravidian lan- guages (Kannada, Malayalam, Tamil, Telugu) 2.1 Among scripts of the Brahmi family and English. To the best of our knowledge, Each Brahmi-derived Indian language script has no other system covers as many languages and been allocated a distinct codepoint range in the Uni- scripts. code standard. These scripts have a similar char- acter inventory, but different glyphs. Hence, the 2. It supports script conversion among the fol- first 85 characters in each Unicode block are in the lowing 10 scripts used by major Indo-Aryan same order and position, on a script by script basis. and Dravidian languages: Bengali, Gujarati, Our script conversion method simply maps the code- Kannada, Malayalam, Odia, Punjabi, Devana- points between the two scripts. gari, Sinhala, Tamil and Telugu. Some of these The Tamil script is different from other scripts scripts are used for writing multiple languages. since it uses the characters for unvoiced, unaspi- Devanagari is used for writing Hindi, Sanskrit, rated plosives for representing voiced and/or aspi- Marathi, Nepali, Konkani and Sindhi. The Ben- rated plosives. When converting into the Tamil gali script is also used for writing Assamese. script, we substitute all voiced and/or aspirated plo- Also, Sanskrit has historically been written in sives by the corresponding unvoiced, unaspirated many of the above mentioned scripts. plosive in the Tamil script. For Sinhala, we do an ex- 3. The system also supports an extended ITRANS plicit mapping between the characters since the Uni- transliteration scheme for romanization of the code ranges are not coordinated. Indic scripts. This simple script conversion scheme accounts for a vast majority of the characters. However, there 4. The transliteration and script conversion sys- are some characters which do not have equivalents tems are accessible via an online portal. Some in other scripts, an issue we have not addressed so additional features include the ability to simul- far. For instance, the Dravidian scripts do not have taneously view transliterations to all available the nukta character. languages and the top-k best transliterations. 2.2 Between a Roman transliteration scheme 5. An Application Programming Interface (API) is and scripts from the Brahmi family available as a Python package and a REST in- We chose ITRANS1 as our transliteration scheme terface for easy integration of the transliteration since: (i) it can be entered using Roman keyboard and script conversion systems into other appli- characters, (ii) the Roman character mappings map cations requiring transliteration services. to Indic script characters in a phonetically intuitive fashion. The official ITRANS specification is lim- 6. As part of the project, parallel transliteration ited to the Devanagari script. We have added a few corpora has been mined for transliteration be- extensions to account for some characters not found tween 110 languages pairs for the following 11 in non-Devanagari scripts. Our extended encoding languages: Bengali, Gujarati, Hindi, Konkani, is backward compatible with ITRANS. We convert Marathi, Punjabi, Urdu, Malayalam, Tamil, Devanagari to ITRANS using Alan Little's python Telugu and English. The parallel translitera- module2. For romanization of other scripts, we use tion corpora is comprised of 1,694,576 word Devanagari as a pivot script and use the inter-Brahmi pairs across all language pairs, which is roughly script converter mentioned in Section 2.1. 15,000 mined pairs per language pair. 3 Transliteration 2 Script Conversion Though Indian language scripts are phonetic and Our script conversion engine contains two rule- largely unambiguous, script conversion is not a sub- based systems: one for script conversion amongst 1http://www.aczoom.com/itrans/ scripts of the Brahmi family, and the other for ro- 2http://www.alanlittle.org/projects/ manization of Brahmi scripts. transliterator/transliterator.html stitute for transliteration which needs to account for system is trained on them. the target language phonology and orthographic con- We used a hybrid approach for transliteration in- ventions. The main challenges that machine translit- volving languages for which we could not mine eration systems encounter are: script specifications, a parallel transliteration corpus. Source languages missing sounds, transliteration variants, language of which cannot be statistically transliterated are first origin, etc. (Karimi et al., 2011). A summary of the transliterated into a phonetically close language challenges specific to Indian languages is described (bridge language) using the above-mentioned rule- by Antony, P. J. and Soman, K.P. (2011). based system. The bridge language is then transliter- ated into the target language using statistical translit- 3.1 Transliteration Mining eration. Similarly, for target languages which cannot Statistical transliteration can address these chal- be statistically transliterated, the source is first sta- lenges by learning transliteration divergences from a tistically transliterated into a phonetically close lan- parallel transliteration corpus. For most Indian lan- guage, followed by rule-based transliteration into the guage pairs, parallel transliteration corpora are not target language. publicly available. Hence, we mine transliteration corpora for 110 language pairs from the ILCI corpus, 4 Brahmi-Net Interface a parallel translation corpora of 11 Indian languages (Jha, 2012). Transliteration pairs are mined using Brahmi-Net is accessible via a web interface as well the unsupervised approach proposed by Sajjad et al. an API. We describe these interfaces in this section. (2012) and implemented in the Moses SMT system (Durrani et al., 2014). Their approach models paral- 4.1 Web Interface lel translation corpus generation as a generative pro- The purpose of the Web interface is to allow users cess comprising an interpolation of a transliteration quick access to transliteration and script conversion and a non-transliteration process. The parameters of services. They can also choose to see the translitera-