Orthographic Syllable As Basic Unit for SMT Between Related Languages

Orthographic Syllable as basic unit for SMT between Related Languages Anoop Kunchukuttan, Pushpak Bhattacharyya Center For Indian Language Technology, Department of Computer Science & Engineering Indian Institute of Technology Bombay {anoopk,pb}@cse.iitb.ac.in Abstract achieved by sub-word level transformations. For in- stance, lexical similarity can be modelled in the stan- We explore the use of the orthographic syl- dard SMT pipeline by transliteration of words while lable, a variable-length consonant-vowel se- decoding (Durrani et al., 2010) or post-processing quence, as a basic unit of translation between (Nakov and Tiedemann, 2012; Kunchukuttan et al., related languages which use abugida or alpha- 2014). betic scripts. We show that orthographic syllable level translation significantly outperforms A different paradigm is to drop the notion of models trained over other basic units (word, word boundary and consider the character n-gram morpheme and character) when training over as the basic unit of translation (Vilar et al., 2007; small parallel corpora. Tiedemann, 2009a). Such character-level SMT bas been explored for closely related languages like Bulgarian-Macedonian, Indonesian-Malay with 1 Introduction modest success, with the short context of unigrams being a limiting factor (Tiedemann, 2012). The Related languages exhibit lexical and structural sim- use of character n-gram units to address this limi- ilarities on account of sharing a common ances- tation leads to data sparsity for higher order n-grams try (Indo-Aryan, Slavic languages) or being in pro- and provides little benefit (Tiedemann and Nakov, longed contact for a long period of time (Indian sub- 2013). continent, Standard Average European linguistic ar- eas) (Bhattacharyya et al., 2016). Translation be- In this work, we present a linguistically moti- tween related languages is an important requirement vated, variable length unit of translation — ortho- due to substantial government, business and social graphic syllable (OS) — which provides more con- communication among people speaking these lan- text for translation while limiting the number of ba- guages. However, most of these languages have few sic units. The OS consists of one or more conso- parallel corpora resources, an important requirement nants followed by a vowel and is inspired from the for building good quality SMT systems. akshara, a consonant-vowel unit, which is the funda- Modelling the lexical similarity among related mental organizing principle of Indic scripts (Sproat, languages is the key to building good-quality SMT 2003; Singh, 2006). It can be thought of as an ap- systems with limited parallel corpora. Lexical sim- proximate syllable with the onset and nucleus, but ilarity implies that the languages share many words no coda. While true syllabification is hard, ortho- with the similar form (spelling/pronunciation) and graphic syllabification can be easily done. Atreya et meaning e.g. blindness is andhapana in Hindi, al. (2016) and Ekbal et al. (2006) have shown that aandhaLepaNaa in Marathi. These words could the OS is a useful unit for transliteration involving be cognates, lateral borrowings or loan words from Indian languages. other languages. Translation for such words can be We show that orthographic syllable-level translation significantly outperforms character-level and Basic Unit Example Transliteration strong word-level and morpheme-level baselines Word घरासमोरचा gharAsamoracA over multiple related language pairs (Indian as well Morph Segment घरा समोर चा gharA samora cA Orthographic Syllable घ रा स मो र चा gha rA sa mo racA as others). Character-level approaches have been Character unigram घ र ◌ा स म ◌ो र च ◌ा gha r A sa m o ra c A previously shown to work well for language pairs Character 3-gram घरा समो रचा gharA samo rcA with high lexical similarity. Our major finding is that something that is in front of home: ghara=home, samora=front, cA=of OS-level translation outperforms other approaches Table 1: Various translation units for a Marathi word even when the language pairs have relatively less lexical similarity or belong to different language families (but have sufficient contact relation). ing a word is considered to be an OS. 2 Orthographic Syllabification 3 Translation Models The orthographic syllable is a sequence of one or We compared the orthographic syllable level model more consonants followed by a vowel, i.e a C+V (O) with models based on other translation units that unit. We describe briefly procedures for ortho- have been reported in previous work: word (W), graphic syllabification of Indian scripts and non- morpheme (M), unigram (C) and trigram characters. Indic alphabetic scripts. Orthographic syllabifica- Table 1 shows examples of these representations. tion cannot be done for languages using logographic The first step to build these translation systems is and abjad scripts as these scripts do not have vowels. to transform sentences to the correct representation. Each word is segmented as the per the unit of rep- Indic Scripts: Indic scripts are abugida scripts, resentation, punctuations are retained and a special consisting of consonant-vowel sequences, with a word boundary marker character (_) is introduced consonant core (C+) and a dependent vowel (ma- to indicate word boundaries as shown here: tra). If no vowel follows a consonant, an implicit W: schwa vowel [IPA: ə] is assumed. Suppression of राजू , घराबाहेर जाऊ नको . schwa is indicated by the halanta character follow- O: रा जू _ , _ घ रा बा हे र _ जा ऊ _ न को _ . ing a consonant. This script design makes for a straightforward syllabification process( as shown) in For all units of representation, we trained phrase- lakShamI based SMT (PBSMT) systems. Since related lan- the following example.( e.g. लक्षमी) CVCCVCV is la kSha mI guages have similar word order, we used distance segmented as ल क्ष मी CVCCVCV . There are two exceptions to this scheme: (i) Indic scripts distin- based distortion model and monotonic decoding. For guish between dependent vowels (vowel diacritics) character and orthographic syllable level models, we and independent vowels, and the latter will consti- use higher order (10-gram) languages models since data sparsity is a lesser concern due to small vocabu- tute an OS on its own. e.g. मु륍बई (mumbaI) ! lary size (Vilar et al., 2007). As suggested by Nakov मु 륍ब ई (mu mba I) (ii) The characters anusvaara and chandrabindu are part of the OS to the left if and Tiedemann (2012), we used word-level tuning they represents nasalization of the vowel/consonant for character and orthographic syllable level models or start a new OS if they represent a nasal consonant. by post-processing n-best lists in each tuning step to Their exact role is determined by the character fol- calculate the usual word-based BLEU score. lowing the anusvaara. See Appendix A for details. While decoding, the word and morpheme level systems will not be able to translate OOV words. Non-Indic Alphabetic Scripts: We use a simpler Since the languages involved share vocabulary, we method for the alphabetic scripts used in our experi- transliterate the untranslated words resulting in the ments (Latin and Cyrillic). The OS is identified by a post-edited systems WX and MX corresponding to C+V+ sequence. e.g. lakshami!la ksha mi, mum- the systems W and M respectively. Following de- bai!mu mbai. The OS could contains multiple ter- coding, we used a simple method to regenerate minal vowel characters representing long vowels (oo words from sub-word level units: Since we represent in cool) or diphthongs (ai in mumbai). A vowel start- word boundaries using a word boundary marker, we IA!IA DR!DR IA!DR parallel corpora for character, morpheme and OS ben-hin 52.30 mal-tam 39.04 hin-mal 33.24 level LMs. pan-hin 67.99 tel-mal 39.18 DR!IA kok-mar 54.51 mal-hin 33.24 System details: PBSMT systems were trained us- IA: Indo-Aryan, DR: Dravidian ing the Moses system (Koehn et al., 2007), with the grow-diag-final-and heuristic for extracting phrases, Table 2: Language pairs used in experiments along and Batch MIRA (Cherry and Foster, 2012) for tun- with Lexical Similarity between them, in terms of ing (default parameters). We trained 5-gram LMs LCSR between training corpus sentences with Kneser-Ney smoothing for word and morpheme level models and 10-gram LMs for character and simply concat the output units between consecutive OS level models. We used the BrahmiNet translit- occurrences of the marker character. eration system (Kunchukuttan et al., 2015) for post- editing, which is based on the transliteration Mod- 4 Experimental Setup ule in Moses (Durrani et al., 2014). We used unsupervised morphological segmenters trained with Languages: Our experiments primarily concen- Morfessor (Virpioja et al., 2013) for obtaining mor- trated on multiple language pairs from the two ma- pheme representations. The unsupervised morpho- jor language families of the Indian sub-continent logical segmenters were trained on the ILCI corpus (Indo-Aryan branch of Indo-European and Dravid- and the Leipzig corpus (Quasthoff et al., 2006).The ian). These languages have been in contact for a morph-segmenters and our implementation of ortho- long time, hence there are many lexical and gram- graphic syllabification are made available as part of matical similarities among them, leading to the sub- the Indic NLP Library1. continent being considered a linguistic area (Eme- neau, 1956). Specifically, there is overlap between Evaluation: We use BLEU (Papineni et al., 2002) the vocabulary of these languages to varying de- and Le-BLEU (Virpioja and Grönroos, 2015) for grees due to cognates, language contact and loan- evaluation. Le-BLEU does fuzzy matches of words words from Sanskrit (throughout history) and En- and hence is suitable for evaluating SMT systems glish (in recent times).

Orthographic Syllable As Basic Unit for SMT Between Related Languages

Poetry and History: Bengali Maṅgal-Kābya and Social Change in Precolonial Bengal David L

LCSH Section K

OSU WPL # 27 (1983) 140- 164. the Elimination of Ergative Patterns Of

Recognition of Online Handwritten Gurmukhi Strokes Using Support Vector Machine a Thesis

Sanskrit Alphabet

Tugboat, Volume 15 (1994), No. 4 447 Indica, an Indic Preprocessor

5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721

Inflectional Morphology Analyzer for Sanskrit

Devan¯Agar¯I for TEX Version 2.17.1

An Introduction to Indic Scripts

(IJTSRD) Volume 4 Issue 1, December 2019 Available Online: E-ISSN: 2456 – 6470

The Gurmukhi Astrolabe of the Maharaja of Patiala