Wordnet Gloss Translation for Under-Resourced Languages Using Multilingual Neural Machine Translation

WordNet Gloss Translation for Under-resourced Languages using Multilingual Neural Machine Translation Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Galway, Ireland [email protected] , [email protected], [email protected] Abstract wordnet creation. The latter approach is known as the expand approach. Popular wordnets like In this paper, we translate the glosses in EuroWordNet (Vossen, 1997) and IndoWordNet the English WordNet based on the ex- (Bhattacharyya, 2010) were based on the expand pand approach for improving and generat- approach. On the Global WordNet Association ing wordnets with the help of multilingual website,1 a comprehensive list of wordnets avail- neural machine translation. Neural Ma- able for different languages can be found, includ- chine Translation (NMT) has recently been ing IndoWordNet and EuroWordNet. applied to many tasks in natural language Due to the lack of parallel corpora, ma- processing, leading to state-of-the-art per- chine translation systems for less-resourced lan- formance. However, the performance of guages are not readily available. We attempt to NMT often suffers from low resource sce- utilize Multilingual Neural Machine Translation narios where large corpora cannot be ob- (MNMT) (Ha et al., 2016), where multiple sources tained. Using training data from closely and target languages are trained simultaneously related language have proven to be invalu- without changes to the network architecture. This able for improving performance. In this has been shown to improve the translation quality, paper, we describe how we trained mul- however, most of the under-resourced languages tilingual NMT from closely related lan- use different scripts which limits the application of guage utilizing phonetic transcription for these multilingual NMT. In order to overcome this, Dravidian languages. We report the eval- we transliterate the languages on the target side uation result of the generated wordnets and bring it into a single script to take advantage of sense in terms of precision. By compar- multilingual NMT for closely-related languages. ing to the recently proposed approach, we Closely-related languages refer to languages that show improvement in terms of precision. share similar lexical and structural properties due to sharing a common ancestor (Popovic´ et al., 1 Introduction 2016). Frequently, languages in contact with other Wordnets are lexical resource organized as hierar- language or closely-related languages like the Dra- chical structure based on synset and semantic fea- vidian, Indo-Aryan, and Slavic share words from a tures of the words (Miller, 1995; Fellbaum, 1998). common root (cognates), which are highly seman- Manually constructing wordnet is a difficult task tically and phonologically similar. and it takes years of experts’ time. Another way is In the scope of the wordnet creation for under- translating synsets of existing wordnet to the tar- resourced languages, combining parallel corpus get language, then applying methods to identify from closely related languages, phonetic transcrip- exact matches or providing the translated synset tion of the corpus and creating multilingual neu- to linguists and this has been proven to speed up ral machine translation has been shown to improve the results in this paper. The evaluation results ob- c 2019 The authors. This article is licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC- BY-ND. 1http://globalwordnet.org/ MomenT-2019 Dublin, Aug. 19-23, 2019 | p. 1 tained from MNMT with transliterated corpus are entire model. The approach of shared vocabulary better than the results of Statistical Machine Trans- across multiple languages resulted in a shared em- lation (SMT) from the recent work (Chakravarthi bedding space. Although the results were promis- et al., 2018). ing, the result of the experiments was reported in highly resourced languages such as English, Ger- 2 Related Work man, and French but many under-resourced languages have different syntax and semantic struc- The Princeton WordNet (Miller, 1995; Fellbaum, ture to these languages. Chakravarthi et al. (2019) 1998) was built from scratch. The taxonomies of shown that using languages belonging to the same the languages, synsets, relations among synset are family and phonetic transcription of parallel cor- built first in the merge approach. Popular wordnets pus to a single script improves the MNMT results. like EuroWordNet (Vossen, 1997) and IndoWord- Our approach extends that of Chakravarthi et Net (Bhattacharyya, 2010) are developed by the al. (2019) and Chakravarthi et al. (2018) by utiliz- expand approach whereby the synsets are built in ing MNMT with a transliterated parallel corpus of correspondence with the existing wordnet synsets closely related languages to create wordnet sense by translation. For the Tamil language, Rajendran for Dravidian languages. In particular, we down- et al. (2002) proposed a design template for the loaded the data, removed code-mixing and phonet- Tamil wordnet. ically transcribed each corpus to Latin script. Two To evaluate and improve the wordnets for types of experiments were performed: In the first the targeted under-resourced Dravidian languages, one, where we just removed code-mixing and com- Chakravarthi et al. (2018) followed the approach of piled the multilingual corpora by concatenating the Arcan et al. (2016), which uses the existing trans- parallel corpora from three languages. In the sec- lations of wordnets in other languages to identify ond one removed code-mixing, phonetically tran- contextual information for wordnet senses from a scribed the corpora and then compiled the multilin- large set of generic parallel corpora. They use gual corpora by concatenating the parallel corpora this contextual information to improve the trans- from three languages. These two experiments are lation quality of WordNet senses. They showed contribution to this work compared to the previous that their approach can help overcome the draw- works. backs of simple translations of words without con- text. Chakravarthi et al. (2018) removed the code- 3 Experiment Setup mixing based on the script of the parallel corpus to reduce the noise in translation. The authors used 3.1 Dravidian Languages the SMT to create bilingual MT for three Dravid- For our study, we perform experiments on Tamil ian languages. In our work, we use MNMT sys- (ISO 639-1: ta), Telugu (ISO 639-1: te) and Kan- tem and we transliterate the closely related lan- nada (ISO 639-1: kn). The targeted languages for guage corpus into a single script to take advantage this work differ in their orthographies due to histor- of MNMT systems. ical reasons and whether they adopted the Sanskrit Neural Machine Translation achieved rapid de- tradition or not (Bhanuprasad and Svenson, 2008). velopment in recent years, however, conventional Each of these has been assigned a unique block in NMT (Bahdanau et al., 2015) creates a separate Unicode, and thus from an MNMT perspective are machine translation system for each pair of lan- completely distinct. guages. Creating individual machine translation system for many languages is resource consuming, 3.2 Multilingual Neural Machine Translation considering there are around 7000 languages in the Johnson et al. (2017) and Ha et al. (2016) ex- world. Recent work on NMT, specifically on low- tended the architecture of Bahdanau et al. (2015) resource (Zoph et al., 2016; Chen et al., 2017) or to use a universal model to handle multiple source zero-resource machine translation (Johnson et al., and target languages with a special tag in the en- 2017; Firat et al., 2016) uses third languages as coder to determine which target language to trans- pivots and showed that translation quality is signif- late. The idea is to use the unified vocabulary and icantly improved. Ha et al. (2016) proposed an ap- training corpus without modification in the archi- proach to extend the Bahdanau et al. (2015) architecture to take advantage of the shared embedding. tecture to multilingual translation by sharing the The goal of this approach is to improve the trans- MomenT-2019 Dublin, Aug. 19-23, 2019 | p. 2 lation quality for individual languages pairs, for transliterate from English (Latin script) to Indian which parallel corpus data is scarce by letting the languages. NMT to learn the common semantics across languages and reduce the number of translation sys- 3.5 Code-Mixing tems needed. The sentence of different languages Code-mixing is a phenomenon which occurs com- are distinguished through languages codes. monly in most multilingual societies where the speaker or writer alternate between two or more 3.3 Data languages in a sentence (Ayeomoni, 2006; Ran- We used datasets from Chakravarthi et al. (2018) in jan et al., 2016; Yoder et al., 2017; Parshad our experiment. The authors collected three Dra- et al., 2016). Since most of our corpus came vidian languages English pairs from OPUS2 from publicly available parallel corpus are cre- ↔ web-page (Tiedemann and Nygaard, 2004). Cor- ated by voluntary annotators or align automati- pus statistics are shown in Table 1. More de- cally. The technical documents translation such scriptions about the three datasets can be found in as KDE, GNOME, and Ubuntu translations have Chakravarthi et al. (2018). We transliterated this code-mixing data since some of the technical terms corpus using Indic-trans library3. All

Load more