Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers

Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers Ife Adebara El Moatez Billah Nagoudi Muhammad Abdul Mageed Natural Language Processing Lab University of British Columbia fife.adebara,moatez.nagoudi,[email protected] Abstract be asymmetric such that speakers or Slovene can We investigate different approaches to trans- understand spoken and written Croatian better than late between similar languages under low re- speakers of Croatian understand Slovene (Gol- source conditions, as part of our contribution ubovic´ and Gooskens, 2015). to the WMT 2020 Similar Languages Transla- Machine translation of similar languages has tion Shared Task. We submitted Transformer- been explored in a number of works (Hajic, 2000; based bilingual and multilingual systems for Currey et al., 2016; Dabre et al., 2017). This can all language pairs, in the two directions. We also leverage back-translation for one of the be seen as part of a growing need to develop mod- language pairs, acquiring an improvement of els that translate well in low resource scenarios. more than 3 BLEU points. We interpret our The goal of the current shared task is to encourage results in light of the degree of mutual intelli- researchers to explore methods for translating be- gibility (based on Jaccard similarity) between tween similar languages. We also view the shared each pair, finding a positive correlation be- task as useful context for studying interaction between mutual intelligibility and model perfor- tween degrees of similarity and mutual intelligibil- mance. Our Spanish-Catalan model has the ity on the one hand, and model performance on best performance of all the five language pairs. Except for the case of Hindi-Marathi, our bilin- the other hand. We explore the use of bilingual gual models achieve better performance than and multilingual models for all the 5 shared task the multilingual models on all pairs. language pairs. We also perform back-translation for one language pair. 1 Introduction In the remainder of this paper, we discuss related We present our findings from our participation literature in Section2. We explain the methodology in the WMT 2020 Similar Language Translation which includes a description of the Transformer shared task, which focused on translation between model, back-translation and beam search in Sec- similar language pairs in low-resource settings. tion3. In Section4, we describe the models we Similar languages share a certain level of mutual in- developed for this task and we discuss the vari- telligibility that may aid the improvement of trans- ous experiments we perform. We also describe the lation quality. Depending on the level of closeness, architectures of the models we developed. Then certain languages may share similar orthography, we discuss the evaluation procedure in Section6. lexical, syntactic, and or semantic structures which Evaluation is done on both the validation and test may make translation more accurate. sets. We conclude with discussion and the insights The level of mutual intelligibility is such that gained from this task in Section7. speakers of one language can understand another language without prior instruction in that other lan- 2 Related Work guage. They can also communicate without the use of a lingua franca which is a link or vehicular Translation between similar languages has recently language used for communicating between speak- attracted attention. Different approaches have been ers of different languages (Gooskens, 2007). It adopted using state-of-the-art techniques, methods, is important to mention that, sometimes, the level and tools to take advantage of the similarity be- of intelligibility varies in both directions. For in- tween languages even in low resource scenarios. stance, Slovene - Croatian intelligibility is said to Approaches that have been effective for other ma- 381 Proceedings of the 5th Conference on Machine Translation (WMT), pages 381–386 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics chine translation tasks have proven to achieve suc- is then tuned through an unsupervised process and cess in the context of similar language translation the entire system is jointly refined in opposite direc- as well. tions to improve performance. This method outper- NMT models, specifically the Transformer archi- forms previous SOTA model with about 5-7 BLUE tecture, has been shown to perform well when trans- points. A re-scoring mechanism that re-uses the lating between similar languages (Baquero-Arnal pre-trained language model to select translations et al., 2019; Przystupa and Abdul-Mageed, 2019). generated through beam search has also been found The use of in-domain data for fine-tuning has also to improve fluency and consistency of translations proven to be of remarkable benefit for this task. (Liu et al., 2019). Yet another approach, combines This problem has also been tackled both by using cross-lingual embeddings with a language model character replacement to leverage the orthographic to make a phrase-table (Artetxe et al., 2019a). The and phonological relationship between closely re- resulting system is then used to generate a pseudo lated mutually intelligible language pairs (Chen parallel corpus with which a bilingual lexicon is and Avgustinova, 2019). A new approach was also derived. This approach can work with any word or introduced for this task using a two-dimensional cross-lingual embeddings techniques. method that assumes that each word of the target sentence can be explained by all the words in the 3 Methodology source sentence (Baquero-Arnal et al., 2019). Motivated by the success of Transformers and back- Within the realm of MT for low resource lan- translation, we develope a sequence-to-sequence guages, recent work has focused on translation us- approach using the Transformer architecture per- ing large monolingual corpora due to the scarcity form back-translation for one language pair. For of parallel data for many language pairs (Lample decoding, we use Beam Search (BS). BS is an et al., 2018, 2017; Artetxe et al., 2018b). These heuristic decoding strategy based on exploring the approaches have leveraged careful initialization of solution space and selecting a sequence of words the unsupervised neural MT model using an in- that maximize the overall likelihood of the target ferred bilingual dictionary, sequence-to-sequence sentence. During the translation, we hold a beam language models, and back-translation to achieve of β sequences (beam size) which are iteratively remarkable results. The bilingual dictionary is extended. At each step, β words are selected to built without parallel data by using an unsuper- extend each of the sequences in the beam, so the vised approach to align the monolingual word em- output is β2 candidate sequences (hypotheses), we bedding spaces from each language (Conneau et al., retain only the β highest score hypotheses for the 2017; Artetxe et al., 2018a). Since parallel data is next step (top-β candidates) (Koehn, 2009). In not available in sufficiently large quantities, back- all our experiments we use beam size of 5 whilst translation is used to create pseudo parallel data. decoding. The monolingual data of the target language is translated into the source using an existing transla- 3.1 Transformer tion system (e.g., one trained with available gold Our baseline models are based on the Transformer data). The output is then used to train a new MT architecture. A Transformer (Vaswani et al., 2017) model (Sennrich et al., 2015a). Weak supervision is a sequence-to-sequence model that does not have caused by back-translation results in a noisy train- the recurrent architecture present in Recurrent ing dataset. This eventually can affect translation Neural Networks (RNNs). It uses a positional en- quality. coding that can remember how sequences are fed More recent works adopt different approaches into the model. These positions are added to the to manage noise in back-translation. For in- embedded representation (n-dimensional vector) of stance, phrase based statistical MT models are in- each word. Transformers have been shown to train troduced as a posterior regularization during the faster than RNNs for translation tasks. back-translation process to reduce the noise and The encoder and decoder in a Transformer model errors of the data generated (Ren et al., 2019). An- have modules that consist mainly of multi-head other method (Artetxe et al., 2019b) uses cross lin- attention and feedforward layers. The attention gual word embeddings incorporated with sub-word mechanism is based on a function that operates on information. The weights of the log-linear model Q(queries), K (keys), and V (values). The query 382 is a vector representation of one token in the input 5 Data sequence, K refers to the vector representations of We used all the parallel data for all language all the tokens in the input sequence. More informa- pairs http://www.statmt.org/wmt20/similar. tion about the Transformer are in (Vaswani et al., html. The task was constrained so we did not add 2017). any additional data to develop our models. We 3.2 Back-translation used the monolingual data for the SL-HR language pair for back-translation. Table2 shows the size of We perform back-translation using the monolingual the data in terms of the number of sentences and model developed for the Croatian-Slovene (HR- words for each language pair while Table1 shows SL) language pair. We use the best HR-SL model example source and corresponding outputs from checkpoint that acquire the highest BLEU score our bilingual and multilingual models for each lan- on the DEV set to translate the monolingual HR guage pair. We also calculated the jaccard similar- data. This produces synthetic Slovene (SL) data ity for the training data we used for the tasks. “Jac- which we then use as the source language while the card similarity” measures the similarity between original monolingual data is used as target when two text documents by taking the intersection of training the SL-HR model.

Load more