Neural Machine Translation with Technical Domain Adaptation for Indic Languages

AdapNMT : Neural Machine Translation with Technical Domain Adaptation for Indic Languages Hema Ala Dipti Misra Sharma LTRC, IIIT-Hyderabad, India LTRC, IIIT-Hyderabad, India [email protected] [email protected] Abstract mance on the new domain. Domain adaptation became very popular in these times, but Adapting new domain is highly challeng- very few works have been carried out on tech- ing task for Neural Machine Translation (NMT). In this paper we show the capa- nical domains like chemistry, computer sci- bility of general domain machine transla- ence, etc. Therefore we adopted two new tech- tion when translating into Indic languages nical domains in our experiments, those in- (English - Hindi and Hindi - Telugu), and clude Artificial Intelligence and Chemistry pro- low resource domain adaptation of MT sys- vided by ICON Adap-MT 2020 shared task tems using existing general parallel data for English - Hindi and Hindi - Telugu lan- and small in domain parallel data for AI guage pairs. In our approach first we train a and Chemistry Domains. We carried out general models(baseline models) which trains our experiments using Byte Pair Encod- ing(BPE) as it solves rare word problems. based on only general data, we test domain It has been observed that with addition of data (AI, Chemistry) on this general model little amount of in-domain data to the gen- then we try to improve performance of this eral data improves the BLEU score signifi- new domain by training another model which cantly. uses combined training data(general data + domain data). Inspired from (Sennrich et al., 1 Introduction 2015) , we encode rare and unknown words as Due to the fact that Neural Machine Trans- sequences of sub word units using Byte Pair lation (NMT) is performing better compared Encodings(BPE) in order to make our NMT to the traditional statistical machine transla- model capable of open vocabulary translation, tion (SMT) models, it has become very popu- this is further discussed in 3.2. lar in the recent years. NMT systems require a large amount of training data and thus per- 2 Background & Motivation form poorly relative to phrase-based machine translation (PBMT) systems in low resource Domain Adaptation has became an active re- and domain adaptation scenarios (Koehn and search topic in NMT. Freitag and Al-Onaizan Knowles, 2017). One of the challenges in NMT (2016) proposed two approaches, continue the is domain adaptation, it becomes more chal- training of the baseline model(general model) lenging when it comes to low resource Indic only on the in-domain data (domain data) and languages and technical domains like Artificial ensemble the continue model with the baseline Intelligence(AI) and Chemistry as these do- model at decoding time. Zeng et al. (2019) pro- mains may contain many technical terms and posed iterative dual domain adaptation frame- equations etc. In a typical domain adaptation work for NMT, which continuously fully ex- setup like ours, we have a large amount of out- ploits the mutual complementarity between in- of-domain bilingual training data for which we domain and out-domain corpora for transla- need to train a NMT model, we can treat this tion knowledge transfer. Apart from these do- as a baseline model. Now given only an ad- main adaptation techniques, there exists some ditional small amount of in-domain data, the approaches which has domain terminology and challenge is to improve the translation perfor- how to use that in NMT. Similarly Hasler et al. 6 Proceedings of the 17th International Conference on Natural Language Processing: Adap-MT 2020 Shared Task, pages 6–10 Patna, India, December 18 - 21, 2020. ©2020 NLP Association of India (NLPAI) (2018) proposed an approach on NMT decod- tional mechanism over the input sequence. In ing with terminology constraints using decoder this work, following Luong et al. (2015) and attentions which enables reduced output dupli- Sutskever et al. (2014) we used LSTM archi- cation and better constraint placement com- tectures for our NMT Models, which uses a pared to existing methods. Apart from tra- LSTM to encode the input sequence and a sep- ditional approaches there is a stack-based lat- arate LSTM to output the translation. The tice search algorithm, constraining its search encoder reads the source sentence, one word space with lattices generated by phrase-based at a time, and produces a large vector that machine translation (PBMT) improves the ro- represents the entire source sentence. The de- bustness(Khayrallah et al., 2017). Wang et al. coder is initialized with this vector and gener- (2017) proposed two instance weighting meth- ates a translation, one word at a time, until it ods with a dynamic weight learning strategy emits the end of sentence symbol. For better for NMT domain adaptation. translations we use bi-directional LSTM (Bah- Although huge amount of research exists in danau et al., 2014) and attention mechanism this area , there exists very few works on In- described in Luong et al. (2015). dian languages. As per our knowledge there is 3.2 Byte Pair Encoding (BPE) no work on technical domains like ours (Arti- ficial Intelligence and Chemistry). Therefore BPE (Gage, 1994) is a data compression tech- there is a need to handle these technical do- nique that replaces the most frequent pair of mains and work on morphological rich and re- bytes in a sequence. We use this algorithm source poor languages. for word segmentation , and merging frequent pairs of character sequences we can get the vo- 3 Approach cabulary of desired size (Sennrich et al., 2015). As Telugu and Hindi are morphological rich There are many approaches for domain adap- languages, particularly Telugu being an Ag- tation discussed in section 2. However the ap- glutinative language, therefore there is need proach we adopted , falls under combining the to handle postpositions and compound words training data of general domain and specific etc. BPE helps the same by separating suf- technical domain data. This is further dis- fix , prefix and compound words. It creates cussed in section 3.3. Our approach follows new and complex words of Telugu and Hindi attention-based NMT implementation similar language by interpreting them as sub-words to Bahdanau et al. (2014) and Luong et al. units. NMT with Byte Pair Encoding made (2015). Our model is very much similar to the significant improvements in translation qual- model described in Luong et al. (2015) and ity for low resource morphologically rich lan- supports label smoothing, beam-search decod- guages (Pinnis et al., 2017). We also adopted ing and random sampling. The brief explana- same for our experiments for all the language tion about NMT is described in section 3.1. pairs namely English-Hindi and Hindi-Telugu. In our approach we got the best results with 3.1 Neural Machine Translation a vocabulary size of 20000 and dimension as NMT system tries to find the conditional prob- 300. ability of target sentence with the given source sentence. In our case targets are indic lan- 3.3 Technical Domain Adaptation guages. There are many ways to parame- Freitag and Al-Onaizan (2016) discussed two terize these conditional probability. Kalch- problems when we combine general data and brenner and Blunsom (2013) used combina- domain data for training. First, training a tion of a convolutional neural network and neural machine translation system on large a recurrent neural network , Sutskever et al. data sets can take several weeks and train- (2014) used a deep Long Short-Term Mem- ing a new model based on the combined train- ory (LSTM) model, Cho et al. (2014) used an ing data is time consuming. Second, since the architecture similar to the LSTM, and Bah- in-domain data is relatively small, the out-of- danau et al. (2014) used a more elaborate neu- domain data will tend to dominate the train- ral network architecture that uses an atten- ing data and hence the learned model will not 7 perform as well on the in-domain test data. provide by ICON Adap-MT 2020, these in- However we preferred that approach only as clude opensubtitles, globalvoices , gnome, etc our target languages are morphologically rich from OPUS corpus (Tiedemann, 2012). Af- and resource poor languages. We addressed so- ter collecting the data from above mentioned lutions for the above problems discussed in Fre- sources, training and validation data split was itag and Al-Onaizan (2016). First, as our main done based on the corpus size , then removed objective is to use the less amount of techni- empty lines. To measure the translation qual- cal domain data(AI and Chemistry) available ity we used an automatic evaluation metric along with general data and improve the trans- called BLEU (Papineni et al., 2002). lation of given domain test data, adding very 4.1 Training Details little amount of data will not make it more time consuming as the general data itself is We have three models for each language pair less for these mentioned morphologically rich 1. Baseline model trained on general data languages(Telugu and Hindi). 2. Trained on general+AI data 3. general To address the second problem, we use BPE. data+Chemistry data. For statistics regard- Technical domain data is very very less com- ing training & validation sentences refer ta- pared to general data so if we take top 50k ble 1. We followed (Bahdanau et al., 2014) words as our vocabulary then most of the and (Luong et al., 2015) while training our words will come from general data which leads NMT systems. Our parameters are uniformly to poor translation of domain data, to over- initial- ized in [-0.1-0.1].

Load more