<<

AdapNMT : Neural Machine Translation with Technical Domain Adaptation for Indic Languages

Hema Ala Dipti Misra Sharma LTRC, IIIT-Hyderabad, India LTRC, IIIT-Hyderabad, India [email protected] [email protected]

Abstract mance on the new domain. Domain adapta- tion became very popular in these times, but Adapting new domain is highly challeng- very few works have been carried out on tech- ing task for Neural Machine Translation (NMT). In this paper we show the capa- nical domains like chemistry, computer sci- bility of general domain machine transla- ence, etc. Therefore we adopted two new tech- tion when translating into Indic languages nical domains in our experiments, those in- (English - and Hindi - Telugu), and clude Artificial Intelligence and Chemistry pro- low resource domain adaptation of MT sys- vided by ICON Adap-MT 2020 shared task tems using existing general parallel data for English - Hindi and Hindi - Telugu lan- and small in domain parallel data for AI guage pairs. In our approach first we train and Chemistry Domains. We carried out general models(baseline models) which trains our experiments using Byte Pair Encod- ing(BPE) as it solves rare word problems. based on only general data, we test domain It has been observed that with addition of data (AI, Chemistry) on this general model little amount of in-domain data to the gen- then we try to improve performance of this eral data improves the BLEU score signifi- new domain by training another model which cantly. uses combined training data(general data + domain data). Inspired from (Sennrich et al., 1 Introduction 2015) , we encode rare and unknown words as Due to the fact that Neural Machine Trans- sequences of sub word units using Byte Pair lation (NMT) is performing better compared Encodings(BPE) in order to make our NMT to the traditional statistical machine transla- model capable of open vocabulary translation, tion (SMT) models, it has become very popu- this is further discussed in 3.2. lar in the recent years. NMT systems require a large amount of training data and thus per- 2 Background & Motivation form poorly relative to phrase-based machine translation (PBMT) systems in low resource Domain Adaptation has became an active re- and domain adaptation scenarios (Koehn and search topic in NMT. Freitag and Al-Onaizan Knowles, 2017). One of the challenges in NMT (2016) proposed two approaches, continue the is domain adaptation, it becomes more chal- training of the baseline model(general model) lenging when it comes to low resource Indic only on the in-domain data (domain data) and languages and technical domains like Artificial ensemble the continue model with the baseline Intelligence(AI) and Chemistry as these do- model at decoding time. Zeng et al. (2019) pro- mains may contain many technical terms and posed iterative dual domain adaptation frame- equations etc. In a typical domain adaptation work for NMT, which continuously fully ex- setup like ours, we have a large amount of out- ploits the mutual complementarity between in- of-domain bilingual training data for which we domain and out-domain corpora for transla- need to train a NMT model, we can treat this tion knowledge transfer. Apart from these do- as a baseline model. Now given only an ad- main adaptation techniques, there exists some ditional small amount of in-domain data, the approaches which has domain terminology and challenge is to improve the translation perfor- how to use that in NMT. Similarly Hasler et al.

6 Proceedings of the 17th International Conference on Natural Language Processing: Adap-MT 2020 Shared Task, pages 6–10 Patna, India, December 18 - 21, 2020. ©2020 NLP Association of India (NLPAI) (2018) proposed an approach on NMT decod- tional mechanism over the input sequence. In ing with terminology constraints using decoder this work, following Luong et al. (2015) and attentions which enables reduced output dupli- Sutskever et al. (2014) we used LSTM archi- cation and better constraint placement com- tectures for our NMT Models, which uses a pared to existing methods. Apart from tra- LSTM to encode the input sequence and a sep- ditional approaches there is a stack-based lat- arate LSTM to output the translation. The tice search algorithm, constraining its search encoder reads the source sentence, one word space with lattices generated by phrase-based at a time, and produces a large vector that machine translation (PBMT) improves the ro- represents the entire source sentence. The de- bustness(Khayrallah et al., 2017). Wang et al. coder is initialized with this vector and gener- (2017) proposed two instance weighting meth- ates a translation, one word at a time, until it ods with a dynamic weight learning strategy emits the end of sentence symbol. For better for NMT domain adaptation. translations we use bi-directional LSTM (Bah- Although huge amount of research exists in danau et al., 2014) and attention mechanism this area , there exists very few works on In- described in Luong et al. (2015). dian languages. As per our knowledge there is 3.2 Byte Pair Encoding (BPE) no work on technical domains like ours (Arti- ficial Intelligence and Chemistry). Therefore BPE (Gage, 1994) is a data compression tech- there is a need to handle these technical do- nique that replaces the most frequent pair of mains and work on morphological rich and re- bytes in a sequence. We use this algorithm source poor languages. for word segmentation , and merging frequent pairs of character sequences we can get the vo- 3 Approach cabulary of desired size (Sennrich et al., 2015). As Telugu and Hindi are morphological rich There are many approaches for domain adap- languages, particularly Telugu being an Ag- tation discussed in section 2. However the ap- glutinative language, therefore there is need proach we adopted , falls under combining the to handle postpositions and compound words training data of general domain and specific etc. BPE helps the same by separating suf- technical domain data. This is further dis- fix , prefix and compound words. It creates cussed in section 3.3. Our approach follows new and complex words of Telugu and Hindi attention-based NMT implementation similar language by interpreting them as sub-words to Bahdanau et al. (2014) and Luong et al. units. NMT with Byte Pair Encoding made (2015). Our model is very much similar to the significant improvements in translation qual- model described in Luong et al. (2015) and ity for low resource morphologically rich lan- supports label smoothing, beam-search decod- guages (Pinnis et al., 2017). We also adopted ing and random sampling. The brief explana- same for our experiments for all the language tion about NMT is described in section 3.1. pairs namely English-Hindi and Hindi-Telugu. In our approach we got the best results with 3.1 Neural Machine Translation a vocabulary size of 20000 and dimension as NMT system tries to find the conditional prob- 300. ability of target sentence with the given source sentence. In our case targets are indic lan- 3.3 Technical Domain Adaptation guages. There are many ways to parame- Freitag and Al-Onaizan (2016) discussed two terize these conditional probability. Kalch- problems when we combine general data and brenner and Blunsom (2013) used combina- domain data for training. First, training a tion of a convolutional neural network and neural machine translation system on large a recurrent neural network , Sutskever et al. data sets can take several weeks and train- (2014) used a deep Long Short-Term Mem- ing a new model based on the combined train- ory (LSTM) model, Cho et al. (2014) used an ing data is time consuming. Second, since the architecture similar to the LSTM, and Bah- in-domain data is relatively small, the out-of- danau et al. (2014) used a more elaborate neu- domain data will tend to dominate the train- ral network architecture that uses an atten- ing data and hence the learned model will not

7 perform as well on the in-domain test data. provide by ICON Adap-MT 2020, these in- However we preferred that approach only as clude opensubtitles, globalvoices , gnome, etc our target languages are morphologically rich from OPUS corpus (Tiedemann, 2012). Af- and resource poor languages. We addressed so- ter collecting the data from above mentioned lutions for the above problems discussed in Fre- sources, training and validation data split was itag and Al-Onaizan (2016). First, as our main done based on the corpus size , then removed objective is to use the less amount of techni- empty lines. To measure the translation qual- cal domain data(AI and Chemistry) available ity we used an automatic evaluation metric along with general data and improve the trans- called BLEU (Papineni et al., 2002). lation of given domain test data, adding very 4.1 Training Details little amount of data will not make it more time consuming as the general data itself is We have three models for each language pair less for these mentioned morphologically rich 1. Baseline model trained on general data languages(Telugu and Hindi). 2. Trained on general+AI data 3. general To address the second problem, we use BPE. data+Chemistry data. For statistics regard- Technical domain data is very very less com- ing training & validation sentences refer ta- pared to general data so if we take top 50k ble 1. We followed (Bahdanau et al., 2014) words as our vocabulary then most of the and (Luong et al., 2015) while training our words will come from general data which leads NMT systems. Our parameters are uniformly to poor translation of domain data, to over- initial- ized in [-0.1-0.1]. We used standard come this we used BPE as it uses sub word embedding dimension i. 300. Comparatively units and handles rare words, and it can eas- we have less amount of data(including general ily recognize inflected words which are preva- data as well) hence we preferred to use small lent in morphologically rich languages. Due batch size as 10. we start with a learning rate to the fact that technical domain data is of 0.001, for every 5 epochs we halve the learn- very less , performing validation on combined ing rate. Additionally, we also use dropout data(general validation data + domain valida- with probability 0.3. In order to avoid overfit- tion data) will lead to low translation quality ting of our models we used an early stopping for domain test data. Therefore we used only criteria which is one of the forms of regulariza- domain data for validation and got significant tion. improvement in BLEU score on domain test Domain BLEU(on val) data. AI-En-Hi 8.4 Train Val Test Chem-En-Hi 6 Gen-En-Hi 665474 7003 507 AI-Hi-Te 0.6 Gen-En-te 120708 2259 507 Chem-Hi-Te 0.03 AI-En-Hi 4872 400 401 Table 2: BLEU scores of AI and Chemistry vali- AI-En-te 4872 400 401 dation data on general models (trained on only Chem-En-Hi 4984 300 397 general data) for respective language pairs Chem-Hi-Te 3300 300 500

Table 1: Data statistics (no. of sentences) Val- Model BLEU(on val) BLEU(on test) validation data Gen-general data for that language AI-En-Hi 16 15.37 pair Chem-En-Hi 19.6 12.35 AI-Hi-Te 8.2 10.35 Chem-Hi-Te 5.7 6.87 4 Experiments and Results Table 3: AI-En-Hi:trained on ai+gen data for We evaluate our approach on test data sets pro- English-Hindi AI-Hi-Te:trained on ai+gen data for vided by ICON Adap-MT 2020 shared task for Hindi-Telugu Chem-En-Hi:trained on chem+gen all language pairs for all domains. We can see data for English-Hindi Chem-En-Hi:trained on data statistics in table 1. All the sentences pre- chem+gen data for Hindi-Telugu sented in table 1 are taken from various sources

8 Source Target MT1 MT2 Square function is 啍वेयर फं 啍शन बहत यह काम सरल सरल ह।ै 啍वेर फं 啍शन बहत सरल pretty simple. सरल ह।ै (skveyar (yah kaam saral ह।ै (skver phankshan phankshan bahut saral hai.) bahut saral hai.) saral hai.) In this case , there is इस िवध मᴂ , एंजाइम इस मामले मᴂ एंजाइम इस मामले मᴂ , एंजाइम no difference इ륍यूनोएसे और िवज़ेशन और रेिडयो के इ륍यूनोएसे और between the enzyme रेिडयोइ륍यूनोएसे के बीच बीच कोई अतरं नहĂ ह।ै रेिडयोइ륍यूनोएसे के बीच immunoassay and कोई अतरं नहĂ होता । (is (is maamale mein कोई अतरं नहĂ है । (is radioimmunoassay . vidhi mein , enjaim enjaim vizeshan aur maamale mein , imyoonoese aur rediyo ke beech koee enjaim imyoonoese rediyoimyoonoese ke antar nahin hai.) aur beech koee antar rediyoimyoonoese ke nahin hota.) beech koee antar nahin hai .)

Table 4: Examples of improved sentences MT1 : output of general model(trained on only general data) MT2 : output of proposed model(trained on general+domain data)

4.2 Analysis els. Now, when we test that validation data We conducted an evaluation of random sen- on proposed models (table 3), the bleu score tences from the test data for both the men- of chemistry validation data improved from 6 tioned domains, it was found that the transla- to 19.6 for English to Hindi language pair , in tion of domain/technical terms or named en- this case the bleu score increased more than tities was improved after adding less amount three times. Similarly for AI, the bleu score of technical domain data to the general data, increased from 8.4 to 16 for English to Hindi. we can see some of the examples in table For Hindi to Telugu bleu score of AI domain 4 for English to Hindi for AI and Chem- is increased from 0.6 to 8.2, likewise it is in- istry domains respectively. If we observe the creased from 0.03 to 5.7 for chemistry domain. first example from table 4 which is taken Next we evaluated domain test data on pro- from AI domain, the domain term ”square posed models AI-En-Hi, Chem-En-Hi, AI -Hi- Te and Chem-Hi-Te. Refer table 3 for bleu function” was translated properly into "啍वेर scores on test data. फं 啍शन"(skver phankshan) when it is tested on our proposed model, same happened with 5 Future Work chemistry domain as well, for ”enzyme im- munoassay” and ”radioimmunoassay” domain We would like to extend this work to possi- terms, our model translated them correctly ble technical domains and for more languages whereas the general model not. In order to as well. We plan to explore many other ap- show improvement in terms of bleu score, we proaches like Transformer based models for tested our AI and Chemistry validation data technical domain adaptation. And try to incor- on general model which was trained on only porate linguistic features into the NMT mod- general data. Then we tested same validation els. data on our proposed models which trains on 6 Conclusion combining data(general+domain). When we get improvements in validation data from gen- For morphologically rich and resource poor eral model to new model, we fixed the param- languages like Telugu it’s very difficult to get eters of the model as mentioned in section 3.3 the large amount of parallel corpus for tech- for testing purpose. Table 2 shows the bleu nical domain. Therefor there is a need to scores of AI and Chemistry validation data on optimize our general models with available English-Hindi and Hindi-Telugu general mod- small amount of domain data. In this paper

9 we showed an approach which combines little Mārcis Pinnis, Rihards Krišlauks, Daiga Deksne, amount of technical domain data to the avail- and Toms Miks. 2017. Neural machine trans- able general domain data and trains a model lation for morphologically rich languages with improved sub-word units and synthetic data. In using BPE. For better translation quality on International Conference on Text, Speech, and technical domain we used only domain data as Dialogue, pages 237–245. Springer. validation and observed our approach is giving Rico Sennrich, Barry Haddow, and Alexandra promising results. Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. References Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Sequence to sequence learning with neural net- Bengio. 2014. Neural machine translation by works. In Advances in neural information pro- jointly learning to align and translate. arXiv cessing systems, pages 3104–3112. preprint arXiv:1409.0473. Jorg Tiedemann. 2012. Parallel data, tools and Kyunghyun Cho, Bart Van Merriënboer, Caglar interfaces in opus. In Proceedings of the Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Eight International Conference on Language Re- Holger Schwenk, and Yoshua Bengio. 2014. sources and Evaluation (LREC’12), Istanbul, Learning phrase representations using rnn Turkey. European Language Resources Associ- encoder-decoder for statistical machine transla- ation (ELRA). tion. arXiv preprint arXiv:1406.1078. Rui Wang, Masao Utiyama, Lemao Liu, Kehai Markus Freitag and Yaser Al-Onaizan. 2016. Fast Chen, and Eiichiro Sumita. 2017. Instance domain adaptation for neural machine transla- weighting for neural machine translation domain tion. arXiv preprint arXiv:1612.06897. adaptation. In Proceedings of the 2017 Confer- ence on Empirical Methods in Natural Language Philip Gage. 1994. A new algorithm for data com- Processing, pages 1482–1488. pression. C Users Journal, 12(2):23–38. Jiali Zeng, Yang Liu, Jinsong Su, Yubin Ge, Yaojie Eva Hasler, Adrià De Gispert, Gonzalo Iglesias, Lu, Yongjing Yin, and Jiebo Luo. 2019. Itera- and Bill Byrne. 2018. Neural machine trans- tive dual domain adaptation for neural machine lation decoding with terminology constraints. translation. arXiv preprint arXiv:1912.07239. arXiv preprint arXiv:1805.03750.

Nal Kalchbrenner and Phil Blunsom. 2013. Re- current continuous translation models. In Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709.

Huda Khayrallah, Gaurav Kumar, Kevin Duh, Matt Post, and Philipp Koehn. 2017. Neural lattice search for domain adaptation in machine translation. In Proceedings of the Eighth Inter- national Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 20– 25.

Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872.

Minh-Thang Luong, Hieu Pham, and Christo- pher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for auto- matic evaluation of machine translation. In Pro- ceedings of the 40th annual meeting of the As- sociation for Computational Linguistics, pages 311–318.

10