Factored Neural Machine Translation on Low Resource Languages in the COVID-19 Crisis
Total Page:16
File Type:pdf, Size:1020Kb
Factored Neural Machine Translation on Low Resource Languages in the COVID-19 crisis Saptarashmi Bandyopadhyay University of Maryland, College Park College Park, MD 20742 [email protected] Abstract information sites like FDA (FDA) and CDC (CDC) have published multilingual COVID related infor- Factored Neural Machine Translation models mation in their websites. A machine translation have been developed for machine translation system (Way et al., 2020) for high resource lan- of COVID-19 related multilingual documents guage pairs like German-English, French-English, from English to five low resource languages, Lingala, Nigerian Fulfulde, Kurdish Kurmanji, Spanish-English and Italian-English has also been Kinyarwanda, and Luganda, which are spoken developed and published to enable the whole world in the impoverished and strive-torn regions of to fight the pandemic and the associated infodemic Africa and Middle-East Asia. The objective as well. of the task is to ensure that COVID-19 re- The present work reports on the development lated authentic information reaches the com- of factored neural machine translation systems in mon people in their own language, primar- five different low resource language pairs English ily those in marginalized communities with limited linguistic resources, so that they can – Lingala, English - Nigerian Fulfulde, English – take appropriate measures to combat the pan- Kurdish (Kurmanji), English - Kinyarwanda and demic without falling for the infodemic aris- English-Luganda. All the target languages are low ing from the COVID-19 crisis. Two NMT resource languages in terms of natural language systems have been developed for each of the processing resources but are spoken by a large num- five language pairs – one with the sequence-to- ber of people. Moreover, the countries where the sequence NMT transformer model as the base- native speakers of these languages reside are facing line and the other with factored NMT model in COVID-19 crisis at various level. which lemma and POS tags are added to each word on the source English side. The motiva- Lingala (Ngala) (Wikipedia,c) is a Bantu lan- tion behind the factored NMT model is to ad- guage spoken throughout the northwestern part of dress the paucity of linguistic resources by us- the Democratic Republic of the Congo and a large ing the linguistic features which also helps in part of the Republic of the Congo, as well as to generalization. It has been observed that the some degree in Angola and the Central African factored NMT model outperforms the baseline Republic. It has over 10 million speakers. The model by a factor of around 10% in terms of language follows Latin script in writing. English the Unigram BLEU score. - Lingala Rule based MT System (at Google Sum- 1 Introduction mer of Code) has been developed by students in- terning at the Apertium organization in the Google The 2019-2020 global pandemic due to COVID-19 Summer of Code 2019 but no BLEU score has been has affected the lives of all people across the world. reported. It has become very essential for all to know the Luganda or Ganda (Wikipedia,d) is a mor- right kind of precautions and steps that everyone phologically rich and low-resource language from should follow so that they do not fall prey to the Uganda. Luganda is a Bantu language and has over virus. It is necessary that appropriate authentic 8.2 million native speakers. The language follows information reaches all in their own language in Latin script in writing. An English - Luganda SMT this period of infodemic as well, so that people system () has been reported that when trained with can fluently communicate and understand serious morphological segmentation at the pre-processing health concerns in their mother tongue. Authentic stage produces BLEU score of 31.20 on the Old Testament Bible corpus. However, it is not rele- factored datasets with the subword-nmt tool vant to our research due to the different medical (Sennrich et al., 2016). emergency context captured by the corpora. Nigerian Fulfulde (Wikipedia,e) is one of the 6. The vocab file is obtained from training and major language in Nigeria. It is spoken by about 15 development datasets of source and target files million people. The language follows Latin script for both original and factored datasets. in writing. Kinyarwanda (Wikipedia,a) is one of the official 7. MT system is trained on original and factored language of Rwanda and a dialect of the Rwanda- datasets. Rundi language spoken by at least 12 million peo- ple in Rwanda, Eastern Democratic Republic of 8. Testing is carried out on original and factored Congo and adjacent parts of southern Uganda. The datasets. language follows Latin script in writing. 9. The output target data is post-processed, i.e., Kurdish (Kurmanji) (Wikipedia,b), also termed detokenized and detruecased for both original Northern Kurdish, is the northern dialect of the Kur- and factored datasets. dish languages, spoken predominantly in southeast Turkey, northwest and northeast Iran, northern Iraq, 10. Unigram BLEU score is calculated on the northern Syria and the Caucasus and Khorasan re- detokenized and detruecased target side for gions. It has over 15 million native speakers. both original and factored datasets. The following statistics, as on 30th June 2020 and shown in Table1, are available from the Coro- 2 Related Works navirus resources website hosted by the Johns Hopkins University (JHU) regarding the effects Neural machine translation (NMT) systems are the of COVID-19 in the countries in which the above current state-of-the-art systems as the translation languages are spoken. accuracy of such systems is very high for languages In the present work, the idea of using factored with large amount of training corpora being avail- neural machine translation (FNMT) has been ex- able publicly. Current NMT Systems that deal plored in the five machine translation systems. The with low resolution (LowRes) languages ((Guzman´ following steps have been taken in developing the et al., 2019); (ws-, 2018)) are based on unsuper- MT systems: vised neural machine translation, semi-supervised neural machine translation, pretraining methods 1. The initial parallel corpus in the Translation leveraging monolingual data and multilingual neu- Memory (.tmx) format has been converted to ral machine translation among others. sentence level parallel corpus using the re- Meanwhile, research work on Factored NMT sources provided by (Madlon-Kay). systems (Bandyopadhyay, 2019); (Koehn and Knowles, 2017); (Garc´ıa-Mart´ınez et al., 2016); 2. Tokenization and truecasing have been done (Sennrich and Haddow, 2016) have evolved over on both the source and the target sides using the years. The factored NMT architecture has MOSES decoder (Hoang and Koehn, 2008). played a significant role in increasing the vocab- 3. English source side after tokenization and true- ulary coverage over standard NMT systems. The casing have been augmented to include fac- syntactic and semantic information from the lan- tors like Lemma (using Snowball Stemmer guage is useful to generalize the neural models (Porter, 2001)) and PoS tags (using NLTK being learnt from the parallel corpora. The number Tagger). No factoring has been done on the of unknown words also decreases in Factored NMT low resource target language side. systems. 4. Byte Pair Encoding (BPE) and vocabulary are 3 Dataset Development jointly learnt on original and factored dataset with subword-nmt tool (Sennrich et al., 2016). The five target languages and language pairs with English as the source language are shown in Table 5. BPE is then applied on the training, devel- 2. The parallel corpus have been developed by sev- opment and testing datasets for original and eral academic (Carnegie Mellon University, Johns Language Country Affected Deceased Lingala Congo 7039 170 Luganda Uganda 889 0 Nigerian Fulfulde Nigeria 25694 590 Kinywarwanda Rwanda 1025 2 Kurdish (Kurmanji) Turkey 199906 5131 Table 1: COVID 19 statistics in the countries speaking the five low resource languages Hopkins University) and industry (Amazon, Ap- in each of the experiments and then BPE was im- ple, Facebook, Google, Microsoft, Translated) part- plemented on the tokenized training, development ners with the Translators without Borders (TICO- and testing files. The subword-nmt tool was used Consortium) to prepare COVID-19 materials for a for the purpose of implementing Byte Pair Encod- variety of the world’s languages to be used by pro- ing (Sennrich et al., 2016) on the datasets to solve fessional translators and for training state-of-the-art the problem of unknown words in the vocabulary. Machine Translation (MT) models. Accordingly, the corpus size remained the same but the number of tokens increases. Target Language Language Pair Lingala en-ln 4 Experiment and Results Nigerian Fulfulde en-fuv FNMT experiments were conducted in OpenNMT Kurdish Kurmanji en-ku (Klein et al., 2017) based on PyTorch 1.4 with the Kinyarwanda en-rw following parameters: Luganda en-lg Table 2: Five low resource language pairs 1. Dropout rate - 0.4 2. 4 layered transformer-based encoder and de- The number of sentences in the training, devel- coder opment and testing datasets for each language pair have been mentioned in Table3. 3. Batch size=50 and 512 hidden states Training Development Testing 4. 8 heads for transformer self-attention 2771 200 100 5. 2048 hidden transformer feed-forward units Table 3: Number of lines in the training, development and testing datasets 6. 70000 training steps 7. Adam Optimizer The following factors have been identified on the English side: 8. Noam Learning rate decay 1. Stemmed word–The lemmas of the surface- 9. 90 sequence length for factored data level English words have been identified using the NLTK 3.4.1. implementation of the Snow- 10. 30 sequence length for original data ball Stemmer (Porter, 2001).