A Baseline Neural Machine Translation System for Indian Languages
Total Page:16
File Type:pdf, Size:1020Kb
A Baseline Neural Machine Translation System for Indian Languages Jerin Philip∗ Vinay P. Namboodiriy C. V. Jawaharz IIIT Hyderabad IIT Kanpur IIIT Hyderabad Abstract one outperforming the previous best and estab- lishing superior baselines in a systematic man- We present a simple, yet effective, Neural Ma- ner [Sutskever et al., 2014, Bahdanau et al., 2014, chine Translation system for Indian languages. Luong et al., 2015, Gehring et al., 2017, Vaswani We demonstrate the feasibility for multiple lan- et al., 2017]. Unfortunately, most of these meth- guage pairs, and establish a strong baseline for further research. ods are highly resource intensive. This, naturally made NMT less attractive for Indian languages due 1 Introduction to the concerns of feasibility. However, we be- lieve that the neural machine translation schemes Ability to access information from non-native lan- are most appropriate for the Indian languages in guages has become a necessity for humans in the the present day setting due to (i) the simplicity of modern age of Information Technology. This is the solution in rapidly prototyping and establishing specially true in India where there are many pop- practically effective systems, (ii) the lower cost of ular languages. While knowledge repositories in annotation of the training data. (The data required English are growing at rapid pace, Indian lan- for building NMT systems at most demand sen- guages still remain very poor in digital resources. tence level alignments, with no special tagging.) This leaves an average Indian handicapped in ac- (iii) the ease in transfer of ideas/algorithms across cessing the knowledge resources. Automatic ma- languages (Many practical tricks of developing ef- chine translation is a promising tool to bridge the fective solutions in one language provide insights language barrier, and thereby to democratize the in newer language pairs also.) (iv) ease in transfer knowledge. of knowledge across languages/tasks, often imple- Machine Translation (MT) has been a topic mented through pre-training, transfer learning or of research for over two decades now, with domain adaptation (v) rich software infrastructure schools of thought following rule-based ap- available for rapidly prototyping effective mod- proaches [Sinha et al., 1995, Chaudhury et al., els (eg. software stack for training) and deploy- 2010], example-based approaches [Somers, 1999, ment (eg. on embedded devices). Many of these Sinhal and Gupta, 2014] and statistics-based ap- are very important in the Indian setting, where the proaches [Brown et al., 1993, Koehn, 2009, number of researchers and industries interested in Kunchukuttan et al., 2014, Banerjee et al., 2018]. Indian Language MT is very small compared to A class of statistics-based methods, involving the what is demanded. Often we may not even have use of deep neural networks, popularly known more than one group equipped per language, with as Neural Machine Translation (NMT) is making the state of the art tools and techniques. rapid progress in providing nearly usable transla- India has 22 official languages, and many un- tions across many language pairs [Edunov et al., official languages in use. With the fast growing 2018, Aharoni et al., 2019, Neubig and Hu, 2018]. numbers of mobile phone and Internet users, there Multiple competitive neural network architectures is an immediate need for automatic machine trans- have come up in the last couple of years, each lation systems from/to English as well as, across ∗ Indian languages. Though the digital content in In- [email protected] [email protected] dian languages has increased a lot in the last few [email protected] years, it is not yet comparable to that in English. For example, there are only 780K Wikipedia pages across Indian languages. We hope that this in Hindi and 214K in Telugu in contrast to 36.5M will enable further research in this area. in English [Wikipedia]. The ability to have ef- fective translation models with acceptable perfor- • Finally, we would like to release models and mance takes solutions based on natural interfaces resources (such as noisy sentences) for en- to the remote areas of a large world population. In- abling research in this area. dia has a spectrum of languages which could be The rest of the content is organized as follows: labelled as low resource. Within the scope of the We start off with Section 2 reviewing some of the discussion in this work, a high resource language relevant background literature with a specific fo- is one where large number (> 1M) of sentence- cus on Indian language scenario. In Section 3, we level aligned text pairs between two languages are discuss the components of the system and the adap- available. Note that this itself could be a very con- tations carried out for Indian language situations. servative definition. In Section 4, we elaborate the experimental details. Usable NMT solutions already exist between In Section 5, we summarize the results and discuss several high resource languages of the world. our findings. Some closed systems are available for public use on the web [Google, Microsoft] for Indian Lan- 2 Machine Translation and Indian guages. However, they are limited in terms of de- Language Situation tails, reproducibility, and scope for further exten- Machine translation systems could be built for a sions to the wider research community. specific language pair or for multiple pairs simul- The objective of this work is multi fold. We taneously. We take the latter route in this work. summarize them below. This choice is dominated by many practical advan- • Though neural machine translation has re- tages in managing the limited available resources. ceived much attention for many western lan- We also hypothesize that the linguistic similairies guage pairs, not many effective systems are across many Indian languages could help each reported yet for Indic languages in the litera- other in this manner, though a systematic study on ture, barring exceptions like [Banerjee et al., this is outside the scope of this work. Our work is 2018, Sen et al., 2018, Philip et al., 2018]. We in a multilingual setting wherein one system trans- would like to explore this machine learning lates in all directions. We discuss challenges spe- and machine translation front, and demon- cific to the context of a few Indian languages. strate the feasibility for multiple Indian lan- Text Resources Most Indian language pairs are guages. resource scarce in terms of sentence aligned par- • Indian languages have always suffered from allel bi-text. More pairs and experts are likely the lack of annotated corpora in the past for to be available in pivoting through English to a number of language processing task includ- learn a multilingual translation model. Increas- ing the machine translation that uses statisti- ing reach of Internet access in the subcontinent is cal or learning techniques. One of our objec- however changing these, as observable from Fig- tive is also to investigate how to effectively ure 1, which presents the systematic increase in the use the available data that could be monolin- wikipedia pages in Indian languages. gual, noisy and partially aligned. Hindi-English (hi-en) can be labelled a rela- tively high resource pair, after continued efforts • From a system point of view, we would like to compile and increase resources over the past to build and demonstrate a web based system decade. The IIT-Bombay Hindi English Parallel that can provide satisfactory results with pos- Corpus (IITB-hi-en) [Kunchukuttan et al., 2018] is sibly superior results to many of the previous the largest general domain dataset available for use open systems. In the process of developing a in MT. In addition to a 1.5 Million parallel data, system, we also study the quirks that work for IITB-hi-en also offers nearly 19.2M monolingual Indian languages, and comment on directions hindi text, which can be used to further enhance that we did not find promising. the performance. • Quantitatively, we would like to establish a The Indian Languages Corpora Initiative (ILCI) strong baseline for machine translation to and Corpus [Jha, 2010] is another major initiative. The Methods Most of the past attempts look at the 800K problem of machine translation in Indian setting as bn hi 700K a pair-wise translation problem. That is, the scope ur ml is restricted to one or two languages to build trans- 600K ta te lation solutions [Garje and Kharate, 2013, Khan 500K et al., 2017]. The Sampark [TDIL-Sampark] sys- 400K tem looks at translation between different pairs of 300K languages in the country as multiple different prob- #(wikipedia-pages) 200K lems, solved separately, with some commonality 100K of the solution across the pairs in idea. This pair- wise view of the problem space, enables greater 2009 2011 2013 2015 2017 2019 incorporation of language expertise into the solu- year tion. In the absence of huge resources to train Figure 1: Growth of wikipedia pages in languages a statistical systems, these provided reasonable translations. Sata Anuvadak [Kunchukuttan et al., 2014], a compilation of 110 independently trained ILCI corpus we used here include parallel data spe- translation systems which used Statistical Machine cific to the domains of tourism and health across Translation (SMT) analyzes a multilingual setup seven languages, all 7 aligned. Indic Languages through SMT based approaches. One may attempt Multilingual Parallel Corpus (WAT-ILMPC)1 is to solve multilingual translations through several a compilation of parallel data of subtitles, from hops through intermediate pivots. However, error OPUS2, which contains parallel pairs of subtitles compounds at each step. between languages of the subcontinent and En- Modern day NMT approaches are capable of glish. We also make use of this. handling sharing of learning between language pairs for which no data is available, through Linguistic Aspects Khan et al.