Proactive Learning for Building Machine Translation Systems for Minority Languages
Total Page:16
File Type:pdf, Size:1020Kb
Proactive Learning for Building Machine Translation Systems for Minority Languages Vamshi Ambati Jaime Carbonell [email protected] [email protected] Language Technologies Institute Language Technologies Institute Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213, USA Pittsburgh, PA 15213, USA Abstract based MT systems require lexicons that provide cov- erage for the target translations, synchronous gram- Building machine translation (MT) for many mar rules that define the divergences in word-order minority languages in the world is a serious across the language-pair. In case of minority lan- challenge. For many minor languages there is guages one can only expect to find meagre amount little machine readable text, few knowledge- of such data, if any. Building such resources effec- able linguists, and little money available for MT development. For these reasons, it be- tively, within a constrained budget, and deploying an comes very important for an MT system to MT system is the need of the day. make best use of its resources, both labeled We first consider ‘Active Learning’ (AL) as a and unlabeled, in building a quality system. framework for building annotated data for the task In this paper we argue that traditional active learning setup may not be the right fit for seek- of MT. However, AL relies on unrealistic assump- ing annotations required for building a Syn- tions related to the annotation tasks. For instance, tax Based MT system for minority languages. AL assumes there is a unique omniscient oracle. In We posit that a relatively new variant of active MT, it is possible and more general to have multiple learning, Proactive Learning, is more suitable sources of information with differing reliabilities or for this task. areas of expertise. A literate bilingual speaker with no extra training can produce translations for word, 1 Introduction phrase or sentences and even align them. But it re- quires a trained linguist to produce syntactic parse Speakers of minority languages could benefit from trees. AL also assumes that the single oracle is per- fluent machine translation (MT) between their native fect, always providing a correct answer when re- tongue and the dominant language of their region. quested. In reality, an oracle (human or machine) But scarcity in capital and know-how has largely may be incorrect (fallible) with a probability that restricted machine translation to the dominant lan- should be a function of the difficulty of the question. guages of first world nations. To lower the barriers There is also no notion of cost associated with the surrounding MT system creation, we must reduce annotation task, that varies across the input space. the time and resources needed to develop MT for But in MT, it is easy to see that length of a sentence new language pairs. Syntax based MT has proven and cost of translation are superlinear. Also not all to be a good choice for minority language scenario annotation tasks for MT have the same level of dif- (Lavie et al., 2003). While the amount of paral- ficulty or cost. For example, it is relatively cheap lel data required to build such systems is orders of to ask a bilingual speaker whether a word, phrase magnitude smaller than corresponding phrase based or sentence was correctly translated by the system, statistical systems (Koehn et al., 2003), the variety but a bit more expensive to ask for a correction. As- of linguistic annotation required is greater. Syntax sumptions like these render active learning unsuit- 58 Proceedings of the NAACL HLT Workshop on Active Learning for Natural Language Processing, pages 58–61, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics able for our task at hand which is building an MT egorized as ‘monolingual’ vs ‘bilingual’ depending system for languages with limited resources. We upon whether it requires knowledge in one language make the case for “Proactive Learning” (Donmez or both languages for annotation. A sample of the and Carbonell, 2008) as a solution for this scenario. different kinds of data and annotation that is ex- In the rest of the paper, we discuss syntax based pected by an MT system is shown below. Each of MT approach in Section 2. In Section 3 we first the additional information can be seen as extra an- discuss active learning approaches for MT and de- notations for the ‘Source’ sentence. The language tail the characteristics of MT for minority languages of target in the example is ‘Hindi’. problem that render traditional active learning un- Source: John ate an apple suitable for practical purposes. In Section 4 we dis- • cuss proactive learning as a potential solution for the Target: John ne ek seb khaya • current problem. We conclude with some challenges Alignment: (1,1),(2,5),(3,3),(4,4) that still remain in applying proactive learning for • SourceParse: (S (NP (NNP John)) (VP (VBD MT. • ate) (NP (DT an) (NN apple)))) 2 Syntax Based Machine Translation Lexicon: (seb apple),(ate khaya) • → → Grammar: VP: V NP NP V In recent years, corpus based approaches to ma- • → chine translation have become predominant, with 3 Active Learning for MT Phrase Based Statistical Machine Translation (PB- SMT) (Koehn et al., 2003) being the most ac- Modern syntax based MT rides on the success of tively progressing area. Recent research in syn- both Statistical Machine Translation and Statistical tax based machine translation (Yamada and Knight, Parsing. Active learning has been applied to Statis- 2001; Chiang, 2005) incorporates syntactic informa- tical Parsing (Hwa, 2004; Baldridge and Osborne, tion to ameliorate the reordering problem faced by 2003) to improve sample selection for manual anno- PB-SMT approaches. While traditional approaches tation. In case of MT, active learning has remained to syntax based MT were dependent on availabil- largely unexplored. Some attempts include training ity of manual grammar, more recent approaches op- multiple statistical MT systems on varying amounts erate within the resources of PB-SMT and induce of data, and exploring a committee based selection hierarchical or linguistic grammars from existing for re-ranking the data to be translated and included phrasal units, to provide better generality and struc- for re-training. But this does not apply to training in ture for reordering (Yamada and Knight, 2001; Chi- a low-resource scenario where data is scarce. ang, 2005; Wu, 1997). In the rest of the section we discuss the different scenarios that arise in gathering of annotation for 2.1 Resources for Syntax MT MT under a traditional ‘active learning’ setup and Syntax based approaches to MT seek to leverage the discuss the characteristics of the task that render it structure of natural language to automatically induce difficult. MT systems. Depending upon the MT system and 3.1 Multiple Oracles the paradigm, the resource requirements may vary and could also include modules such as morpholog- For each of the sub-tasks of annotation, in reality ical analyzers, sense disambiguation modules, gen- we have multiple sources of information or multi- erators etc. A detailed discussion of the comprehen- ple oracles. We can elicit translations for building sive pipeline, may be out of the scope of this pa- a parallel corpus from bilingual speakers who speak per, more so because such resources can not be ex- both the languages with certain accuracy or from a pected in a low-resource language scenario. We only linguist who is well educated in the formal sense focus on the quintessential set of modules for MT of the languages. With the success of collabora- 1 pipeline - data acquisition, word-alignment, syntac- tive sites like Amazon’s ‘Mechanical Turk’ , one tic analysis etc. The resources can broadly be cat- 1http://www.mturk.com/ 59 can provide the task of annotation to multiple ora- but sometimes a tradeoff to make is cost vs qual- cles on the internet (Snow et al., 2008). The task ity. We can not afford to introduce a grammar rule of word alignment can be posed in a similar fash- of low-quality into the system, but can possibly do ion too. More interestingly, there are statistical tools away with an incorrect word-correspondence link. like GIZA 2 that take as input un-annotated paral- lel data and propose automatic correspondences be- 4 Proactive Learning tween words in the language-pair, giving scope to Proactive learning (Donmez and Carbonell, 2008) is ‘machine oracles’. a generalization of active learning designed to re- lax unrealistic assumptions and thereby reach prac- 3.2 Varying Quality and Reliability tical applications. Active learning seeks to select the Oracles also vary on the correctness of the answers most informative unlabeled instances and ask an om- they provide (quality) as well as their availability niscient oracle for their labels, so as to retrain the (robustness) to answer. One typical distinction is learning algorithm maximizing accuracy. However, ‘human oracles’ vs ‘machine oracles’. Human or- the oracle is assumed to be infallible (never wrong), acle produce higher quality annotations when com- indefatigable (always answers), individual (only one pared to a machine oracle. We would prefer a tree oracle), and insensitive to costs (always free or al- bank of parse trees that were manually created over ways charges the same). Proactive learning relaxes automatically generated tree banks. Similar is the all these four assumptions, relying on a decision- case with word-alignment and other tasks of trans- theoretic approach to jointly select the optimal or- lation. Some oracles are ‘reluctant’ to produce an acle and instance, by casting the problem as a utility output, for example parsers tend to break on really optimization problem subject to a budget constraint. long sentences, but when they produce an output we can associate some confidence with it about the quality.