Apertium-Icenlp: a Rule-Based Icelandic to English Machine Translation System
Total Page:16
File Type:pdf, Size:1020Kb
Apertium-IceNLP: A rule-based Icelandic to English machine translation system Martha Dís Brandt, Hrafn Loftsson, Francis M. Tyers Hlynur Sigurþórsson School of Computer Science Dept. Lleng. i. Sist. Inform. Reykjavik University Universitat d’Alacant IS-101 Reykjavik, Iceland E-03071 Alacant, Spain {marthab08,hrafn,hlynurs06}@ru.is [email protected] Abstract translate between the source language (SL) and the target language (TL). We describe the development of a pro- SMT has many advantages, e.g. it is data-driven, totype of an open source rule-based language independent, does not need linguistic ex- Icelandic English MT system, based on perts, and prototypes of new systems can by built → the Apertium MT framework and IceNLP, quickly and at a low cost. On the other hand, the a natural language processing toolkit for need for parallel corpora as training data in SMT Icelandic. Our system, Apertium-IceNLP, is also its main disadvantage, because such corpora is the first system in which the whole are not available for a myriad of languages, espe- morphological and tagging component of cially the so-called less-resourced languages, i.e. Apertium is replaced by modules from languages for which few, if any, natural language an external system. Evaluation shows processing (NLP) resources are available. When that the word error rate and the position- there is a lack of parallel corpora, other machine independent word error rate for our pro- translation (MT) methods, such as rule-based MT, totype is 50.6% and 40.8%, respectively. e.g. Apertium (Forcada et al., 2009), may be used As expected, this is higher than the corre- to create MT systems. sponding error rates in two publicly avail- In this paper, we describe the development able MT systems that we used for com- of a prototype of an open source rule-based parison. Contrary to our expectations, the Icelandic English (is-en) MT system based on → error rates of our prototype is also higher Apertium and IceNLP, an NLP toolkit for process- than the error rates of a comparable system ing and analysing Icelandic texts (Loftsson and based solely on Apertium modules. Based Rögnvaldsson, 2007b). A decade ago, the Ice- on error analysis, we conclude that bet- landic language could have been categorised as ter translation quality may be achieved by a less-resourced language. The current situation, replacing only the tagging component of however, is much better thanks to the development Apertium with the corresponding module of IceNLP and various linguistic resources (Rögn- in IceNLP, but leaving morphological anal- valdsson et al., 2009). On the other hand, no large ysis to Apertium. parallel corpus, in which Icelandic is one of the languages, is freely available. This is the main rea- 1 Introduction son why the work described here was initiated. Our system, Apertium-IceNLP, is the first sys- Over the last decade or two, statistical machine tem in which the whole morphological and tagg- translation (SMT) has gained significant momen- ing component of Apertium is replaced by mod- tum and success, both in academia and industry. ules from an external system. Our motivation SMT uses large parallel corpora, texts that are for developing such a hybrid system was to be translations of each other, during training to derive able to answer the following research question: Is a statistical translation model which is then used to the translation quality of an is-en shallow-transfer c 2011 European Association for Machine Translation. MT system higher when using state-of-the-art Ice- Mikel L. Forcada, Heidi Depraetere, Vincent Vandeghinste (eds.) Proceedings of the 15th Conference of the European Association for Machine Translation, p. 217224 Leuven, Belgium, May 2011 landic NLP modules in the Apertium pipeline as morphologically analysed words, chooses the opposed to relying solely on Apertium modules? most likely sequence of PoS tags. Evaluation results show that the word error rate Lexical selection: A lexical selection mod- (WER) of our prototype is 50.6% and the position- • independent word error rate (PER) is 40.8%1. This ule based on Constraint Grammar (Karlsson is higher than the evaluation results of two publicly et al., 1995) selects between possible transla- available MT systems for is-en translation, Google tions of a word based on sentence context. 2 3 Translate and Tungutorg . This was expected, Lexical transfer: For an unambiguous lex- • given the short development time of our system, ical form in the SL, this module returns the i.e. 8 man-months. For comparison, we know that equivalent TL form based on a bilingual dic- Tungutorg has been developed by an individual, tionary. Stefán Briem, intermittently over a period of two decades4. Structural transfer: Performs local morpho- • Contrary to our expectations, the error rates of logical and syntactic changes to convert the our hybrid system is also higher than the error rates SL into the TL. of an is-en system based solely on Apertium mod- A morphological generator: For a given TL ules. This “pure” Apertium version was devel- • oped in parallel with Apertium-IceNLP. Based on lexical form, this module returns the TL sur- our error analysis, we conclude that better trans- face form. lation quality may be achieved by replacing only 2.1 Language pair specifics the tagging component of Apertium with the corre- sponding module in IceNLP, but leaving morpho- For each language pair, the Apertium platform logical analysis to Apertium. needs a monolingual SL dictionary used by the We think that our work can be viewed as a morphological analyser, a bilingual SL-TL dictio- guideline for other researchers wanting to develop nary used by the lexical transfer module, a mono- hybrid MT systems based on Apertium. lingual TL dictionary used by the morphological generator, and transfer rules used by the struc- 2 Apertium tural transfer module. The dictionaries and transfer rules specific to the is-en pair will be discussed in The Apertium shallow-transfer MT platform was Sections 4.2 and 4.3, respectively. originally aimed at the Romance languages of the The lexical selection module is a new module Iberian peninsula, but has also been adapted for in the Apertium platform and the is-en pair is the other languages, e.g. Welsh (Tyers and Don- first released pair to make extensive use of it. The nelly, 2009) and Scandinavian languages (Nord- module works by selecting a translation based on falk, 2009). The whole platform, both programs sentence context. For example, for the ambigu- and data, is free and open source and all the soft- ous word bóndi ’farmer’ or ’husband’, the default ware and data for the supported language pairs is 5 translation is left as ’farmer’, but a lexical selec- available for download from the project website . tion rule chooses the translation of ’husband’ if The Apertium platform consists of the following a possessive pronoun is modifying it. While the main modules: current lexical selection rules have been written by A morphological analyser: Performs token- hand, work is ongoing to generate them automati- • isation and morphological analysis which for cally with machine learning techniques. a given surface form returns all of the possible 3 IceNLP lexical forms (analyses) of the word. IceNLP is an open source6 NLP toolkit for pro- A part-of-speech (PoS) tagger: The HMM- • cessing and analysing Icelandic texts. Currently, based PoS tagger, given a sequence of the main modules of IceNLP are the following: 1See explanations of WER and PER in Section 5. 2http://translate.google.com A tokeniser. This module performs both 3 • http://www.tungutorg.is/ word tokenisation and sentence segmenta- 4 We do not have information on development months for the tion. is-en part of Google Translate. 5http://www.apertium.org 6http://icenlp.sourceforge.net 218 IceMorphy: A morphological analyser Wallenberg, 2008; Loftsson et al., 2009) has • (Loftsson, 2008). The program provides the shown that the morphological complexity of the tag profile (the ambiguity class) for known Icelandic language, and the relatively small train- words by looking up words in its dictionary. ing corpus in relation to the size of the tagset, is to The dictionary is derived from the Icelandic blame for a rather low tagging accuracy (compared Frequency Dictionary (IFD) corpus (Pind et to related languages). Taggers that are purely al., 1991). The tag profile for unknown based on machine learning (including HMM tri- words, i.e. words not known to the dictionary, gram taggers) have not been able to produce high is guessed by applying rules based on mor- accuracy when tagging Icelandic text (with the ex- phological suffixes and endings. IceMorphy ception of Dredze and Wallenberg (2008)). The does not generate word forms, it only carries current state-of-the-art tagging accuracy of 92.5% out analysis. is obtained by applying a hybrid approach, inte- grating TriTagger into IceTagger (Loftsson et al., IceTagger: A linguistic rule-based PoS • 2009). tagger (Loftsson, 2008). The tagger produces disambiguated morphosyntactic tags from the 4 Apertium-IceNLP tagset of the IFD corpus. The tagger uses Ice- Morphy for morphological analysis and ap- We decided to experiment with using IceMorphy, plies both local rules and heuristics for dis- Lemmald, IceTagger and IceParser in the Aper- ambiguation. tium pipeline. Note that since Apertium is based on a collection of modules that are connected by TriTagger: A statistical PoS tagger. This • clean interfaces in a pipeline (following the Unix trigram tagger is a re-implemenation of philosophy (Forcada et al., 2009)) it is relatively the well-known HMM tagger described by easy to replace modules or add new ones. Figure 1 Brants (2000). It is trained on the IFD cor- shows the Apertium-IceNLP pipeline.