Apertium-IceNLP: A rule-based Icelandic to English machine system

Martha Dís Brandt, Hrafn Loftsson, Francis M. Tyers Hlynur Sigurþórsson School of Computer Science Dept. Lleng. i. Sist. Inform. Reykjavik University Universitat d’Alacant IS-101 Reykjavik, Iceland E-03071 Alacant, {marthab08,hrafn,hlynurs06}@ru.is [email protected]

Abstract translate between the source language (SL) and the target language (TL). We describe the development of a pro- SMT has many advantages, e.g. it is data-driven, totype of an open source rule-based language independent, does not need linguistic ex- Icelandic English MT system, based on perts, and prototypes of new systems can by built → the Apertium MT framework and IceNLP, quickly and at a low cost. On the other hand, the a natural language processing toolkit for need for parallel corpora as training data in SMT Icelandic. Our system, Apertium-IceNLP, is also its main disadvantage, because such corpora is the first system in which the whole are not available for a myriad of languages, espe- morphological and tagging component of cially the so-called less-resourced languages, i.e. Apertium is replaced by modules from languages for which few, if any, natural language an external system. Evaluation shows processing (NLP) resources are available. When that the word error rate and the position- there is a lack of parallel corpora, other machine independent word error rate for our pro- translation (MT) methods, such as rule-based MT, totype is 50.6% and 40.8%, respectively. e.g. Apertium (Forcada et al., 2009), may be used As expected, this is higher than the corre- to create MT systems. sponding error rates in two publicly avail- In this paper, we describe the development able MT systems that we used for com- of a prototype of an open source rule-based parison. Contrary to our expectations, the Icelandic English (is-en) MT system based on → error rates of our prototype is also higher Apertium and IceNLP, an NLP toolkit for process- than the error rates of a comparable system ing and analysing Icelandic texts (Loftsson and based solely on Apertium modules. Based Rögnvaldsson, 2007b). A decade ago, the Ice- on error analysis, we conclude that bet- landic language could have been categorised as ter translation quality may be achieved by a less-resourced language. The current situation, replacing only the tagging component of however, is much better thanks to the development Apertium with the corresponding module of IceNLP and various linguistic resources (Rögn- in IceNLP, but leaving morphological anal- valdsson et al., 2009). On the other hand, no large ysis to Apertium. parallel corpus, in which Icelandic is one of the languages, is freely available. This is the main rea- 1 Introduction son why the work described here was initiated. Our system, Apertium-IceNLP, is the first sys- Over the last decade or two, statistical machine tem in which the whole morphological and tagg- translation (SMT) has gained significant momen- ing component of Apertium is replaced by mod- tum and success, both in academia and industry. ules from an external system. Our motivation SMT uses large parallel corpora, texts that are for developing such a hybrid system was to be of each other, during training to derive able to answer the following research question: Is a statistical translation model which is then used to the translation quality of an is-en shallow-transfer c 2011 European Association for . MT system higher when using state-of-the-art Ice- Mikel L. Forcada, Heidi Depraetere, Vincent Vandeghinste (eds.) Proceedings of the 15th Conference of the European Association for Machine Translation, p. 217224 Leuven, Belgium, May 2011 landic NLP modules in the Apertium pipeline as morphologically analysed words, chooses the opposed to relying solely on Apertium modules? most likely sequence of PoS tags. Evaluation results show that the word error rate Lexical selection: A lexical selection mod- (WER) of our prototype is 50.6% and the position- • independent word error rate (PER) is 40.8%1. This ule based on (Karlsson is higher than the evaluation results of two publicly et al., 1995) selects between possible transla- available MT systems for is-en translation, Google tions of a word based on sentence context. 2 3 Translate and Tungutorg . This was expected, Lexical transfer: For an unambiguous lex- • given the short development time of our system, ical form in the SL, this module returns the i.e. 8 man-months. For comparison, we know that equivalent TL form based on a bilingual dic- Tungutorg has been developed by an individual, tionary. Stefán Briem, intermittently over a period of two decades4. Structural transfer: Performs local morpho- • Contrary to our expectations, the error rates of logical and syntactic changes to convert the our hybrid system is also higher than the error rates SL into the TL. of an is-en system based solely on Apertium mod- A morphological generator: For a given TL ules. This “pure” Apertium version was devel- • oped in parallel with Apertium-IceNLP. Based on lexical form, this module returns the TL sur- our error analysis, we conclude that better trans- face form. lation quality may be achieved by replacing only 2.1 Language pair specifics the tagging component of Apertium with the corre- sponding module in IceNLP, but leaving morpho- For each language pair, the Apertium platform logical analysis to Apertium. needs a monolingual SL dictionary used by the We think that our work can be viewed as a morphological analyser, a bilingual SL-TL dictio- guideline for other researchers wanting to develop nary used by the lexical transfer module, a mono- hybrid MT systems based on Apertium. lingual TL dictionary used by the morphological generator, and transfer rules used by the struc- 2 Apertium tural transfer module. The dictionaries and transfer rules specific to the is-en pair will be discussed in The Apertium shallow-transfer MT platform was Sections 4.2 and 4.3, respectively. originally aimed at the Romance languages of the The lexical selection module is a new module Iberian peninsula, but has also been adapted for in the Apertium platform and the is-en pair is the other languages, e.g. Welsh (Tyers and Don- first released pair to make extensive use of it. The nelly, 2009) and Scandinavian languages (Nord- module works by selecting a translation based on falk, 2009). The whole platform, both programs sentence context. For example, for the ambigu- and data, is free and open source and all the soft- ous word bóndi ’farmer’ or ’husband’, the default ware and data for the supported language pairs is 5 translation is left as ’farmer’, but a lexical selec- available for download from the project website . tion rule chooses the translation of ’husband’ if The Apertium platform consists of the following a possessive pronoun is modifying it. While the main modules: current lexical selection rules have been written by A morphological analyser: Performs token- hand, work is ongoing to generate them automati- • isation and morphological analysis which for cally with machine learning techniques. a given surface form returns all of the possible 3 IceNLP lexical forms (analyses) of the word. IceNLP is an open source6 NLP toolkit for pro- A part-of-speech (PoS) tagger: The HMM- • cessing and analysing Icelandic texts. Currently, based PoS tagger, given a sequence of the main modules of IceNLP are the following: 1See explanations of WER and PER in Section 5. 2http://translate.google.com A tokeniser. This module performs both 3 • http://www.tungutorg.is/ word tokenisation and sentence segmenta- 4 We do not have information on development months for the tion. is-en part of . 5http://www.apertium.org 6http://icenlp.sourceforge.net

218 IceMorphy: A morphological analyser Wallenberg, 2008; Loftsson et al., 2009) has • (Loftsson, 2008). The program provides the shown that the morphological complexity of the tag profile (the ambiguity class) for known , and the relatively small train- words by looking up words in its dictionary. ing corpus in relation to the size of the tagset, is to The dictionary is derived from the Icelandic blame for a rather low tagging accuracy (compared Frequency Dictionary (IFD) corpus (Pind et to related languages). Taggers that are purely al., 1991). The tag profile for unknown based on machine learning (including HMM tri- words, i.e. words not known to the dictionary, gram taggers) have not been able to produce high is guessed by applying rules based on mor- accuracy when tagging Icelandic text (with the ex- phological suffixes and endings. IceMorphy ception of Dredze and Wallenberg (2008)). The does not generate word forms, it only carries current state-of-the-art tagging accuracy of 92.5% out analysis. is obtained by applying a hybrid approach, inte- grating TriTagger into IceTagger (Loftsson et al., IceTagger: A linguistic rule-based PoS • 2009). tagger (Loftsson, 2008). The tagger produces disambiguated morphosyntactic tags from the 4 Apertium-IceNLP tagset of the IFD corpus. The tagger uses Ice- Morphy for morphological analysis and ap- We decided to experiment with using IceMorphy, plies both local rules and heuristics for dis- Lemmald, IceTagger and IceParser in the Aper- ambiguation. tium pipeline. Note that since Apertium is based on a collection of modules that are connected by TriTagger: A statistical PoS tagger. This • clean interfaces in a pipeline (following the Unix trigram tagger is a re-implemenation of philosophy (Forcada et al., 2009)) it is relatively the well-known HMM tagger described by easy to replace modules or add new ones. Figure 1 Brants (2000). It is trained on the IFD cor- shows the Apertium-IceNLP pipeline. pus. Our motivation for using the above modules is Lemmald: A lemmatiser (Ingason et al., the following: • 2008). The method used combines a data- driven method with linguistic knowledge to 1. Developing a good morphological analyser 7 maximise accuracy. for a language is a time-consuming task . Since our system is unidirectional, i.e. is-en IceParser: A shallow parser (Loftsson and • but not en-is, we only need to be able to anal- Rögnvaldsson, 2007a). The parser marks yse an Icelandic surface form, but do not need both constituent structure and syntactic func- to generate an Icelandic surface form from tions using a cascade of finite-state transduc- a lexical form (lemma and morphosyntactic ers. tags). We can thus rely on IceMorphy for 3.1 The tagset and the tagging accuracy morphological analysis. The IFD corpus consists of about 600,000 tokens 2. As discussed in Section 3.1, research has and the tagset of about 700 tags. In this tagset, shown that HMM taggers, like the one in- each character in a tag has a particular function. cluded in Apertium, have not been able to The first character denotes the word class. For achieve high accuracy when tagging Ice- each word class there is a predefined number of landic. Thus, it seems logical to use the state- additional characters (at most six), which describe of-the-art tagger, IceTagger, instead. morphological features, like gender, number and case for nouns; degree and declension for adjec- 3. Morphological analysers in Apertium return tives; voice, mood and tense for verbs, etc. To a lemma in addition to morphosyntactic tags. illustrate, consider the Icelandic word strákarnir To produce a lemma for each word, we ’the boys’. The corresponding IFD tag is nkfng, can instead rely on the Icelandic lemmatiser, denoting noun (n), masculine (k), plural (f ), nomi- Lemmald. native (n), and suffixed definite article (g). 7Although there exists a morphological database for Ice- Previous work on PoS tagging Icelandic text landic (http://bin.arnastofnun.is), it is unfortu- (Helgadóttir, 2005; Loftsson, 2008; Dredze and nately not available as free/open source software/data.

219 lexical SL transfer text IceMorphy IceTagger Lemmald structural transfer lexical morph. chunker interchunk1 interchunk2 postchunk TL selection generator text

Figure 1: The Apertium-IceNLP pipeline.

4. Information about syntactic functions can be The last example of a mapping concerns multi- of help in the translation process. IceParser, word expressions (MWEs). IceTagger tags each which provides this information, can there- word of a MWE, whereas Apertium handles them fore potentially be used (we have not yet as a single unit because MWEs cannot be trans- added IceParser to the pipeline). lated word-for-word. Therefore, MWEs need to be listed in the mapping file along with the cor- 4.1 IceNLP enhancements responding Apertium tags. The following entries In order to use modules from IceNLP in the Aper- show two MWEs, að einhverju leyti ’to some ex- tium pipeline, various enhancements needed to be tent’ and af hverju ’why’ along with the corre- carried out in IceNLP. sponding Apertium tags. 4.1.1 Mappings [MWE] Various mappings from the output generated by ... IceTagger to the format expected by the Apertium að_einhverju_leyti modules were necessary. All mappings were im- af_hverju plemented by a single mapping file with different Instead of producing tags for each component of a sections for different purposes. For example, mor- MWE, IceTagger searches for MWEs in its input phosyntactic tags produced by IceTagger needed text that match entries in the mapping file and pro- to be mapped to the tags used by Apertium. The duces the Apertium tag(s) for a particular MWE if mapping file thus contains entries for each possi- a match is found. ble tag from the IFD tagset and the corresponding Apertium tags. For example, the following entry 4.1.2 Daemonising IceNLP in the mapping file shows the mapping for the IFD The versions of IceMorphy/IceTagger described tag nkfng (see Section 3.1). in (Loftsson, 2008), and Lemmald described in (Ingason et al., 2008), were designed to tag and [TAGMAPPING] lemmatise large amounts of Icelandic text, e.g. ... corpora. When IceTagger starts up, it creates an nkfng instance of IceMorphy which in turn loads var- The string “[TAGMAPPING]” above is a section ious dictionaries into memory. Similarily, when name, whereas stands for noun, for mas- Lemmald starts up, it loads its rules into memory. culine, for plural, for nominative, and This behaviour is fine when tagging and lemmatis- for definite. ing corpora, because, in that case, the startup time Another example of a necessary mapping re- is relatively small compared to the time needed to gards exceptions to tag mappings for particular tag and lemmatise. lemmata. The following entries show that after On the other hand, a common usage of a ma- tag mapping, the tags (verb, active chine translation system is translating a small num- voice) for the lemmata vera ’to be’ and hafa ’to ber of sentences (for example, in online MT ser- have’ should be replaced by the single tag vices) as opposed to a corpus. Using the modules and , respectively. The reason is that from IceNLP unmodified as part of the Apertium Apertium needs specific tags for these verbs. pipeline would be inefficient in that case because [LEMMA] the aforementioned dictionaries and rules would be ... reloaded every time the language pair is used. vera Therefore, we added a client-server function- hafa ality to IceNLP in order for it to run efficiently

220 as part of the Apertium pipeline. We added two the early stages of the project, only one entry could new applications to IceNLP: IceNLPServer and be used. SL words that had multiple TL transla- IceNLPClient. IceNLPServer is a server applica- tions had to be commented out, based on which tion, which contains an instance of the IceNLP translation seemed the most likely option. For ex- toolkit. Essentially, IceNLPServer is a daemon ample, below we have three options for the SL which runs in the background. When it is started word fíngerður in the bidix where it could be trans- up, all necessary dictionaries and rules are loaded lated as ’fine’, ’petite’ or ’subtle’, and the latter two into memory and are kept there while the daemon options are commented out. is running. Therefore, the daemon can serve re-

quests to the modules in IceNLP without any load- fíngerður fine ing delay.

IceNLPClient is a console-based client for com- put string. To illustrate, when the client is asked to Each entry in the bidix is surrounded by element analyse the string Hún er góð ’She is good’: tags ... and paragraph tags

...

. echo "Hún er góð" | RunClient.sh SL words are surrounded by left tags ... it returns: and TL translations by right tags .... ^Hún/hún$ Within the left and right tags the attribute value ^er/vera$ “adj” denotes that the word is an adjective and the ^góð/góður$ presence of the attribute value “sint” denotes that This output is consistent with the output gener- the adjective’s degree of comparison is shown with ated by the Apertium tagger, i.e. for each word “-er/-est” endings (e.g. ’fine’, ’finer’, ’finest’). of an input sentence, the lexeme is followed by In the second stage of the bidix development, a the lemma followed by the (disambiguated) mor- bilingual wordlist of about 6,000 SL words with phosyntactic tags. The output above is then fed di- word class and gender was acquired from an in- rectly into the remainder of the Apertium pipeline, dividual, Anton Ingason. It required some pre- i.e. into lexical selection, lexical transfer, struc- processing before it could be added to the bidix, tural transfer, and morphological generation (see e.g. determining which of these new SL words did Figure 1), to produce the English translation ’She not already exist in the dictionary, and selecting a is good’. default translation in cases where more than one translation was given. 4.2 The bilingual dictionary Last, we acquired a bilingual wordlist from the When this project was initiated, no is-en bilingual dictionary publishing company Forlagið10, con- dictionary (bidix) was publicly available in elec- taining an excerpt of about 18,000 SL words from tronic format. Our bidix was built in three stages. their Icelandic-English online dictionary. This re- First, the is-en dictionary was populated with quired similar preprocessing work as described entries spidered from the Internet from Wikipedia, above. , Freelang, the Cleasby-Vigfusson Old Currently, our is-en bidix contains 21,717 SL 8 Icelandic dictionary and the Icelandic Word lemmata and 1,491 additional translations to be 9 Bank . This provided a starting point of over used for lexical selection. 5,000 entries in Apertium style XML format which needed to be checked manually for correctness. 4.3 Transfer rules Also, since lexical selection was not an option in The syntactic (structural) transfer stage (see Fig- 8http://www.ling.upenn.edu/~kurisuto/ ure 1) in the translator is split into four stages. germanic/oi_cleasbyvigfusson_about.html 9http://www.ismal.hi.is/ob/index.en.html 10http://snara.is/

221 The first stage (chunker) performs local reorder- Translator WER PER ing and chunking. The second (interchunk1) pro- Apertium-IceNLP 50.6% 40.8% duces chunks of chunks, e.g. chunking relative Apertium 45.9% 38.2% clauses into noun phrases. The third (interchunk2) Tungutorg 44.4% 33.7% performs longer distance reordering, e.g. con- Google Translate 36.5% 28.7% stituent reordering, and some tense changes. As Table 1: Word error rate (WER) and position-independent an example of a tense change, consider: Hann word error rate (PER) over the test sentences for the publicly vildi að verðlaunin færu til þeirra ‘He wanted available is-en machine translation systems. → that the awards went to them’ ‘He wanted the → awards to go to them’. Finally, the fourth stage (postchunk) does some cleanup operations, and in- and/or archaic words were removed (e.g. words sertion of the indefinite article. that our human translator did not know how to There are 78 rules in the first stage, the majority translate); and viii) repetitive lines, e.g. multiple dealing with noun phrases, 3 rules in the second, lines of the same format from a list, were removed. 26 rules in the third stage and 5 rules in the fourth After this filtering process, 397 sentences re- stage. mained which were then run through the four MT It is worth noting that the development of the systems. In order to calculate evaluation metrics bilingual dictionary and the transfer rules benefit (see below), each of the four output files had to both the Apertium-IceNLP system and the is-en be post-edited. A bilingual human posteditor re- system based solely on Apertium modules. viewed each TL sentence, copied it and then made minimal corrections to the copied sentence so that 5 Evaluation it would be suitable for dissemination – meaning that the sentence needs to be as close to grammat- Our goal was to evaluate approximately 5,000 ically correct as possible so that post-editing re- words, which corresponds to roughly 10 pages of quires less effort. text, and compare our results to two other publicly The translation quality was measured using two available is-en MT systems: Google Translate, an metrics: word error rate (WER), and position- SMT system, and Tungutorg, a proprietary rule- independent word error rate (PER). The WER is based MT system, developed by an individual. In the percentage of the TL words that require cor- addition, we sought a comparison to the is-en sys- rection, i.e. substitutions, deletions and insertions. tem based solely on Apertium modules. PER is similar to WER except that PER does The test corpus for the evaluation was extracted not penalise correct words in incorrect positions. from a dump of the Icelandic Wikipedia on April Both metrics are based on the well known Leven- 24th 2010, which provided 187,906 lines of SL shtein distance and were calculated for each of the text. The reason for choosing texts from Wikipedia sentences using the apertium-eval-translator is that the evaluation material can be distributed, tool11. Metrics based on word error rate were cho- which is not the case for other available corpora of sen so as to be able to compare the system against Icelandic. other Apertium systems and to assess the useful- Then, 1,000 lines were randomly selected from ness of the system in real settings, i.e. of translat- the test corpus and the resulting file filtered semi- ing for dissemination. automatically such that: i) each line had only one Note that, in our case, the WER and PER complete sentence; ii) each sentence had more than scores are computed based on the difference be- three words; iii) each sentence had zero or one tween the system output and a post-edited ver- lower case unknown word (we want to test the sion of the system output. As can be seen in transfer, not the coverage of the dictionaries); iv) Table 1, the WER and PER for our Apertium- lines that were clearly metadata and traceable to IceNLP prototype is 50.6% and 40.8%, respec- individuals were removed, e.g. user names; v) tively. This may seem quite high, but looking at lines that contained incoherent strings of numbers the translation quality statistics for some of the were removed, e.g. from a table entry; vi) lines other language pairs in Apertium12, we see that containing non-Latin alphabet characters were re- 11http://wiki.apertium.org/wiki/ moved, e.g. if they contained Greek or font; Evaluation vii) lines that contained extremely domain specific 12http://wiki.apertium.org/wiki/

222 the WER for Norsk Bokmål- is 17.7%, for Error category Freq. % Swedish-Danish 30.3%, for Breton-French 38.0%, Missing from the bidix 912 60.7% for Welsh-English 55.7%, and for Basque-Spanish Need further analysis 414 27.5% 72.4%. It is worth noting however that each of Multiword expressions 90 6.0% these evaluations had slightly different require- Abbreviations and initials 31 2.1% ments for source language sentences. For instance, More sophisticated patterns 31 2.1% the Swedish–Danish pair allowed any number of Other 24 1.6% unknown words. Total 1502 100% We expected that the translation quality of Apertium-IceNLP would be significantly less than Table 2: Error categories and corresponding frequencies. both Google Translate and Tungutorg, and the re- sults in Table 1 confirm this expectation. The can dramatically increase the error rate. The pure reason for our expectation was that the develop- version translates unlimited lengths of MWEs as ment time of our system was relatively short (8 single units and can deal with MWEs that con- man-months), whereas Tungutorg, for example, tain inflectional words. In contrast, the length has been developed intermittently over a period of of the MWEs in IceNLP (and consequently also two decades. in Apertium-IceNLP) is limited to trigrams and, Unexpectedly, the error rates of Apertium- furthermore, IceNLP cannot deal with inflectional IceNLP is also higher than the error rates of a sys- MWEs. tem based solely on Apertium modules (see row The additional work required to get a better “Apertium” in Table 1). We will discuss reasons translation quality out of the Apertium-IceNLP for this and future work to improve the translation system than a pure Apertium system raises the quality in the next section. question as to whether “less is more”, i.e. whether 6 Discussion and future work instead of incorporating tokenisation, morphologi- cal analysis, lemmatisation and PoS tagging from In order to determine where to concentrate ef- IceNLP into the Apertium pipeline, it may produce forts towards improving the translation quality of better results to only use IceTagger for PoS tagg- Apertium-IceNLP, some error analysis was carried ing but rely on Apertium for the other tasks. As out on a development data set. This development discussed in Section 3.1, IceTagger outperforms data was collected from the largest Icelandic online an HMM tagger as the one used by the Apertium newspaper mbl.is into 1728 SL files and then trans- pipeline. lated by the system into TL files. Subsequently, In order to replace only the PoS tagger in the 50 files from the pool were randomly selected for Apertium pipeline, some modifications will have manual review and categorisation of errors. to be made to IceTagger. In addition to the mod- The error categories were created along the way, ifications already carried out to make IceTagger resulting in a total of 6 error categories to iden- return output in Apertium style format (see Sec- tify where it would be most beneficial to make tion 4.1.1), the tagger will also have to be able improvements. Analysis of the error categories to take Apertium style formatted input. More showed that 60.7% of the errors were due to words specifically, instead of relying on IceMorphy and missing from the bidix, mostly proper nouns and Lemmald for morphological analysis and lemma- compound words (see Table 2). This analysis sug- tisation, IceTagger would have to be changed to gests that improvement to the translation quality receive the necessary information from the mor- can be achieved by concentrating on adding proper phological component of Apertium. nouns to the bidix, on the one hand, and resolving compound words, on the other. 7 Conclusion One possible explanation for the lower er- We have described the development of Apertium- ror rates for the “pure” Apertium version than IceNLP, an Icelandic English (is-en) MT system → the Apertium-IceNLP system is the handling of based on the Apertium platform and IceNLP, an MWEs. MWEs most often do not translate liter- NLP toolkit for Icelandic. Apertium-IceNLP is a ally nor even to the same number of words, which hybrid system, the first system in which the whole Translation_quality_statistics morphological and tagging component of Aper-

223 tium is replaced by modules from an external sys- In Holmboe, H., editor, Nordisk Sprogteknologi tem. 2004, pages 257–265. Museum Tusculanums Forlag, Our system is a prototype with about 8 man- Copenhagen. months of development work. Evaluation, based Ingason, Anton K., Sigrún Helgadóttir, Hrafn Lofts- on word error rate, shows that our prototype does son, and Eiríkur Rögnvaldsson. 2008. A Mixed not perform as well as two other available is-en Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). In Nordström, B. systems, Google Translate and Tungutorg. This and A. Rante, editors, Advances in Natural Lan- was expected and can mainly be explained by two guage Processing, 6th International Conference on factors. First, our system has been developed over NLP, GoTAL 2008, Proceedings, Gothenburg, Swe- a short time. Second, our system makes systematic den. errors that we intend to fix in future work. Karlsson, Fred, Atro Voutilainen, Juha Heikkilä, and Contrary to our expectations, the Apertium- Arto Anttila. 1995. Constraint Grammar: A IceNLP system also performs worse than the is- Language-Independent System for Parsing Unre- stricted Text. Mouton de Gruyter, Berlin. en system based solely on Apertium modules. We conjectured that this is mainly due to the fact Loftsson, Hrafn and Eiríkur Rögnvaldsson. 2007a. that the Apertium-IceNLP system does not han- IceParser: An Incremental Finite-State Parser for Icelandic. In Proceedings of the 16th Nordic Con- dle MWEs adequately, whereas the handling of ference of Computational Linguistics (NoDaLiDa MWEs is an integrated part of the Apertium mor- 2007), Tartu, Estonia. phological analyser. Therefore, we expect that bet- Loftsson, Hrafn and Eiríkur Rögnvaldsson. 2007b. ter translation quality may be achieved by replac- IceNLP: A Natural Language Processing Toolkit for ing only the tagging component of Apertium with Icelandic. In Proceedings of Interspeech 2007, Spe- the corresponding module in IceNLP, but leaving cial Session: “Speech and language technology for morphological analysis to Apertium. This conjec- less-resourced languages”, Antwerp, Belgium. ture will be verified in future work. Loftsson, Hrafn, Ida Kramarczyk, Sigrún Helgadóttir, and Eiríkur Rögnvaldsson. 2009. Improving the PoS Acknowledgments tagging accuracy of Icelandic text. In Proceedings of the 17th Nordic Conference of Computational Lin- The work described in this paper has been sup- guistics (NoDaLiDa 2009), Odense, Denmark. ported by: i) The Icelandic Research Fund, project Loftsson, Hrafn. 2008. Tagging Icelandic text: A lin- “Viable Language Technology beyond English – guistic rule-based approach. Nordic Journal of Lin- Icelandic as a test case”, grant no. 090662012; guistics, 31(1):47–72. and ii) The NILS mobility project (The Abel Pre- Nordfalk, Jacob. 2009. Shallow-transfer rule-based doc Research Grant), coordinated by Universidad machine translation for Swedish to Danish. In Complutense de Madrid. Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation, Alacant, Spain. References Pind, Jörgen, Friðrik Magnússon, and Stefán Briem. Brants, Thorsten. 2000. TnT: A statistical part-of- 1991. Íslensk orðtíðnibók [The Icelandic Frequency speech tagger. In Proceedings of the 6th Conference Dictionary]. The Institute of Lexicography, Univer- on Applied Natural Language Processing, Seattle, sity of Iceland, Reykjavik. WA, USA. Rögnvaldsson, Eiríkur, Hrafn Loftsson, Kristín Bjar- Dredze, Mark and Joel Wallenberg. 2008. Icelandic nadóttir, Sigrún Helgadóttir, Anna B. Nikulásdóttir, Data Driven Part of Speech Tagging. In Proceedings Matthew Whelpton, and Anton K. Ingason. 2009. of the 46th Annual Meeting of the Association for Icelandic Language Resources and Technology: Sta- Computational Linguistics: Human Language Tech- tus and Prospects. In Domeij, R., K. Kosken- nologies, Columbus, OH, USA. niemi, S. Krauwer, B. Maegaard, E. Rögnvalds- son, and K. de Smedt, editors, Proceedings of the Forcada, Mikel L., Francis M. Tyers, and Gema NoDaLiDa 2009 Workshop ’Nordic Perspectives on Ramírez-Sánches. 2009. The Apertium machine the CLARIN Infrastructure of Language Resources’. translation platform: Five years on. In Proceedings Odense, Denmark. of the First International Workshop on Free/Open- Source Rule-Based Machine Translation, Alacant, Tyers, Francis M. and Kevin Donnelly. 2009. Spain. apertium-cy - a collaboratively-developed free RBMT system for Welsh to English. Prague Bul- Helgadóttir, Sigrún. 2005. Testing Data-Driven letin of Mathematical Linguistics, 91:57–66. Learning Algorithms for PoS Tagging of Icelandic.

224