Using Danish As a CG Interlingua: a Wide-Coverage Norwegian-English Machine Translation System

Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System Eckhard Bick Lars Nygaard Institute of Language and The Text Laboratory Communication University of Southern Denmark University of Oslo Odense, Denmark Oslo, Norway [email protected] [email protected] Abstract running, mixed domain text. Also, some languages, like English, German and This paper presents a rule-based Japanese, are more equal than others, Norwegian-English MT system. not least in a funding-heavy Exploiting the closeness of environment like MT. Norwegian and Danish, and the The focus of this paper will be existence of a well-performing threefold: Firstly, the system presented Danish-English system, Danish is here is targeting one of the small, used as an «interlingua». «unequal» languages, Norwegian. Structural analysis and polysemy Secondly, the method used to create a resolution are based on Constraint Norwegian-English translator, is Grammar (CG) function tags and ressource-economical in that it uses dependency structures. We another, very similar language, Danish, describe the semiautomatic as an «interlingua» in the sense of construction of the necessary translation knowledge recycling (Paul Norwegian-Danish dictionary and 2001), but with the recycling step at the evaluate the method used as well SL side rather than the TL side. Thirdly, as the coverage of the lexicon. we will discuss an unusual analysis and transfer methodology based on Constraint Grammar dependency 1 Introduction parsing. In short, we set out to construct a Norwegian-English MT Machine translation (MT) is no longer system by building a smaller, an unpractical science. Especially the Norwegian-Danish one and piping its advent of corpora with hundreds of output into an existing Danish deep millions of words and advanced parser (DanGram, Bick 2003) and an machine learning techniques, bilingual existing, robust Danish-English MT electronic data and advanced machine system (Dan2Eng, Bick 2006 and 2007). learning techniques have fueled a torrent of MT-project for a large number 2 The MT system of language pairs. However, the potentially most powerful, deep rule- The Bokmål standard variety of based approaches still struggle, for Norwegian is a language historically so most languages, with a serious close to Danish, that speakers of one coverage problem when used on language can understand texts in the Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit (Eds.) NODALIDA 2007 Conference Proceedings, pp. 21–28 Eckhard Bick and Lars Nygaard other without prior training - though the 2.2 The Norwegian-Danish lexicon same does not necessarily hold for the spoken varieties. It is therefore a less The complexity of a Norwegian-Danish challenging task to create a Norwegian- dictionary can be compared to Spanish- Danish MT system than a Norwegian- Catalan language pair addressed in the English or even Norwegian-Japanese open source Apertium MT project one. Furthermore, syntactic differences (Corbí-Bellot et al. 2005), where a 1-to-1 are so few, that lexical transfer can to a lexicon was deemed sufficient (with a large degree be handled at the word few polysemous cases handled as multi- level with only part of speech (PoS) word expressions), avoiding the disambiguation and no syntactic disambiguation complexity of many-to- disambiguation, allowing us to depend many lexica necessary for less-related on the Danish parser to provide a deep languages. Even without extensive structural analysis. Furthermore, the polysemy mismatches, the productive polysemy spectrum of many Bokmål compounding nature of Scandinavian words closely matches the semantics of languages, however, increases lexical the corresponding Danish word, so complexity as compared to Romance different English translation equivalents languages - an issue reflected in the can be chosen using Danish context- transfer evaluation in chapter 2.3. based discriminators. In a project with virtually zero funding, like ours, it can be difficult to 2.1 Norwegian analysis build or buy a lexicon, not to mention the general lack of wide-coverage As a first step of analysis, we use the Norwegian-Danish electronic lexica to Oslo-Bergen Tagger (Hagen et al. 2000) begin with. So with only a few thousand to provide lemma disambiguation and words from terminology lists or the like PoS tagging, the idea being to translate available, creative methods had to be results into Danish, using a large employed, and we opted for a bilingual lexicon, and feed them into the bootstrapping system with the following syntactic and dependency stages of the steps: DanGram parser. However, though both (a) Create a large corpus of the OBT tagger and DanGram adhere to monolingual - Norwegian text and the Constraint Grammar (CG) formalism lemmatize it automatically. Quality was (Karlsson 1990), a number of less important in this step, since descriptive compatibility issues had to frequency measures could be employed be addressed. Since categories could to weed out errors and create a not always be mapped one-to-one, we candidate list of Norwegian lemmas. had to also use the otherwise to-be- (b) Regard Norwegian as misspelled skipped syntactic stage of the OBT Danish, and run a Danish spell checker tagger in order to further disambiguate on the lemma-list obtained from (a). a word's part of speech. Thus, the Assume translation as identical, if the Danish preposition-adverb distinction is Norwegian word is accepted by a underspecified in the Norwegian system Danish spell checker. Use correction where the 2 lexemes have the same suggestions by spell checkers as form, using the preposition tag even translations suggestions. Because without the presence of a pp. The same differences could be greater than holds for about 50 words that in Danish Levenshtein distance 1 or 2, a special, are regarded as unambiguous adverbs, CG-based spell checker (OrdRet, Bick but in Norwegian as unambiguous 2006) was used, with a particular focus prepositions. on heavy, dyslexic spelling deviations 22 Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System and a mixed graphical-phonetic alphabetical list were corrected at the approach. same time. (c) Produce phonetic transmutation In order to evaluate our method of rules for Norwegian and Danish spelling lexicon-creation, we extracted all words to generate hypothetical Danish words with frequency 9 - the most frequent from Norwegian candidates, and than group without prior manual revision - check if a word of the relevant word and inspected all suggested translations class was listed in either DanGram's (1544 cases). parsing lexicon or its spell checker fullform list. Methods (a-c) resulted in a list of 226,000 lemmas with translations candidates in Danish. Only 20,000 low- frequency words were completely unmatchable. In a first round of manual type n % revision, all closed-class words, all non-word 33 2.1 % polylexical matches were checked, and wrong PoS 8 0.5 % a confidence value from DanGram's spell checker module was used to grade etymology 161 10.4 % suggestions into safe, unsafe and none. = Next, a compound analyzer was written transpare 6 0.4 and run on all Norwegian words, nt1 accepting compound splits as likely if the resulting parts both individually intranspar 20 1.3 % existed in the word list, finally creating ent2 a Danish translation from the all 187 12.1 % translations of the parts, and checking corrected it and its epenthetic letters against the Danish lexicon. This step not only all 228 14.8 % helped to fill in remaining blanks, but Table 1 was also used to corroborate spell As can be seen from table 1, ignoring checker suggestions as correct, if they the 2.6 % of non-words from the corpus- matched the translation produced by based lemma-list, about 12% of the compound analysis - or replace them, if unrevised translations were wrong. not. After this, 13.800 lemmas had no However, in most of these cases (10.4%, translation, 23.500 lemmas were left over 4/5), the Danish translations were with an «unsafe» marking from the spell still etymologically - and thus spelling- checker stage, and in 20.700 cases, wise - related to their Norwegian compound analysis contradicted spell checker or list suggestions otherwise 1 deemed safe. Allowing overrides in the brise (blæse), spenntak (spændloft), stabbe, strupetak (strubelåg), villastrøk latter case, and removing the two (villakvarter), vårluft (forårsluft) former cases, we were left with a 2 guttete (drengete), havert (slags sæl), bilingual lemma list of 188.500 entries. hengemyr (hængedynd), kraftsektor Finally, a dual pass of manual (energisektor), koring, kvinneyrke, checking was directed at all items with langdryg, låtskriver, lønnsnemnd, a frequency count over 10, malingflekk, omvisning (rundvisning), purke corresponding to about 12.5%. In (so), sauebonde, smokk (sut), strikkegenser, obvious cases, related low-frequency søppelbøtte (affaldsbøtte), tukle (fumle), words in neighbouring positions on the tøyelig (fleksibel), vassdrag (vandløb), yrkesutdanning 23 Eckhard Bick and Lars Nygaard counterparts, and should thus be former to achieve a part-by-part accessible to improved automatic translation for words not listed in the matching-techniques. bilingual lexicon, the latter to permit assignment of secondary Danish frequen non- wrong correct information (valency, semantics) to cy word PoS ed Danish translations not covered by the DanGram monolingual lexicon. 9 (all) 2.1 % 0.5 % 12.1 % The Norwegian-Danish transfer 5 0.5 % 1.5 % 14 % module was evaluated on 1,000 mixed- genre sentences from the Norwegian 4 4 % 0.5 % 14 % web part of the Leipzig Corpora 3 1 % 0.5 % 10.5 % Collection3 and a 6.500 word chunk 4 2 4 % 3 % 9.5 % from the ECIcorpus . 1 3.5 % 0.5 % 8.5 % average 2.5 % 1.1 % 11.4 % Web Litterature Table 2 words 15,641 6,521 N, ADJ, V, 8,976 3,098 Small checks were also conducted ADV (57.4%) (47.5%) for other frequencies (200 words each), not in noda- 991 (6.3%) 182 (2.8%) randomly extracting 1 out of 10 words.

Using Danish As a CG Interlingua: a Wide-Coverage Norwegian-English Machine Translation System

Using Constraint Grammar for Treebank Retokenization

Floresta Sinti(C)Tica : a Treebank for Portuguese

Instructions for Preparing LREC 2006 Proceedings

17Th Nordic Conference of Computational Linguistics (NODALIDA

Frag, a Hybrid Constraint Grammar Parser for French

Prescriptive Infinitives in the Modern North Germanic Languages: An

The VISL System

A Morphological Lexicon of Esperanto with Morpheme Frequencies

Degrees of Orality in Speechlike Corpora

Arborest – a Growing Treebank of Estonian

The English Wikipedia in Esperanto

The Nordic Dialect Corpus – an Advanced Research Tool