Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine System

Eckhard Bick Lars Nygaard Institute of Language and The Text Laboratory Communication University of Southern University of Oslo Odense, Denmark Oslo, Norway [email protected] [email protected]

Abstract running, mixed domain text. Also, some languages, like English, German and This paper presents a rule-based Japanese, are more equal than others, Norwegian-English MT system. not least in a funding-heavy Exploiting the closeness of environment like MT. Norwegian and Danish, and the The focus of this paper will be existence of a well-performing threefold: Firstly, the system presented Danish-English system, Danish is here is targeting one of the small, used as an «interlingua». «unequal» languages, Norwegian. Structural analysis and polysemy Secondly, the method used to create a resolution are based on Constraint Norwegian-English translator, is Grammar (CG) function tags and ressource-economical in that it uses dependency structures. We another, very similar language, Danish, describe the semiautomatic as an «interlingua» in the sense of construction of the necessary translation knowledge recycling (Paul Norwegian-Danish dictionary and 2001), but with the recycling step at the evaluate the method used as well SL side rather than the TL side. Thirdly, as the coverage of the lexicon. we will discuss an unusual analysis and transfer methodology based on Constraint Grammar dependency 1 Introduction . In short, we set out to construct a Norwegian-English MT Machine translation (MT) is no longer system by building a smaller, an unpractical science. Especially the Norwegian-Danish one and piping its advent of corpora with hundreds of output into an existing Danish deep millions of words and advanced parser (DanGram, Bick 2003) and an machine learning techniques, bilingual existing, robust Danish-English MT electronic data and advanced machine system (Dan2Eng, Bick 2006 and 2007). learning techniques have fueled a torrent of MT-project for a large number 2 The MT system of language pairs. However, the potentially most powerful, deep rule- The Bokmål standard variety of based approaches still struggle, for Norwegian is a language historically so most languages, with a serious close to Danish, that speakers of one coverage problem when used on language can understand texts in the

Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit (Eds.) NODALIDA 2007 Conference Proceedings, pp. 21–28 Eckhard Bick and Lars Nygaard

other without prior training - though the 2.2 The Norwegian-Danish lexicon same does not necessarily hold for the spoken varieties. It is therefore a less The complexity of a Norwegian-Danish challenging task to create a Norwegian- dictionary can be compared to Spanish- Danish MT system than a Norwegian- Catalan language pair addressed in the English or even Norwegian-Japanese open source Apertium MT project one. Furthermore, syntactic differences (Corbí-Bellot et al. 2005), where a 1-to-1 are so few, that lexical transfer can to a lexicon was deemed sufficient (with a large degree be handled at the word few polysemous cases handled as multi- level with only part of speech (PoS) word expressions), avoiding the disambiguation and no syntactic disambiguation complexity of many-to- disambiguation, allowing us to depend many lexica necessary for less-related on the Danish parser to provide a deep languages. Even without extensive structural analysis. Furthermore, the polysemy mismatches, the productive polysemy spectrum of many Bokmål compounding nature of Scandinavian words closely matches the semantics of languages, however, increases lexical the corresponding Danish word, so complexity as compared to Romance different English translation equivalents languages - an issue reflected in the can be chosen using Danish context- transfer evaluation in chapter 2.3. based discriminators. In a project with virtually zero funding, like ours, it can be difficult to 2.1 Norwegian analysis build or buy a lexicon, not to mention the general lack of wide-coverage As a first step of analysis, we use the Norwegian-Danish electronic lexica to Oslo-Bergen Tagger (Hagen et al. 2000) begin with. So with only a few thousand to provide lemma disambiguation and words from terminology lists or the like PoS tagging, the idea being to translate available, creative methods had to be results into Danish, using a large employed, and we opted for a bilingual lexicon, and feed them into the bootstrapping system with the following syntactic and dependency stages of the steps: DanGram parser. However, though both (a) Create a large corpus of the OBT tagger and DanGram adhere to monolingual - Norwegian text and the Constraint Grammar (CG) formalism lemmatize it automatically. Quality was (Karlsson 1990), a number of less important in this step, since descriptive compatibility issues had to frequency measures could be employed be addressed. Since categories could to weed out errors and create a not always be mapped one-to-one, we candidate list of Norwegian lemmas. had to also use the otherwise to-be- (b) Regard Norwegian as misspelled skipped syntactic stage of the OBT Danish, and run a Danish spell checker tagger in order to further disambiguate on the lemma-list obtained from (a). a word's part of speech. Thus, the Assume translation as identical, if the Danish preposition-adverb distinction is Norwegian word is accepted by a underspecified in the Norwegian system Danish spell checker. Use correction where the 2 lexemes have the same suggestions by spell checkers as form, using the preposition tag even suggestions. Because without the presence of a pp. The same differences could be greater than holds for about 50 words that in Danish Levenshtein distance 1 or 2, a special, are regarded as unambiguous adverbs, CG-based spell checker (OrdRet, Bick but in Norwegian as unambiguous 2006) was used, with a particular focus prepositions. on heavy, dyslexic spelling deviations

22 Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System

and a mixed graphical-phonetic alphabetical list were corrected at the approach. same time. (c) Produce phonetic transmutation In order to evaluate our method of rules for Norwegian and Danish spelling lexicon-creation, we extracted all words to generate hypothetical Danish words with frequency 9 - the most frequent from Norwegian candidates, and than group without prior manual revision - check if a word of the relevant word and inspected all suggested translations class was listed in either DanGram's (1544 cases). parsing lexicon or its spell checker fullform list. Methods (a-c) resulted in a list of 226,000 lemmas with translations candidates in Danish. Only 20,000 low- frequency words were completely unmatchable. In a first round of manual type n % revision, all closed-class words, all non-word 33 2.1 % polylexical matches were checked, and wrong PoS 8 0.5 % a confidence value from DanGram's spell checker module was used to grade etymology 161 10.4 % suggestions into safe, unsafe and none. = Next, a compound analyzer was written transpare 6 0.4 and run on all Norwegian words, nt1 accepting compound splits as likely if the resulting parts both individually intranspar 20 1.3 % existed in the word list, finally creating ent2 a Danish translation from the all 187 12.1 % translations of the parts, and checking corrected it and its epenthetic letters against the Danish lexicon. This step not only all 228 14.8 % helped to fill in remaining blanks, but Table 1 was also used to corroborate spell As can be seen from table 1, ignoring checker suggestions as correct, if they the 2.6 % of non-words from the corpus- matched the translation produced by based lemma-list, about 12% of the compound analysis - or replace them, if unrevised translations were wrong. not. After this, 13.800 lemmas had no However, in most of these cases (10.4%, translation, 23.500 lemmas were left over 4/5), the Danish translations were with an «unsafe» marking from the spell still etymologically - and thus spelling- checker stage, and in 20.700 cases, wise - related to their Norwegian compound analysis contradicted spell checker or list suggestions otherwise 1 deemed safe. Allowing overrides in the brise (blæse), spenntak (spændloft), stabbe, strupetak (strubelåg), villastrøk latter case, and removing the two (villakvarter), vårluft (forårsluft) former cases, we were left with a 2 guttete (drengete), havert (slags sæl), bilingual lemma list of 188.500 entries. hengemyr (hængedynd), kraftsektor Finally, a dual pass of manual (energisektor), koring, kvinneyrke, checking was directed at all items with langdryg, låtskriver, lønnsnemnd, a frequency count over 10, malingflekk, omvisning (rundvisning), purke corresponding to about 12.5%. In (so), sauebonde, smokk (sut), strikkegenser, obvious cases, related low-frequency søppelbøtte (affaldsbøtte), tukle (fumle), words in neighbouring positions on the tøyelig (fleksibel), vassdrag (vandløb), yrkesutdanning

23 Eckhard Bick and Lars Nygaard

counterparts, and should thus be former to achieve a part-by-part accessible to improved automatic translation for words not listed in the matching-techniques. bilingual lexicon, the latter to permit assignment of secondary Danish frequen non- wrong correct information (valency, semantics) to cy word PoS ed Danish translations not covered by the DanGram monolingual lexicon. 9 (all) 2.1 % 0.5 % 12.1 % The Norwegian-Danish transfer 5 0.5 % 1.5 % 14 % module was evaluated on 1,000 mixed- genre sentences from the Norwegian 4 4 % 0.5 % 14 % web part of the Leipzig Corpora 3 1 % 0.5 % 10.5 % Collection3 and a 6.500 word chunk 4 2 4 % 3 % 9.5 % from the ECIcorpus . 1 3.5 % 0.5 % 8.5 % average 2.5 % 1.1 % 11.4 % Web Litterature Table 2 words 15,641 6,521 N, ADJ, V, 8,976 3,098 Small checks were also conducted ADV (57.4%) (47.5%) for other frequencies (200 words each), not in noda- 991 (6.3%) 182 (2.8%) randomly extracting 1 out of 10 words. lex Results indicate that automatic compound 458 (2.9%) 78 (1.2%) translatability remains similar in s general, though not in dan- 127 (0.8%) 32 (0.5%) there was a slight correlation between lex falling frequency and less need for correction. The proportion of non-words Table 3 was high for low frequencies, possibly reflecting spelling errors and analysis The failure rate for Norwegian words problems with rare words in the corpus was 6.3% in the web corpus, in part data. However, since having non- compensated by the fact that almost existing words in the SL-list, is only half of these (2.9%) could still be «noise» and not a problem for the MT compound-analyzed. The coverage rate system, we conclude from their of the Danish lexicon was very high - translatability that low-frequency words only 0.8% of suggested translations are at least as safe a contribution to the were not found. Figures for the lexicon as high-frequency words. literature corpus were almost twice as 2.3 Norwegian-Danish transfer good - even when taking into account that the percentage of open-class Analysed input from the Oslo-Bergen- inflecting words was 10 percentage tagger is danified by substituting points lower in this corpus. Danish base forms for Norwegian ones. Even with an extensive bilingual word 2.4 Danish generation list, the transfer program is not, Finally, Danish full-forms are generated however, a mere lookup procedure. Due from the translated base-forms, based to the compounding structure of the languages involved, compound analysis 3 http://corpora.uni-leipzig.de has to be performed both on the 4 European Corpus Initiative, Norwegian and the Danish side - the http://www.elsnet.org/resources/eciCorpus.html

24 Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System

both on the filtered OBT morphological question (described in Bick 2005) tag string, and inflexional information consists of a few hundred rules from the Danish lexicon. targeting CG function tags, supported "[hus] N NEU S DEF GEN", for by attachment direction markers and instance, will be inflected as hus -> close/long-attachment markers from a NEU DEF huset -> GEN husets. special CG layer run as a last step Irregular forms are stored in full in a before dependency. separate file, and compound stems are constructed, prior to inflexion, using 2.6 Danish-English transfer rules for the insertion of epenthetic s or epenthetic e. Though the Danish-English MT system (Dan2eng, fBick 2007) is not the focus (1) agurk+tid -> agurketid of this paper, and used as is in a black (2) forbud+stat -> forbudsstat box fashion, a short description is in Alas, Danish and Norwegian order - not least because of the are not completely perspective of ultimately creating a isomorphic, and in order to handle similar system for direct Norwegian- differences in a context-dependent way, English transfer. a special CG grammar is run before The core principle of Dan2eng is to generation. This grammar handles, for rely as much as possible on deep and instance, the Norwegian phenomenon accurate SL analysis. In this spirit, the of double definiteness: selection of translation equivalents is based on lexical transfer rules (3) NOR: den store bilen -> DAN: exploiting syntactic relations in a den store bil semanticised way. The way in which Here, so-called substitution rules are Dan2eng semanticizes , differs used, replacing the tag DEF with IDF in significantly from many older rule- the presence of definite articles based MT systems designed in the 80's (example below) or pre- or post- and 90's. First, it uses dependency positioned determiners and attributes rather than constituent analyses, and (syntactic tags @, @DET>): be based on Constraint Grammar, a combination that provides it with a SUBSTITUTE (DEF) (IDF) TARGET robust way of progressing from shallow (N) to deep analyses (Bick 2005) without IF (*-1 ART BARRIER NON-PRE- the high percentage of parse failures N/ADV) ; inherent to many generative systems when run on free text5. 2.5 Structural analysis As an example, let us have a look at the translation spectrum Danish verb at Syntactic-functional analysis was based regne (to rain), which has many other, not on the Norwegian OBT-analysis, but non-meteorological, meanings on a from-scratch analysis of the (calculate, consider, expect, convert ...) translated Danish text, in part because as well. Here, Dan2eng simply uses of the high syntactic accuracy of the grammatical distinctors to distinguish Danish parser (Bick 2000), in part to between translations, rather than ensure compatibility with the define sub-senses. descriptive conventions used in the next syntactic stage, dependency analysis, 5 Even today, MT systems using deep syntax, may and the Danish-English MT system find it cautious to restrict their domain or itself. The in structural scope, like the LFG- and HPSG-based LOGON system (Lønning et al. 2004).

25 Eckhard Bick and Lars Nygaard

Thus, the translation rain (a) is regular expressions from word or base chosen if a daughter/dependent (D) forms, or exploit DanGram's semantic exists with the function of prototype tags in a systematic way, e.g. situative/formal subject (@S-SUBJ), , , , while most other meanings ask for a etc. for nouns (160 types in all). human subject. As a default6 translation Adjectives and verbs have fewer classes for the latter calculate (f) is chosen, but (e.g. psychological adjective, move, the presence of other dependents speech or cognitive verbs), but make up (objects or particles) may trigger other for this with a rich annotation of translations. regne med (c-e), for argument/valency tags. instance, will mean include, if med has The rule-based transfer system is been identified as an adverb, while the supplemented by a dictionary of fixed preposition med triggers the expressions and a (so far sentence- translations count on for human based) translation memory. The Danish- «granddaughter» dependents (GD = English bilingual lexicon was built to ), and expect otherwise. Note that match the coverage of the DanGram the include translation also could have lexicon (100.000 words plus 40.000 been conditioned by the presence of an names), but does not yet have the same object (D = @ACC), but would then coverage for compounds. In any case, have to be differentiated from (b), compounds are productive, and regne for (‘consider’). therefore covered by a special back-up module that combines part-translations, regne_V7 affix-translations. Rules may be used to (a) D=(@S-SUBJ) :rain; force a different translation for a (b) D=( @ACC) D=("for" PRP)_nil lexeme if used as first or second part in :consider; compounds, e.g. FN-styrke, where (c) D=("med" PRP)_on GD=() :count; styrke should be 'force', not 'strength'. (d) D=("med" PRP)_on :expect; (e) D=(@ACC) D=("med" ADV)_nil The compound module is doubly :include; important for our Nor2eng interlingua (f) D=( @SUBJ) D?=("på" PRP)_nil approach, since secondary Danish :calculate; lookup-failures may be caused by Norwegian lookup-failures. The example shows how information from different descriptive layers is integrated in the transfer rules. 2.7 English generation and syntax Structural conditions may either be English generation is handled much like expressed in n-gram fashion (with P+n Danish generation, drawing on CG or P-n) positions, or dependency fashion morphological tags, a lexicon of (reference to daughters, mothers, irregular forms and some granddaughters and grandmothers phonetic/stress heuristics to inflect independent of distance). Semantic translated base forms - again supported conditions can either be inferred with by a special CG layer performing systematic substitutions (for instance 6 The ordering of differentiator-translation pairs plural translations of singular words) is important - readings with fewer restrictions have to come last. The example lacks the and insertions (certain modals, or general, differentiator-free default provided with articles). Differences in syntax are all real lexicon entries. handled by successive transformation 7 The full list of differentiators for this verb rules, which may move either words or contains 13 cases, including several prepositional complements not included here whole dependency tree sections if (regne efter, blandt, fra, om, sammen, ud, fejl ...)

26 Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System

certain tags, tokens or sequences are The necessary dependency links are found. marked in the Danish interlingua In the following example, two sentence. movement rules were applied. The first changes the Scandinavian VS order into SV after a filled front field, placing the fronted adverbial between S and V. The Illustration 1 other rule, classifying the adverbial, decides on a better place for it - lex between auxiliary and main verb. Norwegian OB­tagger NOR: På 1980-tallet ble sammenhengen text (CG) mellom sosiale faktorer og helse i stor VISL grad avskrevet. adapt lex Adapt. DAN: Nor2dan CG I PRP @ADVL #1->13 1980'erne N @P< #2->1 ­ transfer blev V @STA ­generation #3->0 lex sammenhængen N @SUBJ Danish DanGram #4->3 text * morphology mellem PRP @N< #5->4 * disambig.CG sociale ADJ @>N #6->7 * syntax­CG faktorer N @P< Dependency #7->5 lexAdapt. grammar og KC @CO Dan2eng CG #8->7 helse N @P< #9->7 ­ transfer i PRP @ADVL #10- ­ generation >13 stor ADJ @>N #11- English Statistical >12 text smoothing grad N @P< #12- >10 afskrevet V @AUX< #13- >3. 3 Perspectives: Statistical smoothing ENG: In the 1980s the connexion between social factors and health was In spite of the fact that Dan2Eng largely written off. employs tens of thousands of hand- written lexical transfer rules, it is extremely difficult to cover all Note also the fact, that the idiosyncrasies of, for instance, preposition change is a difference preposition usage or choice of synonym between Norwegian and Danish, not in a rule based way. Furthermore, between Danish and English, and that mismatches are more likely when the subject movement acted on the chaining two translations. On the other whole NP, including its dependent PP, hand, statistical methods allow to check which again contained a coordination. the probabilities of rule-suggested

27 Eckhard Bick and Lars Nygaard

translations in a given context, Theory, Barcelona, December 9th - 10th, smoothing out translational rough 2005), pp.19-27 spots. Given the lack of large bilingual Bick, Eckhard. 2006. «A Constraint Norwegian-Danish or Norwegian- Grammar Based Spellchecker for Danish English corpora, it is an added with a Special Focus on Dyslexics». In: advantage, that such methods work Suominen, Mickael et al. (ed.) A Man of with monolingual, target language Measure: Festschrift in Honour of Fred corpora - of which there are almost Karlsson on his 60th Birthday. Special unlimited amounts availabe in the case Supplement to SKY Jounal of Linguistics, of English. To prepare for an integration Vol. 19 (ISSN 1796-279X), pp. 387-396. Turku: The Linguistic Association of of TL smoothing, we performed Finland dependency annotation of 1 billion words, and started extracting n-gram Bick, Eckhard. 2007. «Fra syntaks til information as well as what we call dep- semantik: Polysemiresolution igennem grams - hierarchical chains of dependensstrukturer i dansk-engelsk maskinoversættelse.» (forthcoming) dependency-linked words, the former with the perspective of preposition- Corbí-Bello, Antonio M. et al. 2005. An smoothing, the latter for argument- open-source shallow-transfer machine smoothing. translation engine for the Romance Future evaluations, to be conducted Languages of Spain. In Proceedings of the European Association for Machine after a more complete revision of the Translation, 10th Annual Conference, Norwegian bilingual lexicon and the Budapest 2005, p. 79-86. construction of a polysemy-sensitive Norwegian-Danish transfer grammar, Hagen, Kristin, Johannessen, Janne Bondi, will have to address not only the overall Nøklestad, Anders. 2000. "A Constraint- quality of the MT system as a whole - Based Tagger for Norwegian". In: Lindberg, C.-E. and Lund, S.N. (red.): optimally in comparison with other 17th Scandinavian Conference of systems, like LOGON (Lønning et al. Linguistic, Odense. Odense Working 2004) -, but also the relative Papers in Language and Communication, contributions of rule based and No. 19, vol I. statistical modules. Karlsson, Fred. 1990. Constraint Grammar as a Framework for Parsing Running Text. References In: Karlgren, Hans (ed.), COLING-90 Bick, Eckhard. 2001. «En Constraint Helsinki: Proceedings of the 13th Grammar Parser for Dansk», in Peter International. Conference on Widell & Mette Kunøe (eds.), 8. Møde om Computational Linguistics, Vol. 3, pp. Udforskningen af Dansk Sprog, 12.-13. 168-173 oktober 2000, pp. 40-50, Århus University Lønning, Jan Tore, Stephan Oepen, Bick, Eckhard. 2003, «A CG & PSG Hybrid Dorothee Beermann, Lars Hellan, John Approach to Automatic Corpus Carroll, Helge Dyvik, Dan Flickinger, Annotation», In: Kiril Simow & Petya Janne Bondi Johannessen, Paul Meurer, Osenova (eds.), Proceedings of Torbjørn Nordgård, Victoria Rosén, and SProLaC2003 (at Erik Velldal. 2004. LOGON. A Norwegian 2003, Lancaster), pp. 1-12 MT effort. In Proceedings of the Workshop in Recent Advances in Bick, Eckhard. 2005 «Turning Constraint Scandinavian Machine Translation, Grammar Data into Running Dependency Uppsala, Sweden, », In: Civit, Montserrat & Kübler, Sandra & Martí, Ma. Antònia Paul, Michael. 2001. Knowledge Recycling (red.), Proceedings of TLT 2005 (4th for Related Languages. Proceedings of Workshop on Treebanks and Linguistic MT Summit VIII. Santiago de Compostela, Spain. pp. 265-269.

28