Developing Prototypes for Machine Translation Between Two Sámi Languages
Total Page:16
File Type:pdf, Size:1020Kb
Developing Prototypes for Machine Translation between Two Sámi Languages Francis M. Tyers Linda Wiechetek Trond Trosterud Departament de Llenguatges Giellatekno, Giellatekno, i Sistemes Informàtics, Universitetet i Tromsø, Universitetet i Tromsø, Universitat d’Alacant Norway Norway E-03071 Alacant (Spain) [email protected] [email protected] [email protected] Abstract north of Norway and Sweden, North Sámi also This paper describes the development of in Finland. North Sámi has between 15,000 and two prototype systems for machine trans- 25,000 speakers, while Lule Sámi has less than lation between North Sámi and Lule Sámi. 2,000 speakers. Experiments were conducted in rule-based The Sámi proto-language was originally an ag- machine translation (RBMT), using the glutinative language, but North and Lule Sámi Apertium platform, and statistical ma- have developed features known from inflective lan- chine translation (SMT) using the Moses- guages (case/number combinations are often ex- decoder. The experiments show that both pressed by one suffix only, certain morphologi- approaches have their advantages and dis- cal distinctions are expressed by means of conso- advantages, and that they can both make nant gradation (i.e. a non-segmental process) only, use of pre-existing linguistic resources. etc.). The main objective with the development of the 1 Introduction prototype rule-based system was to evaluate how In this paper we describe the development of two well existing resources could be re-used, and if the prototype machine translation systems between shallow-transfer approach was suited to languages two Sámi languages, North Sámi (sme) and Lule with more agglutinative typologies. Sámi (smj), one rule-based (Apertium), and one statistical (Moses). There are other systems which 1.2 A typology of MT systems for minority have been developed with marginalised languages languages in mind (e.g. (Lavie, 2008)), but, as of writing, Minority language speakers typically differ from these were not available under an open-source li- the majority in being bilingual, the minority speaks cence and thus could not be applied to the task at the language of the majority, but not vice-versa. hand. The content will be split into several sec- This has some implications for the requirements tions. The first section will give a general overview society will put to machine translation systems. of the languages in question, and sketch a typol- A majority to minority language system must be ogy of MT scenarios for minority languages. The of high quality, so high that post-editing the output next sections will describe the two machine trans- is faster than translating from scratch. The goal is lation strategies in some detail and will outline to produce well-formed text, not to understand the how the existing language technology was able to content, since the minority language users will pre- be re-used and integrated. We will follow this by fer the original to a bad translation. A minority to a short evaluation and then some discussion and majority language system, on the other hand, will future work. be useful even as a gist system, answering vital questions such as “what are they writing about me 1.1 The languages in the minority language newspaper?”. The sys- Both North Sámi and Lule Sámi belong to the tems presented here are minority to minority lan- Finno-Ugric language family and are spoken in the guage systems. North and Lule Sámi are mutu- c 2009 European Association for Machine Translation. ally intelligible, and also in this context a gist sys- Proceedings of the 13th Annual Conference of the EAMT, pages 120–127, Barcelona, May 2009 120 <pardef n="gu/olli__N"> tem will not be that interesting. The importance of <e> the system lies in its ability to produce text. Here, <p> North Sámi is the larger language, possessing close <l>liid</l> <r>olli<s n="N"/><s n="Pl"/> to a full curriculum of school textbooks. A high- <s n="Acc"/></r> quality MT system would help produce the same </p> for Lule Sámi, and moreover from the closely re- </e> ... lated North Sámi than from Norwegian. The same </pardef> situation may found for many language communi- ties. Figure 1: Section of inflectional paradigm for gu/olli__N 2 Rule-based machine translation 2.1 Apertium tween <l></l> liid the generated ending and the Apertium is an open-source platform for creating items between <r></r> the analysis including the rule-based machine translation systems. It was ini- lemma and the morphological tags. tially designed for closely-related languages, but The Divvun and Giellatekno Sámi language has also been adapted to work better for less- technology projects2 use finite-state transducers related languages. The engine largely follows a for the morphological analyser and closed- shallow-transfer approach. Finite-state transducers source finite-state tools from Xerox (Beesley and (Garrido-Alenda and Forcada, 2002) and (Roche Karttunen, 2003). They tools handle two-level and Schabes, 1997) are used for lexical processing, morphology model with twolc (two-level com- first-order hidden Markov models (HMM) are used piler) for morphophonological analysis together for part-of-speech tagging, and multi-stage finite- with lexical tools in a single transducer, con- state based chunking for structural transfer (For- sonant gradation, diphthong simplification and cada, 2006). The original shallow-transfer Aper- compounding are handled by two-level rules. tium system consists of a de-formatter, a morpho- Consonant gradation and diphthong simplification logical analyser, the categorial disambiguator, the of the noun guolli are handled in the following structural and lexical transfer module, the morpho- way. guolli is listed in the root lexicon with lemma logical generator, the post generator and the re- and continuation lexicon AIGI and redirected to formatter. the sublexicon AIGI. 2.1.1 Analysis and generation For the analysis and generation, we used exist- guolli AIGI "fish N" ; ing finite-state transducers for the two languages.1 LEXICON AIGI !Bisyll. V-Nouns. lttoolbox has been widely used to model ro- +N+Sg+Acc:%>X4 K ; +N:%>X5 GODII- ; ! weak gr dipth simpl mance language morphology, and although it has ... been used to model the morphology of other lan- guages with complex morphology (e.g. Basque), it From there it is redirected to a further sublex- is not ideal for these languages. lttoolbox is lack- icon GODII- which redirects it to the sublexicon ing features for dealing with stem-internal varia- GODII-, which provides the plural accusative anal- tion, diphthong simplification, and compounding. ysis. North Sámi word forms involve both conso- LEXICON GODII- nant gradation, diphthong simplification and com- +Pl+Acc:jd9 K ; pounding. The North Sámi noun guolli (‘fish’) al- At the same time, a two-level rule handles diph- ternates between -ll- (strong stage) and -l- (weak thong simplification when encountering the dia- stage). critical mark X5 by removing the second vowel (e Additionally, one has to deal with diphthong o a) in a diphthong (ie uo ea) if the suffix contains simplification, the diphthong uo changes into a an i. monophtong u in e.g. accusative plural guliid. Vx:0 <=> Vow _ Cns:+ i (...) X5: ; In the Apertium lexicon, guolli is represented as where Vx in (e o a) ; in figure 1. gu represents the stem, the item be- 2To be found on http://www.divvun.no/index. 1http://giellatekno.uit.no/ html and http://giellatekno.uit.no/. 121 Consonant gradation is handled in another rule ij ij+V+Neg+Prs+Sg3 where a consonant (f l m n r s . ) is removed ittjij ij+V+Neg+Prt+Sg3 between a vowel, an identical consonant, another Both for generation and analysis that means that vowel, and a weak grade triggering diacritial mark one has to find a possibility to account for the (the rule is slightly simplified, noted by . ). ‘missing’ tag in North Sámi. ‘Missing’ means here Cx:0 <=> Vow: _ Cy Vow (...) WeG: ; the lack of tag specification for the temps (tempus) where Cx in (f l m n r s ...) variable in the transfer files. Cy in (f l m n r s ...) A number of multiword expressions differ from A general difficulty for generation and analysis each other in SL and TL. While in North Sámi are inconsistent tagsets in SL and TL. While verbs gii beare (‘whoever’) has inner inflection, the Lule are specified with regard to transitivity (V TV, V Sámi vajku guhti does not. The initial pronoun gii IV) for North Sámi, they were not specified in the corresponds to the second component guhti in Lule Lule Sámi dictionary (only V). Another matter of Sámi. choice and convenience is the degree of lexicali- The last type of generation modification hap- sation as in the case of derived verbs. The North pens in a separate step. Orthographic variants and Sámi verbform gohˇcoduvvo (‘he/she is called’) ei- contractions are handled by the postgenerator. The ther goes back to the form gohˇcˇcut (‘order’) or to Lule Sámi copula liehket (’to be’) has three forms gohˇcodit (‘call, name’), which is derived from go- for the tag combination liehket+V+Ind+Prs+Sg3, hˇcˇcut but to some extent lexicalised. le, la, l. While le and la are interchangable vari- gohcˇcut+V+TV+Der1+Der/d+Vˇ ants, l is a shortened form of la after wordforms +Der2+Der/PassL+V+Ind+Prs+Sg3 that end in a vowel. The postgeneration lexicon gohcodit+V+TV+Der2+Der/PassL+V+Ind+Prs+Sg3ˇ specifies this change and outputs the correct form. gåhtjudit+V+TV+Der1+Der/Pass+V+Ind+Prs+Sg3 2.1.2 Disambiguation (Constraint Grammar) In Lule Sámi, gåhtjuduvvá only gets the analy- Disambiguation of morphological and shallow sis with the lexicalised verb gåhtjudit as a lemma.