Rule-Based Breton to French Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
[EAMT May 2010 St Raphael, France] Rule-based Breton to French machine translation Francis M. Tyers Departament de Llenguatges i Sistemes Informatics` Universitat d’Alacant E-03070 Alacant [email protected] Abstract Speakers of minority or regional languages typ- ically differ from the majority in being bilingual, This paper describes a rule-based machine speaking both their language, and the language translation system from Breton to French of the majority. In contrast, a majority language intended for producing gisting translations. speaker does not usually speak the minority or re- The paper presents a summary of the ongo- gional language. This has some implications for ing development of the system, along with the requirements society will put on machine trans- an evaluation of two versions, and some lation systems. reflection on the use of MT systems for Applications of machine translation system can lesser-resourced or minority languages. be divided in two main groups: assimilation, that is, to enable a user to understand what the text is 1 Introduction about; and dissemination, that is, to help in the task of translating a text to be published. The require- This paper describes the development of a “gist- ments of either group of applications is different. ing” machine translation system between Breton Assimilation may be possible even when the and French.1 The first section will give a general text is far from being grammatically correct; how- overview of the two languages in question and de- ever, for dissemination, the effort needed to correct scribe the aims for the current system. The subse- (post-edit) the text must not be so high that it is quent sections will describe some of the develop- preferable to translate it manually from scratch. ment work, the current status, an evaluation, and some prospects for future development. A majority to minority language system will mainly be used for dissemination purposes; it must Breton is a Celtic language of the Brythonic therefore be such that post-editing the output is branch which is today largely spoken in Brittany in faster than translating from scratch. Intelligibil- the north-west of France. Historically it was spo- ity is secondary, and only important if it helps the ken to different degrees throughout Brittany, but post-editor. has been losing territory to French since the 12th century, most rapidly in the last 100 years. The A minority to majority language system will, language is classed as a language in “serious dan- however be mainly used for assimilation, for in- ger of extinction” by the UNESCO Red Book on stance, to answer vital questions such as “what are Endangered Languages (Salminen, 1999). they writing about me in the minority language newspaper?”. Therefore, the main goal is intelli- Although both Breton and French both belong gibility. to the large Indo-European language group, they are from widely different families, Celtic and Ro- The system in this paper was indeed developed mance respectively, with a number of substantial with this second objective in mind, to be able to differences between them. provide intelligible translations into French of text published in the Breton language media, and to at- c 2010 European Association for Machine Translation. tempt to decrease the need to translate everything 1The system can be tried out online at http: //www.ofis-bzh.org/bzh/ressources_ published in Breton into French in order to be un- linguistiques/index-troerofis.php derstood by interested people who do not speak Breton. 2.1 Morphological analyser Two examples of this are: A shared e-mail As no free morphological analyser for Breton was conversation between Breton speakers, with one previously available, a new analyser was written French speaker joining in. As opposed to hav- for the project. Words were added to the analyser ing the whole conversation translated, it should be based on frequency, with the most frequent words possible to use the MT system to get a gist and being added first. The frequency list was taken then request clarification. Another would be to al- from a database dump of the Breton Wikipedia. low Breton-speaking organisations to keep notes Some open categories, for example nouns and ad- of meetings and minutes in Breton, while not ex- jectives were semi-automatically extracted from cluding members who do not speak Breton from the French and Breton Wiktionaries.4 understanding what is going on. For example, for nouns, the French Wiktionary There have been two releases of the system has 1,704 entries and the Breton Wiktionary has (apertium-br-fr), the first one is version 0.1, 2,415 entries. For adjectives the French Wik- which was released in May of 2009, and the sec- tionary has 434 entries while the Breton Wik- ond is version 0.2, released in March of 2010. Also tionary has 557 entries. There are however over- mentioned in the paper are the results from the pro- laps and duplicates. No record was kept of how totype word-for-word system that was presented in many of these entries finally made it into the anal- Tyers (2009). yser as the data was merged with other free data, for example from the Breton–French FreeLang dictionary.5 2 Development As described below, verb conjugation is very regular, and thus, the main verb paradigms were The system is based on the Apertium machine entered by hand, and entries were largely able to translation platform.2 The platform was originally be assigned to paradigms based only on their in- aimed at the Romance languages of the Iberian finitive. Some verbs feature stem internal varia- peninsula, but has also been adapted for other lan- tion, and these were added manually. guage pairs, and in particular languages from the Closed categories, including pronouns, auxil- Celtic group, e.g. Welsh (Tyers and Donnelly, iary verbs, prepositions, conjunctions etc. were 2009). The whole platform, both programs and added from two descriptive grammars of Breton data, are licensed under the Free Software Founda- (Hemon, 2007; Press, 1986) by hand. The total tion’s General Public Licence3 (GPL) and all the development time for the morphological analyser software and data for the 22 supported language for version 0.1 was approximately three person pairs (and the other pairs being worked on) is avail- months. The final tagset includes 63 unique tags able for download from the project website. (for part-of-speech and other morphological prop- The initial development time (for version 0.1 of erties such as person, number, gender and tense) the system) was approximately six person months and 202 unique combinations of tags.6 from scratch. Three of these taken up with de- Breton is a weakly inflected language, with the velopment of the morphological analyser, two in following characteristics. Nouns have two gram- development of the bilingual dictionary and one matical genders, masculine and feminine and two month for the transfer rules. The majority of the numbers, singular and plural. The gender of the work was done by a PhD student of machine trans- noun plays an important part in the system of ini- lation, with input from Breton speakers. One dur- tial consonant mutations. For example, feminine ing the development of the dictionaries, and two singular nouns mutate after the definite article, bro during development of the transfer rules. Subse- ‘country’ and ar vro ‘the country’, but masculine quent expansion of the system has been continued singular nouns do not, mor ‘boat’ ar mor ‘the sea’. by one PhD student working on the transfer rules, Adjectives inflect for comparative, superlative and and one Breton speaker working on the dictionar- ies. 4http://fr.wiktionary.org and http: //br.wiktionary.org/ 5http://meskach.free.fr/arbo/dico/tomaz. 2http://www.apertium.org html 3http://www.fsf.org/licensing/licenses/ 6A description of the tagset can be found at http://wiki. gpl.html apertium.org/wiki/Breton#Tagset. exclamative. Version Lemmata Sur. forms Ambig.8 Verbs inflect for four persons (first, second, third 2009-01-01 9,961 105,697 1.07 and impersonal) and two numbers (singular and 0.1 11,559 224,839 1.09 plural), although the impersonal form of the verb 0.2 14,693 454,479 1.11 does not have number. There are four main finite Table 1: Basic statistics of two versions of the Breton mor- tenses: present, imperfect, past definite, and fu- phological analyser. The two versioned numbers are quality ture, and four moods: indicative, conditional, ha- checked releases, the 2009-01-01 version is the analyser as bitual and imperative. The habitual only appears in used in Tyers (2009). two verbs, bezan˜ ‘to be’ and kaout ‘to have’. The verb bezan˜ in addition has special locative forms, compare Pelec’h eman˜? ‘Where is he?’ with Piv large increase in surface forms between the two eo? ‘Who is he?’. When the subject is placed be- versions, for a relatively small number of new lem- fore the verb, then the verb is inflected in the third mata can be explained by the number of new mul- person singular, regardless of the person and num- tiword verbs (930 in version 0.2 compared to 204 ber of the subject. There are also a number of non- in version 0.1). finite forms of the verb, the infinitive and the past participle. 2.2 Part-of-speech tagger Verbal inflection is very regular, unlike in The part-of-speech tagger for the system is based French or Spanish, there is only one conjuga- on two technologies, the first is constraint gram- tion class for all verbs. There are however dis- mar (Karlsson et al., 1995), which uses linguist- tinct forms of the infinitive, -in˜, -an˜, -out, etc.