A Rule-Based Shallow-Transfer Machine Translation System for Scots and English
Total Page:16
File Type:pdf, Size:1020Kb
A Rule-based Shallow-transfer Machine Translation System for Scots and English Gavin Abercrombie [email protected] Abstract An open-source rule-based machine translation system is developed for Scots, a low-resourced minor language closely related to English and spoken in Scotland and Ireland. By concentrating on translation for assimilation (gist comprehension) from Scots to English, it is proposed that the development of dictionaries designed to be used within the Apertium platform will be sufficient to produce translations that improve non-Scots speakers understanding of the language. Mono- and bilingual Scots dictionaries are constructed using lexical items gathered from a variety of resources across several domains. Although the primary goal of this project is translation for gisting, the system is evaluated for both assimilation and dissemination (publication-ready translations). A variety of evaluation methods are used, including a cloze test undertaken by human volunteers. While evaluation results are comparable to, and in some cases superior to, those of other language pairs within the Apertium platform, room for improvement is identified in several areas of the system. Keywords: Scots, machine translation, assimilation 1. Introduction (‘what did he do that for’). There are also notable differ- The Scots language is spoken by over 1.5 million people ences to English in negation e.g. huvnae (‘have not’), din- in Scotland1 and a further 140,000 in Ireland,2 but is low- nae (‘do not’), and the double modal e.g. he’ll can dae that resourced in terms of technology - as far as I am aware there (‘he’ll be able to do that’) (Beal, 1997). are currently no automatic translators available for this lan- While Scots shares a great deal of vocabulary with English, guage. The development of machine translation (MT) sys- albeit often with different forms and/or senses, it also in- tems can lead to positive outcomes for a minor language cludes a large quantity of words derived from Old English and its speakers by contributing to the language’s standard- that are not present in English, as well as loan words from isation, and assisting the diffusion of content originally cre- Gaelic, French and Latin, the Scandinavian languages, and ated in that language (Forcada, 2006). The aim of this other West Germanic languages such as Frisian (Jones, project is to take a step towards the ‘de-minorizing’ of Scots 1997). 4 by developing a shallow transfer, rule-based MT system as There is no official standard written form of Scots, so part of the Apertium platform. spelling varies greatly - any translation system needs to ac- Since all Scots speakers comprehend standard English,3 count for multiple forms of many, if not most, words. this study will focus on translation from Scots to English 2.2. Translation: assimilation and dissemination for assimilation (or gisting) purposes. With assimilation Dis- thought to be the most common use of MT systems today The aims of translation between languages can vary. semination (O’Regan et al., 2013), this would seem to be an effective is ‘the production of translations of ‘publishable use of time and resources as an initial goal in MT between quality’ (Hutchins, 2003a), and can help in the creation of 5 assimila- these two languages. As the two languages are very closely more text in a lesser-resourced language, while tion related, it is proposed that translation of the bulk of lexical is translation for gist comprehension by readers who items in Scots texts (combined with part-of-speech (POS) ‘can accept poor quality as long as they can get an idea of analysis) will be sufficent to enable gist comprehension by what the text conveys’ (Hutchins, 2003b). non-Scots speakers. Successful assimilation is often viewed as the more realis- tic and achievable goal for MT systems (Hutchins, 2003a) 2. Background and is probably ‘the most frequent application of MT nowa- 2.1. The Scots language days’ (O’Regan et al., 2013). Translation for assimilation would enable people who do not speak the language to Scots is a West Germanic language descended from Old understand Scots text ‘thus removing an argument against Northumbrian (a branch of Old English) and closely related writing in the lesser-resourced language,’6 and is the pri- to modern English (Macafee and Aitken, 2002). Indeed, it mary aim of this study. shares many characteristics with its neighbour and relative including SVO sentence structure and a ‘vast body’ of com- 2.3. Machine translation mon vocabulary (Tulloch, 1997). Approaches to machine translation can be broadly catego- Grammatical differences can however be seen in the word rized as either rule-based or statistical, or in some cases, a order of some constructions e.g. whit fur did he dae that? hybrid of the two. 1 Scotland’s Census 2011 - National Records of Scotland 4www.educationscotland.gov.uk/knowledgeoflanguage 2 2001 Census of Northern Ireland - Northern Ireland /scots/writinginscots/scotsspelling/index.asp Statistics & Research Agency 5wiki.apertium.org/wiki/Assimilation and Dissemination 3 Scotland’s Census 2001 - National Records of Scotland 6wiki.apertium.org/wiki/Assimilation and Dissemination 578 While statistical MT systems have the advantage of not re- The modules perform the following functions: quiring the explicit programming of language rules, they • The deformatter and reformatter modules handle for- generally require large amounts of bilingual corpus data in matting and blank spaces. order to produce satisfactory results. As Forcada (2006) suggests, lacking such resources, ‘it may be much easier • The morphological analyser maps each word (or for speakers of the minor language to encode the language multi-word unit) in the input to entries in a monolin- expertise needed to build a rule-based machine translation gual dictionary, returning their lemma, lexical cate- system.’ gory and morphological information. This informa- While statistical MT tends to produce more fluent transla- tion is passed to the other modules and eventually tions, those of rule-based systems are often more faithful mapped to a target language surface form by the mor- to the details of content of the original text (Forcada et al., phological generator. 2011), a somewhat important advantage where information • A constraint grammar module contains rules designed transferral is concerned. to reduce ambiguity regarding parts-of-speech. Due to the low-resource status of Scots, and this study’s focus on translation for assimilation, rule-based translation • The part-of-speech tagger selects the most statistically would currently seem to be the more appropriate approach. likely of the possible lexical forms, given an item’s context. 2.3.1. Shallow-transfer rule-based machine • The lexical transfer module maps the given lexical translation item to its corrsponding lemma in the bilingual dic- Most rule-based MT systems create translations by parsing tionary. the source text, creating an intermediate symbolic represen- • tation of it, and then generating a final translation in the tar- The chunker segments the input into syntactic chunks get language. They apply mappings between lexical items and handles transfer rules. stored in dictionaries as well as transfer rules to account for • The post generator handles orthographic operations structural differences between the two languages. Trans- such as contractions. lation of unrelated or distantly related languages requires deep syntactic and semantic analysis, whereas closely re- Further details of the Apertium architecture can be found lated languages (such as English and Scots) can be trans- in Forcada et al. (2011). lated with shallow parsing (Sanchez-Mart´ ´ınez and Forcada, 3. Development of the English-Scots 2009). translation system In the belief that translation of Scots-only lexical items to 2.3.2. The Apertium MT platform English will be sufficient to enable gist comprehension, this is a free, open-source platform for develop- Apertium study concentrates on the development of the monolingual ing rule-based, shallow-transfer machine translation sys- and bilingual dictionaries for this language pair, and partic- tems, which was initially developed for translation between ularly on including Scots words that do not exist in English. closely related languages (Forcada et al., 2011). As ‘one of Development of the system was undertaken in the following the few open-source MT systems that can be used for real- stages. life purposes’ (Forcada, 2006), the Apertium platform is highly suitable for this project. 3.1. Prior status of the eng-sco language pair in The system consists of a number of modules through which the Apertium platform the input text is passed and modified before a translation in Apertium language pairs are classified as being in one the target language is outputted. Figure 1 illustrates this of four stages of development: from the trunk for release pipeline. quality pairs, down through staging and nursery to incuba- tor where incomplete ‘dictionaries, dictionary fragments, rules, things that aren’t quite ready to live in the real world’ are kept.7 Prior to this study, English & Scots (eng-sco) was in this latter category, and contained a Scots dictionary of only 172 entries. 3.2. Construction of eng-sco translator modules An English monolingual dictionary containing ap- proximately 20,000 lexical entries and correspond- ing morphological information was first added to eng-sco in the Apertium incubator. Associated dictionary processing files such as the post generator apertium-eng.post-eng.dix were also included. Figure 1: Architecture of the Apertium system. Adapted Work then began on expanding the Scots dictionaries. from Forcada et al., (2011). 7http://wiki.apertium.org/wiki/Incubator 579 3.3. Acquisition of linguistic data ↓ Scots lexical items were collected from the following Morphological analysis sources and added to the monolingual Scots dictionary, the English-Scots bilingual dictionary, and where necessary the ∧Ye/Prpers < prn >< subj >< p2 ><mf ><sp> monolingual English dictionary.