A Prototype System for Learning Structural Transfer Rules in The

A Prototype System for Learning Structural Transfer Rules in the Apertium Machine Translation Platform Daniel Swanson May 4, 2021 Abstract This thesis develops a prototype system for automatically generating structural transfer rules for the Aper- tium machine translation platform in order to streamline the process of constructing language technology for the benefit of und er-privileged languag e communities. Th e current state of the system is described and found to be potentially useful though not yet production-ready , the current prim ary limitations are identified, and potential resolutions to these issues are discussed. 1 Introduction In this thesis, I attempt to develop a tool for extracting syntactic informati on from relatively small corp ora t o enable more rapid development of translation systems for lower-resourced languages. The results are limit ed, but enco ura ging, and there are multiple avenues available for immediat e impro vement. Section 2 describes the benefits of creat ing such systems . Section 3 discusses t he various approaches to machine translati on and their benefits and drawbacks. Section 4 describes Apertium, the machine translation system that the tools from this thesis are particularly designed to work with. Section 5 describes various ways of combining different translation approaches for better results. Section 6 explains a method for extracting syntactic rules. Section 7 evaluates, Section 8 suggests next steps for extending this work, and Section 9 concludes. 1 2 Computational Linguistics and Language Endangerment Computational tools for language are becoming increasingly imp ortant for both linguistic researc h and language revitalization. Kornai (2013) argues that lack of digital presence and computer support is very likely to lead to the endangerment of the vast majority of the world's languages, even tho se that have large speaker populations and don 't fit the other typical criteria for being considered endangered (such as not being passed on to children). As the world becomes more digital, speakers of minority languages will make greater and greater use of the internet. At present, computer support for minority languages is quite limited. Computer support for a language includes things like appropriate keyboard layouts (if the orthography is not compatible with a more widely-supported language such as English or Russian), spellcheck, auto corr ect, localized interfa ces (that is, having the menus and buttons and so forth of the operating system and various pieces of software in the minority languag e) , and machine translation to and from more widely-used languag es (ideally including both regional trade langua ges and globally dominant languages such as English). Digital presence, meanwhile, includes social media, biogs, Wikipedia, and other online materials available in that language. When these resources are lacking, speakers of minority languages can conclude that their languages do not belong on the internet or in other digital realms. Given the growing importanc e of the intern et and related technologies, this leads them to view their own language as inferior in general. This leads , particularly in younger generations, to abandonmen t of heritage languages in favor of what Kornai calls "digit ally ascendant" languages, thus hastening the process of langu age end angerment. An instance where machine tran slation systems can be especially of use is in domains like medicine and law. An example of this is in India where, according the constitution, the Supreme Court only operates in English ( Constitution of India 1950). This creates problems when the lower courts hear cases in regional languages and documents from those courts must then be translated into English. For example, in one land dispute, ther e were mo re than 11,000 pa ges that had to be translat ed from 16 langu ages into English (Ayodhya case 2019). In a case like this, manual translation can take months and is thus quite expensive. With a machine translation system , however, draft tran slation s can be produced immediately, and then the human tra nslators only need to check them, which is a much faster and cheaper pro cess. 1 Fortunately, remedying this lack of support is eminently pos sible. A machine translation system can potentially be built in a matter of months by an individual or a small team and can then both serve as a resource in itself and also be used to directly generate various other kinds of resources. It helps with lack 1Thank you to Tanrnai Khanna on the Apertium IRC chat for pointing me to this examp le. 2 of content as it can be used to generate draft translations of software interfaces and Wikipedia articles, which can then be corrected by native speakers rather than requiring them to start from scratch. If the translation system has morphological processing as a separate step of the translation, this component can often be converted into a working spellchecker and sometimes even a full autocorrect system with minimal effort. 3 Approaches to Machine Translation Machine translation systems can generally be classified as either rule-based or corpus-based, though overlap is possible. Systems using neural networks and deep learning have become widely used in recent years and they are often listed as a separate category, but for the purposes of this thesis it is sufficient to consider them as a subset of corpus-based systems. Rule-based systems use dictionaries and lists of rules 2 which are often hand-written to transform input (usually text) from one language to another. The structure of one such system is described in Section 4. Corpus- based systems, on the other hand, generally take some amount of corresponding text in multiple languages and attempt to derive general mappings between them without human input. Both of these approaches have significant limitations in terms of the resources required to construct them. Building rule-based systems frequently requires someone who is familiar with the languages in question, the translation software, and likely some experienc e with linguistic analysis, and who is willing to work on implementing a translation system for months or years. Corpus-based systems, on the other hand, can be built fairly quickly and without a significant amount of background knowledge. However, extracting useful generalizations from a corpus without human intervention requires a prohibitively large amount of text, often in the realm of millions of words of sentence-aligned bilingual text - that is, a document and a translation such that each line or sentence in one corresponds to exactly one line or sentence in the other - or even word-aligned text, where correspondences between individual words have been specified. Google Translate, for example, when describing their current neural model indicates that their training data involved hundreds of millions of sentence pairs (Wu et al. 2016). By contrast, it is nearly impossible to find more than a few thousand words online for most languages (Kornai 2013). Resources on such a scale rarely exist and quite often there are simply not enough translators to produce them. 2Strictly speaking, this refers to linguistic generalizations which typically take the form of either constraints or mappings between strings or structures, rather than the sequences of operati ons that "rules" typically refer s to . Nonetheless , as the term "rule-based" is standard , I will continue referring to them as "rules" throughout . 3 shallow structural transfer SL text ciisc:on1iguous··. ...··~;;~phoi~•· ... i ciisc:on1iguous··. multiword · multiword assembly . ..i •.. resolution i •·.. disassembly ./ recursive structural transfer TL text Figure 1: The standard components of the pipeline for the translation system Apertium. Note that many translation pairs do not use every module. Retrieved from http:/ /wiki.apertium.org/wiki/File:Apertium_ system_archi tect ure. png. As a result, although corpus-based approaches generally deliver better results when sufficiently trained, they are only available for a small handful of languages. In addition, training such models is computationally intensive and can require access to thousands of dollars worth of computer hardware while emitting dozens or hundreds of tons of CO 2 (Strubell, Ganesh & McCallum 2019). In contrast, rule-based systems can ordinarily be compiled and tested on a typical home computer. Rule-based systems have the added benefits of allowing developers to fix specific errors and having rules that can potentially serve as language documentation, in contrast to corpus-based models which typically produce statistical black boxes. Hybrid systems which attempt to combine the benefits of both approaches also exist and will be discussed in more detail in Section 5. This thesis explores a particular way of hybridizing the generally rule-based translation system Apertium, which is described in the next section. 4 The Apertium Machine Translation System Apertium is a rule-based translation system originally developed at the Universitat d' Alacant in Spain to translate between the languages of the Iberian peninsula (Forcada et al. 2011, Khanna et al. 2021). Apertium is set up as a pipeline, meaning that the translation process is broken down into a number of stages (usually between 6 and 10) with each stage doing part of the work of translating each segment (which may be a word or phrase, depending on the location in the pipe) before passing it along to the next stage. Section 4.1 discusses the reasons for

Load more