A Prototype System for Learning Structural Transfer Rules in the

Apertium Machine Platform

Daniel Swanson

May 4, 2021

Abstract

This thesis develops a prototype system for automatically generating structural transfer rules for the Aper- tium platform in order to streamline the process of constructing language technology for the benefit of und er-privileged languag e communities. Th e current state of the system is described and found to be potentially useful though not yet production-ready , the current prim ary limitations are identified, and potential resolutions to these issues are discussed.

1 Introduction

In this thesis, I attempt to develop a tool for extracting syntactic informati on from relatively small corp ora t o enable more rapid development of translation systems for lower-resourced languages. The results are limit ed, but enco ura ging, and there are multiple avenues available for immediat e impro vement.

Section 2 describes the benefits of creat ing such systems . Section 3 discusses t he various approaches to machine translati on and their benefits and drawbacks. Section 4 describes Apertium, the machine translation system that the tools from this thesis are particularly designed to work with. Section 5 describes various ways of combining different translation approaches for better results. Section 6 explains a method for extracting syntactic rules. Section 7 evaluates, Section 8 suggests next steps for extending this work, and Section 9 concludes.

1 2 Computational Linguistics and Language Endangerment

Computational tools for language are becoming increasingly imp ortant for both linguistic researc h and language revitalization. Kornai (2013) argues that lack of digital presence and computer support is very likely to lead to the endangerment of the vast majority of the world's languages, even tho se that have large speaker populations and don 't fit the other typical criteria for being considered endangered (such as not being passed on to children).

As the world becomes more digital, speakers of minority languages will make greater and greater use of the internet. At present, computer support for minority languages is quite limited. Computer support for a language includes things like appropriate keyboard layouts (if the orthography is not compatible with a more widely-supported language such as English or Russian), spellcheck, auto corr ect, localized interfa ces (that is, having the menus and buttons and so forth of the operating system and various pieces of software in the minority languag e) , and machine translation to and from more widely-used languag es (ideally including both regional trade langua ges and globally dominant languages such as English). Digital presence, meanwhile, includes social media, biogs, Wikipedia, and other online materials available in that language.

When these resources are lacking, speakers of minority languages can conclude that their languages do not belong on the internet or in other digital realms. Given the growing importanc e of the intern et and related technologies, this leads them to view their own language as inferior in general. This leads , particularly in younger generations, to abandonmen t of heritage languages in favor of what Kornai calls

"digit ally ascendant" languages, thus hastening the process of langu age end angerment.

An instance where machine tran slation systems can be especially of use is in domains like medicine and law. An example of this is in India where, according the constitution, the Supreme Court only operates in

English ( Constitution of India 1950). This creates problems when the lower courts hear cases in regional languages and documents from those courts must then be translated into English. For example, in one land dispute, ther e were mo re than 11,000 pa ges that had to be translat ed from 16 langu ages into English

(Ayodhya case 2019). In a case like this, manual translation can take months and is thus quite expensive.

With a machine translation system , however, draft tran slation s can be produced immediately, and then the human tra nslators only need to check them, which is a much faster and cheaper pro cess. 1

Fortunately, remedying this lack of support is eminently pos sible. A machine translation system can potentially be built in a matter of months by an individual or a small team and can then both serve as a resource in itself and also be used to directly generate various other kinds of resources. It helps with lack

1Thank you to Tanrnai Khanna on the Apertium IRC chat for pointing me to this examp le.

2 of content as it can be used to generate draft of software interfaces and Wikipedia articles, which can then be corrected by native speakers rather than requiring them to start from scratch. If the translation system has morphological processing as a separate step of the translation, this component can often be converted into a working spellchecker and sometimes even a full autocorrect system with minimal effort.

3 Approaches to Machine Translation

Machine translation systems can generally be classified as either rule-based or corpus-based, though overlap is possible. Systems using neural networks and deep learning have become widely used in recent years and they are often listed as a separate category, but for the purposes of this thesis it is sufficient to consider them as a subset of corpus-based systems.

Rule-based systems use dictionaries and lists of rules 2 which are often hand-written to transform input

(usually text) from one language to another. The structure of one such system is described in Section 4.

Corpus- based systems, on the other hand, generally take some amount of corresponding text in multiple languages and attempt to derive general mappings between them without human input.

Both of these approaches have significant limitations in terms of the resources required to construct them.

Building rule-based systems frequently requires someone who is familiar with the languages in question, the translation software, and likely some experienc e with linguistic analysis, and who is willing to work on implementing a translation system for months or years.

Corpus-based systems, on the other hand, can be built fairly quickly and without a significant amount of background knowledge. However, extracting useful generalizations from a corpus without human intervention requires a prohibitively large amount of text, often in the realm of millions of words of sentence-aligned bilingual text - that is, a document and a translation such that each line or sentence in one corresponds to exactly one line or sentence in the other - or even word-aligned text, where correspondences between individual words have been specified. , for example, when describing their current neural model indicates that their training data involved hundreds of millions of sentence pairs (Wu et al. 2016). By contrast, it is nearly impossible to find more than a few thousand words online for most languages (Kornai

2013). Resources on such a scale rarely exist and quite often there are simply not enough translators to produce them.

2Strictly speaking, this refers to linguistic generalizations which typically take the form of either constraints or mappings between strings or structures, rather than the sequences of operati ons that "rules" typically refer s to . Nonetheless , as the term "rule-based" is standard , I will continue referring to them as "rules" throughout .

3 shallow structural transfer

SL text

ciisc:on1iguous··. ...··~;;~phoi~•· ... i ciisc:on1iguous··. multiword · multiword assembly . ..i •.. resolution i •·.. disassembly ./ recursive structural transfer TL text

Figure 1: The standard components of the pipeline for the translation system Apertium. Note that many translation pairs do not use every module. Retrieved from http:/ /wiki.apertium.org/wiki/File:Apertium_ system_archi tect ure. png.

As a result, although corpus-based approaches generally deliver better results when sufficiently trained, they are only available for a small handful of languages. In addition, training such models is computationally intensive and can require access to thousands of dollars worth of computer hardware while emitting dozens or hundreds of tons of CO 2 (Strubell, Ganesh & McCallum 2019). In contrast, rule-based systems can ordinarily be compiled and tested on a typical home computer. Rule-based systems have the added benefits of allowing developers to fix specific errors and having rules that can potentially serve as language documentation, in contrast to corpus-based models which typically produce statistical black boxes.

Hybrid systems which attempt to combine the benefits of both approaches also exist and will be discussed in more detail in Section 5. This thesis explores a particular way of hybridizing the generally rule-based translation system Apertium, which is described in the next section.

4 The Apertium Machine Translation System

Apertium is a rule-based translation system originally developed at the Universitat d' Alacant in to translate between the languages of the Iberian peninsula (Forcada et al. 2011, Khanna et al. 2021). Apertium is set up as a pipeline, meaning that the translation process is broken down into a number of stages (usually between 6 and 10) with each stage doing part of the work of translating each segment (which may be a word or phrase, depending on the location in the pipe) before passing it along to the next stage. Section 4.1 discusses the reasons for using this architecture and Section 4.2 describes the various components of the pipeline.

4 Input text

He took the big plant out.

Morphological analyzer

-He/He/prpers$ -took/take$ -the/the$ -big/big$ -plant/plant/plant/plant/plant$ -out/out/out$-./.$

Morphological disambiguator

-prpers$ -take$ -the$ -big$ -plant$ -out$-./.$

Discontiguous multiword processing

-prpers$ -take# out$ -the$ -big$ -plant$-./.$

Lexi ca l transfer

-Prpers/Prpers$ -take# out/sacar$ -the/el$ -big/grande$ -plant/planta/fabrica/maquinaria$-./.$

Lexical selection

-Prpers/Prpers$ -take# out/sacar$ -the/el$ -big/grande$ -plant/planta$-./.$

Stru ctural transfer

-sacar$ -el$ -planta$ -grande$-.$

Morphologic al generator

Sac6 la planta grande.

Figure 2: An example sentence at each stage of tr anslation from English t o Spanish.

5 4.1 Reasons for Using a Pipeline

There are a number of advantages to breaking up the system like this. Breaking it up like this makes it much easier to reuse rules between different translation pairs. The clearest example is that the first two stages (morphological analysis and disambiguation) are the same for a given source language, regardless of the target language. Similarly, the last two stages (morphological generation and post-generation) are the same for any target language, regardless of the source language. As such, these components only need to be written once, saving a significant amount of effort when building other translation pairs that use those languages. Another way this can happen is if two related languages have different writing systems, in which case they would need separate translation dictionaries (because Apertium currently primarily processes text, most of the processes are to some degree dependent on orthography) but if they are sufficiently grammatically similar, it might be possible to copy the rules handling word order from one translation pair to another with only minimal changes.

Another advantage of the pipeline architecture is that breaking something down into pieces often means that the total number of rules involved is smaller and each rule is easier to work with. For example, if at one stage in the pipeline, combi nations of verbs and prepositions that hav e special meanings in English are combined so that later stages can treat them as single words, then they can be simply looked up in the dictionary and the stage that handles syntax doesn't need to process them differently.

Finally, it is more efficient to use multiple sta ges because on a modern computer with multiple CPU cores , multiple stages can be operating at once , all working on different pieces of the text in parallel, leading t o an overall faster translation ti me.

The primary downside of such a system is that developers need to learn how t o write a much wider variety of rules and it can often be somewhat complicated to determine what type of rule or stage of a pipeline will most effecti vely solve a particular translation issue.

4.2 Structure of the Pipeline

The structure of the Apertium pipeline is shown in Figure 1 and examples of the output of each stag e are shown in Figure 2.

The stages of the Apertium pipeline are generally the following :

First, the morphological analyzer takes surface forms and converts them into lemmas (that is, roots or dictionary forms) and morphological information. Morpho logical information consists of abbre viations for various features, such as part of speech (required), number , gender, and tense, each enclo sed in angle

6 brackets. Thus is 'noun', is 'masculine', is 'singular', and so on.

Next, the morphological disambiguator attempts to distinguish based on context between morpho- logical analyses that have the same surface form. In the example in Figure 2, it must distinguish between

"plant" as either a noun or a verb.

At this point, some translation pairs have discontiguous multiword processing which deals with situations such as English phrasal verbs. It can look for sequences such as the English "take th e book out of the box" and replace them with "take-out the book of the box" so that in subsequent stages "take-out" can be translated as if it were a single word, thus giving in Spani sh "sacar" ('remo ve from a container') rather than "tomar fuera" ('carry outside'). Such replacements are generally unconditional (though incorporatin g context is possible), since the combined form will be a better translation in most cases, and if it isn't, the equivalent combined form (here "tomar-fuera") can be added as an alternate translation and the more applicable one can be decided later in the pipeline.

Following this, lexical transfer looks up each word in a bilingual dictionary, replacing the lemma but leaving inflectional information unchanged.

Th en, lexical selection chooses between alternative translations, performing much the sam e task as the morphological disambiguator. Although the lexical selection module is nearly always rule-based and the morphological disambiguator is often at least partially statistical (though work has been done on generating these rules from corpora).

After that, structural transfer rearranges words and corrects any errors in inflectional inform ation .

In the example, lexical transfer attached the tags ,"gender to be determined, number to be determined" t o the determiner, which structural transfer changes to , "feminine, singular" to agree with the noun. Apertium has two transfer systems. The original one rearranges sequences of adjacent words, which is simple and fast, but becomes very difficult to use when dealing with larger syntactic differences.

The newer system (Swanson et al. 2021), meanwhile, builds a syntax tree of the input sentence, allowing for much larger changes.

Then the morphological generator converts from lemmas and tags back to surface forms. This is very similar to the morphological analyzer, and in fact the generator is essentially constructed by running a langua ge's ana lyzer backwards, though there are slight differences. T he primary distinction is that a generator must only produce one surface form. For instance, the English analyzer converts both "isn 't" and

"ain't" to ~be+not$ but the generator only produces "isn't". 3

3 Issues arise in several points in the pipeline from the fact that there must be exact ly one output but several are potentially correct. Some translat ion pairs deal w ith this by choosing a single var iety (often the most comm on or most standardized wr itten

7 Finally, the post-generator deals with orthographic changes between words, such as contracted con- tracting "de el" to "de!" in Spanish or "a" vs "an" in English.

Every step in this process is typically accomplished with dozens or hundreds of handwritten rules and thousands of dictionary entries. However, it is relatively straightforward to replace components of this pipeline with other programs which can process the data in any manner they please so long as they produce output in the correct format (that is, formatting data as ~1emma$ in the manner expected by components further down the pipeline). As a result, some translation pairs incorporate statistical components to replace or augment the rule-based ones. This and other methods of incorporating statistical knowledge will be discussed in the next section.

5 Hybrid Translation Systems

There are two possible approaches to building a hybrid machine translation system. Section 5.1 discus ses the option of having a pipeline in which some modules are rule-based and some are corpus-based. Then

Section 5.2 discusses the option of having a corpus-based approach generate rules which are then processed by a rule- based system.

5.1 Hybridizing by Replacing Components

Hybridizing a translation system by replacing one or more rule-based components with statistical components would decrease the number of rules involved because it eliminates the set of rules for the component that was removed without changing th e rul es requir ed for any of the others.

A statistical component is doing less work than a full statistical syste m and so should be able to learn

4 from a smaller corpus . The reason for this is that there are fewer possible things for such a component to do. If a statistical program has to learn which of 5 possible things to do in a given situation, it will probably need many fewer examples than if there are 50.

For example, in the sentence in Figure 2, a statistical English-Spanish lexical selection module would only need to learn whether to choose "planta", "fabrica ", or "maquinaria'' as th e proper translation of "plant".

A statistical system could conceivably derive useful generalizations for that from a few dozen examples. variety). Others, however , allow the user to specify which versio n they wish to translate into, such as "Portuguese (current standard)" or "Portuguese prior t o the 1990 spelling reform". A similar issue regarding inserting pronouns when one lan gua ge doesn't have them has largel y been dealt with by usin g "he" or some equivalent in all cases, though a proper solution based on dete rminin g what the pronoun refers to is in the process of being added to some tran slation pairs. 4 l'm not aware of any studie s of how corpus size affects the accuracy of different approaches, but it seems reasonable that learning smaller, simpler transformati ons would require less data.

8 A program trying to derive full translations meanwhile, would probably need significantly more examples, since it would also need to determine that these 3 words are all possible translations. And depending on the system , it might end up learning the plural form s separately, requiring a separate set of examples for those.

In languages with more complex morphology than English, that last effect is even more pronounced. In

Attic Greek, for example, a typical verb had over 600 possible forms (Groton 2013). Not having to learn the relationship between those will substantially decrease the number of examples required. This is particularly true of rarer forms that might not appear frequently enough to be learned on their own.

However, simply replacing rule-based components with statistical ones isn't the only method of using a corpus in a translation system.

5.2 Hybridizing by Learning Rules

Another potential approach to hybridizing translati on systems is to modify the programs that learn patterns from corpora so that they output rules in the same format as the rule-ba sed systems . This would allow humans to correct or expand on what the statist ical system learn s, allowing expert knowledge and corpus data to be combined in a single set of rules.

This combining process can also go the other direction, taking a set of rules and feeding them back into a statistical system, such as is done in Sanchez-Cartagena, San chez-Martinez & Perez-Ortiz (2011) for structural transfer or Dugast, Senellart & Koehn (2008) for an entire translation system.

Both of these options enable a translation system to incorporate both corpora and expert knowledge.

However, an even further step in that direction would be to impl ement incremental learn ing, that is, to have a program that doesn't just learn rules from scratch but instead looks for extensions or corrections to an existing set of rules.

The differences between the se approaches are not parti cularly important if a trans lation system is built once and then never modified. That is, if there is only one corpus and rules are only written on a single occasion, then it probably doesn't make very much difference what order the two pieces of the process happen in. However, if adding more hand-written rul es and training on more corpus data are both operations that simply expand an existing set of rules, then, with increment al learning, both ca n happen arbitrarily many times and in any order, wh ich enables continued impro vement of a system as more potential developers become interested or as more text becomes avai lable over time.

It is possible that such incremental learning methods could exploit the fact that they have access to the existing rules, which is more information than just the text by itself, in order to produce good systems

g with less input text. However, even if they don't reduce the amount of data required, they still open up possibilities for decreasing the amount of effort required to produce that data, thus also decreasing the amount of effort needed to reach higher levels of accuracy.

In particular, it has been shown that editing the output of a machine translation system produces more consistent output than manual translation and requires significantly less time (Green, Heer & Manning 2013).

Thus the system could translate the first page of a document and the user could post-edit it. The syst em could then determine refinements to its rules while the user post-edit s the second pag e. With each successive page translated, the quality of the initial translation would improve and the person doing the post-editing would have less work to do. This then creates a positive feedback loop where the larger a bilingual corpus is, the easier it is to usefully expand it.

Additi onally, given such a system, it becomes possible for anyone who speaks the language to usefully contribute in any time increment from correcting a single sentence to translating dozens of pages without needing to be taught how to modify the rules themselves.

Having described hybridization in general, I will now describe the work of this the sis itself.

6 Learning Syntax Rules

The purpose of this thesis is to create a t ool that can generate struc tur al tran sfer rules for an Apertium translation pair compatible with the new recursive transfer module (Swanson et al. 2021) using corpus data and a syntactic description of at least one of the languages.

The general outline of the method is as follows: Fir st, the sentences of each corpus are conver ted into syntax trees (Section 6.1). Second, correspondences are established between the nodes of each pair of trees

(Section 6.2). Third, th ese correspondences are converted into translation rules (Section 6.3). The code implementing this process can be found at https://github.com/mr-martian/apertium-re cursive-learning.

6.1 Parsing The Sentences

There are a variety of ways that a corpus can be converted to syntax trees . The trees can be constructed by hand (for example, if the langua ge in question was documented by a syntactician who produced a treeb ank) or an existing synt actic parser can be used (th e structural transfer modul e from an Apertium tr anslation pair th at includes on e of th e langu ages in th e corpus might work, though curr entl y most of th ese use a method th at doesn't generate trees). Alternatively, various methods of Grammar Induction have been devised, such as Solan et al. 2004, which attempt to derive syntactic structures of a language solely from raw text.

Regardless of how the trees are to be obtained, they are only actually needed for one of the languages in the corpus. The input to the process of establishing correspondences as described in the next section must be in the form of a tree, but the process will function if one of the trees has no internal structure. That is, given an English tree such as the one in Figure 3, then either of the Spanish trees in Figure 4 can be used. s

DP VP A Det NP V pp I I I A The N 1s Prep DP I I I Lord with Pro

I you

Figure 3: A possible syntax tree for the English sentence "The Lord is with you."

s s

DP VP Det N V pp A A I I I I Det NP V pp El Senor esta contigo

I I I I El N esta contigo

I Senor

Figur e 4: Two possible syntax trees for the Spanish sentence "El Senor esta contigo."

Since the transfer rules that this overall process generates can be used to generate trees, the output of the process can also then serve as the input to a later iteration. This means that this thesis can in principle be used in building translation pairs between arbitrarily many language s as long as a parser or treebank exists for at least one of them.

11 6.2 Establishing Correspondences

Once the parallel corpus has been converted into a set of syntax trees, the next step is to align the nodes within the trees.

The first step of this alignment is to establish correspondence s between the individual words of the sentences. There are currently 3 methods available to do this. The first is to use an external program called a word-aligner. My code is currently set up to use the eflomal word-aligner (Ostling & Tiedemann 2016).

The second option is to read alignment data from a file. This is useful if the alignments from eflomal have been corrected by a speaker of the languages in question. The third option is to use the bilingual dictionary.

For this method, each word in the source sentence is looked up in the dictionary. Any word in the targ et sentence that it could translate to is then listed as a potential correspondent.

After word alignment is complete, some words, particularly function words such as articles and auxiliaries, may have no correspondents, and some, particularly if the dictionary approach is used, may have multiple.

If a word has multiple correspondents, it is currently treated as a list of possible correspondents, with the correct one being resolved in the next step. In th e future, it may be useful to have a way of specifying that a particular word in one tree corr espo nds to multipl e words in another tree for wh en one language uses a morphological alternation but the other uses a separate function word, but for the moment the function word is simply left unaligned.

In order to establish correspondences between the phrasal nodes of the trees, an algorithm closely based on Hanneman, Burroughs & Lavie 2011 is used.

A node A in one tree can correspond to a node B in another tree if every word in A is either not aligned or is aligned to a word in B. If A can correspond to B and B can also corres pond to A, then an alignment is added between them. See figure 5 for an example of the effect of this criterion.

For any given node, the tree-aligner might return no alignments, such as in cases like Fi gure 6a, where a constituent on one side corresponds to pieces of multiple constituents on the other side. Thi s is often a result of bracketing differences. In the example given in Figur e 6a, there is a prepositional phrase that could plausibly attach to either a noun or a verb and the two trees have it in different pla ces, likely because the annotators und erstood it differently. There does not appear to be a genera l way to solve this issue, so such instances are simply ignored, p otentially leading to a less complete set of rules.

In other contexts, a single node may have mul tiple alignments , as shown in Figure 6b. One reason this can happen is that a node only has one child and thus both parent and child contain precisely the same sequ ence of wor ds. Another possibilit y is that on e language requir es function words which are not present

12 s s s

DP DP VP DP VP

' ' ~p ~ p ' \ Adj ~p

'\ la serpiente era muy astuta la serpiente era I la serpiente muy astuta I \ ' ' ' ' ' ' 'I ' ' was' very' astute I was very astute

Adj '\]"' \J=' AP DP VP DP \J"' DP VP

s s s (a) The English AP cannot be (b) The Spanish AP cannot be ( c) The two APs can be aligned aligned to the Spanish VP because aligned to the English VP because becaus e neither of th em contain the VP contains "era" which is the VP includes "was" which is words aligned outside the other. aligned to a word outside the AP. aligned to a word outside the AP.

Figure 5: Examples of the effect of the two-way alignment criterion on aligning various constituents of the Spanish sentence "La serpiente era muy astuta." with Engli sh "The serpent was very astute." in the other language, such as the Attic Greek word "xwpi;t"which corresponds to the English phrase "to a land". In such cases the aligner outputs all possibilities on the assumption that the rule-generator can determine which alignment is most useful by comparing with other sentences in the corpus.

Onc e all th e alignments that fit th ese crit eria hav e been found, the aligner att empts to align parts of constitu ents by creating "virtual" nodes. Thes e are nodes which correspond to 2 or more adjac ent children of a node so that adding them never br eaks the original constitu ent structur e. An example of inserting virtual nodes is shown in Figure 7. The program first tries to create virtual nodes to align with each remainin g un aligned input node. If this fails, as it would in Figure 8, it will be impossible to completely align the tree.

An incomplete alignment will not prevent rules from being generated, though it will prevent the generat or from producing rules which can entirely account for the pair of sentences in questi on.

This ability to insert virtual nodes is the reason that one of the trees can be completely flat, as the aligner can induce th e structur e of the flat tree from the non-flat one. Unfortunately, pr eliminary experim ents indicat e that this work s very badly if both tr ees are flat becaus e the align er is th en only looking for adjacent segments and all the actual constituents it finds are overwhelmed by the noi se of things that merely happen

13 s

VP

DP

' ' ' ' ' ' r DP to a land !

r ) r Bob tiene un gato negro en una caja I p _, ' I , I r / , / Bob has a DP --·

pp -- ....

(b) An instance where a node has mul- tiple alignments. The Greek PP satisfies the alignm ent criterion with any of th e NP VP VP three Engli sh nod es shown. Since the En- I glish words "to a" do not correspond to DP VP separate words on the Greek side, they are unaligned and thus can be ignored for s the purpo ses of aligning th eir p are nt s. (a) The Spanish PP "en una caja" has been interpreted as a modifier of the noun, while the English has been interpreted as a modifier of the verb. The result is that the Spanish NP "gato negro en una caja" cannot be aligned with any single constitu ent in th e English tr ee. A complete alignment of the two trees is thus impossible in this case.

Figure 6: Two instances where a one-to-one alignment of syntactic nodes is not possible. to be adjac ent. Thi s featur e is ideal for translating betwe en a high-resourc e languag e and a low-resource language as the tools or data which exist for the former can thus make up for the lack of equivalent resources in the latter.

6.3 Generating Rules

Having produced a set of aligned tree s, the next st ep is t o conve rt thes e alignments into transfer rul es.

The type of structural transfer rules being generated here apply in a two step process. In the input step, the rule matches a sequence of nodes, which may be either words or larger unit s generated by other rules, a nd th en combines them into a sin gle syntacti c nod e. Thes e syntactic nod es keep track of which rule generated them and once all possible rules have been applied, each node use s the corresponding output step

14 pp - - - - /{\ '\, en el est6mago \ I I I I I I l

I I I I I ! I I ' I J I I I I ' I in the stomach in the stomach/ I I

VPP-- --- v/pp Figure 7: In order to complete the alignment of the two trees, a virtual node is inserted in the Spanish tree. to determine how its constituents should be modified and rearranged.

Every node alignment from the previous section is directly convertible into the outline of a structural transfer rule. The children of the source node give the pattern to be matched, the source node is what they combine into, and the alignments between the source children and the target children gives the rearrangement that should be output, with any unaligned children on the target side being inserted. Unaligned children on the source side will not be mentioned in the output pattern and thus will be deleted. See Figure 9 for an example of the rules produced.

Structural transfer rules are also responsible for modifying morphological information, particularly agree- ment. Deriving this information is currently not attempted, though see Section 8.2 for a possible method.

Unfortunately, the rules generated in this manner are sometimes in conflict with each other, particularly if a node has multipl e alignments, leading to a pair of rul es that both match the same input but produ ce different outputs. Currently, this is dealt with by weighting various rules by how often they appear in the corpus, with the result being that all of them appear in the output but, when applying them, only the most common option will be selected. See Section 8.1 for some possible better ways to handle this.

7 Results

I evaluated the rule-learning program by generating translation rules from Spanish to En glish. The Apertium

English-Sp anish bilingual dictionary 5 was used both for translation and for alignment, a set of tran sfer rules

5https: / /github .com/ apertium / apertium-eng-spa / blob / master / apertium- eng-spa .eng- spa .dix

15 --·S

VP------

I ' I ' I' pp - - - - ~'', ,, Bob tiene un gato en una caja ' I \ I' I ' 'I \ I I I I I f I I I I I I I ' I I I I I I Bob has a box \ I

Vpp --

VP ------

Figure 8: An instance where a virtual node would be aligned to another virtual node. Neither the highlighted Spanish NP nor the highlighted English VP can be aligned without breaking constituent boundaries. As a result, a virtual node will be inserted on each side, one for "un gato" and another for "a cat". from Spanish to Catalan 6 was used to generate the Spanish trees, and the English-Spanish Europarl corpus

(Ko ehn 2005) was used for training and evaluation.

To mitigate present inefficiencies in the code, sentences from the corpus that were more than 200 charac- ters long were excluded. Of the remainder, the first 200 lines were used for evaluation and varying numbers of subsequent lines were passed to the rule-learner to see how it handled different amounts of data. Since the rule-learner does not currently attempt to handle morphological information, all tags other than part of speech (POS) were removed and translations were evaluated by comparing lemmas and POS tags only.

The accuracy of the translation was measured using Word Error Rate, which is the Levenshtein edit distance from the output to the reference translation divided by the number of words in the reference.

Th ese results were compared against the existing Ap ertium English-Spanish translation pair and against the bilingual dictionary alone. The results are shown in Table 1.

On this data, the total difference between a large set of hand-written transfer rules (nearly 150) and no structural transfer at all is only 5 percentage points. This suggests that the bilingual dictionary and the lexical selection module (which only has 4 rules in this case) are having a significantly larger effect on the result than structural transfer. 6https://github.com/apertium/apertium -spa -cat/blob/recursive-spa - cat/apertium -spa -cat.spa - cat .rtx

16 s

NP VP

V -> vblex { have©vbhaver _ 1};

(b) The Greek-English rule corresponding to this align- ment. This rule says that a V node in Greek is composed Y) ofo,ww ,riv E:TCtITTOA~\I,, 'yfyp,:q,ev of a lexical verb (vblex) and is rendered in English as the det/\n n det - µ-' vblex ~:~,' --~.,_: modal verb "have" (havelilvbhaver) followed by a space(_) /I ',,,.:: -...'-.._ .._ and the translation of the Gre ek verb (1). the lady has , written ' the letter det n vbhaver i vblex det n V -> vbhaver vblex { 2};

( c) The English-Greek rule corresponding to this align- ment. This rule says that a V node in English is com- posed of a modal verb (vbhaver) followed by a lexical verb NP VJVP V (vblex) and is rendered in Greek as th e tran slation of th e lexical verb (2). s

(a) The trees of the two sentences with the V nodes aligned.

Figure 9: An example of deriving a rule from a pair of aligned nodes. The verb "yi::ypacpe:v"in the Attic Greek sentence "~ oi::arco,ct,~v tmawll.~v yi::ypacpe:v."corresponds to the phrase "has written" in the Engli sh "The lady has written the letter." This gives rise to the two rules given. Note that neither rule currently deals with the fact that the English uses the past participle while the Greek uses the perfect active indicative. See Section 8.2 for further discussion.

Nonetheless, the rule-learner does have a measurable positive impact of nearly 1.2 percentage points, which is a little less than a quarter of the way to the hand-written system. Further, there is a very slight improvem ent with incr easing corpus size, suggesting that a larg er corpus might be able to produce useful results. (Unfortunately, the code is currently too slow and memory-intensive to reasonably test larger corpora.)

7.1 Analysis of Generated Rules

On average, just over a third of each set of rules is for the sentence node. These range from "match a clause and a punctuation mark and output them in the same order" to one which has the pattern "noun phrase, preposition al phrase, noun phrase, comm a, verb phrase, left parenthesis, unknown word, noun phrase, right parenthesis, punctuation". This probably happens because the Spanish parser is not able to fully process the input in all cases and the trainin g script handle s this by joining adjacent nodes into sentences.

The syntactic nodes used in the Spanish -Ca talan for which rules are p roduced by the learning script

17 Transfer System Number of Rules WER on 200 Sentences apertium-eng-spa 148 69.29% no transfer 0 74.29% training on 250 sentences 50 73.13% training on 500 sentences 69 73.04% training on 750 sentences 90 73.04% training on 1000 sentences 112 73.04%

Table 1: Evaluations of the various translation systems. "apertium-eng-spa" is the standard Apertium Spanish-English translation system, "no transfer" is the bilingual dictionary and lexical selection rules without any transfer, and the other lines are the transfer rules generated on inputs of 250, 500, 750, and 1000 lines, respectively. The translation metric is Word Error Rate. Lower numbers indicate better translations. are adjective phrase, adverbial phrase, noun phrase, prepositional phrase, relative clause, and verb phrase.

For all of these (with the exception of relative clause, which never has more than 3 rules generated), the generated rules include simple and expected ones like "a noun phrase can consist of a noun" or "a clau se can consist of a noun phrase and a verb phrase".

Somewhat more common than this, however, is rules with plausible patterns, such as "a noun phrase can consist of a noun phrase and an adjective phrase", but with one or more of the outputs deleted on the English side, such as "a noun phrase containing a noun phrase and an adjective phrase translates to English as just a noun phrase". I suspect that this is mainly due to the word-aligner failing to connect the adjectives in the two trees, and that a better alignment-whether from more data or from a different alignment method-would solve this issue. In the other direction, there are a several rules which insert words. Most of these are function words ("be", "have", "that", "there"), suggesting that they, too, are alignment failures.

Finally, there are a handful of cases where the two alignment issues just mentioned cancel out and produce rules such as SPrep -> SPrep SN { "of"©pr _ 2 } ; 7 . In the Spanish-Catalan rules, SPrep can be either a single preposition or a prepositional phrase, so the preposition failed to align and is being deleted and reinser ted, resulting in a rule that is sometimes correct but for the wrong reas ons. Investigating why this has SPrep in the pattern rather than pr (the standard label for prepositions) led to the further realization that the tree-aligner is occasionally failing to fully align the trees because it currently cannot insert virtual nodes consisting of a single word which would otherwise allow it to better handl e this situation.

7This rul e can be read as "a Spanish SPrep (prepositional phras e) nod e is composed of an SPrep nod e and an SN (noun phrase) node and is output in English as the preposition "of", a space, and th e second input (the noun phrase )".

18 8 Future Work

There are a variety of things that could be done to improve the system. In terms of actually running the program, one of the most important is making the various components operate more efficiently or in parallel.

Some of the code dealing with virtual nodes, for instance, currently processes all possible arrangements, which can lead to a combinatorial explosion. In addition, most of the steps of the process currently hold the whole corpus in memory but could probably be modified so as to only operate on a segment of the corpus at a time, decreasing memory usage substantially.

Beyond this, I see two main areas of improvement. Section 8.1 discusses how to make rule conflict resolution less naive and Section 8.2 discusses how to incorporate morphological information.

8.1 Resolving Rule Conflicts

The current method of dealing with conflicting rules is t o weight them by how often each occurs in the corpus. Unfortunately, this means that if two rules occur equally often, then which one gets chosen will be more or less random.

A better way would be to look at the instances of each alignment and see whether some feature or combination of features was correlated with one or the other - for example, if some variation were triggered by the presence of a particular preposition, or by transitive verbs but not intransitive ones, or certain noun cases but not others.

One limitation of this approach is that there are many such features that could be checked and if combinations of them are tested it could lead to combinatorial explosion, making the process mu ch slower and potentially pushing it beyond the limits of the use r's hardware. As a result, it would likely be helpful to give the user the option of limiting the search to certain features or combinations or of specifying that some features should be preferred (for example, telling the program to make sure that differences can't be explained by animacy before it tries to explain them by what prepositions are involved). If such a filter affected th e final output, th en that would indicate that multipl e explanation s fit the data equally well, likely indicating a need for more data. On the other hand, if the user had some idea where the two languages used different structures, then this would simply be a way to speed up the process by giving the program more data.

Another approach would be to generate multiple sets of rules and then apply them to a corpus and evaluate the results. This, of course, also has the danger of combinatorial explosion besides the added time

19 of running all the rules.

The number of tests that need to be performed could also be kept small by assuming that ea ch set has a more or less independent effect, which would remove the possibility of combinatorial explosion. This assumption should be tested, though, to determine whether the effect of different rules on each other is small enough to be ignored in thi s way.

When de aling with corpora where both sides can be parsed, some conflicts could als o potentially be resolved by preferring alignments where node labels correspond. So for example in Figure 6b, where a prepositional phrase consisting of a single case-marked noun is aligned w ith a prepositional phrase consisting of a preposition, a determiner, and a noun, the alignment between the two prepositional phrases would be chosen rather than the alignment of the prepositional phrase with either the determiner phrase or the noun phrase.

8.2 Extracting Agreement Rules

Once the rules to build and rearrange the tree have been generated, the morphological dictionaries of the two languages (which are compiled into the morphological analyzers and generators) can be used to extract all possible morphological patterns. Since tags are moderately standardized across different languages in

Apertium, determining systematic changes of tags (such as marking different set s of tenses) can be worked out more or less automatically. For example, Spanish uses for present indicative, for present subjunctive, < if i> for past indicative, and for past imperfect indicative, among others. English, on the other hand, uses for past ten se and for present tense. Given machine-readable documentation of these tags8, it should be possible to determine th at, when mapping from Spanish to English, and

shou ld map to , and and should map to , with the proper auxiliaries being determined by the methods discussed in Section 8.1.

Agreement, meanwhile, can be determined by finding cases where one word has a certain tag before structural transfer but multiple wor ds have it after. Then the possible paths from the word that has it to the words that don't could be calculated as follows: Any node which has a descendant that needs the ta g propagates it to the relevant immediate chi ld. Any node that has a tag needed by a non-descendant node propagates that tag to its parent. If this results in multiple potential paths for a particular word to get the tag that it is missing, the paths can be compared with those generated by other example s, or heuristics

8Thls is curre ntly under cons truction at https ://wi ki.apertium .org/wiki/LisLoLsymbols . The documentation of the Aper- tium tagset on that page has been formatted so as to be eas ily extracted by ca tegories . In additi on, many of the tags are listed with the corresponding Universal Features from the Universal Dependencies pro ject (Nivre et al. 2020 ) t o facilitate semi -automated mappin g between langua ges using different subsets of a particular category .

20 ( either language-independent or potentially user-specified ) can be applied. See Figur e 10 for an example of how this would work.

s s DP DP VP I I ~ p Bob cantar Bob sing np. sg v ble x . ifi. p 3 . sg np. ant. m. sg vblex. pa st Bob cantar ~ p pas t ifi el np.sg vblex.past the + p3.sg de t.d ef.f.pl det.def.sp A el song + f .pl can ci6n largo

(b) The desired syntax tree for th e Spani sh sentenc e (a) A syntax tree of the En glish input sentence together with differ ences from the bilin gual dicti onar y output with the Spanish generated by the bilingual dictionary. marked.

( c) The paths alon g which a greement information would pro pag ate throu gh the tree accordin g to the algorithm propo sed . Dash ed lines show actu al path s, th e dott ed line is a pos sible but reject ed path. Noun s ar e her e shown automatica lly pro pagating 3rd per son tags, wh ich mig ht work as a lan guage-independent heuri sti c, or it could be user- specifi ed in some way. As a result , the verb has multipl e pos sible path s for getting the 3rd per son tag , which could probably be resolved in thi s cas e by choo sing th e path a long which it is also gettin g th e singular t ag. Th e conversion from past t o if i is not shown.

Figur e 10: An exam ple of t he process of deriving agreem ent rul es u sing th e En glish sentence "Bob san g the long songs ." and th e Sp anish tr anslation "Bob can to las cancion es largas ."

9 Conclusion

This thesis has presen ted th e need for langua ge techno logy suppor t for minor ity languages and the advant ages and disad vant ages of various methods of producin g that support. It has also presented the structure of the

21 Apertium machine translation platform and a prototype system for automatically learning structural transfer rules from corpora intended to contribute to Apertium's ability to address this need. The system is slow and resource-intensive, but this can likely be improved. The rules it generates are limited, but they result in a measurable improvement in translation quality and also have avenues for improvement. It seems quite possible that this system may in future be useful for reducing the effort needed to build new tran slation pairs.

References

Ayodhya case. 2019. Ayodhya case: 8 translato rs to take 120 days to translate nearly 11,500 pages in English.

DNA India. htt ps: / / www.dnaindia.com/ india/ report- ayodhya- case- 8- translators- to- take-120- days- to-

translate-nearly-11500- pages-in-english- 2724410 (9 May, 2020).

Constitution of India. 1950. Constitution of India Part XVII. https://www.constitution.or g/co ns / india / pl7348.html (9 May, 2020).

Dugast, Loi"c,Jean Senellart & Philipp Koehn. 2008. Can we relearn an RBMT system? In Proceedings of the

third workshop on statistica l machine translation - StatMT '08, 175-178. Columbus, Ohio: Association for

Computational Linguisti cs. https: / / doi.org/10.3115 /1626394.1626421. http: / /portal.acm.org/ citation.

cfm?doid=1626394.1626421. Forcada, Mikel L, Mireia Ginesti-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio

Perez-Ortiz, Felipe Sanchez-Martinez, Gema Ramirez-Sanchez & Francis M Tyers. 2011. Apertium: a

free/ope n-source platform for rule-ba sed machine translation. Machine translation 25(2). 127-14 4.

Green, Spence, Jeffrey Heer & Christopher D Manning. 2013. The efficacy of human post-editing for language

translation. In Proceedings of the SIG CHI conference on human factors in computing systems, 439-448.

Groton, Anne. 2013. From alpha to omega: a beginning course in classical Greek. Hackett Publishing Com-

pany.

Hanneman, Greg, Michelle Burroughs & Alon Lavie. 2011. A general-purpose ru le extractor for SCFG-based machine translation. In Proceedings of the fifth workshop on syntax, semantics and structure in statistical

translation, 135-144.

Khanna, Tanmai, J onathan Washington, Fran cis Tyers, Sevilay Bayath , Daniel Swanson, Tommi Pirin en,

Irene Tang & Hect or Alos i Font. 2021. Recent advances in apertium, a free / open-source rule-based

machine translation platform for low-resource languages . Ma chine Translation.

22 Koehn, Philipp. 2005. Europarl: a parallel corpus for statistical machine translation. In MT summit, vol. 5,

79-86.

Kornai, Andras. 2013. Digital language death. PLOS ONE 8(10). e77056. https://doi.org/10.1371/journal.

pone.0077056. https://journals.plos.org/plosone/article?id=l0.1371/journal.pone.0077056.

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajic, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers & Daniel Zeman. 2020. Universal dependencies v2: an ever-

growing multilingual treebank collection. In arXiv:2004.10643 [cs}, 4027-4036. Ostling, Robert & Jorg Tiedemann. 2016. Efficient word alignment with Markov Chain Monte Carlo. Prague

Bulletin of Mathematical Linguistics 106. 125-146. http://ufal.mff.cuni.cz/pbml/106/art-ostling- tiedemann.pdf.

Sanchez-Cartagena, Victor M, Felipe Sanchez-Martinez & Juan Antonio Perez-Ortiz. 2011. Integrating

shallow-transfer rules into phrase-based statistical machine translation. In Machine translation summit.

Solan, Zach, David Horn, Eytan Ruppin & Shimon Edelman. 2004. Unsupervised context sensitive language

acquisition from a large corpus. In S. Thrun, L. K. Saul & B. Scholkopf (eds.), Advances in neural

information processing systems, 961-968. MIT Press. http://papers.nips.cc/paper/2467-unsupervised- context-sensitive-language-acquisition- from- a- large-corpus. pdf.

Strubell, Emma, Ananya Ganesh & Andrew McCallum. 2019. Energy and policy considerations for deep

learning in NLP. arXiv:1906.02243 [cs}. http://arxiv.org/abs/1906.02243.

Swanson, Daniel G., Jonathan N. Washington, Francis M. Tyers & Mikel L. Forcada. 2021. A tree-based

structural transfer module for the Apertium machine translation platform. Machine Translation.

Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim

Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing

Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals,

Greg Corrado, Macduff Hughes & Jeffrey Dean. 2016. Google's neural machine translation system: bridg-

ing the gap between human and machine translation. arXiv:1609.08144 [cs}. http://arxiv.org / abs / 1609.

08144.

23