Target Language Preposition Selection – an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection – an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden [email protected] Abstract. The translation of prepositions is often considered one of the more difficult tasks within the field of machine translation. We describe an experiment using transformation- based learning to induce rules to select the appropriate target language preposition from aligned bilingual data. Results show an accuracy of 84.9%, to be compared with a baseline of 75.5%, where the most frequent translation alternative is always chosen. 1. Introduction The paper is organized as follows: In the second section, we will look into the heteroge- The selection of prepositions may be due to lots neous nature of prepositions and discuss some of factors, some of which are mainly idiosyn- of its implications on the translation process. In cratic to the language in question, and some of the third section, we will briefly review some which are dependent on the content that the previous experiments on related tasks; we will prepositions contribute with. In the field of ma- specifically consider whether they have involved chine translation, the translation of prepositions the use of aligned bilingual data or not. The fourth is thus often considered to be one of the more section will outline and motivate the main fea- difficult issues, and often there are separate mo- tures of the current approach. In the fifth sec- dules dedicated to that task. tion, transformation-based learning will be in- The many dependencies, often lexical in na- troduced. The sixth section presents the actual ture, make it cumbersome, maybe even unfeasi- experiment: the data and tools, the parameter ble, to manually identify and formalize the const- settings and the choice of templates. Section raints necessary to translate prepositions appro- seven is devoted to a presentation of the results. priately. With the growing bulk of large parallel In the final section, some concluding remarks corpora, however, supervised machine-learning will be given. techniques may be used to facilitate the tedious work: either by revealing patterns hidden in the 2. How Prepositions Translate data, or more directly, by using the techniques to generate classifiers selecting the appropriate Linguists often distinguish two types of prepo- preposition. sitional uses; their functional use and their lexi- Here we will take the latter approach, and cal use.1 In its functional use, a preposition is apply transformation-based learning to induce governed by some other word, most often by a rules for correcting prepositions output by a verb as in example 1, but sometimes by an ad- rule-based machine translation system. Selec- jective (afraid of), or a noun (belief in). tional constraints will be sought in the target language context. For training, however, solely 1. I believe in magic. aligned bilingual corpus data will be used, and one rule sequence will be induced for each source language preposition. Each classifier will 1 Other labels that have been used for approxi- be trained on target language prepositions actu- mately the same distinction are: determined vs. non- ally being aligned to the respective source lan- determined, synsemantic vs. autosemantic and non- guage preposition. predicative vs. predicative. (Tseng, 2002) 112 EAMT 2005 Conference Proceedings Target Language Preposition Selection The selection of a functional preposition is de- content only, but may be further constrained by termined by the governor, and the preposition is the nouns they govern. We say at the bank and typically not carrying much semantic informa- in the store, though the prepositions contribute tion. This is evident when comparing semanti- with approximately the same meaning in both cally similar verbs taking different prepositions, cases. (For an in-depth discussion on classifica- such as charge NP with NP, blame NP for NP, tional issues of prepositions, see Tseng (2000)). and accuse NP of NP. When translating a func- When choosing a strategy for selecting the tional preposition, the identity of the source appropriate target preposition, one should thus language preposition is thereby of less impor- keep both kinds of prepositional uses in mind - tance. Rather, the crucial information lies in the something which implies the need for both bi- co-occurrence patterns of the target language.2 lingual and monolingual data. Working from an interlingual perspective, Miller (1998) suggests that content-free prepositions, 3. Related Work which roughly coincide with prepositions in Several strategies have been suggested for the their functional use, need not be represented at task of selecting the appropriate target word in the inter-lingual level at all, but are better treated context. Most of these, however, address the as a problem of generation. Within a corpus-based translation of content words. We will take a brief strategy, this would correspond to using only mo- look at some of the more influential such pro- nolingual target data as corpus data. posals. For the specific task of selecting the ap- In their lexical use, prepositions are not de- propriate target preposition, we will take a closer termined by some governing word, but are se- look at a strategy proposed by Kanayama (2002). lected due to their meaning. In example 2, other The methods suggested for target word se- prepositions than in are grammatically valid, lection may be classified according to whether e.g. under or beside, but these would alter the they make use of aligned bilingual corpus data meaning of the utterance. or not. 2. The rabbit is in the hat. The obvious advantage of not using aligned When translating a lexical preposition, the iden- bilingual corpora, but monolingual corpora in- tity of the source language preposition, or rather stead, is the vast increase in data available. Da- the content it carries, is thus of importance; gan and Itai (1994) suggest a statistically-based something which implies the need for bilingual approach using a monolingual target corpus and data. a bilingual dictionary. When the bilingual dic- The best place to look for clues for the selec- tionary gives several translation alternatives for tion of a target preposition is evidently depend- a word, the context is considered, and the alter- ent on whether the source preposition is func- natives are ranked according to how frequently tional or lexical. The optimal strategy would thus they occur in a similar context in the target lan- be to treat functional and lexical prepositions guage corpus. When there is more than one se- differently. In practice, however, it turns out to lection to be made, the order is determined by a be very difficult to classify prepositional uses constraint propagation algorithm. The results into these categories. The verb put, for instance, taken from an evaluation on a small English- subcategorizes for a direct object and a locative Hebrew test set were promising, showing a re- where the latter often is expressed by a preposi- call of 68% and a precision of 91%. tional phrase (e.g. put the vase on the table). Kanayama (2002) presents an algorithm spe- The prepositional phrase is thus subcategorized cifically tailored to acquire statistical data for for, but still, the selection of the preposition is the translation of the Japanese postposition de to semantically based. Moreover, lexical preposi- the appropriate English preposition. Following tions are not always chosen on the basis of their Dagan and Itai (1994), he selects the target word on the basis of co-occurrence patterns in the target language. For the experiment, however, 2 This is a bit simplified. The particular syntactic also a Japanese parsed corpus is used, from relation that is signaled by the source language prepo- which almost half a million verb phrases with sition may of course be of relevance. the postposition de are extracted. These are par- EAMT 2005 Conference Proceedings 113 Ebba Gustavii tially translated to English, with the preposition the number of acceptable produced by the sys- left unspecified. Next, a parsed English news- tem from 37 to 45 sentences out of 100. (Brown paper corpus is searched for the partial transla- et al, 1991b) tions where the unspecified preposition is in- In statistical machine translation, aligned bi- stantiated as one of six predefined translations lingual data plays a major role in the selection of de. When translating de, the most frequent of target words. Probability estimates are ex- target preposition, given the surrounding verb tracted from a translation model and a language and noun, is chosen. In case there are no such model, which are built from an aligned bilin- tuples in the data, only the noun context is con- gual corpus and a monolingual corpus, respec- sidered. As a last resort a default preposition is tively. In part, however, the problem noted by selected. The reported total precision was 68.5%, Dagan and Itai (1994) still prevails; since the to be compared with a baseline of 41.8% (where target language model is built on non-aligned the default translation is always chosen). data, there are no means to distinguish the dif- Dagan and Itai (1994) note that the use of ferent sources when context statistics are gath- non-aligned corpus data alone, makes it impos- ered for a target word. sible to distinguish between instances of a target word that corresponds to different source 4. Main Features of the Current words when gathering context statistics for the Approach target words. Therefore, each instance of a target word will be treated as a translation of all The aim of the current experiment is to construct the source words for which it is a potential classifiers able to correct prepositions output from translation. In both experiments, this has been re- a rule-based MT-system.

Load more