Evaluation of a Substitution Method for Idiom Transformation in Statistical Machine Translation Giancarlo D
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation of a Substitution Method for Idiom Transformation in Statistical Machine Translation Giancarlo D. Salton and Robert J. Ross and John D. Kelleher Applied Intelligence Research Centre School of Computing Dublin Institute of Technology Ireland [email protected] robert.ross,john.d.kelleher @dit.ie { } Abstract Another difficulty for second language learners in handling idioms is that idioms can vary in terms We evaluate a substitution based technique of their morphosyntactic constraints or fixedness for improving Statistical Machine Transla- (Fazly et al., 2008). On one hand some idiomatic tion performance on idiomatic multiword expressions such as popped the question are highly expressions. The method operates by per- fixed with syntactic and lexical variations consid- forming substitution on the original idiom ered unacceptable usage. On the other hand id- with its literal meaning before translation, ioms such as hold fire are less fixed with variations with a second substitution step replac- such as hold one’s fire and held fire considered to ing literal meanings with idioms follow- be acceptable instances of the idiom type. ing translation. We detail our approach, For reasons such as those outlined above id- outline our implementation and provide ioms can be challenging to human speakers; but an evaluation of the method for the lan- they also pose a great challenge to a range of guage pair English/Brazilian-Portuguese. Natural Language Processing (NLP) applications Our results show improvements in trans- (Sag et al., 2002). While idiomatic expressions, lation accuracy on sentences containing and more generally multiword expressions, have either morphosyntactically constrained or been widely studied in a number of NLP domains unconstrained idioms. We discuss the con- (Acosta et al., 2011; Moreno-Ortiz et al., 2013), sequences of our results and outline poten- their investigation in the context of machine trans- tial extensions to this process. lation has been more limited (Bouamor et al., 2011; Salton et al., 2014). 1 Introduction The broad goal of our work is to advance ma- Idioms are a form of figurative multiword expres- chine translation by improving the processing of sions (MWE) that are ubiquitous in speech and idiomatic expressions. To that end, in this paper written text across a range of discourse types. we introduce and evaluate our initial approach to Idioms are often characterized in terms of their the problem. We begin in the next section by giv- having non-literal and non-compositional mean- ing a brief review of the problem of idiom pro- ing whilst occasionally sharing surface realiza- cessing in a Statistical Machine Translation (SMT) tions with literal language uses (Garrao and Dias, context. Following that we outline our substitu- 2001). For example the multiword expression s/he tion based solution to idiom processing in SMT. took the biscuit can have both a figurative mean- We then outline a study that we have conducted to ing of being (pejoratively) remarkable, and a lit- evaluate our initial method. This is followed with eral meaning of removing the cookie. results and a brief discussion before we draw con- It is notable that idioms are a compact form clusions and outline future work. of language use which allow large fragments of 2 Translation & Idiomatic Expressions meaning with relatively complex social nuances to be conveyed in a small number of words, i.e., The current state-of-the-art in machine transla- idioms can be seen as a form of compacted regu- tion is phrase-based SMT (Collins et al., 2005). larized language use. This is one reason why id- Phrase-based SMT systems extend basic word-by- iom use is challenging to second language learners word SMT by splitting the translation process into (see, e.g., Cieslicka(2006)). 3 steps: the input source sentence is segmented 38 Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014), pages 38–42, Gothenburg, Sweden, 26-27 April 2014. c 2014 Association for Computational Linguistics into “phrases” or multiword units; these phrases be of the same open-class category, we see the ba- are then translated into the target language; and fi- sic premise of pre- and post- substitution as also nally the translated phrases are reordered if needed applicable to idiom substitution. (Koehn, 2010). Although the term phrase-based translation might imply the system works at the 3 Methodology semantic or grammatical phrasal level, it is worth noting that the concept of a phrase in SMT is The hypothesis we base our approach on is that simply a frequently occurring sequence of words. the work-flow that a human translator would have Hence, standard SMT systems do not model id- in translating an idiom can be reproduced in an al- ioms explicitly (Bouamor et al., 2011). gorithmic fashion. Specifically, we are assuming a work-flow whereby a human translator first iden- Given the above, the question arises as to how tifies an idiomatic expression within a source sen- SMT systems can best be enhanced to account for tence, then ‘mentally’ replaces that idiom with its idiom usage and other similar multiword expres- literal meaning. Only after this step would a trans- sions. One direct way is to use a translation dic- lator produce the target sentence deciding whether tionary to insert the idiomatic MWE along with its or not to use an idiom on the result. For simplicity appropriate translation into the SMT model phrase we assumed that the human translator should use table along with an estimated probability. While an idiom in the target language if available. While this approach is conceptually simple, a notable this work-flow is merely a proposed method, we drawback with such a method is that while the see it as plausible and have developed a compu- MWEs may be translated correctly the word or- tational method based on this work-flow and the der in the resulting translation is often incorrect substitution technique employed by (Okuma et al., (Okuma et al., 2008). 2008). An alternative approach to extending SMT to Our idiom translation method can be explained handle idiomatic and other MWEs is to leave the briefly in terms of a reference architecture as de- underlying SMT model alone and instead perform picted in Figure 1. Our method makes use of 3 intelligent pre- and post-processing of the transla- dictionaries and 2 pieces of software. The first tion material. Okuma et al. (2008) is an example dictionary contains entries for the source language of this approach applied to a class of multi- and idioms and their literal meaning, and is called the single word expressions. Specifically, Okuma et “Source Language Idioms Dictionary”. The sec- al. (2008) proposed a substitution based pre and ond dictionary meanwhile contains entries for the post processing approach that uses a dictionary of target language idioms and their literal meaning, surrogate words from the same word class to re- and is called the “Target Language Idioms Dictio- place low frequency (or unseen) words in the sen- nary”. The third dictionary is a bilingual dictio- tences before the translation with high frequency nary containing entries for the idioms in the source words from the same word class. Then, follow- language pointing to their translated literal mean- ing the translation step, the surrogate words are ing in the target language. This is the “Bilingual replaced with the original terms. Okuma et al.’s Idiom Dictionary”. direct focus was not on idioms but rather on place The two pieces of software are used in the pre- names and personal names. For example, given and post-processing steps. The first piece of soft- an English sentence containing the relatively in- ware analyzes the source sentences, consulting the frequent place name Cardiff, Okuma et al.’s ap- “Source Language Idioms Dictionary”, to iden- proach would: (1) replace this low frequency place tify and replace the source idioms with their lit- name with a high frequency surrogate place name, eral meaning in the source language. During this e.g. New York; (2) translate the updated sentence; first step the partially rewritten source sentences and (3) replace the surrogate words with the cor- are marked with replacements. Following the sub- rect translation of the original term. sequent translation step the second piece of soft- The advantage of this approach is that the word ware is applied for the post-processing step. The order of the resulting translation has a much higher software first looks into the marked sentences to probability of being correct. While this method obtain the original idioms. Then, consulting the was developed for replacing just one word (or a “Bilingual Idiom Dictionary”, the software tries to highly fixed name) at a time and those words must match a substring with the literal translated mean- 39 Figure 1: Reference Architecture for Substitution Based Idiom Translation Technique. ing in the target translation. If the literal mean- “highly fixed” idioms and will be referred to as the ing is identified, it then checks the “Target Lan- “High Fixed Corpus”. Finally a second test corpus guage Idioms Dictionary” for a corresponding id- containing sentences with “low fixed” idioms, the iom for the literal use in the target language. If “Low Fixed Corpus”, was also constructed. In or- found, the literal wording in the target translation der to make results comparable across test corpora is then replaced with an idiomatic phrase from the the length of sentences in each of the two test cor- target language. However if in the post-processing pora were kept between fifteen and twenty words. step the original idiom substitution is not found, or To create the initial large corpus a series of if there are no corresponding idioms in the target small corpora available on the internet were com- language, then the post-processing software does piled into one larger corpus which was used to nothing.