Convergence of Translation Memory and Statistical Machine Translation

Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn Jean Senellart University of Edinburgh Systran 10 Crichton Street La Grande Arche Edinburgh, EH8 9AB 1, Parvis de la Defense´ Scotland, United Kingdom 92044 Paris, France [email protected] [email protected] Abstract tems (Koehn, 2010) are built by fully automati- cally analyzing translated text and learning the rules. We present two methods that merge ideas SMT has been embraced by the academic and com- from statistical machine translation (SMT) mercial research communities as the new dominant and translation memories (TM). We use a TM paradigm in machine translation. Almost all re- to retrieve matches for source segments, and cently published papers on machine translation are replace the mismatched parts with instructions to an SMT system to fill in the gap. We published on new SMT techniques. The method- show that for fuzzy matches of over 70%, one ology has left the research labs and become the method outperforms both SMT and TM base- basis of successful companies such as Language lines. Weaver and the highly visible Google and Microsoft web translation services. Even traditional rule-based companies such as Systran have embraced statisti- 1 Introduction cal methods and integrated them into their systems Two technological advances in the field of au- (Dugast et al., 2007). tomated language translation, translation memory The two technologies have not touched much in (TM) and statistical machine translation (SMT), the past not only because of the different devel- have seen vast progress over the last decades, but opment communities (software suppliers to transla- they have been developed very much in isolation. tion agencies vs. mostly academic research labs). The reason for this is that different communities Another factor is that TM and SMT have recently played a role in each technology’s development. addressed different translation challenges. While TMs are a tool for human translators. Since many TM have addressed the need of translation agencies translation needs are highly repetitive (translation of to produce high-quality translations of often repet- updated product manuals, or several drafts of leg- itive material, SMT has set itself the challenge of islation), being able to find existing translations of open domain translations such as news stories and is segments of the source language text, alleviates the mostly satisfied with translation quality that is good need to carry out redundant translation. In addition, enough for gisting, i.e., transmitting the meaning of finding close matches (so-called fuzzy matches), the source text to a target language speaker (consider may dramatically reduce the translation workload. web page translation or information gathering by in- Various commercial vendors offer TM software and telligence agencies). the technology is in wide use by translation agen- Currently, SMT receives increasing attention by cies. translation agencies, who would like to employ the Instead of building machine translation systems technology in a workflow of first automatic transla- by manually writing translation rules, SMT sys- tion and then human post-editing. One possible user Ventsislav Zhechev (ed.): Proceedings of the Second Joint EM+/CNGL Workshop “Bringing MT to the User: Research on Integrating MT in the Translation Industry” (JEC ’10), pp. 21–31. Denver, CO, 4 November 2010. 21 JEC 2010 “Bringing MT to the User” Denver, CO scenario is to offer a human translator a fuzzy match 2.1 Example from a TM, or an SMT translation, or both. What To illustrate the process (see also Figure 1), let us to show may be decided by an automatic classifier first go over one example. Let us say that the fol- (Simard and Isabelle, 2009; Soricut and Echihabi, lowing source segment needs translation: 2010; He et al., 2010) or may be based on fuzzy The second paragraph of Article 21 is deleted . match score or SMT confidence measures (Specia et al., 2009). The TM does not contain this source segment, but it contains something very similar: In this paper, we argue that the two technologies have much more in common that commonly per- The second paragraph of Article 5 is deleted . ceived. We present a method that integrates TM In the TM, this segment is translated as: and SMT and that outperforms either technology A` l’ article 5 , le texte du deuxieme´ alinea´ est supprime´ . for fuzzy matches of more than 80%. In a second method, we encode TM matches as very large trans- The mismatch between our source segment and lation rules and outperform all other methods for the TM source segment is the word 21 or 5, respec- fuzzy match ranges over 70%. tively. By letting the SMT system translate the true source word 21, but otherwise trusting the target side of the TM match, we construct the following (sim- 2 XML Method plified) XML frame: <xml translation="A` l’ article"/> 21 The main idea of our method is as follows: If we <xml translation=", le texte du deuxieme´ alinea´ est are able to find a sufficiently good fuzzy match for a supprime´ ."/> given source segment in the TM, then we detect the Or, to use a more compact formalism for the pur- location of the mismatch in source and target of the poses of this paper: retrieved TM segment pair, and let the SMT system ` translate the mismatched area. We use the capabil- <A l’ article> 21 <, le texte du deuxieme´ alinea´ est ity of the Moses SMT decoder (Koehn et al., 2007) supprime´ .> to use XML markup to specify required translations The XML frame consists of specified translations for the matched parts of the source sentence, hence (e.g., A` l’ article) and source words (e.g., 21). The forcing it to only translate the unmatched part. decoder is instructed to use the specified translations Recent work has explored similar strategies. Mo- in its output. The remaining source words are trans- tivated by work in EBMT, Smith and Clark (2009); lated as usual, by consulting a phrase translation ta- Zhechev and van Genabith (2010) use syntactic in- ble, and search for the best translation according formation to align TM segments. Then they create to various scoring functions including a language an XML frame to be passed to Moses. Both show model. weaker performance (on different data sets) than we In our example, the SMT decoder produces the report here. Smith and Clark (2009) never over- following output: comes the SMT baseline. Zhechev and van Gen- A` l’ article 21 , le texte du deuxieme´ alinea´ est supprime.´ abith (2010) only beat the SMT system in the 90- A perfectly fine translation. 99% fuzzy match range. The work by Biçici and Dymetman (2008) is 2.2 Fuzzy Matching closer to our approach: They align the TM seg- The first processing step is to retrieve the best match ments using GIZA++ posterior probabilities to in- from the TM. Such fuzzy matches are measured by dentify the mismatch in the target and add one non- a fuzzy match score, and the task is to find the best contiguous phrase rule to their phrase-based de- segment pair in the TM under this score. coder. They show significant improvements over There are several different implementation of the both SMT and TM baselines, but their SMT seems fuzzy match score, and commercial products typi- to perform rather badly — it is outperformed by raw cally do not disclose their exact formula. However, TM matches even in the 74-85% fuzzy match range. most are based on the string edit distance, i.e., the 22 November 4th, 2010 Philipp Koehn and Jean Senellart number of deletions, insertions, and substitutions a specified translation, and instead the source word needed to edit the source segment to the TM seg- is inserted in their place. ment. A mismatch may consist of a sequence of multiple Our implementation of the fuzzy match score words in source, TM source, or TM target. We treat uses word-based string edit distance, and uses let- such a block the same way we treat single words: ter string edit distance as a tie breaker. We define they are removed as a block from the TM target and the fuzzy match score as: the source block is inserted. There may be multiple non-neighboring mis- edit-distance(source; tm-source) FMS = 1 − matched sequences. We treat each separately and max(jsourcej; jtm-sourcej) perform the subtraction process for each. For our example, the word-based string edit distance 2.5 Special Cases is 89% (one substitution for 9 words), and the letter- There are a number of special cases (also illustrated based string edit distance is 95% (one substitution in Figure 2): and one deletion for 37 letters, not counting spaces). Pure insertion: If the source segment has an in- 2.3 Word Alignment of TM serted block that is not present in the TM source The mismatch between the source segment and the segment, we add these source words to the TM source segment is easy to detect. As with our XML frame. The location of the words is after fuzzy match metric, we compute the string edit dis- the TM target word aligned to the TM source tance between the two, which detects the inserted, word just prior to the insertion point. deleted and substituted words. Pure deletion: If the TM source segment has addi- A harder problem is to determine which target tar- tional words, then these are removed from the get words are affected by the mismatch. For this, specified translation in the XML frame. we need a word alignment between the source words and the target words in the TM segment. Unaligned mismatched words: If the mismatched Word alignment is a standard problem in SMT, source words are unaligned, then we find the and many algorithms have been proposed and im- closest aligned previous source word and spec- plemented.

Convergence of Translation Memory and Statistical Machine Translation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support