Enhancing the Bilingual Concordancer Transsearch with Word-Level Alignment Julien Bourdaillet, Stéphane Huet, Fabrizio Gotti, Guy Lapalme, Philippe Langlais

Enhancing the Bilingual Concordancer TransSearch with Word-Level Alignment Julien Bourdaillet, Stéphane Huet, Fabrizio Gotti, Guy Lapalme, Philippe Langlais To cite this version: Julien Bourdaillet, Stéphane Huet, Fabrizio Gotti, Guy Lapalme, Philippe Langlais. Enhancing the Bilingual Concordancer TransSearch with Word-Level Alignment. 22nd Conference of the Canadian Society for Computational Studies of Intelligence (Canadian AI 2009), May 2009, Kelowna, Canada. pp.27-38, 10.1007/978-3-642-01818-3_6. hal-02021384 HAL Id: hal-02021384 https://hal.archives-ouvertes.fr/hal-02021384 Submitted on 26 Feb 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Enhancing the Bilingual Concordancer TransSearch with Word-level Alignment Julien Bourdaillet, StéphaneHuet, Fabrizio Gotti, Guy Lapalme, and Philippe Langlais Département d'Informatique et de Recherche Opérationnelle Universitéde Montréal C.P. 6128, succursale Centre-ville H3C 3J7, Montréal, Québec, Canada http://rali.iro.umontreal.ca Abstract. Despite the impressive amount of recent studies devoted to improving the state of the art of Machine Translation (MT), Computer Assisted Translation (CAT) tools remain the preferred solution of hu- man translators when publication quality is of concern. In this paper, we present our perspectives on improving the commercial bilingual concordancer TransSearch, a Web-based service whose core technology mainly relies on sentence-level alignment. We report on experiments which show that it can greatly benefit from statistical word-level alignment. 1 Introduction Although the last decade has witnessed an impressive amount of effort devoted to improving the current state of Machine Translation (MT), professional translators still prefer Computer Assisted Translation (CAT) tools, particularly translation memory (TM) systems. A TM is composed of a bitext, a set of pairs of units that are in translation relation, plus a search engine. Given a new text to translate, a TM system automatically segments the text into units that are systematically searched for in the memory. If a match is found, the associated target material is retrieved and output with possible modifications, in order to account for small divergences between the unit to be translated and the one retrieved. Thus, such systems avoid the need to re-translate previously translated units. Commercial solutions such as SDL Trados1, Deja Vu2, LogiTerm3 or MultiTrans4 are available; they mainly operate at the level of sentences, which narrows down their usefulness to repetitive translation tasks. Whereas a TM system is a translation device, a bilingual concordancer (BC) is conceptually simpler, since its main purpose is to retrieve from a bitext, the pairs of units that contain a query (typically a phrase) that a user manually 1 http://www.trados.com 2 http://www.atril.com 3 http://www.terminotix.com 4 http://www.multicorpora.ca submits. It is then left to the user to locate the relevant material in the retrieved target units. As simple as it may appear, a bilingual concordancer is nevertheless a very popular CAT tool. In [1], the authors report that TransSearch,5 the commercial concordancer we focus on in this study, received an average of 177 000 queries a month over a one-year period (2006{2007). This study aims at improving the current TransSearch system by providing it with robust word-level alignment technology. It was conducted within the TS3 project, a partnership between the RALI and the Ottawa-based company Terminotix.6 One important objective of the project is to automatically identify (highlight) in the retrieved material the different translations of a user query, as discussed initially in [2]. The authors of that paper also suggested that grouping variants of the same \prototypical" translation would enhance the usability of a bilingual concordancer. These are precisely the two problems we are addressing in this study. The remainder of this paper is organized as follows. We first describe in Sec- tion 2 the translation spotting techniques we implemented and compared. Since translation spotting is a notoriously difficult problem, we discuss in Section 3 two novel issues that we think are essential to the success of a concordancer such as TransSearch: the identification of erroneous alignments (Section 3.1) and the grouping of translation variants (Section 3.2). We report on experiments in Section 4 and conclude our discussion and propose further research avenues in Section 5. 2 Transpotting Translation spotting, or transpotting, is the task of identifying the word-tokens in a target-language (TL) translation that correspond to the word-tokens of a query in a source language (SL) [3]. It is therefore an essential part of the TS3 project. We call transpot the target word-tokens automatically associated with a query in a given pair of units (sentences). The following example7 illustrates the output of one of the transpotting algorithms we implemented. Both conformes à and fidèlesà are French transpots of the English query in keeping with. S1 = These are important measures in keeping with our international obligations. T1 = Il s'agit d'importantes mesures conformes à nos obligations internationales. S2 = In keeping with their tradition, liberals did exactly the opposite. T2 = Fidèlesà leur tradition, les libérauxont fait exactement l'inverse. 5 www.tsrali.com 6 www.terminotix.com 7 The data used in this study is described in Section 4.1. As mentioned in [4], translation spotting can be seen as a by-product of word- level alignment. Since the seminal work of [5], statistical word-based models are still the core technology of today's Statistical MT. This is therefore the alignment technique we consider in this study. Formally, given a SL sentence S = s1:::sn and a TL sentence T = t1:::tm in translation relation, an IBM-style alignment a = a1:::am connects each target token to a source one (aj 2 f1; :::; ng) or to the so-called null token which accounts for untranslated target tokens, and which is arbitrarily set to the source position 0 (aj = 0). This defines a word-level alignment space between S and T whose size is in O(mn+1). Several word-alignment models are introduced and discussed in [5]. They differ by the expression of the joint probability of a target sentence and its alignment, given the source sentence. We focus here on the simplest form, which corresponds to IBM models 1 & 2: m m m Y X p(t1 ; a1 ) = p(tjjsi) × p(ijj; m; n) j=1 i2[0;n] where the first term inside the summation is the so-called transfer distribution and the second one is the alignment distribution. A transpotting algorithm comprises two stages: an alignment stage and a decision stage. We describe in the following sections the algorithms we implemented and tested, with the convention that si2 stands for the source query (we i1 only considered contiguous queries in this study, since they are by far the most frequent according to our user logfile). 2.1 Viterbi Transpotting One straightforward strategy is to compute the so-called Viterbi alignment (â). By applying Bayes' rule and removing terms that do not affect the maximization, we have: n m m n m m n p(s1 ) × p(t1 ; a1 js1 ) m m n a^ = argmax p(a1 jt1 ; s1 ) = argmax n m n = argmax p(t1 ; a1 js1 ) m m p(s ) × p(t js ) m a1 a1 1 1 1 a1 In the case of IBM models 1 & 2, this alignment is computed efficiently in O(n × m). Our first transpotting implementation, called Viterbi-spotter, simply gathers together the words aligned to each token of the query. Note that with this strategy, nothing forces the transpot of the query to be a contiguous phrase. In our first example above, the transpot of the query produced by this formula for in keeping with is the French word conformes. 2.2 Contiguous Transpotting This method, which was introduced by M. Simard in [4], forces the transpot to be a contiguous sequence. This is accomplished by computing for each pair hj ; j i 2 [1; m]2, two Viterbi alignments: one between the phrase tj2 and the 1 2 j1 query si2 , and one between the remaining material in those sentences,s ¯i2 ≡ i1 i1 si1−1sn and t¯j2 ≡ tj1−1tm . The complexity of this algorithm is O(nm3): 1 i2+1 j1 1 j2+1 ( ) ^ t^j2 = argmax max p(aj2 jsi2 ; tj2 ) × max p(¯aj2 js¯i2 ; t¯j2 ) ^ j1 i1 j1 j1 i1 j1 j1 j2 j2 (j1;j2) a a¯ j1 j1 This method, called ContigVit-spotter hereafter, returns the transpot conformes à for the query in keeping with in the first example above. 2.3 Maximum Contiguous Subsequence transpotting In this transpotting strategy, called MCSS-spotter, each token tj is associated with a score computed such as: score(tj) = score0(tj) − t~ where score (t ) = Pi2 p(t js )p(ijj; m; n) and t~= 1 Pm score (t ). 0 j i=i1 j i m j=1 0 j The score corresponds to the word alignment score of IBM model 2 minus the average computed for each token tj. Because of t~, the score associated with a token tj is either positive or negative. The Maximum Contiguous Subsequence Sum (MCSS) algorithm is then ap- plied.

Enhancing the Bilingual Concordancer Transsearch with Word-Level Alignment Julien Bourdaillet, Stéphane Huet, Fabrizio Gotti, Guy Lapalme, Philippe Langlais

Preparation and Exploitation of Bilingual Texts Dusko Vitas, Cvetana Krstev, Eric Laporte

The Iafor European Conference Series 2014 Ece2014 Ecll2014 Ectc2014 Official Conference Proceedings ISSN: 2188-1138

Kernerman Kdictionaries.Com/Kdn DICTIONARY News the European Network of E-Lexicography (Enel) Tanneke Schoonheim

TANGO: Bilingual Collocational Concordancer

Aconcorde: Towards a Proper Concordance for Arabic

Panacea D6.1

Translate's Localization Guide

Corpus Linguistics and Data-Driven Learning: a Critical Overview

From a Bilingual Concordancer to a Translation Finder Julien Bourdaillet, Stéphane Huet, Philippe Langlais, Guy Lapalme

Computer Aided Translation Advances and Challenges

The Universal Dependencies Treebank of Spoken Slovenian

Arxiv:2105.02877V1 [Cs.CV] 6 May 2021 Tles Corresponding to the Audio Content