Enlarging Translation Memories

MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨ OF I !"#$%&'()+,-./012345

MASTER THESIS

Josef Buˇsta

Brno, 2014 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Advisor: RNDr. Milosˇ Jakub´ıcekˇ

ii Acknowledgement

In the ﬁrst place I would like to thank my advisor Milosˇ Jakub´ıcekˇ and my consultants V´ıt Baisa and Alesˇ Horak´ for all valuable advices and for the patience with my work, and great guidance. Furthermore, I would like to give thanks to the other people from NLP Centre for help with tools being developed at Faculty of Informatics. Support and great motivation to ﬁnish my thesis was coming from my friends Stanislav Smrckaˇ and Vojtechˇ Emmer, the thanks belong to them as well. Special thanks go to Jan Busta,ˇ my brother, for all willingness to help and to Zdenkaˇ Sitova´ for all inspiring discussion. Last but not least thanks go to my family for the support and the patience with me.

iii Abstract

The goal of this thesis is to enlarge coverage of translation memories with regard to preserving their high translational accuracy. Statistical machine translation techniques and tools are used to fulﬁll our goal. We propose new algorithm for decoding of overlapping phrases and we use this algorithm to generate new segments as well as support tool for completion of segments from TM with approximate match to a new segment. We provide the outline of approaches applicable for enlarging TM. Proposed decoding algorithm is evaluated on free accessible DGT-Translation Memory and on commercial translation memory.

iv Keywords translation memory, computer-aided translation, overlapping phrases, fuzzy match, decoding, phrases combination, statistical machine translation, Moses, NLP

v Contents

1 Introduction ...... 2 2 Related Work ...... 4 3 Basic Work-ﬂow, Objectives and Motivation ...... 7 4 Subsegment Extraction and Combination ...... 9 4.1 Subsegment Extraction ...... 9 4.2 Subsegment Combination ...... 11 5 Overlapping Phrases Decoder ...... 13 5.1 Decoder ...... 13 5.2 Subsegment Scores ...... 16 5.3 Translation Combination ...... 16 5.4 Decoder – Analysis and Alterations ...... 21 6 Fuzzy Match Treatment ...... 24 6.1 Mismatches Detection ...... 24 6.2 Enlarged Mismatches Translation ...... 26 7 Methods Proposals and Overview ...... 27 7.1 Overlapping Hierarchical Model in SMT ...... 27 7.2 The Decoder as Fuzzy Match Generator ...... 27 7.3 Lexicalization for overlapping phrases decoder ...... 28 8 Evaluation ...... 29 9 Conclusion ...... 31

1 Chapter 1 Introduction

The growing number of texts to be translated among many language pairs require fast translation process. Computer-aided translation (CAT) allows to speed-up the translation process with regard to preserving the translation quality. It saves time of human translators by offering previously translated segments1 or fuzzy match2. CAT is becoming more and more popular with the state-of-the-art technologies such as subsegment leveraging, machine translation, or automatic terminology extraction. This work focuses on the ﬁrst two technologies. Translation memories (TM) used in CAT systems store previously translated segments. They are the highest-quality resources of parallel texts since they are carefully prepared and checked by professional human translators. On the other hand, they are quite small when compared with other parallel data sources. These high-quality parallel texts can be used by machine translation (MT) systems to improve their outputs, which can be reused in CAT. The combination of TM and MT becomes recently very popular, it helps to overcome drawbacks of both. Whereas MT focuses more on high coverage, TM aspire to preserve high quality translations. We want to take advantage of these two approaches and combine them in a way that preserves the high quality of the translations together with enhancement of the coverage. There is also a commercial aspect of our research: the coverage analyses provided by CAT systems are usually used for estimating the amount of work needed for translating a given document (i.e. the price of the translation work). The higher number of segments which can be pre-translated

1. Short text, usually corresponding to a sentence, paragraph, headings, titles or elements in a list. 2. Fuzzy match denotes a segment from translation memory with the highest fuzzy match score to the segment to be translated. Fuzzy match score is, usually, based on edit distance or some modiﬁcation of edit distance measure. Match with 100% fuzzy match score is so called exact match.

2 1. INTRODUCTION automatically, the lower is the price of the translation work. That is why the translation (and localization) companies aim at highest coverage of their resources. Second chapter is dedicated to the related work. Closer introduction of our work, backgrounds and basic work ﬂow follows in the next chapter. Sub- segments extraction and their possible combinations are described further. Chapter 5 describes our approach for the decoding algorithm of overlapping phrases. Furthermore, we present a method to create new segments by leveraging the combination of TM and MT together with the overlaps. The 6th chapter is dedicated to the future work and to the other proposals of new methods appropriate for TM enlargement. In the last chapter we evaluate implemented methods and conclude our work.

3 Chapter 2 Related Work

Translation companies start to be more and more interested in machine translation due to the growing volume of translation requests. Recently, lots of attention were dedicated to the connection between machine translation and translation memories. Since we come up with the methods combining MT and TM approaches, we sum up the state of the art. The TM related papers mainly focus on algorithms for searching, matching and suggesting segments within CAT systems [1] but not much work was devoted to the problem of expanding translation memories. TM are often presented [2] within a closely related ﬁeld: example-based machine translation (EBMT) which uses a similar approach as CAT systems do – reusing samples of previously translated texts. Two other MT approaches are more connected to our research: phrase- based statistical machine translation (PBSMT) [3] and hierarchical model in statistical machine translation (SMT). PBSMT extracts phrases from a parallel corpus using word alignment. Subsequently, the phrases are combined together according to the input sentence in the decoding phase. In contrast to PBSMT, longer phrases or parts of parallel texts are used in EBMT. Hierarchical model in SMT is similar to the classical SMT with the differ- ence that instead of phrase extraction grammar rules are generated. These rules can be formalized as synchronous CFG1 in terms of pairs of source and target side rules. The advantage of this model is that it is able to capture the larger reordering. The decoder has to be modiﬁed to process the grammar rules. In [4], the authors have attempted to build translation memories from Web since they found out that human translators in Canada use Google search results even more often than specialized translation memories. That is why the research team at the National Research Council of Canada devel-

1. Context-free grammar

4 2. RELATED WORK oped a system called WeBiText for extracting possible segments and their translations from bilingual web pages. They state an important notice: it is always better to provide translators with a list of possible translations and let them find the correct one than to have nothing prepared. In other words, it is easier and faster for the translators to look up a good translation than to make up their own translation from scratch. Also, it is very important that the correct translation must be between the first 10 or 20 items in the suggested list. In the study [5], the authors exploited two methods of segmentation of translation memories to achieve enrichment of TM by shorter segments. They use SMT methods. The motivation for their approach was that the segments are sometimes too long and human translators also decompose segments during translation process. The second motivation is that long segments are less repetitive. The next paper [6] describes a method of subsegmenting of translation memories which deals with the principles of EBMT. The authors of this study created an on-line system TransSearch [7] for searching possible translation candidates within all subsegments in already translated texts. These subsegments are linguistically motivated – they use a text-chunker to extract phrases from the Hansard corpus, a text corpus containing the Canadian parliamentary debates from the year 1803 to the present time. In [8] two methods are described and evaluated to complete the not- exact fuzzy match using SMT. This work is most similar to the second main approach in this thesis. Main goal of the work lies in construction of XML frame consisting from the translations of matching and the mismatching parts between the segment to be translated and the fuzzy match for this segment. Subsequently the XML frame is processed by Moses[9] and only the mismatches are translated. The second approach is leveraging hierarchical model in SMT, rules are created by inserting non-terminals instead of mismatched parts and subsequently decoding is launched for the hierarchical rules by Moses. The authors of [8] show that the method using hierarchical model outperforms TM and SMT baselines on fuzzy matches over 70%. In paper [10] two approaches are investigated for EBMT matching, string- based and syntax-based. For those parts of input sentence, where EBMT is not confident, Moses is used to fill the gaps. The work experiments with semantic information from WordNet[11] as a part of string-based approach. According to BLEU score pure SMT system outperforms the hybrid of EBMT and SMT. Another approach presented in [12] aims to improve SMT systems using overlaps between phrases. This work is most similar to the first part of this

5 2. RELATED WORK thesis. Overlapping phrases are merged if the overlap exists in source and target side at least on one word. If the generated phrases are included in the decoding the translation quality is increased according to the BLEU and NIST evaluation metrics. The thesis [13] is mainly dedicated to the evaluation of TM and MT outputs. The motivation is that CAT system should offer the better from the TM and MT outputs to the human translator. One part focuses on combination of high fuzzy matches together with MT on the subsegment level. The paper [14] deals with tree-based alignment which is precision-oriented and provides better handling of long-distance reordering. This alignment is used to detect unaligned parts between fuzzy match from TM and its translation. The mismatched subsegments are translated using SMT. The authors show that the quality of SMT is signiﬁcantly improved using TM matches with high fuzzy match score.

6 Chapter 3 Basic Work-ﬂow, Objectives and Motivation

The input for our methods consists of a translation memory and a document, see Figure 3.1. We want to enlarge the TM (the expanded TM is denoted TMexp) by adding newly created high quality segments. Besides these segments, we want to offer, as a by-product, partially translated segments from the document, where is no good1 fuzzy match offered.

input output

TM DOCUMENT METHODS TMexp

Figure 3.1: Schema of the basic work ﬂow for TMexp.

To fulﬁl the goal of TM enrichment with high quality segments, we extract the subsegments from TM at ﬁrst. We propose several methods leveraging the extracted subsegments, mainly based on overlaps. Two main new approaches leading to larger TM are introduced here and described in detail in the following chapters. First method is closely related to the approach [12] dealing with merging overlapping phrases. The key differences in our work are: • validation of overlap – we use alignments between subsegment and its translation to detect the overlap more precisely • iterative merging of overlapping phrases The second method is dedicated to treatment of fuzzy match with the aim to obtain exact match. Our work is motivated by the paper [8], which is dedicated to convergence of TM and MT in the way, that as much as possible is translated using TM segments and the mismatches are translated using SMT.

1. CAT systems usually offer segments with fuzzy match score of 50% or higher, but the setup is up to the human translator – for very low fuzzy matches it is easier for the human translator to start from scratch.

7 3. BASIC WORK-FLOW,OBJECTIVESAND MOTIVATION

In this thesis, we propose to extend mismatches with a context. The enlarged mismatch is then translated and its translations are subsequently checked if any of them can be inserted into the parts translated by TM. We expect higher quality of the translations in contrast to [8], since the translation is inserted with regard to the context, thus the fluency of the translations is ensured. Let TMnew contain segments completely covered by merging overlapping subsegments, the first method, and the segments obtained by the second method, from modification of a fuzzy match. Then the segments from TMnew can be suggested to the human translator in appropriate places in the document and be validated or the segments can be directly added to the TM and the document can be pre-translated with these segments.2 TMexp = TM ∪ TMnew

2. Papers (e.g. [8, 13]) dedicated to the integration of MT and TM do not append their outputs directly to the TM. The outputs are, usually, suggested to the human translator as a better variant than the fuzzy match.

8 Chapter 4 Subsegment Extraction and Combination

There is a number of options how to acquire new subsegments from existing TM. In this part we present methods and tools used by us to obtain new subsegments from TM.

4.1 Subsegment Extraction

Phrases and corresponding translations are generated using Moses [9] directly from the TM. Subsegments correspond to any extracted phrase. The word alignment is based on MGIZA++ [15] (parallel version of GIZA++ [16]) and the default Moses heuristic grow-diag-ﬁnal.1 Moreover the quality of the alignment obtained from GIZA++ can be improved by providing a parallel corpus2 that is used by GIZA++ in addition to the TM. The output from MOSES is a set of phrases with their aligned translation phrases and probabilities of the translation, as given in the example below:

Subsegment Translation Probabilities Alignment points nejlepsˇ´ı uhl´ı best coal 0.158, 0.142, 0.158, 0.69 0-0 1-1

The probabilities are inverse phrase translation probability, inverse lexical weighting, direct phrase translation probability and direct lexical weighting, respectively. These probabilities are used to select the best translations in case there are many translations for a subsegment. Alternative translations for a subsegment are combined from different aligned pairs in the TM. Typically, short subsegments have many translations. The alignment points determine the word alignment between subsegment and its translation, i.g. 0-0 1-1 means that the ﬁrst word “nejlepˇs´ı” from the source language is translated to the ﬁrst word in the translation “best”

1. http://www.statmt.org/moses/?n=FactoredTraining.AlignWords 2. Some available sources of parallel data: Opus[17], Europarl[18], JRC-Acquis[19] or Dgt- tm[20]

9 4. SUBSEGMENT EXTRACTIONAND COMBINATION věděl tam teď to , kdybys byl bys if you were there you would know it now

Figure 4.1: Word matrix for two aligned sentences / segments.

and the second word “uhl´ı” to the second word “coal.” These points give us an important information about the subsegment translation: • 1) empty alignment,

• 2) one-to-many alignment, and

• 3) opposite orientation. In Figure 4.1 empty alignment is represented by an empty line or an empty row, one-to-many alignment by a sequence of adjacent squares in a row or in a column and opposite orientation by a sequence of neighbouring squares on the secondary diagonal. The alignments are used to determine correct positions in the subsegments translations. We respect standard pipeline of Moses to prepare data. We start with tokenization3, followed by truecasing4 and cleaning5. Plenty of parameters can be set for the phrase extraction in Moses, we use the training script train-model.perl almost with standard options, we just add some parameters for speeding up the training, the option -score-options ’–GoodTuring’ to generate the alignment points and the option -max-phrase- length 10 to extract longer phrases, predeﬁned setting is (7). train-model.perl -cores 4 -parallel

3. Tokenization: Breaks a text into words, phrases, symbols, or other meaningful elements called tokens. 4. Truecasing: Initial words in each sentence are converted to their most probable casing. 5. Cleaning: Long sentences, empty sentences and obviously mis-aligned sentences are removed.

10 4. SUBSEGMENT EXTRACTIONAND COMBINATION

-sort-buffer-size 10G -sort-batch-size 1024 -sort-compress gzip -sort-parallel 4 -max-phrase-length 10 -root-dir ./MOSES-FILES --first-step 1 --last-step 6 -external-bin-dir ./EXTERNAL --corpus ./CORPUS --f cs --e en -score-options ’--GoodTuring’ --mgiza --mgiza-cpus=4

4.2 Subsegment Combination

Basic operations with the subsegments are proposed in this section, we follow the research from our previous work [21]. The particular operation or any combination of the operations should ideally lead to obtain new segment covering a whole, originally uncovered, segment in the input document – so called 100 % match or to generate at least longer subsegments, which can be also suggested to a human translator, when sufﬁcient fuzzy match is missing. The input document allows validation of subsegment combination on the source side, as is common in other MT systems. The methods for expanding TM are as follows.

1. JOIN: new (sub)segments are built by concatenating subsegments.

O (a) JOIN : joined subsegments overlap in a segment from the document, for instance the subsegment ˇseldo and do lesa are joined into ˇseldo lesa, their translations are joined in a similar way went into and into the forest creates went into the forest. N (b) JOIN : joined subsegments neighbour in a segment from the document, for instance phrases ˇsel and do creates ˇseldo, and their translations went and into creates went into.

2. SUBSTITUTE: new segments can be created by replacing a part of one segment with another subsegment.

O (a) SUBSTITUTE : the gap in the first segment is covered with an overlap with the second subsegment, see the example in Table 4.1. N (b) SUBSTITUTE : the second subsegment is inserted into the gap in the first segment, for instance consider the example in Table 4.1 with the segment dodrˇzovat and its translation comply with instead of mus´ıdodrˇzovatzvláˇstn´ı and shall comply with the special.

11 4. SUBSEGMENT EXTRACTIONAND COMBINATION

O Table 4.1: SUBSTITUTE , example for Czech → English

new subsegment Provozovatele´ mus´ı dodrzovatˇ zvla´stnˇ ´ı pravidla pro vyzkumn´ e´ its translation Operators shall comply with the special rules on research from subsegments Provozovatele´ mus´ı vytva´retˇ zvl´aˇstn´ı pravidla pro vyzkumn´ e´ | mus´ı dodrzovatˇ zvl´aˇstn´ı their translations Operators shall create the special rules on research | shall comply with the special

N The operation JOIN corresponds to the decoding used in phrase-based SMT systems. N The operation SUBSTITUTE is closely related to the hierarchical model in SMT. The non-terminal symbol is used instead of the gap in the segment and the rule is created. For example, for the phrase nejlepˇs´ıvelkýpes the rule nejlepˇs´ıN pes could be built where N is the non-terminal symbol, then the phrase nejlepˇs´ımalýpes could be easily translated using this rule and the terminal malý. We focus on the overlapping versions of proposed operations in following sections.

12 Chapter 5 Overlapping Phrases Decoder

In this chapter we propose the decoding algorithm for overlapping phrases. Book [22] presents the beam-search stack decoder in detail and several other decoding algorithms as Greedy Hill-Climbing Decoding or Finite State Transducer Decoding are summarized. We bring out new decoder since the decoding problem can be simplified for overlapping phrases. The decoding algorithms merging non-overlapping phrases need to use the language models to ensure translation fluency, whereas, in our approach, the fluency of the translation is ensured by a overlap of the phrases. This also reduces the search space, which is NP-complete for the problem of exact MT-decoding, as proved in [23]. The search space is reduced in the way that if the merging of the translations of two overlapping phrases fails, then we know, that these overlapping phrases cannot be merged and their combination cannot be part of the best path in the search space. In contrast to non-overlapping decoding, where the phrases can be always merged. Even if merging of two phrases receives bad score according to the language model, it can be still a part of best path, if other scores on the path are good. For this reason, to find the best path trough the search space not only by language model but also another metrics (future cost estimation, score of extracted phrases) are used in non-overlapping decoding.

5.1 Decoder

Since proposed decoder works with indices of the subsegments in a segment, let us start with brief introduction and several deﬁnitions for better understanding of the decoding algorithm. Let Si,j denote subsegment in the segment S of length n (i is the index of start word, j is the index of end word).

Si,j = wi...wj; i, j ∈ 1...n; i ≤ j

Let Sk,l,Sm,n be some subsegments occurring in the segment S, then

13 5. OVERLAPPING PHRASES DECODER

the function 5.1 determines if the subsegments are overlapping using the indices.

 w ... wn, if k < m ∧ l + 1 > m ∧ n > l.  k J(Sk,l,Sm,n) = wm ... wl, if m < k ∧ n + 1 > k ∧ l > n. (5.1)  ∅, otherwise.

The ﬁrst part of the condition k < m determines if Sk,l is before or after Sm,n, the second part of condition l + 1 > m checks if the overlap exists between given subsegments. The last part n > l validates whether the segment Sm,n is not subsegment of the segment Sk,l. Notice, that each of the two combined phrases has to consist at least from two words, otherwise these phrases cannot be combined with overlap (in contrast to non-overlapping phrases) in this case the function 5.1 returns ∅. The condition on 7th line of the decoding Algorithm1 corresponds to the equation 5.1, subsegments are overlapping in the source side.

Algorithm 1: Basic Structure of Decoder for Overlapping Subsegments Data: Segment S from document; List I of indices (i, j) of subsegments occurring in S Result: R 1 I ←− SortIndexes(I); 2 while I 6= ∅ do 3 (i, j) ←− First(I); 4 I ←− I − (i, j); 5 T ←− ∅; 6 for (k, l) ∈ I do 7 if (k < i ∧ l + 1 ≥ i ∧ j > l) ∨ (i < k ∧ j + 1 ≥ k ∧ l > j) then 8 if JoinTranslations(i, j, k, l) then 9 T ←− T + (Min(k, i), Max(l, j)); 10 R ←− R + (Min(k, i), Max(l, j)); 11 if (Min(k, i), Max(l, j)) = (0, Length(S)) then 12 return R; 13 I ←− T + I; 14 return R;

Algorithm1 for overlapping phrases decoding starts with the ﬁrst subsegment (index of the subsegment in the segment) from the sorted list, see the section 5.2 for the implementation of the function on the ﬁrst line in the

14 5. OVERLAPPING PHRASES DECODER algorithm, and then tries to join it with other subsegments. If it succeeds, the new subsegment is appended to temporary list T. The success is conditioned by the merging the translations, see the section 5.3 for the implementation of the function on the 8th line in the algorithm. After all other subsegments are processed, T is prepended to I and the algorithm starts with a new subsegment created from the best subsegment Sbest (according to sorting function) and the other subsegment, the best possible subsegment able to be joined with Sbest. If it does not succeed, the next subsegment in the order is processed. In each iteration Algorithm1 discards one processed subsegment and generates new (longer) subsegments or it takes next unprocessed subsegment in the queue. Several iterations are illustrated in Figure 5.1. Overlapping phrases are combined until whole segment is covered or until there are no phrases that can be combined.

(3, 4) 0

(4, 5)(1, 2) (4, 5)(1, 5) (5, 7)(4, (2, 3) (4, 6) 1 2 3 4 5 6

(3, 5) 03

(3, 6)(2, 4) (3, 6)(2, 2) (4, 5)(1, 5) (5, 7)(4, 3) (4, 6)(2, 50 06 1 2 3 4 5 6

(2, 4) 50

(3, 6) 2) (4, 5)(1, 5) (5, 7)(4, 3) (4, 6)(2, 06 1 2 3 4 5 6

(1, 4) 150

(3, 6) 2) (4, 5)(1, 5) (5, 7)(4, 3) (4, 6)(2, 06 1 2 3 4 5 6

Figure 5.1: Several iteration of the decoder: On left-top position is relaxed subsegment, left-right (black) arrows denote subsegments with were joined to relaxed subsegment, right-left arrows denote new subsegments added into the queue. Left half displays new indices, on the right half of the picture it is illustrated how more schematically how many subsegments were included during generation.

15 5. OVERLAPPING PHRASES DECODER

5.2 Subsegment Scores

The decoding algorithm processes the subsegments in the order from the ﬁrst to the last. Newly created segments are placed to the beginning of the queue. Thus, the order is preserved during whole decoding. Therefore if the phrases are of different quality, they can be sorted, so that the better pairs of subsegments and their translations would be preferred. The phrases are sorted by the probabilities provided by Moses, all probabilities are multiplied.

5.3 Translation Combination

In this section we deal with how to join translations according to the overlapping part in the subsegments. In contrast to the work [12] we use the alignment points available from the subsegment extraction. If the phrases in source language are overlapping, they are expected to have the overlap in the target language. Nevertheless, in some cases, the translations can be merged correctly without any overlap even if the segments in the source are overlapping, see Figure 5.2. An empty alignment is the cause.

S S S S S S S word1 word2 word2 word3 word1 word2 word3 + =

T T T T word1 word2 word1 word2

Figure 5.2: Translation combination with empty alignment of source word.

Different kinds of the alignments are handled in the following way. Let see and analyze more examples.

S S S S S S S S word1 word2 word2 word3 word1 word2 word2 word3 + = ? + = ? T T T T T T word1 word2 word1 word2 word3 word1

Figure 5.3: Problematic translation combination – many-to-one alignment, opposite direction alignment.

In Figure 5.3 are displayed two situations, where the translations of overlapping subsegments cannot be merged. In the case on the left side is the reason that there is no overlap on the target side, compare with Figure 5.2. In the second case there is the match of one word in the translations, but

16 5. OVERLAPPING PHRASES DECODER the order in which the translations can be merged, does not agree with the order in which the subsegments are merged. Figure 5.4 shows several examples of translations merging. These examples having different kinds of the alignment do not represent all possible cases, since different alignments are usually combined in subsegments. The subsegment can in the same time contain one-to-many alignment, empty alignment and one-to-one alignment or any other combination of different alignments.

S S S S S S S word1 word2 word2 word3 word1 word2 word3 + =

T T T T T T T word1 word2 word2 word3 word1 word2 word3

S S S S S S S word1 word2 word2 word3 word1 word2 word3 + = T T T T T T T T T T word1 word2 word3 word2 word3 word4 word1 word2 word3 word4

S S S S S S S S S S word1 word2 word3 word2 word3 word4 word1 word2 word3 word4 + = T T T T T T T word1 word2 word2 word3 word1 word2 word3

S S S S S S S word1 word2 word2 word3 word1 word2 word3 + = T T T T T T T T T T word1 word2 word3 word2 word3 word4 word1 word2 word3 word4

S S S S S S S S S S word1 word2 word3 word2 word3 word4 word1 word2 word3 word4 + = T T T T T T T T T T word1 word2 word3 word2 word3 word3 word1 word2 word3 word3

Figure 5.4: Translation combination: Successively from the top to the bottom one-to-one alignment, empty alignment of target word, many-to-one alignment, one-to-many alignment, opposite orientation.

Furthermore, we give and analyze several problematic or incorrect cases occurring in real data. Consider the example bellow with the overlap on the last word pln´em.

Subsegment Translation Alignment jsou v plnem´ are fully in 0-0 2-1 1-2

If we split the translation according the alignment in order to ﬁnd the

17 5. OVERLAPPING PHRASES DECODER overlapping part in the translation, we lose the preposition in. In this case, the solution could be deletion of the word fully in the middle. But if the overlap is on the last two words v pln´em the subsegment can be split. Notice, that the correct alignment should consist from one-to-one and opposite direction alignment. In the example bellow we meet the empty alignment of preposition v. Let us again distinguish the overlap on the last two words and on the last word. If the overlap with the second segment is on the words v pln´em, the missing translation of the preposition v can be replaced by the translation of v from the second subsegment, if the second segment contains the translation of v with correct alignment. In the case, we have the overlap only on the last word, the translation of v cannot be completed and is lost.

Subsegment Translation Alignment jsou v plnem´ are fully 0-0 2-1 The empty alignment of the word the is in the most cases in the direction from Czech to English correct. But it has to be considert, when two translations are merged, since the unaligned word or words could be incorrectly repeated.

Subsegment Translation Alignment jsou v souladu s are consistent with the 0-0 1-1 2-1 3-2

The last example shows the one-to-many alignment of word Koneˇcn´y to words A deadline and at the same time the alignment many-to-one of words Koneˇcn´yterm´ın to the word deadline.

Subsegment Translation Alignment Konecnˇ y´ term´ın A deadline 0-0 0-1 1-1 Considering the previous analysis, we propose algorithms to validate the overlapping combination of the translations. For this purpose, we use the overlap of the subsegments together with the alignments points between the subsegments and their translations. S Let Overlap denotes the overlap in the source side, further let T1 is the translation of ﬁrst segment S1 and similarly T2 translation of S2. The alignment points i-j are represented as pairs (i, j) in the list A1 for ﬁrst subsegment and in the list A2 for second subsegment. Algorithm2 goes through the words in the OverlapS and for each word performs three following steps:

18 5. OVERLAPPING PHRASES DECODER

1. ﬁnds the index of word in S1 and then in S2,

2. ﬁnds the translation in T1 and then in T2 according the alignment, and 3. checks the equality of the translations of the word, if the comparison fails, the translations cannot be merged, the algorithm returns F alse. Algorithm2 validate the alignment of particular word concurrently in both subsegments, the other option is to split the translation using the alignment and subsequently validate if the parts of the translations, where may is the overlap, are equal by string comparison.

Algorithm 2: Validation of Exact Words Matches S Data: A1, A2, S1, S2, T1, T2, Overlap . S 1 for word ∈ Overlap do 2 k ←−Index (word, S1); 3 for (i, j) ∈ A1 do 4 if k = i then 5 t1 ←−Word (j, T1); 6 k ←−Index (word, S2); 7 for (i, j) ∈ A2 do 8 if k = i then 9 t2 ←−Word (j, T2); 10 if t1 6= t2 then 11 return F alse;

T Let Overlap1 is the part of T1, where may is the overlap, estimated by Algorithm2, concretely, the words from T1 from the ﬁrst word to which any of word from OverlapS is aligned to the last word to which any of word S from Overlap is aligned. Further, let I1 is the list of indices of words from S T1 to which any of word from Overlap is aligned. T Algorithm3 checks if Overlap1 contains any word which is not aligned to OverlapS. This word can be unaligned totally due to empty alignment or it can be aligned to a word out of the OverlapS by the reordering. In the case of reordering, the subsegments cannot be merged and the algorithm returns F alse. In empty Algorithm3 returns empty aligned words (indices of the words) T in Overlap1 . The words in empty for T1 and for T2 are also consecutively checked for the string equality. In behind it is returned empty aligned T words behind the Overlap1 and in before empty aligned words before T the Overlap1 .

19 5. OVERLAPPING PHRASES DECODER

Algorithm3 is written for the ﬁrst subsegment, but it can be easily modiﬁed for the second subsegment. T The empty alignment around Overlap1 may also cause wrong translations merging, for example word duplicates. Therefore, we check if returned before for T1 corresponds to before for T2 and behind for T2 corresponds to behind for T1. The example bellow shows two subsegments, which can be merged.

Subsegment Translation Alignment vydaje´ na program expenditure of the programme 0-0 1-1 1-2 2-3 program i vetˇ sinaˇ of the programme and the majority 0-1 0-2 1-3 2-5 The correct merged translation is expenditure of the programme and the majority. However, according to the alignment is the Czech word program translated to programme in the ﬁrst case and in the second to the programme. So, the phrases would not be merged by our algorithms.

Algorithm 3: Detection of the Reordering and of the Empty Alignment Data: A1, I1, T1. Result: empty, behind, before 1 min ←−Min (I1); 2 max ←−Max (I1); 3 for k ← min to Length (T1) do 4 if k∈ / I1 then 5 for (i, j) ∈ A1 do 6 if k = j then 7 return F alse; 8 if k > max then 9 behind ←− behind ∪ j; 10 empty ←− empty ∪ j; /* min to 0 ≡ interation in decreasing order */ 11 for k ← min to 0 do 12 for (i, j) ∈ A1 do 13 if k = j then 14 return T rue; 15 else 16 before ←− before ∪ j;

How to ﬁnd the correct position of the split using the alignment and concurrently how to avoid some mistakes in the alignment points? We meet

20 5. OVERLAPPING PHRASES DECODER

T examples where the empty alignment of several words behind Overlap1 in S T1 corresponds to the one-to-many alignment of first word in Overlap to T first several words in Overlap2 . The second approach, we deal with, is to determine approximate position of the split using the alignment points and subsequently shift T1 word by word and search the correct match with the T2. Algorithm4 compares first word from T2 consecutively with the words from T1 in order from the last to the first, if the match is found, then compares in sequence the words from T1 stating with the word after the match to the last with the words from T2, starting with the second. If all word matches, then the overlap is found, otherwise we continue with the comparison of words from T1 with first word from T2.

Algorithm 4: Join overlapping translation if possible Data: T1, T2 Result: T12, if T1 and T2 can be joined; ∅, otherwise 1 position ←− Length(T1); 2 found ←− F alse; 3 while position > 0 do 4 if T1[position] == T2[0] then 5 k ←− 1; 6 found ←− T rue; 7 for n ← position + 1 to Length(T1) do 8 if T1[n] 6= T2[k] then 9 found ←− F alse; 10 k ←− k + 1; 11 if found then 12 T12 ←− T1[0 ... position] + T2; 13 return T rue; 14 position ←− position − 1;

5.4 Decoder – Analysis and Alterations

In this section we suggest improvements for our decoder mostly with the aim to reduce the complexity of the algorithm. As in other decoding algorithms, the search space can be risk-free reduced by recombining hypothesis. If there are two paths leading to the same result, we keep only the better one. Consider, if we have for example phrases:

21 5. OVERLAPPING PHRASES DECODER na naˇsemstole – on our table, na naˇsem– on our and naˇsemstole – our table then by merging last two subsegments we receive the ﬁrst subsegment, the path leading to this combination can be discard. The decoder can be modiﬁed to prefer joining phrases with bigger overlap. The size of the overlap can be easily set up by altering the 7th line in the Algorithm1 according the function 5.2, where the variable O determines the size of overlap. 5.2.

 w ... wn, if k < m ∧ l + 1 + O > m ∧ n > l.  k J(Sk,l,Sm,n) = wm ... wl, if m < k ∧ n + 1 + O > k ∧ l > n. (5.2)  ∅, otherwise. As was shown in Section 5.3 there is several cases of alignments, with which we currently do not know how to handle. For example a phrase with the opposite direction alignment on two words never be a part of the phrase combination during decoding. This useless phrases can be removed in forward to speed up the decoding. The decoder can generate several alternative translations for the segment. The phrases are scored only according to the probabilities provided by Moses, since the ﬂuency of the translation is ensured by the overlap. But in Section 5.3 we discussed phrases with some problematic alignments which can cause for example duplicates in the joint. For that reason, we also experiment with using language model to choose the best merging. For this propose we trained the language model using KenLM [24] tool on the ﬁrst 50,100,100 sentences of the corpus enTenTen1 [25] with model order set to 5. Let analyze little bit more the iteration of the decoding algorithm, see the Figure 5.5. When we combine the phrase 1 with the phrase 623, previously created by combination of phrases 6, 2 and 3, we know, that it is enough to check the overlap of the phrase 1 just with the phrases 6 and 3, since it would be joined with 2 in any of previous step, when 2 was at the edge. Similarly, when we combine 1623 with 04, we know, that 04 will be prepended to 1623, or else it will joined to the 623 in previous step. Notice, that if we have 1623, we also have the combinations: 162, 16, 62, 623 and 23, all these phrases can be combined. Currently, we use the information, that we can do the comparison just at the edges of phrase previously combined from more than two other phrases

1. http://www.sketchengine.co.uk/documentation/wiki/Corpora/ enTenTen

22 5. OVERLAPPING PHRASES DECODER

623

04 1 2 3 4 5 6

1623

04 1 2 3 4 5 6

Figure 5.5: Iteraration of decoding algorithm. in the implementation of decoder.

23 Chapter 6 Fuzzy Match Treatment

The segments in TM can differ from the new segment just in few words, these small mismatches could be often repaired. The disadvantage of related works [10, 8] lie in substitution irrespective of the context. We want to resolve this lack. In contra/t to these works we propose to translate some neighborhood of the mismatch in the segment to ensure the ﬂuency of the repaired segment.

6.1 Mismatches Detection

At first, we find the most similar segments from TM to the new segment using the edit distance which is commonly used in CAT systems to offer to the translator the best segment from TM to edit. Specifically we use Leven- shtein distance, which counts the minimum number of edits (i.e. insertions, deletions or substitutions) required to change one word into the other. We work with word-based string edit distance version. Let Ss denotes the segment to be translated, St the fuzzy matched segment from TM for Ss and Tt its translation. We use the same definition of fuzzy match score (FMS) as proposed in [8].

edit-distance(S ,S ) FMS = 1 − s t max(|Ss|, |St|)

If we ﬁnd St with the sufﬁcient score we proceed with searching the word alignment between Ss and St, else the SMT output could be offered to the human translator. Three different cases can occur, if we align Ss to St, these cases are illustrated in Figure 6.1 together with the way how we handle with them. We follow the outline mentioned in the paper [8]. Concretely, insertion inserts a word into the St and set a pointer to Tt between aligned words in the neighbourhood, this is the position, where the translation should be inserted,

24 6. FUZZY MATCH TREATMENT deletion delete the word from St with its translation in Tt, substitution substitute word from St for the word from Ss and keeps the pointer to the translation, which should be substituted for the new one. Our improvements are mainly focused on the insertion and substitution. The aligned words split St to the subsegments matching witch Ss, further denoted as sub, and the subsegments mismatching witch Ss, denoted as mis. s For example word2 builds mis in the substitution displayed in the bottom s s part in Figure 6.1, and words word1, word3 build sub in the deletion. Since we extract the phrases from TM in Section 4.1, the alignment between St and Tt falls out as a by-product from the phrases extraction. We use the alignment to ﬁnd the translations of all subsegments in sub and the translations of all subsegments of mis, subsequently we seek out problematic alignments of neighboring subsegments from mis and sub to their translations and we try to resolve it, similarly as in [8], for example by splitting up blocks. Some problematic alignments are illustrated in Figure 6.2.

Insertion Deletion Substitution

New Segment S S S S S S S S word1 word2 word3 word1 word3 word1 word2 word3 String Edit

S S S S S S S S TM Segment word1 word3 word1 word2 word3 word1 word0 word3

Word Alignment

T T T T T T T T TM Translation word1 word3 word1 word2 word3 word1 word2 word3

S S S S S S S S New TM Segment word1 word2 word3 word1 word3 word1 word2 word3

New Word Alignment

T T T T T T T New TM Translation word1 word3 word1 word3 word1 word2 word3

Figure 6.1: Handling with different edit operations.

Non-contiguous Unaligned

New Segment S S S S S S word1 word2 word3 word1 word2 word3 String Edit

S S S S S S TM Segment word1 word0 word3 word1 word0 word3

Word Alignment

T T T T T TM Translation word1 word2 word3 word1 word3

Figure 6.2: Examples of problematic alignments.

25 6. FUZZY MATCH TREATMENT

6.2 Enlarged Mismatches Translation

In this section we deal with the improvement of fuzzy matches by translating mismatches enlarged with some neighborhood together with matching parts of the fuzzy match and new segment. Let 1mis1 denotes the mismatch with one word before and one word behind. Notice, that to the mismatches in the beginning and in the end cannot be extended from both side, then we use 1mis for the mismatch in the end, respectively mis1 for the mismatch in the beginning. The decoder for overlapping phrases translates 1mis11 and join it with matched parts in the segment from TM. The other possibility is to translate 1mis1 with standard SMT decoders and validate the overlap subsequently. The decoders can, usually, generate n-best translations, which could be successively validated for the overlap with the matched parts. The approach of using our decoder to improve the translation is described in detail bellow. For simplicity, suppose that we have just one mismatch in Ss, then in our approach, at first we find the subsegments in St matching to subsegments in Ss together with their translations, subsequently we find in extracted phrases those which cover the 1mis1, 1mis and mis1 in Ss. Then all these phrases built the input for our decoder. See Figure 6.3 where the mismatched word is 1000 and 1mis1 is obsahuje 1000 záznam˚u, then the search space of the decoder consists from subsegments Tento prostor obsahuje and its translation The set contains, záznam˚ufirem . with translation records of the companies . and from all subsegments and their translations covering obsahuje 1000 záznam˚u.

New Segment Tento soubor obsahuje 1000 záznamů ﬁrem .

String Edit

TM Segment Tento soubor obsahuje 1134 záznamů ﬁrem .

Word Alignment

TM Translation The set contains 1134 records of the companies .

Figure 6.3: Detection of mismatched word in TM source together with word alignment to the TM target.

1. It could be considered mismatches extended with more than one word.

26 Chapter 7 Methods Proposals and Overview

This chapter brings out several other proposals of method, which can be used for TM enlarging.

7.1 Overlapping Hierarchical Model in SMT

In [26] is described method to create more complex rules in hierarchical model in SMT using overlap. For example, the rules nejlepˇs´ıN pes and pes N na námˇest´ı could be merged into the new rule nejlepˇs´ıN1 pes N2 na námˇest´ı. Motivated by high accuracy, we propose the improvement for hierarchical model. We create longer rules with at least two terminals before and after non-terminal symbol and terminals consisting at least from three words, then during the decoding we can search among the rules and validate the overlap. Two kinds of substitution can be distinguished: with the non-terminal symbol or without. For example, naˇsestaréN bij´ıˇctyˇri and starézapráˇsenéN hlasitˇe bij´ı creates new rule with non-terminal symbol naˇsestarézapráˇsenéN hlasitˇe bij´ıˇctyˇri. The same approach can be used for the rules with the non-terminals in the end or in the beginning of the rule. The fast searching and decoding algorithm for these rules is challenging.

7.2 The Decoder as Fuzzy Match Generator

Our decoder can be used for translating whole segment or it can supply TM close match or, in the case there is no good fuzzy match, it can generate as much as possible to cover the segment and offer the result to the human translator as a new fuzzy match. The translated parts should be properly highlighted, to ease the detection of translations in the segment. The human translator should have the possibility to choose among more, then one translation, if the alternative translation are generated.

27 7. METHODS PROPOSALSAND OVERVIEW

7.3 Lexicalization for overlapping phrases decoder

SMT systems can improve their outputs by adding further factors as lemmas or part of speech tags into the training. Often word alignment in sparse data can be improved in this way. The SMT systems can also use syntactic analysis. Inspired by these approaches, we deal with the use of additional factor as well. Our decoder for overlapping phrases could be enhanced using the lemmas. The overlaps could be searched on lemmas and corresponding translations could be merged together using some transfer rules.

28 Chapter 8 Evaluation

As test data we have used the commercial translation memory (COM) and an example document. Both provided by one of the biggest Czech translation companies. In the evaluation process we have tested the translation on a document with 3,818 segments. The provided TM contained 144,082 segment pairs. For comparison we also present our results for DGT translation memory [20] (DGT). For the evaluation using DGT we have used 329,155 pairs from 2014 release and evaluated it on 10,000 randomly chosen segments from the same release. Duplicate pairs were removed before. Out-Of-Vocabulary rate on tokens is 2.8% for DGT and 11.0% for COM. See Table 8.1 for more statistics.

Table 8.1: Statistics of the corpus used in experiments

COM DGT Corpus Test Corpus Test Czech words 5,973,329 182,643 1,722,064 36,947 avg. |segment| CS 18 18 12 10 English words 6,774,993 206,594 2,065,742 45169 avg. |segment| EN 20 21 14 12

Meteor [27] tool was used to evaluate quality (precision) of the proposed translated segments. We provide statistics for proposed decoding algorithm on both test data sets, see Table 8.2. We compare scores for fully translated segments by our decoder with fuzzy matches for those segments. We show that our system outperforms TM baseline. Fuzzy match is determined using fuzzy match score proposed in Section 6.1. If more segments are chosen by word-based FMS we try to resolve this by launching char-based FMS, otherwise we chose randomly one of the best segments. By our decoder were built 9 new segments for COM and 894 new seg-

29 8. EVALUATION ments for DGT.

Table 8.2: Translation quality (Meteor score) for 100% matches.

COM DGT Decoder Fuzzy Decoder Fuzzy precision 0.63 0.41 0.93 0.82 recall 0.74 0.37 0.86 0.83 f1 0.68 0.39 0.89 0.82 Meteor score 0.37 0.20 0.50 0.48

In the picture bellow new segment from our decoder (on the left side) and fuzzy match (on the right side) are displayed. In the top lies segment from reference.

assessment visits shall be conducted in two phases . •◦ assessment the visits test shall ◦ is be ◦ ◦ carried made ◦ ◦ out in • in two ◦ ◦ two phases ◦ ◦ parts . • . Segment 820 P: 0.916 vs 0.493 : -0.422 R: 0.916 vs 0.453 : -0.463 Frag: 0.000 vs 0.407 : 0.407 Score: 0.916 vs 0.272 : -0.644

30 Chapter 9 Conclusion

In this thesis we proposed and described two new methods for enlarging translation memories, the first based on iterative merging of overlapping phrases and the second for repairing high fuzzy matches. The first method was evaluated on DTM TM and commercial TM. We showed that our method outperforms TM baseline. It was shown that the usage of the overlaps has positive influence to the translation quality and that the great potential lies in use of overlapping phrases. Several other approaches were proposed to obtain high quality translations or with the goal to help human translators to speed up the translation process. In the future work we focus on improving of overlapping phrases decoding and on implementations of other proposed methods, especially, on treatment of fuzzy matches.

31 Bibliography

[1] Planas, E., Furuse, O.: Multi-level similar segment matching algorithm for translation memories and example-based machine translation. In: Proceedings of the 18th conference on Computational linguistics- Volume 2, Association for Computational Linguistics (2000) 621–627

[2] Planas, E., Furuse, O.: Formalizing translation memories. In: Machine Translation Summit VII. (1999) 331–339

[3] Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics (2003) 48–54

[4] Desilets,´ A., Farley, B., Stojanovic, M., Patenaude, G.: WeBiText: Build- ing large heterogeneous translation memories from parallel web con- tent. Proc. of Translating and the Computer 30 (2008) 27–28

[5] Nevado, F., Casacuberta, F., Landa, J.: Translation memories enrichment by statistical bilingual segmentation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004. (2004)

[6] Simard, M., Langlais, P.: Sub-sentential exploitation of translation memories. In: Machine Translation Summit VIII. (2001) 335–339

[7] Macklovitch, E., Simard, M., Langlais, P.: TransSearch: A Free Transla- tion Memory on the World Wide Web. In: Proceedings of the Second International Conference on Language Resources and Evaluation, LREC 2000. (2000)

[8] Koehn, P., Senellart, J.: Convergence of translation memory and statistical machine translation. (2010)

[9] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses:

32 9. CONCLUSION

Open source toolkit for statistical machine translation. In: Proceed- ings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Association for Computational Linguistics (2007) 177–180

[10] Smith, J., Clark, S.: Ebmt for smt: A new ebmt-smt hybrid. (2009) 3–10

[11] Fellbaum, C.: WordNet: An Electronic Lexical Database. Bradford Books (1998)

[12] Tribble, A., Vogel, S., Waibel, A.: Overlapping phrase-level translation rules in an smt engine, IEEE (2003) 574–579

[13] He, Y.: The integration of machine translation and translation memory. Thesis (phd), Dublin City University, School of Computing (2011)

[14] Zhechev, V., van Genabith, J.: Seeding statistical machine translation with translation memory output through tree-based structural alignment. In: Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation, Beijing, China, Coling 2010 Organizing Commit- tee (2010) 43–51

[15] Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, Association for Computational Linguistics (2008) 49–57

[16] Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational linguistics 29 (2003) 19–51

[17] Tiedemann, J.: Parallel Data, Tools and Interfaces in OPUS. In: Proceed- ings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012. (2012) 2214–2218 http://opus.lingfil. uu.se.

[18] Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT summit. Volume 5. (2005) http://www.statmt.org/ europarl.

[19] Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tuﬁs, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058 (2006)

33 9. CONCLUSION

[20] Steinberger, R., Eisele, A., Klocek, S., Pilos, S., Schluter,¨ P.: Dgt-tm: A freely available translation memory in 22 languages. arXiv preprint arXiv:1309.5226 (2013)

[21] Baisa, V., Busta,ˇ J., Horak,´ A.: Improving coverage of translation memories with language modelling. RASLAN 2014 Recent Advances in Slavonic Natural Language Processing (2014)

[22] Koehn, P.: Statistical Machine Translation. Cambridge University Press (2009)

[23] Knight, K.: Decoding complexity in word-replacement translation models. Comput. Linguist. 25 (1999) 607–615

[24] Heaﬁeld, K.: Kenlm: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, Association for Computational Linguistics (2011) 187–197

[25] Jakub´ıcek,ˇ M., Kilgarriff, A., Kova´r,ˇ V., Rychly,` P., Suchomel, V., et al.: The tenten corpus family. In: Proc. Int. Conf. on Corpus Linguistics. (2013)

[26] Sariya Karimova, P.S., Riezler, S.: Ofﬂine extraction of overlapping phrases for hierarchical phrase-based translation. (2014)

[27] Denkowski, M., Lavie, A.: Meteor universal: Language speciﬁc translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation. (2014)