From Statistical Term Extraction to Hybrid Machine Translation

From Statistical Term Extraction to Hybrid Machine Translation Petra Wolf, Ulrike Bernardi Christian Federmann, Sabine Hunsicker Lucy Software and Services DFKI Neumarkter Str. 81 Stuhlsatzenhausweg 3 81673 Munich, Germany 66123 Saarbrucken,¨ Germany petra.wolf,[email protected] sabine.hunsicker,[email protected] Abstract multi-layered Linguistically augmented Statistical Terminology EXtraction). We decided to impose This study presents a new hybrid approach strict knowledge-enhanced requirements for the for translation equivalent selection within statistical term extraction which consequently was a transfer-based machine translation sys- augmented by several layers of linguistic data re- tem using an intertwined net of traditional finement and automatic feature attribution. LiS- linguistic methods together with statisti- TEX is distinct from other term extractors as it cal techniques. Detailed evaluation reveals has layers that access already during the early term that the translation quality can be improved defining extraction phase the RBMT system com- substantially in this way. ponents for linguistic filtering and later production of knowledge-augmented output. In contrast to 1 Introduction the phrase-table approach, the F-measure is high A promising integration point for statistical which avoids evident deteriorations after integra- techniques into Rule-Based Machine Translation tion into the MT system. (RBMT) systems is the transfer phase. A key prob- The remainder of this paper is organized as fol- lem here is to deal with non-deterministic rules lows: Section 2 gives an overview of related work. and preferences in order to disambiguate and select Details on the terminology extraction are then pro- the most natural expressions in the target language vided in Section 3 followed by a description of the (cf. (Eisele et al., 2008b; Thurmair, 2009)). We results of terminology and translation quality eval- performed studies on a phrase-table-driven trans- uations in section 4. Section 5 summarizes the key fer approach based on an RBMT system (for de- findings and outlines open issues for future work. tails on the RBMT system see (Alonso and Thur- mair, 2003)): A prototype was built accessing sta- 2 Related work tistically generated bilingual phrase tables at run- There have been several studies of hybrid machine time in addition to system lexicons and grammars. translation approaches to overcome the drawbacks This hybrid prototype showed indeed better lexi- of rule-based and statistical MT alone by a com- cal selection, improved coverage and even some- bined approach, starting from RBMT as well times enhanced syntax, but well-known issues in as starting from SMT. Evaluations of SMT vs. morpho-syntax and a very low hit rate of the phrase RBMT systems revealed that one of the weak table module remained as limiting factors. Thus al- points of RBMT systems is the lexical selection in though the data which were accessed and preferred transfer (cf. (Thurmair, 2009; Chen et al., 2009). are obviously the desired ones, the first challenge Since RBMT systems tend to suffer from insuf- was the lack of deeper linguistic knowledge within ficient and too deterministic lexical coverage and the data while the second challenge was the small choice (Eisele et al., 2008b), this study concen- F-measure of the data itself. trates on automatic enlargements of RBMT lexi- This led us to a deeper intertwined hybrid exten- cons and enhanced transfer-generation operations, sion: The LiSTEX approach (Hybrid Transfer by while taking into account the peculiarities of a spe- c 2011 European Association for Machine Translation. cific RBMT system and statistical techniques. Mikel L. Forcada, Heidi Depraetere, Vincent Vandeghinste (eds.) Proceedings of the 15th Conference of the European Association for Machine Translation, p. 225232 Leuven, Belgium, May 2011 The considerable potential of statistical term extraction combined with RBMT has been evalu- ated by other researchers, such as (Thurmair, 2003; Dugast et al., 2009). Here we go one step further by already integrating some RBMT techniques into the very early stages of the statistical term extraction process. This helps to avoid the well- known problems found in term guessing and identification (Heid, 1999). Additional linguistic layers assure that the extracted terms and phrases are tailored and augmented for the RBMT system in question. 3 Intelligent Terminology Extraction The underlying term extraction tool extracts term pairs by means of statistical algorithms from ex- isting translation memories or bilingual corpora (Eisele et al., 2008a). As a bilingual corpus, the Europarl corpus (Koehn, 2005) was used for devel- opment. As a test set, we choose the ACL WMT 2008 test set which contains the Q4/2000 portion Figure 1: Term Extraction Workflow of the EuroParl data (2000-10 to 2000-12). The initial design study on peculiarities of a spe- tents of this terminology to be extracted and im- cific statistically based term extraction that satis- ported are determined by the fine-grained data fies the requirements of an RBMT system dealt needed by an RBMT system. The first major modi- with issues like term definition in a strict linguis- fication of the statistical extraction has to fulfill the tic and system-related sense, term identification requirement to extract only well defined linguis- as single words vs. multiword expressions, base tic terms. For this reason, the extraction toolkit form reduction and application of linguistic cate- has been extended in order to access and use the gory patterns for identification or elimination of RBMT analysis, transfer and generation trees to noise at the end of the first multi-layered extrac- find interesting linguistic phrases, i.e. terms. In tion phase. Also the term recognition needed to be addition, it checks whether the translation of a spe- defined in relation with the linguistically based tar- cific term in the transfer tree, be it a single or a get system, i.e. comparison of the extraction result multiword, differs from the reference translation with the target system lexicon in order to identify in the corpus. In this way, the term extraction tool known vs. unknown terms. Thus this painstaking delivers only terms which receive a linguistic iden- specification research provided the basis for the ex- tification and which are translated differently with tended implementation of the LiSTEX term identi- the conventional RBMT system than in the bilin- fication, knowledge-enhanced acquisition and aug- gual memory. mentation modules. The extracted terminology has to contain the For LiSTEX, we concentrated on German- following minimum information: Canonical forms English and Spanish-English. The remaining sub- and categories of the source and target terms, do- sections explain details of the intelligent term ex- main area, frequency and, in case of multiwords, traction techniques and present the term extraction the internal structure of the expression. Finally, workflow as a whole (see Figure 1). the context has to be passed through: One example source language sentence in which the term 3.1 Term Identification with Initial Meta occurs in the corpus with the corresponding trans- Knowledge Acquisition and Organization lation equivalent sentence, useful for later manual The objective is to have the semantic translation inspection. equivalents decided by the statistical technique. For achieving this first layer of terminology ac- The requirements, however, on the linguistic con- quisition, the source and reference text are tok- 226 enized, tagged with part of speech information and as punctuation marks or integers, are filtered the tokens are lemmatized. Source and reference into separate lists. Since these entries tend to texts are aligned word-by-word, the source text be wrongly aligned and extracted, the sublists is translated by the RBMT Engine and analysis, have to be inspected and only useful terms transfer and generation trees are created. These will be imported. trees are aligned to the words in the source and ref- All entries showing a category change • erence text and all phrases which the RBMT sys- (i.e. source and target categories are not tem translated differently than the reference trans- identical) are excluded from import since lation are selected as potential term candidates. they are likely to be wrong. Sometimes they Since these phrases still include the original sur- are caused by errors during part of speech face forms, the canonical forms are built now, e.g. tagging, sometimes by alignment errors. adjectives are inflected properly to match the head Quality Splitting procedures which allow • noun. Thus the terms receive the proper dictionary further predictions as to the quality of the format. The categories for the entire terms are de- extracted terms. A German multiword term rived from the part of speech sequences. The list for example which results in an English of terms is alphabetically sorted and non-frequent single target word is a likely non-valid term, terms are filtered out. mostly due to alignment errors: Up to this point in the processing chain, the ex- 1. Single words on source and target side traction process is done for one language direction 2. Single words on source, multiword ex- only. Now the other language direction is gener- pressions on target side ated and the intersection of both is created to avoid 3. Multiword expressions on source side, wrong term pairs caused by alignment problems. single words on the target side

Load more