Automatic Translation of Wordnet Glosses

Automatic Translation of WordNet Glosses Jesus´ Gimenez´ and Llu´ıs Mar` quez German Rigau TALP Research Center, LSI Department IXA Group Universitat Politecnica` de Catalunya University of the Basque Country jgimenez,lluism ¡ @lsi.upc.edu [email protected] Abstract 2004) contain very few glosses. For instance, in the current version of the Spanish WordNet fewer than We approach the task of automatically 10% of the synsets have a gloss. Conversely, since translating the glosses in the English version 1.6 every synset in the English WordNet has WordNet. We intend to generate a prelim- a gloss. We believe that a method to rapidly obtain inary material which could be utilized to glosses for all wordnets in the MCR may be help- enrich other wordnets lacking of glosses. ful. These glosses could serve as a starting point A Phrase-based Statistical Machine Trans- for a further step of revision and post-editing. Fur- lation system has been built using a par- thermore, from a conceptual point of view, the idea allel corpus of proceedings of the Euro- of enriching the MCR using the MCR itself results pean Parliament. We study how to adapt very attractive. the system to the domain of dictionary Moreover, SMT is today a very promising ap- definitions. First, we work with special- proach to Machine Translation (MT) for a number ized language models. Second, we ex- of reasons. The most important one in the context ploit the Multilingual Central Repository of this work is that it allows to build very quickly an to build domain independent translation MT system, given only a parallel corpus represent- models. Combining these two comple- ing the languages involved. Besides, SMT is fully mentary techniques and properly tuning automatic and results are also very competitive. the system, a relative improvement of 64% However, one of the main claims against SMT is in BLEU score is attained. that it is domain oriented. Since parameters are es- timated from a parallel corpus in a specific domain, 1 Introduction the performance of the system on a different domain In this work we study the possibility of applying is often much worse. In the absence of a parallel 1 Statistical Machine Translation (SMT) techniques corpus of definitions, we built phrase-based transla- 2 to the glosses in the English WordNet (Fellbaum, tion models on the Europarl corpus (Koehn, 2003). 1998). WordNet glosses are a very useful resource. However, the language of definitions is very specific For instance, Mihalcea and Moldovan (1999) sug- and different to that of parliament proceedings. This gested an automatic method for generating sense is particularly harmful to the system recall, because tagged corpora which uses WordNet glosses. Hovy many unknown words will be processed. et al. (2001) used WordNet glosses as external 1The term ’phrase’ used hereafter refers to a sequence of knowledge to improve their Webclopedia Question words not necessarilly syntactically motivated. Answering (QA) system. 2European Parliament Proceedings (1996-2003) are available for 11 European languages at http://people.csail.mit.edu/- However, most of the wordnets in the Multilin- people/koehn/publications/europarl/. We used a version of this gual Central Repository (MCR) (Atserias et al., corpus reviewed by the RWTH Aachen group. In order to adapt the system to the new domain we study two separate lines. First, we use electronic )* ) ¡! "!#$%'&( (2) dictionaries in order to build more adequate target language models. Second, we work with domain in- Equation 2 devises three components in a SMT. dependent word-based translation models extracted First, a language model that estimates ) . Sec- ) from the MCR. Other authors have previously ap- ond, a translation model representing . Last, plied information extracted from aligned wordnets. a decoder responsible for performing the search. See Tufis et al. (2004b) presented a method for Word (Brown et al., 1993) for a detailed report on the Sense Disambiguation (WSD) based on parallel cor- mathematics of Machine Translation. pora. They utilized the aligned wordnets in Balka- Net (Tufis et al., 2004a). 3 System Description We suggest to use these models as a complement Fortunately, we can count on a number of freely to phrase-based models. These two proposals to- available tools to build a SMT system. gether with a good tuning of the system parameters We utilized the SRI Language Modeling Toolkit lead to a notable improvement of results. In our ex- (SRILM) (Stolcke, 2002). It supports creation and periments, we focus on translation from English into evaluation of a variety of language model types Spanish. A relative increase of 64% in BLEU mea- based on N-gram statistics, as well as several related sure is achieved when limiting the use of the MCR- tasks, such as statistical tagging and manipulation of based model to the case of unknown words . N-best lists and word lattices. The rest of the paper is organized as follows. In Section 2 the fundamentals of SMT are depicted. In In order to build phrase-based translation mod- Section 3 we describe the components of our sys- els, a phrase extraction must be performed on a word-aligned parallel corpus. We used the GIZA++ tem. Experimental work is deployed in Section 4. 3 Improvements are detailed in Section 5. Finally, in SMT Toolkit (Och and Ney, 2003) to generate word Section 6, current limitations of our approach are alignments. We applied the phrase-extract algo- discussed, and further work is outlined. rithm, as described by (Och, 2002), on the Viterbi alignments output by GIZA++. This algorithm takes 2 Statistical Machine Translation as input a word alignment matrix and outputs a set of phrase pairs that is consistent with it. A phrase pair Current state-of-the-art SMT systems are based on is said to be consistent with the word alignment if all ideas borrowed from the Communication Theory the words within the source phrase are only aligned field (Weaver, 1955). Brown et al. (1988) suggested to words withing the target phrase, and viceversa. that MT can be statistically approximated to the Phrase pairs are scored by relative frequency transmission of information through a noisy chan- (Equation 3). Let +-,'. be a phrase in the source lan- ¢¡£ ¥¤§¦¨¦© +-,/& nel. Given a sentence (distorted signal), guage ( ) and a phrase in the target language ¡£ ¤¦¨¦© 021 3/465§7+-,'.¥89+-,/&2 it is possible to approximate the sentece ( ). We define a function which (original signal) which produced . We need to es- counts the number of times the phrase +-,-. has been timate , the probability that a translator pro- seen aligned to phrase +-,/& in the training data. The & +-, duces as a translation of . By applying Bayes’ conditional probability that +-,-. maps into is es- rule we decompose it: timated as: ¡ ¥ (1) & 021 3/465§7+-,/.¥89+-, : =$>?)@ +-,/&2<¡ 021) 7+-,/.; (3) & 021 3/465§7+-,/.A89+-, To obtain the string which maximizes the translation probability for , a search in the probability space must be performed. Because the denominator No smoothing is performed. is independent of , we can ignore it for the purpose 3The GIZA++ SMT Toolkit may be freely downloaded at of the search: http://www.fjoch.com/GIZA++.html For the search, we used the Pharaoh beam search 4.3 Evaluation Metrics decoder (Koehn, 2004). Pharaoh is an implemen- Three different evaluation metrics have been com- tation of an efficent dynamic programming search puted, namely the General Text Matching (GTM) algorithm with lattice generation and XML markup ¡ ¡ 8£¢ F-measure ( ) (Melamed et al., 2003), the for external components. Performing an optimal de- ¡¥¤ BLEU score ( 4 ) (Papineni et al., 2001), and the coding can be extremely costly because the search ¡§¦ NIST score ( 4 ) (Lin and Hovy, 2002). These space is polynomial in the length of the input metrics have proved to correlate well with both hu- (Knight, 1999). For this reason, like most decoders, man adequacy and fluency. They all reward n-gram Pharaoh actually performs a suboptimal (beam) matches between the candidate translation and a set search by pruning the search space according to cer- of reference translations. The larger the number of tain heuristics based on the translation cost. reference translations the more reliable these mea- sures are. Unfortunately, in our case, a single refer- 4 Experiments ence translation is available. BLEU has become a ‘de facto’ standard nowadays 4.1 Experimental Setting in MT. Therefore, we discuss our results based on the BLEU score. However, it has several deficien- As sketched in Section 2, in order to build a SMT cies that turn it impractical for error analysis (Turian system we need to build a language model, and a et al., 2003). First, BLEU does not have a clear in- translation model, all in a format that is convenient terpretation. Second, BLEU is not adequate to work for the Pharaoh decoder. at the segment4 level but only at the document level. We tokenized and case lowered the Europarl cor- Third, in order to punish candidate translations that pus. A set of 327,368 parallel segments of length be- are too long/short, BLEU computes a heuristically tween five and twenty was selected for training. The motivated word penalty factor. Spanish side consisted of 4,243,610 tokens, whereas In contrast, the GTM F-measure has an intuitive the English side consisted of 4,197,836 tokens. interpretation in the context of a bitext grid. It rep- We built a trigram language model from the Span- resents the fraction of the grid covered by aligned ish side of the Europarl corpus selection. Linear in- blocks. It also, by definition, works well at the seg- terpolation was applied for smoothing. ment level and punishes translations too divergent in We used the GIZA++ default configuration.

Load more