Textual Entailment Features for Machine Translation Evaluation

Textual Entailment Features for Machine Translation Evaluation Sebastian Pado,´ Michel Galley, Dan Jurafsky, Christopher D. Manning∗ Stanford University {pado,mgalley,jurafsky,manning}@stanford.edu Abstract HYP: The virus did not infect anybody. entailment entailment We present two regression models for the prediction of pairwise preference judgments among MT hy- REF: No one was infected by the virus. potheses. Both models are based on feature sets that are motivated by textual entailment and incorporate HYP: Virus was infected. lexical similarity as well as local syntactic features no entailment no entailment and specific semantic phenomena. One model predicts absolute scores; the other one direct pairwise REF: No one was infected by the virus. judgments. We find that both models are compet- itive with regression models built over the scores Figure 1: Entailment status between an MT system hy- of established MT evaluation metrics. Further data pothesis and a reference translation for good translations analysis clarifies the complementary behavior of the (above) and bad translations (below). two feature sets. 1 Introduction suggests that the quality of an MT hypothesis should be predictable by a combination of lexical and structural Automatic metrics to assess the quality of machine trans- features that model the matches and mismatches be- lations have been a major enabler in improving the per- tween system output and reference translation. We use formance of MT systems, leading to many varied ap- supervised regression models to combine these features proaches to develop such metrics. Initially, most metrics and analyze feature weights to obtain further insights judged the quality of MT hypotheses by token sequence into the usefulness of different feature types. match (cf. BLEU (Papineni et al., 2002), NIST (Dod- dington, 2002). These measures rate systems hypothe- 2 Textual Entailment for MT Evaluation ses by measuring the overlap in surface word sequences shared between hypothesis and reference translation. 2.1 Textual Entailment vs. MT Evaluation With improvements in the state-of-the-art in machine Textual entailment (TE) was introduced by Dagan et translation, the effectiveness of purely surface-oriented al. (2005) as a concept that corresponds more closely measures has been questioned (see e.g., Callison-Burch to “common sense” reasoning than classical, categorical et al. (2006)). In response, metrics have been proposed entailment. Textual entailment is defined as a relation that attempt to integrate more linguistic information between two natural language sentences (a premise P into the matching process to distinguish linguistically li- and a hypothesis H) that holds if a human reading P censed from unwanted variation (Gimenez´ and Marquez,` would infer that H is most likely true. 2008). However, there is little agreement on what types Information about the presence or absence of entail- of knowledge are helpful: Some suggestions concen- ment between two sentences has been found to be ben- trate on lexical information, e.g., by the integration of eficial for a range of NLP tasks such as Word Sense word similarity information as in Meteor (Banerjee and Disambiguation or Question Answering (Dagan et al., Lavie, 2005) or MaxSim (Chan and Ng, 2008). Other 2006; Harabagiu and Hickl, 2006). Our intuition is that proposals use structural information such as dependency this idea can also be fruitful in MT Evaluation, as illus- edges (Owczarzak et al., 2007). trated in Figure 1. Very good MT output should entail In this paper, we investigate an MT evaluation metric the reference translation. In contrast, missing hypothesis that is inspired by the similarity between this task and material breaks forward entailment; additional material the textual entailment task (Dagan et al., 2005), which breaks backward entailment; and for bad translations, entailment fails in both directions. ∗ This paper is based on work funded by the Defense Ad- vanced Research Projects Agency through IBM. The content Work on the recognition of textual entailment (RTE) does not necessarily reflect the views of the U.S. Government, has consistently found that the integration of more syn- and no official endorsement should be inferred.. tactic and semantic knowledge can yield gains over surface-based methods, provided that the linguistic anal- Alignment score(3) Unaligned material (10) ysis was sufficiently robust. Thus, for RTE, “deep” Adjuncts (7) Apposition (2) Modality (5) Factives (8) matching outperforms surface matching. The reason is Polarity (5) Quantors (4) that linguistic representation makes it considerably eas- Tense (2) Dates (6) ier to distinguish admissible variation (i.e., paraphrase) Root (2) Semantic Relations (4) from true, meaning-changing divergence. Admissible Semantic relatedness (7) Structural Match (5) variation may be lexical (synonymy), structural (word Compatibility of locations and entities (4) and phrase placement), or both (diathesis alternations). The working hypothesis of this paper is that the ben- Table 1: Entailment feature groups provided by the efits of deeper analysis carry over to MT evaluation. Stanford RTE system, with number of features More specifically, we test whether the features that al- low good performance on the RTE task can also predict quadratic in the number of systems. On the other hand, human judgments for MT output. Analogously to RTE, it can be trained on more reliable pairwise preference these features should help us to differentiate meaning judgments. In a second step, we combine the individ- preserving translation variants from bad translations. ual decisions to compute the highest-likelihood total Nevertheless, there are also substantial differences ordering of hypotheses. The construction of an optimal between TE and MT evaluation. Crucially, TE assumes ordering from weighted pairwise preferences is an NP- the premise and hypothesis to be well-formed sentences, hard problem (via reduction of CYCLIC-ORDERING; which is not true in MT evaluation. Thus, a possible crit- Barzilay and Elhadad, 2002), but a greedy search yields icism to the use of TE methods is that the features could a close approximation (Cohen et al., 1999). become unreliable for ill-formed MT output. However, Both models can be used to predict system-level there is a second difference between the tasks that works scores from sentence-level scores. Again, we have two to our advantage. Due to its strict compositional nature, method for doing this. The basic method (BASIC) pre- TE requires an accurate semantic analysis of all sentence dicts the quality of each system directly as the percent- parts, since, for example, one misanalysed negation or age of sentences for which its output was rated best counterfactual embedding can invert the entailment sta- among all systems. However, we noticed that the man- tus (MacCartney and Manning, 2008). In contrast, hu- ual rankings for the WMT 2007 dataset show a tie for man MT judgments behave more additively: failure of a best system for almost 30% of sentences. BASIC is translation with respect to a single semantic dimension systematically unable to account for these ties. We (e.g., polarity or tense) degrades its quality, but usually therefore implemented a “tie-aware” prediction method not crucially so. We therefore expect that even noisy (WITHTIES) that uses the same sentence-level output as entailment features can be predictive in MT evaluation. BASIC, but computes system-level quality differently, 2.2 Entailment-based prediction of MT quality as the percentage of sentences where the system’s hypothesis was scored better or at most ε worse than the Regression-based prediction. Experiences from the best system, for some global “tie interval” ε. annotation of MT quality judgments show that human raters have difficulty in consistently assigning absolute Features. We use the Stanford RTE system (MacCart- scores to MT system output, due to the number of ways ney et al., 2006) to generate a set of entailment features in which MT output can deviate. Thus, the human an- (RTE) for each pair of MT hypothesis and reference notation for the WMT 2008 dataset was collected in translation. Features are generated in both directions the form of binary pairwise preferences that are con- to avoid biases towards short or long translations. The siderably easier to make (Callison-Burch et al., 2008). Stanford RTE system uses a three-stage architecture. This section presents two models for the prediction of It (a) constructs a robust, dependency-based linguistic pairwise preferences. analysis of the two sentences; (b) identifies the best The first model (ABS) is a regularized linear regres- alignment between the two dependency graphs given sion model over entailment-motivated features (see be- similarity scores from a range of lexical resources, us- low) that predicts an absolute score for each reference- ing a Markov Chain Monte Carlo sampling strategy; hypothesis pair. Pairwise preferences are created simply and (c) computes roughly 75 features over the aligned by comparing the absolute predicted scores. This model pair of dependency graphs. The different feature groups is more general, since it can also be used where absolute are shown in Table 1. A small number features are score predictions are desirable; furthermore, the model real-valued, measuring different quality aspects of the is efficient with a runtime linear in the number of sys- alignment. The other features are binary, indicating tems and corpus size.

Textual Entailment Features for Machine Translation Evaluation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support