TINE: a Metric to Assess MT Adequacy
Total Page:16
File Type:pdf, Size:1020Kb
TINE: A Metric to Assess MT Adequacy Miguel Rios, Wilker Aziz and Lucia Specia Research Group in Computational Linguistics University of Wolverhampton Stafford Street, Wolverhampton, WV1 1SB, UK {m.rios, w.aziz, l.specia}@wlv.ac.uk Abstract The most commonly used metrics, BLEU (Pap- ineni et al., 2002) and alike, perform simple exact We describe TINE, a new automatic evalua- tion metric for Machine Translation that aims matching of n-grams between hypothesis and refer- at assessing segment-level adequacy. Lexical ence translations. Such a simple matching proce- similarity and shallow-semantics are used as dure has well known limitations, including that the indicators of adequacy between machine and matching of non-content words counts as much as reference translations. The metric is based on the matching of content words, that variations of the combination of a lexical matching com- words with the same meaning are disregarded, and ponent and an adequacy component. Lexi- that a perfect matching can happen even if the order cal matching is performed comparing bags- of sequences of n-grams in the hypothesis and ref- of-words without any linguistic annotation. The adequacy component consists in: i) us- erence translation are very different, changing com- ing ontologies to align predicates (verbs), ii) pletely the meaning of the translation. using semantic roles to align predicate argu- A number of other metrics have been proposed ments (core arguments and modifiers), and to address these limitations, for example, by allow- iii) matching predicate arguments using dis- ing for the matching of synonyms or paraphrases tributional semantics. TINE’s performance is comparable to that of previous metrics of content words, such as in METEOR (Denkowski at segment level for several language pairs, and Lavie, 2010). Other attempts have been made with average Kendall’s tau correlation from to capture whether the reference translation and hy- 0.26 to 0.29. We show that the addition of pothesis translations share the same meaning us- the shallow-semantic component improves the ing shallow semantics, i.e., Semantic Role Labeling performance of simple lexical matching strate- (Gimenez´ and Marquez,´ 2007). However, these are gies and metrics such as BLEU. limited to the exact matching of semantic roles and their fillers. 1 Introduction We propose TINE, a new metric that comple- The automatic evaluation of Machine Translation ments lexical matching with a shallow semantic (MT) is a long-standing problem. A number of met- component to better address adequacy. The main rics have been proposed in the last two decades, contribution of such a metric is to provide a more mostly measuring some form of matching between flexible way of measuring the overlap between shal- the MT output (hypothesis) and one or more human low semantic representations that considers both the (reference) translations. However, most of these semantic structure of the sentence and the content metrics focus on fluency aspects, as opposed to ad- of the semantic elements. The metric uses SRLs equacy. Therefore, measuring whether the meaning such as in (Gimenez´ and Marquez,´ 2007). However, of the hypothesis and reference translation are the it analyses the content of predicates and arguments same or similar is still an understudied problem. seeking for either exact or “similar” matches. The 116 Proceedings of the 6th Workshop on Statistical Machine Translation, pages 116–122, Edinburgh, Scotland, UK, July 30–31, 2011. c 2011 Association for Computational Linguistics inexact matching is based on the use of ontologies matching features over dependency parses. The met- such as VerbNet (Schuler, 2006) and distributional ric then predicts the MT quality with a regression semantics similarity metrics, such as Dekang Lin’s model. The alignment is improved using ontologies. thesaurus (Lin, 1998) . He et al. (2010) measure the similarity between In the remainder of this paper we describe some hypothesis and reference translation in terms of related work (Section 2), present our metric - TINE the Lexical Functional Grammar (LFG) represen- - (Section 3) and its performance compared to pre- tation. The representation uses dependency graphs vious work (Section 4) as well as some further im- to generate unordered sets of dependency triples. provements. We then provide an analysis of these Calculating precision, recall, and F-score on the results and discuss the limitations of the metric (Sec- sets of triples corresponding to the hypothesis and tion 5) and present conclusions and future work reference segments allows measuring similarity at (Section 6). the lexical and syntactic levels. The measure also matches WordNet synonyms. 2 Related Work The closest related metric to the one proposed in this paper is that by Gimenez´ and Marquez´ (2007) A few metrics have been proposed in recent years and Gimenez´ et al. (2010), which also uses shallow to address the problem of measuring whether a hy- semantic representations. Such a metric combines a pothesis and a reference translation share the same number of components, including lexical matching meaning. The most well-know metric is probably metrics like BLEU and METEOR, as well as com- METEOR (Banerjee and Lavie, 2005; Denkowski ponents that compute the matching of constituent and Lavie, 2010). METEOR is based on a general- and dependency parses, named entities, discourse ized concept of unigram matching between the hy- representations and semantic roles. However, the se- pothesis and the reference translation. Alignments mantic role matching is based on exact matching of are based on exact, stem, synonym, and paraphrase roles and role fillers. Moreover, it is not clear what matches between words and phrases. However, the the contribution of this specific information is for the structure of the sentences is not considered. overall performance of the metric. Wong and Kit (2010) measure word choice and We propose a metric that uses a lexical similar- word order by the matching of words based on ity component and a semantic component in order surface forms, stems, senses and semantic similar- to deal with both word choice and semantic struc- ity. The informativeness of matched and unmatched ture. The semantic component is based on seman- words is also weighted. tic roles, but instead of simply matching the surface Liu et al. (2010) propose to match bags of uni- forms (i.e. arguments and predicates) it is able to grams, bigrams and trigrams considering both recall match similar words. and precision and F-measure giving more impor- tance to recall, but also using WordNet synonyms. 3 Metric Description Tratz and Hovy (2008) use transformations in or- der to match short syntactic units defined as Ba- The rationale behind TINE is that an adequacy- sic Elements (BE). The BE are minimal-length oriented metric should go beyond measuring the syntactically well defined units. For example, matching of lexical items to incorporate information nouns, verbs, adjectives and adverbs can be con- about the semantic structure of the sentence, as in sidered BE-Unigrams, while a BE-Bigram could be (Gimenez´ et al., 2010). However, the metric should formed from a syntactic relation (e.g. subject+verb, also be flexible to consider inexact matches of se- verb+object). BEs can be lexically different, but se- mantic components, similar to what is done with lex- mantically similar. ical metrics like METEOR (Denkowski and Lavie, Pado´ et al. (2009) uses Textual Entailment fea- 2010). We experiment with TINE having English tures extracted from the Standford Entailment Rec- as target language because of the availability of lin- ognizer (MacCartney et al., 2006). The Textual En- guistic processing tools for this language. The met- tailment Recognizer computes matching and mis- ric is particularly dependent on semantic role label- 117 ing systems, which have reached satisfactory perfor- get merged with modifiers due to bad semantic role mance for English (Carreras and Marquez,´ 2005). labeling (e.g. [A0 I] [T bought] [A1 something to eat TINE uses semantic role labels (SRL) and lexical se- yesterday] instead of [A0 I] [T bought] [A1 some- mantics to fulfill two requirements by: (i) compare thing to eat] [AM-TMP yesterday]). both the semantic structure and its content across matching arguments in the hypothesis and refer- P verb score(Hv,Rv) ence translations; and (ii) propose alternative ways A(H, R) = v∈V (3) of measuring inexact matches for both predicates |Vr| and role fillers. Additionally, it uses an exact lexi- In the adequacy component, V is the set of verbs cal matching component to reward hypotheses that aligned between H and R, and |Vr| is the number of present the same lexical choices as the reference verbs in R. Hereafter the indexes h and r stand for translation. The overall score s is defined using the hypothesis and reference translations, respectively. simple weighted average model in Equation (1): Verbs are aligned using VerbNet (Schuler, 2006) and VerbOcean (Chklovski and Pantel, 2004). A verb in αL(H, R) + βA(H, R) the hypothesis v is aligned to a verb in the refer- s(H, R) = max (1) h α + β R∈R ence vr if they are related according to the follow- ing heuristics: (i) the pair of verbs share at least one where H represents the hypothesis translation, R class in VerbNet; or (ii) the pair of verbs holds a re- represents a reference translation contained in the set lation in VerbOcean. of available references R; L defines the (exact) lex- For example, in VerbNet the verbs spook and ter- ical match component in Equation (2), A defines the rify share the same class amuse-31.1, and in VerbO- adequacy component in Equation (3); and α and β cean the verb dress is related to the verb wear. are tunable weights for these two components. If multiple references are provided, the score of the segment is the maximum score achieved by compar- P arg score(H ,R ) a∈Ar ∩At a a verb score(Hv,Rv) = ing the segment to each available reference.