Semeval-2016 Task 1: Semantic Textual Similarity, Monolingual And
Total Page:16
File Type:pdf, Size:1020Kb
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation Eneko Agirrea, Carmen Baneab, Daniel Cerd, Mona Diabe, Aitor Gonzalez-Agirrea, Rada Mihalceab, German Rigaua, Janyce Wiebef aUniversity of the Basque Country bUniversity of Michigan Donostia, Basque Country Ann Arbor, MI dGoogle Inc. eGeorge Washington University f University of Pittsburgh Mountain View, CA Washington, DC Pittsburgh, PA Abstract ranges from complete semantic equivalence to com- plete semantic dissimilarity. The intermediate levels Semantic Textual Similarity (STS) seeks to capture specifically defined degrees of partial sim- measure the degree of semantic equivalence ilarity, such as topicality or rough equivalence, but between two snippets of text. Similarity is ex- with differing details. The snippets being scored are pressed on an ordinal scale that spans from approximately one sentence in length, with their as- semantic equivalence to complete unrelated- ness. Intermediate values capture specifically sessment being performed outside of any contextu- defined levels of partial similarity. While alizing text. While STS has previously just involved prior evaluations constrained themselves to judging text snippets that are written in the same lan- just monolingual snippets of text, the 2016 guage, this year’s evaluation includes a pilot subtask shared task includes a pilot subtask on com- on the evaluation of cross-lingual sentence pairs. puting semantic similarity on cross-lingual text snippets. This year’s traditional mono- The systems and techniques explored as a part of lingual subtask involves the evaluation of En- STS have a broad range of applications including glish text snippets from the following four do- Machine Translation (MT), Summarization, Gener- mains: Plagiarism Detection, Post-Edited Ma- ation and Question Answering (QA). STS allows chine Translations, Question-Answering and for the independent evaluation of methods for com- News Article Headlines. From the question- puting semantic similarity drawn from a diverse set answering domain, we include both question- of domains that would otherwise be only studied question and answer-answer pairs. The cross-lingual subtask provides paired Spanish- within a particular subfield of computational linguis- English text snippets drawn from the same tics. Existing methods from a subfield that are found sources as the English data as well as indepen- to perform well in a more general setting as well as dently sampled news data. The English sub- novel techniques created specifically for STS may task attracted 43 participating teams produc- improve any natural language processing or lan- ing 119 system submissions, while the cross- guage understanding application where knowing the lingual Spanish-English pilot subtask attracted similarity in meaning between two pieces of text is 10 teams resulting in 26 systems. relevant to the behavior of the system. Paraphrase detection and textual entailment are 1 Introduction both highly related to STS. However, STS is more similar to paraphrase detection in that it defines a bi- Semantic Textual Similarity (STS) assesses the de- directional relationship between the two snippets be- gree to which the underlying semantics of two seg- ing assessed, rather than the non-symmetric propo- ments of text are equivalent to each other. This as- sitional logic like relationship used in textual en- sessment is performed using an ordinal scale that tailment (e.g., P Q leaves Q P unspeci- → → The authors of this paper are listed in alphabetic order. fied). STS also expands the binary yes/no catego- 497 Proceedings of SemEval-2016, pages 497–511, San Diego, California, June 16-17, 2016. c 2016 Association for Computational Linguistics Score English Cross-lingual Spanish-English 5 The two sentences are completely equivalent, as they mean the same thing. The bird is bathing in the sink. El pajaro´ se esta banando˜ en el lavabo. Birdie is washing itself in the water basin. Birdie is washing itself in the water basin. 4 The two sentences are mostly equivalent, but some unimportant details differ. In May 2010, the troops attempted to invade En mayo de 2010, las tropas intentaron invadir Kabul. Kabul. The US army invaded Kabul on May 7th last The US army invaded Kabul on May 7th last year, 2010. year, 2010. 3 The two sentences are roughly equivalent, but some important information differs/missing. John said he is considered a witness but not a John dijo que el´ es considerado como testigo, y suspect. no como sospechoso. “He is not a suspect anymore.” John said. “He is not a suspect anymore.” John said. 2 The two sentences are not equivalent, but share some details. They flew out of the nest in groups. Ellos volaron del nido en grupos. They flew into the nest together. They flew into the nest together. 1 The two sentences are not equivalent, but are on the same topic. The woman is playing the violin. La mujer esta´ tocando el viol´ın. The young lady enjoys listening to the guitar. The young lady enjoys listening to the guitar. 0 The two sentences are completely dissimilar. John went horse back riding at dawn with a Al amanecer, Juan se fue a montar a caballo con whole group of friends. un grupo de amigos. Sunrise at dawn is a magnificent view to take Sunrise at dawn is a magnificent view to take in in if you wake up early enough for it. if you wake up early enough for it. Table 1: Similarity scores with explanations and examples for the English and the cross-lingual Spanish-English subtasks. rization of both paraphrase detection and textual en- TEOR, TER).1 The cross-lingual STS subtask that tailment to a finer grained similarity scale. The ad- is newly introduced this year is similarly related to ditional degrees of similarity introduced by STS are machine translation quality estimation. directly relevant to many applications where inter- The STS shared task has been held annually mediate levels of similarity are significant. For ex- since 2012, providing a venue for the evaluation of ample, when evaluating machine translation system state-of-the-art algorithms and models (Agirre et al., output, it is desirable to give credit for partial se- 2012; Agirre et al., 2013; Agirre et al., 2014; Agirre mantic equivalence to human reference translations. et al., 2015). During this time, a diverse set of gen- Similarly, a summarization system may prefer short res and data sources have been explored (i.a., news segments of text with a rough meaning equivalence headlines, video and image descriptions, glosses to longer segments with perfect semantic coverage. from lexical resources including WordNet (Miller, 1995; Christiane Fellbaum, 1998), FrameNet (Baker STS is related to research into machine transla- et al., 1998), OntoNotes (Hovy et al., 2006), web tion evaluation metrics. This subfield of machine discussion forums, and Q&A data sets). This year’s translation investigates methods for replicating hu- man judgements regarding the degree to which a 1Both monolingual and cross-lingual STS score what is re- translation generated by an machine translation sys- ferred to in the machine translation literature as adequacy and tem corresponds to a reference translation produced ignore fluency unless it obscures meaning. While popular ma- by a human translator. STS systems plausibly could chine translation evaluation techniques do not assess fluency in- dependent from adequacy, it is possible that the deeper semantic be used as a drop-in replacement for existing transla- assessment being performed by STS systems could benefit from tion evaluation metrics (e.g., BLEU, MEANT, ME- being paired with a separate fluency module. 498 evaluation adds new data sets drawn from plagia- details, while a score of 4 indicates that the differing rism detection and post-edited machine translations. details are not important. The top score of 5, denotes We also introduce an evaluation set on Q&A forum that the two texts being evaluated have complete se- question-question similarity and revisit news head- mantic equivalence. lines and Q&A answer-answer similarity. The 2016 In the context of the STS task, meaning equiv- task includes both a traditional monolingual subtask alence is defined operationally as two snippets of with English data and a pilot cross-lingual subtask text that mean the same thing when interpreted by a that pairs together Spanish and English texts. reasonable human judge. The operational approach to sentence level semantics was popularized by the 2 Task Overview recognizing textual entailment task (Dagan et al., 2010). It has the advantage that it allows the label- STS presents participating systems with paired text ing of sentence pairs by human annotators without snippets of approximately one sentence in length. any training in formal semantics, while also being The systems are then asked to return a numerical more useful and intuitive to work with for down- score indicating the degree of semantic similarity stream systems. Beyond just sentence level seman- between the two snippets. Canonical STS scores tics, the operationally defined STS labels also reflect fall on an ordinal scale with 6 specifically defined both world knowledge and pragmatic phenomena. degrees of semantic similarity (see Table 1). While As in prior years, 2016 shared task partici- the underlying labels and their interpretation are or- pants are allowed to make use of existing resources dinal, systems can provide real valued scores to in- and tools (e.g., WordNet, Mikolov et al. (2013)’s dicate their semantic similarity prediction. word2vec). Participants are also allowed to make Participating systems are then evaluated based on unsupervised use of arbitrary data sets, even if such the degree to which their predicted similarity scores data overlaps with the announced sources of the correlate with STS human judgements. Algorithms evaluation data. are free to use any scale or range of values for the scores they return. They are not punished for out- 3 English Subtask putting scores outside the range of the interpretable human annotated STS labels.