Semeval-2013 Task 4: Free Paraphrases of Noun Compounds
Total Page:16
File Type:pdf, Size:1020Kb
SemEval-2013 Task 4: Free Paraphrases of Noun Compounds Iris Hendrickx Zornitsa Kozareva Radboud University Nijmegen & University of Southern California Universidade de Lisboa [email protected] [email protected] Preslav Nakov Diarmuid OS´ eaghdha´ QCRI, Qatar Foundation University of Cambridge [email protected] [email protected] Stan Szpakowicz Tony Veale University of Ottawa & University College Dublin Polish Academy of Sciences [email protected] [email protected] Abstract The frequency spectrum of compound types fol- lows a Zipfian distribution (OS´ eaghdha,´ 2008), so In this paper, we describe SemEval-2013 Task many NC tokens belong to a “long tail” of low- 4: the definition, the data, the evaluation and frequency types. More than half of the two-noun the results. The task is to capture some of the types in the BNC occur exactly once (Kim and Bald- meaning of English noun compounds via para- phrasing. Given a two-word noun compound, win, 2006). Their high frequency and high produc- the participating system is asked to produce tivity make robust NC interpretation an important an explicitly ranked list of its free-form para- goal for broad-coverage semantic processing of En- phrases. The list is automatically compared glish texts. Systems which ignore NCs may give up and evaluated against a similarly ranked list on salient information about the semantic relation- of paraphrases proposed by human annota- ships implicit in a text. Compositional interpretation tors, recruited and managed through Ama- is also the only way to achieve broad NC coverage, zon’s Mechanical Turk. The comparison of because it is not feasible to list in a lexicon all com- raw paraphrases is sensitive to syntactic and morphological variation. The “gold” ranking pounds which one is likely to encounter. Even for is based on the relative popularity of para- relatively frequent NCs occurring 10 times or more phrases among annotators. To make the rank- in the BNC, static English dictionaries provide only ing more reliable, highly similar paraphrases 27% coverage (Tanaka and Baldwin, 2003). are grouped, so as to downplay superficial dif- In many natural language processing applications ferences in syntax and morphology. Three it is important to understand the syntax and seman- systems participated in the task. They all beat tics of NCs. NCs often are structurally similar, a simple baseline on one of the two evalua- tion measures, but not on both measures. This but have very different meaning. Consider caffeine shows that the task is difficult. headache and ice-cream headache: a lack of caf- feine causes the former, an excess of ice-cream – the latter. Different interpretations can lead to different 1 Introduction inferences, query expansion, paraphrases, transla- A noun compound (NC) is a sequence of nouns tions, and so on. A question answering system may which act as a single noun (Downing, 1977), as in have to determine whether protein acting as a tumor these examples: colon cancer, suppressor protein, suppressor is an accurate paraphrase for tumor sup- tumor suppressor protein, colon cancer tumor sup- pressor protein. An information extraction system pressor protein, etc. This type of compounding is might need to decide whether neck vein thrombosis highly productive in English. NCs comprise 3.9% and neck thrombosis can co-refer in the same doc- and 2.6% of all tokens in the Reuters corpus and the ument. A machine translation system might para- British National Corpus (BNC), respectively (Bald- phrase the unknown compound WTO Geneva head- win and Tanaka, 2004). quarters as WTO headquarters located in Geneva. Research on the automatic interpretation of NCs For example, a plastic saw could be a saw made has focused mainly on common two-word NCs. The of plastic or a saw for cutting plastic. Systems par- usual task is to classify the semantic relation under- ticipating in the task were given the set of attested lying a compound with either one of a small number paraphrases for each NC, and evaluated according to of predefined relation labels or a paraphrase from an how well they could reproduce the humans’ ranking. open vocabulary. Examples of the former take on The design of this task, SemEval-2013 Task 4, classification include (Moldovan et al., 2004; Girju, is informed by previous work on compound anno- 2007; OS´ eaghdha´ and Copestake, 2008; Tratz and tation and interpretation. It is also influenced by Hovy, 2010). Examples of the latter include (Nakov, similar initiatives, such as the English Lexical Sub- 2008b; Nakov, 2008a; Nakov and Hearst, 2008; But- stitution task at SemEval-2007 (McCarthy and Nav- nariu and Veale, 2008) and a previous NC paraphras- igli, 2007), and by various evaluation exercises in ing task at SemEval-2010 (Butnariu et al., 2010), the fields of paraphrasing and machine translation. upon which the task described here builds. We build on SemEval-2010 Task 9, extending the The assumption of a small inventory of prede- task’s flexibility in a number of ways. The restric- fined relations has some advantages – parsimony and tions on the form of annotators’ paraphrases was re- generalization – but at the same time there are lim- laxed, giving us a rich dataset of close-to-freeform itations on expressivity and coverage. For exam- paraphrases (Section 3). Rather than ranking a set of ple, the NCs headache pills and fertility pills would attested paraphrases, systems must now both gener- be assigned the same semantic relation (PURPOSE) ate and rank their paraphrases; the task they perform in most inventories, but their relational semantics is essentially the same as what the annotators were are quite different (Downing, 1977). Furthermore, asked to do. This new setup required us to innovate the definitions given by human subjects can involve in terms of evaluation measures (Section 4). rich and specific meanings. For example, Down- We anticipate that the dataset and task will be of ing (1977) reports that a subject defined the NC broad interest among those who study lexical se- oil bowl as “the bowl into which the oil in the en- mantics. We believe that the overall progress in the gine is drained during an oil change”, compared to field will significantly benefit from a public-domain which a minimal interpretation bowl for oil seems set of free-style NC paraphrases. That is why our very reductive. In view of such arguments, linguists primary objective is the challenging endeavour of such as Downing (1977), Ryder (1994) and Coulson preparing and releasing such a dataset to the re- (2001) have argued for a fine-grained, essentially search community. The common evaluation task open-ended space of interpretations. which we establish will also enable researchers to The idea of working with fine-grained para- compare their algorithms and their empirical results. phrases for NC semantics has recently grown in pop- ularity among NLP researchers (Butnariu and Veale, 2 Task description 2008; Nakov and Hearst, 2008; Nakov, 2008a). Task This is an English NC interpretation task, which ex- 9 at SemEval-2010 (Butnariu et al., 2010) was de- plores the idea of interpreting the semantics of NCs voted to this methodology. In that previous work, via free paraphrases. Given a noun-noun compound the paraphrases provided by human subjects were such as air filter, the participating systems are asked required to fit a restrictive template admitting only to produce an explicitly ranked list of free para- verbs and prepositions occurring between the NC’s phrases, as in the following example: constituent nouns. Annotators recruited through Amazon Mechanical Turk were asked to provide 1 filter for air paraphrases for the dataset of NCs. The gold stan- 2 filter of air dard for each NC was the ranked list of paraphrases 3 filter that cleans the air given by the annotators; this reflects the idea that a 4 filter which makes air healthier compound’s meaning can be described in different 5 a filter that removes impurities from the air ways, at different levels of granularity and capturing ... different interpretations in the case of ambiguity. Such a list is then automatically compared and Total Min / Max / Avg evaluated against a similarly ranked list of para- Trial/Train (174 NCs) phrases proposed by human annotators, recruited paraphrases 6,069 1 / 287 / 34.9 and managed via Amazon’s Mechanical Turk. The unique paraphrases 4,255 1 / 105 / 24.5 comparison of raw paraphrases is sensitive to syn- tactic and morphological variation. The ranking Test (181 NCs) of paraphrases is based on their relative popular- paraphrases 9,706 24 / 99 / 53.6 ity among different annotators. To make the rank- unique paraphrases 8,216 21 / 80 / 45.4 ing more reliable, highly similar paraphrases are grouped so as to downplay superficial differences in Table 1: Statistics of the trial and test datasets: the total number of paraphrases with and without duplicates, and syntax and morphology. the minimum / maximum / average per noun compound. 3 Data collection We used Amazon’s Mechanical Turk service to collect diverse paraphrases for a range of “gold- standard” NCs.1 We paid the workers a small fee Paraphrases with a frequency of 1 – proposed for ($0.10) per compound, for which they were asked to a given NC by only one annotator – always occupy provide five paraphrases. Each paraphrase should the lowest rank on the list for that compound. contain the two nouns of the compound (in sin- We used 174+181 noun-noun compounds from ´ gular or plural inflectional forms, but not in an- the NC dataset of OSeaghdha´ (2007). The trial other derivational form), an intermediate non-empty dataset, which we initially released to the partici- linking phrase and optional preceding or following pants, consisted of 4,255 human paraphrases for 174 terms.