SemEval-2013 Task 4: Free Paraphrases of Noun Compounds Iris Hendrickx Zornitsa Kozareva Radboud University Nijmegen & University of Southern California Universidade de Lisboa [email protected] [email protected] Preslav Nakov Diarmuid OS´ eaghdha´ QCRI, Qatar Foundation University of Cambridge [email protected] [email protected] Stan Szpakowicz Tony Veale University of Ottawa & University College Dublin Polish Academy of Sciences [email protected] [email protected]

Abstract The frequency spectrum of types fol- lows a Zipfian distribution (OS´ eaghdha,´ 2008), so In this paper, we describe SemEval-2013 Task many NC tokens belong to a “long tail” of low- 4: the definition, the data, the evaluation and frequency types. More than half of the two-noun the results. The task is to capture some of the types in the BNC occur exactly once (Kim and Bald- meaning of English noun compounds via para- phrasing. Given a two-word noun compound, win, 2006). Their high frequency and high produc- the participating system is asked to produce tivity make robust NC interpretation an important an explicitly ranked list of its free-form para- goal for broad-coverage semantic processing of En- phrases. The list is automatically compared glish texts. Systems which ignore NCs may give up and evaluated against a similarly ranked list on salient information about the semantic relation- of paraphrases proposed by human annota- ships implicit in a text. Compositional interpretation tors, recruited and managed through Ama- is also the only way to achieve broad NC coverage, zon’s Mechanical Turk. The comparison of because it is not feasible to list in a lexicon all com- raw paraphrases is sensitive to syntactic and morphological variation. The “gold” ranking pounds which one is likely to encounter. Even for is based on the relative popularity of para- relatively frequent NCs occurring 10 times or more phrases among annotators. To make the rank- in the BNC, static English dictionaries provide only ing more reliable, highly similar paraphrases 27% coverage (Tanaka and Baldwin, 2003). are grouped, so as to downplay superficial dif- In many natural language processing applications ferences in syntax and morphology. Three it is important to understand the syntax and seman- systems participated in the task. They all beat tics of NCs. NCs often are structurally similar, a simple baseline on one of the two evalua- tion measures, but not on both measures. This but have very different meaning. Consider caffeine shows that the task is difficult. headache and ice-cream headache: a lack of caf- feine causes the former, an excess of ice-cream – the latter. Different interpretations can lead to different 1 Introduction inferences, query expansion, paraphrases, transla- A noun compound (NC) is a sequence of nouns tions, and so on. A question answering system may which act as a single noun (Downing, 1977), as in have to determine whether protein acting as a tumor these examples: colon cancer, suppressor protein, suppressor is an accurate paraphrase for tumor sup- tumor suppressor protein, colon cancer tumor sup- pressor protein. An information extraction system pressor protein, etc. This type of compounding is might need to decide whether neck vein thrombosis highly productive in English. NCs comprise 3.9% and neck thrombosis can co-refer in the same doc- and 2.6% of all tokens in the Reuters corpus and the ument. A system might para- British National Corpus (BNC), respectively (Bald- phrase the unknown compound WTO Geneva head- win and Tanaka, 2004). quarters as WTO headquarters located in Geneva. Research on the automatic interpretation of NCs For example, a plastic saw could be a saw made has focused mainly on common two-word NCs. The of plastic or a saw for cutting plastic. Systems par- usual task is to classify the semantic relation under- ticipating in the task were given the set of attested lying a compound with either one of a small number paraphrases for each NC, and evaluated according to of predefined relation labels or a paraphrase from an how well they could reproduce the humans’ ranking. open vocabulary. Examples of the former take on The design of this task, SemEval-2013 Task 4, classification include (Moldovan et al., 2004; Girju, is informed by previous work on compound anno- 2007; OS´ eaghdha´ and Copestake, 2008; Tratz and tation and interpretation. It is also influenced by Hovy, 2010). Examples of the latter include (Nakov, similar initiatives, such as the English Lexical Sub- 2008b; Nakov, 2008a; Nakov and Hearst, 2008; But- stitution task at SemEval-2007 (McCarthy and Nav- nariu and Veale, 2008) and a previous NC paraphras- igli, 2007), and by various evaluation exercises in ing task at SemEval-2010 (Butnariu et al., 2010), the fields of paraphrasing and machine translation. upon which the task described here builds. We build on SemEval-2010 Task 9, extending the The assumption of a small inventory of prede- task’s flexibility in a number of ways. The restric- fined relations has some advantages – parsimony and tions on the form of annotators’ paraphrases was re- generalization – but at the same time there are lim- laxed, giving us a rich dataset of close-to-freeform itations on expressivity and coverage. For exam- paraphrases (Section 3). Rather than ranking a set of ple, the NCs headache pills and fertility pills would attested paraphrases, systems must now both gener- be assigned the same semantic relation (PURPOSE) ate and rank their paraphrases; the task they perform in most inventories, but their relational is essentially the same as what the annotators were are quite different (Downing, 1977). Furthermore, asked to do. This new setup required us to innovate the definitions given by human subjects can involve in terms of evaluation measures (Section 4). rich and specific meanings. For example, Down- We anticipate that the dataset and task will be of ing (1977) reports that a subject defined the NC broad interest among those who study lexical se- oil bowl as “the bowl into which the oil in the en- mantics. We believe that the overall progress in the gine is drained during an oil change”, compared to field will significantly benefit from a public-domain which a minimal interpretation bowl for oil seems set of free-style NC paraphrases. That is why our very reductive. In view of such arguments, linguists primary objective is the challenging endeavour of such as Downing (1977), Ryder (1994) and Coulson preparing and releasing such a dataset to the re- (2001) have argued for a fine-grained, essentially search community. The common evaluation task open-ended space of interpretations. which we establish will also enable researchers to The idea of working with fine-grained para- compare their algorithms and their empirical results. phrases for NC semantics has recently grown in pop- ularity among NLP researchers (Butnariu and Veale, 2 Task description 2008; Nakov and Hearst, 2008; Nakov, 2008a). Task This is an English NC interpretation task, which ex- 9 at SemEval-2010 (Butnariu et al., 2010) was de- plores the idea of interpreting the semantics of NCs voted to this methodology. In that previous work, via free paraphrases. Given a noun-noun compound the paraphrases provided by human subjects were such as air filter, the participating systems are asked required to fit a restrictive template admitting only to produce an explicitly ranked list of free para- verbs and prepositions occurring between the NC’s phrases, as in the following example: constituent nouns. Annotators recruited through Amazon Mechanical Turk were asked to provide 1 filter for air paraphrases for the dataset of NCs. The gold stan- 2 filter of air dard for each NC was the ranked list of paraphrases 3 filter that cleans the air given by the annotators; this reflects the idea that a 4 filter which makes air healthier compound’s meaning can be described in different 5 a filter that removes impurities from the air ways, at different levels of granularity and capturing ... different interpretations in the case of ambiguity. Such a list is then automatically compared and Total Min / Max / Avg evaluated against a similarly ranked list of para- Trial/Train (174 NCs) phrases proposed by human annotators, recruited paraphrases 6,069 1 / 287 / 34.9 and managed via Amazon’s Mechanical Turk. The unique paraphrases 4,255 1 / 105 / 24.5 comparison of raw paraphrases is sensitive to syn- tactic and morphological variation. The ranking Test (181 NCs) of paraphrases is based on their relative popular- paraphrases 9,706 24 / 99 / 53.6 ity among different annotators. To make the rank- unique paraphrases 8,216 21 / 80 / 45.4 ing more reliable, highly similar paraphrases are grouped so as to downplay superficial differences in Table 1: Statistics of the trial and test datasets: the total number of paraphrases with and without duplicates, and syntax and morphology. the minimum / maximum / average per noun compound. 3 Data collection We used Amazon’s Mechanical Turk service to collect diverse paraphrases for a range of “gold- standard” NCs.1 We paid the workers a small fee Paraphrases with a frequency of 1 – proposed for ($0.10) per compound, for which they were asked to a given NC by only one annotator – always occupy provide five paraphrases. Each paraphrase should the lowest rank on the list for that compound. contain the two nouns of the compound (in sin- We used 174+181 noun-noun compounds from ´ gular or plural inflectional forms, but not in an- the NC dataset of OSeaghdha´ (2007). The trial other derivational form), an intermediate non-empty dataset, which we initially released to the partici- linking phrase and optional preceding or following pants, consisted of 4,255 human paraphrases for 174 terms. The paraphrasing terms could have any part noun-noun pairs; this dataset was also the training of speech, so long as the resulting paraphrase was a dataset. The test dataset comprised paraphrases for well-formed noun phrase headed by the NC’s head. 181 noun-noun pairs. The “gold standard” contained We gave the workers feedback during data col- 9,706 paraphrases of which 8,216 were unique for lection if they appeared to have misunderstood the those 181 NCs. Further statistics on the datasets are nature of the task. Once raw paraphrases had been presented in Table 1. collected from all workers, we collated them into a Compared with the data collected for the spreadsheet, and we merged identical paraphrases SemEval-2010 Task 9 on the interpretation of noun in order to calculate their overall frequencies. Ill- compounds, the data collected for this new task have formed paraphrases – those violating the syntactic a far greater range of variety and richness. For ex- restrictions described above – were manually re- ample, the following (selected) paraphrases for work moved following a consensus decision-making pro- area vary from parsimonious to expansive: cedure; every paraphrase was checked by at least • area for work two task organizers. We did not require that the • area of work paraphrases be semantically felicitous, but we per- • area where work is done formed minor edits on the remaining paraphrases if they contained obvious typos. • area where work is performed The remaining well-formed paraphrases were • ... sorted by frequency separately for each NC. The • an area cordoned off for persons responsible for most frequent paraphrases for a compound are as- work signed the highest rank 0, those with the next- • an area where construction work is carried out highest frequency are given a rank of 1, and so on. • an area where work is accomplished and done • area where work is conducted 1Since the annotation on Mechanical Turk was going slowly, we also recruited four other annotators to do the same work, • office area assigned as a work space following exactly the same instructions. • ... 4 Scoring This overall score is then normalized by dividing by the maximum value among the n-gram overlap Noun compounding is a generative aspect of lan- score for P ara compared with itself and the n- guage, but so too is the process of NC interpretation: gold gram overlap score for P ara compared with it- human speakers typically generate a range of possi- test self. This normalization step produces a paraphrase ble interpretations for a given compound, each em- match score in the range [0.0 – 1.0]. It punishes a phasizing a different aspect of the relationship be- paraphrase P ara for both over-generating (con- tween the nouns. Our evaluation framework reflects test taining more words than are found in P ara ) the belief that there is rarely a single right answer gold and under-generating (containing fewer words than for a given noun-noun pairing. Participating systems are found in P ara ). In other words, P ara are thus expected to demonstrate some generativity gold test should ideally reproduce everything in P ara , of their own, and are scored not just on the accu- gold and nothing more or less. racy of individual interpretations, but on the overall The reference paraphrases in the “gold standard” breadth of their output. are ordered by rank; the highest rank is assigned to For evaluation, we provided a scorer imple- the paraphrases which human judges suggested most mented, for good portability, as a Java class. For often. The rank of a reference paraphrase matters each noun compound to be evaluated, the scorer because a good participating system will aim to re- compares a list of system-suggested paraphrases produce the top-ranked “gold-standard” paraphrases against a “gold-standard” reference list, compiled as produced by human judges. The scorer assigns and rank-ordered from the paraphrases suggested a multiplier of R/(R + n) to reference paraphrases by our human annotators. The score assigned to at rank n; this multiplier asymptotically approaches each system is the mean of the system’s performance 0 for the higher values of n of ever lower-ranked across all test compounds. Note that the scorer re- paraphrases. We choose a default setting of R = 8, moves all determiners from both the reference and so that a reference paraphrase at rank 0 (the highest the test paraphrases, so a system is neither punished rank) has a multiplier of 1, while a reference para- for not reproducing a determiner or rewarded for phrase at rank 5 has a multiplier of 8/13 = 0.615. producing the same determiners. When a system-generated paraphrase P ara is The scorer can match words identically or non- test matched with a reference paraphrase P ara , their identically. A match of two identical words W gold gold normalized n-gram overlap score is scaled by the and W earns a score of 1.0. There is a partial test rank multiplier attaching to the rank of P ara rel- score of (2 |P | / (|PW | + |PW |))2 for a gold gold test ative to the other reference paraphrases provided by match of two words PW and PW that are gold test human judges. The scorer automatically chooses the not identical but share a common prefix P , |P | > 2, reference paraphrase P ara for a test paraphrase e.g., wmatch(cutting, cuts) = (6/11)2 = 0.297. gold P ara so as to maximize this product of normal- Two n-grams N = [GW ,..., GW ] and test gold 1 n ized n-gram overlap score and rank multiplier. N = [TW ,..., TW ] can be matched if test 1 n The overall score assigned to each system for wmatch(GW ,TW ) > 0 for all i in 1..n. The i i a specific compound is calculated in two differ- score assigned to the match of these two n-grams is ent ways: using isomorphic matching of suggested then P wmatch(GW ,TW ). For every n-gram i i i paraphrases to the “gold-standard’s” reference para- N = [TW ,..., TW ] in a system-generated test 1 n phrases (on a one-to-one basis); and using non- paraphrase, the scorer finds a matching n-gram isomorphic matching of system’s paraphrases to the N = [GW ,..., GW ] in the reference para- gold 1 n “gold-standard’s” reference paraphrases (in a poten- phrase P ara which maximizes this sum. gold tially many-to-one mapping). The overall n-gram overlap score for a reference Isomorphic matching rewards both precision and paraphrase P ara and a system-generated para- gold recall. It rewards a system for accurately reproduc- phrase P ara is the sum of the score calculated test ing the paraphrases suggested by human judges, and for all n-grams in P ara , where n ranges from 1 test for reproducing as many of these as it can, and in to the size of P ara . test much the same order. In isomorphic mode, system’s paraphrases are Team isomorphic non-isomorphic matched 1-to-1 with reference paraphrases on a first- SFS 23.1 17.9 come first-matched basis, so ordering can be crucial. IIITH 23.1 25.8 Non-isomorphic matching rewards only preci- MELODI-Primary 13.0 54.8 sion. It rewards a system for accurately reproducing MELODI-Contrast 13.6 53.6 the top-ranked human paraphrases in the “gold stan- Naive Baseline 13.8 40.6 dard”. A system will achieve a higher score in a non- isomorphic match if it reproduces the top-ranked hu- Table 2: Results for the participating systems; the base- man paraphrases as opposed to lower-ranked human line outputs the same paraphrases for all compounds. paraphrases. The ordering of system’s paraphrases is thus not important in non-isomorphic matching. Each system is evaluated using the scorer in both modes, isomorphic and non-isomorphic. Systems which aim only for precision should score highly The IIITH system used the probabilities of the on non-isomorphic match mode, but poorly in iso- preposition co-occurring with a relation to identify morphic match mode. Systems which aim for pre- the class of the noun compound. To collect statis- n cision and recall will face a more substantial chal- tics, it used Google -grams, BNC and ANC. lenge, likely reflected in their scores. The SFS system extracted templates and fillers from the training data, which it then combined with A na¨ıve baseline a four-gram language model and a MaxEnt reranker. We decided to allow preposition-only paraphrases, To find similar compounds, they used Lin’s Word- which are abundant in the paraphrases suggested Net similarity. They further used statistics from the by human judges in the crowdsourcing Mechanical English Gigaword and the Google n-grams. Turk collection process. This abundance means that Table 2 shows the performance of the partici- the top-ranked paraphrase for a given compound is pating systems, SFS, IIITH and MELODI, and the often a preposition-only phrase, or one of a small na¨ıve baseline. The baseline shows that it is rela- number of very popular paraphrases such as used for tively easy to achieve a moderately good score in or used in. It is thus straightforward to build a na¨ıve non-isomorphic match mode by generating a fixed baseline generator which we can expect to score set of paraphrases which are both common and reasonably on this task, at least in non-isomorphic generic: two of the three participating systems, matching mode. For each test compound MH, SFS and IIITH, under-perform the na¨ıve baseline the baseline system generates the following para- in non-isomorphic match mode, but outperform it phrases, in this precise order: H of M, H in M, H in isomorphic mode. The only system to surpass for M, H with M, H on M, H about M, H has M, H to this baseline in non-isomorphic match mode is the M, H used for M, H used in M. MELODI system; yet, it under-performs against the This na¨ıve baseline is truly unsophisticated. No same baseline in isomorphic match mode. No par- attempt is made to order paraphrases by their corpus ticipating team submitted a system which would out- frequencies or by their frequencies in the training perform the na¨ıve baseline in both modes. data. The same sequence of paraphrases is generated for each and every test compound. 6 Conclusions 5 Results The conclusions we draw from the experience of or- ganizing the task are mixed. Participation was rea- Three teams participated in the challenge, and all sonable but not large, suggesting that NC paraphras- their systems were supervised. The MELODI sys- ing remains a niche interest – though we believe it tem relied on semantic vector space model built deserves more attention among the broader lexical from the UKWAC corpus (window-based, 5 words). semantics community and hope that the availabil- It used only the features of the right-hand head noun ity of our freeform paraphrase dataset will attract a to train a maximum entropy classifier. wider audience in the future. We also observed a varied response from our an- Fourth International Workshop on Semantic Evalua- notators in terms of embracing their freedom to gen- tions (SemEval-2007), Prague, Czech Republic, 48-53. erate complex and rich paraphrases; there are many Dan Moldovan, Adriana Badulescu, Marta Tatu, Daniel possible reasons for this including laziness, time Antohe, and Roxana Girju. 2004. Models for the se- mantic classification of noun phrases. Dan Moldovan pressure and the fact that short paraphrases are often and Roxana Girju, eds., HLT-NAACL 2004: Workshop very appropriate paraphrases. The results obtained on Computational , Boston, MA, by our participants were also modest, demonstrating USA, 60-67. that compound paraphrasing is both a difficult task Preslav Nakov and Marti Hearst. 2008. Solving rela- and a novel one that has not yet been “solved”. tional similarity problems using the Web as a corpus. Proc. 46th Annual Meeting of the Association for Com- Acknowledgments putational Linguistics ACL-08, Columbus, OH, USA, 452-460. This work has partially supported by a small but ef- Preslav Nakov. 2008a. Improved statistical machine fective grant from Amazon; the credit allowed us translation using monolingual paraphrases. Proc. 18th to hire sufficiently many Turkers – thanks! And a European Conference on Artificial Intelligence ECAI- thank-you to our additional annotators Dave Carter, 08, Patras, Greece, 338-342. Chris Fournier and Colette Joubarne for their com- Preslav Nakov. 2008b. Noun compound interpretation plete sets of paraphrases of the noun compounds in using paraphrasing verbs: Feasibility study. Proc. the test data. 13th International Conference on Artificial Intelli- gence: Methodology, Systems, Applications AIMSA- 08, Varna, Bulgaria, Lecture Notes in Computer Sci- References ence 5253, Springer, 103-117. Diarmuid OS´ eaghdha.´ 2007. Designing and Evaluating Timothy Baldwin and Takaaki Tanaka. 2004. Transla- a Semantic Annotation Scheme for Compound Nouns. tion by machine of complex nominals: Getting it right. In Proceedings of the 4th Corpus Linguistics Confer- Proc. ACL04 Workshop on Multiword Expressions: In- ence, Birmingham, UK. tegrating Processing, Barcelona, Spain, 24-31. Diarmuid OS´ eaghdha.´ 2008. Learning compound Cristina Butnariu and Tony Veale. 2008. A concept- noun semantics. Ph.D. thesis, Computer Laboratory, centered approach to noun-compound interpretation. University of Cambridge. Published as University Proc. 22nd International Conference on Computa- of Cambridge Computer Laboratory Technical Report tional Linguistics (COLING-08), Manchester, UK, 81- 735. 88. Diarmuid OS´ eaghdha´ and Ann Copestake. 2009. Using Cristina Butnariu, Su Nam Kim, Preslav Nakov, Diar- lexical and relational similarity to classify semantic re- muid OS´ eaghdha,´ Stan Szpakowicz, and Tony Veale. lations. Proc. 12th Conference of the European Chap- 2010. SemEval-2010 Task 9: The interpretation of ter of the Association for Computational Linguistics noun compounds using paraphrasing verbs and prepo- EACL-09, Athens, Greece, 621-629. sitions. Proc. 5th International ACL Workshop on Se- Diarmuid OS´ eaghdha´ and Ann Copestake. 2008. Se- mantic Evaluation, Uppsala, Sweden, 39-44. mantic classification with distributional kernels. In Seana Coulson. 2001. Semantic Leaps: Frame-Shifting Proc. 22nd International Conference on Computa- and Conceptual Blending in Meaning Construction. tional Linguistics (COLING-08), Manchester, UK. Cambridge University Press, Cambridge, UK. Mary Ellen Ryder. 1994. Ordered Chaos: The Interpre- Pamela Downing. 1977. On the creation and use of En- tation of English Noun-Noun Compounds. University glish compound nouns. Language, 53(4): 810-842. of California Press, Berkeley, CA, USA. Roxana Girju. 2007. Improving the interpretation of Takaaki Tanaka and Tim Baldwin. 2003. Noun-noun noun phrases with cross-linguistic information. Proc. compound machine translation: A feasibility study 45th Annual Meeting of the Association of Computa- on shallow processing. Proc. ACL-2003 Workshop tional Linguistics, Prague, Czech Republic, 568-575. on Multiword Expressions: Analysis, Acquisition and Su Nam Kim and Timothy Baldwin. 2006. Interpreting Treatment, Sapporo, Japan, 17-24. semantic relations in noun compounds via verb seman- Stephen Tratz and Eduard Hovy. 2010. A taxonomy, tics. Proc. ACL-06 Main Conference Poster Session, dataset, and classifier for automatic noun compound Sydney, Australia, 491-498. interpretation. Proc. 48th Annual Meeting of the As- Diana McCarthy and Roberto Navigli. 2007. Semeval- sociation for Computational Linguistics ACL-10, Up- 2007 task 10: English lexical substitution task. Proc. psala, Sweden, 678-687.