The Effects of Quality on Preference Violation Detection

Jesse Dunietz Lori Levin and Jaime Carbonell Computer Science Department Technologies Institute Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA, 15213, USA Pittsburgh, PA, 15213, USA [email protected] lsl,jgc @cs.cmu.edu { }

Abstract presents an opportunity to revisit such challenges from the perspective of selectional preference vio- Lexical resources such as WordNet and lations. Detecting these violations, however, con- VerbNet are widely used in a multitude stitutes a severe stress-test for resources designed of NLP tasks, as are annotated corpora for other tasks. As such, it can highlight shortcom- such as . Often, the resources ings and allow quantifying the potential benefits of are used as-is, without question or exam- improving resources such as WordNet (Fellbaum, ination. This practice risks missing sig- 1998) and VerbNet (Schuler, 2005). nificant performance gains and even entire In this paper, we present DAVID (Detector of techniques. Arguments of Verbs with Incompatible Denota- This paper addresses the importance of tions), a resource-based system for detecting pref- resource quality through the lens of a erence violations. DAVID is one component of challenging NLP task: detecting selec- METAL (Metaphor Extraction via Targeted Anal- tional preference violations. We present ysis of Language), a new system for identifying, DAVID, a simple, lexical resource-based interpreting, and cataloguing metaphors. One pur- preference violation detector. With as- pose of DAVID was to explore how far lexical is lexical resources, DAVID achieves an resource-based techniques can take us. Though F1-measure of just 28.27%. When the our initial results suggested that the answer is “not resource entries and parser outputs for very,” further analysis revealed that the problem a small sample are corrected, however, lies less in the technique than in the state of exist- the F1-measure on that sample jumps ing resources and tools. from 40% to 61.54%, and performance Often, it is assumed that the frontier of perfor- on other examples rises, suggesting that mance on NLP tasks is shaped entirely by algo- the algorithm becomes practical given re- rithms. Manning (2011) showed that this may not fined resources. More broadly, this pa- hold for POS tagging – that further improvements per shows that resource quality matters may require resource cleanup. In the same spirit, tremendously, sometimes even more than we argue that for some semantic tasks, exemplified algorithmic improvements. by preference violation detection, resource qual- ity may be at least as essential as algorithmic en- 1 Introduction hancements. A variety of NLP tasks have been addressed 2 The Preference Violation Detection using selectional preferences or restrictions, in- Task cluding sense disambiguation (see Navigli (2009)), semantic (e.g., Shi and Mihalcea DAVID builds on the insight of Wilks (1978) that (2005)), and metaphor processing (see Shutova the strongest indicator of metaphoricity is the vi- (2010)). These semantic problems are quite chal- olation of selectional preferences. For example, lenging; metaphor analysis, for instance, has long only plants can literally be pruned. If laws is been recognized as requiring considerable seman- the object of pruned, the verb is likely metaphori- tic knowledge (Wilks, 1978; Carbonell, 1980). cal. Flagging such semantic mismatches between The advent of extensive lexical resources, an- verbs and arguments is the task of preference vio- notated corpora, and a spectrum of NLP tools lation detection.

765 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 765–770, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics We base our definition of preferences on the “The politician pruned laws regulating plastic Pragglejaz guidelines (Pragglejaz Group, 2007) bags, and created new fees for inspecting dairy for identifying the most basic sense of a word as farms.” the most concrete, embodied, or precise one. Sim- Verb Arg0 Arg1 ilarly, we define selectional preferences as the se- pruned The politician laws . . . bags mantic constraints imposed by a verb’s most basic regulating laws plastic bags sense. may list figurative senses of created The politician new fees prune, but we take the basic sense to be cutting inspecting - - dairy farms plant growth. Several types of verbs were excluded from the Table 1: SENNA’s SRL output for the example task because they have very lax preferences. These sentence above. Though this example demon- include verbs of becoming or seeming (e.g., trans- strates only two arguments, SENNA is capable of form, appear), light verbs, auxiliaries, and aspec- labeling up to six. tual verbs. For the sake of simplifying implemen- tation, phrasal verbs were also ignored. Restriction WordNet Synsets 3 Algorithm Design animate animate being.n.01 people.n.01 To identify violations, DAVID employs a simple person.n.01 algorithm based on several existing tools and re- concrete physical object.n.01 sources: SENNA (Collobert et al., 2011), a seman- matter.n.01 tic role labeling (SRL) system; VerbNet, a com- substance.n.01 putational verb ; SemLink (Loper et al., organization social group.n.01 2007), which includes mappings between Prop- district.n.01 Bank (Palmer et al., 2005) and VerbNet; and WordNet. As one metaphor detection component Table 2: DAVID’s mappings between some of METAL’s several, DAVID is designed to favor common VerbNet restriction types and WordNet precision over recall. The algorithm is as follows: synsets. 1. Run the Stanford CoreNLP POS tagger (Toutanova et al., 2003) and the TurboParser Each VerbNet restriction is interpreted as man- dependency parser (Martins et al., 2011). dating or forbidding a set of WordNet hypernyms, 2. Run SENNA to identify the semantic argu- defined by a custom mapping (see Table 2). ments of each verb in the sentence using the For example, VerbNet requires both the Patient PropBank argument annotation scheme (Arg0, of a verb in carve-21.2-2 and the Theme Arg1, etc.). See Table 1 for example output. of a verb in wipe manner-10.4.1-1 to 3. For each verb V , find all VerbNet entries for be concrete. By empirical inspection, concrete V . Using SemLink, map each PropBank argu- nouns are hyponyms of the WordNet synsets ment name to the corresponding VerbNet the- physical object.n.01, matter.n.03, matic roles in these entries (Agent, Patient, or substance.n.04. Laws (the Patient of etc.). For example, the VerbNet class for prune prune) is a hyponym of none of these, so prune is carve-21.2-2. SemLink maps Arg0 to would be flagged as a violation. the Agent of carve-21.2-2 and Arg1 to the Patient. 4 Corpus Annotation 4. Retrieve from VerbNet the selectional restric- To evaluate our system, we assembled a corpus tions of each thematic role. In our running of 715 sentences from the METAL project’s cor- example, VerbNet specifies +int control pus of sentences with and without metaphors. The and +concrete for the Agent and Patient of corpus was annotated by two annotators follow- carve-21.2-2, respectively. ing an annotation manual. Each verb was marked 5. If the head of any argument cannot be inter- for whether its arguments violated the selectional preted to meet V ’s preferences, flag V as a vi- preferences of the most basic, literal meaning of olation. the verb. The annotators resolved conflicts by dis-

766 Error source Frequency tations for non-verbs. The only parser-related er- ror we corrected was a mislabeled noun. Bad/missing VN entries 4.5 (14.1%) Bad/missing VN restrictions 6 (18.8%) 6.2 Correcting Corrupted Data in VerbNet Bad/missing SL mappings 2 (6.3%) The VerbNet download is missing several sub- Parsing/head-finding errors 3.5 (10.9%) classes that are referred to by SemLink or that SRL errors 8.5 (26.6%) have been updated on the VerbNet website. Some VN restriction system too weak 4 (12.5%) roles also have not been updated to the latest ver- Confounding WordNet senses 3.5 (10.9%) sion, and some subclasses are listed with incor- Endemic errors: 7.5 (23.4%) rect IDs. These problems, which caused SemLink Resource errors: 12.5 (39.1%) mappings to fail, were corrected before reviewing Tool errors: 12 (37.5%) errors from the corpus. Total: 32 (100%) Six subclasses needed to be fixed, all of which were easily detected by a simple script that did not Table 3: Sources of error in 90 randomly selected depend on the 90-sentence subcorpus. We there- sentences. For errors that were due to a combi- fore expect that few further changes of this type nation of sources, 1/2 point was awarded to each would be needed for a more complete resource re- source. (VN stands for VerbNet and SL for Sem- finement effort. Link.) 6.3 Corpus-Based Updates to SemLink Our modifications to SemLink’s mappings in- cussing until consensus. cluded adding missing verbs, adding missing roles 5 Initial Results to mappings, and correcting mappings to more ap- propriate classes or roles. We also added null map- As the first row of Table 4 shows, our initial eval- pings in cases where a PropBank argument had no uation left little hope for the technique. With corresponding role in VerbNet. This makes the such low precision and F1, it seemed a lexical system’s strategy for ruling out mappings more re- resource-based preference violation detector was liable. out. When we analyzed the errors in 90 randomly No corrections were made purely based on the selected sentences, however, we found that most sample. Any time a verb’s mappings were edited, were not due to systemic problems with the ap- VerbNet was scoured for plausible mappings for proach; rather, they stemmed from SRL and pars- every verb sense in PropBank, and any nonsensi- ing errors and missing or incorrect resource entries cal mappings were deleted. For example, when (see Table 3). Armed with this information, we de- the phrase go dormant caused an error, we in- cided to explore how viable our algorithm would spected the mappings for go. Arguments of all but be absent these problems. 2 of the 7 available mappings were edited, either to add missing arguments or to correct nonsensi- 6 Refining The Data cal ones. These changes actually had a net neg- To evaluate the effects of correcting DAVID’s in- ative impact on test set performance because the puts, we manually corrected the tool outputs and bad mappings had masked parsing and selectional resource entries that affected the aforementioned preference problems. 90 sentences. SRL output was corrected for ev- Based on the 90-sentence subcorpus, we mod- ery sentence, while SemLink and VerbNet entries ified 20 of the existing verb entries in SemLink. were corrected only for each verb that produced an These changes included correcting 8 role map- error. pings, adding 13 missing role mappings to existing senses, deleting 2 incorrect senses, adding 11 verb 6.1 Corrections to Tool Output (Parser/SRL) senses, correcting 2 senses, deleting 1 superfluous Guided by the PropBank and annotation role mapping, and adding 46 null role mappings. guidelines, we corrected all errors in core role (Note that although null mappings represented the assignments from SENNA. These corrections in- largest set of changes, they also had the least im- cluded relabeling arguments, adding missed argu- pact on system behavior.) One entirely new verb ments, fixing argument spans, and deleting anno- was added, as well.

767 6.4 Corpus-Based Updates to VerbNet nearly double the sum of the improvements from Nineteen VerbNet classes were modified, and one each alone: tool and resource improvements inter- class had to be added. The modifications gener- act synergistically. ally involved adding, correcting, or deleting se- The effects on the test corpus are harder to lectional restrictions, often by introducing or re- interpret. Due to a combination of SRL prob- arranging subclasses. Other changes amounted to lems and the small number of sentences cor- fixing clerical errors, such as incorrect role names rected, the scores on the test set improved little or restrictions that had been ANDed instead of with resource correction; in fact, they even dipped ORed. slightly between the 30- and 60-sentence incre- An especially difficult problem was an inconsis- ments. Nonetheless, we contend that our results tency in the of VerbNet’s subclass sys- testify to the generality of our corrections: after tem. In some cases, the restrictions specified on each iteration, every altered result was either an a verb in a subclass did not apply to subcatego- error fixed or an error that should have appeared rization frames inherited from a superclass, but in before but had been masked by another. Note also other cases the restrictions clearly applied to all that all results on the test set are without corrected frames. The conflict was resolved by duplicating tool output; presumably, these sentences would subclassed verbs in the top-level class whenever also have improved synergistically with more ac- different selectional restrictions were needed for curate SRL. How long corrections would continue the two sets of frames. to improve performance is a question that we did As with SemLink, samples determined only not have the resources to answer, but our results which classes were modified, not what modifica- suggest that there is plenty of room to go. tions were made. Any non-obvious changes to Some errors, of course, are endemic to the ap- selectional restrictions were verified by examin- proach and cannot be fixed either by improved re- ing dozens of verb instances from SketchEngine’s sources or by better tools. For example, we con- (Kilgarriff et al., 2004) corpus. For example, the sider every WordNet sense to be plausible, which Agent of seek was restricted to +animate, but produces false negatives. Additionally, the selec- the corpus confirmed that organizations are com- tional restrictions specified by VerbNet are fairly monly described non-metaphorically as seeking, loose; a more refined set of categories might cap- so the restriction was updated to +animate | ture the range of verbs’ restrictions more accu- +organization. rately.

7 Results After Resource Refinement 8 Implications for Future Refinement Efforts After making corrections for each set of 10 sen- tences, we incrementally recomputed F1 and pre- Although improving resources is infamously cision, both on the subcorpus corrected so far and labor-intensive, we believe that similarly refining on a test set of all 625 sentences that were never the remainder of VerbNet and SemLink would be corrected. (The manual nature of the correction ef- doable. In our study, it took about 25-35 person- fort made testing k-fold subsets impractical.) The hours to examine about 150 verbs and to mod- results for 30-sentence increments are shown in ify 20 VerbNet classes and 25 SemLink verb en- Table 4. tries (excluding time for SENNA corrections, fix- The most striking feature of these figures is how ing corrupt VerbNet data, and analysis of DAVID’s much performance improves on corrected sen- errors). Extrapolating from our experience, we es- tences: for the full 90 sentences, F1 rose from timate that it would take roughly 6-8 person-weeks 30.43% to 61.54%, and precision rose even more to systematically fix this particular set of issues dramatically from 31.82% to 80.00%. Interest- with VerbNet. ingly, resource corrections alone generally made a Improving SemLink could be more complex, larger difference than tool corrections alone, sug- as its mappings are automatically generated from gesting that resources may be the dominant fac- VerbNet annotations on top of the PropBank cor- tor in resource-intensive tasks such as this one. pus. One possibility is to correct the generated Even more compellingly, the improvement from mappings directly, as we did in our study, which correcting both the tools and the resources was we estimate would take about two person-months.

768 With the addition of some metadata from the gen- Sent. Tools Rsrcs PF1 eration process, it would then be possible to follow 715 0 0 27.14% 28.27% the corrected mappings back to annotations from 625 0 0 26.55% 27.98% which they were generated and fix those annota- tions. One downside of this approach is that if the 625 0 corr. 26.37% 28.15% mappings were ever regenerated from the anno- 30 0 0 50.00% 40.00% tated corpus, any mappings not encountered in the 30 30 0 66.67% 44.44% corpus would have to be added back afterwards. 30 0 corr.+30 62.50% 50.00% Null role mappings would be particularly thorny 30 30 corr.+30 87.50% 70.00% to implement. To add a null mapping, we must 625 0 corr.+30 27.07% 28.82% know that a role definitely does not belong, and 60 0 0 35.71% 31.25% is not just incidentally missing from an exam- 60 60 0 54.55% 31.38% ple. For instance, VerbNet’s defend-85 class 60 0 corr.+60 53.85% 45.16% truly has no equivalent to Arg2 in PropBank’s 60 60 corr.+60 90.91% 68.97% defend.01, but Arg0 or Arg1 may be missing 625 0 corr.+60 26.92% 28.74% for other reasons (e.g., in a passive). It may be best 90 0 0 31.82% 30.43% to simply omit null mappings, as is currently done. 90 90 0 44.44% 38.10% Alternatively, full parses from the Penn , 90 0 corr.+90 47.37% 41.86% on which PropBank is based, might allow distin- 90 90 corr.+90 80.00% 61.54% guishing phenomena such as passives where argu- 625 0 corr.+90 27.37% 28.99% ments are predictably omitted. Table 4: Performance on preference violation de- The maintainers of VerbNet and PropBank are tection task. Column 1 shows the sentence count. aware of many of the issues we have raised, and Columns 2 and 3 show how many sentences’ we have been in contact with them about possi- SRL/parsing and resource errors, respectively, had ble approaches to fixing them. They are particu- been fixed (“corr.” indicates corrupted files). larly aware of the inconsistent semantics of selec- tional restrictions on VerbNet subclasses, and they hope to fix this issue within a larger attempt at re- notation efforts, and to Davida Fromm for curating tooling VerbNet’s selectional restrictions. In the METAL’s corpus of Engish sentences. meantime, we are sharing our VerbNet modifica- This work was supported by the Intelligence tions with them for them to verify and incorporate. Advanced Research Projects Activity (IARPA) We are also sharing our SemLink changes so that via Department of Defense US Army Research they can, if they choose, continue manual correc- Laboratory contract number W911NF-12-C-0020. tion efforts or trace SemLink problems back to the The U.S. Government is authorized to reproduce annotated corpus. and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. 9 Conclusion Disclaimer: The views and conclusions contained Our results argue for investing effort in developing herein are those of the authors and should not be and fixing resources, in addition to developing bet- interpreted as necessarily representing the official ter NLP tools. Resource and tool improvements policies or endorsements, either expressed or im- interact synergistically: better resources multiply plied, of IARPA, DoD/ARL, or the U.S. Govern- the effect of algorithm enhancements. Gains from ment. fixing resources may sometimes even exceed what the best possible algorithmic improvements can References provide. We hope the NLP community will take up the challenge of investing in its resources to the Jaime G. Carbonell. 1980. Metaphor: a key to ex- extent that its tools demand. tensible semantic analysis. In Proceedings of the 18th annual meeting on Association for Computa- tional Linguistics, ACL ’80, pages 17–21, Strouds- Acknowledgments burg, PA, USA. Association for Computational Lin- guistics. Thanks to Eric Nyberg for suggesting building a system like DAVID, to Spencer Onuffer for his an- Ronan Collobert, Jason Weston, Leon´ Bottou, Michael

769 Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. American Chapter of the Association for Computa- 2011. Natural language processing (almost) from tional Linguistics on Human Language Technology scratch. J. Mach. Learn. Res., 12:2493–2537, - Volume 1, NAACL ’03, pages 173–180, Strouds- November. burg, PA, USA. Association for Computational Lin- guistics. Christiane Fellbaum, editor. 1998. WordNet: An Elec- tronic Lexical Database. Bradford Books. Yorick Wilks. 1978. Making preferences more active. Artificial Intelligence, 11:197–223. Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell. 2004. The . In Proceedings of EURALEX.

Edward Loper, Szu-ting Yi, and Martha Palmer. 2007. Combining lexical resources: Mapping between PropBank and VerbNet. In Proceedings of the 7th International Workshop on Computational Linguis- tics, Tilburg, the Netherlands.

Christopher D Manning. 2011. Part-of-speech tag- ging from 97% to 100%: is it time for some linguis- tics? In Computational Linguistics and Intelligent , pages 171–189. Springer.

Andre´ F. T. Martins, Noah A. Smith, Pedro M. Q. Aguiar, and Mario´ A. T. Figueiredo. 2011. Dual de- composition with many overlapping components. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, EMNLP ’11, pages 238–249, Stroudsburg, PA, USA. Association for Computational Linguistics.

Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2):10.

Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An annotated cor- pus of semantic roles. Computational Linguistics, 31(1):71–106.

Pragglejaz Group. 2007. MIP: A method for iden- tifying metaphorically used in discourse. Metaphor and Symbol, 22(1):1–39.

Karin K. Schuler. 2005. VerbNet: A Broad- Coverage, Comprehensive Verb Lexicon. Ph.D. the- sis, University of Pennsylvania, Philadelphia, PA. AAI3179808.

Lei Shi and Rada Mihalcea. 2005. Putting pieces to- gether: Combining FrameNet, VerbNet and Word- Net for robust semantic parsing. In Alexander Gelbukh, editor, Computational Linguistics and In- telligent Text Processing, volume 3406 of Lec- ture Notes in Computer Science, pages 100–111. Springer Berlin Heidelberg.

Ekaterina Shutova. 2010. Models of metaphor in NLP. In Proceedings of the 48th Annual Meeting of the As- sociation for Computational Linguistics, ACL ’10, pages 688–697, Stroudsburg, PA, USA. Association for Computational Linguistics.

Kristina Toutanova, Dan Klein, Christopher D. Man- ning, and Yoram Singer. 2003. Feature-rich part-of- speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North

770