Evaluating Framenet-Style Semantic Parsing: the Role of Coverage Gaps in Framenet
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating FrameNet-style semantic parsing: the role of coverage gaps in FrameNet Alexis Palmer and Caroline Sporleder Computational Linguistics Saarland University apalmer, csporled @coli.uni-saarland.de { } Abstract limited coverage of the available hand-coded data against which to evaluate. More realistic evalua- Supervised semantic role labeling (SRL) tions test systems on full text, but these same cov- systems are generally claimed to have ac- erage limitations mean that the assumptions made curacies in the range of 80% and higher in more restricted evaluations do not necessarily (Erk and Pado,´ 2006). These numbers, hold for full text. This paper provides an analysis though, are the result of highly-restricted of the extent and nature of the coverage gaps in evaluations, i.e., typically evaluating on FrameNet. A more precise understanding of the hand-picked lemmas for which training limitations of existing resources with respect to data is available. In this paper we con- robust semantic analysis of texts is an important sider performance of such systems when foundational component both for improving ex- we evaluate at the document level rather isting systems and for developing future systems, than on the lemma level. While it is well- and it is in this spirit that we make our analysis. known that coverage gaps exist in the re- sources available for training supervised Full-text semantic analysis SRL systems, what we have been lacking Automated frame-semantic analysis aims to ex- until now is an understanding of the pre- tract from text the key event-denoting predicates cise nature of this coverage problem and and the semantic argument structure for those its impact on the performance of SRL sys- predicates. The semantic argument structure of tems. We present a typology of five differ- a predicate describing an event encodes relation- ent types of coverage gaps in FrameNet. ships between the participants involved in the We then analyze the impact of the cov- event, e.g. who did what to whom. Knowledge of erage gaps on performance of a super- semantic argument structure is essential for lan- vised semantic role labeling system on full guage understanding and thus important for ap- texts, showing an average oracle upper plications such as information extraction (Mos- bound of 46.8%. chitti et al., 2003; Surdeanu et al., 2003), ques- tion answering (Shen and Lapata, 2007), or recog- 1 Introduction nizing textual entailment (Burchardt et al., 2009). A lot of progress has been made in semantic Evaluating an existing system for its ability to aid role labeling over the past years, but the per- such tasks is unrealistic if the evaluation is lemma- formance of state-of-the-art systems is still rel- based rather than text-based. Consequently, there atively low, especially for deep, FrameNet-style continues to be significant interest in developing semantic parsing. Furthermore, many of the re- semantic role labeling (SRL) systems able to au- ported performance figures are somewhat unre- tomatically compute the semantic argument struc- alistic because system performance is evaluated tures in an input text. on hand-selected lemmas, usually under the im- Performance on the full text task, though, is plicit assumptions that (i) all relevant word senses typically much lower than for the more restricted (frames) of each lemma are known, and (ii) there evaluations. The SemEval 2007 Task on “Frame is a suitable amount of training data for each Semantic Structure Extraction,” for example, re- sense. This approach to evaluation arises from the quired systems to identify key predicates in texts, 928 Coling 2010: Poster Volume, pages 928–936, Beijing, August 2010 assign a semantic frame to the relevant predi- it might require obtaining an analysis of the word cates, identify the semantic arguments for the tried. predicates, and finally label those arguments with (1) [Captain Thomas Preston] their semantic roles. The systems participating Defendanti was for in this task only obtained F-Scores between 55% triedTry defendanti [murder] . and 78% for frame assignment, despite the fact Chargesi,j that the task organizers adopted a lenient evalu- In the end [he] was ation scheme which gave partial credit for near- Defendantj misses (Baker et al., 2007). For the combined task [Ø] . acquittedVerdictj Chargesj of frame assigment and role labeling the perfor- mance was even lower, ranging from 35% to 54% Performance levels obtained for full text are F-Score. usually not sufficient for this kind of real-world Note that this distinction between evaluation task. FrameNet-style semantic role labeling has schemes for SRL systems corresponds to the dis- been shown to, in principle, be beneficial for ap- tinction between “lexical sample” and “all words” plications that need to generalise over individual evaluations in word sense disambiguation, where lemmas, such as recognizing textual entailment or results for the latter scheme are also typically question answering. However, studies also found lower (McCarthy, 2009). that state-of-the-art FrameNet-style SRL systems The low performances are at least partly due perform too poorly to provide any substantial ben- to coverage problems. For example, Baker et efit to real applications (Burchardt et al., 2009; al. (2007) annotated three new texts for their Shen and Lapata, 2007). SemEval 2007 task. Although these new texts Extending the value of automated semantic overlap in domain with existing FrameNet data, parsing for a variety of applications requires im- the task organizers had to create 40 new frames proving the ability of systems to process unre- in order to complete annotation. The new frames stricted text. Several methods have been pro- were for word senses found in the test set but posed to address different aspects of the cover- missing from FrameNet. The test set contained age problem, ranging from automatic data expan- only 272 frames (types), meaning that nearly 15% sion and semi-supervised semantic role labelling of the frames therein were not yet defined in (Furstenau¨ and Lapata, 2009b; Furstenau¨ and La- FrameNet. Obviously, coverage issues of this de- pata, 2009a; Deschacht and Moens, 2009; Gordon gree make full SRL a difficult task, but this is a and Swanson, 2007; Pado´ et al., 2008) to systems realistic scenario that will be encountered in real which can infer missing word senses (Pennac- applications as well. chiotti et al., 2008b; Pennacchiotti et al., 2008a; As mentioned above, for many tasks it is neces- Cao et al., 2008; Burchardt et al., 2005). How- sary to compute the semantic argument structures ever, so far there has not been a detailed analysis for the whole text, or at least for multi-sentence of the problem. In this paper we provide that de- passages. Due to non-local relations between ar- tailed analysis, by defining different types of cov- gument structures this is also true for tasks like erage problems and performing analysis of both question answering, where it might be possible coverage and performance of an automated SRL to automatically determine a subset of lemmas system on three different data sets. which are relevant for the task. For example, in (1) Section 2 of the paper provides an introduction it might be possible to determine that the second to FrameNet and introduces the basic terminol- sentence contains the answer to the question “Was ogy. Section 4 describes our approach to coverage Thomas Preston acquitted of theft?” However, evaluation, Section 3 discusses the texts analyzed, to correctly answer this question, it is necessary and the analysis itself appears in Section 5. Sec- to resolve the null instantiation of the CHARGES tion 6 then looks at one possibility for addressing role of the VERDICT frame. This null instantiation the coverage problem. The final section presents links back to the previous sentence, and resolving some discussion and conclusions. 929 (a) (b) Figure 1: Terminology: (a) Frame with core frame elements (FEs) and frame-evoking elements (FEEs) (b) Target with possible frame assignments and resultant lexical units (LUs) 2 FrameNet semantic argument annotations for nouns, adjec- tives, adverbs, prepositions and even multi-word Manual annotation of corpora with semantic ar- expressions. For example, the sentence in (2) and gument structure information has enabled the de- the NP in (3) have identical argument structures velopment of statistical and supervised machine because the verb speak and the noun comment learning techniques for semantic role labeling evoke the same frame STATEMENT. (Toutanova et al., 2008; Moschitti et al., 2008; Gildea and Jurafsky, 2002). (2) [The politician]Speaker spokeStatement The two main resources are PropBank (Palmer [about recent developments on the labour et al., 2005) and FrameNet (Ruppenhofer et al., market]T opic. 2006). PropBank aims to provide a semantic role (3) [The politician’s] com- annotation for every verb in the Penn TreeBank Speaker mentsStatement [on recent developments (Marcus et al., 1994) and assigns roles on a verb- on the labour market] by-verb basis, without making higher-level gener- T opic alizations. Whether two distinct usages of a given Since FrameNet annotations are semanti- verb are viewed as different senses or not is thus cally driven they are considerably more time- driven by both syntax (namely, differences in syn- consuming to create than PropBank annotations. tactic argument structure) and semantics (via ba- However, FrameNet also provides ‘deeper’ and sic, easily-discernable differences in meaning). more informative annotations than PropBank FrameNet1 is a lexicographic project whose analyses (Ellsworth et al., 2004). For instance, aim it is to create a lexical resource documenting the fact that (2) and (3) refer to the same state- valence structures for different word senses and of-affairs is not captured by PropBank sense dis- their possible mappings to underlying semantic tinctions. argument structure (Ruppenhofer et al., 2006). In contrast to PropBank, FrameNet is primarily se- FrameNet Terminology mantically driven; word senses (frames)2 are de- The English FrameNet data consist of an inven- fined mainly based on sometimes-subtle meaning tory of frames (i.e.