The Dependency-Parsed Framenet Corpus

The Dependency-Parsed FrameNet Corpus Daniel Bauer, Hagen Furstenau,¨ Owen Rambow Columbia University New York, NY 10024, USA [email protected], fhagen, [email protected] Abstract When training semantic role labeling systems, the syntax of example sentences is of particular importance. Unfortunately, for the FrameNet annotated sentences, there is no standard parsed version. The integration of the automatic parse of an annotated sentence with its semantic annotation, while conceptually straightforward, is complex in practice. We present a standard dataset that is publicly available and that can be used in future research. This dataset contains parser-generated dependency structures (with POS tags and lemmas) for all FrameNet 1.5 sentences, with nodes automatically associated with FrameNet annotations. 1. Introduction uses a syntactic parser to automatically produce syntactic analyses for the sentences in the FrameNet corpus. These FrameNet (Fillmore et al., 2003) is a lexical resource for parses then need to be aligned with the provided seman- English, based on the theory of Frame Semantics (Fillmore, tic annotations. For instance, Figure 2 shows the annota- 1976). It comprises both a lexicon and a corpus of example tion from Figure 1 projected onto a parser-generated depen- sentences, in which certain words are identified as frame dency structure. FEEs and FEs have been identified with evoking elements (FEEs) and annotated with a semantic nodes in the parse tree. Aligning syntactic analyses pro- frame. Such frames represent prototypical situations or duced by a parser with the semantic annotation provided events, such as COMMERCIAL TRANSACTION, and may by FrameNet, while seemingly straightforward, is compli- feature one or more frame elements (FEs, sometimes also cated by various issues, including parser errors, discontin- called semantic roles), such as BUYER or SELLER. In its uous frame elements, inconsistent FrameNet annotations, most recent version 1.5, FrameNet contains manual anno- and technical issues such as character encodings. However, tations for more than 170,000 sentences. Figure 1 shows an while this alignment task has to be performed repeatedly example of a semantically annotated sentence. These anno- by many researchers and potentially has a significant im- tations have been used to train semantic role labeling (SRL) pact on the performance of SRL systems, it has rarely been systems, which are then able to derive frame semantic anal- explained and evaluated in detail. This makes it harder than yses, i.e., identify frames and frame elements in any given necessary to reproduce work on SRL within the Frame Se- sentence. mantic paradigm. In addition, differences in parser perfor- Syntactic structure is of crucial importance to the training mance, syntactic representations, and alignment techniques and application of SRL systems, which typically make use limit the comparability of reported SRL results. of it to learn and infer semantic structure. Indeed, start- The aim of the present work is threefold: ing with Gildea and Jurafsky (2002), most SRL systems have made extensive use of syntactic parses as the basis 1. In Section 3, we present in detail a simple but effective of their semantic predictions, e.g., in the form of features method of aligning FrameNet annotations with syntac- used in various classification algorithms (e.g. Moschitti et tic parses. We decided to restrict ourselves to depen- al. (2008), Johansson and Nugues (2007), Das and Smith dency trees which have enjoyed increasing popular- (2011)). These features have been shown to improve SRL ity in SRL research (Hacioglu, 2004; Johansson and performance. In addition to characterizing syntactic real- Nugues, 2007; Surdeanu et al., 2008). izations of predicate-argument structure, syntactic informa- 2. In Section 4, we evaluate this method on a small sub- tion is also necessary to extract the head word of an argu- set of the FrameNet corpus, for which we manually ment phrase. Features extracted for this head are commonly annotated the head words of FEs, and tune some of its used to model the meaning of the argument and thus its parameters. compatibility with a particular role. Unfortunately the majority of FrameNet annotations is 3. We apply the method to the entire FrameNet 1.5 cor- taken from the British National Corpus, for which there pus to produce a standard dataset that is publicly avail- exists no standard syntactic annotation. Instead, as can be able and can be used in future research.1 This dataset seen from the example in Figure 1, FrameNet provides only contains dependency parses (with POS tags and lem- shallow syntactic information, indicating parts of speech, mas) for all FrameNet 1.5 annotations, including per- phrase types, and limited information on grammatical func- dependency-node annotation of FEEs and FEs. We tions. FrameNet annotates FEs on surface text spans, instead of syntactic constituents, and annotations do not mark 1The dataset is available at http://www1.ccls. the head word of an annotated FE. columbia.edu/˜rambow/resources/parsed_ Consequently, SRL work in the FrameNet paradigm often framenet/. She wrinkled her nose in disapproval . Frame: Body movement FE: Agent Body part Internal cause POS: PNP VVD DPS NN1 PRP NN1 PUN PT: NP NP PP GF: Ext Obj Dep Figure 1: Example FrameNet annotation. FEs and FEEs are annotated over text spans. Additional information about parts of speech (POS), phrase types (PT), and grammatical functions (GF) is provided. 2-wrinkled dependency graphs in which certain nodes are marked as FRAME: Body_movement frame evoking elements (FEEs) or frame elements (FEs) of specific semantic frames. nsubj dobj prep While describing the various processing steps involved in 1-She 4-nose 5-in creating a unified resource, we will leave certain param- FE: Agent FE: Body_part FE: Internal_cause eters unspecified. These will be empirically optimized in the following Section 4. In Section 5, we will then de- poss pobj scribe technical details of the resulting publicly available resource. 3-her 6-disapproval 3.1. Parsing FrameNet contains the tokenized plain texts of each anno- Figure 2: FrameNet annotation projected onto a depen- tated sentence. We extract these sentences and parse them dency graph produced by an automatic parser. FEs and with the Stanford Parser (Klein and Manning, 2003)2, set- FEEs are assigned to nodes in the parse tree. ting it up to produce dependency analyses. In particular, we consider Stanford Dependencies (de Marneffe et al., 2006) also include the split into training and test data that in either of the two modes basicDependencies and was used to evaluate a recent state-of-the-art SRL sys- CCPropagatedDependencies. The main difference tems for FrameNet (Das and Smith, 2011). Section 5 between these are that the latter “collapses” prepositions describes this resource in detail. into edge labels and does not guarantee that the dependency graph is a tree (it may not even by acyclic), both of which We conclude in Section 6. can allow the syntactic dependencies to better reflect the semantics of a sentence. Since there are advantages and 2. Related Work disadvantages to either dependency representation, we ex- The basic problem of parsing FrameNet sentences and perimented with both, and offer two versions of the final aligning the resulting parses with the semantic annotations resource, built on either kind of syntax. had to be addressed by most SRL systems working with the Instead of only considering the most probable parse, we let FrameNet data. In contrast, the PropBank corpus (Palmer the parser output the 50 best parses of each sentence. In et al., 2005) is based on the Penn Treebank and therefore Section 3.3.3, we will describe how our method makes use already provides hand-annotated syntax, so that a similar of competing syntactic analyses. Section 4 then shows the problem does not arise. However, it follows a different impact of the number n of considered parses (n ≤ 50) on paradigm of semantic roles (Rambow et al., 2003). Further- the quality of the resulting resource. more, while PropBank annotates only verbs, the FrameNet lexicon also contains frame evoking nouns, adjectives, ad- 3.2. Preprocessing verbs, and prepositions. In preparation for the merging of syntactic parses and se- An explicit description of different algorithms for the task mantic annotation, we transform the FrameNet annotation, at hand was previously given in (Furstenau,¨ 2008). How- which is based on character indices of substrings, into ever, that approach was limited to annotations of verbal token-based annotation. Since the sentences in FrameNet FEEs. The present work provides more extensive evalua- are already tokenized, we simply split each sentence string tion results, confirming the robustness of our approach, re- at white space characters and number the resulting tokens, ports on parameter tuning experiments, and is accompanied starting at 1. Each FEE or FE in FrameNet comes anno- by a publicly available resource. tated as a (not necessarily continuous) subset of the characters of the sentence. We convert such an annotation into a 3. Method set of token numbers by considering each token whose set In this section, we describe our approach of merging de- of characters is completely contained within the set of an- pendency information provided by a syntactic parser and notated characters. For example, the annotation of the FE Frame Semantic annotation from the FrameNet corpus into a new resource. The resulting resource is a corpus of 2version 2.0.1, available at http://nlp.stanford. semantically annotated dependency graphs, i.e., syntactic edu/software/lex-parser.shtml. Body part in Figure 1 would be converted from the index syntactic head of the FE) that dominates exactly the anno- set f13;:::; 20g into the token number set f3; 4g. tated FE tokens. For various reasons (mostly consisting of parse errors), it is not always possible to find such a node.

The Dependency-Parsed Framenet Corpus

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support