Merging Propbank, Nombank, Timebank, Penn Discourse Treebank and Coreference James Pustejovsky, Adam Meyers, Martha Palmer, Massimo Poesio
Total Page:16
File Type:pdf, Size:1020Kb
Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference James Pustejovsky, Adam Meyers, Martha Palmer, Massimo Poesio Abstract annotates the temporal features of propositions Many recent annotation efforts for English and the temporal relations between propositions. have focused on pieces of the larger problem The Penn Discourse Treebank (Miltsakaki et al of semantic annotation, rather than initially 2004a/b) treats discourse connectives as producing a single unified representation. predicates and the sentences being joined as This paper discusses the issues involved in arguments. Researchers at Essex were merging four of these efforts into a unified responsible for the coreference markup scheme linguistic structure: PropBank, NomBank, the developed in MATE (Poesio et al, 1999; Poesio, Discourse Treebank and Coreference 2004a) and have annotated corpora using this Annotation undertaken at the University of scheme including a subset of the Penn Treebank Essex. We discuss resolving overlapping and (Poesio and Vieira, 1998), and the GNOME conflicting annotation as well as how the corpus (Poesio, 2004a). This paper discusses the various annotation schemes can reinforce issues involved in creating a Unified Linguistic each other to produce a representation that is Annotation (ULA) by merging annotation of greater than the sum of its parts. examples using the schemata from these efforts. Crucially, all individual annotations can be kept 1. Introduction separate in order to make it easy to produce alternative annotations of a specific type of The creation of the Penn Treebank (Marcus et al, semantic information without need to modify the 1993) and the word sense-annotated SEMCOR annotation at the other levels. Embarking on (Fellbaum, 1997) have shown how even limited separate annotation efforts has the advantage of amounts of annotated data can result in major allowing researchers to focus on the difficult improvements in complex natural language issues in each area of semantic annotation and understanding systems. These annotated corpora the disadvantage of inducing a certain amount of have led to high-level improvements for parsing tunnel vision or task-centricity – annotators and word sense disambiguation (WSD), on the working on a narrow task tend to see all same scale as previously occurred for Part of phenomena in light of the task they are working Speech tagging by the annotation of the Brown on, ignoring other factors. However, merging corpus and, more recently, the British National these annotation efforts allows these biases to be Corpus (BNC) (Burnard, 2000). However, the dealt with. The result, we believe, could be a creation of semantically annotated corpora has more detailed semantic account than possible if lagged dramatically behind the creation of other the ULA had been the initial annotation effort linguistic resources: in part due to the perceived rather than the result of merging. cost, in part due to an assumed lack of theoretical agreement on basic semantic judgments, in part, There is a growing community consensus that finally, due to the understandable unwillingness general annotation, relying on linguistic cues, of research groups to get involved in such an and in particular lexical cues, will produce an undertaking. As a result, the need for such enduring resource that is useful, replicable and resources has become urgent. portable. We provide the beginnings of one such level derived from several distinct annotation Many recent annotation efforts for English have efforts. This level could provide the foundation focused on pieces of the larger problem of for a major advance in our ability to semantic annotation, rather than producing a automatically extract salient relationships from single unified representation like Head-driven text. This will in turn facilitate breakthroughs in Phrase Structure Grammar (Pollard and Sag message understanding, machine translation, fact 1994) or the Prague Dependency Tecto- retrieval, and information retrieval. gramatical Representation (Hajicova & Kucer- ova, 2002). PropBank (Palmer et al, 2005) 2. The Component Annotation Schemata annotates predicate argument structure anchored by verbs. NomBank (Meyers, et. al., 2004a) We describe below existing independent annotates predicate argument structure anchored annotation efforts, each one of which is focused by nouns. TimeBank (Pustejovsky et al, 2003) on a specific aspect of the semantic representation task: semantic role labeling, coreference, discourse relations, temporal tuned to a hiring scenario (MUC-6, 1995), relations, etc. They have reached a level of NomBank and PropBank annotation facilitate maturity that warrants a concerted attempt to generalization over patterns. PropBank and merge them into a single, unified representation, NomBank would both support a single IE pattern ULA. There are several technical and theoretical stating that the object (ARG1) of appoint is John issues that will need to be resolved in order to and the subject (ARG0) is IBM, allowing a bring these different layers together seamlessly. system to detect that IBM hired John from each Most of these approaches have annotated the of the following strings: IBM appointed John, same type of data, Wall Street Journal text, so it John was appointed by IBM, IBM's appointment is also important to demonstrate that the of John, the appointment of John by IBM and annotation can be extended to other genres such John is the current IBM appointee. as spoken language. The demonstration of success for the extensions would be the training Coreference: Coreference involves the detection of accurate statistical semantic taggers. of subsequent mentions of invoked entities, as in George Bush,… he…. Researchers at Essex (UK) PropBank: The Penn Proposition Bank focuses were responsible for the coreference markup on the argument structure of verbs, and provides scheme developed in MATE (Poesio et al, 1999; a corpus annotated with semantic roles, Poesio, 2004a), partially implemented in the including participants traditionally viewed as annotation tool MMAX and now proposed as an arguments and adjuncts. An important goal is to ISO standard; and have been responsible for the provide consistent semantic role labels across creation of two small, but commonly used different syntactic realizations of the same verb, anaphorically annotated corpora – the Vieira / as in the window in [ARG0 John] broke [ARG1 Poesio subset of the Penn Treebank (Poesio and the window] and [ARG1 The window] broke. Vieira, 1998), and the GNOME corpus (Poesio, Arg0 and Arg1 are used rather than the more 2004a). Parallel coreference annotation efforts traditional Agent and Patient to keep the funded by ACE have resulted in similar annotation as theory-neutral as possible, and to guidelines, exemplified by BBN’s recent facilitate mapping to richer representations. The annotation of Named Entities, common nouns 1M word Penn Treebank II Wall Street Journal and pronouns. These two approaches provide a corpus has been successfully annotated with suitable springboard for an attempt at achieving a semantic argument structures for verbs and is community consensus on coreference. now available via the Penn Linguistic Data Consortium as PropBank I (Palmer, et. al., 2005). Discourse Treebank: The Penn Discourse Coarse-grained sense tags, based on groupings of Treebank (PDTB) (Miltsakaki et al 2004a/b) is WordNet senses, are being added, as well as based on the idea that discourse connectives are links from the argument labels in the Frames predicates with associated argument structure Files to FrameNet frame elements. There are (for details see (Miltsakaki et al 2004a, close parallels to other semantic role labeling Miltsakaki et al 2004b). The long-range goal is projects, such as FrameNet (Baker, et. al., 1998; to develop a large scale and reliably annotated Fillmore & Atkins, 1998; Fillmore & Baker, corpus that will encode coherence relations 2001), Salsa (Ellsworth, et.al, 2004), Prague associated with discourse connectives, including Tectogrammatics (Hajicova & Kucerova, 2002) their argument structure and anaphoric links, and IAMTC, (Helmreich, et. al., 2004) thus exposing a clearly defined level of discourse structure and supporting the extraction of a range NomBank: The NYU NomBank project can be of inferences associated with discourse considered part of the larger PropBank effort and connectives. This annotation references the Penn is designed to provide argument structure for Treebank annotations as well as PropBank, and instances of about 5000 common nouns in the currently only considers Wall Street Journal text. Penn Treebank II corpus (Meyers, et. al., 2004a). PropBank argument types and related verb TimeBank: The Brandeis TimeBank corpus, Frames Files are used to provide a commonality funded by ARDA, focuses on the annotation of of annotation. This enables the development of all major aspects in natural language text systems that can recognize regularizations of associated with temporal and event information lexically and syntactically related sentence (Day, et al, 2003, Pustejovsky, et al, 2004). structures, whether they occur as verb phrases or Specifically, this involves three areas of the noun phrases. For example, given an IE system annotation: temporal expressions, event-denoting expressions, and the links that express either an dependency between the argument (ARG1) of anchoring of an event to a time or an ordering of modal can and the or. Or is transparent in that