English Propbank Annotation Guidelines

English PropBank Annotation Guidelines Claire Bonial [email protected] Julia Bonn [email protected] Kathryn Conger [email protected] Jena Hwang [email protected] Martha Palmer [email protected] Nicholas Reese [email protected] Center for Computational Language and Education Research Institute of Cognitive Science University of Colorado at Boulder Based on original PropBank guidelines by Olga Babko-Malaya July 14, 2015 Contents 1 Verb Annotation Instructions3 1.1 PropBank Annotation Goals..............................3 1.2 Sense Annotation....................................4 1.2.1 Frame Files...................................4 1.2.2 ER: Error roleset................................6 1.2.3 IE: Idiomatic Expression roleset.......................7 1.2.4 DP: Duplicate indicator roleset........................7 1.2.5 LV: Light Verb roleset.............................8 1.2.6 What to do when there is no frame file....................8 1.3 Annotation of Numbered Arguments.........................8 1.3.1 Choosing ARG0 versus ARG1.........................8 1.3.2 ARGA: Secondary Agent........................... 10 1.4 Annotation of Modifiers................................ 10 1.4.1 Comitatives (COM).............................. 11 1.4.2 Locatives (LOC)................................ 12 1.4.3 Directional (DIR)............................... 12 1.4.4 Goal (GOL)................................... 12 1.4.5 Manner (MNR)................................. 13 1.4.6 Temporal (TMP)................................ 14 1.4.7 Extent (EXT)................................. 14 1.4.8 Reciprocals (REC)............................... 15 1.4.9 Secondary Predication (PRD)......................... 16 1.4.10 Purpose Clauses (PRP)............................ 16 1.4.11 Cause Clauses (CAU)............................. 17 1.4.12 Discourse (DIS)................................. 17 1.4.13 Modals (MOD)................................. 19 1.4.14 Negation (NEG)................................ 19 1.4.15 Direct Speech (DSP).............................. 19 1.4.16 Adverbials (ADV)............................... 21 1.4.17 Adjectival (ADJ)................................ 21 1.4.18 Light Verb (LVB)............................... 21 1.4.19 Construction (CXN).............................. 22 1.5 Span of Annotation................................... 22 1.6 Where to Place Tags.................................. 26 1.6.1 Exceptions to Normal Tag Placement.................... 26 1.7 Understanding and Annotating Null Elements in the Penn TreeBank....... 27 1.7.1 Passive Sentences................................ 28 1.7.2 Fronted and Dislocated Arguments...................... 29 1.7.3 Questions and Wh-Phrases.......................... 30 1.7.4 Interpret Constituent Here (ICH) Traces................... 31 1 1.7.5 Right Node Raising (RNR) Traces...................... 32 1.7.6 It EXtraPosition (EXP)............................ 35 1.7.7 Other Traces.................................. 37 1.8 Linking and Annotation of Null Elements...................... 38 1.8.1 Relative Clause Annotation.......................... 38 1.8.2 Reduced Relative Annotation......................... 38 1.8.3 Concatenation of multiple nodes into one argument............. 41 1.8.4 Special cases of topicalization......................... 42 1.9 Special Cases: small clauses and sentential complements.............. 43 1.10 Handling common features of spoken data...................... 45 1.10.1 Disfluencies and Edited Nodes........................ 45 1.10.2 Asides: PRN nodes............................... 46 2 Light Verb Annotation 47 2.1 Pass 1: Verb Pass.................................... 47 2.2 Pass 2: Noun Pass................................... 50 2.3 Examples........................................ 50 2.4 Tricky Cases...................................... 52 3 Noun Annotation Instructions 53 3.1 Span of Annotation................................... 53 3.2 Annotation of Numbered Arguments......................... 55 3.3 Annotation of Modifiers................................ 55 3.3.1 Adjectival modifiers (ADJ).......................... 56 3.3.2 Secondary Predication modifiers (PRD)................... 56 3.4 Exceptions to Annotation: Determiners and Other Noun Relations........ 57 3.5 .YY Roleset....................................... 57 4 Adjectival Predicate Annotation Instructions 59 4.1 Span of Annotation................................... 59 4.2 Annotation of Arguments............................... 59 4.3 Annotating Constructions............................... 59 4.3.1 Comparative Constructions.......................... 60 4.3.2 Degree Construction.............................. 61 4.3.3 The Xer the Yer Construction......................... 62 A Jubilee Hotkeys 63 References 65 2 Chapter 1 Verb Annotation Instructions 1.1 PropBank Annotation Goals PropBank is a corpus in which the arguments of each predicate are annotated with their semantic roles in relation to the predicate (Palmer et al., 2005). Currently, all the PropBank annotations are done on top of the phrase structure annotation of the Penn TreeBank (Marcus et al., 1993). In addition to semantic role annotation, PropBank annotation requires the choice of a sense id (also known as a frameset or roleset id) for each predicate. Thus, for each verb in every tree (representing the phrase structure of the corresponding sentence), we create a PropBank instance that consists of the sense id of the predicate (e.g. run.02) and its arguments labeled with semantic roles. An important goal is to provide consistent argument labels across different syntactic realizations of the same verb, as in. [John]ARG0 broke [the window]ARG1 [The window]ARG1 broke As this example shows, the arguments of the verbs are labeled as numbered arguments: ARG0, ARG1, ARG2, and so on. The argument structure of each predicate is outlined in the PropBank frame file for that predicate. The frame file gives both semantic and syntactic information about each sense of the predicate lemmas that have been encountered thus far in PropBank annotation. The frame file also denotes the correspondences between numbered arguments and semantic roles, as this is somewhat unique for each predicate. Numbered arguments reflect either the arguments that are required for the valency of a predicate (e.g., agent, patient, benefactive), or if not required, those that occur with high-frequency in actual usage. Although numbered arguments correspond to slightly different semantic roles given the usage of each predicate, in general numbered arguments correspond to the following semantic roles: ARG0 agent ARG3 starting point, benefactive, attribute ARG1 patient ARG4 ending point ARG2 instrument, benefactive, attribute ARGM modifier Table 1.1: List of arguments in PropBank In addition to numbered arguments, another task of PropBank annotation involves assigning functional tags to all modifiers of the verb, such as manner (MNR), locative (LOC), temporal (TMP) and others: Mr. Bush met him privately, in the White House, on Thursday. REL: met 3 ARG0: Mr. Bush ARG1: him ARGM-MNR: privately ARGM-LOC: in the White House ARGM-TMP: on Thursday And, finally, PropBank annotation involves finding antecedents for empty arguments of the verbs, as illustrated below: I made a decision [*PRO*] to leave. The subject of the verb leave in this example is represented as an empty category [*] in TreeBank. In PropBank, all empty categories which could be co-referred with a NP within the same sentence are linked in co-reference chains: REL: leave ARG0: [*PRO*] * [I] Similarly, relativizers and their referents are linked in relative clause constructions, and traces are linked to their referents in reduced relative constructions. While these links were at one time created manually by the annotators, they are now added automatically in post-processing. The annotation of this information creates a valuable corpus, which can be used as training data for a variety of natural language processing applications. Training data, essentially, is what computer scientists and computational linguists can use to `teach the computer' about different aspects of human language. Once this information is processed, it can guide future decisions on how to categorize and/or label different features in novel utterances outside of the PropBank corpus. Parallel PropBank corpora currently exist or are underway for English, Chinese, Arabic and Hindi. As a whole, the PropBank corpus has the potential to assist in natural language processing applications such as machine translation, text editing, text summary and evaluation as well as question answering. Thus, the main tasks of PropBank annotation are: argument labeling, annotation of modifiers, choosing a sense for the predicate, and creating links for empty categories, relative clauses, and reduced relatives. Each of these aspects of annotation are discussed in detail below. Although some detail is provided in each section on how to annotate appropriately using the annotation tool, Jubilee, complete guidelines on the use of this tool are provided in the Jubilee technical report: Jubilee: Propbank Instance Editor Guideline (Version 2.1) (Choi et al., 2009). 1.2 Sense Annotation 1.2.1 Frame Files The argument labels for each predicate are specified in the frame files, which are available at http://verbs.colorado.edu/propbank/framesets-english-aliases/ and are also dis- played in the frameset view of the annotation tool, Jubilee (see Jubilee technical report (Choi et al., 2009) for further information).

English Propbank Annotation Guidelines

Abstraction and Generalisation in Semantic Role Labels: Propbank

A Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling

How Do BERT Embeddings Organize Linguistic Knowledge?

Treebanks, Linguistic Theories and Applications Introduction to Treebanks

An Arabic Wordnet with Ontologically Clean Content

Verbnet Based Citation Sentiment Class Assignment Using Machine Learning

Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness

Deep Linguistic Analysis for the Accurate Identification of Predicate

The Procedure of Lexico-Semantic Annotation of Składnica Treebank

Generating Lexical Representations of Frames Using Lexical Substitution

Unified Language Model Pre-Training for Natural

Towards a Cross-Linguistic Verbnet-Style Lexicon for Brazilian Portuguese