Issues in Synchronizing the English Treebank and Propbank

Issues in Synchronizing the English Treebank and PropBank Olga Babko-Malayaa, Ann Biesa, Ann Taylorb, Szuting Yia, Martha Palmerc, Mitch Marcusa, Seth Kulicka and Libin Shena aUniversity of Pennsylvania, bUniversity of York, cUniversity of Colorado {malayao,bies}@ldc.upenn.edu, {szuting,mitch,skulick,libin}@linc.cis.upenn.edu, [email protected], [email protected] semantic role labeling that can be found, and our Abstract plans for reconciling them. We also investigate the sources of the disagreements, which types of The PropBank primarily adds semantic disagreements can be resolved automatically, role labels to the syntactic constituents in which types require manual adjudication, and for the parsed trees of the Treebank. The which types an agreement between syntactic and goal is for automatic semantic role label- semantic representations cannot be reached. ing to be able to use the domain of local- ity of a predicate in order to find its ar- 1.1 Treebank guments. In principle, this is exactly what The Penn Treebank annotates text for syntactic is wanted, but in practice the PropBank structure, including syntactic argument structure annotators often make choices that do not and rough semantic information. Treebank anno- actually conform to the Treebank parses. tation involves two tasks: part-of-speech tagging As a result, the syntactic features ex- and syntactic annotation. tracted by automatic semantic role label- The first task is to provide a part-of-speech tag ing systems are often inconsistent and for every token. Particularly relevant for Prop- contradictory. This paper discusses in de- Bank work, verbs in any form (active, passive, tail the types of mismatches between the gerund, infinitive, etc.) are marked with a verbal syntactic bracketing and the semantic part of speech (VBP, VBN, VBG, VB, etc.). role labeling that can be found, and our (Marcus, et al. 1993; Santorini 1990) plans for reconciling them. The syntactic annotation task consists of marking constituent boundaries, inserting empty 1 Introduction categories (traces of movement, PRO, pro), showing the relationships between constituents The PropBank corpus annotates the entire Penn (argument/adjunct structures), and specifying a Treebank with predicate argument structures by particular subset of adverbial roles. (Marcus, et adding semantic role labels to the syntactic al. 1994; Bies, et al. 1995) constituents of the Penn Treebank. Constituent boundaries are shown through Theoretically, it is straightforward for PropBank syntactic node labels in the trees. In the simplest annotators to locate possible arguments based on case, a node will contain an entire constituent, the syntactic structure given by a parse tree, and complete with any associated arguments or mark the located constituent with its argument modifiers. However, in structures involving syn- label. We would expect a one-to-one mapping tactic movement, sub-constituents may be dis- between syntactic constituents and semantic placed. In these cases, Treebank annotation arguments. However, in practice, PropBank represents the original position with a trace and annotators often make choices that do not shows the relationship as co-indexing. In (1) be- actually conform to the Penn Treebank parses. low, for example, the direct object of entail is The discrepancies between the PropBank and shown with the trace *T*, which is coindexed to the Penn Treebank obstruct the study of the syn- the WHNP node of the question word what. tax and semantics interface and pose an immedi- ate problem to an automatic semantic role label- (SBARQ (WHNP-1 (WP What )) ing system. A semantic role labeling system is (1) (SQ (VBZ does ) trained on many syntactic features extracted from (NP-SBJ (JJ industrial ) the parse trees, and the discrepancies make the (NN emigration )) training data inconsistent and contradictory. In (VP (VB entail) this paper we discuss in detail the types of mis- (NP *T*-1))) matches between the syntactic bracketing and the (. ?)) In (2), the relative clause modifying a journal- tem (temporal –TMP, locative –LOC, manner ist has been separated from that NP by the prepo- –MNR, purpose –PRP, etc.). sitional phrase to al Riyadh, which is an argu- Inside NPs, the argument/adjunct distinction is ment of the verb sent. The position where the shown structurally. Argument constituents (S and relative clause originated or “belongs” is shown SBAR only) are children of NP, sister to the head by the trace *ICH*, which is coindexed to the noun. Adjunct constituents are sister to the NP SBAR node containing the relative clause con- that contains the head noun, child of the NP that stituent. contains both: (2)(S (NP-SBJ You) (NP (NP head) (VP sent (PP adjunct)) (NP (NP a journalist) (SBAR *ICH*-2)) 1.2 PropBank (PP-DIR to (NP al Riyadh)) PropBank is an annotation of predicate-argument (SBAR-2 structures on top of syntactically parsed, or Tree- (WHNP-3 who) banked, structures. (Palmer, et al. 2005; Babko- (S (NP-SBJ *T*-3) Malaya, 2005). More specifically, PropBank (VP served annotation involves three tasks: argument (NP (NP the name) labeling, annotation of modifiers, and creating (PP of co-reference chains for empty categories. (NP Lebanon))) The first goal is to provide consistent argu- (ADVP-MNR magnificently)))))) ment labels across different syntactic realizations of the same verb, as in Empty subjects which are not traces of move- (3) [ARG0 John] broke [ARG1 the window] ment, such as PRO and pro, are shown as * (see [ARG1 The window] broke. the null subject of the infinite clause in (4) be- low). These null subjects are coindexed with a As this example shows, semantic arguments governing NP if the syntax allows. The null sub- are tagged with numbered argument labels, such ject of an infinitive clause complement to a noun as Arg0, Arg1, Arg2, where these labels are de- is, however, not coindexed with another node in fined on a verb by verb basis. the tree in the syntax. This coindexing is shown The second task of the PropBank annotation as a semantic coindexing in the PropBank anno- involves assigning functional tags to all modifi- tation. ers of the verb, such as MNR (manner), LOC The distinction between syntactic arguments (locative), TMP (temporal), DIS (discourse con- and adjuncts of the verb or verb phrase is made nectives), PRP (purpose) or DIR (direction) and through the use of functional dashtags rather than others. with a structural difference. Both arguments and And, finally, PropBank annotation involves adjuncts are children of the VP node. No distinc- finding antecedents for ‘empty’ arguments of the tion is made between VP-level modification and verbs, as in (4). The subject of the verb leave in S-level modification. All constituents that appear this example is represented as an empty category before the verb are children of S and sisters of [*] in Treebank. In PropBank, all empty catego- VP; all constituents that appear after the verb are ries which could be co-referred with a NP within children of VP. the same sentence are linked in ‘co-reference’ Syntactic arguments of the verb are NP-SBJ, chains: NP (no dashtag), SBAR (either –NOM-SBJ or no dashtag), S (either –NOM-SBJ or no dashtag), (4) I made a decision [*] to leave -DTV, -CLR (closely/clearly related), -DIR with directional verbs. Rel: leave, Adjuncts or modifiers of the verb or sentence Arg0: [*] -> I are any constituent with any other adverbial dashtag, PP (no dashtag), ADVP (no dashtag). As the following sections show, all three tasks Adverbial constituents are marked with a more of PropBank annotation result in structures specific functional dashtag if they belong to one which differ in certain respects from the corre- of the more specific types in the annotation sys- sponding Treebank structures. Section 2 presents our approach to reconciling the differences be- (7) What-1 do you like [*T*-1]? tween Treebank and PropBank with respect to the third task, which links empty categories with Revised PropBank annotation: their antecedents. Section 3 introduces mis- Rel: like matches between syntactic constituency in Tree- Arg0: you bank and PropBank. Mismatches between modi- Arg1: [*T*] fier labels are not addressed in this paper and are left for future work. As the second stage, syntactic chains will be reconstructed automatically, based on the 2 Coreference and syntactic chains coindexation provided by Treebank (note that the trace is coindexed with the NP What in (7)). And, PropBank chains include all syntactic chains finally, coreference annotation will be done on (represented in the Treebank) plus other cases of top of the resulting resource, with the goal of nominal semantic coreference, including those finding antecedents for the remaining empty in which the coreferring NP is not a syntactic categories, including empty subjects of infinitival antecedent. For example, according to PropBank verbs and gerunds. guidelines, if a trace is coindexed with a NP in One of the advantages of this approach is that Treebank, then the chain should be reconstructed: it allows us to distinguish different types of chains, such as syntactic chains (i.e., chains (5) What-1 do you like [*T*-1]? which are derived as the result of syntactic movement, or control coreference), direct Original PropBank annotation: coreference chains (as illustrated by the example Rel: like in (6)), and semantic type links for other ‘indi- Arg0: you rect’ types of links between an empty category Arg1: [*T*] -> What and its antecedent. Syntactic chains are annotated in Treebank, Such chains usually include traces of A and A’ and are reconstructed automatically in PropBank. movement and PRO for subject and object con- The annotation of direct coreference chains is trol. On the other hand, not all instances of PROs done manually on top of Treebank, and is re- have syntactic antecedents. As the following ex- stricted to empty categories that are not ample illustrates, subjects of infinitival verbs and coindexed with any NP in Treebank.

Load more