<<

Exploiting a in Automatic Semantic Role Labelling

Robert S. Swier and Suzanne Stevenson Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 3G4 {swier,suzanne}@cs.toronto.edu

Abstract The reliance on such a resource—one in which the arguments of each predicate are manually identified We develop an unsupervised semantic role and assigned a semantic role—limits the portability labelling system that relies on the direct of such methods to other or even to other application of information in a predicate genres of corpora. lexicon combined with a simple probabil- In this study, we explore the possibility of using a ity model. We demonstrate the usefulness verb lexicon, rather than a hand-labelled corpus, as of predicate for role labelling, the primary resource in the semantic role labelling as well as the feasibility of modifying an task. Perhaps because of the focus on what can existing role-labelled corpus for evaluat- be gleaned from labelled data, existing supervised ing a different set of semantic roles. We approaches have made little use of the additional achieve a substantial improvement over an knowledge available in the predicate lexicon asso- informed baseline. ciated with the labelled corpus. By contrast, we ex- ploit the explicit knowledge of the role assignment 1 Introduction possibilities for each verb within an existing lexi- Intelligent technologies capable of full con. Moreover, we utilize a very simple probability semantic interpretation of domain-general text re- model within a highly efficient algorithm. main an elusive goal. However, statistical advances We use VerbNet (Kipper et al., 2000), a computa- have made it possible to address core pieces of tional lexicon which lists the possible semantic role the problem. Recent years have seen a wealth of assignments for each of its . Our algorithm research on one important component of seman- extracts automatically parsed arguments from a cor- tic interpretation—automatic role labelling (e.g., pus, and assigns to each a list of the compatible roles Gildea and Jurafsky, 2002; Pradhan et al., 2004; Ha- according to VerbNet. Arguments which are given cioglu et al., 2004, and additional papers from Car- only a single role possibility are considered to have reras and Marquez, 2004). Such work aims to an- been assigned an unambiguous role label. This set notate each constituent in a clause with a semantic of arguments constitutes our primary-labelled data, tag indicating the role that the constituent plays with which serves as the noisy training data for a simple respect to the target predicate, as in (1): probability model which is then used to label the re- (1) [Yuka]Agent [whispered]Pred to [Dar]Recipient maining (role ambiguous) arguments. Semantic role labelling systems address a crucial This method has several advantages, the foremost first step in the automatic extraction of semantic re- of which is that it eliminates the dependence on a lations from domain-general text, taking us closer to role labelled corpus, a very expensive resource to the goal of comprehensive semantic mark-up. produce. Of course, a verb lexicon is also an expen- Most work thus far on domain-general role la- sive resource, but one that is highly reusable across a belling depends on supervised learning over statis- range of NLP tasks. Moreover, the approach points tical features extracted from a hand-labelled corpus. at some potentially useful information that current

883 Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 883–890, Vancouver, October 2005. c 2005 Association for Computational Linguistics supervised methods have failed to exploit. Even if whisper one has access to an annotated corpus for training, Frames: our work shows that directly calling on additional Agent V information from the lexicon itself may prove useful Agent V Prep(+dest) Recipient Agent V Topic in restricting the possible labels for an argument. Verbs in same (sub)class: [bark, croon, drone, grunt, holler, ...] The method has disadvantages as well. The in- formation available in a predicate lexicon is less di- Figure 1: A portion of a VerbNet entry. rectly applicable to building a learning model. In- evitably, our results are noisier than in a super- vised approach which has access to a labelled sam- 2.1 The VerbNet Lexicon ple of what it must produce. Still, the method shows VerbNet is a manually developed hierarchical lexi- promise: on unseen test data, the system yields an con based on the verb classification of Levin (1993). F-measure of .83 on labelling of correctly extracted For each of almost 200 classes containing a total of arguments, compared to an informed baseline of .74, 3000 verbs, VerbNet specifies the syntactic frames and an F-measure of .65 (compared to .52) on the along with the semantic role assigned to each argu- overall identification and labelling task. The latter is ment position of a frame.1 Figure 1 shows an exam- well below the best supervised performance of about ple VerbNet entry. The thematic roles used in Verb- .80 on similar tasks, but it must be emphasized that Net are more general than the situation-specific roles it is achieved with a simple probability model and of FrameNet. For example, the roles Speaker, Mes- without the use of hand-labelled data. We view this sage, and Addressee of a Communication verb such as a starting point by which to demonstrate the util- as whisper in FrameNet would be termed Agent, ity of deriving more explicit knowledge from a pred- Topic, and Recipient in VerbNet. These coarser- icate lexicon, which can be later extended through grained roles are often assumed in linguistic the- the use of additional probabilistic features. ory, and have some advantages in terms of capturing We face a methodological challenge arising from commonalities of argument relations across a wide the particular choice of VerbNet for the prototyp- range of predicates. ing of our method: the lexicon has no associated 2.2 Mapping FrameNet to VerbNet Roles semantic role labelled corpus. While this under- scores the need for approaches which do not rely As noted, VerbNet lacks a corpus of example role as- on such a resource, it also means that we lack a signments against which to evaluate a role labelling labelled sample of data against which to evaluate based upon it. We create such a resource by adapting our results. To address this, we use the existing the existing FrameNet corpus. We formulate a map- labelled corpus of FrameNet (Baker et al., 1998), ping between FrameNet’s larger role set and Verb- and develop a mapping for converting the FrameNet Net’s much smaller one, and create a new corpus roles to corresponding VerbNet roles. Our mapping with our mapped roles substituted for the original method demonstrates the possibility of leveraging roles in the FrameNet corpus. existing resources to support the development of role We perform the mapping in three steps. First we labelling systems based on verb lexicons that do not use an existing mapping between the semantically- have an associated hand-labelled corpus. specific roles in FrameNet and a much smaller inter- mediate set of 39 semantic roles which subsume all FrameNet roles.2 The associations in this mapping 2 VerbNet Roles and the Role Mapping are straightforward—e.g., the Place role for Abusing verbs and the Area role for Operate-vehicle verbs are both mapped to Location. Before describing our labelling algorithm, we first 1 briefly introduce the semantic role information Throughout the paper we use the term “frame” to refer to a syntactic frame—a configuration of syntactic arguments of a available in VerbNet, and describe how we map verb—possibly labelled with roles, as in Figure 1. FrameNet roles to VerbNet roles. 2This mapping was provided by Roxana Girju, UIUC.

884 Second, from this intermediate set we create a to train a probability model, described in Section 4, simple mapping to the set of 22 VerbNet roles. Some which we employ to label the remaining arguments roles are unaffected by the mapping (e.g., Cause (those having more than one candidate role). alone in the intermediate set maps to Cause in the VerbNet set). Other roles are merged (e.g., Degree 3.1 Initialization of Candidate Roles and Measure both map to Amount). Moreover, some The frame matcher construes extracted arguments roles in FrameNet (and the intermediate set) must be from the parsed sentence as being in one of the mapped to more than one VerbNet role. For exam- four main types of syntactic positions (or slots) used ple, an Experiencer role in FrameNet is considered by VerbNet frames: subject, object, indirect object, Experiencer by some VerbNet classes, but Agent by and PP-object.3 Additionally, we specialize the lat- others. In such cases, our mappings in this step must ter by the individual preposition, such as “object of be specific to the VerbNet class. for.” For the first three slot types, alignment be- In this second step, some roles have no subsum- tween the extracted arguments and the frames is rel- ing VerbNet role, because FrameNet provides roles atively straightforward. An extracted subject would for a wider variety of relations. For example, both be aligned with the subject position in a VerbNet FrameNet and the intermediate role set contain a frame, for instance, and the subject role from the Manner role, which VerbNet does not have. We frame would be listed as a possible label for the ex- create a catch-all label, “NoRole,” to which we tracted subject. map eight such intermediate roles: Condition, Man- The alignment of PP-objects is similar to that ner, Means, Medium, Part-Whole, Property, Pur- of the other slot types, except that we add an ad- pose, and Result. These phrases labelled NoRole are ditional constraint that the associated prepositions adjuncts—constituents not labelled by VerbNet. must match. For PP-object slots, VerbNet frames of- In the third step of our mapping, some of the roles ten provide an explicit list of allowable prepositions. in VerbNet—such as Theme and Topic, Asset and Alternatively, the frame may specify a required se- Amount—which appear to be too-fine grained for us mantic feature such as +path or +loc. In order to distinguish reliably, are mapped to a more coarse- for an extracted PP-object to align with one of these grained set of VerbNet roles. The final set consists frame slots, its associated preposition must be in- of 16 roles: Agent, Amount, Attribute, Beneficiary, cluded in the list provided by the frame, or have the Cause, Destination, Experiencer, Instrument, Loca- specified feature. To determine the latter, we manu- tion, Material, Predicate, Recipient, Source, Stimu- ally create lists of prepositions that we judge to have lus, Theme and Time; plus the NoRole label. each of the possible semantic features. In general, this matching procedure assumes that 3 The Frame Matching Process frames describing a syntactic argument structure similar to that of the parsed sentence are more likely A main goal of our system is to demonstrate the to correctly describe the semantic roles of the ex- usefulness of predicate lexicons for the role la- tracted arguments. Thus, the frame matcher only belling task. The primary way that we apply the chooses roles from frames that are the best syntac- knowledge in our lexicon is via a process we call tic matches with the extracted argument set. This frame matching, adapted from Swier and Steven- is achieved by adopting the scoring method of Swier son (2004). The automatic frame matcher aligns and Stevenson (2004), in which we compute the por- arguments extracted from an automatically parsed tion %Frame of frame slots that can be mapped to sentence with the frames in VerbNet for the target an extracted argument, and the portion %Sent of verb in the sentence. The output of this process is extracted arguments from the sentence that can be a highly constrained set of candidate roles (possi- mapped to the frame. The score for each frame is bly of size one) for each potential argument. The given by %Frame+%Sent, and only frames having resulting singleton sets constitute a (noisy) role as- the highest score contribute candidate roles to the signment for their corresponding arguments, form- 3Since VerbNet has very few verbs with sentential comple- ing our primary-labelled data. This data is then used ments, we do not consider them for now.

885 Extracted Slots Possible Frames for Verb V SUBJ OBJ %Frame %Sent Score Agent V Agent 100 50 150 Agent V Theme Agent Theme 100 100 200 Instrument V Theme Instrument Theme 100 100 200 Agent V Recipient Theme Agent Theme 67 100 167

Table 1: An example of frame matching. extracted arguments. An example scoring is shown the task difficulty. However, since brawl does ac- in Table 1. Note that two of the frames are tied for cept Theme in another slot, it is not an option to the highest score of 200, resulting in two possible entirely eliminate this role in the mapping for the roles for the subject (Agent and Instrument), and verb. Instead, we use our frame matcher to verify Theme as the only possible role for the object. that each target role generated by our mapping from As mentioned, this frame matching step is very FrameNet is allowed by VerbNet in the relevant slot. restrictive, and it greatly reduces role ambiguity. If the target role is not allowed, then it is converted to Many potential arguments receive only a single can- NoRole in the evaluation set. Constituents labelled didate role, providing the primary-labelled data we as NoRole are not considered target arguments, and use to train our probability model. Some slots re- it is correct for the system to not assign labels in ceive no candidate roles, which is an error for argu- these cases. ment slots but which is correct for adjuncts. The re- The NoRole conversions help to ensure that our duction of candidate roles in general is very helpful gold standard evaluation data is consistent with our in lightening the subsequent load on the probability lexicon, but the method does have limitations. For model to be applied next, but note that it may also instance, some of the arguments which the sys- cause the correct role to be omitted. We experiment tem fails to extract might have had their target role with choosing roles from the frames that are the best changed to NoRole if they were properly extracted. syntactic matches, and from all possible frames. Additionally, in some cases a target role is converted to NoRole when there is an actual role that VerbNet 3.2 Adjustments to the Role Mapping would have assigned instead. We further extend the frame matcher, which has ex- tensive knowledge of VerbNet, for the separate task 4 The Probability Model of helping to eliminate some of the inconsistencies Once argument slots are initialized with sets of pos- that are introduced by our role mapping procedure. sible roles, the algorithm uses a probability model This is a process that applies concurrently with the to label slots having two or more possibilities. Since initialization of candidate roles described above, but our primary goal is to demonstrate how much can be only affects the gold standard labelling of evaluation accomplished through the frame matcher, we com- data.4 pare a number of very simple probability models: For instance, FrameNet assigns the role Side2 to the object of the preposition with occurring with the • P(r|v, s): the probability of a role given the verb brawl. Side2 is mapped to Theme by our role target verb and the slot; the latter includes sub- mapping; however, in VerbNet, brawl does not ac- ject, object, indirect object, and prepositional cept Theme as the object of with. Our mapping thus object, where each PP slot is specialized by the creates a target (i.e., gold standard) label in the eval- identity of the preposition; uation data that is inconsistent with VerbNet. Since there is no possibility of the role labeller assigning a • P(r|s): the probability of a role given the slot; label that matches such a target, this unfairly raises • P(r|sc): the probability of a role given the slot 4Of course, the fact that the frame matcher “sees” the evalu- ation set as part of its dual duties is not allowed to influence its class, in which all prepositional slots are treated assignment of candidate roles. together.

886 Each probability model predicts a role given certain only on FrameNet sentences that include our target conditioning information, with maximum likelihood verbs. estimates determined by the primary-labelled data All of our corpus data was parsed using the directly resulting from the frame matching step.5 Collins parser (Collins, 1999). Next, we use TGrep2 We also compare one non-probabilistic model to (Rohde, 2004) to automatically extract from the resolve the same set of ambiguous cases: parse trees the constituents forming potential argu- ments of the target verbs. For each verb, we label as • Default assignment: candidate roles for am- the subject the lowest NP node, if it exists, that is im- biguous slots are ignored; the four slot classes mediately to the left of a VP node which dominates of subject, object, indirect object and PP-object the verb. Other arguments are identified by finding are assigned the roles Agent, Theme, Recipi- sister NP or PP nodes to the right of the verb. Heads ent, and Location, respectively. of noun phrases are identified using the method of Collins (1999), which primarily chooses the right- These are the most likely roles assigned by the frame most noun in the phrase that is not inside a preposi- matcher over our development data. tional phrase or subordinate clause. Error may be in- For comparison, we also apply the iterative algo- troduced at each step of this preprocessing—the sen- rithm developed by Swier and Stevenson (2004), us- tence may be misparsed, some arguments (such as ing the same bootstrapping parameters. The method distant subjects) may not be extracted, or the wrong uses backoff over three levels of specificity of prob- word may be found as the phrase head. abilities. 5.3 Validation and Test Data 5 Materials and Methods A random selection of 30% of the preprocessed 5.1 The Target Verbs FrameNet data is set aside for testing, and another For ease of comparison, we use the same verbs as in random 30% is used for development and valida- Swier and Stevenson (2004), except that we measure tion. For experiments involving additional BNC performance over a much larger superset of verbs. In data, each 30% of the FrameNet sentences is em- that work, a core set of 54 target verbs are selected bedded in a random selection of 20% of the BNC. to represent a variety of classes with interesting role We selected these percentages to yield a sufficient ambiguities, and the system is evaluated against only amount of data for experimentation, while reserving those verbs. An additional 1105 verbs—all verbs some unseen data for future work. The FrameNet sharing at least one class with the target verbs—are portion of the validation set includes 515 types of also labelled, in order to provide more data for the our target verbs (across 161 VerbNet classes) in probability estimations. Here, we consider our sys- 4300 sentences, and contains a total of 6636 target tem’s performance over the 1159 target verbs that constituents—i.e., constituents that receive a valid consist of the union of these two sets of verbs. VerbNet role as their gold standard label, not No- Role. The test set includes 517 of the target verbs 5.2 The Corpus and Preprocessing (from 163 classes) in 4308 sentences, yielding 6705 The majority of sentences in FrameNet II are taken target constituents.6 from the British National Corpus (BNC Reference To create an evaluation set, we map the manually Guide, 2000). Our development and test data con- annotated FrameNet roles in the corpus to VerbNet sists of a percentage of these sentences. For some roles (or NoRole), as described in Sections 2.2 and experiments, these sentences are then merged with 3.2. We use this role information to calculate perfor- a random selection of additional sentences from the mance: the system should assign roles matching the BNC in order to provide more training data for the target VerbNet roles, and make no assignment when probability estimations. We evaluate performance the target is NoRole.

5Note that we assume the probability of a role for a slot is in- 6The verbs appearing in the validation and test sets occur dependent of other slots—that is, we do not ensure a consistent respectively across 161 and 165 FrameNet classes (what in role assignment to all arguments across an instance of a verb. FrameNet are called “frames”).

887 5.4 Methods of Argument Identification ments as does the frame matcher. The baseline la- One of the decisions we face is how to evaluate the bels all extracted arguments using the default role identification of extracted arguments generated by assignments described in Section 4. the system against the manually annotated target ar- In addition to experiments in which we employ guments provided by FrameNet. We try two meth- various methods of resolving ambiguous assign- ods, the most strict of which is to require full-phrase ments, we also evaluate the system with varying agreement: an extracted argument and a target ar- types and amounts of training data, and with two al- gument must cover exactly the same words in the ternate methods for choosing frames from which to sentence in order for the argument to be considered draw candidate roles. correctly extracted. This means, for instance, that 6.2 Evaluation of Probability Models a prepositional phrase incorrectly attached to an ex- We first evaluate our system with the three very tracted object would render the object incompatible simple probability models, as well as the non- with the target argument, and any system label on probabilistic default assignment, to determine roles it would be counted as incorrect. This evaluation for the extracted arguments that the frame matcher method is commonly used in other work (e.g., Car- considers to be ambiguous. We also report results reras and Marquez, 2004). after only the frame matcher has been applied, to The other method we use is to require that only indicate how much work is being done by it alone. the head of an extracted argument and a target argu- Because we have constructed the frame matcher to ment match. This latter method helps to provide a be highly restrictive in assigning candidate roles to fuller picture of the range of arguments found by the extracted arguments, a large number (about 62%) system, since there are fewer near-misses caused by become primary-labelled data and so do not require attachment errors. Since heads of phrases are often resolution of ambiguous roles. Only about 16% of the most semantically relevant part of an argument, our extracted arguments have role ambiguities, and labels on heads provide much of the same informa- about 22% (many of which are adjuncts) do not re- tion as labels on whole phrases. For these reasons, ceive any candidates and remain unlabelled. we use head matching for most of our experiments below. For comparison, however, we provide results Task: Id. Lab. Id. + Lab. Baseline .80 .74 .52 based on full-phrase matching as well. FM + P (r|sc) .83 .83 .65 FM + P (r|s) .83 .84 .65 6 Experimental Results FM + P (r|v, s) .83 .78 .61 FM + Dflt. Assgnmt. .83 .82 .64 6.1 Experimental Setup FM only .83 .76 .60 We evaluate our system’s performance on several as- As shown in the table, all models perform equally pects of the overall role labelling task; all results are well on identification, which is determined by the given in terms of F-measure, 2PR/(P + R).7 The frame matcher (FM); i.e., any extracted argument first task is argument identification, in which con- receiving one or more candidate roles is “identi- stituents considered by our system to be arguments fied” as an argument. Performance is somewhat (i.e., those that are extracted and labelled) are eval- above the baseline, which must label all extracted uated against actual target arguments. The second arguments. For the task of labelling correctly ex- task is labelling extracted arguments, which evalu- tracted arguments and for the combined task, the ates the labelling of only those arguments that were simplest probability models, P (r|sc) and P (r|s), correctly extracted. Last is the overall role labelling perform about the same. On the combined task, they task, which evaluates the system on the combined achieve .13 above the informed baseline, indicating tasks of identification and labelling of all target ar- the effectiveness of such simple models when com- guments. bined with the frame matcher. The more specific We compare our results to an informed baseline model, P (r|v, s), performs less well, and may be that has access to the same set of extracted argu- over-fitting on this relatively small amount of train- 7In each case, P and R are close in value. ing data.

888 Two observations indicate the power of the frame when trained on more data or on manually annotated matcher. First, even using the non-probabilistic de- data, although it still does not perform as well as the fault assignments to resolve ambiguous roles sub- simplest model. Because the models only choose stantially outperforms the baseline (and indeed per- from among candidate roles selected by the frame forms quite close to the best results, since the default matcher, differences in the learned probability esti- role assignment is often the same as that chosen by mations must be quite large to have an effect. At the probability models). Importantly, the baseline least for the simplest model, these estimations do uses the same default assignments, but without the not vary with a larger corpus or one lacking in noise. benefit of the frame matcher to further narrow down However, the increase in performance seen here for the possible arguments. Second, the frame matcher the more specific model, albeit small, may indicate alone achieves .60 F-measure on the combined task, that richer probability models may require more or not far below the performance of the best models. cleaner training data. These results show that once arguments have been extracted, much of the labelling work is performed 6.4 Evaluation of Frame Choice by the frame matcher’s careful application of lexical “Best” frames All Frames information. Baseline .52 FM + P (r|sc) .65 .63 Henceforth we consider the use of the frame matcher plus P (r|sc) as our basic system, since this The frame matcher has been shown to shoulder is our simplest model, and no other outperforms it. much of the responsibility in our system, and it is worth considering variations in its operation. For 6.3 Evaluation of Training Methods example, by having the frame matcher only choose In our above experiments, the probabilistic mod- roles from the frames that are the best syntactic els are trained only on primary-labelled data from matches to the sentence, role ambiguity is mini- the frame matcher run on the FrameNet data. We mized at the cost of possibly excluding the correct would like to determine whether using either more role. To determine whether we may do better by re- data or less noisy data may improve results. To pro- lying more on the probability model and less on the vide more data, we ran the frame matcher on the frame matcher, we instead include role candidates additional 20% of the BNC. This provides almost from all frames in a verb’s lexical entry. The effect 600K more sentences containing our target verbs, of this choice is more role ambiguity, decreasing the yielding a much higher amount of primary-labelled number of primary-labelled slots by roughly 30%. data. To provide less-noisy data, we trained the We see that performance using P (r|sc) is slightly probability models on manually annotated target la- worse with the greater ambiguity admitted by using bels from system-identified arguments in 1000 sen- all frames, indicating the benefit of precise selection tences. While fewer sentences are used, all argu- of candidate roles. ments in the training data are guaranteed to have a 6.5 Differing Argument Evaluation Methods correct role assignment, in contrast to the primary- labelled data output by the frame matcher. (We Heads Full Phrase chose 1000 sentences as an upper bound on an Baseline .52 .49 FM + P (r|sc) .65 .61 amount of data that could be relatively easily anno- tated by human judges.) As mentioned, for most of our evaluations we match the arguments extracted by the system to the tar- Training Prim.-lab. Prim.-lab. 1K sents Data: FN BNC annot’d get arguments via a match on phrase heads, since Baseline .52 head labels provide much useful semantic informa- FM + P (r|sc) .65 .65 .65 tion. When we instead require that the extracted FM + P (r|v, s) .61 .62 .63 arguments match the targets exactly, the number of For our basic model, P (r|sc), these variations in correctly extracted arguments falls from about 80% training data do not affect performance. Only the of the roughly 6700 targets to about 74%, due to in- most specific model, P (r|v, s), shows improvement creased parsing difficulty. As expected, this results

889 in both the system and the baseline having perfor- may require human filtering or augmentation of the mance decreases on the overall task. initial labelling. However, given the expense of pro- ducing a large semantically annotated corpus, even 7 Related Work such “human in the loop” approaches may lead to a decrease in overall resource demands. We use Most role labelling systems have required hand- such a corpus for evaluation purposes only, modi- labelled training data. Two exceptions are the sub- fying it with a role mapping to correspond to our categorization frame based work of Atserias et al. lexicon. We thus demonstrate that such existing re- (2001) and the bootstrapping labeller of Swier and sources can be bootstrapped for lexicons lacking an Stevenson (2004), but both are evaluated on only a associated annotated corpus. small number of verbs and arguments. In related un- supervised tasks, Riloff and colleagues have learned Acknowledgments “case frames” for verbs (e.g., Riloff and Schmelzen- bach, 1998), while Gildea (2002) has learned role- We gratefully acknowledge the support of NSERC slot mappings (but does not apply the knowledge for of Canada. We also thank Afsaneh Fazly, who as- the labelling task). sisted with much of our corpus pre-processing. Other role labelling systems have also relied on References the extraction of much more complex features or J. Atserias, L. Padro,´ and G. Rigau. 2001. Integrating multiple probability models than we adopt here. As a point knowledge sources for robust semantic parsing. In Proc. of of comparison, we apply the iterative backoff model the International Conf. on Recent Advances in NLP. from Swier and Stevenson (2004), trained on 20% of C. F. Baker, C. J. Fillmore, and J. B. Lowe. 1998. The Berkeley the BNC, with our frame matcher and test data. The FrameNet Project. In Proc. of COLING-ACL, p. 86–90. backoff model achieves an F-measure of .63, slightly BNC Reference Guide. 2000. Reference Guide for the British below the performance of .65 for our simplest proba- National Corpus (World Edition), second edition. bility model, which uses less training data and takes X. Carreras and L. Marquez, editors. 2004. CoNLL-04 Shared Task. far less time to run (minutes rather than hours). M. Collins. 1999. Head-Driven Statistical Models for Natural In general, it is not possible to make direct com- Language Parsing. Ph.D. thesis, University of Pennsylvania. parisons between our work and most other role la- D. Gildea. 2002. Probabilistic models of verb-argument struc- bellers because of differences in corpora and role ture. In Proc. of the 19th International CoNLL, p. 308–314. sets, and, perhaps more significantly, differences in D. Gildea and D. Jurafsky. 2002. Automatic labeling of seman- the selection of target arguments. However, the tic roles. Computational Linguistics, 23(3):245–288. best supervised systems, using automatic parses to K. Hacioglu, S. Pradhan, W. Ward, J. H. Martin, and D. Ju- rafsky. 2004. Semantic role labeling by tagging syntactic identify full argument phrases in PropBank, achieve chunks. In Proc. of the 8th CoNLL, p. 110–113. about .82 on the task of identifying and labelling K. Kipper, H. T. Dang, and M. Palmer. 2000. Class based con- arguments (Pradhan et al., 2004). Though this is struction of a verb lexicon. In Proc. of the 17th AAAI Conf. higher than our performance of .61 on full phrase ar- B. Levin. 1993. English Verb Classes and Alternations: A Pre- guments, our system does not require manually an- liminary Investigation. University of Chicago Press. notated data. S. Pradhan, W. Ward, K. Hacioglu, J. Martin, and D. Jurafsky. 2004. Shallow semantic parsing using support vector ma- chines. In Proc. of HLT/NAACL. 8 Conclusion E. Riloff and M. Schmelzenbach. 1998. An empirical approach to conceptual case frame acquisition. In Proc. of the 6th In this work, we employ an expensive but highly WVLC. reusable resource—a verb lexicon—to perform role D. L. T. Rohde. 2004. TGrep2 user manual ver. 1.11. labelling with a simple probability model and a http://tedlab.mit.edu/˜dr/Tgrep2. small amount of unsupervised training data. We out- R. Swier and S. Stevenson. 2004. Unsupervised semantic role perform similar work that uses much more data and labelling. In Proc. of the 2004 Conf. on EMNLP, p. 95–102. a more complex model, showing the benefit of ex- ploiting lexical information directly. To achieve per- formance comparable to that of supervised methods

890