The Berkeley FrameNet Project

Collin F. Baker and Charles J. Fillmore and John B. Lowe {collinb, fillmore, jblowe}@icsi.berkeley.edu International Computer Science Institute 1947 Center St. Suite 600 Berkeley, Calif., 94704

Abstract These descriptions are based on hand-tagged FrameNet is a three-year NSF-supported semantic annotations of example sentences ex- project in corpus-based computational lexicog- tracted from large text corpora and systematic raphy, now in its second year (NSF IRI-9618838, analysis of the semantic patterns they exem- "Tools for Lexicon Building"). The project's plify by lexicographers and linguists. The pri- key features are (a) a commitment to corpus mary emphasis of the project therefore is the evidence for semantic and syntactic generaliza- encoding, by humans, of semantic knowledge tions, and (b) the representation of the valences in machine-readable form. The intuition of the of its target (mostly nouns, adjectives, lexicographers is guided by and constrained by and verbs) in which the semantic portion makes the results of corpus-based research using high- use of frame . The resulting database performance software tools. will contain (a) descriptions of the semantic The semantic domains to be covered are" frames underlying the meanings of the words de- HEALTH CARE, CHANCE, PERCEPTION, COMMU- scribed, and (b) the valence representation (se- NICATION, TRANSACTION, TIME, SPACE, BODY mantic and syntactic) of several thousand words (parts and functions of the body), MOTION, LIFE and phrases, each accompanied by (c) a repre- STAGES, SOCIAL CONTEXT, EMOTION and COG- sentative collection of annotated corpus attes- NITION. tations, which jointly exemplify the observed 1.1 Scope of the Project linkings between "frame elements" and their syntactic realizations (e.g. grammatical func- The results of the project are (a) a lexical re- source, called the FrameNet database 3, and (b) tion, phrase type, and other syntactic traits). This report will present the project's goals and associated software tools. The database has workflow, and information about the computa- three major components (described in more de- tional tools that have been adapted or created tail below: • Lexicon containing entries which are com- in-house for this work. posed of: (a) some conventional dictionary-type data, mainly for the sake of human readers; (b) FOR- 1 Introduction MULAS which capture the morphosyntactic ways in The Berkeley FrameNet project 1 is producing which elements of the semantic frame can be realized frame-semantic descriptions of several thousand within the phrases or sentences built up around the ; (c) links to semantically ANNOTATED EXAM- English lexical items and backing up these de- scriptions with semantically annotated attesta- European collaborators whose participation has made tions from contemporary English corpora2. this possible are Sue Atkins, Oxford University Press, and Ulrich Held, IMS-Stuttgart. 1The project is based at the International Computer SThe database will ultimately contain at least 5,000 Science Institute (1947 Center Street, Berkeley, CA). A lexical entries together with a parallel annotated cor- fuller bibliography may be found in (Lowe et ai., 1997) pus, these in formats suitable for integration into appli- 2Our main corpus is the . cations which use other lexical resources such as Word- We have access to it through the courtesy of Oxford Net and COMLEX. The final design of the database will University Press; the POS-tagged and lemmatized ver- be selected in consultation with colleagues at Princeton sion we use was prepared by the Institut flit Maschinelle (WordNet), ICSI, and IMS, and with other members of Sprachverarbeitung of the University of Stuttgart). The the NLP community.

86 PLE SENTENCES which illustrate each of the poten- subframes associated with individual words in- tial realization patterns identified in the formula; 4 herit all of these while possibly adding some of and (d) links to the FRAME DATABASE and to other their own. Fig. 1 shows some of the subframes, machine-readable resources such as WordNet and as discussed below. COMLEX. • Frame Database containing descriptions of fra~ne(TRANSPORTATION) each frame's basic conceptual structure and giving frame.elements(MOVER(S), MEANS, PATH) names and descriptions for the elements which par- scene(MOVER(S) move along PATH by MEANS) ticipate in such structures. Several related entries in frame(DRiVING) this database are schematized in Fig. 1. inherit(TRANSPORTATION) • Annotated Example Sentences which are frarne.elements(DRIVER (:MOVER), VEHICLE marked up to exemplify the semantic and morpho- (:MEANS), RIDER(S) (:MOVER(S)), CARGO syntactic properties of the lexical items. (Several (=MOVER(S))) of these are schematized in Fig. 2). These sentences scenes(DRIVER starts VEHICLE, DRIVER con- provide empirical support for the lexicographic anal- trois VEHICLE, DRIVER stops VEHICLE) ysis provided in the frame database and lexicon en- frame(RIDING-i) tries. inherit(TRANSP O RTATION) These three components form a highly rela- frame.elements(RIDER(S) (=MOVER(S)), VE- tional and tightly integrated whole: elements HICLE (:MEANS)) in each may point to elements in the other scenes(RIDER enters VEHICLE, two. The database will also contain estimates VEHICLE carries RIDER along PATH, of the relative frequency of senses and comple- RIDER leaves VEHICLE ) mentation patterns calculated by matching the Figure 1: A subframe can inherit elements and senses and patterns in the hand-tagged exam- semantics from its parent ples against the entire BNC corpus. 1.2 Conceptual Model The DRIVING frame, for example, specifies a DRIVER (a principal MOVER), a VEHICLE (a par- The FrameNet work is in some ways similar ticularization of the MEANS element), and po- to efforts to describe the argument structures tentially CARGO or RIDER as secondary movers. of lexical items in terms of case-roles or theta- In this frame, the DRIVER initiates and controls roles, 5 but in FrameNet, the role names (called the movement of the VEHICLE. For most verbs frame elements or FEs) are local to particular in this frame, DRIVER or VEHICLE can be real- conceptual structures (frames); some of these ized as subjects; VEHICLE, RIDER, or CARGO can are quite general, while others are specific to a appear as direct objects; and PATH and VEHICLE small family of lexical items. can appear as oblique complements. For example, the TRANSPORTATION frame, Some combinations of frame elements, or within the domain of MOTION, provides MOVERS, MEANS of transportation, and PATHS; 6 Frame Element Groups (FEGs), for some real corpus sentences in the DRIVING frame are 4In cases of accidental gaps, clearly marked invented shown in Fig. 2. examples may be added. A RIDING_I frame has the primary mover role 5The semantic frames for individual lexical units are as RIDER, and allows as VEHICLE those driven typically "blends" of more than one basic frame; from our point of view, the so-called "linking" patterns pro- by others/ In grammatical realizations of this posed in LFG, HPSG, and Construction Grammar, op- frame, the RIDER can be the subject; the VEHI- erate on higher-level frames of action (giving agent, pa- CLE can appear as a direct object or an oblique tient, instrument), motion and location (giving theme, complement; and the PATH is generally realized location, source, goal, path), and experience (giving ex- periencer, stimulus, content), etc. In some but not all as an oblique. cases, the assignment of syntactic correlates to frame el- The FrameNet entry for each of these verbs ements could be mediated by mapping them to the roles will include a concise formula for all seman- of one of the more abstract frames. 8A detailed study of motion predicates would require work includes the separate analysis of the flame seman- a finer-grained analysis of the Path element, separating tics of directional and locational expressions. out Source and Goal, and perhaps Direction and Area, 7A separate frame RIDING_2 that applies to the En- but for a basic study of the transportation predicates glish verb r/de selects means of transportation that can such refined analysis is not necessary. In any case, our be straddled, such as bicycles, motorcycles, and horses.

87 FEG Annotated Example from BNC of the full range of use possibilities for individ- D [D Kate] drove [v home] in a ual words, documented with corpus data, the stupor. model examples for each use, and the statistical V, D A pregnant woman lost her baby af- information on relative frequency. ter she fainted as she waited for a bus and fell into the path of [v a 2 Organization and Workflow lorry] driven [~ by her uncle]. D, P And that was why [D I] drove 2.1 Overview [p eastwards along Lake Geneva]. The computational side of the FrameNet project D, R, P Now [D Van Cheele] was driving is directed at efficiently capturing human in- [R his guest] Iv back to the station]. sights into semantic structure. The majority D, V, P [D Cumming] had a fascination with most forms of transport, driving of the work involved is marking text with se- [y his Rolls] at high speed [p around mantic tags, specifying (again by hand) the the streets of London]. structure of the frames to be treated, and writ- D+R, P [D We] drive [p home along miles ing dictionary-style entries based the results of of empty freeway]. annotation and a priori descriptions. With V, P Over the next 4 days, Iv the Rolls the exception of the example sentence extrac- Royces] will drive [p down to Ply- tion component, all the software modules are mouth], following the route of the highly interactive and have substantial user in- railway. terface requirements. Most of this functionality Figure 2: Examples of Frame Element Groups is provided by WWW-based programs written and Annotated Sentences in PERL. Four processing steps are required produce the FrameNet database of frame semantic rep- tic and syntactic combinatorial possibilities, to- resentations: (a) generating initial descriptions gether with a collection of annotated corpus sen- of semantic and syntactic patterns for use in tences in which each possibility is exemplified. corpus queries and annotation ("Preparation"), The syntactic positions considered relevant for (b) extracting good example sentences ("Sub- lexicographic description include those that are corpus Extraction"), (c) marking (by hand) the internal to the maximal projection of the target constituents of interest ("Annotation"), and (d) word (the whole VP, AP, or NP for target V, A building a database of lexical semantic represen- or N), and those that are external to the max- tations based on the annotations and other data imal projection under precise structural condi- ("Entry Writing"). These are discussed briefly tions; the subject, in the case of VP, and the below and shown in Fig. 3. subject of support verbs in the case of AP and NP. s 2.2 Workflow and Personnel Used in NLP, the FrameNet database should As work on the project has progressed, we make it possible for a system which finds a have defined several explicit roles which project valence-bearing lexical item in a text to know participants play in the various steps, these (for each of its senses) where its individual argu- roles are referred to as Vanguard (1.1 in ments are likely to be found. For example, once Fig. 3), Annotators (3.1) and Rearguard a parser has found the verb drive and its direct (4.1). These are purely functional designations: object NP, the link to the DRIVING frame will the same person may play different roles at dif- suggest some semantics for that NP, e.g. that ferent times. 9 a person as direct object probably represents 1. Preparation. The Vanguard (1.1) pre- the RIDER, while a non-human proper noun is pares the initial descriptions of frames, includ- probably the VEHICLE. ing lists of frames and frame elements, and adds For practical lexicography, the contribution of these to the Frame Database (5.1) using the the FrameNet database will be its presentation Frame Description tool (1.2). The Vanguard

SFor causatives, the object of the support verb 90f course there are other staff members who write is included; for details, see (Fillmore and Atkins, code and maintain the databases. This behind-the- forthcoming). scenes work is not shown in Fig. 3.

88 Annotators 3.1 #~ alembic ~..,~-~'~ ] [SGMLannotation ,/f ~.~ [program 3.2 b

[ ~ [ ~nnom,e? ~ ~] Entry Rearguard Vanguard 1.1 LT:[.,,,D,,:; / ~,,,,,. 5.3J / TooI I 4.1

Extraction .. ~ I - " 2.2.2[~.,,~ IxKwIC c".'Tju'/ I "1

Figure 3: Workflow, Roles, Data Structures and Software also selects the major vocabulary items for the TION TOOLS (2.3)). frame (the target words) and the syntactic pat- terns that need to be checked for each word, 3. Annotation. Using the annotation soft- which are entered in the Lexical Database (5.2) ware (3.2) and the tagsets (3.2.1) derived from by means of the Lexical Database Tool (1.3). the Frame Database, the Annotators (3.1) mark selected constituents in the extracted subcor- 2. Subcorpus Extraction. Based on pora according to the frame elements which the Vanguard's work, the subcorpus extraction they realize, and identify canonical examples, tools (2.2) produce a representative collection of novel patterns, and problem sentences. 1° sentences containing these words. This selection of examples is achieved through 4. Entry Writing. The Rearguard (4.1) a hybrid process partially controlled by the pre- reviews the skeletal lexical record created by liminary lexical description of each lemma. Sen- the Vanguard, the annotated example sentences tences containing the lemma are extracted from (5.3), and the FEGs extracted from them, and from a corpus and classified into subcorpora builds both the entries for the lemmas in the by syntactic pattern (2.2.1) using a CASCADE Lexical Database (5.2) and the frame descrip- FILTER (2.2.2, 2.2.5, 2.2.6) representing a par- tions in the Frame Database (5.1), using the tial regular-expression grammar of English over Entry Writing Tools (4.2). part-of-speech tags (cf. Gahl (forthcoming)), formatted for annotation (2.2.4) , and automat- ically sampled (2.2.3) down to an appropriate number. l°We are building a "constituent type identifier" which will semi-automatically assign Grammatical Function (If these heuristics fail to find appropriate (GF), and Phrase Type (PT) attributes to these FE- examples by means of syntactic patterns, sen- marked constituents, eliminating the need for Annota- tences are selected using INTERACTIVE SELEC- tors to mark these.

89 3 Implementation SGML files into HTML for convenient viewing on 3.1 Data Model the web, etc. are being written in PERL. RCS main- tains version control over most files. The data structures described above are im- plemented in SGML. n Each is described by a 4 Conclusion DTD, and these DTDs are structured to provide At the time of writing, there is something in the necessary links between the components. place for each of the major software compo- 3.2 Software nents, though in some cases these are littlemore than stubs or "toy" implementations. Nearly The software suite currently supporting 10,000 sentences exemplifying just under 200 database development is an aggregate of lemmas have been annotated; there are over existing software tools held together with 20,000 frame element tokens marked in these PERL/CGI-based "glue". In order to get the example sentences. About a dozen frames have project started, we have depended on off-the- been specified, which refer to 47 named frame shelf software which in some cases is not ideal elements. Most of these annotations have been for our purposes. Nevertheless, using these accomplished in the last few months since the programs allowed us to get the project up and software for corpus extraction, frame descrip- running within just a few months. We describe tion, and annotation became operational. We below in approximate order of application the expect the inventory to increase rapidly. If the programs used and their state of completion. • Frame Description Tool (1.2) (in development) proportions cited hold constant as the Framenet An interactive, web-based tool. database grows, the final database of 5,000 lex- • Lexical Description Tool (1.3) (prototype) An ical units may contain 250,000 annotated sen- interactive, web-based tool. tences and over half a million tokens of frame • CQP (2.2.1) is a high-performance Corpus elements. Query Processor, developed at IMS Stuttgart (IMS, 1997). The cascade filter, which partitions lemma- References specific subcorpora by syntactic patterns, is built Charles J. Fillmore and B. T. S. Atkins. forth- using a preprocessor (written in PERL, 2.2.2) which coming. FrameNet and lexicographic rele- generates CQP's native query language. vance. In Proceedings of the First Inter- • XKWIC (2.3) is an X-window, interactive tool, also from IMS, which facilitates manipulating cor- national Conference On Language Resources pora and subcorpora. And Evaluation, Granada, Spain, P8-30 May • Subcorpora are prepared for annotation by a 1998. program ("arf" for Annotation Ready Formatter, Susanne Gahl. forthcoming. Automatic extrac- 2.2.4) which wraps SGML tags around sentences, tion of subcorpora based on subcategoriza- target words, comments and other distinguishable tion frames from a part of speech tagged cor- text elements. Another program, "whittle" (2.2.3), pus. In Proceedings o/ the 1998 COLING- combines subcorpora in a preselected order, remov- A CL conference. ing very long and very short sentences, and sampling to reduce large subcorpora. Institut f'dr maschinelle Sprachverarbeitung • Alembic (3.2) (Mitre, 1998), allows the inter- IMS. 1997. IMS corpus toolbox web active markup (in SGML) of text files according to page at stuttgart, http://www.ims.uni- predefined tagsets (3.2.1). It is used to introduce stuttgart.de/~oli/CorpusToolbox/. frame element annotations into the subcorpora. John B. Lowe, Collin F. Baker, and Charles J. • Sgmlnorm, etc. (from James Clark's SGML tool Fillmore. 1997. A frame-semantic approach set) are used to validate and manage the SGML files. to semantic annotation. In Tagging Text with • Entry Writing Tools (4.2) (in development) Lexical Semantics: Why, What, and How? • Database management tools to manage the cat- pages 18-24. alog of subcorpora, schedule the work, render the Proceedings of the Workshop, Special Interest Group on the Lexicon, Asso- nEventually, we plan to migrate to an XML data ciation for Computational Linguistics, April. model, which appears to provide more flexibility while Mitre. 1998. Alembic Work- reducing complexity. Also, the FrameNet software is be- ing developed on Unix, but we plan to provide cross- bench web page at Mitre corp. platform capabilities by making our tool suite web-based http: //www.mitre.org/resources/ centers/ and XML-compatible. advanced_info/g04h/workbench.html.

90