<<

Corpus annotation within the French FrameNet: a domain-by-domain methodology

Marianne Djemaa?, Marie Candito?, Philippe Muller, Laure Vieu4 ? Alpage (Univ. Paris Diderot / INRIA),  IRIT, Toulouse University , 4 IRIT, CNRS, [email protected] Abstract This paper reports on the development of a French FrameNet, within the ASFALDA project. While the first phase of the project focused on the development of a French set of frames and corresponding (Candito et al., 2014), this paper concentrates on the subsequent corpus annotation phase, which focused on four notional domains (commercial transactions, cognitive stances, causality and verbal communication). Given full coverage is not reachable for a relatively “new” FrameNet project, we advocate that focusing on specific notional domains allowed us to obtain full lexical coverage for the frames of these domains, while partially reflecting word sense ambiguities. Furthermore, as frames and roles were annotated on two French Treebanks (the French Treebank (Abeille´ and Barrier, 2004) and the Sequoia Treebank (Candito and Seddah, 2012), we were able to extract a syntactico-semantic lexicon from the annotated frames. In the resource’s current status, there are 98 frames, 662 frame-evoking words, 872 senses, and about 13000 annotated frames, with their semantic roles assigned to portions of text. The French FrameNet is freely available at alpage.inria.fr/asfalda.

Keywords: FrameNet, French, semantic roles, semantic paper focuses on the subsequent corpus annotation phase. frames, semantically-annotated corpus In section 2., we describe the original FrameNet developed 1. Introduction for English. In section 3., we review the pros and cons of The ASFALDA project1 aims to build semantic resources various FrameNet building methodologies before present- and a corresponding semantic analyzer for French, to ing ours, and then describe our target corpora and our an- capture generalizations both over predicates and over the notation workflow. We next present the resulting resource semantic arguments of predicates. We chose to build on in section 4., evaluating its quality using inter-annotator the work resulting from the FrameNet project (Baker et agreement and providing various statistics. We then review al., 1998), which provides a structured set of prototypical specific phenomena we had to address in section 5., and situations, called frames, along with a semantic character- conclude in section 6.. ization of the participants of these situations (called frame elements, but we’ll use roles for short). The corresponding 2. FrameNet English lexicon associates frames with the words that FrameNet (Baker et al., 1998), developed at UC Berke- can evoke them (called frame-evoking elements, FEEs ley’s ICSI, is a Fillmore (1982)’s Frame inspired for short). While other English semantic resources, such resource. It is made of: (a) a network of frames: pro- as PropBank (Palmer et al., 2005) or VerbNet (Schuler, totypical scenes, complete with semantic characterizations 2005), also provide semantic classes and/or semantic of the scene’s participants, props or parts called frame ele- roles for predicate arguments, we chose FrameNet mainly ments (as mentioned in the introduction, we will say roles because of its more semantic orientation, which is crucial for short); (b) a lexicon of lexemes that may evoke those for portability to other . FrameNet offers gener- frames; and (c) a corpus of annotated English sentences. alization over not only syntactic variation (e.g. diathesis These sentences carry frame annotations on the words that alternation) but also lexical variation (like VerbNet but evoke them, as well as role annotations on the linguistic unlike PropBank), and groups together lexical units of material that realizes those roles. various categories, on the basis of criteria that are not As we can see in figure 1 (which shows excerpts from the primarily syntactic (unlike VerbNet). EXPORTING frame), a frame consists of several parts, the first of which is a natural definition of the proto- The resources built within ASFALDA consist of a set typical situation described by the frame. Each frame also of frames, a French lexicon in which lexical units are includes a set of roles − those may be assigned specific associated to FrameNet frames, and a semantic annotation semantic types, and relations (such as exclusion) may be layer added on top of existing syntactic French treebanks. defined between several roles. Frames are linked via frame The project also aims to investigate new models for relations, such as inheritance or is-causative-of. A key as- frame-based semantic analysis. In a first phase of the pect of FrameNet is that roles are specific to each frame, project (Candito et al., 2014), the work focused on a as an answer to the well-known difficulty of fitting the se- set of notional domains, and frames pertaining to these mantic arguments of predicates into a small set of semantic domains were selected from the frame set of Berkeley roles (Fillmore, 2007). Yet, more coarse-grained roles can FrameNet 1.5. These frames were adapted to French, and be obtained through frame-to-frame relations: identity rela- the corresponding French lexicon was built. The current tions are defined between the roles of two frames linked via a frame-to-frame relation. This makes it possible to gener- 1alpage.inria.fr/asfalda alize over frame-specific roles to obtain a granularity closer

3794 EXPORTING veloped for, among others, Spanish (Subirats-Ruggeberg¨ Def: An Exporter moves Goods across a border from and Petruck, 2003), Japanese (Ohara et al., 2004), German an Exporting area to an Importing area. (Burchardt et al., 2006b) and Swedish (Borin et al., 2009). Roles: Exporter The conscious entity, gener- ally a person, that moves the 3. Methodology Goods across a border out of 3.1. Pros and cons of existing frame semantics the Exporting area annotation strategies Exporting area The place where the Goods are initially, before the the The original Berkeley FrameNet project and subsequent Exporter moves them. FrameNet projects for various languages have adopted var- Goods The items of value whose lo- ious methodologies, which have a major impact on the re- cation is changing. sulting resources: because full lexicon coverage is unreach- Importing area The place that the Goods end able, the path taken to carry out the annotations shapes the up as a result of motion. result. We therefore describe the main traits of previous FEEs: export.n, export.v, exportation.n methodologies, before presenting ours. Three main strategies have been used in the past: Figure 1: EXPORTING frame • The lexicographic frame-by-frame strategy is preva- lent within FrameNet-related projects. A frame is de- to that of traditional sets of semantic roles. fined along with its FEEs, and examplar sentences Roles are grouped by status, based on whether they are es- containing (disambiguated) occurrences of these FEEs sential to the meaning of the frame. In particular, roles are are chosen for annotation, with the objective of max- called “core” if they “instantiate a conceptually necessary imizing the range of annotated syntactic valences for component of a frame, while making the frame unique” a given semantic frame. Each examplar sentence con- (Ruppenhofer et al., 2006)2. tains one annotation set only. In every frame is also listed the set of lexical units that may evoke it: the frame-evoking elements (FEEs), made of a • The corpus-driven -by-lemma strategy is char- lemma and a part-of-. Each frame and FEE pair usu- acteristic of the German FrameNet (developed in the ally comes with a choice of annotated sentences from the SALSA project, (Burchardt et al., 2009)), and was also British National Corpus, picked so that they exhaustively partly adopted by the Japanese FrameNet (Ohara et al., examplify the range of possible role syntactic realizations 2004). A set of lemmas is chosen5 and all their occur- of the roles of a frame. Each such examplar contains one rences in the target corpus are annotated. annotation set exactly, namely one frame annotated on the FEE along with the various roles annotated on the corre- • The full-text strategy, which was later adopted within sponding role fillers. In a later stage of the project, full- the Berkeley FrameNet project, consists in annotating text annotations were added, which consist of a set of run- any content-word occurrence in running text. Though ning texts in which every occurrence of potentially frame- it presupposes the existence of a core set of frames, it bearing words has been annotated. necessarily entails defining new frames as uncovered Compared to other semantic roles resources, FrameNet has senses are encountered. a more semantic orientation3. Yet FrameNet annotation is This last strategy obviously achieves perfect coverage, still lexicon-based: two sentences referring to a same situ- though restricted to the target corpus, in terms of the lexical ation might bear different frames depending on the words diversity for a given frame and of the sense ambiguity for used. a given lemma. Yet it is extremely difficult to achieve: it is Initiated in 1998, the resource currently has 13 474 FEEs likely that the full-text annotations could only be completed that may evoke a total of 1 216 frames. There are 174 038 thanks to the experience acquired through the lexicographic annotation sets in the lexicographic corpus, and 28 046 in phase of the Berkeley FrameNet project. Preliminary trials the full-text corpus4. The French FrameNet was designed led us to the conclusion that full-text frame semantic anno- using the slightly smaller 1.5 release. tation could not be accomplished within our project. Probably because of its semantic orientation, FrameNet has The other two strategies have different flaws and quali- been shown to be quite portable to other languages (Boas, ties. The frame-by-frame annotation enforces full lexical 2009), and FrameNet-like resources have already been de- coverage for frames, and fetching lexicographic examples enforces full coverage of the possible syntactic variation 2 Although primarily stated in semantic terms, coreness is within a frame. Yet because frame coverage is necessar- closely related to the formal distinction of argument versus ad- ily insufficient, the resulting lexicon is biased: for a given junct of the underlying FEEs. Note though that some subcatego- lemma, only senses pertaining to covered frames will ap- rized complements may not be conceptually necessary, such as the pear in the lexicon, even though these senses may not be addressee of the Statement frame. 3For instance unlike VerbNet (Schuler, 2005), which is based the most frequent senses of that lemma. On the contrary, on Levin’s classes (Levin, 1993), FrameNet’s criteria for with the lemma-by-lemma strategy, lemmas are either fully frame delimitation and role typing are mostly semantic. 4https://framenet.icsi.berkeley.edu/fndrupal/current status, 5In the SALSA project, the target lemmas were primarily 500 consulted in march 2016 of all frequency bands, plus some deverbal nouns.

3795 accounted for or entirely missing, and thus the lexical di- In the current paper, we describe the subsequent corpus an- versity of a frame is not fully accounted for. notation phase. We used as target corpus two syntactically- As far as ease of annotation is concerned, since achieving annotated corpora (see below section 3.3.), and basically a good understanding of the limits of a frame is quite a dif- aimed to annotate all occurrences of lemmas that poten- ficult task, the frame-by-frame strategy eases the task of tially evoke a frame of one of 4 notional domains (cf. sec- annotators, who can perform better on a frame they know tion 4.1.). Once disambiguated, the FEE occurrence is ei- well. The lemma-by-lemma strategy requires addressing ther associated with one relevant frame, or with a special very diverse lemma senses, often without an existing frame Other sense frame, to indicate the meaning is outside of in the Berkeley FrameNet database, including rarer senses the targetted notional domains. More precisely, we used or cases in which the lemma is part of a larger lexical unit an upper bound of 100 occurrences7 of a given lemma. with non fully compositional semantics (Burchardt et al., This methodology entails that an occurrence of a lemma 2009). While this can be a valid strategy to increase frame appearing in our lexicon is either (i) annotated with a frame coverage, preliminary investigations led us to conclude it from one of the 4 notional domains, (ii) annotated with was extremely difficult for an annotator to master frames the dummy Other sense frame or (iii) not associated to any from very various notional domains, and a fortiori to be frame at all (meaning it was not proposed for annotation at able to create new frames6. all because it was beyond the first 100 occurrences). On the other hand, we have experienced that using the In the resulting resource, such a strategy entails that: frame-by-frame strategy, i.e. working in isolation on a • a given notional domain is completely annotated; frame, can result in missing some similarities with other frames, and thus artificially increasing polysemy: choosing • a frame is associated to all its potential FEEs, and the right frame for a given lemma’s occurrence can become their occurrences (if any) in the corpus are frame- quite difficult due to blurred frontiers between frames. annotated. We thus retain the crucial property of the Finally, a crucial characteristic of the different strategies FrameNet project of capturing the lexical diversity concerns the use of the resulting annotations as training within a frame, including diversity of parts-of-speech; data for semantic parsers: even though they are much more • sentences have both full syntactic gold annotation and numerous, Berkeley FrameNet’s lexicographic annotations partial semantic annotations, which can be useful for have proved much less useful than their full-text ones (Das corpus studies on the syntax-semantics interface; et al., 2010), which have the crucial trait of preserving the natural sense and role-realization probabilistic distribu- • the resulting annotations do reflect naturally occur- tions. ring sense and role distributions, though only within the four target notional domains, and modulo the 100- 3.2. Corpus-driven domain-by-domain strategy occurrence upper bound. The resource can thus be From these observations, we chose to use a “corpus-driven used as training data for partial word sense disam- domain-by-domain strategy”, i.e. to focus on a small set of biguation, and for semantic role labeling, for all the notional domains and fully annotate the frames pertaining occurrences associated to a frame (whether actual or to these domains. Within a given domain, we aimed for full dummy). coverage in terms of the FEEs that can evoke the frames, 3.3. Target corpora and full coverage in terms of the occurrences of such FEEs within a target treebank. In a first phase of the project (Can- The corpus we used is the concatenation of two syntacti- dito et al., 2014): cally annotated treebanks: the Sequoia treebank (Candito and Seddah, 2012) and the French Treebank (Abeille´ and Barrier, 2004), totalizing 21634 sentences and 624187 • We chose a small set of notional domains, and selected tokens. The French Treebank consists of sentences from the English frames pertaining to these domains (within Le Monde national newspaper. The version we used, the 1.5 FrameNet release). from the SPMRL 2013 shared task (Seddah et al., 2013), • We worked in parallel to (i) define the lexicon for these contains 18535 sentences. The Sequoia treebank was frames, starting from automatically projected French developed using the same annotation scheme. It is much smaller (3099 sentences), and contains sentences from 4 FrameNet (Mouton et al., 2010; Pado,´ 2007); 8 (ii) adapt the frames to French and (iii) cope with different sources : Europarl, a regional newspaper L’Est frame overlap brought to light thanks to the domain- Republicain,´ public assessment reports for medicines by-domain approach. This has sometimes led us to from the European Medicines Agency, and the French

merge several English frames into one. Some frames 7 were split and totally new frames were created, to Actually the first 100 occurrences in the Sequoia plus French Treebank corpus, in that order. complete the frame modelization of a notional do- 8The Sequoia treebank was originally developed to provide main. “out-of-domain” sentences for parsing experiments. It is impor- tant to note that we use in this paper the term “domain” in the 6For that reason, the SALSA project proposed to cope with sense of “notional domain”, different from the “human activity senses not covered by any existing English frame by creating domain” sense that can be used to qualify corpora. The two cor- sense-specific proto-frames, without lexical generalization nor se- pora we used have different origins, but both contain occurrences mantic relations to other frames. of FEEs evoking one of our 4 notional domains.

3796 Wikipedia. not pre-annotated). Annotators had the possibility to leave several frames if they were unsure. While both corpora are available with both phrase-structure trees and dependency trees, we chose to add the semantic • Adjudication: the two annotated versions are adju- dicated, either by an expert, or by the two annotators annotations on top of the syntactic dependency trees. These 10 syntactic annotations were essential to the goal of obtain- together. ing a syntactico-semantic resource, and also as a means to 4. Resulting annotated resource speed up annotations, mainly in the (good) cases in which a role filler coincides with a syntactic subtree. 4.1. Annotated notional domains Though the first phase of the project focused on 7 notional 3.4. Annotation workflow domains, we were able to annotate only 4 of these: As stressed in section 3.2., we chose to focus on a set of • notional domains in order to define the French frames and Commercial transactions : originally well studied in their corresponding FEEs. Yet, the ambiguity of any given the English FrameNet, this domain has the particular- lemma occurrence forces us to adopt a lemma-by-lemma ity of including converse verbs, for which FrameNet is corpus annotation workflow. particularly adapted; Six annotators worked in this phase, plus 4 domain experts • Cognitive positions: This notional domain includes (the authors of this paper). The experts first performed predicates in which the stance of a cognizer towards some trial annotations, in order to set up a first version of a propositional content is expressed. It is mostly con- the annotation guide. Annotation then started for lemmas cerned with beliefs, with varying degrees of certainty, associated with one frame only, then lemmas associated and either stative (know, think) or inchoative (realize). with several frames but within the same notional domain at first before switching to cross-domain ambiguous lemmas9. • Causality: The domain covers both factual causation The annotation guide was continuously expanded based on between events appearing in narratives and evidential the annotators’ feedback and questions. or epistemic relations between facts relevant in argu- More precisely, for each given lemma to annotate, the fol- mentative texts.11 lowing steps were applied: • Verbal communication: this notional domain is per- • Pre-annotation with ambiguous frames: the first vasive in the journalistic parts of our target corpora. 100 occurrences of the lemma are automatically pre- In the current release, it is not fully annotated though: annotated with its corresponding candidate frames ac- we started with the FEEs ambiguous with the other 3 cording to the lexicon. domains. A frame generally pertains to one domain only, but not al- • Double annotation on top of syntactic dependency ways. For instance the FR Attributing cause frame (when trees: two annotators then have to independently someone attributes an effect to a cause) relates both to the annotate these occurrences, using the Salto graphi- causality and the cognitive stances domain. cal tool (Burchardt et al., 2006a): for a given pre- annotated occurrence, each annotator decides whether 4.2. Evaluation of the annotation task : the occurrence evokes one of the proposed frames. If Inter-annotator agreement so, they discard all other frames, and annotate the rel- As noted by (Burchardt et al., 2009), chance corrected inter- evant roles. Only the core roles were to be annotated, annotator agreement metrics such as the kappa score are as well as non-core roles realized as a subcategorized applicable to classification tasks in which both the items to complement (cf. section 2.). As noted above, the classify and the classes are fixed. This is not the case in syntactic structure helped speed up the annotation of our setting: for the frame assignment task, items to anno- role fillers when they coincided with a subtree. Note tate were almost fixed (except for support verbs), but the though that annotators could in any case choose the ex- frames could vary. For the role assignment task, items to act tokens composing a role filler, independently of the add a role to were not fixed, and the set of roles depended provided syntactic analysis. This is crucial for cases of on the frame assigned by the annotator. We thus evaluated syntax/semantics discrepancies and for cases of errors the annotation task simply using an inter-annotator Fscore, in syntactic annotations. both for frame assigment and role assignment.12 For the For lemmas they found difficult to disambiguate, the annotators could ask “domain experts” to provide a 10The first adjudications were performed by the experts plus specific disambiguation guide. This could result in the annotators, in order to train them. Then, adjudications were modifying the lexicon (i.e. the frames associated to the performed by annotators if the inter-annotator agreement was high lemma in the lexicon). When annotating the frames for enough (see section 4.2.), by experts otherwise. 11 a predicative noun, annotators were also asked to add We address the most generic causal frames only, some of special frames for support verbs if needed (these were which subsume a large number of specialized ones. 12The Fscore is the harmonic mean of precision and recall when considering one annotation as the reference and the other as the 9Note that in any case, additional ambiguity could arise from prediction. Interverting both annotations swaps precision and re- senses not belonging to the target notional domains. call values, resulting in the same Fscore.

3797 Nb of % % Inter-annotator Fscore causality domain, respectively. An explanation could be FEE of of Frame Exact Partial that the ambiguity within the causality domain was more occ. N V Role Role difficult to cope with. As far as agreement on role assign- 17667 36 50 85.9 77.2 81.9 13 ment is concerned, the break-down by part-of-speech of the Break-down by notional domain FEE seems to provide the best explanation for the agree- Commercial 3307 60 40 92.0 73.4 80.4 ment variation across domains: role assignment agreement Causality 7691 30 48 79.2 74.2 80.4 Cog. Stances 7886 28 62 90.6 81.1 86.0 is much lower for nouns than for verbs. Hence, domains Communic. 2221 23 76 89.6 82.3 87.5 with high ratio of verbs (cognitive stances, verbal commu- Break-down by POS of the FEE nication) have a better role assigment agreement than the V 8834 - - 87.6 82.8 87.1 other two domains, which have fewer verbal FEEs. Note N 6234 - - 86.8 68.3 72.5 that the agreement achieved for verbal FEEs is on a par other 2509 - - 77.7 74.6 82.1 to that reported by Burchardt et al. (2009) for German, within the SALSA project, which focused on verbal FEEs Table 1: Inter-annotator agreement for all occurrences of mainly (reported agreement is 85% for frames, and 86% for FEE that were independently annotated by two annotators, roles of matching frames, whereas ours for verbs is 87.6 in total and broken down by notional domain, and by part- for frames, and 82.8%). Our having better agreement for of-speech of the FEE. First col.: number of FEE occur- frames may be due to our restricting ourselves to 4 do- rences proposed for double annotation. Next two col.: per- mains. centage of nouns and of verbs within the FEE occurrences to annotate. Last three col.: Inter-annotator agreement F1 4.3. Resulting annotated corpus scores for the frame assignment, exact role assignment, and We now turn to the current resulting resource. Adjudication partial role assignment tasks (for matching frames). was not fully completed at the time of , although it will be by the time of the conference (this touches about 15% of the FEE occurrences). All the figures provided in latter, we report both exact and partial match Fscores. For this section have been obtained as if adjudication had been exact match Fscore, the number of common role fillers are completed, using one annotator’s annotations as if they re- the pairs of role fillers with both same role and exact same sulted from adjudication. Furthermore, in order to complete set of tokens. For partial match Fscore, we also give par- annotation of the domains, some FEEs were annotated by tial credit if two role fillers sharing some tokens have been one expert only (totalizing about 10% of all the annotated assigned the same role: in order to calculate the amount of FEE occurrences). correct answers, for each role filler in annotation 1, we look Statistics are provided in table 2. We have about 100 for a role filler in annotation 2 with both same role and high- frames and over 650 distinct FEEs, with various POS, est intersection over union ratio, and add this highest ratio though mainly verbs and nouns. A syntactico-semantic lex- to the total of correct answers. icon was extracted from the annotated corpus, providing We computed these scores for all occurrences of FEEs that precise information on the syntactic realization patterns of have been independently annotated by two annotators so the roles of each frame+FEE pair and each frame overall. far. Results are shown in table 1. The first column provides the number of pre-annotated FEE occurrences proposed for 5. Handling of specific linguistic phenomena double annotation, in total and for each domain (a FEE may belong to several domains in these counts, either because it 5.1. Non-strict semantic compositionality is associated with frames from several domains, or because As stressed in (Burchardt et al., 2009), corpus-based frame the frame per se pertains to several domains). The relatively annotation runs into several kinds of full or partial seman- low number of instances from the verbal communication tic non-compositionality, namely idioms, light verb con- domain results from this domain being only partially anno- structions, and metaphors. We treat these as they are in tated, and containing primarily FEEs ambiguous with some the Berkeley FrameNet (Ruppenhofer et al., 2006). Idioms of the other three domains. are identified as multi-word expressions for the most part Overall, agreement is roughly 86% for frames and 77% for frozen, whose meanings have to be understood as a whole. roles of matching frames. Agreement is thus substantial, Such idioms can evoke frames pretty much like plain words which shows that the task was well defined. The break- can, and are thus recorded in the lexicon. For instance the down by notional domain reveals that agreement varies for idiom mettre (qqch) sur le compte de (qqun/qqch) (litt. to the different domains. It is computed for each domain, by put (sth) on the account of (sb/sth), meaning to blame (sth) considering only FEEs that have at least one frame belong- on (sb/sth)) can evoke the FR Attributing cause frame. ing to the domain. Hence it mixes information concerning Note that such idiomatic FEEs, especially the verbal ones, ambiguity both within a domain and cross-domains. Over- are potentially discontinuous. all, the best and worst agreement for frame assignment is Light verb constructions (LVC) combine a predicative noun achieved for the commercial transaction domain and the and a verb which does not have exactly the meaning it would have in the absence of the noun. As in previous 13Note that the counts are before sense disambiguation. A given FrameNet projects, the frame is evoked by the noun, and we FEE is associated to a notional domain if one of its frames at least mark the verb as support verb, in order to easily spot such belongs to this domain. Hence the ALL count is less than the sum constructions. For that purpose, we used specific frames for for all the 4 domains. support verbs, with only one “role” , for the noun.

3798 Nb distinct Nb distinct Nb senses Nb annotation % of non-dummy frames FEEs sets (6= dummy frame) frames ALL 98 662 872 12874 64.4 Commercial 19 93 103 2937 - Causality 11 217 252 3843 - Cognitive stances 40 283 356 4040 - Communication 36 168 210 2631 - N - 207 247 4275 63.9 V - 341 483 7087 66.6 PREP - 25 29 530 54.5 ADV - 23 31 320 48.9 CONJ - 21 27 299 79.3 ADJ - 37 42 210 53.6

Table 2: Statistics of the resulting annotated resource: number of distinct frames and FEEs, number of senses (association frame + FEE) and number of annotation sets. Last column: proportion of annotated occurrences that are not the Other sense frame.

Finally metaphors are cases of new or not conventional- Buyer of the buying frame14. ized non-literal meaning, either lexical (i.e. concerning one In the same vein, when a role filler is a non-lexical anaphor word only) or multi-word. Although novelty is a continu- with a lexical antecedent expressed within the same sen- ous property, we map it to a binary feature, which distin- tence, we asked annotators to mark both as fillers of the guishes lexical metaphor from polysemy, and multi-word same role (with flags indicating the anaphoric relation), in metaphors from idioms (using Ruppenhofer et al. (2006) order to capture the syntactic regularity (when the anaphor criteria). Given our annotation workflow, polysemic senses is realized as a regular syntactic valent of the FEE) without and idioms arising from conventionalized metaphors were missing a lexical role filler (the antecedent). entered in the lexicon before annotation time. For a (non- conventionalized) metaphorical use of a lemma, the can- 5.3. Null instantiation didate frames proposed to the annotators pertained to lit- Sometimes, a role that has been deemed conceptually nec- eral meanings only. In such cases, annotators were asked essary to a frame cannot be filled by any part of a sentence to annotate the literal meaning (from the source domain where said frame is evoked. Berkeley FrameNet keeps in Lakoff’s view), but to flag the frame occurrence as track of those cases of what they call null instantiation metaphorical. Contrary to what was done in the SALSA (NI) for they provide “lexicographically relevant informa- project, we did not annotate the metaphorical meaning, be- tion regarding omissibility conditions” (Ruppenhofer et al., cause such annotation might fall outside the notional do- 2006). A distinction is made between three types of NI. mains we targeted. When a constituent is constructionnally omitted, as is of- ten the case of agents in passive sentences, the role that this 5.2. Syntactic non-locality constituent would have filled is labeled CNI for Construc- In Berkeley FrameNet lexicographic annotations, only tional NI. Definite NI (DNI) covers cases in which the miss- those role fillers that were realized “in grammatical con- ing role must be interpreted from the linguistic or discourse struction” with the FEE (Fillmore, 2007) were annotated. context, whereas Indefinite NI (INI) is used for existentially This includes the syntactic dependents of the FEE, but understood missing roles, which for certain verbs tend to also some syntactically well-defined cases of syntactic non- receive a stereotypical interpretation (e.g. if someone eats, locality of role fillers with respect to the FEE (raising and it is assumed that they eat a meal). control, relative clauses (Ruppenhofer et al., 2006, p. 27), For the French FrameNet, we have used a different set of but also light verb constructions, copular sentences and NI subtypes. Where FrameNet uses DNI to annotate indis- FEEs typically modifying and thus governed by one of their criminately missing role fillers that can be found in either role). This choice derives from the objective of obtaining the linguistic or discourse context, our own Definite NI is relevant patterns of how semantic valents realize as syntac- only used when the non-instantiated role is clearly filled by tic valents. an element of the linguistic context. When the role filler Using this strategy means we sometimes do not annotate can be interpreted but has not been explicitly mentioned in a role filler which is expressed in the sentence of the FEE the few surrounding sentences, we consider the role to be but outside its locality domain. Because FrameNet annota- an Extra-linguistic NI (ENI). Only the DNI cases should tions are also useful for obtaining lexical selectional prefer- be looked for in surrounding sentences. All the other cases ences, we found it interesting to annotate even non locally of NI, that is those in which the missing role filler cannot realized role fillers. So for instance in La terrible pression be interpreted, are labeled as Unknown NI (UNI). We have de la grande distribution pour acheter le plus bas possi- ble asphyxiait les fabricants (the terrible pressure of super- 14A flag is used to identify non-local instantiations of FEEs. markets to buy at the lowest (possible price) was asphyxi- Because these are necessarily more subject to interpretation is- ating manufacturers), manufacturers and supermarkets are sues, annotators were asked to favor syntactic locality whenever clearly understood and thus annotated as the Seller and the possible, and to only annotate doubtless non-local fillers.

3799 decided not to include a CNI-type label, as it seemed redun- understood approximately as providing a cause for the in- dant to us to make a difference between arguments lexically transitive version. But unlike in the middle construction, in vs. syntaxically omitted: the information as to whether the intransitive version (the neuter construction) the agent an argument’s omission is constructional can be extracted needs not be understood: it is either absent or at least de- from the syntax itself. To sum up, the null instantiation profiled. For this reason, neuter versions cannot be cap- types distinguish whether the interpretation of the missing tured within the same frame as the transitive construction. role can be found in the linguistic context (DNI), discourse A frame-to-frame relation is used between the frames for context (ENI) or is underspecified or generic (UNI). the transitive and the intransitive versions.

(1) Ils revendront plus tard. (5) Le vase peut (se) casser. They will-resell more late. (They will resell later.) The vase can (REF) break. (The vase can break)

In the sentence (1), in which revendront evokes the Com- The causative alternation also modifies the number of par- merce sell frame, ils fills the Seller role, and the Goods ticipants. French causative constructions involve the faire to be resold (real estate) are labeled DNI, because they (to make) verb combining with an infinitive, e.g. (6). From are mentioned in a previous sentence. By contrast, the the semantic point of view, except for cases in which the only thing that can be said about the Buyer in that Com- combination of faire and the infinitive is lexicalized, the merce sell event is that they are people likely to buy real meaning of the construction is roughly compositional: one estate. Since that information is entirely entailed by the extra semantic argument (the causer) triggers the eventual- definition of the Buyer role and the filler of the Goods role, ity described by the infinitive, which can roughly be ana- we consider the Buyer to be UNI. lyzed as a causal relation. From the formal point-of-view though, it is well-known that properties of causative con- 5.4. Syntactic alternations structions in Romance languages lead to two competing In this section, we briefly review how productive syntactic analysis (Abeille´ et al., 1996), in which the combination of alternations are treated in the French FrameNet. By pro- faire with the infinitive is either regular or forms a complex ductive, we mean those applying without many lexical ex- single predicate. In the syntactic annotation scheme of our ceptions, given a certain canonical syntactic valency. For target corpora, only the latter was retained: the faire verb French these are the passive, middle, impersonal, reflexive forms a complex predicate with the infinitive verb, which and causative alternations. appears with a non canonical valency (hence the character- The treatment of syntactic alternations is not described at ization of causative constructions as syntactic alternations). length in FrameNet’s guidelines, but can be inferred from The causer appears as subject of the complex predicate, and the general objectives of FrameNet. On the one hand, all the causee is demoted to direct or oblique object, depending LUs evoking a given frame should share the same semantic on the infinitive’s transitivity. arguments, on the other hand productive syntactic alterna- tions should be neutralized. So only those alternations that (6) Il veut faire payer le surcoutˆ a` Paul. do not modify the set of participants of a predicate should He wants-to make pay the overcost to Paul. be captured within the same frame. This is the case for (He wants to make Paul pay the overcost.) passive, middle and impersonal alternations: personal and Because the causative uses of an infinitive have an ex- impersonal uses of an intransitive verb, cf. examples (2) tra participant (the causer), they cannot evoke the same and (3), fall under the same frame, and so do active, passive frame as in the non-causative versions. We annotate and middle voices of a transitive verb. For the middle voice, causative constructions compositionally, which results in the well-known key aspect to consider is that although the a syntax/semantic divergence: faire evokes the Causation agent participant (the canonical subject) cannot be locally frame, with the causer filling either the Cause or the Actor expressed, cf. example (4), it is necessarily understood. So role, and the infinitive regularly evoking one of its frames. middle uses of a transitive verb are represented using the Finally, the (true) reflexive alternation reduces the set of same frame as the transitive verb, with a null instantiation participants in that two roles are played by the same entity. flag for the missing participant. Furthermore in French, the reflexive clitic se is considered (2) Des complications peuvent en resulter.´ as semantically void, meaning that the reflexive verb has Some complications may from-it result. one fewer syntactic valent. So we simply annotate true re- (Complications may result from it) flexives by assigning two roles to the subject. (3) Il peut en resulter´ des complications. 5.5. Classification of the reflexive se marker It may from-it result some complications. (Complications may result from it) As noticeable from the previous section, the French reflex- (4) Les crises ne s’ anticipent pas (*par les traders). ive se clitic paradigm is highly ambiguous: it can mark (i) lexicalized se+V combinations, neuter and middle syntactic The crisis NEG REF anticipate not (*by the traders). (One doesn’t anticipate crisis) alternations (cf. examples (5) and (4) in the previous sec- tion), true reflexive or reciprocal constructions. Depending The situation is different for the causative/inchoative alter- on the status of the clitic, a given occurrence of se+V trig- nation for some transitive verbs (roughly change of state gers a frame evoked by the V alone, or by the lexicalized or change of position verbs). The transitive version can be combination se+V. In order to break up annotation tasks,

3800 we used a preliminary annotation of the reflexive se’s sta- M. Candito and D. Seddah. 2012. Le corpus sequoia : tus in context15: when faced with a se+V occurrence, the annotation syntaxique et exploitation pour l’adaptation se’s status was used by the pre-annotation tool to choose d’analyseur par pont lexical. In Proc. of TALN 2012 (in whether to pre-annotate frames pertaining to the V alone French). (in case of middle, true reflexive or reciprocal) or to the M. Candito, P. Amsili, L. Barque, F. Benamara, Ga. clitic+V combination. de Chalendar, M. Djemaa, P. Haas, R. Huyghe, Y. Y. Mathieu, P. Muller, B. Sagot, and L. Vieu. 2014. Devel- 6. Conclusion oping a French FrameNet: Methodology and first results. We presented the annotation workflow for a French In Proc. of LREC’14. FrameNet, consisting of a set of frames, a lexicon, and D. Das, N. Schneider, D. Chen, and N. A. Smith. frame and role annotations added to syntactic treebanks. 2010. Probabilistic frame-semantic parsing. In Proc. of The resource is focused on four notional domains (com- NAACL HLT 2010, pages 948–956. mercial transactions, cognitive stances, causality and verbal C. J. Fillmore, 1982. Frame semantics, pages 111–137. communication). Inter-annotator agreement is substantial, Hanshin Publishing Co. and the resulting resource can be used for word sense dis- C. J. Fillmore. 2007. Valency issues in . In ambiguation and semantic role labeling tasks, as well as for Thomas Herbst and Katrin Gotz-Votteler, editors, Va- studying syntactic valency patterns of semantic frames. lency: theoretical, descriptive and cognitive issues, vol- ume 187 of Trends in Linguistics, pages 129–162. Mou- 7. Acknowledgements ton de Gruyter. This work was funded by the French National Research B. Levin. 1993. English Verb Classes and Alternations: A Agency (ASFALDA project ANR-12-CORD-023), and Preliminary Investigation. University of Chicago Press. supported by the French Investissements d’Avenir - Labex C. Mouton, G. de Chalendar, and B. Richert. 2010. EFL program (ANR-10-LABX-0083). We thank very Framenet using bilingual dictionaries with much the people involved in the preceding frame and evaluation on the English-French pair. In Proc. of lexicon-building phase of the project, and are extremely LREC’10. grateful to the annotators for their work: Vanessa Com- K. Ohara, S. Fujii, S. Ishizaki, T. Ohori, Hiroaki S., and bet, Nomie Faivre, Virginie Mouilleron, Marjorie Raufast, R. Suzuki. 2004. The Japanese FrameNet project; an in- Anny Soubeille and Emilia Verzeni. troduction. In Proc. of the Workshop on Building Lex- ical Resources from Semantically Annotated Corpora. 8. Bibliographic References LREC’04. A. Abeille´ and N. Barrier. 2004. Enriching a french tree- S. Pado.´ 2007. Cross-Lingual Annotation Projection Mod- bank. In Proc. of LREC’04. els for Role-Semantic Information. Ph.D. thesis, Saar- A. Abeille,´ D. Godard, and P. Miller. 1996. Les causatives land University. MP. en franc¸ais, un cas de comptition syntaxique. Langue M. Palmer, D. Gildea, and P. Kingsbury. 2005. The propo- Franc¸aise, 115:62–74. sition bank: An annotated corpus of semantic roles. C. F. Baker, C. J. Fillmore, and J. B. Lowe. 1998. The Computational Linguistics, 31(1):71–106, March. Berkeley FrameNet project. In COLING-ACL ’98: Proc. J. Ruppenhofer, M. Ellsworth, M. R.L. Petruck, C. R. of the Conference, pages 86–90. Johnson, and J. Scheffczyk. 2006. FrameNet II: Ex- H. C. Boas, editor. 2009. Multilingual FrameNets in com- tended Theory and Practice. International Computer putational lexicography : methods and applications. Science Institute, Berkeley, California. Distributed with Trends in linguistics. Mouton de Gruyter. the FrameNet data. L. Borin, D. Dannlls, M. Forsberg, M. Toporowska Gronos- K. Kipper Schuler. 2005. Verbnet: a broad-coverage, com- taj, and D. Kokkinakis. 2009. Thinking Green: To- prehensive verb lexicon. Ph.D. thesis. ward Swedish FrameNet++. In FrameNet Masterclass D. Seddah, R. Tsarfaty, S. Kubler,¨ M. Candito, J. Choi, and Workshop. R. Farkas, J. Foster, I. Goenaga, K. Gojenola, Y. Gold- A. Burchardt, K. Erk, A. Frank, A. Kowalski, S. Pado,´ and berg, S. Green, N. Habash, M. Kuhlmann, W. Maier, M. Pinkal. 2006a. SALTO – A Versatile Multi-Level J. Nivre, A. Przepiorkowski, R. Roth, W. Seeker, Y. Ver- Annotation Tool. In Proc. of LREC’06. sley, V. Vincze, M. Wolinski,´ A. Wroblewska,´ and A. Burchardt, K. Erk, A. Frank, A. Kowalski, S. Pado,´ and E. Villemonte de la Clergerie.´ 2013. Overview of the M. Pinkal. 2006b. The SALSA Corpus: a German cor- SPMRL 2013 Shared Task: A Cross-Framework Eval- pus resource for lexical semantics. In Proc. of LREC uation of Parsing Morphologically Rich Languages. In 2006. Proc. of the 4th Workshop on Statistical Parsing of Mor- A. Burchardt, K. Erk, A. Frank, A. Kowalski, S. Pado,´ and phologically Rich Languages: Shared Task. M. Pinkal. 2009. Framenet for the semantic analysis of C. Subirats-Ruggeberg¨ and M. R.L. Petruck. 2003. Sur- german: Annotation, representation and automation. In prise: Spanish FrameNet! In Proc. of the Workshop on Boas (Boas, 2009), pages 209–244. Frame Semantics, XVII International Congress of Lin- guists (CIL). Matfyzpress. 15This annotation was performed by V. Combet, V. Mouilleron, B. Sagot and M. Candito at the occasion of the ASFALDA project and the deep Sequoia project (https://deep-sequoia.inria.fr).

3801