Developing a large scale FrameNet for Italian: the IFrameNet experience

Roberto Basili° Silvia Brambilla§ Danilo Croce° Fabio Tamburini§ ° § Dept. of Enterprise Engineering Dept. of Classic Philology and Italian Studies University of Rome Tor Vergata University of Bologna {basili,croce}@info.uniroma2.it [email protected], [email protected]

ian Portuguese, German, Spanish, Japanese, Swe- Abstract dish and Korean. All these projects are based on the idea that English. This paper presents work in pro- most of the Frames are the same among languages gress for the development of IFrameNet, a and that, thanks to this, it is possible to adopt large-scale, computationally oriented, lexi- Berkeley’s Frames and FEs and their relations, cal resource based on Fillmore’s frame se- with few changes, once all the language-specific mantics for Italian. For the development of information has been cut away (Tonelli et al. 2009, IFrameNet linguistic analysis, corpus- Tonelli 2010). processing and machine learning techniques With regard to Italian, over the past ten years are combined in order to support the semi- several research projects have been carried out at automatic development and annotation of different universities and Research Centres. In par- the resource. ticular, the ILC-CNR in Pisa (e.g. Lenci et al. 2008; Johnson and Lenci 2011), FBK in Trento (e.g. Italiano. Questo articolo presenta un work Tonelli et al. 2009, Tonelli 2010) and the Universi- in progress per lo sviluppo di IFrameNet, ty of Rome, Tor Vergata (e.g. Pennacchiotti et al. una risorsa lessicale ad ampia copertura, 2008, Basili et al. 2009) proposed automatic or computazionalmente orientata, basata sulle semiautomatic methods to develop an Italian teorie di Semantica dei Frame proposte da FrameNet. However, as of today, a resource even Fillmore. Per lo sviluppo di IFrameNet so- remotely equivalent to Berkeley’s FrameNet (BFN) no combinate analisi linguistica, corpus- is still missing. processing e tecniche di machine learning al As a of this kind is useful in fine di semi-automatizzare lo sviluppo della many computational applications (such as Human- risorsa e il processo di annotazione. Robot interaction), a new effort is currently being jointly made at the universities of Bologna and 1 Introduction Roma, Tor Vergata. The IFrameNet project aims to develop a large-coverage FrameNet-like resource Firstly developed at the University of Berkeley for Italian, relying on robust and scalable methods, (California) in 1997, FrameNet adopts theories in which the automatic corpus processing is con- from Frame (Fillmore 1976, 1982, sistently integrated with manual . It 1985) to NLP and explains ’ meanings ac- builds upon the achievements of previous projects cording to the semantic frames they evoke. It illus- that automatically harvested FrameNet LUs ex- trates semantic frames (i.e. schematizations of pro- ploiting both distributional and WordNet based totypical events, relations or entities in the reality), models (Pennacchiotti et al. 2008). Since the LUs through the involved participants (called frame el- induction is a noisy process, the data thus obtained ements, FEs) and the evoking words (or, better, the need to be manually refined and validated. lexical units, LUs). Moreover, FrameNet aims to The aim is also to provide Sample Sentences for give a valence representation of the lexical units LUs with the highest corpus frequency. On the one and underline the relations between frames and side, they will be derived from already existing between frame elements (Baker et al. 1998). resources such as the HuRIC corpus (Bastianelli The initial American project has since been ex- 2014) or the EvalIta2011 FLaIT task data: FBK set tended to other languages: French, Chinese, Brazil- (Tonelli, Pianta 2008) and ILC set (Lenci et al. 2012). On the other side, candidate sentences will

59 59 also be extracted through semi-automatic distribu- tional and paradigmatic lexical information (i.e. tional analysis of a large corpus - i.e. CORIS (Ros- derived from WordNet) to assign unknown LUs to sini Favretti et al. 2002) - and refined through lin- frames. In particular, distributional models are used guistic analysis and manual validation of data thus to select a list of frames suggested by the corpus’ obtained. evidence and then the plausible lexical senses of the unknown LU are used to re-rank proposed frames. 2 The development of the large scale In order to rely on comparable representations IFrameNet resource for LUs and sentences for transferring semantic information from the former to the latter, we ex- The need for a large-scale resource cannot be satis- ploit Distributional Models (DM) of Lexical Se- fied without resorting to a semi-automatic process mantics, in line with (Pennacchiotti et al. 2008) and for the gathering of linguistic evidence, selection of (De Cao et al. 2008). DMs are intended to acquire lexical examples as well as the annotation of the semantic relationships between words, mainly by targeted texts. This work is thus at the cross roads looking at the usage. The foundation for these of linguistic theoretical investigation, corpus analy- models is the Distributional Hypothesis (Harris sis and natural language processing. 1954), i.e. words that are used and occur in the On the one hand, the matching between LUs same “contexts” tend to be semantically similar. A and frames is always granted through manual lin- context is a set of words appearing in the neighbor- guistic validation applied to the data in the devel- hood of a target predicate word (e.g. a LU). In this opment stage. For every Frame the correctness of sense, if two predicates share many contexts then the inducted LUs is analysed and the ‘missing’ they can be considered similar in some way. Alt- LUs, that is the BFN LUs’ translations, which are hough different ways for modeling word semantics absent in the inducted LU’s list, are detected. exist (Sahlgren 2006; Pado and Lapata 2007; On the other hand, most choices rely on large Mikolov et al. 2013; Pennington et al. 2014), they sets of corpus examples, as made available by CO- all derive vector representations for words from RIS. Finally, the scaling to large sets of textual ex- more or less complex processing stages of large- amples is supported by automatically searching scale text collections. This kind of approach is ad- candidate items through semantic pre-filtering over vantageous in that it enables the estimation of se- the corpus: frame phenomena are here used as que- mantic relationships in terms of vector similarity. ries while intelligent retrieval and ranking methods From a linguistic perspective, such vectors allow are applied to the corpus material to minimize the for some aspects of lexical semantics to be geo- manual effort involved. metrically modelled, and to provide a useful way In the following section, we will sketch the to represent this information in a machine-readable main stages of the process that integrate the above format. Distributional methods can model different paradigms. semantic relationships, e.g. topical similarities (if vectors are built considering the occurrence of a 2.1 Integrating corpus processing and lexical word in documents) or paradigmatic similarities (if analysis for populating IFrameNet vectors are built considering the occurrence of a word in the (short) contexts of another word The beneficial contribution of the interaction be- (Sahlgren 2006)). In such models, words like run tween corpus processing techniques and lexical and walk are close in the space, while run and read analysis for the semi-automatic expansion of the are likely to be projected in different subspaces. FrameNet resource has been discussed since (Pen- Here, we concentrate on DMs mainly devoted to nacchiotti et al. 2008), where LU induction is pre- modelling paradigmatic relationships, as we are sented as the task of assigning a generic lexical unit more interested in capturing phenomena of quasi not yet present in the FrameNet database (the so- synonymy, i.e. that tends to called unknown LU) to the correct frame(s). The preserve meaning. number of possible classes (i.e. frames) and the problem of multiple assignment make it a challeng- 2.2 The development cycle ing task. This task is discussed in (Pennacchiotti et In the following paragraphs, we outline the different al. 2008, De Cao et al. 2008, Croce and Previtali stages in the development process. Each stage cor- 2010), where different models combine distribu- responds to specific computational processes.

60 60 Figure 1: Three lexical clusters for the frames triggered by the verb abandon.v: pairs closed in the map correspond to (paradigmatic) semantic similar words and frames

Validation of existing resources. At this stage, Lexical clustering is important here as specific the existing resources, dating back to previous space regions enclosing the instance vectors of work, are analysed and manually pruned of errors some considered LUs correspond to semantically such as lexical units wrongly assigned to frames coherent lexical subsets. This is a priming function (e.g. ‘asta’ or ‘colmo’ to the Frame for mapping unseen word vectors to frames, as ap- ‘BODY_PARTS’), or words never assigned to their plied in (De Cao et al. 2008): the centroids of the correct frame, for instances the LU ‘piede’ or possibly multiple clusters generated by the known ‘mano’ for the Frame ‘BODY_PARTS’. LUs of a given frame f are used to detect all regions expressing f and thus predict the predicate f over All the acquired Italian LUs have been com- previously unseen words and sentences. Examples pared, frame by frame, to BFN’s ones, using bilin- of semantically coherent regions evoked by the gual dictionaries (e.g. Oxford bilingual dictionary) verb abandon for the English Framenet are report- and WordNet in order to verify the correctness of ed in Fig. 1. Here different lexical clusters for a matching between lexical and frames. Over the given frame (i.e. DEPARTING) are depicted while 15,134 automatically acquired ⟨LU, frame⟩ pairs different frames (e.g. DEPARTING, QUIT- (6,670 nouns and 8,464 verbs and adjectives), TING_A_PLACE, COLLABORATION) are also evoked 7,377 LUs have been considered correctly assigned by the verb. It should be noted that in the figure (2,506 verb and adjective and 4,871 noun pairs). distances in the two-dimensional plot correspond to In addition, bilingual dictionaries, ItalWordNet distances between the vectors, and MultiWordNet have been used to manually while each lexical cluster is expressed as the cen- insert a list of missing lexical entries for each troid of its member vectors. frame. At the end of the process, the resulting vali- The distributional information has been acquired dated and refined ⟨LU,frame⟩ amount to 7,902 for the considered 7,902 LUs from CORIS and (5,128 nouns and 2,774 verbs and adjectives). used to support the LU mapping and the sentence Corpus processing and lexical modeling. At validation. In fact, given a sentence s containing a this stage, the LUs made available from manual target LU l, a specific geometrical representation validation are used to model distributionally the for s can be derived by linearly combining all vec- individual frames. Firstly, distributional corpus tors representing words w surrounding l in sentence analysis is applied to map individual LUs into dis- s. This duality property allows the embedding tributional vectors. A distributional model will be space to represent sentences s, lexical units l as well acquired from the CORIS corpus by applying the as generic words w. This enables to model the rele- neural method presented in (Mikolov et al. 2013). vance of a frame f for an incoming sentence s It will enable the acquisition of geometrical repre- through the distance d(f,s) between vectors f related sentations for words in a high dimensional space to a centroid for a frame f and the vector s of the where distance reflects the paradigmatic relation sentence s. It corresponds to a confidence measure among words. This model can also be adopted to computed for a rule such as: build a representation for sentences, as traditional- “s is a valid example of the usage of frame f ” ly carried out by Distributional Semantic models, e.g. (Landauer and Dumais 1997) or (Mitchell and The open aspects of the above semi-automatic Lapata, 2010). process are the following:

61 61 I. How to design a suitable representation 3 Status of the Project and Perspective (centroid or model) for a frame f Views II. How to define the vector for a sentence s III. How to compute the distance function d(f,s) Although the general software architecture for the project progress is available, the overall process The current research activity is focusing on the described above has not been fully accomplished. best solution for these issues and part of our exper- Current material covers a set of 554 frames and imental activity is devoted to assess these design 7,902 lexical units, of which 2,604 verbs, 5,128 choices, as discussed in Section 3. nouns and 170 adjectives. The average number of occurrences for each of these selected words is First Lexical Analysis and Validation. A further higher than 9,400, although there are still 508 stage for the resource development focuses on the words not present in CORIS. selection of a significant sample of LUs, chosen on All these occurrences correspond to a number the basis of their high semantic salience and for of about 70 millions non validated and unsorted their high number of occurrences in the corpus sentences. In the rest of the paper, we describe the (primary LUs). By relying on the method described outcome of the First Lexical Analysis and Valida- above, we use the distributional representation of tion stage: its aim is to trigger the semi-automatic words, lexical units and sentences, to gather CO- learning and tagging of the whole corpus, accord- RIS sentences s where a LU occurs and evaluate its ing to the methods suggested in section 2.2. suitability as an example for the evoked f. This de- cision function is based on the geometric distance 3.1 Empirical Investigation: First Lexical d(f,s) that can be computed over a large number of Analysis and Validation sentences s. When this step is carried out in CO- The stage First Lexical Analysis and Validation has RIS, the validation of the acquired candidate sen- been currently accomplished. The three research tences allows for positive examples of a frame f questions posed above: (I) the modelling of a frame to develop quickly: this is used to trigger super- f, (II) the sentence representation and (III) the defi- vised learning of f. nition of a distance function able to model similari- The manually validation in fact confirms the ty between sentences. proper correspondence between automatically se- About the problem (I) two approaches are pos- lected sentences and LUs that evoke a targeted sible. We can model a frame via clustering its lexi- frame f. It produces novel seed examples for f: the- cal units and applying the method described in se will serve as a training set for a semi-automatic (Pennacchiotti et al. 2008, De Cao et al. 2008). On stage of resource expansion. the contrary, we can adopt a supervised technique. A frame f is represented as the target class of in- Semi-automatic resource expansion. The ac- stances corresponding to ⟨s,l⟩ pairs, where s is an quired distributional model will support the semi- input sentence and l is a lexical unit: a statistical automatic expansion of the seed set, by selecting classifier is trained to map s,l into a confidence the most semantically similar word to the seed set ⟨ ⟩ and assigning them to frames by applying the value and its output h(s,l,f) corresponds to the sys- methodologies suggested in (Pennacchiotti et al. tem’s confidence that the sentence 2008, De Cao et al. 2008, Croce and Previtali “f is the frame evoked by l in s” 2010). Moreover, the same distributional model is true. Notice that the pair ⟨s,l⟩ can be expressed will support the assignment process of sentences to as an instance by combining the embedding vector frames. We will in fact investigate semi-supervised l of its lexical unit l with a vector s for s. models based on clustering techniques (Pennac- As a solution for the problem (II) we define s as chiotti et al. 2008) or other supervised approaches the linear combination of vectors w, for each word such as Support Vector Machines as in (Croce and w in s, i.e. s = ∈ w . Previtali 2010). Σw s The above formulation allows to define the Final Validation and Release. The extracted sen- classification task as follows: tences will be ordered by decreasing probability, Given a sentence s including a word l as a po- according to their distributional collocation, and a tential frame evoking LU, Find the frame f that list of 15 to 20 candidates per LU will be provided. characterizes l in s. This list will be manually validated. The aim is to The solution of the above problem over a ⟨s,l⟩ provide at least 4 sample sentences for each of the pair would also be a useful solution for the problem primary LUs. (III), as the confidence h(s,l,f) in the classification

62 62 of a sentence s in a frame f for l can be retained as uation over a set of 3261 frames, the ones with the inverse of the target distance function d(f,l) lo- more than 5 lexical units in the initial lexicon. In cal to the sentence. this way, we selected 1,095 different LUs, repre- The major problem with the above formulation sented as an embedding vector in the wordspace. is that the training of the statistical classifier is not On average, we have 12 LU per frame, and every possible without the availability of useful examples individual lexical entry l appears in about 1.88 of different frame f. The idea is thus to develop frames. The baseline of a classification task that ways to derive from CORIS the proper candidates s maps a sentence s including a lexical unit into its for f through the knowledge of some of its LUs. In own frame is about 35%, as for the ambiguity char- the bootstrapping stage, we define as virtual exam- acterizing most frequent entries. ples the pairs ⟨l,{l}⟩ that are retained as positive We asked three annotators to evaluate individual examples for the frame f, for every l that is a known triples ⟨l, s, f⟩ validating the system proposal. Four lexical unit for f. In our approach, an example is main cases where possible: thus obtained by modelling the sentence s as a sin- • MISSING FRAME. The sentence s is not mani- gleton {l}, i.e. the lexical unit l. festing any of the frames f evoked by the lexi- A statistical classifier considers every known cal unit l, but corresponds to a frame not yet LU as an individual (positive) example and can be present in the lexicon for l. In this case the algo- applied to every LU in our initial resource (i.e. rithm cannot provide the suitable frame, as it 7,902 for the 554 frames). cannot generate a novel frame. In synthesis, the method works as follows. First, • NOT APPLICABLE. The sentence s does not con- for every lemma w in the corpus, an n-dimensional tain an occurrence of the lexical unit l in one of embedding vector w is derived, according to its proper senses: this case is typical for phrase- (Mikolov et al. 2013). As a side effect, for every ological uses of a verb such as morire di freddo, LU l of each known frame f, the lexical embedding andare di fretta, … that do not directly corre- vector l is used to build the example (l, l) for the spond to lexical predicates and thus cannot be LU sentence pair: ⟨l, {l}⟩. treated through the lexical embedding vectors. A multiclass-statistical categorizer is trained for • CORRECT/INCORRECT, when the outcome every frame f for which at least 5 examples (i.e. 5 argmaxf’ { h(l,s,f’) } is correct (or incorrect) as different LUs) where available. the frame evoked by l in s is exactly (or not) f. When applied to an incoming sentence s includ- According to the above method annotators validat- ing a LU l, the classifier outcome h(l,s, f) is said to ed 667 sentences for 113 frames and 212 different accept the frame f if: verbal lexical units. The analysis resulted into a • f belongs to the set of frames evoked by l precision (i.e. the number of correct candidate • f = argmaxf’ { h(l,s,f’) } frames emitted by the algorithm w.r.t. the number of valid cases, that is all but the MISSING FRAME or For every sentence s including a frame evoking NOT APPLICABLE cases) is 75,2%, well beyond the lexical unit l, the above function suggests one can- 35% baseline. The method could be applied onto didate frame among the possibly multiple ones. the 74,5% of the sentences, including CORRECT When the scoring function h is negative every- cases and MISSING FRAME cases. We neglected in where (e.g. with the SVM formulation of a classifi- this coverage score the NOT APPLICABLE cases that cation task), the sentence is rejected and is not con- amount to 44 sentences, i.e. about 6,4%. sidered a valid example for future iterations. Examples of the correct assignment of the algo- The application of this method to the CORIS rithm on quite ambiguous verbs, such as finire (i.e. corpus has been carried out applying a multi- to end, in frames ACTIVITY_FINISH, classifier SVM with linear kernel to the 2n- CAUSE_TO_END and KILLING) or rivelare (i.e. to dimensional vectors of each pair ⟨l, {l}⟩. Starting reveal, in frames REVEAL_SECRET, OMEN, EVI- from the lexicon validated in the first stages, the DENCE) are the following: SVM has been able to label over 2 million sentenc- La vicenda avrebbe potuto [finire] lì , ma il prefetto es. ACTIVITY_FINISH di Nuoro fece presentare ... 3.2 Empirical Investigation: Current Results In prova si è [rivelato]EVIDENCE ad altissimo livello sia sull' In order to evaluate the proposed supervised classi- asciutto sia sul ... fication method for the stage “First Lexical Analy- 1 sis and Validation” we run and experimental eval- By keeping the frames that include at least 4 lexical units the number of targeted frames grows to 371.

63 63 An example of Missing Frame is BEAT_OPPONENT Johnson, M. And Lenci, A. (2011). Verbs of visual per- for the verb battere in ception in Italian FrameNet, Advances in Frame Se- mantics, 3(1), 9–45 ... impegnato a fornire quante più informazioni possibili, anche Landauer, T. and Dumais, S. (1997). A solution to Pla- per [battere]BEAT_OPPONENT la concorrenza dei siti Ipsoa e il ... to’s problem: The theory of as the lexicon of the verb battere only includes the acquisition, induction and representation of frames CAUSE_HARM, CORPORAL_PUNISHMENT knowledge. Psychological Review, 104. and EXPERIENCE_BODI-LY_HARM. Lenci, A, Johnson, M, Lapesa, G. (2010). Building an The experiments only run over verbal lexical Italian FrameNet through Semi-automatic Corpus units will be extended soon to nouns and adjec- Analysis. Proceedings of LREC 2010. Malta. tives. However, the encouraging precision reached Lenci, A., Montemagni, S., Venturi, G, Cutrullà, M. G. by the method allows for direct use it in an iterative (2012). Enriching the ISST-TANL Corpus with active learning schema, where the more ambiguous Semantic Frames in Proceedings of LREC 2012, sentences found and annotated within a specific Istanbul, Turkey training stage are used to train the system at the next stage. We expect this to speed up the lexicon Mikolov, T., Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). Efficient estimation of word representations development process and to allow bootstrapping in vector space. CoRR abs/1301.3781. with fewer resources. The lexicon will be made http://arxiv.org/abs/1301.3781. available for crowdsourcing further annotations and delivered incrementally in the next few months. Mitchell, J. and Lapata, M. (2010). Composition in dis- tributional models of semantics. Cognitive Science, 34(8):1388–1429. References Pado, S. and Lapata, M. (2007). Dependency-based con- Baker C. F., Fillmore, C. J., Lowe, J. B.. (1998). The struction of semantic space models. Computational Berkeley FrameNet project. In: COLING '98 Linguistics, 33(2):161–199. Proceedings of COLING '98, 1. Canada, 86-90. Pennacchiotti M., De Cao D., Basili R., Croce D., Roth Basili R., De Cao D., Croce D., Coppola B., Moschitti M. (2008). Automatic induction of FrameNet lexical A. (2009). Cross-Language Frame Semantics units. In: Proceedings of the EMNLP 2008, Hawaii Transfer in Parallel Corpora. In: Proceedings of the CICLing 2009, Best Paper Award. Mexico Pennington, J., Socher, R. and Manning, C. (2014). GloVe: Global Vectors for Word Representation, In Bastianelli, E., Castellucci, G., Croce, D., Iocchi, L., Proceedings of EMNLP 2014, 1532-1543. Basili, R., & Nardi, D. (2014). HuRIC: a Human Ro- bot Interaction Corpus. In Proceeings of LREC 2014, Rossini Favretti R., Tamburini F., De Santis C. (2002). 4519-4526. CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model. In A Rainbow of Croce, D. and Previtali, D. (2010). Manifold learning for Corpora: and the Languages of the the semi-supervised induction of framenet predicates: World, Lincom-Europa, Munich, 27-38. an empirical investigation. In Proceedings of GEMS ’10, pages 7–16, Stroudsburg, PA, USA. Sahlgren, M.. (2006). The Word-Space Model. Ph.D. thesis, Stockholm University. De Cao D., Croce D., Pennacchiotti M., Basili R. (2008). Combining word sense and usage for modeling frame Tonelli, S. and Pianta, E. (2008). Frame information semantics. In Proceedings of STEP 2008, Italy transfer from English to Italian. In Proceedings of LREC, Marrekech, Morocco De Cao D., Croce D., Basili R. (2010). Extensive Evaluation of a FrameNet-WordNet mapping Tonelli, S, Pighin, D, Giuliano, C, Pianta, E. (2009). resource. In: Proceedings of the LREC 2010, Malta. Semi-automatic Development of FrameNet for Ital- ian. In Proceedings of the FrameNet Workshop and Fillmore, C.J. (1985). Frames and the semantics of Masterclass, Milano, Italy. Milan, Italy understanding. Quaderni di Semantica, VI(2), 222-254. Tonelli, S. and Pighin, D. (2009). ‘New Features for Fillmore, Charles J. (1976). Frame semantics and the FrameNet - WordNet Mapping’, in Proceedings of nature of language, Annals of the New York Acade- CoNLL 2009, Boulder, Colorado, 219–227 my of Sciences: Conference on the Origin and Devel- opment of Language and Speech, vol. 280, pp. 20-32 Tonelli, S. (2010). “Semi-automatic techniques for ex- tending the FrameNet lexical database to new lan- Fillmore, C. J. (1982). Frame semantics. Linguistics in guages”, Università Ca’ Foscari, Venezia the morning calm, pp. 111-137. Venturi G., Lenci A., Montemagni S., Vecchi E., Sagri Harris, Z. (1954). Distributional structure. In Jerrold J. M., Tiscornia D., Agnoloni T. (2009). Towards a Katz and Jerry A. Fodor, editors, The Philosophy of FrameNet Resource for the Legal Domain. In Linguistics, New York. Oxford University Press. Proceedings of LOAIT 2009. Barcelona, Spain

64 64