The Hebrew FrameNet Project

Avi Hayoun and Michael Elhadad Dept. of Computer Science Ben-Gurion University Beer Sheva, Israel {hayounav,elhadad}@cs.bgu.ac.il

Abstract 2 Development Process

We present the Hebrew FrameNet We started with the English FrameNet data project, describe the development (version 1.5.1) as a basis on which to build. and annotation processes and enu- We currently use the same semantic frames merate the challenges we faced along included in the original FrameNet data, in- the way. cluding definitions, participant roles inter- role and inter-frame relationships. 1 Introduction To accelerate the process of adding He- Based on the linguistic theory of Frame brew lexical units (LUs) to frames, our sys- proposed by Fillmore (Fillmore, tem suggests translations of the English 1982), the FrameNet project (Fillmore and LUs. However, annotators are free to add Baker, 2010; Ruppenhofer et al., 2010) is a new LUs on their own. We collected trans- human-annotated linguistic resource with lation pairs for most of lexical units oc- rich semantic content. FrameNet defines a curring in FrameNet from online lexico- collection of semantic frames, each contain- graphic resources and introduced lookup ing a set of frame evoking predicates called procedures as part of the online annotation Lexical Units, definitions of roles of partic- tool we developed for the project. ipants in the event described by the frame The next step in the annotation pro- and additional information describing rela- cess is the selection of exemplar sentences tionships between roles and frames. for each LU for each frame. To assist in FrameNet databases in languages other this step, we prepared a large corpus of than English have been created in multi- about 2M sentences collected from a vari- ple languages, including German, Spanish, ety of Modern Hebrew sources (newspaper, Japanese, Swedish. Inspired by Swedish blogs, wikipedia) and pre-processed these FrameNet++ (Friberg Heppin and Voion- sentences with full morphological analysis, maa, 2012) and the ideas put forth by and automatic syntactic (both con- Petruck (Petruck, 2005; Petruck, 2009), we stituent and dependency parsing). We gave have started the development of a Hebrew access to this annotated corpus from the FrameNet, a semi-automatic translation of FrameNet annotation tool through a full the English FrameNet. text search where specific lexical items can As part of the development and adap- be searched and matched irrespective of tation process, we were confronted with is- their morphological inflection. In addition, sues specific to the Hebrew language, which annotators can refine the search query by we discuss. We first present the process adding specification of part of speech, mor- we have adopted to develop the Hebrew phological features (number, gender, per- FrameNet resource, the supporting tools we son etc), and syntactic context (items re- developed and provide information on the lated to other items, e.g., a appearing linguistic issues. as the subject of a verb). We apply the syn- tactic diversification algorithm of (Borin et We solved this issue by adding a “contigu- al., 2012) to the search result, so that the ous” flag to the multi-word LUs in our top N sentences presented to the annota- dataset, which indicates their expected be- tor exhibit a wide range of syntactic con- havior. structs. Annotators can quickly create a range of syntactic examples for a single se- 3.2 Role-Bearing Phrases mantic concept. Embedded in Morphology The final stage in the annotation pro- In some cases, a role-bearing phrase is em- cess of a single Frame consists of anno- bedded in another word in the text, due tating occurrences of frame elements with to the rich morphology which exists in He- roles within the selected example sentences. brew. For example, consider the follow- Spans of text are selected by annotators to ing exemplar sentence from the Abandon- If the possessive .זנח את אשתו :fill the various roles encoded in the seman- ment frame ,were a separate token אשתו in the word ו .tic frame definition the correct annotation would be 2.1 Project Status זנח את [אשת Theme] [ו As of May 2015, the Hebrew FrameNet [Agent project contains 3, 006 LUs across 167 frames, with an average of 18 LUs per Since this is not the case, we needed a frame. Additionally, there are 423 anno- way to encode the fact that the Agent is tated exemplar sentences pending final re- embedded in another phrase. We followed view across 66 LUs, with an average of 6.41 Petruck’s recommendation and borrowing sentences per LU. Before starting a more from the Spanish FrameNet project, we an- intense annotation campaign, we are now notate such roles as “externally instanti- reviewing the linguistic issues faced dur- ated”, meaning no single phrase or token ing the initial annotation trial and assessing in the sentence can represent the role. the potential to speed up annotation with semi-supervised expansion. 4 Semi-Supervised Dataset Expansion 3 Hebrew-Specific Issues F¨urstenauand Lapata (F¨urstenauand La- We identified the following issues specific to pata, 2012) presented a semi-supervised Hebrew in our initial annotation effort. method for automatically expanding the FrameNet corpus, using structural align- 3.1 Multi-word Lexical Units ment of dependency-parse trees. They re- While the English FrameNet project con- ported a success rate of about 33% of the tains several multi-word LUs (MWLUs), projected annotations as completely cor- such as give up.v and turn in.v, they are rect. We implemented this algorithm in annotated as contiguous units. In Hebrew, the Hebrew FrameNet tool to assist anno- we found so many morphological and syn- tators; gold annotations are used to seed tactic variants of MWLUs that we decided automatic annotations of candidate sen- to enable annotation of discontinous units. tences from Wikipedia. We computed a is a contiguous unit, on the Hebrew מזל רע ,For example -is the LU in both of the follow- corpus taking into account syntactic rela יצר קשר but ing sentences illustrating the wide range of tions in the dependency trees as opposed discontinuous constructions we observed: to n-grams following (Levy and Goldberg, 2014). Candidate sentences are extracted using a lexical similarity measure, based החייל יצר קשר עם מפקדו. .on the the computed word embeddings הוא יצר איתי קשר אתמול. The projected annotations are manually re- viewed before being accepted into the He- brew FrameNet corpus. We will report on the accuracy of this process.

References Lars Borin, Markus Forsberg, Karin Friberg Hep- pin, Richard Johansson, and Annika Kjellands- son. 2012. Search result diversification meth- ods to assist lexicographers. In Proceedings of the Sixth Linguistic Annotation Workshop, pages 113–117. Association for Computational Linguistics.

Charles J Fillmore and Collin Baker. 2010. A frames approach to semantic analysis. The Ox- ford handbook of linguistic analysis, pages 313– 339.

Charles J. Fillmore, 1982. Frame semantics, pages 111–137. Hanshin Publishing Co., Seoul, South Korea.

Karin Friberg Heppin and Kaarlo Voionmaa. 2012. Practical aspects of transferring the english berkeley framenet to other languages. In SLTC 2012 The Fourth Swedish Language Technology Conference Lund, October 24-26, 2012 Proceed- ings of the Conference, pages 28–29.

Hagen F¨urstenauand Mirella Lapata. 2012. Semi- supervised via structural alignment. Comput. Linguist., 38(1):135–171, March.

Omer Levy and Yoav Goldberg. 2014. Depen- dencybased word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 2, pages 302–308.

Miriam R.L. Petruck. 2005. Towards hebrew framenet. Kernerman Dictionary News, 13:12.

Miriam Petruck, 2009. Typological Considerations in constructing a Hebrew FrameNet, pages 183– 208. Mouton de Gruyter, Berlin.

Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Scheffczyk. 2010. FrameNet II: Extended The- ory and Practice. Technical report, Interna- tional Computer Science Institute in Berkeley, 09.