Multimodal Frame Identification with Multilingual Evaluation

Multimodal Frame Identification with Multilingual Evaluation Teresa Botschen§†, Iryna Gurevych∗§†, Jan-Christoph Klie∗†, Hatem Mousselly-Sergieh∗†, Stefan Roth∗§‡ Research Training Group AIPHES Ubiquitous§ Knowledge Processing (UKP) Lab † Visual Inference Lab Department of Computer‡ Science, Technische Universitat¨ Darmstadt www. aiphes, ukp, visinf .tu-darmstadt.de { } Abstract specific role labels. FrameId is crucial to the suc- cess of Semantic Role Labeling as FrameId errors An essential step in FrameNet Semantic Role account for most wrong predictions in current sys- Labeling is the Frame Identification (FrameId) task, which aims at disambiguating a situation tems (Hartmann et al., 2017). Consequently, im- around a predicate. Whilst current FrameId proving FrameId is of major interest. methods rely on textual representations only, The main challenge and source of prediction er- we hypothesize that FrameId can profit from rors of FrameId systems are ambiguous predicates, a richer understanding of the situational con- which can evoke several frames, e.g., the verb text. Such contextual information can be ob- sit evokes the frame Change posture in a context tained from common sense knowledge, which like ‘a person is sitting back on a bench’, while is more present in images than in text. In this paper, we extend a state-of-the-art FrameId it evokes Being located when ‘a company is sit- system in order to effectively leverage multi- ting in a city’. Understanding the predicate con- modal representations. We conduct a compre- text, and thereby the context of the situation (here, hensive evaluation on the English FrameNet ‘Who / what is sitting where?’), is crucial to iden- and its German counterpart SALSA. Our anal- tifying the correct frame for ambiguous cases. ysis shows that for the German data, tex- State-of-the-art FrameId systems model the sit- tual representations are still competitive with uational context using pretrained distributed word multimodal ones. However on the English data, our multimodal FrameId approach out- embeddings (see Hermann et al., 2014). Hence, performs its unimodal counterpart, setting a it is assumed that the context of the situation new state of the art. Its benefits are particularly is explicitly expressed in words. However, lan- apparent in dealing with ambiguous and rare guage understanding involves implicit knowledge, instances, the main source of errors of current which is not mentioned but still seems obvious systems. For research purposes, we release (a) to humans, e.g., ‘people can sit back on a bench, the implementation of our system, (b) our eval- but companies cannot’, ‘companies are in cities’. uation splits for SALSA 2.0, and (c) the em- Such implicit common sense knowledge is obvi- beddings for synsets and IMAGINED words.1 ous enough to be rarely expressed in sentences, but is more likely to be present in images. Fig- 1 Introduction ure1 takes the ambiguous predicate sit to illustrate FrameNet Semantic Role Labeling analyzes sentences with respect to frame-semantic structures based on FrameNet (Fillmore et al., 2003). Typ- ically, this involves two steps: First, Frame Iden- tification (FrameId), capturing the context around a predicate (frame evoking element) and assigning a frame, basically a word sense label for a pro- totypical situation, to it. Second, Role Labeling, i.e. identifying the participants (fillers) of the predicate and connecting them with predefined frame- ∗named alphabetically 1https://github.com/UKPLab/ Figure 1: Example sentences demonstrating the poten- naacl18-multimodal-frame-identification tial benefit of images for ambiguous predicates. 1481 Proceedings of NAACL-HLT 2018, pages 1481–1491 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics how images can provide access to implicit com- 2 Related Work mon sense knowledge crucial to FrameId. 2.1 Frame identification When looking at the semantics of events, FrameId has commonalities with event prediction State-of-the-art FrameId systems rely on pre- tasks. These aim at linking events and their partic- trained word embeddings as input (Hermann et al., ipants to script knowledge and at predicting events 2014). This proved to be helpful: those sys- in narrative chains. Ahrendt and Demberg(2016) tems consistently outperform the previously lead- argue that knowing about the participants helps to ing FrameId system SEMAFOR (Das et al., 2014), identify the event, which suggests the need for im- which is based on a handcrafted set of features. plicit context knowledge also for FrameId. This The open source neural network-based FrameId specifically applies to images, which can reflect system SimpleFrameId (Hartmann et al., 2017) is properties of the participants of a situation in a in- conceptually simple, yet yields competitive accu- herently different way, see Fig.1. racy. Its input representation is a concatenation of We analyze whether multimodal representa- the predicate’s pretrained embedding and an em- tions grounded in images can encode common bedding of the predicate context. The dimension- sense knowledge to improve FrameId. To that wise mean of the pretrained embeddings of all end, we extend SimpleFrameId (Hartmann et al., words in the sentence is taken as the context. In 2017), a recent FrameId model based on dis- this work, we first aim at improving the represen- tributed word embeddings, to the multimodal case tation of the predicate context using multimodal and evaluate for English and German. Note that embeddings, and second at assessing the applica- there is a general lack of evaluation of FrameId bility to another language, namely German. systems for languages other than English. This Common sense knowledge for language under- is problematic as they yield different challenges; standing. Situational background knowledge German, for example, due to long distance depen- can be described in terms of frames (Fillmore, dencies. Also, word embeddings trained on differ- 1985) and scripts (Schank and Abelson, 2013). ent languages have different strengths in ambigu- Ahrendt and Demberg(2016) report that know- ous words. We elaborate on insights from using ing about a script’s participants aids in predict- different datasets by language. ing events linked to script knowledge. Transfer- ring this insight to FrameId, we assume that a rich Contributions. (1) We propose a pipeline and context representation helps to identify the sense architecture of a FrameId system, extending state- of ambiguous predicates. Addressing ambiguous of-the-art methods with the option of using im- predicates where participants have different prop- plicit multimodal knowledge. It is flexible toward erties depending on the context, Feizabadi and modality and language, reaches state-of-the-art ac- Pado´(2012) give some examples where the loca- curacy on English FrameId data, clearly outper- tion plays a discriminating role as participant: mo- forming several baselines, and sets a new state of tion verbs that have both a concrete motion sense the art on German FrameId data. (2) We discuss and a more abstract sense in the cognitive domain, properties of language and meaning with respect e.g., struggle, lean, follow. to implicit knowledge, as well as the potential of multimodal representations for FrameId. (3) We Frame identification in German. Shalmaneser perform a detailed analysis of FrameId systems. (Erk and Pado, 2006) is a toolbox for semantic role First, we develop a new strong baseline. Second, assignment on FrameNet schemata of English and we suggest novel evaluation metrics that are es- German (integrated into the SALSA project for sential for assessing ambiguous and rare frame in- German). Shalmaneser uses a Naive Bayes clas- stances. We show our system’s advantage over the sifier to identify frames, together with features for strong baseline in this regard and by this improve a bag-of-word context with a window over sen- upon the main source of errors. Third, we analyze tences, bigrams, and trigrams of the target word gold annotated datasets for English and German and dependency annotations. They report an F1 showing their different strengths. Finally, we re- of 75.1 % on FrameNet 1.2 and 60 % on SALSA lease the implementation of our system, our eval- 1.0. These scores are difficult to compare against uation splits for SALSA 2.0, and the embeddings more recent work as the evaluation uses older ver- for synsets and IMAGINED words. sions of datasets and custom splits. Shalmaneser 1482 requires software dependencies that are not avail- from images for nouns. Such nouns, in turn, are able anymore, hindering application to new data. candidates for role fillers of predicates. In order to To the best of our knowledge, there is no FrameId identify the correct sense of an ambiguous predi- system evaluated on SALSA 2.0. cate, it could help to enrich the representation of Johannsen et al.(2015) present a simple, the context situation with multimodal embeddings but weak translation baseline for cross-lingual for the entities that are linked by the predicate. FrameId. A SEMAFOR-based system is trained on English FrameNet and tested on German 3 Our Multimodal FrameId Model Wikipedia sentences, translated word-by-word to Our system builds upon the SimpleFrameId (Hart- English. This translation baseline reaches an F1 mann et al., 2017) system for English FrameId score of 8.5 % on the German sentences when based on textual word embeddings. We extend translated to English. The performance of this it to multimodal and

Load more