Grounded Semantic Role Labeling

Grounded Semantic Role Labeling Shaohua Yang1, Qiaozi Gao1, Changsong Liu1, Caiming Xiong2, Song-Chun Zhu3, and Joyce Y. Chai1 1Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 2Metamind, Palo Alto, CA 94301 3Center for Vision, Cognition, Learning, and Autonomy, University of California, Los Angeles, CA 90095 yangshao,gaoqiaoz,cliu,jchai @cse.msu.edu { } [email protected], [email protected] Abstract Predicate: “takes out”: track 1 Agent: ‘’The woman’’ : track 2 Pa.ent: ‘’a cucumber’’ : track 3 Semantic Role Labeling (SRL) captures se- Source: ‘’from the refrigerator’’ : mantic roles (or participants) such as agent, track 4 The woman takes out a cucumber from the Des.na.on: ‘’ ‘’ : track 5 patient, and theme associated with verbs refrigerator. from the text. While it provides important in- termediate semantic representations for many Figure 1: An example of grounded semantic role labeling for traditional NLP tasks (such as information ex- the sentence the woman takes out a cucumber from the refrig- traction and question answering), it does not erator. The left hand side shows three frames of a video clip capture grounded semantics so that an artificial agent can reason, learn, and perform with the corresponding language description. The objects in the the actions with respect to the physical envi- bounding boxes are tracked and each track has a unique identi- ronment. To address this problem, this pa- fier. The right hand side shows the grounding results where each per extends traditional SRL to grounded SRL role including the implicit role (destination) is grounded to where arguments of verbs are grounded to a track id. participants of actions in the physical world. By integrating language and vision processing through joint inference, our approach not 2002; Collobert et al., 2011; Zhou and Xu, 2015). only grounds explicit roles, but also grounds For example, given the sentence the woman takes implicit roles that are not explicitly mentioned out a cucumber from the refrigerator., takes out is in language descriptions. This paper describes our empirical results and discusses challenges the main verb (also called predicate); the noun and future directions. phrase the woman is the agent of this action; a cucumber is the patient; and the refrigerator is the source. 1 Introduction SRL captures important semantic representations Linguistic studies capture semantics of verbs by for actions associated with verbs, which have shown their frames of thematic roles (also referred to as beneficial for a variety of applications such as infor- semantic roles or verb arguments) (Levin, 1993). mation extraction (Emanuele et al., 2013) and ques- For example, a verb can be characterized by agent tion answering (Shen and Lapata, 2007). However, (i.e., the animator of the action) and patient the traditional SRL is not targeted to represent verb (i.e., the object on which the action is acted upon), semantics that are grounded to the physical world and other roles such as instrument, source, so that artificial agents can truly understand the on- destination, etc. Given a verb frame, the goal going activities and (learn to) perform the specified of Semantic Role Labeling (SRL) is to identify lin- actions. To address this issue, we propose a new task guistic entities from the text that serve different the- on grounded semantic role labeling. matic roles (Palmer et al., 2005; Gildea and Jurafsky, Figure 1 shows an example of grounded SRL. 149 Proceedings of NAACL-HLT 2016, pages 149–159, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics The sentence the woman takes out a cucumber 2 Related Work from the refrigerator describes an activity in a visual scene. The semantic role representation Recent years have witnessed an increasing amount from linguistic processing (including implicit roles of work in integrating language and vision, from such as destination) is first extracted and then earlier image annotation (Ramanathan et al., 2013; grounded to tracks of visual entities as shown in Kazemzadeh et al., 2014) to recent image/video the video. For example, the verb phrase take out caption generation (Kuznetsova et al., 2013; Venu- is grounded to a trajectory of the right hand. The gopalan et al., 2015; Ortiz et al., ; Elliott and de role agent is grounded to the person who actually Vries, 2015; Devlin et al., 2015), video sentence does the take-out action in the visual scene (track alignment (Naim et al., 2015; Malmaud et al., 2015), 1) ; the patient is grounded to the cucumber scene generation (Chang et al., 2015), and multi- taken out (track 3); and the source is grounded model embedding incorporating language and vi- to the refrigerator (track 4). The implicit role of sion (Bruni et al., 2014; Lazaridou et al., 2015). destination (which is not explicitly mentioned What is more relevant to our work here is re- in the language description) is grounded to the cut- cent progress on grounded language understanding, ting board (track 5). which involves learning meanings of words through To tackle this problem, we have developed an ap- connections to machine perception (Roy, 2005) and proach to jointly process language and vision by in- grounding language expressions to the shared vi- corporating semantic role information. In particular, sual world, for example, to visual objects (Liu et we use a benchmark dataset (TACoS) which con- al., 2012; Liu and Chai, 2015), to physical land- sists of parallel video and language descriptions in marks (Tellex et al., 2011; Tellex et al., 2014), and a complex cooking domain (Regneri et al., 2013) in to perceived actions or activities (Tellex et al., 2014; our investigation. We have further annotated sev- Artzi and Zettlemoyer, 2013). eral layers of information for developing and eval- Different approaches and emphases have been ex- uating grounded semantic role labeling algorithms. plored. For example, linear programming has been Compared to previous works on language ground- applied to mediate perceptual differences between ing (Tellex et al., 2011; Yu and Siskind, 2013; Krish- humans and robots for referential grounding (Liu namurthy and Kollar, 2013), our work presents sev- and Chai, 2015). Approaches to semantic pars- eral contributions. First, beyond arguments explic- ing have been applied to ground language to inter- itly mentioned in language descriptions, our work nal world representations (Chen and Mooney, 2008; simultaneously grounds explicit and implicit roles Artzi and Zettlemoyer, 2013). Logical Semantics with an attempt to better connect verb semantics with Perception (LSP) (Krishnamurthy and Kol- with actions from the underlying physical world. lar, 2013) was applied to ground natural language By incorporating semantic role information, our ap- queries to visual referents through jointly parsing proach has led to better grounding performance. natural language (combinatory categorical grammar Second, most previous works only focused on a (CCG)) and visual attribute classification. Graph- small number of verbs with limited activities. We ical models have been applied to word grounding. base our investigation on a wider range of verbs For example, a generative model was applied to in- and in a much more complex domain where object tegrate And-Or-Graph representations of language recognition and tracking are notably more difficult. and vision for joint parsing (Tu et al., 2014). A Fac- Third, our work results in additional layers of anno- torial Hidden Markov Model (FHMM) was applied tation to part of the TACoS dataset. This annotation to learn the meaning of nouns, verbs, prepositions, captures the structure of actions informed by seman- adjectives and adverbs from short video clips paired tic roles from the video. The annotated data is avail- with sentences (Yu and Siskind, 2013). Discrimina- able for download 1. It will provide a benchmark for tive models have also been applied to ground human future work on grounded SRL. commands or instructions to perceived visual entities, mostly for robotic applications (Tellex et al., 1 http://lair.cse.msu.edu/gsrl.html 2011; Tellex et al., 2014). More recently, deep learn- 150 ing has been applied to ground phrases to image re- the center of the bounding boxes of the refrigerator gions (Karpathy and Fei-Fei, 2015). track; and the destination is grounded to the center of the cutting board track. 3 Method We use Conditional Random Field(CRF) to model We first describe our problem formulation and then this problem. An example CRF factor graph is provide details on the learning and inference algo- shown in Figure 2. The CRF structure is cre- rithms. ated based on information extracted from language. More Specifically, s1, ..., s6 refers to the observed 3.1 Problem Formulation text and its semantic role. Notice that s6 is an implicit role as there is no text from the sentence de- Given a sentence S and its corresponding video clip scribing destination. Also note that the whole V , our goal is to ground explicit/implicit roles as- prepositional phrase “from the drawer” is identified sociated with a verb in S to video tracks in V. In as the source rather than “the drawer” alone. This this paper, we focus on the following set of semantic is because the prepositions play an important role in roles: predicate, patient, location, { specifying location information. For example, “near source, destination, tool . In the cook- } the cutting boarding” is describing a location that is ing domain, as actions always involve hands, the near to, but not exactly at the location of the cutting predicate is grounded to the hand pose repre- board. Here v , ..., v are grounding random vari- sented by a trajectory of relevant hand(s). Normally 1 6 ables which take values from object tracks and lo- agent would be grounded to the person who does cations in the video clip, and φ , ..., φ are binary the action.

Grounded Semantic Role Labeling

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support