Learning the Semantics of Manipulation Action

Learning the Semantics of Manipulation Action Yezhou Yang† and Yiannis Aloimonos† and Cornelia Fermuller¨ † and Eren Erdal Aksoy‡ † UMIACS, University of Maryland, College Park, MD, USA yzyang, yiannis, fer @umiacs.umd.edu { } ‡ Karlsruhe Institute of Technology, Karlsruhe, Germany [email protected] Abstract robot collaboration, action planning and policy de- sign, etc. In this paper we present a formal computational framework for modeling manip- In this paper, we are concerned with manipula- ulation actions. The introduced formal- tion actions, that is actions performed by agents ism leads to semantics of manipulation ac- (humans or robots) on objects, resulting in some tion and has applications to both observ- physical change of the object. However most of ing and understanding human manipula- the current AI systems require manually defined tion actions as well as executing them with semantic rules. In this work, we propose a com- a robotic mechanism (e.g. a humanoid putational linguistics framework, which is based robot). It is based on a Combinatory Cat- on probabilistic semantic parsing with Combina- egorial Grammar. The goal of the intro- tory Categorial Grammar (CCG), to learn manip- duced framework is to: (1) represent ma- ulation action semantics (lexicon entries) from an- nipulation actions with both syntax and se- notations. We later show that this learned lexicon mantic parts, where the semantic part em- is able to make our system reason about manipu- ploys λ-calculus; (2) enable a probabilis- lation action goals beyond just observation. Thus tic semantic parsing schema to learn the the intelligent system can not only imitate human lambda-calculus representation of manip- movements, but also imitate action goals. ulation action from an annotated action Understanding actions by observation and exe- corpus of videos; (3) use (1) and (2) to de- cuting them are generally considered as dual prob- velop a system that visually observes ma- lems for intelligent agents. The sensori-motor nipulation actions and understands their bridge connecting the two tasks is essential, and meaning while it can reason beyond ob- a great amount of attention in AI, Robotics as well servations using propositional logic and as Neurophysiology has been devoted to investi- axiom schemata. The experiments con- gating it. Experiments conducted on primates have ducted on a public available large manip- discovered that certain neurons, the so-called mir- ulation action dataset validate the theoret- ror neurons, fire during both observation and ex- ical framework and our implementation. ecution of identical manipulation tasks (Rizzolatti et al., 2001; Gazzola et al., 2007). This suggests 1 Introduction that the same process is involved in both the obser- Autonomous robots will need to learn the actions vation and execution of actions. From a function- that humans perform. They will need to recognize alist point of view, such a process should be able these actions when they see them and they will to first build up a semantic structure from obser- need to perform these actions themselves. This re- vations, and then the decomposition of that same quires a formal system to represent the action se- structure should occur when the intelligent agent mantics. This representation needs to store the se- executes commands. mantic information about the actions, be encoded Additionally, studies in linguistics (Steedman, in a machine readable language, and inherently be 2002) suggest that the language faculty develops in a programmable fashion in order to enable rea- in humans as a direct adaptation of a more primi- soning beyond observation. A formal represen- tive apparatus for planning goal-directed action in tation of this kind has a variety of other appli- the world by composing affordances of tools and cations such as intelligent manufacturing, human consequences of actions. It is this more primitive 676 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 676–686, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics apparatus that is our major interest in this paper. The advantage of our approach is twofold: 1) Such an apparatus is composed of a “syntax part” Learning semantic representations from annota- and a “semantic part”. In the syntax part, every lin- tions helps an intelligent agent to enrich automat- guistic element is categorized as either a function ically its own knowledge about actions; 2) The or a basic type, and is associated with a syntactic formal logic representation of the action could be category which either identifies it as a function or a used to infer the object-wise consequence after a basic type. In the semantic part, a semantic trans- certain manipulation, and can also be used to plan lation is attached following the syntactic category a set of actions to reach a certain action goal. We explicitly. further validate our approach on a large publicly Combinatory Categorial Grammar (CCG) intro- available manipulation action dataset (MANIAC) duced by (Steedman, 2000) is a theory that can from (Aksoy et al., 2014), achieving promising ex- be used to represent such structures with a small perimental results. Moreover, we believe that our set of combinators such as functional application work, even though it only considers the domain of and type-raising. What do we gain though from manipulation actions, is also a promising example such a formal description of action? This is simi- of a more closely intertwined computer vision and lar to asking what one gains from a formal descrip- computational linguistics system. The diagram in tion of language as a generative system. Chom- Fig.1 depicts the framework of the system. skys contribution to language research was exactly this: the formal description of language through the formulation of the Generative and Transforma- tional Grammar (Chomsky, 1957). It revolution- ized language research opening up new roads for the computational analysis of language, providing researchers with common, generative language structures and syntactic operations, on which language analysis tools were built. A grammar for action would contribute to providing a common framework of the syntax and semantics of action, so that basic tools for action understanding can be built, tools that researchers can use when develop- ing action interpretation systems, without having Figure 1: A CCG based semantic parsing frame- to start development from scratch. The same tools work for manipulation actions. can be used by robots to execute actions. In this paper, we propose an approach for learning the semantic meaning of manipulation action 2 Related Works through a probabilistic semantic parsing framework based on CCG theory. For example, we want Reasoning beyond appearance: The very small to learn from an annotated training action corpus number of works in computer vision, which aim to that the action “Cut” is a function which has two reason beyond appearance models, are also related arguments: a subject and a patient. Also, the ac- to this paper. (Xie et al., 2013) proposed that be- tion consequence of “Cut” is a separation of the yond state-of-the-art computer vision techniques, patient. Using formal logic representation, our we could possibly infer implicit information (such system will learn the semantic representations of as functional objects) from video, and they call “Cut”: them “Dark Matter” and “Dark Energy”. (Yang et al., 2013) used stochastic tracking and graph- Cut :=(AP NP )/NP : λx.λy.cut(x, y) divided(y) cut based segmentation to infer manipulation con- \ → sequences beyond appearance. (Joo et al., 2014) Here cut(x, y) is a primitive function. We will fur- used a ranking SVM to predict the persuasive mo- ther introduce the representation in Sec. 3. Since tivation (or the intention) of the photographer who our action representation is in a common calculus captured an image. More recently, (Pirsiavash et form, it enables naturally further logical reasoning al., 2014) seeks to infer the motivation of the per- beyond visual observation. son in the image by mining knowledge stored in 677 a large corpus using natural language processing proved that grammar based approaches are prac- techniques. Different from these fairly general in- tical in activity recognition systems, and shed vestigations about reasoning beyond appearance, insight onto human manipulation action under- our paper seeks to learn manipulation actions se- standing. However, as mentioned, thinking about mantics in logic forms through CCG, and further manipulation actions solely from the viewpoint infer hidden action consequences beyond appear- of recognition has obvious limitations. In this ance through reasoning. work we adopt principles from CFG based activity recognition systems, with extensions to a CCG Action Recognition and Understanding: Hu- grammar that accommodates not only the hierar- man activity recognition and understanding has chical structure of human activity but also action been studied heavily in Computer Vision recently, semantics representations. It enables the system and there is a large range of applications for this to serve as the core parsing engine for both ma- work in areas like human-computer interactions, nipulation action recognition and execution. biometrics, and video surveillance. Both visual recognition methods, and the non-visual descrip- Manipulation Action Grammar: As men- tion methods using motion capture systems have tioned before, (Chomsky, 1993) suggested that a been used. A few good surveys of the former can minimalist generative grammar, similar to the one be found in (Moeslund et al., 2006) and (Turaga et of human language, also exists for action under- al., 2008). Most of the focus has been on recog- standing and execution. The works closest related nizing single human actions like walking, jump- to this paper are (Pastra and Aloimonos, 2012; ing, or running etc. (Ben-Arie et al., 2002; Yilmaz Summers-Stay et al., 2013; Guha et al., 2013).

Load more