Learning the of Manipulation Action

Yezhou Yang† and Yiannis Aloimonos† and Cornelia Fermuller¨ † and Eren Erdal Aksoy‡ † UMIACS, University of Maryland, College Park, MD, USA yzyang, yiannis, fer @umiacs.umd.edu { } ‡ Karlsruhe Institute of Technology, Karlsruhe, Germany [email protected]

Abstract robot collaboration, action planning and policy de- sign, etc. In this paper we present a formal compu- tational framework for modeling manip- In this paper, we are concerned with manipula- ulation actions. The introduced formal- tion actions, that is actions performed by agents ism leads to semantics of manipulation ac- (humans or robots) on objects, resulting in some tion and has applications to both observ- physical change of the object. However most of ing and understanding human manipula- the current AI systems require manually defined tion actions as well as executing them with semantic rules. In this work, we propose a com- a robotic mechanism (e.g. a humanoid putational framework, which is based robot). It is based on a Combinatory Cat- on probabilistic semantic parsing with Combina- egorial Grammar. The goal of the intro- tory Categorial Grammar (CCG), to learn manip- duced framework is to: (1) represent ma- ulation action semantics (lexicon entries) from an- nipulation actions with both syntax and se- notations. We later show that this learned lexicon mantic parts, where the semantic part em- is able to make our system reason about manipu- ploys λ-calculus; (2) enable a probabilis- lation action goals beyond just observation. Thus tic semantic parsing schema to learn the the intelligent system can not only imitate human lambda-calculus representation of manip- movements, but also imitate action goals. ulation action from an annotated action Understanding actions by observation and exe- corpus of videos; (3) use (1) and (2) to de- cuting them are generally considered as dual prob- velop a system that visually observes ma- lems for intelligent agents. The sensori-motor nipulation actions and understands their bridge connecting the two tasks is essential, and meaning while it can reason beyond ob- a great amount of attention in AI, Robotics as well servations using propositional logic and as Neurophysiology has been devoted to investi- axiom schemata. The experiments con- gating it. Experiments conducted on primates have ducted on a public available large manip- discovered that certain neurons, the so-called mir- ulation action dataset validate the theoret- ror neurons, fire during both observation and ex- ical framework and our implementation. ecution of identical manipulation tasks (Rizzolatti et al., 2001; Gazzola et al., 2007). This suggests 1 Introduction that the same process is involved in both the obser- Autonomous robots will need to learn the actions vation and execution of actions. From a function- that humans perform. They will need to recognize alist point of view, such a process should be able these actions when they see them and they will to first build up a semantic structure from obser- need to perform these actions themselves. This re- vations, and then the decomposition of that same quires a formal system to represent the action se- structure should occur when the intelligent agent mantics. This representation needs to store the se- executes commands. mantic information about the actions, be encoded Additionally, studies in linguistics (Steedman, in a machine readable language, and inherently be 2002) suggest that the language faculty develops in a programmable fashion in order to enable rea- in humans as a direct adaptation of a more primi- soning beyond observation. A formal represen- tive apparatus for planning goal-directed action in tation of this kind has a variety of other appli- the world by composing affordances of tools and cations such as intelligent manufacturing, human consequences of actions. It is this more primitive

676 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 676–686, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics apparatus that is our major interest in this paper. The advantage of our approach is twofold: 1) Such an apparatus is composed of a “syntax part” Learning semantic representations from annota- and a “semantic part”. In the syntax part, every lin- tions helps an intelligent agent to enrich automat- guistic element is categorized as either a function ically its own knowledge about actions; 2) The or a basic type, and is associated with a syntactic formal logic representation of the action could be category which either identifies it as a function or a used to infer the object-wise consequence after a basic type. In the semantic part, a semantic trans- certain manipulation, and can also be used to plan lation is attached following the syntactic category a set of actions to reach a certain action goal. We explicitly. further validate our approach on a large publicly Combinatory Categorial Grammar (CCG) intro- available manipulation action dataset (MANIAC) duced by (Steedman, 2000) is a theory that can from (Aksoy et al., 2014), achieving promising ex- be used to represent such structures with a small perimental results. Moreover, we believe that our set of combinators such as functional application work, even though it only considers the domain of and type-raising. What do we gain though from manipulation actions, is also a promising example such a formal description of action? This is simi- of a more closely intertwined computer vision and lar to asking what one gains from a formal descrip- computational linguistics system. The diagram in tion of language as a generative system. Chom- Fig.1 depicts the framework of the system. skys contribution to language research was exactly this: the formal description of language through the formulation of the Generative and Transforma- tional Grammar (Chomsky, 1957). It revolution- ized language research opening up new roads for the computational analysis of language, provid- ing researchers with common, generative language structures and syntactic operations, on which lan- guage analysis tools were built. A grammar for action would contribute to providing a common framework of the syntax and semantics of action, so that basic tools for action understanding can be built, tools that researchers can use when develop- ing action interpretation systems, without having Figure 1: A CCG based semantic parsing frame- to start development from scratch. The same tools work for manipulation actions. can be used by robots to execute actions. In this paper, we propose an approach for learn- ing the semantic meaning of manipulation action 2 Related Works through a probabilistic semantic parsing frame- work based on CCG theory. For example, we want Reasoning beyond appearance: The very small to learn from an annotated training action corpus number of works in computer vision, which aim to that the action “Cut” is a function which has two reason beyond appearance models, are also related arguments: a subject and a patient. Also, the ac- to this paper. (Xie et al., 2013) proposed that be- tion consequence of “Cut” is a separation of the yond state-of-the-art computer vision techniques, patient. Using formal logic representation, our we could possibly infer implicit information (such system will learn the semantic representations of as functional objects) from video, and they call “Cut”: them “Dark Matter” and “Dark Energy”. (Yang et al., 2013) used stochastic tracking and graph- Cut :=(AP NP )/NP : λx.λy.cut(x, y) divided(y) cut based segmentation to infer manipulation con- \ → sequences beyond appearance. (Joo et al., 2014) Here cut(x, y) is a primitive function. We will fur- used a ranking SVM to predict the persuasive mo- ther introduce the representation in Sec. 3. Since tivation (or the intention) of the photographer who our action representation is in a common calculus captured an image. More recently, (Pirsiavash et form, it enables naturally further logical reasoning al., 2014) seeks to infer the motivation of the per- beyond visual observation. son in the image by mining knowledge stored in

677 a large corpus using natural language processing proved that grammar based approaches are prac- techniques. Different from these fairly general in- tical in activity recognition systems, and shed vestigations about reasoning beyond appearance, insight onto human manipulation action under- our paper seeks to learn manipulation actions se- standing. However, as mentioned, thinking about mantics in logic forms through CCG, and further manipulation actions solely from the viewpoint infer hidden action consequences beyond appear- of recognition has obvious limitations. In this ance through reasoning. work we adopt principles from CFG based activ- ity recognition systems, with extensions to a CCG Action Recognition and Understanding: Hu- grammar that accommodates not only the hierar- man activity recognition and understanding has chical structure of human activity but also action been studied heavily in Computer Vision recently, semantics representations. It enables the system and there is a large range of applications for this to serve as the core parsing engine for both ma- work in areas like human-computer interactions, nipulation action recognition and execution. biometrics, and video surveillance. Both visual recognition methods, and the non-visual descrip- Manipulation Action Grammar: As men- tion methods using motion capture systems have tioned before, (Chomsky, 1993) suggested that a been used. A few good surveys of the former can minimalist generative grammar, similar to the one be found in (Moeslund et al., 2006) and (Turaga et of human language, also exists for action under- al., 2008). Most of the focus has been on recog- standing and execution. The works closest related nizing single human actions like walking, jump- to this paper are (Pastra and Aloimonos, 2012; ing, or running etc. (Ben-Arie et al., 2002; Yilmaz Summers-Stay et al., 2013; Guha et al., 2013). and Shah, 2005). Approaches to more complex ac- (Pastra and Aloimonos, 2012) first discussed a tions have employed parametric approaches, such Chomskyan grammar for understanding complex as HMMs (Kale et al., 2004) to learn the transi- actions as a theoretical concept, and (Summers- tion between feature representations in individual Stay et al., 2013) provided an implementation of frames e.g. (Saisan et al., 2001; Chaudhry et al., such a grammar using as perceptual input only 2009). More recently, (Aksoy et al., 2011; Ak- objects. More recently, (Yang et al., 2014) pro- soy et al., 2014) proposed a semantic event chain posed a set of context-free grammar rules for ma- (SEC) representation to model and learn the se- nipulation action understanding, and (Yang et al., mantic segment-wise relationship transition from 2015) applied it on unconstrained instructional spatial-temporal video segmentation. videos. However, these approaches only con- sider the syntactic structure of manipulation ac- There also have been many syntactic ap- tions without coupling semantic rules using λ ex- proaches to human activity recognition which used pressions, which limits the capability of doing rea- the concept of context-free grammars, because soning and prediction. such grammars provide a sound theoretical basis for modeling structured processes. Tracing back Combinatory Categorial Grammar and Se- to the middle 90’s, (Brand, 1996) used a grammar mantic Parsing: CCG based semantic parsing to recognize disassembly tasks that contain hand originally was used mainly to translate natural manipulations. (Ryoo and Aggarwal, 2006) used language sentences to their desired semantic rep- the context-free grammar formalism to recognize resentations as λ-calculus formulas (Zettlemoyer composite human activities and multi-person in- and Collins, 2005; Zettlemoyer and Collins, teractions. It is a two level hierarchical approach 2007). (Mooney, 2008) presented a framework where the lower-levels are composed of HMMs of grounded language acquisition: the interpre- and Bayesian Networks while the higher-level in- tation of language entities into semantically in- teractions are modeled by CFGs. To deal with formed structures in the context of perception and errors from low-level processes such as tracking, actuation. The concept has been applied success- stochastic grammars such as stochastic CFGs were fully in tasks such as robot navigation (Matuszek also used (Ivanov and Bobick, 2000; Moore and et al., 2011), forklift operation (Tellex et al., 2014) Essa, 2002). More recently, (Kuehne et al., 2014) and of human-robot interaction (Matuszek et al., proposed to model goal-directed human activi- 2014). In this work, instead of grounding natural ties using Hidden Markov Models and treat sub- language sentences directly, we ground informa- actions just like words in speech. These works tion obtained from visual perception into seman-

678 tically informed structures, specifically in the do- represents that if the cucumber is cut by x, then main of manipulation actions. the status of the cucumber is divided. λ expressions: lambda expressions represent 3 A CCG Framework for Manipulation functions with unknown arguments. For example, Actions λx.cut(knife, x) is a function from entities to en- tities, which is of type NP after any entities of type Before we dive into the semantic parsing of ma- N that is cut by knife. nipulation actions, a brief introduction to the Com- binatory Categorial Grammar framework in Lin- 3.2 Combinatory Categorial Grammar guistics is necessary. We will only introduce re- The semantic parsing formalism underlying our lated concepts and formalisms. For a complete framework for manipulation actions is that of background reading, we would like to refer read- combinatory categorial grammar (CCG) (Steed- ers to (Steedman, 2000). We will first give a brief man, 2000). A CCG specifies one or more logi- introduction to CCG and then introduce a fun- cal forms for each element or combination of ele- damental combinator, i.e., functional application. ments for manipulation actions. In our formalism, The introduction is followed by examples to show an element of Action is associated with a syntac- how the combinator is applied to parse actions. tic “category” which identifies it as functions, and 3.1 Manipulation Action Semantics specifies the type and directionality of their argu- ments and the type of their result. For example, ac- The semantic expression in our representation of tion “Cut” is a function from patient object phrase manipulation actions uses a typed λ-calculus lan- (NP) on the right into predicates, and into func- guage. The formal system has two basic types: tions from subject object phrase (NP) on the left entities and functions. Entities in manipulation into a sub action phrase (AP): actions are Objects or Hands, and functions are the Actions. Our lambda-calculus expressions are Cut := (AP NP )/NP. (1) \ formed from the following items: Constants: Constants can be either entities or As a matter of fact, the pure categorial gram- functions. For example, Knife is an entity (i.e., it mar is a conext-free grammar presented in the ac- is of type N) and Cucumber is an entity too (i.e., it cepting, rather than the producing direction. The is of type N). Cut is an action function that maps expression (1) is just an accepting form for Ac- entities to entities. When the event Knife Cut Cu- tion “Cut” following the context-free grammar. cumber happened, the expression cut(Knife, Cu- While it is now convenient to write derivations as cumber) returns an entity of type AP, aka. Action follows, they are equivalent to conventional tree Phrase. Constants like divided are status functions structure derivations in Figure. 3.2. that map entities to truth values. The expression Knife Cut Cucumber divided(cucumber) returns a true value after the NN NP (AP NP)/NP NP event (Knife Cut Cucumber) happened. \ > AP NP Logical connectors: The λ-calculus expression <\ has logical connectors like conjunction ( ), dis- AP ∧ junction ( ), negation( ) and implication( ). ∨ ¬ → For example, the expression AP

connected(tomato, cucumber) NP AP ∧ divided(tomato) divided(cucumber) ∧ N A NP represents the joint status that the sliced tomato Knife Cut N merged with the sliced cucumber. It can be regarded as a simplified goal status for “mak- Cucumber ing a cucumber tomato salad”. The expression connected(spoon, bowl) represents the status ¬ Figure 2: Example of conventional tree structure. after the spoon finished stirring the bowl. The semantic type is encoded in these cate- λx.cut(x, cucumber) divided(cucumber) → gories, and their translation can be made explicit

679 in an expanded notation. Basically a λ-calculus Knife Cut Cucumber expression is attached with the syntactic category. NN A colon operator is used to separate syntactical NP (AP NP)/NP NP knife λx.λy\.cut(x, y) cucumber and semantic expressions, and the right side of the knife divided(y) cucumber → > colon is assumed to have lower precedence than AP NP the left side of the colon. Which is intuitive as any λx.cut(x,\cucumber) divided(cucumber) explanation of manipulation actions should first → < AP obey syntactical rules, then semantic rules. Now cut(knife, cucumber) divided(cucumber) the basic element, Action “Cut”, can be further → represented by: 4 Learning Model and Semantic Parsing Cut :=(AP NP )/NP : λx.λy.cut(x, y) divided(y). \ → After having defined the formalism and applica- tion rule, instead of manually writing down all the (AP NP )/NP denotes a phrase of type AP , possible CCG representations for each entity, we \ which requires an element of type NP to specify would like to apply a learning technique to de- what object was cut, and requires another element rive them from the paired training corpus. Here of type NP to further complement what effector we adopt the learning model of (Zettlemoyer and initiates the cut action. λx.λy.cut(x, y) is the λ- Collins, 2005), and use it to assign weights to the calculus representation for this function. Since the semantic representation of actions. Since an ac- functions are closely related to the state update, tion may have multiple possible syntactic and se- divided(y) further points out the status expres- mantic representations assigned to it, we use the → sion after the action was performed. probabilistic model to assign weights to these rep- A CCG system has a set of combinatory rules resentations. which describe how adjacent syntatic categories in a string can be recursively combined. In the 4.1 Learning Approach setting of manipulation actions, we want to point First we assume that complete syntactic parses of out that similar combinatory rules are also appli- the observed action are available, and in fact a ma- cable. Especially the functional application rules nipulation action can have several different parses. are essential in our system. The parsing uses a probabilistic combinatorial cat- egorial grammar framework similar to the one 3.3 Functional application given by (Zettlemoyer and Collins, 2007). We as- The functional application rules with semantics sume a probabilistic categorial grammar (PCCG) can be expressed in the following form: based on a log linear model. M denotes a manipu- lation task, L denotes the semantic representation A/B : f B : g => A : f(g) (2) of the task, and T denotes its parse tree. The prob- ability of a particular syntactic and semantic parse B : g A B : f => A : f(g) (3) \ is given as:

Rule. (2) says that a string with type A/B can be ef(L,T,M) Θ P (L, T M; Θ) = · combined with a right-adjacent string of type B to f(L,T,M) Θ (4) | e · form a new string of type A. At the same time, it (L,T ) also specifies how the semantics of the category A where f is a mappingP of the triple (L, T, M) to can be compositionally built out of the semantics feature vectors Rd, and the Θ Rd represents ∈ ∈ for A/B and B. Rule. (3) is a symmetric form of the weights to be learned. Here we use only lexi- Rule. (2). cal features, where each feature counts the number In the domain of manipulation actions, follow- of times a lexical entry is used in T . Parsing a ma- ing derivation is an example CCG parse. This nipulation task under PCCG equates to finding L parse shows how the system can parse an ob- such that P (L M; Θ) is maximized: servation (“Knife Cut Cucumber”) into a se- | mantic representation (cut(knife, cucumber) argmax P (L M; Θ) → L | divided(cucumber)) using the functional appli- = argmaxL P (L, T M; Θ). (5) cation rules. | XT

680 We use dynamic programming techniques to in the format of observed triplets (subject action calculate the most probable parse for the manipu- patient) and a corresponding semantic representa- lation task. In this paper, the implementation from tion of the action as well as its consequence. The (Baral et al., 2011) is adopted, where an inverse-λ semantic representations in λ-calculus format are technique is used to generalize new semantic rep- given by human annotators after watching each ac- resentations. The generalization of lexicon rules tion clip. A set of sample training pairs are given are essential for our system to deal with unknown in Table.1 (one from each action category in the actions presented during the testing phase. training set). Since every training clip contains one single full execution of each manipulation ac- 5 Experiments tion considered, the training corpus thus has a total of 120 paired training samples. 5.1 Manipulation Action (MANIAC) Dataset (Aksoy et al., 2014) provides a manipulation ac- Snapshot triplet semantic representation chopping(cleaver, carrot) tion dataset with 8 different manipulation actions cleaver chopping carrot divided(carrot) (cutting, chopping, stirring, putting, taking, hid- → cutting(spatula, pepper) ing, uncovering, and pushing), each of which con- spatula cutting pepper divided(pepper) sists of 15 different versions performed by 5 dif- → 1 ferent human actors . There are in total 30 differ- spoon stirring bucket stirring(spoon, bucket) ent objects manipulated in all demonstrations. All manipulations were recorded with the Microsoft take down(cup, bucket) cup take down bucket connected(cup, bucket) Kinect sensor and serve as data here. →moved ¬ (cup) training ∧ The MANIAC data set contains another 20 long put on top(cup, bowl) cup put on top bowl on top(cup, bowl) and complex chained manipulation sequences →moved(cup) ∧ (e.g. “making a sandwich”) which consist of a to- hiding(bucket, ball) bucket hiding ball contained(bucket, ball) tal of 103 different versions of these 8 manipula- →moved(bucket) tion tasks performed in different orders with novel ∧ pushing(hand, box) hand pushing box objects under different circumstances. These serve moved(box) → as testing data for our experiments. uncover(box, apple) (Aksoy et al., 2014; Aksoy and Worg¨ otter,¨ box uncover apple appear(apple) →moved(box) 2015) developed a semantic event chain based ∧ model free decomposition approach. It is an un- Table 1: Example annotations from training cor- supervised probabilistic method that measures the pus, one per manipulation action category. frequency of the changes in the spatial relations embedded in event chains, in order to extract the We also assume the system knows that every subject and patient visual segments. It also decom- “object” involved in the corpus is an entity of its poses the long chained complex testing actions own type, for example: into their primitive action components according Knife := N : knife to the spatio-temporal relations of the manipula- Bowl := N : bowl tor. Since the visual recognition is not the core ...... of this work, we omit the details here and refer the interested reader to (Aksoy et al., 2014; Aksoy Additionally, we assume the syntactic form of and Worg¨ otter,¨ 2015). All these features make the each “action” has a main type (AP NP )/NP \ MANIAC dataset a great testing bed for both the (see Sec. 3.2). These two sets of rules form the theoretical framework and the implemented sys- initial seed lexicon for learning. tem presented in this work. 5.3 Learned Lexicon 5.2 Training Corpus We applied the learning technique mentioned in We first created a training corpus by annotating Sec. 4, and we used the NL2KR implementa- the 120 training clips from the MANIAC dataset, tion from (Baral et al., 2011). The system learns and generalizes a set of lexicon entries (syntactic 1Dataset available for download at https: //fortknox.physik3.gwdg.de/cns/index. and semantic) for each action categories from the php?page=maniac-dataset. training corpus accompanied with a set of weights.

681 We list the one with the largest weight for each ac- they serve as the testing corpus. However, since tion here respectively: many of the objects used in the testing data are not present in the training set, an object model-free ap- Chopping :=(AP NP )/NP : λx.λy.chopping(x, y) \ proach is adopted and thus “subject” and “patient” divided(y) → fields are filled with segment IDs instead of a spe- Cutting :=(AP NP )/NP : λx.λy.cutting(x, y) \ cific object name. Fig. 3 and 4 show several ex- divided(y) → amples of the detected triplets accompanied with a Stirring :=(AP NP )/NP : λx.λy.stirring(x, y) \ T ake down :=(AP NP )/NP : λx.λy.take down(x, y) set of key frames from the testing sequences. Nev- \ connected(x, y) moved(x) ertheless, the method we used here can 1) gener- → ¬ ∧ P ut on top :=(AP NP )/NP : λx.λy.put on top(x, y) alize the unknown segments into the category of \ on top(x, y) moved(x) object entities and 2) generalize the unknown ac- → ∧ Hiding :=(AP NP )/NP : λx.λy.hiding(x, y) tions (those that do not exist in the training corpus) \ contained(x, y) moved(x) into the category of action function. This is done → ∧ P ushing :=(AP NP )/NP : λx.λy.pushing(x, y) by automatically generalizing the following two \ moved(y) types of lexicon entries using the inverse-λ tech- → Uncover :=(AP NP )/NP : λx.λy.uncover(x, y) nique from (Baral et al., 2011): \ appear(y) moved(x). → ∧ Object [ID] :=N : object [ID] Unknown :=(AP NP )/NP : λx.λy.unknown(x, y) The set of seed lexicon and the learned lexicon \ entries are further used to probabilistically parse the detected triplet sequences from the 20 long Among the 90 detected triplets, using the manipulation activities in the testing set. learned lexicon we are able to parse all of them into semantic representations. Here we pick the 5.4 Deducing Semantics representation with the highest probability after Using the decomposition technique from (Aksoy parsing as the individual action semantic represen- et al., 2014; Aksoy and Worg¨ otter,¨ 2015), the re- tation. The “parsed semantics” rows of Fig. 3 and ported system is able to detect a sequence of ac- 4 show several example action semantics on test- tion triplets in the form of (Subject Action Pa- ing sequences. Taking the fourth sub-action from tient) from each of the testing sequence in MA- Fig. 4 as an example, the visually detected triplets NIAC dataset. Briefly speaking, the event chain based on segmentation and spatial decomposition representation (Aksoy et al., 2011) of the observed is (Object 014, Chopping, Object 011). Af- long manipulation activity is first scanned to esti- ter semantic parsing, the system predicts that mate the main manipulator, i.e. the hand, and ma- divided(Object 011). The complete training cor- nipulated objects, e.g. knife, in the scene without pus and parsed results of the testing set will be employing any visual feature-based object recog- made publicly available for future research. nition method. Solely based on the interactions between the hand and manipulated objects in the 5.5 Reasoning Beyond Observations scene, the event chain is partitioned into chunks. As mentioned before, because of the use of λ- These chunks are further fragmented into sub- calculus for representing action semantics, the ob- units to detect parallel action streams. Each parsed tained data can naturally be used to do logical rea- Semantic Event Chain (SEC) chunk is then com- soning beyond observations. This by itself is a pared with the model SECs in the library to decide very interesting research topic and it is beyond this whether the current SEC sample belongs to one paper’s scope. However by applying a couple of of the known manipulation models or represents a common sense Axioms on the testing data, we can novel manipulation. SEC models, stored in the li- provide some flavor of this idea. brary, are learned in an on-line unsupervised fash- Case study one: See the “final action conse- ion using the semantics of manipulations derived quence and reasoning” row of Fig. 3 for case one. from a given set of training data in order to create Using propositional logic and axiom schema, we a large vocabulary of single atomic manipulations. can represent the common sense statement (“if an For the different testing sequence, the number object x is contained in object y, and object z is of triplets detected ranges from two to seven. In to- on top of object y, then object z is on top of object tal, we are able to collect 90 testing detections and x”) as follows:

682 Figure 3: System output on complex chained manipulation testing sequence one. The segmentation output and detected triplets are from (Aksoy and Worg¨ otter,¨ 2015) .

Figure 4: System output on the 18th complex chained manipulation testing sequence. The segmentation output and detected triplets are from (Aksoy and Worg¨ otter,¨ 2015) .

Axiom (1): x, y, z, contained(y, x) Axiom (2): x, y, z, contained(y, x) ∃ ∧ ∃ ∧ on top(z, y) on top(z, x). contained(z, y) contained(z, x). → → Axiom (3): x, y, contained(y, x) Then it is trivial to deduce an additional fi- ∃ ∧ divided(y) divided(x). nal action consequence in this scenario that → Axiom (4): x, y, z, contained(y, x) (on top(object 007, object 009)). This matches ∃ ∧ on top(y, z) on top(x, z). the fact: the yellow box which is put on top of the → red bucket is also on top of the black ball. With these common sense Axioms, the system Case study two: See the “final action conse- is able to deduce several additional final action quence and reasoning” row of Fig. 4 for a more consequences in this scenario: complicated case. Using propositional logic and divided(object 005) divided(object 010) ∧ axiom schema, we can represent three common on top(object 005, object 012) ∧ on top(object 010, object 012). sense statements: ∧ 1) “if an object y is contained in object x, and From Fig. 4, we can see that these additional object z is contained in object y, then object z is consequences indeed match the facts: 1) the bread contained in object x”; and cheese which are covered by ham are also di- 2) “if an object x is contained in object y, and vided, even though from observation the system object y is divided, then object x is divided”; only detected the ham being cut; 2) the divided 3) “if an object x is contained in object y, and bread and cheese are also on top of the plate, even object y is on top of object z, then object x is on though from observation the system only detected top of object z” as follows: the ham being put on top of the plate.

683 We applied the four Axioms on the 20 testing scenarios, based on whether the action is transi- action sequences and deduced the “hidden” conse- tive or intransitive, the main types of action can be quences from observation. To evaluate our system extended to include AP NP . \ performance quantitatively, we first annotated all Moreover, the logical expressions can also be the final action consequences (both obvious and extended to include universal quantification and ∀ “hidden” ones) from the 20 testing sequences as existential quantification . Thus, manipulation ∃ ground-truth facts. In total there are 122 conse- action such as “knife cut every tomato” can be quences annotated. Using perception only (Aksoy parsed into a representation as x.tomato(x) ∀ ∧ and Worg¨ otter,¨ 2015), due to the decomposition cut(knife, x) divided(x) (the parse is given → errors (such as the red font ones in Fig. 4) the sys- in the following chart). Here, the concept “every” tem can detect 91 consequences correctly, yielding has a main type of NP NP and semantic mean- \ a 74% correct rate. After applying the four Ax- ing of x.f(x). The same framework can also ∀ ioms and reasoning, our system is able to detect extended to have other combinatory rules such as 105 consequences correctly, yielding a 86% cor- composition and type-raising (Steedman, 2002). rect rate. Overall, this is a 15.4% of improvement. These are parts of the future work along the line of Here we want to mention a caveat: there are def- the presented work. initely other common sense Axioms that we are Knife Cut every Tomato not able to address in the current implementation. NN However, from the case studies presented, we can NP (AP NP)/NP NP NP NP knife λx.λy\.cut(x, y) x.\f (x) tomato see that using the presented formal framework, our knife divided(y) ∀x.f (x) tomato → ∀ > system is able to reason about manipulation action NP goals instead of just observing what is happening x.tomato(x) ∀ > visually. This capability is essential for intelligent AP NP y.λx.tomato(y) cut\ (x, y) divided(y) agents to imitate action goals from observation. ∀ ∧ → < AP y.tomato(y) cut(knife, y) divided(y) ∀ ∧ → 6 Conclusion and Future Work The presented computational linguistic frame- work enables an intelligent agent to predict and In this paper we presented a formal computa- reason action goals from observation, and thus has tional framework for modeling manipulation ac- many potential applications such as human inten- tions based on a Combinatory Categorial Gram- tion prediction, robot action policy planning, hu- mar. An empirical study on a large manipula- man robot collaboration etc. We believe that our tion action dataset validates that 1) with the intro- formalism of manipulation actions bridges com- duced formalism, a learning system can be devised putational linguistics, vision and robotics, and to deduce the semantic meaning of manipulation opens further research in Artificial Intelligence actions in λ-schema; 2) with the learned schema and Robotics. As the robotics industry is moving and several common sense Axioms, our system is towards robots that function safely, effectively and able to reason beyond just observation and deduce autonomously to perform tasks in real-world un- “hidden” action consequences, yielding a decent structured environments, they will need to be able performance improvement. to understand the meaning of actions and acquire Due to the limitation of current testing scenar- human-like common-sense reasoning capabilities. ios, we conducted experiments only considering a relatively small set of seed lexicon rules and log- 7 Acknowledgements ical expressions. Nevertheless, we want to men- tion that the presented CCG framework can also This research was funded in part by the sup- be extended to learn the formal logic representa- port of the European Union under the Cogni- tion of more complex manipulation action seman- tive Systems program (project POETICON++), tics. For example, the temporal order of manipula- the National Science Foundation under INSPIRE tion actions can be modeled by considering a seed grant SMA 1248056, and by DARPA through rule such as AP AP : λf.λg.before(f( ), g( )), U.S. Army grant W911NF-14-1-0384 under the \ · · where before( , ) is a temporal predicate. For Project: Shared Perception, Cognition and Rea- · · actions in this paper we consider seed main type soning for Autonomy. (AP NP )/NP . For more general manipulation \

684 References Jungseock Joo, Weixin Li, Francis F Steen, and Song- Chun Zhu. 2014. Visual persuasion: Inferring com- E E. Aksoy and F. Worg¨ otter.¨ 2015. Semantic decom- municative intents of images. In Computer Vision position and recognition of long and complex ma- and Pattern Recognition (CVPR), 2014 IEEE Con- nipulation action sequences. International Journal ference on, pages 216–223. IEEE. of Computer Vision, page Under Review. A. Kale, A. Sundaresan, AN Rajagopalan, N.P. Cun- E.E. Aksoy, A. Abramov, J. Dorr,¨ K. Ning, B. Dellen, toor, A.K. Roy-Chowdhury, V. Kruger, and R. Chel- and F. Worg¨ otter.¨ 2011. Learning the semantics of lappa. 2004. Identification of humans using object–action relations by observation. The Interna- gait. IEEE Transactions on Image Processing, tional Journal of Robotics Research, 30(10):1229– 13(9):1163–1173. 1249. Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. E E. Aksoy, M. Tamosiunaite, and F. Worg¨ otter.¨ 2014. The language of actions: Recovering the syntax Model-free incremental learning of the semantics and semantics of goal-directed human activities. In of manipulation actions. Robotics and Autonomous Computer Vision and Pattern Recognition (CVPR), Systems , pages 1–42. 2014 IEEE Conference on, pages 780–787. IEEE. Chitta Baral, Juraj Dzifcak, Marcos Alvarez Gonzalez, Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle- and Jiayu Zhou. 2011. Using inverse λ and gener- moyer, Liefeng Bo, and Dieter Fox. 2011. A joint alization to translate english to formal languages. In model of language and perception for grounded at- Proceedings of the Ninth International Conference tribute learning. In International Conference on Ma- on , pages 35–44. Associ- chine learning (ICML). ation for Computational Linguistics.

Jezekiel Ben-Arie, Zhiqian Wang, Purvin Pandit, and Cynthia Matuszek, Liefeng Bo, Luke Zettlemoyer, and Shyamsundar Rajaram. 2002. Human activ- Dieter Fox. 2014. Learning from unscripted deictic ity recognition using multidimensional indexing. gesture and language for human-robot interactions. Pattern Analysis and Machine Intelligence, IEEE In Twenty-Eighth AAAI Conference on Artificial In- Transactions on, 24(8):1091–1104. telligence.

Matthew Brand. 1996. Understanding manipulation in T.B. Moeslund, A. Hilton, and V. Kruger.¨ 2006. A video. In Proceedings of the Second International survey of advances in vision-based human motion Conference on Automatic Face and Gesture Recog- capture and analysis. Computer vision and image nition, pages 94–99, Killington,VT. IEEE. understanding, 104(2):90–126.

R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. Raymond J Mooney. 2008. Learning to connect lan- 2009. Histograms of oriented optical flow and binet- guage and perception. In AAAI, pages 1598–1601. cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Proceedings Darnell Moore and Irfan Essa. 2002. Recognizing of the 2009 IEEE Intenational Conference on Com- multitasked activities from video using stochastic puter Vision and Pattern Recognition, pages 1932– context-free grammar. In Proceedings of the Na- 1939, Miami,FL. IEEE. tional Conference on Artificial Intelligence, pages 770–776, Menlo Park, CA. AAAI. N. Chomsky. 1957. Syntactic Structures. Mouton de Gruyter. K. Pastra and Y. Aloimonos. 2012. The mini- malist grammar of action. Philosophical Transac- Noam Chomsky. 1993. Lectures on government and tions of the Royal Society B: Biological Sciences, binding: The Pisa lectures. Walter de Gruyter. 367(1585):103–117.

V Gazzola, G Rizzolatti, B Wicker, and C Keysers. Hamed Pirsiavash, Carl Vondrick, and Antonio Tor- 2007. The anthropomorphic brain: the mirror neu- ralba. 2014. Inferring the why in images. arXiv ron system responds to human and robotic actions. preprint arXiv:1406.5472. Neuroimage, 35(4):1674–1684. Giacomo Rizzolatti, Leonardo Fogassi, and Vittorio Anupam Guha, Yezhou Yang, Cornelia Fermuller,¨ and Gallese. 2001. Neurophysiological mechanisms un- Yiannis Aloimonos. 2013. Minimalist plans for in- derlying the understanding and imitation of action. terpreting manipulation actions. Proceedings of the Nature Reviews Neuroscience, 2(9):661–670. 2013 IEEE/RSJ International Conference on Intelli- gent Robots and Systems, pages 5908–5914. Michael S Ryoo and Jake K Aggarwal. 2006. Recogni- tion of composite human activities through context- Yuri A. Ivanov and Aaron F. Bobick. 2000. Recogni- free grammar based representation. In Proceedings tion of visual activities and interactions by stochastic of the 2006 IEEE Conference on Computer Vision parsing. IEEE Transactions on Pattern Analysis and and Pattern Recognition, volume 2, pages 1709– Machine Intelligence, 22(8):852–872. 1718, New York City, NY. IEEE.

685 P. Saisan, G. Doretto, Y.N. Wu, and S. Soatto. 2001. Dynamic texture recognition. In Proceedings of the 2001 IEEE Intenational Conference on Computer Vision and Pattern Recognition, volume 2, pages 58–63, Kauai, HI. IEEE. Mark Steedman. 2000. The syntactic process, vol- ume 35. MIT Press. Mark Steedman. 2002. Plans, affordances, and combi- natory grammar. Linguistics and Philosophy, 25(5- 6):723–753. D. Summers-Stay, C.L. Teo, Y. Yang, C. Fermuller,¨ and Y. Aloimonos. 2013. Using a minimal ac- tion grammar for activity understanding in the real world. In Proceedings of the 2013 IEEE/RSJ Inter- national Conference on Intelligent Robots and Sys- tems, pages 4104–4111, Vilamoura, Portugal. IEEE. Stefanie Tellex, Pratiksha Thaker, Joshua Joseph, and Nicholas Roy. 2014. Learning perceptually grounded word meanings from unaligned parallel data. Machine Learning, 94(2):151–167. P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea. 2008. Machine recognition of human ac- tivities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11):1473– 1488. Dan Xie, Sinisa Todorovic, and Song-Chun Zhu. 2013. Inferring “dark matter” and “dark energy” from videos. In Computer Vision (ICCV), 2013 IEEE In- ternational Conference on, pages 2224–2231. IEEE. Yezhou Yang, Cornelia Fermuller,¨ and Yiannis Aloi- monos. 2013. Detection of manipulation action consequences (MAC). In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 2563–2570, Portland, OR. IEEE. Y. Yang, A. Guha, C. Fermuller, and Y. Aloimonos. 2014. A cognitive system for understanding hu- man manipulation actions. Advances in Cognitive Sysytems, 3:67–86. Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015. Robot learning manipulation ac- tion plans by “watching” unconstrained videos from the world wide web. In The Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15). A. Yilmaz and M. Shah. 2005. Actions sketch: A novel action representation. In Proceedings of the 2005 IEEE Intenational Conference on Computer Vision and Pattern Recognition, volume 1, pages 984–989, San Diego, CA. IEEE. Luke S Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Struc- tured classification with probabilistic categorial grammars. In UAI. Luke S Zettlemoyer and Michael Collins. 2007. On- line learning of relaxed ccg grammars for parsing to logical form. In EMNLP-CoNLL, pages 678–687.

686