Current Advances in Cognitive Robotics Amir Aly, Sascha Griffiths, Francesca Stramandinoli
Total Page:16
File Type:pdf, Size:1020Kb
Towards Intelligent Social Robots: Current Advances in Cognitive Robotics Amir Aly, Sascha Griffiths, Francesca Stramandinoli To cite this version: Amir Aly, Sascha Griffiths, Francesca Stramandinoli. Towards Intelligent Social Robots: Current Advances in Cognitive Robotics . France. 2015. hal-01673866 HAL Id: hal-01673866 https://hal.archives-ouvertes.fr/hal-01673866 Submitted on 1 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Proceedings of the Full Day Workshop Towards Intelligent Social Robots: Current Advances in Cognitive Robotics in Conjunction with Humanoids 2015 South Korea November 3, 2015 Amir Aly1, Sascha Griffiths2, Francesca Stramandinoli3 1- ENSTA ParisTech – France 2- Queen Mary University – England 3- Italian Institute of Technology – Italy Towards Emerging Multimodal Cognitive Representations from Neural Self-Organization German I. Parisi, Cornelius Weber and Stefan Wermter Knowledge Technology Institute, Department of Informatics University of Hamburg, Germany fparisi,weber,[email protected] http://www.informatik.uni-hamburg.de/WTM/ Abstract—The integration of multisensory information plays a processing of a huge amount of visual information to learn crucial role in autonomous robotics. In this work, we investigate inherent spatiotemporal dependencies in the data. To tackle how robust multimodal representations can naturally develop in this issue, learning-based mechanisms have been typically used a self-organized manner from co-occurring multisensory inputs. for generalizing a set of labelled training action samples and We propose a hierarchical learning architecture with growing then predicting the labels of unseen samples (e.g. [15][16]). self-organizing neural networks for learning human actions from However, most of the well-established methods learn actions audiovisual inputs. Associative links between unimodal represen- tations are incrementally learned by a semi-supervised algorithm with a batch learning scheme, i.e. assuming that all the training with bidirectional connectivity that takes into account inherent samples are available at the training phase. An additional com- spatiotemporal dynamics of the input. Experiments on a dataset mon assumption is that training samples, generally presented of 10 full-body actions show that our architecture is able to as a sequence of frames from a video, are well segmented so learn action-word mappings without the need of segmenting that ground-truth labels can be univocally assigned. Therefore, training samples for ground-truth labelling. Instead, multimodal it is usually the case that raw data collected by sensors must representations of actions are obtained using the co-activation of undergo an intensive pre-processing pipeline before training action features from video sequences and labels from automatic a model. Such pre-processing stages are mainly performed speech recognition. Promising experimental results encourage the manually, thereby hindering the automatic, continuous learning extension of our architecture in several directions. of actions from live video streams. Intuitively, this is not the Keywords—Human action recognition, multimodal integration, case in nature. self-organizing networks. Words for actions and events appear to be among children’s earliest vocabulary [6]. A central question in the field of I. INTRODUCTION developmental learning has been how children first attach The ability to integrate information from different modali- verbs to their referents. During their development, children ties for an efficient interaction with the environment is a funda- have at their disposal a wide range of perceptual, social, and mental feature of the brain. As humans, our daily perceptual linguistic cues that they can use to attach a novel label to a experience is modulated by an array of sensors that convey novel referent [7]. Referential ambiguity of verbs could then different types of modalities such as vision, sound, touch, be solved by children assuming that words map onto the action and movement [1]. Similarly, the integration of modalities with most perceptual saliency in their environment. Recent conveyed by multiple sensors has been a paramount ingredient experiments have shown that human infants are able to learn of autonomous robots. In this context, multisensory inputs action-label mappings using cross-situational statistics, thus must be represented and integrated in an appropriate way such in the presence of piece-wise available ground-truth action that it results in a reliable cognitive experience aimed to trigger labels [8]. Furthermore, action labels can be progressively adequate behavioral responses. Multimodal cognitive represen- learned and improved from social and linguistic cues so that tations have been shown to improve robustness in the context novel words can be attached to existing visual representations. of action recognition and action-driven perception, learning This hypothesis is supported by many neurophysiological by imitation, socially-aware agents, and natural human-robot studies evidencing strong links between the areas in the brain interaction (HRI) [2]. governing visual and language processing, and suggesting high levels of functional interaction of these areas during action An extensive number of computational models has been learning and recognition [9]. proposed that aimed to integrate audiovisual input (e.g. [3][4]). These approaches used unsupervised learning for generalizing In this work, we investigate how associative links between visual properties of the environment (e.g. objects) and linking unimodal representations can naturally emerge from the co- these representations with linguistic labels. However, action occurrence of audiovisual stimuli. We show that it is possible verbs do not label actions in the same way that nouns label to progressively learn congruent multimodal representations objects [5]. While nouns generally refer to objects that can of human actions with neural self-organization using a spe- be perceived as distinct units, action words refer instead to cial type of hierarchical connectivity. For this purpose, we spatiotemporal relations within events that may be performed extended our recently proposed neural architecture for the in many different ways. In fact, action classification has been self-organizing integration of action cues [16] with an as- shown to be particularly challenging since it involves the sociative learning layer where action-word mappings emerge Hierarchical visual processing AWL Pose 1 2 3 4 “standing” “walking” “sitting down” P-1 P-2 . G G . “jogging” STS-1 STS-2 “jump” G G M-1 M-2 “lie down” G G Motion 1 frame 5 frames 7 frames 10 frames 0.1 s 0.5 s 0.7 s 1 s Unsupervised learning Associative learning ASR Fig. 1. Diagram of our learning architecture with GWR self-organizing networks and the number of frames (and seconds) required for hierarchical processing - Layers 1-3: parallel spatiotemporal clustering of visual features and self-organizing pose-motion integration (STS-1). Layer 4: associative learning for linking visual representations in STS-2 to the action words layer (AWL) obtained with automatic speech recognition (ASR). from co-occurring audiovisual inputs using Hebbian-like learn- From a neurobiological perspective, a large number of ing [10]. We implement experience-dependent plasticity with studies has shown that the superior temporal sulcus (STS) the use of an incremental self-organizing network that em- in the mammalian brain is the basis of an action-encoding ploys neurobiologically-motivated habituation for stable learn- network with neurons that are not only driven by the per- ing [11]. The proposed architecture is novel in two main ception of dynamic human bodies, but also by audiovisual aspects: First, our learning mechanism does not require manual integration [13]. Therefore, the STS area is thought to be segmentation of training samples. Instead, spatiotemporal gen- an associative learning device for linking different unimodal eralizations of actions are incrementally obtained and mapped representations, accounting for the mapping of naturally oc- to symbolic labels using the co-activation of audiovisual stim- curring, highly correlated features such as shape, motion, and uli. This allows us to train the model in an online fashion with characteristic sound [14]. In our proposed architecture, we a semi-supervised learning scheme. Second, we propose a type implement an associative learning network in Layer 4 (or of bidirectional inter-layer connectivity that takes into account STS-2) where action-word mappings are progressively learned the spatiotemporal dynamics of sequences so that symbolic from co-occurring audiovisual inputs using a self-organizing labels are linked to temporally-ordered representations in the connectivity scheme. visual domain. In Section II, we describe our hierarchical architecture with A. Self-Organizing Hierarchical Learning incremental self-organizing networks and hierarchical connec- Our model consists