Humanoid Learns to Detect Its Own Hands

Humanoid Learns to Detect Its Own Hands Jurgen¨ Leitner∗, Simon Hardingy, Mikhail Frank∗, Alexander Forster¨ ∗,Jurgen¨ Schmidhuber∗ ∗Dalle Molle Institute for Artificial Intelligence (IDSIA) / SUPSI Faculty of Informatics, Universita` della Svizzera Italiana (USI) CH-6928 Manno-Lugano, Switzerland Email: [email protected] yMachine Intelligence Ltd South Zeal, United Kingdom Email: [email protected] Abstract—Robust object manipulation is still a hard problem transformations. Visual servoing [3], [4] has been proposed in robotics, even more so in high degree-of-freedom (DOF) to overcome some of these problems. It is using the distance humanoid robots. To improve performance a closer integration between of the object and the arm in the visual plane as the of visual and motor systems is needed. We herein present a novel control signal. method for a robot to learn robust detection of its own hands and fingers enabling sensorimotor coordination. It does so solely It stands to reason that humanoids require a certain level of using its own camera images and does not require any external visuomotor coordination to develop better manipulation skills. systems or markers. Our system based on Cartesian Genetic This is similar to the development in studying of human visual Programming (CGP) allows to evolve programs to perform this and motor processing, which have previously been investigated image segmentation task in real-time on the real hardware. We separately, but more evidence emerged that those two systems show results for a Nao and an iCub humanoid each detecting its own hands and fingers. are very closely related [5]. Robust visual perception, such as, e.g. precisely and quickly detecting the hands, is of importance in approaches that aim to develop better robotic manipulation I. INTRODUCTION skills. This is especially of interest in highly-complex robots Although robotics has seen advances over the last decades such as humanoids, as generally there is no precise knowledge robots are still not in wide-spread use outside industrial about the kinematic structure (body) of the robot available, applications. In general various scenarios are proposed for generating significant errors when transforming back and forth future robotic helpers, these range from cleaning tasks, gro- into operational space. cery shopping to elderly care and helping in hospitals, etc. In this paper we present a novel method for humanoid Those would involve the robots working together, helping and robots to visually detect their own hands. Section II describes coexisting with humans in daily life. From this the need to previous work related to visually detecting hands on humanoid manipulate objects in unstructured environment arises. Yet robots. In Section III a machine learning technique called autonomous object manipulation is still a hard problem in Cartesian Genetic Programming (CGP) is described. CGP is robotics. Humans, in contrast, are able to quickly and without used to learn a robust method to filter out the hands and much thought, perform a variety of object manipulation tasks fingers in the camera stream from an iCub and a Nao robot, on arbitrary objects. To achieve this, Visual Motor Integration these experiments are detailed in Section IV. This is the first (or visuomotor integration, VMI), also referred to as eye-hand time, to our knowledge, for a robot to learn how to detect its coordination, is of importance. For example, if someone wants own hands, from vision alone. Section V contains concluding to grab a certain object from the table, the brain takes in the remarks and comments on how this opens up future research location of the object as perceived by the eyes, and uses this into sensorimotor coordination on humanoid robot platforms. visual information to guide the arm. To perform this guidance both the object and the arm need to be detected visually. II. RELATED WORK In robotics traditionally a multi-step approach is used to The detection of hands in robotics has so far been mainly reach for an object. First the object’s position is determine in focused on detecting human hands in camera images. For operational space. If there are no active vision systems (RGB- example, Kolsch¨ and Turk presented a method for robustly D cameras/Kinect) or external trackers/markers (Vicon) avail- classifying different hand postures in images [6]. These ap- able, the most common approach is using a stereo camera pair. proaches usually use the quite uniform colour of the skin [7], After detecting the object in the image, projective geometry motion flow [8] and particle filtering [9]. is used to estimate the operational space coordinate [1]. This is the desired position to be reached with the end-effector. Hand-eye coordination for robotics has previously been The joint configuration of the robot for reaching this position investigated using extra sensors like LASERs or other helpers is determined by applying inverse kinematics techniques [2]. mounted on the end-effector, such as, LEDs, bright coloured The obvious problem in this chain of operations is that symbols or markers, etc. Langdon and Nordin, already ex- errors are propagated and sometimes amplified through these plored GP for evolving hand-eye coordination on a robot arm with a LED on the end-effector [10]. Hulse¨ et al. used machine be connected to the output node of the program. As a result learning to grasp a ball with a robot arm, yet only the ball, there are nodes in the representation that have no effect on not the arm was visually detected [11]. Another approach to the output, a feature known in GP as ‘neutrality’. It has been detect the hand gesture directly, using a coloured glove and shown that this can be helpful in the evolutionary process [21]. machine learning was presented by Nagi et al. [12]. Another Also, because the genotype encodes a graph, there can be often used option to track and detect the position of the reuse of nodes, which makes the representation distinct from robots end-effector are external motion capturing and imaging a classically tree-based GP representation (Fig. 1). systems (e.g. [13], [14]). Recently, the German Aerospace Agency (DLR) has investigated methods for better position Our implementation CGP for Image Processing (CGP-IP) estimation for their humanoid robot. Using RGB-D (Kinect) draws inspiration from much of the previous work in the field and a stereo camera approach both combined with a model- (see e.g. [22]). It uses a mixture of primitive mathematical based technique, their system was able to qualitatively decrease and high level operations. It’s main difference to previous the error from the pure kinematic solution [15]. implementation is that it encompasses domain knowledge, i.e. it allows for the automatic generation of computer programs Machine learning has been used in computer vision previ- using a large subset of the OpenCV image processing library ously but has mainly focussed on techniques such as Support functionality [23]. With over 60 unique functions, the function Vector Machines (SVM), k-Nearest Neighbour (kNN) and set is considerably larger than those typically used with GP. Artificial Neural Networks (ANN). Herein we use Genetic This does not appear to hinder evolution (see the results Programming (GP) which is a search technique inspired by we obtained with this configuration in Section IV), and we concepts from Darwinian evolution [16]. It can be used in speculate that the increased number of functions provides many domains, but is most commonly used for symbolic greater flexibility in how the evolved programs can operate. regression and classification tasks. GP has also been used to solve problems in image processing, however previous Executing the genotype is a straightforward process. First, attempts typically use a small set of mathematical functions to the active nodes are identified. This is done recursively, starting evolve kernels, or a small number of basic image processing at the output node, and following the connections used to functions (such as erode and dilate). Previously Spina used provide the inputs for that node. In CGP-IP the final node GP to evolve programs performing segmentation based on in the genotype is used as the output. Next, the phenotype can features calculated from partially labeled instances separating be executed on an image. The input image (or images) are foreground and background [17]. used as inputs to the program, and then a forward parse of the phenotype is performed to generate the output. Given the maturity of the field of image processing, it should be possible to construct programs that use much more The efficacy of this approach was shown for several complicated image operations and hence incorporate domain different domains (including basic image processing, medical knowledge. For example, Shirakawa evolved segmentation imaging, terrain classification, object detection in robotics and programs that use many high-level operations such as mean, defect detection in industrial application) by Harding et al. [24] maximum, Sobel, Laplacian and product [18]. and Leitner et al. [25]. III. CARTESIAN GENETIC PROGRAMMING A. Fitness Function FOR IMAGE PROCESSING Depending on the application, different fitness functions Herein we are using Cartesian Genetic Programming are available in CGP-IP. The thresholded output image of an (CGP), in which programs are encoded in partially connected individual is compared to a target image using the Matthews feed forward graphs [19], [20]. The genotype, given by a list Correlation Coefficient (MCC) [26], [27], which has previ- of nodes, encodes the graph. For each node in the genome ously been observed to be useful for CGP approaches to there is a vertex, represented by a function, and a description classification problems [28]. The MCC is calculated based of the nodes from where the incoming edges are attached. on the ‘confusion matrix’, which is the count of the true The basic algorithm works as follows: Initially, a popula- positives (TP), false positives (FP), true negatives (TN), and tion of candidate solutions is composed of randomly generated individuals.

Load more