Learning to Recognise Objects and Situations to Control a Robot End-Eﬀector

Kunstliche¨ Intelligenz 2:24{29, 2003 Learning to Recognise Objects and Situations to Control a Robot End-Effector Gunther Heidemann and Helge Ritter View based representations have become very popular for recognition tasks. In this contribution, we argue that the potential of the approach is not yet fully tapped: Tasks need not to be \homogeneous", i.e. there is no need to restrict a system e.g. to either \object classification” or \gesture recognition". Instead, qualitatively different problems like gesture recognition and scene evaluation can be handled simultaneously by the same system. This feature makes the view based approach a well suited tool for robotics as will be demonstrated for the domain of an end-effector camera. In the described scenario, the task is threefold: Recognition of object types, judging the stability of grasps on objects and hand gesture classification. As this task leads to a large variety of views, a neural network{based recognition architecture specifically designed to represent very non-linear distributions of samples representing views will be described. complex and inhomogeneous than the above mentioned ex- 1 Introduction amples: A robot hand camera that has to classify not only different objects but simultaneously grasping situations and Computer Vision (CV) is a substantial technique in modern hand gestures [6]. In the following, first the experimental robotics. Not only CV takes the role of \eyes" e.g. for tasks setup will be outlined. In section 3, the view based classifi- like navigation | what it is expected to do from a cogni- cation system will be described and in section 4 results will tive point of view | but often also replaces other sensory be discussed. input channels like haptics. However, CV capabilities are in general still far from the human example. One of the major shortcomings is that most CV systems are highly specialised 2 Grasping scenario to a certain task. CV applications perform the better the closer the limitations | in counting bright spots on a dark The robot system is used in the context of a man-machine background, CV outperforms humans by several orders of interaction scenario where a human instructor tells the robot magnitude but can't tell if the spots are shells on a beach to grasp an object on a table using hand gestures and speech. or bolts on a conveyor belt. As a consequence, several spe- As a complete description of this system is beyond the scope cialised modules have to be used when dealing with a variety of this paper, we refer to [15, 18] and outline the setup only of tasks. in short. During the last ten years, view-based representations An active stereo camera head that is separate from the have become popular for recognition tasks. This technique robot detects and pursues hand movements of the instructor. is a powerful tool when geometrical modelling of an object When a pointing gesture is detected, the pointing direction domain is intractable. The key idea is memorising features is evaluated. Simultaneously, the active camera head surveys extracted from object views using filters generated from sam- the scene of objects on the table, so the correspondence be- ples instead of memorising representations derived from hu- tween pointing gesture and object can be established. The man expertise. active system gives the approximate world coordinates of the The view-based approach is inherently domain- indicated object to the robot system. The robot arm then independent. Nevertheless, this major benefit is not fully carries out an approach movement until the hand camera exploited in most applications in the sense that they are (Fig. 1) detects the object. The final approach is controlled strongly restricted to a certain domain. For example, by the hand camera alone without help from the active cam- the domain of the well known object recognition system era head. When the hand is above the object, the object developed by Murase and Nayar [14] is classification of identity is checked and the hand is moved into the optimal isolated objects (toys and household) on a turntable against grasping position for this type of object. Then the hand is black background. Another domain in which the view-based lowered and the object grasped, the oil pressure in the hy- approach is successful is face classification (e.g. [12]). draulic motor pistons indicates whether the grasp was suc- Still, the restriction to a certain scenario is unavoidable. cessful. However, it will be demonstrated in this paper that using In this contribution, we present a supplemental vision a suitable neural architecture a view based representation system for the hand camera which will be integrated into the can be built that captures a domain which is much more grasping architecture. So far, only the active camera head 2 G. Heidemann, H. Ritter / Kunstliche¨ Intelligenz 2 (2003) 24{29 Figure 2: View of the hand camera with the processing rect- angle. Three different types of tasks have to be solved: Classification of (1) object type, (2) stability of the grasp and (3) the direction indicated by a hand. tion sensors on the hydraulic motor pistons, these position measurements cannot be used to estimate the resulting joint angles within the required accuracy due to mechanical hys- teresis. Insufficient sensory feedback is a major problem in the domain of grasping with grippers or even anthropomorphic Figure 1: Three-fingered oil-hydraulic robot hand holding an artificial hands. It is one of the reasons that robotic abili- object. The camera is mounted in front. ties in grasping objects are no match to human skills: We can grasp objects of almost arbitrary shape, many different sizes and we easily adapt the applied forces to light or heavy evaluates pointing gestures. Since the distance to the table is weight. Our ability is the result of a complex interaction be- over one meter, information for correction movements during tween the controlling neural system and the haptic and visual the approach phase cannot be gathered. Another shortcom- \sensors". Force sensing in the fingertips plays an impor- ing is the insufficient sensory feedback from the robot hand, tant role in this process. Unfortunately, there is no technical by which it can be judged only whether the object is grasped equivalent to human fingertip force sensing by now, though at all but not whether the grasp is stable. The active cam- progress has been made e.g. using piezo-resistive pressure era head cannot be used to watch grasp stability because sensitive foils for tactile imaging [9]. The problem is not of the distance. To solve both these problems, a vision sys- to provide high quality sensors in general, but sensors which tem for the hand camera was developed which will enable have the size and shape of a human fingertip. Such finger- visual guidance of the robot from gestures carried out under tip sensors are still unreliable and coarse in measurement. the end-effector during the approach movement and a better As a consequence, other sensory sources are used, especially evaluation of grasp stability. CV, as miniature cameras and image processing hardware are easily available. Fig. 1 shows the three-fingered hand with the mounted 2.1 Robot setup camera (Sony XC 999P). The view point is chosen such that We use a standard industrial robot arm (PUMA 560) with all fingertips are still visible at the image borders when the six degrees of freedom. The end-effector is an oil-hydraulic hand is completely stretched (Fig. 2). The view area is also robot hand developed by Pfeiffer et al. [11]. It has three large enough to capture a human hand held below. equal fingers mounted in an equilateral triangle, pointing in parallel directions (Fig. 1). Each finger has three degrees 2.3 The recognition task of freedom: bending the first joint sideways and inward at the wrist, and bending the coupled second and third joint. The hand camera delivers three major categories of informa- The oil pressure in the hydraulic system serves as a sensory tion: directions of hand gestures, identity of grasped objects feedback. A detailed description of the setup can be found and a judgement on stability of the grasp. in [7]. Fig. 1 shows also a force/torque wrist sensor which For the purpose of guiding the robot, hand gestures are will not be used in the following. divided qualitatively in six classes: Left, right, come, away, up, and down, see Fig. 3. While the first four are intuitively 2.2 Visual sensors for grasping clear, the classes up and down are more of a symbolic nature, which is due to the fact that a finger pointing downwards is The hydraulic oil pressure alone is not sufficient as feedback hardly visible and a finger pointing upwards looks similar to to judge the success of the grasping movement. Using this the come gesture. data, it can only be judged whether the object is still in the Judging grasp stability requires an individual definition grasp or if it was lost. Though there are additional posi- of the classes stable / semi-stable / unstable for each ob- G. Heidemann, H. Ritter / Kunstliche¨ Intelligenz 2 (2003) 24{29 3 hands or situations. Hence, a large capacity of the recognition system is required. We will outline the basic problems of such tasks in the next section and argue that the common method of principal component analysis (PCA) is not suitable in this case. Instead, in section 3.2 a neural network based architecture will be described that performs a local PCA. This method facilitates the representation of even highly inhomogeneous domains. 3.1 ANNs for recognition tasks Artificial Neural Networks (ANN) can acquire knowledge Figure 3: Classes of gestures indicating directions (from left from examples, which makes them an ideal tool in image to right and top to bottom): Left, right, come (towards processing, especially when dealing with data that cannot user), away (from user), up and down.

Load more