Yang Xiao Human–Robot Interaction by Zhijun Zhang* Aryel Beck Understanding Upper Body Institute for Media Innovation Nanyang Technological University Singapore

Junsong Yuan School of Electrical and Electronic Engineering Abstract Nanyang Technological University Singapore In this paper, a human–robot interaction system based on a novel combination of sensors is proposed. It allows one person to interact with a humanoid social robot Daniel Thalmann using natural body language. The robot understands the meaning of human upper Institute for Media Innovation body gestures and expresses itself by using a combination of body movements, facial Nanyang Technological University expressions, and verbal language. A set of 12 upper body gestures is involved for Singapore communication. This set also includes gestures with human–object interactions. The gestures are characterized by head, arm, and hand posture information. The wearable Immersion CyberGlove II is employed to capture the hand posture. This information is combined with the head and arm posture captured from Microsoft Kinect. This is a new sensor solution for human- capture. Based on the pos- ture data from the CyberGlove II and Kinect, an effective and real-time human gesture recognition method is proposed. The gesture understanding approach based on an innovative combination of sensors is the main contribution of this paper. To verify the effectiveness of the proposed gesture recognition method, a human body gesture data set is built. The experimental results demonstrate that our approach can recognize the upper body gestures with high accuracy in real time. In addition, for robot motion generation and control, a novel online motion planning method is proposed. In order to generate appropriate dynamic motion, a quadratic program- ming (QP)-based dual-arms kinematic motion generation scheme is proposed, and a simplified recurrent neural network is employed to solve the QP problem. The integration of a within the HRI system illustrates the effectiveness of the proposed online generation method.

1 Introduction

Recently, human–robot interaction (HRI) has garnered attention in both academic and industrial communities. Being regarded as the sister commu- nity of human–computer interaction (HCI), HRI is still a relatively young field that began to emerge in the 1990s (Dautenhahn, 2007; Goodrich & Schultz, 2007). It is an interdisciplinary research field that requires contributions from mathematics, psychology, mechanical engineering, biology, and computer science, among other fields (Goodrich & Schultz, 2007). HRI aims to understand and shape the interactions between humans and robots. Unlike the earlier human–machine interaction, more social dimensions Presence, Vol. 23, No. 2, Spring 2014, 133–154 doi:10.1162/PRES_a_00176 © 2014 by the Massachusetts Institute of Technology *Correspondence to [email protected].

Xiao et al. 133

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 134 PRESENCE: VOLUME 23, NUMBER 2

guage, and facial expressions. In this paper, the main research questions addressed are:

• How to establish the communication between human and robot by using body language; • How to control a human-like robot so that it can sustain believable interaction with humans.

Verbal and nonverbal language are two main com- munication avenues for human–human interaction. Verbal language has been employed in many HRI sys- tems (Faber et al., 2009; Nickel & Stiefelhagen, 2007; Figure 1. Human–robot social interaction. Human is on the right, Perzanowski, Schultz, Adams, Marsh, & Bugajska, and robot is on the left. 2001; Spiliotopoulos, Androutsopoulos, & Spyropou- los, 2001; Stiefelhagen et al., 2004, 2007). However, must be considered in HRI, especially when interactive it still has some constraints. That is, speech recogni- social robots are involved (Dautenhahn, 2007; Fong, tion accuracy is likely to be affected by background Nourbakhsh, & Dautenhahn, 2003). In this case, robots noise, human accents, and device performance. More- should be believable. Moreover, humans prefer inter- over, learning and interpreting the subtle rules of syntax acting with robots in the way they do with other people and grammar in speech is also a difficult task. These fac- (Dautenhahn; Fong et al., 2003). Therefore, one way tors limit the practical use of verbal language to some to increase believability would be to make the robot degree. On the other hand, nonverbal cues also con- interact with a human using the same modalities as vey rich communication (Cassell, 2000; Mehrabian, human–human interaction. This includes verbal and 1971). Thus, one of our research motivations is to apply body language as well as facial expressions. That is to nonverbal language to human–robot social interac- say, the robots should be able to use these modalities for tion. More specifically, upper body gesture language is both perception and expression. Some social robots have employed. Currently, 12 human upper body gestures already been proposed to achieve this goal. For instance, are involved in the proposed system. These gestures are the Leonardo robot expresses itself using a combina- all natural motions with intuitive semantics. They are tion of voice, facial, and body expressions (Smith & characterized by head, arm, and hand posture simultane- Breazeal, 2007). Another example is the Nao humanoid ously. It is worth noting that human–object interactions 1 robot that can use vision along with gestures and body are involved in these gestures. Human–object inter- expression of emotions (Beck, Cañamero et al., 2013). action events manifest frequently during the human– In contrast to these two robots, the Nadine is a highly human interaction in daily life. However, to our knowl- realistic humanoid robot (see Figure 1). This robot edge, they are largely ignored by the previous HRI presents some different social challenges. In this paper, systems. a human–robot interaction system that addresses some The main challenge to apply upper body gesture lan- of these challenges is proposed. As shown in Figure 1, guage to human–robot interaction is how to enable the it supports one person to communicate and interact Nadine robot to understand and react to human ges- with one humanoid robot. In the proposed system, tures accurately and in real time. To achieve this goal, the human can naturally communicate with the Nadine two crucial issues need to be solved. robot by using body language. The Nadine robot is able • to express itself by a combination of speech, body lan- First, an appropriate human gesture-capture sensor solution is required. To recognize the 1. http://www.aldebaran-robotics.com/ 12 upper body gestures, head, arm, and hand

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 135

posture information are needed simultaneously. • An effective and real-time upper body gesture Because robustly obtaining hand posture based recognition approach is proposed; on the vision-based sensors (such as the RGB • A novel online motion planning method is camera) is still a difficult task (Lu, Shark, Hall, proposed for robot control; & Zeshan, 2012; Teleb & Chang, 2012), the • To support human to communicate and interact wearable CyberGlove II (Immersion, 2010) is with robot using body language, a gesture under- employed. Using this device, high-accuracy hand standing and human–robot interaction (GUHRI) position data can be acquired stably. Meanwhile, system is built. Microsoft Kinect (Shotton et al., 2011) is an effec- The remainder of this paper is organized as follows. tive and efficient low-cost depth sensor applied Related work is discussed in Section 2. Section 3 gives an successfully to human body tracking. The skele- overview of the GURHI system. The human upper body ton joints can be extracted from the Kinect depth gesture understanding method is illustrated in Section images (Shotton et al.) in real time (30 fps). In our 4. The robot motion planning and control mechanism is work, Kinect is applied to capture the upper body described in Section 5. Section 6 introduces the scenario (head and arm) posture information. Recently, for human–robot interaction. Experimental details and Kinect 2, which supports tracking multiple peo- discussion are given in Section 7. Section 8 concludes ple with better depth imaging quality, has been the paper and discusses future research. released. Since our work investigates the HRI scenario that only involves one person, Kinect is sufficient to handle the human body tracking task. 2 Related Work • Second, an effective and real-time gesture recog- nition method should be developed. Based on the HRI systems are constructed mainly based on CyberGlove II and Kinect posture data, a descrip- verbal, nonverbal, or multimodal communication tive upper body gesture feature is proposed. The modalities. As mentioned in Section 1, verbal commu- LMNN distance metric learning method (Wein- nication still faces constraints in practical applications. berger & Saul, 2009) is applied in order to improve Our work focuses on how to apply nonverbal commu- the performance of recognizing and interpreting nication to human–robot social interaction, especially gesture. Then, the energy-based LMNN classifier is using upper body gesture. Some HRI systems have used to recognize the gestures. already employed body gesture for human–robot com- munication. In Waldherr, Romero, and Thrun (2000), To evaluate the proposed gesture recognition an arm gesture-based interface for HRI was proposed. method, a human upper body gesture data set is con- The user could control a mobile robot by using static structed. This data set contains gesture samples from or dynamic arm gestures. Hand gesture was used as the 25 people of different genders, body sizes, and cul- communication modality for HRI in Brethes, Menezes, tural backgrounds. Further, the experimental results Lerasle, and Hayet (2004). The HRI systems addressed demonstrate the effectiveness and efficiency of our in Stiefelhagen et al. (2004, 2007) could recognize a method. human’s gesture by using the 3D head and Overall, the main contributions of this paper are as hand position information, and head orientation was follows. further appended to enhance recognition performance. • A novel human gesture-capture sensor solution is In Faber et al. (2009), the social robot could under- proposed. That is, the CyberGlove II and Kinect stand the human’s body language characterized by arm are integrated to capture head, arm, and hand and head posture. Our proposed focus on nonverbal posture simultaneously; human–robot communication is different from previous

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 136 PRESENCE: VOLUME 23, NUMBER 2

work in two main respects. First, head, arm, and hand CyberGlove II), while the rough upper body posture posture are jointly captured to describe the 12 upper is handled by the unencumbered sensor (the Kinect). body gestures involved in the GUHRI system. Sec- The aim of the GUHRI system is to allow a user to ondly, the gestures accompanied with human–object interact with a humanoid robot named Nadine (see interaction can be understood by the robot. However, Figure 1). In order to be believable, a humanoid robot these gestures are always ignored by the previous HRI should behave in a way consistent with its physical systems. appearance (Beck, Stevens, Bard, & Cañamero, 2012). Body gesture recognition plays an important role in Therefore, due to its highly realistic appearance, the the GUHRI system. According to the gesture-capture Nadine robot should rely on the same modalities as sensor type, gesture recognition systems can be catego- humans for communication. In other words, it should rized as either encumbered or unencumbered (Berman communicate using speech, facial expressions, and body & Stern, 2012). Encumbered systems require the users movements. This paper focuses mostly on body move- to wear physical assistive devices such as infrared respon- ment generation and describes a control mechanism that ders, hand markers, or data gloves. These systems are allows the robot to respond to the user’s gestures in real of high precision and fast response, and are robust to time. environment changes. Many encumbered systems have Humanoid social robots need to be able to display been proposed. For instance, two education systems coordinated and independent arm movements depend- (Adamo-Villani, Heisler, & Arns, 2007) were built for ing on the actual situation. To do so, a kinematics the deaf by using data gloves and optical motion capture model is necessary to generate motions dynamically. devices; Lu et al. (2012) proposed an immersive virtual The dual arms of a humanoid robot are redundant, object manipulation system based on two data gloves and this makes it difficult to solve the inverse kinematics and a hybrid ultrasonic tracking system. Although most model directly. The classical approaches for solving the commercialized gesture-capture devices are currently redundancy-resolution problem are the pseudoinverse- encumbered, unencumbered systems are expected to based methods, that is, one minimum-norm particular be the choice for the future, especially the vision-based solution plus a homogeneous solution (Wang & Li, systems. With the emergence of low-cost 3D vision sen- 2009). Specifically, at the joint-velocity level, the sors, the application of such devices becomes a very hot pseudoinverse-type solution can be formulated as topic in both research and commercial fields. One of the ˙ † † θθ (t) = J (θ )r˙ + (I − J (θ )J (θ ))w , (1) most famous examples is the Microsoft Kinect, which L L L L L L L L L L has been successfully employed in human body track- θ˙ = † θ ˙ + − † θ θ R(t) JR ( R)rR (IR JR ( R)JR( R))wR, (2) ing (Shotton et al., 2011), activity analysis (Wang, Liu, Wu, & Yuan, 2012), and gesture interpreting (Xiao, where θL and θR denote the joints of the left arm and the ˙ ˙ Yuan, & Thalmann, 2013). Even for other vision appli- right arm, respectively; θL and θR denote the joint veloc- cations, such as scene categorization (Xiao, Wu, & Yuan, ities of the left arm and the right arm respectively; JL and 2014) and image segmentation (Xiao, Cao, & Yuan, JR are the Jacobian matrixes defined as JL = ∂fL(θL)/∂θL = θ θ † θ ∈ n×m 2014), Kinect also holds the potential to boost perfor- and JR ∂fR( R)/∂ R, respectively; JL ( L) R and † θ ∈ n×m mance. However, accurate and robust hand posture JR ( R) R denote the pseudoinverse of the left arm capture is still a difficult task for vision-based sensors. Jacobian matrix JL(θL) and right arm Jacobian matrix n As discussed above, both encumbered and unen- JR(θR), respectively; IL = IR ∈ R are the identity n n cumbered sensors possess intrinsic advantages and matrixes, and wL ∈ R and wR ∈ R are arbitrary vec- drawbacks. For specific applications, they can be com- tors usually selected by using some optimization criteria. plementary. In the GUHRI system, a trade-off between The first terms of the right-hand side of Equations 1 and the two kinds of sensors is made; that is, the fine hand 2 are the particular solutions (i.e., the minimum norm posture is captured by the encumbered device (the solutions), and the second terms are the homogeneous

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 137

Figure 2. The GUHRI system architecture.

solutions. Pseudoinverse-based approaches need to com- and Y. Zhang (2013) implement a QP-based two-norm pute the matrix inverse, which may be time-intensive in scheme on a planar six-DOF manipulator. However, real-time computation. In addition, pseudoinverse-based the above methods only consider a single arm and are approaches have an inner shortage; that is, it cannot therefore not directly applicable for the two arms of solve the inequality problems. Recent works have pre- a humanoid robot. This is why a QP-based dual-arm ferred optimization methods (Cai & Y. Zhang, 2012; kinematic motion generation scheme is proposed, and a Guo & Zhang, 2012; Kanoun, Lamiraux, & Wieber, simplified recurrent neural network is employed to solve 2011), and most of them focus on industrial manipu- the QP problem. lators and single robots (Z. Zhang & Y. Zhang, 2012, 2013c). The conventional redundancy resolution method 3 System Overview uses the pseudoinverse-type formulation, that is, Equa- tions 1 and 2. Based on such a pseudoinverse-type The proposed GUHRI system is able to capture solution, many optimization performance criteria have and understand human upper body gestures and trigger been exploited in terms of manipulator configura- the robot’s reaction in real time. The GUHRI system is tions and interaction with the environment, such as implemented using a framework called the Integrated joint-limit avoidance (Chan & Dubey, 1995; Ma & Integration Platform (I2P) that is specifically devel- Watanabe, 2002), singularity avoidance (Taghirad & oped for integration. I2P was developed by the Institute Nahon, 2008), and manipulability enhancement (Mar- for Media Innovation.2 This framework allows for the tins, Dias, & Alsina, 2006). Recent research shows that link and integration of perception, decision, and action the solutions to redundancy resolution problems can modules within a unified and modular framework. The be enhanced by using optimization techniques based platform uses client–server communications between on methods (Cheng, Chen, & Sun, 1994). Compared the different components. Each component has an I2P with the conventional pseudoinverse-based solutions, interface and the communication between the client and such QP-based methods do not need to compute the servers is implemented using Thrift.3 It should be noted inverse of the Jacobian matrix, and are readily available that the framework is highly modular and components to deal with the inequality and/or bound constraints. can be added to make the GUHRI system extendable. This is why QP-based methods have been employed. In As shown in Figure 2, the current GUHRI system is Cheng et al. (1994), considering the physical limits, the authors proposed a compact QP method to resolve the 2. http://imi.ntu.edu.sg/Pages/Home.aspx constrained kinematic redundancy problem. Z. Zhang 3. http://thrift.apache.org/

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 138 PRESENCE: VOLUME 23, NUMBER 2

composed of two modules. One is the human gesture understanding module that serves as the communi- cation interface between human and robot, and the other is the robot control module proposed to control the robot’s behaviors for interaction. At this stage, our system supports the interaction between one person and one robot. One right hand CyberGlove II and one Microsoft Kinect are employed to capture the human hand and body posture information simultaneously for gesture perception. This new gesture capturing sensor solution is different from the approaches introduced in Section 2. Specifically, the CyberGlove is used to capture the hand posture, and Kinect acquires the 3D position infor- Figure 3. The GUHRI system deployment. mation of the human skeletal joints (including head, shoulder, limb, and hand). At this stage, the GUHRI The robot control module enables the robot to system relies on the upper body gestures triggered by respond to the human’s body gesture. In our system, the human’s right hand and right arm. the robot’s behavior is composed of three parts: body Apart from the CyberGlove, the user does not need movement, facial expression, and verbal expression. to wear any other device. Thus, the proposed sensor Combining these modalities makes the robot more life- solution does not exert a heavy burden that would be like, and should enhance the users’ interest during the uncomfortable to the user. Meanwhile, since the Cyber- interaction. Glove II is involved in the system using Bluetooth, the As shown in Figure 3, when the GUHRI system is user can move freely. In addition, the GUHRI system running, the user wears the CyberGlove and stands fac- is able to recognize gestures with human–object inter- ing the robot for interaction. The Kinect is placed beside action, such as “call,” “drink,” “read”, and “write” by the robot to capture the user’s body frame information. fusing the hand and body posture information. These gestures are often ignored by the previous systems. However, these manifest frequently in daily interac- 4 Human Upper Body Gesture tion between humans. These affect the interaction state Interpretation abruptly and should be considered in HRI. Therefore, they should be regarded as essential elements of the nat- As an essential part of the GUHRI system, the ural human–robot interaction. In our system, the robot human upper body gesture interpretation module plays is able to recognize and give meaningful responses to an important role during the interaction. Its perfor- these gestures. mance will affect the interaction experience. In this The first step of the gesture understanding phase is section, our upper body gesture interpretation method, to synchronize the original data from the CyberGlove which fuses the gesture information from CyberGlove and Kinect. The descriptive features are then extracted. and Kinect, is illustrated in detail. First, the body ges- The multimodal features are then fused to generate a tures included in the GUHRI system are introduced. unified input for the gesture classifier. Lastly, the gesture The feature extraction pipelines for both CyberGlove recognition and understanding results are sent to the and Kinect are then presented. To generate an inte- robot control module via the message server to trigger gral gesture description, the multimodal features from the robot’s reaction. different sensors will be fused as the input for the

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 139

Figure 4. The Category 1 upper body gestures. These gestures can be characterized by the body and hand posture information simultaneously.

classifier. Aiming to enhance the gesture recogni- • Category 1 includes those body gestures without tion accuracy, the LMNN distance metric learning human–object interaction. approach (Weinberger & Saul, 2009) is applied to • Category 2 includes those body gestures with mining the optimal distance measures. Further, the human–object interaction. energy-based classifier (Weinberger & Saul) is applied Category 1 contains eight upper body gestures: “be for decision-making. confident,” “have question,” “object,” “praise,” “stop,” “succeed,” “shake hand,” and “weakly agree.” Some ges- ture samples are shown in Figure 4. These gestures are 4.1 Gestures in the GUHRI System natural and have intuitive meanings. They are related to At the current stage, 12 static upper body ges- the human’s emotional state and behavior intention, as tures are included in the GUHRI system. Since we only opposed to transient gestures for specific applications. have one right hand CyberGlove, to obtain accurate Therefore, gesture-to-meaning mapping is not needed hand posture information, all the gestures are mainly in our system. Because the humans’ behavior habits are triggered by the human’s right hand and right arm. not all the same, recognizing natural gestures is chal- The gestures involved can be partitioned into two cat- lenging. However, these natural gestures are meaningful egories, according to whether human–object interaction for human–robot interaction. As exhibited in Figure 4, happens: both hand and body posture information are required

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 140 PRESENCE: VOLUME 23, NUMBER 2

Figure 5. The Category 2 upper body gestures. These gestures can be characterized by the body and hand posture information simultaneously.

for recognizing these gestures. For instance, the upper body postures corresponding to “have question” and “object” are very similar. Without the hand posture, they are difficult to distinguish. The same thing also happens to “have question,” “weakly agree,” and “stop.” That is, they correspond to similar hand gestures but very different upper body postures. Category 2 is composed of four other upper body gestures: “call,” “drink,” “read,” and “write” (see Figure 5). As opposed to Category 1 gestures, these four gestures occur with human–object interactions. The existing system does not consider this kind of ges- ture (see Section 3). One main reason is that objects Figure 6. CyberGlove II data joints (Immersion, 2010). often cause body occlusion, especially of the hand. In this case, vision-based hand gesture recognition methods are impaired. This is why the CyberGlove is employed to capture the hand posture. In the GUHRI system, Cate- 4.2.1 Hand Posture Feature. The Immer- gory 2 gestures are recognized and affect the interaction sion wireless CyberGlove II is employed as the hand in a realistic way. These gestures are also recognized posture capture device in the GUHRI system. As one based on the hand and upper body posture information. of the most sophisticated and accurate data gloves, In summary, the description of human’s hand CyberGlove II provides 22 high-accuracy joint-angle and upper body posture is the key to recognize and measurements in real time. These measurements reflect understand the 12 upper body gestures. the bending degree of fingers and wrist. The 22 data joints (marked as big white or black dots) are located on the CyberGlove as shown in Figure 6. However, 4.2 Feature Extraction and Fusion not all the joints are used. For the hand gestures in our In this section, we introduce the feature extraction application, we found that the wrist posture does not methods for both human hand and upper body posture provide stable descriptive information. The wrist bend- description. The multimodal feature fusion approach is ing degrees of different people vary to a large extent, also illustrated. even for the same gesture. This phenomenon is related

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 141

to different behavior habits. This is why the two wrist data joints (marked as black in Figure 6) were discarded.

A 20-dimensional feature vector Fhand is extracted from the 20 white data joints to describe the human hand posture as

Fhand = (h1, h2, h3 ···h19, h20), (3)

where hi is the bending degree corresponding to the white data joint i.

4.2.2 Upper Body Posture Feature. Using the Kinect sensor, we shape the human upper body posture Figure 7. The selected body skeletal joints. intermediately using the 3D skeletal joint positions. For a full human subject, 20 body joint positions can be detected and tracked by the real-time skeleton tracker where Js is the descriptive joint and Jr is the reference (Shotton, et al., 2011) based on the Kinect depth frame. joint. With this processing, Jsr is less sensitive to the This is invariant to posture, body shape, clothing, and so human–Kinect relative position. It is mainly determined on. Each joint Ji is represented by three coordinates at by the body posture. The “middle of the two shoul- the frame t as ders” was chosen as the reference joint because it can be robustly detected and tracked in most cases. More- J = (x (t), y (t), z (t)). (4) i i i i over, it is rarely occluded by the limbs or the objects However, not all the 20 joints are necessary for upper during gestures in the GUHRI system. Finally, an upper body gesture recognition. As discussed in section 4.1, body posture feature vector Fbody of 12 dimensions is the head and right arm are highly correlated with the constructed by combining the four pairwise relative 12 upper body gestures (Figure 4 and Figure 5). For positions as efficiency, only four upper body joints are chosen as = the descriptive joints for gesture interpretation. These Fbody (J1r , J2r , J3r , J4r ), (6) are “head,” “right shoulder,” “right elbow,” and “right where J , J , J , and J are the pairwise relative hand,” as shown by gray dots in Figure 7. 1r 2r 3r 4r positions. Directly using the original 3D joint information for body posture description is not stable, because it is sensi- 4.2.3 Feature Fusion. From the CyberGlove tive to the relative position between the human and the II and Kinect, two multimodal feature vectors, F Kinect. Solving this problem by restricting the human’s hand and F , are extracted to describe the hand posture position is not appropriate for interaction. In Wang body and upper body posture, respectively. To fully under- et al. (2012), human action is recognized by using the stand the upper body gestures, the joint information of pairwise relative positions between all joints, which is the two feature vectors is required. Both of them are robust to human–Kinect relative position. Inspired by essential for the recognition task. However, the two this work, a simplified solution is proposed. First, the feature vectors are located in different value ranges. “middle of the two shoulders” joint (black dot in Figure Simply combining them as the input for the classi- 7) is selected as the reference joint. The pairwise rela- fier will yield a performance bias on the feature vector tive positions between the four descriptive joints and the of low values. To overcome this difficulty, we scale reference joint are then computed for the body posture them into similar ranges before feature fusion. Sup- description as max posing Fi is one dimension of Fhand or Fbody, Fi and = − min Jsr Js Jr , (5) Fi are the corresponding maximum and minimum

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 142 PRESENCE: VOLUME 23, NUMBER 2

values in the training set. Then Fi can be normalized as min Fi − F Fˆ = i ,(7) i max − min Fi Fi for both training and test. After normalization, the effectiveness of the two fea- ture vectors for gesture recognition will be balanced. Figure 8. Illustration of the LMNN distance metric learning. Finally, they are fused to generate an integral feature vector by concatenation as  ˆ ˆ the training set. Ideally, the distance measure should be F = (Fhand, Fbody). (8) adjusted according to the specific task being solved. To This process results in a 32-dimensional feature vector F achieve better classification performance, the LMNN used for upper body gesture recognition. distance metric learning method is proposed to mine the best distance measure for the KNN classification. {  }n Let (xi , yi ) i=n be a training set of n labeled samples 4.3 Classification Method d with inputs xi ∈ R and class labels yi . The main goal Using F as the input feature, the upper body of LMNN distance metric learning is to learn a linear gestures will be recognized by template matching transformation L : Rd → Rd that is used to compute the based on the energy-based LMNN classifier proposed square sample distances as in Weinberger and Saul (2009).4 It is derived from the D   =  − 2 energy-based model (Chopra, Hadsell, & LeCun, 2005) (xi , xj ) L(xi xj ) . (9) and the LMNN distance metric learning method (Wein- D   berger & Saul, 2009). The LMNN distance metric is Using (xi , xj ) as the distance measure tends to opti-  the key to constructing this classifier. The LMNN dis- mize the KNN classification by making each input xi tance metric learning approach is proposed to seek the have k nearest neighbors that share the same class label best distance measure for the k-nearest neighbor (KNN) yi to the greatest possibility. Figure 8 gives an intuitive classification rule (Cover & Hart, 1967). As one of the illustration of LMNN distance metric learning. Com- oldest methods for pattern recognition, the KNN clas- pared with Euclidean distance, LMNN distance tries to  sifier is very simple to implement and use. Nevertheless, pull the nearest neighbors of class yi closer to xi , mean- it can still yield comparative results in certain domains while pushing the neighbors from different classes away. such as object recognition and shape matching (Belon- Under the assumption that the training set and the test gies, Malik, & Puzicha, 2002). It has also been applied set keep the similar feature distribution, LMNN dis- to action recognition (Müller & Röder, 2006). tance metric learning can help to improve the KNN The KNN rule classifies each testing sample by the classification result. majority label voting among its k-nearest training sam- The energy-based LMNN classifier makes use of D   ples. Its performance crucially depends on how to both the (xi , xj ) distance measure and the loss func- compute the distances between different samples for tion defined for LMNN distance metric learning. It the k nearest neighbors search. Euclidean distance is the constructs an energy-based criterion function, and most widely used distance measure. However, it ignores the testing sample is assigned to the class which yields any statistical regularities that may be estimated from the minimum loss value. Because the related theory is sophisticated, we do not give a detailed definition on the

4. The source code is available at http://www.cse.wustl.edu/˜kilian energy-based LMNN classifier here. Interested readers /code/lmnn/lmnn.html can turn to Weinberger and Saul (2009) for reference.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 143

Figure 9. Lip synchronization for part of the sentence “I am Nadine.”

The next section describes the robot control mecha- can achieve. The Cerevoice text-to-speech library6 is nism that makes the robot react to human gestures. used to extract the phonemes as well as to synthesize speech. Figure 9 illustrates the process for the begin- ning of the sentence “I am Nadine.” First, the following 5 Robot Motion Planning and Control phonemes are extracted: “sil,” “ay,” “ax,” “m,” “n,” “ey,”“d,” “iy,” “n,” along with their durations. Due to The Nadine robot (see Figure 1) is a realistic the Nadine robot’s velocity limits, it is not possible to 5 human-size robot developed by Kokoro Company, Ltd. generate lips movements for all the phonemes. This is It has 27 degrees of freedom (DOF) and uses pneu- why, to maintain the synchronization, any phonemes matic motors to display natural-looking movements. that last less than 0.1 are ignored and the duration of Motion planning and control are always an important the next phoneme is extended by the same amount. issue for robots (Miyashita & Ishiguro, 2004) and are In the Figure 9 example, “ax” is removed and “m” is becoming a necessary and promising research area (Pier- extended, “n” is removed and “ey” is extended, and ris & Lagoudakis, 2009; Takahashi, Kimura, Maeda, “d” is removed and “iy” is extended. The phonemes are & Nakamura, 2012). They allow the synchronization then mapped to visemes that were designed by a profes- of animations (predefined and online animations), sional animator. Figure 9 shows examples of two visemes speech, and gaze. The following sections describe the (Frames 3 and 10). The transitions between phonemes core functionalities of the Nadine robot controller. This is done using cosine interpolation (see Figure 9 frames includes lip synchronization, synchronization of prede- 4 to 9). Moreover, the robot cannot display an “O” fined gestures and facial expressions, and online motion mouth movement along with a “Smile.” Therefore, if generation. a forbidden transition is needed, a closing mouth move- ment is generated prior to displaying the next viseme. 5.1 Lip synchronization The synchronization is done so that the predefined Lip synchronization is part of the core func- viseme position is reached at the end of each phoneme. tion of the Nadine robot controller. It ensures that the Nadine robot looks natural when talking. How- 5.2 Library of Gestures and Idle ever, implementing this on a robot such as the Nadine Behaviors is a challenging task. On the one hand, the Nadine In addition to the lip-synch animation generator, robot is physically realistic, raising users’ expectations; on the other hand, the robot has strong limitations a professional animator designed predefined animations in terms of the range and speed of movements that it for the Nadine robot. These are used to display iconic

5. http://www.kokoro-dreams.co.jp/english/ 6. http://www.cereproc.com/en/products/sdk

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 144 PRESENCE: VOLUME 23, NUMBER 2

Figure 10. Examples of body movements and facial expressions from the library of gestures.

gestures such as waving hello. The predefined gestures where f (·) is a smooth nonlinear function, which can also include facial and bodily emotional expressions be obtained if the structure and parameters of a robot (Figure 10). Providing they do not require the same is known; n is the dimension of the joint space; m is the joints, they are dynamically combined to create a richer dimension of end-effector Cartesian space. Conversely, display. We are also designing a generator for idle behav- given end-effector position/orientation vector r(t) ∈ iors to avoid having the robot remain completely static, Rm, the joint-space vector θ(t) ∈ Rn can be denoted by as this would look unnatural. To do so, we plan to use θ = −1 the well established Perlin Noise to generate movements (t) f (r(t)), (11) as this can be modulated to express emotions (Beck, − where f 1(·) is the inverse function of f (·) in Equa- Hiolle, & Cañamero, 2013). tion 10. For a redundant arm system, that is, n > m, the difficulty is that inverse kinematics Equation 11 5.3 Online Motion Generation is usually nonlinear and underdetermined, and is dif- ficult (even impossible) to solve. The key of online The kinematic model of the Nadine robot is a motion generation is how to solve the inverse kinematics dual-arm kinematic model. The kinematic model of problem. the Nadine robot includes two parts, that is, the for- In our setting (Figure 11), the robot is expected ward kinematic model and the inverse kinematic model. to generate social gestures and motions dynamically The forward kinematic model outputs the end-effector according to the situation. For instance, a handshake (hand) trajectories of the robot if the joint vector of is commonly used as a greeting at the beginning and dual-arms is given, and the inverse kinematic model end of an interaction. Moreover, this allows the robot outputs the joint vector of dual-arms if the end-effector to communicate through touch, which is common in (hand) path is known. Mathematically, given joint-space human–human interaction. This kind of gesture can- vector θ(t) ∈ Rn, the end-effector position/orientation not be included in the predefined library, as the gesture vector r(t) ∈ Rm can be formulated as the following needs to be adapted to the current user’s position on forward kinematic equation: the fly. In order to generate motion dynamically, the for- ward kinematic equations of dual arms are first built. r(t) = f (θ(t)), (10) Then, they are integrated into a quadratic programming

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 145

minimize ϑ˙ T(t)M ϑ˙(t)/2 (14)

subject to j(ϑ)ϑ˙(t) = Υ˙ (t), (15) − + ϑ (t) ≤ ϑ(t) ≤ ϑ (t), (16) − + ϑ˙ (t) ≤ ϑ˙(t) ≤ ϑ˙ (t), (17) ϑ =[θT θT]T ∈ 2n ϑ− =[θ−T θ−T]T ∈ where (t) L, R R ; (t) L , R 2n ϑ+ =[θ+T θ+T]T ∈ 2n ϑ˙ = ϑ = R ; (t) L , R R ; (t) d /dt [θ˙T θ˙T]T ∈ 2n ϑ˙ − =[θ˙−T θ˙−T]T ∈ 2n ϑ˙ + = L, R R ; (t) L , R R ; (t) [θ˙+T θ˙+T]T ∈ 2n Υ˙ =[˙T ˙T]T ∈ 2n L , R R . (t) rL ; rR R . Matrix j is composed by Jacobian matrixes JL and JR; M is a n × n identity matrix. Specifically, M ∈ R2n×2n is an identity matrix, and Figure 11. System chart of online motion generation with handshake as an example. J 0 × × j = L m n ∈ R2m 2n. 0m×n JR formulation. After that, we use a simplified recurrent For the sake of calculation, the QP-based coordinated neural network to solve such a quadratic programming. dual-arm scheme can be formulated as the following Specifically, when the robot recognizes that the user expression constrained by an equality and inequality. wants to shake hands with the robot, the robot stretches ϑ˙ 2 out a hand to the user to shake hands with the user. minimize (t) /2 (18) The forward kinematics model considers the robot’s subject to J (ϑ)ϑ˙(t) = Υ˙ (t), (19) arms. Each arm has 7 DOF. Given the left arm end- − ˙ + m ξ (t) ≤ ϑ(t) ≤ ξ (t), (20) effector position vector pendL ∈ R , the right arm ∈ m end-effector position vector pendR R , and their cor- where ·denotes the two norm of a vector or a = ∈ responding homogeneous representation rL rR matrix. Equation 18 is the simplification of Equa- m+1 T R with superscript denoting the transpose of a tion 14. Equation 20 is transformed by Equations 16 vector or a matrix, we can obtain the homogeneous and 17. In Equation 20, the ith components of ξ−(t) + − − − representations rL and rR from the following chain ξ ξ = {ϑ˙ ν ϑ − ϑ } and (t) are i (t) max i , ( i (t) i ) and formulas, respectively: ξ+ = {ϑ˙ + ν ϑ+ − ϑ } ν = i (t) min i , ( i (t) i ) with 2 being used to scale the feasible region of ϑ˙. rL (t) = fL(θL) ϑ− =[ϑ−T ϑ−T]T ∈ 2n ϑ+ =[ϑ+T ϑ+T]T ∈ 2n = 0 ·1 · 2 ·3 · 4 · 5 · 6 · L , R R ; L , R R . 1T 2 T 3T 4 T 5T 6T 7T pendL, (12) − − − + + + ϑ˙ =[ϑ˙ T, ϑ˙ T]T ∈ R2n; ϑ˙ =[ϑ˙ T, ϑ˙ T]T ∈ R2n. rR(t) = fR(θR) L R L R = 0 ·8 · 9 ·10 · 11 · 12 · 13 · In the subsequent experiments, the physical limits are 8T 9 T 10T 11 T 12T 13T 14T pendR, (13) as follows. + i = ϑ =[π/20, π/10, π/8, π/2, 0, 2π/3, π/2]T, where i+1T with i 0, 1, ..., 14 denote the homo- L geneous transform matrixes. In this paper, n = 7 and ϑ− =[ − π − π − π π ]T L 0, 3 /10, 7 /120, 0, 131 /180, 0, /9 , m = 3. ϑ+ =[ π π π π −π ]T Inspired by the work on a one-arm redundant system R 0, 3 /10, 7 /120, ,0,2 /3, /2 , and (Z. Zhang & Y. Zhang, 2013c), we attempt to build ϑ− =[−π −π −π π R /20, /10, /8, /2, a model based on quadratic programming as shown − π − π ]T below. 131 /180, 0, 8 /9 .

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 146 PRESENCE: VOLUME 23, NUMBER 2

Firstly, according to Z. Zhang and Y. Zhang (2013c), where γ is a positive design parameter used to scale the Equations 18–20 can be converted to a linear variational convergence rate of the neural network. The lemma inequality; that is, to find a solution vector u∗ ∈ Ω with proposed in Y. Zhang et al. (2008) guarantees the con- respect to vergence of the neural network formulated by Equation 23 (proof omitted due to space limitations). − ∗ T Γ ∗ + ≥ ∀ ∈ Ω (u u ) ( u q) 0. u (21) Lemma: Assume that the optimal solution ϑ to the strictly-convex QP problem formulated by Equations Secondly, Equation 21 is equivalent to the following 18–20 exists. Being the first 2n elements of state u(t), system of piecewise-linear equations (Z. Zhang & Y. output ϑ(t) of the simplified recurrent neural network in Zhang, 2013c): Equation 23 is globally exponentially convergent to ϑ. ΦΩ(u − (Γu + q)) − u = 0, (22) In addition, the exponential-convergence rate is propor- tional to the product of γ and the minimum eigenvalue 2n+2m where ΦΩ(·) : R → Ω is a projection operator, of M (Y. Zhang et al., 2008). that is, ⎧ − − ⎪ < ⎨ui ,ifui ui , 6 Human–Robot Interaction − + u ,ifu ≤ u ≤ u , ∀i ∈{1, 2, ..., n + m}. ⎪ i i i i ⎩ + + ui ,ifui > ui , As a case study for the GUHRI system, a scenario was defined in which a user and the robot interact in 2n+2m − + In addition, Ω ={u ∈ R |u ≤ u ≤ u }⊂ a classroom. The robot is the lecturer, and the user is 2n+2m m R ; u ∈ R is the primal-dual decision vector; the student. The robot is female in appearance, and − m + m u ∈ R and u ∈ R are the lower and upper named “Nadine.” Nadine can understand the 12 upper bounds of u, respectively; ω is usually set to a sufficiently body gestures described in Section 4.1 and react to large value (e.g., in the simulations and experiments the users’ gestures accordingly. In our system, Nadine 10 afterward, := 10 ). Specifically, is human-like and capable of reacting by combining + body movement, facial expression, and verbal language ϑ(t) + + ζ (t) + u = ∈ R2n 2m, u = ∈ R2n 2m, (see Section 5). In this way, Nadine’s reactions pro- ι ω1ι vide the user with vivid feedback. Figure 10 shows some − − ζ (t) + examples of Nadine’s body movements along with cor- u = ∈ R2n 2m, −ω1ι responding facial expressions. Nonverbal behaviors can help to structure the processing of verbal information T M −j + × + as well as giving affective feedback during the interac- Γ = ∈ R(2n 2m) (2n 2m), j 0 tion (Cañamero & Fredslund, 2001; Krämer, Tietz, & Bente, 2003). Thus, body movements and facial 0 = ∈ n+m =[ ··· ]T expressions are expected to enhance the quality of the q −Υ˙ R ,1v : 1, ,1 . interaction with Nadine. Thirdly, being guided by dynamic-system-solver design In this scenario, Nadine’s behaviors are triggered by experience (Y. Zhang, Tan, Yang, Lv, & Chen, 2008; the user’s body language. Nadine’s reactions are consis- Zhang, Wu, Zhang, Xiao, & Guo, 2013; Z. Zhang & tent with the defined scenario (see Table 1). It should Y. Zhang, 2013a,b), we can adopt the following neu- be noted that because it is difficult to fully describe the ral dynamics (the simplified recurrent neural network; robot’s body actions, the robot’s movements and emo- Y. Zhang et al., 2008) to solve Equation 22. tional display are described at a high level. All the 12 upper body gestures are involved. The GUHRI sys- u˙ = γPΩ(u − (Mu + q)) − u, (23) tem can also handle unexpected situations during the

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 147

Ta b l e 1 . The Scenario for Human–Robot Interaction

Nadine’s response Human gestures Nonverbal Verbal

“Be confident” Happy It is great to see you so confident. “Have question” Moderate What is your question? “Object” Sad Why do you disagree? “Praise” Happy Thank you for your praise. “Stop” Moderate Why do you stop me? “Succeed” Happy Well done. You are successful. “Shake hand” Happy Nice to meet you. “Weakly agree” Head OK, we finally reach an agreement. “Call” Head shake Please turn off your phone. “Drink” Moderate You can have a drink. No problem. “Read” Moderate Please, take your time and read it carefully. “Write” Moderate If you need time for taking notes, I can slow my presentation.

interaction. For example, Nadine can react appropriately For convenience, the CyberGlove II was pre-calibrated even if the user suddenly answers an incoming phone for all the people with a standard calibration. Due to call. the data set collection setup, a large divergence may exist among the gesture samples from different people. This will yield challenges on body gesture recognition. 7 Experiment and Discussion Figure 12 exhibits parts of the Category 1 and Category 2 gesture samples (“have question,” “succeed,” “call,” The experiment is composed of two main parts. and “drink”) captured from five people for comparison. First, our upper body gesture recognition method is For the sake of brevity, not all the gestures are shown. tested. Then, the effectiveness of the proposed online The five descriptive and reference skeletal joints pro- motion generation approach is verified by executing a posed in Section 4.2.2 are marked as the big dots in handshake between the user and the robot. Figure 12. These joints in Figure 12 are connected by straight segments to outline the upper body posture intuitively. From the exhibited samples, we make the 7.1 Upper Body Gesture Recognition following observations. Test • For different people, the listed body gestures can A human upper body gesture data set was built indeed be differentiated from the hand and upper to test the proposed gesture recognition method. This body posture information. Further, the people exe- data set involves all the 12 upper body gestures men- cute the gestures differently to some degree. As tioned in Section 4.1. The samples are captured from mentioned previously, this phenomenon leads to 25 volunteers of different genders, body sizes, and challenges for upper body gesture recognition. countries of origin. During the sample collection, no • For different people and gestures, the five skele- strict constraint was imposed. The subjects carried out tal joints employed for gesture recognition can be the gestures according to their own habits. The user– tracked robustly, even when human–object inter- Kinect relative position was also not strictly limited. action occurs. Generally, the resulting positions

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 148 PRESENCE: VOLUME 23, NUMBER 2

Figure 12. Some gesture samples captured from different volunteers. These people are of different genders, body sizes, and countries of origin. They executed the gestures according to their own habits.

are accurate for gesture recognition. Mean- samples are randomly split into the training and testing while, the CyberGlove II is a human-touch device set for five trials, and the average classification accuracy that can capture hand posture robustly to yield and standard deviation are reported. high-accuracy data. Thus, the proposed human The KNN classifier is employed as the baseline gesture-capture sensor solution can stably acquire to make comparison with the energy-based LMNN available data for gesture recognition. classifier. They are compared both on the items of clas- For each gesture, one key snapshot is picked up to sification accuracy and time consumption. The KNN build the data set among all the 25 people. As a conse- classifier will run with different kinds of distance mea- quence, the resulting data set contains 25 × 12 = 300 sures. Following the experimental setup in Weinberger gesture samples in all. During the experiment, the and Saul (2009), “k” is set as 3 in all cases. Because the

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 149

Ta b l e 2 . Classification Result (in percent) of the Constructed Upper Body Gesture Data Set; Best Performance is Shown in Boldface; Standard Deviations are in Parentheses

Training sample number per class Classifiers 4 6 8 10 12 14

KNN (Euclidean) 86.51(±2.89) 89.56(±2.43) 91.47(±2.21) 93.00 ±1.15 93.33(±0.73) 92.27(±2.30) KNN (PCA) 73.81(±4.78) 84.04 ±5.41 79.31(±3.09) 86.11 ±2.75 88.21(±4.17) 86.67(±3.70) KNN (LDA) 79.68 ±5.33 90.44 ±2.56 92.35(±2.44) 92.44(±1.34) 94.74(±0.84) 93.48(±1.38) KNN (LMNN) 86.67(±2.18) 90.35 ±2.56 92.16(±1.80) 93.67(±1.34) 93.85(±1.48) 93.48 ±1.15

Energy (LMNN) 90.00(±3.40) 92.28(±0.85) 94.31(±2.04) 95.22(±1.50) 95.64(±1.39) 96.52(±2.37)

training sample number is a crucial factor that affects the energy-based LMNN classifier is robust to the the classification accuracy, the results of the two classi- gesture diversities among people. fiers are compared corresponding to different amounts • KNN classifier can also yield good results on this of training samples. For each class, the training sample data set. However, it is inferior to the energy-based number will increase from 4 to 14 with a step size of 2. LMNN classifier. Compared to Euclidean distance, Two other well-known distance metric learning meth- the LMNN distance metric learning method can ods, PCA (Jolliffe, 1986) and LDA (Fisher, 1936), are improve the performance of KNN classifier consis- employed for comparison with the LMNN distance met- tently in most cases. However, it works much better ric learning approach. For PCA, the first 10 eigenvectors on the energy-based model. are used to capture roughly 90% of the sum of eigenval- • PCA does not work well on this data set. Its per- ues. Further, the first six eigenvectors are employed for formance is even worse than the basic Euclidean LDA. The distance measures yielded by PCA and LDA distance. The reason may be that PCA needs a large will be applied to the KNN classifier. number of training samples to obtain the satisfied Table 2 lists the classification results yielded by the distance measures (Osborne & Costello, 2004). different classifier and distance measure combinations. It That is the limitation for the practical applications. can be observed that: • LDA also achieves good performance for upper body gesture recognition. However, it is still consis- • The 12 upper body gestures in the data set can be tently inferior to energy-based LMNN, especially well recognized by the proposed gesture recog- when the training sample number is small. For nition method. More than 95.00% classification example, when the training sample number is only accuracy can be achieved if enough training sam- four, energy-based LMNN’s accuracy (90.00%) is ples are used. With the increase of training sample significantly better than that of LDA (79.68%) by a amount, performance is generally enhanced. large margin (10.32%). • Corresponding to all the training sample numbers, the energy-based LMNN classifier can yield the Apart from the classification accuracy, we also have a highest classification accuracy. Even with a small concern about the testing time consumption. The rea- number of training samples (such as four), it can son is that the GUHRI system should run in real time still achieve relatively good performance (90.00%). for good HRI experience. According to the classification When the training sample number reaches 14, the results in Table 2, the energy-based LMNN classifier, classification accuracy (96.52%) is nearly satisfied LMNN KNN classifier, and LDA KNN classifier are the for practical use. Further, its standard deviations three strongest ones for gesture recognition. Here, a are relatively low in most cases, which means that comparison of their testing time is also made. Table 3

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 150 PRESENCE: VOLUME 23, NUMBER 2

Ta b l e 3 . Average Testing Time Consumption (ms) per Sample*

Training sample number per class Classifiers 4 6 8 10 12 14

KNN (LDA) 0.0317 0.0398 0.0414 0.0469 0.0498 0.0649 KNN (LMNN) 0.0239 0.0282 0.0328 0.0342 0.0418 0.0525 Energy (LMNN) 0.0959 0.1074 0.1273 0.1359 0.1610 0.1943 *The program is running on the computer with Intel Core i5-2430M at 2.4GHz (only using one core).

Figure 13. Snapshots of shaking hands between the human and the Nadine robot.

lists the average run time per testing sample of the three the LDA KNN classifier and the LMNN KNN classifier classifiers, corresponding to different amounts of train- will be the better choice to achieve a balance between ing samples. We can see that all of the three classifiers classification accuracy and computational efficiency. are extremely fast under our experimental conditions. Further, the time consumption mainly depends on the 7.2 The Motion Generation and Control number of training samples. Note that the LDA KNN Effectiveness classifier and LMNN KNN classifier are much faster than the energy-based LMNN classifier. If a huge number In this section, the effectiveness of the motion of training samples is used (such as tens of thousands), generation (Section 5.3) and control are illustrated

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 151

within the case study described in Section 6. At the and efficiency of our gesture recognition method and beginning of the interaction, a user can shake hands motion planning approach. with the Nadine robot, as shown in Figure 13. From So far, the gestures involved in the GUHRI system the figure, we can see that when the user is far from are static ones, for example, “have question,” “praise,” the Nadine robot, the robot knows that it is too far “call,” and “drink.” In future work, we plan to enable to shake hands, and will ask the user to come closer. the robot to understand the dynamic gestures, such as When the user comes closer and gives a handshake “ hand,” “type keyboard,” and “clap.” Speech recog- gesture, the robot will recognize it, and stretch out a nition can be further added to make the interaction hand to shake hands with the user while saying “nice more natural. to meet you.” After the handshake, the Nadine robot withdraws the hand to its initial position and is ready for the next movement. From Figure 13, we can see Acknowledgments the Nadine robot is able to shake hands with people well. This situation demonstrates that the proposed This work, partially carried out at BeingThere Center, QP-based online motion generation approach is effec- is supported by IMI Seed Grant M4081078.B40, and Singapore National Research Foundation under its Interna- tive, applicable, and well integrated within the GUHRI tional Research Centre @ Singapore Funding Initiative and system. administered by the IDM Programme Office. This work is an extended and improved version of Xiao, Y., Yuan, J., and Thalmann, D. (2013), Human–virtual human 8 Conclusions interaction by upper body gesture understanding. Proceedings of the 19th ACM Symposium on Virtual Reality Software and The GUHRI system, a novel body gesture under- Technology, VRST 2013, 133–142. standing and human–robot interaction system, is proposed in this paper. A set of 12 human upper body gestures with and without human–object interac- References tions can be understood by the robot. Meanwhile, the robot can express itself by using a combination of body Adamo-Villani, N., Heisler, J., & Arns, L. (2007). Two ges- ture recognition systems for immersive math education of movements, facial expressions, and verbal language the deaf. Proceedings of the First International Conference on simultaneously, aiming to provide a vivid experience to Immersive Telecommunications, ICIT 2007,9. the users. Beck, A., Cañamero, L., Hiolle, A., Damiano, L., Cosi, P., A new combination of sensors is proposed. That is, Tesser, F., & Sommavilla, G. (2013). Interpretation of emo- the CyberGlove II and Kinect are combined to cap- tional body language displayed by a humanoid robot: A case ture the head, arm, and hand posture simultaneously. study with children. International Journal of Social Robotics, An effective and real-time gesture recognition method is 5(3), 325–334. also proposed. For robot control, a novel online motion Beck, A., Hiolle, A., & Cañamero, L. (2013). Using Per- planning method is proposed. This motion planning lin noise to generate emotional expressions in a robot. In Proceedings of the Annual Meeting of the Cognitive Science method is formulated as a quadratic program style. It is Society, COG SCI 2013, 1845–1850. then solved by a simplified recurrent neural network. Beck, A., Stevens, B., Bard, K. A., & Cañamero, L. (2012). The obtained optimization solutions are finally used Emotional body language displayed by artificial agents. to generate arms movements. This method has been ACM Transactions on Interactive Intelligent Systems, 2(1), integrated within a robot controller module. In experi- 1–29. ment, a human upper body gesture data set is built. The Belongies, S., Malik, J., & Puzicha, J. (2002). Shape match- experimental results demonstrate the effectiveness ing and object recognition using shape contexts. IEEE

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 152 PRESENCE: VOLUME 23, NUMBER 2

Transactions on Pattern Analysis and Machine Intelligence, Fisher, R. A. (1936). The use of multiple measures in 24(4), 509–522. taxonomic problems. Annals of Eugenics, 7, 179–188. Berman, S., & Stern, H. (2012). Sensors for gesture recog- Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A sur- nition systems. IEEE Transactions on Systems, Man, and vey of socially interactive robots. Robotics and Autonomous Cybernetics, Part C: Applications and Reviews, 42(3), Systems, 42(3), 143–166. 277–290. Goodrich, M. A., & Schultz, A. C. (2007). Human–robot Brethes, L., Menezes, P., Lerasle, F., & Hayet, J. (2004). Face interaction: A survey. Foundations and Trends in Human– tracking and hand gesture recognition for human–robot Computer Interaction, 1(3), 203–275. interaction. Proceedings of the IEEE Conference on Robotics Guo, D., & Zhang, Y. (2012). A new inequality-based and Automation, ICRA 2004, Vol. 2, 1901–1906. obstacle-avoidance MVN scheme and its application to Cai, B., & Zhang, Y. (2012). Different-level redundancy- redundant robot manipulators. IEEE Transactions on resolution and its equivalent relationship analysis for Systems, Man, and Cybernetics, Part C: Applications and robot manipulators using gradient-descent and Zhang’s Reviews, 42(6), 1326–1340. neural-dynamic methods. IEEE Transactions on Industrial Immersion. (2010). CyberGlove II Specifications. Retrieved Electronics, 59(8), 3146–3155. from http://www.cyberglovesystems.com/products/ Cañamero, L., & Fredslund, J. (2001). I show you how I like cyberglove-ii/specifications you—Can you read it in my face? IEEE Transactions on Jolliffe, I. T. (1986). Principal component analysis. Berlin: Systems, Man and Cybernetics, Part A: Systems and Humans, Springer-Verlag. 31(5), 454–459. Kanoun, O., Lamiraux, F., & Wieber, P. B. (2011). Kinematic Cassell, J. (2000). Nudge nudge wink wink: Elements of face- control of redundant manipulators: Generalizing the task- to-face conversation for embodied conversational agents. In priority framework to inequality task. IEEE Transactions on J. Cassell, J. Sullivan, S. Prevost, & E. Churchill, Embodied Robotics, 27 (4), 785–792. Conversational Agents (Chap. 1, pp. 1–27). Krämer, N. C., Tietz, B., & Bente, G. (2003). Effects of Chan, T. F., & Dubey, R. (1995). A weighted least-norm embodied interface agents and their gestural activity. Intel- solution based scheme for avoiding joint limits for redun- ligent virtual agents. Lecture notes in artifical intelligence, dant joint manipulators. IEEE Transactions on Robotics and Vol. 2792 (pp. 292–300). Berlin: Springer-Verlag. Automation, 11(2), 286–292. Lu, G., Shark, L.-K., Hall, G., & Zeshan, U. (2012). Immer- Cheng, F.-T., Chen, T.-H., & Sun, Y.-Y. (1994). Resolving sive manipulation of virtual objects through glove-based manipulator redundancy under inequality constraints. IEEE hand gesture interaction. Virtual Reality, 16(3), 243– Transactions on Robotics and Automation, 10(1), 65–71. 252. Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face Ma, S., & Watanabe, M. (2002). Time-optimal control of verification. In Proceedings of the IEEE Conference on Com- kinematically redundant manipulators with limit heat puter Vision and Pattern Recognition, CVPR 2005, Vol. 1, characteristics of actuators. Advanced Robotics, 16(8), 539–546 735–749. Cover, T., & Hart, P. (1967). Nearest neighbor pattern clas- Martins, A., Dias, A., & Alsina, P. (2006). Comments on sification. IEEE Transactions in Information Theory, 13(1), manipulability measure in redundant planar manipula- 21–27. tors. In Proceedings of the IEEE Latin American Robotics Dautenhahn, K. (2007). Socially intelligent robots: Dimen- Symposium, LARS 2006, 169–173. sions of human–robot interaction. Philosophical Transactions Mehrabian, A. (1971). Silent messages. Wadsworth, CA: of the Royal Society B: Biological Sciences, 362(1480), Belmont. 679–704. Miyashita, T., & Ishiguro, H. (2004). Human-like natu- Faber, F., Bennewitz, M., Eppner, C., Gorog, A., Gonsior, C., ral behavior generation based on involuntary motions for Joho, D., ··· Behnke, S. (2009). The humanoid museum humanoid robots. Robotics and Autonomous Systems, 48(4), tour guide robotinho. Proceedings of the IEEE Symposium on 203–212. Robot and Human Interactive Communication, RO-MAN Müller, M., & Röder, T. (2006). Motion templates for auto- 2009, 891–896. matic classification and retrieval of motion capture data. In

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 Xiao et al. 153

Proceedings of the 2006 ACM SIGGRAPH/Eurographics Conference on Systems and Informatics, ICSAI 2012, 107– Symposium on Computer Animation, SCA 2006, 137–146. 112. Nickel, K., & Stiefelhagen, R. (2007). Visual recognition of Waldherr, S., Romero, R., & Thrun, S. (2000). A gesture pointing gestures for human–robot interaction. Image and based interface for human–robot interaction. Autonomous Vision Computing, 25(12), 1875–1884. Robots, 9(2), 151–173. Osborne, J. W., & Costello, A. B. (2004). Sample size and Wang, J., & Li, Y. (2009). Inverse kinematics analysis for the subject to item ratio in principal components analysis. Prac- arm of a mobile humanoid robot based on the closed-loop tical Assessment, Research & Evaluation, 9(11). Retrieved algorithm. Proceedings of the International Conference on from http://para online.net/getrn.asp?v=9&n=11 Information and Automation, ICIA 2009, 516–521. Perzanowski, D., Schultz, A. C., Adams, W., Marsh, E., & Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining action- Bugajska, M. (2001). Building a multimodal human–robot let ensemble for action recognition with depth cameras. interface. IEEE Intelligent Systems, 16(1), 16–21. Proceedings of the IEEE Conference on Computer Vision and Pierris, G., & Lagoudakis, M. (2009). An interactive tool Pattern Recognition, CVPR 2012, 1290–1297. for designing complex robot motion patterns. In Proceed- Weinberger, K. Q., & Saul, L. K. (2009). Distance metric ings of the IEEE International Conference on Robotics and learning for large margin nearest neighbor classification. The Automation, ICRA 2009, 4013–4018. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, Journal of Machine Learning Research, 10, 207–244. M., Moore, R., ··· Blake, A. (2011). Real-time human pose Xiao, Y., Cao, Z., & Yuan, J. (2014). Entropic image thresh- recognition in parts from single depth images. In Proceed- olding based on GLGM histogram. Pattern Recognition ings of the IEEE Conference on Computer Vision and Pattern Letters, 40, 47–55. Recognition, CVPR 2011 1297–1304. Xiao, Y., Wu, J., & Yuan, J. (2014). mCENTRIST: A Smith, L. B., & Breazeal, C. (2007). The dynamic lift of multi-channel feature generation mechanism for scene cat- developmental process. Developmental Science, 10(1), egorization. IEEE Transactions on Image Processing, 23(2), 61–68. 823–836. Spiliotopoulos, D., Androutsopoulos, I., & Spyropoulos, C. Xiao, Y., Yuan, J., & Thalmann, D. (2013). Human–virtual D. (2001). Human–robot interaction based on spoken human interaction by upper body gesture understanding. natural language dialogue. In Proceedings of the European Proceedings of the 19th ACM Symposium on Virtual Reality Workshop on Service and Humanoid Robots, 25–27. Software and Technology, VRST 2013, 133–142. Stiefelhagen, R., Ekenel, H. K., Fugen, C., Gieselmann, P., Zhang, Y., Tan, Z., Yang, Z., Lv, X., & Chen, K. (2008). A ··· Holzapfel, H., Kraft, F., Waibel, A. (2007). Enabling simplified LVI-based primal-dual neural network for repet- multimodal human–robot interaction for the Karlsruhe itive motion planning of PA10 robot manipulator starting humanoid robot. IEEE Transactions on Robotics, 23(5), from different initial states. Proceedings of the IEEE Joint 840–851. Conference on Neural Networks, IJCNN 2008, 19–24. Stiefelhagen, R., Fugen, C., Gieselmann, R., Holzapfel, H., Zhang, Y., Wu, H., Zhang, Z., Xiao, L., & Guo, D. (2013). Nickel, K., & Waibel, A. (2004). Natural human-robot Acceleration-level repetitive motion planning of redundant interaction using speech, head pose and gestures. In Pro- planar robots solved by a simplified LVI-based primal- ceedings of the IEEE Conference on Intelligent Robots and Systems, IROS 2004, Vol. 3, 2422–2427. dual neural network. Robotics and Computer-Integrated Taghirad, H. D., & Nahon, M. (2008). Kinematic analysis of Manufacturing, 29(2), 328–343. a macro–micro redundantly actuated parallel manipulator. Zhang, Z., & Zhang, Y. (2012). Acceleration-level cyclic- Advanced Robotics, 22(6–7), 657–687. motion generation of constrained redundant robots Takahashi, Y., Kimura, T., Maeda, Y., & Nakamura, T. tracking different paths. IEEE Transactions on Systems, (2012). Body mapping from human demonstrator to Man, and Cybernetics, Part B: Cybernetics, 42(4), 1257– inverted-pendulum mobile robot for learning from obser- 1269. vation. Proceedings of the IEEE Conference on Fuzzy Systems, Zhang, Z., & Zhang, Y. (2013a). Design and experimentation FUZZ-IEEE 2012, 1–6. of acceleration-level drift-free scheme aided by two recur- Teleb, H., & Chang, G. (2012). Data glove integration with rent neural networks. Control Theory & Applications, IET, 3D virtual environments. Proceedings of the International 7 (1), 25–42.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021 154 PRESENCE: VOLUME 23, NUMBER 2

Zhang, Z., & Zhang, Y. (2013b). Equivalence of different- 9 Appendix level schemes for repetitive motion planning of redundant robots. Acta Automatica Sinica, 39(1), 88–91. Zhang, Z., & Zhang, Y. (2013c). Variable joint-velocity lim- The demonstration video of this paper can be its of redundant robot manipulators handled by quadratic viewed at https://www.dropbox.com/s/eht69b8 programming. IEEE/ASME Transactions on Mechatronics, 18(2), 674–686. kyg53vso/hri_demo_mit.mp4.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/PRES_a_00176 by guest on 01 October 2021