Vision-Based Gesture Recognition in Human-Robot Teams Using Synthetic Data
Total Page:16
File Type:pdf, Size:1020Kb
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) October 25-29, 2020, Las Vegas, NV, USA (Virtual) Vision-Based Gesture Recognition in Human-Robot Teams Using Synthetic Data Celso M. de Melo1, Brandon Rothrock2, Prudhvi Gurram1, Oytun Ulutan3 and B.S. Manjunath3 Abstract— Building successful collaboration between humans settings. Therefore, in contrast to other work in gesture and robots requires efficient, effective, and natural communi- recognition [6] [7], we pursue a pure vision-based RGB cation. Here we study a RGB-based deep learning approach solution to this challenge. for controlling robots through gestures (e.g., “follow me”). To address the challenge of collecting high-quality annotated A. Activity & Gestures Recognition data from human subjects, synthetic data is considered for this domain. We contribute a dataset of gestures that includes Recognizing human activity is an important problem real videos with human subjects and synthetic videos from our in computer vision that has been studied for over two custom simulator. A solution is presented for gesture recogni- decades [8], which traditionally involves predicting an ac- tion based on the state-of-the-art I3D model. Comprehensive testing was conducted to optimize the parameters for this tivity class among a large set of common human activities model. Finally, to gather insight on the value of synthetic (e.g., sports actions, daily activities). In contrast, here we are data, several experiments are described that systematically concerned with recognition of a limited set of gestures with a study the properties of synthetic data (e.g., gesture variations, clear symbolic meaning (e.g., “follow me”). Recognizing and character variety, generalization to new gestures). We discuss understanding gestures requires models that can represent the practical implications for the design of effective human-robot collaboration and the usefulness of synthetic data for deep appearance and motion dynamics of the gestures. This can be learning. accomplished by using (shallow) models that rely on hand- crafted features that capture detailed gesture knowledge, or I. INTRODUCTION more general deep models that are data-driven and less reliant on hand-crafted features [9]. Effectively, deep learning Robust perception of humans and their activities will be has been gaining increasing popularity in the domain of ac- required for robots to effectively and efficiently team with tivity recognition [10] [11] [12] [13] [14] [15] [8] [16] [17]. humans in vast, outdoor, dynamic, and potentially dangerous This has been facilitated by the development of better environments. This kind of human-robot teaming is likely learning algorithms and the exponential growth in available to be relevant across different domains, including space computational power (e.g., powerful GPUs). In this paper, we exploration [1], peacekeeping missions [2], and industry [3]. follow an approach based on a state-of-the-art deep learning The cognitive demand imposed on humans by the sheer model [18]. complexity of these environments requires multiple, effec- tive, and natural means of communication with their robotic B. The Data Problem teammates. Here we focus on one visual communication Deep learning models, however, tend to have low sample modality: gestures [4]. Gestures complement other forms efficiency and, thus, their success is contingent on the avail- of communication - e.g., natural language - and may be ability of large amounts of labeled training data. Collecting necessary when (1) it is desirable to communicate while and annotating a large number of performances from human maintaining a low profile, (2) other forms of digital commu- subjects is costly, time-consuming, error-prone, and comes nication are not available, or (3) distance prevents alternative with potential ethical issues. These problems are only likely forms of communication. Since humans are likely to carry a to get exacerbated as society increases scrutiny on current heavy load and need to react quickly to emerging situations, data handling practices [19]. For these reasons, researchers it is important that gesture communication be untethered; have pointed that, rather than being limited by algorithm or thus, an approach based on data gloves is excluded [5]. The computational power, in certain cases, the main bottleneck operational environment, moreover, is likely to be outdoors, is the availability of labeled data [20]. To help mitigate this thus, excluding the viability of depth sensors - such as problem, several datasets are available for human activity the Kinect - which are known to have limitations in these recognition (for a full review see: [21] [8]) from diverse 1Celso M. de Melo and Prudhvi Gurram are at the sources including movies [22], sporting events [23], or daily CCDC US Army Research Laboratory, Playa Vista, CA life [24]. However, a general problem is that they capture a 90094, USA [email protected], gurram [email protected] very broad range of activities that is readily available on the 2Brandon Rothrock perfomed this research at the web, or collected in heavily controlled environments. They NASA Jet Propulsion Laboratory, Pasadena, CA 91101 are, therefore, unlikely to suffice for domains that target [email protected] more specific types of activity and that favor models that 3Oytun Ulutan and B.S. Manjunath are with the Electrical Engineering Department, UC Santa Barbara, Santa Barbara, CA 93106-9560, USA are optimized for this subset of human activity, rather than [email protected], [email protected] general human activity. This is the case for our target domain 978-1-7281-6211-9/20/$31.00 ©2020 IEEE 10278 TABLE I and, thus, new data must still be collected and annotated. To PARAMETERS USED TO GENERATE THE SYNTHETIC DATA. minimize the cost and problems associated with collecting labeled data from humans, we consider synthetic data. Parameter Range C. Synthetic Data Characters Male civilian, male camouflage, female civilian, fe- Synthetic data may be critical to sustain the deep learning male camouflage Skin color Caucasian, African-American, East Indian revolution. Synthetic data is less expensive, already comes Thickness Thin, thick with error-free ground truth annotation, and avoids many eth- Animations 3 animations per gesture ical issues [25]. Synthetic data, generated from simulations in Repetitions Based on the distribution in the human data: Move in reverse, 2-4; halt, 1; attention, 1-3; advance, 1; virtual worlds, has been used in several important computer follow me, 2-4; rally, 2-4; move forward, 2-4 vision problems. In object and scene segmentation, synthetic Speeds 0.75×, 1.0×, 1.25× data has been used to improve performance in object de- Environments Alpine, barren desert, coastal, rolling hills Camera 0◦, 45◦, 90◦, 135◦, 180◦, 225◦. 270◦, 315◦ tection [26] [27] [28], object tracking [29] [30], viewpoint angles estimation [31], and scene understanding [29] [32] [33] [34]. In robot control domains, because of the difficulty and danger of training robots in real environments, synthetic data has often been used to teach robots visual space recogni- important than others. We test the performance of different tion [35] [36], navigation [37], object detection [38], and synthetic data input resolution - namely, frame resolution, grasping control [39] [40] [41]. Synthetic virtual characters quality of rendering, and frames-per-second - on perfor- have also been used to train depth-based models for static mance. Finally, we present a set of experiments testing pose estimation [42] [43] [44] [45], gaze estimation [46], whether the model is able to generalize to new gestures, and optical flow estimation [47]. Comparatively less work when training exclusively on synthetic data (for that gesture); has looked at synthetic videos to recognize human activity, the results show robust performance for new gestures. with Ionescu et al. [43] and de Souza et al. [10] being rare In sum, this paper makes the following contributions: exceptions. Their work, though, focused on general activity • A novel dataset of synthetic (and real) videos of control recognition, rather than recognition of a specific set of gestures that can be used as a benchmark for studying gestures as we do here. Much of this research reports a per- how models can be improved with synthetic data; formance bump due to augmenting real data with synthetic • A solution based on the I3D model for recognition of data (e.g., [10] [29] [44]) or competitive performance while control gestures in human-robot teaming; training only on synthetic data (e.g., [31] [27]). Here we • Insight on the value of synthetic data for deep learning study whether synthetic data can also improve performance models. in our target domain with a focus on understanding how to optimize the value of synthetic data. II. THE DATA For effective human-robot interaction, the gestures need D. Approach & Contributions to have clear meaning, be easy to interpret, and have in- We present a new synthetic dataset for a small set of tuitive shape and motion profiles. To accomplish this, we control gestures (e.g., “follow me”) generated from a custom selected standard gestures from the US Army Field Manual simulator developed for our target domain. We also collected [48], which describes efficient, effective, and tried-and-tested a small, by design, dataset of (real) data with human subjects. gestures that are appropriate for various types of operating Since the focus of our