Accurate, Robust, and Flexible Real-Time Hand Tracking
Total Page:16
File Type:pdf, Size:1020Kb
Accurate, Robust, and Flexible Real-time Hand Tracking Toby Sharpy Cem Keskiny Duncan Robertsony Jonathan Taylory Jamie Shottony David Kim Christoph Rhemann Ido Leichter Alon Vinnikov Yichen Wei Daniel Freedman Pushmeet Kohli Eyal Krupka Andrew Fitzgibbon? Shahram Izadi? Microsoft Research Figure 1: We present a new system for tracking the detailed motion of a user’s hand using only a commodity depth camera. Our system can accurately reconstruct the complex articulated pose of the hand, whilst being robust to tracking failure, and supporting flexible setups such as tracking at large distances and over-the-shoulder camera placement. ABSTRACT the user’s hand with gloves or markers can be cumbersome We present a new real-time hand tracking system based on a and inaccurate. Much recent effort, including this work, has single depth camera. The system can accurately reconstruct thus focused on camera-based systems. However, cameras, complex hand poses across a variety of subjects. It also allows even modern consumer depth cameras, pose further difficul- for robust tracking, rapidly recovering from any temporary ties: the fingers can be hard to disambiguate visually and are failures. Most uniquely, our tracker is highly flexible, dra- often occluded by other parts of the hand. Even state of the matically improving upon previous approaches which have art academic and commercial systems are thus sometimes in- focused on front-facing close-range scenarios. This flexibil- accurate and susceptible to loss of track, e.g. due to fast mo- ity opens up new possibilities for human-computer interaction tion. Many approaches address these concerns by severely with examples including tracking at distances from tens of constraining the tracking setup, for example by supporting centimeters through to several meters (for controlling the TV only close-range and front facing scenarios, or by using mul- at a distance), supporting tracking using a moving depth cam- tiple cameras to help with occlusions. era (for mobile scenarios), and arbitrary camera placements (for VR headsets). These features are achieved through a new In this paper, we present a system that aims to relax these pipeline that combines a multi-layered discriminative reini- constraints. We aim for high accuracy, i.e. the correctness tialization strategy for per-frame pose estimation, followed by and fidelity of the final reconstruction across a wide range of a generative model-fitting stage. We provide extensive techni- human hand poses and motions. Our system is remarkably cal details and a detailed qualitative and quantitative analysis. robust, i.e. can rapidly recover from momentary failures of tracking. But perhaps most uniquely, our system is flexible: it INTRODUCTION uses only a single commodity depth camera; the user’s hand The human hand is remarkably dextrous, capable of high- does not need to be instrumented; the hand and fingers can bandwidth communication such as typing and sign language. point in arbitrary directions relative to the sensor; it works Computer interfaces based on the human hand have so far well at distances of several meters; and the camera placement been limited in their ability to accurately and reliably track the is largely unconstrained and need not be static (see Fig.1). detailed articulated motion of a user’s hand in real time. We believe that, if these limitations can be lifted, hand tracking Our system combines a per-frame reinitializer that ensures ro- will become a foundational interaction technology for a wide bust recovery from loss of track, with a model-fitter that uses range of applications including immersive virtual reality, as- temporal information to achieve a smooth and accurate result. sistive technologies, robotics, home automation, and gaming. We propose a new reinitializer that uses machine learning to However, hand tracking is challenging: hands can form a va- efficiently predict a hierarchical distribution over hand poses. riety of complex poses due to their many degrees of free- For the model-fitter, we describe a ‘golden’ objective function dom (DoFs), and come in different shapes and sizes. Despite and stochastic optimization algorithm that minimizes the re- some notable successes (e.g. [7, 32]), solutions that augment construction error between a detailed 3D hand model and the observed depth image. The pipeline runs in real time on con- y denotes joint first authorship. ? denotes joint last authorship. sumer hardware. We provide extensive technical details to aid Permission to make digital or hard copies of all or part of this work for personal or replication, as well as a detailed qualitative and quantitative classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation analysis and comparison on several test datasets. We show on the first page. Copyrights for components of this work owned by others than how our tracker’s flexibility can enable new interactive possi- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, bilities beyond previous work, including: tracking across the or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. range of tens of centimeters through to a several meters (for CHI ’15, Apr 18 – 23 2015, Seoul, Republic of Korea. high fidelity control of displays, such as TVs, at a distance); Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3145-6/15/04...$15.00. http://dx.doi.org/10.1145/2702123.2702179. tracking using a moving depth camera (for mobile scenarios); and arbitrary camera placements, including first person (en- known initial pose, and the method cannot recover from loss abling tracking for head-worn VR systems). of track. Qian et al. [20] extend [17] by adding a ‘guided’ PSO step and a reinitializer that requires fingertips to be cle- RELATED WORK Given our aims described above, we avoid encumbering the arly visible. Melax et al. [16] use a generative approach driven user’s hand with data gloves [7], colored gloves [32], wear- by a physics solver to generate 3D pose estimates. These able cameras [13], or markers [38], all of which can be bar- systems work only at close ranges, and are based on simple riers for natural interaction. We focus below on vision-based polyhedral models; our approach instead works across a wide articulated hand tracking. Discriminative approaches work range of distances, and exploits a full 3D hand mesh model directly on the image data (e.g. extracting image features and that is better able to fit to the observed data. using classification or regression techniques) to establish a Keskin et al. [12] propose a discriminative method using a mapping to a predefined set of hand pose configurations. Th- multi-layered random-forest to predict hand parts and thereby ese often do not require temporal information and can thus to fit a simple skeleton. The system runs at 30Hz on con- be used as robust reinitializers [21]. Generative (or model- sumer CPU hardware, but can fail under occlusion. Tang et based) methods use an explicit hand model to recover pose. al. [27, 26] extend this work demonstrating more complex Hybrid methods (such as [3, 23] and ours) combine discrim- poses at 25Hz. Whilst more robust to occlusions than [12], inative and generative to improve the robustness of frame-to- neither approach employs an explicit model fitting step mean- frame model fitting with per-frame reinitialization. ing that results may not be kinematically valid (e.g. implausi- ble articulations or finger lengths). Xu et al. [36] estimate the RGB input Early work relied on monocular RGB cameras, global orientation and location of the hand, regress candidate making the problem extremely challenging (see [8] for a sur- 21-DoF hand poses, and select the correct pose by minimiz- vey). Discriminative methods [2, 34] used small databases of ing reconstruction error. The system runs at 12Hz, and the restricted hand poses, limiting accuracy. Generative methods lack of tracking can lead to jittery pose estimates. Tompson used models with restricted DoFs, working with simplified et al. [30] demonstrate impressive hand tracking results us- hand representations based on 2D, 2½D or inverse kinematics ing deep neural networks to predict feature locations and IK (IK) (e.g. [24, 35]), again resulting in limited accuracy. Bray to infer a skeleton. While real-time, the approach only tack- et al. [4] used a more detailed 3D hand model. de La Gorce et les close-range scenarios. Wang et al. [32, 31] demonstrate a al. [6] automatically apply a scaling to the bones in the hand discriminative nearest-neighbor lookup scheme using a large model during tracking. Most of these systems performed of- hand pose database, and IK for pose refinement. Nearest- fline tracking using recorded sequences. An early example neighbor methods are highly dependent on the database of of online (10Hz) tracking with a simplified deformable hand poses, and can struggle to generalize to unseen poses. model is [10]. This work, as with other RGB-based meth- ods, struggled with complex poses, changing backgrounds, Commercial systems Beyond this research, there have also and occlusions, thus limiting general applicability. been commercial hand tracking systems. The 3Gear Sys- tems [1] (based on prior work of [32, 31]) and the second- Multi-camera input There has also been recent work on generation software for the Leap Motion [15] have shown offline high-quality, (non-interactive) performance capture of tracking of a range of complex poses, including two-handed hands using multi-camera rigs. Ballan et al. [3] demonstrate interaction. We compare to both in our results section. high-quality results closely fitting a detailed scanned mesh model to complex two-handed and hand-object interactions.