A Multitouchless Interface Expanding User Interaction
Total Page:16
File Type:pdf, Size:1020Kb
Interacting beyond the Screen A Multitouchless Interface Expanding User Interaction Philip Krejov, Andrew Gilbert, and Richard Bowden ■ University of Surrey ultitouch technology is now com- In contrast, our approach detects only fingertips monplace across modern devices and but does so without machine learning and a large provides direct interaction with an ap- amount of training data. Its real-time methodol- Mplication’s GUI. However, interaction is limited ogy uses geodesic maxima on the hand’s surface to the 2D plane of the display. The detection and instead of its visual appearance, which is efficient tracking of fingertips in 3D, independent of the to compute and robust to both the pose and en- display, opens up multitouch to new application vironment. With this methodology, we created a areas, ranging from medical analysis in sterile en- “multitouchless” interface that allows direct in- vironments to home entertainment and gaming. teraction with data through finger tracking and Furthermore, the additional dimension of interac- gesture recognition.3 tion provides new avenues for UI design. Extending interaction beyond surfaces within Unfortunately, using gloves, the operator’s physical reach (see Figure 1) pro- specialized hardware, or multi- vides greater scope for interaction. We can eas- This interface tracks users’ camera stereo rigs to accurately ily extend the user’s working space to walls or the hands and fingers in 3D, using track the hands limits the tech- entire environment. Our approach also achieves a Microsoft Kinect sensor. The nology’s applications. Alterna- real-time tracking of up to four hands on a stan- system tracks up to four hands tively, detecting and tracking dard desktop computer. This opens up possibilities simultaneously on a standard fingertips in real-time video is for collaboration, especially because the workspace desktop computer. From this difficult due to the variability of isn’t limited by the display’s size. tracking, the system identifies the environment, the fast motion Our primary use of multitouchless interac- gestures and uses them for of the hands, and the high degree tion has been in the Making Sense project (www. interaction. of freedom of finger movement. making-sense.org), which employs data-mining The Leap Motion, an infrared and machine-learning tools to allow the discovery stereo sensor tailored for hand of patterns that relate images, video, and text. It interaction, accurately tracks fingertips, enabling combines multitouchless interaction and visual- interaction with existing multitouch applications. ization to let analysts quickly However, it has a limited workspace allowing only 1 the hands to be tracked. Alternatively, using a ge- ■ visualize data, neric depth sensor allows both the body and hands ■ summarize the data or subsets of the data, to be tracked for interaction. Cem Keskin and his ■ discriminatively mine rules that separate one colleagues created a means of full hand pose es- dataset from another, 2 timation that can also classify hand shapes. This ■ find commonality between data, approach, however, requires significant quantities ■ project data into a visualization space that of synthetic data. (For more on previous research shows the semantic similarity of the different on hand pose estimation and interactive displays, items, and see the related sidebars.) ■ group and categorize disparate data types. 50 May/June 2014 Published by the IEEE Computer Society 0272-1716/14/$31.00 © 2014 IEEE Related Work in Hand Pose Tracking and Estimation xtensive research has focused on analyzing hands for and his colleagues applied geodesic distances to locate Euse in computer interaction. This includes systems that the body skeleton.7 We also employ geodesic extrema, identify fingertip locations, because hand tracking is often but for finger detection, and we use Dijkstra’s algorithm to the first stage in finding the fingertips. efficiently identify the candidate regions. Model-based tracking is a popular way to infer finger- tip locations1 and account for large-scale occlusion.2 Such approaches can achieve 15 fps for a single hand, but this References is still too slow for smooth interaction. Our approach (see 1. N. Oikonomidis, I. Kyriazis, and A. Antonis, “Efficient Model- the main article) can achieve 30 fps for four hands. Based 3D Tracking of Hand Articulations Using Kinect,” Proc. Many researchers optimized model fitting with the British Machine Vision Conf., 2011, pp. 101.1–101.11. assistance of devices attached to the user. Motion capture 2. H. Hamer et al., “Tracking a Hand Manipulating an Object,” is a common example of this; it uses markers at key loca- Proc. 2009 IEEE Int’l Conf. Computer Vision (ICCV 09), 2009, pp. tions on the hand, sometimes including the wrist. Andreas 1475–1482. Aristidou and Joan Lasenby employed markers to reduce 3. A. Aristidou and J. Lasenby, “Motion Capture with Constrained the complexity of mapping a model to the full structure of Inverse Kinematics for Real-Time Hand Tracking,” Proc. 2010 the hand.3 Int’l Symp. Communications, Control and Signal Processing After obtaining depth information from a structured- (ISCCSP 10), 2010, pp. 1–5. light camera, Jagdish Raheja and his colleagues removed 4. J.L. Raheja, A. Chaudhary, and K. Singal, “Tracking of the background.4 Then, they used a large circular filter Fingertips and Centers of Palm Using Kinect,” Proc. 2011 to remove the palm and obtain the fingers’ masks. They Int’l Conf. Computational Intelligence, Modeling and Simulation located each fingertip by finding the point closest to the (CIMSiM 11), 2011, pp. 248–252. camera under each finger mask. To find fingers, Georg 5. G. Hackenberg, R. McCall, and W. Broll, “Lightweight Palm Hackenberg and his colleagues exploited the understand- and Finger Tracking for Real-Time 3D Gesture Control,” Proc. ing that fingers comprise tips and pipes.5 Employing 2011 IEEE Virtual Reality Conf. (VR 11), 2011, pp. 19–26. features previously applied to body pose estimation, Cem 6. C. Keskin et al., “Real Time Hand Pose Estimation Using Depth Keskin and his colleagues trained a random forest using a Sensors,” Proc. 2011 IEEE Computer Vision Workshops, 2011, pp. labeled model.6 This forest classified the regions of a cap- 1228–1234. tured hand on a per-pixel basis. They then used a support 7. C. Plagemann et al., “Real-Time Identification and Localization vector machine to estimate the pose. of Body Parts from Depth Images,” Proc. 2010 IEEE Int’l Conf. Using a time-of-flight camera, Christian Plagemann Robotics and Automation (ICRA 10), 2010, pp. 3108–3113. Visualization provides an intuitive understanding of large, complex datasets, while interaction pro- vides an intuitive interface to complex machine- learning algorithms. Figure 1 shows the system in use. Tracking Fingertips A Microsoft Kinect serves as the sensor. First, we capture the depth image and calibrate the point cloud. We locate and separate hand blobs with re- spect to the image domain. Processing each hand in parallel, we build a weighted graph from the real-world point information for each hand’s sur- face. An efficient shortest-path algorithm traverses Figure 1. Multitouchless interaction with Making Sense’s main screen. the graph to find candidate fingertips. We then -fil The user is employing the selection gesture to choose the data subset to ter these candidates on the basis of their location mine. relative to the center of the hand and the wrist, and use them in a temporally smoothed model of the user’s face using a standard face detector. This the fingertip locations. provides the subject’s location for each frame. Using the point cloud derived from the calibrated depth Hand and Forearm Segmentation information (the distance from the Kinect), we find For each color frame from the Kinect, we locate the bounding box of the closest face, which we as- IEEE Computer Graphics and Applications 51 Interacting beyond the Screen Related Work in Interactive Displays ohn Underkoffler at MIT designed the gesture-based Jcomputer interface featured in the 2002 film Minority Report. An interface based on that concept is available from Oblong Industries (www.oblong.com), but its reliance on specialized hardware limits it to custom installations. However, the Microsoft Kinect has brought new levels of interaction to consumers. In an early example of apply- ing the Minority Report interaction paradigm to the Kinect, Garratt Gallagher, from the MIT Computer Science and Artificial Intelligence Laboratory, used the Point Cloud Library (http://pointclouds.org) to track hands (see Figure A1 and www.csail.mit.edu/taxonomy/term/158). The HoloDesk used head and hand tracking to let users inter- act with 3D virtual objects through the screen.1 The Space Top extended this by developing a desktop interface (see Figure A2).2 6D Hands used two webcams to perform pose estimation and tracking, enabling a design-and- assembly application with virtual object manipulation.3 OmniTouch was a shoulder-mounted interface that displayed graphics on users’ hands, held objects, walls, and tables.6 It captured user interaction through a depth sensor and used a pico projector to project a multitouch overlay on the target surface. It also tracked the surface’s (1) plane, letting users adjust the projected image to maintain the overlay. Because the collaboration of users is important, ap- proaches that let multiple people use a system in real time are desirable. In research by Alexander Kulik and his col- leagues, six people are able to interact with the same projec- tive warping screen.4 The T(ether) application used infrared markers to track multiple hands interacting with a virtual environment.5 Users viewed this environment on a tablet computer, which was also tracked through infrared markers. Projectors are the typical means of presenting infor- mation to multiple users and allow collaborative work. (2) Andrew Wilson and his colleagues developed a steerable projector that presents information on any surface in a Figure A.