Interacting beyond the Screen

A Multitouchless Interface Expanding User Interaction

Philip Krejov, Andrew Gilbert, and Richard Bowden ■ University of Surrey

ultitouch technology is now com- In contrast, our approach detects only fingertips monplace across modern devices and but does so without machine learning and a large provides direct interaction with an ap- amount of training data. Its real-time methodol- plication’sM GUI. However, interaction is limited ogy uses geodesic maxima on the hand’s surface to the 2D plane of the display. The detection and instead of its visual appearance, which is efficient tracking of fingertips in 3D, independent of the to compute and robust to both the pose and en- display, opens up multitouch to new application vironment. With this methodology, we created a areas, ranging from medical analysis in sterile en- “multitouchless” interface that allows direct in- vironments to home entertainment and gaming. teraction with data through finger tracking and Furthermore, the additional dimension of interac- .3 tion provides new avenues for UI design. Extending interaction beyond surfaces within Unfortunately, using gloves, the operator’s physical reach (see Figure 1) pro- specialized hardware, or multi- vides greater scope for interaction. We can eas- This interface tracks users’ camera stereo rigs to accurately ily extend the user’s working space to walls or the hands and fingers in 3D, using track the hands limits the tech- entire environment. Our approach also achieves a Microsoft Kinect sensor. The nology’s applications. Alterna- real-time tracking of up to four hands on a stan- system tracks up to four hands tively, detecting and tracking dard desktop computer. This opens up possibilities simultaneously on a standard fingertips in real-time video is for collaboration, especially because the workspace desktop computer. From this difficult due to the variability of isn’t limited by the display’s size. tracking, the system identifies the environment, the fast motion Our primary use of multitouchless interac- gestures and uses them for of the hands, and the high degree tion has been in the Making Sense project (www. interaction. of freedom of finger movement. making-sense.org), which employs data-mining The , an infrared and machine-learning tools to allow the discovery stereo sensor tailored for hand of patterns that relate images, video, and text. It interaction, accurately tracks fingertips, enabling combines multitouchless interaction and visual- interaction with existing multitouch applications. ization to let analysts quickly However, it has a limited workspace allowing only 1 the hands to be tracked. Alternatively, using a ge- ■■ visualize data, neric depth sensor allows both the body and hands ■■ summarize the data or subsets of the data, to be tracked for interaction. Cem Keskin and his ■■ discriminatively mine rules that separate one colleagues created a means of full hand pose es- dataset from another, 2 timation that can also classify hand shapes. This ■■ find commonality between data, approach, however, requires significant quantities ■■ project data into a visualization space that of synthetic data. (For more on previous research shows the semantic similarity of the different on hand pose estimation and interactive displays, items, and see the related sidebars.) ■■ group and categorize disparate data types.

50 May/June 2014 Published by the IEEE Computer Society 0272-1716/14/$31.00 © 2014 IEEE Related Work in Hand Pose Tracking and Estimation

xtensive research has focused on analyzing hands for and his colleagues applied geodesic distances to locate Euse in computer interaction. This includes systems that the body skeleton.7 We also employ geodesic extrema, identify fingertip locations, because hand tracking is often but for finger detection, and we use Dijkstra’s algorithm to the first stage in finding the fingertips. efficiently identify the candidate regions. Model-based tracking is a popular way to infer finger- tip locations1 and account for large-scale occlusion.2 Such approaches can achieve 15 fps for a single hand, but this References is still too slow for smooth interaction. Our approach (see 1. N. Oikonomidis, I. Kyriazis, and A. Antonis, “Efficient Model- the main article) can achieve 30 fps for four hands. Based 3D Tracking of Hand Articulations Using Kinect,” Proc. Many researchers optimized model fitting with the British Machine Vision Conf., 2011, pp. 101.1–101.11. assistance of devices attached to the user. 2. H. Hamer et al., “Tracking a Hand Manipulating an Object,” is a common example of this; it uses markers at key loca- Proc. 2009 IEEE Int’l Conf. Computer Vision (ICCV 09), 2009, pp. tions on the hand, sometimes including the wrist. Andreas 1475–1482. Aristidou and Joan Lasenby employed markers to reduce 3. A. Aristidou and J. Lasenby, “Motion Capture with Constrained the complexity of mapping a model to the full structure of Inverse Kinematics for Real-Time Hand Tracking,” Proc. 2010 the hand.3 Int’l Symp. Communications, Control and Signal Processing After obtaining depth information from a structured- (ISCCSP 10), 2010, pp. 1–5. light camera, Jagdish Raheja and his colleagues removed 4. J.L. Raheja, A. Chaudhary, and K. Singal, “Tracking of the background.4 Then, they used a large circular filter Fingertips and Centers of Palm Using Kinect,” Proc. 2011 to remove the palm and obtain the fingers’ masks. They Int’l Conf. Computational Intelligence, Modeling and Simulation located each fingertip by finding the point closest to the (CIMSiM 11), 2011, pp. 248–252. camera under each finger mask. To find fingers, Georg 5. G. Hackenberg, R. McCall, and W. Broll, “Lightweight Palm Hackenberg and his colleagues exploited the understand- and Finger Tracking for Real-Time 3D Gesture Control,” Proc. ing that fingers comprise tips and pipes.5 Employing 2011 IEEE Conf. (VR 11), 2011, pp. 19–26. features previously applied to body pose estimation, Cem 6. C. Keskin et al., “Real Time Hand Pose Estimation Using Depth Keskin and his colleagues trained a random forest using a Sensors,” Proc. 2011 IEEE Computer Vision Workshops, 2011, pp. labeled model.6 This forest classified the regions of a cap- 1228–1234. tured hand on a per-pixel basis. They then used a support 7. C. Plagemann et al., “Real-Time Identification and Localization vector machine to estimate the pose. of Body Parts from Depth Images,” Proc. 2010 IEEE Int’l Conf. Using a time-of-flight camera, Christian Plagemann Robotics and Automation (ICRA 10), 2010, pp. 3108–3113.

Visualization provides an intuitive understanding of large, complex datasets, while interaction pro- vides an intuitive interface to complex machine- learning algorithms. Figure 1 shows the system in use.

Tracking Fingertips A Microsoft Kinect serves as the sensor. First, we capture the depth image and calibrate the point cloud. We locate and separate hand blobs with re- spect to the image domain. Processing each hand in parallel, we build a weighted graph from the real-world point information for each hand’s sur- face. An efficient shortest-path algorithm traverses Figure 1. Multitouchless interaction with Making Sense’s main screen. the graph to find candidate fingertips. We then -fil The user is employing the selection gesture to choose the data subset to ter these candidates on the basis of their location mine. relative to the center of the hand and the wrist, and use them in a temporally smoothed model of the user’s face using a standard face detector. This the fingertip locations. provides the subject’s location for each frame. Using the point cloud derived from the calibrated depth Hand and Forearm Segmentation information (the distance from the Kinect), we find For each color frame from the Kinect, we locate the bounding box of the closest face, which we as-

IEEE Computer Graphics and Applications 51 Interacting beyond the Screen

Related Work in Interactive Displays

ohn Underkoffler at MIT designed the gesture-based Jcomputer interface featured in the 2002 film Minority Report. An interface based on that concept is available from Oblong Industries (www.oblong.com), but its reliance on specialized hardware limits it to custom installations. However, the Microsoft Kinect has brought new levels of interaction to consumers. In an early example of apply- ing the Minority Report interaction paradigm to the Kinect, Garratt Gallagher, from the MIT Computer Science and Artificial Intelligence Laboratory, used the Point Cloud Library (http://pointclouds.org) to track hands (see Figure A1 and www.csail.mit.edu/taxonomy/term/158). The HoloDesk used head and hand tracking to let users inter- act with 3D virtual objects through the screen.1 The Space Top extended this by developing a desktop interface (see Figure A2).2 6D Hands used two webcams to perform pose estimation and tracking, enabling a design-and- assembly application with virtual object manipulation.3 OmniTouch was a shoulder-mounted interface that displayed graphics on users’ hands, held objects, walls, and tables.6 It captured user interaction through a depth sensor and used a pico projector to project a multitouch overlay on the target surface. It also tracked the surface’s (1) plane, letting users adjust the projected image to maintain the overlay. Because the collaboration of users is important, ap- proaches that let multiple people use a system in real time are desirable. In research by Alexander Kulik and his col- leagues, six people are able to interact with the same projec- tive warping screen.4 The T(ether) application used infrared markers to track multiple hands interacting with a virtual environment.5 Users viewed this environment on a tablet computer, which was also tracked through infrared markers. Projectors are the typical means of presenting infor- mation to multiple users and allow collaborative work. (2) Andrew Wilson and his colleagues developed a steerable projector that presents information on any surface in a Figure A. Examples of hand- and gesture-based interaction. (1) An early room.7 They also used it with a Kinect camera to provide Kinect-based version of an interface developed by MIT researchers. reactive projection. (Source: Garratt Gallagher, MIT Computer Science and Artificial Intelligence Laboratory; used with permission.) (2) Use of 3D hand tracking in 3D interaction. (Source: Microsoft Applied Sciences References Group; used with permission.) 1 O. Hilliges et al., “HoloDesk: Direct 3D Interactions with a Situated See-Through Display,” Proc. SIGCHI Conf. Human Co-located Collaboration in Shared Virtual Environments,” Factors in Computing Systems (CHI 12), 2012, pp. 2421–2430. Proc. Siggraph Asia Conf., 2011, article 188. 2 J. Lee et al., “Space Top: Integrating 2D and Spatial 3D 5. M. Blackshaw et al., “T(ether): A Tool for Spatial Expression,” Interactions in a See-Through Desktop Environment,” Proc. 2011; http://tangible.media.mit.edu/project/tether. SIGCHI Conf. Human Factors in Computing Systems (CHI 13), 6. C. Harrison, H. Benko, and A.D. Wilson, “OmniTouch: Wearable 2013, pp. 189–192. Multitouch Interaction Everywhere,” Proc. 24th Ann. ACM 3 R. Wang, S. Paris, and J. Popovi´c, “6D Hands: Markerless Symp. Software and Technology (UIST 11), 2011, Hand Tracking for Computer Aided Design,” Proc. 24th Ann. pp. 441–450. ACM Symp. User Interface Software and Technology (UIST 11), 7. A. Wilson et al., “Steerable with the 2011, pp. 549–558. Beamatron,” Proc. 25th Ann. ACM Symp. User Interface 4. A. Kulik et al., “C1x6: A Stereoscopic Six-User Display for Software and Technology (UIST 12), 2012, pp. 413–422.

52 May/June 2014 (a) (b) (c) (d) (e) (e)

Figure 2. Multiple hand shapes and their first seven extremities. (a) An open palm with two extrema belonging to the wrist. (b) Pinching. (c) Open pinching. (d) A pointing finger, with one other extremum approaching the tip. (e) Self-occlusion. (f) The extrema paths. sume is the user. We smooth the depth over a 20-ms To find the candidate fingertips, we first build a window to define the back plane for segmentation, weighted undirected graph of the hand. Each point which lets us remove the body and background in the hand represents a vertex in the graph. We from the depth image. The remaining depth space connect these vertices with neighboring vertices in is the workspace for gesture-based interaction. an 8-neighborhood fashion, deriving the edge cost We cluster any points in this space into con- from the Euclidean distance between their world nected blobs. We ignore blobs smaller than po- coordinates. tential hand shapes; the remaining blobs are Using the hand’s center as the seed point, we candidates for the user’s arms. We then classify compute the first geodesic extremity of the hand the points in each blob as part of a hand subset graph. We use an efficient implementation of Di- or wrist subset. This classification uses a depth jkstra’s shortest-path algorithm and search for threshold of one-quarter of the arm’s total depth. hand locations with the longest direct path from the hand’s center. (For a given source vertex, Di- Hand Center Localization jkstra’s algorithm finds the lowest-cost path to all The hand’s center serves as the seed point for other vertices in the mesh. The geodesic extremity distance computation. Simply using the centroid is the vertex with the greatest cost.) This uses the of points would result in the seed point shift- hand’s internal structure to find the extrema and, ing when the hand is opened and closed. We use as such, is robust to contour noise, which is fre- the hand’s chamfer distance, measured from the quent in the depth image and would cause circular external boundary to find a stable center. (The features to fail. We then find the remaining ex- chamfer distance is a transform that computes the trema iteratively, again using Dijkstra’s algorithm distance between sets of points. In this context, it but with a noninitialized distance map,4 reducing computes each point’s distance from the closest the computational complexity. boundary point.) A hand’s center is the point with Figure 2 shows the seven extremities found for the greatest distance in the chamfer image. various hand shapes. Each extremity is associated with its shortest path, as Figure 2f shows. Candidate Fingertip Detection By mapping the hand’s surface and searching for Nonfingertip Rejection extremities, we can find the fingertips, exclud- Next, we filter the candidate fingertips’ points to ing the closed fingers. We conduct this search by a subset of valid fingertips. We combine a penalty mapping the distance from the hand’s center to all metric derived from the path taken during Dijks- other points on the hand’s surface. The geodesic tra’s algorithm with a candidate’s position relative distances with the greatest local value correspond to the hand’s center. to geodesic maxima. We search for up to five ex- Finding the penalty for each candidate requires trema to account for all open fingertips. the covariance of the hand and arm points, trans- In practice, however, the wrist forms additional lated to the wrist location. (The covariance models extremities with a similar geodesic distance. So, the variation in pixels for the chosen body part.) we greedily compute the first seven extremities; Using the covariance to map the wrist in this this accounts for each fingertip, including two manner takes into consideration the variability of false positives. When the fingers are closed, the hand shapes. We translate the covariance to form tips aren’t of interest because the fingers contrib- an elliptical mask centered around the wrist. If ute to the fist’s formation. The extremity normally a pixel has a Mahalanobis distance within three associated with a folded finger forms an additional standard deviations of the wrist, we mark it as 1; false positive that we filter out at a later stage. otherwise, we mark it as 0 (see Figure 3).

IEEE Computer Graphics and Applications 53 Interacting beyond the Screen

Finger Assignment and Tracking To track and assign the fingertips between frames, we use a Kalman filter. (A Kalman filter takes a series of noisy measurements observed over time and combines these estimates to give a more ac- curate prediction. It often includes assumptions (a) (b) (c) about the type of motion and noise that might be Figure 3. The covariance of several hand shapes translated to the wrist. observed, which it uses to smooth the predictions.) (a) An image of an open palm, showing a long principle axis for a full We update this model using the point correspon- hand and arm. (b) An image of a fist, revealing a circular boundary. dence that minimizes the change between consec- (c) An image of two extended fingers, demonstrating wrist rotation. utive frames. As tracking is in 3D, we must check all possible permutations when matching points. When assigning five or fewer points, searching all permutations requires fewer operations than using the Hungarian algorithm (an optimization algorithm that efficiently solves the assignment problem). The hand has five fingers, resulting in only 120 permutations in the worst case. We use each detected fingertip’s world position to update a bank of 3D Kalman filters that persist from one frame to the next. To update this model, we pair detected extrema with the Kalman filter (a) (b) predictions by selecting the lowest-cost assign- ment between pairs. Figure 4. The complete shortest path for each Any unmatched extrema initialize a new Kalman extremum. The points along each path that increase tracker to account for the presence of new fingers. the score are red. (a) Vertices outside the ellipse Kalman filter predictions derived from the previous don’t contribute to the penalty. (b) The thumb’s path frames that were not matched are updated, using demonstrates the need to penalize only vertices that the previous predicted position. This blind update traverse with increasing depth. is performed on the condition that the predic- tion’s confidence does not diminish considerably, To find the penalty, we use both the paths from at which point we remove it from the model. the extrema to the hand’s center and the ellipti- We use the smoothed model to output each fin- cal mask. We increment a path’s score for each gertip as a 3D coordinate in millimeters. For visu- vertex with increasing depth through the mask, alization, we can project these coordinates back to then normalize the score using the complete path’s the image domain (see Figure 5). length. When moving along this path, we include It’s more accurate to track the fingertips in 3D a vertex in the penalty only if its depth is less than than in the image domain because the additional the next vertex’s depth. (In Figure 4, the red verti- depth information and filter estimation provide ces contribute to an increased score.) This heuris- subpixel accuracy. If we tracked the points in the tic is consistent with the understanding that the image plane, the resulting fingertips would be prone wrist has a greater depth than the hand’s center. to jitter, owing to the quantization of the image. We only consider vertices that traverse with in- creasing depth in the penalty. The elliptical mask Tracking Validation allows for gestures in which the fingers have a Our tests, written in C++, were performed on a stan- greater depth than the hand’s center—for example, dard desktop PC with a 3.4-GHz Intel Core i7 CPU. someone pointing at his or her own chest. After To validate tracking performance, we used sequences finding the penalty for each candidate, we remove captured from a range of users. We labeled these vid- the candidate with the greatest penalty. eos with approximately 15,000 ground truth mark- We reduce the remaining candidates using the ers, against which we measured the tracked fingers. Euclidean distance of the fingertip to the hand’s Over all the sequences, the average error was center, rejecting fingers with a distance less than 2.48 cm. However, Figure 6 shows that most fin- 7 mm. This forms a sphere around the user’s hand. gertips were accurate to within 5 mm. Our ap- We consider any fingertip outside this sphere a proach operated at 30 fps for each hand, with up true positive. to four hands processing in parallel.

54 May/June 2014 (a) (b) (c) (d) (e) (f)

Figure 5. Multiple fingertips and the hand’s center tracked over 10 frames. Each line’s length demonstrates the distance covered in the previous 10 frames.

Recognizing Gestures Gestures for interaction typically are common 8,000 across multitouch devices and provide familiarity 6,000 that lets users easily perform complex tasks. How- ever, 3D interaction differs fundamentally, and 4,000 2D gestures don’t necessarily translate well to 3D. The development of gestures requires special No. of detections 2,000 consideration because they’re integral to the user experience. They should be simple and semanti- 1 357911 13 15 17 cally similar to the task they represent; a good ex- Error (mm) ample would be using a pinching gesture to zoom. However, for large-scale displays, a single-handed, Figure 6. The detections across all user sequences. Most detections were two-finger pinch zoom does not scale well to the within 5 mm of their ground truth. visualization’s size; a two-handed, single-finger zoom is more appropriate. application layer. The hands’ position controls the To provide flexible multitouchless gesture window’s size and position. Our demonstration recognition, we place middleware between the system uses this particularly intuitive gesture to finger-tracking and application levels. We chose the select different subsets of data in the visualization. gesture vocabulary to generalize across applications. Because our approach tracks the hands in 3D, it To minimize the complexity new users experience can interpret rapid motion toward the camera as when the interaction modality changes, we adapted a push gesture. Users can employ this gesture with the most common 2D gestures. an open hand to open a menu or with the index finger to select an item. They can rotate a hand to Basic Gestures control a circular context menu (also known as a From hand and finger tracking, we know the pres- pie or radial menu). ence and position of the user’s head, hand, and fin- We can extend the system to recognize additional gers (see Figure 7a); additional skeletal information types and combinations of gestures. However, more is available from the Kinect. We use a representa- complex gestures involving transitions—for exam- tion of a hand’s velocity to detect a zoom gesture ple, a horizontal swipe followed by a downward (see Figure 7b) and horizontal and vertical swipes swipe—could be better represented using a proba- (see Figure 7c), single- or double-handed. As we bilistic approach such as hidden Markov models. know the number of fingers visible on each hand, our system can recognize these gestures on the ba- Limitations sis of specific finger configurations. For example, at Because the Kinect sensor is designed for full body the application level, users employ one finger per pose estimation, it has inherent drawbacks for our hand for the zoom gesture, but the system detects approach. For example, it has difficulty mapping swipes regardless of the number of fingers. the hand’s surface when fingers are directed toward Users can also employ an extended thumb and it. This forms a hole in the depth information that index finger to control a cursor. Then, to select our approach does not account for. So, we avoid ges- an object placed under the index finger, the user tures in which users point at an onscreen object. retracts the thumb to lie alongside the palm. The In addition, we parametrically tuned the distance system also easily detects a grasping gesture as the threshold described in the section “Nonfingertip rapid transition from five to zero fingers. Rejection” to be invariant to adult users. In the Figure 7d shows a gesture that translates di- event of a child using the system, this might lead rectly to a variable-size selection window in the to failure owing to the smaller hand. Our approach

IEEE Computer Graphics and Applications 55 Interacting beyond the Screen

Figure 7. Gestures implemented in the middleware. (a) The position of the user’s head, hand, and fingers, determined from hand and finger tracking. (a) (b) (b) Zooming. (c) Swiping. (d) Selection.

(c) (d)

could derive a more appropriate value at runtime The tabletop display visualizes data in the se- from the chamfer distance. The maximum value at mantic space, determined through machine learn- the hand’s center, once normalized by the depth, ing. The distance between objects represents the would provide the palm width. semantic similarity of documents in the dataset, projected through multidimensional scaling into The Making Sense System an interactive finite-element simulation. (A finite- The Making Sense system employs a curved pro- element simulation consists of a set of nodes con- jection screen that forms the main display and nected by edges. Each node and edge has synthetic a secondary tabletop display (see Figure 8). Two physical properties defining how they interact. The high-definition projectors are predistorted to cor- simulation constantly evaluates these interactions rect for the screen’s curvature and stitched to- to minimize the overall system’s energy.) Users gether to form a 2,500 × 800-pixel display that can select datasets and mine the data to extract presents timeline data. rules linking the data. These rules provide the se- Because the user stands some distance from the mantic similarity. screen, conventional interaction such as with a mouse and keyboard isn’t applicable or required. Data Collection So, the system employs the previously described An example combining data mining and multitouch- gesture vocabulary. The user’s face serves as the less interaction was analysis of the content of hard back plane for gesture recognition; any objects be- drives. To generate a synthetic dataset, a team of hind this are ignored. This removes background psychologists generated scripts for four individuals, distractions but allows the workspace to move each with different profiles and interests. The scripts with the user. contained each subject’s atypical online behaviors. The tabletop display, which is directly in front of For a month, the scripts were performed on a the user, employs a traditional multitouch overlay day-by-day basis, with the scripted behaviors alter- on a high-definition display. However, the space nating with normal filler behaviors. This generated above the tabletop display is within the workspace a hard drive for each subject, containing text, im- for multitouchless recognition. So, by sensing the ages, and videos from webpages, emails, YouTube area containing the user’s arms, the system uses videos, Flickr photos, and Word documents—a gestures as input for interacting with both the truly mixed-mode dataset. Digital-forensic scien- screen and tabletop. tists then reconstructed this data from the hard

56 May/June 2014 0.6 mWorking volume Figure 8. The Making Sense drives and produced high-level event timelines of system. By user activity, along with supporting digital evi- sensing the dence from the hard-drive images. area containing the user’s arms, Semantic Grouping of Data the system The traditional method for clustering or group- uses gestures ing media is to use a large training set of labeled Sensor as input for data. Instead, Making Sense lets users find natural interacting groups of similar content based on a handful of with both seed examples. To do this, it employs two efficient the screen data-mining tools originally developed for text and tabletop analysis: min-Hash and APriori. MinHash quickly display. estimates set similarity. APriori, often called the shopping-basket algorithm, finds the frequency Projection and association rules in datasets. It’s typically used screen in applications such as market basket analysis (for example, to determine that people who bought item A also bought item B). The user guides the correct grouping of the di- verse media by identifying only a small subset of disparate items—for example, Web, video, image, and emails.5 The pairwise similarity between the media is then used to project the data into a 2D Tabletop presentation for the tabletop display that preserves display the notion of similarity via proximity of items. More recent research has looked at automatically attaching linguistic tags to images harvested from the Internet.6 Sensor plane User plane The approach can also summarize media, effi- ciently identifying frequently recurring elements in the data. When all the events are used, this can provide an overall gist or flavor of the activi- ties. Furthermore, we can employ discriminative mining, which mines a subset of (positive) events against another subset of (negative) events. Posi- tive and negative sets are chosen by the user, and (a) discriminative mining attempts to find rules that vary between them. This can be used to highlight events or content that’s salient in the positive set, letting users identify trends that, for example, de- scribe activity particular to a specific time of day or day of the week. Figure 9a shows the timeline view of data from the data collection example we described earlier. The horizontal axis is the date; the vertical axis is the time of day. Figure 1 shows a user employing the selection gesture on the timeline view to choose the data subset to mine. As we mentioned before, the Making Sense system presents the data on the tabletop display in terms of semantic similarity (see Figure 9b). To form semantic groups, we use a fast (b) multiscale community detection algorithm7 on the Figure 9. Making Sense screenshots. (a) In the timeline view on the main distance matrix to find natural groupings. A video screen, the horizontal axis is the date and the vertical axis is the time of demonstrating the system is at www.ee.surrey.ac. day. (b) On the tabletop display, the data items’ proximity indicates the uk/Personal/R.Bowden/multitouchless. content’s semantic similarity.

IEEE Computer Graphics and Applications 57 Interacting beyond the Screen

System Validation 3 P. Krejov and R. Bowden, “Multi-touchless: Real- We employed the Making Sense system on a mul- Time Fingertip Detection and Tracking Using Geodesic ticlass video dataset with 1,200 videos. With only Maxima,” Proc. IEEE 10th Int’l Conf. and Workshops 44 user-labeled examples, our approach increased Automatic Face and Gesture Recognition, 2013, pp. the class grouping’s accuracy from 72 to 97 per- 1–7. cent, compared to the basic Min-Hash approach.5 4 A. Baak et al., “A Data-Driven Approach for Real- (This contrasts with the popular use of equal Time Full Body Pose Reconstruction from a Depth amounts of training data and test data.) Camera,” Proc. 2011 IEEE Int’l Conf. Computer Vision Making Sense users have reported that they (ICCV 11), 2011, pp. 1092–1099. found the gestures intuitive and responsive. While 5 A. Gilbert and R. Bowden, “iGroup: Weakly mining data, they used the gestures for in-depth Supervised Image and Video Grouping,” Proc. 2011 analysis. They used the zoom gesture to look at IEEE Int’l Conf. Computer Vision (ICCV 11), 2011, pp. tight groupings of data in the timeline visualiza- 2166–2173. tion. They used scrolling both to explore the data, 6 A. Gilbert and R. Bowden, “A Picture Is Worth a moving between clusters of activity, and to navi- Thousand Tags: Automatic Web Based Image Tag gate a thumbnail summary of the data. Expansion,” Computer Vision: ACCV 2012, LNCS The users found the timeline representation 7725, Springer, 2013, pp. 447–460. enlightening because it visualized trends in data 7 E. Le Martelot and C. Hankin, “Fast Multi-scale usage. For example, the horizontal grouping in Detection of Relevant Communities in Large-Scale Figure 9a indicates repeated evening use of the Networks,” Computer J., 22 Jan. 2013. computer. Also, the real-time data mining let us- ers drill down into the data for increased detail, so Philip Krejov is a PhD student at the University of Sur- that they could fully explore the data. rey’s Centre for Vision, Speech and Signal Processing. His research focuses on human–computer interaction, with in- terests in head tracking for anamorphic projection and hand e plan to improve the system in several pose estimation using graph-based analysis and machine Wways. Instead of relying on face detection, learning. Krejov received a BEng (Hons) in electronic engi- we’ll use the Kinect’s skeletal tracker to establish neering from the University of Surrey and received the prize more accurate hand and forearm segmentation. for best final-year dissertation. Contact him at p.krejov@ We’ll also incorporate a probabilistic framework surrey.ac.uk. in which we can combine several hypothesis of fingertips, improving detection and accuracy. Andrew Gilbert is a research fellow at the University of Another area of investigation is the classification Surrey’s Centre for Vision, Speech and Signal Processing. of more complex hand shapes, moving toward full His research interests include real-time visual tracking; hand pose estimation. A wider range of detected human activity recognition; intelligent surveillance; and hand shapes would also allow recognition of more exploring, segmenting, and visualizing large amounts of gestures, extending the system’s capabilities. mixed-modality data. Gilbert received a PhD from the Cen- tre for Vision, Speech and Signal Processing. He’s a member of the British Machine Vision Association’s Executive Com- Acknowledgments mittee and coordinates the association’s national technical The UK Engineering and Physical Sciences Re- meetings. Contact him at [email protected]. search Council Project Making Sense (grant 13 EP/ H023135/1) funded this research. We thank Chris Richard Bowden is a professor of computer vision and ma- Hargreaves at Cranfield for the digital-forensics work chine learning at the University of Surrey, where he leads and Margaret Wilson at the University of Liverpool the Cognitive Vision Group in the Centre for Vision, Speech for the psychology work. and Signal Processing. His research employs computer vi- sion to locate, track, and understand humans, with specific examples in sign language and gesture recognition, activity References and action recognition, lip reading, and facial feature track- 1 F. Weichert et al., “Analysis of the Accuracy and ing. Bowden received a PhD in computer vision from Bru- Robustness of the Leap Motion Controller,” Sensors, nel University, for which he received the Sullivan Doctoral vol. 13, no. 5, 2013, pp. 6380–6393. Thesis Prize for the best UK PhD thesis in vision. He’s a 2 C. Keskin et al., “Real Time Hand Pose Estimation member of the British Machine Vision Association, a fellow Using Depth Sensors,” Proc. 2011 IEEE Computer of the UK Higher Education Academy, and a senior member Vision Workshops, 2011, pp. 1228–1234. of IEEE. Contact him at [email protected].

58 May/June 2014