<<

2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 3D -based Interaction for Immersive Experience in Mobile VR

Shahrouz Yousefi∗, Mhretab Kidane†, Yeray Delgado†, Julio Chana† and Nico Reski∗ ∗Department of Media Technology, Linnaeus University, Vaxjo, Sweden Email: shahrouz.yousefi@lnu.se, [email protected] †ManoMotion AB, Stockholm, Sweden Email: [email protected], [email protected], [email protected]

Abstract—In this paper we introduce a novel solution for and will not satisfy the users’ needs in the near future. The real-time 3D hand gesture analysis using the embedded 2D presented research results enable large-scale and real-time 3D camera of a mobile device. The presented framework is based gesture analysis. This can be used for user-device interaction in on forming a large database of hand including the ground truth information of hand poses and details of finger real-time applications in mobile and wearable devices, where joints in 3D. For a query frame captured by the mobile device’s intuitive, and instant 3D interaction are important. VR, AR, camera in real time, the gesture analysis system finds the and 3D gaming are among the areas that directly benefit from best match from the database. Once the best match is found, the 3D interaction technology. Specifically, with the presented the corresponding ground truth information will be used for research results we plan to observe how the integration of interaction in the designed interface. The presented framework performs an extremely efficient gesture analysis (more than 30 new frameworks such as Search methods to existing computer fps) in flexible lighting condition and complex background with vision solutions facilitates the high degrees of freedom gesture dynamic movement of the mobile device. The introduced work analysis. In the presented work, Gesture Search Engine, as an is implemented in Android and tested in Gear VR headset. innovative framework, for 3D gesture analysis is introduced and used in mobile platforms to facilitate the AR/VR appli- I.INTRODUCTION cations. Prototypes in simple application scenarios have been The rapid development and wide adoption of mobile devices demonstrated based on the proposed technology. in recent years have been mainly driven by the introduction of novel interaction and visualization technologies. Although II.RELATED WORK have significantly enhanced the human device One of the current enabling technologies to build gesture- interaction, clearly for next generation of smart devices such as based interfaces is hand tracking and gesture recognition. Virtual/ (VR/AR) headsets, smart watches, The major technology bottleneck lies in the difficulty of and future smartphones/tablets, users will no longer be satis- capturing and analyzing the articulated hand motions. One fied with just performing interaction over the limited space of of the existing solutions is to employ glove-based devices, 2D or using extra controllers. They will demand which directly measure the finger positions and joint angles more natural interactions performed by the bare hands in the by using a set of sensors (i.e. electromagnetic or fiber-optical free space around the smart device [1]. Thus, the next gener- sensors) [8], [9]. However, the glove-based solutions are too ation of smart devices will require a gesture-based interface intrusive and expensive for natural interaction with smart to facilitate the bare hands for manipulating digital content devices. To overcome these limitations, vision-based hand directly. In general, 3D hand gesture recognition and tracking tracking solutions need to be developed and video sequences have been considered as classical and pattern should be analyzed. Capturing hand and finger motions in recognition problems. Although substantial research has been video sequences is a highly challenging task in computer conducted in this area, the state-of-the-art research results are vision due to the large number of DOF of the hand kinematics. mainly limited to global hand tracking and low-resolution Recently, Microsoft demonstrated how to capture full body gesture analysis [2]. However, in order to facilitate the natural motions using [10], [11]. Substantial development of gesture-based interaction, full analysis of the hand and fingers hand tracking and gesture recognition are based on using depth will be required, which in total incorporates 27 degrees of sensors. Sridhar et al., [12] use RGB and depth data for track- freedom (DOF) for each hand [3]. The main objectives behind ing of articulated hand motion based on color information and this work are to introduce new frameworks and methods for part-based hand model. Oikonomidis et al., [13] and Taylor intuitive interaction with future smart devices. We aim to re- et al., [14] track the articulated hand motion using RGBD produce the real-life experiences in the digital space with high information from Kinect. Papoutsakis et al., [15] analyze accuracy hand gesture analysis. Based on the comprehensive limited number of hand gestures with RGBD sensor. Model- studies [4], [5], [6], [7], and feedback from top researchers in based hand tracking using depth sensor is among the com- the field and major technology developers, it has been clearly mon proposed solutions [14], [16]. Oikonomidis et al., [13], verified that today’s 3D gesture analysis methods are limited introduce articulated hand tracking using calibrated multi-

978-1-5090-4846-5/16/$31.00 ©2016 IEEE 2122 camera system and optimization methods. Balan et al., [17] the 3D hand model with the observed image from the camera, propose a model-based solution to estimate the pose of two and minimizing the discrepancy between them. Generally, it is hands using discriminative salient points. Here, the question easier to achieve real-time performance with appearance-based is whether using 3D depth cameras can potentially solve the approaches due to the fact of simpler 2D image features [2]. problem of 3D hand tracking and gesture recognition. This However, this type of approaches can only handle simple hand problem has been greatly simplified by the introduction of gestures, like detection and tracking of fingertips. In contrast, real-time depth cameras. However, the technologies based on 3D hand model based approaches offer a rich description that depth information for hand tracking and gesture recognition potentially allows a wide class of hand gestures. The bad news still face major challenges for mobile applications. In fact, is that the 3D hand model is a complex articulated deformable mobile applications have at least two critical requirements: object with 27 DOF. To cover all the characteristic hand computational efficiency and robustness. Therefore, feedback images under different views, a very large image database is and interaction in a timely fashion is assumed and any latency required. Matching the query images from the video input with should not be perceived as unnatural to the human participant. all hand images in the database is computationally expensive. It is doubtful if most existing technical approaches, including This is why the most existing model-based approaches focus the one used in Kinect body tracking system would be the on real-time tracking for global hand motions with restricted direction leading to the technical development for future lighting and background conditions. To handle the challenging smart devices due to their inherent resource-intensive nature. search problem in a high dimensional space of human hands, Another issue is the robustness. The solutions for mobile the efficient index technologies used in information retrieval applications should always work no matter indoor or outdoor. field, have been tested. Zhou et al. proposed an approach This may somehow exclude the possibility of using Kinect- that integrates the powerful text retrieval tools with computer type sensors in uncontrolled environments. Therefore, the vision techniques in order to improve the efficiency for image original problem is how to provide effective hand tracking and retrieval [26]. An Okapi-Chamfer matching is used gesture recognition with video cameras. A critical question is in their work based on the inverted index technique. Athitsos whether we could develop alternative video-based solutions et al., proposed a method that can generate a ranked list of that may fit future mobile applications better. 3D hand configurations that best match an input image [27]. 1) Bare-hand Gesture Recognition and Tracking: Algo- Hand pose estimation is achieved by searching for the closest rithms of hand tracking and gesture recognition can be grouped matches for an input hand image from a large database of syn- into two categories: appearance-based approaches and 3D thetic hand images. The novelty of their system is the ability hand model-based approaches [3], [18], [19], [20], [21], [22]. to handle the presence of clutter. Imai et al. proposed a 2D Appearance-based approaches are based on a direct compar- appearance-based method to estimate 3D hand posture [28]. In ison of hand gestures with 2D image features. The popular their method, the variations of possible hand contours around image features used to detect human hands include hand the registered typical appearances are trained from a number of colors and shapes, local hand features, optical flow and so graphical images generated from a 3D hand model. Although on. The early works on hand tracking belong to this type of the methods based on retrieval are very promising, they are approaches [4], [5], [23]. The gesture analysis step usually too few to be visible in the field. The reason might be that includes feature extraction, gesture detection, motion analysis the approach is too primary, or the results are not impressive and tracking. Pattern recognition methods for detecting and due to the tests just over very limited size of database. analyzing the hand gestures are mainly based on local or Moreover, it might be also a consequence of the success of 3D global image features. Simple features such as edges, corners, sensors such as Kinect in real-time human gesture recognition lines, and more complex features such as SIFT (scale-invariant and tracking. The statistical approaches (random forest tree, feature transform), SURF (Speeded Up Robust Features), and for example) adopted in Kinect start dominating mainstream FAST (features from accelerated segment test) are widely gesture recognition approaches. This effect is enhanced by used in the computer vision applications [24], [25]. For the introduction of a new type of depth sensor from Leap dynamic hand gestures, combination of local/global image Motion. This type of depth sensor can run at interactive rates features might be useful to detect hand gestures. In general, the on consumer hardware and interact with moving objects in drawback of the feature-based approaches is that clean image real-time. Despite of its attractive demo, Leap Motion sensor segmentation is required in order to extract the hand features. cannot handle full range of hand shapes. The main reason for This is not a trivial task when the background is cluttered. that is that such sensors usually detect and track the presence Furthermore, human hands are highly articulated. It is often of fingertips or points in free space when user’s hands enter difficult to find local hand features due to the self-occlusion, the sensor’s field of view. In fact, they can be used for general and some kinds of heuristics are needed to handle the large va- hand motion tracking. Regarding the special requirements riety of hand gestures. Instead of employing 2D image features for mobile applications such as real-time processing, low- to represent the hand directly, 3D model-based approaches use complexity and robustness, it seems that a promising approach a 3D kinematic hand model to render hand poses. An Analysis- to handle the problem of hand tracking and hand gesture By-Synthesis (ABS) strategy is employed to recover the hand recognition is to use retrieval technologies for search. In order motion parameters by aligning the appearance projected by to apply this technology to next generation of smart devices, a

2123 A. Organization of the Database of Hand Gestures Our database contains a large set of different hand gestures with all potential variations in rotation, positioning, scaling, and deformations. Besides the matching between the query input and database, we aim to retrieve the 3D motion pa- rameters from the query image. Since query inputs do not contain any pose information, the best solution is to associate the motion parameters of the query to the best retrieved match from the database. For this reason, we need to annotate the database images with their ground truth motion parameters, Fig. 1. Overview of the presented system global hand pose as well as position of the joints and additional flags about each specific gesture. The flags represent the states of each specific gesture in detail. In the following we explain how the process of making the database is performed. The current version of the database features data of four different gesture types: pinch (thumb and index finger), point (using the index finger) as well as grab normal (hand’s back facing the user) and grab palm (hand’s palm facing the user). Each gesture type is recorded in five rotations and different states, i.e., grab strength (pinch, grab) or tilt (point), for both left and right hands. Figure 2 illustrates the recorded gesture types including descriptions about the selected rotations and states. Gestures of more than ten participants were recorded in a laboratory setup using a Samsung Galaxy S6. A Chroma key screen was used to assure near optimal conditions for the later image processing in order to exclusively extract relevant Fig. 2. Database: Gesture types, rotations, states, and annotation of the joints gesture data from images. The actual image processing and database creation was performed in MATLAB. Each gesture recording was cropped to a square-sized dimension. Then, systematic study is needed regarding how retrieval tools should each of the 19 joints of the hand were manually marked (see be applied to handle gesture recognition, particularly, how to Figure 2) in order to collect information about their individual integrate the advanced image search technologies [29]. The x and y coordinates. The coordinate data are normalized from key issue is how to relate the vision-based gesture analysis to 0 to 1. Keeping track of the joint positions is of importance the large-scale search framework. The proposed solution based in order to build a joint skeleton of the hand. Once the joint on the search framework for gesture analysis is explained in annotation of the processed gestures was completed, binary the following sections. and edge representations were created based on a resized (100- by-100 pixels) version of the cropped recording, and stored as binary data. Consequently, one database entry contains 10000 III.SYSTEM DESCRIPTION elements as 0 or 1 values. Finally, the creation of a general flag database assists with the unique identification of the gesture’s type, rotation, state, and hand (left, right) for each database The proposed interactive process consists of different com- entry. Practically, the recorded hand poses were processed, ponents, as demonstrated in Figure 1. An ordinary smartphone and database files were created. The output files contain the with embedded camera is used to provide the required video individual information, i.e., joint annotation, binary image, sequence for analysis of hand gestures. The Pre-processing edge image and flag matrix. Table I provides a statistical component receives the video frames, applies efficient image overview about the conducted gesture database. processing, provides a segmentation of the hand and finally removes the background noise. Our database consists of a large collection of required hand gestures including the global and Gesture type Rotations States Hands Sum local information about the hand pose and joints. The Match- Pinch 5 5 2 650 ing component analyzes the similarity of the pre-processed Point 5 3 2 550 Grab(normal) 5 7 2 850 query to the database entries and finds the best match in real Grab(palm) 5 7 2 850 time. The annotated information to the best match will be TABLE I analyzed in Gesture Analysis and the detected static/dynamic DATABASE:OVERVIEWABOUTGESTURERECORDINGS (INCL. SUMOF hand gestures will be detected and sent to the output. The DATABASE ENTRIES PER GESTURE TYPE) output will be used in the application.

2124 B. Pre-processing Pre-processing is an essential step to segment the hand from the background. Therefore, an efficient search process and matching can be performed on the database of hand gestures. Pre-processing starts by capturing the RGB frame and converting it to YCbCr color space. Since Cr and Cb channels are less sensitive to illumination changes, they can be used to define a segmentation function and localize the hand area. The luminance channel (Y) is ignored as it varies a lot between frames. However, to increase the contrast between hand and background we defined a weighted image out of Cb and Cr channels (see Equation 1). In order to analyze the hand color, maximize the contrast between hand and background in the weighted image, and segment a clear hand different sets of samples from both hand and background have been analyzed. In the implemented system each sample is a small patch of 3-by-3 pixels and the median value of each patch represent the color information of the sample. For segmentation the following equations are considered:

zi = αHCri − βHCbi (1)

ti = αBCri − βBCbi, i = {1, ..., n} (2) Fig. 3. Analysis of the sample images for tuning the segmentation parameters. where Z and T indicate the weighted image of the hand and background samples respectively. HCr and HCb represent the C. Real-time Matching and Similarity Analysis Cr and Cb color information of the hand samples and BCr and BCb for the background. A large training set of different The output of the Pre-processing provides a normalized hands, lighting conditions and backgrounds have been used binary vector of the region of interest in the query frame to tune the parameters in the equations and reach a clear that contains the clear segmented hand with minimum level segmentation of the hand. Therefore, the following equation of noise from the background. Ideally the segmented hand should be maximized to tune α and β. should be identical to specific entries in the database if we provide an extremely large database of hand gestures. In argmax kZ − T k (3) practice, we build a similarity analysis to find the closest match from the database for a captured query gesture. In our Based on the calculated parameters a threshold on pixel values system we have experimented L1 and L2 norms to match is considered to segment the hand from the background and the query with the closest entry of the database in real time. form a binary image of the hand. The threshold value, alpha, Based on the conducted experiments, for each category of the and beta are first defined in an offline step for pre-sets of hand gestures, the similarity level to a query to be recognized different environments. Therefore, if the segmentation function as that specific gesture varies. For instance, similarity of the in the default mode does not provide a clear segmentation of entries of the pointer category to a query gesture should be the hand, user can easily switch to other modes such as Bright, higher than 83% to maximize the probability of the correct Dark, Complex background etc. The implemented system also recognition as pointer. The similarity criteria varies between includes an adaptive analysis that re-samples the colors of different categories of gestures from 80% to 86%. In order to hand by taking the values from the last known gesture. This significantly improve the efficiency of the search process, a allows the system to automatically adapt to multiple changes selective search method is introduced. In the selective search, throughout the use of the system. The changes could be in the priority to search in the database for a given query is on lighting or background. By sampling and re-calculating the Cb, the likelihood of the detected gesture from previous frames. Cr and threshold values, the system becomes more tolerant to Therefore, the dimension of the search can be significantly changes in the environment. After creating the binary image, reduced. This process is explained in the following section. we find the largest object. Since the thresholds and contrasts 1) Gesture Mapping and Selective Search: In order to are in favour of the hand, the hand will be the largest object perform a smooth and fast retrieval, we analyze the database in the frame. This area in the frame is defined as ROI (region images with dimensionality reduction methods to cluster the of interest), as it is expected to hold the hand. The ROI is similar hand gestures. This approach indicates that which ges- then cropped out of the frame and normalized to a size that tures are closer to each other and fall in the same neighborhood matches the width and height of the database gestures. in the high dimensional space. For dimensionality reduction

2125 this approach is likely to fail in some occasions due to the potential runtime noise. To reduce the noise and potential failures, the framework applies a noise reduction method that allows reading the previous static hand gestures and remove the outliers. It is unlikely that in a short period of time (i.e 5 frames) the user performs different hand poses. If the framework detects 4 open hand and 1 closed hand in the previous frames, the output of the last 5 frames would be an open hand gesture. The analysis of a frame results in a state that refers to a static gesture. A combination of states over the video sequence result in a dynamic gesture that can be a click, double click, swipe right/left, grab, hold up/down (7 in total). In order to detect a dynamic gesture, the framework analyzes Fig. 4. The designed interface based on the proposed approach. the states in the last n frames. For instance, in order to detect a grab gesture the stored states in the previous frames should and gesture mapping different have been tested. contain the transition from open hand state (static gesture) to The best achieved results that properly mapped our database the closed hand state (static gesture). Based on the conducted images to visually distinguishable patterns in 3D is performed experiments n varies for different dynamic hand gestures and by Isomap method. Based on the fact that the hand movements it should be selected in a way that maximizes the naturalness follow a continuous motion, it is expected that for each frame of the interaction and minimizes the wrong detections. the best match from the database falls in the neighborhood B. Implementation for Android/HMD and Interface Design of the best matches from the previous frames. This approach will significantly reduce the search domain and improves the The framework is developed in C++/OpenCV and tested efficiency of the retrieval. In the implemented system this by more than 10 users. The results show similar performance method is considered and if the best match is detected with on all users, although they feature different skin colors and high certainty (high score) the next frame will start the search hand shapes. The tests were conducted in different lighting from the neighborghood of the previous match and if none of conditions and backgrounds with users wearing the VR head- the entries in the neighborhood meet the similarity criteria as set or holding the smartphone in one hand and interacting the best match the search domain will be extended to the rest with the content by the other hand. The algorithm can handle of the database till it finds the best match. the movements of the user (moving camera) as long as the 2) Motion Averaging: Suppose that for the query images image quality is preserved. In order to improve the quality of detection, auto-focus of the camera is deactivated. As Qk−n-Qk (k > n), best database matches are selected. In order to smooth the retrieved motion in a sequence, the demonstrated in Figure 4, a simple interface is designed in averaging method is considered. Thus, for the k + 1th query Android to show the detected static and dynamic gestures image, position/orientation computed based on the estimated on the camera view while user is performing different hand position/orientation of the n previous frames as follows: gestures. A small box is also rendered on the bottom corner to show the segmented hand in real time. We have designed two k k 1 X 1 X applications in Unity to showcase the technology in simple P = P ,O = O (4) Qk+1 n Qi Qk+1 n Qi scenarios. The applications have been implemented and tested i=k−n+1 i=k−n+1 on Android phones (Samsung Galaxy Note4, and S6) and VR Here, PQ and OQ represent the estimated position and orien- headset (Samsung Gear VR). In the applications users can tation for the query images, respectively. Position/orientation pick and place objects, select and activate different features include all 3D information (translation and rotation parameters such as playing a set of songs in VR using hand and finger with respect to x, y, and z axes). According to the experiments, motions. The rendered hand exactly follows the user’s motion for 3 ≤ n ≤ 5, averaging can be performed properly. in 3D space. As it is shown, system can handle the complex and noisy backgrounds properly. The main requirement for IV. EXPERIMENTAL RESULTS segmentation of the hand is an acceptable level of contrast A. Static and Dynamic Gesture Recognition between hand and background. In different lighting conditions, The capability of the framework to recognize hand gestures even those cases where hand area is affected by background highly depends on the size of the database. In other words, noise, our system can handle the detection of the gesture we can increase the resolution of tracking and number of class reliably. Since the system works on the normalized recognized gestures by extending the database size. In the hand patterns, there is no restriction on the hand scale in the framework we classify the gestures into two types: static and algorithm. The current database is stored on RAM due to its dynamic. A static gesture refers to a hand pose recognized size (less than 3 MB) and performance related reasons. The in one frame such as open/closed hand (front and back view), average processing time for the end to end cycle including open/closed pinch, and pointer (7 categories in total). However, the designed interface in Unity is less than 35 ms per frame.

2126 The battery consumption of the application is similar to the problem with this approach is that the extracted features current 3D games or open camera applications in Android. are not natural, which directly affects the search process. Based on our estimation, the presented system is able to handle Recording the hand gestures from real users for converting into significantly larger number of database entries. Various hand binary hand shape images seems to be a reasonable approach patterns from different users increases the database size but for extending the database in the future work. this increase is not significant after a certain point that we REFERENCES cover the diversity. Based on our estimation the database of 10- 20K entries will handle the high resolution 3D joint tracking. [1] K. Roebuck, Tangible User Interfaces: High-Impact Emerging Technol- ogy - What You Need to Know: Definitions, Adoptions, Impact, Benefits, With the proposed method the processing can be handled in Maturity, Vendors. Emereo Pty Limited, 2011. real time for the extended database. [2] A. Sarkar, G. Sanyal, and S. Majumder, “Hand Gesture Recognition Systems: A Survey,” 2013. V. CONCLUSIONAND FUTURE WORK [3] B. Stenger, “Model-Based Hand Tracking Using A Hierarchical Bayesian Filter,” 2006. In comparison with the existing gesture analysis systems, [4] S. Yousefi, F. A. Kondori, and H. Li, “Experiencing real 3D gestural our proposed technology is unique from different aspects. interaction with mobile devices,” Pattern Recognition Letters, 2013. [5] S. Yousefi, F. a. Kondori, and H. B. Li, “Camera-Based Gesture Tracking Clearly, the existing solutions based on using a single RGB for 3D Interaction Behind Mobile Devices,” 2012. camera are not capable of handling the analysis of articu- [6] S. Yousefi, “3D photo browsing for future mobile devices,” in Proceed- lated hand motions. They can detect the global hand motion ings of the 20th ACM International Conference on Multimedia, 2012. [7] S. Yousefi, H. Li, and L. Liu, “3D Gesture Analysis Using a Large-Scale (translations) and tracking the fingertips [15], [16]. Moreover, Gesture Database,” in Advances in Visual Computing. Springer, 2014. they mainly work in stable lighting condition with static plain [8] A. Kolahi, M. H, T. R, M. A, M. B, and H. M, “Design of a marker- background. Another group of solutions rely on using extra based human motion tracking system,” pp. 59–67, 2007. [9] M. Knecht, A. Dunser,¨ C. T, M. W, and R. G, “A Framework For hardware such as depth sensors. They usually perform global Perceptual Studies In Photorealistic Augmented Reality,” 2011. hand motion tracking including the 3D rotation/translation [10] M. Tang, “Recognizing Hand Gestures with Microsoft’s Kinect,” pp. parameters (6 degrees of freedom) and fingertip detection. 303–313, 2011. [11] E. Parvizi and Q. Wu, “Real-Time 3D Head Tracking Based on Time- On the other hand, they cannot be embedded to the existing of-Flight Depth Sensor,” 2007. mobile devices due to their size and power limitations. Since [12] S. Sridhar, A. O, and C. T, “Interactive Markerless Articulated Hand our technology does not require complex sensors and in Motion Tracking using RGB and Depth Data,” in Proc. of the Int. Conf. on Computer Vision (ICCV), 2013. addition, it provides high degrees of freedom motion analysis [13] I. Oikonomidis, N. K, K. T, and A. A, “Tracking hand articulations: (global motion and local joint movements), it can be used Relying on 3D visual hulls versus relying on multiple 2D cues,” in to recover up to 27 parameters of hand motion. Due to Ubiquitous Virtual Reality (ISUVR), 2013, 2013. [14] J. Taylor, R. S, and V. R, “User-Specific Hand Modeling from Monocular our innovative search-based method, this solution can handle Depth Sequences.” Computer Vision and Pattern Recog. (CVPR), 2014. large-scale gesture analysis in real time on both stationary [15] D. Michel, K. P., and A. A. Argyros, “Gesture Recognition Supporting and mobile platforms with minimum hardware requirement. the Interaction of Humans with Socially Assistive Robots,” in Advances in Visual Computing. Springer International Publishing, 2014. A summary of the comparison between the existing solutions [16] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robust and our contribution is depicted in Table. III. Performance hand tracking from depth,” in Computer Vision and Pattern Recognition of the proposed solution is highly related to the quality of (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 1106–1113. [17] L. Ballan, A. Taneja, J. Gall, L. V. Gool, and M. Pollefeys, “Motion the database. This requires more time and effort to build a Capture of Hands in Action using Discriminative Salient Points,” in comprehensive collection of hand gestures that can represent European Conference on Computer Vision (ECCV), Firenze, oct 2012. all possible cases in interactive applications. One effective [18] J. Song, G. Soros, F. Pece, S. Fanello, S. Izadi, C. Keskin, and O. Hilliges, “In-air Gestures Around Unmodified Mobile Devices,” in solution is to use a 3D hand model to render all possible Proceedings of the UIST 14, 2014. hand postures with computer graphics technology, and convert [19] R. Y. R. Yang and S. Sarkar, “Gesture Recognition using Hidden Markov the generated gestures into binary representation. The major Models from Fragmented Observations,” 2006. [20] C. Hardenberg and F. B, “Bare-hand human-computer interaction,” 2001. [21] D. Iwai and K. Sato, “Heat Sensation in Image Creation with Thermal Average processing time 30-40 fps Vision,” 2005. DB and Query frame size 100x100 and 320x240 [22] M. Kolsch and M. Turk, “Fast 2D Hand Tracking with Flocks of Features and Multi-Cue Integration,” 2004. CPU usage in Galaxy S6 9-14% [23] F. a. Kondori, S. Yousefi, and H. Li, “Real 3D interaction behind Number of Static and Dynamic hand gestures 7 and 7 mobile phones for augmented environments,” in Proceedings - IEEE TABLE II International Conference on Multimedia and Expo, 2011. BENCHMARKOFTHEPROPOSEDSYSTEM. [24] D. G. Lowe, “Distinctive Image Features from Scale-invariant Key- points,” 2004. [25] H.˜Bay, A.˜Ess, T.˜Tuytelaars, and L.˜Gool, “SURF: Speeded Up Robust Features/Approach 2D Camera Depth Camera Our approach Features,” pp. 346–359, 2008. [26] T. Huang, “Okapi-Chamfer Matching for Articulated Object Recogni- Stationary/Mobile Yes/Limited Yes/No Yes/Yes tion,” pp. 1026–1033, 2005. 2D/3D Tracking Yes/Limited Yes/Yes Yes/Yes [27] V. Athitsos and S. Sclaroff, “Estimating 3D hand pose from a cluttered Joint analysis No Limited Yes image,” 2003. Degrees of freedom Low (2-6) Medium (6-10) High (10+) [28] A. Imai, N. Shimada, and Y. Shirai, “3-D hand posture recognition by TABLE III training contour variation,” pp. 895–900, 2004. COMPARISONOFTHEPROPOSEDSYSTEMANDCURRENTTECHNOLOGIES. [29] Y. Cao, C. Wang, L. Zhang, and L. Zhang, “Edgel Index for Large-Scale Sketch-based Image Search - Search - Cao et al. - Unknown.pdf,” 2011.

2127