3D Gesture-Based Interaction for Immersive Experience in Mobile VR

2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 3D Gesture-based Interaction for Immersive Experience in Mobile VR Shahrouz Yousefi∗, Mhretab Kidaney, Yeray Delgadoy, Julio Chanay and Nico Reski∗ ∗Department of Media Technology, Linnaeus University, Vaxjo, Sweden Email: shahrouz.yousefi@lnu.se, [email protected] yManoMotion AB, Stockholm, Sweden Email: [email protected], [email protected], [email protected] Abstract—In this paper we introduce a novel solution for and will not satisfy the users’ needs in the near future. The real-time 3D hand gesture analysis using the embedded 2D presented research results enable large-scale and real-time 3D camera of a mobile device. The presented framework is based gesture analysis. This can be used for user-device interaction in on forming a large database of hand gestures including the ground truth information of hand poses and details of finger real-time applications in mobile and wearable devices, where joints in 3D. For a query frame captured by the mobile device’s intuitive, and instant 3D interaction are important. VR, AR, camera in real time, the gesture analysis system finds the and 3D gaming are among the areas that directly benefit from best match from the database. Once the best match is found, the 3D interaction technology. Specifically, with the presented the corresponding ground truth information will be used for research results we plan to observe how the integration of interaction in the designed interface. The presented framework performs an extremely efficient gesture analysis (more than 30 new frameworks such as Search methods to existing computer fps) in flexible lighting condition and complex background with vision solutions facilitates the high degrees of freedom gesture dynamic movement of the mobile device. The introduced work analysis. In the presented work, Gesture Search Engine, as an is implemented in Android and tested in Gear VR headset. innovative framework, for 3D gesture analysis is introduced and used in mobile platforms to facilitate the AR/VR appli- I. INTRODUCTION cations. Prototypes in simple application scenarios have been The rapid development and wide adoption of mobile devices demonstrated based on the proposed technology. in recent years have been mainly driven by the introduction of novel interaction and visualization technologies. Although II. RELATED WORK touchscreens have significantly enhanced the human device One of the current enabling technologies to build gesture- interaction, clearly for next generation of smart devices such as based interfaces is hand tracking and gesture recognition. Virtual/Augmented Reality (VR/AR) headsets, smart watches, The major technology bottleneck lies in the difficulty of and future smartphones/tablets, users will no longer be satis- capturing and analyzing the articulated hand motions. One fied with just performing interaction over the limited space of of the existing solutions is to employ glove-based devices, 2D touchscreen or using extra controllers. They will demand which directly measure the finger positions and joint angles more natural interactions performed by the bare hands in the by using a set of sensors (i.e. electromagnetic or fiber-optical free space around the smart device [1]. Thus, the next gener- sensors) [8], [9]. However, the glove-based solutions are too ation of smart devices will require a gesture-based interface intrusive and expensive for natural interaction with smart to facilitate the bare hands for manipulating digital content devices. To overcome these limitations, vision-based hand directly. In general, 3D hand gesture recognition and tracking tracking solutions need to be developed and video sequences have been considered as classical computer vision and pattern should be analyzed. Capturing hand and finger motions in recognition problems. Although substantial research has been video sequences is a highly challenging task in computer conducted in this area, the state-of-the-art research results are vision due to the large number of DOF of the hand kinematics. mainly limited to global hand tracking and low-resolution Recently, Microsoft demonstrated how to capture full body gesture analysis [2]. However, in order to facilitate the natural motions using Kinect [10], [11]. Substantial development of gesture-based interaction, full analysis of the hand and fingers hand tracking and gesture recognition are based on using depth will be required, which in total incorporates 27 degrees of sensors. Sridhar et al., [12] use RGB and depth data for track- freedom (DOF) for each hand [3]. The main objectives behind ing of articulated hand motion based on color information and this work are to introduce new frameworks and methods for part-based hand model. Oikonomidis et al., [13] and Taylor intuitive interaction with future smart devices. We aim to re- et al., [14] track the articulated hand motion using RGBD produce the real-life experiences in the digital space with high information from Kinect. Papoutsakis et al., [15] analyze accuracy hand gesture analysis. Based on the comprehensive limited number of hand gestures with RGBD sensor. Model- studies [4], [5], [6], [7], and feedback from top researchers in based hand tracking using depth sensor is among the com- the field and major technology developers, it has been clearly mon proposed solutions [14], [16]. Oikonomidis et al., [13], verified that today’s 3D gesture analysis methods are limited introduce articulated hand tracking using calibrated multi- 978-1-5090-4846-5/16/$31.00 ©2016 IEEE 2122 camera system and optimization methods. Balan et al., [17] the 3D hand model with the observed image from the camera, propose a model-based solution to estimate the pose of two and minimizing the discrepancy between them. Generally, it is hands using discriminative salient points. Here, the question easier to achieve real-time performance with appearance-based is whether using 3D depth cameras can potentially solve the approaches due to the fact of simpler 2D image features [2]. problem of 3D hand tracking and gesture recognition. This However, this type of approaches can only handle simple hand problem has been greatly simplified by the introduction of gestures, like detection and tracking of fingertips. In contrast, real-time depth cameras. However, the technologies based on 3D hand model based approaches offer a rich description that depth information for hand tracking and gesture recognition potentially allows a wide class of hand gestures. The bad news still face major challenges for mobile applications. In fact, is that the 3D hand model is a complex articulated deformable mobile applications have at least two critical requirements: object with 27 DOF. To cover all the characteristic hand computational efficiency and robustness. Therefore, feedback images under different views, a very large image database is and interaction in a timely fashion is assumed and any latency required. Matching the query images from the video input with should not be perceived as unnatural to the human participant. all hand images in the database is computationally expensive. It is doubtful if most existing technical approaches, including This is why the most existing model-based approaches focus the one used in Kinect body tracking system would be the on real-time tracking for global hand motions with restricted direction leading to the technical development for future lighting and background conditions. To handle the challenging smart devices due to their inherent resource-intensive nature. search problem in a high dimensional space of human hands, Another issue is the robustness. The solutions for mobile the efficient index technologies used in information retrieval applications should always work no matter indoor or outdoor. field, have been tested. Zhou et al. proposed an approach This may somehow exclude the possibility of using Kinect- that integrates the powerful text retrieval tools with computer type sensors in uncontrolled environments. Therefore, the vision techniques in order to improve the efficiency for image original problem is how to provide effective hand tracking and retrieval [26]. An Okapi-Chamfer matching algorithm is used gesture recognition with video cameras. A critical question is in their work based on the inverted index technique. Athitsos whether we could develop alternative video-based solutions et al., proposed a method that can generate a ranked list of that may fit future mobile applications better. 3D hand configurations that best match an input image [27]. 1) Bare-hand Gesture Recognition and Tracking: Algo- Hand pose estimation is achieved by searching for the closest rithms of hand tracking and gesture recognition can be grouped matches for an input hand image from a large database of syn- into two categories: appearance-based approaches and 3D thetic hand images. The novelty of their system is the ability hand model-based approaches [3], [18], [19], [20], [21], [22]. to handle the presence of clutter. Imai et al. proposed a 2D Appearance-based approaches are based on a direct compar- appearance-based method to estimate 3D hand posture [28]. In ison of hand gestures with 2D image features. The popular their method, the variations of possible hand contours around image features used to detect human hands include hand the registered typical appearances are trained from a number of colors and shapes, local hand features, optical flow and so graphical images generated from a 3D hand model. Although on. The early works on hand tracking belong to this type of the methods based on retrieval are very promising, they are approaches [4], [5], [23]. The gesture analysis step usually too few to be visible in the field. The reason might be that includes feature extraction, gesture detection, motion analysis the approach is too primary, or the results are not impressive and tracking. Pattern recognition methods for detecting and due to the tests just over very limited size of database. analyzing the hand gestures are mainly based on local or Moreover, it might be also a consequence of the success of 3D global image features.

Load more