<<

3D Recognition and Tracking for Next Generation of Smart Devices

Theories, Concepts, and Implementations

SHAHROUZ YOUSEFI Department of Media Technology and School of and Communication KTH Royal Institute of Technology

Doctoral Thesis in Media Technology

Stockholm, February 2014 3D and Tracking for Next Generation of Smart De- vices: Theories, Concepts, and Implementations Shahrouz Yousefi Department of Media Technology and Interaction Design (MID) School of Computer Science and Communication (CSC) KTH Royal Institute of Technology SE-100 44, Stockholm, Sweden Author’s e-mail: [email protected]

Akademisk avhandling som med tillst˚andav Kungliga Tekniska H¨ogskolan framl¨aggstill offentlig granskning f¨oravl¨aggandeav Teknologie Doktorsex- amen i Medieteknik, m˚andagenden 17 mars 2014 kl 13:15 i sal F3 Lindst- edsv¨agen26, Kungliga Tekniska H¨ogskolan, Stockholm.

TRITA-CSC-A-2014-02 ISSN-1653-5723 ISRN-KTH/CSC/A–14/02-SE ISBN-978-91-7595-031-0

Copyright c 2014 by Shahrouz Yousefi, All rights reserved. Typeset in LATEX by Shahrouz Yousefi E-version available at http://kth.diva-portal.org Printed by E-print AB, Stockholm, Sweden, 2014 Distributor: KTH School of Computer Science and Communication Abstract

The rapid development of mobile devices during the recent decade has been greatly driven by interaction and visualization technologies. Al- though have significantly enhanced the interaction technol- ogy, it is predictable that with the future mobile devices, e.g., glasses and smart watches, users will demand more intuitive in- puts such as free-hand interaction in 3D space. Specifically, for manipu- lation of the digital content in augmented environments, 3D hand/body will be extremely required. Therefore, 3D gesture recognition and tracking are highly desired features for interaction design in future smart environments. Due to the complexity of the hand/body motions, and limitations of mobile devices in expensive computations, 3D gesture analysis is still an extremely difficult problem to solve. This thesis aims to introduce new concepts, theories and technologies for natural and intuitive interaction in future augmented environments. Contributions of this thesis support the concept of bare-hand 3D ges- tural interaction and interactive visualization on future smart devices. The introduced technical solutions enable an effective interaction in the 3D space around the smart device. High accuracy and robust 3D mo- tion analysis of the hand/body gestures is performed to facilitate the 3D interaction in various application scenarios. The proposed technologies enable users to control, manipulate, and organize the digital content in 3D space.

Keywords: 3D gestural interaction, gesture recognition, gesture tracking, 3D visualization, 3D motion analysis, augmented environments.

Shahrouz Yousefi February 2014 Sammanfattning

Den snabba utvecklingen av mobila enheter under det senaste decenniet har i stor utstr¨ackning drivits av interaktion och visualiseringsteknologi. Aven¨ om peksk¨armaravsev¨arthar f¨orb¨attratinteraktions tekniken ¨ar det f¨oruts¨agbartatt med framtida mobila enheter, t.ex. augmented real- ity glas¨ogonoch smarta klockor, kommer anv¨andarekr¨aver mer intuitiva s¨attatt interagera, s˚asomex. fri hand interaktion i 3D-rymden. Speciellt viktigt blir det vid manipulation av digitalt inneh˚alli ut¨okade milj¨oer d¨ar3D hand/kropp gester kommer att vara ytterst n¨odv¨andig. D¨arf¨or ¨ar3D gest igenk¨anning och sp˚arningh¨ogt¨onskade egenskaper f¨orin- teraktionsdesign i framtida smarta milj¨oer. P˚agrund av komplexiteten i hand/kroppsr¨orelser,och begr¨ansningarav mobila enheter vid dyra ber¨akningar,¨ar3D- gest analys fortfarande ett mycket sv˚artproblem att l¨osa. Avhandlingen syftar till att inf¨oranya begrepp, teorier och tekniker f¨or naturlig och intuitiv interaktion i framtida ut¨okade milj¨oer.Bidrag fr˚an denna avhandling st¨oderbegreppet naken-hand 3D gest interaktion och interaktiv visualisering p˚aframtida smarta enheter. De inf¨ordatekniska l¨osningarm¨ojligg¨oreffektiv interaktion i 3D-rymden runt den smarta enheten. H¨ognoggrannhet och robust 3D r¨orelseanalysav hand/kropp gester utf¨orsf¨oratt underl¨atta3D-interaktion i olika till¨ampningsscenar- ier. De f¨oreslagnateknik g¨ordet m¨ojligtf¨oranv¨andareatt kontrollera, manipulera, och organiserar digitalt inneh˚alli 3D-rymden.

Nyckelord: 3D gest interaktion, gest igenk¨annande,gest sp˚arning,3D visualisering, 3D r¨orelseanalys, ut¨okademilj¨oer.

Shahrouz Yousefi februari 2014 Acknowledgements

First of all, I wish to express my sincere gratitude to my main ad- visor, Prof. Haibo Li, for providing me this research opportunity. Thank you for your motivation, enthusiasm, and support during these years. Without your supervision and mentoring this thesis would not have been possible. You inspired me to be more adven- turous in research. I would like to thank my second advisor, Dr. Li Liu, for all the motivational and fruitful discussions. Special thanks to my dear friend and colleague, Farid Kondori. We had many collaborations, interesting discussions and enjoyable moments during these years. I would like to thank my former colleagues at Digital Media Lab, Ume˚aUniversity for their helpful suggestions and comments on my research projects. Special thanks to Annemaj Nilsson, Mona-Lisa Gunnarsson, and the friendly staff of the department of Applied Physics and Electronics, Ume˚aUniversity. My time at KTH was really enjoyable due to the friendly col- leagues of the department of Media Technology and Interaction Design. I am grateful for the time spent with them at work meet- ings, seminars and social events. I must especially thank Prof. Ann Lantz for providing an excellent research environment at the MID department. Thanks for your support, encouragement and kind- ness. I would also like to thank Henrik Artman, Cristian Bogdan, Ambj¨ornNaeve, Olle B¨alter,Eva-Lotta Salln¨as,and other senior researchers at MID department for their support and guidance. Many thanks should go to Dr. Roberto Bresin and Prof. Yngve Sundblad for reviewing my thesis. Your constructive ideas, in- sightful comments, and suggestions made a great improvement in the quality of my PhD thesis. Winning the first prize in KTH Innovation Idea Competition, best project work in Uminova Academic Business Challenge, and being selected as one of the top PhD works at ACM Multimedia Doctoral Symposium, motivated me to work harder on the development of my research ideas. I would especially like to thank H˚akan Borg and Cecilia Sandell from KTH Innovation for their great support on patentability analysis, business development and commercial- ization of my research results. Finally, and most importantly, I am grateful to my loving par- ents, my brother, and his family for giving me the endless intellec- tual support and encouragement to pursue my studies during these years. I would especially like to thank my best friend and com- panion Shora. Thanks for the wonderful and precious moments we shared together.

Shahrouz Yousefi February 2014 Contents

Contents v

1 Introduction 1 1.1 Motivation ...... 1 1.2 Research Problem ...... 3 1.2.1 Future Mobile Devices ...... 4 1.2.2 Experience Design ...... 4 1.2.3 Limitations in Interaction Facilities ...... 6 1.2.4 Limitations in Visualization ...... 8 1.2.5 Technical Challenges in 3D Gestural Interaction . . . . 8 1.3 Future Trends in Multimedia Context ...... 10 1.3.1 3D Interaction Technology ...... 10 1.3.2 3D Visualization ...... 10 1.3.3 Passive Vision to Active/Interactive Vision ...... 10 1.3.4 Gesture Analysis: from Methods to Image-based Search Methods ...... 11 1.4 Research Strategy ...... 11

2 Related Work 13 2.1 Terminology ...... 13 2.2 Related Work ...... 14 2.2.1 3D Technologies in Available Interactive Systems ...... 15

v CONTENTS

2.2.1.1 Passive Motion Tracking and Its Applications 15 2.2.1.2 Active Motion Tracking and Its Applications . 16 2.2.1.3 Comparison Between Active and Passive Meth- ods ...... 16 2.2.2 3D Motion Estimation for Mobile Interaction ...... 18 2.2.3 3D Gesture Recognition and Tracking ...... 19 2.2.4 3D Visualization on Mobile Devices ...... 21

3 General Concept and Methodology 23 3.1 General Concept ...... 23 3.1.1 Interaction/Visualization Space ...... 25 3.1.2 Sharing the Interaction/Visualization space ...... 27 3.1.2.1 Single-user, Single-device ...... 27 3.1.2.2 Multi-user, Multi-device with Shared Interac- tion Space ...... 28 3.1.2.3 Multi-user, Single-device with Shared Visual- ization Space ...... 28 3.1.2.4 Interaction from Different Locations for Multi- user Multi-device ...... 28 3.2 Evolution of Interaction/Visualization Spaces ...... 28 3.3 Enabling Media Technologies ...... 31 3.3.1 Vision-based Motion Tracking in 3D Space ...... 32 3.3.2 3D Visualization ...... 33 3.4 Methodology Overview ...... 36 3.5 Gesture Analysis through the Pattern Recognition Methods ...... 38 3.6 Gesture Analysis through the Large-scale Image Retrieval ...... 40

4 Enabling Media Technologies 43 4.1 Gesture Detection and Tracking Based on Low-level Pattern Recognition ...... 44

vi CONTENTS

4.1.1 3D Motion Analysis ...... 46 4.2 Gesture Detection and Tracking Based on Gesture Search Engine ...... 48 4.2.1 Providing the Database of Gesture Images ...... 49 4.2.2 Query Processing and Matching ...... 50 4.2.3 Scoring System ...... 50 4.2.4 Quality of Hand Gesture Database ...... 52 4.3 Interactive 3D Visualization ...... 54 4.4 Methods for 3D Visualization ...... 56 4.4.1 Depth Recovery and 3D Visualization from a Single View 56 4.4.2 3D Visualization from Multiple 2D Views ...... 57 4.5 3D Channel Coding ...... 57

5 Experimental Results 59 5.1 Experiments on Gesture Detection, Tracking and 3D Motion Analysis ...... 59 5.1.1 Camera and Experiment Condition ...... 59 5.1.2 ...... 60 5.1.3 Programming Environment and Results ...... 62 5.2 Experiments on Gesture Search Framework ...... 63 5.2.1 Constructing the Database ...... 63 5.2.2 Forming the Vocabulary Table ...... 65 5.2.3 Gesture Search Engine and Neighborhood Analysis . . . 66 5.2.4 Gesture Search Results ...... 66 5.3 Technical Comparison between the Prior Art and the Proposed Solutions ...... 68 5.4 3D Rendering and Graphical Interface ...... 69 5.5 Research Scenarios ...... 71 5.5.1 Implementation of the 3D Gestural Interaction on Mo- bile Platform ...... 71 5.5.2 Implementation of the Interactive 3D Vision on a Wall- sized Display ...... 72

vii CONTENTS

5.5.3 3D Rendering and Visualization of 2D Content . . . . . 73 5.6 Potential Applications ...... 74 5.6.1 3D Photo Browsing ...... 75 5.6.2 Virtual/Augmented Reality ...... 75 5.6.3 Interactive 3D Display ...... 76 5.6.4 Medical Applications ...... 76 5.6.5 3D Games ...... 76 5.6.6 and Reconstruction ...... 77 5.6.7 Wearable AR Displays ...... 77 5.7 Usability Analysis in Object Manipulation: Interaction vs. 3D Gestural Interaction ...... 77 5.7.1 User Test ...... 79 5.7.2 Usability Results ...... 80

6 Concluding Remarks and Future Direction 83 6.1 Contributions ...... 83 6.1.1 Conceptual Models for Future Human Mobile Device Interaction ...... 84 6.1.2 Technical Contributions for 3D Gestural Interaction and 3D Interactive Visualization ...... 84 6.1.3 Implementations ...... 85 6.2 Concluding Remarks and Future Direction ...... 86 6.2.1 Technical Challenges ...... 88 6.2.1.1 Active vs. Passive Motion Capture ...... 88 6.2.1.2 Gesture Detection and Tracking without Intel- ligence ...... 89 6.2.1.3 Adaptability of the Contributions to Future Hardware Evolution ...... 89 6.2.1.4 Contributions of other Research Areas to Com- puter Vision ...... 90 6.2.2 Further Development ...... 90 6.2.2.1 Concept of Collaborative 3D Interaction . . . . 91

viii CONTENTS

6.2.2.2 Concept of Interaction in the Space using Body Gestures ...... 91 6.2.2.3 Extension of the Gesture Search Framework to Extremely Large Scale ...... 91 6.2.3 Future of Mobile Interaction and Visualization . . . . . 92

7 Summary of the Selected Articles 95 7.1 List of Publications ...... 97

8 Paper I:

Experiencing Real 3D Gestural Interaction with Mobile De- vices 105 8.1 Abstract ...... 106 8.2 Introduction ...... 106 8.3 Related Work ...... 108 8.4 System Description ...... 110 8.4.1 Gesture Detection and Tracking ...... 111 8.4.2 Local Orientation and Double-angle Representation . . 111 8.4.3 Rotational Symmetries Detection ...... 113 8.4.4 3D Structure from Motion ...... 115 8.4.5 Finger Detection and Tracking ...... 116 8.4.5.1 Fingertip Detection ...... 117 8.4.5.2 Localization by Clustering ...... 118 8.4.5.3 ...... 119 8.4.6 3D Coding and Visualization ...... 119 8.5 Experimental Results ...... 121 8.6 Usability of the Proposed System ...... 124 8.7 Conclusion ...... 125

9 Paper II:

ix CONTENTS

3D Photo Browsing for Future Mobile Devices 133 9.1 Abstract ...... 134 9.2 Motivation ...... 134 9.3 Challenges ...... 135 9.4 Enabling Media Technologies ...... 137 9.4.1 Vision-based Motion Tracking in 3D Space ...... 137 9.4.2 3D Visualization ...... 138 9.5 Design of the 3D Photo Browser ...... 139 9.6 Technical Contributions ...... 142 9.6.1 Gesture Detection and Tracking ...... 142 9.6.2 3D Motion Analysis ...... 143 9.6.3 Methods for 3D Visualization ...... 143 9.7 Concluding Remarks ...... 144 9.8 Future Work ...... 144

10 Paper III:

Bare-hand Gesture Recognition and Tracking through the Large-scale Image Retrieval 147 10.1 Abstract ...... 148 10.2 Introduction ...... 148 10.3 Related Work ...... 150 10.4 System Description ...... 152 10.4.1 Pre-processing on the Database ...... 152 10.4.1.1 Position/Orientation Tagging to the Database 152 10.4.1.2 Defining and Filling the Edge-orientation Table 155 10.4.2 Query Processing and Matching ...... 156 10.4.2.1 Direct Scoring ...... 156 10.4.2.2 Reverse Scoring ...... 159 10.4.2.3 Weighting the Second Level Top Matches . . . 160 10.4.2.4 Dimensionality Reduction for Motion Path Anal- ysis ...... 160

x CONTENTS

10.4.2.5 Motion Averaging ...... 161 10.5 Experimental Results ...... 162 10.5.1 Dimensionality Reduction for Selective Search ...... 166 10.6 Conclusion and Future Work ...... 167

11 Paper IV:

Interactive 3D Visualization on a 4K Wall-Sized Display 173 11.1 Abstract ...... 174 11.2 Introduction and Related Work ...... 174 11.3 3D Motion Analysis ...... 178 11.4 Error Analysis in 3D Motion Estimation ...... 180 11.5 Experimental Results ...... 182 11.5.1 Visualization on a 4K Wall-sized Display ...... 183 11.6 Conclusion and Future Work ...... 183

12 Paper V:

3D Visualization of Single Images Using Patch Level Depth 187 12.1 Abstract ...... 188 12.2 Introduction ...... 188 12.3 Related Work ...... 189 12.4 Monocular Features for Depth Estimation...... 190 12.5 Feature Vector ...... 191 12.6 MRF and Depth Map Recovery ...... 193 12.7 Depth Normalization and Pixel Level Translation ...... 194 12.8 Anaglyph 3D Coding ...... 195 12.9 Experimental Results ...... 196 12.10Conclusion ...... 198

13 Paper VI:

xi CONTENTS

Stereoscopic Visualization of Monocular Images in Photo Col- lections 201 13.1 Abstract ...... 202 13.2 Introduction ...... 202 13.3 Related Work ...... 203 13.4 System Description ...... 204 13.4.1 SIFT Feature Detection and Matching ...... 204 13.4.2 Image Transformation ...... 205 13.4.3 Image Projection and Stereoscopic Adjustment . . . . . 206 13.4.4 3D Coding and Visualization ...... 207 13.5 Experimental Results ...... 208 13.6 Conclusion ...... 210

14 Paper VII:

Robust Correction of 3D Geo-Metadata in Photo Collections by Forming a Photo Grid 213 14.1 Abstract ...... 214 14.2 Introduction ...... 214 14.3 Related Work ...... 216 14.4 System Overview ...... 216 14.5 System Description ...... 218 14.5.1 Pre-processing ...... 218 14.5.2 Structure from Motion ...... 219 14.5.3 Uncertainty Analysis ...... 221 14.5.4 Signal Model ...... 223 14.5.5 Measurement Model ...... 224 14.5.6 Data Fusion ...... 225 14.6 Experimental Results ...... 225 14.7 Discussion and Conclusion ...... 226

Bibliography 235

xii

Chapter 1

Introduction

1.1 Motivation

Mobile devices play an important role in the modern world. Beyond the or- dinary use in daily life, they are being used by people for various advanced purposes in scientific areas, entertainment, education, medical applications, communication, gaming etc. The fast growing market of mobile devices re- veals that the sales of mobile devices are overtaking PCs. The recent statistics on mobile device market indicate that the total smartphone sales has reached 490 million units in 2011 and 700 million units in 2012 across the globe [1, 2]. With the current rate of growth, the sales of smartphones will exceed 1.5 billion in 2017 [3]. In addition to this enormous number, we should take into account the other types of mobile devices such as tablets, advanced portable gaming devices, digital cameras, camcorders, multimedia players, smart watches, and augmented reality glasses. Capability of mobile devices in capturing, storing, processing, and visualiza- tion of multimedia content has been significantly increased during the recent years. In addition to the high resolution cameras, other embedded sensors such as GPS, accelerometer, gyroscope and magnetometer provide a chance to collect extra metadata and integrate them in a wide range of application scenarios. In addition, variety of sensors might be considered as alternative

1 Chapter 1 input facilities for interaction between users and their mobile devices. Introduction of smartphones has changed our way to interact with mobile phones. Nowadays, people interact with their mobile devices through the touchscreen displays. The current technology offers single or multi-touch ges- tural interaction on 2D touchscreens. This approach is designed to provide a more natural interaction when users are operating their mobile devices. Over the touchscreen of a smartphone, users could manipulate a soft keyboard, vir- tual objects, and perform actions just through moving fingers. Although this technology has solved many limitations in human mobile device interaction, the recent trend in the digital world reveals that people always prefer to have intuitive experiences with their digital devices. For instance, popularity of the Microsoft can demonstrate this idea that people enjoy experiences that give them the freedom to act like they would in the real world. Rapid development and wide adaptation of smart phones have greatly changed our lives. Nowadays, we are more and more relying on our smartphones. It is a strong trend that a smartphone will become a part of our body. An indicative example is Google Glass, which can be seen as a version of next generation smartphones. Most probably, for next generation smartphones, users will no longer be satisfied with just performing interaction over 2D touchscreen, they will demand more natural interactions performed by the bare hands in 3D free space, at the back of the phone, or in front of the smart device, for in- stance. Thus, next generation of smart devices will need a gesture interface to facilitate the bare hands for manipulating digital objects directly, for instance, playing Spotify, scanning photo collections, and reading emails. Due to the strong indications and current trends, mobile devices will be an essential inseparable part of our body in the near future. In fact, smartphones, tablets or wearable augmented reality glasses will not be just ordinary devices. They will bring any experience to the personalized visualization space from the huge sea of information. For instance, mobile device might be a guitar, fitness trainer, home theater, shopping center, navigation system, game con- sole and thousands of other possible scenarios.

2 Introduction

Currently the major discussion is how we interact with the mobile device, while in the near future we should also consider how we interact through the mobile device, with the physical space, objects, information, etc. However, when we discuss about the next generation of mobile devices we should con- sider the next generation of interaction facilities too. The important question is in which space and how will we interact with and through our future mobile devices?

1.2 Research Problem

Design of the interaction experience for future mobile devices incorporates many challenging problems. The rapid growth in the technology of mobile devices shows that in the near future we will have extremely powerful hand- held or wearable devices. Although it is rather hard to exactly predict the hardware capabilities and features of the future mobile devices, but current trends in multimedia technology indicate that interaction with future devices should happen in a more intuitive and natural manner. Here, some important scientific questions might be considered. First, in which space and how should the intuitive interaction happen? Will touchscreens and track pads be replaced by other input facilities? And how should we design a new space for intuitive interaction with future mobile devices? Intuitive interaction is highly related to the mental connection of humans to their natural experiences. Since humans interact with their environment through the physical gestures, 3D hand/body gestures might be the effective alternatives for existing interaction facilities. Assume that the new interaction space is designed and introduced. Now the main challenge is how to technically support this concept. What types of technologies are required to perform this significant change? What are the limitations of media technologies to perform the 3D gestural interaction? How can we detect, recognize and track the complex hand gestures, head motions and body movements in 3D space? And how can enabling media technologies solve the technical problems?

3 Chapter 1

However, from both design and technical perspectives, introducing new ways for interaction with future mobile devices seems to be an extremely challenging task. This thesis aims to tackle these challenges and introduce new concepts, designs and technical solutions for the mentioned issues. These challenges are discussed in the following sections in detail.

1.2.1 Future Mobile Devices

In the discussions of future mobile devices we have to consider some impor- tant points. Five to ten years from now, we will most probably be faced with substantial changes in mobile technology. From a hardware point of view, fu- ture mobile devices will be featuring more advanced and powerful components such as various types of sensors, high speed processors, 3D displays, and huge memories. It seems that user experience in interaction with mobile devices will be quite different from now. Therefore, designing any interactive application for future mobile devices needs extensive investigation and research. From a design point of view, interaction environment and visualization quality will be substantially changed in the near future. Here, the main challenge is how to design a usable system to enhance the user experience in interaction with future mobile devices.

1.2.2 Experience Design

In the multimedia context, experience is defined as the sensation of interaction with a product, service or event [4]. Therefore, in experience design for mo- bile users, quality and sensation of interaction on physiological, affective, and cognitive levels should be taken into account. In fact, for a more convenient and desired experience, interaction between user and device should happen in a natural and effective manner. Unlike the interaction with the physical world where people use their body gestures, the best available technologies in smart- phones and tablets use interaction on limited 2D touchscreens. This limitation stops users from having a natural interaction in a wide range of applications where using physical gestures for 3D manipulations might be unavoidable. For

4 Introduction instance, picking, placing, grabbing, moving, pushing, zooming, and in gen- eral, manipulating virtual menus, objects and graphics in 3D environments require physical hand gestures. In addition, because the interaction happens on the display, in practice, users’ fingers or hands cover a large area or some parts of the screen while they are operating on the device. As a result, they will lose the visibility of the display during the interaction. Since the hard- ware capability of the mobile devices is increasing rapidly, the complexity of the applications will increase as well. This means that in the near future we will interact with our digital devices in a quite different manner. Another im- portant point to consider is the visualization quality. For a high quality user perception, a realistic visualization is needed. This is the main idea behind the development of 3D display technologies such as 3D cinemas or TVs. It is predictable that in the future, multimedia content will be displayed in 3D format. Therefore, adaptation of the old content to the future visualization technologies should be considered. For instance, we need to find an effective way to convert our old multimedia collections such as 2D photo albums, videos, etc. to 3D, which is quite a challenging problem. However, experience design for future mobile devices seems to be a difficult task from both interaction and visualization perspectives. Quality of user experience is rather a difficult concept to define, measure and evaluate. Although substantial research has been done on this subject, find- ing a straightforward method to measure the quality of experience(QoE) is still challenging. Usability is an important criterion to consider when we in- vestigate the QoE concept. Usability might be perceived from three angles: efficiency, effectiveness and user satisfaction [5]. From technical point of view, these three factors have been found to be more practical to measure and eval- uate. Therefore, improving the usability factors might significantly enhance the quality of user experience in multimedia context.

5 Chapter 1

1.2.3 Limitations in Interaction Facilities

Designing interactive applications for mobile devices is still a challenging prob- lem. Although from the processing point of view, new devices are quite pow- erful, but due to the limitations in size and weight for portability purpose, many problems remain unsolved. One major problem is that how users can effectively communicate with their devices in hardware level. The current technology provides several solutions to this problem. The commonly used hardware facilities to communicate with mobile devices are miniature key- boards, tiny joysticks and touchscreen displays [6]. Keyboards allow users to perform tasks through the menus, type, search, nav- igate etc., but in reality, even a small-sized keyboard, occupies a large space and limits the area for display. Secondly, usability of those keyboards is ques- tionable for users with large fingers for selecting tiny buttons. Substantial research has been done to reduce the size of keyboards, for example simplify- ing the devices by mapping the QWERTY keyboards to other formats or using few or single buttons [7, 8]. Swype is another known technique for enhancing the interaction with mobile device through the virtual keyboard. In Swype virtual keyboard user enters words by sliding a finger from the first letter of a word to its last letter, lifting only between words. Although joysticks are useful in some applications for scrolling up and down and selecting menus, but they are very limited and difficult to work with, es- pecially on small screens. Nowadays, the touchscreen displays are being used by most smartphones and tablet PCs and the trend shows that the button- less devices are becoming more popular. This indicates that users prefer to work on larger screens and designers allocate the whole device’s surface to the touchscreen display. On the other hand, touchscreen displays have several drawbacks. First, for typing scenarios a virtual keyboard will be rendered on the screen that occupies a large space for user convenience. Second, in most applications at least one hand works on the surface which brings the occlusion problem and in many cases both hands are involved. Therefore, the occlu- sion limits the visualization and the quality of experience will be degraded.

6 Introduction

In some contributions, a novel approach by touching the back of the device is presented [9]. Although this solution might work on limited scenarios but generally users lose the matching between visual perception and the touch. The commonly used technology in interaction design with mobile devices is 2D touchscreen display with a single button or without any physical button. Since humans interact with the physical world in 3D space, the quality of interaction will be degraded while it is mapped to 2D surfaces. Technically speaking, in 3D space the motion is represented by six degrees of freedom (three rotation parameters and three translation parameters), while in the mapping from the real world to 2D screens, the motion parameters will be reduced to two. On single-touch displays, motion is limited to translation in 2D (x and y) coordinates, but new products in the market, use multi-touch gestures to simulate rotations and translations in z-axis. However, with the best interactive devices in the market, the motion parameters are limited on 2D screens. This means that without using extra buttons, 2D gestures, or the aid of embedded orientation sensors, manipulation around x,y axes on 2D screens is not possible. Due to the fact that in magnetometer aided applica- tions the device itself should move, the visual content such as graphics, photos, videos, etc. might be out of the user’s sight and it will not be applicable in most cases. Another important group of mobile devices are the forthcoming augmented reality glasses such as Google Glass. Since this type of gadgets might be the future of smartphones, it is quite important to investigate how convenient users interact with them. Voice commands are one effective solution for command- ing the device to take a set of actions. For instance, for dialing a number, searching a phrase, or capturing a photo, voice commands might be really useful. For more complex tasks such as writing a text, skimming emails or browsing photos, users definitely need more input facilities. Google has intro- duced a touchscreen bar on the side frame of the Glass to solve this problem. Although this small surface provides more capability for user interaction, but it is obviously weaker than current smartphone touchscreens due to its size

7 Chapter 1 and invisibility to the users’ sight. However, designing usable and convenient input facilities for future mobile devices incorporates deep research and investigation. 3D gestural interaction in free space might be an effective alternative for the current interaction tech- nology. Enabling media technologies can support the user device interaction and enhance the user experience.

1.2.4 Limitations in Visualization

In order to improve the user experience in multimedia applications, in ad- dition to the effective interaction, high quality and realistic visualization is required. The main reason behind the manufacturing of mobile devices with larger displays is to enhance the visual output and the quality of experience. Although size and mobility have been in a trade-off in design for many years, due to the importance of the visual interaction, mobile devices became larger in screen size. Substantial experiments have been done to find the optimal and most effective size for different mobile devices [6]. A mobile device, as it is named, should be portable and easy to be held by its user, and this criterion brings the most challenging task to keep the balance between portability and size, besides the power consumption issue that is out of our discussion [10]. However, today’s smartphones just offer a limited surface for visualization. If wearable smart glasses provide a high quality visualization they might signifi- cantly enhance the visual experience. 3D display technology is another feature that might improve the perception quality in future mobile devices.

1.2.5 Technical Challenges in 3D Gestural Interaction

3D gestural interaction is rather a new trend in the multimedia context. Sub- stantial efforts have been done in this area. Specifically, 3D gestural interfaces are used in gaming and entertainment applications. One of the enabling tech- nologies to build such gesture interfaces is hand tracking and gesture recogni- tion. The major technology bottleneck lies in the difficulty of capturing and analyzing the articulated hand motions. One of existing solutions is to em-

8 Introduction ploy glove-based devices, which directly measure the finger joint angles and spatial positions of the hand by using a set of sensors (i.e. electromagnetic or fiber-optical sensors). Although there exits such applications in human com- puter interaction, , and 3D games, the glove-based solutions are too intrusive, cumbersome, and expensive for natural interactions with mobile devices. To overcome this, vision-based hand motion capturing and tracking solutions need to be developed. Capturing hand and finger motions in video sequences is a highly challenging task due to the large number of degrees of freedom (DOF) of the hand kinematics. Tracking articulated objects through sequences of images is one of the grand challenges in computer vision. Re- cently, Microsoft demonstrated how to capture full body motions by means of their newly developed depth cameras, Kinect. The question is whether the problem of 3D hand tracking and gesture recognition can be potentially solved by using 3D depth cameras. Of course, this problem has been greatly simpli- fied by the introduction of real-time depth cameras. However, the technologies based on 3D depth information for hand tracking and gesture recognition still face major challenges for mobile applications. Mobile applications at least have two critical requirements: computational efficiency and robustness. For mobile applications, feedback and interaction in a timely fashion is assumed. Any latency should not be perceived as un- natural to the human participant. Therefore, the maximum time between the completion of a gestural action from a person and response from the de- vice must be no longer than 100 ms (at least 10 frames per second should be processed in real-time vision-based systems). This requires an extremely fast solution for hand tracking and gesture recognition. It is doubtful if most existing technical approaches, including the one used in Kinect body tracking system would be the direction leading to the technical development for future mobile devices due to their inherent resource-intensive nature. Another issue is the robustness. The solutions for mobile applications should always work no matter indoor or outdoor. This may somehow exclude the possibility of using Kinect-type depth sensors in the next generation of mobile devices. Therefore,

9 Chapter 1 we come back to our original problem again, how to solve the problem of hand tracking and gesture recognition with video cameras. A critical question is whether we could develop alternative video-based so- lutions to hand tracking and gesture recognition that may fit future mobile applications better. Obviously, this question is of significance to address, since it is not only one of the fundamental problems in computer vision, but also it would have a potential impact on the mobile industry, and above all on the interaction with mobile devices in the future.

1.3 Future Trends in Multimedia Context

Contributions of this thesis are highly inspired by the following key trends in the multimedia context.

1.3.1 3D Interaction Technology

Major direction of interaction technology is towards more intuitive and natural interaction between users and digital devices. This is the main reason that keyboards, joysticks and other traditional input facilities are mainly replaced by track pads and touchscreens. The rapid development of new sensors such as Microsoft Kinect and for 3D interaction is another indication that proves this trend.

1.3.2 3D Visualization

Visualization quality has been significantly improved during the recent decade. The major trend indicates that 2D display devices are replacing with 3D tech- nology. Realistic perception and quality of experience are important features introduced by 3D display technology.

1.3.3 Passive Vision to Active/Interactive Vision

Introduction of wearable smart displays such as Google Glass reveals a new trend in multimedia world. Augmented reality glasses or similar products will

10 Introduction change the user perception from passive to active or interactive vision. In fact, users might interact with the environment through the wearable display, receive information from various channels and command the display.

1.3.4 Gesture Analysis: from Computer Vision Methods to Image-based Search Methods

Gesture detection, recognition, and tracking are mainly considered as classical computer vision/pattern recognition problems. Capability of new devices in storing and processing of large databases motivates the idea of solving the mentioned problem by using image-based search approaches. Therefore, de- velopment of search methods for visual content might be the future approach for gesture analysis.

1.4 Research Strategy

The main objective behind this research is to develop concepts and technolo- gies for effective interaction with future mobile devices. In order to fulfill this objective, challenges from both design and technical aspects should be consid- ered. This thesis aims to cover the concept of human mobile device interaction, future challenges and new possible frameworks to improve the quality of user experience. Afterwards technical solutions to overcome these challenges will be introduced and experimental results will be demonstrated. The main research strategies towards achieving these goals can be summarized in the following items: ® Concepts of interaction and visualization spaces have been deeply investi- gated during this work. Idea of extending the interaction and visualization spaces to real 3D space, sharing the interaction/visualization spaces and po- tential application scenarios are introduced. ® In order to support the future interaction/visualization concept, enabling media technologies have been deeply studied and widely used during this re- search. ® 3D gestural interaction as a powerful tool for future interaction technology

11 Chapter 1 is suggested. Different methods for gesture detection, recognition, tracking, and 3D motion analysis have been studied and new for supporting this concept have been developed during this research. ® Concepts of active and passive vision have been investigated and studied during this work. An interactive framework for 3D displays has been intro- duced. Moreover, new methods for 3D visualization of multimedia content have been developed. ® Various implementations of gestural interaction and 3D visualizations and different experiments have been conducted on stationary and mobile platforms. Experimental results have been compared and the final conclusions have been reflected. ® Future direction of mobile multimedia, potential application scenarios, en- abling technologies, and new frameworks have been investigated.

12 Chapter 2

Related Work

2.1 Terminology

Nowadays, gesture-based interaction is a strong trend in the multimedia con- text. In general, effective 3D gestural interaction can be achieved by combining the technical solutions in gesture analysis, designing usable applications and efficient . Since the main focus of this thesis is on the technical aspects of gesture analysis, it is quite crucial to provide a comprehensive def- inition for technical keywords and expressions that have been frequently used in the discussions.

Gesture recognition: gesture recognition is the process of interpreting the human gestures by mathematical models or computer vision algorithms. Ges- ture recognition is widely considered for communication between users and computers using . Various hand gestures might be used for commanding the digital devices in different tasks. In this context, gesture recognition is the process of differentiating between various hand gestures and assigning the different labels to them. For instance, all variations of the grab gesture in different poses and orientations should be recognized as grab gesture.

Gesture detection: the process of detecting the presence of a gesture pat-

13 Chapter 2 tern in an image frame is known as gesture detection. In this context, for a specific hand gesture, detection output indicates the presence or absence of the gesture pattern in an image frame.

Gesture localization: the process of returning the estimated position of the detected gesture in an image frame is known as gesture localization. The location of the gesture might be returned by different parameters such as bounding box, ellipse axes, or center of mass.

Gesture tracking: the process of gesture localization in a video sequence is known as gesture tracking. Gesture tracking might be performed by gesture localization in each single frame in an image sequence. In another approach, following the motion of the localized gesture in consecutive frames might be considered as gesture tracking.

Gesture pose estimation: estimating the position and orientation of the detected gesture with respect to the camera origin is pose estimation. In the discussions of this thesis 3D pose is referred to position (three parameters), and orientation (three parameters), with respect to the camera coordinate sys- tem.

3D gestural interaction: interaction between users and digital devices by employing hand/body gestures in 3D space is regarded as 3D gestural inter- action. Gesture detection, localization, recognition, tracking, and 3D pose estimation are the essential components of gestural interaction.

2.2 Related Work

3D technology and the research area around it have been developed rapidly during the recent years. Although substantial efforts have been done to facil- itate the devices with 3D technology, there are still numerous problems and challenges with them. Nowadays, the main focus of the 3D research world is

14 Related Work on 3D visualization. For instance, in 3D cinemas, 3D TVs, 3D digital cameras or even 3D mobile phones, the main goal is to add 3D features to the visual- ization part. In this context, 3D technology is considered from two aspects. First, how the interaction between user and device happens in 3D space, and second, the visualization technology where digital output should be displayed in 3D format. In the following sections, first, current motion capture and tracking technolo- gies in interactive products and environments are reviewed. Afterwards, ap- plicability of the current technologies to mobile applications and the available solutions and related works are discussed.

2.2.1 3D Motion Capture Technologies in Available Interactive Systems

During the recent years, 3D motion analysis has been found to be useful in various scenarios such as entertainment, virtual/augmented reality, and med- ical applications [11]. Different methods have been studied and introduced to effectively retrieve the motion parameters. Since the aim is to interact with mobile device through the user’s gesture in 3D space, different techniques in 3D motion tracking and analysis must be studied. If we successfully retrieve the 3D motion parameters, with high accuracy, we will be able to design an effective interaction environment for manipulation of digital content. In the following sections, different approaches for analysis of the 3D motion captured by various sensors in different setups are introduced and compared.

2.2.1.1 Passive Motion Tracking and Its Applications

The most common method to analyze the motion is using static cameras. The tracking method is known as passive, where the cameras are static and subjects move. The Microsoft Kinect is one example of passive motion analysis where the sensor is mounted somewhere in the room and users move in front of it. Kinect, features a RGB camera and depth sensor that provides full-body 3D motion capture and facial recognition [12]. Sony has also added motion

15 Chapter 2 capture systems in its game console [13]. PlayStation Move performs motion capture by holding a wiggle stick in hand. This controller features a spherical glowing part which can shine in full range of RGB colors. Based on the size and position of the shining part, captured by the PlayStation camera in 3D space, the motion will be accurately estimated [14, 15, 16]. Passive approach is widely used in medical applications to analyze the patients’ motions for diagnosing different types of physical disorders. Such systems usually use several expensive cameras, mounted at different positions in the room, and wearable markers or special clothes with visible markers on the body joints to be detected by the cameras from the distance [17]. In the systems that work without any marker or wearable devices, usually additional sensors like 3D cameras, depth or distance sensors are added to the installation or capturing device [17].

2.2.1.2 Active Motion Tracking and Its Applications

Active motion capture configurations estimate the 3D motion by using wear- able devices. Wearable devices might be different types of sensors for measur- ing the 3D motion parameters such as orientation and acceleration changes. They transmit the information to the base station for processing. The MotionPlus game controller performs motion analysis by active configuration. The device incorporates gyroscope and accelerometer to accurately capture and report the 3D motion [18]. In high accuracy virtual reality and medical applications similar types of sensors are extensively used on body joints [17].

2.2.1.3 Comparison Between Active and Passive Methods

Active motion analysis usually provides more accurate results due to its higher resolution in measuring the motion parameters. Since sensors are mounted on body parts, they can measure the body motions with a higher resolution com- paring with the passive installation where motion is captured from distance. Based on the conducted research in [19], accuracy of the active motion cap- ture is about 10 times higher than passive motion capture where a single RGB

16 Related Work

Figure 2.1: Passive and active motion tracking in gaming consols; Left: The Microsoft Kinect; Middle: PlayStation 3 Move; Right: Wii MotionPlus.

camera is used for capturing and measuring the 3D motion. Although the measurement is highly dependent on the motion analysis techniques, but in general, the substantial difference reveals that for accuracy reasons, body- mounted sensors are preferred to passive configuration. A major drawback with active motion capture is that wearable devices are usually uncomfortable for users. Moreover, many active motion capture systems use specific instal- lations, expensive materials and many sensors that will substantially increase the total cost. On the other hand, passive systems suffer from high accuracy due to the error caused by distant motion estimation. Apart from passive sys- tems with wearable markers, marker-less systems use natural gesture analysis and they are convenient for users. Due to the fact that mobile devices are equipped with different types of sen- sors, it is possible to make use of them in the active or passive configurations. On one hand, when they are in motion, they can be used as an active sensor (for instance by reading the orientation sensor or analysis of the video input). On the other hand, in static setup, they might be considered as passive sen- sors. Motion analysis of a moving object from the video input, captured by the device’s camera, is an example of this configuration. This thesis is focused on the gestural interaction behind the mobile device’s camera. Basically, this scenario is similar to passive motion tracking using a vision sensor, but due to

17 Chapter 2 the close distance between the vision sensor and the user’s gesture, it can also present the features of the active motion tracking.

2.2.2 3D Motion Estimation for Mobile Interaction

Currently, the most popular way to interact with mobile devices is through the 2D touchscreens. As mentioned before, touchscreen displays have limitations in 3D application scenarios, and the idea to hire other types of sensors such as orientation or vision sensors provides an opportunity to enhance the quality of interaction. Generally, in HCI applications, different solutions have been used to analyze the human body or gesture motion. The retrieved information from motion analysis might be used to facilitate the interaction. Many solu- tions are developed based on the marked gloves or markers on body joints (see Fig. 2.2) [20, 21, 22, 23, 24, 25, 26]. Some of them perform gesture analysis using depth sensors [27, 28, 29]. Model based approaches have been used in many applications [30, 31]. Other solutions analyze the motion by means of shape or temperature sensors [32, 33, 34], etc. Almost all the solutions are developed for stationary systems with powerful components. Due to the limi- tations on mobile devices, most of the proposed solutions will not be practical. Limited power resources, cost, mobility, and size are important features that make the design process for 3D interaction really difficult. New devices are equipped with different types of integrated sensors (orientation sensor, optical cameras, GPS, etc). Now the question is whether is it possible to use them in an effective way to analyze the 3D motion? Generally, the answer is yes. In many virtual reality and augmented reality applications, integrated sensors are used to control the motion. In [35], rendered graphics is controlled by ori- entation sensor. In [36, 37] vision sensors are hired to detect hand, gestures, or different types of objects. The major weakness of all current technologies is limitations in 3D motion analysis. Most of them are limited to object de- tection algorithms to augment graphics or manipulate virtual objects. The problem to be tackled is to analyze the six DOF motion in 3D space. There- fore, when the real 3D interaction with mobile devices is discussed, it means

18 Related Work

Figure 2.2: Motion-based interaction using wearable markers and gloves. Left: visual markers; Middle: T(ether), motion tracking glove; Right: ShapeHand, motion capture device. that all motion parameters in 3D space must be considered.

2.2.3 3D Gesture Recognition and Tracking

Existing algorithms of hand tracking and gesture recognition can be grouped into two categories: appearance-based approaches and 3D hand model-based approaches. Appearance-based approaches are based on a direct comparison of hand gestures with 2D image features. The popular image features used to detect human hands and recognize gestures include hand colors and shapes, local hand features, optical flow and so on. The earlier proposed works on hand tracking belong to this type of approaches [38, 39]. The drawback of these feature-based based approaches is that clean image segmentation is gen- erally required in order to extract the hand features. This is not a trivial task when the background is cluttered, for instance. Furthermore, human hands are highly articulated. It is often difficult to find local hand features due to the self-occlusion, and some kinds of heuristics are needed to handle the large variety of hand gestures. Instead of employing 2D image features to repre- sent the hand directly, 3D hand model-based approaches use a 3D kinematic hand model to render hand poses. An analysis-by-synthesis (ABS) strategy is employed to recover the hand motion parameters by aligning the appearance projected by the 3D hand model with the observed image from the camera, and minimizing the discrepancy between them. Generally, it is easier to achieve real-time performance with appearance-based

19 Chapter 2 approaches due to the fact of simpler 2D image features. However, this type of approaches can only handle simple hand gestures, like detection and tracking of fingertips. In contrast, 3D hand model based approaches offer a rich de- scription that potentially allows a wide class of hand gestures. The bad news is that the 3D hand model is a complex articulated deformable object with 27 DOF. To cover all the characteristic hand images under different views, a very large image database is required. Matching the query images from the video input with all hand images in the database is time-consuming and com- putationally expensive. This is why the most existing 3D hand model-based approaches focus on real-time tracking for global hand motions with restricted lighting and background conditions. For general mobile applications, we need to cover full range of hand gestures. 3D hand model-based approaches seem more promising. To handle the chal- lenging exhaustive search problem in a high dimensional space of human hands, the efficient index technologies used in information retrieval field, have been tested. Zhou et al. proposed an approach that integrates the powerful text re- trieval tools with computer vision techniques in order to improve the efficiency for hand images retrieval [40]. An Okapi-Chamfer matching algorithm is used in their work based on the inverted index technique. Athitsos et al. proposed a method that can generate a ranked list of three-dimensional hand configura- tions that best match an input image [41]. Hand pose estimation is achieved by searching for the closest matches for an input hand image from a large database of synthetic hand images. The novelty of their system is the ability to handle the presence of clutter. Imai et al. proposed a 2D appearance-based method by using hand contours to estimate 3D hand posture [42]. In their method, the variations of possible hand contours around the registered typical appearances are trained from a number of graphical images generated from a 3D hand model. A low-dimensional embedded manifold is created to overcome the high computation cost of the large number of appearance variations. Although the methods based on text retrieval are very promising, they are too few to be visible in the field. The reason might be that the approach is

20 Related Work too primary, or the results are not impressive due to the tests just over very limited size of database. Moreover, it might be also a consequence of the suc- cess of Kinect in real-time human body gesture recognition and tracking. The statistical approaches (random forest tree, for example) adopted in Kinect start dominating mainstream gesture recognition approaches. This effect is enhanced by the introduction of a new type of depth sensor from the Leap Motion company. This type of depth sensor can run at interactive rates (it should process at least 10 frames/second) on consumer hardware and interact with moving objects in real-time. Despite of its fantastic demo, Leap Motion sensor cannot handle full range of human hand shapes and sizes. The main reason is that such sensors usually detect and track the presence of fingertips or points in free space when user’s hands enter the sensor’s field of view. In fact, they can be used for general hand motion tracking. Regarding the special requirements for mobile applications such as real-time processing, low-complexity and robustness, it seems that a promising approach to handle the problem of hand tracking and hand gesture recognition is to use text retrieval technologies for search. In order to apply this technology to next generation of mobile devices, a systematic study is needed regarding how text retrieval tools should be applied to handle gesture recognition, particularly, how to integrate the advanced image search technologies [43]. Surely, there exist many powerful tools to overcome this problem. The key issue is how to relate the vision-based gesture analysis to the large-scale search framework and define a right problem. Once the right problem is defined, we can identify and integrate right tools to form a powerful solution.

2.2.4 3D Visualization on Mobile Devices

3D visualization or 3D imaging refers to the techniques for conveying the illu- sion of depth to the viewer’s eyes. First efforts on 3D imaging started around mid-1800s [44, 45]. Most 3D vision systems are based on Stereoscopic vision. Stereoscopic vision or Stereopsis is the process of conveying the 3D illusion by making stereo images. Various techniques have been introduced in this

21 Chapter 2

Figure 2.3: Examples of available 3D mobile devices; Left: HTC EVO 3D; Right: LG Optimus 3D. way. Old techniques such as parallel or crossed-eye viewing [46] without eye- glasses, color anaglyphs and color codes using eye-glasses are widely used for 3D photography [47, 48]. During the recent years, 3D technology has become popular in cinema industry and TV production. Passive technology using po- larized glasses [49] and Active shutter glasses [50] are common approaches in 3D cinemas and 3D TVs respectively. Autostereoscopic 3D is another tech- nology that requires no glasses. In this method, stereo images are transmitted separately to each eye from the light source. Some advanced 3D displays also provide limited number of views of a scene for more realistic 3D perception while user is moving his/her head [51]. Popularity of the 3D displays has been increased after 3D cinemas and 3D TVs attracted the public attention. The current trend in manufacturing of the 3D devices shows that we should expect a tremendous growth in this market for the coming few years. Mobile device manufacturers have started releasing smartphones with 3D capabilities and there are few 3D mobile phones available in the market (see Fig. 2.3). Among different companies two famous mobile phone manufacturers, LG and HTC, have introduced their 3D smartphones. Both devices use autostereo- scopic technology and dual cameras for recording and displaying stereoscopic images and videos [52, 53].

22 Chapter 3

General Concept and Methodology

3.1 General Concept

Intuitive interaction between multimedia users and digital devices is a desired feature of future technology. Although introduction of breakthrough technolo- gies such as iPhone dramatically changed the manner users interact with their phones, users always demand more realistic ways to communicate with their devices. Currently, limitations of 2D touchscreens stop users from having a natural interaction in a wide range of applications where using physical gestures is unavoidable. For instance, 3D manipulation of graphical content such as 3D rotations, zooming in/out, grabbing, pushing, moving etc. requires physical 3D space for intuitive interaction. Even the latest technologies of smartphones and tablets are limited to touch capabilities for interaction on a limited 2D sur- face. Moreover, occlusion problems caused by fingers and hands might degrade the quality of interaction and visualization. However, various entertainment and gaming applications are limited to using the rendered virtual buttons on the touchscreen while they need real 3D manipulation in free space. For in- stance, playing musical instruments such as virtual piano, guitar and drums requires pushing, moving, and tapping in 3D space. Obviously, limiting the

23 Chapter 3 same tasks to 2D surface degrades the natural interaction. Rendered virtual controllers on 2D surface limit the visualisation space and affect the quality of user experience. 3D gestural interaction might be even more vital for forth- coming augmented reality glasses since the dedicated surface for interaction is shrunk and shifted to the frame. This thesis aims to introduce a new space for interaction between user and mobile device. The main idea is to shift the interaction space from 2D surface to real 3D space around the device where the vision sensor can capture the hand/body gestures. In other words, performing gestural interaction in 3D space is proposed to solve the limitations of 2D interaction technology. Delivering the experience of free-hand interaction to the mobile device users can totally affect the mobile industry. Bare-hand 3D interaction enables users to communicate with their mobile devices exactly in the same manner as they do in the physical world with people, objects, etc. (In the same way they push a button or they pick and rotate an object in the physical space). Analysis of the users’ gestures from the video input might be used to control the ongoing operation on the device. Users might perform a set of actions using different hand gestures or they might control and manipulate virtual objects and but- tons by their hand movements. From the design point of view, the main goal is to enhance the quality of user experience by designing a new interactive environment. Since the quality of user experience is highly affected by the interaction design, by introducing a new interaction space the main goal is to solve the current limitations of interaction technology. From the technical point of view, the aim is to introduce enabling technologies for gesture recognition, tracking, and 3D motion analysis that can effectively facilitate the interaction design in mobile devices and multimedia applications. In addition, novel techniques in 3D visualization of multimedia collections such as single images and photo albums is considered. Finally, other related imple- mented techniques that might support the main contributions are included in this study.

24 General Concept and Methodology

Figure 3.1: For a better experience design, interaction and visualization spaces can be extended to 3D space.

From the practical point of view, this work aims to demonstrate the con- ducted experiments and implementations of the main contributions in real applications. Different application scenarios such as photo browsing, graphi- cal manipulation and 3D motion control are included in this work.

3.1.1 Interaction/Visualization Space

Miniature keyboards, tiny joysticks, and specifically, touchscreen surfaces are various designed input facilities for mobile devices. Considering the today’s mobile devices, it is clearly observable that interaction and visualization spaces are located on one side of the device. Since the physical buttons have been gradually removed from the mobile devices, the allocated surface for display

25 Chapter 3

Figure 3.2: Bare-hand 3D interaction with mobile device in the extended interaction/visualization space. has been significantly increased. The major concern about today’s mobile devices is the overlap between the interaction surface and display due to the common designed surface for input and output modules. Due to the fact that users prefer to keep their visual contact with input fa- cilities, it makes sense to design them in this way. For instance, users want to see which button they push or where they touch. On the other hand, this configuration might cause some problems in visualization. Obviously, when we work on touchscreen displays the occlusion problem might happen. Users lose the visibility and the quality of experience will be degraded. A novel solution for this problem is to extend the interaction and visualization spaces from 2D surface to 3D space (see Fig. 3.1). This extension should be done in a way that preserves the visual contact of user and the ongoing operation in interaction with the device. Since mobile devices have at least one embedded camera at the back side, it is possible to see the space behind the device from that vision

26 General Concept and Methodology sensor. If we manage to interact with the device in 3D space behind the cam- era, we can successfully extend the interaction space. Furthermore, interaction in 3D space features substantial capabilities that can facilitate a wide range of applications. For effective interaction in 3D space, advanced technologies in 3D motion analysis should be developed. From the technical perspective, interaction in 3D space can handle the limitations of 2D interaction on touch- screen displays. On the other hand, 3D visualization technology such as 3D coding and 3D displays might help to extend the visualization space from 2D surface to 3D. 3D visualization technology conveys the illusion of depth and 3D perception to users (see Fig. 3.2).

3.1.2 Sharing the Interaction/Visualization space

One of the great advantages that 3D interaction offers is the possibility of shar- ing the physical space for collaboration. In fact, by turning the interaction space from limited surface to free space, users will be able to collaborate within the provided common physical space between the mobile devices. Therefore, concept of single-user single-device might be extended to collaborative multi- user multi-device using the shared space. In general, different configurations for interaction between users and mobile devices might be considered. Based on the desired purpose, number of users, number of devices, and the interac- tion/visualization spaces might vary. Here, several possible scenarios for single and shared interactive applications are introduced (see Fig. 3.3, 3.4).

3.1.2.1 Single-user, Single-device

In this scenario user holds the mobile device in one hand and the other hand controls the interaction in the 3D space behind the device. The 3D space between the display and user’s eyes belongs to the 3D visualization. User can control and manipulate the content within the allocated spaces for interaction.

27 Chapter 3

3.1.2.2 Multi-user, Multi-device with Shared Interaction Space

In this configuration, more than one user share a common interaction space to manipulate the content. They might sit in front or next to each other. Each one holds a device in one hand and by the other hand interacts with the content. Interaction happens in the common space between devices where users might share the content, pass them around or manipulate them together.

3.1.2.3 Multi-user, Single-device with Shared Visualization Space

In this setup users share a single device for collaboration. They should use the space behind the device for 3D interaction, and visualization happens on a single display. This configuration is suitable for collaboration of two users.

3.1.2.4 Interaction from Different Locations for Multi-user Multi- device

In this configuration each user has his/her own location, device and space for interaction with the digital content, but they share the same virtual space. In other words, they interact with a common content from different locations through the network connection. This model might be extended to the case that the visualization space of one user is affected by the interaction of the other users and vice versa.

3.2 Evolution of Interaction/Visualization Spaces

Mobile phones have been evolved substantially since 1980s, both from design and functionality aspects. Before smartphones come to the market, the chal- lenge was to reduce the size for portability purposes. After smartphones at- tracted the users’ attention, for a better quality of experience, devices became larger in display size with less physical buttons. During this evolution, many devices have hit the market by their special features. Generally, if we consider the mobile devices evolution during the recent decade, we can distinguish a gradual change from both interaction and visualization aspects. In the earlier

28 General Concept and Methodology

Figure 3.3: Different configurations of single and collaborative interaction. 1: Single-user, single-device; 2: Multi-user, multi-device with shared interaction space; 3: Multi-user, single-device with shared visualization space; 4: Interac- tion from different locations for multi-user, multi-device. generation of mobile phones, the device’s surface was allocated to both inter- action and visualization facilities (keypad and display). In that configuration, visualization quality was quite weak due to the very limited area for display. Afterwards, some manufacturers proposed an innovative solution to allocate the whole surface to the display and design a physical keypad layer under the display layer. During the recent years most smartphone manufacturers have introduced their products with touchscreen displays. In this configuration both interaction and visualization spaces are located on the same area. By introduction of wearable displays such as Google Glass, the way users inter- act with the device might be totally changed due to the removal of hand-held module. This significant change enables users to benefit from 3D interaction using both hands. It means that if the technical solution for bare-hand inter- action is provided, users may perform different actions in 3D instead of using weaker input facilities such as touch frames or voice commands. In this thesis the proposed solutions to the limitations of the today’s technol- ogy, extend the interaction to the physical 3D space. Interaction happens in

29 Chapter 3

Figure 3.4: In 3D interaction, users might share the 3D space for collaborative tasks in different applications.

the 3D space behind the mobile device and 3D visualization shows its effect in the 3D space between the user and display. If we take into account the other body parts such as head or foot as interaction facilities, the interaction space can be extended to the 3D space around the device. This conceptual model might be considered for future interactive smart devices (see Fig. 3.5). However, the evolution trend in mobile interaction is towards designing sim- pler and more intuitive input facilities. Clearly, advanced media technologies combined with powerful hardware are required to make this evolution happen.

30 General Concept and Methodology

Figure 3.5: Evolution of the interaction/visualization spaces in mobile devices.

3.3 Enabling Media Technologies

As discussed before, the main objective behind this thesis is to provide techni- cal solutions that enable users to experience a realistic interaction with future smart devices in entertainment, communication, and information contexts. In order to support the main concept, this thesis focuses on two major problems: first, interaction design with mobile devices based on motion analysis in 3D space [38, 39, 54, 55, 56, 57], and second, 3D visualization on ordinary 2D displays [58, 59, 60, 61]. The proposed interactive systems are based on the detection, tracking, and analysis of the user’s 3D motion from the visual input. This visual input might be received from the mobile device’s camera, body- mounted camera, webcam, and in general, from any type of vision sensor. In this thesis, the main focus is on the analysis of the user’s gestures, captured by the vision sensor, in real-time. Specifically, hand gestures are considered due to the direct connection of the hands to the real life gestural activities.

31 Chapter 3

Figure 3.6: Enabling media technologies support the concept of 3D interaction and 3D visualization.

The retrieved 3D motion parameters from the detected gestures are used to drive the real-time interaction in various applications. Other technical contributions of this thesis are focused on providing the real- istic visualization. They mainly include the technical solutions for converting the 2D content to 3D and interactive visualization of the content based on the user’s head motion (see Fig. 3.6).

3.3.1 Vision-based Motion Tracking in 3D Space

In this thesis 3D gestural interaction and its significant advantages in com- parison with the current 2D technology is introduced. As it is technically discussed in paper I [57], efficiency and effectiveness of the interaction with mobile devices in 3D space is substantially higher than 2D space. This interac- tion might happen by detecting and tracking specific hand gestures that plays important roles in interactive applications. Alternatively, other body parts such as head or foot might be hired to perform the 3D interaction. However, the main idea is to use physical space for 3D manipulation on 2D devices. By using six DOF motion analysis, problems of 2D interaction will be handled in most cases [38, 39, 54, 55]. The important features of the proposed systems are bare-hand, marker-less gesture detection, recognition and tracking from 2D video input. This technology enables users to efficiently interact with their devices in real-time applications (see Fig. 3.7). Since the vision-based interac- tion with hand-held mobile devices happens in the space close to the camera, the distance between the moving subject and capturing sensor has some phys-

32 General Concept and Methodology

Figure 3.7: 3D gestural interaction with mobile device. ical limitations. For instance, user cannot move his/her hand more than 35-40 cm away from the body. This limited space for interaction preserves the high resolution motion analysis. When the vision sensor and moving subject are relatively close to each other, the configuration will be more accurate for mea- suring the 3D motion parameters. Therefore, the resolution in motion analysis will be increased. Another proposed configuration for 3D spatial interaction is interactive vision. This setup is proposed for interactive 3D display where user interacts with the content of the display based on the head motion. Head-mounted or static vision sensors might be used to measure and report the head movements. By the measured 3D motion parameters, users control the angle and viewpoint of the digital content in real-time.

3.3.2 3D Visualization

Amount of multimedia content in digital devices has been enormously in- creased during the recent years. Due to the substantial improvement of the

33 Chapter 3

Figure 3.8: 3D processing on 2D query images.

quality of cameras in smart devices, users can capture large amount of photos and videos using their smartphones. Besides the challenges in ordering, orga- nizing and interacting with huge collections, visualization quality is another issue that should be taken into account. Since majority of the devices have captured and stored the visual content by 2D technology, 3D visualization of 2D content will become a challenging task. While the today’s 3D technology has attracted the users’ attention, it is quite important to find an effective way to visualize the content in a more realistic fashion. Fig. 3.8 shows the system overview for 3D processing and visualization of 2D content in mobile devices. This thesis aims to tackle several challenges in visualization and improve the quality of visual perception in interaction with the multimedia content. Specif- ically, the following items are considered in the discussions:

® First, is there any way to convey the experience of real 3D visualiza- tion (similar to what people experience in watching a real world scene) to users by measuring the users’ dynamic position/orientation in real-time (pa-

34 General Concept and Methodology

Figure 3.9: Active 3D vision for head motion-based user-device interaction. per IV [61])? ® Second, is it possible to display the stereoscopic 3D content on normal 2D displays (paper I, II, IV, V, VI [56, 57, 58, 59, 62])? ® Third, how can we recover the 3D information from a single 2D image and visualize that in 3D format on an ordinary 2D display (paper V [59])? ® Fourth, is it possible to make use of photo/video collections, captured by 2D devices, to convert and display the content in 3D (paper VI [58])? ® Fifth, is there any efficient way to correct the 3D modeling, positioning and localization errors by integrating the metadata from position/orientation sensors and computer vision techniques (paper VII [60])?

The main focus of all 3D display technologies is to convey the illusion of depth to the viewers’ eyes, while 3D visualization might be seen from different perspectives as well. In fact, the real 3D perception, in a way we observe the real world, is 3D manipulation based on motion, plus the depth perception. An example might clarify this idea. Imagine a box in front of a user. Each

35 Chapter 3 side has a different color and pattern and from the front view user can only see the top and front sides. In a natural manner, if user wants to see the left or right sides he/she will move to the left or right directions and in the same way users observe any scene by moving towards different directions. In another approach, users can pick the box and rotate it to see any side they desire. One way to observe the 3D space is to manipulate the scene by users’ motion. In other words they should be able to control what they like to see. This type of visualization might happen by analysis of the user’s motion in front of the vision sensor and transmission of the motion information to the rendering system for 3D visualization. Of course, this process should be performed in a real-time fashion without any noticeable delay to deliver a realistic experience to the users eyes. Moreover, the output might be rendered by stereoscopic techniques to convey the illusion of depth. Therefore, concept of interactive vision or interactive 3D displays can be formed based on this idea (see Fig. 3.9).

3.4 Methodology Overview

Technical contributions of this thesis are mainly focused on the development of the enabling technologies for 3D gestural interaction. Therefore, 3D ges- ture detection, recognition and tracking are technical features that are aimed to be extensively used in the proposed solutions. Generally, gesture analysis is considered as a classical computer vision and pattern recognition problem. Thus, substantial part of the technical discussion of this work is allocated to these challenges from classical approach. Low-level feature/pattern detection, global model-based detection, estimating the motion from tracking robust fea- tures and other computer vision methods are hired to find novel solutions for solving the challenges of gesture analysis. In addition to the common computer vision methods, a new framework for gesture analysis is introduced. Due to the fact that capability of the modern computers in storing and processing extremely large databases is substantially increased, shifting the complexity of the methods from pattern recognition al- gorithms to large-scale retrieval approach might be the new trend to tackle

36 General Concept and Methodology the gesture analysis problems. Therefore, the introduced method is based on collecting an extremely large database of gesture images and retrieving the best match from the provided data. In the ideal scenario for gesture analy- sis the database should include all possible articulated hand gestures and the corresponding metadata including the relative spatial position and orientation with respect to the camera. The major methodology is based on direct re- trieval of the best match for any query gesture. The retrieval process should be performed in a way that preserves the smooth motion in a continuous ges- tural interaction. This step might be done by analysis of the gesture patterns in high dimensional space. The main methodology towards improving the visual perception is based on 3D visualization of the today’s multimedia content on current display devices. The whole process might be divided into two steps. In the first step, 3D motion analysis of the user’s head for real-time manipulation of the content should be performed. In this step a vision sensor is used to track visual features from the environment and motion analysis from consecutive frames is hired to measure the 3D motion parameters. The measured parameters in real-time help users interact with the content and manipulate that in a natural manner. The methodology for visualization of the content is based on the processing of the captured images and videos by the current 2D devices. The conversion methods from 2D to 3D are based on the direct analysis of the single views or multiple view analysis in photo collections. The main strategy is to convert the 2D multimedia to 3D and use stereoscopic coding. This approach adds additional value to the visual experience, while it does not require extra hard- ware. In other words, besides the user manipulated content, the output might be visualized by stereoscopic techniques to convey the illusion of depth to the user’s eyes.

37 Chapter 3

3.5 Gesture Analysis through the Pattern Recognition Methods

Basically, a common vision-based system for real-time gestural interaction is composed of four main elements: user, vision sensor, gesture analysis compo- nent, and visualization component. The real-time query input from user is a continuous set of hand/body gestures. In this context bare-hand gestural performance in free space is considered for most of the proposed scenarios and in few cases head movements are used as query input. Ordinary vision sensors can be divided into two groups: 2D cameras such as normal RGB webcams and 3D depth sensors such as Microsoft Kinect. Since most of the ordinary devices are equipped with normal RGB cameras, the main focus of this work is to use that type of sensor for different research scenarios (embedding depth sensors to mobile devices does not seem to be possible in the near future). The gesture analysis step usually includes feature extraction, gesture detec- tion, motion analysis and tracking parts. Pattern recognition methods for detecting and analyzing the hand gestures are mainly based on local or global image features. Simple features such as edges, corners, lines, and more com- plex features such as Symmetry patterns, SIFT, SURF, and FAST features are widely used in the computer vision applications [63, 64]. If the desired goal is to detect a specific pattern, a combination of image features might be used. For dynamic hand gestures, it is quite challenging to define a single pattern for detection due to the complex combination of the hand joints. Therefore, combination of local/global image features might be useful to detect and lo- calize the hand gestures. Distinctive features are extremely useful for robust tracking and 3D motion analysis. If the hand gesture is correctly detected and localized, robust features such as SIFT or SURF might be used to analyze the 3D motion parameters in a sequence of image frames. If the main goal is to track the gesture in consecutive frames, the detection algorithm might be conducted on single frames in a sequence. Another way to track the gesture is to detect and localize the gesture in a single frame and follow the detected

38 General Concept and Methodology

Figure 3.10: Overview of the 3D gesture analysis process based on computer vision methods.

pattern in the coming frames using common tracking methods such as Opti- cal Flow. However, depending on the application scenario, if the recognition of different types of gestures is required, different gesture patterns should be analyzed. If the goal is to track a special gesture, the specific pattern might be detected in consecutive frames, and if the 3D motion of the hand gesture is required, gesture localization and 3D motion analysis from the sequence of frames should be performed. Finally, the gesture analysis output might provide the required information about the type of gesture, and position/orientation of the detected gesture with respect to the vision sensor. The retrieved information will be sent to the real-time applications. The final output might be rendered in 2D/3D for visualization on the display. Fig. 3.10 demonstrates the block diagram of the 3D gesture analysis based on computer vision methods.

39 Chapter 3

3.6 Gesture Analysis through the Large-scale Image Retrieval

In addition to the computer vision methods for gesture analysis, this thesis introduces a new framework and methodology for tracking articulated hand motions in video sequences based on search technologies. The innovative so- lution is to define the problem of hand tracking and gesture recognition as a general image search problem. The idea is to build a large database that con- tains at least thousands of hand gesture images. Ideally, these images should emulate all possible hand gestures. Furthermore, these images are tagged with hand motion parameters including 3D position and orientation of the gestures. When the hand of a mobile device user is captured by the video camera around the mobile device, the captured hand image is used to retrieve the most sim- ilar hand gesture image stored in the database. Then the motion parameters tagged with the matched image are given to the captured hand image. Thus, 3D hand tracking and gesture recognition can be achieved. The key of this approach is how to quickly find the best match from a database. The proposed solution is to treat each image as a document, convert shape features to a huge visual vocabulary table, and employ the inverted indexing as a powerful re- trieval tool to perform the search. The developed framework might have a big impact on gesture analysis, where high resolution hand/gesture tracking is re- quired. In fact, unlike the classical pattern recognition methods, in the search framework, entries of the database will not be analyzed from shape-based or model-based methods. The main idea is to include every possible hand ges- ture image regardless of its shape or model. The entries of the database might be real images of articulated hand gestures or computer generated graphics. Here, the important point is to annotate the database entries with the position and orientation information of the recorded hand gestures. The vocabulary of hand gestures integrates the information from visual features of the gesture images and their pose information in an extremely large table. On the other hand, the query frame, captured by the vision sensor, will be

40 General Concept and Methodology

Figure 3.11: Overview of the 3D gesture analysis system based on the large- scale image search method. pre-processed and its visual features will be extracted for analysis in the ges- ture search block. The core of the system is the gesture search engine that analyzes the similarity of the query input with the database entries in several steps and retrieves the best match. The output of the system is the most similar gesture image to the query input and in the ideal case is identical to the query. Finally, the retrieved image and its annotated pose information will be employed in the application. Fig. 3.11 shows the block diagram of the 3D gesture analysis system based on the large-scale image retrieval.

41

Chapter 4

Enabling Media Technologies

From a technical point of view, in order to enhance the usability of an interac- tive system, numerous challenges must be considered. Specifically, interaction design for mobile devices using hand gestures incorporates technical issues in computer vision techniques such as detection, tracking, 3D motion estimation and visualization. Basically, technical discussions around the proposed meth- ods can be divided into the following categories: low-level pattern recognition for gesture analysis, search-based gesture analysis, and interactive 3D visual- ization. In order to implement a gesture-based interactive system, various hand ges- tures should be considered. Fig.4.1 demonstrates the most common hand gestures for 3D interaction and manipulation of objects in different digital en- vironments. Although the collected gestures can be used for different actions such as pick, place, move, grab, zoom, rotate, etc., but they all might be seen as variations of basic hand poses such as Grab or Pinch gestures. This is the main reason that, in this context, gesture detection and recognition based on computer vision methods are mainly focused on the Grab gesture and its variations such as deformations, scaling, and rotations. Clearly, these gestures can cover majority of the required actions in 3D interaction. Moreover, the proposed search-based method for gesture analysis can be used for extremely large number of hand gestures.

43 Chapter 4

Figure 4.1: Most common hand gestures in 3D interaction scenarios.

4.1 Gesture Detection and Tracking Based on Low- level Pattern Recognition

Low-level pattern recognition algorithms might be extremely useful in gesture analysis. Although low-level features do not represent complex patterns in- dependently, but due to the extremely fast process and low complexity, they are highly recommended for real-time applications. Here the main challenge is how to combine low-level features in an effective way to retrieve a global meaning such as detecting a gesture pattern from a video sequence. In the contributions of this thesis towards hand gesture detection and track- ing, low-level features are extensively used [38, 39, 54]. Specifically, gesture tracking based on low-level operators known as rotational symmetry patterns is considered. As discussed in paper I [57], rotational symmetries are specific curvature patterns derived from the local orientation image [65]. The main idea behind rotational symmetry is to use local orientation to detect complex curvatures in double-angle representation. The double-angle representation,

44 Enabling Media Technologies z, of an orientation with the direction θ, is defined as a complex number with an argument (angle), that is double the local orientation, z = cei2θ, where the magnitude, c, represents the information about the signal energy or confidence. Rotational symmetries can be categorized in different orders and phases. By using a set of specific filters on the orientation image, it is possible to detect different members of rotational symmetry patterns such as curvatures, circular and star patterns. The idea of taking advantage of the rotational symmetries in gesture detection seems to be rather general and complex, but modeling the gesture by the choice of the rotational symmetry patterns of different classes could lead us to differentiate between them and other features even in clut- tered backgrounds. Theories and mathematical definitions of local orientation, rotational symmetries, detection of symmetry patterns etc. are fully discussed in paper I [57]. Through the experiments, it can be demonstrated that hand gestures or finger- tips show high responses if we search for specific group members of rotational symmetry patters in the orientation image. For example, fingertips are respon- sive to the group of first order rotational symmetries (curvature patterns) [38], or grab gesture is responsive to the group of second order rotational symmetries (circular patterns) [39, 54]. Therefore, depending on the application scenario, the proper detector for different hand gestures can be introduced. By this approach, hand gesture can be localized in a sequence of frames captured by the device’s camera. Increasing the selectivity of the desired patterns can be achieved by applying the removal process on the noisy responses caused by complex backgrounds [38]. For instance, if detecting the fingertips is desired, first-order symmetry pattern detection will return the position of the fingertips as well as the noisy features from the background. During the further processing, magnitude, phase, and color properties of the responses can be used to differentiate between the cor- rect detections and noisy points [66] (see Fig. 4.2). Second order rotational symmetries return more specific patterns. For in- stance, the grab gesture responses to the circular pattern from this group of

45 Chapter 4

Figure 4.2: Gesture detection, tracking, and 3D motion analysis based on rotational symmetry patterns. symmetries. It is possible to enhance the detection by controlling the phase of the pattern using a simple threshold. Thus, the grab gesture can be localized properly in a video sequence [57]. However, the mentioned processing will result in a proper detection and re- jects the noisy responses. From technical perspective, gesture detection and tracking are the first steps in the 3D gesture-based interaction. The core of the system is 3D motion analysis where the 3D motion parameters will be recovered from the video sequence (see Fig. 4.2). In many interactive applications, 2D gesture detection and tracking are enough to perform the task and further 3D motion analysis is not required. For real 3D (six DOF) interaction, extra information about the 3D position and ori- entation must be recovered.

4.1.1 3D Motion Analysis

In computer vision and image processing discussions, a common way to re- trieve and estimate the motion between image frames is to analyze the motion between the extracted feature points. 3D structure can be studied by finding and matching the corresponding feature points in consecutive frames [67]. In computer vision algorithms, various types of feature detectors and descriptors have been introduced. Generally, feature detectors can be divided into edge,

46 Enabling Media Technologies

Figure 4.3: 3D motion analysis steps. Retrieving the 3D structure from motion between image frames.

corner and blob detectors or any combination of them [68]. In applications where robustness and accuracy have higher priority, more com- plex feature descriptors are required. SIFT, SURF and ChoG [63, 64, 69] are examples of robust feature descriptors which have been found to be useful in many multimedia applications. In the contributions of this thesis, scale-invariant feature transform (SIFT) is widely used as a robust scale/rotation-invariant feature descriptor. Once the hand gesture is localized, the SIFT features will be extracted on the de- sired region (user’s gesture) in the image frame. The extracted features will be tracked in consecutive frames and the structure of the 3D motion can be derived by finding the transformation between two frames. This transforma- tion might be in the form of planar homography [67], as discussed in [58], or fundamental and essential matrix [67], as suggested in [39, 54, 60]. In order to remove the outliers in matching between the feature points and find the best transformation matrix, consistent with the true matches, robust iterative methods such as RANSAC [70, 71] is performed. As a result, the best motion transformation between two frames will be estimated (see Fig. 4.3). Paper I [57], extensively explains how the 3D motion parameters can be retrieved by decomposing the estimated transformation [39, 54]. In paper I [57] gesture detection, tracking and 3D motion analysis from rotational symmetry patterns are explained in detail. Moreover, the effect of applying 3D motion parameters to different applications is demonstrated [57, 62].

47 Chapter 4

4.2 Gesture Detection and Tracking Based on Gesture Search Engine

In this thesis a new framework and algorithms for tracking hands in cluttered images and recognizing underlying gestures are introduced. To better specify hand gestures and hand motions, two concepts might be distinguished: Hand Posture: a hand posture is a static hand pose and its current position without any movements involved. Hand Gesture: a hand gesture is a sequence of hand postures connected by continuous hand or finger movements over a short period of time. For real-world hand tracking applications, the problems of initialization and recovery have to be addressed. In order to develop robust solutions, we can adopt a static approach, that is, to localize and recognize hand posture from in- dividual frames. Thus, hand gesture recognition could be achieved by reading individual posture images. The goal is that the new framework and algorithms could lead to solutions with high tracking and recognition accuracy. The ac- curacy should be so high that the solution could be used as a stand-alone module for 3D hand tracking. Obviously, such solutions will also be useful in providing single frame estimate to a 3D hand tracker, consequently achieve automatic initialization and error recovery. The proposed technical approach is to redefine the problem of hand tracking and gesture recognition as a text search problem. The framework is based on the idea of building a large database which in the best case emulates all possi- ble articulated hand motions. Furthermore, these images are tagged with 3D hand motion parameters including joint angles of articulated fingers. When the hand of a user is captured by the device’s camera, the captured hand image is used to retrieve the most similar image from the database. The ground truth labels of the retrieved matches are used as hand pose estimates for the input. It can work even under worse segmentation conditions. What is required to input is just a bounding box around the hand gesture. The bounding box is allowed to include arbitrary amounts of clutter in addition to the hand region.

48 Enabling Media Technologies

The key issue in this approach is how fast to find the best match from a database containing gesture images. The proposed solution is based on treat- ing each image as a document, converting shape features as words, and em- ploying the powerful text retrieval tool (inverted indexing to perform the fast search).

4.2.1 Providing the Database of Gesture Images

The core in the gesture search system is how to represent gesture contours. To enable the formulation of the gestural interaction problem into a search framework, two particular properties should be considered: first: shape sensi- tivity, which means that the matched hand gesture shape should be as close as possible to the one from input frame; second: position sensitivity, which means that the matched gesture should be at a similar position as the in- put gesture. In this work a new type of shape vocabulary is defined. The introduced technique is based on dividing the contour into segments or edge features. An individual segment is considered as a word for forming the search table. In order to form the search table, all the database images will be normalized and their corresponding edge images are computed. Each single edge pixel will be represented by its position and orientation. In order to make a global structure for low-level edge orientation features, we can form a large table to represent all the possible cases that each edge feature might happen. Con- sidering the whole database with respect to the position and orientation of the edges, an extremely large table can represent the whole vocabulary of the hand gestures in edge pixel format. For instance, for image size of 640x480 and 8 orientation representation, for a database of 10000 images of hand ges- tures, the gesture vocabulary table will have the dimension of 2457600x10000. After forming this huge table, each block will be filled with the indices of all database images that have features at that specific point. Therefore, this table collects the required information from the whole database, which is essential in the online gesture search.

49 Chapter 4

In addition to the processing of the database images to form the search ta- ble, for each single gesture image in the database, the 3D motion parameters will be calculated and tagged to that specific image. This process is done, by mounting a motion capture sensor on the hand, while the database images are being recorded. In the database, active vision sensor (hand-mounted camera) is used to measure the gesture movements and annotate the gesture images.

4.2.2 Query Processing and Matching

A query hand gesture is any type of hand gesture with its specific position and orientation. The first step in the retrieval and matching process is edge de- tection. This process is the same as edge detection in the database processing but the result will be totally different, because for the query gesture, pre- sense of edge features from cluttered background and other irrelevant objects is expected.

4.2.3 Scoring System

Assume that each query edge image, Qi, contains a set of edge points that can be represented by the row column positions and specific directions. Basi- cally, during the first step of scoring process, for all single query edge pixels,

Qi|(xu, yv), similarity function to the database images at that specific posi- tion is computed as: Sim(Qi,Dj). If the certain condition is satisfied for the edge pixel in the query image and the corresponding database images, the first level of scoring starts and all the database images that have an edge with similar direction at that specific coordinate receive +3 points in the scoring table. Similarly, for all the edge pixels in the query image the same process is performed and corresponding database images receive their +3 points. Here, an important issue that might happen during the scoring system should be considered. The first step of scoring system satisfies the need where two edge patterns from the query and database images exactly cover each other, whereas in most real cases two similar patterns are extremely close to each other in position but there is not a large overlap between them. For these cases that

50 Enabling Media Technologies regularly happen, the first and second-level neighbor scoring are introduced. A very probable case is when two extremely similar patterns do not overlap but fall on the neighboring pixels of each other. In order to consider these cases, besides the first step scoring, for any single pixel, the first-level 8 neigh- boring and second-level 16 neighboring pixels in the database images should be checked. All the database images that have edge with similar direction in the first level and second level neighbors receive +2 and +1 points respectively. In short, scoring system is performed for all the edge pixels in the query with respect to the similarity to the database images in three levels with different weights. The accumulated score of each database image is calculated and nor- malized and the maximum scores will be selected as the best top matches. Finally, the proposed algorithm selects top ten matches form the database. In order to find the closest match among the top matches, the reverse comparison system is required. Reverse scoring means that besides finding the similarity of the query gesture to the database images (Sim(Qi,D)), the reverse similarity of the selected top database images to the query gesture should be computed. Combination of the direct and reverse similarity functions will result in a much higher accuracy in finding the closest match from the database. The 0.5 final scoring function will be computed as: S = [Sim(Qi,D) × Sim(D,Qi)] . The highest value of this function returns the best match from the database images for the given query gesture. Afterwards, the tagged motion parameters to the best match can be immediately used to facilitate various application scenarios. Another additional step in a sequence of gestural interaction is the smoothness of the gesture search. Smoothness means that the retrieved best matches in a sequence should represent a smooth motion. In order to perform a smooth re- trieval, database gesture images should be analyzed in high dimensional space to detect the motion maps. Motion maps indicate that which gestures are closer to each other and fall in the same neighborhood in high dimensional space. Therefore, for a query gesture image in a sequence, after the top ten selection, the reverse similarity will be computed and top four matches will be

51 Chapter 4

Figure 4.4: Overview of the gesture search engine. selected. Afterwards, the algorithm searches the motion paths to check which of these top matches is closer to the previous frame match and the closest im- age will be selected as the final best match. Fig. 4.4 shows the block diagram of the gesture search engine. In paper III [56], the whole process is explained in detail.

4.2.4 Quality of Hand Gesture Database

In general, for the database we have to consider two main issues: how large the database should be? how to build such a database? The human hand is a complex articulated structure consisting of many connected links and joints. Including 6 DOF for orientation and position, there are 27 DOF for the human hand in total [72]. To render all possible combinations of joints and poses huge numbers of hand images will be generated. It is impossible to store all images in a mobile device. Fortunately, there is a strong correlation between joint angles. The state space of the joints has substantially lower dimensions. In [73], Wu et al. show that the state space for the joints can be approximated with 7 DOF. Thus, 7 is a rather good estimation of the embedded dimension

52 Enabling Media Technologies

Figure 4.5: Interactive 3D vision overview.

of hand postures. If we quantize each DOF and represent it with 3 bits, thus we will have a total combination of 87 ≈ 2 millions states. Thus, we have a rough estimation of the size of the database of hand gesture images, around at least 2 millions. The second issue is about how to build such a database. One solution is to use a 3D hand model to render all possible hand postures with computer graphics technology, and convert the generated gesture images into binary shape images through edge and boundary detection. The major problem with this approach is that the extracted edges are not natural, which directly affects the search of the best matched hand shape. In this thesis, bare-hand is used against a uniform background to perform all sorts of gestures. Hand gestures are recorded and used for converting into binary hand shape images. Motion sensors or video cameras are attached to the hand for measuring the exact position/orientation of the gestures. Thus, the ground truth hand motion parameters are tagged to the gesture images.

53 Chapter 4

4.3 Interactive 3D Visualization

The main idea behind interactive 3D visualization is to enable users interact with the content of display based on their motion in the free space. In fact, this technology helps them perceive the content in a realistic manner by con- trolling the angle and viewpoint in real-time, and turn the normal screen to an interactive digital window. For accurate 3D motion tracking, active technology requires mounting the vi- sion sensor on the user’s body. Since the aim is to manipulate the content based on the user’s view point, the sensor will be mounted on the user’s head. Therefore, the video sequence can be captured in real-time. In order to esti- mate the head motion parameters from the visual input, the extracted image frames from the video sequence will be processed in motion analysis step. The proposed motion analysis technology is based on the analysis of the 3D head motion between consecutive frames captured by the camera. For each two consecutive image frames, a robust feature detector can be employed to extract and track the important feature points from the environment. In most cases, due to the robustness and scale-invariance properties, SIFT feature de- tector is used. Afterwards, the relation between the two sets of corresponding feature points can be represented by a transformation matrix. For instance, planar homography, or for more accurate representation, Fundamental and Essential matrices will be calculated. The transformation matrix contains the information about the motion between the two image planes. In the next step, the decomposition process should be applied on the transformation ma- trix to retrieve the 3D motion parameters. This process will be performed on each two consecutive frames and the relative 3D position/orientation will be estimated. The motion analysis block provides six outputs for the rendering block, three representing the orientation parameters and three representing the position parameters in x, y, and z coordinate system. Note that SIFT feature detection at each single frame requires rather heavy processing which is not a problem in stationary systems. For faster processing especially on mobile platform, faster detectors such as SURF, or FAST fea-

54 Enabling Media Technologies

Figure 4.6: Real-time interaction with the graphical content using interactive 3D vision system. tures can be used. Another approach is to perform the feature detection in the first frame and track the detected features by common tracking methods such as Optical Flow in consecutive frames. The process of feature detection can be repeated when the number of features reduces to a certain value. The rendering block generates and updates the scene based on the provided motion information at each moment. The rendered scene is based on the pre-defined graphical model or augmented reality environment. The output result will be displayed on a screen while user may have the chance to interact and manip- ulate the content in real-time. Fig. 4.5 demonstrates the system overview of the interactive 3D vision. Fig. 4.6 shows how user controls the viewpoint and position in the rendered scene by moving in the 3D space. For capturing and measuring the 3D position and orientation of the user, an ordinary webcam is mounted on the user’s head. The graphical content will be updated according to the translation and rotation of the head at each moment. The perception effect is similar to looking at a real scene from a window. In fact, view of a

55 Chapter 4 real scene will be adjusted based on the angle and position of the viewer. In paper IV [61], interactive 3D visualization is discussed in detail.

4.4 Methods for 3D Visualization

As mentioned before, in order to enhance the quality of experience in multi- media applications, the aim is to visualize the output in 3D format. Here, the following scenarios might be considered.

® First, 3D visualization of a graphical model with a known geometry [39, 54], ® Second, 3D visualization of single images using the image itself [59], ® Third, 3D visualization of monocular images by analysis of the multiple views, in 2D digital photo collections [58].

A common way to visualize the content in 3D format is to produce stereo views. As fully discussed in [54, 57, 58, 59], stereoscopic systems transmit stereo views of a scene which have been represented by two viewpoints with a slight horizontal translation. Basically, for rendering the graphical mod- els in different applications, geometry of the scene is known. Therefore, it is rather simple to render the second view which satisfies the required geometry for stereoscopic views. The task of stereoscopic visualization becomes more challenging when the content is not recorded with stereo cameras or any prior knowledge about the geometry or structure of the 3D scene is not provided. Considering single views or randomly captured views of a scene, an efficient way to generate stereo views should be found. 3D visualization from single and multiple 2D views are briefly described in the following sections. In paper V and VI [58, 59], the whole process is explained in detail.

4.4.1 Depth Recovery and 3D Visualization from a Single View

Making stereo views from a single monocular image is one of the most chal- lenging tasks in computer vision. The first step in making 3D from single

56 Enabling Media Technologies images is to recover the depth map. This process is performed by applying su- pervised learning algorithms on a set of images and the corresponding ground truth depth maps. Statistical image modeling and estimation techniques such as Markov Random Fields (MRF) are used for training the system [74]. After the training process, the depth map for a query image will be recovered. Once the depth map is estimated, the required information for generating stereo views will be calculated as suggested in [59].

4.4.2 3D Visualization from Multiple 2D Views

Although 2D digital photo galleries and collections do not contain any 3D information, but by performing computer vision techniques, interesting 3D images and videos can be generated. Basically, in many photo collections there are a lot of hidden connections between images. These connections might be represented by a transformation matrix. In fact, any two, three or more unstructured photos of a scene, might capture the overlapping areas. This means that by finding the geometrical transformation between the overlapping images, the 3D information of the real scene can be inferred. In paper VI, the process of generating stereo views for 3D visualization by matching the feature points and finding the homography transformation between the overlapping frames is discussed in detail [58].

4.5 3D Channel Coding

The final step to visualize the content in 3D is to encode the stereo channels. The coding techniques vary according to the display device technology. For instance, in passive 3D systems with polarized glasses, the stereoscopic output should be transmitted in channels with different polarities [75], while in ac- tive shutter glasses, stereo frames are transmitted with twice the original rate (60 × 2 = 120 frames/sec) [75]. In the implementations ordinary 2D displays are considered for rendering the 3D output. A common group of stereoscopic techniques that do not require 3D displays are color anaglyphs [76, 77]. In

57 Chapter 4

Figure 4.7: Contributions in 3D visualization. anaglyph methods, the stereo frames are encoded into two different colors for left and right eyes. The color-coded stereo frames should be merged and displayed as a single layer on the display. Depending on the coding method an appropriate low-cost glasses are used to decode the displayed output. An appropriate eye-glasses features two different colors for left and right lenses, each for filtering the corresponding layer from the output image. In the im- plementations, two enhanced techniques for generating more realistic outputs are performed. These two techniques are known as Optimized Anaglyph and Color-code 3D [47, 76].

58 Chapter 5

Experimental Results

5.1 Experiments on Gesture Detection, Tracking and 3D Motion Analysis

Basically, the implemented system for gesture detection and tracking based on low-level patterns includes the gesture input from user, vision sensor, and algorithms for 3D gesture analysis. The target gesture for implementation of the gesture detector using low-level patterns is the Grab gesture. Selection of the grab gesture is due to the conducted studies on the human intuitive hand gestures for daily tasks such as pick, place, object manipulation, etc. [57]. In the experiments, the grab gesture is not considered as a rigid object. In fact, the implemented system is designed in a way that tolerates the deformation and rotations of the gesture up to a certain limit that the global shape is preserved in the captured frames.

5.1.1 Camera and Experiment Condition

Experiments on gesture detection using rotational symmetry patterns are gen- erally conducted in the lab environment with normal lighting condition and different backgrounds. For all the experiments, a single RGB webcam is used. The camera is used in both static and semi-dynamic (holding by one hand) setups to simulate both stationary and mobile configurations. The distance

59 Chapter 5

Figure 5.1: Sample variations of Grab gesture. between the camera and user’s gesture is normally between 15 to 40 cm. For testing the robustness of the system, various backgrounds with different colors and patterns are used. In addition, number of users with different skin color and size are considered in the tests.

5.1.2 Algorithm

In the experiments of this thesis rotational symmetry patterns are used with two different approaches for detecting the grab gesture. First approach is based on the first order symmetry patterns. This group represents the cur- vature patterns with different orientation. Observations reveal that fingertips are responsive to this group of symmetry patterns. On the other hand, curva- ture patterns are rather general and noisy points from the background might show a similar response to the first order symmetry detector. Therefore, in order to differentiate between the noisy points and fingertips, more features should be integrated in the algorithm. First criterion is the magnitude of the responses. Normally, responses of the fingertips are much stronger than noisy points. Another feature is phase. Since the intuitive hand gesture will not fully rotate in any angle, by setting a threshold on the phase we can limit the responses to natural observations. The third point is the skin color. An- other threshold on the color of the responses can help to remove more noises. Finally, by including all the conditions, best responses that represent the fin- gertips will be detected. Although this approach requires further processing for detecting the fingertips, but it provides more flexibility for detecting the

60 Experimental Results

Figure 5.2: 3D model manipulation using second order rotational symmetry patterns. The graphical model follows the exact motion of the user’s hand gesture in 3D space. deformed gesture patterns. By detecting the fingertips and measuring the dis- tance between them, it will be possible to model various hand gestures. Another developed method for detecting the grab gesture is based on the second order symmetry patterns. Second order group represents the circular patterns. With some constraints on the phase of the detected patterns the circular form of grab gesture can be detected properly. The constraints on the phase can be set based on the restriction of the wrist joints in rotation within a limited angle. Therefore, the grab gesture can be detected by searching the circular patterns with phase variation between +/- 45 degrees (see Fig. 5.1). In order to improve the robustness of the system, after the first detections, a region of interest is defined around the localized gesture to secure the correct detection in consecutive frames and automatically remove the noisy points. Since the second order rotational symmetries represent more complex pat- terns, the noisy points are significantly less than the previous case, but on the other hand, flexibility of the user’s gesture is lower than fingertip detection. The center of the detected patterns using second order symmetries returns the point around the center of the grab gesture. In order to retrieve the 3D motion of the localized gesture between the image

61 Chapter 5 frames, SIFT feature detection and tracking is performed. For faster process- ing, the feature points on the first frame are detected and tracked in consecu- tive frames to retrieve the 3D motion parameters. In the PC implementation, SIFT feature matching between all the frames are tested indeed. For the for- mer case when the number of features reduces to less than 35 points, feature detection will be restarted to guarantee the robust motion analysis.

5.1.3 Programming Environment and Results

The first version of the gesture analysis system, based on the rotational symme- tries, has been developed in Matlab. After approving the preliminary results, in the next step, the program was implemented in C/C++ environment. This step significantly improved the efficiency of the system in processing the video sequence in real-time. Since rotational symmetries are low-level patterns, from computation perspec- tive, the detection process is extremely fast. This is the major advantage that improves the quality of interaction in real-time applications. Even with the further processing for retrieving the 3D motion parameters, we can achieve the required performance for efficient interaction. As reflected in [57, 66], the measured detection accuracy shows the effectiveness of the algorithm. The implemented system based on the first order rotational symmetry detector re- turns the fingertip positions. In order to localize the grab gesture, the middle position between the detected thumb and index finger is considered as the output. The implemented system based on the second order symmetry detector re- turns the center of the circular pattern. Thus, for grab gesture, position of the response is always a point close to the gesture center. In both mentioned cases, the detected gesture points are used for manipulation of the graphical objects. The tested scenarios are based on the 3D tracking and rotation with six DOF motion analysis (see Fig. 5.2).

62 Experimental Results

Figure 5.3: Capturing the motion parameters for tagging the pose information to the database images.

5.2 Experiments on Gesture Search Framework

Experiments on gesture search framework can be divided into two steps. First: offline step that includes the process of constructing the database entries, tag- ging the motion parameters, and forming the vocabulary table of the gestures. Second: the online gesture search process for query input. This step includes the scoring process and neighborhood analysis for finding the best match for query input.

5.2.1 Constructing the Database

The main strategy behind constructing the database is to record and store all possible hand gestures including the deformations, scaling, and translation variations. Moreover, the stored gesture frames should contain the 3D mo- tion information for instant retrieval after the matching step. For this reason, active vision system is used to immediately retrieve and tag the 3D motion pa-

63 Chapter 5

Figure 5.4: Active motion analysis for tagging the orientation information to the database images. Vision sensor is attached to the back side of the hand for measuring the 3D motion parameters. The retrieved motion parameters are applied to the 3D model to validate the accuracy.

rameters (six parameters including the 3D position/orientation information) to each image frame during the process of recording the database images. The whole database is recorded in the lab environment with stable lighting condition and plain green background. In order to easily obtain a clear image of the gesture and eliminate the rest of the image, extra green paper for cov- ering the arm and hand-mounted camera is used. Active camera is mounted at the back side of the hand to let the second camera capture the video se- quence while the user is performing different gestures. Therefore, the active camera captures the frames from the environment for online 3D motion analy- sis. The second camera captures the gesture sequence simultaneously. Finally, the retrieved 3D orientation, based on the hand motion, will be tagged to the

64 Experimental Results synchronized frame from the second camera and this process will be continued to complete the construction of the database (see Fig. 5.3 and Fig. 5.4). Process of generating the database images and retrieving the orientation pa- rameters are conducted in C++ environment and is performed in real-time. Another reason to cover the arm with background color and provide a clear im- age of the hand is to calculate the 3D position of the gesture in each database image. During this step, first the database images are converted to edge im- ages. Afterwards, average position of the edges based on the image coordinate system is calculated for each frame. Moreover, the bounding box around each gesture is defined. In fact, size of the bounding box reflects the scaling factor or depth of the gesture with respect to the camera position. Finally these three parameters representing the 3D gesture position will be retrieved. At this step the database of the hand gestures including the gesture images, converted gesture edge images, and the corresponding text files containing the six motion parameters is constructed.

5.2.2 Forming the Vocabulary Table

The implemented algorithm for finding the best match for query frames are based on the low-level edge orientation features. Thus, the vocabulary table contains the indices of the relevant database images at different locations. The vocabulary table is defined by number of database images as row size, and mxnxnθ as column size where m and n represent the width and height of each image and nθ is the number of angle intervals. For most of the conducted tests with image size of 320x240, eight angle intervals, and 6000 images in the database, 6000 rows and 614400 columns represent the vocabulary table. Each block in the vocabulary table stores the indices of the database images that have edge at that position and with the similar orientation. The conducted experiments reveal that with the database size around 6000 the maximum number of indices at each block will not exceed 100. The whole process of forming the vocabulary table is performed in Matlab. The final constructed table is stored in text format for online retrieval step.

65 Chapter 5

5.2.3 Gesture Search Engine and Neighborhood Analysis

The online search system is implemented in C++ environment for efficient interaction in real-time. First, the vocabulary table will be loaded in the memory for fast retrieval. Afterwards, each frame from the real-time video input will be sent to the gesture search engine. After the direct and reverse scoring steps, the top four matches will be sent to the neighborhood analysis step and the best match will be selected. Different methods for analysis and mapping of the gesture images from high dimensional space to 3D space are introduced in paper III [56]. The main idea behind that is to analyze the distance between the gesture patterns and construct a meaningful pattern for neighborhood search. Since gestural in- teraction represents a smooth motion in 3D space, neighborhood analysis for selecting or predicting the closest match from the database for query inputs is quite important. In the implementations, Laplacian method for mapping the gesture vectors from high dimensional space to 3D space is selected. Se- lection of the Laplacian among other methods such as PCA, and LLE is due to the visible pattern from the 3D representation of the image vectors. As demonstrated in Fig. 5.5, each branch in the graph indicates the clear change in positioning of the gesture patterns within the database images. Basically, the dense center mostly represents the gestures around the center point of the image and each branch shows the direction towards the corners of the im- age frames. In the process of selecting the best match from the top matches, neighborhood analysis is used to return the closest gesture match based on the previous selected match in a video sequence. This step smoothes the motion of the retrieved sequence.

5.2.4 Gesture Search Results

During the experiments various databases have been provided for testing the performance of the system. The earliest database contained about 1500 images of the grab gesture. Later, another database with 3000 images including the grab gesture and other types of hand gestures was captured. Afterwards, the

66 Experimental Results

Figure 5.5: Left: Gesture images mapped to the three dimensional space by Laplacian method. Right: Gesture and non-gesture images mapped to the 3D space by PCA.

database was extended to more than 6000 images. At this step the non-gesture images were also included to analyze the performance on a larger database with noisy entries. At each step, system test have been conducted on the resized images, as well. In general, both 320x240 and 160x120 images show quite promising results in the tests. Most of the tests are based on the 320x240 images but if the database size increases to more than 10000 entries, the image size might be set to 160x120 to improve the efficiency of the retrieval. Among the mentioned steps in online retrieval system, the reverse scoring consumes most of the processing time. This is the major reason that the reverse scoring is conducted for the top ten matches retrieved by the direct scoring step. For instance, with 6000 images in the database, and image size of 320x240, the retrieval system by the direct scoring step can process 25 frames/second while by applying the reverse scoring it will be reduced to 15 frames/second. Thus, the reverse scoring shows the stronger effect on increasing the processing time than the database size. Fig. 5.6 shows sample gesture inputs and the corresponding best matches from the database of hand gestures.

67 Chapter 5

Figure 5.6: Output of the gesture search engine for number of sample query gestures.

5.3 Technical Comparison between the Prior Art and the Proposed Solutions

Basically, majority of the hand gesture recognition and tracking systems em- ploy vision-based approaches to handle the technical challenges. RGB and depth cameras or combination of them (i.e. Kinect) are the widely used hard- ware for capturing the body gestures. Since contributions of this thesis target the current and future mobile devices, the introduced methods are based on using ordinary RGB cameras such as webcams and mobile cameras. Although various algorithms have been introduced in the computer vision and pattern recognition discussions, the proposed solutions of this thesis can be compared to the common approaches for hand detection, gesture recognition, gesture tracking and 3D motion analysis. 2D and 3D features, 3D models, skeletal, appearance, color, and depth information are among the most known proper- ties that have been used for gesture detection and tracking. In fact, majority of the prior art can be grouped into the mentioned categories or combination of them. As discussed before, the proposed gesture analysis system based on rotational symmetry patterns can be considered as a combination of the mentioned com-

68 Experimental Results puter vision approaches. On the other hand, the introduced gesture analysis system based on large-scale search is not a classical computer vision approach. However, technical contributions of this thesis can be compared with the prior art from different aspects. Table 5.1 provides a comprehensive comparison between the prior art and the proposed solutions. Method 1 and 2 represent the rotational symmetries and gesture search method, respectively. Ratings are estimated based on the reviews and surveys on the current vision-based technologies [78].

5.4 3D Rendering and Graphical Interface

3D rendering is the process of generating a graphical view based on the three- dimensional models. 3D models might contain various properties such as ge- ometry and texture. Finally, the 3D rendering process depicts the 3D scene as a picture, taken from a particular perspective angle and it might be changed based on the desired viewpoint in a continuous sequence. Various features such as lighting, shadow, atmosphere, refraction of light, or motion blur on moving objects can enhance the realistic perception in 3D rendering. By the development of the modern computers, 3D rendering has become a major step in many applications such as video games, simulators, movies, augmented reality, virtual reality, etc. In order to convey the realistic 3D experience to users, two possible approaches or combination of them might be considered. First, the generated graphical view might be understood by various noticeable features such as perspective, shading, texture-mapping, reflection, depth of field, transparency, translu- cency, refraction, etc. Second, the generated scene might be rendered using stereoscopic techniques to convey the illusion of depth and the final result can be visualized on 3D displays. Nowadays, due to the popularity of the 3D dis- plays both techniques are combined to enhance the quality of user experience. Since the interaction between users and digital devices happens through the interface level, the effect of provided technical solutions might be visualized in a graphical interface. Basically, two scenarios are considered for graphical

69 Chapter 5

Method shape- color- depth- 3D model- Method1 Method2 based based based based Rot. Gesture Properties sym. search Efficiency of de- 3 5 5 4 5 5 tection Accuracy of de- 4 3 5 4 4 5 tection Tracking qual- 4 3 5 4 4 5 ity Gesture recog- 4 3 4 4 4 5 nition Robustness to 3 1 2 3 3 5 environmental conditions 3D motion 3 2 4 4 4 4 Large-scale ges- 2 3 4 4 3 5 ture Cluttered back- 3 2 5 4 4 5 ground Occlusion 2 1 1 3 1 3 Scale- 2 3 4 3 3 5 invariance Rotation- 2 3 4 3 3 5 invariance Deformation- 2 3 3 3 3 5 invariance Mobile plat- 3 3 0 2 4 5 form Multi-gesture 4 3 5 4 3 5

Table 5.1: Properties of the different methods in gesture analysis are compared with the proposed solutions of this thesis. Method 1 and 2 represent the discussed methods based on Rotational Symmetries and Gesture search engine, respectively. Quality of the different properties is scaled between zero and five. 0: not applicable. 1: very weak. 2: weak. 3: average. 4: strong. 5: very strong.

70 Experimental Results interface design in this thesis. First, manipulation of graphical objects using hand gestures, and second, manipulation of the graphical scene in interactive 3D vision. For both cases the graphics are rendered in OpenGL environment. Perspective projection, lighting, color, reflection and other features are used to provide a realistic 3D experience. Moreover, in order to convey the illusion of depth, the rendered output is provided in color-code stereoscopic 3D. By this technology, users might be able to experience the illusion of depth on any 2D screen using a simple color-code glasses. In most designed environments for 3D gestural interaction, manipulation of the graphical objects in an augmented environment is considered. Normally, the rendered objects are shown on the live camera view while user can pick, rotate, move, zoom in/out, or even reshape the objects in real-time. For interactive 3D vision the main goal is to place the users in a virtual reality environment, enabling them move and perceive the rendered scene in an in- teractive manner. Thus, the recommended setup for this scenario is a rather large or possibly a wall-sized screen.

5.5 Research Scenarios

Conceptual and technical contributions of this thesis have been tested and used for implementation in different research scenarios. The major research scenarios can be summarized in the following items.

5.5.1 Implementation of the 3D Gestural Interaction on Mo- bile Platform

Since one of the main target areas for applying the proposed technologies is the future mobile devices, implementation on mobile platforms are the essential part of this work. Android platform is selected for mobile implementation of the gesture-based interaction. The core of the system for detecting, tracking and analysis of the gestures is developed in native C/C++, in OpenCV envi- ronment. The graphical part is mainly handled by OpenGL (Open Graphics Library [79]). In some earlier versions, Min3D (3D library for Android using

71 Chapter 5

Figure 5.7: Graphical interface in mobile application. Implementation of the proposed systems in photo browsing and 3D manipulation.

Java and OpenGL ES) has been used for rendering different graphical objects (see Fig. 5.7).

5.5.2 Implementation of the Interactive 3D Vision on a Wall- sized Display

The proposed interactive 3D vision is tested in three different setups. The first test is performed on normal computer display with both 2D and stereoscopic 3D rendering. In the Second test, output is displayed on the wall, using a video projector. Third test is performed on the 4K wall-sized display in KTH VIC lab (visualization studio). In all three cases the graphical scene is rendered in OpenGL environment. In the stereoscopic case, passive 3D glasses are used for depth perception (see Fig. 5.8). Here an important point to mention is that the interactive 3D vision on per- sonal devices might be set in both active and passive configurations. As dis- cussed before, in active configuration where the vision sensor is mounted on the user’s head, the resolution of the 3D motion analysis is significantly higher

72 Experimental Results than passive configuration. The accuracy level is highly dependent on the relative distance between the moving subject and the vision sensor. Thus, if we remove the body-mounted camera and simply use the device’s camera for motion tracking, decreasing the distance between user and device might compensate the accuracy to a proper level. This case usually happens where users interact with their devices in closer range such as operating on laptops, smartphones and tablets. In these cases, due to the simplicity and comfort of the users, passive configuration is more practical. Although it will not provide the same level of accuracy as active vision but it is generally acceptable for natural interaction (see Fig. 5.9). In the conducted experiments on MacBook Pro, the device’s camera is used for tracking the head motion. In order to improve the quality of tracking, in the first step, face detection is applied to immediately separate the moving part from the rest of the image. Afterwards, the discussed technology for tracking and estimating the 3D head motion is used to provide the required data for 3D interaction with the content. In larger interaction spaces, such as visualization on wall-sized displays, for accurate and high-resolution interaction, active motion estimation is unavoid- able. Quality of motion tracking with passive installation is quite weak for large distances between sensor and moving subject.

5.5.3 3D Rendering and Visualization of 2D Content

Unlike the 3D graphics where the geometry of the scene and objects are known, 3D visualization of the 2D content such as images and videos are quite a challenging task. Single view and multiple view analysis for retrieving the 3D information from 2D content are introduced in the contributions of paper V and VI [58, 59]. The main idea is to provide a supportive technology to enhance the user experience in interactive applications. This technology enables users to see the content in 3D while they operate in the application or manipulate the content. 3D visualization is based on the stereoscopic techniques using passive glasses. Due to the simplicity of the tests and applicability to any

73 Chapter 5

Figure 5.8: Active 3D vision tests in different setups.

type of display, in all the conducted tests, anaglyph glasses are used [58, 59]. 3D channel coding is performed based on the selected glasses.

5.6 Potential Applications

Contributions of this thesis in 3D motion analysis and visualization can be used in a wide range of multimedia applications on mobile devices and stationary systems. Virtual reality, augmented reality, medical imaging, motion based interactive systems, 3D games, 3D displays, motion-based localization and positioning systems, visual search and many other applications might take advantage of the proposed methods. Here, several implemented and potential applications based on the contributions of this thesis are briefly explained.

74 Experimental Results

Figure 5.9: Passive configuration shows the similar effect as active in close- range interaction.

5.6.1 3D Photo Browsing

Interactive photo browser enables users to manipulate their photo collections in 3D space. Unlike the 2D interaction, where only one user could operate on the device (due to the limited area for interaction), in 3D interaction, two or multi users might share both the interaction and visualization spaces for collaborative tasks. Users might sit together to share their photo collections and manipulate them in 3D space while they have their own devices. They can use only one device and share the interaction and visualization spaces. They might share the virtual space if they are present at different locations, etc.

5.6.2 Virtual/Augmented Reality

In [39, 54], gestural interaction techniques are applied to render graphical objects in augmented environments. The analysis of the hand gesture motion, behind the mobile phone’s camera, is used to manipulate graphical models.

75 Chapter 5

The six DOF motion control with high level of accuracy enables users to experience a realistic interaction with their mobile devices. The proposed gestural interaction for manipulation of graphical models is also implemented on Android platform. Efficiency and performance of the system is tested and validated on different devices such as Samsung and HTC smartphones. The visual outputs are rendered in both 2D and 3D formats.

5.6.3 Interactive 3D Display

Human motion tracking might be used to interact with the display device. In the implemented interactive systems introduced in [57, 61], user controls the content of the display by using head or gesture motion. The retrieved motion parameters (rotations and translations in three axes), between the consecutive frames captured by the device’s camera, applies the motion control to the application.

5.6.4 Medical Applications

The proposed technologies might be widely used in medical applications. 3D motion tracking and analysis of the patients, help physicians for diagnosing and treatment of the physical disorders in various types of diseases. Fur- thermore, 3D imaging and visualization of the body organs and interactive 3D manipulation on display devices help experts to analyze and diagnose the physical problems and select the required treatment.

5.6.5 3D Games

One of the most exciting areas that can benefit from the efficient 3D motion analysis is 3D gaming. Bare-hand, marker-less gesture analysis by using ordi- nary 2D cameras provides a great chance for experiencing a realistic interaction with the graphical environment in 3D games. Head and gesture detection and tracking, using the techniques discussed in the previous chapters, provide an effective way for playing in 3D environments.

76 Experimental Results

5.6.6 3D Modeling and Reconstruction

Many digital photo/video capturing devices, in addition to a vision sensor, present other types of embedded sensors such as GPS and orientation sensors. Therefore, extra information such as position and orientation will be tagged to the captured photos. This geo-tagging, have been found to be useful in many applications such as 3D digital photo albums, photo-tagged maps and visual navigation. In most cases, the geo-tagged meta-data are corrupted by noise or missing due to the unavailability of the GPS signal or magnetic sensors. In paper VII [60], we discuss how the 3D motion analysis can help to form a signal model and significantly correct this noisy data.

5.6.7 Wearable AR Displays

Contributions of this thesis perfectly fit the area of mobile augmented reality. Thus, AR glasses such as Google glass that integrate the information through the augmented environments require intuitive interaction technology. Due to the fact that in wearable AR glasses, touchscreen will be removed or placed in a smaller scale, convenient 3D gestural interaction can definitely enhance the interaction experience.

5.7 Usability Analysis in Object Manipulation: Touchscreen Interaction vs. 3D Gestural Inter- action

In order to evaluate the user experience in 3D gestural interaction, a com- parative user study is conducted. In this study, manipulation of graphical objects in 3D space, using bare-hand gestures, is considered. Learnability, user experience and interaction quality is evaluated and compared with the same task in 2D touchscreen interaction. Four students from the course, Eval- uation Methods in HCI (DH2408), assisted this study by selecting this case as their course project. In order to provide a comparative scenario for evaluating the 3D gestural interaction, two sets of designed interfaces and tasks for 2D

77 Chapter 5 touchscreen and 3D gesture-based interaction, required usability tests, and questionnaires for user interview were provided for the students. In this task, they were supposed to invite uses, test the learnability and usability of both systems, and collect and report the required information based on the given instructions. Here, the whole process is explained in detail.

Touchscreen interaction: Two smartphones are considered for this case. Smartphones are positioned side-by-side on a table. Smartphone 1 plays a pre-recorded video of the rendered graphical model. On the smartphone 2, the same graphical model is rendered where user can manipulate that through the touchscreen and control the position, zooming and viewpoint in x, y, and z coordinates. During the task, user should follow and mimic the exact motion of the graphical model on smartphone 1 through the real-time manipulation of the model on smartphone 2 using touchscreen interaction. A webcam is mounted on top of both smartphones. This camera records the touchscreen interaction for further studies.

3D Gestural interaction: In this case, user can control and manipulate the same graphical model in 3D space using bare-hand interaction. Kinect depth sensor is used to detect and measure the user’s hand motion in 3D space. Similar to the previous case, the same pre-recorded motion tasks are displayed on the computer screen. User should follow and mimic the motion of the graphical model in free space through the real-time 3D interaction. A camera records the whole task for further studies.

Task: Both 2D and 3D interaction tasks are divided into different parts. In each part the graphical model will move with a specific motion sequence to reach a certain position/orientation. Afterwards, user should follow the same motion to reach the similar position/orientation. These pre-recorded tasks are divided into 10 parts. First two videos are used for learnability step where new users learn how to work with both 2D and 3D tasks. In the main part, 8

78 Experimental Results

Figure 5.10: User test in 2D touchscreen and 3D gestural interaction. videos (2 easy, 4 normal, and 2 hard) are considered (see Fig. 5.10).

5.7.1 User Test

For this study ten users were selected: one pilot user, seven in the primary target group (experienced in using touchscreen) and two in the secondary tar- get group (very little or no experience in using touchscreen). According to Nielsen [80, 81], five people find 85% of the problems. Therefore, ten users (mostly between the age 20 and 30) were enough to provide proper results. As mentioned before, the goals are set to test the learnability for the 3D gestu- ral interaction system as well as comparing the user experience of 3D gestural interaction to the touchscreen interface. For the comparative analysis, effi- ciency, effectiveness and user satisfaction are considered as the main criteria. In order to increase the reliability of the tests, a subjective data based on the experience of the participants during and after the test sessions were gathered. This was done through filling the scale-based forms, and answering predefined questions. On the other hand, user performance is observed by Seeing as Doing during the comparative tests and the manual data are collected. Eventually, these two steps could provide access to both quantitative and qualitative data for final evaluation.

79 Chapter 5

Figure 5.11: Average score of the 2D vs. 3D user performance.

5.7.2 Usability Results

Since the tests are based on following the movements in a video, for the quan- titative measurement of the user performance, the following method is consid- ered. This method is motivated by the MUSiC user performance method [82]. Therefore, all observers in the student group watched the recordings of each and every user test, and scored them from 1-7 according to how well the user performed based on the instructions in the video (1 = no coherence at all and 7 = no difference between video and performance). After all four students scored the user tests, the average score were calculated. Fig. 5.11 demon- strates the measured comparative performance between 2D and 3D scenarios. Since task 1 and 2 are considered for learnability step, they are not included in the chart. By analysis of the collected data, some important points can be distinguished. Firstly, it is clear that the touchscreen interface works fine as long as the task is limited to spinning/turning around x and y axes without any translation or zooming. As soon as movements in the 3D or zooming come to the action, the 3D gestural interface clearly shows its strength. Scoring on tasks 3, 4, 6, 8 and 9 indicates this observation (task 3 was primarily limited to turning the object). This fact is also proven by the comments of the users

80 Experimental Results in the interview, where five users mentioned the fact that turning objects were easier than turning and moving on touchscreen whereas combination of rota- tions, translations and zooming is simpler with the 3D gestural interaction. It may be too early to tell without future studies with larger user groups, but according to the collected data in the chart, it seems that the 3D gestural interaction is in overall the preferred system, due to the fact that users all scored higher on that system (except one case) than the touchscreen interface. During the test sessions two users had little to no experience in using touch- screens (two male teachers in their sixties). These users formed the secondary group in the usability evaluations. Although none of them owned a smartphone or regularly used touchscreens, one of them had little experience in using a Nintendo Wii, which might be the reason he scored substantially higher in the 3D interaction part, whereas the second user only scored a few points higher than the touchscreen interface. It would undoubtedly seem that 3D gestural interaction is the preferred system for people who learn both systems at the same time. In fact, both of them scored higher on the 3D gestural interaction, and during the interview both said that they strongly prefer the 3D interface over touchscreen. However, they both mentioned that large hands and fingers might cause problems on touchscreens. Findings through the interviews reveal that users truly believe that 3D gestu- ral interaction will be a standard interaction tool for the future applications. Although, there are some differences in opinions about how wide it will be used for future interactive scenarios. During the interviews, our users also had to answer a few statements and respond to how they agreed with the statement on a scale of 1-5 (1 = I do not agree at all, and 5 = I fully agree). The results of this interview is reflected in Fig. 5.12. In the learnability step, the main idea was to let users watch the videos and intuitively start following the recorded motions instead of giving specific instructions. Based on the re- sponses gathered from these questions it is clear that users find the 3D system easy to learn and think that most people will learn to use a similar interface quite easily. They mainly believe that with a bit more time in front of the 3D

81 Chapter 5

Figure 5.12: Qualitative results of 3D gestural interaction. interface they would master it. 3D gestural interaction has a quicker learning curve than touchscreen interface and obviously, better at performing more complex movements. However, in- terfaces using 3D gestural interaction must be specifically designed for gesture- based inputs, in the same way as applications for touchscreens are developed differently from those on a desktop computer where mouse and keyboard are available. This is already performed in games designed for Kinect and similar products.

82 Chapter 6

Concluding Remarks and Future Direction

6.1 Contributions

Today’s multimedia technology is highly inspired by two strong trends: tech- nologies towards intuitive interaction and technologies towards augmented vi- sualization. The former trend provides natural interaction technology for ef- fective communication between users and smart devices. The latter trend, which is considered as the direction to the fifth screen or augmented reality visualization, combines the interactive experience on the personalized screen with augmented information through the Internet. Technical contributions of this thesis can support the development of both trends. In fact, 3D interaction through intuitive gestures is unavoidable part of the future AR applications. Therefore, defining new frameworks for effective interaction in augmented en- vironments will improve the quality of user experience in future mobile ap- plications. Basically, contributions of this thesis might be divided into three main categories: Conceptual models for future human mobile device interac- tion; Technical contributions towards 3D interaction design and interactive visualization; Implementation of the proposed concepts and methods for dif- ferent application scenarios.

83 Chapter 6

6.1.1 Conceptual Models for Future Human Mobile Device Interaction

This thesis proposes new concepts and frameworks for future human mobile device interaction. The main features of the proposed ideas can be summa- rized as follows. ® Current and future trends in multimedia technology, especially on mobile multimedia, future demands, challenges, limitations and directions are dis- cussed in detail. ® Evolution of the interaction and visualization facilities on mobile devices and future trends are investigated. ® Concept of extending the interaction and visualization spaces to 3D on mo- bile devices is introduced and its advantages are discussed. ® Concept of 3D gestural interaction on mobile devices is introduced and its significant impacts are discussed. ® Concept of collaborative tasks on mobile devices using bare- hand interac- tion in 3D, and sharing the interaction and visualization spaces are discussed. ® Potential application scenarios based on 3D gestural interaction are intro- duced. ® Concept of user-manipulated content and interactive 3D vision are intro- duced and discussed.

6.1.2 Technical Contributions for 3D Gestural Interaction and 3D Interactive Visualization

Technical contributions are the main focus of this thesis. New methods and frameworks for 3D gesture analysis and interactive visualization have been introduced. Specifically, technical contributions are mainly focused on two major problems: first, interaction with mobile devices based on motion anal- ysis in 3D space [38, 39, 54, 57], and second, 3D visualization on ordinary 2D displays [58, 59, 60, 61]. The introduced interactive systems are based on detection, tracking, and analysis of the users’ 3D motion from the visual input. This visual input might be received from the mobile device’s camera,

84 Concluding Remarks and Future Direction body-mounted camera, webcam, and in general, from any type of vision sen- sor. Technical contributions can be listed as below. ® Concept of 3D gesture recognition and tracking, new methods and algo- rithms based on low-level operators are discussed. ® Novel methods and algorithms regarding the gesture recognition and track- ing based on large-scale search framework is introduced and considered. ® Proposed methods for 3D gesture analysis are compared and evaluated. ® Technical solutions regarding the motion-based interactive 3D display are introduced and compared in different configurations. ® Different configurations in interaction between users and multimedia con- tent in various scenarios and platforms are discussed. ® New methods regarding 3D visualization of monocular images, photo col- lections, and videos are investigated and discussed.

6.1.3 Implementations

Implemented scenarios based on the conceptual and technical contributions can be summarized in the following items. ® New methods for gesture analysis based on low-level patterns are imple- mented. ® New framework for 3D gesture analysis based on large-scale retrieval and search methods are implemented. ® 3D gesture detection and tracking are implemented in different platforms (Windows, Mac OS X, Android). ® Interactive 3D vision is implemented and tested in different scenarios from personal smart devices to wall-sized display. ® 3D visualization of monocular images, photo collections and videos are implemented and tested.

85 Chapter 6

6.2 Concluding Remarks and Future Direction

Although today’s media industry is highly inspired by 3D technology, but realistic interaction and visualization are still at their early stage of develop- ment. Realistic visualization has attracted a lot of attention during the recent decade. Introduction of 3D display technology in TVs, projectors and even on mobile devices is the indication of the fast-growing 3D market. Strong efforts towards changing the stereoscopic 3D to glasses-free 3D displays are other indications of the general trend for intuitive and realistic visualization. However, the current technology of 3D displays is quite different from real human observation of the 3D world and significant improvements are required to fulfill the objective of realistic visualization. Contributions of this thesis in 3D visualization, especially the introduced concept and technology for in- teractive 3D display, support the realistic and intuitive visualization. In fact, the main idea behind interactive visualization is to enable users observe the content, control the angle and viewpoint, in a similar manner to the real world observation. Introduction of 3D interaction facilities such as Microsoft Kinect has signifi- cantly changed the way people interact with the digital content especially in the entertainment area. Due to the fact that real 3D interaction requires ex- tremely high accuracy in 3D motion estimation and tracking of the body joints, there are still many unsolved issues and challenges to handle the difficulties. However, strong indications reveal that future human mobile interaction will be highly affected by intuitive 3D interaction. Contributions of this thesis aimed to tackle the fundamental issues and propose novel ideas towards solv- ing them. In this thesis, 3D gestural interaction is deeply investigated as an effective tool for future human mobile device interaction. Computer vision, pattern recognition, and machine learning methods are widely used in this area. Ob- servations and experimental results of this thesis indicate that although these methods might be extremely useful to solve different challenges in 3D ges- ture recognition, 3D motion analysis, etc., but for the generalized problem

86 Concluding Remarks and Future Direction

Figure 6.1: Different approaches for solving the technical challenges in media technology. The current trend shows the gradual move from low-level features and high-level algorithms towards meta-data retrieval from large databases. formulation they are not adequate. Therefore, new methods for 3D gesture analysis through the large-scale retrieval system have been introduced. Due to the possibility of storing and processing of extremely large databases and the corresponding metadata, future methodology for solving the discussed problems will be mostly centered around the metadata retrieval and search methods instead of processing the low-level data. Thus, preparation of rich and comprehensive databases can formulate the classical problems in a totally new way. For instance, challenges of gesture recognition and tracking can be slightly shifted from signal processing level to large-scale search and match- ing frameworks. Although, image-based retrieval and template matching are quite known concepts in media technology, but large-scale search framework for gesture analysis is rather a new concept and needs further development. This thesis has introduced and investigated this framework for high accu- racy 3D motion retrieval and gesture tracking. Experimental results indicate that the search framework is extremely powerful especially when recognition, tracking and 3D motion retrieval are required all together, in large-scale and

87 Chapter 6 real-time. On the other hand, if we target specific patterns and models for recognition and tracking, computer vision methods can handle the complexity of the problems.

6.2.1 Technical Challenges

During this research various methodologies, algorithms and different approaches towards solving the current and future challenges in media technology have been considered. Some of the important technical challenges and findings that have been tackled during this research work might be highlighted in the following items.

6.2.1.1 Active vs. Passive Motion Capture

A common discussion in human motion analysis is the position where mo- tion capture sensor should be mounted. In order to enhance the accuracy of the tracking and convenience of the users, various configurations have been introduced in different application scenarios. As discussed before, for more intuitive and natural interaction design, marker-less bare-hand solutions are preferred to wearable sensors such as motion capture gloves or body-mounted devices. Although the current motion analysis setups can be divided into passive and active systems, in mobile devices or augmented reality glasses motion analysis might be performed by both passive and active configurations. In fact, mo- bile sensor can be used in static or dynamic modes. This possibility provides a great chance to take the technical advantages of both configurations. For instance, camera of the AR glass presents the advantages of active motion analysis for a moving head, while it can be used as a passive sensor for hand gesture tracking. This thesis has demonstrated the practical scenarios where each configuration can show its advantages. For instance, the proposed in- teractive 3D visualization employs the active vision for manipulation of the content from larger distances (wall-sized display and projection) while the same system is introduced with passive configuration for close-range interac-

88 Concluding Remarks and Future Direction tion with mobile devices or laptops. As discussed before, in order to design a realistic experience, for hand gesture interaction, passive configuration is preferred. Thus, intuitive interaction should happen by using bare hands in free space. However, the proposed technical solutions provide flexible designs for different application scenarios.

6.2.1.2 Gesture Detection and Tracking without Intelligence

Majority of the available computer vision and pattern recognition methods hire complex algorithms for gesture detection, recognition, and tracking. These types of solutions usually include heavy computation and large training sets. Obviously, for mobile systems with hardware and power limitations majority of the common solutions are not applicable. The idea of employing low-level operators for detecting and tracking hand gestures is for ensuring the efficient detection without intelligence. Although, implementation of the effective ges- ture analysis system without using high-level detection algorithms is quite challenging but for efficiency reasons this important goal should be achieved. Employing rotational symmetries for detecting and tracking bare-hand ges- tures are based on this idea.

6.2.1.3 Adaptability of the Contributions to Future Hardware Evo- lution

Obviously, with the current rate of technology development, new types of sen- sors will be introduced and embedded to the smart devices. Although the proposed solutions of this thesis are mainly designed and tested based on the current technology but in fact, they can perfectly fit the future environments. Development of new sensors and extra hardware-related features can addition- ally support the contributions and enhance the quality of the achieved results. For instance, release of the Kinect sensor provided more flexibility to the pro- posed concepts, designs and technologies due to its capability to provide the additional depth information. Clearly, integration of the RGB images and depth information can substantially improve the quality of detection, track-

89 Chapter 6 ing, noise removal, and etc. Another example is the developing wearable AR glasses. Although presenting the information through the AR glasses is not a new concept in media tech- nology, but technical development of the recent years has made this concept possible for implementation. Combination of the lightweight wearable display and different types of sensors is the ideal scenario for gesture-based interaction technology. Technical contributions of this thesis perfectly fit this area.

6.2.1.4 Contributions of other Research Areas to Computer Vision

Rapid development of other research areas might have strong contributions to computer vision field. Solving the gesture analysis problems through the search methods is formed based on this idea. Since search algorithms are exten- sively used for text and document retrieval, modeling the gesture recognition and tracking problems by common search methods such as indexing could ef- fectively improve the research results. These findings reveal that breakthrough technologies from other research areas might be successfully adapted to similar concepts with totally different application scenarios. Basically, retrieving the best gesture entry from a huge database of images is a similar concept to find- ing the most related document to a searched text phrase. Thus, integration of the classical computer vision and pattern recognition methods with enabling technologies from other research fields can provide extremely powerful tools for solving the technical challenges.

6.2.2 Further Development

There are quite a large number of application scenarios and configurations that might benefit from the proposed technologies of this thesis. Due to the fact that this research has been conducted during a limited period of time, it was not possible to deeply investigate all the aspects of the proposed methods such as user study and design features. Evaluations and experiments are mainly performed on the technical aspects of the contributions. However, for the implemented systems based on the proposed methods, user experience and

90 Concluding Remarks and Future Direction design aspects are considered and studied in most cases. Here, some interesting directions for further research and development might be mentioned.

6.2.2.1 Concept of Collaborative 3D Interaction

Development of the 3D interaction in collaborative scenarios is an interesting line for further research. From technical perspective, collaborative 3D inter- action can be implemented based on the proposed solutions of this thesis. As discussed in the previous chapters, sharing the interaction/visualization spaces among several users provides a great chance for numerous application scenar- ios. Exchanging the digital information such as documents, photos, audio and video tracks between different users is an example of collaborative sharing based on 3D gestural interaction. In fact, users might grab, move and pass the multimedia content in a shared 3D space using physical hand gesture.

6.2.2.2 Concept of Interaction in the Space using Body Gestures

Interaction between human and future smart devices can be extended to the whole space. Specifically, by introducing the AR glasses, hand-held devices will be removed and the whole space in front of the user can be dedicated to the interaction. Contributions of this thesis have been focused on hand gesture technology for interaction in front of the smart device and 3D head motion estimation for interactive display. Since the interaction space can be extended to a larger space, whole body motion for action recognition and other body parts such as feet might by employed to design interactive application scenarios.

6.2.2.3 Extension of the Gesture Search Framework to Extremely Large Scale

The proposed search framework for gesture recognition and tracking has been implemented and tested with different databases. The largest database has been made by 10000 gesture entries. Although this number seems to contain

91 Chapter 6 quite large gesture poses for handling the gesture analysis problem, but ac- cording to the estimations, for extremely high resolution tracking, the database should be extended. One important line of research for further development is to generalize the retrieval system to an extremely huge database. Confidently, real-time matching process for extended database will be quite a challenging problem.

6.2.3 Future of Mobile Interaction and Visualization

It is quite difficult to predict the evolution of smart devices within the next ten years. From hardware point of view, the current trend shows that displays might be presented with 3D technology, in transparent, flexible, or wearable formats. Future devices will be definitely equipped with numerous sensors, larger storage, and faster processors. High-speed mobile network connections might totally change the design of the future smartphones. Storage and pro- cessing can be shifted to the infrastructure and smart devices might act as a set of sensors and screen for visualization. Huge network of connected devices will provide a chance to share the digital content in a virtual space. Collaborative interaction might be an important part of the future mobile multimedia. Concept of personalized environments and screens can totally change the fu- ture of visualization. In fact, by introducing the AR glasses, any space can be dedicated to the personalized environment. The whole space and augmented information can be designed based on the user demands. Future of interaction technology on mobile devices might be highly affected by multimodal inputs. 3D technology for intuitive interaction will be definitely an essential part of that. Using the bare-hand interaction in 3D space can perform various tasks. Other input modalities such as voice, motion or orien- tation will be complementary. Here an important point to mention is that the natural interaction requires the sense of touch. This is probably the essential feature that the free-space interaction might need. Ultra-sound 3D rendering is an enabling technology

92 Concluding Remarks and Future Direction that might be useful for implementation of the sense of touch in free space. Due to the primary stage of research [83], availability of this technology for mobile devices in the near future is quite questionable at this moment. This technology might offer the rendering of virtual objects in free space.

93

Chapter 7

Summary of the Selected Articles

This thesis reflects the results of the conducted research by Shahrouz Yousefi during his PhD study. The first part of this thesis (introduction part), is formed based on the result of more than 15 published papers in the interna- tional conferences and journals. The publication list is included at the end of this chapter. In the second part of this thesis 7 papers are included. In all se- lected papers, Shahrouz Yousefi (author of this thesis) is the first/corresponding author. Basically, major contributions of these seven papers including con- cepts, theories, experiments, implementations and writing are from Shahrouz Yousefi. Prof. Haibo Li has supervised Shahrouz Yousefi as the main super- visor during this PhD study. Third author has assisted Shahrouz Yousefi in some experiments or has participated in the discussions.

Chapter 8 introduces the gesture detection, tracking and 3D motion analy- sis based on the first order and second order rotational symmetry patterns. Rotational symmetries have been used for gesture localization and fingertip detection. Feature detection, feature tracking, and 3D motion retrieval have been performed. The computed motion parameters have been used to control and manipulate the virtual objects on the screen. Various application scenar-

95 Chapter 7 ios that might benefit from the proposed technology have been introduced. Content of this chapter has been published as a journal article in Pattern Recognition Letters (PRL).

Content of chapter 9 is reprinted from the published paper at the ACM Inter- national Conference on Multimedia (ACMMM12). This work has been pre- sented at the conference in both oral and poster sessions. It has been selected as one of the top eight papers for Doctoral Symposium track. The content of this work has been evaluated by an opponent and committee members at the conference. This paper reflects the substantial part of this PhD thesis in brief. It intro- duces the concept of 3D gestural interaction, potential applications, enabling media technologies that support this concept, and the implemented photo browsing system.

Chapter 10 introduces the concept of gesture analysis based on the large- scale gesture retrieval and search engine. The introduced technology is based on the provided database of annotated gesture images with the corresponding 3D pose information, and a search engine for similarity analysis between the query gesture and the database entries. The output provides the best match from the database and the annotated motion parameters will be used in real- time interaction. This paper is accepted for publication in the 9th International Conference on Computer Vision Theory and Applications (VISAPP2014). This work has successfully passed the novelty analysis step through the KTH Innovation. Due to the patent application restrictions, the full version of this work with technical details has not been submitted for publication in conferences or jour- nals. The extended version of this work is filed as a U.S. patent application.

Chapter 11 introduces the interactive 3D visualization, the proposed tech- nology for interaction between users and content of the display in real-time.

96 Summary of the Selected Articles

This technology enables users to control and manipulate the content of the screen based on their position/orientation in 3D space. A head-mounted vi- sion sensor is employed to measure and report the 3D motion parameters. The real-time motion parameters will be sent to the rendering block for visualiza- tion on the screen. This paper is submitted to the International Conference on Image Processing (ICIP2014).

Content of chapter 12 is reprinted from the published paper at the Interna- tional Conference on Signal Processing and multimedia applications (SIGMAP 2011). This paper introduces the technology for 3D visualization of monocular images based on the patch-level depth retrieval. Stereoscopic techniques have been used for 3D visualization on a normal 2D display.

Content of chapter 13 is the reprinted version of the published paper at the IEEE International Conference on Wireless Communications and Signal Pro- cessing (WCSP2011). This paper discusses the technology for converting the 2D monocular photo and video collections to 3D and visualizing them on 2D displays using stereoscopic technology.

Chapter 14 introduces a vision-based technique for robust correction of 3D geo-metadata in photo collections. The proposed technology efficiently im- proves the accuracy of the position/orientation information in photo collec- tions. Consequently, this approach enhances the 3D visualization, navigation, and exploration of large data sets. Content of the chapter 14 is reprinted from the published paper at the IEEE International Conference on Wireless Communications and Signal Processing (WCSP2011).

7.1 List of Publications

Content of this thesis is based on the contributions of the following articles but not including all of them:

97 Chapter 7

Journal articles:

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Experiencing Real 3D Gestural Interaction with Mobile Devices, published in Pattern Recog- nition Letters (PRLetters), 2013.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Gesture Tracking for Real 3D Interaction Behind Mobile Devices, published in the International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 2013.

Ë Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Direct Head Pose Es- timation Using Kinect-type Sensors, published in Electronics Letters., 2014.

Licentiate thesis:

Ë Shahrouz Yousefi, Enabling Media Technologies for Mobile Photo Brows- ing, Licentiate Thesis, Digital Media Lab (DML), Department of Applied Physics and Electronics, Ume˚aUniversity, SE-901 87, Ume˚a,Sweden, ISSN: 1652-6295:16, ISBN: 978-91-7459-426-3, 2012.

Conference papers:

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Bare-hand Gesture Recognition and Tracking through the Large-scale Image Retrieval, accepted for publication in the 9th International Conference on Computer Vi- sion Theory and Applications (VISAPP), January, 2014.

Ë Shahrouz Yousefi, 3D Photo Browsing for Future Mobile Devices, In Proceeding of the 20th ACM International Conference on Multime- dia (ACMMM12), October 29-November 2, Nara, Japan, 2012.

98 Summary of the Selected Articles

Ë Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Real 3D Interaction Behind Mobile Phones for Augmented Environments, In Proceeding of the IEEE International Conference on Multimedia and Expo (ICME), Barcelona, Spain, July 2011.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Gestural In- teraction for Stereoscopic Visualization on Mobile Devices, In Proceeding of the 14th International Conference on Computer Analysis of Images and Patterns (CAIP), Seville, Spain, CAIP (2), Vol. 6855 Springer (2011) , p. 555-562, 29-31 August 2011.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Visualization of Single Images using Patch Level Depth, In Proceedings of the Interna- tional Conference on Signal Processing and Multimedia Applica- tions (SIGMAP), Seville, Spain, 18-21 July, 2011.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Stereoscopic Vi- sualization of Monocular Images in Photo Collections, In Proceeding of the IEEE International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, p. 1 - 5, 9-11 Nov. 2011.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Robust Correction of 3D Geo-Metadata in Photo Collections by Forming a Photo Grid, In Pro- ceeding of the IEEE International Conference on Wireless Communi- cations and Signal Processing (WCSP), Nanjing, China, p. 1 - 5, 9-11 Nov. 2011.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Tracking Fingers in 3D Space for Mobile Interaction, In Proceeding of the 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 2010.

99 Chapter 7

Under-review articles:

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Hand Gesture Recognition and Tracking through the Large-scale Gesture Search Engine, Sub- mitted to the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Interactive 3D Visu- alization on a 4K Wall-Sized Display, Submitted to the IEEE International Conference on Image Processing (ICIP 2014)., Paris, France, 2014.

Ë Farid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo Li, Direct Hand Pose Estimation for Immersive Gestural Interaction, Submitted to Pattern Recognition Letters (PRLetters)., 2014.

Ë Farid Abedan Kondori, Shahrouz Yousefi, Ahmad Ostovar, Li Liu, Haibo Li, A Direct Method for 3D Hand Pose Recovery, Submitted to the 22nd In- ternational Conference on Pattern Recognition (ICPR 2014)., Stock- holm, Sweden, 2014.

Other related publications:

Ë Farid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo Li, Head Oper- ated Electric Wheelchair, accepted for publication in the IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI 2014), San Diego, USA, 2014.

Ë Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Samuel Sonning, Sabina Sonning, 3D Head Pose Estimation Using the Kinect, In Proceeding of the 2011 IEEE International Conference on Wireless Communica-

100 Summary of the Selected Articles tions and Signal Processing (WCSP), Nanjing, China, Nov. 2011.

Ë Farid Abedan Kondori, Shahrouz Yousefi, Smart Baggage In Aviation, In Proceeding of the 2011 IEEE International Conference on Internet of Things (iThings-11), Dalian, China, 2011.

Ë Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Visualization of Monocular Images in Photo Collections, In Proceeding of the Swedish Symposium on Image Analysis (SSBA), Linkoping, Sweden, 2011.

Ë Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Gesture Tracking for 3D Interaction in Augmented Environments, In Proceeding of the Swedish Symposium on Image Analysis (SSBA), Linkoping, Sweden, 2011.

Ë Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, , In Proceeding of the Swedish Symposium on Image Analysis (SSBA), Gothenburg, Sweden, 2013.

Patents:

Ë Shahrouz Yousefi, Haibo Li, Farid Abedan Kondori, Real-time 3D Gesture Recognition and Tracking System for Mobile Devices, U.S. Patent Applica- tion, filed January 2014. Patent Pending.

101