Computer Vision Based Interfaces for Computer Games

Computer Vision Based Interfaces for Computer Games Jonathan Rihan Thesis submitted in partial fulfilment of the requirements of the award of Doctor of Philosophy Oxford Brookes University October 2010 Abstract Interacting with a computer game using only a simple web camera has seen a great deal of success in the computer games industry, as demonstrated by the numerous computer vision based games available for the Sony PlayStation 2 and PlayStation 3 game consoles. Computational efficiency is important for these human computer inter- action applications, so for simple interactions a fast background subtraction approach is used that incorporates a new local descriptor which uses a novel temporal coding scheme that is much more robust to noise than the standard formulations. Results are presented that demonstrate the effect of using this method for code label stability. Detecting local image changes is sufficient for basic interactions, but exploiting high-level information about the player’s actions, such as detecting the location of the player’s head, the player’s body, or ideally the player’s pose, could be used as a cue to provide more complex interactions. Following an object detection approach to this problem, a combined detection and segmentation approach is explored that uses a face detection algorithm to initialise simple shape priors to demonstrate that good real-time performance can be achieved for face texture segmentation. Ultimately, knowing the player’s pose solves many of the problems encountered by simple local image feature based methods, but is a difficult and non-trivial problem. A detection approach is also taken to pose estimation: first as a binary class problem for human detection, and then as a multi-class problem for combined localisation and pose detection. For human detection, a novel formulation of the standard chamfer matching algorithm as an SVM classifier is proposed that allows shape template weights to be learnt automatically. This allows templates to be learnt directly from training data even in the presence of background and without the need to pre-process the images to extract their silhouettes. Good results are achieved when compared to a state of the art human detection classifier. For combined pose detection and localisation, a novel and scalable method of exploiting the edge distribution in aligned training images is presented to select the most potentially discriminative locations for local descriptors that allows a much higher space of descriptor configurations to be utilised efficiently. Results are presented that show competitive performance when compared to other combined localisation and pose detection methods. Dedicated to my parents and my family, to family Dahm, and to Susanne. In memory of Elke Dahm, and Winnie Smith. i Acknowledgements During the time I have spent studying at Oxford Brookes, I have had the honour and the pleasure of meeting and working with many very talented and enthusiastic people. First I would like to thank my director of studies Professor Philip H. S. Torr for all his valuable guidance, advice, enthusiasm and support that allowed me to develop my skills as a researcher and gain an understanding of computer vision. I would also like to thank my second supervisor Dr Nigel Crook for his advice and valuable feedback during writing, and would like to thank Professor William Clocksin for the teaching opportunities that gave me valuable teaching experience during my final year of study. Of those I have worked with while studying in the Oxford Brookes Computer Vi- sion group, I would like to thank M. Pawan Kumar, Pushmeet Kohli, Carl Henrik Ek, Chris Russel, Gregory Rogez, Karteek Alahari, Srikumar Ramalingam, Paul Sturgess, David Jarzebowski, Lubor Ladicky, Christophe Restif, Glenn Sheasby and Fabio Cuz- zolin for all the interesting research discussions. During my time at the London Studio of Sony Computer Entertainment Europe, I worked under the supervision and guidance of Diarmid Campbell who I would like to thank for giving me valuable insight into applied computer vision and the games industry. From within the Sony EyeToy R&D group I would also like to thank Simon Hall, Nick Lord, Graham Clemo, Sam Hare, Dave Griffiths and Mark Pupilli. I would like to thank my parents for all the support they have given me during the many years of my studies. I would also like to thank Elke and Erdmann Dahm and family for their support and encouragement during the final year of writing. Finally, I would like to thank Susanne Dahm for her unwavering support and understanding during my studies, and in whom I found the strength to do more than I ever thought I could. ii Contents 1 Introduction1 1.1 Defining ‘Real-time’...............................1 1.2 Motivation....................................2 1.3 Approach.....................................3 1.3.1 Detecting Movement..........................4 1.3.2 Face Detection and Segmentation...................5 1.3.3 Human Detection............................5 1.3.4 Pose Estimation.............................6 1.4 Contributions..................................6 1.5 Thesis Structure.................................7 1.6 Publications...................................8 2 Background9 2.1 Computer Vision in Games..........................9 2.1.1 Nintendo Game Boy Camera (1998).................9 2.1.2 SEGA Dreamcast Dreameye Camera (2000)............. 10 2.1.3 Sony EyeToy (2003)........................... 11 2.1.4 Nintendo Wii (2006).......................... 13 2.1.5 XBox LIVE Vision (2006)........................ 13 2.1.6 Sony Go!Cam (2007).......................... 14 2.1.7 Sony PlayStation Eye (2007)...................... 15 2.2 Types of Camera................................. 16 2.2.1 Monocular (standard webcam).................... 17 iii CONTENTS 2.2.2 Stereo................................... 17 2.2.3 Depth, or Z-Cam............................ 18 2.3 Problem Domain................................ 18 2.3.1 Typical Computer Games....................... 19 2.3.2 Computational Constraints...................... 20 2.4 Background Subtraction and Segmentation................. 20 2.4.1 Background Subtraction........................ 20 2.4.2 Background Subtraction using Local Correlation.......... 24 2.4.3 Segmentation.............................. 27 2.4.4 Graph-Cut based Methods...................... 29 2.5 Human Detection and Pose Estimation.................... 32 2.5.1 Learning Problem............................ 32 2.6 Human Detection................................ 33 2.6.1 Generative................................ 35 2.6.2 Discriminative............................. 35 2.6.3 Chamfer Matching........................... 38 2.7 Pose Estimation................................. 40 2.7.1 Generative................................ 41 2.7.2 Discriminative............................. 44 2.7.3 Combined Detection and Pose Estimation.............. 46 3 Fast Background Subtraction 47 3.1 Problem: Where is the player?......................... 48 3.2 Image Differencing............................... 48 3.3 Motion Button Limitations........................... 50 3.4 Persistent Buttons................................ 50 3.5 Algorithm Overview.............................. 51 3.5.1 Sub-sampling.............................. 53 3.5.2 Intensity Image............................. 53 3.5.3 Blur Filtering.............................. 53 iv CONTENTS 3.5.4 Code Map................................ 54 3.6 Algorithm Details................................ 54 3.6.1 Sub-sampling and Converting to an Intensity Image........ 54 3.6.2 Blur Filtering.............................. 55 3.7 Noise Model................................... 56 3.8 Code Map.................................... 57 3.8.1 Local Binary Patterns.......................... 57 3.8.2 3 Label Local Binary Patterns (LBP3)................. 58 3.8.3 Temporal Hysteresis.......................... 59 3.9 Experiments................................... 63 3.9.1 Comparisons.............................. 63 3.9.2 Table-top Game............................. 64 3.9.3 Human Shape Game.......................... 64 3.10 Results...................................... 65 3.11 Discussion and Future Work.......................... 66 4 Detection and Segmentation 68 4.1 Shape priors for Segmentation......................... 69 4.2 Coupling Face Detection and Segmentation................. 70 4.3 Preliminaries................................... 71 4.3.1 Face Detection and Localisation................... 71 4.3.2 Image Segmentation.......................... 72 4.3.3 Colour and Contrast based Segmentation.............. 72 4.4 Integrating Face Detection and Segmentation................ 73 4.4.1 The Face Shape Energy........................ 74 4.5 Incorporating the Shape Energy........................ 74 4.5.1 Pruning False Detections....................... 76 4.6 Implementation and Experimental Results................. 77 4.6.1 Handling Noisy Images........................ 77 4.7 Extending the Shape Model to Upper Body................. 77 v CONTENTS 4.8 Discussion and Future Work.......................... 79 5 Human Detection 82 5.1 Suitable Algorithms............................... 83 5.2 Features..................................... 83 5.2.1 Histogram of Oriented Gradients................... 84 5.2.2 DAISY.................................. 88 5.2.3 SURF................................... 89 5.2.4 Chamfer Distance Features...................... 90 5.3 Human Detector................................. 96 5.3.1 Training................................

Load more