Combining Computer Vision and Knowledge Acquisition to Provide Real-Time Activity Recognition for Multiple Persons Within Immersive Environments
Total Page:16
File Type:pdf, Size:1020Kb
Combining Computer Vision and Knowledge Acquisition to provide Real-Time Activity Recognition for Multiple Persons within Immersive Environments Anuraag Sridhar Schoool of Computer Science and Engineering University of New South Wales, Sydney, Australia A thesis submitted in fulfilment of the requirements for the degree of PhilosophiæDoctor (PhD) 2012 b c d Abstract In recent years, vision is gaining increasing importance as a method of human- computer interaction. Vision techniques are becoming popular especially within immersive environment systems, which rely on innovation and novelty, and in which the large spatial area benefits from the unintrusive operation of visual sensors. However, despite the vast amount of research on vision based analysis and immersive environments, there is a considerable gap in coupling the two fields. In particular, a vision system that can provide recognition of the activ- ities of multiple persons within the environment in real time would be highly beneficial in providing data not just for real-virtual interaction, but also for so- ciology and psychology research. This thesis presents novel solutions for two important vision tasks that sup- port the ultimate goal of performing activity recognition of multiple persons within an immersive environment. Although the work within this thesis has been tested in a specific immersive environment, namely the Advanced Visual- isation and Interaction Environment, the components and frameworks can be easily carried over to other immersive systems. The first task is the real-time tracking of multiple persons as they navigate within the environment. Numer- ous low-level algorithms, which leverage the spatial positioning of the cameras, are combined in an innovative manner to provide a high-level, extensible frame- work that provides robust tracking of up to 10 persons within the immersive environment. The framework uses multiple cameras distributed over multiple computers for efficiency, and supports additional cameras for greater coverage of larger areas. The second task is that of converting the low-level feature values derived from an underlying vision system into activity classes for each person. Such a system can be used in later stages to recognize increasingly complex activities using symbolic logic. An on-line, incremental knowledge acquisition (KA) philosophy is utilised for this task, which allows the introduction of additional features and classes even during system operation. The philosophy lends itself to a robust software framework, which allows a vision programmer to add activity classifi- cation rules to the system ad infinitum. The KA framework provides automated knowledge verification techniques and leverages the power of human cognition to provide computationally efficient yet accurate classification. The final system is able to discriminate 8 different activities performed by up to 5 persons. Dedication I dedicate this thesis to my dear parents. To Amma, for the love and support that she gave to me along the long and arduous journey and for inspiring me to pursue creative skills along with the technical. To Nanna, for encouraging me to pursue education and scientific endeavours from a very young age and for imbuing within me a deep passion for reading and learning. f Acknowledgements Along the long and difficult road that led to the final goal that is this thesis, I have been fortunate to have the support and encouragement of many people to whom I owe my deepest gratitude. First is my supervisor Arcot Sowmya and co-supervisor Paul Compton. They have both been incredibly supportive and patient and have inspired in me a great appreciation for the world of academia and the pursuit of knowledge. Without their help, this thesis simply could not have been achieved. I extend my deepest gratitude to Avishkar Misra, who drew me into the concept of Ripple-Down Rules and provided great insights and conversations along the way. Thanks also goes to my PhD review group, Maurice Pagnucco, Claude Sam- mut, and Alan Blair who showed a keen interest in my work during my presen- tations and who asked some very insightful and thought-provoking questions. I would especially like to thank Maurice for providing an immense amount of guidance during some very difficult times. The people at the iCinema Centre for Interactive Cinema Research provided me with the resources to see my thesis through to the end. For this reason, I thank Jeffrey Shaw and Dennis Del Favero for allowing me use the iCinema system for my research. The person to whom I owe the most thanks at iCinema, how- ever, is Ardrian Hardjono, who provided me with not just physical resources but also moral guidance and a strong friendship. I have also become highly in- debted to the many people at iCinema who started as colleagues but, through their incredible support, have become my dear friends over the course of the PhD. These are Alex Kuptsov, Som Guan, Robin Chow, Marc Yu-San Chee, Den- san Obst, Matthew Mcginity, Jared Berghold, Piyush Bedi, Rob Lawther, Sue Midgley, and Volker Kuchelmeister. Finally, I would like to thank my friends and family who provided so much en- couragement and support to see me through to the end. Thanks to my parents, Kumar and Ambika Sridhar, for their love, support and patience. Thanks to Oleg Sushkov, for many great coffee conversations about coding and life in gen- eral. Thanks to Boris Derevnin and Vince Clemente for the many dinners where the question of "Are you finished yet?" kept me on the path. Last but in no manner the least, I thank the many researchers whose shoulders have provided such terrific foundations. g Publications arising from Thesis 1. A. Sridhar and A. Sowmya. SparseSPOT: using a priori 3-D tracking for real-time multi-person voxel reconstruction. In Proceedings of the 16th ACM Symposium on Virtual Reality Software and Technology, pages 135–138. ACM, 2009. 2. A. Sridhar and A. Sowmya. Distributed, multi-sensor tracking of multiple partici- pants within immersive environments using a 2-cohort camera setup. Machine Vision and Applications, pages 1–17, 2010. 3. A. Sridhar and A. Sowmya. Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition in Immersive Environments. In Advances in Visual Computing, pages 508–519. Springer, 2008. 4. A. Sridhar, A. Sowmya, and P. Compton. On-line, Incremental Learning for Real- Time Vision Based Movement Recognition. In The 9th International Conference on Machine Learning and Applications, 2010. h Contents List of Figures iv List of Tables viii 1 Computer Vision, Activity Recognition, and the Immersive Environment 1 1.1 Activity Recognition within Immersive Environments . 2 1.2 Thesis Scope . 6 1.3 Research Overview . 8 1.3.1 Tracking Multiple Persons . 8 1.3.2 A Knowledge-Based Approach to Activities . 9 1.4 Thesis Contribution . 9 1.5 Thesis Organization . 10 2 Activity Recognition within the Large Scale Projection Display 12 2.1 Real-Time Performance . 13 2.2 Tracking Systems . 14 2.2.1 Lighting and Colour . 15 2.2.2 Camera Calibration . 16 2.2.3 Segmentation . 20 2.2.4 Monocular Tracking Systems . 21 2.2.5 Stereo Tracking Systems . 24 2.2.6 Multi-Camera Tracking System . 24 2.2.7 Bayesian State Estimation . 28 2.3 Activity Recognition . 32 2.3.1 Rigid-Body Motion Analysis . 33 2.3.2 The Use of Joint Models . 34 2.3.3 The Silhouette and the Visual Hull . 39 2.3.4 Activity Classes : The Benefits of Cognitive Rules . 44 2.4 Summary . 47 3 Immersitrack: Tracking Multiple Persons in Real-Time using a 2-Cohort Camera Setup 48 3.1 Immersitrack Architecture . 49 3.1.1 The Slave Process . 53 3.1.2 The Master Process . 56 3.2 Camera Calibration . 57 3.2.1 Camera Parameters . 59 i 3.2.2 Calibration Accuracy . 61 3.3 Overhead Cameras . 62 3.3.1 2-D Motion Models . 62 3.3.2 Correspondence . 63 3.3.3 Tracker Switching . 66 3.4 Oblique Cameras . 68 3.4.1 Body Localization . 68 3.5 Determining 3-D Hypotheses on the Master . 70 3.5.1 3-D Motion Models . 71 3.5.2 Assign Blobs to Existing Persons . 74 3.5.3 Group Unassigned Blobs . 75 3.5.4 Reconstructing World Position . 77 3.5.5 Verifying Observed Persons . 80 3.6 Experimental Results . 83 3.6.1 Baseline Error Values for Single Person Tracking . 83 3.6.2 Framerates and Filter Prediction . 86 3.6.3 Tracking Multiple Persons . 90 3.7 SparseSPOT : Real-Time Multi-Person Visual Hull Reconstruction . 94 3.7.1 SPOT . 95 3.7.2 SparseSPOT: Leveraging Immersitrack for Spatial Coherency . 96 3.7.3 Performance Comparison between SPOT and SparseSPOT . 99 3.7.4 Final Frame-rates of Immersitrack and SparseSPOT . 101 3.8 Summary . 102 4 Real-World Applications of Immersitrack 103 4.1 A Real-Time Pointing Gesture Recognition System . 104 4.1.1 Finger Detection in Overhead Cameras . 104 4.1.2 Finger Detection in Oblique Cameras . 106 4.1.3 3-D Reconstruction . 108 4.1.4 Head Position . 111 4.1.5 Multiple Persons . 112 4.1.6 Experimental Results : Pointing Gestures vs. Inertial Sensor . 113 4.1.7 Application . 116 4.2 Scenario . 117 4.3 Mnajdra . 119 4.4 Summary . 120 5 An Incremental Knowledge Acquisition Framework for Activity Recogni- tion 121 5.1 Ripple Down Rules . 122 5.1.1 Single Classification RDR (SCRDR) . 123 5.1.2 Knowledge Revision . 124 5.1.3 Multiple Classifications . 126 5.1.4 Generalisation . 130 5.2 Encoding Knowledge for Activity Recognition . 132 5.2.1 Rule Format . 132 5.2.2 Static versus Dynamic Attributes . 133 5.2.3 Cases and the Temporal Nature of Activities . 134 5.2.4 The Final Rule Format .