Combining and Knowledge Acquisition to provide Real-Time Activity Recognition for Multiple Persons within Immersive Environments

Anuraag Sridhar Schoool of Computer Science and Engineering

University of New South Wales, Sydney, Australia

A thesis submitted in fulfilment of the requirements for the degree of

PhilosophiæDoctor (PhD)

2012 b c d Abstract

In recent years, vision is gaining increasing importance as a method of human- computer interaction. Vision techniques are becoming popular especially within immersive environment systems, which rely on innovation and novelty, and in which the large spatial area benefits from the unintrusive operation of visual sensors. However, despite the vast amount of research on vision based analysis and immersive environments, there is a considerable gap in coupling the two fields. In particular, a vision system that can provide recognition of the activ- ities of multiple persons within the environment in real time would be highly beneficial in providing data not just for real-virtual interaction, but also for so- ciology and psychology research. This thesis presents novel solutions for two important vision tasks that sup- port the ultimate goal of performing activity recognition of multiple persons within an immersive environment. Although the work within this thesis has been tested in a specific immersive environment, namely the Advanced Visual- isation and Interaction Environment, the components and frameworks can be easily carried over to other immersive systems. The first task is the real-time tracking of multiple persons as they navigate within the environment. Numer- ous low-level algorithms, which leverage the spatial positioning of the cameras, are combined in an innovative manner to provide a high-level, extensible frame- work that provides robust tracking of up to 10 persons within the immersive environment. The framework uses multiple cameras distributed over multiple computers for efficiency, and supports additional cameras for greater coverage of larger areas. The second task is that of converting the low-level feature values derived from an underlying vision system into activity classes for each person. Such a system can be used in later stages to recognize increasingly complex activities using symbolic logic. An on-line, incremental knowledge acquisition (KA) philosophy is utilised for this task, which allows the introduction of additional features and classes even during system operation. The philosophy lends itself to a robust software framework, which allows a vision programmer to add activity classifi- cation rules to the system ad infinitum. The KA framework provides automated knowledge verification techniques and leverages the power of human cognition to provide computationally efficient yet accurate classification. The final system is able to discriminate 8 different activities performed by up to 5 persons. Dedication

I dedicate this thesis to my dear parents.

To Amma, for the love and support that she gave to me along the long and arduous journey and for inspiring me to pursue creative skills along with the technical.

To Nanna, for encouraging me to pursue education and scientific endeavours from a very young age and for imbuing within me a deep passion for reading and learning.

f Acknowledgements

Along the long and difficult road that led to the final goal that is this thesis, I have been fortunate to have the support and encouragement of many people to whom I owe my deepest gratitude. First is my supervisor Arcot Sowmya and co-supervisor Paul Compton. They have both been incredibly supportive and patient and have inspired in me a great appreciation for the world of academia and the pursuit of knowledge. Without their help, this thesis simply could not have been achieved.

I extend my deepest gratitude to Avishkar Misra, who drew me into the concept of Ripple-Down Rules and provided great insights and conversations along the way. Thanks also goes to my PhD review group, Maurice Pagnucco, Claude Sam- mut, and Alan Blair who showed a keen interest in my work during my presen- tations and who asked some very insightful and thought-provoking questions. I would especially like to thank Maurice for providing an immense amount of guidance during some very difficult times.

The people at the iCinema Centre for Interactive Cinema Research provided me with the resources to see my thesis through to the end. For this reason, I thank Jeffrey Shaw and Dennis Del Favero for allowing me use the iCinema system for my research. The person to whom I owe the most thanks at iCinema, how- ever, is Ardrian Hardjono, who provided me with not just physical resources but also moral guidance and a strong friendship. I have also become highly in- debted to the many people at iCinema who started as colleagues but, through their incredible support, have become my dear friends over the course of the PhD. These are Alex Kuptsov, Som Guan, Robin Chow, Marc Yu-San Chee, Den- san Obst, Matthew Mcginity, Jared Berghold, Piyush Bedi, Rob Lawther, Sue Midgley, and Volker Kuchelmeister.

Finally, I would like to thank my friends and family who provided so much en- couragement and support to see me through to the end. Thanks to my parents, Kumar and Ambika Sridhar, for their love, support and patience. Thanks to Oleg Sushkov, for many great coffee conversations about coding and life in gen- eral. Thanks to Boris Derevnin and Vince Clemente for the many dinners where the question of "Are you finished yet?" kept me on the path.

Last but in no manner the least, I thank the many researchers whose shoulders have provided such terrific foundations.

g Publications arising from Thesis

1. A. Sridhar and A. Sowmya. SparseSPOT: using a priori 3-D tracking for real-time multi-person voxel reconstruction. In Proceedings of the 16th ACM Symposium on Software and Technology, pages 135–138. ACM, 2009.

2. A. Sridhar and A. Sowmya. Distributed, multi-sensor tracking of multiple partici- pants within immersive environments using a 2-cohort camera setup. Machine Vision and Applications, pages 1–17, 2010.

3. A. Sridhar and A. Sowmya. Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition in Immersive Environments. In Advances in Visual Computing, pages 508–519. Springer, 2008.

4. A. Sridhar, A. Sowmya, and P. Compton. On-line, Incremental Learning for Real- Time Vision Based Movement Recognition. In The 9th International Conference on Machine Learning and Applications, 2010.

h Contents

List of Figures iv

List of Tables viii

1 Computer Vision, Activity Recognition, and the Immersive Environment 1 1.1 Activity Recognition within Immersive Environments ...... 2 1.2 Thesis Scope ...... 6 1.3 Research Overview ...... 8 1.3.1 Tracking Multiple Persons ...... 8 1.3.2 A Knowledge-Based Approach to Activities ...... 9 1.4 Thesis Contribution ...... 9 1.5 Thesis Organization ...... 10

2 Activity Recognition within the Large Scale Projection Display 12 2.1 Real-Time Performance ...... 13 2.2 Tracking Systems ...... 14 2.2.1 Lighting and Colour ...... 15 2.2.2 Camera Calibration ...... 16 2.2.3 Segmentation ...... 20 2.2.4 Monocular Tracking Systems ...... 21 2.2.5 Stereo Tracking Systems ...... 24 2.2.6 Multi-Camera Tracking System ...... 24 2.2.7 Bayesian State Estimation ...... 28 2.3 Activity Recognition ...... 32 2.3.1 Rigid-Body Motion Analysis ...... 33 2.3.2 The Use of Joint Models ...... 34 2.3.3 The Silhouette and the Visual Hull ...... 39 2.3.4 Activity Classes : The Benefits of Cognitive Rules ...... 44 2.4 Summary ...... 47

3 Immersitrack: Tracking Multiple Persons in Real-Time using a 2-Cohort Camera Setup 48 3.1 Immersitrack Architecture ...... 49 3.1.1 The Slave Process ...... 53 3.1.2 The Master Process ...... 56 3.2 Camera Calibration ...... 57 3.2.1 Camera Parameters ...... 59

i 3.2.2 Calibration Accuracy ...... 61 3.3 Overhead Cameras ...... 62 3.3.1 2-D Motion Models ...... 62 3.3.2 Correspondence ...... 63 3.3.3 Tracker Switching ...... 66 3.4 Oblique Cameras ...... 68 3.4.1 Body Localization ...... 68 3.5 Determining 3-D Hypotheses on the Master ...... 70 3.5.1 3-D Motion Models ...... 71 3.5.2 Assign Blobs to Existing Persons ...... 74 3.5.3 Group Unassigned Blobs ...... 75 3.5.4 Reconstructing World Position ...... 77 3.5.5 Verifying Observed Persons ...... 80 3.6 Experimental Results ...... 83 3.6.1 Baseline Error Values for Single Person Tracking ...... 83 3.6.2 Framerates and Filter Prediction ...... 86 3.6.3 Tracking Multiple Persons ...... 90 3.7 SparseSPOT : Real-Time Multi-Person Visual Hull Reconstruction ...... 94 3.7.1 SPOT ...... 95 3.7.2 SparseSPOT: Leveraging Immersitrack for Spatial Coherency . . . . . 96 3.7.3 Performance Comparison between SPOT and SparseSPOT ...... 99 3.7.4 Final Frame-rates of Immersitrack and SparseSPOT ...... 101 3.8 Summary ...... 102

4 Real-World Applications of Immersitrack 103 4.1 A Real-Time Pointing Gesture Recognition System ...... 104 4.1.1 Finger Detection in Overhead Cameras ...... 104 4.1.2 Finger Detection in Oblique Cameras ...... 106 4.1.3 3-D Reconstruction ...... 108 4.1.4 Head Position ...... 111 4.1.5 Multiple Persons ...... 112 4.1.6 Experimental Results : Pointing Gestures vs. Inertial Sensor ...... 113 4.1.7 Application ...... 116 4.2 Scenario ...... 117 4.3 Mnajdra ...... 119 4.4 Summary ...... 120

5 An Incremental Knowledge Acquisition Framework for Activity Recogni- tion 121 5.1 Ripple Down Rules ...... 122 5.1.1 Single Classification RDR (SCRDR) ...... 123 5.1.2 Knowledge Revision ...... 124 5.1.3 Multiple Classifications ...... 126 5.1.4 Generalisation ...... 130 5.2 Encoding Knowledge for Activity Recognition ...... 132 5.2.1 Rule Format ...... 132 5.2.2 Static versus Dynamic Attributes ...... 133 5.2.3 Cases and the Temporal Nature of Activities ...... 134 5.2.4 The Final Rule Format ...... 135 5.3 Implementation and the Roles within the Framework ...... 135

ii 5.4 The User Interface ...... 137 5.4.1 Consequent Selection ...... 139 5.4.2 Rule Selection ...... 139 5.4.3 Cases Selection ...... 140 5.4.4 Cases Display ...... 140 5.4.5 Feature Selection ...... 141 5.4.6 Antecedent View ...... 143 5.4.7 Fired Cornerstones ...... 143 5.4.8 Confirmation ...... 144 5.5 Summary ...... 145

6 Applications of the RDR Framework to Activity Recognition 146 6.1 CAVIAR : Activity Recognition in a Surveillance Setting ...... 147 6.1.1 The CAVIAR Case ...... 148 6.2 AVIE : Activity Recognition in the Immersive Environment ...... 149 6.2.1 The AVIE Case ...... 149 6.3 Training and Evaluation Protocol ...... 151 6.4 Resulting Features Designed for both Datasets ...... 152 6.4.1 CAVIAR Features ...... 153 6.4.2 AVIE Features ...... 154 6.5 Experimental Results and Discussion ...... 156 6.5.1 Overall Performance ...... 157 6.5.2 Consistency vs. Generalisation Ability ...... 158 6.5.3 Complexity of Training ...... 161 6.5.4 Number of Corrections over Time ...... 163 6.5.5 Time Taken ...... 164 6.6 Discussion on the Datasets ...... 165 6.6.1 Discussion on CAVIAR ...... 166 6.6.2 Discussion on AVIE ...... 169 6.7 Situated Cognition ...... 171 6.8 Summary ...... 171

7 Conclusion 173 7.1 Contributions ...... 174 7.1.1 Contributions to Computer Vision ...... 174 7.1.2 Contribution to Knowledge Acquisition ...... 175 7.1.3 Contribution to Immersive Environments ...... 176 7.2 Limitations and Future Work ...... 176 7.3 Final Remarks ...... 178

A Reducing the Minimum-Weight Edge Cover of a Bipartite Graph to a Linear Assignment Problem 179

References 182

iii List of Figures

1 Computer Vision, Activity Recognition, and the Immersive Environment 1 1.1 CAVE...... 2 1.2 The AlloSphere ...... 2 1.3 VIDEOPLACE ...... 4 1.4 Computer Vision in the Arts ...... 4 1.5 KidsRoom ...... 5 1.6 Advance Visualisation and Interaction Environment ...... 6 1.7 Example Applications within AVIE ...... 7

2 Activity Recognition within the Large Scale Projection Display 12 2.1 Technological Variables influencing Tele-Presence ...... 13 2.2 Active Illumination ...... 15 2.3 Electromagnetic Spectrum ...... 16 2.4 Visible Light compared to Near-Infrared ...... 16 2.5 Pinhole Camera Model ...... 17 2.6 The W4 system ...... 22 2.7 Group Tracking under Occlusion ...... 22 2.8 Probabilistic combination of tracking algorithms ...... 23 2.9 Depth Image from a Stereo Camera ...... 24 2.10 Camera Hand-off ...... 25 2.11 Probabilistic Occupancy Map Tracker ...... 27 2.12 Kalman Filter Equations ...... 30 2.13 Kalman Filter Example ...... 31 2.14 Moving Light Displays ...... 34 2.15 Rigid Body and Articulated Motion Tracking ...... 35 2.16 Marker-less Model Reconstruction ...... 35 2.17 3-D Skeleton Fitting ...... 36 2.18 Voxel Model Fitting and Tracking ...... 37 2.19 Depth-Based Joint Localisation ...... 37 2.20 Actions in Low-Resolution Video ...... 39 2.21 Motion Energy Image ...... 39 2.22 Object Silhouette Projections ...... 40 2.23 The Visual Hull in 2-D ...... 41 2.24 Volumetric and Polygonal Representations of the Visual Hull ...... 41 2.25 Voxels-based Visual Hull ...... 42

iv 3 Immersitrack: Tracking Multiple Persons in Real-Time using a 2-Cohort Camera Setup 48 3.1 Tracking System Architecture ...... 49 3.2 AVIE Hardware Setup ...... 50 3.3 Camera Hardware ...... 51 3.4 Camera Layout ...... 51 3.5 AVIE Camera Layout in 3-D ...... 52 3.6 Slave Process ...... 53 3.7 Foreground Extraction ...... 54 3.8 Segmentation ...... 55 3.9 Master Process ...... 56 3.10 Coplanar Calibration Grid Markers ...... 57 3.11 Calibration Grid viewed from Cameras ...... 58 3.12 3-D Calibration Rig ...... 59 3.13 Reprojection Error ...... 61 3.14 Multiple Participants in One Blob ...... 64 3.15 Graph Theoretical Tracking ...... 65 3.16 Switching Tracking Methods ...... 67 3.17 Overhead Tracking Results ...... 67 3.18 Extracting body segments in oblique view cameras ...... 69 3.19 Final Results in Oblique View Cameras ...... 70 3.20 Reconstruction Algorithms ...... 73 3.21 Points returned from HighestPoint and LowestPoint ...... 76 3.22 Occlusion in oblique cameras ...... 79 3.23 Entrance Rule ...... 80 3.24 Tracking 3 Persons ...... 82 3.25 Single Person Standing Still ...... 83 3.26 Error vs. Spatial Locations ...... 84 3.27 Straight Line Tests ...... 85 3.28 Kalman Filter Test ...... 87 3.29 Kalman Filter Test (Frames 1–150) ...... 88 3.30 Kalman Filter Test (Frames 151–300) ...... 88 3.31 Kalman Filter Test (Frames 301–450) ...... 89 3.32 Kalman Filter Test (Frames 451–600) ...... 89 3.33 Graph of Multiple Object Tracking Results ...... 92 3.34 Precision vs. Person for Persons 1–7 ...... 93 3.35 Precision vs. Person for Persons 8–14 ...... 94 3.36 Precision vs. Person for Persons 15–21 ...... 94 3.37 Comparison between AVIE and SPOT environments ...... 95 3.38 SPOT vs SparseSPOT ...... 97 3.39 SparseSPOT Reconstruction ...... 98 3.40 SPOT vs. SparseSPOT Comparison: Experiment 1 ...... 99 3.41 SPOT vs. SparseSPOT Comparison: Experiment 2 ...... 100 3.42 SparseSPOT: Comparison between Performance and Person Count ...... 101

4 Real-World Applications of Immersitrack 103 4.1 Fingertip Detection in Overhead Camera ...... 105 4.2 Fingertip Refinement in Overhead Cameras ...... 105 4.3 Hand Extents in Oblique Camera ...... 106 4.4 Oblique Camera Finger Localization Results ...... 107

v 4.5 Final finger detection results in oblique cameras ...... 107 4.6 Oblique Camera Finger Detection for Multiple Persons ...... 108 4.7 Fingertip World Lines ...... 109 4.8 Results from finger tracking and head reconstruction ...... 111 4.9 Finger Tracking Results for Three Persons ...... 112 4.10 Pointing Gesture (white) and Inertial Sensor (grey) in 3-D ...... 113 4.11 Theta vs. Time for Side-to-Side Motion ...... 114 4.12 Height vs. Time for Up-and-Down Motion ...... 114 4.13 Height vs. Time for Up-and-Down Motion with Offset ...... 115 4.14 Real World Pointing Gesture Application ...... 116 4.15 Scenario Scene 1 ...... 117 4.16 Scenario Scene 2 ...... 118 4.17 Spaces of Mnajdra Project ...... 119 4.18 Tracking triggered Transition ...... 120

5 An Incremental Knowledge Acquisition Framework for Activity Recogni- tion 121 5.1 RDR Hierarchy Example ...... 123 5.2 RDR Inference Example ...... 124 5.3 RDR Revision Example 1 ...... 125 5.4 RDR Revision ...... 126 5.5 MCRDR ...... 127 5.6 FlatRDR ...... 129 5.7 Generalisation Example 1 ...... 130 5.8 Generalisation Example 2 ...... 131 5.9 Generalisation Example 3 ...... 131 5.10 Rule Structure ...... 135 5.11 RDR Framework Class Diagram ...... 136 5.12 Defining a Case ...... 136 5.13 Defining Features ...... 137 5.14 Defining Consequents ...... 137 5.15 RDR User Interface for Activity Recognition ...... 138 5.16 Selecting a Consequent ...... 139 5.17 Selecting a Rule ...... 139 5.18 Selecting Cases ...... 140 5.19 Feature Selection ...... 141 5.20 Feature graph for Range Feature ...... 141 5.21 Selecting a Time Span ...... 142 5.22 Selecting Upper and Lower Bounds ...... 142 5.23 Antecedent View ...... 143 5.24 Error Cornerstones Table ...... 143 5.25 Feature Graph for CornerStones ...... 144 5.26 Confirmation of the New Rule ...... 144

6 Applications of the RDR Framework to Activity Recognition 146 6.1 CAVIAR Frame ...... 147 6.2 Displaying the CAVIAR Case ...... 149 6.3 AVIE Activity Classes ...... 150 6.4 Total Error Rate on CAVIAR Dataset ...... 157 6.5 Total Error Rate on AVIE Dataset ...... 158

vi 6.6 Seen vs. Unseen Error on CAVIAR Dataset ...... 159 6.7 Seen vs. Unseen Error on AVIE Dataset ...... 160 6.8 Training Complexity in CAVIAR Dataset ...... 162 6.9 Training Complexity in AVIE Dataset ...... 163 6.10 Number of Corrections made over time for CAVIAR ...... 163 6.11 Number of Corrections made over time for AVIE ...... 164 6.12 Corrections and Classifications for AVIE ...... 165 6.13 Activity Class Ambiguity 1 ...... 167 6.14 Walking/Running Ambiguity ...... 168 6.15 Activity Class Ambiguity 2 ...... 168

vii List of Tables

3 Immersitrack: Tracking Multiple Persons in Real-Time using a 2-Cohort Camera Setup 48 3.1 Camera Parameters for Cameras 1–4 ...... 59 3.2 Camera Parameters for Cameras 5–7 ...... 60 3.3 Camera Parameters for Cameras 9–12 ...... 60 3.4 Camera Parameters for Cameras 13–16 ...... 60 3.5 Reprojection Error ...... 61 3.6 Average Error Rates for a single still person ...... 84 3.7 Quantitative Results for Error vs. Spatial Location Tests ...... 85 3.8 Quantitative Results for Single Person Straight Line tests ...... 86 3.9 Framerates of different components of the system ...... 86 3.10 Average Error for Kalman Filter Test ...... 87 3.11 Multiple Object Tracking Results for the Tracking System ...... 92 3.12 Cumulative Statistics for Multiple Object Tracking Results ...... 92 3.13 Properties of the SPOT vs. SparseSPOT Experiments ...... 100 3.14 Frame rates for Immersitrack and SparseSPOT ...... 101

4 Real-World Applications of Immersitrack 103 4.1 Discrepancy Statistics between Pointing Gesture and Inertial Sensor . . . . . 115

6 Applications of the RDR Framework to Activity Recognition 146 6.1 CAVIAR Movement Classes ...... 148 6.2 CAVIAR Attributes ...... 148 6.3 AVIE Case ...... 150 6.4 Properties of the Knowledge Bases ...... 156 6.5 Final complexity measures for the knowledge bases ...... 161 6.6 Average Time taken for Knowledge Bases ...... 164 6.7 Confusion Matrix for CAVIAR Dataset ...... 166 6.8 Confusion Matrix for the AVIE Dataset ...... 169

viii List of Algorithms

3.1 Assign previously tracked overhead blobs to existing persons...... 74 3.2 Assign untracked overhead blobs to existing persons ...... 74 3.3 Assign oblique blobs to existing persons ...... 75 3.4 Group unassigned overhead blobs ...... 75 3.5 Assign oblique blobs to new persons ...... 76 3.6 SPOT ...... 95 3.7 SparseSPOT ...... 96 4.1 Determine 2-D fingertip correspondence ...... 109 4.2 Determine 3-D fingertip correspondence...... 110 5.1 SCRDR Inference ...... 124 5.2 MCRDR Inference ...... 127 5.3 FlatRDR Inference ...... 129

ix This page has been intentionally left blank.

x Chapter 1

Computer Vision, Activity Recognition, and the Immersive Environment

OMPUTER vision is fast becoming an acceptable method to provide interaction be- C tween real and virtual worlds. At a time when there is a continuous search for new and innovative methods of interfacing with computers, vision systems present so- lutions that are not burdened by wires and wearable sensors. Cameras work from afar and the visual nature of their data is more intuitive for programmers than other sensors, with considerably more information.

However, although research has uncovered a better understanding of how data enters a human’s vision system, the cognitive interpretation of the information contained in the data is incredibly complex and still not completely understood. It is well known that hu- mans leverage experience and knowledge to perform the interpretation and it is extremely difficult to store and use such experience and knowledge in a computer.

Despite these challenges, vision systems are growing increasingly advanced and gaining greater acceptance in mainstream society. They are currently being used in surveillance systems, defence, sports, user interfaces, and robotics along with a vast number of other applications. This massive increase in their practical applications is a result of increasing advances in vision software, and the availability of cheap yet effective vision sensors.

The fields of virtual reality and immersive environments are particularly embracing vision techniques as a novel method of interacting with the virtual environment. Practi- tioners in these areas are usually artists, and are particularly fond of the visual nature of the data that can be obtained from vision sensors. In many such systems, data from vision sensors are directly input into the virtual environment, because it is impressive enough as is.

1 1.1 Activity Recognition within Immersive Environments

Ever since the inception of the CAVE virtual environment by Cruz-Neira et al. (1993), the large-scale projection display has become an increasingly popular method for providing vir- tual immersion. Prior to the design of such a system, virtual reality was most often symbol- ised by the headset and goggles. The use of projected displays meant that one could do away with such encumbering technologies. Advances in graphics technologies have now made it considerably easier to provide high-quality projections yielding very impressive displays. An example of the CAVE system in operation is given in Figure 1.1.

Figure 1.1: An example of the CAVE Automatic Vir- tual Environment (Cruz- Neira et al., 1993)

The CAVE system has seen numerous implementations, such as the Blue-C project (Gross et al., 2003), and the NICE project (Roussos et al., 1997) and has seen extensions to curved display metaphors such as cylindrical and spherical screens (Höllerer et al., 2007; McGinity et al., 2007), an example of which is shown in Figure 1.2.

Figure 1.2: The Allo- Sphere : a spherical re- alization of the CAVE en- vironment (Höllerer et al., 2007)

2 Such immersive systems and virtual reality systems provide benefits that go well be- yond human-computer interaction research. For example, Blascovich (2002) presents their case on the use of immersive virtual environments technology (IVET) for sociology experi- ments. Blascovich argues that an individual can feel just as much presence in a virtual space as in a physical space and that “interpersonal self-relevance, the importance to the in- dividual’s sense of self, of interpersonal interactions operates in basically the same manner within synthetic or virtual worlds as in natural social environments”. McCall and Blas- covich (2009) extended this work to show the benefits that IVET provides towards research- ing behavioural mechanisms.

Thus, these systems provide advanced experimental platforms for research not just in technological fields, but also in the sociological and psychological fields. In such spaces, concepts like group behaviour (Slater et al., 2000), real-virtual interaction (Kenny et al., 2007), and the influence of virtual behaviours on real experiences (Pertaub et al., 2002) are just some of the problems that have been investigated. Because the rewards and punish- ments can be highly controlled through the display mechanism, a wide variety of sociology experiments can be conducted.

However, in order for the immersive display system to be effective in sociology research, one has to maximise the level of presence experienced within. Presence is the degree to which a person feels engaged with the story or narrative unfolding within the environment (Steuer, 1992), and it must be as high as possible for an impressive and realistic virtual environment.

Slater and Usoh (2002) identify 10 points in order for a user to feel presence within a virtual environment. One of the key factors that they discovered is that “the information [must be free] from signals that indicate the existence of the input devices, or display”. That is, the more aware that the user is of the input device, the less the degree of presence. Thus, in such systems the use of traditional wired or wearable controllers is avoided, as it con- siderably lowers the degree of presence experienced and fails to accommodate for audience groups that contain dozens of persons at a time. This is where computer vision comes into its own.

Since the very beginnings of the virtual reality movement, the use of computer vision as a means of providing interactivity in fabricated spaces has proven beneficial. In their seminal works, Krueger (1991, 1977) and Krueger, Gionfriddo, and Hinrichsen (1985) iden- tified the need to provide an unencumbered space along with a responsive environment to make the virtual reality application more rewarding and intuitive. As part of their research, Krueger et al. presented “VIDEOPLACE”, an system with a vision-based input mechanism, shown in Figure 1.3. In this system, a simple vision interface, operating on the silhouette image of a human, provides interactivity on a projected surface. The sil- houette and other projected graphical entities drive persons towards interacting with the virtual backdrop.

3 Figure 1.3: VIDEOPLACE - An early virtual reality system that utilised computer vision (Krueger et al., 1985)

Since then, the use of computer vision within virtual reality systems has expanded con- siderably as a means of providing interactivity without relying on expensive and cumber- some hardware technologies such as gloves, sensors, and other wearable devices. Despite advancing development of cheaper input sensors, such as touch surfaces and inertial sen- sors, computer vision systems continually prove to be useful in providing interaction on their own, while also providing support for other sensors.

Recent times have especially seen computer vision move from purely theoretical re- search into practical and commercial applications where competition, driven by financial rewards, is fierce. Technologies such as the Microsoft Kinect 1 show that the gaming indus- tries are increasingly embracing vision mechanisms. Advertising and marketing is seeing an increasing use of vision techniques for ubiquitous displays that provide interactivity to groups of the public 2. Within the arts, new realms of interactive visualisations are con- stantly being discovered using cameras, some examples of which are shown in Figure 1.4.

Figure 1.4: Computer Vision in various commer- cial arts and entertain- ment systems

1http://www.xbox.com/kinect 2http://www.gesturetek.com

4 Computer vision has shown itself to be highly beneficial as a means of sensing input from users without alerting them to the presence of the sensors. The typical uses of com- puter vision that have been explored for interactivity purposes are also representative of traditional paradigms of human-machine interaction. Davis and Chen (2002) detect laser devices pointed by multiple persons on a wall projection display. Cabral et al. (2005) mea- sure the usability of “gesture interfaces” within a CAVE environment, but ultimately the gestures are used in a manner similar to using a mouse to manipulate a 3-D world.

Some systems have looked at the idea of inserting a representation of a real world per- son directly into the . The famous “VIDEOPLACE” (Krueger et al., 1985) and “Pfinder” (Wren et al., 1997) both inserted a 2-D representation of a person obtained from a single camera. VIDEOPLACE worked with silhouette images whereas Pfinder tracked a statistical model of the shape and colour of the person. Allard et al. (2004) and Hasenfratz et al. (2003) insert a 3-D reconstruction of a real world human into a virtual environment. Allard et al. only present a visualisation of the reconstructed model while Hasenfratz et al. provide some interactivity using collisions between the reconstructed actor and the virtual world.

These types of applications, among others, demonstrate a strong influence from the world of game development on human-computer interaction. This is the direct interaction metaphor, where a person’s physical presence directly influences the virtual environment through touch and manipulation.

However, research has already shown a trend within such environments towards the general task of activity recognition. The KidsRoom (Bobick et al., 1999, 1998), shown in Figure 1.5, showed that the interactive narrative does not necessarily have to operate on the traditional direct input typically used in games. An engaging story can be narrated by responding to abstract actions performed by children within the environment.

Figure 1.5: The KidsRoom (Bobick et al., 1999)

5 Furthermore, both games and immersive display systems are now increasingly utilising such abstract interaction techniques. The interactive narrative, a more general method of game playing, is one such example. Pinhanez (1999) identified the benefits of a story- driven architecture within interactive spaces, and showed how physical activity may be used to drive such narratives. The desire for greater complexity and higher realism in such interactive virtual worlds has driven an increasing interest in identifying activities of humans within such spaces.

The construction of an activity recognition system is not just beneficial for an inter- active application. For sociology and psychology researchers, a system that can output higher-level descriptions of activities has greater relevance than one that outputs lower level feature descriptors. For such researchers, being able to describe important activities and have the system recognize them is vital to support progress in the understanding of human behaviour.

There is clearly a huge benefit in using computer vision to detect and identify the ac- tivities of persons within immersive environments. Such systems would provide enormous advantages not just in the field of human-computer interaction but also in sociological and psychological fields. As immersive environments become increasingly advanced and realis- tic, vision needs to be able to match the advances both in innovation and in realism.

1.2 Thesis Scope

Figure 1.6: The Advance Visualisation and Interaction Environment (AVIE)

6 The primary focus of this thesis is on the Advanced Visualisation and Interaction En- vironment (AVIE) (McGinity et al., 2007), which is part of the iCinema Centre for Inter- active Cinema Research at the University of New South Wales, Australia 1. AVIE, shown in Figure 1.6, is a state-of-the-art immersive system providing an engaging virtual reality experience. It consists of a 360-degree stereoscopic immersive interactive visualisation en- vironment with a multi-channel audio system and a set of state-of-the-art resources that enable the development and study of applications in the fields of immersive visualisation, immersive sonification, and human interaction design.

AVIE is a realisation of the CAVE environment that demonstrates the trend towards circular displays that “provide a more seamless appearance over a greater range of viewing angles and conditions” (Lantz, 1998, 2007). AVIE presents the move from the flat cubic display system of the CAVE to a cylindrical display system.

The AVIE environment has been used for the display of panoramic photography and vir- tual worlds constructed from computer graphics. Examples of both are shown in Figure 1.7. Research within AVIE has resulted in numerous artistic applications touted as the future of interactive cinema. Several museums have employed a likeness of the AVIE environment to provide an interactive, virtual museum exhibit. In addition, applications beyond the vi- sual arts, such as in teaching hazard awareness to miners, have also proven quite effective within AVIE (McGinity et al., 2007).

Figure 1.7: Example Appli- cations within AVIE

1http://www.icinema.unsw.edu.au

7 AVIE presents a highly suitable environment for the research developed in this thesis. The development of vision-based tracking applications has many uses within the environ- ment, and can provide numerous innovative paths for interaction. A system that can detect the motions of multiple persons within AVIE facilitates the development of a vast range of new applications. Such applications can go beyond the traditional paradigms of direct in- teraction between real people and virtual worlds, and can look at investigating how virtual agents can react to more abstract social behaviour between real people.

1.3 Research Overview

The goal of this thesis is as follows:

“To recognize the activities of multiple persons as they navigate within an immersive environment, and to do so in real-time”

The problem is broken down into two sub-goals:

Tracking in which the hypotheses about the positions of the persons within the environ- ment are maintained and updated as the persons move around. The purpose of track- ing is to maintain a unique identity on each person so that the temporal nature of activities can be suitably exploited, and the activities are recognized in a reduced search space aligned with the positions of the participants.

Recognition in which the data from the tracking step is used to perform subject-aligned classification of activities performed by individuals within the environment. In this thesis, recognition is framed as a knowledge acquisition problem. A well-known knowledge engineering paradigm is used to construct the recognition framework, and it is shown how simple rules, determined by a vision programmer, can effectively per- form fast activity classification while providing accuracy through the maintenance of knowledge consistency.

1.3.1 Tracking Multiple Persons

The first goal of this thesis is to track multiple persons within the environment in order to perform activity recognition aligned to each person’s position. Tracking reduces the search space when performing activity classification, and allows localisation of the search to only a restricted space around the position of the person. This allows for much faster processing times in the classification stage.

Tracking is achieved by utilising multiple cameras and a distributed configuration that allows for real time performance. Numerous existing algorithms are brought together, ex- tended, and combined in an innovative way to provide a robust, algorithmic framework with

8 a high level of extensibility. The resulting tracking system is called Immersitrack, which is introduced in Chapter 3.

The innovativeness of Immersitrack stems from the different treatment of cameras with different positions and orientations. The system explores the use of a 2-cohort set-up for the cameras, where each group applies distinct image processing algorithms. This separation allows each group to exploit its own strengths to achieve increased speed and accuracy benefits over prior tracking systems. Immersitrack achieves real time results even when having to fuse data from 16 cameras, while also able to incorporate an increasing number of cameras in future.

1.3.2 A Knowledge-Based Approach to Activities

The second goal of this thesis is to provide subject-aligned activity recognition for multi- ple persons within the environment in real time. To achieve this goal, a knowledge-centric approach is taken. It will first be shown that activity recognition is highly context depen- dent, and vision programmers can often make use of simple rules to achieve classification of various activities. Although such rules can be efficient to formulate, the introduction of too many rules and activity classes can often result in conflicts and result in a decrease in the classification accuracy.

An incremental knowledge maintenance paradigm, known as Ripple Down Rules (Comp- ton and Jansen, 1989) is used to assure consistency when adding new rules to the system, without degrading the quality of knowledge already stored. Verification and validation steps are used during the addition of rules, to prevent new rules from conflicting with older ones. As a result, the performance efficiency standards of simple rules are maintained while accuracy is kept high. The use of RDR for activity recognition has not been researched in the past and is a novel contribution to both knowledge acquisition fields and activity recog- nition fields.

The knowledge representation, acquisition, and inference mechanism used in this the- sis is highly suitable for the activity recognition task, as it ensures that the insertion of knowledge into the knowledge base never gets too complex. The vision programmer is re- quired only to determine properties from the lower-level vision modules that differentiate activities. As a result new activity classes can be continuously added, and prior classifica- tions can be continuously improved, in order to achieve a classifier that converges towards an optimal solution.

1.4 Thesis Contribution

The research presented in this thesis contributes to numerous areas of computer science and engineering.

9 Contribution to Computer Vision The main contribution of this thesis is to the field of computer vision. Various prior algorithms within computer vision are first combined in an innovative and robust way to provide a framework that can be used for tracking appli- cations. The use of knowledge acquisition philosophies for vision-based activity recognition is also a novel approach that stems from existing uses of such philosophies in vision-based object recognition applications. The extension of the knowledge-based paradigms presented in this research can provide considerable benefits to vision programmers.

Contribution to Immersive Environments The tracking and recognition systems de- veloped within this thesis can easily be adapted to different immersive systems, as they are tailored to work under conditions that are common to the majority of such environ- ments. The provision of efficient activity recognition within immersive environments is a challenging problem, and this thesis makes considerably headway in achieving this goal.

Contribution to Knowledge Acquisition Finally, this thesis makes valuable contribu- tions to knowledge acquisition domains, and particularly in the use of the RDR methodology for activity recognition: a use that has never been explored before. The performance bene- fits of using simple rules for the activity recognition task is strengthened by the knowledge consistency mechanisms provided by RDRs. As a result, the robustness of rules used for the activity recognition task is considerably improved.

1.5 Thesis Organization

The remaining chapters of this thesis are organized as follows. Related work in the field of activity recognition using computer vision techniques is presented in Chapter 2. In partic- ular, background work on the two areas of tracking and activity recognition are presented. The chapter also highlights existing algorithms that have been used within the tracking system in this research, and a motivation for using a knowledge-based approach for activ- ity classification.

The tracking system, Immersitrack, outlined above is then presented in Chapter 3. A complete framework that extracts 3-D position hypotheses of multiple persons within the environment from the raw camera images is formulated. The use of a novel 2-cohort camera set-up along with the image processing algorithms for each camera and 3-D reconstruction methods are presented. A quantitative evaluation of the performance of Immersitrack is also presented in Chapter 3.

The qualitative benefits of Immersitrack are then presented in Chapter 4 in order to show the effectiveness of the system in real-world situations. Three applications that use

10 the data from Immersitrack to provide interaction to multiple persons within the envi- ronment are presented. Immersitrack has already performed robustly in numerous public demonstrations and has proven itself highly suitable as a means of providing interaction.

The beginnings of the incremental knowledge acquisition framework used for activity recognition is then introduced in Chapter 5. The background behind the framework and the encoding of knowledge for the activity recognition task are first presented. Low-level details regarding the implementation of the framework and the method of training the knowledge base are also presented in this chapter.

To show the effectiveness of the final knowledge-based framework, two activity recog- nition tasks are evaluated in Chapter 6. The first task is based on the CAVIAR dataset (Fisher, 2004), a famous ground-truth dataset that is freely available and often used for benchmarking activity recognition tasks. The available labelled ground truth data within this data set is used to both quantitatively and qualitatively evaluate the performance of the framework. The second dataset is constructed from labelled activity data from the AVIE environment. The results of Immersitrack along with a fast 3-D reconstruction algorithm, which is also presented in this chapter, are labelled with a set of chosen activities. These labels are then used to evaluate the activity recognition task within AVIE.

Finally, closing remarks are presented in Chapter 7. In this chapter, the benefits of Immersitrack and the knowledge-based framework are reiterated along with their limita- tions. The chapter concludes with a discussion on how both systems could be extended in the future, to overcome limitations and to provide additional benefits.

11 Chapter 2

Activity Recognition within the Large Scale Projection Display

CTIVITY recognition, through vision mechanisms, is no easy task. Within the gamut A of activities that humans can exhibit, trying to recognize a single activity involves a search through a very large space. Researchers have made several attempts to tackle the problem in order to come up with a tractable solution. This is easily demonstrated by the large number of surveys on the topic (Aggarwal and Cai, 1999; Ahad et al., 2008; Cedras and Shah, 1995; Gabriel et al., 2003; Gavrila, 1999; Hu et al., 2004a; Moeslund and Granum, 2001; Moeslund et al., 2006; Zhan et al., 2008).

Activity recognition within immersive environments, however, has been relatively over- looked. Most researchers in activity recognition have focused on surveillance systems and smart home environments, which have greater applications and where people exhibit a larger number of activities. A lack of imagination of activity classes that can provide inter- activity within an immersive environment has also contributed towards this deficiency.

As more and more immersive applications are developed, the role of activities within these environments is becoming more apparent. A shift from traditional input devices to more abstract modes of interaction is also facilitating greater research on vision-based ac- tivity recognition systems tailored to the immersive environment.

In this thesis, the activity recognition problem is broken down into two subtasks: track- ing and classification. The tracking task involves maintenance of a unique identity for each person within the environment, while the classification task involves assignment an activ- ity class to each person. Before prior work in either area is presented, however, it is vital to understand the importance of real-time performance within an immersive environment. The constraints imposed by the necessity for real-time performance limits the scope of the background research that is relevant to this thesis.

12 The chapter will begin by describing the real-time performance constraints that are im- posed by virtual reality environments in Section 2.1. The task of tracking multiple people and the related research is presented in Section 2.2. Finally, background for the classifica- tion task and motivation for the approach taken in this thesis are presented in Section 2.3. The last section summarises the concepts presented.

2.1 Real-Time Performance

Immersive environments and virtual reality systems are typically required to run at very high frame-rates to achieve effective realism and immersion. Researchers have identified that the degree of “presence” felt within a virtual reality system is increased when the speed of the interactivity is increased. “Presence” may be defined as the degree of engage- ment that a person experiences in the physical environment (Steuer, 1992). Barfield and Hendrix (1995) measured the subjective report of presence while navigating a virtual real- ity application at various “update rates”. The application was run at 5 Hz, 15 Hz, 20 Hz, and 25 Hz, and it was found that the presence experienced by the subjects was greater at higher frame rates.

Steuer also presented the argument that the effectiveness of an immersive environ- ment should be measured not just by the degree of presence but also by the degree of “tele- presence”, or how involved the person is with the communicating medium. The degree of tele-presence is measured as the sum of the vividness of the environment along with the level of interactivity, as shown in Figure 2.1. One of the key factors governing the level of interactivity, it is claimed, is the speed at which interactivity is provided.

Figure 2.1: Technological Variables influencing Tele-Presence (Steuer, 1992)

The real-time performance constraint is a difficult one to impose, since most systems

13 that perform activity recognition require the processing of a large amount of data. Prior research on such systems demonstrates algorithms that run much slower than in real time, with the claim that future hardware and software improvements would provide sufficient technology to bring the performance up to real-time standards. Nevertheless, because no testing is done, no guarantees are made. In the background work presented in this chapter, real-time performance is considered a vital property of the vision system. Having to deal with multiple persons can often increase the run-time complexity of algorithms consider- ably, and it is important to keep the total complexity of all parts of the system as low as possible.

2.2 Tracking Systems

Tracking is the task of maintaining a unique identity for a set of moving objects that are being observed by one or more sensors. Tracking is a difficult problem, as it requires deter- mining correspondences between feature values in images. Due to lighting changes, noise, and other vision-based artefacts, tracking can be quite challenging, particularly when a real-time performance constraint, as presented above, is imposed.

When there is only a single person involved, it is possible to perform activity recogni- tion without any kind of tracking. Researchers have considered such recognition directly from video data. For example, Wilson and Bobick (2002) and Bobick and Davis (1996a) per- form real-time activity recognition of a single person in a single camera. The availability of subject-aligned ground truth data (Blank et al., 2005) has resulted in a recent burst in activity recognition systems that operate on a single person without having to track the per- son (Ali et al., 2007; Gorelick et al., 2007; Natarajan and Nevatia, 2008). In such systems, the camera stays still, and the activity is assumed to be occurring in a small fixed window within the video images.

Unfortunately, such techniques are not suitable for immersive environment systems as the cameras typically cover a large area in which persons can move around considerably. For this reason, tracking is usually mandatory for performing activity recognition within such systems. The activity recognition components can then be aligned to the result from the tracking system in order to perform localised recognition.

Tracking is simply the task of maintaining a model of each object being observed by the cameras and ensuring model consistency over time. By model consistency, it is meant that as each object enters, traverses, and exits the environment, it is the responsibility of the tracking system to initialise a model for the object, derive correspondences between object and model in subsequent frames, update the model using the correspondences and finally destroy the model on object exit. When multiple cameras are used, the tracking system is also responsible for determining spatial correspondence across cameras.

14 Tracking systems have received an enormous amount of attention in recent years with many different solutions discovered for varying situations. A good tracking system is im- mensely valuable for vision systems in general, the greatest benefit being that it provides a reduction of the search space by identifying points, areas, and volumes of interest. This reduction provides considerable benefits later on, where more challenging stages can end up being very computationally intensive.

Challenges due to lighting within the environment are presented in section 2.2.1. An overview of the mathematics of camera calibration is presented in section 2.2.2. The first software task in the tracking process, segmentation, is presented in section 2.2.3. Work on monocular, binocular, and multi-camera tracking systems is presented in sections 2.2.4, 2.2.5, and 2.2.6 respectively. Finally, section 2.2.7 presents algorithms to estimate the state of tracking between updates in order to provide faster frame rates.

2.2.1 Lighting and Colour

The most fundamental problem faced in tracking systems is the choice of the lighting model. This problem has particular relevance in immersive environments, which are required to be dark during operation in order to enhance the vividness of the display. This means that the typical CCD camera, which operates on visible light, cannot be used, as it will not be able to “see”. Furthermore, the projection system can interfere with the effectual running of the algorithms. Gross et al. (2003) overcome this problem through active illumination, using pulsed calibrated light sources which are turned on when the person’s view of the environment is blocked off, as shown in Figure 2.2.

Figure 2.2: Active Illumination (Gross et al., 2003)

The use of active illumination is quite expensive, as it requires thousands of LEDs along with special shuttered glasses to provide calibrated time-multiplexing with the video source. The lighting method used within this thesis is that used by Penny et al. (1999). In

15 their work, near-infrared lighting is used to illuminate the area. Near-infrared light lies within a relatively thin band on the electromagnetic spectrum, as in Figure 2.3.

Figure 2.3: Electromagnetic Spectrum

Night-vision goggles and active infrared cameras quite often use near-infrared light for observing scenes in the dark. Its benefit over visible light is that it is invisible to the naked eye. An example of a scene viewed using visible light and near-infrared respectively is shown in Figure 2.4.

Figure 2.4: Visible Light compared to Near-Infrared: Left image shows scene using visible light. Right image shows same scene with a visible light blocking filter that does not filter near infrared light. Image Courtesy of: http: // smartsecuritycamera. com/ infrared-illuminators. html

Near-infrared cameras require the presence of near-infrared light to work properly. As it turns out, human skin and clothing are quite good near-infrared reflectors, while typi- cal projection display screens are poor reflectors even though they reflect visible light quite well. Thus, bathing the entire environment with near-infrared light would provide suffi- cient lighting for the cameras to pick up humans, without affecting the visible darkness that is required for the visualisation system to operate properly.

2.2.2 Camera Calibration

Another issue that is faced in vision systems is in modelling and determining camera pa- rameters or camera calibration. The determination of camera parameters provides a math-

16 ematical relation between world coordinates and image pixels. This is particularly useful in localising objects within imagery as the space of possible hypotheses is reduced.

Many vision systems in the past have operated without the need for camera calibration. Monocular systems in particular commonly perform tracking without calibration informa- tion (Haritaoglu et al., 2000; Shotton et al., 2011; Yu et al., 2008). In such systems, detecting feature points in a single image is sufficient. There is no need to determine correspondence across cameras, and so the camera parameters are not required.

Camera calibration is most often used in multi-camera systems where there is a need to match features observed in different cameras. Some multi-camera systems have been able to achieve tracking without the need for calibration. For example, Javed et al. (2000) and Khan and Shah (2000) detect field-of-view boundaries by determining when persons exit one camera’s image and enter another. These boundary lines are then used for 2-D tracking by switching the tracking camera when targets cross a boundary.

However, 3-D reconstruction cannot be achieved without calibrating the cameras. This is because the process of triangulation, or recovering a 3-D point from multiple image points, which is vital to 3-D methods, cannot be performed without knowing at least the position and orientation of the camera view.

Camera calibration involves the determination of two sets of parameters for each cam- era: extrinsic parameters which model the position and orientation of the camera and map 3-D points in a world co-ordinate system to 3-D points in the camera co-ordinate system, and Intrinsic Parameters which model properties of the camera and camera lens and map 3-D points in the camera co-ordinate system to 2-D image pixels. A commonly used method of modelling these camera parameters is the pinhole camera model (Forsyth and Ponce, 2003), which maps points in the real world points in the image plane, shown in Figure 2.5.

Figure 2.5: Pinhole Camera Model

17 The pinhole camera model is entirely defined based on first-order matrices and disre- gards lens and camera specific parameters such as distortion, sensor sampling, and the discretisation of pixels. Tsai (1986) extended the pinhole model to include such non-linear intrinsic parameters and derived a method of determining them from correspondences be- tween image pixels and world points. Tsai’s camera model defines 4 transformations from T world coordinates, (xw, yw, zw) , to image pixel coordinates:

T Camera Coordinates (xc, yc, zc) : The first transformation is a rigid-body transforma- tion that transforms points in R3 so that the origin lies at the camera position. A rotation 3 3 3 matrix R R × and a translation vector t R defines this transformation, shown in Equa- ∈ ∈ tion 2.1.

    xc xw      y  R y  t (2.1)  c  =  w  + zc zw

T (Ideal) Image Coordinates (Xu,Yu) : The second transformation maps points in cam- era coordinates to points on the ideal, undistorted image plane,

à ! à xc ! Xu f zc (2.2) yc Yu = zc

T Lens Distortion (Xd,Yd) : The third transformation models the distortion of the image due to the camera lens. This transformation is:

à ! à ! Xd Dx Xu + (2.3) Yd D y = Yu + where, assuming radial distortion,

à ! à ¡ 2 4 ¢ ! Dx Xd κ1r κ2r ... + + ¡ 2 4 ¢ D y = Yd κ1r κ2r ... + + q r X 2 Y 2 (2.4) = d + d

κ1,κ2,... are called the lens distortion coefficients and r is the distance between the point and the centre of the sensor. It has been shown by Tsai (1987) that in the majority of industrial vision applications, only the first order coefficient for lens distortion, κ1, needs to

18 be used, and the remaining coefficients provide no significant benefits, and can increase the numeric instability of the calibration. Thus, the distortion transformation becomes

à ! à 2 ! Dx Xdκ1r (2.5) 2 D y = Ydκ1r

T Computer Image Coordinates (X f ,Yf ) : Finally, the Tsai model defines a transfor- mation to map the image points, defined in real-world, continuous metric coordinates, to discrete coordinates in pixel space. This transformation is:

à ! à 1 ! X f sxd0− Xd Cx x + (2.6) 1 Yf = d− Yd C y y + Nc x d0x dx (2.7) = Nf x where,

(Cx,C y) are the (x, y) coordinates of the centre of the image in pixel coordinates

dx is the centre-to-centre distance between adjacent sensor elements in X direction

d y is the centre-to-centre distance between adjacent sensor elements in Y direction

Nc x is the number of sensor elements in the X direction

Nf x is the number of pixels in a line as sampled by the computer

sx is an image scale factor that is often necessary in certain CCD cameras due to the way that the cameras perform scan-line re-sampling.

The values of (dx, d y, Nc x, Nf x) can be determined by the camera’s specification param- eters. Generally, Cx,C y can also be manually input, however the Tsai calibration technique provides a method of automating their discovery. Thus, calibrating the camera requires the calculation of 11 parameters, which consist of 5 intrinsic parameters: f Focal length

κ1 First order radial lens distortion coefficient

Cx,Cy Coordinates of principal point in image coordinate system sx Image scale factor due to scan line re-sampling

19 along with 6 extrinsic parameters:

Rx,Ry,Rz Rotation angles for the transformation between the world and camera coordi- nate frames

Tx,Ty,Tz Translational components for the transformation between the world and camera coordinate frames

The method of determining these parameters has been demonstrated by Tsai (1987, 1986). The cameras are calibrated against an absolute world coordinate system. Thus, the calibration technique requires the coordinates of a set of known points in the real world whose corresponding points in image space can be extracted for each camera.

Prior work (Svoboda et al., 2005; Zhang, 2000) has looked at fully automating the cali- bration procedure using a calibration object with known geometry or pixel values. However, these techniques typically require the determination of very precise correspondences be- tween cameras, or between image and world space. Because the environments investigated in this thesis are quite large, such precise points are unavailable.

Instead, a calibration rig and grid were used as landmarks to determine a coarse corre- spondence through hand marking. Because the Tsai calibration procedure works through least-squares minimization, the resulting calibration is accurate if a sufficient number of points are specified. The calibration procedure is presented in Section 3.2.

2.2.3 Segmentation

Segmentation is the lowest level task in any vision system. It is the task of separating pixels belonging to objects of interest from background pixels in images. It is the foundation upon which all other algorithms are built and thus is the cause of many failures in a vision system.

The output of segmentation is usually a set of blobs or pixel regions representing de- tected subjects. Segmentation has been considerably studied already with dozens of papers written on the topic (Pal and Pal, 1993). The solutions range from simple techniques such as background subtraction (Elgammal et al., 1999), frame differencing (Yoshinari and Michi- hito, 1996), and edge detection (Sminchisescu and Triggs, 2001), through to modelling in- dividual pixels probabilistically (Stauffer and Grimson, 1999) and machine learning (Blanz and Gish, 2002).

Despite the vast amount of research on segmentation, many algorithms still fail in cer- tain situations. The outcomes of segmentation failure falls into one of two cases: false- negatives, when regions belonging to an object are not detected, or false-positives, when regions are detected that do not belong to any object. It is clear to see that both of these fail- ures can cause errors in the tracking task as tracked targets can be erroneously deleted in

20 the presence of false-negatives, and erroneously created or updated in the presence of false- positives. Thus, tracking has to take into account such errors as well when determining correspondence.

The use of near-infrared light within AVIE, however, means that the pixel values be- longing to humans are in stark contrast to those belonging to the background, due to the fact that the background is a poor reflector whereas humans and human clothing reflect the light considerably better. This means that a simple background subtraction technique can be utilised to accurately segment out human blobs from the background. Although this procedure still suffers occasionally from errors, the tracking system presented uses higher level information to resolve them.

2.2.4 Monocular Tracking Systems

When a monocular sensing system is used, the tracking system maintains a set of tracked targets, detects persons within the camera’s reference frame, determines an optimal corre- spondence between tracked targets and detected objects, and performs an update, deletion, or addition of a tracked target as the situation requires.

One of the earliest human motion tracking systems, built by Rossi and Bozzoli (1994), provides for detection, tracking, and counting of moving persons within camera sequence images. The sensing device consists of an overhead camera, pointing down, such that the camera plane is parallel to the floor plane. The system assumes that objects only enter the room from either the bottom or the top of the image. This speeds up segmentation, as motion need only be checked in one dimension: across the bottom and top lines of the video. Once entering objects are detected, they are tracked in two phases: (i) a simple prediction formula, using a constant velocity trajectory model, is used to determine a reasonable esti- mate of the region where the object has moved, and (ii) a template matching method is used to determine the point position of the object.

The system presented by Cai and Aggarwal (1996) detects humans within the images, and identifies key feature points on the human. Properties of the feature points, such as position and velocity, are modelled as multivariate Gaussian distributions, and matched across frames using a Bayesian formulation.

Segen and Pingali (1996) also present one of the earliest vision-based human tracking systems. They track feature points of a region-of-interest over time. Initially the feature points are detected by identifying the points of high curvature around the boundary of the region. The feature points are then tracked over subsequent frames by comparing distances and curvature. The system identifies and tracks the motion paths of persons observed by a surveillance camera. However, it is unable to deal with temporary occlusion of persons.

The W4 system developed by Haritaoglu et al. (1998a,b, 1999a,b) and shown in Figure 2.6, is a complete, monocular real-time surveillance system for outdoor environments. It at-

21 tempts to solve several different problems in outdoor surveillance tracking including: back- ground segmentation in cluttered outdoor scenes, tracking the head positions of multiple persons through occlusion and group movement, tracking body parts of persons using sil- houettes and identifying interactions with inanimate objects. Although W4 is an outdoor system, many of its algorithms can be adapted and, in fact, improved for indoor environ- ments.

Figure 2.6: The Hy- dra tracking system, part of W4 (Hari- taoglu et al., 1998a, 1999b)

Cui et al. (1998) recognized that different sensors could provide different information. They used two different types of sensors to track moving individuals within a room. A peripheral sensor, implemented by a wide angled camera, tracked the global position of individuals within the room while a foveal sensor provided more detailed monitoring. Al- though they still perform monocular tracking in both sensors, their research demonstrates the benefits of splitting the sensor network into different groups providing different infor- mation. In this work, the sensors are split into two groups comprising overhead and oblique cameras, as will be explained in Chapter 3.

McKenna et al. (2000) track groups of persons in a single camera and maintain identi- ties even under occlusion. Colour based tracking is first performed to segment blobs while there is no occlusion. A colour histogram of each blob is maintained to determine the proba- bility of each colour within a segmented blob. When occlusion occurs, the system determines the blob that each pixel in the occluded region belongs to, by using the prior probabilities of the pixel colour. The system, shown in Figure 2.7, successfully maintains tracking once the blobs split, and there is no more occlusion.

Figure 2.7: Group Tracking under Occlusion (McKenna et al., 2000)

22 Leichter et al. (2004) provide a probabilistic framework for combining multiple tracking algorithms into a single tracking system. Their system assumes that each tracking algo- rithm outputs a probability density function estimate of the state of the tracked object, and the feature space of one tracking algorithm is independent of the feature space of another algorithm. The system treats each tracking subsystem as a black box, and allows multiple subsystems, which could operate on different feature spaces of differing dimensionality, to work together to provide an enhanced tracking system, presented in Figure 2.8.

Figure 2.8: Combining a template-based and a back- ground differencing track- ing algorithm (Leichter et al., 2004)

Intille et al. (2002) and Chen et al. (2001) both present algorithms for the matching of tracked targets to detected blobs. Both systems present a theoretically sound approach to determine an optimal match between multiple segmented blobs and multiple tracked targets. In this research, the approach presented by Chen et al. (2001) is used, as there exists a polynomial time approximation to the algorithm.

Prior research has achieved limited success within an immersive system with single cameras. Sparacino (2004) constructs an immersive interactive museum exhibit using a single overhead camera. The ALIVE system presented by Maes et al. (1997) uses a single oblique view camera to allow a person to interact with a virtual character on a single screen projected display. More recently, Tran et al. (2009) was able to perform motion tracking of persons within surveillance video using affine invariant feature tracking.

Unfortunately monocular systems start to fail drastically even for a small group of per- sons, due to severe occlusion issues. This problem is especially compounded in immersive systems as the dark environment prevents the use of colour based techniques, such as those used by Khan and Shah (2000) and Tran et al. (2009).

23 2.2.5 Stereo Tracking Systems

The use of stereo cameras has also become popular recently. Stereo cameras simulate hu- man binocular vision, by using a pair of cameras placed next to each other with separation equal to the human eye separation, and a parallel line-of-sight. This placement allows the determination of a disparity image which offers relative depth information about the scene, shown in Figure 2.9. Harville and Li (2004); Harville (2004) were able to use a stereo camera pair to perform tracking and activity recognition.

Figure 2.9: Depth Image from a Stereo Camera (Harville, 2004)

Krumm et al. (2000) use two stereo camera pairs to track multiple persons around a small office environment. A blob-based tracking module is employed in each camera pair, and blobs are clustered together to form human shaped blobs. An identity for each person is maintained using colour histogram statistics of the blob. The system reports 2-D ground plane position of each person.

Stereo cameras, however, are not very suitable for large areas. The typical effective range for stereo cameras is roughly the size of a small room. Beyond this distance, the depth values are very similar despite a large discrepancy in real-world depth. Typical large scale projection displays can be much larger than the average room making stereo cam- eras ineffective in such environments. In addition, stereo cameras can only resolve partial occlusions. If a person is fully occluded, there is no way to identify the position of the per- son in the background until they have separated from the occluding object. This becomes quite problematic as the number of persons increases and the probability of full occlusion becomes much greater.

2.2.6 Multi-Camera Tracking System

The systems presented so far represent tracking using monocular and stereo tracking sys- tems. However, as previously mentioned, the occlusion and range issues make such systems unsuitable for immersive environments. In such areas the use of multiple cameras spaced

24 quite widely, relative to each other, is much more beneficial. This spacing can resolve a considerable amount of occlusion while providing greater coverage of the larger area.

Multiple camera systems have become exceedingly popular in recent times, due to the increase in faster networking hardware, cheaper vision sensors, and improved algorithms. Tracking in multiple cameras requires some method for identifying correspondence between subject models in different cameras, and the correspondence is used to update the subject’s models in subsequent frames. This is quite similar to monocular tracking, except that the correspondence is performed both spatially and temporally, instead of just temporally.

One way to perform the correspondence is to treat the cameras independently of each other. Dockstader and Tekalp (2001a,b) use this approach. A Bayesian belief network is used to compare the projections of 3-D state vectors, corresponding to significant features on a human, against the 2-D motion of the features obtained from independent observa- tions in multiple cameras. The two most confident observations are used to triangulate and update the 3-D state vector in future frames. Results are only presented for 3-view tracking, however, and because optical flow features are used, adding more cameras would considerably affect the run-time performance of the system.

Another approach often used in the literature is camera switching. Camera switching is the task of tracking each subject in a single camera using a 2-D model, and switching cameras to the one with the best view, as the subject moves around. Cai and Aggarwal (1998) extended the single camera tracking system to multiple cameras by providing camera switching. An intelligent, automatic camera switching module is used to determine the cameras that have the best view of the tracked object, and only information from those cameras is extracted.

Figure 2.10: Camera Hand-off based on FOV Lines (Javed et al., 2000; Khan et al., 2001)

25 Javed et al. (2000) and Khan et al. (2001) developed a method of camera hand-off track- ing by identifying the field-of-view (FOV) lines for each camera in every other camera. As a human walks around the environment, the system determines if they have crossed the FOV line of one camera into another camera’s viewing area. The system automatically switches to the camera with the best view, when a crossing occurs. An example of the system in operation is presented in Figure 2.10.

Zeeshan et al. (2003) extended the camera hand-off procedure to disjoint views, by em- ploying Parzen windows (Parzen, 1962) during a training phase, to learn space-time rela- tions between cameras with disjoint views. Camera switching is quite useful, when there is a relatively small amount of overlap between the camera field-of-views. However, camera placement within immersive environments can be highly controlled in order to increase the amount of overlap between cameras. Thus, greater robustness may be achieved by detect- ing subjects and finding correspondence across as many cameras as possible to obtain a more confident hypothesis.

The use of a reconstructed top-down view has become quite popular for ground-plane tracking. Harville (2004) showed that using a reconstructed top-down view of an environ- ment from stereo cameras is a useful method of tracking persons within an environment. Yang et al. (2003) present a system that tracks and counts multiple persons moving around in a room, using multiple cameras. The cameras are all placed such that they point hori- zontally into the scene. A background subtraction method is first applied to all cameras, to determine foreground silhouettes. The system then computes the projection of the visual hull of all the silhouettes, onto the ground plane of the scene, effectively offering a simulated top down view of the scene. A person within the scene is then represented by a polygon in the visual hull projection. The system then applies a bounds-tightening algorithm, which eventually converges to the correct number of persons within the scene.

Zabulis et al. (2010) perform a 3-D reconstruction of an entire scene using 8 cameras, and then project the reconstruction to the ground-plane to perform unoccluded tracking. Their system is used to provide interactivity within a museum exhibit for 3 persons. A reconstructed top-down view clearly has merits in tracking. However, the reconstruction is usually crude and noisy, and expensive to compute.

In summary, although such 2-D methods can be quite effective, ultimately it is much more useful to obtain a 3-D trajectory for the objects within the environment. A 3-D trajec- tory allows the determination of depth values, which is not generally afforded by performing image-based tracking. The task in 3-D tracking consists of three sub-tasks: track persons in individual cameras, determine matching tracks across cameras, and reconstruct 3-D po- sition from matching tracks.

The first two sub-tasks are often tackled using colour (Bernardin et al., 2007; Focken and Stiefelhagen, 2002; Mittal and Davis, 2003; Yang et al., 2007). Each track hypothesis in each camera is assigned a colour signature. This signature can then be used to identify

26 correspondence across frames, and across cameras. Once correspondence is achieved, the signature can be updated incrementally to reflect changes in lighting and body orientation. The correspondence problem itself has been approached using different methods including Bayesian modality fusion (Chang and Lin, 2002), best-hypothesis heuristics (Focken and Stiefelhagen, 2002), and joint probabilistic data fusion (Bernardin et al., 2007).

Fleuret et al. (2008) present a system that uses a series of synchronised camera streams all at eye level to track multiple individuals in both indoor and outdoor settings, shown in Figure 2.11. A probabilistic occupancy map of the ground plane is constructed using segmented binary foreground images, and colour motion models are then used to optimize the trajectories of individuals such that they fit the occupancy map. Although their system is quite accurate, the computation of the probabilistic occupancy map is too expensive to be used in real-time.

Figure 2.11: The Probabilistic Occupancy Map Tracker (Fleuret et al., 2008)

There has also been considerable research in recent times into off-line tracking within videos. Kang et al. (2004) track soccer players across a field, through multiple cameras, us- ing a stochastic formulation on the colours of segmented blobs. Iwase and Saito (2004) per- form similar tracking of soccer players, but utilising ground plane homography constraints, rather than appearance information. Zhao and Nevatia (2004); Zhao et al. (2008) track multiple humans in complex situations by breaking down the movement into two parts. An ellipsoid model is first used to track global motion, and then a locomotion-dependent limb model is used to track individual body parts.

Although tracking methods based on computer vision have been considered in the past, most tracking systems focus on the use of colour in an attempt to re-identify individuals who become occluded or leave a camera’s reference frame to enter into that of another. These tracking systems are geared towards surveillance applications where the camera’s spatial layout is fixed in usually oblique positions, and the presence of visible light and colour makes image features highly suitable for tracking.

However, immersive environments pose challenges and present opportunities that are quite different from surveillance systems. The main difficulty is the lack of colour infor-

27 mation when the system is in operation. Since infrared or near-infrared cameras are re- quired, it becomes quite difficult to perform re-identification based on image features that can change markedly as the person moves around. One of the main benefits that immer- sive environments present, however, is that the spatial positioning of the cameras can be changed at will, and it is possible to leverage this opportunity to arrange the cameras so that the entire environment is covered within the fields of view of the cameras. This can result in reduced impact of issues such as occlusion and camera re-entry.

In this thesis, the idea of using a top-down view for tracking is exploited to provide a fast and robust method of tracking multiple persons. Rather than performing expensive and noisy reconstructions, however, the cameras themselves are positioned in a top-down manner, providing much cleaner results. Various algorithms from the literature are then combined to build an efficient and robust tracking system. First, the cameras are split into two groups: a set of overhead cameras to perform tracking and a set of oblique cameras to provide height information. In this way, instead of the crude overhead reconstruction presented by Harville (2004), a clean top-down view is obtained.

The 2-D blob detection methods of Haritaoglu et al. (1999b) and graph theoretical frame- works of Chen et al. (2001) are applied on the overhead cameras to achieve 2-D tracking. Because the blobs in the oblique cameras suffer from occlusion, some simple rules are used to determine blobs that are most likely to belong to bodies. The 2-D information is then collated and a 3-D point is constructed through triangulation. The triangulated point is then used in subsequent frames to determine correspondences. More details are in Chapter 3.

2.2.7 Bayesian State Estimation

One method of overcoming performance issues in tracking systems is to make use of a model estimation technique to provide predictions of the state of the scene in between up- dates. Bayesian State Estimation attempts to perform this task in a probabilistic setting (Arulampalam et al., 2002). A particle filter is a common tool used to perform Bayesian state estimation.

Bayesian state estimation techniques attempt the task of determining the posterior n probability density function (p.d.f.) of a process xt R , t 0,1,2,3,..., given measurements ∈ = m zt R , t 0,1,2,3,... and probability distributions relating to prior states of the process. A ∈ = Bayesian estimator assumes that xt is a first-order Markov process (Papoulis et al., 1965) and that it evolves by the equation:

xt ft (xt 1,vt 1) (2.8) = − −

28 n k n k where ft : R + R and vt R is an independent and identically distributed (i.i.d.) process → ∈ noise vector sequence. Furthermore, the particle filter assumes that the observations are governed by the equation:

xt ht (xt,nt) (2.9) =

n l m l where ht : R + R and nt R is an i.i.d. measurement noise vector sequence. A particle → ∈ filter then defines two steps. A prediction step calculates the prior p.d.f. at time t as

Z p(xt z1:t 1) p(xt xt 1)p(xt 1 z1:t 1) dxt 1 (2.10) | − = | − − | − − and a measurement step is used to update the prior p.d.f., when measurements become available:

p(zt xt)p(xt z1:t 1) p(xt z1:t) | | − (2.11) | = p(zt z1:t 1) Z | − p(zt z1:t 1) p(zt xt)p(xt z1:t 1) dxt (2.12) | − = | | −

The derivation and explanation of equations 2.10, 2.11, and 2.12 are presented by Aru- lampalam et al. (2002). The main point that must be stressed is that the Bayesian estima- tion equations provide a way of updating the prior density in order to obtain the posterior density at the current time step. This updated posterior density is then used for future prediction calculations to obtain a better estimate.

The mathematical Bayesian formulation presented above is a conceptual tool and, in the general case, the analytical determination of the p.d.f.s is intractable. The Kalman Filter (Kalman, 1960) is a popular tractable formulation that is often used for tracking. The filter assumes that the p.d.f.s are all Gaussian, and thus can be parameterized by mean and covariance matrices. The Kalman filter models the state of a process x Rn by the linear ∈ stochastic equation

xt Axt ∆t But ∆t wt ∆t (2.13) = − + − + − assuming that the process has an observable state, or measurement, z Rm governed by ∈

zt Hxt vt (2.14) = +

29 The definition of the different terms within the Kalman state and measurement equa- tions are:

A called the transition matrix, that defines a linear transform for the process from the previous time step

B the control matrix, that relates x to an optional control input u Rl ∈ w the process noise with normal probability distribution p(w) N(0,Q), where Q is a co- ∼ variance matrix for the process noise.

H relates the current state of the process to the observed measurement. It is called the measurement matrix. v the measurement noise with normal probability distribution p(v) N(0,R), where R is a ∼ covariance matrix for the measurement noise.

The prediction and update steps for the Kalman filter are shown in Figure 2.12. All the calculations in both steps are simply linear transformations.

Figure 2.12: Kalman Filter Equations (Welch and Bishop, 2001)

An excellent tutorial on the Kalman filter is available (Welch and Bishop, 2001). A simple, 1-dimensional example of the Kalman filter, from Welch and Bishop (2001), is shown in Figure 2.13.

30 Figure 2.13: Kalman Filter Ex- ample. Crosses rep- resent irregular, noisy measurements of the signal, the solid straight line gives the true value of the signal, and the remaining plot line gives the output predicted by the filter

A common approach to using the Kalman filter in motion tracking is to model the posi- tion and velocity of certain regions or features within the scene. Rosales and Sclaroff (1999) used this technique by first detecting bounding boxes for subjects within the scene. Each bounding box is tracked by storing two corner vertices. The filter is then used to predict and estimate the position and velocities of these corner vertices.

Jang et al. (2002) present an interesting new structural framework in the use of the Kalman filter for 2-D human body tracking. The framework is composed of a graph-like structure, with each node being defined by a cell Kalman filter and each edge defined by a relation Kalman filter. The system tracks moving persons, by splitting the segmented person into subregions. Each subregion is tracked using a cell Kalman filter, and the spatial relation between subregions is tracked by a relation Kalman filter. When the person is not occluded, the system behaves like an ordinary Kalman filter, but as partial occlusions occur, the relation filters are activated and the system is able to track the occluded subregions.

Other extensions of Bayesian estimation, which do not make the Gaussian assump- tions, have also been proposed in the literature. The Extended Kalman filter (Julier and Uhlmann, 2004) was developed to deal with cases where the state is governed, not by lin- ear functions, but by differentiable functions. Particle filtering methods that use sampling to update the posterior p.d.f., such as the CONDENSATION algorithm (Isard and Blake, 1998) and bootstrap filtering (Gordon et al., 1993) have also been shown to be quite useful for state estimation.

Ultimately, the generality of these techniques is offset by the cost of processing them. The linear assumption made in the Kalman filter is not a particularly restrictive one when it comes to human motion tracking, and the efficiency of the Kalman filter in using only linear transformations makes it quite a useful tool for estimation.

In this research, the Kalman filter is used at two levels to provide robustness in track-

31 ing. A 2-D first order Kalman filter is first used at the image level to perform blob track- ing in overhead cameras. This 2-D framework is combined with optical flow methods to provide robust tracking in the presence of blob splits and merges. At the second level, a Kalman filter is used to maintain the 3-D state of each person within the environment. The Kalman filter provides accurate predictions of a person’s position between vision updates, considerably improving the correspondence derivations. Both uses of the Kalman filter are presented in Chapter 3, sections 3.3.1 and 3.5.1.

2.3 Activity Recognition

Research in activity recognition has only recently begun to acquire growing interest, along- side the advent of such technologies as smart homes and surveillance systems. Most human motion tracking systems so far have focused mainly on low-level tracking tasks. However, it has been shown recently that there is a need to develop efficient high level action recog- nition schemes, to be able to deal with developing technology that can utilise such outputs. Buxton (2003), Aggarwal and Park (2004), Kruger et al. (2007), and Weinland et al. (2010) present four excellent surveys on the activity recognition problem.

In their influential work, Bobick (1997) came up with a taxonomy that broke down the activity perception task into several subtasks. They identified a 3-tier hierarchy:

Movements are the most atomic primitives, requiring no contextual or sequence knowledge to be recognized; movement is often addressed using either view- invariant or view-specific geometric techniques. Activity refers to sequences of movements or states, where the only real knowledge required is the statistics of the sequence; much of the recent work in gesture understanding falls within this category of motion perception. Finally, Actions are larger-scale events, which typically include interaction with the environment and causal relation- ships; action understanding straddles the grey division between perception and cognition, computer vision and artificial intelligence.

Action understanding, as defined by Bobick (1997), involves higher level semantics of the activities being performed. Although some actions can be derived from low-level fea- tures, the task is typically performed through higher level symbolic logic and has a limited view of the lower level features. Action understanding is better performed once movement and activity classification modules have converted vision feature vectors into class symbols. In this thesis, the focus is on movement and activity classification, although it is shown how the research presented can be extended to action understanding.

Movements are generally identified using statistical techniques over multiple frames. Activities are composed of a sequence of movements. However, it should be noted that an

32 activity could be defined and recognized entirely on its own without recognizing any lower- level movement classes. The term activity classification is used from here on to denote recognition of both movement and activity classes.

The terms activities and actions are often used interchangeably in research due to a lack of collaboration in determining concrete definitions for them. For example, Turaga et al. (2008) use actions to refer to the task of converting feature vectors to symbols, and activities to denote higher level understanding. This usage is the opposite of that defined by Bobick. In this research, the definitions presented by Bobick will be used.

2.3.1 Rigid-Body Motion Analysis

One class of movements that is immediately obtainable from a low-level tracking system and that may be useful for activity classification is rigid-body motion. A common method of analysing rigid-body motion is through trajectory analysis. Morris and Trivedi (2008) present a recent and comprehensive survey. Trajectory analysis involves the processing of point models and has become very popular in low-resolution videos, such as surveillance systems, where detailed models are difficult to extract. These methods forego the con- struction of highly complex, computationally taxing representations of the subjects being observed.

Trajectory analysis models the scene as a series of fixed-dimensional feature vectors. The trajectory is defined as the curve, or random walk, that the feature vectors trace out in the vector sub-space over time. Feature space consists not just of spatial position properties, but can also include properties such as pixel intensity values, global image properties, object blob characteristics and other properties.

Trajectory analysis has obvious uses in surveillance systems. For example Bird et al. (2005) detect when pedestrians are loitering in front of a bus stop. Jung and Ho (1999) extract traffic flow information by tracking cars in a single video stream. Hu et al. (2004b) predict the occurrence of accidents between cars by tracking the cars, feeding their trajecto- ries through a self-organizing neural network to predict the future motion of the cars, and determining when the predictions of two or more cars are likely to meet. Makris and Ellis (2005) and Wang et al. (2006) learn topological representations of an observed scene, by la- belling the scene based on identifiable activities performed in different regions. Identifiable activities are defined by spatio-temporal properties of persons’ trajectories.

Rigid body motion has also been successfully used for the recognition of group interac- tion activities. Cupillard et al. (2002) use a finite state model to represent important ac- tivities characterised solely by the motion of tracked object bounding boxes in surveillance video. Ryoo and Aggarwal (2008) construct a context free grammar to represent group activ- ities consisting of a dynamic number of individuals. The atomic constructs of their grammar use only properties of the bounding box of tracked persons.

33 Just recognizing rigid-body motion on its own is not sufficient however, as activities can also involve the motion of a person’s limbs. In this case, a more complex representation of the person is required in order to identify such motions. This is the task of extracting articulated motion.

2.3.2 The Use of Joint Models

A common method of extracting articulated motion from video is by using joint models. These methods find their roots within studies related to biological vision. Johansson (1973) showed that activities performed by a human could be easily identified by observers using 10 to 12 strategically located moving light displays, or MLDs. The concept of MLDs is the main inspiration of technologies, where markers are used to extract the pose of a person.

Figure 2.14: Recognizing movements from moving light displays (Johansson, 1973)

Extracting a joint motion model from cameras is often posed as the task of solving a con- straint satisfaction problem (CSP) over the camera images. Using CSP methods for model optimisation is an open and rather difficult problem, as even the simplest model requires a considerably large search space. In motion capture applications the problem is typically simplified by the use of markers, but this cannot be allowed in immersive environment ap- plications, particularly when multiple persons are involved, since wearing markers makes the application expensive and cumbersome, and can affect the degree of presence. In or- der to compensate for the large space, researchers have looked at different techniques for approximating an optimal model.

Considerable advances have been made in extracting the limb positions of a person in 2-D images obtained from a single camera (Agarwal and Triggs, 2004; Sminchisescu and Triggs, 2001; Wachter and Nagel, 1999). More recently Zhao and Nevatia (2004) and Zhao et al. (2008) fit an articulated skeleton to pedestrians in a surveillance environment, us- ing a rigid-body motion tracking component to reduce the search space, shown in Figure 2.15. Their system uses colour imagery and they make no claims as to real-time perfor-

34 mance. Spurred by recent increases in benchmarked datasets (Blank et al., 2005; Gorelick et al., 2007), research in the construction of a joint model from monocular image sequences continues to this day.

Figure 2.15: Rigid body tracking (left) is performed by using an ellipse model which reduces the search space for extracting articulated motion (right) using a skeleton model (Zhao and Nevatia, 2004)

The use of multiple cameras has also proved beneficial to articulated motion extraction. Multiple cameras provide the benefit of view-invariance, an important facet in the vari- ability of the recognition of activities (Sheikh and Shah, 2005). Two other such identified variables are the anthropometry of the human performing the action and the execution rate of the action. Sheikh and Shah (2005) show that modelling the action as a linear combina- tion of bases in joint spatio-temporal space allows overcoming these variabilities.

Gavrila and Davis (1996) present one of the earliest systems that perform marker-less 3- D model reconstruction of humans in multiple cameras using special clothing. They employ a 22-DOF (degree of freedom) skeletal model along with 3-D super-quadrics to model the human body. They use edge detection to measure the similarity between the projected model and images. They are able to reconstruct models for up to 2 persons moving around.

Figure 2.16: Marker-less Model Reconstruction (Gavrila and Davis, 1996)

35 Researchers have looked at various methods to optimise the model fitting including de- termining a priori joint locations using image features (Rosales et al., 2001), using physical forces determined by edge and contour features (Delamarre and Faugeras, 2001) and using simple Boolean energy functions (Carranza et al., 2003). Stochastic sampling techniques such as the Extended Kalman Filter (Wachter and Nagel, 1999) and annealed particle fil- ters (Duetscher et al., 2000) have also been used to update the model hypothesis in subse- quent frames, to reduce the search space. Mitchelson and Hilton (2003) use a 22 parameter model consisting of ellipsoids, re-project the model to images and use colour and silhouette cues to determine fit. They achieve at best 3 seconds per frame, with a single person and 3 cameras.

Others have looked at performing joint model fitting in a reduced 3-D space rather than in image space. A 3-D representation of the model is first constructed for this ap- proach. Skeletonization of the representation can provide a useful energy function to fit a 3-D skeleton model, as shown in Figure 2.17 (Chu et al., 2003, 2002). Theobalt et al. (2002) fit a skeleton model to a 3-D reconstruction by first aligning the skeleton to the model us- ing principal component analysis, and then using detected feature points in the silhouette images to hierarchically guide the locations of the joints. Kehl (2005); Kehl et al. (2005) perform model fitting using stochastic sampling over a coarse colour reconstruction of the individual. Their system was built for a small CAVE-like environment and demonstrated only for a single person.

Figure 2.17: 3-D Skele- ton Fitting (Chu et al., 2003, 2002)

Ellipsoid models are often quite effective for fitting in 3-D space (Cheung et al., 2000, 2003a,b; Cheung, 2003; de Aguiar et al., 2004). An ellipsoid model is first fitted to the recon- structed data, using intersections and coverage as an energy function and this model is then reduced to a skeleton model. Mikic et al. (2003) fitted an articulated human body model, made up of cylinders and ellipsoids, to a reconstruction of a single individual, by using some prior knowledge about the shape and relative size of different parts of the human body. To prevent re-initialization of the model in every frame, a tracking framework based on an Extended Kalman filter is used to track the individual joints of the model. Examples are

36 presented in Figure 2.18. Similarly Sundaresan and Chellappa (2005) perform marker-less motion capture of a single person by fitting a super-quadric model to a 3-D representation and using an iterative Kalman filter to update the model.

Figure 2.18: Voxel Model Fitting with Model Tracking (Mi- kic et al., 2003)

The state-of-the art in reconstructing articulated models is presented by Shotton et al. (2011) in their commercial work on the Microsoft Kinect device. In their work, depth images from a specialised depth-sensing device are used to fit an articulated skeleton model to humans. Their system is extremely robust and fast for single persons, using consumer hardware to achieve frame rates of up to 200 frames-per-second. The extraction of the joint model is seen as a generative problem. A series of depth comparison features are extracted using the depth-disparity image, and a random forest is trained to assign different pixels to different body parts. The skeletal model is then fitted according to the classified pixels. An example is shown in Figure 2.19.

Figure 2.19: Depth-image based skeletal joint locali- sation using the Microsoft Kinect sensor Shotton et al. (2011)

37 Shotton et al. achieve high robustness and efficiency by using an enormous amount of training and testing data to build the classifier. Roughly, 500 000 synthetic and real images of persons in various poses are used for training and testing the classifier. This allows for view-invariance and can handle a large amount of complex poses. However, the use of the depth sensor is vital to their work and this sensor has a practical ranging limit of only 1.2–3 metres. The authors also do not show results for more than two persons. Typical immersive systems are much larger than 3 metres and need to deal with numerous persons at a time.

The joint model has obvious benefits for classification, being of fixed dimensionality and having considerably fewer dimensions than image space. Numerous researchers have looked at feeding the articulated model directly into a learning agent. Guo et al. (1994) transforms the model parameters into the Fourier domain and feeds them through a neural network to perform classification. Park and Aggarwal (2003) segment and parameterise body parts of individuals in a monocular video sequence, which are then input into a hier- archical Bayesian network to perform pose estimation. Niebles and Fei-Fei (2007) feed an hierarchical, 2-D, ellipse model representing a human through a Support Vector Machine (SVM) classifier to determine the activity. Rosales and Sclaroff (2000) us an EM-algorithm to map low-level segmented features directly to a pose which determines a configuration for an articulated model.

The Hidden Markov Model (HHM) (Poritz, 1988; Rabiner, 1989) is a common technique used for recognizing actions from kinematic models. Bregler (1997) extracts a blob-based body model from images, parameterised by a set of probability distributions representing image and shape properties, and uses an HMM formulation to perform classification. Gu et al. (2010) use exemplars and an HMM to classify the actions of a 24-DOF skeletal model.

The use of the joint model for classification has been influenced by research on motion capture and moving light displays that operate on off-line videos. Even the most recent work in joint model extraction fails to perform efficiently in the presence of multiple persons and large spatial environments. Although both research and hardware are progressing hand- in-hand to improve the performance of joint model extraction, it would be highly beneficial to be able to perform activity recognition without having to obtain such complicated models.

Previous research has shown that the joint model is often unnecessarily complex and that humans have the ability to identify activities based on lower-level motion imagery quite effectively. In this thesis, although the extraction of joint locations and orientations is not employed, the work presented is shown to be extensible to cover classifications based on an articulated model.

Instead, a coarser model, known as a visual hull, is used to represent each human and activity recognition is performed over this model. As will be shown in the next sub-section, the visual hull has numerous benefits for the activity recognition task. The main benefits are that it is efficiently computable and it preserves a lot of the shape information that is vital to activity recognition.

38 2.3.3 The Silhouette and the Visual Hull

In their pioneering work on the recognition of movements, Bobick and Davis (1996a,b) re- alized that there is no need to build complex kinematic skeletons of individuals for activity classification, because a simpler underlying feature vector can provide sufficient informa- tion to perform the task. It has been shown that humans have the ability to recognize certain actions even from very coarse, low-resolution data.

Figure 2.20: Bobick and Davis (1996a) realized that certain movements, such as sitting in this case, can be inferred by humans even in coarse, low-resolution data. A computer vision programmer with sufficient knowledge may even be able to infer how the action relates to the behaviour of the individual pixels

Using the concept of temporal templates, Bobick and Davis (1996b) showed that direct movement classification could be performed on the imagery, as there is sufficient informa- tion present for a human to discriminate the movements. Their system was trained with a series of templates for certain movements, captured from several different camera angles. They were then able to identify, in real time, simple movements performed by a particular individual, such as sitting, standing, and waving arms. They achieved view-invariance by constructing training data at multiple angles.

Figure 2.21: Camera Input frames (top row) and their corresponding Motion Energy Images (bottom row) (Bobick and Davis, 1996b)

39 Through their work, Bobick and Davis demonstrated the enormous potential for spatial reasoning that is present in something as simple as a binary silhouette - an image con- taining only ones and zeros. They extended the silhouette to include temporal information by introducing the motion history image (MHI) and motion energy image (MEI), both of which contain accumulated motion statistics about each pixel within the image. The MHI is gray-scale whereas the MEI is also a binary image.

Since their work, the MHI and MEI have shown themselves to be incredibly effective as a means of recognizing activities, while their computation is very efficient. So much so, that many researchers discard colour information altogether due to the noisy nature of colour data. For example Martinez-Contreras et al. (2009) reduce an MHI to a smaller subspace using a Kohonen Self Organizing Map (Kohonen, 1989) and then use an HMM to perform classification. de Leon and Sucar (2000) reduce a silhouette down to a set of Fourier descriptors and then use a nearest centroid classifier to recognize poses in a single frame.

The silhouette has been so effective that the necessity for view invariance has inspired the use of 3-D analogies. The visual hull is one such analogous representation. The visual hull, first presented by Laurentini (1991, 1994), is a 3-D representation of an object that preserves many of the external shape properties of the object. This is suitable for human action recognition, because many human actions can be recognized solely from the shape of the human body performing the action. Researchers have already shown that precise approximations to the visual hull can be constructed in real time.

Figure 2.22: Object Silhouette Projections

The visual hull was first introduced as a method of reconstructing a 3-D object from silhouette images. The silhouettes of an object in a camera image are the 2-D projections of the object into the camera’s reference frame, as shown in Figure 2.22. This silhouette can be back-projected from the image into the 3-D world space through the camera centre, if the camera’s calibration is known. The resulting back-projection is called the generalised cone of the silhouette. If multiple cameras observe the same object, then multiple generalised cones can be determined. The intersection of all of these generalised cones is called the visual hull of the object. The visual hull can be seen as the 3-D extension of the 2-D convex hull, shown in Figure 2.23. Any method that extracts the visual hull from a series of silhouette images is known as shape-from-silhouette (SFS).

40 Figure 2.23: The Visual Hull in 2-D

The methods for representing and reconstructing visual hulls can generally be split into two classes: polygonal representations that model the visual hull as a set of polygonal surface patches and volumetric representations that model the visual hull by splitting the world up into a set of atomic constructs and choosing a subset of spatially coherent atoms that most closely represent the object. Examples of both are shown in Figure 2.24.

Figure 2.24: A comparison between polygonal surface based representations and volumetric representations (Franco and Boyer, 2003)

SFS methods to extract polygonal representations typically utilise the generalised cones of the silhouettes, and apply an intersection routine to obtain the visual hull. Real-time methods for such techniques have been proposed (Matusik et al., 2001). However, direct cone intersection techniques are quite difficult for general 3-D objects and especially for humans whose silhouettes often contain curved and irregular edges that are difficult to approximate with just polygonal patches. Although polygonal surface representations are quite suitable for presenting high detail, they only provide benefits for analysis if the carved model itself is sufficiently detailed. However, the computational complexity of performing cone intersections is quite high and is more suitable for off-line reconstructions.

41 In contrast, the typical SFS method for extracting volumetric representations is to use a cuboid for the atomic constructs, known as a voxel, and then to carve the space so that only voxels that intersect with a foreground object are kept. Voxel models are faster to calculate than polygonal representations yet can still provide a model suitably detailed for action classification tasks.

The term voxel is derived from the two words volumetric pixel. A voxel is the 3-D ana- logue of an image pixel, and is often used as a basic atomic unit for 3-D models, just as the pixel is used as an atomic unit for images in image processing. Voxels have been used con- siderably in computer graphics and medical imaging, particularly in visualisations where they provide a fast reconstruction and rendering method for high detail 3-D models.

A voxel model requires the world space to be split into a grid of 3-D hyper-rectangles. The voxel representation models the visual hull by choosing only the voxels that intersect with the hull and discarding the remaining voxels. An example of voxel representation appears in Figure 2.25.

Figure 2.25: Voxels-based Visual Hull

The first devised method for computing the voxel model (Potmesil, 1987), given a set of silhouette images of a foreground object taken at different camera angles, is to project each voxel into each view using the camera calibration information. A voxel represents a cube in 3-D space and each of the eight corners of the cube are projected to the view. This yields eight points in the image. The convex hull of the 8 points is constructed, and if the convex hull intersects any part of the foreground object, then the voxel is marked as a foreground voxel. Otherwise, the voxel is marked as a background voxel.

The computation of the convex hull and the inspection of every point within the hull can be very time-consuming, making the standard voxel reconstruction algorithm too slow for real-time purposes. Penny et al. (1999) and Chu et al. (2003) reduced each voxel to a single point (the mid-point of the voxel) which is then projected to the silhouette. If this point lies within the foreground silhouette then the voxel is marked as a foreground voxel. Cheung

42 et al. (2000, 2003a) extended this algorithm by introducing the SPOT (Sparse Pixel Occu- pancy Test) algorithm, which allows for a more robust reconstruction of the voxel model.

Caillette and Howard (2004) used a hierarchical subdivision method to obtain a speed- up of SPOT and allow for voxel colouring. Lopez et al. (2007) reconstructed a voxel model within an office environment and use particle filtering on each voxel to track individuals. Their system does not operate in real time however. Kehl (2005); Kehl et al. (2005) reverse the SPOT lookup by mapping each pixel in the image to a set of voxels that re-project to it. In this way, the voxel checking process is dependent on the number of changed pixels, and not on the size of the voxel space. They perform model fitting to the voxel space using stochastic sampling.

Several hardware approaches to voxel reconstruction have also been attempted (Hasen- fratz et al., 2003; Ladikos et al., 2008). However, although these achieve reconstruction at very fast speeds on the GPU, there is still a severe bottleneck in extracting the model for further analysis due to hardware limitations. Therefore, while voxel models can be recon- structed and viewed in real time, any further analysis cannot be performed.

The use of the voxel model has proved incredibly fruitful for the task of view-invariant activity recognition. So much so that publicly available datasets providing voxel models, and not just images, have appeared in recent times (Weinland et al., 2006). Because the voxel model is a natural extension of the binary silhouette, the concepts of the MHI and MEI have also been extended to encompass voxel representations into the Motion History Volume (MHV) and Motion Energy Volume (MEV) (Weinland et al., 2005).

It is clear that voxel reconstructions present a viable alternative to kinematic skele- tons for activity recognition. They are quite efficiently computable on modern hardware while preserving 3-D shape information vital to activity classification. They are a natural extension to the 2-D binary silhouette, which has proven to be extremely effective in 2-D activity recognition tasks, and they provide the added benefit of view invariance. In this thesis, a speed-up of the SPOT algorithm is achieved using spatial coherency within the en- vironment. An improved algorithm, called SparseSPOT is presented in Chapter 3, section 3.7 that provides the real-time reconstruction of a voxel model for multiple people in large spaces.

The question remains, however, of how to extract activity classes from such 3-D rep- resentations. With a kinematic skeleton, the static dimensionality of the feature vector describing the skeleton makes it easy to apply machine learning principles to perform clas- sification. However, voxel reconstructions are dynamic in dimensionality and are typically quite large dimensional. The next sub-section looks at available options for performing the classification of actions from such data.

43 2.3.4 Activity Classes : The Benefits of Cognitive Rules

Activity classification is often modelled as a supervised problem with a fixed, known set of activity classes. In surveillance systems, the activity classes are clear-cut. Anomalous ac- tivities are often well defined and fixed. For example in a shopping centre environment, the recognition of criminal activity is beneficial. In traffic monitoring systems, the prediction of accidents would be useful.

For this reason, many researchers utilise standard machine learning techniques to per- form activity classification. Hidden Markov Models (HMM) (Rabiner, 1989), for example, are a common approach to this problem. An HMM models a system as a Markov process. The state of the system is invisible to the observer, originating the term hidden, but emits a series of observations that are visible. The task of the HMM is to predict the probability of occurrence of a sequence of observations.

Hidden Markov models are often used in temporal pattern recognition due to their abil- ity to model temporal processes. In activity recognition problems, the observable states are usually feature vectors derived from an underlying vision system. An HMM is built for each activity class, and is trained over a sequence of positive and negative cases. It is then used to predict the likelihood of an activity class given a sequence of feature vectors. Many researchers have used the HMM in this manner (Gu et al., 2010; He and Debrunner, 2000; Kellokumpu et al., 2005; Martinez-Contreras et al., 2009; Yamato et al., 1992). The HMM has also been used, to a limited extent, in modelling interactions (Oliver et al., 2000).

A more general form of the HMM, the Dynamic Bayesian Network (DBN), has also been used in the task of activity recognition. Park and Aggarwal (2003, 2004), for example, rec- ognize 2-person interactions using an hierarchical Bayesian network. A lower-level module is first used to extract low-level features from image sequences. A set of Bayesian networks at the lowest level output pose estimates of body parts based on the lower level features. The pose estimates are then fed into a higher-level Bayesian network to determine the probability of occurrence of an action.

The use of HMMs and DBNs for activity classification from low-level feature vectors poses several difficulties. The first issue with such models is that they require a fixed graph- based structure that is predetermined during software construction. This structure is not always intuitive, and may require considerable experimentation to determine. Because such models output the probability of a single activity occurring, employing them would require the determination of the graph structure for multiple activities. This can result in considerable time spent in simply experimenting with different models.

Another problem is that it is infeasible to use such models when the observations are continuous variables, such as feature vectors from an underlying vision system, because computation of the posterior probabilities can become quite heavy-handed. Continuous ver- sions of the HMM have been proposed for activity recognition (Ivanov and Bobick, 2000;

44 Kale et al., 2002), but are inefficient in general as they require the computation of com- plex probability calculations even for small structured models. It is unlikely that real-time performance can be achieved when multiple persons, and hence multiple classifiers, are present.

Ultimately, HMMs and DBNs are more appropriate for higher level action understand- ing tasks once the feature vectors from the vision components have been converted to dis- crete symbols using activity and movement classification modules. Such modules can effec- tively be seen as the discrete quantisation of continuous feature vectors. Researchers have proposed such modules and methods include exemplars (Gu et al., 2010), Kohonen Self Or- ganizing Maps (SOMs) (Martinez-Contreras et al., 2009), vector codebook quantization (He and Debrunner, 2000; Yamato et al., 1992), and SVM classifications (Kellokumpu et al., 2005).

Another method that at first glance looks promising is Dynamic Time Warping (DTW) (Rabiner and Juang, 1993). DTW uses dynamic programming to determine a similarity measure between time series, while eliminating variations in the execution rate of the time series. DTW methods have clear benefits for the activity recognition problem as they elim- inate one of the major causes of variance (Sheikh and Shah, 2005). However, although a polynomial time algorithm does exist for 1-D DTW, such methods in two or higher dimen- sions have been shown to be NP-complete and algorithms to compute such metrics are quite inefficient.

It is the contention of this thesis that activity recognition is ultimately a knowledge- based problem, with no single perfect solution. The discrimination of activities is based on humans’ interpretation of what constitutes a particular activity, since the activities them- selves are usually defined in human terms using language-specific constructs. The problem with most activity classifiers is that they attempt to achieve complex human cognition using relatively simple inference methods, and a single conceptual model. This leads to fallacies, as the concepts describing significant activities are continuously evolving. Even the idea of what denotes a significant activity is constantly in flux.

In immersive environments, the concept of an important activity is particularly am- biguous. It is reasonable to say that certain activities that humans perform in everyday life can be used to provide interactivity within the environment. However, who is to say that abstract activities, such as walking in an S-shape then walking in a circle then jumping, could not be used to provide interactivity? The artistic and dynamic nature of immersive environments means that important activities are sometimes determined after the system has been built and is in operation. It is clear from these properties that human cognition plays a major part in the determination and recognition of important activities.

However, it has already been shown in the past that humans have an uncanny ability to condense complex cognition into simpler rules based on knowledge and expertise (Dreyfus et al., 1986). This is particularly true in vision systems where the optical nature of data

45 makes it easy to discern such rules. Even for general machine learning problems, Holte (1993) showed that simple 1-attribute rules work nearly as accurately as induced machine learning algorithms while also being efficient for inference.

It is clear that the use of human expertise in the form of rules provides a fast and effective alternative for achieving activity classifications. In fact, expert systems, or software systems tailored to use encoded human expertise in solving problems, have been used as early as 27 years ago for vision-based tasks (Nazif and Levine, 1984). Crevier and Lepage (1997) present an extensive survey on the different methods that researchers have used to encode knowledge for image understanding tasks. The use of off-line reasoning for activity recognition, modelled by rules encoded by an expert, has particularly shown fruitful results (Fusier et al., 2007).

The difficulty with typical expert systems, however, is that knowledge and rules are usually encoded during an initial knowledge engineering (KE) step. The rules are deter- mined by both a knowledge engineer and a domain expert during KE and are fixed for the remaining life of the system. In this sense, an expert system is no different from a clas- sifier based on a single conceptual model. This makes it very difficult to incorporate new activities or new contexts into the system. A knowledge engineer would need to ensure knowledge consistency every time the client required new activity classes. This can become quite cumbersome if many activities are introduced.

This thesis explores the use of a knowledge acquisition and maintenance philosophy known as Ripple Down Rules (RDR) (Compton and Jansen, 1989). The benefit of the RDR approach is that it is an incremental learning strategy with automated knowledge consis- tency checks, eliminating the need for a knowledge engineer. The vision expert is only required to determine rules that discriminate activity classes from each other. Because of the reasoning ability of vision experts, the RDR strategy allows the production of rules which are efficient enough to run in real-time, generalised enough to cover a majority of cases, and accurate enough to maintain constant error-rates over prior knowledge through verification. RDR knowledge bases have been explored for general vision-based recognition systems in the past with successful results (Misra et al., 2004; Pham and Sammut, 2005).

RDR is highly suitable for use with the visual hull for activity recognition. The visual hull is often a coarse model of the human body, and programmers end up building very coarse and efficient features and rules to provide for activity classification. With the RDR framework, these rules can be added to the system ad infinitum in order to build a classi- fier that is highly accurate, efficient and allows for considerable extensibility in new classes and features. More details about the RDR strategy employed in this thesis, including im- plementation and experimental results, are presented in Chapters 5 and 6.

46 2.4 Summary

In this chapter, background literature for the task of tracking and activity recognition that is relevant to the topic of this thesis, has been presented. First, the importance of real-time performance, particularly within immersive systems, was stressed. Applying a real-time constraint places a severe restriction on the choice of algorithms that may be used. It is important that any method that is investigated must be checked for inefficiency, as it can considerably limit the final run-time speed of the system.

The importance of tracking and methods thereof were then introduced. The field of tracking is an enormously researched topic. Despite the vast amount of research, still no sufficiently suitable solution can track multiple people using multiple cameras and do so in real time. The tracking system built for this thesis, presented in Chapters 3 and 4, achieves this task by leveraging several past algorithms in a novel manner.

Finally, methods of activity recognition have been outlined. The primary activity recog- nition goal pursued in this thesis, of movement classification, was defined. A contrast be- tween rigid-body motion and articulated motion analysis was presented with great empha- sis on the difficulty in recognising articulated motion due to its large dimensionality. The visual hull method has much greater efficiency than skeletal models while providing suf- ficient information for the recognition task. The activity recognition problem is viewed as a knowledge acquisition task as presented in Chapters 5 and 6. When combined with the visual hull, this allows simple, efficient rules to be used in a highly accurate manner. This eventually results in an optimal activity classifier.

47 Chapter 3

Immersitrack: Tracking Multiple Persons in Real-Time using a 2-Cohort Camera Setup 1

RACKING, in the context of AVIE, is the task of identifying where persons are upon T entry into the space, and maintaining a unique identity for each person, over both time and space, until they exit. The role of the tracking system is to convert the camera images into multiple hypotheses representing persons within the environment. In multiple camera systems, a tracking system is also responsible for intelligently fusing to- gether data from numerous video streams.

This chapter describes Immersitrack: a complete, distributed tracking system that uni- fies data from multiple camera sources to construct a consistent 3-D point-based hypothesis of persons within an immersive environment, by reconstructing the positions of all persons within it. Immersitrack contains a solid algorithmic framework for the construction of such systems, including the fusion of data from multiple camera sources, the reconstruction of the 3-D scene, and an intelligent prediction of the tracked persons’ positions. Immersitrack has proven to be highly robust and extensible. The algorithms can be used in any general multi-camera tracking system, and are capable of dealing with an increasing number of cameras that allow for an increase in area coverage.

Immersitrack employs a novel 2-cohort camera setup, based on their real world spa- tial positions and orientations. A classification into one of overhead camera and oblique camera separates the two groups. Overhead cameras, suspended from the roof of the envi- ronment, look directly down. Oblique cameras, placed at the perimeter of the environment, look towards the centre. Despite all cameras being of the same hardware specifications,

1Portions of this chapter appear in Sridhar and Sowmya (2010)

48 the differing information offered by the two groups warrants separate treatment. Consis- tently obtaining a clear view of the persons, the overhead cameras can provide unoccluded tracking within the environment. In contrast, the oblique cameras suffer from occlusion, but provide height information. Such height information can be useful for differentiating activities such as jumping, standing, and crouching.

Immersitrack employs a master-slave distributed configuration. Each camera is treated as independent of others amongst and across the slaves, and knows only its own position and calibration within the world, but not that of the other cameras. The slaves can only perform 2-D processing of images, and have no 3-D information about the world, as they cannot identify the disparity across camera images. The master however, does have the camera calibration information of every camera, and can reconstruct 3-D information.

The main task of Immersitrack is to maintain a set of 3-D hypotheses P {P1,P2,...,Pn} = representing real world individuals within the immersive space. As the individuals move about, the system is required to update the position of each Pi, representing the person’s centre-of-gravity. The challenge is in fusing 2-D information from multiple cameras and matching it to 3-D real-world targets.

An overview of the software and hardware architecture of Immersitrack is presented in Section 3.1. The camera calibration process is outline in Section 3.2. The method of 2-D tracking in the overhead and oblique cameras is presented in Sections 3.3 and 3.4 respec- tively. The method of reconstructing 3-D motion models for each person within the environ- ment is presented in Section 3.5. Experimental results and evaluations of Immersitrack are presented in Section 3.6. Finally, a novel method of performing 3-D voxel reconstructions of multiple people in real-time, which leverages Immersitrack, is presented in Section 3.7.

3.1 Immersitrack Architecture

The architecture of Immersitrack consists of numerous processes in two groups, a set of slave processes and a single master process, shown in Figure 3.1.

Figure 3.1: Tracking System Architecture

49 In order to distribute the workload, each process runs on a separate machine and the hardware is configured using a networked master-slave architecture, shown in Figure 3.2. The slave machines communicate with the master machine via an Ethernet link. A slave machine has a direct hardware link to the cameras through a frame grabber and performs 2-D image processing on the video sequences to extract meaningful information from the raw data. The results of the 2-D image processing are then sent to the master, for further processing. The master provides the central point for obtaining collated vision data and the display system polls the master for regular updates on tracking information. Immersitrack is fully extensible and the number of cameras and slaves can be increased to provide greater coverage and greater distribution.

Figure 3.2: AVIE Hardware Setup

Although the cameras are hardware synchronized on each slave, the slaves connect to the master via a local Ethernet connection and so there is no guarantee that the data from the slave to the master will be synchronized between slaves. Software methods can be used to perform tight synchronization between slaves. However, this thesis presents a framework that does not require it. The slaves send so little data to the master that the network suffers a negligible delay in data transfer.

Four slave machines are used within this research. Each slave is an AMD Athlon 3500+ CPU, running at 2.4 GHz, containing 1 GB of RAM and an NVidia graphics card. To com- municate with the cameras, each computer contains an IDS Falcon Quattro Framegrabber board, with four parallel video channels capable of outputting PAL resolution images at 20 fps when all four ports are used. Each slave is connected to exactly four cameras.

Each camera is a Sony XC-I50CE CCD camera, connected to a Pentax-Cosmicar C60402 (H416) Monofocal Lens with a focal length of 4.2mm and attached with a near-infrared filter, shown in Figure 3.3. The filter blocks out all incoming light except for a narrow band of near-infrared, at approximately 830 nm. A set of halogen lights, fitted with filters, bathe

50 the system with light at the required wavelength.

Figure 3.3: Camera Hardware

A top-down plan view of the camera layout used in AVIE is shown in Figure 3.4. Of the 16 cameras, 9 cameras are placed within AVIE looking directly down, and 7 cameras are placed around the circumference with an oblique (side) view of the environment. In the figure, boxed crosses represent the overhead cameras and angle brackets represent the oblique cameras.

Figure 3.4: Camera Layout

This particular camera layout was determined after considerable experimentation within AVIE. The overhead cameras are particularly good at providing an unoccluded view of the entire environment, which is highly valuable as more and more persons enter the space. Note also that there is a single overhead camera placed above the entrance of the envi- ronment. This proves useful for Immersitrack as it ensures that persons are only tracked if they enter at the entrance door of AVIE, which improves the robustness of the tracking algorithms. The 3-D positions and orientations of the 16 cameras within AVIE are shown in Figure 3.5.

51 52

Figure 3.5: Avie Cameras Spatial Configuration in 3-D. Blue circle indicates entrance to the AVIE environment 3.1.1 The Slave Process

The software process running on each slave is presented in Figure 3.6.

Figure 3.6: The Slave Process within Immersitrack

A slave process consists of two separate paths: a path for each overhead camera and path for each oblique camera. A segmentation process first extracts blobs of interest from the camera images. The segmentation step is the same for all cameras regardless of whether they are overhead or side. The output from the segmentation step is a set of observed blobs. In the overhead cameras, these blobs are used to update tracking. In the side cameras, the blobs are used to localize human bodies.

For each overhead camera, the slave maintains a set of 2-D hypotheses about moving targets within the environment. A dynamic graph theoretic framework is used to maintain an optimal match between the detected blobs in the current frame and tracked targets in the previous frame. Each overhead camera only performs tracking within its 2-D reference frame.

Because of occlusion issues, the oblique cameras do not perform tracking. However, a novel method of body localization that exploits the nature of the camera angle and the spatial ratios of the human body is applied. Approximate bodies are determined in each camera, regardless of occlusion, and the master process differentiates between occluded bodies using reconstructed 3-D information.

Segmentation The segmentation task involves two steps: identifying the foreground as a binary pixel mask over the camera image and identifying connected regions of interest within the foreground mask. Segmentation, tackled by many different techniques within the literature, is not the focus of this thesis. The current work uses existing segmentation techniques which will be briefly discussed here.

53 The cameras work on reflected near infrared light (approx. 830 nm), and the display system operates in the dark. This means that little to no illumination change is encoun- tered in the camera images. Furthermore, the cameras are passive and the background is clutter-free during operation. As a result, the segmentation technique used is very basic, making it extremely efficient. Background subtraction, where the background model is the mean image of 40 frames when there are no persons within the environment, is used. The absolute difference between the incoming frames and the background model is thresholded to obtain a binary mask representing the foreground. Although more complex models such as mixture-of-gaussians could have been employed they impacted the performance of the system, without providing sufficient benefits, and were left out of scope of this thesis. An example of foreground extraction is shown in Figure 3.7.

Figure 3.7: Foreground Extraction

One difficulty in the use of near-infrared sensors is the dependence of the amount of reflected light upon the type of material that the person is wearing. Although most clothing can reflect near-infrared light quite effectively, occasionally there are regions that are not as bright within the images and the thresholding process rejects these regions. This results in multiple split foreground regions belonging to the same person. To resolve this issue, the binary image is subjected to a series of morphological operators, prior to connected component analysis, in order to close such regions. A median filter removes small noisy regions first, and then the image is dilated twice and eroded once. For the dilation and erosion operations, the standard 3 3 structuring element is used: ×

  1 1 1    1 1 1  (3.1)   1 1 1

54 Connected components of true values within the resulting mask must then be deter- mined. For this, an existing algorithm (cvFindContours) available in Intel’s OpenCV library (Bradski, 2002) is used, which performs contour extraction of the foreground regions. The result of this step is a foreground mask and a set of blobs for each camera, described by the boundary contour. An example of the extracted contours is shown in Figure 3.8.

Figure 3.8: Segmentation

55 3.1.2 The Master Process

Once the slave machines have finished processing the images from the camera video streams, the result is a set of blobs observed in each camera. Each blob in the overhead camera rep- resents a single person, whereas each blob in an oblique camera can contain more than one person, due to occlusion. In order to keep network bandwidth low, the slaves send only blob information to the master without any information about images. The master’s task is to reconstruct the 3-D positions of each individual person without being able to use image- processing techniques. Only the spatial and temporal properties of the blobs may be used. The master process is illustrated in Figure 3.9.

Figure 3.9: The Master Process within Immersitrack

The master maintains a database of tracked 3-D persons within the real world. The only information that the master receives from the slaves is the tracked targets from the overhead cameras and the localized bodies from the oblique cameras. The master first performs correspondence between each tracked person and the blob information from the slaves, and determines if any new persons have entered the environment, or anyone has left the environment. Once all the persons in the current frame have been determined and the corresponding blobs assigned, the master performs reconstruction of the new positions using the assigned blobs. The reconstructions form the information used to update the database.

56 3.2 Camera Calibration

Before the master can determine correspondence across cameras, the geometric properties of each camera must be identified in order to determine where blobs within one camera might occur in another camera. This is the process of camera calibration. Independent calibration of each camera is performed using two calibration tools: a planar grid along with a 3-D calibration rig.

Planar Grid: A grid of calibration markers is placed on the ground using electrical tape, which is able to reflect near-infrared light quite well, and this grid is used to perform an initial coplanar calibration of the cameras. The positions of the markers are shown in Figure 3.10. Each marker consists of a cross that is placed exactly 1 metre away from all of its neighbours. A view of the markers in every camera is shown in Figure 3.11.

Figure 3.10: Coplanar Calibration Grid Markers

To perform fully optimised calibration using the Tsai camera model, a camera requires a view with at least 11 points. As Figure 3.11 shows, even the camera that views the fewest points can still observe 14 points. Thus, fully optimised planar calibration of all cameras can be performed using the markers.

57 Figure 3.11: Calibration Grid viewed from Cameras

58 Calibration Rig: To achieve non-coplanar calibration, a 2-m high and 1-m wide calibra- tion rig, made of metal and reflecting near-infrared light well, is used. The calibration rig is shown in Figure 3.12. When the calibration rig is placed so that it is aligned with the markers, it provides an additional 4 points, whose world position is known, to perform non-coplanar calibration.

Figure 3.12: 3-D Calibration Rig

A calibration application is built, which allows the calibrator to select known calibra- tion points within the camera images and specify the world position of the points. Once a sufficient number of points are specified by the calibrator, the system performs calibration using Tsai’s calibration techniques. Doing this for every camera yields absolute, indepen- dent camera calibrations for every camera.

3.2.1 Camera Parameters

The resulting calibration parameters are shown in Tables 3.1, 3.2, 3.3 and 3.4.

Camera 1 Camera 2 Camera 3 Camera 4

Cx 328.5781 293.5116 307.638 307.4941 Cy 235.6596 246.6517 236.4496 229.8994 sx 1.025766 1.02685 1.029468 1.033743 f 11.79505 11.71578 12.44507 12.01212 κ1 0.004082 0.00408 0.003864 0.003905 Tx 550.1182 551.833 -368.06 -411.673 Ty -684.737 303.5553 337.1502 -716.319 Tz 495.9996 383.1989 547.7831 580.4539 Rx -3.08285 3.021663 3.082938 -3.01038 Ry -0.02984 -0.05737 0.159302 0.061654 Rz -3.12521 -1.55281 -0.15304 1.580297

Table 3.1: Camera Parameters for Cameras 1–4

59 Camera 5 Camera 6 Camera 7 Camera 8

Cx 357.0203 307.7223 299.0942 302.1438 Cy 264.7282 286.1142 260.4388 244.2268 sx 0.990329 0.986908 1.030483 0.993396 f 17.72521 17.63594 15.24365 12.48836 κ1 0.000494 0.001059 0.002627 0.003311 Tx 731.3163 183.0899 786.1197 383.4755 Ty 230.0783 -82.9522 -220.145 -222.535 Tz 490.137 1267.457 598.8774 470.5333 Rx -2.30193 -1.99813 -3.16221 -3.09781 Ry -0.89736 0.521296 -0.04204 -0.16401 Rz -2.52454 2.780032 3.069852 3.063848

Table 3.2: Camera Parameters for Cameras 5–7

Camera 9 Camera 10 Camera 11 Camera 12

Cx 361.1488 290.1317 353.6802 260.5765 Cy 298.7282 250.4626 305.8194 218.4972 sx 0.981604 1.001901 1.017409 0.982693 f 17.1993 16.37304 12.79102 14.34182 κ1 0.00081 0.000822 0.003548 0.003246 Tx -155.372 571.9855 -349.86 -540.265 Ty 426.3587 500.4112 750.2151 861.8024 Tz 29.56276 388.1422 562.606 699.2138 Rx 2.262319 3.072117 3.144386 3.026233 Ry -0.5906 -0.06012 0.06931 0.320389 Rz -0.30855 -1.61777 0.075329 -0.0847

Table 3.3: Camera Parameters for Cameras 9–12

Camera 13 Camera 14 Camera 15 Camera 16

Cx 360.7611 256.228 267.8962 304.4545 Cy 229.4947 248.6448 236.7121 208.393 sx 0.960281 0.99633 0.993419 0.973321 f 18.56652 17.34269 17.6937 18.58304 κ1 0.0005 0.001278 0.001086 0.000736 Tx -581.224 408.0095 621.1651 -269.301 Ty 309.1157 372.3758 -56.5384 -113.949 Tz 305.4543 132.1931 989.8382 1235.524 Rx 1.991292 2.806747 -2.09523 -2.59625 Ry 0.175803 -0.95448 -0.17321 1.03601 Rz 0.086747 -1.18347 -2.97664 2.107221

Table 3.4: Camera Parameters for Cameras 13–16

60 3.2.2 Calibration Accuracy

To measure the accuracy of the camera calibration the reprojection error for each camera is calculated. To compute this error, for a given camera, each 3-D calibration point for the camera is re-projected into the camera image (768 576 pixels), using the calibration × data. The error for that point is then the 2-D Euclidean distance (in pixels) between the re- projected point, and the actual point input during calibration. The error for each camera, then, is measured as the mean error over all calibration points. The resulting error graph is shown in Figure 3.13 and the quantitative reprojection errors are in Table 3.5.

Figure 3.13: Reprojection Error for Calibration. Error is plotted against left axis. Number of points is plotted against right axis

Camera Number Mean Std. Dev. of Points Error of Error 1 25 2.19769 1.04626 2 25 2.42604 1.391 3 22 1.68256 1.03222 4 25 2.56206 1.25856 5 29 2.5543 1.51218 6 22 2.00653 1.54734 7 22 3.54975 1.67164 8 22 1.41939 0.8689 9 26 1.27797 0.747871 10 20 2.53845 1.3357 11 21 2.28465 1.13278 12 18 2.33983 1.22431 13 25 1.1113 0.613311 14 21 1.92196 0.963914 15 27 1.77675 0.962756 16 25 2.25683 1.24507

Table 3.5: Reprojection Error

61 3.3 Overhead Cameras

The system first performs 2-D tracking in overhead cameras and attempts to find corre- spondences between existing tracked persons and segmented blobs. For each frame, each overhead camera has two sets of information. These are (i) a database of 2-D targets from the previous frame and (ii) the set of segmented foreground blobs extracted in the current frame. Targets are specified only by their motion models whereas segmented blobs provide shape and appearance information.

The motion model used to represent each target is first presented in Section 3.3.1. Each target is modelled as a first-order Markov process with Gaussian noise, and is maintained using the Kalman filtering framework presented in Chapter 2. The use of Kalman filter- ing allows the system to perform prediction on a target’s motion in subsequent frames. This prediction increases the confidence in determining correspondence between targets and blobs.

To determine the correspondence, a dynamic graph-theoretic framework is used. Using only properties of the bounding box of targets and blobs, an optimal match between targets and blobs is determined allowing multiple targets to match to multiple blobs. This matching process, presented in section 3.3.2, is shown to be very effective despite the lack of image features.

Finally, a limitation of the Kalman filtering framework, in updating multiple targets when they match to a single blob, is discussed in section 3.3.3. To overcome these cases, a novel technique of switching tracking to optical flow methods is presented. This switching allows targets to maintain identities even when their blobs merge.

3.3.1 2-D Motion Models

In order for the slave machines to perform tracking in the overhead cameras, a 2-D Kalman filter formulation is used to model each tracked blob, or target, that is being tracked by a camera. The Kalman filter is used to provide an accurate prediction of blob motion in subsequent frames, which assists in determining correspondence between observed blobs and tracked targets.

Because blob-based tracking is performed, each 2-D target tt, represents the projection of a single person within the camera view and is modelled as

h iT tt x y x˙ y˙ w h w˙ h˙ (3.2) = where [x, y] is the 2-D location of the centre of mass of the blob in image coordinates, [x˙, y˙] is the velocity of its centre of mass, [w, h] is the width and height of its bounding box and [w˙ , h˙ ] is the rate of change of the width and height. In each frame, once the system computes

62 correspondence, the bounding box of a segmented blob updates the state of its corresponding target. Thus, the measurement vector of each 2-D target is simply [x, y,w, h]. The overhead cameras perform frame-based tracking, and since no frames are skipped, ∆t is set to 1. Then the standard first-order equations are assumed:

A      =  xt xt 1 x˙ t 1 1 0 1 0 0 0 0 0    − + −     y   y y˙   0 1 0 1 0 0 0 0   t   t 1 t 1       − + −     x˙ t   x˙ t 1   0 0 1 0 0 0 0 0     −           y˙t   y˙t 1   0 0 0 1 0 0 0 0  tt    −    tt 1 (3.3) =   =   =  −  wt   wt 1 w˙ t 1   0 0 0 0 1 0 1 0     − + −     h   h h˙   0 0 0 0 0 1 0 1   t   t 1 t 1       − + −     w˙ t   w˙ t 1   0 0 0 0 0 0 1 0     −    ˙ ˙ ht ht 1 0 0 0 0 0 0 0 1 −

Since the system performs blob based tracking, the observation consists of the bounding box of each blob in each frame. Thus, the measurement equations are

H  =  1 0 0 0 0 0 0 0    0 1 0 0 0 0 0 0  T   zt [x, y,w, h]   tt (3.4) = = 0 0 0 0 1 0 0 0    0 0 0 0 0 1 0 0

3.3.2 Correspondence

The problem of correspondence between targets and blobs within the camera view becomes complex as tracked targets come close together. When this happens, the extracted blobs within the camera views start to merge. Thus, it becomes difficult to find 1-to-1 correspon- dences between blobs and targets, especially since the motion of persons within the envi- ronment can result in very complicated blob motion dynamics. An example of two targets within the same foreground blob appears in Figure 3.14.

To overcome this split-merge problem, a graph theoretic technique is employed, simi- lar to that presented by Chen et al. (2001), who tackled the problem of assigning tracked targets to foreground blobs in colour video. They constructed a weighted bipartite graph between targets and blobs. The algorithm assigns a cost to each edge between a target and a blob, calculated as the dissimilarity between the target and the blob. Constructing a minimum cost edge cover of the graph would yield an optimal match between targets and blobs, allowing for multiple blobs assigned to targets and multiple targets assigned to blobs. The solution presented by them is suitable as the problem reduces to the linear assignment problem, for which a polynomial time solution exists. However, their approach uses a cost

63 Figure 3.14: Multiple Partic- ipants in One Blob function based on colour similarity, as they attempt to resolve correspondence after occlu- sion has occurred. This work employs a cost function based only on spatial properties of blobs and targets.

Given a set of segmented blobs bi and tracked targets t j, the task is to match the targets to the blobs. First, a dissimilarity cost between each blob and each target is constructed:

¯¯ ¯¯ 2 ¯¯ ¯¯ 2  ¯¯ BB BB¯¯  ¯¯ BB BB¯¯ ¯¯bi t j ¯¯ ¯¯bi t j ¯¯ 1 ∩ 1 ∩  ¯¯ BB¯¯   ¯¯ ¯¯  − ¯¯b ¯¯ + − ¯¯tBB¯¯ i ¯¯ j ¯¯ D(bi,t j) (3.5) = 2

BB BB where is the area operator and b and t are the bounding boxes of bi and t j respec- ||·|| i j tively. Intuitively, the function tends towards a maximum of one as the two bounding boxes become more and more separated and a minimum of zero as they come closer together in both size and position. Furthermore, if multiple blobs reside within a single target, the intersection between the blobs and the target determines the cost of each blob and the total sum cost across the target will not exceed one, and vice versa when multiple targets are within a single blob.

The algorithm then proceeds as follows. A bipartite graph G (V,E) is built, with each =

64 vertex representing either a target or a foreground blob. An edge is constructed between each target vertex and blob vertex if the dissimilarity between the vertices is less than 1- ². The use of the ² ignores very small, negligible overlaps. To each edge e, a weight w(e) equal to the corresponding dissimilarity cost is assigned. Further, two dummy nodes are constructed on each side of the bipartite graph. An edge between a dummy node and a real node is also constructed if there are no edges connecting to the real node, and the weight of 1 these edges is set to 2 . A dummy node connection represents either a foreground blob that does not match to any target or a target that has disappeared from the reference frame of the camera.

The system computes the minimum weight edge cover of this graph to find an optimal match between tracked targets and segmented blobs. In Appendix Chapter A, it is shown that this problem reduces to the linear assignment problem. The solution to the linear assignment problem can be calculated in polynomial time using the Hungarian Algorithm (Kuhn, 1955; Munkres, 1957). Examples of constructed graphs and the resulting edge cover appear in Figure 3.15.

Figure 3.15: Overhead Tracking via Graph Matching. Left column: Graph as an adjacency ma- trix. Middle column: Camera image (Yellow rectangles are segmented blobs and green rectangles are tracked targets). Right column: Resulting minimum weighted edge cover of graph (Red nodes are dummy nodes). Top row: New target to be tracked enters camera’s reference frame. Middle row: One-to-one correspondence between blobs and targets. Bottom row: Two targets within one blob.

One of several situations can then occur:

« A single target node connects to exactly one blob node and no other target node connects to that blob node. The position and dimensions of the bounding box of the connected foreground blob updates the Kalman filter state of the target.

65 « A target node connects to a dummy node. This occurs, when the target does not match to any segmented foreground blobs. Delete the target as it has exited the camera’s reference frame.

« A blob node connects to a dummy node. This occurs when a new untracked person enters the camera’s reference frame. Create a new target and set its Kalman state to the state of the blob.

« A target node connects to multiple foreground nodes. This occurs when a target turns out to be a group of persons. Update the target with the foreground blob that has the least dissimilarity cost, and create a new target for each remaining foreground blob.

« A foreground blob node connects to multiple target nodes. The final case occurs when multiple tracked targets come too close together, and thus lie within the same fore- ground blob. When this occurs, a novel method of switching tracking mechanisms is used, explained in the following subsection.

3.3.3 Tracker Switching

When multiple persons being tracked individually then come close together, their seg- mented blobs merge, resulting in a single blob. In this case, the state of each individual target must be somehow determined during the merge so that, when they split up again, the system maintains target identities. This is the familiar merge-split problem. Because of the use of overhead cameras, it is possible to identify targets through merges using only the motion models. To this end, the system assumes that although the edges of a target’s blob may occlude during a merge, the centre-of-mass is always visible. Furthermore, it assumes that the bounding box of the targets will not vary much when they are in a group, as the other targets restrict their motion.

Based on these assumptions, when targets come together to form a group, the system bases their motion on that of their individual centre-of-masses. It maintains a separate point for each individual within the group, initialised to the centre-of-mass from the indi- vidual’s motion model before they entered the group. A Lucas-Kanade optical flow tracker (Lucas and Kanade, 1981) then tracks the centre of masses while the target is in the group. This optical flow tracker is suitable and highly robust for such small motion, allowing the determination of how targets are moving within the group. An example of the optical flow tracker at work is in Figure 3.16.

The effectiveness of the system in tracking three persons in a single overhead camera, as they all come close enough for their blobs merge together, is illustrated in Figure 3.17. Initially, the system tracks and assigns a unique id to each person individually. Two of the persons then come close enough together so that their blobs within the camera merge and form a single blob. Then the third person also joins the group, forming a single blob consisting of three persons. Finally, all three persons split up again.

66 Figure 3.16: Optical flow tracking through blob merges. Yellow boxes are de- tected blobs. Green boxes are individual targets. Cyan boxes are blobs containing multiple tar- gets, where the tracker switches to optical flow tracking

The figure shows that the system is able to recognize that the blobs have merged and start tracking a grouped blob. The system maintains individual target positions by creating a new optical flow tracker for each merged blob. The bounding box shown for the grouped blob is simply the combined bounding box of the individual blobs. This method of tracking allows the system to maintain identities despite the merging of blobs. Results of overhead tracking are shown in Figure 3.17.

Figure 3.17: Overhead Tracking Results (Green squares represent tracked blobs from previous frames and yellow squares represent detected blobs in the current frame): Frame 1 to Frame 61: 3 persons moving around are tracked as 3 individual blobs, Frame 141: Persons 2 and 3 merge and are tracked as a group, Frame 157: Person 1 joins the group containing persons 2 and 3, Frame 239 Person 1 leaves the group and the system is able to maintain the original id, Frame 285: Person 1 has left the system altogether and Person 2 leaves the group

67 3.4 Oblique Cameras

Due to the use of the overhead views in this work, tracking is not required in the oblique cameras. The corresponding slave process only performs segmentation and some simple spatial analysis, and the master process then determines how to assign oblique blobs to in- dividuals. This considerably reduces the computational load on the oblique cameras. There is also no need to worry about occlusion as the master can resolve it effectively.

The task of the oblique cameras is to identify the positions of individual bodies within the environment. To provide sufficient support for the later 3-D reconstruction stage, the oblique process needs to extract the 2-D centre-of-mass for each person. The master then combines these positions with the information from the overhead cameras, to allow the determination of the heights of persons.

3.4.1 Body Localization

To determine where individual bodies are located within the reference frame of the camera, a technique similar to the Backpack system by Haritaoglu et al. (1999a) is used. As in the overhead cameras, the process first applies segmentation to extract the foreground silhou- ette, and a connected component analysis to determine whole regions. The result is a set of binary silhouette blobs. A binary mask defines each silhouette blob:

si : [0,w] [0, h] {0,1} (3.6) × 7→ where w is the width of the blob’s bounding box and h is the height. For each silhouette blob, the system computes the normalized projection of the silhouette mask to the horizontal x-axis. That is, given a silhouette blob,

h 1 X hi : [0,w] [0, h],hi(x) si(x, y) (3.7) 7→ = h y 0 =

The mean value of the horizontal projection is determined:

Pw x 0 hi(x) µi = (3.8) = w

Using µi, the system determines a set of continuous horizontal segments [x1, x2], such that the value of hi between x1 and x2 is always greater than µi. Each segment then gives

68 the horizontal position and width of the body of a single individual. A width threshold discards segments that are too narrow to be part of a body. Once the horizontal segments are determined, it is necessary to refine the vertical position and height of each body. To perform this step, for each horizontal segment [x1, x2], the silhouette mask is scanned from top to bottom and a set of vertical segments [y1, y2] are found such that

y [y1, y2], x [x1, x2]si(x, y) 1 (3.9) ∀ ∈ ∃ ∈ =

This step separates vertical segments by image rows in the blob that contain no true pixels in the silhouette mask within [x1, x2]. In the case of multiple segments, the system chooses the largest segment to be the vertical position and height of the body. It uses the resulting horizontal and vertical segments to determine a bounding box for an individual body. All three steps are shown in Figure 3.18.

Figure 3.18: Extracting body segments in oblique view cameras. Top Image: Horizontal seg- ments for bodies are first determined. Middle Image: Horizontal segments are then used to determine vertical extents of the body. Bottom Image: The horizontal extents and the largest vertical extents are used to construct the body rectangle

69 Finally, the centre-of-mass position of each individual is set as the centre-of-mass of his or her body rectangle. The results in the oblique cameras are shown in Figure 3.19. Note that the figure also shows a green circle on top of each person. This point approximates the head position of each person, determined by using the horizontal mid-point of the bounding box, along with the upper most y-value. These points give an idea of the relative alignment between the bounding box and the actual body.

Figure 3.19: Final Results in Oblique View Cameras

3.5 Determining 3-D Hypotheses on the Master

At each time-step, the master maintains an hypothesis of each person’s motion, denoted by

T0k 1. The master process then uses this prediction to assign blobs from the slaves to indi- + vidual targets and approximates the target’s position in the current frame by reconstructing a 3-D position from the assigned blobs.

It may be recalled from Section 3.3 that the slaves use a Kalman filter formulation for each blob observed in each overhead camera. This formulation is extended to 3-D on the master machine to model the hypothesis of each person being tracked. This 2-level Kalman filtering provides more robustness in the tracking and prediction of individual persons. The 3-D motion models used on the master is first presented in section 3.5.1 and the algorithms used to perform the correspondence and reconstruction in sections 3.5.2, 3.5.3, and 3.5.4.

In section 3.5.5 is presented an additional verification step which leverages the closed world nature of the immersive environment, and uses an entrance rule to verify tracked targets. Targets may only be created or destroyed when they pass through the entrance. This significantly increases the robustness of Immersitrack.

70 3.5.1 3-D Motion Models

A 6-dimensional vector Tk models each tracked person in the environment in 3-D,

h iT Tk x y z x˙ y˙ z˙ (3.10) = where [x, y, z] is the spatial location of the target in world coordinates and [x˙, y˙, z˙] is its spa- tial velocity in world coordinates at time k. The master process uses a first-order Kalman filter model, so the state transition matrix is

  1 0 0 ∆t 0 0    0 1 0 0 ∆t 0       0 0 1 0 0 ∆t  A   (3.11) =    0 0 0 1 0 0     0 0 0 0 1 0    0 0 0 0 0 1

Note the use of the ∆t term in the transition matrix. The transition matrix presented in Section 3.3 for the 2-D case does not contain this term, as the overhead cameras do not need to perform prediction between frames. This is because the interval between frames, for each camera, is quite small, resulting in negligible motion. The master, however, has to deal with several processes and network data transfer, and can occasionally suffer a significant delay between updates. The use of the ∆t term ensures that the master updates the Kalman filter with real-world times.

For 3-D targets only the position of each target is observed:

T Zk [x, y, z] (3.12) = and so the measurement matrix is

  1 0 0 0 0 0   H  0 1 0 0 0 0  (3.13) =   0 0 1 0 0 0

It is up to the master to determine the appropriate 3-D measurement vector Zk, based on the 2-D blob information in order to update the motion models and trajectories of the persons. To do so, the master needs to assign blobs from the slaves to persons that are being tracked. Once all blobs are assigned, for each person, the master uses the 2-D positions of the blobs to triangulate and reconstruct the 3-D position of the person.

71 Before explaining the process of assignation and reconstruction, some notation is intro- duced:

« O {o1,o2,...,on} is the set of all overhead blobs from the slaves =

« S {s1,s2,...,sn} is the set of all oblique blobs from the slaves =

« C {c1, c2,..., cn} is the set of all cameras within the system =

« P {P1,P2,...,Pn} is the set of all individuals being tracked by the master = « c : O S C maps blobs to the cameras in which they were observed ∪ 7→

Three assignation maps are defined:

φ : O P (3.14) 7→ ξ : S P(P) (3.15) 7→ γ : P P(O S) (3.16) 7→ ∪

φ assigns each overhead blob in O to a person in P, ξ assigns each oblique blob in S to one or more persons in P and γ is the inverse of φ and ξ mapping individuals to blobs assigned to them. All three maps are initially empty (γ(P) for each existing individual = ; P and ξ(s) for each oblique blob s) and the first task of reconstruction is to initialise = ; them. To perform this task a series of assumptions are made:

If P1,P2 P and P1 P2 then γ(P1) γ(P2) S Two separate individuals cannot have the ∈ 6= ∩ ⊂ same overhead blobs assigned to them but, due to occlusion, may have the same oblique blobs assigned.

φ and ξ are total maps Each blob in O S must be assigned to at least one person in P, ∪ after all the assignations are performed.

If a,b O and c(a) c(b) then φ(a) φ(b) Two blobs viewed in the same camera cannot ∈ = 6= be assigned to the same person. Similarly if a,b S and c(a) c(b) then ξ(a) ξ(b) . ∈ = ∩ = ;

For simplicity, given a map α, define D(α) to be the domain of α (that is, the set of all elements for which α is defined). A high level overview of the reconstruction algorithms is given in Figure 3.20.

72 Figure 3.20: Reconstruction Algorithms

73 3.5.2 Assign Blobs to Existing Persons

The first step for the master is to assign blobs to tracked individuals. The motion models of these individuals give a fair idea of their position in the current time-step, so that their re-projections in the camera views can be used to identify the appropriate blobs to use in their reconstruction.

To assign overhead camera blobs to individual persons the system maintains a persis- tent map, ψ : O P, that determines the blobs that have been assigned to specific individ- 7→ uals in previous frames. Since the overhead blobs are tracked independently on each slave, ψ can be updated using the 2-D correspondence determined on the slave. That is, if ψ is defined for an overhead blob, oi, and oi is still being tracked in its camera’s reference frame, then its assignation is simply ψ(oi). This is shown in Algorithm 3.1.

Algorithm 3.1: Assign previously tracked overhead blobs to existing persons.

for oi O do ∈ if oi D(ψ) then ∈ φ(oi) ψ(oi); = γ(ψ(oi)) γ(ψ(oi)) oi; = ∪

At this stage, there may yet be overhead blobs that are unassigned to persons, but they may be re-projections of tracked persons. This occurs when a person moves into the refer- ence frame of a camera. To assign these blobs correctly, the motion models of each person is used. Let Ti,k be the predicted position of tracked person Pi in the current time step, k, and let X ci denote the 2-D projection of the 3-D point X in the camera ci. Furthermore, ⊥ given a 2-D point x and a blob b, define x[ ]b if x lies within the bounding box of b. Details ∈ on how untracked overhead blobs are assigned to existing persons are in Algorithm 3.2.

Algorithm 3.2: Assign untracked overhead blobs to existing persons

for oi O do ∈ if oi D(φ) then ∉ for P j P do ¡∈ ¢ if T j,k 1 c(oi) [ ]oi then + ⊥ ∈ φ(oi) P j; = γ(P j) γ(P j) oi; = ∪ ψ(oi) P j; =

Finally, oblique blobs are assigned to existing persons. Since these blobs are not tracked on the slaves, correspondence between individuals and oblique blobs in each time step must be determined. This is done similarly to the prior step performed for unassigned overhead

74 blobs, and shown in Algorithm 3.3.

Algorithm 3.3: Assign oblique blobs to existing persons

for si S do ∈ for P j P do ¡∈ ¢ if T j,k 1 c(si) [ ]si then + ⊥ ∈ ξ(si) ξ(si) P j; = ∪ γ(P j) γ(P j) si; = ∪

3.5.3 Group Unassigned Blobs

After blobs are assigned to existing persons, there may still be unassigned blobs. These blobs are the re-projections of individuals that have just entered the environment in the current time-step. The system needs to determine correspondences across cameras amongst these blobs and construct new targets from them. The master does not receive any images from the slaves, so the only information that can be used for correspondence is the spatial positions of the blobs.

The first step is to group together unassigned overhead camera blobs that are the re- projection of the same person, shown in Algorithm 3.4.

Algorithm 3.4: Group unassigned overhead blobs

for oi O do ∈ if oi D(φ) then ∉ W = Approximate3DPosition(oi);

P0 = CreateNewPerson();

P P {P0}; = ∪ φ(oi) P0,γ(P0) γ(P0) oi,ψ(oi) P0; = = ∪ = for o j O do ∈ if o j D(φ) c(oi) c(o j) then ∉ ∧ 6= w = W c(o j); ⊥ if w[ ]o j then ∈ φ(o j) P0,γ(P0) γ(P0) o j,ψ(o j) P0; = = ∪ =

This algorithm ensures that if a person walks into the reference frame of multiple over- head cameras at the same time, the blob correspondence across cameras is appropriately determined and a new person is created from the set of corresponding blobs. The routine Approximate3DPosition is any routine that takes as input an overhead blob and deter- mines an approximate of the blob’s centre of gravity. In this work, the most effective method found was to fix the height of the centre of gravity to half the approx- imate human body height, and back-project the centre-of-mass of the blob in 3-D, using the calibration data of the camera that the blob was visible in.

75 The correspondence between the new persons created in the previous algorithm and the oblique blobs in S then needs to be determined. Define P0 to be the set of all new persons detected. P0 can be determined by determining the set of persons that were added to the tracking database in Algorithm 3.4.

Algorithm 3.5: Assign oblique blobs to new persons

for P0 P0 do ∈ for oi (O γ(P0)) do h∈ ∩ W HighestPoint(oi); = l W LowestPoint(oi); = for s j S do h∈ h l l w = W c(s j),w = W c(s j); ⊥ ⊥ l segment joining wh and wl; = bb j bounding box of s j; = if l intersects bb j then ξ(s j) ξ(s j) P0; = ∪ γ(P0) γ(P0) s j; = ∪

The routines HighestPoint and LowestPoint in Algorithm 3.5 compute the highest and lowest points on the reprojection line of the centre-of-mass of oi. These points are determined by fixing the height of the point to be either 0 for the lowest point and the height of the environment for the highest point. The camera calibration is then used to determine the position of the point in world coordinates. An example of the points returned by these routines is shown in Figure 3.21.

Figure 3.21: Points re- turned from HighestPoint and LowestPoint. Results for two overhead cameras are shown in a third oblique camera

76 3.5.4 Reconstructing World Position

The final step in reconstruction is to use the blobs and the blob maps to reconstruct a 3-D position for each possible person. Since all of the m cameras are already calibrated, this amounts to finding a 3-D point Xi, for each person Pi, that minimizes

m X ¡ ¢2 S(Xi) d xi j,C j(Xi) (3.17) = i 1 = where C j(Xi) is the projection of Xi into camera C j, xi j is the required 2-D position of person Pi in camera j and d( , ) is the 2-D Euclidean distance metric. xi j is defined to be · · the centre of mass of the blob of person Pi in camera j, determined from the maps in the previous stages, in ideal image coordinates.

This amounts to a non-linear least-squares optimization problem. To perform this minimization, the iterative Levenberg-Marquardt (L-M) technique (Levenberg, 1944; Mar- quardt, 1963) is used. The L-M technique is popular due to its method of interpolat- ing between the Gauss-Newton method and gradient descent, thus making use of certain strengths of both techniques.

To perform L-M minimization over a parameter vector β for a function f, three compo- nents are needed: an initial estimate β0, the function f, and its Jacobian with respect to β:

∂f Jβ (3.18) = ∂β

In this case, β Xi and f C j. Recall from Section 3.2, that the Tsai camera model is = = used to define the camera parameters. These calibrated parameters can be further used to determine the reprojection function C j.

Suppose that bi j is the blob for person Pi in camera j. Then (X i j,Yi j) is defined to be the

2-D centre of mass of bi j in image memory coordinates. To simplify the computation of the

Jacobian, first transform the coordinates of each (X i j,Yi j) from image memory coordinates to ideal image coordinates xi j. Thus, the distortion equations in the Tsai camera model are eliminated.

T Then, given a 3-D point Xi (xw, yw, zw) , C j is defined to be the transformation from = world coordinates to camera coordinates to ideal image coordinates, which determines the corresponding point for Xi in the same 2-D coordinate system as xi j. Suppose that R and t are, respectively, the rotation matrix and translation vector for the transformation from world coordinates to camera coordinates. Then set

77   r11 r12 r13   R  r r r  (3.19) =  21 22 23  r31 r32 r33   t1   t  t  (3.20) =  2  t3

Furthermore, suppose that f is the focal length of camera C j. Then

à xc ! zc C j(Xi) f (3.21) = yc zc

xc r11xw r12 yw r13 zw t1 (3.22) = + + + yc r21xw r22 yw r23 zw t2 (3.23) = + + + zc r31xw r32 yw r33 zw t3 (3.24) = + + +

Then the Jacobian for C j can be calculated:

à ! f r11 zc r31xc r12 zc r32xc r13 zc r33xc J − − − (3.25) Xi 2 = zc r21 zc r31 yc r22 zc r32 yc r23 zc r33 yc − − −

These values for C j and JXi are input into the Levenberg-Marquardt algorithm, while the initial estimate of Xi is set as the world origin, and an optimal solution for Xi is de- termined. The final result of the reconstruction step is a set of points Xi, representing the observed persons within the environment.

Occlusion in Oblique Cameras As mentioned above, to perform the 3-D reconstruction for a single person Pi, the centre of mass of each blob of Pi in each camera is used. With this approach, occlusion in the oblique cameras needs to be handled separately. In this case, the actual centre of mass of the person may be considerably different from the detected centre- of-mass as blobs of other persons introduce instability in the calculations.

To overcome this instability, for each person Pi P the set Ci γ(Pi) is constructed ∈ ⊆ such that for each oblique blob s Ci S, ξ(s) 1. This condition ensures that any oblique ∈ ∩ | | = blobs that are part of an occlusion are ignored, as these blobs can considerably skew the reconstruction. For accurate tracking, the system requires that the person be detected in at least one overhead camera and one other camera, whether overhead or oblique. It is found

78 that this is a reasonable assumption, despite the occlusion rule applied here. An example of this kind of instability is shown in Figure 3.22.

Figure 3.22: Occlusion in oblique cameras. Top image shows camera with no occlusion. Bottom image shows same frame in a different camera, where occlusion has occurred. Box with a cross shows the blob that is discarded

79 3.5.5 Verifying Observed Persons

The output from the previous tracking stages is a set of points Xi for likely persons Pi within the environment. It was found that using the Xi directly to update the motion models of tracked targets within the environment would often yield false positives, because false targets are created due to segmentation errors, and false negatives, due to persons who wander around the peripheries of the cameras without exiting the environment being wrongly deleted. One way to overcome these errors is to use an entrance rule, similar to the one used in the KidsRoom (Bobick et al., 1999).

Figure 3.23: Entrance Rule

The entrance within AVIE is shown in Figure 3.23. Because of the narrow width of the entrance, ensuring that persons are only created and deleted at the location is a highly effective rule for systems like AVIE. A cylindrical entrance rule is used for which the cen- tre and radius are specified manually. A verification procedure, described below, is then employed to eliminate false positives and false negatives.

To perform the verification, the graph theoretic framework introduced in section 3.3.2 is reused. Define the set P0 {P0 } to be the set of persons who are currently being tracked = j by the system, up to the last frame. Then once more a bipartite graph G (P P0,E) is = ∪ constructed, where the left vertex set is P and the right vertex set is P0. An edge between each P P and P0 P0 is constructed and an edge weight function w : E [0,1] is defined: ∈ ∈ →

80 d(P,P0) w(P,P0) (3.26) = D where d is the Euclidean distance and D is the maximum possible distance within the environment. D is set to the diameter of the cylindrical environment, which is 10 metres.

Furthermore, 2 dummy nodes are constructed, one in P and one in P0, and every P P is ∈ connected to the dummy node in P0 and every P0 P0 is connected to the dummy node in P. ∈ The weight of all edges connected to a dummy node is set to 0.5.

Then once more a minimum weight edge cover for this graph is derived, which gives an optimal mapping between each observed person P and tracked person P0. This mapping is many-to-many and the task then is to reduce it so that only the dummy nodes have multiple connections. All other nodes must have exactly one edge incident on them.

To achieve this, for each tracked person in P0 that has multiple connections, only the connection of minimum weight is kept and all other connections are removed. For each observed person P incident on a removed connection, connect P is connected to the dummy node in P0. This process is repeated for each remaining observed person with multiple con- nections, thus yielding exactly one connection incident on each person, observed or tracked. There are then 5 cases to deal with:

Observed Person P is connected to Dummy Node and P is in Entrance P has just en-

tered the system. Create a new tracked person P0 and update its motion models with the state of P.

Observed Person P is connected to Dummy Node and P is not in Entrance P is a false positive, so delete it.

Observed Person P is connected to tracked person P0 Update P0’s motion models with the state of P.

Tracked Person P0 is connected to Dummy Node, and P0 is in Entrance P0 has just

exited the system. Delete P0.

Tracked Person P0 is connected to Dummy Node, and P0 is not in Entrance This is

a false negative. Update P0’s motion models, using the previous states of P0.

It is found that applying this verification step provides a considerable increase in the accuracy of Immersitrack. Almost all false negatives and false positives are eliminated from the system when this verification step is employed, and the accuracy of Immersitrack is greatly improved. Some false positives and false negatives can still occur when persons are standing within the entrance area, but this happens very rarely during system operation. The final 3-D tracking results for 3 persons in an overhead and oblique camera are shown in Figure 3.24.

81 Figure 3.24: Tracking Results for three persons reprojected to an oblique (left column) and an overhead (right column) camera

82 3.6 Experimental Results

Evaluating the performance of a tracking system is no easy task. The lack of suitably adaptable benchmark data makes it difficult to provide comparable performance evalua- tions. Tracking is also not a black-and-white task, with the results often being subjective to the observer. One common cause of ambiguity, for example, is determining when a person may be deemed to have entered the environment. In a multi-camera system this could be either when the person enters the environment physically, or when the person is visible in enough cameras to provide a sufficiently accurate position estimate.

In this section, numerous results in quantitatively evaluating the performance of Im- mersitrack are presented. The performance of the system in tracking a single person is first determined by computing error values when the person is standing still and when the person is moving in planned motion paths. These values give baseline error rates for the multiple person tracking performance later on.

The performance and accuracy of the Kalman filter prediction is then presented to iden- tify the performance benefits of using the filter. Finally, the performance of the system in the presence of multiple persons is identified by using metrics from prior literature. These metrics present a qualitative evaluation of the system in tracking multiple persons.

3.6.1 Baseline Error Values for Single Person Tracking

The first test measures the accuracy of the system in tracking a single person standing still within the environment. This determines a baseline error rate for future tests. The position of the person xt, is recorded for 3000 frames and the error in each frame is calculated as the absolute difference between the ground truth position and the average position over all frames. The result of this test along each axis, is shown in Figure 3.25.

Figure 3.25: Single Person Standing Still

83 The average error over all frames is shown in Table 3.6.

Axis Average Error (cm) x 0.3045131 y 0.4326645 z 0.1293832 Euclidean distance 0.605565

Table 3.6: Average error rates for a single person standing still within the environment

The values in the table represent the best case scenario error of Immersitrack. The tests were performed with one person exhibiting no motion, and any further test results will contain at least this error.

A person was then recorded standing still on 16 known markers within the environ- ment, and the mean and standard deviation of the position output of the tracker, along each axis and for each marker, was recorded. This allows the determination of how error varies with the spatial location of the person within the environment. Since the person is standing still, error is measured as the magnitude of the standard deviation vector at each marker. The results are shown in Figure 3.26. As the figure shows, some regions demon- strate slightly higher error rates than others. These are area where the person is observed in fewer cameras. This effect can be counteracted by introducing more cameras around the problem areas without a large impact on the efficiency of the system.

Figure 3.26: Comparing error against spatial loca- tion. Size of spheres represent relative error values

84 The quantitative results for the tests performed in Figure 3.26 are presented in Table 3.7.

Test No. Frames Error (cm) Test No. Frames Error (cm) 1 249 1.744080119 9 261 0.690105137 2 251 0.620052908 10 302 2.581934135 3 239 1.667897575 11 277 0.834650119 4 241 1.202973082 12 255 2.791370924 5 248 0.961847785 13 266 4.271070988 6 275 2.495039697 14 254 0.81201457 7 275 0.643946729 15 254 0.915243161 8 252 3.324170663 16 280 2.087111689

Table 3.7: Quantitative Results for Error vs. Spatial Location Tests

To test the performance of Immersitrack for moving objects, 2000 frames of a single per- son walking in a straight line radially out from the centre of the environment was recorded. Markers were used to ensure that the person kept to a straight line while in motion. Eight tests were performed. To calculate the error for these tests, linear regression was performed to fit a straight line to the ground plane coordinates for all the points in each test. The error is measured, for each point, as the perpendicular distance between the point and the fitted line. The results are shown in Figure 3.27 and the average error rates for each straight line test are presented in Table 3.8.

Figure 3.27: Straight Line tests for Single Person Track- ing. Colour is dependent on the test id

85 Test No. Average Error (cm) 0 5.40731 1 7.373808 2 8.929848 3 8.534284 4 5.618567 5 6.160718 6 6.553798 7 3.499592

Table 3.8: Quantitative Results for Single Person Straight Line tests

3.6.2 Framerates and Filter Prediction

The utmost importance was given to ensuring that the system performed in real time, as the aim is to use the system in real-world immersive installations where real-time performance is critical. Thus, during the development of the system the real-time performance property was adhered to as closely as possible. The frame rates for the system are given in Table 3.9.

Configuration Frame Rate (fps) Slave with a single overhead camera 30 Slave with a single oblique camera 33 Slave with 4 cameras (2 overhead and 2 oblique) 12 Trajectory Reconstruction on master with 4 slaves (16 cameras) 12 Prediction on master with 4 slaves (16 cameras) 64

Table 3.9: Framerates of different components of the system

As the table shows, a slave on the system runs at 12 fps at the slowest, to process 4 cameras at one time, and because each slave is sending a very small amount of network data, the master is able to receive tracking information as fast as the slowest slave is. The time taken for trajectory reconstruction on the master is negligible as the amount of data that the master has to work with is very small, and consists almost entirely of bounding box positions.

The strength of this work, however, lies in the master’s ability to provide accurate po- sitions of tracked individuals between camera frames. Because the master runs indepen- dently of the slaves, it is able to run at its own speed, which is beneficial since the trajectory reconstruction algorithms run much faster than the image processing algorithms. There- fore, multi-threaded capabilities can be utilised. The reconstruction algorithms can be run in a separate thread to the prediction algorithms providing tracking data even while the master is waiting for a frame update.

As a result, the final system is able to run at over 60 frames per second while providing accurate prediction of the locations of individuals within the environment. To demonstrate

86 the accuracy of the prediction, 600 frames of a single person moving around the environ- ment was recorded, and the error was measured as the Euclidean distance between the prediction output by the Kalman filter, and the actual measured position obtained from the slaves. In Figure 3.28 is shown the trajectory of the person in the ground plane.

Figure 3.28: Kalman Filter Test

The average error rates in 3-D and in the ground plane, is shown in Table 3.10.

Error Type Average Error (cm) Error in Ground Plane 15.56363599 Error in 3-D 16.65124767

Table 3.10: Average error in comparison between Kalman filter prediction and observed output during updates. The Kalman filter prediction lies roughly within a ball of radius 16cm which is sufficient as a human body is considerably larger than that

In Figures 3.29, 3.30, 3.31, and 3.32 are shown individual segments of the test. Each figure shows 150 frames of the test. The figures show that the greatest discrepancy in prediction is when the person’s motion is considerably curved thus violating the linear constraint imposed by the Kalman filter. However, overall the system provides quite an accurate reflection of the person’s position.

87 Figure 3.29: Kalman Filter Test (Frames 1–150)

Figure 3.30: Kalman Filter Test (Frames 151–300)

88 Figure 3.31: Kalman Filter Test (Frames 301–450)

Figure 3.32: Kalman Filter Test (Frames 451–600)

89 3.6.3 Tracking Multiple Persons

Identifying the performance of Immersitrack for multiple persons is trickier than for a sin- gle person. The problem of objectively evaluating the performance of a multiple object tracker is an open one with some proposed solutions (Bernardin et al., 2006; Smith et al., 2005). Many systems that demonstrate multiple object trackers present results that are not quantitatively measured, due to a lack of generally accepted metrics to perform the evaluation.

The metrics presented by Bernardin et al. (2006) are the most suitable for this system as they are simple to understand, yet provide clear quantitative results for system perfor- mance. They assume that the output from the system at each time frame consists of two sets Ot and Ht. The set Ot consists of all the real-world objects to be tracked by the system at time t. The values in Ot are derived from ground truth data and are hand marked. The set Ht consists of the set of all hypothesised objects output by the tracking system at time t.

If the tracking system were a perfect one, then in each frame each element of Ot would be paired with an element of Ht and vice-versa, and as time progressed the pairing would be maintained over time. Furthermore, as elements of Ot disappeared or reappeared, their corresponding paired elements in Ht would also disappear and reappear respectively. Of course, in reality, the tracking system is not perfect, and to measure the deviation from perfection Bernardin et al. identify two metrics to evaluate the performance of the tracker.

The first metric is the Multiple Object Tracking Precision (MOTP). The MOTP deter- mines the total position error of the tracker hypothesis, and shows the ability of the tracker to estimate precise object positions. It is calculated as the ratio

P i,t di,t MOTP P (3.27) = t ct

where di,t is defined as the distance between the i-th object hypothesis pair at time t and ct is the total number of object-hypothesis pairs at time t. The higher this precision, the worse the tracking system performs. In this work, MOTP is calculated both in 3-D and in the 2-D ground plane.

The second metric is the Multiple Object Tracking Accuracy (MOTA). The MOTA ac- counts for all object configuration errors and quantitatively measures the ability of the tracker to maintain correct object-hypothesis pairs. It is calculated as:

90 P t (mt f pt mmet) MOTA 1 +P + (3.28) = − t gt where mt is the number of misses at time t, f pt is the number of false positives, mmet is the number of mismatch errors, and gt is the number of objects. Misses occur when the tracker does not output a hypothesis for a particular object, false positives occur when the tracker outputs a hypothesis that cannot be matched to any object, and mismatch errors occur when the tracker incorrectly changes the assigned hypothesis to an object.

In this work, the ground truth data was collected by recording video from 4 cameras that have the best view of the environment, while Immersitrack is running. The position of each person, in each frame, of each video was hand-marked and the 3-D positions were reconstructed using triangulation and camera calibration. The output from Immersitrack was recorded along with the video frames. Only 4 cameras were used as using any more can affect the optimal operation of the tracking system, due to expensive disk read/write operations.

Bernardin et al. give an automated algorithm to compute both the MOTA and MOTP, given the above reconstructed ground truth data. Object-hypothesis pairs are determined over the entire video sequence using a distance threshold. Each hand marked object, in each frame, is matched to the closest tracker hypothesis. If the distance between the match is larger than the distance threshold the match is rejected. When an object is not matched to any hypothesis, it is a miss. When a hypothesis is not matched to any object, it is a false positive. When an object’s matched hypothesis from the previous frame is different to the one in the current frame, it is a mismatch error. These values are then used to calculate the MOTP and MOTA as outlined above. The distance threshold used in this research is 100cm.

The MOTA and MOTP measures were computed for 6 tests containing 1 to 6 persons. The persons were asked to move in random motions around the environment, occasionally meeting together and touching each other. The purpose of meeting together was to test the tracker’s performance when persons are very close to each other and their blobs in the cameras merge. The cumulative statistics of the results are also measured. Two sets of cumulative statistics are computed. The first is a weighted average over the number of persons within each test, and the second is a weighted average over the number of frames in each test. These results are presented in Tables 3.11 and 3.12.

91 MOTP MOTA (cm) (%) Num Num 3-D 2-D Misses False Mismatch MOTA Persons Frames Positives Error 1 610 4.01096 3.50009 0 0 0 100 2 669 2.50559 2.27767 0 0.60658 0 99.3934 3 735 4.19375 3.62894 2.52417 1.18153 0.05371 96.2406 4 718 1.95228 1.39684 0.21727 9.88593 0 89.8968 5 741 23.2288 15.475 5.65766 11.5676 0.43243 82.3423 6 911 5.05914 4.04837 4.12114 8.67936 0.06244 87.1371

Table 3.11: Multiple Object Tracking Results for the Tracking System

Weighted Average MOTP MOTA (cm) (%) 3D 2D Misses False Mismatch MOTA Positives Errors Over No. Persons 8.37673 6.00927 2.92651 7.34360 0.12847 89.60142 Over No. Frames 6.93524 5.12546 2.27689 5.67742 0.09497 91.95070

Table 3.12: Cumulative Statistics for Multiple Object Tracking Results

A graph of the multiple object tracking results is presented in Figure 3.33

Figure 3.33: Graph of Multiple Object Tracking Results

92 Note, from Figure 3.33, that there is a sudden drop in accuracy and precision for Test 5. Upon observing the videos recorded in Test 5, it was noted that 2 persons in the test were wearing clothes that did not reflect near-infrared light very well. As a result, they were frequently blending into the background in the camera images and not being picked up by Immersitrack.

In fact, it was found that there are two main reasons for a drop in both accuracy and pre- cision. The first is poorly reflecting clothing and the second is the presence of an increasing number of persons. To demonstrate this, the mean precision for each ground truth object is recorded in each test. A qualitative comparison is presented in Figures 3.34, 3.35, and 3.36. The figures present a single snapshot of each object being tracked along with their mean precision. The objects are sorted according to their mean precision values, and a graph of the values is presented.

The ids of each tracked object within the figures consist of two components. The number before the colon is the test id and represents the number of persons within the test. The number after the colon is the object id determined by the ground-truth marker. As is to be expected most of the tests with fewer persons have better precision. In addition to this, objects with poorer precision are ones that are less visible as their clothing reflects near- infrared light less.

Figure 3.34: Precision vs. Person for Persons 1–7

93 Figure 3.35: Precision vs. Person for Persons 8–14

Figure 3.36: Precision vs. Person for Persons 15–21

3.7 SparseSPOT : Real-Time Multi-Person Visual Hull Recon- struction 1

Immersitrack provides point based trajectory values for persons within the AVIE environ- ment. These values can be used to perform rigid-body movement classification within AVIE, as discussed in Chapter 2. However, in order to provide classification of a greater number of activities, the data from Immersitrack is first enriched with a 3-D shape-preserving recon- struction. A voxel model representation is constructed for each person within the environ- ment. As will be shown in the performance evaluation, the algorithm used can efficiently perform the construction for multiple persons in real-time.

1Portions of this section appear in Sridhar and Sowmya (2009)

94 3.7.1 SPOT

The SPOT algorithm was described in Chapter 2 as a fast, robust voxel reconstruction algorithm. The algorithm checks if a voxel, v belongs to a foreground object by projecting Q uniformly distributed points within the voxel onto the foreground image for each camera. If at least ²Q of these points lie within a foreground object, where 0 ² 1, then the voxel is < ≤ classified as white in that view. The runtime complexity of this algorithm for a voxel space of size W D H is WHDQ. The SPOT algorithm is summarised in Algorithm 3.6. × ×

Algorithm 3.6: SPOT input : Voxel v to check, Q N, ² (0,1] ∈ ∈ output: True if v is a foreground voxel, False otherwise. begin incount 0 ← for l 1,...,Q do = k if qv(l) is a silhouette pixel then incount incount 1 if incount ²Q←then + Return ≥True else Return False end

Current software implementations of SPOT present experimental results within a small volume containing only one human at any one time. Typically this volume is about 2m × 2m 2m and the voxel model has dimensions 64 64 64. The SPOT algorithm has been × × × shown to run at 15 fps at-most over this space. The computational complexity of the system increases proportionally with each dimension of the voxel space. Moving from a 2m 2m × × 2m system to a 5 metre radius cylinder with height 3 metres, which are the dimensions of AVIE, represents an approximately 9-fold increase in the algorithmic runtime.

Figure 3.37: A relative size com- parison between the AVIE environ- ment and the typ- ical environment that the SPOT al- gorithm has been tested on. The ex- ternal cylinder is the AVIE environ- ment. The yellow cube inside is the typical SPOT en- vironment

95 Furthermore, general scenes may contain several humans, and real-time performance in the reconstruction of multiple humans is desirable. The current runtime for the SPOT algorithm is clearly unacceptable for real-time standards and multiple humans. To resolve these issues, the information derived from Immersitrack is used to provide a novel voxel reconstruction algorithm which performs said reconstruction for multiple persons in real- time.

3.7.2 SparseSPOT: Leveraging Immersitrack for Spatial Coherency

The algorithm developed, called SparseSPOT, works by growing the voxel model from selected seed points by checking neighbourhood voxels and recursing over any white voxels found. This means that only voxels belonging to a foreground object are checked and the recursion terminates when a black voxel is reached. This provides a significant speed-up over SPOT, allowing SparseSPOT to be applied in real time even with multiple humans present. SparseSPOT is presented in Algorithm 3.7.

Algorithm 3.7: SparseSPOT input : The set S of all seed points from a tracking step. output: The set V of voxel points. locals : The set C of all voxels that have been checked. V ; = ; for si S do ∈ Vi {si}; ← while Vi do 6= ; v random element from Vi/C; ← C C {v}; = ∪ if SPOT(v)=True then Vi Vi/{v}; ← V V {v}; ← ∪ Vi Vi Nv; return V ← ∪

Define

© ª Nv v0 d(v,v0) δ v0 C (3.29) = | ≤ ∧ ∉

3 where d is any suitable metric in R . Nv is the set of all voxels inside a ball of radius δ around v that have not already been checked. This algorithm is analogous to the 2-D connected component algorithm for silhouette extraction, which checks neighbouring pixels and proceeds to grow the silhouette from a seed pixel.

A comparison of SparseSPOT against SPOT is shown in Figure 3.38. The results show that SparseSPOT reconstruction is comparable to that of the original SPOT algorithm and

96 additionally provides two other benefits for the reconstruction. It eliminates some noise voxels that SPOT picks up outside the tracked human, by not checking them at all. Spars- eSPOT only checks voxels that are near a tracked human, so any voxels that are too far from humans are removed. Another benefit is that SparseSPOT assigns voxels to tracked individuals during voxel reconstruction itself. Therefore, no more computation time need be wasted in assigning voxels to individuals after reconstruction, making SparseSPOT highly suitable for voxel model-based tracking applications.

Figure 3.38: A comparison between the SPOT and SparseSPOT algorithms. The number in the SparseSPOT images is the identity output from Immersitrack

97 The final result of SparseSPOT in reconstructing the voxel models of 3 humans within AVIE, a cylinder of radius 5 metres and height 3 metres, is shown in Figure 3.39.

Figure 3.39: SparseSPOT Reconstruction for 3 persons. Top figure shows the reconstructed voxel model. Bottom 4 figures shows the voxel reconstruction projected onto 4 cameras

98 3.7.3 Performance Comparison between SPOT and SparseSPOT

The performance of SparseSPOT is evaluated and is found to be considerably superior to the traditional SPOT algorithm. For SparseSPOT, the distance metric for the neighbourhood voxel extraction is chosen to be the metric induced by the infinity norm d(v,v0) v v0 = || − ||∞ so that the neighbourhood voxels are simply voxels within a cube centred at v. Set δ 1 as = this provides satisfactory results in voxel extraction yet is comparable to the original SPOT algorithm.

Two experiments are performed. In Experiment 1, SparseSPOT is shown to be com- parable to SPOT by using the parameters specified in the original paper by Cheung et al. (2000). The voxel grid size is set to 64 64 64, and the dimensions of the volume-of-interest × × are set to 2m 2m 2m. For the parameters of the SPOT algorithm, required by both SPOT × × and SparseSPOT, set Q 2 and ² 1 0.5. Experiment 1 involves one human entering the = = Q = region, walking around for 10 seconds and exiting the environment.

Experiment 2 involved running both algorithms in a much larger volume. The dimen- sions are 6m 6m 2m, which is roughly the volume of interest within AVIE, with voxel di- × × mensions of 240 240 60. This represents an approximately 13-fold increase in the volume × × to be reconstructed. For Experiment 2, three humans walked into the volume, performed some actions for 10 seconds and then walked out.

In both experiments, roughly 200 frames of video from 4 cameras are recorded, along with the positions of persons output by Immersitrack. The voxel processing is done off-line on both videos, so that the measured times are independent of the time taken by Immersi- track. Only the time taken to reconstruct the entire voxel model is measured in each frame. A graph of the time taken for Experiment 1 is presented in Figure 3.40.

Figure 3.40: Performance Comparison between SPOT and SparseSPOT in Experiment 1

99 The time taken for Experiment 2 is presented in Figure 3.41.

Figure 3.41: Performance Comparison between SPOT and SparseSPOT in Experiment 2

As can be seen from the figures, SparseSPOT is clearly superior to SPOT in both envi- ronments, but prominently so within the large volume in Experiment 2. In Table 3.13 is presented properties of each experiment, including the average time taken for each experi- ment. By average time is meant the total time taken over all frames, divided by the number of frames.

Experiment 1 Experiment 2 Volume-of-Interest 2m 2m 2m 6m 6m 2m × × × × Voxel Model Dimensions 64 64 64 240 240 60 × × × × Number of Frames 101 187 Time taken for SPOT (ms) 10.5 153.85 Time taken for SparseSPOT (ms) 2.46 8.28

Table 3.13: Properties of the SPOT vs. SparseSPOT Experiments

As can be seen, in the small volume, SparseSPOT takes only 20% of the time that SPOT takes to reconstruct the voxel model while, in the larger volume, SparseSPOT takes approx- imately 5% of the time taken by SPOT. It should be noted that where the run-time of the SPOT algorithm increases with the size of the volume, the run-time for the SparseSPOT algorithm increases with the number of persons, a fact that is demonstrated in Figure 3.42. Despite the increase in run-time, it is clear that SparseSPOT presents a viable alternative to provide real-time voxel reconstruction for multiple people.

100 Figure 3.42: Comparison between Performance and Person Count for SparseSPOT

3.7.4 Final Frame-rates of Immersitrack and SparseSPOT

Recall from sub-section 3.6.2 that Immersitrack operates at roughly 12 frames per second or conversely, takes approximately 83 ms to process. These values are consistent regardless of how many persons enter the environment, deviating only by a millisecond or so, even with up to 6 persons. The next task is to measure the computational efficiency of SparseSPOT combined with Immersitrack.

In order to do so, a suitable voxel size must first be selected that provides sufficient information for the activity recognition stage. Through observation, it is found that a voxel size of 4cm 4cm 4cm provides enough detail to sufficiently distinguish activities while × × being computationally efficient to reconstruct. For SparseSPOT, Q is set to 2 and ² is set to 0.5 as in the original SPOT algorithm. To analyse the performance of the final system, 5 persons enter the environment one at a time. Every time a person enters, they walk around for some time and the time taken to perform the reconstruction in each frame is recorded. These tests reflect the total run time of the system, accounting for processing times along with network transfer times. The results from the tests are shown in Table 3.14. Note that the table shows average times, or the total time taken divided by the number of frames.

Number Number Average Average of Persons of Frames Time (ms) Frame Rate (fps) 1 116 91.05 10.9828 2 178 121.31 8.24303 3 138 138.58 7.21569 4 100 154.37 6.47794 5 196 166.78 5.99608

Table 3.14: Frame rates for the Complete Vision System. Times are given in seconds with frames-per-second given in brackets

101 3.8 Summary

In this chapter has been presented a complete and robust framework for the real-time track- ing of multiple persons within the immersive environment. Immersitrack is highly suitable for such indoor areas, and can be extended to cover larger areas with a greater number of cameras without a large impact on the performance of the system. Qualitative experimen- tal results have been provided demonstrating the consistency, effectiveness, and efficiency of the tracking system.

The chapter also presented SparseSPOT, a fast multi-person voxel reconstruction sys- tem. SparseSPOT exploits spatial coherency provided by Immersitrack to allow for the re- construction of multiple persons in real-time. Immersitrack and SparseSPOT were shown to work in real-time for up to 5 people within the environment.

In the following chapter will first be presented a qualitative evaluation of Immersitrack through real-world applications that utilise the system for interaction. Commercial demon- strations of Immersitrack that have been run dozens of times with multiple participants will be presented. The final chapters will then introduce and evaluate the activity recognition framework that uses output from Immersitrack and SparseSPOT to provide classification of activities for multiple persons.

102 Chapter 4

Real-World Applications of Immersitrack

MMERSITRACK allows identification of the positions of persons over time, and thus I provides maintenance of identities of individual persons. Through tracking, the global motion of individuals within the environment may be derived, along with properties such as position, velocity, trajectory and other point based values.

Although the quantitative robustness of Immersitrack has been established by experi- mental means and described in Chapter 3, the qualitative benefits of the system only be- comes apparent in real-world systems that utilise the tracking output. These applications identify the robustness of Immersitrack and its benefits for providing interactivity. They also show that Immersitrack provides a solid foundation for the recognition of activities in later chapters. In this chapter, several such applications are presented, including a pointing gesture recognition system that leverages Immersitrack to provide real-time interaction for multiple persons, and two artistic demonstrations in a public setting.

In Section 4.1, a pointing gesture recognition system that localises its search space by using the output from Immersitrack is presented. Although the pointing gesture system is not used in the activity recognition steps later on, it highlights a primary benefit of Im- mersitrack: the ability to reduce the search space for such gesture extraction systems. Im- mersitrack has also been used in two public arts demonstrations: Scenario and Spaces of Mnajdra, presented in Sections 4.2 and 4.3. These demonstrations have been run dozens of times, providing interactivity for multiple audience members at a time. Both applica- tions provide valuable contributions to artistic and sociological research and have received numerous grants and accolades.

103 4.1 A Real-Time Pointing Gesture Recognition System 1

The use of pointing gestures has proven quite beneficial within immersive environments, because it is a natural translation from the familiar mouse used in desktop environments. Pointing gesture recognition systems can be used for direct interaction with the virtual world and they allow persons to highlight objects and areas of interest.

In this section a pointing gesture recognition module, that uses Immersitrack to localise the gesture search, is presented. This module performs fingertip tracking on the 2-D im- ages, then combines and reconstructs them in 3-D. To detect the fingertips, the centres of gravity of persons determined by Immersitrack are used as seed points, to speed up the de- tection. The pointing gesture recognition system presented here clearly demonstrates the reduction in search space that is achieved by Immersitrack.

Recall from Section 3.5 that a map γ is defined that maps each tracked person to a set of blobs, where each person is assigned to a single blob in a camera. Using this map and the centre of gravity reconstruction presented in Chapter 3, fingertip localization can be performed for each person in each assigned blob. The information from the various blobs can then be combined to obtain a 3-D fingertip position.

4.1.1 Finger Detection in Overhead Cameras

To perform fingertip detection in the overhead cameras, an extended version of the methods presented by Kehl and Van Gool (2004) is used. The centre of gravity of each person Pi is

first projected onto each overhead camera c j in which the person is visible. Define this projection to be pi, j. Define bi, j to be the corresponding blob for Pi in c j determined by γ. If pi, j does not lie within bi, j then there is high discrepancy between the 3-D reconstruction of Pi and the blob bi, j and the blob is ignored. So assume that pi, j lies within the bounding box of bi, j.

The value of a distance function between pi, j and the contour points of bi, j is first com- puted. Because the distance function is not necessarily smooth, a box filter is applied to remove segmentation noise. Extremal points in this function are searched for, and labelled as possible fingertips. Furthermore, the passive nature of the cameras allows the applica- tion of a distance threshold to remove some false positives. The result of this extremal point search is presented in Figure 4.1. The result of the extremal point search often gives points that are not quite at the fingertip position, but slightly to one side. This is a consequence of the smoothing filter, which often shifts the local maxima slightly to the side of the true maxima, as can be seen in the graph in Figure 4.1.

1Portions of this section appear in Sridhar and Sowmya (2008)

104 Figure 4.1: Fingertip Detection in Overhead Camera

To refine the results for more precise fingertip localization, the direction of the point- ing hand is first quantized into one of 8 directions, based on the angle of the vector joining pi, j and the extremal point. These 8 directions are one of U p, U pLef t, Lef t, DownLef t, Down, DownRight, Right or U p. A small fixed size search window centred at the ex- tremal point is used to localize the search space. Using the approximated hand direction an additional extremal point search is performed, within the search window, to find extreme points on the contour. For example, if the vector direction is U pLef t, a single point on the contour, furthest from the bottom right corner of the search window, is searched for. This refinement step considerably improves the standard extremal point search, resulting in much smoother finger motion. The result of this step is given in Figure 4.2.

Figure 4.2: Fingertip Refine- ment in Overhead Cameras. Red cross is the result of pre- vious extremal point search. Green cross and circle is the result from refinement step.

105 4.1.2 Finger Detection in Oblique Cameras

In the oblique cameras, a considerably different scenario occurs. Previous work (Kehl and Van Gool, 2004) localized finger points in oblique cameras using the same local extrema search method used in the overhead cameras. However, the morphological operators ap- plied during the contour extraction step affects the fingertip points in the oblique cameras considerably more than in the overhead cameras. This results in noisy finger measure- ments, leading to severely erratic motion of the fingertips after 3-D reconstruction.

To resolve this issue, a different method of fingertip localization is used within the oblique cameras. Rather than trying to detect a single point to represent the fingertip, it is much more robust to localize the entire hand of a person as a single line, and then label the tip of this line as a fingertip. This results in much cleaner fingertip motion detection than other methods.

Haritaoglu et al. (1998a) showed that analysis of the horizontal and vertical projec- tions of a human silhouette provides a fast and effective method of recovering certain body postures. In this system, only the horizontal projection is used, and a simpler statistical technique to determine where the hands of a person start and stop.

For each silhouette, the normalized horizontal projection is first computed. Regions within the silhouette that may contain hands are determined by iterating over columns within the silhouette from either side and computing the mean and variance of the height of the projection. The algorithm stops when the variance in the projection height is above a threshold. In this step the body is separated into 3 regions as shown in Figure 4.3. These three regions are: (i) a box in the middle containing the legs, torso and head, (ii) a box on the left containing pointing hands on the left, and (iii) a box on the right containing pointing hands on the right.

Figure 4.3: Determining hand extents in oblique cameras. Left: Horizontal Projection with detected extents. Hand Extents (Red circle represents head position).

106 Once the extents of the hands have been detected, the actual hand lines within the rectangular hand segments are determined. The distance transform of the silhouette im- age is used for this purpose. A connected component algorithm is applied to the distance transformed image, to determine connected paths of high intensity. Any path whose length falls below a threshold is eliminated, and a line fitting algorithm is applied to the remain- ing paths. The tips of the lines are output as fingertip positions. The result of the path detection and hand limb localization is shown in Figure 4.4.

Figure 4.4: Oblique Camera Finger Localization Results. Left: Distance Transform with Skele- ton Paths marked in red. The smaller red lines are discarded using a length threshold. Right: Final Results. Red lines represent hand lines and Green circles represent fingertips

This technique of hand limb localization provides very good results in localizing fingertip points. Fingertips are identified even when a person is pointing with both hands on the same side. Some of the achieved results in the oblique cameras are shown in Figure 4.5. As the figure shows, the system is able to deal with multiple hands on the same side and can handle a skewed body posture.

Figure 4.5: Final finger detection re- sults in oblique cam- eras

107 Although this method does suffer from occlusion when multiple persons come close to- gether in one camera, the occlusion issue is resolved when multiple cameras are used. When two or more persons are occluded in an oblique camera and merged into one blob, the algo- rithm detects all possible hands at the left and right extremes of the blob. During the next reconstruction phase, finger points from the overhead cameras are matched to those from the oblique cameras. Because the overhead cameras have an unoccluded view of the finger points most of the time, this ensures that all pointing fingers are appropriately matched and reconstructed. It was also discovered that, given the set-up of 7 oblique cameras, a point- ing finger is always extracted in at least one oblique camera, even when up to 5 persons are present within the environment. Thus, two views, one each from an overhead camera and an oblique camera are sufficient for 3-D reconstruction. An example of finger points extracted in the presence of occlusion is shown in Figure 4.6.

Figure 4.6: Final finger detection results in oblique cameras for multiple persons

4.1.3 3-D Reconstruction

The output from the previous sections is a set of points for each person in each camera, representing the fingertip positions for that person. The final task is to reconstruct the fingertips for each person in 3-D. In this research, it is assumed that each person can only have a maximum of two tracked fingers. Thus, the task is to find correspondences between either of the person’s 2 fingers in 3-D and the 2-D fingers from the slaves, and then use the correspondences, or lack thereof, to determine whether the fingertip is raised or not and to update their positions.

Correspondences in 2-D are first determined between fingertip points observed in dif- ferent cameras. To determine these correspondences, the HighestPoint and LowestPoint functions, introduced in Algorithm 3.5, are reused. Recall that these two functions recon- struct a highest and lowest 3-D point, given a 2-D point and the calibration data of the camera in which it was observed. The world line of a 2-D point is defined as the 3-D line

108 joining the 3-D points returned from the HighestPoint and LowestPoint routines. An example of world lines for fingertips is shown in Figure 4.7.

Figure 4.7: Fingertip world lines. Each camera image shows fingertip world lines con- structed from fingertips in the other cameras.

Suppose that F2D is the set of fingertip points for a given person. As in Chapter 3, define C to be the set of cameras, and the map c : F2D C to map finger points in F2D to 7→ the camera they were observed in. Further, define the distance function ld : F2D F2D R × 7→ between two 2-D fingertip points, as the perpendicular distance between the world lines of the two points. The correspondences are determined by Algorithm 4.1.

Algorithm 4.1: Determine 2-D fingertip correspondence

input : The set F2D of 2-D fingertip positions for a person

input : The map c : F2D C from finger points to cameras. 7→ output: FG P(F2D) consists of subsets of fingers from F2D which correspond to ⊆ each other.

local : M is the set of fingertips in F2D that have been matched. Set M . ← ; for fi F2D \ M do ∈ Fi {fi} ← © ¯ ª for f j x¯x F2D \ M c(x) c(fi) ld(fi,x) T do ∈ ∈ ∧ 6= ∧ < Fi Fi {f j} ← ∪© ¯ ª for (f j,fk) (x,y)¯(x,y) Fi Fi c(x) c(y) do ∈ ∈ × ∧ = if ld(fi,f j) ld(fi,fk) then < Fi Fi \{fk} ← else Fi Fi \{f j} ← M M Fi ← ∪ FG FG {Fi} ← ∪

109 This algorithm matches each fingertip point to a single best fingertip point in every other camera, using a threshold to eliminate large discrepancies. The threshold T is set to 15cm in this work. The iterative reconstruction procedures used in Section 3.5.4 are then reused to reconstruct a single 3-D finger point for each cluster of 2-D fingertips in FG.

Denote F3D to be the set of 3-D points derived from FG.

As already mentioned, it is assumed that each person has a maximum of two fingertips. Although it may be possible to detect the direction the person is facing using their direction of motion, this information becomes inaccurate if the person rotates on the spot. Since it is difficult to determine left and right fingertips without knowing the direction the person is facing, a 3-D fingertip is designated as either primary or secondary. The first finger raised is always the primary fingertip. When two fingers are being held up, the second finger raised is always the secondary. The second step in fingertip tracking is to match the reconstructed 3-D points to either the primary or the secondary fingertips, if they are already being tracked, or to initialise the tracking of the primary or secondary fingertip otherwise. To do so Algorithm 4.2 is used.

Algorithm 4.2: Determine 3-D fingertip correspondence.

input : The set F3D of reconstructed 3-D fingertip points, f0 primary fingertip, ← f00 secondary fingertip ← if f0 exists then f3D closest point to f0 in F3D ← if f3D exists then Update f0 with f3D, F3D F3D/f3D ← else Stop tracking primary f0 if f00 exists then f3D F3D closest point to f00 ∈ ← if f3D exists then Update f00 with f3D, F3D F3D/f3D ← else Stop tracking f00 for f3D F3D do ∈ if f0 does not exist then Initialise f0 with f3D, F3D F3D/f3D ← else if f00 does not exist then Initialise f00 with f3D, F3D F3D/f3D ←

This algorithm assigns reconstructed fingertips to the primary or secondary fingertip, giving preference to the primary fingertip. If either of the fingertips is already being tracked, it is assigned to the closest reconstructed fingertip. If neither of them is being tracked, then the presence of any reconstructed fingers will trigger the tracking of the pri- mary first and secondary second.

110 4.1.4 Head Position

The aim of the pointing gesture recognition component is to provide a method of interacting with the surround-screen environment. To be able to utilise the finger positions for the interaction task, it is not sufficient to have just a single point. A line-of-sight is required in 3-D, and the intersection of this line-of-sight with the screen and with objects within the environment can then be used to determine interactions.

Kehl and Van Gool (2004) used the line linking the head position and the fingertip position as the line-of-sight. The head position can be determined by using the centre of gravity position determined in the tracking phase. The head position is set to the same ground plane position as the centre of gravity but the height is doubled. However, for a more intuitive measure of the pointing gesture, a point one third of the way between the head and centre of gravity is a better option for the line of sight. The final line-of-sight results for a single person are shown in Figure 4.8.

Figure 4.8: Results from finger tracking and head reconstruction. Yellow circle is centre-of- gravity. Green circle is head position. Stick figure images on the bottom show resulting hand lines in 3-D world space

111 4.1.5 Multiple Persons

Because finger tracking and reconstruction is performed using the seed point of each person from the tracking step, the system is easily able to accommodate multiple persons. Cur- rently the system is able to reconstruct, robustly, finger positions for up to 3 persons within the environment. Unfortunately, beyond 3 persons, the occlusion issues become too great and the accuracy drops considerably. This problem can be solved by introducing more cam- eras, without a severe impact on the efficiency of the system. An example with 3 persons is presented in Figure 4.9.

Figure 4.9: Finger Track- ing Results for Three Per- sons

112 4.1.6 Experimental Results : Pointing Gestures vs. Inertial Sensor

Prior to this work, the AVIE environment contained an interaction mechanism in the form of an inertial sensor. This sensor is attached to a wired remote control, and users are able to interact with the display system by pointing the remote at the screen and clicking. This pointing device provides a satisfactory user experience. It has been used in dozens of exhibitions and displays, demonstrating its usability and effectiveness. However, the inertial sensor’s drawbacks are that it is wired, it is expensive and only one person can use it at any one time, and only one pointer can be controlled.

To demonstrate that the pointing gesture recognition system can be used as an alterna- tive pointing device, it is shown that the pointing gestures are comparable to the inertial sensor in terms of motion of the pointer. During experiments, a user is asked to hold the inertial sensor in one hand and use the same hand to point at the screen. The user is then asked to move the hand in a strictly horizontal motion, and then a strictly vertical motion, across the cylindrical screen. Both the motions of the inertial sensor and the tracked finger position are recorded, for approximately 800 frames each. The recorded motion is shown in Figure 4.10.

Figure 4.10: Pointing Gesture (white) and Inertial Sensor (grey) in 3-D

113 The recorded positions are then converted to cylindrical coordinates, and the r com- ponent is ignored, since the radius of the cylindrical screen is fixed. The θ component is plotted against time for the lateral motion, shown in Figure 4.11. As the figure shows, the motion of the inertial sensor matches the motion of the pointing gesture quite closely.

Figure 4.11: Theta vs. Time for Side-to-Side Motion

The height component over time is also plotted, for the up-and-down motion, shown in Figure 4.12. This figure shows that there is a discrepancy between the height of the inertial sensor and the height of the pointing gesture. This is because the inertial sensor assumes that the line-of-sight is parallel to the horizontal plane that the sensor resides in, whereas the pointing gesture system uses the line between the head position and the fingertip posi- tion as the line-of-sight. Since the head position is higher than the fingertip position, the projection to the screen will naturally be lower than the actual fingertip position.

Figure 4.12: Height vs. Time for Up-and-Down Motion

114 In order to get a better comparison between the two interaction devices the discrepancy between the theta values is computed for the lateral test and the height values for the vertical test. By discrepancy is meant the absolute difference between the values for the inertial sensor and those for the pointing gestures. The average and standard deviation of these discrepancies are given in Table 4.1.

Vertical Test Lateral Test (values in metres) (values in radians) Mean Discrepancy 4.127100095 0.059971386 Std. Dev. Discrepancy 0.421334303 0.056081401

Table 4.1: Discrepancy Statistics between Pointing Gesture and Inertial Sensor

If the height graph for the pointing gestures is offset by the average discrepancy for the height value in Table 4.1, a better comparison of the height values emerges. This compari- son is shown in Figure 4.13.

Figure 4.13: Height vs. Time for Up-and-Down Motion with Offset

From Figures 4.11 and 4.13 , one can observe that the greatest discrepancy in the val- ues for both tests is when the user changes the direction of motion. This is because the pointing gestures use a Kalman filter prediction technique to estimate the finger positions and, because it uses a linear estimation of motion, it catches up to the changing direction more slowly than the inertial sensor which has a much faster measurement rate. Although this does not cause a major problem for interaction, the pointer occasionally overshoots the mark when the person exhibits rapid direction change.

115 4.1.7 Application

A real world application that uses the pointing gesture recognition system is shown in Figure 4.14. The algorithms presented so far allow a user to control a pointer-like object on the screen using their fingers, and interact with a virtual menu and other virtual objects within the scene. Preliminary tests have shown that the gesture recognition system is able to pick up and reconstruct pointing fingers of up to 3 persons accurately within the environment.

Figure 4.14: Pointing gesture system being used in a real world application: Top image shows a person controlling a menu item on a 3-D projected menu and bottom image shows the per- son interacting with the display system using both hands to control two projected pointers (red circles)

116 4.2 Scenario

Scenario (Brown et al., 2011; Del Favero and Barker, 2010) explores the realm of interactive narrative, utilising technologies in artificial intelligence, computer graphics and computer vision. It consists of a cinematic fairytale that is played out in the AVIE environment, and encourages the audience to interact with certain characters contained within. It is inspired by the experimental television work of Samuel Beckett (Herren, 2007), and is based on a virtual portrayal of real-life events. The story is as follows:

A female humanoid character has been imprisoned in a concealed basement, along with her four children, by her father who lives above ground with his day- time family. Set within this underground labyrinth, she and her children take the audience through various basement spaces in an attempt to discover the possible ways they and the audience can resolve the mystery of their imprison- ment and so effect an escape before certain death. Watching them are a series of shadowy humanoid sentinels who track the family and the audience, physically trying to block their escape.

Scenario uses Immersitrack to provide a means for the humanoid sentinels to identify where the audience members are and react to them. A set of white sentinels urge the audience members to perform certain tasks to allow the story to unfold, and a set of shadow sentinels work to prevent the audience from doing so.

Figure 4.15: Scenario Scene 1

117 The purpose of Scenario is to study an emerging field known as the co-evolutionary nar- rative (Del Favero and Barker, 2010). The idea behind a co-evolutionary narrative is to explore deliberated actions performed by virtual humanoid characters within the interac- tive cinematic experience. The humanoid characters are given a degree of autonomy and, rather than following fixed scripts as in traditional games, are required to respond to audi- ence behaviour in an intelligent manner.

An artificial intelligence (AI) system that explores the behaviours of virtual reality char- acters is used in Scenario to formulate a way for the humanoid sentinels to respond, inde- pendently, to the behaviour of the audience. In order for the AI system to do so, the positions of the audience members, provided by Immersitrack, are used as sensory input.

Figure 4.16: Scenario Scene 2

Scenario operates on exactly 5 audience members at a time and has proved to be very engaging for those involved. It has been demonstrated dozens of times to several members of the public, including as part of the Sydney Film Festival (2011), and Immersitrack has proven to be robust in each instance. In certain scenes of the cinematic, the audience mem- bers are required to control virtual artefacts using their positions within the AVIE space. Immersitrack always responds consistently with the positions of the persons making inter- action seamless. Scenario has received several grants from government bodies including an Australian Research Council Queen Elizabeth II Fellowship (2005), an Australian Research Council Discovery Grant (2005), and an Australian Research Council Linkage Grant (2006).

118 4.3 Mnajdra

In the Spaces of Mnajdra project (Flynn et al., 2006), an archaeological dig site located in Mnajdra in South East Malta is projected within the AVIE environment. Audience mem- bers control a virtual silhouette that highlights different parts of the scene and uncovers a hidden view of the scene. In certain areas, the silhouettes uncover a hidden object that the audience members can then interact with using a wired wand. The purpose of the Spaces of Mnajdra is to provide an interactive museum exhibit that emphasizes explorability.

Figure 4.17: Spaces of Mnajdra Project. Copyright: Bernadette Flynn, Photographer: Heidrun Lohr

Another method of exploration that is provided within the Spaces of Mnajdra applica- tion is through scene transitions. Certain hotspots within the vista that look like doorways allow the users to enter another scene, where further objects can be explored.

Space of Mnajdra uses Immersitrack to reflect the positions of real persons within AVIE into the virtual world. The positions are projected on to the screen by extrapolating the line joining the centre of the environment to the real world position of the person. A silhouette in the shape of a human is displayed where the line meets the screen. The silhouette acts like a virtual shadow of the human and allows the human to trigger a doorway when the silhouette and doorway overlap.

Spaces of Mnajdra operates on 3 audience members at a time. Although Immersitrack can handle a greater number of persons, a greater number of resulting silhouettes increases

119 the confusion in interacting with the scene. The application has been demonstrated to numerous high ranking officials of the Maltese government with significant praise.

Figure 4.18: A Scene Transition triggered by user’s location in Spaces of Mnajdra. Copyright: Bernadette Flynn; Photographer: Heidrun Lohr

4.4 Summary

In this chapter, three real-world applications that use Immersitrack for interactivity have been presented. The Scenario and Spaces of Mnajdra projects demonstrate two unique methods of interaction that demonstrate the innovative value of the tracking system. The pointing gesture recognition system demonstrates how spatial coherency provided by the tracking system can be exploited in order to provide additional interaction mechanisms.

These applications demonstrate that Immersitrack is a robust and effective framework. The results from Immersitrack along with those from SparseSPOT will be used in the fol- lowing chapters, along with a knowledge-based framework, for activity recognition. The knowledge based framework will first be introduced and the adaptation of the AVIE data to the framework will then be presented. A secondary activity classification dataset based on publicly available ground-truth will also be used to evaluate the framework against a benchmark.

120 Chapter 5

An Incremental Knowledge Acquisition Framework for Activity Recognition

CTIVITY recognition is typically viewed as a classification problem. The idea is that A the individual under observation is performing a particular activity at the current instance in time, and it is up to the system to determine what activity class to out- put. In order to determine this output the system has, at its behest, data from an underlying vision component.

In this chapter, a knowledge maintenance approach called ripple down rules, or RDR, is introduced and it is shown how this approach benefits the activity recognition problem. RDR is an incremental knowledge maintenance and acquisition philosophy that has been used in industrial expert systems. It provides a predetermined knowledge base structure, and the domain expert is only required to differentiate cases based on features. The RDR framework presents a knowledge consistency approach to the problem that ensures that the justification provided by the expert is valid with past knowledge.

In this thesis, the RDR framework is used to perform activity recognition in a knowledge- sensitive context. The novel contribution of this work is to highlight the difficulties in encod- ing knowledge for activity recognition tasks, including the temporal and dynamic nature of the low-level feature vectors, the necessity of dealing with ambiguities in truth values and the problems caused by generalisation schemes for numeric data. Numerous solutions are proposed for each problem yielding a framework that is highly suitable to tackle general activity recognition tasks.

An introduction to the RDR methodology is presented in Section 5.1. In Section 5.2, an RDR framework is presented which encodes knowledge for the activity recognition task. A brief overview of the implementation and the roles individuals in the development of the

121 knowledge base is presented in Section 5.3. Finally, the generic application developed for training the RDR on different activity recognition scenarios is presented in Section 5.4.

5.1 Ripple Down Rules

An expert system is a software system that solves a problem using a database of human expertise (Hayes-Roth et al., 1984; Waterman, 1986). An expert system models the solution to a particular problem as a set of rules over the lower level data, and human expertise is used to determine the nature of the rules. Expert systems attempt to dissociate the task of a domain expert, who usually has little to no programming experience, from the task of the low-level programmer, who builds and maintains the underlying system.

Building an expert system requires several knowledge-centric tasks, where knowledge is defined as the expertise of the domain expert. First is the step of knowledge representation, where the knowledge needs to be encoded into lower level data structures. The second is the combined task of knowledge acquisition and knowledge maintenance, which involves acquiring expertise from the domain expert and ensuring that the knowledge is acquired in a meaningful and consistent manner. Finally, the acquired knowledge, called the knowledge base, is used for the task where human expertise is required. This last step is often labelled inference.

In 1989 Compton and Jansen, following experience maintaining the GARVAN-ES1 ex- pert system (Compton and Jansen, 1989), recognized that knowledge cannot be explained by a single underlying model. In their attempt to provide a mechanism for experts to up- date an expert system knowledge base, they identified the paradigmatic philosophy behind the maintenance and acquisition of knowledge. Knowledge was identified as being context dependent, and it was found that experts often “make up” knowledge, to justify a conclusion about a particular case, by identifying properties unique to that case.

One of the main problems with knowledge acquisition is that, if experts keep adding their expertise into the database haphazardly, contradictory results may arise. This is especially true if more than one expert is providing input, without collaboration between the experts. Therefore, it is necessary to ensure that prior knowledge is not broken in the process of updating the knowledge base. Unfortunately checking an entire knowledge base against new knowledge is quite an expensive process, when one considers the amount of knowledge that is practically stored by a typical real-world knowledge base. As a result, ensuring consistency quickly becomes intractable.

Compton and Jansen addressed this problem by reorganising the knowledge acquisition methodology as the construction of an exception hierarchy, and devised a largely localised method of checking knowledge consistency. Through this realization came the development of RDRs as a knowledge maintenance and acquisition strategy.

122 RDRs represent knowledge as an exception hierarchy of rules encoded by experts. When- ever the knowledge base requires an update, because it does not explain a particular case, experts add their own expertise as an exception to the knowledge already existent. The knowledge acquisition engine maintains consistency by asking the expert to justify the creation of the exception using features that are specific to the new case and that set it apart from previous cases. Furthermore, the RDR maintenance strategy allows incremen- tal knowledge acquisition, to deal with an insufficient feature set or conclusion set. Knowl- edge acquisition and maintenance is an ongoing process that can be performed even while the system is in operation, and new features and conclusions can be introduced at any stage to deal with new cases.

RDR knowledge bases first gained industrial use in the PIERS medical expert system (Preston et al., 1994), where it was shown that a knowledge engineer was not required at the knowledge acquisition stage. 2000 rules were added to the system by 1994, and the knowledge base is still growing as of today. Since the implementation of PIERS, RDR knowledge bases have been used in a wide variety of fields ranging from medical and clinical domains (Edwards et al., 1993), dynamic systems control (Shiraz and Sammut, 1997), to resource allocation (Richards and Compton, 1999) and there are at least three commercial companies whose core technology is based on RDR.

5.1.1 Single Classification RDR (SCRDR)

The simplest form of the RDR hierarchy is the Single Classification RDR, or SCRDR. An example of an SCRDR knowledge base is shown in Figure 5.1.

Figure 5.1: RDR Hierarchy Exam- ple

The fundamental structure of any RDR knowledge base is defined by an exception hier- archy organized in a tree. Each node within the hierarchy is a rule encoded by an expert. A

123 rule typically takes the form consequent antecedent. A case is said to fire for a rule if ← the case returns true for the antecedent of the rule.

At the very top of the RDR knowledge base hierarchy is the root rule, ². This rule fires for every case and outputs a null consequent. Every RDR knowledge base begins with this rule. Every rule within the RDR hierarchy has any number of ordered exception rules as children. When a case c requires inference by an RDR hierarchy, that case and ² are passed into Algorithm 5.1.

Algorithm 5.1: SCRDRInference input : Case c,Rule r γ α = ← output: Consequent

for ei Exceptions(r) do ∈ if ei fires for c then return SCRDRInference(c,ei); return γ

An illustrative example of the RDR inference method is presented in Figure 5.2. In the example, which uses the same knowledge base as in Figure 5.1, the input case fires for rules ², Rule 1, Rule 3, and Rule 6. Thus, the conclusion output by the inference engine is the conclusion in Rule 6.

Figure 5.2: RDR Inference Example

5.1.2 Knowledge Revision

Although the RDR inference mechanism is straightforward, the real benefits of RDR arise in the way that knowledge is revised within the knowledge base in order to keep it con-

124 sistent with prior revisions. RDR knowledge revision operates on the principle of context- dependent updates. The reasoning is that it is much easier for an expert to justify a revision to the knowledge base as an exception within a localised context, rather than having to ob- serve how the knowledge affects the entire knowledge base. Therefore, RDR knowledge base updates are based on general-to-specific refinements to rules.

To better explain the knowledge revision process, suppose that there exists a 2-D domain that needs to be segmented into regions based on a set of points and their classifications. The antecedent of each rule within the knowledge base defines a subregion of the entire space and the consequent is a classification over the region. Thus, the expert is required to construct rules that define 2-D subregions of the original space.

The knowledge base is initially empty, consisting only of ². Suppose that the expert now encounters a case C1 that requires a classification c1. Based on the expert’s judgement a rule R1 is created as an exception to ². R1’s antecedent contains C1 and its consequent is the classification c1. An example is shown in Figure 5.3.

Figure 5.3: RDR Initial Revision Example.

The key to knowledge consistency within the RDR framework is the use of cornerstones. Cornerstones are cases used to revise the knowledge base and are stored within the rule nodes of the hierarchy. Cornerstones are stored at the rules that they were responsible for defining. The cornerstones of an RDR knowledge base represent all the past knowledge used to build the hierarchy. In the previous example, the case C1 becomes a cornerstone case and is stored within the rule R1.

To demonstrate how cornerstones work in maintaining knowledge consistency, suppose that a new case C2 enters the system and requires a new classification c2. Suppose that

C2 lies in the region defined by R1. Then the knowledge base will output conclusion c1 when classifying C2. This is obviously a misclassification, and the rule R1 is called the point-of-failure (POF) for the case C2.

Training within the RDR framework happens always by creating new exception rules under the POF. If the POF already has a list of child exception rules, and the new case does not fire for any of these exceptions, then a new exception is created succeeding all of the children. Knowledge verification is performed by checking that the cornerstones used to define the POF do not fire for the new rule. If this occurs then a more specific rule is requested from the expert to exclude the fired cornerstones.

125 Going back to the previous example, the expert is required to create an exception under rule R1 that covers C2 and outputs c2. To ensure that prior knowledge is not broken, the cornerstones of the POF R1 are checked against the new rule. If the new rule covers any of the cornerstone cases then the expert is asked to make the rule more specific to exclude C1. This is demonstrated in Figure 5.4.

Figure 5.4: RDR Revision Example: Rule created on the left is too general and erroneously covers cornerstone C1. It is not allowed

One of the key benefits of the RDR knowledge maintenance technique is that revisions happen in a localised, contextual manner, so that adding rules is never made more difficult regardless of the size of the knowledge base. Every time a misclassification occurs, an exception is required, a set of cornerstones surrounding the new rule are checked and, if necessary, the rule is made more specific before addition to the knowledge base.

In contrast to previous expert systems, one of the most important features of RDR is that the organisation of knowledge is unknown to the expert, and the underlying system itself takes care of it. The domain expert is only required to provide a method of distinguishing cases based on the available features. This means that, once the structure and nature of the knowledge base has been determined by a knowledge engineer, they are no longer required for the maintenance of the system. This property of the RDR philosophy is highly beneficial, as it has been shown that humans are much better at identifying features that distinguish objects, rather than trying to define the object on its own (Gaines and Shaw, 1990).

5.1.3 Multiple Classifications

The SCRDR structure presented above is the most basic form of RDR. SCRDR is the first type of RDR developed and, as its namesake suggests, each case goes through a single path, reaches a single rule, and the knowledge base outputs a single consequent. Since the inception of RDR, numerous modifications have been made to the structure of the hierarchy. The first modification is the shift from single classification to (M)ultiple (C)lassification RDR (MCRDR) presented by Kang et al. (1995).

The idea behind MCRDR is that a case can take any number of paths in the hierarchy, until it reaches a rule where no child exception fires. For each such path, the final con-

126 sequent is output, resulting in multiple consequents. The MCRDR inference algorithm is presented in Algorithm 5.2.

Algorithm 5.2: MCRDRInference input : Case c,Rule r γ α = ← output: Set of consequents, C E ExceptionsOf(r); ← if e E such that e fires for c then ∃ ∈ for e E s.t. e fires for c do MCRDRInference(∈ c,e); else C C S{γ}; =

Note that the order of children rules is unimportant in MCRDR, as opposed to SCRDR. An example of inference in an MCRDR hierarchy, presented by Kang et al. (1995), is shown in Figure 5.5.

Figure 5.5: Multiple Classification RDR example (Kang et al., 1995). Bold boxes show the rule paths followed. Result- ing conclusions are Classes 2, 5, and 6.

MCRDR is clearly useful for the activity recognition problem, as it means that a single knowledge base can be enriched with any number of activities, including activities that are independent of each other. This is in contrast with SCRDR where, if a person is performing two or more activities at the same time, then two or more knowledge bases must be set up, which can considerably impact both memory and performance of the system.

In MCRDR the knowledge revision process is more complex than in SCRDR, due to the fact that there may be multiple POFs, so several exception rules may need to be created. The steps in acquiring knowledge within an MCRDR hierarchy are as follows:

Determine Classifications The expert is first required to determine the set of correct

classifications C0 for the misclassified case. This is simply a matter of choosing from a list of all possible classifications.

127 Determine all POFs The system passes the case through the knowledge base and deter- mines all the POFs for the case. The POF for MCRDR are any rules that fire for the

case, and that do not output a conclusion within the required conclusion list C0. Then three types of modifications to the knowledge base can be performed:

« For a POF a new exception rule, that provides a correct classification, can be added as a child. « For a POF a special deletion rule that deletes the conclusion output by the POF can be added as a child. « A new exception rule can be added to the root rule to provide a correct classifica- tion.

The first two cases occur when a fired rule outputs a conclusion that is not in C0. The

third case occurs when there is a conclusion in C0 that is not output by any of the fired rules.

Knowledge Verification Knowledge verification involves checking any cornerstones within the knowledge base that may fire for the new rule. In SCRDR it was only required to ensure that the cornerstones of the POF did not fire for the new rule. In MCRDR cornerstones from several rules may fire for the new rule. First, as in SCRDR, the cornerstones from the POF must be checked against the new rule. It is also possible that cornerstones from the descendants of the POF may also fire for the new rule. By descendants is meant the previous children of the POF, along with their children and so on. Each of these must be checked against the new rule, and the expert must add enough conditions so that the rule fires for the new case, but not for any of these cornerstones. Although this appears to be a considerable number of cornerstones, in practice it is found that the expert can easily create conditions that eliminate most of the corner- stones quite quickly. This is because properties of cornerstones within a rule generally carry over to cornerstones in the rule’s children, as the cornerstones of the children must necessarily fire for the parent rules. In addition to this, rules that output a

conclusion that is in C0 may also be eliminated since they are correct in their classifi- cations.

Although MCRDR presents increased difficulty for the programmer in developing the initial infrastructure, once the knowledge base is in operation, the domain expert’s job is once more to simply create rules that distinguish cases, and no knowledge engineering expertise is required. MCRDR has been extended numerous times, in order to explore new types of RDR structures for different problem types, including NRDR (Beydoun and Hoffmann, 2000) and RIMCRDR (Compton and Richards, 2000). A wide variety of RDR methods are reviewed by Richards (2009). In this thesis is employed the FlatRDR structure which, as will be shown, is highly suited for this application domain.

128 FlatRDR In this research a modified version of the MCRDR structure called FlatRDR, introduced by Kang et al. (1996), is used. The FlatRDR structure provides considerable performance benefits over MCRDR while providing the same knowledge acquisition func- tionality. A FlatRDR is structured as an n-ary tree of maximum depth 2 with:

Level 0 The single root rule, ²,

Level 1 Any number of classification rules that provide a conclusion for a case and,

Level 2 Any number of deletion rules that delete a classification assigned by the parent

Figure 5.6 shows the MCRDR in figure 5.5 reformulated as a FlatRDR.

Figure 5.6: FlatRDR example

The inference method in FlatRDR is given in Algorithm 5.3.

Algorithm 5.3: FlatRDRInference input : Case c,Root Rule ² output: Set of consequents, C // Exceptions of ² are all classification rules E ExceptionsOf(²); ← C ; ← ; for e E s.t. e fires for c do //∈ Exceptions of e are all deletion rules D ExceptionsOf(e); ← if d D, d does not fire for c then ∀C ∈ C ConsequentOf(e); ← ∪

In the FlatRDR knowledge base, although the expert is still assigned the task of im- plicitly distinguishing classifications, the task is broken down into two simpler sub tasks. The sub-tasks are: (i) determine reasoning for an incorrect classification by creating a dele- tion rule and (ii) add a new rule with the correct classification. This considerably eases the burden on the domain expert, as fewer conditions need to be chosen for building a single rule, as opposed to MCRDR. In addition to the simplicity of rules, the FlatRDR structure

129 also provides a beneficial performance improvement over the MCRDR structure. Because the Flat-RDR hierarchy does not grow beyond depth 2, inference can be performed using faster programmatic techniques. This is considerably faster than MCRDR inference and is highly suitable for real-time domains. Thus, the knowledge base constructed for this thesis, employs a FlatRDR structure.

5.1.4 Generalisation

Another major issue in RDR knowledge base revisions is the generalisation scheme used in adding rules (Scheffer, 1996). Generalisation is the process of expanding the boundaries of a rule within the knowledge base to cover new cases. A generalisation mechanism allows experts to go back and expand the scope of an existing rule, rather than creating a new exception every time. A lack of generalisation can lead to the knowledge base covering a very small set of cases and result in considerable duplication of knowledge, especially when dealing with real-time data where knowledge acquired over single frame steps is highly specific and minimal.

However, generalisations have proven to be quite problematic when features are nu- meric, as they commonly are in real-time systems, leading to knowledge degradation (Misra et al., 2009). To demonstrate the problem with generalisation, consider the knowledge base shown in Figure 5.7. In this knowledge base there are two rules outputting different con- clusions. Rule R1 was built on cornerstone 1 and rule R2 was built on cornerstones 2 and 3. The colours of the cases denote the required conclusions.

Figure 5.7: An example demonstrating the problems of generalisation: Knowledge base consists of 2 rules, Rule 1 built on Case 1 and Rule 2 built on Cases 2 and 3.

Now, suppose that two new cases are observed by the expert as shown in Figure 5.8. Cases 4 and 5 are passed through the system in that order. Case 4 lies within the boundary defined by R2 and this conclusion is correct. As the case is classified correctly, it is ignored

130 and no new rule is built. Case 5, however, requires the same conclusion as Case 1 but lies outside the boundaries of both rules. It is misclassified.

Figure 5.8: An example demonstrating the problems of generalisation: Two new cases en- ter the system. Case 4 is correctly classified by Rule 2 and is ignored. Case 5 is misclassified and requires a knowledge revision.

If generalisations were allowed within the knowledge base, the expert can choose to expand the scope of Rule R1 to cover Case 5 rather than creating a new rule, as shown in Figure 5.9. In doing so, however, the new Rule now also covers Case 4. This should clearly not be allowed, but there is no way to check it because Case 4 does not reside within the knowledge base, as it was not used as a cornerstone. The knowledge consistency rule has been broken.

Figure 5.9: An example demonstrating the problems of generalisation: Rather than cre- ating a new rule for Case 5, the expert has chosen to generalise Rule 1. The new rule in- correctly classifies Case 4, but this cannot be checked as Case 4 is not stored.

A benefit that arises from the use of real-time data, however, is that a large number of cases can be obtained easily for training a particular rule that have the same conclusion, and whose feature values are statistically similar. Thus, a compromise is proposed by disal- lowing generalisations to existing rules, but building rules on several cases at once rather than single cases. For example, one can assume that several frames of a person walking

131 will have velocity values close to each other, and every one of these frames can be used to train the new rule. Using several cases to build a rule in such a way could be viewed as mixing RDR with induction and has been explored in the past (Gaines and Compton, 1995; Scheffer, 1996; Wada et al., 2001).

5.2 Encoding Knowledge for Activity Recognition

The RDR knowledge strategy provides a philosophical basis for the acquisition and mainte- nance of knowledge. Although it provides some algorithmic foundations for implementation, it does not describe how to encode knowledge from different scenarios within its rules. Be- fore an RDR system can be trained with cases by an expert, a programmer is first required to tailor the RDR for the required domain. This is not always straightforward, as consider- ation needs to be given to the performance of the RDR implementation, along with making it easy for the expert to distinguish cases. Thus, aspects of low-level implementation and higher-level interfaces must both be first determined.

The activity recognition task provides its own difficulties that require careful consid- eration of several challenging aspects. In this section, justification for the use of RDR in activity recognition is presented and it is shown how the RDR philosophy can be adapted for an activity recognition framework.

5.2.1 Rule Format

The first important decision to consider is the format of the rules themselves, determined by the format of the antecedent in each rule. The choice of the rule format can lie between two extremes. The expert can be offered the power to create the rule as they see fit, giving them the greatest power over rules, or the format can be completely fixed and determined entirely by the cornerstones, giving the expert no power.

If the latter extreme is chosen, an automated technique must be implemented to induce rules from cornerstones. Such methods have been experimented with in the past. For exam- ple, Gaines and Compton (1995) developed Induct-RDR, a method of recursively inducing an RDR tree based on binomial probability, given a large training set. Wada et al. (2001) present a modification of Induct-RDR by using the Minimum-Description Length principle rather than binomial probability. Such inductive revisions are not used in this research as they rob the system of the valuable expertise afforded by the human expert, and can result in rules that are too complex and unreadable.

On the other hand, giving the expert maximum power over rules would require that the rules be implemented in a tailored machine readable language, like LISP expressions. This generally requires a steep learning curve for the expert, as they would need to learn the language first. It also makes differentiation of rules much more complex, especially in the

132 latter stages of knowledge acquisition when rules can become quite specific. Furthermore, it can result in a considerable decrease in efficiency in processing cases, if the interpreter is not suitably fast.

In this research, inspired by machine learning approaches, each rule is implemented as a conjunction of predicates,

p1 p2 p3 ... pn (5.1) ∧ ∧ ∧ where each pi is defined on a single low-level feature. The features themselves can be either real-valued or boolean-valued. The predicate pi, can take one of two forms:

pi f j,n [l j,n, u j,n] (5.2) = ∈ where f j is the j th feature vector and f j,n is the n th component of this vector, and − − l j,n, u j,n are lower and upper bounds. This is simply a real-valued decision boundary on the real-valued feature f j,n. Note that l j,n, u j,n may be , respectively to denote a single- −∞ ∞ sided boundary. The predicate may also take the form:

pi f j (5.3) = where f j is a boolean valued feature.

During training, the expert is simply required to choose a subset of features from the feature selection set, and the knowledge acquisition engine builds the rule automatically by choosing values based on the cases selected for training. This format of rules is simple enough for the expert to build rules without having to learn a new language. At the same time, the knowledge maintenance engine is given enough power to pick suitable values for the conditions thus relieving the expert of this very precise task. It is also highly suitable for vision domains where low level features are usually either boolean or real-valued.

5.2.2 Static versus Dynamic Attributes

RDR knowledge bases must have the ability to incorporate new features at any point of the knowledge base lifetime, thus allowing unique new rules to be incrementally introduced. However, it is important to note that there is an inherent limitation to the ability to define new features once the knowledge base has been structured.

An RDR knowledge base requires that past cases used to construct rules must be con- sistent with new rules for new cases. However, if the new rule that is being constructed

133 utilises a feature that is not computable on a past case, then there is no way to ensure knowledge consistency. Thus, it must be ensured that every feature can be computed on every case within the knowledge base.

To impose this restriction, it is important to distinguish between static and dynamic attributes. A static attribute is defined as a fixed property of a case. It is usually of a primitive type, and it can always be determined for every case entered into the knowledge base, and for every cornerstone case stored within. Conversely, if a case does not have a static attribute required by the knowledge base, it simply could not be inferred upon. Static attributes are determined in discussion between the knowledge engineer and expert during domain construction and once knowledge acquisition has started, attributes can neither be added nor removed.

On the other hand, a dynamic attribute is a property that is usually determined by performing some computation on one or more static attributes. In the simplest case, a dynamic attribute could simply be an identity function on a static attribute, thus allowing the knowledge engineer to get direct access to the static attribute. Dynamic attributes can be constructed at any time to deal with new cases, and since they operate on a fixed set of static attributes that must be available for every case, they too can be computed on every case. A simple example of a static attribute for a living being, would be number of legs and number of arms, whereas a dynamic attribute might be number of limbs = number of legs + number of arms. Static attributes are exposed to the expert during dynamic attribute construction, but not during knowledge base revision.

5.2.3 Cases and the Temporal Nature of Activities

In the activity recognition scenario, each case represents a single person at a single time instant. Since the properties of a person can change from one frame to the next, the case representing a person at time t is not the same as the case representing the person at time t 1. Other than the cornerstones, each case is volatile and its lifetime is simply one frame. + The attributes of the person consist of frames of data output from the lower level vision stages. Because activity recognition is necessarily temporal, each case can consist of at- tributes from any number of prior frames, and not just the attributes in the current frame. Although the activity is classified in the current frame, the knowledge base and expert may also need to consider these prior frames and be able to compute features over sequences of frames. Thus, cornerstones need to be stored with sequences of frames.

The problem with such temporal data is that knowledge consistency becomes difficult to maintain. As an example, suppose that there is a cornerstone c within the knowledge base with 10 frames stored, because the rule r that the cornerstone belongs to does not contain any features that require more than 10 frames. Suppose now that the expert wants to create a child rule r0, under r, for a misclassified case. If there is even a single feature

134 within r0 which requires more than 10 frames, then there is simply not enough information to determine if r0 fires for c.

Ideally, when a person’s case is used as a cornerstone, the knowledge base would need to store every single past frame of the case. In real-time systems, this can result in thou- sands of frames being stored and, as the knowledge base becomes more and more complex, memory bounds will quickly be reached. In practice, it is found that an upper bound limit N on the number of frames that a feature can be computed over is sufficient, as features are usually computed over frame lengths that are much shorter than the entire length of time that a person has been tracked.

5.2.4 The Final Rule Format

Having considered the previous conditions, an example of a single rule within the activity recognition problem is given in Figure 5.10.

Figure 5.10: An example rule used within the Activity Recognition RDR framework. The features are designed by the domain expert. The features, dimension, time range and decision boundary sidedness are selected during training.

The rule consists of a conjunction of predicates and each predicate is a decision boundary over feature values. The computation of each feature is designed by the domain expert and constructed by a vision programmer off-line. During training, the expert chooses the feature, the dimension if it is a multi-dimensional feature, the number of frames the feature is computed over, and whether an upper bound, lower bound, or both are used.

5.3 Implementation and the Roles within the Framework

A high-level class diagram of the implementation of the RDR framework is presented in Figure 5.11. This diagram shows a static overview of the structure of the framework, along with important operations. This structure is common to most RDR implementations and

135 the differences lie mainly in the low-level implementation details. The framework in this thesis has been implemented in C++ to leverage the language’s speed and compile-time optimizations.

Figure 5.11: A high level overview of the class diagram for the RDR framework implementation

In adapting the RDR framework to a particular activity recognition situation there are two roles, shown, in Figure 5.11, by the shaded classes Frame, CaseDisplayWidget, Feature, and Consequent. The first is the role of the vision programmer, who is responsible for defining the attributes of the underlying case. This involves specialising the Frame class by defining all the attributes of a single case in a single frame and defining input/output methods for the frame class to write frames to disk. The case class takes care of updating frames using a circular buffer.

The vision programmer is also responsible for specialising the CaseDisplayWidget class for each case. This display class is responsible for presenting specialised cases to the domain expert using a meaningful visualisation. The framework developed in this thesis uses the Qt cross-platform user interface library (Blanchette and Summerfield, 2008), and so the vision programmer is required to implement a custom Qt display widget.

Once the vision programmer has performed the previous two tasks, they are required to define properties of the final case. The framework uses C++ macros to allow the program- mer to define a case in one line with ease. The macro used for defining the case is shown in Figure 5.12.

DefineCase(ClassName,FrameType,Name,IdLength,CaseWidget,NumFrames);

Figure 5.12: Defining a Case

Several parameters need to be defined for each case. ClassName is the name of the class

136 that defines the case being implemented. FrameType and CaseWidget are the names of the specialised classes defined above. Name is a unique string name to identify the case on disk. IdLength is the number of components in the id for a single, unique case. Finally, NumFrames is the number of frames available for each case. It determines the length of the circular buffer for each case.

The second role is the role of the domain expert, who is responsible for specialising the Feature and Consequent classes during training. Specialising a feature involves defining the feature, again using macros, and implementing the evaluation of the feature. Either the expert can define a real-valued feature vector or a boolean feature, shown in Figure 5.13.

DeclareFeatureVector(FeatureClassName,Name,NumDimensions,FeatureData); void FeatureClassName::evaluate(Case* case,double* out); DeclareBooleanFeature(FeatureClassName,Name,FeatureData); void FeatureClassName::evaluate(Case* case,bool& out);

Figure 5.13: Defining Features

FeatureClassName is the C++ class name for the feature, Name is a unique string identi- fier, NumDimensions is the number of dimensions of the feature vector, and FeatureData is an arbitrary struct containing extra data types for feature computation. The boolean-valued feature is similar to the real-valued feature vector except that it only has one dimension.

Finally, the expert is required to define consequents. Since the framework performs clas- sification, consequents are simply class values. Defining a consequent is shown in Figure 5.14.

DeclareConsequent(ConsequentClassName,Classification)

Figure 5.14: Defining Consequents

5.4 The User Interface

Another important aspect of the RDR knowledge base engineering stage is the development of an appropriate user interface to present to the domain expert, making it easier for the expert to make rule decisions. This task is especially important in vision-based applica- tions where a visual display of the cases must also be provided to support decision-making. The user interface designed for this research is presented in Figure 5.15. Each coloured component of the user interface will be detailed in the following sections.

137 138

Figure 5.15: The RDR knowledge acquisition user interface designed for the activity recognition task. In the following sections, the process of revising the knowledge base for the activity recognition scenario is demonstrated. The process begins when the expert has selected a set of cases that they wish to use to revise the knowledge base. Denote this set of cases by

C0. In each revision run, the expert may choose to build multiple rules, and thus the cases selected may be stored as cornerstones of several different rules.

5.4.1 Consequent Selection

The first task for the expert is to select a consequent for the new rule. This simply involves choosing from a list of available consequents, as shown in Figure 5.16. Upon selection, the consequent determines the cornerstones to check during the verification of the new rule.

Figure 5.16: Selecting a Conse- quent

5.4.2 Rule Selection

The next step for the expert is to select a point-of-failure as the parent of the new rule. In Figure 5.17, two rules fire for the cases that the expert has selected: the root rule and a classification rule that outputs a walking classification. This means that out of all the cases in C0, all of them fire for the root rule, and some of them fire for the standing classification rule.

Figure 5.17: Selecting a Rule. Although the children rules are also shown, they are only there to assist the expert in building a rule. They are not selectable

139 Thus, the expert has one of two choices here. They may choose to build a new classifi- cation rule under the root rule. This new rule will output the consequent chosen from the previous consequent selection box. On the other hand, the expert may choose to build a rule that deletes the walking classification from some cases in C0. This would entail creating a new child rule under Rule 0 : 0 that deletes the consequent output by 0 : 0.

5.4.3 Cases Selection

Upon selecting a rule, the expert is then presented with a subset of cases from C0 that fire for the selected rule. The expert is then required to select a subset of these cases to build the new rule on. This selection determines the feature values that will be used to build the new rule. An example is shown in Figure 5.18.

Figure 5.18: Selecting Cases to build the new rule. Left image shows result if root rule was selected and right image shows result if Rule 0 : 2 was selected

5.4.4 Cases Display

The cases display section of the training interface is a custom built display area determined in discussion with the expert when the knowledge structure was being designed. Because of the importance of visual information within the activity recognition problem, the cases display widget is necessary to enable the expert to develop meaningful rules. In the frame- work used in this thesis, the cases display widget is required to present to the expert the set of selected cases. The display widgets implemented for this thesis will be explained in Chapter 6 in which real-world uses of the RDR framework is presented.

140 5.4.5 Feature Selection

The next step for the expert is to start building the new rule to be added to the knowledge base. This involves constructing a set of decision boundaries over features. The first step is for the expert to select the feature name, as shown in Figure 5.19.

Figure 5.19: Feature Selection. A list of pro- grammed features are presented to the expert to add to the new rule

Upon selection of the feature, a feature graph is updated with the feature values of the selected cases. A curve is plotted for each case showing the relationship between the feature value and the number of frames that the feature is computed over. The x-axis is a time range where time is defined as the number of frames. This is shown in Figure 5.20.

Figure 5.20: Feature graph for a feature com- puted over a range of frames. The time range is specified on the x-axis. The thick red lines rep- resent the minimum and maximum feature values

141 Note the two red lines plotted along with the feature values. These lines plot the min- imum and maximum values of the feature and represent the actual boundary that will be constructed for the rule. The feature graph is provided to assist the expert in identifying the features that are similar over the selected cases, thus making the rule more appropriate for the cases.

Having selected a feature and observed its values in the feature graph, if the feature depends on a time range, the expert must make a decision about the time span that the feature value is computed over, as shown in Figure 5.21. In the figure, the left image shows the feature graph for the entire time range. In the right image, the expert has made the time range smaller, from 18 to 24 frames. This means that the antecedent of the new rule will compute feature values for each time range from 18 frames to 24 frames. The use of a range of time values allows for the quick elimination of cases that should not fire for the new rule, while not making the rule too specific.

Figure 5.21: Selecting a feature time span. Left image shows a larger time span than the right image for the same feature. The antecedent constructed from the right image will be more general than the one from the left image

Finally, the expert needs to select whether the boundary is one sided or two sided. If a one-sided boundary is desired then the expert is required to choose whether the boundary is an upper boundary or a lower boundary. The resulting three choices are presented to the expert as the relationship GreaterThan, representing a lower boundary, LesserThan, representing an upper boundary, and Between, representing a double-sided boundary. This is shown in Figure 5.22.

Figure 5.22: Selecting whether the boundary is one- sided or two-sided

142 5.4.6 Antecedent View

As the expert keeps adding features to the new rule as above, the antecedent view updates with the added features, shown in Figure 5.23.

Figure 5.23: A view of the an- tecedent being used to create the new rule

When the expert is satisfied that there are sufficient features, the rule may be created. However, before this can happen, the new rule must be verified against the knowledge base as explained in Section 5.2. To do so the expert first clicks on the “Update Error Cornerstones” button, which determines all the cornerstones within the knowledge base that fire for the antecedent but should not do so. This is the knowledge validation step. It is now up to the expert to eliminate any erroneous cornerstones.

5.4.7 Fired Cornerstones

The expert is first shown the erroneous cornerstones in a table, as in Figure 5.24. The expert is also shown the consequent output by the rule that the cornerstone belongs to, and the relationship between that rule and the rule being built in the current training session. The relationship can be either parent or sibling. At this point, the responsibility of the expert is to add features to the new rule that can eliminate these cornerstones. The new rule can only be built once the cornerstones table is empty.

Figure 5.24: Table of error cornerstones. This table must be empty for the new rule to be created

143 In order to assist in the selection of appropriate features, when error cornerstones are selected the feature graph is updated to reflect the cornerstone choice. This enables the expert to see how a feature is different between misclassified cases and error cornerstones thus allowing the expert to build valid rules quickly. The cornerstones graph is differenti- ated from the cases graph as it is represented by a thick green line, shown in Figure 5.25.

Figure 5.25: Feature graph for cornerstones. This graph allows the expert to choose features that are substantially different between corner- stones and cases. The green lines show the boundaries defined by the cornerstones

5.4.8 Confirmation

The final step in the training process is the confirmation of the new rule. Note that this process consists of two choices, shown in Figure 5.26: “Construct Rule” or “Correct Classifi- cations”.

Figure 5.26: Confirmation of the New Rule

The first choice works normally, and can only be run when the error cornerstones table is empty. This choice constructs the new rule at the selected point and adds the selected cases as cornerstones. Once the rule is added, the knowledge base is run on all the cases in

C0, and the training process is repeated.

The second choice becomes very important when dealing with ground truth data as will be shown in the next chapter. This choice is selected when the expert finds it impossi- ble to construct a new rule that can account for the selected cases while not firing for any

144 surrounding cornerstones. This choice enables the expert to modify the ground truth classi- fications of cases to reflect a more suitable classification. As will be shown, this is a highly beneficial aspect of the RDR knowledge base, as it allows for consistency not only in the classifier, but also in the training data.

5.5 Summary

In this chapter, an introduction to the RDR knowledge acquisition philosophy has been presented, along with the benefits that it provides towards the activity recognition task. A novel method of encoding knowledge for the activity recognition within an RDR structural framework has been presented, with emphasis on accounting for the temporal nature of the activities and the dynamic nature of attributes and features. It was also shown how the RDR knowledge base could leverage real-time data to provide greater generalisation capabilities.

An overview of the concrete implementation of the RDR framework has also been pre- sented in this chapter. In particular, the different roles played by the vision programmer and the domain expert in adapting a particular activity classification scenario to the frame- work was outlined. Finally, the chapter presented an overview of the different components of the user interface during RDR training, demonstrating the ease with which the domain expert can create new rules.

The next chapter will provide experimental results in adapting two markedly different activity recognition scenarios to the RDR framework. The first is a ground-truth dataset that provides for a qualitative evaluation of the use of the framework for the task, demon- strating its performance against benchmark standards. The second is the result from Im- mersitrack and SparseSPOT, which will be used to provide activity recognition within the AVIE environment.

145 Chapter 6

Applications of the RDR Framework to Activity Recognition

AVING adapted the RDR framework to the activity recognition task, two applications H of the framework are presented in this chapter. The RDR framework is separately adapted to two different activity recognition scenarios and a similar experimen- tal analysis is performed on both adaptations to provide a comparative evaluation of the framework.

The first application performs activity classification in a surveillance setting. To this end, the popular CAVIAR dataset (Fisher, 2002) is used for evaluation. The CAVIAR dataset is a publicly available computer vision dataset that is often used to benchmark work in vision-based surveillance systems. The purpose of using the CAVIAR dataset, in the context of this thesis, is to provide a quantitative evaluation of RDR against commonly used ground truth data.

The second scenario that the RDR framework is employed for is the data from Immersi- track and SparseSPOT. These two algorithms provide point-based tracking and voxel recon- structions for each person within the AVIE environment. This data is then used to perform activity classification of 8 activities, on persons within the environment.

Experimental results on both datasets show that the RDR knowledge acquisition strat- egy presents a viable approach for the classification of activities in the absence of a large amount of training data. It is shown that vision experts have the ability and the intuition to create suitable rules that can generalise speedily to a large number of cases. Meanwhile, the knowledge verification step employed by RDR ensures that the rules added by the ex- perts are consistent with past knowledge.

In addition to the classification benefits provided by the knowledge base, the incremen- tal nature of the RDR approach means that cases that do not fit in with the expert’s own

146 knowledge are quickly uncovered. This means that each revision can yield an update of the expert’s own cognition along with an update to the knowledge base. This allows the expert to adjust their perceptions and build more accurate features and rules to provide better classification.

The adaptation of the RDR framework to the CAVIAR dataset is presented in Section 6.1. An adaptation to the AVIE dataset is presented in Section 6.2. The method of training the knowledge base over both datasets is presented in Section 6.3. Quantitative experi- mental results on both datasets are presented in Section 6.5. The chapter is wrapped up in Sections 6.6 and 6.7 with a discussion on the benefits of the RDR framework for activity recognition tasks, and its ability to leverage the cognitive aspect of human activities.

6.1 CAVIAR : Activity Recognition in a Surveillance Setting 1

The CAVIAR dataset (Fisher, 2002) demonstrates an excellent example of the need for knowledge consistency in an activity recognition system. It consists of a series of videos of humans in a shopping centre environment. Ground-truth for this dataset consists of the bounding box of each human in each frame, along with hand-marked labels of the activities that the human is performing. A typical frame from the CAVIAR dataset is shown in Figure 6.1, along with the ground-truth values.

Figure 6.1: Frame from CAVIAR Dataset with ground truth (Bound- ing box and movement label)

A basic classification task defined by the creators of the CAVIAR dataset is to assign one of four movement classes { Inactive, Active, Walking, Running } to each human in each frame. The definition of these four classes (Fisher, 2004) are given in Table 6.1. Note that the description of all of these classes is entirely dependent on the motion of the per- son’s centre-of-gravity and bounding box. This is important as it implies that the classes

1Portions of this section appear in Sridhar et al. (2010)

147 should be easily distinguishable using just spatio-temporal reasoning over the ground- truth, marked bounding boxes.

Movement Class Description Inactive A static person/object with no local motion Active A person with local motion but not translating across the image Walking A person translating across the image Running A person translating across the image faster than walking

Table 6.1: Movement classification tasks in the CAVIAR Dataset

The CAVIAR dataset comes with ground-truth labels stored within it. Attributes of CAVIAR cases are defined by information that is already present in every frame within the dataset.

6.1.1 The CAVIAR Case

The information available for each person within each frame of the CAVIAR dataset is an integer id, the ground-plane centre of gravity position, the bounding box height and width for the person’s blob and a frame of video. Unlike previous research (Ribeiro et al., 2005), no image attributes are used within the RDR system. There are two reasons for this. The first is that the movement classification task, according to its definition, should be entirely independent of image attributes, therefore making them redundant. The second reason is that using image attributes would require the storage of the video images. This would include not only the image from the current frame, but also several prior frames, as this is a requirement for the cornerstone storage outlined in Chapter 5, section 5.2.3. Since dozens of cases can be trained at any one time, space limitations will be quickly reached. Although this may be counteracted with additional disk space, RDR provides considerably accurate results without the need to store image data, as the results will show.

Therefore, only spatio-temporal attributes of CAVIAR cases are utilised, and image- based attributes are discarded. The selected attributes are given in Table 6.2.

CAVIAR Case

Attribute Name Description Id N Labelled id of the person represented by the case ∈ Position: (x, y) R2 Labelled centre of gravity of person ∈ Bounding Box: (w, h) R2 Labelled bounding box width, height ∈ Table 6.2: Attributes of the CAVIAR dataset derived directly from the ground truth labelling

Once the attributes are determined, it is necessary to select the number of prior frames that will be stored for each case. A frame in the CAVIAR case does not refer to a video

148 frame but rather, a single structure holding all the attributes for an individual case. In this work, 50 frames are stored for each cornerstone. It is found that a much smaller number can actually be used, as most of the distinguishing features do not require so many frames. However, since the attribute set is quite small, 50 frames do not have a very large impact on the size of the knowledge base, and may prove useful later on when rules become highly specific and more information is required.

Finally, it must be determined how to display the CAVIAR cases to the expert. Each case is presented by displaying the current video frame, along with the bounding box of the case, the ground truth label, and the last 50 positions as a motion trajectory line. An example of a displayed CAVIAR case is shown in Figure 6.2.

Figure 6.2: Displaying the CAVIAR Case. Bounding box for current frame, trajectories for last 50 frames and CAVIAR ground truth label is shown to the expert

6.2 AVIE : Activity Recognition in the Immersive Environ- ment

As discussed in Chapter 3, there are two primary algorithms in the underlying vision sys- tem within AVIE: Immersitrack which provides point based trajectory values for persons within the environment, and SparseSPOT which provide a real-time voxel reconstruction for each person. These two representations for the two algorithms provide the attributes for the persons within AVIE.

6.2.1 The AVIE Case

The output from the low-level system within AVIE is a fixed structure for each person within the environment, encapsulated in Table 6.3.

149 AVIE Case

Attribute Name Description Id Unique identity for each person Head Position 3-D Head position output by Immersitrack Centre of Gravity 3-D Centre of gravity output by Immersitrack Voxels Voxel model output by SparseSPOT

Table 6.3: A formal structure for the output from the low-level vision system implemented in AVIE

There are eight activity classes defined for the AVIE dataset: standing, crouching, jump- ing, walking, running, both arms ahead, both arms to side, and both arms pointing up. These classes were chosen as they can be easily distinguished by a human using just the voxel model. Thus any interactive application that uses up to 8 modes of interactions can utilise the results from this thesis. Examples are shown in Figure 6.3. A single person is recorded performing each activity at least 10 times.

Figure 6.3: AVIE Activity Classes. Left to right, top to bottom: Both Arms Ahead, Crouching, Standing, Both Arms Up, Jumping, Walking, Both Arms To Side, Running

150 For each AVIE case, 30 frames are presented to the expert for classification. It was found that 30 frames of data are enough to provide sufficient information to differentiate classes visually. Each frame of data is hand-marked with the activity class. The display widget, constructed for the AVIE case, is exactly as seen in Figure 6.3. The expert is presented with a 3-D view of the voxel representation of the human for which the viewpoint can be manipulated in 3-D.

6.3 Training and Evaluation Protocol

Evaluating the performance of an RDR knowledge base is a difficult task. In traditional machine learning (Mitchell, 1997), cross-validation techniques are usually employed to de- termine the performance of a predictive model. Cross validation is performed by splitting the input data multiple times into two sets randomly. These are described as training and test sets. The training set is used to build the classifier and the test set is used to evaluate the error rate of the classifier. The test set represents data that has not been “seen” by the classifier. This is repeated for each random split of the tests. A split of 90% training data to 10% test data is commonly used to determine the size of the sets. The test set is used to estimate the accuracy of the learned classifier. However, using the RDR philosophy, the test set can be simply considered as unseen knowledge and may be fed back to further train the classifier.

To evaluate the performance of RDR on the activity datasets, the approach used by Kang et al. (1996) is employed. In their work, the knowledge base is built over 21 822 cases. A distinction is made between seen cases and unseen cases, with respect to the rules built within the knowledge base. Seen cases represent all the cases that the expert has already observed and were classified correctly or were used for revising the knowledge base. An error rate is measured by applying the knowledge base to a set of cases that have not been seen by the expert and therefore not used for training. The performance of the knowledge base is judged by the change in the error rate over time.

Both the CAVIAR and AVIE datasets represent finite sets. The concept of seen and unseen cases only makes sense if a single training run is performed over the entire dataset. Beyond this run, the knowledge base can continue being updated, but there exist only seen cases. Thus, in this work, a single training run is performed over the entire dataset. During this run, all data within the datasets is iterated over and for each person within each frame, if the person is classified incorrectly by the knowledge base, the incorrect case is buffered. Every time a person’s activity changes, all incorrectly classified cases corresponding to that person are presented to the expert for training. The cases buffer for the person is then cleared.

This training method lies between two extremes. On the one hand the expert could be presented with each misclassified case in each frame individually to revise the knowledge

151 base. This results in an incredibly slow learning process as the rules encoded become too specific. It also yields an unnecessarily large knowledge base.

At the other extreme, for each person that enters the environment, all cases for the per- son can be buffered until the person exits. The expert may be asked to revise the knowledge base only when the person exits the environment. This method results in too many cases being presented at once to the expert, as a person can be present in several thousand frames within the videos and can change activities dozens of times. Because of the limited amount of user interface space for the training application, it is difficult for the expert to navigate through such a large number of cases at once.

The use of the activity transition as the training point is therefore optimal. Because the subject is exhibiting the same activity, it is a reasonable assumption that the features and values that can be used to distinguish that activity will be similar. This makes it easier for the expert to construct a rule for the activity that is general enough to be applicable to the same activity later on. Creating a rule for a single activity at a time is also much less confusing for the expert.

The properties mentioned above represent a simplification for labelled sets. In general operations of RDR, the training set is infinite. There is no point at which revisions within the knowledge base cease. In fact, in such systems it is impossible to determine the point at which an activity transition occurs. However, the benefits of such infinite training sets, especially within immersive environments, is that if an activity was seen to occur but no cases were recorded for training, the person who performed the activity could be asked to repeat it while cases are being recorded. The results presented in this chapter present a conceptual demonstration that RDR is destined to work effectively in real-world situations based on its performance on the labelled datasets.

6.4 Resulting Features Designed for both Datasets

Recall from Chapter 5, section 5.2.2, that a distinction is made in this thesis between at- tributes and features. Attributes are determined during the knowledge engineering phase, defined by information provided by the underlying vision system or dataset, and are fixed for the lifetime of the knowledge base. Features, on the other hand, are designed by a domain expert, constructed by a vision programmer using the attributes and can be intro- duced while the system is in operation. Features are used to build the rules within the knowledge base.

In order to identify the benefits of using RDR for the activity recognition task, it is nec- essary to describe the features that were designed by the domain expert for each applica- tion during training, as this aids in determining the performance benefits of the knowledge base. In the following sections, the features designed for each activity recognition dataset

152 that were used in revising the RDR knowledge bases are presented. Along with the feature description, a justification as to why the feature was constructed is also presented, along with the classes that it helped distinguish. These lists further show that use of expert cog- nition enforces human readability in feature design, which provides the ability to identify discrepancies in the training data. This is quite beneficial for the highly ambiguous activity recognition problem.

6.4.1 CAVIAR Features

A total of 11 features were programmed for the CAVIAR dataset. The features produced are:

CF1 Current x position of case Used to distinguish the “active” class when people are standing in active hot-spots.

CF2 Current y position of case Similar to above feature.

CF3 Distance travelled from first frame to last frame 2-D Euclidean distance from first frame to last frame. Used to distinguish “walking”, “inactive” and “running”.

CF4 Total distance travelled over single frame increments Sum of 2-D Euclidean distance over single frame increments. Similar to previous feature.

CF5 Variance in x position Variance in ground plane x position. Used to discriminate “active” class.

CF6 Variance in y position Variance in ground plane y position. Used to discriminate “active” class.

p 2 2 CF7 Variance in (xt xt 1) (yt yt 1) Variance in change in ground plane position − − + − − over single frame increments. Used to discriminate “active” and “walking” classes.

CF8 Variance in w Variance in bounding box width. Used to discriminate “active” and “walking” classes.

CF9 Variance in h Variance in bounding box height. Used to discriminate “active” and “walking” classes.

CF10 Variance in bounding box area Single value giving variance in with height of × bounding box. Used to discriminate “active” and “inactive” classes.

CF11 Variance in wtht wt 1ht 1 Variance in change in bounding box area over single | − − − | frame increments. Used to discriminate “active” and “inactive” classes.

153 The features programmed for the CAVIAR dataset are quite simple in nature. They are based solely on position and bounding box properties and are highly efficient to compute. Despite this simplicity, however, as will be shown in section 6.5 they still achieve excellent results when combined with the RDR framework. In addition to this, because the features are human readable, it is easy to identify discrepancies within the dataset as will be shown in section 6.6.1.

6.4.2 AVIE Features

A total of 12 features were programmed for the AVIE dataset. The features produced are:

AF1 Distance Travelled Derived from the trajectory information. Useful for discrimi- nating “Running”, “Standing” and “Walking”.

AF2 Cross section area of upper body voxels divided by cross-section area of lower body voxels The voxel model is split into two halves consisting of an upper portion and a lower portion, based on the height of voxels. A rectangular prism is fit to both halves, and the area of both prisms in the ground plane is computed. This fea- ture outputs a ratio between the two areas. It is useful for distinguishing “Standing” against “Arms to Side” and “Arms Ahead”.

AF3 Area of ellipse fitted to upper body voxels divided by area of ellipse fitted to lower body voxels As in feature AF2 the voxel model is split into two portions, based on the height of the voxels. Both portions are flattened by projecting the voxels on to the ground plane, resulting in two 2-D masks. An ellipse is fitted to both 2-D masks. This feature outputs the ratio between the area of the upper ellipse and the area of the lower ellipse. It is useful for distinguishing “Standing” against “Arms to Side” and “Arms Ahead”.

AF4 Change in Height This feature computes ht ht n where ht is the height value of − − the trajectory of the person at time t. This feature determines how much a person’s height has changed over previous frames. It is used for “Jumping” and “Crouching” classifications.

AF5 Variance in Height Value This feature computes that variation in the trajectory height value of a person over n frames. It is used for “Jumping” and “Crouching” classifications.

AF6 Was Crouching This feature outputs a boolean valued depending on whether a per- son has a crouching classification in the previous frame. It is used along with the previous two features to say that if a person is already crouching, and their height has not changed much then they must still be crouching.

154 AF7 Primary axis of ellipse fitted to upper body voxels divided by secondary axis of ellipse fitted to upper body voxels This feature reuses the ellipses fitted to vox- els data from feature AF3. The feature outputs the ratio between the upper ellipse’s primary axis and its secondary axis. It is used to discriminate “Arms to Side” against “Arms Ahead”.

AF8 Primary axis of ellipse fitted to upper body voxels divided by primary axis of ellipse fitted to lower body voxels This feature reuses the ellipses fitted to voxels data from feature AF3. It outputs the ratio between the primary axis of the upper ellipse and the primary axis of the lower ellipse. This is used to discriminate “Standing” against “Arms Ahead” and “Arms to Side”.

AF9 Number of voxels in upper body divided by number of voxels in lower body This feature splits the body in half as in features AF2 and AF3. It outputs the number of voxels in the upper body divided by the number of voxels in the lower body. It is used to discriminate “Standing” versus “Both Arms Up” as the ratio for the former should be much larger than for the latter.

AF10 Percentage coverage of voxels in bounding box This feature first computes an axis-aligned bounding box for the entire voxel model. It outputs the ratio between the number of voxels and the bounding box volume. It is used to discriminate “Both Arms to Side” and “Standing”. The justification is that the former will have a much smaller ratio than the latter.

AF11 Bounding Box Minimum Height Value The minimum height value of the voxels- fitted bounding box from feature AF10. This was used to determine “Jumping” be- haviour.

AF12 Variance in maximum Height Variance of the height value of the top-most voxels within the model. This value should be large for the “Both Arms Up” classification as opposed to most other classifications.

The true power of RDR is demonstrated by the features produced for the AVIE dataset. Previous works that attempted voxel classification, as presented in Chapter 2, used highly complex features that are inefficient to compute and are more suitable for off-line classifi- cation. As the feature list above shows, the features programmed for the AVIE dataset are simple, owing to the fact that they were determined through visual perception rather than complex analytics, and as a result they are also computable in real time. This is a result that will be quantitatively demonstrated in section 6.5.5. Despite their simplicity, the AVIE database also demonstrates excellent accuracy when combined with the RDR framework. Furthermore, the features are sufficiently general and are able to cover several variabilities in activity classes as will be shown in section 6.6.2.

155 6.5 Experimental Results and Discussion

Properties of the final RDR knowledge bases built for each scenario, are presented in Table 6.4.

Property Dataset #Rules #Cornerstones #Cases #Videos #Features CAVIAR 91 2 523 29 064 30 11 AVIE 98 984 5 892 6 12

Table 6.4: Properties of the final activity classification knowledge bases. Last column shows the number of features that were created by the expert for each knowledge base, as described in section 6.4

To evaluate the performance of the RDR knowledge bases, all the rules within each knowledge base are sorted by their date of creation. These dates represent the instance of time at which a single revision was made to the knowledge base. Initially, all rules are turned off - that is, none of the rules fire for any case. Each rule is then turned on, in order of creation date. Every time a rule is turned on, the result is a new knowledge base. Various measures are then calculated over the resulting knowledge base. The measures computed are:

Overall Performance The overall performance is measured as the number of cases for which the knowledge base outputs the correct activity class, compared to the ground truth. This is the sum of error over both seen and unseen cases.

Consistency vs. Generalisation Ability In order for the RDR knowledge base to be an effective classifier, the expert should be able to add sufficiently general rules to the knowledge base, that can cover future cases. Care must be taken, however, to reduce the misclassification of past cases that were previously correctly classified. In order to measure this property, the error rate over seen cases and unseen cases are presented separately.

Complexity of Training Despite the large number of cases used to build single rules, in the framework presented in this thesis, it may be shown that the complexity of creating rules never grows too large for the expert. To demonstrate this, the average number of cornerstones that are checked during knowledge verification is compared to the average number of features added.

Number of Corrections over Time Each refinement within the RDR knowledge base represents a correction to the knowledge maintained. Ideally, the number of cases that are observed between corrections should increase over time as more unseen cases are classified correctly.

156 Time Taken per Case Finally, in order to adhere to the real-time classification require- ment, the time taken to process each case is calculated for the final knowledge base. The average time taken over all cases is then determined.

These measures give a comprehensive description of the knowledge evolution within an RDR knowledge base and the benefits that RDR provides to the activity recognition task. As done by Kang et al. (1995), each measure is plotted against the number of seen cases observed until the time of rule creation, as this gives a fair idea of the way that the properties of the knowledge base evolve with the observance of additional cases.

6.5.1 Overall Performance

The overall error rate and number of cornerstones for the CAVIAR dataset are shown in Figure 6.4. Initially, a small increase in the number of cornerstones results in a large decrease in the performance of the classifier. However, as time progresses, the cornerstones have a smaller impact on the performance, as rules become more specific. This is a common phenomenon in RDR knowledge bases. Most of the important knowledge is obtained in the initial training sessions of the knowledge base, when the expert creates quite broad rules.

Figure 6.4: Total Error Rate, as a percentage of the total number of cases, in CAVIAR Dataset. Error rate is plotted against left axis. Number of cornerstones is plotted against right axis

The overall error rate and number of cornerstones for the AVIE dataset are shown in Figure 6.5. The main point of note about the AVIE dataset is the large decreases in error rate even in later stages of the knowledge acquisition. The reason for this is that the classes within the AVIE dataset are not randomised, as in CAVIAR. The videos within the dataset

157 contain single activities interspersed between a series of walking and standing actions. Each video may contain an activity that is not likely to have been observed in any other video. The large decreases in the error rate, in later stages of training, therefore represent the introduction of a new activity class.

Figure 6.5: Total Error Rate, as a percentage of the total number of cases, in AVIE Dataset. Error rate is plotted against left axis. Number of cornerstones is plotted against right axis. The sudden drop on the left of the graph is produced by the introduction of the highly prevalent standing class and the sudden rise is due to the introduction of the walking class (which was defined by velocity > threshold)

In contrast, the CAVIAR dataset is sufficiently randomised so that all the activities are observed in the first few videos, and then later videos just help to refine the hypothesis. Thus, the error rate levels out quite fast at first, and starts decreasing slowly thereon. Although it is a common methodology in machine learning to randomise the training and test instances prior to training, this is not required in the RDR framework as the knowledge base naturally converges towards an optimal solution regardless of the order of training data.

6.5.2 Consistency vs. Generalisation Ability

As described in section 6.3, the concept of cross-validation is often used to measure the abil- ity of a particular predictor to generalise to unseen cases. Cross validation is a useful tool to measure the ability of a classifier to generalise to unseen examples. In the RDR framework, however, another issue can arise. It is possible that previous cases that have been correctly classified by a knowledge base will fail to be classified correctly in later revisions. This can occur if the cases were not used as cornerstones for past revisions, and thus were not stored

158 by the knowledge base. This leads to their not being used to perform knowledge verification at the time of knowledge base revision.

In order to measure these two properties of the knowledge base - generalisation ability and prior knowledge consistency - a method similar to cross-validation is used. Recall from above that each rule is “turned on” one by one in the order of creation. The entire dataset is then split into two subsets: the set of cases before the last cornerstone in the last rule that was turned on, and the set of cases after. These two subsets constitute the set of seen and unseen cases for the corresponding rule within the knowledge base.

The percentage error rate over the seen and unseen cases is calculated every time that a rule is turned on. The error rate on the unseen cases is then similar to the cross-validation error and measures the ability of the knowledge base to generalise to unseen examples. The error on the seen cases measures the ability of the knowledge base to maintain knowledge consistency and keep prior error low. The seen-unseen error rate for the CAVIAR dataset is given in Figure 6.6.

Figure 6.6: Seen vs. Unseen Error on CAVIAR Dataset. Top axis represents seen cases as a percentage of the total number of cases

The unseen error rate is initially slightly higher than the seen error rate but eventually decreases and converges towards the same value. This is the ideal situation and shows that the expert is able to create rules that are broad enough to cover a wide range of cases ac- curately. An interesting development occurs, however, in the seen error rate graph, which increases slightly until halfway through the dataset and then starts to decrease again to- wards the end. This trend was touched on briefly by Kang et al. (1996) who identified the problem as “later changes may cause errors in the classification of cases previously classi-

159 fied correctly”, but they found that this error was “small compared to that on unseen cases and so does not effect the overall validity of the system”.

The example that they use, however, consisted of relatively small changes to the knowl- edge base during each revision. Each rule in their knowledge base is built on a single case, and the expert’s task is to distinguish two cases at a time. In contrast, the activity recogni- tion problem requires the expert to distinguish between multiple cases at a time, and build rules on sets of cases rather than single cases. As a result, the error rate resulting from correctly classified seen cases being misclassified in later revisions is considerably exacer- bated. This is because the rules being created have increased generality, thus covering a greater number of cases.

The reason for the plateau effect and the final decrease in the error rate, however, is that both the knowledge contained within the knowledge base and the expert’s knowledge itself is continually evolving. As a result, initially the expert is likely to make some mistakes because not enough cases have been made visible to them yet. As more and more cases are seen, the expert refines their own knowledge and is able to make a better judgement on the rules required to perform the classification.

Figure 6.7: Seen vs. Unseen Error on AVIE Dataset. Top axis represents seen cases as a per- centage of the total number of cases

The seen versus unseen performance on the AVIE dataset is presented in Figure 6.7. As discussed above, the classifications within the AVIE dataset are ordered and not ran- domised as in the CAVIAR dataset. As a result, the unseen error rate has large jumps when a classification rule for a particular consequent is first entered into the knowledge base. Another outcome of the ordered nature of the dataset is that the seen error never

160 rises considerably. This is because once a video of a single classification, other than stand- ing and walking, has been classified completely, that particular classification is not going to be seen again. Therefore, the error rate of the classification is unlikely to rise considerably beyond that point.

6.5.3 Complexity of Training

Another issue in RDR knowledge bases is that creating new rules should not become too complex for the domain expert despite the age of the knowledge base. That is, no matter how many rules and cornerstones there are in the system, the complexity of adding additional rules and cornerstones should stay the same. Compton et al. (2006) measured complexity as the average time taken to add a rule to the knowledge base. However, this measure is only useful for knowledge bases that have been trained over a longer period than the scenarios presented in this thesis.

Rather, complexity in this thesis is measured as the number of features that the expert selects or constructs, in order to eliminate cornerstones. This measure should ideally be small throughout the lifetime of the knowledge base. To evaluate the complexity of RDR knowledge acquisition for the two scenarios, two metrics are calculated. The first metric is the average number of features per rule. Ideally, this value should remain as low as possible and should be roughly constant, indicating that the complexity of creating rules does not change considerably.

The second metric is the average number of cornerstones that are checked, for each rule, during knowledge verification. For each rule within the knowledge base, the number of cornerstones that are checked if an exception is to be created under the rule is computed. The average over all the rules determines the values of the second metric. This metric helps identify how complex revisions to the knowledge base get over time. It provides a useful comparison against the number of features metric presented above. The results for both metrics are presented in Table 6.5.

Complexity Measure Average Number of Average Number of Dataset Features per Rule Cornerstones Checked per Rule CAVIAR 2.69767 492.39535 AVIE 1.29592 40.48979

Table 6.5: Final complexity measures for the knowledge bases

As the CAVIAR dataset is much larger than AVIE, the average number of cornerstones that are checked within the former is much larger than in the latter. Despite the large number of cornerstones to be checked, the number of features that are used by the expert

161 to create the rules is quite low. This measure, along with the generalisation ability demon- strated above, shows the expert’s ability to create simple yet accurate rules to perform the classification. It shows that cornerstones can be quickly eliminated with a very small set of features. The evolution of the complexity values over different revisions of the CAVIAR knowledge base is shown in Figure 6.8.

Figure 6.8: Training Complexity in CAVIAR Dataset. Average number of features plotted against left axis. Number of cornerstones plotted against right axis

The evolution of the complexity over different revisions of the AVIE knowledge base is shown in Figure 6.9. Both graphs show that, despite the increasing number of cornerstones that need to be checked during verification, the number of features per rule stays roughly constant. The CAVIAR dataset, however, seems to behave slightly more erratically than the AVIE dataset. This is because the CAVIAR dataset has a greater number of labelling ambiguities than AVIE. These ambiguities are discussed further in Section 6.6.1. The result of these ambiguities is that the domain expert’s knowledge varies considerably resulting in significant changes in the feature selection.

162 Figure 6.9: Training Complexity in AVIE Dataset. Average number of features plotted against left axis. Number of cornerstones plotted against right axis

6.5.4 Number of Corrections over Time

Another measure that is often used to determine the quality of an RDR knowledge base is its size, determined by the number of rules added (Kang et al., 1995). This value determines the number of corrections made to the knowledge base, in order to deal with new cases. Ide- ally, the number of corrections should decrease as more and more cases are observed. Thus, a logarithmic trend in the values represents the ideal situation. The graph of corrections over time for the CAVIAR dataset is presented in Figure 6.10.

Figure 6.10: Number of Corrections made over time for CAVIAR

As the figure above shows, the CAVIAR dataset behaves close to the ideal scenario. As time progresses and more and more cases are seen, fewer corrections are required by the

163 expert for correct classification. The graph of corrections over time for the AVIE dataset is presented in Figure 6.11. The graph for the AVIE dataset demonstrates an overall linear increase in corrections. The reason that this occurs is that the classifications within the AVIE dataset are not randomised, as presented earlier. Thus, every time a new class is introduced for training, the number of corrections jumps in order to account for the new class.

Figure 6.11: Number of Corrections made over time for AVIE

However, it can be demonstrated that between the introductions of new classes, the AVIE dataset behaves in the ideal sense. This is shown in Figure 6.12, which shows the same corrections graph labelled with the introduction of new classes. As the figure shows, between class introductions, the graph behaves logarithmically but due to the large jumps at the introduction points, the overall graph appears linear.

6.5.5 Time Taken

Finally, in order to ensure that the activity classification occurs in real time, the average time taken to process each case by the final knowledge base is computed. The times for both knowledge bases are shown in Table 6.6.

Dataset Average Time Taken (ms) CAVIAR 0.04 AVIE 1.06

Table 6.6: Average time taken for each knowledge base, computed as an average over all cases

The table shows that the time taken to process each case using the RDR framework is almost negligible, despite the large number of rules present in the knowledge base. The reason for this is that as each case is run through the knowledge base, the feature values

164 Figure 6.12: Number of Corrections made over time for AVIE, labelled with the introduction of new classes are cached as the case traverses the knowledge base tree. This means that each feature need only be evaluated once, regardless of how many rules there are in the knowledge base. Therefore, the worst-case run-time of the knowledge base increases with the number of features and the time taken to compute each feature.

As will be presented in the discussion in the next section, the features that were de- signed for both datasets ended up being quite efficient to compute. In addition, because the choice of a time range increases the dimensionality of the features, a very small set of features is sufficient for accurate classification. The AVIE dataset is somewhat slower than the CAVIAR dataset as it contains voxel features.

6.6 Discussion on the Datasets

It is clear from the results presented above that the RDR framework works admirably as a classifier. It is efficient, able to generalise over unseen cases and yet maintains consistency over past cases. However, the benefits of RDR go far beyond just these properties. The main benefit of the RDR framework is that the features that are constructed for the knowl- edge are human-readable. This is clearly demonstrated in section 6.4, which describes the features designed for both AVIE and CAVIAR.

The reason for the human readability is that RDR relies on human expertise to deter- mine how to distinguish cases. When a human tries to justify the reasoning for why a feature helps distinguish a particular class from other classes, that human will necessarily

165 do so using constructs in natural language. For example, “a greater velocity implies run- ning or walking as opposed to standing”. The benefit of human readability of features is that the justification behind the choice of features is clearer to the expert, providing greater support for the expert’s cognitive reasoning process.

As will be shown in this section, activity recognition scenarios are prone to ambiguities in labelling. Some of these ambiguities are so severe that they violate all prior knowledge. As a result, the expert is forced to re-evaluate the labels themselves. However, because the features are human readable and the knowledge base is updated incrementally the expert is able to easily identify error-prone labelling and discard such cases or re-label them. Such cases form a small percentage of the total size of the dataset, and re-evaluating them provides a much improved labelling resulting in considerably refined knowledge within both the knowledge base and the expert’s cognition. A discussion on CAVIAR is presented in sub- section 6.6.1 and a discussion on AVIE is presented in sub-section 6.6.2.

6.6.1 Discussion on CAVIAR

The confusion matrix for the final CAVIAR knowledge base is first presented in Table 6.7. This table was computed using the final knowledge base from the previous section as a fixed classifier, and helps identify where the class discrepancies are in the error rates of the knowledge base and provides a quantitative basis for the discussions presented here.

Classified as A I W R MC MI N A 1733 476 835 0 280 12 606 I 127 6249 227 0 159 11 288 W 811 161 15272 0 450 11 482 R 0 0 241 231 3 11 2

Table 6.7: Confusion matrix for the CAVIAR movement classification task: A = active, I = inac- tive, W = walking, R = running, MC = multiple classifications with correct class, MI = multiple classifications without correct class and N = no classifications

The first major source of inconsistency within the CAVIAR dataset is the labelling of the active class. This is clearly highlighted by the confusion matrix, as the active class is quite often misclassified as walking or inactive and, conversely, instances of walking or inactive cases are classified as active.

Recall from Section 6.1 that the active class is defined as “a person with local motion but not translating across the image”. It is reasonable, from this definition and the definition of the other classes, to assume that the active class can be entirely defined by the motion of the person’s centre of gravity, and the change in bounding box width and height.

166 As it turns out, the ground truth labellers also labelled certain cases as active when the person was standing in particular hotspots. For example, some of the persons had been labelled as active if they were standing in front of an ATM or a kiosk. This was despite the fact that they were not exhibiting any local motion. An example is shown in Figure 6.13.

Figure 6.13: Ambiguity in the CAVIAR activity class: Person in a hotspot is la- belled as active

The problem with this labelling is that it is quite a major inconsistency with the a priori knowledge about the classes. Since such knowledge also governs the creation of the feature set, a machine learning or batch algorithm would also have failed to uncover this new def- inition of the active class. For example, it is reasonable to assume, from the description of the CAVIAR movement classes, that a programmer would not have constructed a feature outputting the position of a person relative to the hotspots. In that case, the presence of hotspots would have been completely ignored by a batch learner. Due to the incremental nature of RDR knowledge acquisition, the vision programmer has the opportunity to imple- ment such features.

Another example of error-prone labelling in the dataset is in discriminating walking and running. This is highlighted in Figure 6.14. The right image in this figure shows a person who has been labelled as running. The left image shows a person who has been labelled as walking. The inconsistency lies in the fact that the trajectory of the walker, over 50 frames, is longer than that of the runner, because the runner case is in the corners of the image where lens distortion effects in the camera result in a smaller trajectory length.

The walking-running discrimination case presented above highlights an important part of ground truth labelling: the use of cognition in analysing video data. In cases such as these, the labellers usually discard the spatio-temporal definitions of the classes, if they were given these definitions at all, and come up with their own judgements on the class definitions based on visual reasoning over image data.

Because the labellers are not aware of the low level reasoning in designating the activity classes, the judgements that they employ in labelling do not easily carry over to spatio-

167 Figure 6.14: Ambiguity between CAVIAR walking and running classes: The running case in the right image has a smaller trajectory length than the walking case in the left image. Both trajectories are 50 frames in length

temporal features. Therefore, even though the labelling may be consistent with the video data, the lack of corresponding relevant features makes it difficult for the expert to build rules consistent with the labelling. In this sense, building the RDR knowledge base has helped uncover limitations of the underlying assumptions.

Some of these inconsistencies are resolvable. The walking-running scenario can be re- solved using velocity rules that are dependent also on the position of the case within the image. However, some cases are so grossly inconsistent that no such resolution is possible. An example is shown in Figure 6.15. In this image, two persons come together and start fighting with each other. Both of them are exhibiting similar movements for several frames, but while one of them is labelled as active, the other one is labelled as walking.

Figure 6.15: Ambiguity in the CAVIAR active class: Two persons exhibiting ac- tive motions are labelled differently

These types of cases present limitations that are so severe that there is simply no way to build a rule that covers the misclassified case and that is consistent with prior knowledge. The cases presented in Figure 6.15 output features that are stochastically so similar, that no possible decision boundary can be constructed to differentiate the cases.

These inconsistencies within the dataset have been recognized in the past. List et al.

168 (2005) used the CAVIAR data to identify the discrepancies that arise when multiple humans observe and classify the same video of a movement being performed. They found that there was a considerably high level of ambiguity in labelling movements amongst different hu- man observers. For example, they found that there was at least 60% error in distinguishing between Active and Walking because different observers constructed their own assumptions about how the two activities were defined.

The evaluation over the CAVIAR dataset uncovers the first major benefit of the RDR framework for activity recognition. Because of the incremental nature of the RDR learning, and the simplicity with which new rules are constructed just by differentiating cases, it is much easier to determine labelling ambiguities prevalent in the training data. It also allows the vision programmer to introduce new features into the dataset to deal with such ambiguities, and when this is not possible, it allows them to relabel or reject a particular training case as too ambiguous.

6.6.2 Discussion on AVIE

The CAVIAR dataset presents a qualitative evaluation of the RDR classifier against ground truth. It shows that the RDR framework is effective against a benchmark. In the discussion on AVIE, the benefit of the RDR framework over prior classification strategies is presented. As in the discussion above, the confusion matrix for the AVIE dataset is presented in Figure 6.8.

Classified as S W R BAA BAU BAS J C MC MI NC S 800 35 30 1 2 0 0 0 22 3 85 W 43 2913 23 0 0 0 0 0 39 1 41 R 2 35 228 0 0 0 0 0 1 0 5 BAA 3 4 0 372 0 0 0 0 32 0 10 BAU 17 4 0 14 337 0 0 0 5 9 12 BAS 15 9 0 4 0 253 0 0 103 2 9 J 10 16 0 5 2 0 23 0 10 11 30 C 37 32 0 0 0 0 0 143 32 4 14

Table 6.8: Confusion matrix for the AVIE movement classification task: S = standing, W = walking, R = running, BAA = both arms ahead, BAU = both arms up, BAS = both arms to side, J = jumping, C = crouching, MC = multiple classifications with correct class, MI = multiple classifications without correct class and N = no classifications

The AVIE dataset is a much smaller and cleaner dataset than CAVIAR. There are fewer ambiguities because the classes are well defined and the voxel models across dif- ferent classes are sufficiently different to allow disambiguation. The main difficulty for the AVIE dataset is that the features should be efficiently computable, a task which is made

169 difficult by the dynamic dimensionality of the voxel model.

Prior work in voxel model classification, as discussed in Chapter 2, uses methods such as PCA or SVMs to project the voxel model to a fixed dimensional sub-space and perform classification there. However, these methods can be quite computationally taxing even for small models. However, as the experimental results above have shown, a set of much sim- pler features combined with the RDR framework achieves accurate results while having a considerably low computational load. In addition, the expert is forced to come up with a justification as to why a feature works in distinguishing classes. This is particularly ben- eficial for programmers who may look at the knowledge base later on to see how features were used.

Recall from Chapter 2 that there are three causes of variability in activity recognition: anthropometry of the human body, execution rate of the activity, and change in viewpoint. In order for the AVIE RDR to be effective, it must overcome each of these variabilities. For this to occur, each of the constructed features must be invariant to these properties. Invariance to viewpoint changes is easily observable as the features designed for AVIE all operate on a 3-D reconstruction of the human and not 2-D images, and no viewpoint specific properties are used in any of the features.

Invariance to execution rate and anthropometry are slightly more difficult to observe. It is possible to automate these invariances during RDR learning by applying transforma- tions to the features in spatial and temporal dimensions in order to cover a greater num- ber of cases. Dynamic time warping (Rabiner and Juang, 1993) is one such example that could provide invariance to execution rate. Invariance to anthropometry can be provided by transforming the voxel model into an invariant space and calculating feature values there.

Again, these additional transformations can be quite time consuming and ultimately unnecessary. First, combining the excellent cognition of the expert with the RDR framework has already shown the ability to generalise over future cases. In the AVIE dataset, although a single person is performing the activities, the execution rate of the activities does change in different instances of a particular activity. Despite the changing rate, the classifier is able to generalise well over the cases. This is because the expert themselves identify features that provide these invariances.

In addition, the RDR framework allows for the continual addition of rules to the system, which means that over time each of these invariances can be dealt with in a convergent manner. This is particularly useful for dealing with changes in anthropometry. For exam- ple, if a person of small stature, such as a child, entered the area, their walking and running velocities may be considerably smaller than previous cases. When this occurs, a new rule could be added which depends not just on velocity but also on the dimensions of the voxel model.

170 6.7 Situated Cognition

The CAVIAR and AVIE datasets demonstrate an important concept that is inherent within the activity recognition task: that of situated cognition (SC) . Situated cognition, first pre- sented by Greeno (1989), is a philosophical model that states that all knowledge is situated in activity bound to social, cultural and physical contexts. In the SC philosophy, a gen- eral conceptual knowledge model is given the back seat, and learning is typically viewed as acquiring specific knowledge applicable to specific situations.

The SC model states that, in order for the learning agent to acquire knowledge, they have at their behest affordances, or properties of the environment, and attunements that act upon affordances. When a new situation arises, the agent is required to identify variances and invariances from the affordances and attunements. That is, the agent determines what is different and what stays the same in the new situation and updates his knowledge to reflect a new intention, or goal, based on the variances.

The SC model presents a view of the activity recognition task that provides insight into the discussions on the two datasets presented above. In the CAVIAR dataset, different ob- servers form their own opinions about different activities based on contextual information. In the AVIE dataset, it becomes possible to deal with invariances by using context of the case to explain variations in feature values.

In both cases, the RDR framework is able to incorporate the contextual information it- self in a meaningful manner. In fact, it is easy to see how the RDR framework and philoso- phy explored in this thesis are related to the SC model. The learning agent is the knowledge base. Affordances and attunements are simply attributes and features of the case respec- tively. Intentions are modelled as the consequents output during knowledge base inference. The combined role of the domain expert and the knowledge base is to identify how to distin- guish a misclassified case from other cases. That is, they are required to identify and add invariances. This relation to the SC philosophy makes the RDR framework highly suitable for the activity recognition task.

6.8 Summary

In this chapter, concrete implementations of the RDR framework towards the activity recog- nition problem have been presented. Two different implementations are provided. The first is for CAVIAR, a ground truth dataset that provides a qualitative evaluation of the frame- work. The second is for AVIE data provided by the Immersitrack and SparseSPOT algo- rithms, and shows that the RDR framework can be highly efficient for large dimensional data.

Experimental results on both datasets show that the RDR framework is quite effective in activity classification. It is efficient, simple and human-readable. In addition, because it

171 leverages the power of human cognition, it can maintain accuracy while generalising well over unseen cases. Because of its incremental learning nature, it is also highly extensible and new features and new classifications can be continuously added without affecting the complexity of training the knowledge base.

The RDR framework has shown itself to be quite effective against the highly ambiguous nature of the activity recognition problem. It is able to uncover discrepancies in human judgement in classifying activities and allows the introduction of new contexts in order to explain such discrepancies. This is related to the philosophical model known as situated cognition, a philosophy that is highly applicable to the activity recognition scenario, and easily modelled by the RDR framework.

172 Chapter 7

Conclusion

The aim of this thesis is to recognize the activities of multiple people in real time, as they navigate within an immersive environment. The motivation springs from the fact that ac- tivities can be used to provide unique forms of interactivity to the participants. The use of activities for interaction allows immersive systems to transcend beyond the traditional point-and-click interfaces that have survived from the age of the earliest computers. De- spite the benefits that activity recognition systems provide, there is a considerable lack of investigation in the field due to the complexity of human activities.

This thesis models the goal as two sub-problems. The first is that of tracking, where the spatial location of individuals within the environment is maintained and updated over time. Tracking requires the fusion of data from various sources separated in both time and space, in order to provide coherent trajectory values for persons within the environment. The second task is that of recognition, where low-level data output by an underlying vision system is converted to meaningful symbols representing real-world activities.

Tracking is required in order for the software system to assign a unique identity to each person within the environment. Tracking provides three main benefits. The first benefit is that knowledge about a person in prior frames can be carried over to the current frame in order to utilise past history about a person’s activities. Secondly, tracking allows for the determination of trajectory information about a person which can then be used to perform rigid-body motion analysis. Finally, tracking allows for the spatial localisation of later al- gorithms by fixing a region-of-interest around the person. This can be highly beneficial as activity recognition algorithms operate on large dimensional data and a reduced search space would speed things up considerably.

Recognition involves identifying the activities that people are performing within the en- vironment. This thesis focuses on the task of converting feature vectors from an underlying vision system into high-level activity class symbols. These symbols can be used later on for activity recognition tasks of greater complexity, such as the modelling of activity relation-

173 ships using temporal networks (Pinhanez, 1999).

This thesis has achieved both the goals with results that operate in real time for multi- ple persons. Along with a demonstration of the computational efficiency of the algorithms presented, their efficacy has also been shown through experimental results. The speed of the tracking and recognition systems along with performance evaluations based on existing metrics has been presented. These metrics show that the systems are able to deal with an increasing number of people without deviating from the real-time requirements.

7.1 Contributions

In achieving the final goal, this thesis contributes to three important fields: computer vi- sion, knowledge acquisition, and immersive environments.

7.1.1 Contributions to Computer Vision

(i) The first contribution this thesis makes within the field of computer vision is in the provision of a robust, extensible, multi-camera tracking system that is able to track multiple persons within an indoors environment. A solid programmatic framework has been presented from which either the complete system or significant parts can be worked into other environments.

(ii) The Immersitrack system presents a novel method of discriminating cameras based on their spatial position and orientation. The cameras are split into two groups: over- head cameras pointing directly down and oblique cameras pointing towards the centre of the environment. In this manner, properties of the cameras based on the camera layout can be leveraged to provide robustness in tracking. Such discrimination has rarely been explored before.

(iii) At the 2-D level, numerous innovative algorithms have been presented to perform tracking or detection of bodies within the environment. Because overhead cameras benefit from an unoccluded view of the environment, they are most suitable for per- forming tracking. A novel method of tracking using multiple tracker mechanisms is utilised. The mechanisms are switched between, based on the splitting and merging of blobs. The oblique cameras are then used to detect bodies with the possibility of occlusion.

(iv) At the 3-D level, two major contributions have been made. The first is a robust, al- gorithmic framework that combines blob information from the oblique and overhead cameras to perform correspondence and construct a 3-D trajectory representation of each person within the environment. This reconstruction includes a novel verifica- tion step, which considerably increases the accuracy of the tracking system. The

174 framework presented allows for a distributed configuration, which provides real-time results. The second contribution is SparseSPOT, a novel method of performing 3-D voxel reconstruction of the persons within the environment, leveraging the results from Immersitrack. SparseSPOT improves the existing SPOT algorithm for voxel reconstruction by only reconstructing voxels around a small neighbourhood of each person, compared to the entire environment. This allows for real-time reconstruction of multiple persons, which is highly beneficial for the activity recognition stages.

(v) As an extension showing the benefits of Immersitrack, this thesis has also presented a novel finger tracking mechanism that can detect and track the fingertips of multiple persons within the environment, allowing them to control artefacts projected by the display system.

(vi) For the purposes of vision-based activity recognition, this thesis presents an innova- tive approach to the activity recognition problem. The power of human cognition is leveraged through a knowledge acquisition and inference mechanism that allows for a fast, efficient, and effective activity classifier. The use of simple rules in an auto- mated knowledge verification engine was shown to be quite beneficial to the activity recognition goal. Because of the visual nature of the data, it is very easy for a vision programmer to create consistent yet generalisable rules. The knowledge acquisition framework can easily be adapted to other activity recognition scenarios for further applications.

7.1.2 Contribution to Knowledge Acquisition

(i) Although RDR systems have been used for object recognition, the use of the RDR philosophy for the activity recognition task is a novel application that has not been attempted before. This is mainly because the collation of data for activity recognition, in order to present to the vision expert, has been a difficult problem to tackle. This thesis presents valuable contributions to the knowledge acquisition fields by provid- ing a solution to this problem.

(ii) Numerous contributions are made to extend the RDR principles to real-time data. A solution to the generalisation problem is presented by building rules over numerous similar cases.

(iii) The problem of dynamic data within knowledge acquisition was discussed, where a lack of certain features in future cases could cause knowledge inconsistency. This problem is especially exaggerated in temporal cases, where feature values over longer periods might be unavailable later on. The problem was solved by differentiating between static and dynamic properties, and building the RDR knowledge base on dynamic features that operated on static attributes.

175 (iv) Finally, the benefits of the RDR knowledge philosophy towards the activity recogni- tion task has been outlined through evaluation over two datasets. The evaluations show that the rules encoded by the expert end up being highly suitable for the task in hand, and the incremental nature of the learning results in the discovery and cor- rection of ambiguities within the classifications that simply do not fit in with any of the rules. The importance of situated cognition in the context of activity recognition was emphasized along with the manner in which the RDR philosophy handles the situated cognition model.

7.1.3 Contribution to Immersive Environments

(i) The main contribution that this thesis makes towards immersive environment sys- tems is the development of several unique methods of interaction within such areas that are able to run at frame rates comparable to those of the display system.

(ii) Immersitrack and SparseSPOT are both suitable for use in most immersive environ- ments as they have been built to work in similar environmental conditions. Both systems can be ported completely or part-by-part to other similar systems to provide similar benefits.

(iii) The pointing gesture recognition developed in Chapter 4 can be used to provide real- time pointing interaction with the display mechanism. The ability to deal with mul- tiple persons within such large spaces makes this system the first of its kind.

(iv) The activity recognition framework can be ported to different environments as is and can be continually enriched with new activities, both well defined and abstract, and can make use of any underlying vision system data. This means that data from other sensors, and not just vision, can be used to provide multi-modal activity recognition. This opens the way for a host of new interaction techniques.

7.2 Limitations and Future Work

The following are some of the limitations of the research, along with future extensions:

Automatic calibration techniques Currently the vision system is calibrated by hand, which results in a coarse calibration model. An automated calibration technique would provide much finer results, resulting in improved correspondence and greater precision in tracking. This would also considerably improve the voxel reconstruction results.

Improved Motion Models Immersitrack uses a first-order Kalman filter to model the po- sitions of persons within the environment. Other models, such as CONDENSATION

176 or the Extended Kalman Filter, which relax the linear motion assumption, might be more beneficial for maintaining the positions of the persons.

Exploration in Colour Systems Because of the reliance on near-infrared the various low-level vision systems within this thesis perform processing based solely on spatio- temporal features. The algorithms should be explored in colour settings to see how well they fare within such environments.

Larger Areas and a Greater Number of People Immersitrack has proved quantitatively effective for up to 6 people and qualitatively effective for up to 8 people. Although AVIE is a fair representation of the typical size of a large immersive environment, the use of Immersitrack within larger environments, and extended by a greater number of cameras and slaves, should be explored.

Higher Level Activity Descriptions Higher-level activity recognition systems have looked at using grammar-parsing techniques on activity symbols to recognize higher-level activities. The RDR framework within this thesis can be easily extended to utilise features that output symbols rather than real values. Such an extension would allow the use of RDR to model activity relationships.

Recognising Interactions The current research looks at recognising the activities of a single person within the environment. To this aim, each case within the RDR frame- work represents a single person. In immersive environments, it may be beneficial to look at modelling multi-person cases, in order to recognise interactions between people. This could also aid in modelling group behaviour.

Multi-Modal Interactions In immersive environments, there are usually several differ- ent modes of interaction. For example, within AVIE, along with the cameras, there is also an electronic wand for pointing and a joystick for navigation. The attributes of all of these modalities could be incorporated within the RDR framework for a greater space of activities. Additional vision modules may also be incorporated.

Using Activities for Interaction This thesis presents the recognition side of activity- based interaction. Further research should look at how activities can be used to drive the display mechanism. It would be interesting to see the use of abstract activities, determined by an artist during operation of the system, to control interactive mecha- nisms.

Machine Learning as Features The RDR framework presented in this thesis uses fea- tures that output real or boolean values. A feature could be any algorithm that does so. Thus, a machine learned classifier might also be used as a feature within the system. The use of classifiers as features would provide a considerable increase in accuracy and robustness.

177 Usability of the RDR Training Interface The RDR knowledge bases built as a part of this thesis were built by a single expert who had intimate knowledge of the under- lying system. Use of a variety of experts is recommended to provide better coverage of variations in judgements. More work is required in the usability of the training interface as it requires a fair amount of learning in its current state. A more intu- itive knowledge base editor, that potentially provides a way to manipulate the video directly, would highly benefit the use of RDR within vision applications such as this one.

7.3 Final Remarks

Immersive environments are becoming cheaper, faster, and a more suitable platform for ex- ploring applications in arts, entertainment, and education. As these systems become more advanced, the need to provide innovative interaction techniques must match the advances. Such interaction techniques must go beyond the traditional pointing, clicking, and typing interfaces of traditional desktops to more abstract forms. A vision-based activity recogni- tion module would provide a much greater expanse of interaction modalities than wired sensors and can provide further benefits for sociological and psychological research.

In this thesis, a solution to the activity recognition task has been presented, that starts from the lowest-levels of computer vision up to the high level task of outputting a symbolic representation of the activity of each person within the environment. The system can be continuously enriched with new activities making it suitable for artistic endeavours. The research presented uses efficient methods at each stage to ensure that the result operates in real time to match the graphics frame rates of the immersive environment display. Al- though the algorithms and frameworks are specific to the AVIE environment, they may easily be ported to a wide variety of research.

178 Appendix A

Reducing the Minimum-Weight Edge Cover of a Bipartite Graph to a Linear Assignment Problem

In this appendix it is shown that finding the minimum-weight edge cover for a bipartite graph, G (V,E), with a defined edge weight function, w, reduces to the linear assignment = problem. Although this proof has been demonstrated by Papadimitriou and Steiglitz (1998), it is presented here as the tracking algorithms are dependent on it.

Proof A graph G0 (V 0, e0), which is a disjoint copy of G, is first constructed. So for every = v V there is a corresponding v0 V 0 and for every edge e E there is a e0 E0. The weight ∈ ∈ ∈ ∈ function w in G is extended to G0 by setting w(e) w(e0), where e0 is the corresponding edge = for e in E0.

Further edge set E00 is defined by connecting each vertex v V to its corresponding ∈ vertex v0 V 0 (denote ev to be the edge connecting v to v0). Then construct the graph G+ ∈ = (V +,E+), where

V + V V 0 (A.1) = ∪

and,

E+ E E0 E00 (A.2) = ∪ ∪

179 An edge weight function w+ in G+ is defined by

w+(e) w(e), e E (A.3) = ∈ w+(e0) w(e0), e0 E0 (A.4) = ∈ w+(ev) 2µ(v), ev E00 (A.5) = ∈ where µ(v) is the minimum weight of the edges of G incident on v.

Note that G+ has an even number of vertices, since V V 0 . Thus, it is possible to find | | = | | a minimum weight perfect matching within G+. That is, it is possible to find a set of edges,

M+ E+, such that every vertex in V + is incident on exactly one edge in M+, and such that ⊆ the sum of the weights of edges in M+ is minimized.

Then M+ provides a minimum weight edge cover, M E, in G. M is constructed as ⊆ follows. For each edge e M+, if e+ e E then add e to M, if e+ e0 E0 then discard e+, ∈ = ∈ = ∈ and if e+ ev E00 then add the edge in G, incident with v of minimum weight. It is shown = ∈ that M constructed in this way is a minimum weight edge cover.

Since M+ is a perfect matching, if ev M+ E00, then both v and v0 are incident on the ∈ ∩ same edge in M+, and so there can be no other edge incident on them. So all the vertices that are incident on an edge in M+ E00 are removed from consideration and two sets of ∪ disjoint vertices of equal size, U V and U0 V 0, are obtained. Thus, the number of edges ⊆ ⊆ in M+ that are incident on vertices in U (that is, edges in M E), is equal to the number of ∩ edges that are incident on vertices in U0 (edges in M E0). ∩

Furthermore, for every vertex v U there is a corresponding vertex v0 U0 with the ∈ ∈ same number of edges incident, and with the same weights. Thus, for a minimum-weight perfect matching on U there must be a minimum weight perfect matching on U0 with the same weight. So, without loss of generality, assume that w+(M+ E) w+(M+ E0). So, by ∩ = ∩ the construction of M:

w+(M+) w+(M+ E) w+(M+ E0) w+(M+ E00) (A.6) = ∩ + ∩ + ∩ 2w+(M+ E) w+(M+ E00) (A.7) = ∩ + ∩ X X 2 w+(e) w+(ev) (A.8) = + e M E ev M E ∈ +∩ ∈ +∩ 00 X X 2 w+(e) 2 µ(v) (A.9) = + e M+ E ev M+ E00 Ã∈ ∩ ∈ ∩ ! X X 2 w(e) µ(v) (A.10) = + e M E ev M E ∈ +∩ ∈ +∩ 00 2w(M) (A.11) =

180 Now suppose that F is any edge cover of G. Assume that F has no paths of length greater than 2 edges since, if it does, the middle edge can be removed to form a cover with fewer edges, and so smaller weight. Thus, F is a union of stars.

Perform a reverse construction, F+ into G+ of F by taking each leaf vertex, v, of each star and connecting it to its corresponding vertex v0 and setting the weight of the edge to

2µ(v) (as above), until only vertices of degree 1 are left. Thus, F+ is a perfect matching in

G+. Furthermore, by definition of µ, the weight of each edge formed in F+ is at most twice the weight of each edge in F. So,

w+(F+) 2w(F) (A.12) ≤

So, since M+ is a minimum-weight perfect matching in G+,

1 w(M) w+(M+) (A.13) = 2 1 w+(F+) (A.14) ≤ 2 w(F) (A.15) ≤ so M is a minimum weight edge cover for G.

Therefore, to find a minimum weight edge cover for the original graph, simply compute the perfect matching, M+, in G+ and use the construction methods above to derive M. If G is a bipartite graph, then V L R where all edges in G have one vertex in L and one = ∪ vertex in R. Then denote V 0 L0 R0 and note that = ∪

V + ((L R0) (L0 R)) (A.16) = ∪ ∪ ∪

Let L+ (L R0) and R+ (L0 R). Then by the construction of G+ all edges will have = ∪ = ∪ exactly one vertex in L+ and the other vertex in R+. Thus, G+ is also bipartite. But finding a perfect matching in a bipartite graph is exactly the linear assignment problem. Thus, the minimum-weight edge cover within a bipartite graph is reduced to a linear assignment problem.

181 References

Ankur Agarwal and Bill Triggs. Learning to track 3D human motion from silhouettes. In Proceedings of the twenty-first international conference on Machine learning, pages 2–2, Banff, Alberta, Canada, 2004. ACM. URL http://portal.acm.org/citation.cfm?id= 1015343. 34

JK Aggarwal and Q. Cai. Human motion analysis: A review. Computer Vision and Image Understanding, 73(3):428–440, 1999. 12

JK Aggarwal and S. Park. Human motion: modeling and recognition of actions and in- teractions. In Proceedings of the 2nd International Symposium on 3D Data Processing, Visualization and Transmission, pages 640–647, 2004. 32

M. Ahad, JK Tan, HS Kim, and S. Ishikawa. Human activity recognition: various paradigms. In Control, Automation and Systems, 2008. ICCAS 2008. International Con- ference on, pages 1896–1901. IEEE, 2008. 12

S. Ali, A. Basharat, and M. Shah. Chaotic invariants for human action recognition. In Proc. ICCV, pages 1–8. Citeseer, 2007. 14

Jérémie Allard, Edmond Boyer, Jean-Sébastien Franco, Clément Ménier, and Bruno Raffin. Marker-less real time 3D modeling for virtual reality. In Immersive Projection Technol- ogy, 2004. 5

M.S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. Signal Processing, IEEE Transactions on, 50(2):174–188, 2002. ISSN 1053-587X. 28, 29

W. Barfield and C. Hendrix. The effect of update rate on the sense of presence within virtual environments. Virtual Reality, 1(1):3–15, 1995. ISSN 1359-4338. 13

K. Bernardin, A. Elbs, and R. Stiefelhagen. Multiple object tracking performance metrics and evaluation in a smart room environment. In Sixth IEEE International Workshop on Visual Surveillance, in conjunction with ECCV, 2006. 90, 91

182 Keni Bernardin, Tobias Gehrig, and Rainer Stiefelhagen. Multi-and Single View Multiper- son Tracking for Smart Room Environments. Lecture Notes in Computer Science, 4122: 81–92, 2007. URL http://dx.doi.org/10.1007/978-3-540-69568-4_5. 26, 27

G. Beydoun and A. Hoffmann. Incremental acquisition of search knowledge. International Journal of Human-Computer Studies, 52(3):493–530, 2000. 128

N.D. Bird, O. Masoud, N.P. Papanikolopoulos, and A. Isaacs. Detection of loitering individu- als in public transportation areas. Intelligent Transportation Systems, IEEE Transactions on, 6(2):167–177, 2005. ISSN 1524-9050. 33

J. Blanchette and M. Summerfield. C++ GUI programming with Qt 4. Prentice Hall Press Upper Saddle River, NJ, USA, 2008. 136

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1395–1402 Vol. 2, 2005. ISBN 1550-5499. doi: {10.1109/ICCV.2005.28}. 14, 35

WE Blanz and SL Gish. A connectionist classifier architecture applied to image segmenta- tion. In Pattern Recognition, 1990. Proceedings., 10th International Conference on, vol- ume 2, pages 272–277. IEEE, 2002. ISBN 0818620625. 20

J. Blascovich. Social influence within immersive virtual environments. The social life of avatars: Presence and interaction in shared virtual environments, pages 127–145, 2002. 3

A. Bobick and J. Davis. Real-time recognition of activity using temporal templates. In IEEE Workshop on Applications of Computer Vision, pages 39–42, 1996a. 14, 39

A. Bobick and J. Davis. An appearance-based representation of action. In Pattern Recog- nition, 1996., Proceedings of the 13th International Conference on, volume 1, pages –, 1996b. 39, 40

A. Bobick, S. Intille, J. Davis, F. Baird, C. Pinhanez, L. Campbell, Y. Ivanov, A. Schutte, and A. Wilson. The KidsRoom: A Perceptually-Based Interactive and Immersive Story Environment. PRESENCE: Teleoperators and Virtual Environments, 8(4):367–391, 1999. 5, 80

Aaron Bobick, Stephen Intille, Jim Davis, Freedom Baird, Claudio Pinhanez, Lee Camp- bell, Yuri Ivanov, Arjan Schütte, and Andrew Wilson. Design Decisions for Interactive Environments: Evaluating the KidsRoom. In Proceedings of the 1998 AAAI Spring Sym- posium on Intelligent Environments. AAAI TR, pages 98–02. AAAI Press, 1998. 5

A.F. Bobick. Movement, activity and action: the role of knowledge in the perception of motion. Philosophical Transactions of the Royal Society B: Biological Sciences, 352(1358): 1257, 1997. 32, 33

183 G. Bradski. OpenCV: Examples of use and new applications in stereo, recognition and tracking. In Proc. Intern. Conf. on Vision Interface (VIŠ2002), 2002. 55

C. Bregler. Learning and recognizing human dynamics in video sequences. In IEEE COM- PUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOG- NITION, pages 568–574. INSTITUTE OF ELECTRICAL ENGINEERS INC (IEEE), 1997. 38

N.C.M. Brown, T.S. Barker, and D. Del Favero. Performing digital aesthetics: The frame- work for a theory of the formation of interactive narratives. Leonardo, 44(3):212–219, 2011. 117

H. Buxton. Learning and understanding dynamic scene activity: a review. Image and Vision Computing, 21(1):125–136, 2003. 32

M.C. Cabral, C.H. Morimoto, and M.K. Zuffo. On the usability of gesture interfaces in virtual reality environments. In Proceedings of the 2005 Latin American conference on Human-computer interaction, pages 100–108. ACM Press New York, NY, USA, 2005. 5

Q. Cai and JK Aggarwal. Tracking human motion using multiple cameras. In International Conference on Pattern Recognition, volume 13, pages 68–72, 1996. 21

Q. Cai and JK Aggarwal. Automatic tracking of human motion in indoor scenes across mul- tiple synchronized video streams. In Computer Vision, 1998. Sixth International Confer- ence on, pages 356–362, 1998. 25

F. Caillette and T. Howard. Real-time markerless human body tracking using colored vox- els and 3D blobs. In Third IEEE and ACM International Symposium on Mixed and Augmented Reality, 2004. ISMAR 2004, pages 266–267, 2004. 43

J. Carranza, C. Theobalt, M.A. Magnor, and H.P. Seidel. Free-viewpoint video of human actors. In ACM SIGGRAPH 2003 Papers, page 577. ACM, 2003. 36

C. Cedras and M. Shah. Motion-based recognition: A survey. Image and Vision Computing, 13(2):129–155, 1995. 12

C.C. Chang and C.J. Lin. Training ν-Support Vector Regression: Theory and Algorithms. Neural Computation, 14(8):1959–1977, 2002. 27

H.T. Chen, H.H. Lin, and T.L. Liu. Multi-object tracking using dynamical graph matching. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 2, 2001. 23, 28, 63

G. Cheung, T. Kanade, J. Y. Bouguet, and M. Holler. A Real Time System for Robust 3D Voxel Reconstruction of Human Motions. In IEEE COMPUTER SOCIETY CON- FERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, volume 2. IEEE

184 Computer Society; 1999, 2000. doi: {http://doi.ieeecs.org/10.1109/CVPR.2000.854944}. 36, 42, 99

G. K. M. Cheung, S. Baker, and T. Kanade. Visual hull alignment and refinement across time: a 3D reconstruction algorithm combining shape-from-silhouette with stereo. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Soci- ety Conference on, volume 2, 2003a. 36, 43

K. M. G. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, 2003b. 36

KM Cheung. Visual Hull Construction, Alignment and Refinement for Human Kinematic Modeling, Motion Tracking and Rendering. PhD thesis, PhD thesis, Carnegie Mellon University, 2003. 36

C. Chu, O. Jenkins, and M. Mataric. Towards Model-free Markerless Motion Capture. In IEEE Intl. Conf. On Robotics and Automation, Taipei, Taiwan, 2003. URL http: //citeseer.ist.psu.edu/chu03towards.html. 36, 42

C.W. Chu, O.C. Jenkins, and M.J. Mataric. Converting Sequences of Human Volumes into Kinematic Motion. Technical report, Center for Robotics and Embedded Systems, 2002. 36

P. Compton and B. Jansen. A Philosophical Basis for Knowledge Acquisition. Knowledge Acquisition, 2:241–257, 1989. 9, 46, 122

P. Compton and D. Richards. Generalising ripple-down rules. Knowledge Engineering and Knowledge Management Methods, Models, and Tools, pages 125–135, 2000. 128

P. Compton, L. Peters, G. Edwards, and T.G. Lavers. Experience with ripple-down rules. Knowledge-Based Systems, 19(5):356–362, 2006. ISSN 0950-7051. 161

D. Crevier and R. Lepage. Knowledge-based image understanding systems: A survey* 1. Computer Vision and Image Understanding, 67(2):161–185, 1997. 46

C. Cruz-Neira, D.J. Sandin, and T.A. DeFanti. Surround-screen projection-based virtual reality: the design and implementation of the CAVE. In Proceedings of the 20th annual conference on Computer graphics and interactive techniques, pages 135–142. ACM New York, NY, USA, 1993. 2

Y. Cui, S. Samarasckera, Q. Huang, and M. Greiffenhagen. Indoor monitoring via the collab- oration between a peripheral sensor and a foveal sensor. In In Proc. of the IEEE Workshop on Visual Surveillance, pages 2–9. IEEE Computer Society, 1998. 22

185 F. Cupillard, F. Brémond, and M. Thonnat. Group behavior recognition with multiple cam- eras. In Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision, page 177. Citeseer, 2002. 33

J. Davis and X. Chen. Lumipoint: Multi-user laser-based interaction on large tiled displays. Displays, 23(5):205–211, 2002. ISSN 0141-9382. 5

E. de Aguiar, C. Theobalt, M. Magnor, H. Theisel, and H.-P. Seidel. M3: marker-free model reconstruction and motion tracking from 3D voxel data. In Proc. 12th Pa- cific Conf. Computer Graphics and Applications PG 2004, pages 101–110, 2004. doi: 10.1109/PCCGA.2004.1348340. 36

R.D. de Leon and L.E. Sucar. Human Silhouette Recognition with Fourier Descriptors. International Conference on Pattern Recognition. IEE Barcelona, Spain, 2000. 40

D. Del Favero and T.S. Barker. Scenario: Co-evolution, shared autonomy and mixed real- ity. In Mixed and Augmented Reality-Arts, Media, and Humanities (ISMAR-AMH), 2010 IEEE International Symposium On, pages 11–18. IEEE, 2010. 117, 118

Q. Delamarre and O. Faugeras. 3D Articulated Models and Multiview Tracking with Phys- ical Forces* 1. Computer Vision and Image Understanding, 81(3):328–357, 2001. ISSN 1077-3142. 36

S.L. Dockstader and A.M. Tekalp. Multiple camera tracking of interacting and occluded human motion. Proceedings of the IEEE, 89(10):1441–1455, 2001a. 25

S.L. Dockstader and A.M. Tekalp. Multiple camera fusion for multi-object tracking. In Multi-Object Tracking, 2001. Proceedings. 2001 IEEE Workshop on, pages 95–102, 2001b. 25

H.L. Dreyfus, S.E. Dreyfus, and T. Athanasiou. Mind over machine: The power of human intuition and expertise in the era of the computer. The Free Press New York, NY, USA, 1986. 45

J. Duetscher, A. Blake, and I. Reid. Articulated body motion capture by annealed particle filtering. In cvpr, page 2126. Published by the IEEE Computer Society, 2000. 36

G. Edwards, P. Compton, R. Malor, A. Srinivasan, and L. Lazarus. PEIRS: a pathologist- maintained expert system for the interpretation of chemical pathology reports. Pathology, 25(1):27–34, 1993. 123

A. Elgammal, D. Harwood, and L. Davis. Non-parametric model for background subtrac- tion. FRAME-RATE Workshop, IEEE, 1999. 20

R.B. Fisher. The PETS04 surveillance ground-truth data sets. In Proc. 6th IEEE Inter- national Workshop on Performance Evaluation of Tracking and Surveillance, pages 1–5, 2004. 11, 147

186 Robert Fisher. EC Funded CAVIAR project/IST 2001 37540, 2002. URL http:// homepages.inf.ed.ac.uk/rbf/CAVIAR/. 146, 147

F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multicamera People Tracking with a Prob- abilistic Occupancy Map. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):267–282, 2008. ISSN 0162-8828. 27

B. Flynn, R. Sablatnig, J. Hemsley, P. Kammerer, E. Zolda, and J. Stockinger. Prehistoric ar- chaeology in a dynamic spatial visualization environment. In Digital Cultural Heritage- Essential for Tourism Proceedings of the 1st EVA 2006 Vienna Conference. Citeseer, 2006. 119

D. Focken and R. Stiefelhagen. Towards vision-based 3-D people tracking in a smart room. Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on, pages 400–405, 2002. 26, 27

D. A. Forsyth and Jean Ponce. Computer Vision, A Modern Approach. Prentice Hall, 2003. ISBN 0-12-379777-2. 17

J.S. Franco and E. Boyer. Exact polyhedral visual hulls. In British Machine Vision Confer- ence, volume 1, pages 329–338. Citeseer, 2003. 41

Florent Fusier, Valery Valentin, Francois Bremond, Monique Thonnat, Mark Borg, David Thirde, and James Ferryman. Video understanding for complex activity recognition. Ma- chine Vision and Applications, 18(3):167–188, 2007. URL http://dx.doi.org/10.1007/ s00138-006-0054-y. 46

P.F. Gabriel, J.G. Verly, J.H. Piater, and A. Genon. The state of the art in multiple object tracking under occlusion in video sequences. Advanced Concepts for Intelligent Vision Systems, pages 166–173, 2003. 12

B. Gaines and M.L.G. Shaw. Cognitive and logical foundations of knowledge acquisition. In The 5th Knowledge Acquisition for Knowledge Based Systems Workshop, pages 9–1, 1990. 126

B.R. Gaines and P. Compton. Induction of ripple-down rules applied to modeling large databases. Journal of Intelligent Information Systems, 5(3):211–228, 1995. 132

DM Gavrila. Visual analysis of human movement: A survey. Computer Vision and Image Understanding, 73(1):82–98, 1999. 12

DM Gavrila and LS Davis. 3-D model-based tracking of humans in action: a multi-view approach. In 1996 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1996. Proceedings CVPR’96, pages 73–80, 1996. 35

187 N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 107–113. IET, 1993. 31

L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as Space-Time Shapes. IEEE T Pattern Anal, 29(12):2247–2253, 2007. doi: 10.1109/TPAMI.2007.70711. 14, 35

J.G. Greeno. A perspective on thinking. American Psychologist, 44(2):134–141, 1989. ISSN 0003-066X. 171

M. Gross, S. Wurmlin, M. Naef, E. Lamboray, C. Spagno, A. Kunz, E. Koller-Meier, T. Svo- boda, L. Van Gool, S. Lang, et al. blue-c: a spatially immersive display and 3D video portal for . ACM Transactions on Graphics, 22(3):819–827, 2003. 2, 15

J. Gu, X. Ding, S. Wang, and Y. Wu. Action and Gait Recognition From Recovered 3-D Human Joints. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics: a publication of the IEEE Systems, Man, and Cybernetics Society, 0(99):1–13, 2010. doi: 10.1109/TSMCB.2010.2043526. Early Access. 38, 44, 45

Y. Guo, G. Xu, and S. Tsuji. Understanding human motion patterns. In Pattern Recogni- tion, 1994. Vol. 2-Conference B: Computer Vision & Image Processing., Proceedings of the 12th IAPR International. Conference on, volume 2, pages 325–329. IEEE, 1994. ISBN 0818662700. 38

I. Haritaoglu, D. Harwood, and LS Davis. Ghost: a human body part labeling system using silhouettes. Pattern Recognition, 1998. Proceedings. Fourteenth International Conference on, 1, 1998a. 21, 22, 106

I. Haritaoglu, D. Harwood, and L.S. Davis. Who? What? When? Where?: A real-time system for detecting and tracking people. In Proc. Computer Vision and Pattern Recognition, pages 962–968. Springer, Springer, 1998b. 21

I. Haritaoglu, R. Cutler, D. Harwood, and L.S. Davis. Backpack: detection of people carrying objects using silhouettes. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 1, pages 102–107 vol.1, 1999a. 21, 68

I. Haritaoglu, D. Harwood, and LS Davis. Hydra: multiple people detection and tracking using silhouettes. In Proceedings of the Second IEEE Workshop on Visual Surveillance, pages 6–13. IEEE Computer Society Washington, DC, USA, 1999b. 21, 22, 28

I. Haritaoglu, D. Harwood, LS Davis, I.B.M.A.R. Center, and CA San Jose. W4: real-time surveillance of people and their activities. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):809–830, 2000. 17

M. Harville and D. Li. Fast, Integrated Person Tracking and Activity Recognition with Plan- View Templates from a Single Stereo Camera. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2. IEEE Computer Society, 2004. 24

188 Michael Harville. Stereo person tracking with adaptive plan-view templates of height and occupancy statistics. Image and Vision Computing, 22:127—142, 2004. doi: 10.1.1.10. 4194. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.4194. 24, 26, 28

Jean-Marc Hasenfratz, Marc Lapierre, Jean-Dominique Gascuel, and Edmond Boyer. Real- Time Capture, Reconstruction and Insertion into Virtual World of Human Actors. In Vision, Video and Graphics, pages 49–56. Eurographics, Elsevier, 2003. URL http:// artis.imag.fr/Publications/2003/HLGB03. 5, 43

F. Hayes-Roth, D. Waterman, and D. Lenat. Building expert systems. Addison-Wesley, Reading, MA, 1984. 122

Q. He and C. Debrunner. Individual recognition from periodic activity using hidden markov models. In Human Motion, 2000. Proceedings. Workshop on, pages 47–52. IEEE, 2000. ISBN 0769509398. 44, 45

G. Herren. Samuel Beckett’s plays on film and television. Palgrave Macmillan, 2007. ISBN 9781403977953. URL http://books.google.com/books?id=RBh0QgAACAAJ. 117

T. Höllerer, J.A. Kuchera-Morin, and X. Amatriain. The allosphere: a large-scale immersive surround-view instrument. In Proceedings of the 2007 workshop on Emerging displays technologies: images and beyond: the future of displays and interacton, page 3. ACM, 2007. 2

R.C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine learning, 11(1):63–90, 1993. 46

W. Hu, T. Tan, L. Wang, and S. Maybank. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applica- tions and Reviews, 34(3):334–352, 2004a. ISSN 1094-6977. 12

W. Hu, X. Xiao, D. Xie, T. Tan, and S. Maybank. Traffic accident prediction using 3-D model-based vehicle tracking. Vehicular Technology, IEEE Transactions on, 53(3):677– 694, 2004b. ISSN 0018-9545. 33

S.S. Intille, J.W. Davis, and A.F. Bobick. Real-time closed-world tracking. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Confer- ence on, pages 697–703. IEEE, 2002. ISBN 0818678224. 23

M. Isard and A. Blake. CondensationUconditional˚ density propagation for visual tracking. International journal of computer vision, 29(1):5–28, 1998. ISSN 0920-5691. 31

Y. A. Ivanov and A. F. Bobick. Recognition of visual activities and interactions by stochastic parsing. IEEE T Pattern Anal, 22(8):852–872, 2000. doi: 10.1109/34.868686. 44

189 S. Iwase and H. Saito. Parallel tracking of all soccer players by integrating detected posi- tions in multiple view images. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, 4:–, 2004. 27

D.S. Jang, S.W. Jang, and H.I. Choi. 2D human body tracking with Structural Kalman Filter. Pattern Recognition, 35:2041–2049, 2002. 31

O. Javed, S. Khan, Z. Rasheed, and M. Shah. Camera handoff: tracking in multiple un- calibrated stationarycameras. Human Motion, 2000. Proceedings. Workshop on, pages 113–118, 2000. 17, 25, 26

G. Johansson. Visual perception of biological motion and a model for its analysis. Attention, Perception, & Psychophysics, 14(2):201–211, 1973. ISSN 1943-3921. 34

S.J. Julier and J.K. Uhlmann. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3):401–422, 2004. ISSN 0018-9219. 31

Y.K. Jung and Y.S. Ho. Traffic parameter extraction using video-based vehicle tracking. In Intelligent Transportation Systems, 1999. Proceedings. 1999 IEEE/IEEJ/JSAI Interna- tional Conference on, pages 764–769. IEEE, 1999. ISBN 078034975X. 33

A.A. Kale, N. Cuntoor, V. Kr "uger, and AN Rajagopalan. Gait-based recognition of humans using continuous HMMs. In fgr, page 0336. Published by the IEEE Computer Society, 2002. 45

R.E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960. 29

B. Kang, P. Compton, and P. Preston. Multiple classification ripple down rules: Evalua- tion and possibilities. In The 9th Knowledge Acquisition for Knowledge Based Systems Workshop. Citeseer, 1995. 126, 127, 157, 163

B.H. Kang, W. Gambetta, and P. Compton. Verification and validation with ripple-down rules. International journal of human-computer studies, 44(2):257–269, 1996. ISSN 1071- 5819. 129, 151, 159

J. Kang, I. Cohen, and G. Medioni. Tracking people in crowded scenes across multiple cameras. In Asian Conference on Computer Vision January, pages –, 2004. 27

R. Kehl. Markerless motion capture of complex human movements from multiple views. PhD thesis, Swiss Federal Institute of Technology, Zurich, 2005. 36, 43

R. Kehl and L. Van Gool. Real-time pointing gesture recognition for an immersive environ- ment. Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE Interna- tional Conference on, pages 577–582, 2004. 104, 106, 111

190 R. Kehl, M. Bray, and L. Van Gool. Full body tracking from multiple views using stochastic sampling. Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, 2:129–136 vol. 2–129–136 vol. 2, 2005. ISSN 1063-6919. 36, 43

V. Kellokumpu, M. Pietikäinen, and J. Heikkilä. Human activity recognition using se- quences of postures. In Proceedings of the IAPR Conference on Machine Vision Appli- cations (MVA 2005), Tsukuba Science City, Japan, pages 570–573. Citeseer, 2005. 44, 45

P. Kenny, A. Hartholt, J. Gratch, W. Swartout, D. Traum, S. Marsella, and D. Piepol. Build- ing interactive virtual humans for training environments. In The Interservice/Industry Training, Simulation & Education Conference (I/ITSEC), volume 2007. NTSA, 2007. 3

S. Khan and M. Shah. Tracking people in presence of occlusion. In Asian Conference on Computer Vision, volume 5, 2000. 17, 23

S. Khan, O. Javed, Z. Rasheed, and M. Shah. Human tracking in multiple cameras. In In International Conference on Computer Vision, pages 331–336, 2001. 25, 26

T. Kohonen. Self-organization and associative memory. Springer-Verlag New York, Inc. New York, NY, USA, 1989. 40

M. Krueger. Artificial Reality 2. Addison-Wesley Professional, 1991. ISBN 0201522608. 3

M.W. Krueger. Responsive environments. In Proceedings of the June 13-16, 1977, national computer conference, pages 423–433. ACM, 1977. 3

M.W. Krueger, T. Gionfriddo, and K. Hinrichsen. VIDEOPLACEUan˚ artificial reality. ACM SIGCHI Bulletin, 16(4):35–40, 1985. ISSN 0736-6906. 3, 4, 5

V. Kruger, D. Kragic, A. Ude, and C. Geib. The Meaning of Action: a review on action recognition and mapping. Advanced Robotics, 21(13):1473–1501, 2007. 32

J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer. Multi-camera multi- person tracking for easyliving. vs, page 3, 2000. 24

H.W. Kuhn. The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. 65

A. Ladikos, S. Benhimane, and N. Navab. Efficient visual hull computation for real-time 3D reconstruction using CUDA. Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pages 1–8, 2008. 43

E. Lantz. Large-scale immersive displays in entertainment and education. In 2 nd Annual Immersive Projection Technology Workshop, 1998. 7

191 E. Lantz. A survey of large-scale immersive displays. In Proceedings of the 2007 work- shop on Emerging displays technologies: images and beyond: the future of displays and interacton, page 1. ACM, 2007. 7

A. Laurentini. The visual hull: A new tool for contour-based image understanding. In Proc. 7th Scandinavian Conference on Image Analysis, pages 993–1002, 1991. 40

A. Laurentini. The visual hull concept for silhouette-based image understanding. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 16(2):150–162, 1994. ISSN 0162-8828. 40

I. Leichter, M. Lindenbaum, and E. Rivlin. A probabilistic framework for combining track- ing algorithms. Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, 2:445–451, 2004. 23

K. Levenberg. A method for the solution of certain nonlinear problems in least squares. Quart. Appl. Math, 2(2):164–168, 1944. 77

T. List, J. Bins, J. Vazquez, and R. B. Fisher. Performance evaluating the evaluator. In Proc. 2nd Joint IEEE Int Visual Surveillance and Performance Evaluation of Tracking and Surveillance Workshop, pages 129–136, 2005. doi: 10.1109/VSPETS.2005.1570907. 168

A. Lopez, C. Canton-Ferrer, and J.R. Casas. Multi-Person 3D Tracking with Particle Fil- ters on Voxels. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 1, pages I–913–I–916, 2007. ISBN 1520-6149. doi: {10.1109/ICASSP.2007.366057}. 43

B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence, volume 3, pages 674–679. Citeseer, 1981. 66

P. Maes, T. Darrell, B. Blumberg, and A. Pentland. The ALIVE system: wireless, full-body interaction with autonomous agents. Multimedia Systems, 5(2):105–112, 1997. 23

D. Makris and T. Ellis. Learning semantic scene models from observing activity in visual surveillance. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 35(3):397–408, 2005. ISSN 1083-4419. 33

D.W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Jour- nal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. 77

F. Martinez-Contreras, C. Orrite-Urunuela, E. Herrero-Jaraba, H. Ragheb, and S. A. Ve- lastin. Recognizing Human Actions Using Silhouette-based HMM. In Proc. Sixth IEEE Int. Conf. Advanced Video and Signal Based Surveillance AVSS ’09, pages 43–48, 2009. doi: 10.1109/AVSS.2009.46. 40, 44, 45

192 W. Matusik, C. Buehler, and L. McMillan. Polyhedral visual hulls for real-time rendering. In Eurographics Workshop on Rendering, volume 1, pages 115–125. Citeseer, 2001. 41

C. McCall and J. Blascovich. How, when, and why to use digital experimental virtual en- vironments to study social behavior. Social and Personality Psychology Compass, 3(5): 744–758, 2009. ISSN 1751-9004. 3

M. McGinity, J. Shaw, V. Kuchelmeister, A. Hardjono, and D. Del Favero. AVIE: a versatile multi-user stereo 360 interactive VR theatre. In Proceedings of the 2007 workshop on Emerging displays technologies: images and beyond: the future of displays and interacton. ACM New York, NY, USA, 2007. 2, 7

S.J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, and H. Wechsler. Tracking groups of people. Computer Vision and Image Understanding, 80(1):42–56, 2000. 22

I. Mikic, M. Trivedi, E. Hunter, and P. Cosman. Human Body Model Acquisition and Track- ing Using Voxel Data. International Journal of Computer Vision, 53(3):199–223, 2003. 36, 37

A. Misra, A. Sowmya, and P. Compton. Incremental learning of control knowledge for lung boundary extraction. In Pacific Rim International Conference on Artificial Intelligence, Auckland. Citeseer, 2004. 46

A. Misra, A. Sowmya, and P. Compton. Impact of quasi-expertise on knowledge acquisition in computer vision. In Proc. 24th Int. Conf. Image and Vision Computing New Zealand IVCNZ ’09, pages 334–339, 2009. doi: 10.1109/IVCNZ.2009.5378387. 130

Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997. ISBN 0070428077, 9780070428072. 151

J. Mitchelson and A. Hilton. Simultaneous pose estimation of multiple people using multiple-view cues with hierarchical sampling. In In Proc. of BMVC, 2003. 36

A. Mittal and L.S. Davis. M 2 Tracker: a multi-view approach to segmenting and tracking people in a cluttered scene. International Journal of Computer Vision, 51(3):189–203, 2003. 26

T.B. Moeslund and E. Granum. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81(3):231–268, 2001. 12

T.B. Moeslund, A. Hilton, and V. Krüger. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2-3):90– 126, 2006. 12

B.T. Morris and M.M. Trivedi. A survey of vision-based trajectory learning and analysis for surveillance. Circuits and Systems for Video Technology, IEEE Transactions on, 18(8): 1114–1127, 2008. ISSN 1051-8215. 33

193 J. Munkres. Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, 1957. 65

P. Natarajan and R. Nevatia. Online, Real-time Tracking and Recognition of Human Ac- tions. In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop on, pages 1–8, 2008. 14

A.M. Nazif and M.D. Levine. Low level image segmentation: an expert system. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 5(5):555–577, 1984. 46

J.C. Niebles and L. Fei-Fei. A hierarchical model of shape and appearance for human action classification. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007. 38

N.M. Oliver, B. Rosario, and A.P. Pentland. A Bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):831–843, 2000. 44

Nikhil R Pal and Sankar K Pal. A review on image segmentation techniques. Pattern Recognition, 26(9):1277 – 1294, 1993. ISSN 0031-3203. doi: DOI:10. 1016/0031-3203(93)90135-J. URL http://www.sciencedirect.com/science/article/ B6V14-48MPPFT-1T0/2/e14ee69e6208152164c7c7791b53b6da. 20

C.H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Com- plexity. Courier Dover Publications, 1998. 179

A. Papoulis, S.U. Pillai, and S. Unnikrishna. Probability, random variables, and stochastic processes, volume 196. McGraw-hill New York, 1965. 28

S. Park and JK Aggarwal. Recognition of two-person interactions using a hierarchical Bayesian network. In First ACM SIGMM international workshop on Video surveillance, pages 65–76. ACM, 2003. ISBN 158113780X. 38, 44

S. Park and JK Aggarwal. Event semantics in two-person interactions. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, 4, 2004. 44

E. Parzen. On estimation of a probability density function and mode. The annals of mathe- matical statistics, 33(3):1065–1076, 1962. ISSN 0003-4851. 26

S. Penny, J. Smith, and A. Bernhardt. Traces: Wireless full body tracking in the cave. In International Conference on Artificial Reality and , 1999. URL http:// citeseer.ist.psu.edu/penny99traces.html. 15, 42

D.P. Pertaub, M. Slater, and C. Barker. An experiment on public speaking anxiety in re- sponse to three different types of virtual audience. Presence: Teleoperators & Virtual Environments, 11(1):68–78, 2002. ISSN 1054-7460. 3

194 K.C. Pham and C. Sammut. Rdrvision-learning vision recognition with ripple down rules. In Proceedings of the Australian Conference on Robotics and Automation. Citeseer, 2005. 46

C.S. Pinhanez. Representation and Recognition of Action in Interactive Spaces. PhD thesis, Massachusetts Institute of Technology, 1999. 6, 174

AB Poritz. Hidden Markov models: a guided tour. Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, pages 7–13, 1988. 38

M. Potmesil. Generating octree models of 3D objects from their silhouettes in a sequence of images. Computer Vision, Graphics, and Image Processing, 40(1):1–29, 1987. ISSN 0734-189X. 42

P. Preston, G. Edwards, and P. Compton. A 2000 rule expert system without a knowledge engineer. In Proc. AAAI Knowledge Acquisition for Knowledge Based Systems Workshop. Citeseer, 1994. 123

L. R. Rabiner. A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. 38, 44

Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition. Prentice- Hall, Inc., Upper Saddle River, NJ, USA, 1993. ISBN 0-13-015157-2. 45, 170

P.C. Ribeiro, J. Santos-Victor, and P. Lisboa. Human activity recognition from video: Mod- eling, feature selection and classification architecture. In Proceedings of International Workshop on Human Activity Recognition and Modelling, 2005. 148

D. Richards. Two decades of ripple down rules research. The Knowledge Engineering Re- view, 24(2):159–184, 2009. 128

D. Richards and P. Compton. Revisiting Sisyphus I - an incremental approach to resource allocation using ripple down rules. In 12th Banff Knowledge Acquisition for Knowledge Base System Workshop (KAW99), Canada, SRDG, volume 1. Citeseer, 1999. 123

R. Rosales and S. Sclaroff. 3 D trajectory recovery for tracking multiple objects and trajec- tory guided recognition of actions. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2:117–123, 1999. 31

R. Rosales and S. Sclaroff. Inferring body pose without tracking body parts. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 721–727. IEEE, 2000. 38

R. Rosales, M. Siddiqui, J. Alon, and S. Sclaroff. Estimating 3D body pose using uncali- brated cameras. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1. Citeseer, 2001. 36

195 M. Rossi and A. Bozzoli. Tracking and counting moving people. Image Processing, 1994. Proceedings. ICIP-94., IEEE International Conference, 3:212–216, 1994. 21

M. Roussos, A. Johnson, J. Leigh, C. Barnes, C.A. Vasilakis, and T.G. Moher. The NICE project: Narrative, immersive, constructionist/collaborative environments for learning in virtual reality. In Proceedings of ED-MEDIA/ED-TELECOM, volume 97, pages 917–922. Citeseer, 1997. 2

M. S. Ryoo and J. K. Aggarwal. Recognition of high-level group activities based on ac- tivities of individual members. In Proceedings of the 2008 IEEE Workshop on Mo- tion and video Computing, pages 1–8, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-1-4244-2000-1. doi: 10.1109/WMVC.2008.4544065. URL http: //portal.acm.org/citation.cfm?id=1549821.1549924. 33

T. Scheffer. Algebraic foundation and improved methods of induction of ripple down rules. In Proc. Pacific Knowledge Acquisition Workshop. Citeseer, 1996. 130, 132

J. Segen and S. Pingali. A camera-based system for tracking people in real time. Proceedings of ICPR, 96:63–67, 1996. 21

Y. Sheikh and M. Shah. Exploring the space of an action for human action recognition. In Proc. Int. Conf. Comput. Vision, volume 1, pages 144–149. Citeseer, 2005. 35, 45

GM Shiraz and C. Sammut. Combining knowledge acquisition and machine learning to con- trol dynamic systems. In INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 15, pages 908–913. Citeseer, 1997. 123

Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-Time Human Pose Recognition in Parts from a Single Depth Image. Technical report, Microsoft Research, 2011. 17, 37

M. Slater and M. Usoh. Presence in immersive virtual environments. In Virtual Reality Annual International Symposium, 1993., 1993 IEEE, pages 90–96. IEEE, 2002. ISBN 0780313631. 3

M. Slater, A. Sadagic, M. Usoh, and R. Schroeder. Small-group behavior in a virtual and real environment: A comparative study. Presence: Teleoperators & Virtual Environments, 9(1):37–51, 2000. ISSN 1054-7460. 3

C. Sminchisescu and B. Triggs. A Robust Multiple Hypothesis Approach to Monocular Hu- man Motion Tracking. INRIA Research Report No, 4208, 2001. 20, 34

K. Smith, D. Gatica-Perez, J.M. Odobez, and S. Ba. Evaluating multi-object tracking. In Computer Vision and Pattern Recognition-Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on, page 36. IEEE, 2005. 90

196 F. Sparacino. Scenographies of the past and museums of the future: from the wunderkam- mer to body-driven interactive narrative spaces. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 72–79. ACM, 2004. ISBN 1581138938. 23

A. Sridhar and A. Sowmya. SparseSPOT: using a priori 3-D tracking for real-time multi- person voxel reconstruction. In Proceedings of the 16th ACM Symposium on Virtual Re- ality Software and Technology, pages 135–138. ACM, 2009. 94

A. Sridhar and A. Sowmya. Distributed, multi-sensor tracking of multiple participants within immersive environments using a 2-cohort camera setup. Machine Vision and Ap- plications, pages 1–17, 2010. 48

Anuraag Sridhar and Arcot Sowmya. Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition in Immersive Environments. In Advances in Vi- sual Computing, pages 508–519. Springer, 2008. URL http://dx.doi.org/10.1007/ 978-3-540-89639-5_49. 104

Anuraag Sridhar, Arcot Sowmya, and Paul Compton. On-line, Incremental Learning for Real-Time Vision Based Movement Recognition. In The 9th International Conference on Machine Learning and Applications, 2010. 147

C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time track- ing. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pat- tern Recognition, 2:246–252, 1999. 20

J. Steuer. Defining virtual reality: Dimensions determining telepresence. Journal of com- munication, 42(4):73–93, 1992. ISSN 1460-2466. 3, 13

A. Sundaresan and R. Chellappa. Markerless Motion Capture using Multiple Cameras. In Proc. Computer Vision for Interactive and Intelligent Environment, pages 15–26, 2005. doi: 10.1109/CVIIE.2005.13. 37

T. Svoboda, D. Martinec, and T. Pajdla. A convenient multicamera self-calibration for virtual environments. Presence: Teleoperators & Virtual Environments, 14(4):407–422, 2005. 20

C. Theobalt, M. Magnor, P. Schuler, and H.P. Seidel. Combining 2D feature tracking and vol- ume reconstruction for online video-based human motion capture. In Computer Graphics and Applications, 2002. Proceedings. 10th Pacific Conference on, pages 96–103, 2002. 36

S. Tran, Z. Lin, D. Harwood, and L. Davis. UMD_VDT, an Integration of Detection and Tracking Methods for Multiple Human Tracking. Multimodal Technologies for Perception of Humans, 1:179–190, 2009. 23

R. Tsai. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal of robotics and Au- tomation, 3(4):323–344, 1987. 18, 20

197 R.Y. Tsai. An efficient and accurate camera calibration technique for 3D machine vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 86), pages 364–374. Miami Beach: IEEE Press, 1986. 18, 20

P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea. Machine recognition of human activities: A survey. Circuits and Systems for Video Technology, IEEE Transactions on, 18(11):1473–1488, 2008. ISSN 1051-8215. 33

S. Wachter and HH Nagel. Tracking of Persons in Monocular Image Sequences. Computer Vision and Image Understanding, 74(3):174–192, 1999. 34, 36

T. Wada, H. Motoda, and T. Washio. Knowledge acquisition from both human expert and data. Advances in Knowledge Discovery and Data Mining, 1:550–561, 2001. 132

X. Wang, K. Tieu, and E. Grimson. Learning semantic scene models by trajectory analysis. Computer Vision–ECCV 2006, pages 110–123, 2006. 33

D. Waterman. A guide to expert systems. Addison-Wesley Pub. Co., Reading, MA, 1986. 122

D. Weinland, R. Ronfard, and E. Boyer. Motion history volumes for free viewpoint action recognition. In IEEE International Workshop on Modeling People and Human Interaction. Citeseer, 2005. 43

D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2-3):249–257, 2006. ISSN 1077-3142. 43

Daniel Weinland, Rémi Ronfard, and Edmond Boyer. A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition. Research Report RR-7212, INRIA, 02 2010. URL http://hal.inria.fr/inria-00459653/en/. 32

G. Welch and G. Bishop. An Introduction to the Kalman Filter. ACM SIGGRAPH 2001 Course Notes, 2001. 30

A.D. Wilson and A.F. Bobick. Realtime online adaptive gesture recognition. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, volume 1, pages 270– 275. IEEE, 2002. ISBN 0769507506. 14

C.R. Wren, A. Azarbayejani, T. Darrell, and A.P. Pentland. Pfinder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (7):780–785, 1997. 5

J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden Markov model. Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on, pages 379–385, 1992. 44, 45

198 DB Yang, HH Gonzalez-Banos, and LJ Guibas. Counting people in crowds with a real- time network of simple image sensors. In Proceedings of the Ninth IEEE International Conference on Computer Vision, pages 122–129, 2003. doi: {10.1109/ICCV.2003.1238325}. 26

Tao Yang, F. Chen, D. Kimber, and J. Vaughan. Robust People Detection and Tracking in a Multi-Camera Indoor Visual Surveillance System. In Multimedia and Expo, 2007 IEEE International Conference on, pages 675–678, 2007. doi: {10.1109/ICME.2007.4284740}. 26

K. Yoshinari and M. Michihito. A human motion estimation method using 3-successive video frames. Proc. of Int. Conf. on Virtual Systems and Multimedia GIFU, pages 135– 140, 1996. 20

Chih-Chang Yu, Hsu-Yung Cheng, Jenq-Neng Hwang, and Kuo-Chin Fan. Human body modeling with partial self occlusion from monocular camera. In Proc. IEEE Workshop Machine Learning for Signal Processing MLSP 2008, pages 193–198, 2008. doi: 10.1109/ MLSP.2008.4685478. 17

X. Zabulis, D. Grammenos, T. Sarmis, K. Tzevanidis, and AA Argyros. Exploration of large- scale museum artifacts through non-instrumented, location-based, multi-user interac- tion. In The 11th International Symposium on Virtual Reality, Archaeology and Cultural Heritage VAST. VAST, 2010. 26

O.J. Zeeshan, O. Javed, Z. Rasheed, K. Shafique, and M. Shah. Tracking Across Multiple Cameras With Disjoint Views. In The Ninth IEEE International Conference on Computer Vision, pages 952–957. Citeseer, 2003. 26

B. Zhan, D.N. Monekosso, P. Remagnino, S.A. Velastin, and L.Q. Xu. Crowd analysis: a survey. Machine Vision and Applications, 19(5):345–357, 2008. 12

Z. Zhang. A flexible new technique for camera calibration. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(11):1330–1334, 2000. ISSN 0162-8828. 20

Tao Zhao and R. Nevatia. Tracking multiple humans in complex situations. Pattern Analy- sis and Machine Intelligence, IEEE Transactions on, 26(9):1208–1221, 2004. ISSN 0162- 8828. 27, 34, 35

Tao Zhao, R. Nevatia, and Bo Wu. Segmentation and Tracking of Multiple Humans in Crowded Environments. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(7):1198–1211, 2008. ISSN 0162-8828. doi: {10.1109/TPAMI.2007.70770}. 27, 34

199