Computer Vision Based Interfaces
for Computer Games
Jonathan Rihan
Thesis submitted in partial fulfilment of the requirements of the award of
Doctor of Philosophy
Oxford Brookes University
October 2010 Abstract
Interacting with a computer game using only a simple web camera has seen a great deal of success in the computer games industry, as demonstrated by the numerous computer vision based games available for the Sony PlayStation 2 and PlayStation 3 game consoles. Computational efficiency is important for these human computer inter- action applications, so for simple interactions a fast background subtraction approach is used that incorporates a new local descriptor which uses a novel temporal coding scheme that is much more robust to noise than the standard formulations. Results are presented that demonstrate the effect of using this method for code label stability.
Detecting local image changes is sufficient for basic interactions, but exploiting high-level information about the player’s actions, such as detecting the location of the player’s head, the player’s body, or ideally the player’s pose, could be used as a cue to provide more complex interactions. Following an object detection approach to this problem, a combined detection and segmentation approach is explored that uses a face detection algorithm to initialise simple shape priors to demonstrate that good real-time performance can be achieved for face texture segmentation.
Ultimately, knowing the player’s pose solves many of the problems encountered by simple local image feature based methods, but is a difficult and non-trivial problem. A detection approach is also taken to pose estimation: first as a binary class problem for human detection, and then as a multi-class problem for combined localisation and pose detection.
For human detection, a novel formulation of the standard chamfer matching algo- rithm as an SVM classifier is proposed that allows shape template weights to be learnt automatically. This allows templates to be learnt directly from training data even in the presence of background and without the need to pre-process the images to extract their silhouettes. Good results are achieved when compared to a state of the art human detection classifier.
For combined pose detection and localisation, a novel and scalable method of ex- ploiting the edge distribution in aligned training images is presented to select the most potentially discriminative locations for local descriptors that allows a much higher space of descriptor configurations to be utilised efficiently. Results are presented that show competitive performance when compared to other combined localisation and pose detection methods. Dedicated to my parents and my family, to family Dahm, and to Susanne.
In memory of Elke Dahm, and Winnie Smith.
i Acknowledgements
During the time I have spent studying at Oxford Brookes, I have had the honour and the pleasure of meeting and working with many very talented and enthusiastic people.
First I would like to thank my director of studies Professor Philip H. S. Torr for all his valuable guidance, advice, enthusiasm and support that allowed me to develop my skills as a researcher and gain an understanding of computer vision. I would also like to thank my second supervisor Dr Nigel Crook for his advice and valuable feedback during writing, and would like to thank Professor William Clocksin for the teaching opportunities that gave me valuable teaching experience during my final year of study.
Of those I have worked with while studying in the Oxford Brookes Computer Vi- sion group, I would like to thank M. Pawan Kumar, Pushmeet Kohli, Carl Henrik Ek, Chris Russel, Gregory Rogez, Karteek Alahari, Srikumar Ramalingam, Paul Sturgess, David Jarzebowski, Lubor Ladicky, Christophe Restif, Glenn Sheasby and Fabio Cuz- zolin for all the interesting research discussions.
During my time at the London Studio of Sony Computer Entertainment Europe, I worked under the supervision and guidance of Diarmid Campbell who I would like to thank for giving me valuable insight into applied computer vision and the games industry. From within the Sony EyeToy R&D group I would also like to thank Simon Hall, Nick Lord, Graham Clemo, Sam Hare, Dave Griffiths and Mark Pupilli.
I would like to thank my parents for all the support they have given me during the many years of my studies. I would also like to thank Elke and Erdmann Dahm and family for their support and encouragement during the final year of writing.
Finally, I would like to thank Susanne Dahm for her unwavering support and un- derstanding during my studies, and in whom I found the strength to do more than I ever thought I could.
ii Contents
1 Introduction1
1.1 Defining ‘Real-time’...... 1
1.2 Motivation...... 2
1.3 Approach...... 3
1.3.1 Detecting Movement...... 4
1.3.2 Face Detection and Segmentation...... 5
1.3.3 Human Detection...... 5
1.3.4 Pose Estimation...... 6
1.4 Contributions...... 6
1.5 Thesis Structure...... 7
1.6 Publications...... 8
2 Background9
2.1 Computer Vision in Games...... 9
2.1.1 Nintendo Game Boy Camera (1998)...... 9
2.1.2 SEGA Dreamcast Dreameye Camera (2000)...... 10
2.1.3 Sony EyeToy (2003)...... 11
2.1.4 Nintendo Wii (2006)...... 13
2.1.5 XBox LIVE Vision (2006)...... 13
2.1.6 Sony Go!Cam (2007)...... 14
2.1.7 Sony PlayStation Eye (2007)...... 15
2.2 Types of Camera...... 16
2.2.1 Monocular (standard webcam)...... 17
iii CONTENTS
2.2.2 Stereo...... 17
2.2.3 Depth, or Z-Cam...... 18
2.3 Problem Domain...... 18
2.3.1 Typical Computer Games...... 19
2.3.2 Computational Constraints...... 20
2.4 Background Subtraction and Segmentation...... 20
2.4.1 Background Subtraction...... 20
2.4.2 Background Subtraction using Local Correlation...... 24
2.4.3 Segmentation...... 27
2.4.4 Graph-Cut based Methods...... 29
2.5 Human Detection and Pose Estimation...... 32
2.5.1 Learning Problem...... 32
2.6 Human Detection...... 33
2.6.1 Generative...... 35
2.6.2 Discriminative...... 35
2.6.3 Chamfer Matching...... 38
2.7 Pose Estimation...... 40
2.7.1 Generative...... 41
2.7.2 Discriminative...... 44
2.7.3 Combined Detection and Pose Estimation...... 46
3 Fast Background Subtraction 47
3.1 Problem: Where is the player?...... 48
3.2 Image Differencing...... 48
3.3 Motion Button Limitations...... 50
3.4 Persistent Buttons...... 50
3.5 Algorithm Overview...... 51
3.5.1 Sub-sampling...... 53
3.5.2 Intensity Image...... 53
3.5.3 Blur Filtering...... 53
iv CONTENTS
3.5.4 Code Map...... 54
3.6 Algorithm Details...... 54
3.6.1 Sub-sampling and Converting to an Intensity Image...... 54
3.6.2 Blur Filtering...... 55
3.7 Noise Model...... 56
3.8 Code Map...... 57
3.8.1 Local Binary Patterns...... 57
3.8.2 3 Label Local Binary Patterns (LBP3)...... 58
3.8.3 Temporal Hysteresis...... 59
3.9 Experiments...... 63
3.9.1 Comparisons...... 63
3.9.2 Table-top Game...... 64
3.9.3 Human Shape Game...... 64
3.10 Results...... 65
3.11 Discussion and Future Work...... 66
4 Detection and Segmentation 68
4.1 Shape priors for Segmentation...... 69
4.2 Coupling Face Detection and Segmentation...... 70
4.3 Preliminaries...... 71
4.3.1 Face Detection and Localisation...... 71
4.3.2 Image Segmentation...... 72
4.3.3 Colour and Contrast based Segmentation...... 72
4.4 Integrating Face Detection and Segmentation...... 73
4.4.1 The Face Shape Energy...... 74
4.5 Incorporating the Shape Energy...... 74
4.5.1 Pruning False Detections...... 76
4.6 Implementation and Experimental Results...... 77
4.6.1 Handling Noisy Images...... 77
4.7 Extending the Shape Model to Upper Body...... 77
v CONTENTS
4.8 Discussion and Future Work...... 79
5 Human Detection 82
5.1 Suitable Algorithms...... 83
5.2 Features...... 83
5.2.1 Histogram of Oriented Gradients...... 84
5.2.2 DAISY...... 88
5.2.3 SURF...... 89
5.2.4 Chamfer Distance Features...... 90
5.3 Human Detector...... 96
5.3.1 Training...... 97
5.3.2 Hard Examples...... 98
5.4 Datasets...... 98
5.4.1 HumanEva...... 98
5.4.2 INRIA...... 99
5.4.3 Mobo...... 99
5.4.4 Upper Body Datasets...... 100
5.5 Experiments...... 100
5.5.1 Human Detection...... 100
5.5.2 Upper Body Detection...... 102
5.6 Discussion...... 102
5.6.1 Computational Efficiency...... 104
5.6.2 Chamfer SVM...... 106
5.6.3 Edge Thresholding...... 107
5.6.4 Bagging, Boosting and Randomized Forests...... 108
5.7 Summary and Future Work...... 110
6 Pose Detection 112
6.1 Introduction...... 114
6.1.1 Related Previous Work...... 114
vi CONTENTS
6.1.2 Motivations and Overview of the Approach...... 116
6.2 Selection of Discriminative HOGs...... 119
6.2.1 Formulation...... 120
6.3 Randomized Cascade of Rejectors...... 123
6.3.1 Bottom-up Hierarchical Tree construction...... 123
6.3.2 Randomized Cascades...... 126
6.3.3 Application to Human Pose Detection...... 132
6.4 Experiments...... 134
6.4.1 Preliminary results training on HumanEva...... 135
6.4.2 Experimentation on MoBo dataset...... 137
6.5 Conclusions and Discussions...... 144
7 Conclusion 146
7.1 Summary of Contributions...... 146
7.2 Future Work...... 148
A Code Documentation 149
A.1 System Overview...... 149
A.1.1 Shared Projects...... 149
A.1.2 Project: CapStation...... 150
A.1.3 Project: HogLocalise...... 151
A.2 C++: Shared Libraries...... 153
A.2.1 Common...... 153
A.2.2 ImageIO...... 154
A.3 MEX: cppHogLocalise...... 154
A.3.1 Usage...... 154
A.3.2 Options: Scanning...... 155
A.3.3 Options: Classification...... 157
A.3.4 Related Source Code Folders...... 161
References 162
vii List of Tables
6.1 2D Pose Estimation Error on HumanEva II...... 135
6.2 Classifiers - Training Time...... 141
6.3 Classifiers - Detection Time...... 143
viii List of Figures
1.1 Typical Camera Position...... 2
1.2 Camera View...... 3
2.1 Nintendo Game Boy Camera...... 10
2.2 Sega Dreameye...... 11
2.3 Sony EyeToy Camera...... 11
2.4 Sony EyeToy Games...... 12
2.5 Sony EyeToy Peripherals...... 13
2.6 Microsoft XBox Live Vision Camera...... 14
2.7 Sony PlayStation Portable Go!Cam...... 15
2.8 Sony PlayStation Eye camera...... 16
2.9 Camera Bayer Pattern...... 17
2.10 hsL Colour Space...... 27
2.11 GraphCut Illustration...... 30
2.12 Supervised Machine Learning Diagram...... 33
2.13 Chamfer distance...... 39
2.14 Chamfer Pose Templates...... 46
3.1 Image Differencing...... 49
3.2 Background Image Differencing...... 51
3.3 LBP Code...... 52
3.4 LBP3 Subsampling...... 54
3.5 Blur Filter...... 56
ix LISTOF FIGURES
3.6 LBP3 Code Response To Lighting...... 59
3.7 LBP3 Labels...... 60
3.8 LBP3 Code Map Construction...... 60
3.9 LBP3 Code Map Differencing...... 61
3.10 LBP3 Hysteresis...... 62
3.11 Persistent Button Applications...... 63
3.12 LBP3 Hysteresis Results...... 65
3.13 LBP3 Hysteresis FPR...... 66
3.14 Shadow Results...... 67
3.15 Camera Drift Effect on LBP3...... 67
4.1 Real-Time Face Segmentation...... 69
4.2 Face Shape Energy...... 74
4.3 Energy Function Terms...... 75
4.4 False Positive Pruning...... 76
4.5 Face Segmentation Results...... 77
4.6 Effect of Smoothing...... 78
4.7 Upper Body Model...... 79
4.8 Optimising Body Parameters...... 80
4.9 Upper Body Segmentation Results...... 81
5.1 HOG Descriptor...... 85
5.2 Integral Image...... 86
5.3 Haar-like Rectangular Features...... 87
5.4 Integral Histogram HOG...... 88
5.5 DAISY Descriptor Construction...... 88
5.6 SURF Descriptor...... 89
5.7 Hausdorf Distance...... 92
5.8 Dilation Example...... 93
5.9 Hausdorf PAC Classifier Results...... 94
5.10 Chamfer Distance Features...... 97
x LISTOF FIGURES
5.11 Human Detector Window Construction...... 98
5.12 HumanEva Detection Dataset...... 99
5.13 INRIA Dataset...... 99
5.14 MoBo Dataset...... 100
5.15 INRIA: SVM Classifier Feature Comparison...... 101
5.16 Mobo Upper Body: SVM Classifier Feature Comparison...... 103
5.17 Chamfer SVM Weights...... 104
5.18 Feature Vector Timings...... 105
5.19 Canny Edge Thresholds...... 108
5.20 Chamfer SVM Classifier Edge Thresholds...... 108
5.21 SVM and SGD-QN SVM Comparison...... 109
5.22 SVM and FEST Algorithms Comparison...... 110
6.1 Random Forest Preliminary Results...... 117
6.2 Log-likelihood Ratio for Human Pose...... 119
6.3 Log-likelihood Ratio for Face Expressions...... 122
6.4 Selection of Discriminative HOG for Face Expressions...... 123
6.5 Bottom-up Hierarchical Tree learning...... 124
6.6 Hog Block Selection...... 127
6.7 Rejector branch decision...... 130
6.8 Single Cascade Localisation...... 131
6.9 Pose Detection...... 133
6.10 Pose Localisation...... 134
6.11 Pose Detection Results on HumanEva II Data Set...... 135
6.12 Pose Detection Result with a Moving Camera...... 136
6.13 MoBo Dataset...... 138
6.14 Defined Classes on MoBo...... 139
6.15 Pose Classification Baseline Experiments on MoBo...... 140
6.16 Localisation Dataset...... 141
6.17 Detection Results...... 142
xi LISTOF FIGURES
6.18 Best Cascade Localisation Results...... 143
A.1 System Diagram of Main Applications...... 149
A.2 Main Stages of Sliding Window Algorithm...... 151
A.3 Scale Search Method...... 152
A.4 HOG Feature Scaling...... 153
xii Notation
The following conventions are used in this thesis. Sets, matrices or special functions are uppercase characters, such as X or Y(θ). Functions or scalar values are lower-case characters such as a, n, f (n) and h(g). Vectors are lower-case bold characters, x and
p, and their components are lower-case with a subscript index e.g. {x1, x2, x3.. xn} or pr, pg, pb. A number followed by a subscript denotes the base of the number, e.g.
510 = 01012, if no subscript is specified then the number is assumed to be decimal.
xiii CHAPTER 1
Introduction
The ability to interact with a computer game without the requirement of a traditional game controller device provides a player with the unique ability to use their own body to directly interact with the game they are playing.
This thesis is concerned with the problem of real-time human computer interaction for computer games. Single camera systems are (of the time of writing) by far the most common type of system available, and as such are the problem domain of the algorithms discussed.
Many existing camera based games, such as the numerous games available for the Sony PlayStation 2 system in conjunction with Sony’s EyeToy camera (§2.1.3), all en- courage the player to stand up and take a more physical role in the game play, in con- trast with their usual seated gaming position. Indeed, some games such as Sony Eye- Toy Kinetic have taken this interaction concept even further and try to provide exercise training routines to help players keep fit.
These games tend to employ computer vision algorithms that are both computa- tionally efficient so as not to take too much computation time away from the rest of the media the game has to update, and robust so that the player isn’t frustrated by the technology failing at critical moments.
1.1 Defining ‘Real-time’
Throughout the thesis, the term ‘real-time’ is frequently used to describe and com- pare the performance and practicality of using different algorithms for computer vi- sion based interface problems. Generally speaking, the goal of any algorithm used in this context is to be able to process images at a fast enough speed that allows the user a sufficient level of interaction for the interface.
1 CHAPTER 1: INTRODUCTION
The term ‘interactive frame rate’ could be also used to describe the performance of the algorithms discussed in this thesis, but the term ‘real-time’ is generally more widely used.
Where ‘real-time’ is used in the text, it is taken to mean that there is sufficient per- formance for user interaction and may vary depending on the context.
1.2 Motivation
Offering the ability for a player to use their body to interact with a computer game has already seen a great deal of success in the computer games industry, and demonstrates that there is a lot of interest in using this technology as an alternative to traditional game controller devices.
The game systems responsible for much of the success of computer vision based computer games have been the Sony PlayStation 2 and PlayStation 3 game consoles. See section 2.1 for a history of how computer vision based games have evolved over the last decade.
Figure 1.1: Illustration showing a typical camera position for a camera based computer games system. The camera location is highlighted in red, and the viewing frustum is highlighted in green. The image displayed on the television in the diagram has been flipped horizontally.
Game systems such as these typically use a single web camera that is usually placed on top of the television set, roughly aligned to the screen centre. Figure 1.1 illustrates the camera placement used for most vision based games. The camera is highlighted in red, and the viewing frustum of the camera is highlighted in green.
The video stream recorded by the camera is displayed on the TV screen usually as a horizontally flipped image so that the movements displayed on the screen behave like
2 CHAPTER 1: INTRODUCTION a mirror of the player’s actions. Gaming components are overlaid on top of the live video stream and the player must move their body to interact with the components.
Figure 1.2 shows an example video frame captured from the camera in figure 1.1. The image on the left is the original video image captured from the point of view of the camera, and the image on the right is displayed on the television screen facing the player. The image has been flipped horizontally to make interactions more intuitive.
Interpreting this type of image poses some difficult problems however. Generally a living room environment contains many varied types of object in the background1. In the example frame presented in 1.2 there are two picture frames, a sofa, a wall, and a floor.
The problem of detecting areas of the image that belong to the player are made more complicated by the presence of these items, as their appearance can contain similar texture or colours to the clothing of the player. It is unreasonable to expect players to clear their living space completely of these items to make algorithms more robust, so any algorithms considered must ideally be able to cope with this kind of background appearance variation.
Figure 1.2: Left: Video image as seen from the point of view of the camera in figure 1.1. Right: Video image displayed on the TV screen with superimposed graphics. Notice that the image is flipped horizontally so that the displayed image behaves like a mirror reflection.
1.3 Approach
The work presented in this thesis considers two approaches to the problem of detecting the movements of the player with the goal of extending the types of interactions that can be used in computer games.
1Here ‘background’ simply means anything that is not the player.
3 CHAPTER 1: INTRODUCTION
The first approach (§3) presents an algorithm that uses simple image features to detect changes in the image useful for realising computer interactions.
The second approach is based on the observation that since any changes detected by these low-level algorithms are a direct consequence of the player moving their body, then being able to detect the location of a body part (§4), body location (§5), or ideally the player’s pose (§6) should be able to help solve some of the interaction problems encountered by simple low-level algorithms.
In the context of computer games and entertainment, the types of computer vision algorithms that can be used are generally limited to only those that can be considered to run in real-time. This means only algorithms that process an image fast enough so that a player can see the result of their actions affecting the computer game environment in some way.
As an example of why this is important, consider a game where a player is bom- barded by objects and must swat them aside to earn points (see right-most image in figure 1.2). Clearly if the algorithm used to locate the player’s movements2 were too slow, then swatting the objects would be impossible since by the time the control sys- tem responds, the object would have moved past the area of interaction.
1.3.1 Detecting Movement
Detecting movement can be useful to trigger user interface controls by encouraging the player to ’wave’ their hand over them for a given time, as demonstrated in Sony EyeToy games.
This can be posed as the problem of detecting changes in the image since the previ- ous frame, and simple algorithms such as image differencing, where the current video frame is subtracted from its previous frame and thresholded to detect changes, have seen a great deal of success in many of the EyeToy series of games demonstrating that game mechanics can be created to work within the limitations of a simple algorithm.
However, simple image differencing is contrast dependent, so it can be problematic to detect changes in areas that are a similar colour to a part of the player. Camera sensor noise can be a problem particularly in poorly lit environments, as can very slow or subtle movements that cause gradual change below the detection threshold of the image differencing algorithm.
Another problem is that this method only detects changes between frames, and
2Either indirectly by detecting low-level image changes or directly using object detection or pose.
4 CHAPTER 1: INTRODUCTION cannot detect anything if the player remains stationary on top of a game object. Chapter 3 addresses some of these problems and presents an alternative algorithm that takes steps toward allowing this kind of interaction to be realised.
1.3.2 Face Detection and Segmentation
A useful case of object detection that naturally lends itself to a computer game applica- tion is face detection. To see what is happening within the game they are playing, the player must face their television, and since this is where the camera is situated the face of the player should be visible for the majority of the game.
Detecting faces can be a useful cue for interaction, and there are good algorithms for fast and accurate face detection [170] that can be employed to locate the player’s face. Once the location of the player’s face is known, then it can be used to find roughly where the rest of the player is in relation to the video frame. Additionally once the face has been localised it then also becomes possible to perform other actions like placing the face on a computer generated avatar, or pass the extracted face texture to a recog- nition algorithm for identifying individual players within a multi-user interface [157].
Chapter4 presents a method to achieve segmentation in real-time using an off the shelf face detection algorithm to first find a region of interest for the segmentation algorithm.
1.3.3 Human Detection
Detecting and localising a human is a complex object detection problem. Humans have a high variance in appearance and pose, so an accurate detector must be able to deal with these variations. Knowing the location of the player would allow more complex algorithms to be applied, and given that the player will be present in front of the camera when playing the game, a human detector should be able to find the player reliably.
Some common approaches to this problem are discussed in section 2.6. Among these methods, fast detection algorithms such as chamfer matching [20, 63, 66] can achieve real-time performance, but the requirement that silhouettes must be extracted from the training images before the classifier is learnt can limit the number of training instances that are practical for use with the algorithm.
Chapter5 presents an algorithm that formulates the chamfer matching algorithm as a SVM classifier, where the weights of the SVM represent the template, and allow a general chamfer template to be learnt automatically from the training data.
5 CHAPTER 1: INTRODUCTION
1.3.4 Pose Estimation
The methods mentioned in the previous sections attempt to extract basic location infor- mation about which parts of the image contain parts of the player so that interactions can affect game objects or controls that are being superimposed on top of the video frame. Since these interactions are all dependent on the location or motion of the user, they can all be achieved by determining the pose of the individual.
If pose is known, then user interface controls such as floating buttons can be in- teracted with simply by tracking the position of the user’s hand. Knowing pose also means events can be triggered for a specific hand only, and even stop the control from activating in error when an arm, head or other limb enters the area. For exercise and fitness games, pose could enable the game to provide feedback to the player on how well they are preforming the action.
Pose estimation is of course a very difficult and non-trivial problem, particularly so for real-time applications. Ambiguities can arise quite easily when using information from a single camera, such as which way round a player’s legs are from the side, or how far forward their hand is to the camera (since depth information isn’t available from a single camera).
There are many approaches to the problem of pose estimation, and they are dis- cussed in more detail in section 2.7. Many methods use a two stage approach where the image is first processed by a computationally efficient algorithm (such as background subtraction) to identify locations in the image where the human is most likely to be, and then apply a more computationally expensive method to determine pose. How- ever, only a few methods consider the combined problem of simultaneously detecting and estimating the pose of a human within an image [17, 44, 116].
Chapter6 proposes a method that exploits the distributions of edges over training set examples to select the most discriminative locations for local features to be placed to discriminate between different classes, where the classes represent discrete poses, and constructs an efficient hierarchical cascade classifier. To handle ambiguity in pose, the classifier output is a distribution of votes over all the available classes.
1.4 Contributions
The contributions made by this thesis are as follows:
1. An extension to the local binary pattern descriptor that is more robust to noise
6 CHAPTER 1: INTRODUCTION
than the standard formulations.
2. A detection and segmentation approach that demonstrates how combining de- tection algorithms with simple shape priors can achieve real-time performance for face texture segmentation, and presents some initial results on using the seg- mentation to help prune false negatives.
3. A novel formulation of the standard chamfer matching algorithm as an SVM clas- sifier that allows shape template weights to be learnt automatically without the need to pre-process them to find their silhouettes.
4. A method of exploiting the edge distribution in aligned training images to select discriminative locations for local descriptors that allows a much higher space of descriptor configurations to be utilised efficiently.
1.5 Thesis Structure
Chapter 2 This chapter presents a history of the types of systems that have been em- ployed for single camera computer vision based computer games, and presents the background literature on the methods addressed by the algorithms presented in the following chapters.
Chapter 3 Being able to identify the specific areas of an image that belong to the player and not to the background environment is an important problem that can be used to interact with objects and user interface controls presented by the game. This chapter takes steps towards solving this problem with a background subtraction method that uses fast local features.
Chapter 4 This chapter presents a method of detection and segmentation using a sim- ple shape prior to extract the texture belonging to the face of the player which can be utilised by other game interactions. Some initial results are also presented on using the result of the segmentation algorithm to aid detection.
Chapter 5 The problem of fast human detection is addressed in this chapter, and an alternative formulation of the chamfer template matching algorithm is presented that expresses the shape template as the weight vector in an SVM classifier al- lowing training examples to be used that also contain background information, without the need to pre-process them to find their silhouettes. The algorithm is then compared with a state of the art detection method.
7 CHAPTER 1: INTRODUCTION
Chapter 6 A novel algorithm to jointly detect and estimate the pose of humans in sin- gle camera images is presented in this chapter, and comparisons are made to other fast state of the art methods. The algorithm exploits the distributions of edges over the training set examples to select the most discriminative locations for local features to be placed to discriminate between different classes. To handle ambiguity in pose, the classifier output is a distribution of over all the available classes. Parts of this chapter have previously appeared in the publication.
1.6 Publications
Parts of the material that appear in Chapter4 have previously been published in the following publications (and can be found at the end of the thesis):
1. Jon Rihan, Pushmeet Kohli, Philip H.S. Torr, ObjCut for Face Detection, In ICVGIP (2006)
2. Pushmeet Kohli, Jon Rihan, Matthieu Bray, Philip H.S. Torr, Simultaneous Segmen- tation and Pose Estimation of Humans using Dynamic Graph Cuts, In International Journal of Computer Vision, Volume 79, Issue 3 pages 285-298, 2008
Parts of the material that appear in Chapter6 are extensions to material that has previously appeared in the following publication (which can also be found at the end of the thesis):
1. Gregory Rogez, Jon Rihan, Srikumar Ramalingam, Carlos Orrite, Philip H.S. Torr, Randomized Trees for Human Pose Detection, In Proceedings IEEE Conference of Computer Vision and Pattern Recognition, 2008
8 CHAPTER 2
Background
Research into Computer Vision as a method of human computer interaction for enter- tainment media has been an active area of research for over a decade. Major entertain- ment companies such as Sony, Nintendo and Microsoft have all looked towards com- puter vision to expand the types of experiences they can provide to home and business customers.
This chapter contains a review of computer vision based user interface research em- ployed in the entertainment industry from initial concepts up to current state-of-the-art systems employed today. The chapter then goes on to look at the main problem areas within computer vision that typically need to be addressed to make these systems ro- bust in real world environments, and gives a review of current research literature con- cerned with solving these problems within the constraints provided by current games hardware.
2.1 Computer Vision in Games
This section introduces the systems made available by major entertainment companies such as Sony, Nintendo and Microsoft in roughly chronological order so that the sim- pler systems are presented first and the more advanced and currently used systems are introduced later. This is to highlight the evolution of computer vision driven entertain- ment media over the past decade.
2.1.1 Nintendo Game Boy Camera (1998)
Freeman et al. [59, 60] from Mitsubishi’s MERL laboratory used simple, low-level vi- sion image processing algorithms implemented on a hardware chip to try and achieve
9 CHAPTER 2: BACKGROUND
Figure 2.1: Shown here is the Nintendo Game Boy (left), the Game Boy Camera device (middle), and an example photo taken by the camera.
interactive frame rates required by human computer interaction. The user interface methods discussed in the paper are tested in typically quite clean environments (e.g. a player stands in front of a white wall), so finding useful information is made easier.
This chip was later used by Nintendo for their Game Boy hand-held system as shown in figure 2.1. On this system no games actually used the camera as a control device discussed by the authors, but simply provided a means to take photographs and apply filter effects. This interactivity alone however was enough to sell the product to consumers, demonstrating that even simple visual systems can be entertaining to players.
2.1.2 SEGA Dreamcast Dreameye Camera (2000)
Although SEGA is no longer competing in the games console market, they did release a camera peripheral for their last console, the SEGA Dreamcast, which was intended as a webcam and photo device. The camera was only released in Japan.
The resolution of the camera is 640x480 for still images, and when connected to the Dreamcast games console its picture resolution can be either: 640x480, 320x240, or 160x240. However since the data transfer rate of the camera is only 4Mbps (524,288 bytes per second), the frame rate of the camera is severely limited by the transfer rate, so that only very low resolution video is possible at interactive frame rates.
The camera was demonstrated at the Tokyo Game Show in 2000, showing its video conferencing application and a simple computer vision interface game (see figure 2.2). The software finally released with the device was primarily for editing photos, video conferencing with other Dreameye owners or recording short videos.
10 CHAPTER 2: BACKGROUND
Figure 2.2: The Sega Dreameye (left), and images from a demonstration at the Tokyo Game Show in 2000 of a video conferencing game (middle) and a simple colour based computer vision game (right) in which the colour red causes the frog to move.
Figure 2.3: The EyeToy camera device with a game (left), EyeToy camera close up (right).
2.1.3 Sony EyeToy (2003)
Early in 2002, Dr Richard Marks joined Sony to work on the EyeToy project with the idea of using a webcam as an input device for a computer game. SCEE London studio created 4 technical demos that illustrated the potential for using a video camera as an input device, and then launched the Sony EyeToy camera with its first games in 2003.
The camera itself (see figure 2.3) is a simple webcam capable of a 320x240 or 640x480 resolution at 50 or 60Hz. The quality is relatively low for a web camera, but the frame rate is very high which is ideal for interactive computer game frame rates.
Control Method
To play EyeToy games, the EyeToy camera is placed centred on top of a television set and the player stands in front of the camera. The games generally display the video feed of the player on the television screen, and superimposes the game graphics over the top (see figure 2.4).
A few games use a technology called Digimask that allows players to place their face on digital characters within a game using the camera. This technology comes
11 CHAPTER 2: BACKGROUND
Figure 2.4: A selection of EyeToy games. From left to right: EyeToy Kinetic, EyeToy Play 2, EyeToy Kinetic Combat, and AntiGrav
under the name EyeToy: Cameo, and there are several games that support it as an extra feature. EyeToy: Cameo isn’t really a control mechanism however, and simply allows the player another way of personalising their gaming experience.
When the camera is used as a control interface, the player generally interacts by waving their hand over certain portions of the screen and the game uses this stimulus for navigating a menu interface, or interacting with a game object (e.g. swatting a monster or something similar).
Due to the processing power available on the console, only simple low-level com- puter vision algorithms can be used. Despite this however, a surprising number of game mechanics can be created using these simple processing techniques, as the Eye- Toy: Play series of games demonstrates.
More Advanced Control
A few games use some more advanced image processing algorithms and track parts of the player to control a game avatar. One game that does this particularly well is a game called AntiGrav. The aim of the game is to navigate around a racing course on a floating board. Once calibrated the player can move their head around the screen to steer their avatar around the course, and move their arms up and down to collect objects at different heights along the race track (see right most image in figure 2.4).
Unlike many of the other games released for the EyeToy device, AntiGrav does not display a live video feed of the player during the game. It instead displays the result of the tracking system on the lower right of the screen, showing where the player’s head and hands are in relation to the control’s neutral position.
Colour Tracking
A few new games are being sold with brightly coloured game peripherals that can be easily tracked using simple computer vision colour tracking methods (see figure 2.5).
12 CHAPTER 2: BACKGROUND
Figure 2.5: Two EyeToy games that use peripherals to aid computer vision colour tracking algorithms. Left: EyeToy Play: Hero. Right: EyeToy Play: Pom Pom Party
Typically these items are coloured bright green or bright pink to reduce the likelihood that a player’s living room might contain the same colour.
By using two different colours, one for each hand, multiple object tracking is made easier since there’s little chance of one object being mistaken for the other with the colour tracking enabled.
2.1.4 Nintendo Wii (2006)
Nintendo released a new console in 2006 aimed at a more family oriented casual gamer market, called the Nintendo Wii.
Although at first sight there is no computer vision based interface, the game con- trollers actually contain a small infra-red sensor that tracks the location of a sensor bar with IR LEDs positioned on top of a television. By monitoring the location and distance apart of the LEDs on this sensor bar, it is possible to calculate 3D position and rotation using triangulation.
In 2009, a camera device was released for the console with a game called Your Shape. The game uses a simple camera to attempt to detect if the player has performed the displayed exercise moves correctly, and also provides a fitness program for the player to follow.
2.1.5 XBox LIVE Vision (2006)
In 2006 Microsoft released a video camera for their XBox 360 console. The camera supports a resolution of 640x480 at 30 frames per second. It also supports a higher resolution for 1.3 megapixel still images of 1280x1024 (see left image in figure 2.6).
Games have generally used the camera for live video feeds of players in online games (a typical video conferencing application) and also to provide the ability to add pictures of a player to their online gaming profile taken using the camera.
13 CHAPTER 2: BACKGROUND
Figure 2.6: Left: XBox Live Vision camera. Right: Computer vision interface based game where the player waves their hands on the left/right of the screen to control the character.
The camera uses a long exposure time to increase the light sensitivity of the sensor used in the hardware. Due to the long exposure times used by the camera however, interfaces that require high speed interactions or more complex computer vision algo- rithms are limited. For instance, with a long exposure time, features like edges will tend to become blurred and smoothed out during fast actions.
Some games have used the camera as a vision based controller using similar meth- ods to some of the games created for Sony’s EyeToy on the PS2, however it hasn’t yet reached the same level of success as the EyeToy franchise has.
A few games use the Digimask technology to allow players to place their face on characters in game, as with EyeToy: Cameo system used in some of Sony PlayStation 2 games.
2.1.6 Sony Go!Cam (2007)
Sony has also released a camera for their PSP hand-held console that attaches to the back of the PSP game console.
The first major game to exploit the Sony PSP camera is a title called ’InviZimals’, which was released in the UK in November 2009 and in the US in autumn 2010. This game is an Augmented Reality (AR) game that exploits various image properties within the camera field of view to generate creatures, and uses AR algorithms to track an AR marker card to provide a 3D position for a battlefield in which the creatures can fight each other.
The player must scan their surroundings using the camera while the game extracts various image properties to determine which creatures are detected and where they are located. These creatures can be captured by placing the card at the indicated location and are then made available to the player to fight and capture other creatures. These actual properties extracted from the image are not known, but are likely to be things
14 CHAPTER 2: BACKGROUND
Figure 2.7: Left: PlayStation Portable (PSP) Go!Cam camera. Right: InviZimals game, the AR marker card is underneath the creature on the left.
like distribution over colour and edge information that can be extracted efficiently and quickly.
The combat section of the game uses an augmented reality marker to position the centre of the battlefield, and the creatures that are fighting position themselves at either side to attack each other. Displaying the battle this way using the AR marker card means that during combat the camera may be moved around to different viewpoints to provide a different view on the battle taking place.
2.1.7 Sony PlayStation Eye (2007)
A year after the release of Sony’s new PlayStation 3 games console, Sony released a second web camera called the PlayStation Eye (see figure 2.8). The new camera is ca- pable of capturing video at a higher resolution and frame rate than the original EyeToy camera. The video resolution is 640x480 at 60 frames per second, or 320x240 at a much higher 120 frames a second.
The camera is much more sensitive than the original EyeToy, and is able to cope with lower light conditions. It also has no compression artefacts in the video data, which were visible in video data captured by the previous EyeToy camera.
With the better processing power of the PS3, games are able to employ more com- plex computer vision algorithms for their control interfaces. One such game uses aug- mented reality algorithms to detect special markers on playing cards and create the appropriate 3D character on top of cards presented by a player (see right image in fig- ure 2.8). The cards can then be used to place the creatures on a grid where they can fight each other.
This augmented reality concept was later applied to a digital pet game called Eye- Pet, in which the player must look after a digital creature. The game tracks a patterned card that player may use to interact with the creature and perform various tasks. The game encourages the player to feed, wash, and customise the appearance of the crea-
15 CHAPTER 2: BACKGROUND
Figure 2.8: Left: PlayStation Eye camera. Middle: Eye of Judgement game. Right: EyePet game.
ture and interact with it by playing a series of short games. There is also a game where the player can draw a simple shape on a piece of paper and present drawing to the PS3 camera, then a 3D object is extruded from lines detected in the drawing so that the creature may interact with it.
2.2 Types of Camera
Different types of camera technology have been investigated for computer vision in media and games. With any camera technology investigated, the hardware must be cheap enough to distribute for a reasonably low price with games or utilities so that consumers will be more likely to buy the device while still offering a good quality video camera suitable for image processing.
The following features are important for a camera device intended for computer gaming interfaces:
Resolution The device must have a reasonably good resolution. Higher resolution means more information can be exploited for computer vision tasks. Good cam- era devices typically support a resolution of 640x480 pixels.
Frame rate A high frame rate means that fast movements can be processed with bet- ter accuracy since the frames do not suffer as much from motion blur artefacts. Cameras that are capable of 60 frames per second and above would be good for use where fast actions need to be handled with accuracy.
Sensitivity The device should be able to cope with a reasonably good range of lighting conditions. Living rooms might be quite dark or poorly lit, so being able to handle these conditions is important. However, a sensitive device is only good if it has a low noise level (grain) in the video frames. Typically noise shows up in poor lighting conditions due to the gain applied by the camera to enhance the picture, so high sensitivity with low noise is ideal.
16 CHAPTER 2: BACKGROUND
Figure 2.9: A typical Bayer pattern used by consumer web camera hardware. Each pixel in the sensor array has a corresponding colour filter. Full colour RGB images are reconstructed by interpolating between the different colour channels.
Quality A camera that produces video images with compression artefacts can cause problems with computer vision algorithms, so it is important for a device to be able to produce images with very few or no compression artefacts. The PSEye camera for instance can send video frames with no compression making it ideal for computer vision applications.
Cost The overall cost of the device must be low to ensure that it can be distributed at a price that encourages consumers to purchase it. This usually means that technology is a compromise between cost and quality.
2.2.1 Monocular (standard webcam)
All camera devices currently in use for computer games to date have been standard single lens webcams. The device typically provides the video data in resolutions over a wired USB interface.
The types of images that cameras can retrieve are RGB, grey-scale and Bayer pattern images. Hardware processing support for different formats may be available on the camera, but they can also be constructed from the bayer sensor pattern in software. See figure 2.9 for a typical Bayer pattern.
Applying computer vision algorithms to the raw camera data can be faster, as the RGB image doesn’t then need to be constructed for each frame. Constructing the RGB image can take some time depending on the type of interpolation scheme used.
2.2.2 Stereo
Stereo cameras are basically two synchronised cameras fixed a short distance apart from each other. These cameras can be used to retrieve depth as well as colour image data that can be used to solve problems where ambiguities might occur with a single
17 CHAPTER 2: BACKGROUND lens and no depth information.
This type of camera is more expensive than a regular single lens camera, as the cam- era sensor hardware is doubled. Also due to the extra processing required to calculate the disparity maps for depth, processing frames can be costly even before computer vision algorithms are applied to the image and depth data.
2.2.3 Depth, or Z-Cam
Recently the cost of depth camera technology has become low enough to be considered for use in computer games, though no consumer hardware is currently available (as of 2009). There are a few different approaches to finding depth.
Structured light In this type of camera, a light pattern is projected into the scene from the camera and depth is calculated by looking at the way the scene deforms the projected pattern. Several techniques can be used to determine depth, such as fringe projection Zhang [178], but are out of the scope of this thesis.
Time-of-flight (TOF) There are two types of time of flight camera. Shutter based cam- eras work by sending out a fixed duration pulse of light, and collecting light on a sensor for only a short period of time. The more light that is collected the closer the surface is determined to be. Phase based cameras modulate the light source and determine depth by detecting the difference in phase when the light is re- flected back to the camera.
The camera can then provide depth data along with the RGB colour image data. In ideal conditions this means usually quite difficult tasks like background subtraction can be done by simply setting a threshold on the depth value of each pixel.
2.3 Problem Domain
As highlighted in section 2.1 there has been a great deal of interest in human computer interaction with single cameras in the home entertainment industry. Due to the con- straints of making this technology available at a low cost, the majority of these systems employ simple monocular web cameras § 2.2.1.
Any algorithm employed in this context is subject to the following constraints:
1. The algorithm must be computationally efficient to allow for real-time interaction feedback to the user.
18 CHAPTER 2: BACKGROUND
2. The algorithm must be able to operate using only data from a single camera.
The computation efficiency constraint is important because of the many compo- nents (or subsystems) that are required to run a typical computer game, as discussed in the next section.
2.3.1 Typical Computer Games
Computer games have evolved from small projects that a single developer can manage 20 or so years ago, to huge multimedia projects involving many hundreds of people [165]. During this transition, it has become important to modularise common com- ponents into subsystems so that they can be more easily maintained in the large scale development projects that are now typical in the computer games industry [7].
Though there is some variation in the terminology used in the computer entertain- ment industry when referring to the software components of a computer game, the term ‘game engine’ refers to a collection of reusable subsystems that are responsible for such things as rendering, sound, game state, physics, and other tasks [7].
Until recently game engines were generally optimised for single processor architec- tures, but in the latter half of the last decade game developers have started to explore multi-threaded game engines to take advantage of the multi-processor and multi-core architectures present in most modern computers and games consoles [165]. Despite this trend, the term ‘game engine’ holds the same abstract meaning and is responsible for the same sub-processes, though the workload is spread across multiple threads.
One of the core subsystems of any game engine is the input subsystem. A com- puter vision algorithm can be considered an input subsystem since it provides control information to the rest of the game engine, though a computer vision algorithm may also provide other information such as texture information for the rendering subsystem from algorithms such as the one presented in chapter4.
Each of the subsystems within the game engine is updated between each frame. The frame rate of the game is the rate at which new frames are presented, and is typically measured in units of frames per second, or FPS. Where fast interactions are required, games should have a high frame rate, so that a user can react to the game objects being displayed and see feedback of that action perceptually close to the time that the action was performed.
19 CHAPTER 2: BACKGROUND
2.3.2 Computational Constraints
Clearly, implementing game interactions using low level and computationally efficient methods are attractive due to the number of other complex subsystems that constitute a typical computer game engine. The algorithm must be able to operate within a fraction of the available processing time so that the other subsystems of the game can also be updated, and provide a high frame rate suitable for real-time interaction.
This constraint restricts the types of algorithms that can be used to those that can operate within these requirements. The algorithms discussed in the subsequent chap- ters first present algorithms to provide useful interactions (§3 and §4), then move on to tackle the more difficult problem of efficient localisation (§5) and pose estimation (§6). The following sections give an overview of the literature for each of these problems.
2.4 Background Subtraction and Segmentation
Background subtraction and segmentation methods are two approaches to extracting similar types of information. They both attempt to extract the shape of an object from an image, but tackle the problem in different ways.
Background subtraction methods generally cope with motion within video sequences and can be made computationally efficient so are quite well suited to processing live video streams. They construct either simple or relatively complex background models that can adapt to changes in video sequences so that foreground objects can be detected.
Segmentation algorithms on the other hand can be formulated to deal with both single images and video sequences, but due to a more complex formulation they can be generally computationally more expensive to use, though as will be demonstrated in chapter4 the use of other algorithms to focus in on smaller areas of the image allows such algorithms to be used in real-time applications.
2.4.1 Background Subtraction
Background subtraction is a method of finding differences in an image to detect mov- ing objects from a static camera. There is typically a model of the background that is ei- ther calculated offline from reference images of the background from images when the camera view is clear of moving objects (and in varied lighting conditions), or the back- ground model can be separated from the foreground using the pixel intensity statistics.
There are several survey papers that contain comparisons of background subtrac-
20 CHAPTER 2: BACKGROUND tion methods applied to different problems (see [33, 123]). The common approaches to background subtraction are described next. For simplicity, references in equations to images or image models are assumed to be on greyscale intensity values, but can easily be applied independently to each colour channel in RGB colour images or other colour models.
At the most general level, a background subtraction algorithm compares pixels in the current frame to a background model, and labels them background if they are sim- ilar, or foreground otherwise.
|Bt − It| > θt (2.4.1)
Where Bt and It are the background model and image intensity respectively for the
current time t, and θt is a threshold (each pixel has a threshold). The threshold θ can
represent a more complex threshold based on the type of model used, for instance θt is typically a function of the standard deviation measured at its respective pixel such as in [89, 174].
It is from the formulation of equation (2.4.1) that the term background subtraction is derived [123]. More generally, background subtraction is simply the method of com- paring the pixels in an image to a reference model to find differences from that model.
Another popular foreground detection threshold method is to normalise the mea- sured difference with respect to the standard deviation of the pixel:
|Bt − It| > θt,σ (2.4.2) σt
In either threshold method, most works determine the detection thresholds θt and
θt,σ experimentally [33].
The model can either be learned offline or updated online. When updating online one assumption that could be taken is that a pixel will remain background most of the time [174], and foreground pixels will be those that differ from this mean by some threshold. Other works maintain hypotheses on moving objects to update the model only in areas other than where foreground objects have been predicted to be [84, 86, 89]. This conditional background update is known as a selective update.
A common approach is to use a temporal median filter to determine the background model [38, 39, 67, 101, 181]. The median is computed over the last n frames [39, 67, 101, 181], or a set of selected frames [38]. The assumption is that the pixel will take a value from the background more than half of the time that has been measured. However the
21 CHAPTER 2: BACKGROUND main disadvantages to this approach are that a buffer needs to be maintained to store the values of the last n frames, and the lack of statistical definition of deviation from the median makes it difficult to adapt the threshold value [123].
Some works instead use a running Gaussian average [84, 86, 89, 174] to construct a model of the background. Wren et al. [174] use a method that approximates the calculation of the mean and standard deviations of pixels over n frames. To avoid the requirement of storing n frames of pixel values and subsequently processing them to fit a Gaussian model to each pixel, they update their background model using a running Gaussian average.
µt+1 = αIt + (1 − α)µt (2.4.3) 2 2 σt+1 = α(It − µt) + (1 − α)σt (2.4.4)
Where α is a parameter that controls the rate at which the model is updated. While this type of model can be useful in some situations such as CCTV where the back- ground remains the same for much of the time, in a real-time computer game problem the user can occupy the same region of the screen for a long time, and as such it is undesirable to update these pixels into the background model.
The background update equation can alternatively be formulated so that the back- ground is only updated for pixels that are not labelled as foreground in the current frame [84, 86, 130]. The update method is formulated in a Kalman filter formulation:
Bt+1 = Bt + (α1(1 − Mt) + α2 Mt)Dt (2.4.5)
Where Dt = It − Bt is the difference between the image and the background model
at time t, α1 and α2 are based on an estimate of the rate of change of the background.
The mask is defined as Mt = |Dt| > τ and is simply a thresholded difference map. This helps avoid the background being affected by pixels that are very different from the background model.
Koller et al. [89] use a Gaussian to model the values of each pixel but extend the se- lective update method used by [84, 86] to include a tracking based object motion mask predicting where foreground pixels are expected to be located in the current frame in- stead of simply thresholding the difference image. This object motion mask is used to ensure that pixels predicted to be foreground are not considered when the background model is updated. They then combine this system with an object tracking algorithm
that is used to predict a new object motion mask for the next frame Mt+1.
22 CHAPTER 2: BACKGROUND
For dynamic backgrounds a single Gaussian model for each pixel is not sufficient. Stauffer and Grimson [153] proposed to use a mixture of Gaussians to model each pixel independently. This is able to handle dynamic backgrounds where background ob- jects have varying intensity properties, such as fountains, moving trees and plants, etc. A few years later, Power and Schoonees [126] provided a principled tutorial and of- fered some corrections to the Stauffer and Grimson [153] paper. The recent history of
X1,..., Xt is modelled by a mixture of Gaussians. The probability of observing the current pixel value is:
K P(Xt) = ∑ wi,t · η(Xt, µi,t, Σi,t) (2.4.6) i=1
Where K is the number of distributions, wi,t is an estimate of the weight (the pro- portion of the data that is accounted for by this Gaussian) of the ith Gaussian in the th mixture at time t, µi,t is the mean value of the i Gaussian in the mixture at time t, Σi,t is the covariance matrix of the ith Gaussian in the mixture at time t, and η is a Gaussian probability density function.
The K Gaussians are sorted by using wk,t/σk,t as a measure, which is proportional to the peak of the weight and density function, wk,tη(Xt, µi,t, Σi,t), and the top B Gaussians B satisfying (∑i wk,t) > T are assumed to describe the background, with B estimated to be:
b ! B = argmin ∑ wk,t > T (2.4.7) b i
where T is the prior probability of anything in view being background, and the re- maining Gaussians are considered to belong to the foreground. The mixture is updated each frame only for pixels considered to be from the background Gaussians. Pixels are labelled foreground if they are more than 2.5σ away from any of the background distri- butions. The number K used in practice is typically 3 or 5, and represents a reasonable trade-off between computation cost and message resource usage.
A disadvantage to this method is that the computation cost of the model is higher compared to the more simple (but less flexible) single Gaussian models, so low values of K must be used to approach real-time speeds.
Histograms of the pixel values can be used to approximate the background distri- bution, but due to the histogram function being a step function it might not model the distribution correctly. Elgammal et al. [46] use kernel density estimation (KDE) on the past n (with n taking value of around 100) values of the pixel to model the background
23 CHAPTER 2: BACKGROUND distribution:
1 n P(Xt) = ∑ K(Xt − Xi) (2.4.8) n i=1
Where K is the kernel estimator function, and is chosen to be the normal distribution N(0, Σ).
Modelling the background in this way the background distribution can be esti- mated directly from the data points themselves, instead of restricting to a fixed number of modes as in the work proposed by Stauffer and Grimson [153]. Elgammal et al. [46] combine this formulation with a second stage that attempts to suppress false detections by also considering a small neighbourhood around each pixel using the model from the centre pixel, and is constrained by the requirement that a connected component must also have been displaced.
Other methods use eigenvalue decomposition to model the background over blocks ([138]) and over the image (Oliver et al. [119]), but the model construction is too com- putationally intensive for use in real-time applications.
Though many of these systems update a model at runtime using selective updating methods, they generally assume that for the majority of the time being observed the pixel will remain background. This is not a safe assumption for computer vision based games where the player may remain in the centre of the image for most of the duration of the game. In this situation there is a possibility that the pixel values belonging to the user are absorbed into the background model where pixels could become close enough to background values (e.g. similar but slightly different colours in clothing and background) and cause the gradual drift towards the new foreground values.
Methods that consider local texture properties with ideally robustness to illumina- tion changes may be better representations than comparing single pixel values. There are fast local descriptors that can be used to describe local texture in a computationally efficient way. The following section briefly introduces correlation.
2.4.2 Background Subtraction using Local Correlation
For near real-time systems that apply a model to each pixel, the comparisons between the model and the current frame are generally done on individual pixel values. An object that is moving in a video sequence will generally be made up of a group of con- nected pixels. Local patch based correlation methods that use the texture description of a small region around a pixel instead of the individual pixel values could provide a
24 CHAPTER 2: BACKGROUND more robust measure of correlation.
Modelling pixel intensities independently is computationally efficient, but pixels are generally part of a larger pattern or structure. Local patch correlation methods can be used to add a degree of robustness to illumination changes beyond that of just intensity or colour models using single pixel values. By using comparisons between relative intensities within the patch [115] or local intensity normalised patches, a degree of illumination change robustness can be achieved.
For correlation between local patches, dense descriptor methods such as HOG de- scriptors [40, 140, 183], Haar filters [171], DAISY [162] or SIFT [102], among others can be used, but they must be efficient and fast to calculate.
In other fields such as wide baseline stereo, local features are used to find correspon- dences between pairs of images and can be calculated using efficient local descriptors such as the DAISY descriptor presented by Tola et al. [162]. Grabner et al. [68] uses a grid of local classifiers to construct a discriminative background model to classify blocks as foreground or background. Neither of these approaches are fast enough for real-time correlation however.
Normalised Cross Correlation (NCC)
NCC is a correlation method that is robust to global changes in illumination and means that there is less need for a background model to be updated each frame. Lewis [100] observed that much of the computation for the NCC comparison can be pre-calculated, and for small sized windows in the region of 5 or 7 pixels across can be made to run in real-time (such as the implementation used in §3).
Given two M-by-N patches, f (x, y) and g(x, y) the normalised cross correlation be- tween them is expressed as:
¯ ∑x,y f (x, y) − f [g(x, y) − g¯] q (2.4.9) ¯2 2 ∑x,y f (x, y) − f ∑x,y [g(x, y) − g¯]
Where f¯ and g¯ are the mean of f and g respectively. The correlation is the dot product of the differences from the mean in each of the respective patches, normalised by the square root of the product of the sum of squared differences. If the patches are pre-processed to subtract their mean before the NCC comparison, it reduces to:
0 0 ∑x,y f (x, y)g (x, y) NCCf ,g(x, y) = q (2.4.10) 0 2 0 2 ∑x,y f (x, y) ∑x,y g (x, y)
25 CHAPTER 2: BACKGROUND
Where f 0(x, y) and g0(x, y) are the patches with their mean already subtracted. The square root can be removed by squaring both sides. The denominator is just the prod- uct of the standard deviations of each patch.
0 0 ∑x,y f (x, y)g (x, y) NCCf ,g(x, y) = (2.4.11) σf · σg
If B(x, y) represents the background model, and It(x, y) the image from the current 0 frame, then quantities B (x, y) = B(x, y) − B¯ and σB can be pre-calculated. Additionally by squaring both sides of the equation, the identity Var(X) = E(X2) − E(X)2 can be used (where E(X) is the expectation value of X, and Var(X) is the variance σ2) to remove the need for calculating the square root to find the standard deviations for each pixel location and just use the variance.
2 0 0 ∑x,y I (x, y)B (x, y) ( )2 = NCCB,I x, y 2 2 (2.4.12) σI · σB Convolution over sum-of-area tables [37] (see section §5.2.1 for a description) can ¯ 2 be used to efficiently precompute the mean I and variance σI at each pixel for a fixed size correlation window, then only the product between the background model and the 0 0 2 current image, B (x, y) · It (x, y), needs to be calculated to evaluate NCCB,I (x, y). The values of B0(x, y) for a fixed size window can be stored in a vector for each pixel for efficiency when evaluating each new frame It.
Jacques Jr et al. [78] use NCC to remove shadow pixels from detected foreground pixels, though shadows that cause hard edges are still a problem in grey-scale images as there is no way to distinguish them from true edges using the NCC measure alone.
Colour Normalised Cross Correlation (CNCC)
To help address the problem of shadows in the NCC comparison, Grest et al. [71] proposed a variation on the NCC formulation to perform the comparison in a more shadow invariant colour space. The RGB colour image is transformed into what they refer to as an hsL image, the hs component represents a vector on the colour hue and saturation plane defined in the HSL colour space (see figure 2.10). This representation allows colour comparisons to be expressed as a dot-product and easily integrates into the NCC equation.
Grest et al. [71] defines the CNCC over a window M × N where the hsL components are split into a (h, s) vector c = (h, s) and lightness value L, as:
26 CHAPTER 2: BACKGROUND
Figure 2.10: Left: HSL colour cylinder [172]. Right: hsL colour space representa-
tion showing two colours, ca and cb. Their similarity is measured by the function T C(ca, cb) = max(0, ca cb).
B I ¯B ¯I ∑x,y cx,y ◦ cx,y − L L CNCCx,y = p (2.4.13) CVAR(B) · CVAR(I)
And:
A A ¯A2 CVAR(A) = ∑ cx,y ◦ cx,y − MNL (2.4.14) x,y
B I Where cx,y and cx,y are the (h, s) vectors from the hsL colour space at position (x, y) from the background model B and image I respectively, and ca ◦ cb represents the scalar product between two vectors, but with negative values set to 0.
Grest et al. [71] reports that the algorithm uses the (h, s) colour space to successfully increase robustness to shadows in colour images, while in intensity only images, the CNCC correlation is equivalent to NCC.
Following on from the works using local correlation methods for background sub- traction, chapter3 presents a fast method of background subtraction using a single ref- erence frame as a background model and compares the new algorithm to correlation methods.
2.4.3 Segmentation
Image segmentation algorithms try to solve the general problem of partitioning an im- age into a number of segments, commonly a foreground and background segment in
27 CHAPTER 2: BACKGROUND the binary case, or object class segments in classification based segmentation methods such as the approach used by Shotton et al. [146].
Thresholding is the simplest way of tackling the segmentation problem. A thresh- old is selected and each pixel is labelled as either foreground if it is above this threshold, or background if it is below. The threshold is selected either as an intensity value, or a colour (multi-band thresholding).
This simple approach to segmentation (though computationally efficient) is only really effective in applications where the object already stands out quite clearly from a clear background, for instance in manufacturing and quality control, document scan- ning, or in applications such as chroma-keying in film and television where the back- ground is a distinct colour such as bright green or blue.
An excellent survey of the common methods of image thresholding as applied to many varied applications is presented by Sezgin and Sankur [139], and they categorise the algorithms into different approaches. Histogram shape methods select a threshold based on the maxima, minima and curvatures of the smoothed histogram of pixel inten- sities. Clustering based methods cluster intensity (or colour values) into background and foreground regions, or alternately are modelled as a mixture of two Gaussians (similar to the approaches used in the background subtraction methods presented in §2.4.1). Entropy based methods use the entropy of the foreground and background regions, and the cross-entropy between the original and binarised image. Object at- tribute based methods search for similarity between the intensity and the binarised images, using methods such as shape similarity, or edge coincidence. Spatial methods use higher-order probability distribution or correlation between pixels to compare the local area of regions similar in some respects to local correlation methods. Local meth- ods adapt the threshold value on each pixel to the local image characteristics, such as local brightness.
Sezgin and Sankur [139] conclude that clustering and entropy based methods work best on the task of segmentation in their non-destructive testing image dataset, but in binarisation of degraded document images clustering and local methods perform better. The best performing method across both datasets was the clustering method proposed by Kittler and Illingworth [87].
Thresholding approaches are generally computationally efficient and lend them- selves well to applications in controlled environments such as document scanning, but in complex applications with varied and dynamic environments they can leave incom- plete or fragmented foreground regions that must be cleaned up using morphological operations such as dilation and erosion. Many background subtraction methods also
28 CHAPTER 2: BACKGROUND suffer from these problems.
Region based methods such as the Watereshed algorithm [169] use the edge gradi- ent magnitudes extracted from an image as a height map surface, and creates segments by flooding it from the minima of the surface. The flooding continues until it reaches a point that it becomes higher than a point along its boundaries and can spill over to fill another area. A segment boundary is created when one region spills over into another, or meets another flooded segment. The process continues until all regions have been flooded. This algorithm is based on the assumption that segments can be defined by the valleys of these boundaries.
2.4.4 Graph-Cut based Methods
Instead of thresholding the image at each pixel independently to assign a label or grow- ing closed contour regions, the pixels can be formulated as a graph constructed from a regular lattice of pixels. The GraphCut algorithm formulates the pixels in such a way and achieves good results in many applications due to the flexibility of the energy function that is optimised over the graph to solve the labelling problem.
Greig et al. [70] were the first to apply graph cut to solve a computer vision problem. They formulated an image restoration problem as an energy minimisation problem and solved it using a graph-cut max-flow/min-cut algorithm. Some time later, the paper by Boykov and Kolmogorov [26] reformulated several problems in computer vision to energy minimisation problems and solved them using a new max-flow/min-cut algo- rithm. They reformulated image restoration, stereo, and segmentation problems and demonstrated that they could be effectively solved using graph-cut methods. They propose a new max-flow/min-cut algorithm that can be used to efficiently solve the labelling problem.
The basic formulation for binary image segmentation is as follows [26, 70]: A pixel in an image P can take one of a number of labels L = {l1, l2..ln} = {0, 1, ..n}, where L = {0, 1} for binary image segmentation. The set N defines the set of all connected pairs of pixels in the image. For a given labelling L of an image P, where each pixel has been assigned an associated label, the Potts energy of the image taking that labelling can be expressed as follows:
E(L) = ∑ Dp(Lp) + ∑ Vp,q(Lp, Lq) (2.4.15) p∈P (p,q)∈N
Where Dp is a data penalty function, and Vp,q is an interaction penalty function
29 CHAPTER 2: BACKGROUND
Figure 2.11: GraphCut of a directed capacitated (weighted) graph (example adapted from Boykov and Kolmogorov [26]). Weight strengths are reflected by line thickness. Far-Left: Pixel intensities of a 3 × 3 neighbourhood. Mid-Left: Graph representation of the pixel intensities. Pixels are represented by the grey nodes. Mid-Right: A cut on the graph. Far-Right: Maximum a posteriori (MAP) solution for the graph.
between pixels. The data term Dp is the cost of assigning the pixel to a given label (usually based on observed intensities and a pre-specified likelihood function, e.g. de- termined from histograms of colours), and the interaction term Vp,q corresponds to a cost for discontinuity between pixels to encourage spatial coherence between them. The data and interaction penalties are also known as unary and pairwise potentials in other segmentation literature [27, 88, 92].
The optimal labelling solution L∗ is found by solving the following using the max- flow/min-cut algorithm such as the one proposed by [26].
L∗ = argmin E(L) (2.4.16) L
Figure 2.11 shows an example graph cut for a small 3 × 3 neighbourhood. The graph is constructed as follows. The pixels p ∈ P from image P are represented as a set of nodes V and are connected to their neighbours by directed weighted edges E to form a directed weighted (capacitated) graph G = hV, Ei. The nodes in the graph are also connected to special nodes called terminals that represent the set of possible labels a node can take. In binary image segmentation where there are 2 labels, these are called the source node s and sink node t.
There are two types of connections between nodes in the graph, and each link is given a weight (cost). The first type are N-links and are the connections between the nodes. The weights of these N-links are determined by the interaction term Vp,q in
30 CHAPTER 2: BACKGROUND equation (2.4.15), and correspond to a penalty for discontinuity between pixels. Pixels pairs which have a high contrast are typically set to low values to encourage a split along a high contrast boundary. The other type of connection are terminal links, or T-links, and the weight of the connection for a given node to either the s or t terminals is based on the cost of a node taking a particular label from the data term Dp in (2.4.15). Since the graph is directed, the weight over the link (p, q) ∈ N can be different from the weight in the other direction (q, p) ∈ N . This is a useful property that can be exploited by many vision algorithms [26].
The s/t cut C on a graph G partitions the nodes into two sets, those connected to the source S and those connected to the sink T [26]. The cost of the s/t cut C is the sum of the costs between boundary pixels (p, q) ∈ N in the neighbourhood system where p ∈ S and q ∈ T . The cost of the cut is directed, so that costs are added only in the direction of S to T . The minimum cut finds the cut that has the minimum cost of all cuts. This is can be solved as a maximum flow problem [56], where the edges in the graph are treated as a network of pipes with capacity equal to their weights. Ford and Fulkerson [56] state that the maximum flow from s to t saturates a set of edges dividing the graph into two sets that correspond to a minimum cut. Boykov and Kolmogorov [26] discuss various algorithms that can be used to solve this max-flow problem, and propose their own algorithm that can solve the problem efficiently.
When the images being segmented are from image sequences or videos, the images tend to change only slightly from one frame to the next. Kohli and Torr [88] proposed an efficient max-flow formulation that uses information from the first frame to consid- erably speed up the computation of the solution for subsequent frames. They call this method dynamic graph cuts, and demonstrated good results on improving the compu- tation time versus the traditional approach of reconstructing the graph every frame.
Other works [27, 58, 76, 92] observed that when prior knowledge of the type of object being segmented is available, the data term can be augmented to include a shape prior. Selecting an appropriate shape representation however is the main problem in these systems due to the variability of the shape and pose over time or viewing angle. Determining these pose and shape parameters is a difficult problem in itself.
Kumar et al. [92] approached this difficult problem by matching a set of exemplars for different parts of the object onto the image. These matches are used to generate a shape model for the object. The segmentation problem is then modelled by combining MRFs with layered pictorial structures (LPS) which provide them with a realistic shape prior described by a set of latent shape parameters. However, a lot of effort has to be spent to learn the exemplars for different parts of the model.
31 CHAPTER 2: BACKGROUND
Instead of matching shape exemplars, Bray et al. [27] used a simple 3D stick figure as a shape prior to determine 3D pose at the same time as solving the segmentation of the image. The parameters of this model are iteratively explored to find the pose corresponding to the human segmentation having the maximum probability, or mini- mum energy. The iterative search was made efficient by using the dynamic graph cut algorithm proposed by Kohli and Torr [88].
The work on shape priors [27, 58, 76, 92], and the introduction of an efficient dy- namic graphcut algorithm for image sequences [88], made solving graph cut based segmentations much more efficient and practical for applications where efficiency is important. Chapter4 couples these ideas with a detection algorithm to reduce the processing region to a smaller area of the image. This allows the size of the graph be- ing solved to be reduced to a much smaller size consequently improving computation speed to real-time speeds.
2.5 Human Detection and Pose Estimation
The methods discussed in the previous section can be employed to quickly identify regions of the image that have changed due to the appearance of the user in front of the modelled background (background subtraction), or to label pixels in the image that should belong to the user based on an appearance model (segmentation). Since the changes in the image are primarily dependent on the location or motion of the user, they can all be achieved by determining the location and pose of the individual.
If pose is known, then user interface controls such as floating buttons or other more advanced interactions can be solved by following the position of the parts of the user’s body. Knowing pose also means you can trigger an interaction for a specific body part only and even stop a control from activating in error when an arm, head or other limb enters the area.
This is of course a very difficult and non-trivial problem, particularly for real-time applications. The next section discusses methods for monocular pose estimation. This problem can be broken down into two tasks: detection §2.6, and pose estimation §2.7.
2.5.1 Learning Problem
The machine learning problems examined in the chapter on Human Detection (chap- ter5) and the chapter on Pose Detection (chapter6) are supervised learning problems. See figure 2.12 for an illustration of a typical supervised machine learning algorithm.
32 CHAPTER 2: BACKGROUND
Figure 2.12: A diagram showing the typical processes involved in a supervised ma- chine learning.
These types of algorithms learn a model from a training set D = {(xi, yi)} made up of examples that pair a feature vector x ∈ Rn with a corresponding label y ∈ Rm. To re- duce the dimensionality of the data that the learning algorithm must deal with, feature vectors are usually generated from training images by extracting lower dimensional information (such as the descriptors discussed later in §5.2).
Supervised learning algorithms attempt to learn a model h(·) that maps the feature vectors xi extracted from the training data to their corresponding labels yi (i.e. ∀i :
h(xi) = yi), or predicted function values in the case of regression. Given a new and unseen feature vector x0 the model will try to predict an appropriate output y0 for it.
In the case of human detection, the output y is usually either a binary value indicat- ing either object or non-object, or a continuous value indicating a degree of confidence that the object is there or not. In the case of pose estimation/detection the output is ei- ther a class indicating some quantisation of a possible pose space that maps to a pose, or a predicted 3D pose.
2.6 Human Detection
Human detection is a specialised form of object detection, and is generally formulated as a binary classification problem. A classifier h(·) is learnt that can classify subregions of an image as either containing an object or not.
There have been a few surveys dealing with the problems relating to human de- tection [49, 62, 63]. The problem is more difficult than rigid object detection, since the
33 CHAPTER 2: BACKGROUND appearance of a human can vary considerably and finding a suitable model to cope with the variations is extremely difficult.
The survey by Gandhi and Trivedi [62] explores the problem in terms of pedestrian detection for vehicle safety and discusses various methods that use different sensing technologies. The methods most relevant to this thesis are visible light single camera methods, as they best represent the hardware constraints typically used for computer vision based computer entertainment applications.
A more recent survey by Enzweiler and Gavrila [49] presents a detailed summary of the state of the art computer vision algorithms that address the problem of human detection and compares several methods side by side to assess their performance. They consider the problem in two parts: that of region of interest (ROI) detection, and clas- sification - though not all methods can be neatly separated into these two stages (take the sliding window approach of Dalal and Triggs [40] discussed later for instance). The algorithms considered in their benchmark were a HOG based linear classifier [40], a Haar wavelet cascade classifier [171], and a Neural Network (NN) classifier with adap- tive local receptive fields (LRF) where each neuron sees only a portion of the image [173], and a combined shape detector and texture classifier based on chamfer template matching with a texture verification stage that simply uses a NN/LRF classifier as a final stage verification to prune false positives [66].
They report that the local receptive field classifier may have had slightly better per- formance when trained instead with a non-linear SVM, but the memory requirements to do this were too high to allow training using their dataset.
Enzweiler and Gavrila [49] concluded that the HOG linear SVM classifier performed by far the best out of the three methods when no time constraints were considered. When computation constraints were applied however, the Haar wavelet method per- formed the best. This seems to indicate that cascaded classifiers have a clear advantage in terms of computation in time critical applications due to the rejection power of the first few levels of the cascade, but at a cost in classification performance when the whole cascade is considered.
Algorithms for human detection can be broadly categorised as either generative or discriminative. Discriminative methods try to find a model that can directly predict the probability p(c|x) of a class label c (either human or non-human) directly from the fea- ture representation x, while generative methods try to find an appropriate model with variable parameters Θ that describes the appearance of the human by finding the joint probability p(c, x), which can be written as p(x|c)p(c). This can be done by learning the likelihood p(x|c) and probability of the class p(c) separately. The prior probability
34 CHAPTER 2: BACKGROUND
p(c) may vary, as it can be dependent on the model parameters Θ of the shape model used, e.g. to restrict the shape to physically plausible configurations independent of the appearance likelihood p(x|c).
2.6.1 Generative
An advantage to generative algorithms is that they can handle missing data or partially labelled data, and can also be used to augment datasets using synthetically generated as well as natural training data using its model. However since they try to predict the joint probability density of p(c, x) predicting the class for a new image often requires a computationally intensive iterative solution (such as methods like Active Contours used in [35, 113]), making them costly for use in real-time applications.
Generative approaches generally employ a shape representation to model appear- ance. Discrete shape models use exemplars (representative shapes) that cover the ex- pected variation in appearance and use efficient methods to determine if any of them matches the query image [65, 66, 154, 164]. While other approaches use a parametric shape representation [12, 16, 35, 48, 73, 74, 80, 113] to describe the shape being matched and optimise over the parameter space of the appearance model to find the best match.
Combining shape and texture information with a compound appearance model has also been explored [34, 35, 48, 52, 80]. Training data are normalised using sparse land- mark features such as the approach used by Fan et al. [52] or dense correspondences such as the method used by Cootes et al. [34], then an intensity model is learnt using the normalised examples to model the texture appearance variation.
2.6.2 Discriminative
Discriminative approaches are typically very fast at predicting p(c|x) for new data compared to the iterative solution often required by generative methods. This makes discriminative approaches potentially more useful in real-time applications. A disad- vantage of discriminative methods however, can be the requirement of large numbers of training examples to cover the expected variation in appearance.
Some discriminative approaches use a mixture of experts approach where they first separate the training data into local shape specific pedestrian clusters and then a clas- sifier is trained for each subspace [66, 114, 141, 144, 175, 177]. The advantage of these methods is that by grouping training examples into clusters of roughly similar shape, the classifier isn’t faced with the problem of trying to model very high variability in
35 CHAPTER 2: BACKGROUND appearance.
Other methods attempt to model the detection problem in terms of semantic parts, such as body parts [5, 104, 108, 141, 147, 175] where a discriminative classifier is learnt for each part or codebook representations [3, 96, 97, 137] where occurrences of features are learned over local patches, and their geometric relations between them.
An advantage to mutli-part methods are that they can be good for handling occlu- sions, and can reduce the number of training examples required to cover the intended pose space. The disadvantage to these methods however is that they usually come with a higher computational cost due to the multiple detectors and the additional cost of classifying examples during testing.
Detection Strategies
A typical approach to human detection is that of using a sliding window to scan across an image at multiple scales to determine directly the classification of each sub-window [40, 41, 108, 122, 136, 159]. The problem with sliding window methods is that they are generally too computationally costly to be used in real-time applications. Cham- fer based matching methods [65, 66] can exploit properties of the distance transform smoothness to perform a coarse to fine search of the image, but are less accurate than the more computationally expensive sliding window algorithms.
Dalal and Triggs [40] arrange a densely overlapping grid of HOG descriptors within a 96 × 160 detection window and train the classifier using a linear support vector machine. The linear Support Vector Machine (or SVM) used by [40, 41] and others [114, 142, 144, 177, 183] is a linear classifier that when trained will classify a feature vector as either positive or negative using the equations in 2.6.1.
wTx + b ≥ 0 for positive classification (2.6.1) wTx + b < 0 for negative classification
Where w and b are the weight vector and bias determined during the training pro- cess that represent the decision hyperplane used to classify training examples, and x is the feature vector to be classified. The decision hyperplane surface is described by equation 2.6.2.
wTx + b = 0 (2.6.2)
The training problem of a linear SVM is to maximise the distance between the train-
36 CHAPTER 2: BACKGROUND ing examples from the decision hyperplane surface. Dalal and Triggs [40] used an im- plementation called SVM Light [79] to train the linear SVM using a set of positive and negative feature vector examples. They then scan negative images to find a set of hard examples that are used as a bootstrap set to train the classifier again. The performance achieved by this method is still considered among the state of the art algorithms for standing pedestrian detection.
The work of Zhu, Yeh, Cheng and Avidan [183] greatly improves the computation efficiency of the HOG descriptors by exploiting summed area tables to build integral histograms [125] that allow constant time calculation of arbitrarily sized HOG features, and combines them in a cascade classifier in a manner similar to [170, 171]. Since then other authors have also used variable sized HOG blocks [142, 177] with promising results. However, the survey by Enzweiler and Gavrila [49] does not use this more efficient HOG cascade implementation as a comparison for their time constrained per- formance evaluation benchmark.
Non-linear SVMs can also be used and can provide an improvement in performance over linear SVMs [5, 108, 112, 114, 122, 159], but the additional computational cost can make it difficult to use for classifiers intended for real-time applications.
Another approach is to employ 2-stage methods that use a fast region of interest detection algorithm that can identify candidate locations for processing with a more computationally expensive algorithm. CCTV surveillance applications, or other appli- cations that have a fixed camera with a static background can employ methods such as background subtraction to focus only on areas of the image that have been identi- fied as foreground. The works of [114, 152, 180] use background subtraction to identify regions of the image that can be focused on with more detailed algorithms. Search space constraints or scene knowledge can also be exploited to more efficiently identify regions of interest [50, 66, 96, 141, 180], while other works attempt to identify regions of high information content [3, 96, 97, 102, 137] to find candidate locations.
A hybrid approach somewhere between sliding window and ROI/classification methods is to train a cascade [104, 118, 142, 166, 171, 175, 177, 183], where the first stages have a high precision but use only a small number of features to quickly reject large portions of the image, and later stages use more features to make a more detailed decision. This method has seen great success in face detection [170] with real-time speeds, and is exploited as a region of interest detector for the segmentation algorithm proposed in Chapter4.
A cascade is trained using the Ada-Boost algorithm [170], which attempts to find a subset of possible features at each level of the cascade that meets a specified accu-
37 CHAPTER 2: BACKGROUND racy, typically 95% true positive rate and 50% false positive rate. Each level is trained to correct errors made by the previous level, and the cascade gradually becomes more complex. The key advantage of this approach is that a great deal of the image can be rejected as negatives with only a few features making it suitable for real-time applica- tions.
Gavrila [65] uses a chamfer template matching exemplar based approach and con- structs a hierarchical template tree using human shape exemplars and the chamfer dis- tance between them. They recursively cluster together similar shape templates select- ing at each node a single cluster prototype along with a chamfer similarity threshold calculated from all the templates within that cluster. Multiple branches can be explored if edges from a query image are considered to be similar to cluster exemplars for more than one branch in the tree.
2.6.3 Chamfer Matching
The chamfer matching algorithm is inherently fast due to its simplicity which makes it an attractive method for use in real-time applications, but providing the optimal number of exemplar edge templates can be problematic since they must be learnt from segmented images to ensure that only object edges are considered by the exemplars. This section gives an overview of the basic matching algorithm that is used to match an exemplar template such as one from a node in the tree hierarchy constructed by Gavrila [65].
Chamfer matching [11, 20] is a technique for object detection that finds locations (ideally a single location) in a query image that closely match a binary edge template created from an instance of an object being searched for.
Object templates are created off-line from edges extracted from a representative object image using an edge detection technique such as the Canny edge detector [32]. Each template is a set of edge point coordinates from a binary edge map:
nO nO O = {(oxi, oyi)}i=1 = {oi}i=1 (2.6.3)
2 Where oi ∈ R is a 2D coordinate location of an edge point extracted from the
original exemplar object image, and nO is the total number of edge points extracted from an image.
At run-time, edges are extracted from a query image to create a set of binary edge coordinates:
38 CHAPTER 2: BACKGROUND
Figure 2.13: Distance feature calculation. Left: A query image. Middle: edges extracted for 2 orientations, each colour represents a different channel, white represents an over- lap between adjacent orientation channels. Right: Truncated distance transforms for each orientation channel.
nA nA A = {(axi, ayi)}i=1 = {ai}i=1 (2.6.4)
Where nA is the total number of edge points. Using the set of points in A, a distance transform D(·) is calculated so that for any query point p and set of edge points A from a query image:
DA(p) = min kp − qk (2.6.5) q∈A
gives the distance to the nearest edge in A from that point. Efficient methods ex- ist to calculate distance transforms [20, 54], making the algorithm ideal for real-time applications.
The points from O are used to sample the distance transform at a given position using a chamfer cost function. Different cost functions can be used such as the average sum of squared differences [20, 154]. This cost function is evaluated at every position in the image, where the minima indicate the best matches.
The chamfer score C of an object template O at a given location in the distance transformed image DA with the same dimensions as the object template is given by [20, 154]:
1 C(D , O) = D (p)2 (2.6.6) A |O| ∑ A p∈O
A truncated chamfer cost function can improve matching stability [154] in images where the object is partially occluded:
39 CHAPTER 2: BACKGROUND
1 C(D , O, τ ) = min(D (p)2, τ ) (2.6.7) A d |O| ∑ A d p∈O
The threshold τd gives an upper limit to values in the distance transform where
edges might be missing, and means that squared distances beyond τd take the same value.
A threshold τc is typically used to accept partial matches due to incomplete edge information. A cost value of zero indicates a perfect match (i.e. the average edge dis- tance over all points on the template is zero), however a cost of zero can also mean that there could be a highly textured image area causing a strong local minima.
To alleviate this, oriented edge information can also be used to improve matching performance [154]. In this case template edges are split into a set of edge points for each θ |Θ| orientation channel O = {O }θ=1 and a set of distance transforms is created DA = θ |Θ| {DA}θ=1 one for each edge orientation to be considered from the image edges. See figure 2.13 for an illustration of an oriented chamfer distance transform. The oriented chamfer cost function is:
1 C (D , O, τ ) = C(Dθ , Oθ, τ ) (2.6.8) Θ A d | | ∑ A d Θ θ∈Θ
The oriented chamfer cost is the average chamfer cost over each of the edge orien- tation channels.
Chapter5 proposes a method that combines the chamfer matching algorithm with a linear SVM to automatically learn the templates even in the presence of background information.
2.7 Pose Estimation
Pose estimation can be formulated as a multi-class classification problem, where each class represents a distinct pose and the closest matching pose exemplars can then be used to regress to a pose in continuous space, or a continuous problem where the out- put pose can be derived from varying parameters in an internal model, or based on the detection results of many part-based classifiers. Ultimately whichever approach is used, they are all concerned with extracting information from an image sufficient enough to determine an output pose from data. The inferred pose can be either in 2D or in 3D.
40 CHAPTER 2: BACKGROUND
There are various surveys that consider human motion analysis [4, 63, 106, 107, 124, 150] and in which pose estimation is usually a component. Motion analysis methods that are interested in action recognition don’t necessarily need to determine an explicit pose configuration to analyse motion, as this can be done indirectly using other infor- mation such as tracking the motion of detections over time, or analysing the evolution of low level image features over time, and as such are out of the scope of this dis- cussion. Similarly, methods that use multiple cameras or additional hardware such as depth cameras, and those that are still computationally too costly for real-time applica- tions are also not considered.
Many of the existing surveys each define their own categories to group the different approaches, but as observed in the survey by Poppe [124] they generally fall into two categories:
Model-free / Discriminative methods use discriminative approaches to determine pose directly from feature representations.
Model-based / Generative methods maintain an internal appearance likelihood model that iteratively tries to determine pose by varying the model parameters.
These discriminative and generative approaches to human pose estimation provide similar advantages and disadvantages to their equivalent approaches in human detec- tion. Discriminative approaches tend to be faster, but can be limited to only being able to deal with poses that they have been trained with, so large amounts of training data can be required to sufficiently cover the variation in pose. Parts based discriminative methods alleviate the training data problem somewhat, but at the cost of more complex computation when estimating pose due to the classifiers for individual parts. Gener- ative methods that maintain an internal human model usually find the pose through an iterative solution making them less suited for real-time applications, but have the advantage that they can deal with poses not seen during training.
2.7.1 Generative
In generative models, various different approaches have been explored to model the shape of a human. Some are simplistic models using hierarchies of simple 2D shapes, and can increase in complexity to very detailed 3D models of shape and physical de- formation.
2D models such as the ‘Cardboard human’ model proposed by Ju et al. [81] use a kinematic hierarchy of planar patches to model human shape. A similar approach
41 CHAPTER 2: BACKGROUND by Morris and Rehg [111] uses a scaled prismatic kinematic 2-D human body model for human body registration where the parts are modelled by using textured shapes to find correspondences. Howe et al. [75] uses a method similar to Morris and Rehg [111] and infer 3D pose by tracking parts in 2D then using a prior model of 3D motion learnt from motion capture data in a Bayesian framework that models short duration motions to reconstruct 3D pose. Huang and Huang [77] extend the model of Morris and Rehg [111] to include with an extra dimension to the DOF of each part to describe width change, but does not attempt to recover 3D pose.
Models that are defined in 3D are generally constructed by a hierarchy of simple primitives. The hierarchy defines joint locations and relative positions in the hierarchy (e.g. head is attached to torso) and geometric primitives are defined relative to their associated joints. These primitives can be spheres [120], cylinder based models [133, 148] or models defined using tapered super-quadrics [64, 85] and [155] to model the hand.
Rohr [133] focuses on the problem of pedestrians walking in a plane relative to the camera. First a 3D model of cylinders is constructed to represent the human shape, then background subtraction is applied to locate a region of interest. Within the region of interest edge pixel locations are detected which are then linked together and lines are fit to the linked edge pixels. Cylinder edges from the model are projected into the window around ROI and model lines are compared against detected lines.
Sidenbladh et al. [148] also use a 3D cylinder based generative model of image appearance that extends the idea of parameterised optical flow estimation to 3D artic- ulated figures. Depth ambiguities, occlusion, and ambiguous image information result in a multi-modal posterior probability over the pose of the model. They employ a parti- cle filtering approach to track multiple hypotheses in parallel and use prior probability distributions over the dynamics of the human body to constrain motions to valid poses.
Pose ambiguities with single camera methods can be resolved by combining infor- mation from multiple cameras. Tapered super-quadrics are exploited by Gavrila and Davis [64] to construct a 3D human shape model. The parameters for parts are deter- mined during a calibration step by varying parameters and assessing similarity with chamfer matching using the silhouette of the shape while a subject remains in a known pose. Initialisation is done using background subtraction to find the region of inter- est, followed by PCA on the foreground pixels to discover the main axes of variation to initialise the torso parameters. Using this initialisation, a search is done to find the best fitting head/torso configuration, then afterwards limb parameters are found in a similar manner.
42 CHAPTER 2: BACKGROUND
Bottino and Laurentini [22] use multiple cameras to calculate a 3D reconstruction of the human shape using volume intersection, then motion data is acquired by fitting a human shape model to the reconstructed 3D volume. Later Kehl and Gool [85] also use 3D reconstruction techniques in combination with a model built using super-ellipsoids, and fit the model to the images over several video streams. To fit the model, multiple image cues are used consisting of edges, colour information and a volumetric recon- struction. The use of 3D information both helps reduce ambiguity, and the use of the model can be used to refine the quality of the reconstruction.
Alternatively, the shape model can be defined as a single polygonal mesh. Anguelov et al. [8] build a surface mesh shape model that is constructed using training examples from laser scan data over many subjects, then find correspondences between the mod- els. Dimensions of shape variation are learnt using PCA on the model data (e.g. male and female, height, body size), and realistic surface deformations are learnt using linear regressions per triangle. The method is computationally costly but is very expressive. Later Sigal et al. [149] combined this in another multi-view approach with a discrim- inative initialisation approach to initialise the model and follows a refinement step to determine a more accurate pose.
Barron and Kakadiaris [10] propose a semi-automatic method of simultaneously estimating a humans anthropometric measurements (physical limb lengths and their proportions) and pose from an uncalibrated image. Landmarks are placed by the user indicating the positions of the main joints, then a set of plausible limb length estimates are produced using a priori statistical information about the human body, then plausi- ble poses are inferred using geometric joint limit constraints.
Some works attempt to determine the shape and articulation automatically. Kaka- diaris and Metaxas [82] use data from multiple views and apply a spatio-temporal analysis of the deforming contour of a human moving to a series of predetermined movements to determine the shape and articulation of the model (i.e. it is not defined a priori). When the movements are unknown however, this method cannot be applied.
The majority of these generative methods however, tend to be too computation- ally intensive for real-time applications due to the extra computational cost of iterative model evaluation and parameter updating to fit the model to the image data. Multiple cameras can reduce ambiguity, but these methods are not related to this thesis and are only mentioned to give a context for describing the types of approaches used in some of the model based literature.
43 CHAPTER 2: BACKGROUND
2.7.2 Discriminative
Discriminative methods can be further broken down into two methods [124]: learning based where the problem is to find a mapping directly from feature space to pose space (though due to ambiguities in pose a set or mixture of mappings tend to be learnt to handle the multi-modality [1, 69, 134, 151]), and example based where the problem is formulated as finding the nearest exemplar(s) out of a set of possible exemplars that most closely match the feature and return the associated pose.
Various methods analyse silhouettes of humans to determine pose. Silhouettes from multiple views are used by Grauman et al. [69] in a probabilistic shape and structure model. Then a prior density over the multi-view shape and corresponding structure is constructed with a mixture of probabilistic principal components analysers that locally model clusters of data in the input space with probabilistic linear manifolds, and pose is determined by finding the maximum a posteriori estimate. Instead of clustering in input space, [134] cluster in 2D pose space and learn mappings from features to pose from each cluster.
Agarwal and Triggs [1] and Sminchisescu et al. [151] also use a mixture of experts approach. Agarwal and Triggs [1] does this by first clustering silhouettes to a lower dimensional input space found using PCA, to which a mixture of regressors is fit. This approach helps model ambiguities by following multiple hypotheses but requires a clean silhouette. Sminchisescu et al. [151] uses local appearance and shape contexts as a feature space and shows promising results on complex monocular human actions.
Elgammal and Lee [47] recover pose from a monocular silhouette, but instead they learn view-based activity manifolds and learn mapping functions between the mani- folds and both silhouette and 3D body pose. Pose is estimated by projecting the sil- houette to the learned activity manifold, finding the point on the learned manifold representation corresponding to the silhouette and then applying interpolation over probable 3D poses.
Example (or exemplar) based methods such as Mori and Malik [110] store a number of exemplar 2D views of the human body over different configurations and viewpoints with corresponding manually labelled locations of body joints. Pose estimation is done by matching the input image using shape context matching with a kinematic chain based deformation model and the corresponding pose is used to reconstruct the 3D pose of the human.
Exemplar based approaches have been very successful in pose recognition. How- ever, in applications involving a wide range of viewpoints and poses a large number
44 CHAPTER 2: BACKGROUND of exemplars would be required, and as a result the computational time would be very high to recognize individual poses. One approach, based on efficient nearest neigh- bours search using histogram of gradient features, addressed the problem of quick retrieval in large set of exemplars by using Parameters Sensitive Hashing (PSH) [140], a variant of the original Locality Sensitive Hashing algorithm (LSH) [42]. Once a set of nearest neighbours has been found the final pose estimate is then produced by locally- weighted regression using the set of neighbours to dynamically build a model of the neighbourhood and infer pose. PSH is also applied in the work by [129] using Haar-like wavelet features based on Viola and Jones [170] from multi-view silhouettes obtained using three cameras, then a motion graph is used to find poses both close in input and pose space.
The method of Agarwal and Triggs [2] is exemplar based. They use a kernel based regression but they do not perform a nearest neighbour search for exemplars, instead using a hopefully sparse subset of the exemplars learnt by Relevance Vector Machines (RVM). Their method has a disadvantage in that it is silhouette based and that it cannot model ambiguity in pose as the regression is uni-modal.
Gavrila [65] present a probabilistic approach to hierarchical, exemplar-based shape matching. This method achieves a very good detection rate and real time performance but does not regress to a pose estimation. Similar in spirit, Stenger [154] proposed a hierarchical Bayesian filter for real-time articulated hand tracking but clusters shapes in pose space to construct the tree hierarchy. Toyama and Blake [164] use dynamics in combination with an exemplar-based approach for tracking pedestrians.
Everingham and Zisserman [51] utilises a similar hierarchical chamfer tree approach to [65, 154] by constructing a chamfer template tree to reduce the pose search space, then uses the estimated pose from the tree classifier to initialise a generative model to find a more accurate pose.
For applications such as pose estimation where there are many object templates, it is inefficient to find the chamfer score for each object at every point in an image, so more efficient methods must be used. Figure 2.14 shows templates for several actions.
Pose detection using chamfer matching can be achieved by clustering poses ob- tained from a varied action database into a hierarchical structure to quickly explore the example space in a coarse to fine manner such as the methods proposed by [51, 65, 154]. Okada and Stenger [117] presents a method for marker-less human motion capture using a single camera by using tree-based chamfer filtering to efficiently propagate a probability distribution over poses of a 3D body model. Dynamics is used to handle self-occlusion by increasing the variance of occluded body-parts, allowing for recovery
45 CHAPTER 2: BACKGROUND
Figure 2.14: Pose templates for several different actions.
when the body part reappears.
2.7.3 Combined Detection and Pose Estimation
In contrast to two stage methods, relatively few works attempt to combine localisa- tion and pose estimation into a single discriminative model. Dimitrijevic et al. [44] present a template-based pose detector and solve the problem of the dependency on huge datasets for training by detecting only human silhouette in a characteristic pos- tures (sideways opened-leg walking postures in this case). They extended this work in [57] by inferring 3D poses between consecutive detections using motion models. This work gave some very interesting results with moving cameras. However it seems somehow difficult to generalise to any actions that do not exhibit characteristic posture.
The pose estimation and detection work of Okada and Soatto [116] learns k kernel SVMs to discriminate between k pre-defined pose clusters, and then learns linear re- gressors from feature to pose space. They extend this method to localisation by adding an additional cluster that contains only images of background.
The problem of jointly tackling human detection and pose estimation at the same time is discussed in more detail in the work presented in Chapter6.
46 CHAPTER 3
Fast Background Subtraction
For gaming interfaces, a camera is typically placed on top of a television or monitor and the video stream is displayed on the screen so that the user can see what the camera is viewing.
In most applications, the video image is mirrored on the y axis so that the video stream acts as if the television display is a mirror placed in front of the user. This is the method of display employed by many of the Sony EyeToy games.
People are naturally used to interacting with their own reflection due to the abun- dance of mirrors in everyday life, and so co-ordinating movement with what a user sees on screen is very intuitive and can be learnt quickly. The success of the EyeToy franchise is a testament to how well this method of interaction has been received, and it is this method that is used for the algorithms described in this chapter.
User interface components are displayed as an overlay on top of the video stream. To interact with the components the user must move a part of their body over the area covered by that control to interact with it and some kind of feedback (a sound effect or visual cue) is triggered to indicate that the interaction was successful. Game objects are displayed and interacted with in a similar manner.
For simple interactions such as those required by basic user interface controls, it is reasonable to tackle the problem using methods that are computationally efficient and extract only the minimum amount of required information when processing the video frame. A strong motivation behind this low-level processing approach is to take as little processing time as possible when the game engine updates each of its subsystems between frames (§2.3.1 ) to ensure that other media subsystems, such as music, sound and graphical effects have sufficient computing time remaining to be updated too.
This chapter takes a look at a common algorithm used in many computer vision based games, proposes a new local descriptor, and looks at how other algorithms might
47 CHAPTER 3: FAST BACKGROUND SUBTRACTION be exploited to provide more advanced methods of interaction.
3.1 Problem: Where is the player?
The general question that needs answering for a computer vision based interface is: "Where in the image is the player?"
An answer to this question is required at least on a very basic level to determine how the user is attempting to interact with the components or game objects being su- perimposed on the video frame.
This can be determined either directly using shape detection [65, 170], background subtraction [123], and segmentation methods [27, 92], or indirectly by detecting motion using sparse optical flow methods [9, 143] and simple image differencing.
The following section describes an approach of detecting inter-frame differences to drive user interface controls and interact with game objects.
3.2 Image Differencing
The simplest method employed in many computer vision based games (for example EyeToy play series and EyeToy kinetic) is that of image differencing between two con- secutive video frames and thresholding the result to create a binary map of differences.
( 1 if |It(x, y) − It−1(x, y)| > τ Dt(x, y) = (3.2.1) 0 otherwise
Where Dt is the binary mask of differences between the current frame It at time t and the previous frame at time It−1, and τ is the difference threshold. The threshold value τ is set to a value that is just over the ambient noise level of the camera, but not too high that true image differences are missed. See figure 3.1 for an illustration.
By finding the difference between consecutive frames, a model of the background image is not required and does not need to be updated to account for illumination changes. However, if the user stands completely still or moves very slowly so that the pixel-wise differences between frames are very small, then movement can be missed.
Interacting with a game using this method is simply a case of having game objects collide with areas of the screen that have any non-zero values in the difference map,
Dt. The user just has to make sure they wave their hand in the appropriate area of
48 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.1: Image differencing algorithm. For user interface controls, differences are accumulated over several frames under the control to place the button in an active state.
the screen, or just create a large amount of movement in that area and hence produce image differences.
For user interface controls, using the image differences for the current frame Dt means that other components may be accidently triggered as the user reaches for the appropriate control. This can be addressed by accumulating the differences under re- gion R covered by the user interface component for several frames, and activating the control only when the accumulated differences have reached a certain threshold.
Dt(R) = ∑ Dt(x, y) (3.2.2) (x,y)∈R
This motion button approach relies heavily on the user moving around and does not offer any persistent information across frames as to the true state of pixels belonging to the user. It also does not discriminate between sudden illumination changes and the user’s movements, and any shadow movements caused by the user as they move around within the video frame.
If a user interface control can reliably detect which regions of the video image the user occupies, then more varied interactions can be accommodated. As an example, consider for the moment a musical keyboard interface. With the existing image dif- ference based controls it is difficult to activate the keys momentarily, as the button would have to reliably detect two spikes of differences; one for the hand entering the key region and another for when the hand was removed. A toggle button could be possible, but relies on the control being able to successfully detect both activation and deactivation cues. If a control could determine which pixels were different from some
49 CHAPTER 3: FAST BACKGROUND SUBTRACTION pre-calibrated reference model, then the button can remain active so long as the num- ber of different pixels is above a given activation threshold, and would allow for more sensitive user interface controls.
To solve this, a fast and robust background subtraction algorithm is required. The next section proposes an approach that extends the local binary pattern (LBP) algorithm to use a 3-label neighbourhood coding scheme in which each label is coded by 2-bits, and presents some qualitative results of the method.
3.3 Motion Button Limitations
Motion buttons have been used for computer vision user interface components in a large majority of the existing EyeToy games released by Sony. The idea is simple and effective. Absolute pixel differences between two consecutive frames that are greater than a certain threshold are counted over the region of the frame that the button covers.
Motion buttons become activated when they have accumulated movement over several frames. This is so that the button isn’t accidently activated by the user or by something considered to be in the background. When there is no movement within the button region the button reverts to an inactive state.
However, because motion buttons have to accumulate movement over time before they activate, they have a couple of limitations. Firstly, they cannot be used in interfaces that require a user to keep a button activated, since if a user keeps their hand stationary, no motion is accumulated and the button deactivates. Secondly, the time it takes to accumulate enough movement to activate the button means that expecting the user to activate the button quickly for time-critical applications becomes impractical.
Performing this image differencing instead with a reference image would be one approach to solving this, but subtle illumination changes can affect the difference map and it is difficult to select a threshold that can detect true differences from global bright- ness changes. See figure 3.2 for an illustration of the problem.
It is these limitations that the algorithm used for persistent button control is in- tended to address.
3.4 Persistent Buttons
The design goal of the persistent button is to be responsive to fast user interactions, and be able to maintain an active state when the user intends to keep the button activated.
50 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.2: Diagram illustrating one of the problems caused by using simple image differencing on a background image. The circled red regions appear only slightly different due to a global illumination change caused by the automatic camera gain, but can cause differences to register in the difference map if the difference threshold is not updated to reflect this.
An example application for this type of component could be a small virtual key- board, where the user can play sustained notes by keeping their hand within a button’s area or activate them quickly by passing their hand through them briefly.
This level of responsiveness comes with some limitations. Calibration is required to build a model of the region underneath the button, and this means that if the camera is moved for some reason, the button will require recalibration.
This section describes the algorithm, presents some qualitative results and com- pares the algorithm to two other algorithms: Normalised Cross Correlation (NCC) [100] and Colour Normalised Cross Correlation (CNCC) [71]. See section 2.4.2 for the details of these two algorithms.
3.5 Algorithm Overview
In the following sections, a ‘label’ is defined as a 2-bit binary value, and a ’code’ is an ordered sequence of concatenated labels c = (l1, l2, ...ln). The proposed algorithm extends the idea of Local Binary Patterns (LBP) [115]. In the local binary pattern algorithm, for each pixel p in an image its intensity value is compared to the values of pixels in its local neighbourhood Np centred on its position and a binary code is constructed. Each bit in the binary code represents the result of a comparison between a pixel pn ∈ Np from that neighbourhood and the centre pixel p. The value stored in that bit is 0 or 1 to represent the polarity of the comparison: 0 if
51 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.3: Illustration of LBP code construction. The centre pixel is compared to each of its neighbours, and assigned a value of 1 if equal or greater than its value, or 0 otherwise. The code is then constructed by assigining a neighbourhood pixel to a corresponding bit in a binary code.
(pn < p), and 1 if (pn ≥ p). Figure 3.3 illustrates how a simple binary code is generated from intensity values in a small 3x3 neighbourhood.
The proposed algorithm extends this idea of Local Binary Patterns (LBP) [115] so that instead of pixels in Np taking one of 2 labels they can take one of 3 possible labels. This is in the same spirit as the work of Local Ternary Patterns (LTP) proposed by [160], except that in the proposed LBP3 coding scheme:
1. The labels are given 2-bit values that are not split into upper and lower channels as in [160] and are instead concatenated to form a single code.
2. The code map is generated after pre-processing the image using a Gaussian blur, so that comparisons are done between the sum over weighted regions instead of single pixel intensities to add more spatial support to the comparisons. This is similar in spirit to the way filters are used to compute weighted sums in the DAISY descriptor construction [162] and Geometric Blurring [15].
3. To achieve temporal stability for use with video streams, temporal hysteresis is applied to the code labels so that ambient camera sensor noise does not cause labels to oscillate between two states over time.
4. The labels are coded in such a way that taking the hamming distance between two codes will yield the distance in label space between them (see section 3.8.2 for details).
The proposed algorithm extension is referred to as LBP3 in the following sections, to differentiate it from the Local Ternary Pattern work of Tan and Triggs [160].
The algorithm takes a frame as input, sub-samples and converts the image into an intensity image (greyscale image), then builds a code for each pixel describing its inten- sity relative to a set of neighbouring pixels. The current frame’s code map is compared to a reference code map stored during calibration, and differences in codes are summed
52 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Algorithm 1: LBP3 Image Processing Steps 1. Sub-sample the image; 2. Convert the image to an intensity image; 3. Blur the intensity image; 4. Build the code map. up to give a measure of how different the current frame is. If the number of differences is above a certain threshold (minus the ambient distance learned during calibration), then the button becomes active. The main image processing steps are highlighted in algorithm1.
3.5.1 Sub-sampling
The image frame I is sub-sampled to reduce the number of pixels that the algorithm has to deal with. This step is optional and the sub-sampling scale factor used in the current implementation is 0.5. This resamples the source image so that the final image used by the algorithm after pre-processing is half the resolution of the original image.
Though this decreases computational cost due to reducing the number of pixels processed, it also limits the effective minimum size that a control can be, since less information is available to make a decision on whether or not the button should be in an active state. When a control covers a small area of the image, sub-sampling should not be used.
3.5.2 Intensity Image
By reducing colour image to just its intensity values the amount of data the algorithm needs to consider is reduced. This is helpful for the next stage (that of blurring the image to reduce the effects of noise), as the filtering need only be applied to a single channel rather than to each of the individual RGB channels.
3.5.3 Blur Filtering
By applying a Gaussian filter to the image, the pixel-wise comparisons used in the binary coding scheme (discussed in the next section) are between weighted sums over a small neighbourhood around each of the pixels being compared. Though this does not remove the effects of sensor noise (see section 3.7) on the pixel values, it is an efficient method of adding more spatial support to the comparisons. This is similar to
53 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.4: The source image can be sub-sampled by a factor of 0.5 to an image of half the dimensions. To calculate the pixel value for each coordinate in the sub-sampled
image, the position (xs, ys) is projected back into the original image, and four pixels are averaged together.
the way the DAISY descriptor exploits filtering to pre-calculate weighted sums for its descriptor construction [162].
3.5.4 Code Map
Once the image has been pre-processed, a binary code is generated for each pixel. This code is built from the concatenated labels given to a set of neighbouring pixels Np, which are labelled according to their relative intensity from the current pixel being considered p. Each of the labels is given a 2-bit binary code that is concatenated to form the final code for that pixel. This is discussed in more detail in section 3.8.2.
3.6 Algorithm Details
3.6.1 Sub-sampling and Converting to an Intensity Image
Figure 3.4 shows the sub-sampling process. The image I is both sub-sampled and con- verted from RGB to a greyscale representation in the same pass over the current frame. A simple component average is used to convert the RGB value into greyscale as shown in the equation below.
1 f (p) = (p + p + p ) (3.6.1) G 3 r g b
54 CHAPTER 3: FAST BACKGROUND SUBTRACTION
The original image I is then transformed using this function to give a greyscale image, IG = fG(I). Since the sub-sampling scale factor used is 0.5, the image can be sub-sampled using the average of the current pixel and the pixels to the right, below and below-right and stepping over the source image two pixels at a time.
1 I (x , y ) = I (2x + u, 2y + v) (3.6.2) S s s 4 ∑ G s s (u,v)∈No
Where xs ∈ [0, w/2) and ys ∈ [0, h/2), w and h are the original image width and height respectively, and No = {(0, 0), (0, 1), (1, 0), (1, 1)} are the neighbourhood offsets to average over.
The sub-sampling pass can be combined with the greyscale conversion by simply converting the RBG values to greyscale during the sub-sampling pass. So the sub- sampling conversion transform can be written as:
1 I (x , y ) = f ( I(2x + u, 2y + v)) (3.6.3) S s s 4 ∑ g s s (u,v)∈No
Where fg(·) converts an RGB value to greyscale as defined in equation 3.6.1. Since the divisor is a power of two, it can be performed as a bit shift to remove the need for an integer divide.
3.6.2 Blur Filtering
To help reduce the ambient pixel noise from the video camera source, the intensity image is smoothed using a blur filter. This is a Gaussian kernel separated to be a two pass 1D operation. The kernel is discretised and scaled so that the values are an integer vector g = {gi} and sum to a power-of-two so that normalisation can be done by a bit shift instead of an integer divide. See figure 3.5 for an illustration of the process.
First the image is convoluted with the 1D Gaussian kernel g and stored in a buffer
Ibu f f er with transposed dimensions to IG so that the result of convoluting at IG(x, y) is saved in the buffer at location Ibu f f er(y, x). Finally Ibu f f er is convolved with g again, and the result is stored in IG.
Convolving this way means that IG is convolved along x twice, but the second time the buffer is transposed so (x, y) = (y, x) and is actually convolving vertically.
55 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.5: Filtering process used to blur the image before generating codes in the code map.
T Ibu f f er = (IG ∗ g) T IG = (Ibu f f er ∗ g) (3.6.4)
This method can be more efficient on architectures that use memory caching as there are more memory reads along the kernel components, and by transposing the result of the horizontal pass, the vertical pass can read values from the image that are adjacent to each other in memory rather than being w bytes apart for kernel component
gi from gi−1 (assuming that the pixel size is one byte in this example). It also means that both passes are simply an integer dot product with contiguous memory at each pixel location and a bit shift.
3.7 Noise Model
The input to the algorithm is an image from a web camera video stream. However due to sensor noise and other factors, the pixel values in this image can vary over time even in static scenes where nothing is moving. There is generally always a small amount of sensor noise present from thermal affects on the sensor device. Automatic gain can alter the relative intensities of the colour channels in the image over time and at different scales, and in the process alter the respective noise levels in the image, so it is important to consider these effects.
Let Iraw(x) be a matrix of image pixels on the sensor Bayer pattern where coordinate x maps to a pixel intensity value for either R, G or B depending on its position such as in the pattern shown in figure 2.9 in chapter2.
56 CHAPTER 3: FAST BACKGROUND SUBTRACTION
The noise model assumes that the actual value of the pixel Iraw(x) is corrupted by an additive sensor noise function η(x) ∼ N (0, σx) and is transformed by a gain function
GΘ which is governed by a number of gain parameters Θ in the camera. This can be written as:
I(x) = GΘ(Iraw(x) + η(x)) (3.7.1)
It is assumed that the sensor noise function η(x) is an additive Gaussian noise model with zero mean and standard deviation σx. The gain function GΘ amplifies the effect of the sensor noise, and may amplify the R, G or B sensor values by different amounts (i.e. the gain factor can be different for each colour sensor).
The noise value can be made constant by transforming GΘ by another function −1 (ideally its inverse) GΘ to cancel out the effects of gain. This yields a constant noise function:
−1 GΘ (I(x)) ≈ Iraw(x) + η(x) (3.7.2)
However, in practice the actual form of the gain function GΘ may not be available to reverse the effect of the gain. The algorithm assumes that the automatic gain pa- rameters Θ of the gain function can at least be fixed in some way, such as being able to turn off automatic gain, so that the noise remains constant during the operation of the algorithm. Making this noise value constant is desirable for the labeling scheme described next.
3.8 Code Map
The algorithm used to create a code map for the current frame is based on the Local Binary Pattern (LBP) algorithm [115] but extends the 2-label binary coding scheme to a 3-label scheme. The following section gives a brief overview of the LBP algorithm.
3.8.1 Local Binary Patterns
For each pixel p in an image I, neighbouring pixels are assigned a label based on their intensities relative to the intensity of the current pixel. Labels are either 0 or 1 and represent that a neighbourhood pixel pn is either less than (pn < p) or equal or greater to (pn ≥ p) the intensity value of the current pixel p.
57 CHAPTER 3: FAST BACKGROUND SUBTRACTION
The code map representation is a description of the local neighbourhood intensity values around each pixel. The neighbourhood considered can be arbitrary, but the basic LBP descriptor considers the 8 adjacent pixels surrounding the current pixel (its 8-neighbourhood).
Each of the labels are concatenated to form a binary code describing the pixel’s neighbourhood. Refer back to figure 3.3 to see how a code is generated from a simple 3x3 area.
3.8.2 3 Label Local Binary Patterns (LBP3)
The LBP3 algorithm extends the original LBP algorithm by adding a third label to make the LBP code generation more robust to noise. One of 3 labels can be assigned, {less, similar, greater}. Let L(p, q) be a labelling function that assigns a label based on the intensity values of a reference pixel p and a neighbouring pixel q ∈ Np. Labels are assigned as follows:
greater if (q − p) > τ L(p, q) = similar if kq − pk ≤ τ (3.8.1) less if (q − p) < −τ
As with LTP [160], the proposed LBP3 algorithm neighbour pixel can take one of 3 labels: less, similar, greater. The similar label is assigned if the absolute difference be- tween the neighbour pixel intensity and the current pixel intensity is within a specified threshold. This threshold is a parameter of the algorithm, and adjusting it can reduce the effect of noise at the cost of contrast sensitivity.
However unlike the LTP algorithm, in LBP3 each label is associated with a 2-bit binary value: less = 012, similar = 112, greater = 102, and are chosen so that the bitwise difference between two labels expresses how far away from each other they are in label space. Let H(a, b) be the Hamming distance between two binary strings a and b, then: H(less, similar) < H(less, greater) and H(similar, less) = H(similar, greater).
This 3 label scheme improves robustness to noise over uniform regions where the differences between pixels are almost identical but offset by a small amount of noise [160]. It does however make it less resistant to intensity scaling where intensity changes by a constant factor, but is still more robust than taking simple image intensity differ- ences between frames as a difference measure. See figure 3.6 for an illustration of the code map generation for a simple code.
58 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.6: LBP3 code response to changes in lighting for two image patches cover- ing the same area sampled from different frames in a video. The top image is while the room is under low lighting, the bottom image is of the same area under normal lighting. Each colour in the respective code maps represent a different combination of labels in the binary code and are generated using a simple two-pixel neighbourhood code such as the one depicted in figure 3.8. Large differences in lightness yield almost identical code maps for this image patch.
The example implementation of the LBP3 algorithm considers neighbourhood pix- els above and to the right to build the code map. These 2-bit labels are concatenated to form the final code for the current pixel. However as with the original LBP any ar- bitrary neighbourhood can be used. Shown in figure 3.8 is an illustration of a simple 2-label LBP3 code.
Figure 3.9 shows that the representation is more robust to the subtle global illumi- nation change that simple image differencing would be vulnerable to without updating its difference threshold value.
Of course, there are still label transition problems due to noise, but those are mostly along edges as the similar label reduces noise on homogenous areas of intensity. Hys- teresis thresholding can be used to address this, as discussed in the following section.
3.8.3 Temporal Hysteresis
When the intensity value of a neighbouring pixel holds a value that is at or very close to the border of the similar and greater threshold, then ambient camera sensor noise can
59 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.7: Illustration of relative intensity fluctuation over time between reference pixel p and neighbourhod pixel q, and its effect on the original 2 label LBP algorithm (top), and in the proposed LBP3 extension (bottom). In the LBP labelling scheme the value q − p fluctuates between the two labels over time. In the LBP3 labelling scheme the fluctuations stay within the similar label region.
cause the label to be assigned either the similar or greater labels over time. This creates an unwanted level of variation over the region and will show up as differences in the code map.
An approach to dealing with this is to estimate the amount of background noise in the code map for the region covered by the control. For the button region R, over a number of frames T find the mean ambient noise in the code map difference µˆ = 1 ( ) |T| ∑t∈T Lt R over a region R and use this value to offset the activation threshold for the control. This can work reasonably well, but for very noisy regions the required
Figure 3.8: LBP3 code map construction.
60 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.9: LBP3 code map difference algorithm. Code maps are constructed for the reference image and compared to the code map of the current frame.
number of different pixels per frame can exceed the remaining pixels in the area of the control, meaning the control cannot be activated.
A more robust approach to improving code stability while still retaining some sen- sitivity to change, is to apply hysteresis on the code labels assigned at each pixel. Once a label has been assigned a code, any change in a consecutive frame must exceed the label boundary by a secondary hysteresis threshold α.
This can be implemented efficiently by updating the code label lt using a hysteresis threshold that adapts itself based on the previous label lt−1.
lt = Lh(pt, qt, lt−1) (3.8.2)
where pt and qt are pixel intensities at time t, and Lh(p, q, lt−1) is defined as:
greater if (q − p) > τ + α Lh(p, q, lt−1 = similar) = less if (q − p) < −τ − α (3.8.3) similar otherwise
similar if (q − p) < τ − α Lh(p, q, lt−1 = greater) = less if (q − p) < −τ − α (3.8.4) greater otherwise
61 CHAPTER 3: FAST BACKGROUND SUBTRACTION
greater if (q − p) > τ + α Lh(p, q, lt−1 = less) = similar if (q − p) > −τ + α (3.8.5) less otherwise
Figure 3.10 illustrates how a label is assigned for a pixel q from within the neigh- bourhood Np of a reference pixel p. The left graph shows how the label is assigned if a simple threshold value is used. Since the difference (q − p) starts very close to the similarity threshold τ the label fluctuates between the similar and greater labels. The graph on the right shows how labels are assigned to the same difference value (q − p) when a secondary hysteresis value α is used, and is much more stable over time.
Figure 3.10: Illustration of label assignments given relative intensity (q − p) over time between neighbourhood pixel q and reference pixel p without temporal hysteresis (left) and with temporal hysteresis enabled (right). Orange lines indicate label bound- aries, and the height of the orange rectanular regions represent the hysteresis thresh- old α. By using a secondary hysteresis threshold, the code label will not change until its value becomes significantly different. This drastically reduces the effect of noise on the difference map.
This temporal hysteresis threshold significantly reduces the effects of labeling fluc- tuations due to noise, particularly on edges in the image where the difference between reference pixel p and a neighbourhood pixel q is equal to the similarity threshold τ. In these situations, ambient sensor noise pushes the difference (q − p) between the two label regions and cause false positives in the code difference map.
In section 3.10, figure 3.12 shows results comparing two images from the same video processed with only a single similarity threshold (left) and with a secondary hys- teresis threshold (right). The left image shows the ambient background noise caused by label fluctuations along edges in a sample video sequence, however in the right image using a secondary temporal hysteresis threshold significantly reduces these artifacts.
62 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.11: Screen captures of two test scenarios for the persistent button algorithm using the LBP3 local binary pattern variant. Left: Table-top game. Right: Human shape game.
3.9 Experiments
The algorithm was qualitatively tested in two gaming scenarios. The first was a table- top game where the player has to move a cursor using their hand to collect items while avoiding obstacles (shown on the left in figure 3.11), the second was a human shape game where the player must move into a shape that activates the required blocks for as long as they can to score points (shown on the right in figure 3.11).
3.9.1 Comparisons
Since the LBP3 algorithm essentially performs a type of background subtraction on a local region, it is compared to two other computationally efficient local patch compar- ison methods. The methods chosen are a computationally efficient Normalised Cross Correlation (NCC) [100] and another method, Colour Normalised Cross Correlation (CNCC) proposed by Grest et al. [71]. See section 2.4.2 for the details of these two algorithms.
Normalised cross correlation normalises each of the patches being compared by the magnitude of their respective variances. This offers some illumination invariance to the comparison. Colour normalised cross correlation on the other hand, extends the NCC formulation to perform the correlation dot product using a hue and saturation coordinate vector to gain sensitivity to colour while being more robust to differences caused by shadows [71].
Since the image being compared is a reference image saved during calibration, much of the computation for the correlation can be recalculated for efficiency when
63 CHAPTER 3: FAST BACKGROUND SUBTRACTION processing each new video frame [100]. A small neighbourhood size of 5x5 is used when assessing each algorithm.
3.9.2 Table-top Game
In the table top game application, the camera is angled downward to point towards a clear space on the floor. A gaming grid is then superimposed on top of the lower half of the video frame. Each cell in the grid represents an independent persistent button control.
The buttons are first calibrated with the floor area clear to allow the camera to cali- brate its internal gain parameters and to model the mean ambient noise in each of the button areas. For each button Bi, a mean noise µi is calculated over several frames and used to offset the activation threshold of the respective button. Once calibration is completed the game begins.
To play the game, the user sits in front of the virtual gaming board facing the camera and must place their finger into the desired gaming square. Since the user’s hands and arms will also activate all buttons up until their desired button, the ’active’ button is considered to be the active button that is furthest away from the user (closest to the bottom of the screen).
From a position along a random side of the gaming board, green and red blocks start moving across to the opposite side to its starting position. The user must collect as many green blocks as they can without colliding with any red blocks.
3.9.3 Human Shape Game
The human shape game is played by angling the camera towards the opposite wall so that most of the user’s legs, upper body and arms are visible. As with the table top game, the buttons are calibrated by the user leaving the view of the camera for a short time while mean noise levels are determined for each button in the grid.
Once the game begins, the user must activate only the indicated blocks to score points within a fixed time. If not all blocks are covered, then no points are gained. If the wrong blocks are activated, points are lost. After a short while, a new pattern is displayed and the user must activate a different set of blocks.
64 CHAPTER 3: FAST BACKGROUND SUBTRACTION
3.10 Results
All three algorithms perform well, but with different limitations and advantages. Videos of subjects playing the two games were recorded and then analysed by each of the 3 algorithms.
Figure 3.12 shows the effect of applying temporal hysteresis to the code maps. This significantly reduces the effect of noise from codes generated by pixels that oscillate on the border of the similarity threshold and the less and greater labels.
To quantify the effect of the hysteresis threshold on the LPB3 algorithm, the al- gorithm was run on a test sequence with with ground truth with the basic similarity threshold labelling scheme and then with the secondary hysteresis threshold. The first 65 frames of the sequence are a view of a static room, and no movement is occuring in the video, after that the player comes in to view of the camera. The graph in figure 3.13(a) shows how the False Positive Rate evolves over time with the two methods, and clearly shows that by using a secondary temporal hysteresis threshold the LPB3 labelling scheme is considerably more stable.
The false positives that are registered after frame 65 occur when the player comes in to view of the camera and are due to a small reflective surface in the background of the room, combined with some minor dilation artifacts due to the coverage of the descrip- tor neighbourhood (see figure 3.13(b)). This causes a small number of changes near the border of the ground truth area in the current frame, but are not due to labelling noise.
Figure 3.12: Using a secondary hysteresis threshold significantly reduces the effect of noise on the difference map. Left: Frame without temporal hysteresis. Right: The same frame processed with temporal hysteresis.
Of the three methods, the CNCC algorithm is the most robust to many situations
65 CHAPTER 3: FAST BACKGROUND SUBTRACTION
(a) (b) Figure 3.13: (a) False Postive Rate (FPR) of LBP3 over a video sequence with and with- out the use of a secondary hysteresis threshold. After the first 65 frames the player comes in view of the camera. It can be clearly seen that without the hysteresis thresh- old, there are many false positive detections. When hysteresis is applied, the false positives reduced to zero in the static part of the scene. (b) False positives after frame 65 are due to a reflective surface in the background and minor dilation artifacts slightly offset from the actual change due to the LBP3 descriptor coverage (red = false posi- tives, green = true positives, grey = ground truth).
where shadows would otherwise be detected in the difference map. See figure 3.14 for the segmentations achieved by each algorithm. The CNCC algorithm is more robust to shadows only when the underlying surface has colour, and in situations where the surface is generally grey it performs no better than the NCC algorithm. This is to be expected, as when the colour of the pixels move towards grey values, the (h, s) hue- saturation plane component is reduced to zero and the algorithm operates purely on the lightness L component in the (h, s, L) colour model.
3.11 Discussion and Future Work
There are two areas of improvement that would be interesting to explore for future versions of the LBP3 algorithm.
First, it is not as robust to shadows as methods such as CNCC. This could be ad- dressed by performing the LBP3 comparison measure using more shadow invariant colour representations such as the (h, s, L) colour space used by the CNCC algorithm. The work by Yeffet and Wolf [176] uses SSD to compare local patches between different frames to detect motion when constructing their LBP inspired binary pattern represen- tation, and it would be interesting to see how incorporating shadow invariant local
66 CHAPTER 3: FAST BACKGROUND SUBTRACTION
Figure 3.14: Results on algorithms applied to table-top game scenario. Game display overlay has been omitted for clarity of results. The CNCC method performs best on this frame due to the red-brown hue of the table cloth aiding the shadow removal in the comparison. If this were grey, then CNCC would perform with similar results to NCC. The extra saturation on the NCC table result includes marked areas that have too low variance and their NCC is unstable. These are detected and filtered out when generating the NCC difference map.
patch comparisons such as the CNCC correlation method into the spatial coding of LBP3 improves robustness to shadows (i.e. compare patches instead of single pixels from the processed image), though this would increase the computational complexity and slow down the algorithm and risk losing its real-time performance.
Second, since the LPB3 algorithm (as well as the NCC and CNCC methods tested here) holds a background reference image to compare new frames against, if the camera drifts slowly over time, as illustrated in figure 3.15, differences will be registered in the difference map the more the frame drifts out of alignment with the reference image. Applying basic image stabilising methods to the image to correct for gradual or sudden drift should address this problem.
Figure 3.15: Camera drift over the course of many frames on LBP3 difference maps. Left to right: The camera very slowly tilts upward on its stand due to not being placed securely, and causes gradual constant differences in the difference map.
67 CHAPTER 4
Detection and Segmentation
Object detection and segmentation are important problems of computer vision and have numerous commercial applications such as pedestrian detection, surveillance and gesture recognition. Image segmentation has been an extremely active area of research in recent years [24, 27, 58, 88, 92, 135]. In particular segmentation of the face is of great interest due to such applications as Windows Messenger [36, 158].
One such recent application used in a computer game context is that of using a face detection to control the steering of a virtual character in a racing game called AntiGrav. To control the game, the player must move their head to different areas of the screen to steer left, right, up or down. Although the simple face detection employed by the game failed at times, it demonstrated that novel interactions like this can be realised by using computer vision algorithms in a creative way.
Taking inspiration from this idea, it is easy to see the potential of being able to place a player’s face on their own digital avatar, particularly in a multi-player computer game. Using off-the-shelf face detection algorithms such as the one presented by [170], and coupling it with segmentation techniques it is possible to segment the face of the player so that it can be added to a 3D avatar. This chapter proposes a novel use of a detection and segmentation algorithm to segment a face in real-time.
Until recently the only reliable method for performing segmentation in real-time was blue screening. This method imposes strict restrictions on the input data and can only be used for certain specific applications. Recently Kolmogorov et al. [90] proposed a robust method for extracting foreground and background layers of a scene from a stereo image pair. Their system ran in real-time and used two carefully calibrated cam- eras for performing segmentation. These cameras were used to obtain disparity infor- mation about the scene which was later used in segmenting the scene into foreground and background layers. Although they obtained excellent segmentation results, the
68 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.1: Real-Time Face Segmentation using face detection. The first image on the first row shows the original image. The second image shows the face detection results. The image on the second row shows the segmentation obtained by using shape priors generated using the detection and localisation results.
need for two calibrated cameras was a drawback of their system.
In the previous section (§ 3.4), a low-level method of segmentation via background subtraction using a fast binary descriptor was presented and compared to other fast background subtraction methods. It showed that it can be efficient to tackle image segmentation this way, though there are still issues with camera drift that need to be addressed by other algorithms that make them less attractive for use when reliable segmentations are required.
This section presents a framework that exploits the result of a state-of-the-art face detection algorithm [170] to provide an initialisation for the position and scale of a simple shape prior that can then be incorporated into Markov Random Field energy function that can be minimised by the dynamic graphcut algorithm [88].
4.1 Shape priors for Segmentation
An orthogonal approach to background subtraction for solving the segmentation prob- lem robustly has been the use of prior knowledge about the object to be segmented. In recent years a number of papers have successfully tried to couple MRFs used for mod- elling the image segmentation problem with information about the nature and shape of the object to be segmented [27, 58, 76, 92]. The primary challenge in these systems is that of ascertaining what would be a good choice for a prior on the shape. This is because the shape (and pose) of objects in the real world vary with time. To obtain a good shape prior then, there is a need to localise the object in the image and also infer its pose, both of which are extremely difficult problems in themselves.
Kumar et al. [92] proposed a solution to these problems by matching a set of ex- emplars for different parts of the object on to the image. Using these matches they generate a shape model for the object. They model the segmentation problem by com-
69 CHAPTER 4: DETECTIONAND SEGMENTATION bining MRFs with layered pictorial structures (LPS) which provide them with a realistic shape prior described by a set of latent shape parameters. A lot of effort has to be spent to learn the exemplars for different parts of the LPS model.
In their work on simultaneous segmentation and 3D pose estimation of humans, Bray et al. [27] proposed the use of a simple 3D stick-man model as a shape prior. Instead of matching exemplars for individual parts of the object, their method followed an iterative algorithm for pose inference and segmentation whose aim was to find the pose corresponding to the human segmentation having the maximum probability (or least energy). Their iterative algorithm was made efficient using the dynamic graph cut algorithm [88]. Their work had the important message that rough shape priors were sufficient to obtain accurate segmentation results. This is an important observation which will be exploited in our work to obtain an accurate segmentation of the face.
4.2 Coupling Face Detection and Segmentation
In the methods described above the computational problem is that of localizing the object in the image and inferring its pose. Once a rough estimate of the object pose is obtained, the segmentation can be computed extremely efficiently using graph cuts [24, 25, 70, 88, 91]. In this section we show how an off the shelf face detector such as the one described in [170] can be coupled with graph cut based segmentation to give accurate segmentation and improved face detection results in real-time.
The key idea of the framework proposed in the following sections is that face lo- calisation estimates in an image (obtained from any generic face detector) can be used to generate a rough shape energy. These energies can then be incorporated into a dis- criminative MRF framework to obtain robust and accurate face segmentation results as shown in Figure 4.1. This method is an example of the OBJCUT paradigm for an unarticulated object. We define an uncertainty measure corresponding to each face detection which is based on the energy associated with the face segmentation. It is shown how this uncertainty measure might be used to filter out false face detections thus improving the face detection accuracy.
The algorithm proposed in this section is a method for face segmentation which works by coupling the problems of face detection and segmentation in a single frame- work. This method is efficient and runs in real-time. The key novelties of the algorithm include:
1. A framework for coupling face detection and segmentation problems together.
70 CHAPTER 4: DETECTIONAND SEGMENTATION
2. A method for generating rough shape energies from face detection results.
3. An uncertainty measure for face segmentation results which can be used to iden- tify and prune false detections.
In the next section, we briefly discuss the methods for robust face detection and image segmentation. In section 4.3, we describe how a rough shape energy can be generated using localisation results obtained from any face detection algorithm. The procedure for integration of this shape energy in the segmentation framework is given in the same section along with details of the uncertainty measure associated with each face segmentation. The simple shape prior is then extended to an upper body model in section 4.7. We conclude by listing some ideas for future work in section 4.8.
4.3 Preliminaries
In this section we give a brief description of the methods used for face detection and image segmentation.
4.3.1 Face Detection and Localisation
Given an image, the aim of a face detection system is to detect the presence of all human faces in the image and to give rough estimates of the positions of all such detected faces. In this proposed framework we use the face detection method proposed by Viola and Jones [170]. This method is extremely efficient and has been shown to give good detection accuracy. A brief description of the algorithm is given next.
The Viola Jones face detector works on features which are similar to Haar filters. The computation of these features is done at multiple scales and is made efficient by using an image representation called the integral image [170]. After these features have been extracted, the algorithm constructs a set of classifiers using AdaBoost [61]. Once constructed, successively more complex classifiers are combined in a cascade struc- ture. This dramatically increases the speed of the detector by focussing attention on promising regions of the image 1. The output of the face detector is a set of rectangular windows in the image where a face has been detected. We will assume that each de- x y x y tection window Wi is parameterised by a vector θi = {ci , ci , wi, hi} where (ci , ci ) is the centre of the detection window and wi and hi are its width and height respectively.
1A system has been developed which uses a single camera and runs in real-time.
71 CHAPTER 4: DETECTIONAND SEGMENTATION
4.3.2 Image Segmentation
Given a vector y = {y1, y2, ··· , yn} where each yi represents the colour of the pixel i of an image having n pixels, the image segmentation problem is to find the value of the vector x = {x1, x2, ··· , xn} where each xi represents the label which the pixel i is assigned. Each xi takes values from the label set L = {l1, l2, ··· , lm}. Here the label set L consists of only two labels i.e. ‘face’ and ‘background’. The posterior probability for x given y can be written as:
Pr(y|x)Pr(x) Pr(x|y) = ∝ Pr(y|x)Pr(x) (4.3.1) Pr(y) We define the energy E(x) of a labelling x as:
E(x) = − log Pr(x|y) − log Pr(x) + constant = φ(x, y) + ψ(x) + constant (4.3.2)
where φ(x, y) = −logPr(y|x) and ψ(x) = −logPr(x). Given an energy function E(x), the most probable or maximum a posterior (MAP) segmentation solution x∗ can be found as the segmentation solution x that minimises E(x):
x∗ = argmin E(x) (4.3.3) x It is typical to formulate the segmentation problem in terms of a Discriminative Markov Random Field [93]. In this framework the likelihood φ(x, y) and prior terms ψ(x) of the energy function can be decomposed into unary and pairwise potential func- tions. In particular this is the contrast dependent MRF [24, 92] with energy:
E(x) = ∑ (φ(xi, y) + ψ(xi)) + ∑ (φ(xi, xj, y) + ψ(xi, xj)) + const (4.3.4) i (i,j)∈N
where N is the neighbourhood system defining the MRF. Typically a 4 or 8 neigh- bourhood system is used for image segmentation which implies each pixel is connected with 4 or 8 pixels in the graphical model respectively.
4.3.3 Colour and Contrast based Segmentation
The unary likelihood terms φ(xi, y) of the energy function are computed using the colour distributions for the different segments in the image [24, 92]. For our experi- ments we built the colour appearance models for the face/background using the pixels
72 CHAPTER 4: DETECTIONAND SEGMENTATION lying inside/outside the detection window obtained from the face detector. The pair- wise likelihood term φ(xi, xj, y) of the energy function is called the contrast term and is discontinuity preserving in the sense that it encourages pixels having dissimilar colours to take different labels (see [24, 92] for more details). This term takes the form:
γ(i, j) i f xi 6= xj φ(xi, xj, y) = (4.3.5) 0 i f xi = xj
2 ( ) = −g(i,j) · 1 ( ) = k − k Where γ i, j exp 2σ2 dist(i,j) . Here g i, j Ii Ij 2 and measures the difference between the RGB pixel values Ii and Ij respectively, and dist(i, j) gives the spatial distance between i and j. Other colour spaces could be used (such as (L∗, u∗, v∗)) but as this would add an additional computational cost to transform the RGB values to the new colour space, this simple distance measure was preferred.
The pairwise prior terms ψ(xi, xj) are defined in terms of a generalized Potts model as:
Kij i f xi 6= xj ψ(xi, xj) = (4.3.6) 0 i f xi = xj
This encourages neighbouring pixels in the image 2 to take the same label thus resulting in smoothness in the segmentation solution. In most methods, the value of
the unary prior term ψ(xi) is fixed to a constant. This is equivalent to assuming a uniform prior and does not affect the solution. In the next section we will show how a shape prior derived from a face detection result can be incorporated in the image segmentation framework.
4.4 Integrating Face Detection and Segmentation
Having given a brief overview of image segmentation and face detection methods, we now show how we couple these two methods in a single framework. Following the OBJCUT paradigm, we start by describing the face energy and then show how it is incorporated in the MRF framework.
2Pixels i and j are neighbours if (i, j) ∈ N
73 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.2: Generating the face shape energy. The figure shows how a localisation result from the face detection stage (left) is used to define a rough shape energy for the face.
4.4.1 The Face Shape Energy
In their work on segmentation and 3D pose estimation of humans, Bray et al. [27] show that rough and simple shape energies are adequate to obtain accurate segmentation re- sults. Following their example we use a simple elliptical model for the shape energy for a human face. The model is parameterised in terms of four parameters: the ellipse cen- tre coordinates (cx, cy), the semi-minor axis a and the semi-major b (assuming a < b). x y The values of these parameters are computed from the parameters θk = {ck , ck , wk, hk} x y of the detection window k obtained from face detector as: cx = ck , cy = ck , a = wk/α and a = wk/β. The values of α and β used in our experiments were set to 2.5 and 2.0 respectively, however these can be computed iteratively in a manner similar to [27]. A detection window and the corresponding shape prior are shown in figure 4.2.
4.5 Incorporating the Shape Energy
For each face detection k, we create a shape energy Θk as described above. This energy is integrated in the MRF framework described in section 4.3.2 using the unary terms
ψ(xi) as:
ψ(xi) = λ(xi, Θk) = −log p(xi, Θk) (4.5.1)
Where we define p(xi, Θk) as:
1 p(x = face|Θ ) = (4.5.2) i k k − k cyi−c + · cxi cx + y − 1 exp µ (ak)2 (bk)2 1
74 CHAPTER 4: DETECTIONAND SEGMENTATION
and:
p(xi = background|Θk) = 1 − p(xi = face|Θk) (4.5.3)
x y Where cxi and cyi are the x and y coordinates of the pixel i, ck ck , a = wk/α, b =
hk/β are parameters of the shape energy Θk, and the parameter µ determines how the strength of the shape energy term varies with the distance from the ellipse boundary. The different terms of the energy function and the corresponding segmentation for a particular image are shown in figure 4.3.
Figure 4.3: Different terms of the shape-prior + MRF energy function. The figure shows the different terms of the energy function for a particular face detection and the corresponding image segmentation obtained.
Once the energy function E(x) has been formulated, the most probable segmenta- tion solution x∗ defined in equation 4.3.3 can be found by computing the solution of the max-flow problem over the energy equivalent graph [91]. The complexity of the max- flow algorithm increases with the number of variables involved in the energy function. Recall that the number of random variables is equal to the number of pixels in the im- age to be segmented. Even for a moderate sized image the number of pixels is in the range of 105 to 106. This makes the max-flow computation quite time consuming. To overcome this problem we only consider pixels which lie in a window Wk whose di- mensions are double of those of the original detection window obtained from the face detector. As pixels outside this window are unlikely to belong to the face (due to the
75 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.4: This figure shows an image from the INRIA pedestrian data set. After run- ning our algorithm, we obtain four face segmentations, one of which (the one bounded by a black square) is a false detection. The energy-per-pixel values obtained for the true detections were 74, 82 and 83 while that for the false detection was 87. As you can see the energy of false detection is higher than that of the true detections, and can be used to detect and remove it.
shape term ψ(xi)) we set them to the background. The energy function for each face detection k now becomes:
Ek(x) = ∑ (φ(xi, y) + ψ(xi, Θk)) + ∑ φ(xi, xj, y) + ψ(xi, xj) + const i∈Wk j∈Wk, (i,j)∈N (4.5.4)
∗ This energy is then minimized using graph cuts to find the face segmentation xk for each detection k.
4.5.1 Pruning False Detections
The energy E(x0) of any segmentation solution x0 is the negative log of the probabil- ity, and can be viewed as a measure of how uncertain that solution is. The higher the energy of a segmentation, the lower the probability that it is a good segmentation. Intuitively, if the face detection given by the detector is correct, then the resulting seg- mentation obtained from our method should have high probability and hence have low energy compared to the case of a false detection (as can be seen in figure 4.4).
This characteristic of the energy of the segmentation solution can be used to prune false face detections. This method was also explored by Ramanan [128] for improving the results of human detection. Alternatively, if the number of people P in the scene is known, then we can choose the top P detections according to the segmentation energy.
76 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.5: Some face detection and segmentation results obtained from our algorithm.
4.6 Implementation and Experimental Results
We tested our algorithm on a number of images containing faces. Some detection and segmentation results are shown in figure 4.5. The time taken for segmenting out the faces is in the order of tens of milliseconds. We also implemented a real-time system for frontal face detection and segmentation. The system is capable of running at roughly 15 frames per second on images of 320x240 resolution.
4.6.1 Handling Noisy Images
The contrast term of the energy function might become quite bad in noisy images. To avoid this we smooth the image before the computation of this term. The results of this procedure are shown in figure 4.6.
4.7 Extending the Shape Model to Upper Body
The simple ellipse model used for face detection and segmentation in the previous section can be extended to detect and segment the upper body of a human. This use of localisation can improve the speed of the parameter search in PoseCut [27] as the position and approximate scale of the human is already known from the detector stage of the algorithm.
Figure 4.7 shows an illustration of the model. This upper body model expresses
77 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.6: Effect of smoothing on the contrast term and the final segmentation. The images on the first row correspond to the original noisy image. The images on the second row are obtained after smoothing the image.
simple articulation of the neck length and angle, and the shape of the upper torso. The model has 6 parameters which encode the x and y location of the two shoulders and the length and angle of the neck.
Using the PoseCut approach, the parameters of the model can be optimised to find
the lowest energy Ek(x, Θk) given the current model parameters Θk that now represent the 6 parameters of the upper body model.
The energy cost function Ek(x, Θk) for detection k is:
E (x, Θ ) = (φ(x , y) + ψ(x , Θ )) + φ(x , x , y) + ψ(x , x ) + const k k ∑ i i k ∑ i j i j i∈Wk j∈Wk (i,j)∈N (4.7.1)
∗ The goal is then to find the optimal model parameters Θk that yield the lowest
segmentation energy, minx Ek(x, Θk):
∗ Θ = argmin min Ek(x, Θk) (4.7.2) k x Θk
As with the model used in face segmentation, a window Wk is defined around de- tection k with a size proportional to the detection window scale. The model is initialised to a frontal configuration, as shown in subfigure (a) in figure 4.7 and the optimal pa-
78 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.7: Upper Body Model. (a) The model parameterised by 6 parameters encod- ing the x and y location of the two shoulders and the length and angle of the neck. (b) The shape prior generated using the model. Pixels more likely to belong to the foreground/background are green/red. (c) and (d) The model rendered in two poses.
rameters are determined by finding the solution to equation 4.7.2. In the upper body model experiments Powell minimisation [127] is used.
Figure 4.8 shows iterations of several starting positions optimising over 2 of the 6 parameters of the upper body model shown in figure 4.7. It can be seen that although the energy surface has multiple local minima, they are very close and have very similar energy. The experiments showed that the Powell minimisation algorithm was able to converge to almost the same point for different initialisations (see bottom image in figure 4.8).
The upper body model was tested on video sequences from the Microsoft Research bilayer video segmentation dataset [90]. Some results from processing one of the se- quences can be seen in figure 4.9.
4.8 Discussion and Future Work
In the previous sections a method for face segmentation which combines face detection and segmentation into a single framework has been presented. The method runs in real-time and gives accurate segmentation, and shows how the segmentation energy can be used to help prune out false positives in certain situations.
When the upper body model was applied to video sequences, the time taken to find ∗ the optimal parameters Θk for a given detection window is too slow for processing video streams in real-time. This means that it would also be unviable to use a fully articulated model for determining pose such as the problem discussed in chapter6.
For video sequences, knowledge of previous good detections in addition to tempo-
79 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.8: Optimising the upper body parameters. (top) The values of minx Ek(x, Θk) obtained by varying the rotation and length parameters of the neck. (bottom) The image shows five runs of the Powell minimisation algorithm which are started from different initial solutions. The runs converge on two solutions which are very close and have almost the same energy.
ral smoothing can be applied to detections to avoid detections that oscillate between adjacent discrete detection scales when the true scale of the human is part way between them. Other temporal information such as initialising the model with the optimal pa- ∗ rameters from the previous frame Θk [t − 1] could take the performance of an upper body model a step closer to real-time speed.
80 CHAPTER 4: DETECTIONAND SEGMENTATION
Figure 4.9: Segmentation results using the 2D upper body model. The first column shows the original images, the second column shows the segmentations obtained from the initial model parameters for the upper body model, and the third column shows segmentations using the optimal parameters. The segmentation energies for the initial and optimal pose parameters are also shown.
81 CHAPTER 5
Human Detection
Within a web camera based computer game setting, where a player is typically stand- ing in front of a camera positioned on top of their TV such as in the Sony EyeToy games, efficiently localising exactly where in the video frame the player currently is can make more advanced computer algorithms more robust or even faster to compute. For in- stance, if an algorithm fails or produces a weak response, then extra information that localises the player can reduce the search area needed to re-initialise a more complex algorithm.
An example game that illustrates this problem is Sony’s EyeToy Kinetic and Kinetic Combat. The player is asked to perform a series of tasks, such as hit a block while avoiding others, but how the player achieves this is not actually measured, i.e. no estimate of pose is determined during the course of the game. Instead the game detects collisions with areas of movement (e.g. pixel differencing) on screen.
If for instance, a player is asked to perform a movement in a particular way for interactive exercise games such as these, it would be useful to also assess how well a player is doing with this movement. To do this, the pose of the player or other more detailed information needs to be determined.
Pose estimation algorithms can be employed to attempt to solve this, however good localisation of the player allows complex algorithms to be applied only at areas of in- terest.
This chapter first discusses the basic requirements of a fast detection algorithm (not necessarily real-time), presents some existing feature descriptors and learning algo- rithms, and then proposes a new approach to chamfer matching for object detection that can learn the weights required for human detection. Finally, some experiments are presented that compare the classification performance of the different algorithms discussed by testing them on human and upper body datasets.
82 CHAPTER 5: HUMAN DETECTION
5.1 Suitable Algorithms
The localisation task can be thought of as a binary class problem in which a position in an image can be labelled as either background or object.
A good localisation method must be computationally efficient so that it can be used in conjunction with other computer vision algorithms on top of any computer game media being displayed, and it should be able to cope with partial occlusions and varied subject size and clothing. Once the location of a player is known, or at least reduced to a smaller number of possible locations, then other more computationally intensive algorithms can be applied.
One such algorithm that has these properties is chamfer matching [11, 20, 154]. Another algorithm, called Histograms of Oriented Gradients (or HOG) [40], is consid- ered state-of-the-art for human detection, though even efficient implementations are not quite able to run at interactive frame rates [182]. This idea has also been extended to considering the responses of different part detectors such as in [23], but this algo- rithm is slower in comparison due to the complexity added by the separate classifiers, so for detection the simpler single classifier algorithm is used in experiments presented in this chapter.
5.2 Features
Simply passing the raw image data to a learning algorithm such as an SVM does not achieve good results due to variability in brightness and appearance. Typically images are processed to reduce them to not only a more manageable size, but also reduce the image data to the information that remains useful for detecting an object, e.g. edge maps, or histograms of oriented gradients (HOG).
There are numerous features used in human detection: silhouette [2], chamfer [65], edges [44], HOG descriptors [40, 140, 183], Haar filters [171], motion and appearance patches [18], edgelet features [175], shapelet features [136] or SIFT [102]. The features that have been selected for use are edge gradient based descriptors due to their success in human detection [40] and their ability to be made computationally efficient while retaining good accuracy [162, 163, 183]. There are other features that exist, but the ones used in this chapter were selected due to their popularity and success in state-of-the-art algorithms.
83 CHAPTER 5: HUMAN DETECTION
5.2.1 Histogram of Oriented Gradients
The principle behind the Histogram of Oriented Gradients (HOG) introduced by Dalal and Triggs [40] is quite simple. In contrast to sparse descriptors such as SIFT that are positioned at extrema found in scale-space [102] or at distinct repeatable interest points [13], the HOG detector proposed in [40] uses a dense array of overlapping histograms across a sample window. See figure 5.11 for an illustration of how the descriptors are arranged within the detection window.
The image is first processed to extract gradient orientations and magnitudes at each pixel using an edge operator such as Sobel or Canny. Where available, colour informa- tion is used and the maximum gradient over the colour channels is used as the gradient for that pixel.
Each HOG descriptor block is split into a grid of smaller cells cw × ch of nw × nh pixels (typically 2 × 2 and 8 × 8 respectively [40]). Before constructing the descriptor, the gradients within the descriptor region are weighted by a Gaussian of size σ = 0.5 ∗ BlockSize to down-weight gradient contributions from the pixels at the very edge of the descriptor.
For each cell within the block, a histogram is constructed over θ orientations by summing the magnitudes of each of the gradients within the cell into the histogram bin associated with their orientations. Gradients are bilinearly interpolated with adja- cent orientation bins to avoid boundary effects within the orientation histogram. The gradient is also weighted spatially by contributing to histograms in neighbouring cells depending on how close the current location is to the other cells.
The histograms for each of the block cells are concatenated together h = {hi} and then normalised to a unit vector to form the final HOG descriptor representation. Each of the HOG descriptor blocks is normalised based on the ‘energy’ of the histograms contained within it. Dalal and Triggs [40] discusses different methods of contrast nor- malisation, and the method that appeared to improve results most in the original HOG implementation is the clipped norm method originally proposed by Lowe [102], and referred to as L2-hys by Dalal and Triggs [40]. This method first normalises the descrip- q 1 2 2 2 2 2 2 tor using the L2-norm: h = khk2 + e , where khk2 = (h1 + h2 + ··· + hn) . Then the components hi ∈ h are thresholded so that none of the values hi are greater than 0.2 (this constant was determined empirically in [102]), and the descriptor is normalised again using the L2-norm. Since the descriptor is applied densely, even over uniform patches where no edge gradients exist, a small constant value e is used to avoid divi- sion by zero when khk2 = 0. See figure 5.1 for a diagram showing the key steps in the
84 CHAPTER 5: HUMAN DETECTION
Figure 5.1: Diagram of HOG descriptor construction. Gradients for the whole image are extracted, then over a given descriptor block B, for each position B(x, y) within the block the gradient magnitudes are weighted by a Gaussian |∇B(x, y)| · G(σ, x, y) where σ = 0.5 ∗ blockSize, and added to the histogram belonging to the cell that covers (x, y). The gradient also contributes to adjacent cells proportionally to how close the pixel is to the respective cell centre.
algorithm.
Dalal and Triggs [40] explores different configurations of this descriptor, but 2 × 2 blocks of 8 × 8 cells seem to give good results on the INRIA database while keeping the dimensionality low. The best result was reported to be HOG blocks with a 3 × 3 arrangement of 6 × 6 pixel cells and 9 orientation bins but has a much higher (nearly double) dimensionality of the 2 × 2 cell of 8 × 8 pixel arrangement descriptors with only a small improvement in accuracy [40].
Descriptor Variations
There are a few variations of the HOG descriptor discussed in Dalal and Triggs [40]. The two main categories are rectangular and circular HOG descriptors (R-HOG and C-HOG respectively), and the paper explores their performance compared to SIFT de- scriptors [102], shape context descriptors [110] (simulated using C-HOG descriptors) and generalised Haar wavelet based descriptors [170].
Dalal and Triggs [40] demonstrated that HOG based descriptors outperformed each of the other descriptor variations on two different person databases: the MIT pedes- trian database, and a more challenging INRIA database created to test the HOG de- scriptor. R-HOG descriptors generally performed best on human detection.
85 CHAPTER 5: HUMAN DETECTION
Figure 5.2: The sum over the area defined by region R can be found by sampling 4 points from the integral image (A, B, C and D) to give: sum(R) = ii(D) − ii(B) − ii(C) + ii(A).
Integral HOG Features
The HOG features used in the experiments is a slight variant of the original HOG de- scriptor, called Integral HOG [182], this descriptor uses integral histograms to speed up the HOG descriptor computation time, while retaining similar accuracy to the original descriptor [182].
An integral image [170] (also known as a summed area table [37]) is a method of efficiently finding the sum of values over a rectangular area. The value at a given point in the integral image is defined as [170]:
ii(x, y) = ∑ i(x0, y0) (5.2.1) x0≤x,y0≤y
Where ii(x, y) is the integral image and i(x, y) is the original image. This table can be calculated efficiently in a single pass over an image by using the following pair of equations [170]:
s(x, y) = s(x, y − 1) + i(x, y) (5.2.2) ii(x, y) = ii(x − 1, y) + s(x, y)
Where s(x,y) is the cumulative row sum, s(x, −1) = 0, and ii(−1, y) = 0.
Figure 5.2 shows how an integral image is used to calculate the sum over a given area. In the integral image, the value at point (x, y) represents the sum of all the values in the original image to the left and above of that position (as defined in eq. 5.2.1). The sum over any arbitrary rectangular region R can be calculated by sampling four points (A, B, C and D in figure 5.2) in the integral image, and using the following formula:
86 CHAPTER 5: HUMAN DETECTION
sum(R) = ii(D) − ii(B) − ii(C) + ii(A) (5.2.3)
The value at ii(A) is added to compensate for the fact that by subtracting ii(B) and ii(C) from ii(D) the area ii(A) has actually been subtracted twice already, since the area represented by ii(A) is in both ii(B) and ii(C). Using this method the area over any arbitrary region can be calculated with only 4 samples from the integral image representation.
Viola and Jones [170] uses this representation to efficiently calculate rectangular features made up of positive and negative regions summed together, see figure 5.3 for some example features.
Figure 5.3: Some examples of the Haar-like rectangular features used by Viola and Jones [170]. Areas in white are subtracted from the areas in black.
Zhu, Yeh, Cheng and Avidan [183] extend this idea by combining the approach used in [170] and the integral histogram work of Porikli [125] to HOG [40] by calcu- lating integral images over the orientation histograms used by the HOG descriptors to accelerate the speed at which the descriptors can be calculated. In the integral his- togram implementation of HOG, the edge gradients are calculated as with the HOG descriptor [40], however after the edge gradients have been calculated, they are inter- polated between orientation channels and an integral image is calculated over each of the channels. These oriented integral histograms are used to calculate the gradient contribution for each cell by only sampling 4 points for each bin in the cell, instead of having to iterate over each pixel in a HOG descriptor cell.
For each cell in the descriptor block, the sum over the area covered by the cell R over all histograms is found by sampling 4 points for each orientation θ from the integral image (Aθ, Bθ, Cθ and Dθ) to give: sum(Rθ) = ii(Dθ) − ii(Bθ) − ii(Cθ) + ii(Aθ). This is done for the region that defines each descriptor cell to quickly generate histograms at any scale with constant time. The histograms are concatenated into a vector and
(cw·ch·nθ ) normalised as with regular HOG to form the descriptor h = {hi}i=1 where cw and
ch are the number of cells along the width and the height of the descriptor respectively,
and nθ is the number of orientations. Given that this implementation is much faster than the original HOG implementa- tion and almost as accurate [183], this algorithm is used for the HOG feature experi-
87 CHAPTER 5: HUMAN DETECTION
Figure 5.4: Diagram of integral histogram calculation for HOG. For each cell in the descriptor, the sum over the magnitudes for each orientation channel θ is concatenated
to form the histogram of gradients for the cell using: sum(Rθ) = ii(Dθ) − ii(Bθ) −
ii(Cθ) + ii(Aθ)
ments presented in this chapter.
5.2.2 DAISY
The DAISY descriptor is a computationally efficient descriptor used for dense matching applications that can be calculated very efficiently compared to other state-of-the-art algorithms such as SIFT, while retaining good accuracy [162, 163].
The descriptor itself is made up of a sampling grid arranged in concentric circles around the descriptor origin. See figure 5.5 for an illustration. Gradient magnitudes are first calculated in different directions. A Gaussian is applied to the edge magnitudes for each orientation, and are sampled to build the histograms for the inner most ring histograms. The Gaussian is applied again and mid-region histograms are sampled. This process is repeated for each ring of samples.
Figure 5.5: DAISY descriptor construction. Gradient magnitudes are extracted from the source image and quantised into orientations layers. The orientation layers are consecutively blurred by a Gaussian and sampled by one of the sample rings of the descriptor at each step (highlighted in red).
88 CHAPTER 5: HUMAN DETECTION
Figure 5.6: Filters used to determine edge response in x and y for the SURF descriptor.
The Gaussian blurring is an efficient way of determining the weighted sum over a circular area, in this case the sample regions in the circular descriptor arrangement. Another advantage of this method is that the different sized sampling regions can be calculated by applying a small Gaussian kernel successively to incrementally sample the different sized regions. The speed of this descriptor makes it very attractive as a feature for use in a dense arrangement such as over a dense object detection window as used in the experiments of this chapter.
The default arrangement of 1 centre region, 3 rings and 8 samples per ring make for quite a high dimensional descriptor compared to SIFT (though DAISY can be computed much more efficiently), but another lower dimensional arrangement, that of 1 centre region, 2 rings and 4 samples per ring over 4 orientations brings the descriptor size down to a similar size to the standard HOG descriptor arrangement with acceptable reduction in performance and comparable computation efficiency as integral histogram HOG.
One of the key sources of its efficiency comes from using successive convolution kernels applied to the image oriented gradient map to create lookup tables for the de- scriptor cell coverage.
5.2.3 SURF
The Speeded-Up Robust Features (SURF) algorithm proposed by Bay et al. [13] presents a simple and computationally efficient keypoint descriptor. There are two variants, a rotation invariant descriptor which is used in sparse keypoint detection algorithms such as SIFT, and an ‘upright’ descriptor that does not orient the descriptor represen- tation to its dominant gradient orientation. The ‘upright’ version of the descriptor is used for object detection and is the one used in the experiments of this chapter.
The descriptor itself consists of a 4 × 4 cell grid of 4-bin histograms. The histograms
89 CHAPTER 5: HUMAN DETECTION are calculated from responses to a Haar-like edge filter in x and y. See figure 5.6 for an illustration of the process.
At a desired descriptor scale s, a region of size 20s (s = 0.8 gives a descriptor diam- eter of 16 pixels, and is used for the detection experiments discussed later) centred at the descriptor’s location is convolved with Haar-like wavelets to determine horizon- tal and vertical edge responses. The region is then partitioned into 4 × 4 sub-regions. For each sub-region, Haar wavelet responses (filter size is 2s) are computed at 5 × 5 regularly spaced sample points, and are weighted with a Gaussian (σ = 3.3s) to in- crease robustness to localisation errors and geometric deformations. These weighted responses dx and dy over each cell are summed up to create a histogram vector v = (∑ dx, ∑ dy, ∑ |dx|, ∑ |dy|). The absolute values of the feature responses are included to express the polarity of the intensity changes over the region [13]. The vectors for each cell are concatenated and normalised to a unit vector to create the descriptor rep- resentation of a 64 dimensional descriptor vector. The wavelet responses are invariant to a constant offset in illumination and invariance to contrast is achieved by normalis- ing the descriptor.
5.2.4 Chamfer Distance Features
Recall from chapter2, section 2.6.3, that given a set of edge pixel coordinates O =
nO {(xi, yi)}i=1 extracted from a binary edge template object IO, and the edge pixel coor- nA dinates A = {(xi, yi)}i=1 extracted from the binary edge image IA from a query image
(e.g. thresholded edges from the current video frame), where nO and nA are the number
of edge pixels found in template IO and in image IA respectively, then the (truncated) chamfer distance between them can be found using:
1 C(D , O, τ ) = min(D (p)2, τ ) (5.2.4) A d |O| ∑ A d p∈O
where τd is a threshold used to provide an upper limit on the values in the distance transform to increase stability (see section 2.6.3), and DA is the distance transform func- tion of A such that for a given point p, the transform gives the distance to the nearest edge in A:
DA(p) = min kp − qk (5.2.5) q∈A
For multiple orientations, the chamfer distance is defined as:
90 CHAPTER 5: HUMAN DETECTION
1 C (D , O, τ ) = C(D , O , τ ) (5.2.6) Θ A d | | ∑ Aθ θ d Θ θ∈Θ
where Θ is the set of orientation channels, and |Θ| gives the number of orientations.
Expanding CΘ(DA, O, τd) gives:
1 1 ( O ) = ( ( )2 ) CΘ DA, , τd ∑ ∑ min DAθ p , τd (5.2.7) |Θ| |Oθ| θ∈Θ p∈Oθ
where Oθ and Aθ are the sets of edge coordinates corresponding to orientation θ
for the template object O and the query image A respectively, and DAθ is the distance
transform for oriented edge coordinates Aθ and is defined using the same notation as equation 5.2.5:
DAθ (p) = min kp − qk (5.2.8) q∈Aθ
The distance transform function can be efficiently precalculated for a given query image making the matching algorithm more computationally efficient. Matches are
found by evaluating CΘ(DA, O, τd) at different locations in the image, where high val- ues indicate a poor match and low values indicate a good match. A value of 0 means
that the edges in O correspond exactly to edges in A, i.e. O ⊆ A, but a threshold τc is typically used to accept partial matches: CΘ(DA, O, τd) ≤ τc. The chamfer matching algorithm requires that edge templates can be extracted from a series of reference images of the desired object. These images generally require that the edges of the foreground can be separated from background edges so that only the relevant edges are used in the matching algorithm.
The following section proposes an alternative method to this that allows for back- ground edges to be present in the training images, by formulating the template of an object as a weight vector in a linear SVM. This is similar to the work presented by Felzenszwalb [53] that learns weights for a Hausdorff distance transform classifier but here the truncated chamfer distance transform is used. The algorithm is then tested on a standard detection dataset and the results are compared to the other algorithms.
The Partial Hausdorff Distance
Felzenszwalb [53] observed that algorithms such as the Hausdorff distance can be ex- pressed as a generalisation of a linear classifier. The task of matching edges can be done as follows:
91 CHAPTER 5: HUMAN DETECTION
Figure 5.7: Illustration of the Hausdorff distance between edge pixel coordinates from object A (green) and object B (blue). If for each point in A, the closest point in B is found, then the Hausdorff distance is the largest of these distances.
If A and B are sets of 2D edge point coordinates, then the Hausdorff distance be- tween the two sets is given by [53]:
h(A, B) = max min ka − bk (5.2.9) a∈A b∈B The Hausdorff distance is the maximum distance over all the points in A to their nearest point in B, i.e. if for each point in A the closest point in B is found, then the Hausdorff distance is the largest of these distances. See figure 5.7 for an illustration. One of the strengths of this measure is that it does not require that the points in A and B match each other exactly, which makes it able to handle partial edge contours or minor occlusions, or even a slight variation in shape [53].
One problem with this distance measure is that it is not robust with respect to noise or outliers [53]. To deal with this the Kth ranked value is used instead of the max in equation 5.2.9. The partial Hausdorff distance [53] is defined as:
th hK(A, B) = Ka∈A min ka − bk (5.2.10) b∈B Taking K = |A|/2 the partial distance is the median distance from the set of points in A to the set of points in B.
Object Recognition
Using this partial Hausdorff distance measure, an input object B can be considered the same as object A if:
hK(A, B) ≤ d (5.2.11)
Felzenszwalb [53] observed that this condition holds exactly if at least K points from A are at most distance d from some point in B. By using the notion of dilation,
92 CHAPTER 5: HUMAN DETECTION
Figure 5.8: Illustration of dilation operator. Shown on the left is a set of edge coordi- nates B, the middle diagram shows the dilation of B by distance d to give the set of points Bhri. The diagram on the right shows another set A that matches Bhri. where Bhri denotes the set of points that are at most distance r from any point in B, then hdi hK(A, B) ≤ d holds when at least K points from A are contained in B . See figure 5.8 for an illustration of the dilation operation.
Let A be an m × n matrix with binary values 1 for all points in A and 0 otherwise, and let Bhdi be an m × n matrix with binary values 1 for all points in Bhdi and 0 oth- erwise, then let a = vec (A) and bhdi = vec Bhdi be mn × 1 vectorisations of these matrices respectively, where vec (·) is defined as:
T vec(M) = (M1,1, M2,1, ··· , Mm,1, M1,2, M2,2, ··· , Mm,2, ··· , Mm,n) (5.2.12)
where M is an arbitrary m × n matrix, and superscript T denotes the transpose.
Using this vectorised representation, the partial Hausdorff distance hK(A, B) ≤ d can be expressed by a dot product between a and bhdi:
T hdi hK(A, B) ≤ d ⇔ a b ≥ K (5.2.13)
Felzenszwalb [53] applied this approach to human detection in a PAC learning framework [168] using the perceptron algorithm [105] for hypotheses, and promising results on human detection were reported. Figure 5.9 shows a sample detection made by the classifier and the corresponding weights learnt by the classifier, and a roughly human shape is visible in the positive weights.
Learning Chamfer Weights
Using a similar approach to the one discussed in 5.2.4, the chamfer matching algorithm can also be expressed as a dot product in a linear classifier learning algorithm such as
93 CHAPTER 5: HUMAN DETECTION
(a) (b) (c) Figure 5.9: Figure showing some results as reported by Felzenszwalb [53]. Image (a) shows an example detection made by the classifier, (b) shows the weights learnt by the PAC classifier, and (c) shows only the positive weights. an SVM.
Given an object template O and a query image A that are both sets of edge pixel coordinates, let Oθ and Aθ denote the set of edge pixel coordinates for the respective sets for a given orientation channel θ.
Using a similar formulation to the one used in 5.2.4, let Oθ be an m × n matrix with T binary values 1 for points in Oθ and 0 otherwise, and let oθ = vec Oθ be the vec- torisation of the transpose this matrix and is an mn × 1 column vector. The values are transposed so that the rows instead of the columns are concatenated to form the vec- torisation, and is more efficient to access in memory than concatenating the columns.
The matrix O is an mn × |Θ| matrix where each of the columns is one of the vec- torised orientations oθ for θ ∈ Θ. This matrix is also vectorised to give o = vec (O) so that each of the vectorised orientations are concatenated to form a single (mn · |Θ|) × 1 column vector.
T vec(O1 ) o1 vec(OT) o 2 2 o = . = . (5.2.14) . . ( T ) vec O|Θ| o|Θ|
Recall that DA is the distance transform of edge pixel coordinates A, let this now represent an m × n matrix where any point in DA is the distance to the nearest edge coordinate in A. For oriented chamfer matching, there is an oriented distance trans- form m × n matrix DAθ for each orientation channel θ and corresponding edge pixel coordinates Aθ.
94 CHAPTER 5: HUMAN DETECTION
Using the same notation for vectorisation as with the template object O, the ori- entated distance transforms for the query image A are vectorised and concatenated together so that:
vec(DT ) A1 dA1 T vec(D ) dA A2 2 d = . = . (5.2.15) . . vec(DT ) d A|Θ| A|Θ|
2 Next the values in d are transformed to satisfy the min(DA(p) , τd) term in equation
5.2.7, so that the squared distances are no higher than the distance threshold τd: