<<

Computer Vision Based Interfaces

for Computer Games

Jonathan Rihan

Thesis submitted in partial fulfilment of the requirements of the award of

Doctor of Philosophy

Oxford Brookes University

October 2010 Abstract

Interacting with a computer game using only a simple web camera has seen a great deal of success in the computer games industry, as demonstrated by the numerous computer vision based games available for the PlayStation 2 and PlayStation 3 game consoles. Computational efficiency is important for these human computer inter- action applications, so for simple interactions a fast background subtraction approach is used that incorporates a new local descriptor which uses a novel temporal coding scheme that is much more robust to noise than the standard formulations. Results are presented that demonstrate the effect of using this method for code label stability.

Detecting local image changes is sufficient for basic interactions, but exploiting high-level information about the player’s actions, such as detecting the location of the player’s head, the player’s body, or ideally the player’s pose, could be used as a cue to provide more complex interactions. Following an object detection approach to this problem, a combined detection and segmentation approach is explored that uses a face detection algorithm to initialise simple shape priors to demonstrate that good real-time performance can be achieved for face texture segmentation.

Ultimately, knowing the player’s pose solves many of the problems encountered by simple local image feature based methods, but is a difficult and non-trivial problem. A detection approach is also taken to pose estimation: first as a binary class problem for human detection, and then as a multi-class problem for combined localisation and pose detection.

For human detection, a novel formulation of the standard chamfer matching algo- rithm as an SVM classifier is proposed that allows shape template weights to be learnt automatically. This allows templates to be learnt directly from training data even in the presence of background and without the need to pre-process the images to extract their silhouettes. Good results are achieved when compared to a state of the art human detection classifier.

For combined pose detection and localisation, a novel and scalable method of ex- ploiting the edge distribution in aligned training images is presented to select the most potentially discriminative locations for local descriptors that allows a much higher space of descriptor configurations to be utilised efficiently. Results are presented that show competitive performance when compared to other combined localisation and pose detection methods. Dedicated to my parents and my family, to family Dahm, and to Susanne.

In memory of Elke Dahm, and Winnie Smith.

i Acknowledgements

During the time I have spent studying at Oxford Brookes, I have had the honour and the pleasure of meeting and working with many very talented and enthusiastic people.

First I would like to thank my director of studies Professor Philip H. S. Torr for all his valuable guidance, advice, enthusiasm and support that allowed me to develop my skills as a researcher and gain an understanding of computer vision. I would also like to thank my second supervisor Dr Nigel Crook for his advice and valuable feedback during writing, and would like to thank Professor William Clocksin for the teaching opportunities that gave me valuable teaching experience during my final year of study.

Of those I have worked with while studying in the Oxford Brookes Computer Vi- sion group, I would like to thank M. Pawan Kumar, Pushmeet Kohli, Carl Henrik Ek, Chris Russel, Gregory Rogez, Karteek Alahari, Srikumar Ramalingam, Paul Sturgess, David Jarzebowski, Lubor Ladicky, Christophe Restif, Glenn Sheasby and Fabio Cuz- zolin for all the interesting research discussions.

During my time at the London Studio of Sony Computer Entertainment Europe, I worked under the supervision and guidance of Diarmid Campbell who I would like to thank for giving me valuable insight into applied computer vision and the games industry. From within the Sony EyeToy R&D group I would also like to thank Simon Hall, Nick Lord, Graham Clemo, Sam Hare, Dave Griffiths and Mark Pupilli.

I would like to thank my parents for all the support they have given me during the many years of my studies. I would also like to thank Elke and Erdmann Dahm and family for their support and encouragement during the final year of writing.

Finally, I would like to thank Susanne Dahm for her unwavering support and un- derstanding during my studies, and in whom I found the strength to do more than I ever thought I could.

ii Contents

1 Introduction1

1.1 Defining ‘Real-time’...... 1

1.2 Motivation...... 2

1.3 Approach...... 3

1.3.1 Detecting Movement...... 4

1.3.2 Face Detection and Segmentation...... 5

1.3.3 Human Detection...... 5

1.3.4 Pose Estimation...... 6

1.4 Contributions...... 6

1.5 Thesis Structure...... 7

1.6 Publications...... 8

2 Background9

2.1 Computer Vision in Games...... 9

2.1.1 Game Boy Camera (1998)...... 9

2.1.2 SEGA Dreamcast Dreameye Camera (2000)...... 10

2.1.3 Sony EyeToy (2003)...... 11

2.1.4 Nintendo (2006)...... 13

2.1.5 XBox LIVE Vision (2006)...... 13

2.1.6 Sony Go!Cam (2007)...... 14

2.1.7 Sony PlayStation Eye (2007)...... 15

2.2 Types of Camera...... 16

2.2.1 Monocular (standard webcam)...... 17

iii CONTENTS

2.2.2 Stereo...... 17

2.2.3 Depth, or Z-Cam...... 18

2.3 Problem Domain...... 18

2.3.1 Typical Computer Games...... 19

2.3.2 Computational Constraints...... 20

2.4 Background Subtraction and Segmentation...... 20

2.4.1 Background Subtraction...... 20

2.4.2 Background Subtraction using Local Correlation...... 24

2.4.3 Segmentation...... 27

2.4.4 Graph-Cut based Methods...... 29

2.5 Human Detection and Pose Estimation...... 32

2.5.1 Learning Problem...... 32

2.6 Human Detection...... 33

2.6.1 Generative...... 35

2.6.2 Discriminative...... 35

2.6.3 Chamfer Matching...... 38

2.7 Pose Estimation...... 40

2.7.1 Generative...... 41

2.7.2 Discriminative...... 44

2.7.3 Combined Detection and Pose Estimation...... 46

3 Fast Background Subtraction 47

3.1 Problem: Where is the player?...... 48

3.2 Image Differencing...... 48

3.3 Motion Button Limitations...... 50

3.4 Persistent Buttons...... 50

3.5 Algorithm Overview...... 51

3.5.1 Sub-sampling...... 53

3.5.2 Intensity Image...... 53

3.5.3 Blur Filtering...... 53

iv CONTENTS

3.5.4 Code Map...... 54

3.6 Algorithm Details...... 54

3.6.1 Sub-sampling and Converting to an Intensity Image...... 54

3.6.2 Blur Filtering...... 55

3.7 Noise Model...... 56

3.8 Code Map...... 57

3.8.1 Local Binary Patterns...... 57

3.8.2 3 Label Local Binary Patterns (LBP3)...... 58

3.8.3 Temporal Hysteresis...... 59

3.9 Experiments...... 63

3.9.1 Comparisons...... 63

3.9.2 Table-top Game...... 64

3.9.3 Human Shape Game...... 64

3.10 Results...... 65

3.11 Discussion and Future Work...... 66

4 Detection and Segmentation 68

4.1 Shape priors for Segmentation...... 69

4.2 Coupling Face Detection and Segmentation...... 70

4.3 Preliminaries...... 71

4.3.1 Face Detection and Localisation...... 71

4.3.2 Image Segmentation...... 72

4.3.3 Colour and Contrast based Segmentation...... 72

4.4 Integrating Face Detection and Segmentation...... 73

4.4.1 The Face Shape Energy...... 74

4.5 Incorporating the Shape Energy...... 74

4.5.1 Pruning False Detections...... 76

4.6 Implementation and Experimental Results...... 77

4.6.1 Handling Noisy Images...... 77

4.7 Extending the Shape Model to Upper Body...... 77

v CONTENTS

4.8 Discussion and Future Work...... 79

5 Human Detection 82

5.1 Suitable Algorithms...... 83

5.2 Features...... 83

5.2.1 Histogram of Oriented Gradients...... 84

5.2.2 DAISY...... 88

5.2.3 SURF...... 89

5.2.4 Chamfer Distance Features...... 90

5.3 Human Detector...... 96

5.3.1 Training...... 97

5.3.2 Hard Examples...... 98

5.4 Datasets...... 98

5.4.1 HumanEva...... 98

5.4.2 INRIA...... 99

5.4.3 Mobo...... 99

5.4.4 Upper Body Datasets...... 100

5.5 Experiments...... 100

5.5.1 Human Detection...... 100

5.5.2 Upper Body Detection...... 102

5.6 Discussion...... 102

5.6.1 Computational Efficiency...... 104

5.6.2 Chamfer SVM...... 106

5.6.3 Edge Thresholding...... 107

5.6.4 Bagging, Boosting and Randomized Forests...... 108

5.7 Summary and Future Work...... 110

6 Pose Detection 112

6.1 Introduction...... 114

6.1.1 Related Previous Work...... 114

vi CONTENTS

6.1.2 Motivations and Overview of the Approach...... 116

6.2 Selection of Discriminative HOGs...... 119

6.2.1 Formulation...... 120

6.3 Randomized Cascade of Rejectors...... 123

6.3.1 Bottom-up Hierarchical Tree construction...... 123

6.3.2 Randomized Cascades...... 126

6.3.3 Application to Human Pose Detection...... 132

6.4 Experiments...... 134

6.4.1 Preliminary results training on HumanEva...... 135

6.4.2 Experimentation on MoBo dataset...... 137

6.5 Conclusions and Discussions...... 144

7 Conclusion 146

7.1 Summary of Contributions...... 146

7.2 Future Work...... 148

A Code Documentation 149

A.1 System Overview...... 149

A.1.1 Shared Projects...... 149

A.1.2 Project: CapStation...... 150

A.1.3 Project: HogLocalise...... 151

A.2 C++: Shared Libraries...... 153

A.2.1 Common...... 153

A.2.2 ImageIO...... 154

A.3 MEX: cppHogLocalise...... 154

A.3.1 Usage...... 154

A.3.2 Options: Scanning...... 155

A.3.3 Options: Classification...... 157

A.3.4 Related Code Folders...... 161

References 162

vii List of Tables

6.1 2D Pose Estimation Error on HumanEva II...... 135

6.2 Classifiers - Training Time...... 141

6.3 Classifiers - Detection Time...... 143

viii List of Figures

1.1 Typical Camera Position...... 2

1.2 Camera View...... 3

2.1 Nintendo Game Boy Camera...... 10

2.2 Sega Dreameye...... 11

2.3 Sony EyeToy Camera...... 11

2.4 Sony EyeToy Games...... 12

2.5 Sony EyeToy Peripherals...... 13

2.6 XBox Live Vision Camera...... 14

2.7 Sony PlayStation Portable Go!Cam...... 15

2.8 Sony PlayStation Eye camera...... 16

2.9 Camera Bayer Pattern...... 17

2.10 hsL Colour Space...... 27

2.11 GraphCut Illustration...... 30

2.12 Supervised Machine Learning Diagram...... 33

2.13 Chamfer distance...... 39

2.14 Chamfer Pose Templates...... 46

3.1 Image Differencing...... 49

3.2 Background Image Differencing...... 51

3.3 LBP Code...... 52

3.4 LBP3 Subsampling...... 54

3.5 Blur Filter...... 56

ix LISTOF FIGURES

3.6 LBP3 Code Response To Lighting...... 59

3.7 LBP3 Labels...... 60

3.8 LBP3 Code Map Construction...... 60

3.9 LBP3 Code Map Differencing...... 61

3.10 LBP3 Hysteresis...... 62

3.11 Persistent Button Applications...... 63

3.12 LBP3 Hysteresis Results...... 65

3.13 LBP3 Hysteresis FPR...... 66

3.14 Shadow Results...... 67

3.15 Camera Drift Effect on LBP3...... 67

4.1 Real-Time Face Segmentation...... 69

4.2 Face Shape Energy...... 74

4.3 Energy Function Terms...... 75

4.4 False Positive Pruning...... 76

4.5 Face Segmentation Results...... 77

4.6 Effect of Smoothing...... 78

4.7 Upper Body Model...... 79

4.8 Optimising Body Parameters...... 80

4.9 Upper Body Segmentation Results...... 81

5.1 HOG Descriptor...... 85

5.2 Integral Image...... 86

5.3 Haar-like Rectangular Features...... 87

5.4 Integral Histogram HOG...... 88

5.5 DAISY Descriptor Construction...... 88

5.6 SURF Descriptor...... 89

5.7 Hausdorf Distance...... 92

5.8 Dilation Example...... 93

5.9 Hausdorf PAC Classifier Results...... 94

5.10 Chamfer Distance Features...... 97

x LISTOF FIGURES

5.11 Human Detector Window Construction...... 98

5.12 HumanEva Detection Dataset...... 99

5.13 INRIA Dataset...... 99

5.14 MoBo Dataset...... 100

5.15 INRIA: SVM Classifier Feature Comparison...... 101

5.16 Mobo Upper Body: SVM Classifier Feature Comparison...... 103

5.17 Chamfer SVM Weights...... 104

5.18 Feature Vector Timings...... 105

5.19 Canny Edge Thresholds...... 108

5.20 Chamfer SVM Classifier Edge Thresholds...... 108

5.21 SVM and SGD-QN SVM Comparison...... 109

5.22 SVM and FEST Algorithms Comparison...... 110

6.1 Random Forest Preliminary Results...... 117

6.2 Log-likelihood Ratio for Human Pose...... 119

6.3 Log-likelihood Ratio for Face Expressions...... 122

6.4 Selection of Discriminative HOG for Face Expressions...... 123

6.5 Bottom-up Hierarchical Tree learning...... 124

6.6 Hog Block Selection...... 127

6.7 Rejector branch decision...... 130

6.8 Single Cascade Localisation...... 131

6.9 Pose Detection...... 133

6.10 Pose Localisation...... 134

6.11 Pose Detection Results on HumanEva II Data Set...... 135

6.12 Pose Detection Result with a Moving Camera...... 136

6.13 MoBo Dataset...... 138

6.14 Defined Classes on MoBo...... 139

6.15 Pose Classification Baseline Experiments on MoBo...... 140

6.16 Localisation Dataset...... 141

6.17 Detection Results...... 142

xi LISTOF FIGURES

6.18 Best Cascade Localisation Results...... 143

A.1 System Diagram of Main Applications...... 149

A.2 Main Stages of Sliding Window Algorithm...... 151

A.3 Scale Search Method...... 152

A.4 HOG Feature Scaling...... 153

xii Notation

The following conventions are used in this thesis. Sets, matrices or special functions are uppercase characters, such as X or Y(θ). Functions or scalar values are lower-case characters such as a, n, f (n) and h(g). Vectors are lower-case bold characters, x and

p, and their components are lower-case with a subscript index e.g. {x1, x2, x3.. xn} or pr, pg, pb. A number followed by a subscript denotes the base of the number, e.g.

510 = 01012, if no subscript is specified then the number is assumed to be decimal.

xiii CHAPTER 1

Introduction

The ability to interact with a computer game without the requirement of a traditional game controller device provides a player with the unique ability to use their own body to directly interact with the game they are playing.

This thesis is concerned with the problem of real-time human computer interaction for computer games. Single camera systems are (of the time of writing) by far the most common type of system available, and as such are the problem domain of the algorithms discussed.

Many existing camera based games, such as the numerous games available for the Sony PlayStation 2 system in conjunction with Sony’s EyeToy camera (§2.1.3), all en- courage the player to stand up and take a more physical role in the game play, in con- trast with their usual seated gaming position. Indeed, some games such as Sony Eye- Toy Kinetic have taken this interaction concept even further and try to provide exercise training routines to help players keep fit.

These games tend to employ computer vision algorithms that are both computa- tionally efficient so as not to take too much computation time away from the rest of the media the game has to update, and robust so that the player isn’t frustrated by the technology failing at critical moments.

1.1 Defining ‘Real-time’

Throughout the thesis, the term ‘real-time’ is frequently used to describe and com- pare the performance and practicality of using different algorithms for computer vi- sion based interface problems. Generally speaking, the goal of any algorithm used in this context is to be able to process images at a fast enough speed that allows the user a sufficient level of interaction for the interface.

1 CHAPTER 1: INTRODUCTION

The term ‘interactive frame rate’ could be also used to describe the performance of the algorithms discussed in this thesis, but the term ‘real-time’ is generally more widely used.

Where ‘real-time’ is used in the text, it is taken to mean that there is sufficient per- formance for user interaction and may vary depending on the context.

1.2 Motivation

Offering the ability for a player to use their body to interact with a computer game has already seen a great deal of success in the computer games industry, and demonstrates that there is a lot of interest in using this technology as an alternative to traditional game controller devices.

The game systems responsible for much of the success of computer vision based computer games have been the Sony PlayStation 2 and PlayStation 3 game consoles. See section 2.1 for a history of how computer vision based games have evolved over the last decade.

Figure 1.1: Illustration showing a typical camera position for a camera based computer games system. The camera location is highlighted in red, and the viewing frustum is highlighted in green. The image displayed on the television in the diagram has been flipped horizontally.

Game systems such as these typically use a single web camera that is usually placed on top of the television set, roughly aligned to the screen centre. Figure 1.1 illustrates the camera placement used for most vision based games. The camera is highlighted in red, and the viewing frustum of the camera is highlighted in green.

The video stream recorded by the camera is displayed on the TV screen usually as a horizontally flipped image so that the movements displayed on the screen behave like

2 CHAPTER 1: INTRODUCTION a mirror of the player’s actions. Gaming components are overlaid on top of the live video stream and the player must move their body to interact with the components.

Figure 1.2 shows an example video frame captured from the camera in figure 1.1. The image on the left is the original video image captured from the point of view of the camera, and the image on the right is displayed on the television screen facing the player. The image has been flipped horizontally to make interactions more intuitive.

Interpreting this type of image poses some difficult problems however. Generally a living room environment contains many varied types of object in the background1. In the example frame presented in 1.2 there are two picture frames, a sofa, a wall, and a floor.

The problem of detecting areas of the image that belong to the player are made more complicated by the presence of these items, as their appearance can contain similar texture or colours to the clothing of the player. It is unreasonable to expect players to clear their living space completely of these items to make algorithms more robust, so any algorithms considered must ideally be able to cope with this kind of background appearance variation.

Figure 1.2: Left: Video image as seen from the point of view of the camera in figure 1.1. Right: Video image displayed on the TV screen with superimposed graphics. Notice that the image is flipped horizontally so that the displayed image behaves like a mirror reflection.

1.3 Approach

The work presented in this thesis considers two approaches to the problem of detecting the movements of the player with the goal of extending the types of interactions that can be used in computer games.

1Here ‘background’ simply means anything that is not the player.

3 CHAPTER 1: INTRODUCTION

The first approach (§3) presents an algorithm that uses simple image features to detect changes in the image useful for realising computer interactions.

The second approach is based on the observation that since any changes detected by these low-level algorithms are a direct consequence of the player moving their body, then being able to detect the location of a body part (§4), body location (§5), or ideally the player’s pose (§6) should be able to help solve some of the interaction problems encountered by simple low-level algorithms.

In the context of computer games and entertainment, the types of computer vision algorithms that can be used are generally limited to only those that can be considered to run in real-time. This means only algorithms that process an image fast enough so that a player can see the result of their actions affecting the computer game environment in some way.

As an example of why this is important, consider a game where a player is bom- barded by objects and must swat them aside to earn points (see right-most image in figure 1.2). Clearly if the algorithm used to locate the player’s movements2 were too slow, then swatting the objects would be impossible since by the time the control sys- tem responds, the object would have moved past the area of interaction.

1.3.1 Detecting Movement

Detecting movement can be useful to trigger user interface controls by encouraging the player to ’wave’ their hand over them for a given time, as demonstrated in Sony EyeToy games.

This can be posed as the problem of detecting changes in the image since the previ- ous frame, and simple algorithms such as image differencing, where the current video frame is subtracted from its previous frame and thresholded to detect changes, have seen a great deal of success in many of the EyeToy series of games demonstrating that game mechanics can be created to work within the limitations of a simple algorithm.

However, simple image differencing is contrast dependent, so it can be problematic to detect changes in areas that are a similar colour to a part of the player. Camera sensor noise can be a problem particularly in poorly lit environments, as can very slow or subtle movements that cause gradual change below the detection threshold of the image differencing algorithm.

Another problem is that this method only detects changes between frames, and

2Either indirectly by detecting low-level image changes or directly using object detection or pose.

4 CHAPTER 1: INTRODUCTION cannot detect anything if the player remains stationary on top of a game object. Chapter 3 addresses some of these problems and presents an alternative algorithm that takes steps toward allowing this kind of interaction to be realised.

1.3.2 Face Detection and Segmentation

A useful case of object detection that naturally lends itself to a computer game applica- tion is face detection. To see what is happening within the game they are playing, the player must face their television, and since this is where the camera is situated the face of the player should be visible for the majority of the game.

Detecting faces can be a useful cue for interaction, and there are good algorithms for fast and accurate face detection [170] that can be employed to locate the player’s face. Once the location of the player’s face is known, then it can be used to find roughly where the rest of the player is in relation to the video frame. Additionally once the face has been localised it then also becomes possible to perform other actions like placing the face on a computer generated avatar, or pass the extracted face texture to a recog- nition algorithm for identifying individual players within a multi-user interface [157].

Chapter4 presents a method to achieve segmentation in real-time using an off the shelf face detection algorithm to first find a region of interest for the segmentation algorithm.

1.3.3 Human Detection

Detecting and localising a human is a complex object detection problem. Humans have a high variance in appearance and pose, so an accurate detector must be able to deal with these variations. Knowing the location of the player would allow more complex algorithms to be applied, and given that the player will be present in front of the camera when playing the game, a human detector should be able to find the player reliably.

Some common approaches to this problem are discussed in section 2.6. Among these methods, fast detection algorithms such as chamfer matching [20, 63, 66] can achieve real-time performance, but the requirement that silhouettes must be extracted from the training images before the classifier is learnt can limit the number of training instances that are practical for use with the algorithm.

Chapter5 presents an algorithm that formulates the chamfer matching algorithm as a SVM classifier, where the weights of the SVM represent the template, and allow a general chamfer template to be learnt automatically from the training data.

5 CHAPTER 1: INTRODUCTION

1.3.4 Pose Estimation

The methods mentioned in the previous sections attempt to extract basic location infor- mation about which parts of the image contain parts of the player so that interactions can affect game objects or controls that are being superimposed on top of the video frame. Since these interactions are all dependent on the location or motion of the user, they can all be achieved by determining the pose of the individual.

If pose is known, then user interface controls such as floating buttons can be in- teracted with simply by tracking the position of the user’s hand. Knowing pose also means events can be triggered for a specific hand only, and even stop the control from activating in error when an arm, head or other limb enters the area. For exercise and fitness games, pose could enable the game to provide feedback to the player on how well they are preforming the action.

Pose estimation is of course a very difficult and non-trivial problem, particularly so for real-time applications. Ambiguities can arise quite easily when using information from a single camera, such as which way round a player’s legs are from the side, or how far forward their hand is to the camera (since depth information isn’t available from a single camera).

There are many approaches to the problem of pose estimation, and they are dis- cussed in more detail in section 2.7. Many methods use a two stage approach where the image is first processed by a computationally efficient algorithm (such as background subtraction) to identify locations in the image where the human is most likely to be, and then apply a more computationally expensive method to determine pose. How- ever, only a few methods consider the combined problem of simultaneously detecting and estimating the pose of a human within an image [17, 44, 116].

Chapter6 proposes a method that exploits the distributions of edges over training set examples to select the most discriminative locations for local features to be placed to discriminate between different classes, where the classes represent discrete poses, and constructs an efficient hierarchical cascade classifier. To handle ambiguity in pose, the classifier output is a distribution of votes over all the available classes.

1.4 Contributions

The contributions made by this thesis are as follows:

1. An extension to the local binary pattern descriptor that is more robust to noise

6 CHAPTER 1: INTRODUCTION

than the standard formulations.

2. A detection and segmentation approach that demonstrates how combining de- tection algorithms with simple shape priors can achieve real-time performance for face texture segmentation, and presents some initial results on using the seg- mentation to help prune false negatives.

3. A novel formulation of the standard chamfer matching algorithm as an SVM clas- sifier that allows shape template weights to be learnt automatically without the need to pre-process them to find their silhouettes.

4. A method of exploiting the edge distribution in aligned training images to select discriminative locations for local descriptors that allows a much higher space of descriptor configurations to be utilised efficiently.

1.5 Thesis Structure

Chapter 2 This chapter presents a history of the types of systems that have been em- ployed for single camera computer vision based computer games, and presents the background literature on the methods addressed by the algorithms presented in the following chapters.

Chapter 3 Being able to identify the specific areas of an image that belong to the player and not to the background environment is an important problem that can be used to interact with objects and user interface controls presented by the game. This chapter takes steps towards solving this problem with a background subtraction method that uses fast local features.

Chapter 4 This chapter presents a method of detection and segmentation using a sim- ple shape prior to extract the texture belonging to the face of the player which can be utilised by other game interactions. Some initial results are also presented on using the result of the segmentation algorithm to aid detection.

Chapter 5 The problem of fast human detection is addressed in this chapter, and an alternative formulation of the chamfer template matching algorithm is presented that expresses the shape template as the weight vector in an SVM classifier al- lowing training examples to be used that also contain background information, without the need to pre-process them to find their silhouettes. The algorithm is then compared with a state of the art detection method.

7 CHAPTER 1: INTRODUCTION

Chapter 6 A novel algorithm to jointly detect and estimate the pose of humans in sin- gle camera images is presented in this chapter, and comparisons are made to other fast state of the art methods. The algorithm exploits the distributions of edges over the training set examples to select the most discriminative locations for local features to be placed to discriminate between different classes. To handle ambiguity in pose, the classifier output is a distribution of over all the available classes. Parts of this chapter have previously appeared in the publication.

1.6 Publications

Parts of the material that appear in Chapter4 have previously been published in the following publications (and can be found at the end of the thesis):

1. Jon Rihan, Pushmeet Kohli, Philip H.S. Torr, ObjCut for Face Detection, In ICVGIP (2006)

2. Pushmeet Kohli, Jon Rihan, Matthieu Bray, Philip H.S. Torr, Simultaneous Segmen- tation and Pose Estimation of Humans using Dynamic Graph Cuts, In International Journal of Computer Vision, Volume 79, Issue 3 pages 285-298, 2008

Parts of the material that appear in Chapter6 are extensions to material that has previously appeared in the following publication (which can also be found at the end of the thesis):

1. Gregory Rogez, Jon Rihan, Srikumar Ramalingam, Carlos Orrite, Philip H.S. Torr, Randomized Trees for Human Pose Detection, In Proceedings IEEE Conference of Computer Vision and Pattern Recognition, 2008

8 CHAPTER 2

Background

Research into Computer Vision as a method of human computer interaction for enter- tainment media has been an active area of research for over a decade. Major entertain- ment companies such as Sony, Nintendo and Microsoft have all looked towards com- puter vision to expand the types of experiences they can provide to home and business customers.

This chapter contains a review of computer vision based user interface research em- ployed in the entertainment industry from initial concepts up to current state-of-the-art systems employed today. The chapter then goes on to look at the main problem areas within computer vision that typically need to be addressed to make these systems ro- bust in real world environments, and gives a review of current research literature con- cerned with solving these problems within the constraints provided by current games hardware.

2.1 Computer Vision in Games

This section introduces the systems made available by major entertainment companies such as Sony, Nintendo and Microsoft in roughly chronological order so that the sim- pler systems are presented first and the more advanced and currently used systems are introduced later. This is to highlight the evolution of computer vision driven entertain- ment media over the past decade.

2.1.1 Nintendo Game Boy Camera (1998)

Freeman et al. [59, 60] from Mitsubishi’s MERL laboratory used simple, low-level vi- sion image processing algorithms implemented on a hardware chip to try and achieve

9 CHAPTER 2: BACKGROUND

Figure 2.1: Shown here is the Nintendo Game Boy (left), the Game Boy Camera device (middle), and an example photo taken by the camera.

interactive frame rates required by human computer interaction. The user interface methods discussed in the paper are tested in typically quite clean environments (e.g. a player stands in front of a white wall), so finding useful information is made easier.

This chip was later used by Nintendo for their Game Boy hand-held system as shown in figure 2.1. On this system no games actually used the camera as a control device discussed by the authors, but simply provided a means to take photographs and apply filter effects. This interactivity alone however was enough to sell the product to consumers, demonstrating that even simple visual systems can be entertaining to players.

2.1.2 SEGA Dreamcast Dreameye Camera (2000)

Although SEGA is no longer competing in the games console market, they did release a camera peripheral for their last console, the SEGA Dreamcast, which was intended as a webcam and photo device. The camera was only released in Japan.

The resolution of the camera is 640x480 for still images, and when connected to the Dreamcast games console its picture resolution can be either: 640x480, 320x240, or 160x240. However since the data transfer rate of the camera is only 4Mbps (524,288 bytes per second), the frame rate of the camera is severely limited by the transfer rate, so that only very low resolution video is possible at interactive frame rates.

The camera was demonstrated at the Tokyo Game Show in 2000, showing its video conferencing application and a simple computer vision interface game (see figure 2.2). The finally released with the device was primarily for editing photos, video conferencing with other Dreameye owners or recording short videos.

10 CHAPTER 2: BACKGROUND

Figure 2.2: The Sega Dreameye (left), and images from a demonstration at the Tokyo Game Show in 2000 of a video conferencing game (middle) and a simple colour based computer vision game (right) in which the colour red causes the frog to move.

Figure 2.3: The EyeToy camera device with a game (left), EyeToy camera close up (right).

2.1.3 Sony EyeToy (2003)

Early in 2002, Dr Richard Marks joined Sony to work on the EyeToy project with the idea of using a webcam as an input device for a computer game. SCEE London studio created 4 technical demos that illustrated the potential for using a video camera as an input device, and then launched the Sony EyeToy camera with its first games in 2003.

The camera itself (see figure 2.3) is a simple webcam capable of a 320x240 or 640x480 resolution at 50 or 60Hz. The quality is relatively low for a web camera, but the frame rate is very high which is ideal for interactive computer game frame rates.

Control Method

To play EyeToy games, the EyeToy camera is placed centred on top of a television set and the player stands in front of the camera. The games generally display the video feed of the player on the television screen, and superimposes the game graphics over the top (see figure 2.4).

A few games use a technology called Digimask that allows players to place their face on digital characters within a game using the camera. This technology comes

11 CHAPTER 2: BACKGROUND

Figure 2.4: A selection of EyeToy games. From left to right: EyeToy Kinetic, EyeToy Play 2, EyeToy Kinetic Combat, and AntiGrav

under the name EyeToy: Cameo, and there are several games that support it as an extra feature. EyeToy: Cameo isn’t really a control mechanism however, and simply allows the player another way of personalising their gaming experience.

When the camera is used as a control interface, the player generally interacts by waving their hand over certain portions of the screen and the game uses this stimulus for navigating a menu interface, or interacting with a game object (e.g. swatting a monster or something similar).

Due to the processing power available on the console, only simple low-level com- puter vision algorithms can be used. Despite this however, a surprising number of game mechanics can be created using these simple processing techniques, as the Eye- Toy: Play series of games demonstrates.

More Advanced Control

A few games use some more advanced image processing algorithms and track parts of the player to control a game avatar. One game that does this particularly well is a game called AntiGrav. The aim of the game is to navigate around a racing course on a floating board. Once calibrated the player can move their head around the screen to steer their avatar around the course, and move their arms up and down to collect objects at different heights along the race track (see right most image in figure 2.4).

Unlike many of the other games released for the EyeToy device, AntiGrav does not display a live video feed of the player during the game. It instead displays the result of the tracking system on the lower right of the screen, showing where the player’s head and hands are in relation to the control’s neutral position.

Colour Tracking

A few new games are being sold with brightly coloured game peripherals that can be easily tracked using simple computer vision colour tracking methods (see figure 2.5).

12 CHAPTER 2: BACKGROUND

Figure 2.5: Two EyeToy games that use peripherals to aid computer vision colour tracking algorithms. Left: EyeToy Play: Hero. Right: EyeToy Play: Pom Pom Party

Typically these items are coloured bright green or bright pink to reduce the likelihood that a player’s living room might contain the same colour.

By using two different colours, one for each hand, multiple object tracking is made easier since there’s little chance of one object being mistaken for the other with the colour tracking enabled.

2.1.4 Nintendo Wii (2006)

Nintendo released a new console in 2006 aimed at a more family oriented casual gamer market, called the Nintendo Wii.

Although at first sight there is no computer vision based interface, the game con- trollers actually contain a small infra-red sensor that tracks the location of a sensor bar with IR LEDs positioned on top of a television. By monitoring the location and distance apart of the LEDs on this sensor bar, it is possible to calculate 3D position and rotation using triangulation.

In 2009, a camera device was released for the console with a game called Your Shape. The game uses a simple camera to attempt to detect if the player has performed the displayed exercise moves correctly, and also provides a fitness program for the player to follow.

2.1.5 XBox LIVE Vision (2006)

In 2006 Microsoft released a video camera for their console. The camera supports a resolution of 640x480 at 30 frames per second. It also supports a higher resolution for 1.3 megapixel still images of 1280x1024 (see left image in figure 2.6).

Games have generally used the camera for live video feeds of players in online games (a typical video conferencing application) and also to provide the ability to add pictures of a player to their online gaming profile taken using the camera.

13 CHAPTER 2: BACKGROUND

Figure 2.6: Left: XBox Live Vision camera. Right: Computer vision interface based game where the player waves their hands on the left/right of the screen to control the character.

The camera uses a long exposure time to increase the light sensitivity of the sensor used in the hardware. Due to the long exposure times used by the camera however, interfaces that require high speed interactions or more complex computer vision algo- rithms are limited. For instance, with a long exposure time, features like edges will tend to become blurred and smoothed out during fast actions.

Some games have used the camera as a vision based controller using similar meth- ods to some of the games created for Sony’s EyeToy on the PS2, however it hasn’t yet reached the same level of success as the EyeToy franchise has.

A few games use the Digimask technology to allow players to place their face on characters in game, as with EyeToy: Cameo system used in some of Sony PlayStation 2 games.

2.1.6 Sony Go!Cam (2007)

Sony has also released a camera for their PSP hand-held console that attaches to the back of the PSP game console.

The first major game to exploit the Sony PSP camera is a title called ’InviZimals’, which was released in the UK in November 2009 and in the US in autumn 2010. This game is an Augmented Reality (AR) game that exploits various image properties within the camera field of view to generate creatures, and uses AR algorithms to track an AR marker card to provide a 3D position for a battlefield in which the creatures can fight each other.

The player must scan their surroundings using the camera while the game extracts various image properties to determine which creatures are detected and where they are located. These creatures can be captured by placing the card at the indicated location and are then made available to the player to fight and capture other creatures. These actual properties extracted from the image are not known, but are likely to be things

14 CHAPTER 2: BACKGROUND

Figure 2.7: Left: PlayStation Portable (PSP) Go!Cam camera. Right: InviZimals game, the AR marker card is underneath the creature on the left.

like distribution over colour and edge information that can be extracted efficiently and quickly.

The combat section of the game uses an augmented reality marker to position the centre of the battlefield, and the creatures that are fighting position themselves at either side to attack each other. Displaying the battle this way using the AR marker card means that during combat the camera may be moved around to different viewpoints to provide a different view on the battle taking place.

2.1.7 Sony PlayStation Eye (2007)

A year after the release of Sony’s new PlayStation 3 games console, Sony released a second web camera called the PlayStation Eye (see figure 2.8). The new camera is ca- pable of capturing video at a higher resolution and frame rate than the original EyeToy camera. The video resolution is 640x480 at 60 frames per second, or 320x240 at a much higher 120 frames a second.

The camera is much more sensitive than the original EyeToy, and is able to cope with lower light conditions. It also has no compression artefacts in the video data, which were visible in video data captured by the previous EyeToy camera.

With the better processing power of the PS3, games are able to employ more com- plex computer vision algorithms for their control interfaces. One such game uses aug- mented reality algorithms to detect special markers on playing cards and create the appropriate 3D character on top of cards presented by a player (see right image in fig- ure 2.8). The cards can then be used to place the creatures on a grid where they can fight each other.

This augmented reality concept was later applied to a digital pet game called Eye- Pet, in which the player must look after a digital creature. The game tracks a patterned card that player may use to interact with the creature and perform various tasks. The game encourages the player to feed, wash, and customise the appearance of the crea-

15 CHAPTER 2: BACKGROUND

Figure 2.8: Left: PlayStation Eye camera. Middle: Eye of Judgement game. Right: EyePet game.

ture and interact with it by playing a series of short games. There is also a game where the player can draw a simple shape on a piece of paper and present drawing to the PS3 camera, then a 3D object is extruded from lines detected in the drawing so that the creature may interact with it.

2.2 Types of Camera

Different types of camera technology have been investigated for computer vision in media and games. With any camera technology investigated, the hardware must be cheap enough to distribute for a reasonably low price with games or utilities so that consumers will be more likely to buy the device while still offering a good quality video camera suitable for image processing.

The following features are important for a camera device intended for computer gaming interfaces:

Resolution The device must have a reasonably good resolution. Higher resolution means more information can be exploited for computer vision tasks. Good cam- era devices typically support a resolution of 640x480 pixels.

Frame rate A high frame rate means that fast movements can be processed with bet- ter accuracy since the frames do not suffer as much from motion blur artefacts. Cameras that are capable of 60 frames per second and above would be good for use where fast actions need to be handled with accuracy.

Sensitivity The device should be able to cope with a reasonably good range of lighting conditions. Living rooms might be quite dark or poorly lit, so being able to handle these conditions is important. However, a sensitive device is only good if it has a low noise level (grain) in the video frames. Typically noise shows up in poor lighting conditions due to the gain applied by the camera to enhance the picture, so high sensitivity with low noise is ideal.

16 CHAPTER 2: BACKGROUND

Figure 2.9: A typical Bayer pattern used by consumer web camera hardware. Each pixel in the sensor array has a corresponding colour filter. Full colour RGB images are reconstructed by interpolating between the different colour channels.

Quality A camera that produces video images with compression artefacts can cause problems with computer vision algorithms, so it is important for a device to be able to produce images with very few or no compression artefacts. The PSEye camera for instance can send video frames with no compression making it ideal for computer vision applications.

Cost The overall cost of the device must be low to ensure that it can be distributed at a price that encourages consumers to purchase it. This usually means that technology is a compromise between cost and quality.

2.2.1 Monocular (standard webcam)

All camera devices currently in use for computer games to date have been standard single lens webcams. The device typically provides the video data in resolutions over a wired USB interface.

The types of images that cameras can retrieve are RGB, grey-scale and Bayer pattern images. Hardware processing support for different formats may be available on the camera, but they can also be constructed from the bayer sensor pattern in software. See figure 2.9 for a typical Bayer pattern.

Applying computer vision algorithms to the raw camera data can be faster, as the RGB image doesn’t then need to be constructed for each frame. Constructing the RGB image can take some time depending on the type of interpolation scheme used.

2.2.2 Stereo

Stereo cameras are basically two synchronised cameras fixed a short distance apart from each other. These cameras can be used to retrieve depth as well as colour image data that can be used to solve problems where ambiguities might occur with a single

17 CHAPTER 2: BACKGROUND lens and no depth information.

This type of camera is more expensive than a regular single lens camera, as the cam- era sensor hardware is doubled. Also due to the extra processing required to calculate the disparity maps for depth, processing frames can be costly even before computer vision algorithms are applied to the image and depth data.

2.2.3 Depth, or Z-Cam

Recently the cost of depth camera technology has become low enough to be considered for use in computer games, though no consumer hardware is currently available (as of 2009). There are a few different approaches to finding depth.

Structured light In this type of camera, a light pattern is projected into the scene from the camera and depth is calculated by looking at the way the scene deforms the projected pattern. Several techniques can be used to determine depth, such as fringe projection Zhang [178], but are out of the scope of this thesis.

Time-of-flight (TOF) There are two types of time of flight camera. Shutter based cam- eras work by sending out a fixed duration pulse of light, and collecting light on a sensor for only a short period of time. The more light that is collected the closer the surface is determined to be. Phase based cameras modulate the light source and determine depth by detecting the difference in phase when the light is re- flected back to the camera.

The camera can then provide depth data along with the RGB colour image data. In ideal conditions this means usually quite difficult tasks like background subtraction can be done by simply setting a threshold on the depth value of each pixel.

2.3 Problem Domain

As highlighted in section 2.1 there has been a great deal of interest in human computer interaction with single cameras in the home entertainment industry. Due to the con- straints of making this technology available at a low cost, the majority of these systems employ simple monocular web cameras § 2.2.1.

Any algorithm employed in this context is subject to the following constraints:

1. The algorithm must be computationally efficient to allow for real-time interaction feedback to the user.

18 CHAPTER 2: BACKGROUND

2. The algorithm must be able to operate using only data from a single camera.

The computation efficiency constraint is important because of the many compo- nents (or subsystems) that are required to run a typical computer game, as discussed in the next section.

2.3.1 Typical Computer Games

Computer games have evolved from small projects that a single developer can manage 20 or so years ago, to huge multimedia projects involving many hundreds of people [165]. During this transition, it has become important to modularise common com- ponents into subsystems so that they can be more easily maintained in the large scale development projects that are now typical in the computer games industry [7].

Though there is some variation in the terminology used in the computer entertain- ment industry when referring to the software components of a computer game, the term ‘’ refers to a collection of reusable subsystems that are responsible for such things as rendering, sound, game state, physics, and other tasks [7].

Until recently game engines were generally optimised for single processor architec- tures, but in the latter half of the last decade game developers have started to explore multi-threaded game engines to take advantage of the multi-processor and multi-core architectures present in most modern computers and games consoles [165]. Despite this trend, the term ‘game engine’ holds the same abstract meaning and is responsible for the same sub-processes, though the workload is spread across multiple threads.

One of the core subsystems of any game engine is the input subsystem. A com- puter vision algorithm can be considered an input subsystem since it provides control information to the rest of the game engine, though a computer vision algorithm may also provide other information such as texture information for the rendering subsystem from algorithms such as the one presented in chapter4.

Each of the subsystems within the game engine is updated between each frame. The frame rate of the game is the rate at which new frames are presented, and is typically measured in units of frames per second, or FPS. Where fast interactions are required, games should have a high frame rate, so that a user can react to the game objects being displayed and see feedback of that action perceptually close to the time that the action was performed.

19 CHAPTER 2: BACKGROUND

2.3.2 Computational Constraints

Clearly, implementing game interactions using low level and computationally efficient methods are attractive due to the number of other complex subsystems that constitute a typical computer game engine. The algorithm must be able to operate within a fraction of the available processing time so that the other subsystems of the game can also be updated, and provide a high frame rate suitable for real-time interaction.

This constraint restricts the types of algorithms that can be used to those that can operate within these requirements. The algorithms discussed in the subsequent chap- ters first present algorithms to provide useful interactions (§3 and §4), then move on to tackle the more difficult problem of efficient localisation (§5) and pose estimation (§6). The following sections give an overview of the literature for each of these problems.

2.4 Background Subtraction and Segmentation

Background subtraction and segmentation methods are two approaches to extracting similar types of information. They both attempt to extract the shape of an object from an image, but tackle the problem in different ways.

Background subtraction methods generally cope with motion within video sequences and can be made computationally efficient so are quite well suited to processing live video streams. They either simple or relatively complex background models that can adapt to changes in video sequences so that foreground objects can be detected.

Segmentation algorithms on the other hand can be formulated to deal with both single images and video sequences, but due to a more complex formulation they can be generally computationally more expensive to use, though as will be demonstrated in chapter4 the use of other algorithms to focus in on smaller areas of the image allows such algorithms to be used in real-time applications.

2.4.1 Background Subtraction

Background subtraction is a method of finding differences in an image to detect mov- ing objects from a static camera. There is typically a model of the background that is ei- ther calculated offline from reference images of the background from images when the camera view is clear of moving objects (and in varied lighting conditions), or the back- ground model can be separated from the foreground using the pixel intensity statistics.

There are several survey papers that contain comparisons of background subtrac-

20 CHAPTER 2: BACKGROUND tion methods applied to different problems (see [33, 123]). The common approaches to background subtraction are described next. For simplicity, references in equations to images or image models are assumed to be on greyscale intensity values, but can easily be applied independently to each colour channel in RGB colour images or other colour models.

At the most general level, a background subtraction algorithm compares pixels in the current frame to a background model, and labels them background if they are sim- ilar, or foreground otherwise.

|Bt − It| > θt (2.4.1)

Where Bt and It are the background model and image intensity respectively for the

current time t, and θt is a threshold (each pixel has a threshold). The threshold θ can

represent a more complex threshold based on the type of model used, for instance θt is typically a function of the standard deviation measured at its respective pixel such as in [89, 174].

It is from the formulation of equation (2.4.1) that the term background subtraction is derived [123]. More generally, background subtraction is simply the method of com- paring the pixels in an image to a reference model to find differences from that model.

Another popular foreground detection threshold method is to normalise the mea- sured difference with respect to the standard deviation of the pixel:

|Bt − It| > θt,σ (2.4.2) σt

In either threshold method, most works determine the detection thresholds θt and

θt,σ experimentally [33].

The model can either be learned offline or updated online. When updating online one assumption that could be taken is that a pixel will remain background most of the time [174], and foreground pixels will be those that differ from this mean by some threshold. Other works maintain hypotheses on moving objects to update the model only in areas other than where foreground objects have been predicted to be [84, 86, 89]. This conditional background update is known as a selective update.

A common approach is to use a temporal median filter to determine the background model [38, 39, 67, 101, 181]. The median is computed over the last n frames [39, 67, 101, 181], or a set of selected frames [38]. The assumption is that the pixel will take a value from the background more than half of the time that has been measured. However the

21 CHAPTER 2: BACKGROUND main disadvantages to this approach are that a buffer needs to be maintained to store the values of the last n frames, and the lack of statistical definition of deviation from the median makes it difficult to adapt the threshold value [123].

Some works instead use a running Gaussian average [84, 86, 89, 174] to construct a model of the background. Wren et al. [174] use a method that approximates the calculation of the mean and standard deviations of pixels over n frames. To avoid the requirement of storing n frames of pixel values and subsequently processing them to fit a Gaussian model to each pixel, they update their background model using a running Gaussian average.

µt+1 = αIt + (1 − α)µt (2.4.3) 2 2 σt+1 = α(It − µt) + (1 − α)σt (2.4.4)

Where α is a parameter that controls the rate at which the model is updated. While this type of model can be useful in some situations such as CCTV where the back- ground remains the same for much of the time, in a real-time computer game problem the user can occupy the same region of the screen for a long time, and as such it is undesirable to update these pixels into the background model.

The background update equation can alternatively be formulated so that the back- ground is only updated for pixels that are not labelled as foreground in the current frame [84, 86, 130]. The update method is formulated in a Kalman filter formulation:

Bt+1 = Bt + (α1(1 − Mt) + α2 Mt)Dt (2.4.5)

Where Dt = It − Bt is the difference between the image and the background model

at time t, α1 and α2 are based on an estimate of the rate of change of the background.

The mask is defined as Mt = |Dt| > τ and is simply a thresholded difference map. This helps avoid the background being affected by pixels that are very different from the background model.

Koller et al. [89] use a Gaussian to model the values of each pixel but extend the se- lective update method used by [84, 86] to include a tracking based object motion mask predicting where foreground pixels are expected to be located in the current frame in- stead of simply thresholding the difference image. This object motion mask is used to ensure that pixels predicted to be foreground are not considered when the background model is updated. They then combine this system with an object tracking algorithm

that is used to predict a new object motion mask for the next frame Mt+1.

22 CHAPTER 2: BACKGROUND

For dynamic backgrounds a single Gaussian model for each pixel is not sufficient. Stauffer and Grimson [153] proposed to use a mixture of Gaussians to model each pixel independently. This is able to handle dynamic backgrounds where background ob- jects have varying intensity properties, such as fountains, moving trees and plants, etc. A few years later, Power and Schoonees [126] provided a principled tutorial and of- fered some corrections to the Stauffer and Grimson [153] paper. The recent history of

X1,..., Xt is modelled by a mixture of Gaussians. The probability of observing the current pixel value is:

K P(Xt) = ∑ wi,t · η(Xt, µi,t, Σi,t) (2.4.6) i=1

Where K is the number of distributions, wi,t is an estimate of the weight (the pro- portion of the data that is accounted for by this Gaussian) of the ith Gaussian in the th mixture at time t, µi,t is the mean value of the i Gaussian in the mixture at time t, Σi,t is the covariance matrix of the ith Gaussian in the mixture at time t, and η is a Gaussian probability density function.

The K Gaussians are sorted by using wk,t/σk,t as a measure, which is proportional to the peak of the weight and density function, wk,tη(Xt, µi,t, Σi,t), and the top B Gaussians B satisfying (∑i wk,t) > T are assumed to describe the background, with B estimated to be:

b ! B = argmin ∑ wk,t > T (2.4.7) b i

where T is the prior probability of anything in view being background, and the re- maining Gaussians are considered to belong to the foreground. The mixture is updated each frame only for pixels considered to be from the background Gaussians. Pixels are labelled foreground if they are more than 2.5σ away from any of the background distri- butions. The number K used in practice is typically 3 or 5, and represents a reasonable trade-off between computation cost and message resource usage.

A disadvantage to this method is that the computation cost of the model is higher compared to the more simple (but less flexible) single Gaussian models, so low values of K must be used to approach real-time speeds.

Histograms of the pixel values can be used to approximate the background distri- bution, but due to the histogram function being a step function it might not model the distribution correctly. Elgammal et al. [46] use kernel density estimation (KDE) on the past n (with n taking value of around 100) values of the pixel to model the background

23 CHAPTER 2: BACKGROUND distribution:

1 n P(Xt) = ∑ K(Xt − Xi) (2.4.8) n i=1

Where K is the kernel estimator function, and is chosen to be the normal distribution N(0, Σ).

Modelling the background in this way the background distribution can be esti- mated directly from the data points themselves, instead of restricting to a fixed number of modes as in the work proposed by Stauffer and Grimson [153]. Elgammal et al. [46] combine this formulation with a second stage that attempts to suppress false detections by also considering a small neighbourhood around each pixel using the model from the centre pixel, and is constrained by the requirement that a connected component must also have been displaced.

Other methods use eigenvalue decomposition to model the background over blocks ([138]) and over the image (Oliver et al. [119]), but the model construction is too com- putationally intensive for use in real-time applications.

Though many of these systems update a model at runtime using selective updating methods, they generally assume that for the majority of the time being observed the pixel will remain background. This is not a safe assumption for computer vision based games where the player may remain in the centre of the image for most of the duration of the game. In this situation there is a possibility that the pixel values belonging to the user are absorbed into the background model where pixels could become close enough to background values (e.g. similar but slightly different colours in clothing and background) and cause the gradual drift towards the new foreground values.

Methods that consider local texture properties with ideally robustness to illumina- tion changes may be better representations than comparing single pixel values. There are fast local descriptors that can be used to describe local texture in a computationally efficient way. The following section briefly introduces correlation.

2.4.2 Background Subtraction using Local Correlation

For near real-time systems that apply a model to each pixel, the comparisons between the model and the current frame are generally done on individual pixel values. An object that is moving in a video sequence will generally be made up of a group of con- nected pixels. Local patch based correlation methods that use the texture description of a small region around a pixel instead of the individual pixel values could provide a

24 CHAPTER 2: BACKGROUND more robust measure of correlation.

Modelling pixel intensities independently is computationally efficient, but pixels are generally part of a larger pattern or structure. Local patch correlation methods can be used to add a degree of robustness to illumination changes beyond that of just intensity or colour models using single pixel values. By using comparisons between relative intensities within the patch [115] or local intensity normalised patches, a degree of illumination change robustness can be achieved.

For correlation between local patches, dense descriptor methods such as HOG de- scriptors [40, 140, 183], Haar filters [171], DAISY [162] or SIFT [102], among others can be used, but they must be efficient and fast to calculate.

In other fields such as wide baseline stereo, local features are used to find correspon- dences between pairs of images and can be calculated using efficient local descriptors such as the DAISY descriptor presented by Tola et al. [162]. Grabner et al. [68] uses a grid of local classifiers to construct a discriminative background model to classify blocks as foreground or background. Neither of these approaches are fast enough for real-time correlation however.

Normalised Cross Correlation (NCC)

NCC is a correlation method that is robust to global changes in illumination and means that there is less need for a background model to be updated each frame. Lewis [100] observed that much of the computation for the NCC comparison can be pre-calculated, and for small sized windows in the region of 5 or 7 pixels across can be made to run in real-time (such as the implementation used in §3).

Given two M-by-N patches, f (x, y) and g(x, y) the normalised cross correlation be- tween them is expressed as:

 ¯ ∑x,y f (x, y) − f [g(x, y) − g¯] (2.4.9)  ¯2 2 ∑x,y f (x, y) − f ∑x,y [g(x, y) − g¯]

Where f¯ and g¯ are the mean of f and g respectively. The correlation is the dot product of the differences from the mean in each of the respective patches, normalised by the square root of the product of the sum of squared differences. If the patches are pre-processed to subtract their mean before the NCC comparison, it reduces to:

0 0 ∑x,y f (x, y)g (x, y) NCCf ,g(x, y) = q (2.4.10) 0 2 0 2 ∑x,y f (x, y) ∑x,y g (x, y)

25 CHAPTER 2: BACKGROUND

Where f 0(x, y) and g0(x, y) are the patches with their mean already subtracted. The square root can be removed by squaring both sides. The denominator is just the prod- uct of the standard deviations of each patch.

0 0 ∑x,y f (x, y)g (x, y) NCCf ,g(x, y) = (2.4.11) σf · σg

If B(x, y) represents the background model, and It(x, y) the image from the current 0 frame, then quantities B (x, y) = B(x, y) − B¯ and σB can be pre-calculated. Additionally by squaring both sides of the equation, the identity Var(X) = E(X2) − E(X)2 can be used (where E(X) is the expectation value of X, and Var(X) is the variance σ2) to remove the need for calculating the square root to find the standard deviations for each pixel location and just use the variance.

2  0 0  ∑x,y I (x, y)B (x, y) ( )2 = NCCB,I x, y 2 2 (2.4.12) σI · σB Convolution over sum-of-area tables [37] (see section §5.2.1 for a description) can ¯ 2 be used to efficiently precompute the mean I and variance σI at each pixel for a fixed size correlation window, then only the product between the background model and the 0 0 2 current image, B (x, y) · It (x, y), needs to be calculated to evaluate NCCB,I (x, y). The values of B0(x, y) for a fixed size window can be stored in a vector for each pixel for efficiency when evaluating each new frame It.

Jacques Jr et al. [78] use NCC to remove shadow pixels from detected foreground pixels, though shadows that cause hard edges are still a problem in grey-scale images as there is no way to distinguish them from true edges using the NCC measure alone.

Colour Normalised Cross Correlation (CNCC)

To help address the problem of shadows in the NCC comparison, Grest et al. [71] proposed a variation on the NCC formulation to perform the comparison in a more shadow invariant colour space. The RGB colour image is transformed into what they refer to as an hsL image, the hs component represents a vector on the colour hue and saturation plane defined in the HSL colour space (see figure 2.10). This representation allows colour comparisons to be expressed as a dot-product and easily integrates into the NCC equation.

Grest et al. [71] defines the CNCC over a window M × N where the hsL components are split into a (h, s) vector c = (h, s) and lightness value L, as:

26 CHAPTER 2: BACKGROUND

Figure 2.10: Left: HSL colour cylinder [172]. Right: hsL colour space representa-

tion showing two colours, ca and cb. Their similarity is measured by the function T C(ca, cb) = max(0, ca cb).

  B I ¯B ¯I ∑x,y cx,y ◦ cx,y − L L CNCCx,y = p (2.4.13) CVAR(B) · CVAR(I)

And:

  A A ¯A2 CVAR(A) = ∑ cx,y ◦ cx,y − MNL (2.4.14) x,y

B I Where cx,y and cx,y are the (h, s) vectors from the hsL colour space at position (x, y) from the background model B and image I respectively, and ca ◦ cb represents the scalar product between two vectors, but with negative values set to 0.

Grest et al. [71] reports that the algorithm uses the (h, s) colour space to successfully increase robustness to shadows in colour images, while in intensity only images, the CNCC correlation is equivalent to NCC.

Following on from the works using local correlation methods for background sub- traction, chapter3 presents a fast method of background subtraction using a single ref- erence frame as a background model and compares the new algorithm to correlation methods.

2.4.3 Segmentation

Image segmentation algorithms try to solve the general problem of partitioning an im- age into a number of segments, commonly a foreground and background segment in

27 CHAPTER 2: BACKGROUND the binary case, or object class segments in classification based segmentation methods such as the approach used by Shotton et al. [146].

Thresholding is the simplest way of tackling the segmentation problem. A thresh- old is selected and each pixel is labelled as either foreground if it is above this threshold, or background if it is below. The threshold is selected either as an intensity value, or a colour (multi-band thresholding).

This simple approach to segmentation (though computationally efficient) is only really effective in applications where the object already stands out quite clearly from a clear background, for instance in manufacturing and quality control, document scan- ning, or in applications such as chroma-keying in film and television where the back- ground is a distinct colour such as bright green or blue.

An excellent survey of the common methods of image thresholding as applied to many varied applications is presented by Sezgin and Sankur [139], and they categorise the algorithms into different approaches. Histogram shape methods select a threshold based on the maxima, minima and curvatures of the smoothed histogram of pixel inten- sities. Clustering based methods cluster intensity (or colour values) into background and foreground regions, or alternately are modelled as a mixture of two Gaussians (similar to the approaches used in the background subtraction methods presented in §2.4.1). Entropy based methods use the entropy of the foreground and background regions, and the cross-entropy between the original and binarised image. Object at- tribute based methods search for similarity between the intensity and the binarised images, using methods such as shape similarity, or edge coincidence. Spatial methods use higher-order probability distribution or correlation between pixels to compare the local area of regions similar in some respects to local correlation methods. Local meth- ods adapt the threshold value on each pixel to the local image characteristics, such as local brightness.

Sezgin and Sankur [139] conclude that clustering and entropy based methods work best on the task of segmentation in their non-destructive testing image dataset, but in binarisation of degraded document images clustering and local methods perform better. The best performing method across both datasets was the clustering method proposed by Kittler and Illingworth [87].

Thresholding approaches are generally computationally efficient and lend them- selves well to applications in controlled environments such as document scanning, but in complex applications with varied and dynamic environments they can leave incom- plete or fragmented foreground regions that must be cleaned up using morphological operations such as dilation and erosion. Many background subtraction methods also

28 CHAPTER 2: BACKGROUND suffer from these problems.

Region based methods such as the Watereshed algorithm [169] use the edge gradi- ent magnitudes extracted from an image as a height map surface, and creates segments by flooding it from the minima of the surface. The flooding continues until it reaches a point that it becomes higher than a point along its boundaries and can spill over to fill another area. A segment boundary is created when one region spills over into another, or meets another flooded segment. The process continues until all regions have been flooded. This algorithm is based on the assumption that segments can be defined by the valleys of these boundaries.

2.4.4 Graph-Cut based Methods

Instead of thresholding the image at each pixel independently to assign a label or grow- ing closed contour regions, the pixels can be formulated as a graph constructed from a regular lattice of pixels. The GraphCut algorithm formulates the pixels in such a way and achieves good results in many applications due to the flexibility of the energy function that is optimised over the graph to solve the labelling problem.

Greig et al. [70] were the first to apply graph cut to solve a computer vision problem. They formulated an image restoration problem as an energy minimisation problem and solved it using a graph-cut max-flow/min-cut algorithm. Some time later, the paper by Boykov and Kolmogorov [26] reformulated several problems in computer vision to energy minimisation problems and solved them using a new max-flow/min-cut algo- rithm. They reformulated image restoration, stereo, and segmentation problems and demonstrated that they could be effectively solved using graph-cut methods. They propose a new max-flow/min-cut algorithm that can be used to efficiently solve the labelling problem.

The basic formulation for binary image segmentation is as follows [26, 70]: A pixel in an image P can take one of a number of labels L = {l1, l2..ln} = {0, 1, ..n}, where L = {0, 1} for binary image segmentation. The set N defines the set of all connected pairs of pixels in the image. For a given labelling L of an image P, where each pixel has been assigned an associated label, the Potts energy of the image taking that labelling can be expressed as follows:

E(L) = ∑ Dp(Lp) + ∑ Vp,q(Lp, Lq) (2.4.15) p∈P (p,q)∈N

Where Dp is a data penalty function, and Vp,q is an interaction penalty function

29 CHAPTER 2: BACKGROUND

Figure 2.11: GraphCut of a directed capacitated (weighted) graph (example adapted from Boykov and Kolmogorov [26]). Weight strengths are reflected by line thickness. Far-Left: Pixel intensities of a 3 × 3 neighbourhood. Mid-Left: Graph representation of the pixel intensities. Pixels are represented by the grey nodes. Mid-Right: A cut on the graph. Far-Right: Maximum a posteriori (MAP) solution for the graph.

between pixels. The data term Dp is the cost of assigning the pixel to a given label (usually based on observed intensities and a pre-specified likelihood function, e.g. de- termined from histograms of colours), and the interaction term Vp,q corresponds to a cost for discontinuity between pixels to encourage spatial coherence between them. The data and interaction penalties are also known as unary and pairwise potentials in other segmentation literature [27, 88, 92].

The optimal labelling solution L∗ is found by solving the following using the max- flow/min-cut algorithm such as the one proposed by [26].

L∗ = argmin E(L) (2.4.16) L

Figure 2.11 shows an example graph cut for a small 3 × 3 neighbourhood. The graph is constructed as follows. The pixels p ∈ P from image P are represented as a set of nodes V and are connected to their neighbours by directed weighted edges E to form a directed weighted (capacitated) graph G = hV, Ei. The nodes in the graph are also connected to special nodes called terminals that represent the set of possible labels a node can take. In binary image segmentation where there are 2 labels, these are called the source node s and sink node t.

There are two types of connections between nodes in the graph, and each link is given a weight (cost). The first type are N-links and are the connections between the nodes. The weights of these N-links are determined by the interaction term Vp,q in

30 CHAPTER 2: BACKGROUND equation (2.4.15), and correspond to a penalty for discontinuity between pixels. Pixels pairs which have a high contrast are typically set to low values to encourage a split along a high contrast boundary. The other type of connection are terminal links, or T-links, and the weight of the connection for a given node to either the s or t terminals is based on the cost of a node taking a particular label from the data term Dp in (2.4.15). Since the graph is directed, the weight over the link (p, q) ∈ N can be different from the weight in the other direction (q, p) ∈ N . This is a useful property that can be exploited by many vision algorithms [26].

The s/t cut C on a graph G partitions the nodes into two sets, those connected to the source S and those connected to the sink T [26]. The cost of the s/t cut C is the sum of the costs between boundary pixels (p, q) ∈ N in the neighbourhood system where p ∈ S and q ∈ T . The cost of the cut is directed, so that costs are added only in the direction of S to T . The minimum cut finds the cut that has the minimum cost of all cuts. This is can be solved as a maximum flow problem [56], where the edges in the graph are treated as a network of pipes with capacity equal to their weights. Ford and Fulkerson [56] state that the maximum flow from s to t saturates a set of edges dividing the graph into two sets that correspond to a minimum cut. Boykov and Kolmogorov [26] discuss various algorithms that can be used to solve this max-flow problem, and propose their own algorithm that can solve the problem efficiently.

When the images being segmented are from image sequences or videos, the images tend to change only slightly from one frame to the next. Kohli and Torr [88] proposed an efficient max-flow formulation that uses information from the first frame to consid- erably speed up the computation of the solution for subsequent frames. They call this method dynamic graph cuts, and demonstrated good results on improving the compu- tation time versus the traditional approach of reconstructing the graph every frame.

Other works [27, 58, 76, 92] observed that when prior knowledge of the type of object being segmented is available, the data term can be augmented to include a shape prior. Selecting an appropriate shape representation however is the main problem in these systems due to the variability of the shape and pose over time or viewing angle. Determining these pose and shape parameters is a difficult problem in itself.

Kumar et al. [92] approached this difficult problem by matching a set of exemplars for different parts of the object onto the image. These matches are used to generate a shape model for the object. The segmentation problem is then modelled by combining MRFs with layered pictorial structures (LPS) which provide them with a realistic shape prior described by a set of latent shape parameters. However, a lot of effort has to be spent to learn the exemplars for different parts of the model.

31 CHAPTER 2: BACKGROUND

Instead of matching shape exemplars, Bray et al. [27] used a simple 3D stick figure as a shape prior to determine 3D pose at the same time as solving the segmentation of the image. The parameters of this model are iteratively explored to find the pose corresponding to the human segmentation having the maximum probability, or mini- mum energy. The iterative search was made efficient by using the dynamic graph cut algorithm proposed by Kohli and Torr [88].

The work on shape priors [27, 58, 76, 92], and the introduction of an efficient dy- namic graphcut algorithm for image sequences [88], made solving graph cut based segmentations much more efficient and practical for applications where efficiency is important. Chapter4 couples these ideas with a detection algorithm to reduce the processing region to a smaller area of the image. This allows the size of the graph be- ing solved to be reduced to a much smaller size consequently improving computation speed to real-time speeds.

2.5 Human Detection and Pose Estimation

The methods discussed in the previous section can be employed to quickly identify regions of the image that have changed due to the appearance of the user in front of the modelled background (background subtraction), or to label pixels in the image that should belong to the user based on an appearance model (segmentation). Since the changes in the image are primarily dependent on the location or motion of the user, they can all be achieved by determining the location and pose of the individual.

If pose is known, then user interface controls such as floating buttons or other more advanced interactions can be solved by following the position of the parts of the user’s body. Knowing pose also means you can trigger an interaction for a specific body part only and even stop a control from activating in error when an arm, head or other limb enters the area.

This is of course a very difficult and non-trivial problem, particularly for real-time applications. The next section discusses methods for monocular pose estimation. This problem can be broken down into two tasks: detection §2.6, and pose estimation §2.7.

2.5.1 Learning Problem

The machine learning problems examined in the chapter on Human Detection (chap- ter5) and the chapter on Pose Detection (chapter6) are supervised learning problems. See figure 2.12 for an illustration of a typical supervised machine learning algorithm.

32 CHAPTER 2: BACKGROUND

Figure 2.12: A diagram showing the typical processes involved in a supervised ma- chine learning.

These types of algorithms learn a model from a training set D = {(xi, yi)} made up of examples that pair a feature vector x ∈ Rn with a corresponding label y ∈ Rm. To re- duce the dimensionality of the data that the learning algorithm must deal with, feature vectors are usually generated from training images by extracting lower dimensional information (such as the descriptors discussed later in §5.2).

Supervised learning algorithms attempt to learn a model h(·) that maps the feature vectors xi extracted from the training data to their corresponding labels yi (i.e. ∀i :

h(xi) = yi), or predicted function values in the case of regression. Given a new and unseen feature vector x0 the model will try to predict an appropriate output y0 for it.

In the case of human detection, the output y is usually either a binary value indicat- ing either object or non-object, or a continuous value indicating a degree of confidence that the object is there or not. In the case of pose estimation/detection the output is ei- ther a class indicating some quantisation of a possible pose space that maps to a pose, or a predicted 3D pose.

2.6 Human Detection

Human detection is a specialised form of object detection, and is generally formulated as a binary classification problem. A classifier h(·) is learnt that can classify subregions of an image as either containing an object or not.

There have been a few surveys dealing with the problems relating to human de- tection [49, 62, 63]. The problem is more difficult than rigid object detection, since the

33 CHAPTER 2: BACKGROUND appearance of a human can vary considerably and finding a suitable model to cope with the variations is extremely difficult.

The survey by Gandhi and Trivedi [62] explores the problem in terms of pedestrian detection for vehicle safety and discusses various methods that use different sensing technologies. The methods most relevant to this thesis are visible light single camera methods, as they best represent the hardware constraints typically used for computer vision based computer entertainment applications.

A more recent survey by Enzweiler and Gavrila [49] presents a detailed summary of the state of the art computer vision algorithms that address the problem of human detection and compares several methods side by side to assess their performance. They consider the problem in two parts: that of region of interest (ROI) detection, and clas- sification - though not all methods can be neatly separated into these two stages (take the sliding window approach of Dalal and Triggs [40] discussed later for instance). The algorithms considered in their benchmark were a HOG based linear classifier [40], a Haar wavelet cascade classifier [171], and a Neural Network (NN) classifier with adap- tive local receptive fields (LRF) where each neuron sees only a portion of the image [173], and a combined shape detector and texture classifier based on chamfer template matching with a texture verification stage that simply uses a NN/LRF classifier as a final stage verification to prune false positives [66].

They report that the local receptive field classifier may have had slightly better per- formance when trained instead with a non-linear SVM, but the memory requirements to do this were too high to allow training using their dataset.

Enzweiler and Gavrila [49] concluded that the HOG linear SVM classifier performed by far the best out of the three methods when no time constraints were considered. When computation constraints were applied however, the Haar wavelet method per- formed the best. This seems to indicate that cascaded classifiers have a clear advantage in terms of computation in time critical applications due to the rejection power of the first few levels of the cascade, but at a cost in classification performance when the whole cascade is considered.

Algorithms for human detection can be broadly categorised as either generative or discriminative. Discriminative methods try to find a model that can directly predict the probability p(c|x) of a class label c (either human or non-human) directly from the fea- ture representation x, while generative methods try to find an appropriate model with variable parameters Θ that describes the appearance of the human by finding the joint probability p(c, x), which can be written as p(x|c)p(c). This can be done by learning the likelihood p(x|c) and probability of the class p(c) separately. The prior probability

34 CHAPTER 2: BACKGROUND

p(c) may vary, as it can be dependent on the model parameters Θ of the shape model used, e.g. to restrict the shape to physically plausible configurations independent of the appearance likelihood p(x|c).

2.6.1 Generative

An advantage to generative algorithms is that they can handle missing data or partially labelled data, and can also be used to augment datasets using synthetically generated as well as natural training data using its model. However since they try to predict the joint probability density of p(c, x) predicting the class for a new image often requires a computationally intensive iterative solution (such as methods like Active Contours used in [35, 113]), making them costly for use in real-time applications.

Generative approaches generally employ a shape representation to model appear- ance. Discrete shape models use exemplars (representative shapes) that cover the ex- pected variation in appearance and use efficient methods to determine if any of them matches the query image [65, 66, 154, 164]. While other approaches use a parametric shape representation [12, 16, 35, 48, 73, 74, 80, 113] to describe the shape being matched and optimise over the parameter space of the appearance model to find the best match.

Combining shape and texture information with a compound appearance model has also been explored [34, 35, 48, 52, 80]. Training data are normalised using sparse land- mark features such as the approach used by Fan et al. [52] or dense correspondences such as the method used by Cootes et al. [34], then an intensity model is learnt using the normalised examples to model the texture appearance variation.

2.6.2 Discriminative

Discriminative approaches are typically very fast at predicting p(c|x) for new data compared to the iterative solution often required by generative methods. This makes discriminative approaches potentially more useful in real-time applications. A disad- vantage of discriminative methods however, can be the requirement of large numbers of training examples to cover the expected variation in appearance.

Some discriminative approaches use a mixture of experts approach where they first separate the training data into local shape specific pedestrian clusters and then a clas- sifier is trained for each subspace [66, 114, 141, 144, 175, 177]. The advantage of these methods is that by grouping training examples into clusters of roughly similar shape, the classifier isn’t faced with the problem of trying to model very high variability in

35 CHAPTER 2: BACKGROUND appearance.

Other methods attempt to model the detection problem in terms of semantic parts, such as body parts [5, 104, 108, 141, 147, 175] where a discriminative classifier is learnt for each part or codebook representations [3, 96, 97, 137] where occurrences of features are learned over local patches, and their geometric relations between them.

An advantage to mutli-part methods are that they can be good for handling occlu- sions, and can reduce the number of training examples required to cover the intended pose space. The disadvantage to these methods however is that they usually come with a higher computational cost due to the multiple detectors and the additional cost of classifying examples during testing.

Detection Strategies

A typical approach to human detection is that of using a sliding window to scan across an image at multiple scales to determine directly the classification of each sub-window [40, 41, 108, 122, 136, 159]. The problem with sliding window methods is that they are generally too computationally costly to be used in real-time applications. Cham- fer based matching methods [65, 66] can exploit properties of the distance transform smoothness to perform a coarse to fine search of the image, but are less accurate than the more computationally expensive sliding window algorithms.

Dalal and Triggs [40] arrange a densely overlapping grid of HOG descriptors within a 96 × 160 detection window and train the classifier using a linear support vector machine. The linear Support Vector Machine (or SVM) used by [40, 41] and others [114, 142, 144, 177, 183] is a linear classifier that when trained will classify a feature vector as either positive or negative using the equations in 2.6.1.

wTx + b ≥ 0 for positive classification (2.6.1) wTx + b < 0 for negative classification

Where w and b are the weight vector and bias determined during the training pro- cess that represent the decision hyperplane used to classify training examples, and x is the feature vector to be classified. The decision hyperplane surface is described by equation 2.6.2.

wTx + b = 0 (2.6.2)

The training problem of a linear SVM is to maximise the distance between the train-

36 CHAPTER 2: BACKGROUND ing examples from the decision hyperplane surface. Dalal and Triggs [40] used an im- plementation called SVM Light [79] to train the linear SVM using a set of positive and negative feature vector examples. They then scan negative images to find a set of hard examples that are used as a bootstrap set to train the classifier again. The performance achieved by this method is still considered among the state of the art algorithms for standing pedestrian detection.

The work of Zhu, Yeh, Cheng and Avidan [183] greatly improves the computation efficiency of the HOG descriptors by exploiting summed area tables to integral histograms [125] that allow constant time calculation of arbitrarily sized HOG features, and combines them in a cascade classifier in a manner similar to [170, 171]. Since then other authors have also used variable sized HOG blocks [142, 177] with promising results. However, the survey by Enzweiler and Gavrila [49] does not use this more efficient HOG cascade implementation as a comparison for their time constrained per- formance evaluation benchmark.

Non-linear SVMs can also be used and can provide an improvement in performance over linear SVMs [5, 108, 112, 114, 122, 159], but the additional computational cost can make it difficult to use for classifiers intended for real-time applications.

Another approach is to employ 2-stage methods that use a fast region of interest detection algorithm that can identify candidate locations for processing with a more computationally expensive algorithm. CCTV surveillance applications, or other appli- cations that have a fixed camera with a static background can employ methods such as background subtraction to focus only on areas of the image that have been identi- fied as foreground. The works of [114, 152, 180] use background subtraction to identify regions of the image that can be focused on with more detailed algorithms. Search space constraints or scene knowledge can also be exploited to more efficiently identify regions of interest [50, 66, 96, 141, 180], while other works attempt to identify regions of high information content [3, 96, 97, 102, 137] to find candidate locations.

A hybrid approach somewhere between sliding window and ROI/classification methods is to train a cascade [104, 118, 142, 166, 171, 175, 177, 183], where the first stages have a high precision but use only a small number of features to quickly reject large portions of the image, and later stages use more features to make a more detailed decision. This method has seen great success in face detection [170] with real-time speeds, and is exploited as a region of interest detector for the segmentation algorithm proposed in Chapter4.

A cascade is trained using the Ada-Boost algorithm [170], which attempts to find a subset of possible features at each level of the cascade that meets a specified accu-

37 CHAPTER 2: BACKGROUND racy, typically 95% true positive rate and 50% false positive rate. Each level is trained to correct errors made by the previous level, and the cascade gradually becomes more complex. The key advantage of this approach is that a great deal of the image can be rejected as negatives with only a few features making it suitable for real-time applica- tions.

Gavrila [65] uses a chamfer template matching exemplar based approach and con- structs a hierarchical template tree using human shape exemplars and the chamfer dis- tance between them. They recursively cluster together similar shape templates select- ing at each node a single cluster prototype along with a chamfer similarity threshold calculated from all the templates within that cluster. Multiple branches can be explored if edges from a query image are considered to be similar to cluster exemplars for more than one branch in the tree.

2.6.3 Chamfer Matching

The chamfer matching algorithm is inherently fast due to its simplicity which makes it an attractive method for use in real-time applications, but providing the optimal number of exemplar edge templates can be problematic since they must be learnt from segmented images to ensure that only object edges are considered by the exemplars. This section gives an overview of the basic matching algorithm that is used to match an exemplar template such as one from a node in the tree hierarchy constructed by Gavrila [65].

Chamfer matching [11, 20] is a technique for object detection that finds locations (ideally a single location) in a query image that closely match a binary edge template created from an instance of an object being searched for.

Object templates are created off-line from edges extracted from a representative object image using an edge detection technique such as the Canny edge detector [32]. Each template is a set of edge point coordinates from a binary edge map:

nO nO O = {(oxi, oyi)}i=1 = {oi}i=1 (2.6.3)

2 Where oi ∈ R is a 2D coordinate location of an edge point extracted from the

original exemplar object image, and nO is the total number of edge points extracted from an image.

At run-time, edges are extracted from a query image to create a set of binary edge coordinates:

38 CHAPTER 2: BACKGROUND

Figure 2.13: Distance feature calculation. Left: A query image. Middle: edges extracted for 2 orientations, each colour represents a different channel, white represents an over- lap between adjacent orientation channels. Right: Truncated distance transforms for each orientation channel.

nA nA A = {(axi, ayi)}i=1 = {ai}i=1 (2.6.4)

Where nA is the total number of edge points. Using the set of points in A, a distance transform D(·) is calculated so that for any query point p and set of edge points A from a query image:

DA(p) = min kp − qk (2.6.5) q∈A

gives the distance to the nearest edge in A from that point. Efficient methods ex- ist to calculate distance transforms [20, 54], making the algorithm ideal for real-time applications.

The points from O are used to sample the distance transform at a given position using a chamfer cost function. Different cost functions can be used such as the average sum of squared differences [20, 154]. This cost function is evaluated at every position in the image, where the minima indicate the best matches.

The chamfer score C of an object template O at a given location in the distance transformed image DA with the same dimensions as the object template is given by [20, 154]:

1 C(D , O) = D (p)2 (2.6.6) A |O| ∑ A p∈O

A truncated chamfer cost function can improve matching stability [154] in images where the object is partially occluded:

39 CHAPTER 2: BACKGROUND

1 C(D , O, τ ) = min(D (p)2, τ ) (2.6.7) A d |O| ∑ A d p∈O

The threshold τd gives an upper limit to values in the distance transform where

edges might be missing, and means that squared distances beyond τd take the same value.

A threshold τc is typically used to accept partial matches due to incomplete edge information. A cost value of indicates a perfect match (i.e. the average edge dis- tance over all points on the template is zero), however a cost of zero can also mean that there could be a highly textured image area causing a strong local minima.

To alleviate this, oriented edge information can also be used to improve matching performance [154]. In this case template edges are split into a set of edge points for each θ |Θ| orientation channel O = {O }θ=1 and a set of distance transforms is created DA = θ |Θ| {DA}θ=1 one for each edge orientation to be considered from the image edges. See figure 2.13 for an illustration of an oriented chamfer distance transform. The oriented chamfer cost function is:

1 C (D , O, τ ) = C(Dθ , Oθ, τ ) (2.6.8) Θ A d | | ∑ A d Θ θ∈Θ

The oriented chamfer cost is the average chamfer cost over each of the edge orien- tation channels.

Chapter5 proposes a method that combines the chamfer matching algorithm with a linear SVM to automatically learn the templates even in the presence of background information.

2.7 Pose Estimation

Pose estimation can be formulated as a multi-class classification problem, where each class represents a distinct pose and the closest matching pose exemplars can then be used to regress to a pose in continuous space, or a continuous problem where the out- put pose can be derived from varying parameters in an internal model, or based on the detection results of many part-based classifiers. Ultimately whichever approach is used, they are all concerned with extracting information from an image sufficient enough to determine an output pose from data. The inferred pose can be either in 2D or in 3D.

40 CHAPTER 2: BACKGROUND

There are various surveys that consider human motion analysis [4, 63, 106, 107, 124, 150] and in which pose estimation is usually a component. Motion analysis methods that are interested in action recognition don’t necessarily need to determine an explicit pose configuration to analyse motion, as this can be done indirectly using other infor- mation such as tracking the motion of detections over time, or analysing the evolution of low level image features over time, and as such are out of the scope of this dis- cussion. Similarly, methods that use multiple cameras or additional hardware such as depth cameras, and those that are still computationally too costly for real-time applica- tions are also not considered.

Many of the existing surveys each define their own categories to group the different approaches, but as observed in the survey by Poppe [124] they generally fall into two categories:

Model-free / Discriminative methods use discriminative approaches to determine pose directly from feature representations.

Model-based / Generative methods maintain an internal appearance likelihood model that iteratively tries to determine pose by varying the model parameters.

These discriminative and generative approaches to human pose estimation provide similar advantages and disadvantages to their equivalent approaches in human detec- tion. Discriminative approaches tend to be faster, but can be limited to only being able to deal with poses that they have been trained with, so large amounts of training data can be required to sufficiently cover the variation in pose. Parts based discriminative methods alleviate the training data problem somewhat, but at the cost of more complex computation when estimating pose due to the classifiers for individual parts. Gener- ative methods that maintain an internal human model usually find the pose through an iterative solution making them less suited for real-time applications, but have the advantage that they can deal with poses not seen during training.

2.7.1 Generative

In generative models, various different approaches have been explored to model the shape of a human. Some are simplistic models using hierarchies of simple 2D shapes, and can increase in complexity to very detailed 3D models of shape and physical de- formation.

2D models such as the ‘Cardboard human’ model proposed by Ju et al. [81] use a kinematic hierarchy of planar patches to model human shape. A similar approach

41 CHAPTER 2: BACKGROUND by Morris and Rehg [111] uses a scaled prismatic kinematic 2-D human body model for human body registration where the parts are modelled by using textured shapes to find correspondences. Howe et al. [75] uses a method similar to Morris and Rehg [111] and infer 3D pose by tracking parts in 2D then using a prior model of 3D motion learnt from motion capture data in a Bayesian framework that models short duration motions to reconstruct 3D pose. Huang and Huang [77] extend the model of Morris and Rehg [111] to include with an extra dimension to the DOF of each part to describe width change, but does not attempt to recover 3D pose.

Models that are defined in 3D are generally constructed by a hierarchy of simple primitives. The hierarchy defines joint locations and relative positions in the hierarchy (e.g. head is attached to torso) and geometric primitives are defined relative to their associated joints. These primitives can be spheres [120], cylinder based models [133, 148] or models defined using tapered super-quadrics [64, 85] and [155] to model the hand.

Rohr [133] focuses on the problem of pedestrians walking in a plane relative to the camera. First a 3D model of cylinders is constructed to represent the human shape, then background subtraction is applied to locate a region of interest. Within the region of interest edge pixel locations are detected which are then linked together and lines are fit to the linked edge pixels. Cylinder edges from the model are projected into the window around ROI and model lines are compared against detected lines.

Sidenbladh et al. [148] also use a 3D cylinder based generative model of image appearance that extends the idea of parameterised optical flow estimation to 3D artic- ulated figures. Depth ambiguities, occlusion, and ambiguous image information result in a multi-modal posterior probability over the pose of the model. They employ a parti- cle filtering approach to track multiple hypotheses in parallel and use prior probability distributions over the dynamics of the human body to constrain motions to valid poses.

Pose ambiguities with single camera methods can be resolved by combining infor- mation from multiple cameras. Tapered super-quadrics are exploited by Gavrila and Davis [64] to construct a 3D human shape model. The parameters for parts are deter- mined during a calibration step by varying parameters and assessing similarity with chamfer matching using the silhouette of the shape while a subject remains in a known pose. Initialisation is done using background subtraction to find the region of inter- est, followed by PCA on the foreground pixels to discover the main axes of variation to initialise the torso parameters. Using this initialisation, a search is done to find the best fitting head/torso configuration, then afterwards limb parameters are found in a similar manner.

42 CHAPTER 2: BACKGROUND

Bottino and Laurentini [22] use multiple cameras to calculate a 3D reconstruction of the human shape using volume intersection, then motion data is acquired by fitting a human shape model to the reconstructed 3D volume. Later Kehl and Gool [85] also use 3D reconstruction techniques in combination with a model built using super-ellipsoids, and fit the model to the images over several video streams. To fit the model, multiple image cues are used consisting of edges, colour information and a volumetric recon- struction. The use of 3D information both helps reduce ambiguity, and the use of the model can be used to refine the quality of the reconstruction.

Alternatively, the shape model can be defined as a single polygonal mesh. Anguelov et al. [8] build a surface mesh shape model that is constructed using training examples from laser scan data over many subjects, then find correspondences between the mod- els. Dimensions of shape variation are learnt using PCA on the model data (e.g. male and female, height, body size), and realistic surface deformations are learnt using linear regressions per triangle. The method is computationally costly but is very expressive. Later Sigal et al. [149] combined this in another multi-view approach with a discrim- inative initialisation approach to initialise the model and follows a refinement step to determine a more accurate pose.

Barron and Kakadiaris [10] propose a semi-automatic method of simultaneously estimating a humans anthropometric measurements (physical limb lengths and their proportions) and pose from an uncalibrated image. Landmarks are placed by the user indicating the positions of the main joints, then a set of plausible limb length estimates are produced using a priori statistical information about the human body, then plausi- ble poses are inferred using geometric joint limit constraints.

Some works attempt to determine the shape and articulation automatically. Kaka- diaris and Metaxas [82] use data from multiple views and apply a spatio-temporal analysis of the deforming contour of a human moving to a series of predetermined movements to determine the shape and articulation of the model (i.e. it is not defined a priori). When the movements are unknown however, this method cannot be applied.

The majority of these generative methods however, tend to be too computation- ally intensive for real-time applications due to the extra computational cost of iterative model evaluation and parameter updating to fit the model to the image data. Multiple cameras can reduce ambiguity, but these methods are not related to this thesis and are only mentioned to give a context for describing the types of approaches used in some of the model based literature.

43 CHAPTER 2: BACKGROUND

2.7.2 Discriminative

Discriminative methods can be further broken down into two methods [124]: learning based where the problem is to find a mapping directly from feature space to pose space (though due to ambiguities in pose a set or mixture of mappings tend to be learnt to handle the multi-modality [1, 69, 134, 151]), and example based where the problem is formulated as finding the nearest exemplar(s) out of a set of possible exemplars that most closely match the feature and return the associated pose.

Various methods analyse silhouettes of humans to determine pose. Silhouettes from multiple views are used by Grauman et al. [69] in a probabilistic shape and structure model. Then a prior density over the multi-view shape and corresponding structure is constructed with a mixture of probabilistic principal components analysers that locally model clusters of data in the input space with probabilistic linear manifolds, and pose is determined by finding the maximum a posteriori estimate. Instead of clustering in input space, [134] cluster in 2D pose space and learn mappings from features to pose from each cluster.

Agarwal and Triggs [1] and Sminchisescu et al. [151] also use a mixture of experts approach. Agarwal and Triggs [1] does this by first clustering silhouettes to a lower dimensional input space found using PCA, to which a mixture of regressors is fit. This approach helps model ambiguities by following multiple hypotheses but requires a clean silhouette. Sminchisescu et al. [151] uses local appearance and shape contexts as a feature space and shows promising results on complex monocular human actions.

Elgammal and Lee [47] recover pose from a monocular silhouette, but instead they learn view-based activity manifolds and learn mapping functions between the mani- folds and both silhouette and 3D body pose. Pose is estimated by projecting the sil- houette to the learned activity manifold, finding the point on the learned manifold representation corresponding to the silhouette and then applying interpolation over probable 3D poses.

Example (or exemplar) based methods such as Mori and Malik [110] store a number of exemplar 2D views of the human body over different configurations and viewpoints with corresponding manually labelled locations of body joints. Pose estimation is done by matching the input image using shape context matching with a kinematic chain based deformation model and the corresponding pose is used to reconstruct the 3D pose of the human.

Exemplar based approaches have been very successful in pose recognition. How- ever, in applications involving a wide range of viewpoints and poses a large number

44 CHAPTER 2: BACKGROUND of exemplars would be required, and as a result the computational time would be very high to recognize individual poses. One approach, based on efficient nearest neigh- bours search using histogram of gradient features, addressed the problem of quick retrieval in large set of exemplars by using Parameters Sensitive Hashing (PSH) [140], a variant of the original Locality Sensitive Hashing algorithm (LSH) [42]. Once a set of nearest neighbours has been found the final pose estimate is then produced by locally- weighted regression using the set of neighbours to dynamically build a model of the neighbourhood and infer pose. PSH is also applied in the work by [129] using Haar-like wavelet features based on Viola and Jones [170] from multi-view silhouettes obtained using three cameras, then a motion graph is used to find poses both close in input and pose space.

The method of Agarwal and Triggs [2] is exemplar based. They use a kernel based regression but they do not perform a nearest neighbour search for exemplars, instead using a hopefully sparse subset of the exemplars learnt by Relevance Vector Machines (RVM). Their method has a disadvantage in that it is silhouette based and that it cannot model ambiguity in pose as the regression is uni-modal.

Gavrila [65] present a probabilistic approach to hierarchical, exemplar-based shape matching. This method achieves a very good detection rate and real time performance but does not regress to a pose estimation. Similar in spirit, Stenger [154] proposed a hierarchical Bayesian filter for real-time articulated hand tracking but clusters shapes in pose space to construct the tree hierarchy. Toyama and Blake [164] use dynamics in combination with an exemplar-based approach for tracking pedestrians.

Everingham and Zisserman [51] utilises a similar hierarchical chamfer tree approach to [65, 154] by constructing a chamfer template tree to reduce the pose search space, then uses the estimated pose from the tree classifier to initialise a generative model to find a more accurate pose.

For applications such as pose estimation where there are many object templates, it is inefficient to find the chamfer score for each object at every point in an image, so more efficient methods must be used. Figure 2.14 shows templates for several actions.

Pose detection using chamfer matching can be achieved by clustering poses ob- tained from a varied action database into a hierarchical structure to quickly explore the example space in a coarse to fine manner such as the methods proposed by [51, 65, 154]. Okada and Stenger [117] presents a method for marker-less human motion capture using a single camera by using tree-based chamfer filtering to efficiently propagate a probability distribution over poses of a 3D body model. Dynamics is used to handle self-occlusion by increasing the variance of occluded body-parts, allowing for recovery

45 CHAPTER 2: BACKGROUND

Figure 2.14: Pose templates for several different actions.

when the body part reappears.

2.7.3 Combined Detection and Pose Estimation

In contrast to two stage methods, relatively few works attempt to combine localisa- tion and pose estimation into a single discriminative model. Dimitrijevic et al. [44] present a template-based pose detector and solve the problem of the dependency on huge datasets for training by detecting only human silhouette in a characteristic pos- tures (sideways opened-leg walking postures in this case). They extended this work in [57] by inferring 3D poses between consecutive detections using motion models. This work gave some very interesting results with moving cameras. However it seems somehow difficult to generalise to any actions that do not exhibit characteristic posture.

The pose estimation and detection work of Okada and Soatto [116] learns k kernel SVMs to discriminate between k pre-defined pose clusters, and then learns linear re- gressors from feature to pose space. They extend this method to localisation by adding an additional cluster that contains only images of background.

The problem of jointly tackling human detection and pose estimation at the same time is discussed in more detail in the work presented in Chapter6.

46 CHAPTER 3

Fast Background Subtraction

For gaming interfaces, a camera is typically placed on top of a television or monitor and the video stream is displayed on the screen so that the user can see what the camera is viewing.

In most applications, the video image is mirrored on the y axis so that the video stream acts as if the television display is a mirror placed in front of the user. This is the method of display employed by many of the Sony EyeToy games.

People are naturally used to interacting with their own reflection due to the abun- dance of mirrors in everyday life, and so co-ordinating movement with what a user sees on screen is very intuitive and can be learnt quickly. The success of the EyeToy franchise is a testament to how well this method of interaction has been received, and it is this method that is used for the algorithms described in this chapter.

User interface components are displayed as an overlay on top of the video stream. To interact with the components the user must move a part of their body over the area covered by that control to interact with it and some kind of feedback (a sound effect or visual cue) is triggered to indicate that the interaction was successful. Game objects are displayed and interacted with in a similar manner.

For simple interactions such as those required by basic user interface controls, it is reasonable to tackle the problem using methods that are computationally efficient and extract only the minimum amount of required information when processing the video frame. A strong motivation behind this low-level processing approach is to take as little processing time as possible when the game engine updates each of its subsystems between frames (§2.3.1 ) to ensure that other media subsystems, such as music, sound and graphical effects have sufficient computing time remaining to be updated too.

This chapter takes a look at a common algorithm used in many computer vision based games, proposes a new local descriptor, and looks at how other algorithms might

47 CHAPTER 3: FAST BACKGROUND SUBTRACTION be exploited to provide more advanced methods of interaction.

3.1 Problem: Where is the player?

The general question that needs answering for a computer vision based interface is: "Where in the image is the player?"

An answer to this question is required at least on a very basic level to determine how the user is attempting to interact with the components or game objects being su- perimposed on the video frame.

This can be determined either directly using shape detection [65, 170], background subtraction [123], and segmentation methods [27, 92], or indirectly by detecting motion using sparse optical flow methods [9, 143] and simple image differencing.

The following section describes an approach of detecting inter-frame differences to drive user interface controls and interact with game objects.

3.2 Image Differencing

The simplest method employed in many computer vision based games (for example EyeToy play series and EyeToy kinetic) is that of image differencing between two con- secutive video frames and thresholding the result to create a binary map of differences.

( 1 if |It(x, y) − It−1(x, y)| > τ Dt(x, y) = (3.2.1) 0 otherwise

Where Dt is the binary mask of differences between the current frame It at time t and the previous frame at time It−1, and τ is the difference threshold. The threshold value τ is set to a value that is just over the ambient noise level of the camera, but not too high that true image differences are missed. See figure 3.1 for an illustration.

By finding the difference between consecutive frames, a model of the background image is not required and does not need to be updated to account for illumination changes. However, if the user stands completely still or moves very slowly so that the pixel-wise differences between frames are very small, then movement can be missed.

Interacting with a game using this method is simply a case of having game objects collide with areas of the screen that have any non-zero values in the difference map,

Dt. The user just has to make sure they wave their hand in the appropriate area of

48 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.1: Image differencing algorithm. For user interface controls, differences are accumulated over several frames under the control to place the button in an active state.

the screen, or just create a large amount of movement in that area and hence produce image differences.

For user interface controls, using the image differences for the current frame Dt means that other components may be accidently triggered as the user reaches for the appropriate control. This can be addressed by accumulating the differences under re- gion R covered by the user interface component for several frames, and activating the control only when the accumulated differences have reached a certain threshold.

Dt(R) = ∑ Dt(x, y) (3.2.2) (x,y)∈R

This motion button approach relies heavily on the user moving around and does not offer any persistent information across frames as to the true state of pixels belonging to the user. It also does not discriminate between sudden illumination changes and the user’s movements, and any shadow movements caused by the user as they move around within the video frame.

If a user interface control can reliably detect which regions of the video image the user occupies, then more varied interactions can be accommodated. As an example, consider for the moment a musical keyboard interface. With the existing image dif- ference based controls it is difficult to activate the keys momentarily, as the button would have to reliably detect two spikes of differences; one for the hand entering the key region and another for when the hand was removed. A toggle button could be possible, but relies on the control being able to successfully detect both activation and deactivation cues. If a control could determine which pixels were different from some

49 CHAPTER 3: FAST BACKGROUND SUBTRACTION pre-calibrated reference model, then the button can remain active so long as the num- ber of different pixels is above a given activation threshold, and would allow for more sensitive user interface controls.

To solve this, a fast and robust background subtraction algorithm is required. The next section proposes an approach that extends the local binary pattern (LBP) algorithm to use a 3-label neighbourhood coding scheme in which each label is coded by 2-bits, and presents some qualitative results of the method.

3.3 Motion Button Limitations

Motion buttons have been used for computer vision user interface components in a large majority of the existing EyeToy games released by Sony. The idea is simple and effective. Absolute pixel differences between two consecutive frames that are greater than a certain threshold are counted over the region of the frame that the button covers.

Motion buttons become activated when they have accumulated movement over several frames. This is so that the button isn’t accidently activated by the user or by something considered to be in the background. When there is no movement within the button region the button reverts to an inactive state.

However, because motion buttons have to accumulate movement over time before they activate, they have a couple of limitations. Firstly, they cannot be used in interfaces that require a user to keep a button activated, since if a user keeps their hand stationary, no motion is accumulated and the button deactivates. Secondly, the time it takes to accumulate enough movement to activate the button means that expecting the user to activate the button quickly for time-critical applications becomes impractical.

Performing this image differencing instead with a reference image would be one approach to solving this, but subtle illumination changes can affect the difference map and it is difficult to select a threshold that can detect true differences from global bright- ness changes. See figure 3.2 for an illustration of the problem.

It is these limitations that the algorithm used for persistent button control is in- tended to address.

3.4 Persistent Buttons

The design goal of the persistent button is to be responsive to fast user interactions, and be able to maintain an active state when the user intends to keep the button activated.

50 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.2: Diagram illustrating one of the problems caused by using simple image differencing on a background image. The circled red regions appear only slightly different due to a global illumination change caused by the automatic camera gain, but can cause differences to register in the difference map if the difference threshold is not updated to reflect this.

An example application for this type of component could be a small virtual key- board, where the user can play sustained notes by keeping their hand within a button’s area or activate them quickly by passing their hand through them briefly.

This level of responsiveness comes with some limitations. Calibration is required to build a model of the region underneath the button, and this means that if the camera is moved for some reason, the button will require recalibration.

This section describes the algorithm, presents some qualitative results and com- pares the algorithm to two other algorithms: Normalised Cross Correlation (NCC) [100] and Colour Normalised Cross Correlation (CNCC) [71]. See section 2.4.2 for the details of these two algorithms.

3.5 Algorithm Overview

In the following sections, a ‘label’ is defined as a 2-bit binary value, and a ’code’ is an ordered sequence of concatenated labels c = (l1, l2, ...ln). The proposed algorithm extends the idea of Local Binary Patterns (LBP) [115]. In the local binary pattern algorithm, for each pixel p in an image its intensity value is compared to the values of pixels in its local neighbourhood Np centred on its position and a binary code is constructed. Each bit in the binary code represents the result of a comparison between a pixel pn ∈ Np from that neighbourhood and the centre pixel p. The value stored in that bit is 0 or 1 to represent the polarity of the comparison: 0 if

51 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.3: Illustration of LBP code construction. The centre pixel is compared to each of its neighbours, and assigned a value of 1 if equal or greater than its value, or 0 otherwise. The code is then constructed by assigining a neighbourhood pixel to a corresponding bit in a binary code.

(pn < p), and 1 if (pn ≥ p). Figure 3.3 illustrates how a simple binary code is generated from intensity values in a small 3x3 neighbourhood.

The proposed algorithm extends this idea of Local Binary Patterns (LBP) [115] so that instead of pixels in Np taking one of 2 labels they can take one of 3 possible labels. This is in the same spirit as the work of Local Ternary Patterns (LTP) proposed by [160], except that in the proposed LBP3 coding scheme:

1. The labels are given 2-bit values that are not split into upper and lower channels as in [160] and are instead concatenated to form a single code.

2. The code map is generated after pre-processing the image using a Gaussian blur, so that comparisons are done between the sum over weighted regions instead of single pixel intensities to add more spatial support to the comparisons. This is similar in spirit to the way filters are used to compute weighted sums in the DAISY descriptor construction [162] and Geometric Blurring [15].

3. To achieve temporal stability for use with video streams, temporal hysteresis is applied to the code labels so that ambient camera sensor noise does not cause labels to oscillate between two states over time.

4. The labels are coded in such a way that taking the hamming distance between two codes will yield the distance in label space between them (see section 3.8.2 for details).

The proposed algorithm extension is referred to as LBP3 in the following sections, to differentiate it from the Local Ternary Pattern work of Tan and Triggs [160].

The algorithm takes a frame as input, sub-samples and converts the image into an intensity image (greyscale image), then builds a code for each pixel describing its inten- sity relative to a set of neighbouring pixels. The current frame’s code map is compared to a reference code map stored during calibration, and differences in codes are summed

52 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Algorithm 1: LBP3 Image Processing Steps 1. Sub-sample the image; 2. Convert the image to an intensity image; 3. Blur the intensity image; 4. Build the code map. up to give a measure of how different the current frame is. If the number of differences is above a certain threshold (minus the ambient distance learned during calibration), then the button becomes active. The main image processing steps are highlighted in algorithm1.

3.5.1 Sub-sampling

The image frame I is sub-sampled to reduce the number of pixels that the algorithm has to deal with. This step is optional and the sub-sampling scale factor used in the current implementation is 0.5. This resamples the source image so that the final image used by the algorithm after pre-processing is half the resolution of the original image.

Though this decreases computational cost due to reducing the number of pixels processed, it also limits the effective minimum size that a control can be, since less information is available to make a decision on whether or not the button should be in an active state. When a control covers a small area of the image, sub-sampling should not be used.

3.5.2 Intensity Image

By reducing colour image to just its intensity values the amount of data the algorithm needs to consider is reduced. This is helpful for the next stage (that of blurring the image to reduce the effects of noise), as the filtering need only be applied to a single channel rather than to each of the individual RGB channels.

3.5.3 Blur Filtering

By applying a Gaussian filter to the image, the pixel-wise comparisons used in the binary coding scheme (discussed in the next section) are between weighted sums over a small neighbourhood around each of the pixels being compared. Though this does not remove the effects of sensor noise (see section 3.7) on the pixel values, it is an efficient method of adding more spatial support to the comparisons. This is similar to

53 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.4: The source image can be sub-sampled by a factor of 0.5 to an image of half the dimensions. To calculate the pixel value for each coordinate in the sub-sampled

image, the position (xs, ys) is projected back into the original image, and four pixels are averaged together.

the way the DAISY descriptor exploits filtering to pre-calculate weighted sums for its descriptor construction [162].

3.5.4 Code Map

Once the image has been pre-processed, a binary code is generated for each pixel. This code is built from the concatenated labels given to a set of neighbouring pixels Np, which are labelled according to their relative intensity from the current pixel being considered p. Each of the labels is given a 2-bit binary code that is concatenated to form the final code for that pixel. This is discussed in more detail in section 3.8.2.

3.6 Algorithm Details

3.6.1 Sub-sampling and Converting to an Intensity Image

Figure 3.4 shows the sub-sampling process. The image I is both sub-sampled and con- verted from RGB to a greyscale representation in the same pass over the current frame. A simple component average is used to convert the RGB value into greyscale as shown in the equation below.

1 f (p) = (p + p + p ) (3.6.1) G 3 r g b

54 CHAPTER 3: FAST BACKGROUND SUBTRACTION

The original image I is then transformed using this function to give a greyscale image, IG = fG(I). Since the sub-sampling scale factor used is 0.5, the image can be sub-sampled using the average of the current pixel and the pixels to the right, below and below-right and stepping over the source image two pixels at a time.

1 I (x , y ) = I (2x + u, 2y + v) (3.6.2) S s s 4 ∑ G s s (u,v)∈No

Where xs ∈ [0, w/2) and ys ∈ [0, h/2), w and h are the original image width and height respectively, and No = {(0, 0), (0, 1), (1, 0), (1, 1)} are the neighbourhood offsets to average over.

The sub-sampling pass can be combined with the greyscale conversion by simply converting the RBG values to greyscale during the sub-sampling pass. So the sub- sampling conversion transform can be written as:

1 I (x , y ) = f ( I(2x + u, 2y + v)) (3.6.3) S s s 4 ∑ g s s (u,v)∈No

Where fg(·) converts an RGB value to greyscale as defined in equation 3.6.1. Since the divisor is a power of two, it can be performed as a bit shift to remove the need for an integer divide.

3.6.2 Blur Filtering

To help reduce the ambient pixel noise from the video camera source, the intensity image is smoothed using a blur filter. This is a Gaussian kernel separated to be a two pass 1D operation. The kernel is discretised and scaled so that the values are an integer vector g = {gi} and sum to a power-of-two so that normalisation can be done by a bit shift instead of an integer divide. See figure 3.5 for an illustration of the process.

First the image is convoluted with the 1D Gaussian kernel g and stored in a buffer

Ibu f f er with transposed dimensions to IG so that the result of convoluting at IG(x, y) is saved in the buffer at location Ibu f f er(y, x). Finally Ibu f f er is convolved with g again, and the result is stored in IG.

Convolving this way means that IG is convolved along x twice, but the second time the buffer is transposed so (x, y) = (y, x) and is actually convolving vertically.

55 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.5: Filtering process used to blur the image before generating codes in the code map.

T Ibu f f er = (IG ∗ g) T IG = (Ibu f f er ∗ g) (3.6.4)

This method can be more efficient on architectures that use memory caching as there are more memory reads along the kernel components, and by transposing the result of the horizontal pass, the vertical pass can read values from the image that are adjacent to each other in memory rather than being w bytes apart for kernel component

gi from gi−1 (assuming that the pixel size is one byte in this example). It also means that both passes are simply an integer dot product with contiguous memory at each pixel location and a bit shift.

3.7 Noise Model

The input to the algorithm is an image from a web camera video stream. However due to sensor noise and other factors, the pixel values in this image can vary over time even in static scenes where nothing is moving. There is generally always a small amount of sensor noise present from thermal affects on the sensor device. Automatic gain can alter the relative intensities of the colour channels in the image over time and at different scales, and in the process alter the respective noise levels in the image, so it is important to consider these effects.

Let Iraw(x) be a matrix of image pixels on the sensor Bayer pattern where coordinate x maps to a pixel intensity value for either R, G or B depending on its position such as in the pattern shown in figure 2.9 in chapter2.

56 CHAPTER 3: FAST BACKGROUND SUBTRACTION

The noise model assumes that the actual value of the pixel Iraw(x) is corrupted by an additive sensor noise function η(x) ∼ N (0, σx) and is transformed by a gain function

GΘ which is governed by a number of gain parameters Θ in the camera. This can be written as:

I(x) = GΘ(Iraw(x) + η(x)) (3.7.1)

It is assumed that the sensor noise function η(x) is an additive Gaussian noise model with zero mean and standard deviation σx. The gain function GΘ amplifies the effect of the sensor noise, and may amplify the R, G or B sensor values by different amounts (i.e. the gain factor can be different for each colour sensor).

The noise value can be made constant by transforming GΘ by another function −1 (ideally its inverse) GΘ to cancel out the effects of gain. This yields a constant noise function:

−1 GΘ (I(x)) ≈ Iraw(x) + η(x) (3.7.2)

However, in practice the actual form of the gain function GΘ may not be available to reverse the effect of the gain. The algorithm assumes that the automatic gain pa- rameters Θ of the gain function can at least be fixed in some way, such as being able to turn off automatic gain, so that the noise remains constant during the operation of the algorithm. Making this noise value constant is desirable for the labeling scheme described next.

3.8 Code Map

The algorithm used to create a code map for the current frame is based on the Local Binary Pattern (LBP) algorithm [115] but extends the 2-label binary coding scheme to a 3-label scheme. The following section gives a brief overview of the LBP algorithm.

3.8.1 Local Binary Patterns

For each pixel p in an image I, neighbouring pixels are assigned a label based on their intensities relative to the intensity of the current pixel. Labels are either 0 or 1 and represent that a neighbourhood pixel pn is either less than (pn < p) or equal or greater to (pn ≥ p) the intensity value of the current pixel p.

57 CHAPTER 3: FAST BACKGROUND SUBTRACTION

The code map representation is a description of the local neighbourhood intensity values around each pixel. The neighbourhood considered can be arbitrary, but the basic LBP descriptor considers the 8 adjacent pixels surrounding the current pixel (its 8-neighbourhood).

Each of the labels are concatenated to form a binary code describing the pixel’s neighbourhood. Refer back to figure 3.3 to see how a code is generated from a simple 3x3 area.

3.8.2 3 Label Local Binary Patterns (LBP3)

The LBP3 algorithm extends the original LBP algorithm by adding a third label to make the LBP code generation more robust to noise. One of 3 labels can be assigned, {less, similar, greater}. Let L(p, q) be a labelling function that assigns a label based on the intensity values of a reference pixel p and a neighbouring pixel q ∈ Np. Labels are assigned as follows:

  greater if (q − p) > τ  L(p, q) = similar if kq − pk ≤ τ (3.8.1)   less if (q − p) < −τ

As with LTP [160], the proposed LBP3 algorithm neighbour pixel can take one of 3 labels: less, similar, greater. The similar label is assigned if the absolute difference be- tween the neighbour pixel intensity and the current pixel intensity is within a specified threshold. This threshold is a parameter of the algorithm, and adjusting it can reduce the effect of noise at the cost of contrast sensitivity.

However unlike the LTP algorithm, in LBP3 each label is associated with a 2-bit binary value: less = 012, similar = 112, greater = 102, and are chosen so that the bitwise difference between two labels expresses how far away from each other they are in label space. Let H(a, b) be the Hamming distance between two binary strings a and b, then: H(less, similar) < H(less, greater) and H(similar, less) = H(similar, greater).

This 3 label scheme improves robustness to noise over uniform regions where the differences between pixels are almost identical but offset by a small amount of noise [160]. It does however make it less resistant to intensity scaling where intensity changes by a constant factor, but is still more robust than taking simple image intensity differ- ences between frames as a difference measure. See figure 3.6 for an illustration of the code map generation for a simple code.

58 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.6: LBP3 code response to changes in lighting for two image patches cover- ing the same area sampled from different frames in a video. The top image is while the room is under low lighting, the bottom image is of the same area under normal lighting. Each colour in the respective code maps represent a different combination of labels in the binary code and are generated using a simple two-pixel neighbourhood code such as the one depicted in figure 3.8. Large differences in lightness yield almost identical code maps for this image patch.

The example implementation of the LBP3 algorithm considers neighbourhood pix- els above and to the right to build the code map. These 2-bit labels are concatenated to form the final code for the current pixel. However as with the original LBP any ar- bitrary neighbourhood can be used. Shown in figure 3.8 is an illustration of a simple 2-label LBP3 code.

Figure 3.9 shows that the representation is more robust to the subtle global illumi- nation change that simple image differencing would be vulnerable to without updating its difference threshold value.

Of course, there are still label transition problems due to noise, but those are mostly along edges as the similar label reduces noise on homogenous areas of intensity. Hys- teresis thresholding can be used to address this, as discussed in the following section.

3.8.3 Temporal Hysteresis

When the intensity value of a neighbouring pixel holds a value that is at or very close to the border of the similar and greater threshold, then ambient camera sensor noise can

59 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.7: Illustration of relative intensity fluctuation over time between reference pixel p and neighbourhod pixel q, and its effect on the original 2 label LBP algorithm (top), and in the proposed LBP3 extension (bottom). In the LBP labelling scheme the value q − p fluctuates between the two labels over time. In the LBP3 labelling scheme the fluctuations stay within the similar label region.

cause the label to be assigned either the similar or greater labels over time. This creates an unwanted level of variation over the region and will show up as differences in the code map.

An approach to dealing with this is to estimate the amount of background noise in the code map for the region covered by the control. For the button region R, over a number of frames T find the mean ambient noise in the code map difference µˆ = 1 ( ) |T| ∑t∈T Lt R over a region R and use this value to offset the activation threshold for the control. This can work reasonably well, but for very noisy regions the required

Figure 3.8: LBP3 code map construction.

60 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.9: LBP3 code map difference algorithm. Code maps are constructed for the reference image and compared to the code map of the current frame.

number of different pixels per frame can exceed the remaining pixels in the area of the control, meaning the control cannot be activated.

A more robust approach to improving code stability while still retaining some sen- sitivity to change, is to apply hysteresis on the code labels assigned at each pixel. Once a label has been assigned a code, any change in a consecutive frame must exceed the label boundary by a secondary hysteresis threshold α.

This can be implemented efficiently by updating the code label lt using a hysteresis threshold that adapts itself based on the previous label lt−1.

lt = Lh(pt, qt, lt−1) (3.8.2)

where pt and qt are pixel intensities at time t, and Lh(p, q, lt−1) is defined as:

  greater if (q − p) > τ + α  Lh(p, q, lt−1 = similar) = less if (q − p) < −τ − α (3.8.3)   similar otherwise

  similar if (q − p) < τ − α  Lh(p, q, lt−1 = greater) = less if (q − p) < −τ − α (3.8.4)   greater otherwise

61 CHAPTER 3: FAST BACKGROUND SUBTRACTION

  greater if (q − p) > τ + α  Lh(p, q, lt−1 = less) = similar if (q − p) > −τ + α (3.8.5)   less otherwise

Figure 3.10 illustrates how a label is assigned for a pixel q from within the neigh- bourhood Np of a reference pixel p. The left graph shows how the label is assigned if a simple threshold value is used. Since the difference (q − p) starts very close to the similarity threshold τ the label fluctuates between the similar and greater labels. The graph on the right shows how labels are assigned to the same difference value (q − p) when a secondary hysteresis value α is used, and is much more stable over time.

Figure 3.10: Illustration of label assignments given relative intensity (q − p) over time between neighbourhood pixel q and reference pixel p without temporal hysteresis (left) and with temporal hysteresis enabled (right). Orange lines indicate label bound- aries, and the height of the orange rectanular regions represent the hysteresis thresh- old α. By using a secondary hysteresis threshold, the code label will not change until its value becomes significantly different. This drastically reduces the effect of noise on the difference map.

This temporal hysteresis threshold significantly reduces the effects of labeling fluc- tuations due to noise, particularly on edges in the image where the difference between reference pixel p and a neighbourhood pixel q is equal to the similarity threshold τ. In these situations, ambient sensor noise pushes the difference (q − p) between the two label regions and cause false positives in the code difference map.

In section 3.10, figure 3.12 shows results comparing two images from the same video processed with only a single similarity threshold (left) and with a secondary hys- teresis threshold (right). The left image shows the ambient background noise caused by label fluctuations along edges in a sample video sequence, however in the right image using a secondary temporal hysteresis threshold significantly reduces these artifacts.

62 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.11: Screen captures of two test scenarios for the persistent button algorithm using the LBP3 local binary pattern variant. Left: Table-top game. Right: Human shape game.

3.9 Experiments

The algorithm was qualitatively tested in two gaming scenarios. The first was a table- top game where the player has to move a cursor using their hand to collect items while avoiding obstacles (shown on the left in figure 3.11), the second was a human shape game where the player must move into a shape that activates the required blocks for as long as they can to score points (shown on the right in figure 3.11).

3.9.1 Comparisons

Since the LBP3 algorithm essentially performs a type of background subtraction on a local region, it is compared to two other computationally efficient local patch compar- ison methods. The methods chosen are a computationally efficient Normalised Cross Correlation (NCC) [100] and another method, Colour Normalised Cross Correlation (CNCC) proposed by Grest et al. [71]. See section 2.4.2 for the details of these two algorithms.

Normalised cross correlation normalises each of the patches being compared by the magnitude of their respective variances. This offers some illumination invariance to the comparison. Colour normalised cross correlation on the other hand, extends the NCC formulation to perform the correlation dot product using a hue and saturation coordinate vector to gain sensitivity to colour while being more robust to differences caused by shadows [71].

Since the image being compared is a reference image saved during calibration, much of the computation for the correlation can be recalculated for efficiency when

63 CHAPTER 3: FAST BACKGROUND SUBTRACTION processing each new video frame [100]. A small neighbourhood size of 5x5 is used when assessing each algorithm.

3.9.2 Table-top Game

In the table top game application, the camera is angled downward to point towards a clear space on the floor. A gaming grid is then superimposed on top of the lower half of the video frame. Each cell in the grid represents an independent persistent button control.

The buttons are first calibrated with the floor area clear to allow the camera to cali- brate its internal gain parameters and to model the mean ambient noise in each of the button areas. For each button Bi, a mean noise µi is calculated over several frames and used to offset the activation threshold of the respective button. Once calibration is completed the game begins.

To play the game, the user sits in front of the virtual gaming board facing the camera and must place their finger into the desired gaming square. Since the user’s hands and arms will also activate all buttons up until their desired button, the ’active’ button is considered to be the active button that is furthest away from the user (closest to the bottom of the screen).

From a position along a random side of the gaming board, green and red blocks start moving across to the opposite side to its starting position. The user must collect as many green blocks as they can without colliding with any red blocks.

3.9.3 Human Shape Game

The human shape game is played by angling the camera towards the opposite wall so that most of the user’s legs, upper body and arms are visible. As with the table top game, the buttons are calibrated by the user leaving the view of the camera for a short time while mean noise levels are determined for each button in the grid.

Once the game begins, the user must activate only the indicated blocks to score points within a fixed time. If not all blocks are covered, then no points are gained. If the wrong blocks are activated, points are lost. After a short while, a new pattern is displayed and the user must activate a different set of blocks.

64 CHAPTER 3: FAST BACKGROUND SUBTRACTION

3.10 Results

All three algorithms perform well, but with different limitations and advantages. Videos of subjects playing the two games were recorded and then analysed by each of the 3 algorithms.

Figure 3.12 shows the effect of applying temporal hysteresis to the code maps. This significantly reduces the effect of noise from codes generated by pixels that oscillate on the border of the similarity threshold and the less and greater labels.

To quantify the effect of the hysteresis threshold on the LPB3 algorithm, the al- gorithm was run on a test sequence with with ground truth with the basic similarity threshold labelling scheme and then with the secondary hysteresis threshold. The first 65 frames of the sequence are a view of a static room, and no movement is occuring in the video, after that the player comes in to view of the camera. The graph in figure 3.13(a) shows how the False Positive Rate evolves over time with the two methods, and clearly shows that by using a secondary temporal hysteresis threshold the LPB3 labelling scheme is considerably more stable.

The false positives that are registered after frame 65 occur when the player comes in to view of the camera and are due to a small reflective surface in the background of the room, combined with some minor dilation artifacts due to the coverage of the descrip- tor neighbourhood (see figure 3.13(b)). This causes a small number of changes near the border of the ground truth area in the current frame, but are not due to labelling noise.

Figure 3.12: Using a secondary hysteresis threshold significantly reduces the effect of noise on the difference map. Left: Frame without temporal hysteresis. Right: The same frame processed with temporal hysteresis.

Of the three methods, the CNCC algorithm is the most robust to many situations

65 CHAPTER 3: FAST BACKGROUND SUBTRACTION

(a) (b) Figure 3.13: (a) False Postive Rate (FPR) of LBP3 over a video sequence with and with- out the use of a secondary hysteresis threshold. After the first 65 frames the player comes in view of the camera. It can be clearly seen that without the hysteresis thresh- old, there are many false positive detections. When hysteresis is applied, the false positives reduced to zero in the static part of the scene. (b) False positives after frame 65 are due to a reflective surface in the background and minor dilation artifacts slightly offset from the actual change due to the LBP3 descriptor coverage (red = false posi- tives, green = true positives, grey = ground truth).

where shadows would otherwise be detected in the difference map. See figure 3.14 for the segmentations achieved by each algorithm. The CNCC algorithm is more robust to shadows only when the underlying surface has colour, and in situations where the surface is generally grey it performs no better than the NCC algorithm. This is to be expected, as when the colour of the pixels move towards grey values, the (h, s) hue- saturation plane component is reduced to zero and the algorithm operates purely on the lightness L component in the (h, s, L) colour model.

3.11 Discussion and Future Work

There are two areas of improvement that would be interesting to explore for future versions of the LBP3 algorithm.

First, it is not as robust to shadows as methods such as CNCC. This could be ad- dressed by performing the LBP3 comparison measure using more shadow invariant colour representations such as the (h, s, L) colour space used by the CNCC algorithm. The work by Yeffet and Wolf [176] uses SSD to compare local patches between different frames to detect motion when constructing their LBP inspired binary pattern represen- tation, and it would be interesting to see how incorporating shadow invariant local

66 CHAPTER 3: FAST BACKGROUND SUBTRACTION

Figure 3.14: Results on algorithms applied to table-top game scenario. Game display overlay has been omitted for clarity of results. The CNCC method performs best on this frame due to the red-brown hue of the table cloth aiding the shadow removal in the comparison. If this were grey, then CNCC would perform with similar results to NCC. The extra saturation on the NCC table result includes marked areas that have too low variance and their NCC is unstable. These are detected and filtered out when generating the NCC difference map.

patch comparisons such as the CNCC correlation method into the spatial coding of LBP3 improves robustness to shadows (i.e. compare patches instead of single pixels from the processed image), though this would increase the computational complexity and slow down the algorithm and risk losing its real-time performance.

Second, since the LPB3 algorithm (as well as the NCC and CNCC methods tested here) holds a background reference image to compare new frames against, if the camera drifts slowly over time, as illustrated in figure 3.15, differences will be registered in the difference map the more the frame drifts out of alignment with the reference image. Applying basic image stabilising methods to the image to correct for gradual or sudden drift should address this problem.

Figure 3.15: Camera drift over the course of many frames on LBP3 difference maps. Left to right: The camera very slowly tilts upward on its stand due to not being placed securely, and causes gradual constant differences in the difference map.

67 CHAPTER 4

Detection and Segmentation

Object detection and segmentation are important problems of computer vision and have numerous commercial applications such as pedestrian detection, surveillance and gesture recognition. Image segmentation has been an extremely active area of research in recent years [24, 27, 58, 88, 92, 135]. In particular segmentation of the face is of great interest due to such applications as Windows Messenger [36, 158].

One such recent application used in a computer game context is that of using a face detection to control the steering of a virtual character in a racing game called AntiGrav. To control the game, the player must move their head to different areas of the screen to steer left, right, up or down. Although the simple face detection employed by the game failed at times, it demonstrated that novel interactions like this can be realised by using computer vision algorithms in a creative way.

Taking inspiration from this idea, it is easy to see the potential of being able to place a player’s face on their own digital avatar, particularly in a multi-player computer game. Using off-the-shelf face detection algorithms such as the one presented by [170], and coupling it with segmentation techniques it is possible to segment the face of the player so that it can be added to a 3D avatar. This chapter proposes a novel use of a detection and segmentation algorithm to segment a face in real-time.

Until recently the only reliable method for performing segmentation in real-time was blue screening. This method imposes strict restrictions on the input data and can only be used for certain specific applications. Recently Kolmogorov et al. [90] proposed a robust method for extracting foreground and background layers of a scene from a stereo image pair. Their system ran in real-time and used two carefully calibrated cam- eras for performing segmentation. These cameras were used to obtain disparity infor- mation about the scene which was later used in segmenting the scene into foreground and background layers. Although they obtained excellent segmentation results, the

68 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.1: Real-Time Face Segmentation using face detection. The first image on the first row shows the original image. The second image shows the face detection results. The image on the second row shows the segmentation obtained by using shape priors generated using the detection and localisation results.

need for two calibrated cameras was a drawback of their system.

In the previous section (§ 3.4), a low-level method of segmentation via background subtraction using a fast binary descriptor was presented and compared to other fast background subtraction methods. It showed that it can be efficient to tackle image segmentation this way, though there are still issues with camera drift that need to be addressed by other algorithms that make them less attractive for use when reliable segmentations are required.

This section presents a framework that exploits the result of a state-of-the-art face detection algorithm [170] to provide an initialisation for the position and scale of a simple shape prior that can then be incorporated into Markov Random Field energy function that can be minimised by the dynamic graphcut algorithm [88].

4.1 Shape priors for Segmentation

An orthogonal approach to background subtraction for solving the segmentation prob- lem robustly has been the use of prior knowledge about the object to be segmented. In recent years a number of papers have successfully tried to couple MRFs used for mod- elling the image segmentation problem with information about the nature and shape of the object to be segmented [27, 58, 76, 92]. The primary challenge in these systems is that of ascertaining what would be a good choice for a prior on the shape. This is because the shape (and pose) of objects in the real world vary with time. To obtain a good shape prior then, there is a need to localise the object in the image and also infer its pose, both of which are extremely difficult problems in themselves.

Kumar et al. [92] proposed a solution to these problems by matching a set of ex- emplars for different parts of the object on to the image. Using these matches they generate a shape model for the object. They model the segmentation problem by com-

69 CHAPTER 4: DETECTIONAND SEGMENTATION bining MRFs with layered pictorial structures (LPS) which provide them with a realistic shape prior described by a set of latent shape parameters. A lot of effort has to be spent to learn the exemplars for different parts of the LPS model.

In their work on simultaneous segmentation and 3D pose estimation of humans, Bray et al. [27] proposed the use of a simple 3D stick-man model as a shape prior. Instead of matching exemplars for individual parts of the object, their method followed an iterative algorithm for pose inference and segmentation whose aim was to find the pose corresponding to the human segmentation having the maximum probability (or least energy). Their iterative algorithm was made efficient using the dynamic graph cut algorithm [88]. Their work had the important message that rough shape priors were sufficient to obtain accurate segmentation results. This is an important observation which will be exploited in our work to obtain an accurate segmentation of the face.

4.2 Coupling Face Detection and Segmentation

In the methods described above the computational problem is that of localizing the object in the image and inferring its pose. Once a rough estimate of the object pose is obtained, the segmentation can be computed extremely efficiently using graph cuts [24, 25, 70, 88, 91]. In this section we show how an off the shelf face detector such as the one described in [170] can be coupled with graph cut based segmentation to give accurate segmentation and improved face detection results in real-time.

The key idea of the framework proposed in the following sections is that face lo- calisation estimates in an image (obtained from any generic face detector) can be used to generate a rough shape energy. These energies can then be incorporated into a dis- criminative MRF framework to obtain robust and accurate face segmentation results as shown in Figure 4.1. This method is an example of the OBJCUT paradigm for an unarticulated object. We define an uncertainty measure corresponding to each face detection which is based on the energy associated with the face segmentation. It is shown how this uncertainty measure might be used to filter out false face detections thus improving the face detection accuracy.

The algorithm proposed in this section is a method for face segmentation which works by coupling the problems of face detection and segmentation in a single frame- work. This method is efficient and runs in real-time. The key novelties of the algorithm include:

1. A framework for coupling face detection and segmentation problems together.

70 CHAPTER 4: DETECTIONAND SEGMENTATION

2. A method for generating rough shape energies from face detection results.

3. An uncertainty measure for face segmentation results which can be used to iden- tify and prune false detections.

In the next section, we briefly discuss the methods for robust face detection and image segmentation. In section 4.3, we describe how a rough shape energy can be generated using localisation results obtained from any face detection algorithm. The procedure for integration of this shape energy in the segmentation framework is given in the same section along with details of the uncertainty measure associated with each face segmentation. The simple shape prior is then extended to an upper body model in section 4.7. We conclude by listing some ideas for future work in section 4.8.

4.3 Preliminaries

In this section we give a brief description of the methods used for face detection and image segmentation.

4.3.1 Face Detection and Localisation

Given an image, the aim of a face detection system is to detect the presence of all human faces in the image and to give rough estimates of the positions of all such detected faces. In this proposed framework we use the face detection method proposed by Viola and Jones [170]. This method is extremely efficient and has been shown to give good detection accuracy. A brief description of the algorithm is given next.

The Viola Jones face detector works on features which are similar to Haar filters. The computation of these features is done at multiple scales and is made efficient by using an image representation called the integral image [170]. After these features have been extracted, the algorithm constructs a set of classifiers using AdaBoost [61]. Once constructed, successively more complex classifiers are combined in a cascade struc- ture. This dramatically increases the speed of the detector by focussing attention on promising regions of the image 1. The output of the face detector is a set of rectangular windows in the image where a face has been detected. We will assume that each de- x y x y tection window Wi is parameterised by a vector θi = {ci , ci , wi, hi} where (ci , ci ) is the centre of the detection window and wi and hi are its width and height respectively.

1A system has been developed which uses a single camera and runs in real-time.

71 CHAPTER 4: DETECTIONAND SEGMENTATION

4.3.2 Image Segmentation

Given a vector y = {y1, y2, ··· , yn} where each yi represents the colour of the pixel i of an image having n pixels, the image segmentation problem is to find the value of the vector x = {x1, x2, ··· , xn} where each xi represents the label which the pixel i is assigned. Each xi takes values from the label set L = {l1, l2, ··· , lm}. Here the label set L consists of only two labels i.e. ‘face’ and ‘background’. The posterior probability for x given y can be written as:

Pr(y|x)Pr(x) Pr(x|y) = ∝ Pr(y|x)Pr(x) (4.3.1) Pr(y) We define the energy E(x) of a labelling x as:

E(x) = − log Pr(x|y) − log Pr(x) + constant = φ(x, y) + ψ(x) + constant (4.3.2)

where φ(x, y) = −logPr(y|x) and ψ(x) = −logPr(x). Given an energy function E(x), the most probable or maximum a posterior (MAP) segmentation solution x∗ can be found as the segmentation solution x that minimises E(x):

x∗ = argmin E(x) (4.3.3) x It is typical to formulate the segmentation problem in terms of a Discriminative Markov Random Field [93]. In this framework the likelihood φ(x, y) and prior terms ψ(x) of the energy function can be decomposed into unary and pairwise potential func- tions. In particular this is the contrast dependent MRF [24, 92] with energy:

 

E(x) = ∑ (φ(xi, y) + ψ(xi)) + ∑ (φ(xi, xj, y) + ψ(xi, xj)) + const (4.3.4) i (i,j)∈N

where N is the neighbourhood system defining the MRF. Typically a 4 or 8 neigh- bourhood system is used for image segmentation which implies each pixel is connected with 4 or 8 pixels in the graphical model respectively.

4.3.3 Colour and Contrast based Segmentation

The unary likelihood terms φ(xi, y) of the energy function are computed using the colour distributions for the different segments in the image [24, 92]. For our experi- ments we built the colour appearance models for the face/background using the pixels

72 CHAPTER 4: DETECTIONAND SEGMENTATION lying inside/outside the detection window obtained from the face detector. The pair- wise likelihood term φ(xi, xj, y) of the energy function is called the contrast term and is discontinuity preserving in the sense that it encourages pixels having dissimilar colours to take different labels (see [24, 92] for more details). This term takes the form:

 γ(i, j) i f xi 6= xj φ(xi, xj, y) = (4.3.5) 0 i f xi = xj

 2  ( ) = −g(i,j) · 1 ( ) = k − k Where γ i, j exp 2σ2 dist(i,j) . Here g i, j Ii Ij 2 and measures the difference between the RGB pixel values Ii and Ij respectively, and dist(i, j) gives the spatial distance between i and j. Other colour spaces could be used (such as (L∗, u∗, v∗)) but as this would add an additional computational cost to transform the RGB values to the new colour space, this simple distance measure was preferred.

The pairwise prior terms ψ(xi, xj) are defined in terms of a generalized Potts model as:

 Kij i f xi 6= xj ψ(xi, xj) = (4.3.6) 0 i f xi = xj

This encourages neighbouring pixels in the image 2 to take the same label thus resulting in smoothness in the segmentation solution. In most methods, the value of

the unary prior term ψ(xi) is fixed to a constant. This is equivalent to assuming a uniform prior and does not affect the solution. In the next section we will show how a shape prior derived from a face detection result can be incorporated in the image segmentation framework.

4.4 Integrating Face Detection and Segmentation

Having given a brief overview of image segmentation and face detection methods, we now show how we couple these two methods in a single framework. Following the OBJCUT paradigm, we start by describing the face energy and then show how it is incorporated in the MRF framework.

2Pixels i and j are neighbours if (i, j) ∈ N

73 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.2: Generating the face shape energy. The figure shows how a localisation result from the face detection stage (left) is used to define a rough shape energy for the face.

4.4.1 The Face Shape Energy

In their work on segmentation and 3D pose estimation of humans, Bray et al. [27] show that rough and simple shape energies are adequate to obtain accurate segmentation re- sults. Following their example we use a simple elliptical model for the shape energy for a human face. The model is parameterised in terms of four parameters: the ellipse cen- tre coordinates (cx, cy), the semi-minor axis a and the semi-major b (assuming a < b). x y The values of these parameters are computed from the parameters θk = {ck , ck , wk, hk} x y of the detection window k obtained from face detector as: cx = ck , cy = ck , a = wk/α and a = wk/β. The values of α and β used in our experiments were set to 2.5 and 2.0 respectively, however these can be computed iteratively in a manner similar to [27]. A detection window and the corresponding shape prior are shown in figure 4.2.

4.5 Incorporating the Shape Energy

For each face detection k, we create a shape energy Θk as described above. This energy is integrated in the MRF framework described in section 4.3.2 using the unary terms

ψ(xi) as:

ψ(xi) = λ(xi, Θk) = −log p(xi, Θk) (4.5.1)

Where we define p(xi, Θk) as:

1 p(x = face|Θ ) = (4.5.2) i k   k  − k cyi−c + · cxi cx + y − 1 exp µ (ak)2 (bk)2 1

74 CHAPTER 4: DETECTIONAND SEGMENTATION

and:

p(xi = background|Θk) = 1 − p(xi = face|Θk) (4.5.3)

x y Where cxi and cyi are the x and y coordinates of the pixel i, ck ck , a = wk/α, b =

hk/β are parameters of the shape energy Θk, and the parameter µ determines how the strength of the shape energy term varies with the distance from the ellipse boundary. The different terms of the energy function and the corresponding segmentation for a particular image are shown in figure 4.3.

Figure 4.3: Different terms of the shape-prior + MRF energy function. The figure shows the different terms of the energy function for a particular face detection and the corresponding image segmentation obtained.

Once the energy function E(x) has been formulated, the most probable segmenta- tion solution x∗ defined in equation 4.3.3 can be found by computing the solution of the max-flow problem over the energy equivalent graph [91]. The complexity of the max- flow algorithm increases with the number of variables involved in the energy function. Recall that the number of random variables is equal to the number of pixels in the im- age to be segmented. Even for a moderate sized image the number of pixels is in the range of 105 to 106. This makes the max-flow computation quite time consuming. To overcome this problem we only consider pixels which lie in a window Wk whose di- mensions are double of those of the original detection window obtained from the face detector. As pixels outside this window are unlikely to belong to the face (due to the

75 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.4: This figure shows an image from the INRIA pedestrian data set. After run- ning our algorithm, we obtain four face segmentations, one of which (the one bounded by a black square) is a false detection. The energy-per-pixel values obtained for the true detections were 74, 82 and 83 while that for the false detection was 87. As you can see the energy of false detection is higher than that of the true detections, and can be used to detect and remove it.

shape term ψ(xi)) we set them to the background. The energy function for each face detection k now becomes:

   Ek(x) = ∑ (φ(xi, y) + ψ(xi, Θk)) + ∑ φ(xi, xj, y) + ψ(xi, xj)  + const i∈Wk j∈Wk, (i,j)∈N (4.5.4)

∗ This energy is then minimized using graph cuts to find the face segmentation xk for each detection k.

4.5.1 Pruning False Detections

The energy E(x0) of any segmentation solution x0 is the negative log of the probabil- ity, and can be viewed as a measure of how uncertain that solution is. The higher the energy of a segmentation, the lower the probability that it is a good segmentation. Intuitively, if the face detection given by the detector is correct, then the resulting seg- mentation obtained from our method should have high probability and hence have low energy compared to the case of a false detection (as can be seen in figure 4.4).

This characteristic of the energy of the segmentation solution can be used to prune false face detections. This method was also explored by Ramanan [128] for improving the results of human detection. Alternatively, if the number of people P in the scene is known, then we can choose the top P detections according to the segmentation energy.

76 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.5: Some face detection and segmentation results obtained from our algorithm.

4.6 Implementation and Experimental Results

We tested our algorithm on a number of images containing faces. Some detection and segmentation results are shown in figure 4.5. The time taken for segmenting out the faces is in the order of tens of milliseconds. We also implemented a real-time system for frontal face detection and segmentation. The system is capable of running at roughly 15 frames per second on images of 320x240 resolution.

4.6.1 Handling Noisy Images

The contrast term of the energy function might become quite bad in noisy images. To avoid this we smooth the image before the computation of this term. The results of this procedure are shown in figure 4.6.

4.7 Extending the Shape Model to Upper Body

The simple ellipse model used for face detection and segmentation in the previous section can be extended to detect and segment the upper body of a human. This use of localisation can improve the speed of the parameter search in PoseCut [27] as the position and approximate scale of the human is already known from the detector stage of the algorithm.

Figure 4.7 shows an illustration of the model. This upper body model expresses

77 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.6: Effect of smoothing on the contrast term and the final segmentation. The images on the first row correspond to the original noisy image. The images on the second row are obtained after smoothing the image.

simple articulation of the neck length and angle, and the shape of the upper torso. The model has 6 parameters which encode the x and y location of the two shoulders and the length and angle of the neck.

Using the PoseCut approach, the parameters of the model can be optimised to find

the lowest energy Ek(x, Θk) given the current model parameters Θk that now represent the 6 parameters of the upper body model.

The energy cost function Ek(x, Θk) for detection k is:

    E (x, Θ ) = (φ(x , y) + ψ(x , Θ )) + φ(x , x , y) + ψ(x , x )  + const k k ∑  i i k ∑ i j i j  i∈Wk j∈Wk (i,j)∈N (4.7.1)

∗ The goal is then to find the optimal model parameters Θk that yield the lowest

segmentation energy, minx Ek(x, Θk):

∗   Θ = argmin min Ek(x, Θk) (4.7.2) k x Θk

As with the model used in face segmentation, a window Wk is defined around de- tection k with a size proportional to the detection window scale. The model is initialised to a frontal configuration, as shown in subfigure (a) in figure 4.7 and the optimal pa-

78 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.7: Upper Body Model. (a) The model parameterised by 6 parameters encod- ing the x and y location of the two shoulders and the length and angle of the neck. (b) The shape prior generated using the model. Pixels more likely to belong to the foreground/background are green/red. (c) and (d) The model rendered in two poses.

rameters are determined by finding the solution to equation 4.7.2. In the upper body model experiments Powell minimisation [127] is used.

Figure 4.8 shows iterations of several starting positions optimising over 2 of the 6 parameters of the upper body model shown in figure 4.7. It can be seen that although the energy surface has multiple local minima, they are very close and have very similar energy. The experiments showed that the Powell minimisation algorithm was able to converge to almost the same point for different initialisations (see bottom image in figure 4.8).

The upper body model was tested on video sequences from the Microsoft Research bilayer video segmentation dataset [90]. Some results from processing one of the se- quences can be seen in figure 4.9.

4.8 Discussion and Future Work

In the previous sections a method for face segmentation which combines face detection and segmentation into a single framework has been presented. The method runs in real-time and gives accurate segmentation, and shows how the segmentation energy can be used to help prune out false positives in certain situations.

When the upper body model was applied to video sequences, the time taken to find ∗ the optimal parameters Θk for a given detection window is too slow for processing video streams in real-time. This means that it would also be unviable to use a fully articulated model for determining pose such as the problem discussed in chapter6.

For video sequences, knowledge of previous good detections in addition to tempo-

79 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.8: Optimising the upper body parameters. (top) The values of minx Ek(x, Θk) obtained by varying the rotation and length parameters of the neck. (bottom) The image shows five runs of the Powell minimisation algorithm which are started from different initial solutions. The runs converge on two solutions which are very close and have almost the same energy.

ral smoothing can be applied to detections to avoid detections that oscillate between adjacent discrete detection scales when the true scale of the human is part way between them. Other temporal information such as initialising the model with the optimal pa- ∗ rameters from the previous frame Θk [t − 1] could take the performance of an upper body model a step closer to real-time speed.

80 CHAPTER 4: DETECTIONAND SEGMENTATION

Figure 4.9: Segmentation results using the 2D upper body model. The first column shows the original images, the second column shows the segmentations obtained from the initial model parameters for the upper body model, and the third column shows segmentations using the optimal parameters. The segmentation energies for the initial and optimal pose parameters are also shown.

81 CHAPTER 5

Human Detection

Within a web camera based computer game setting, where a player is typically stand- ing in front of a camera positioned on top of their TV such as in the Sony EyeToy games, efficiently localising exactly where in the video frame the player currently is can make more advanced computer algorithms more robust or even faster to compute. For in- stance, if an algorithm fails or produces a weak response, then extra information that localises the player can reduce the search area needed to re-initialise a more complex algorithm.

An example game that illustrates this problem is Sony’s EyeToy Kinetic and Kinetic Combat. The player is asked to perform a series of tasks, such as hit a block while avoiding others, but how the player achieves this is not actually measured, i.e. no estimate of pose is determined during the course of the game. Instead the game detects collisions with areas of movement (e.g. pixel differencing) on screen.

If for instance, a player is asked to perform a movement in a particular way for interactive exercise games such as these, it would be useful to also assess how well a player is doing with this movement. To do this, the pose of the player or other more detailed information needs to be determined.

Pose estimation algorithms can be employed to attempt to solve this, however good localisation of the player allows complex algorithms to be applied only at areas of in- terest.

This chapter first discusses the basic requirements of a fast detection algorithm (not necessarily real-time), presents some existing feature descriptors and learning algo- rithms, and then proposes a new approach to chamfer matching for object detection that can learn the weights required for human detection. Finally, some experiments are presented that compare the classification performance of the different algorithms discussed by testing them on human and upper body datasets.

82 CHAPTER 5: HUMAN DETECTION

5.1 Suitable Algorithms

The localisation task can be thought of as a binary class problem in which a position in an image can be labelled as either background or object.

A good localisation method must be computationally efficient so that it can be used in conjunction with other computer vision algorithms on top of any computer game media being displayed, and it should be able to cope with partial occlusions and varied subject size and clothing. Once the location of a player is known, or at least reduced to a smaller number of possible locations, then other more computationally intensive algorithms can be applied.

One such algorithm that has these properties is chamfer matching [11, 20, 154]. Another algorithm, called Histograms of Oriented Gradients (or HOG) [40], is consid- ered state-of-the-art for human detection, though even efficient implementations are not quite able to run at interactive frame rates [182]. This idea has also been extended to considering the responses of different part detectors such as in [23], but this algo- rithm is slower in comparison due to the complexity added by the separate classifiers, so for detection the simpler single classifier algorithm is used in experiments presented in this chapter.

5.2 Features

Simply passing the raw image data to a learning algorithm such as an SVM does not achieve good results due to variability in brightness and appearance. Typically images are processed to reduce them to not only a more manageable size, but also reduce the image data to the information that remains useful for detecting an object, e.g. edge maps, or histograms of oriented gradients (HOG).

There are numerous features used in human detection: silhouette [2], chamfer [65], edges [44], HOG descriptors [40, 140, 183], Haar filters [171], motion and appearance patches [18], edgelet features [175], shapelet features [136] or SIFT [102]. The features that have been selected for use are edge gradient based descriptors due to their success in human detection [40] and their ability to be made computationally efficient while retaining good accuracy [162, 163, 183]. There are other features that exist, but the ones used in this chapter were selected due to their popularity and success in state-of-the-art algorithms.

83 CHAPTER 5: HUMAN DETECTION

5.2.1 Histogram of Oriented Gradients

The principle behind the Histogram of Oriented Gradients (HOG) introduced by Dalal and Triggs [40] is quite simple. In contrast to sparse descriptors such as SIFT that are positioned at extrema found in scale-space [102] or at distinct repeatable interest points [13], the HOG detector proposed in [40] uses a dense array of overlapping histograms across a sample window. See figure 5.11 for an illustration of how the descriptors are arranged within the detection window.

The image is first processed to extract gradient orientations and magnitudes at each pixel using an edge operator such as Sobel or Canny. Where available, colour informa- tion is used and the maximum gradient over the colour channels is used as the gradient for that pixel.

Each HOG descriptor block is split into a grid of smaller cells cw × ch of nw × nh pixels (typically 2 × 2 and 8 × 8 respectively [40]). Before constructing the descriptor, the gradients within the descriptor region are weighted by a Gaussian of size σ = 0.5 ∗ BlockSize to down-weight gradient contributions from the pixels at the very edge of the descriptor.

For each cell within the block, a histogram is constructed over θ orientations by summing the magnitudes of each of the gradients within the cell into the histogram bin associated with their orientations. Gradients are bilinearly interpolated with adja- cent orientation bins to avoid boundary effects within the orientation histogram. The gradient is also weighted spatially by contributing to histograms in neighbouring cells depending on how close the current location is to the other cells.

The histograms for each of the block cells are concatenated together h = {hi} and then normalised to a unit vector to form the final HOG descriptor representation. Each of the HOG descriptor blocks is normalised based on the ‘energy’ of the histograms contained within it. Dalal and Triggs [40] discusses different methods of contrast nor- malisation, and the method that appeared to improve results most in the original HOG implementation is the clipped norm method originally proposed by Lowe [102], and referred to as L2-hys by Dalal and Triggs [40]. This method first normalises the descrip- q 1 2 2 2 2 2 2 tor using the L2-norm: h = khk2 + e , where khk2 = (h1 + h2 + ··· + hn) . Then the components hi ∈ h are thresholded so that none of the values hi are greater than 0.2 (this constant was determined empirically in [102]), and the descriptor is normalised again using the L2-norm. Since the descriptor is applied densely, even over uniform patches where no edge gradients exist, a small constant value e is used to avoid divi- sion by zero when khk2 = 0. See figure 5.1 for a diagram showing the key steps in the

84 CHAPTER 5: HUMAN DETECTION

Figure 5.1: Diagram of HOG descriptor construction. Gradients for the whole image are extracted, then over a given descriptor block B, for each position B(x, y) within the block the gradient magnitudes are weighted by a Gaussian |∇B(x, y)| · G(σ, x, y) where σ = 0.5 ∗ blockSize, and added to the histogram belonging to the cell that covers (x, y). The gradient also contributes to adjacent cells proportionally to how close the pixel is to the respective cell centre.

algorithm.

Dalal and Triggs [40] explores different configurations of this descriptor, but 2 × 2 blocks of 8 × 8 cells seem to give good results on the INRIA database while keeping the dimensionality low. The best result was reported to be HOG blocks with a 3 × 3 arrangement of 6 × 6 pixel cells and 9 orientation bins but has a much higher (nearly double) dimensionality of the 2 × 2 cell of 8 × 8 pixel arrangement descriptors with only a small improvement in accuracy [40].

Descriptor Variations

There are a few variations of the HOG descriptor discussed in Dalal and Triggs [40]. The two main categories are rectangular and circular HOG descriptors (R-HOG and C-HOG respectively), and the paper explores their performance compared to SIFT de- scriptors [102], shape context descriptors [110] (simulated using C-HOG descriptors) and generalised Haar wavelet based descriptors [170].

Dalal and Triggs [40] demonstrated that HOG based descriptors outperformed each of the other descriptor variations on two different person databases: the MIT pedes- trian database, and a more challenging INRIA database created to test the HOG de- scriptor. R-HOG descriptors generally performed best on human detection.

85 CHAPTER 5: HUMAN DETECTION

Figure 5.2: The sum over the area defined by region R can be found by sampling 4 points from the integral image (A, B, C and D) to give: sum(R) = ii(D) − ii(B) − ii(C) + ii(A).

Integral HOG Features

The HOG features used in the experiments is a slight variant of the original HOG de- scriptor, called Integral HOG [182], this descriptor uses integral histograms to speed up the HOG descriptor computation time, while retaining similar accuracy to the original descriptor [182].

An integral image [170] (also known as a summed area table [37]) is a method of efficiently finding the sum of values over a rectangular area. The value at a given point in the integral image is defined as [170]:

ii(x, y) = ∑ i(x0, y0) (5.2.1) x0≤x,y0≤y

Where ii(x, y) is the integral image and i(x, y) is the original image. This table can be calculated efficiently in a single pass over an image by using the following pair of equations [170]:

s(x, y) = s(x, y − 1) + i(x, y) (5.2.2) ii(x, y) = ii(x − 1, y) + s(x, y)

Where s(x,y) is the cumulative row sum, s(x, −1) = 0, and ii(−1, y) = 0.

Figure 5.2 shows how an integral image is used to calculate the sum over a given area. In the integral image, the value at point (x, y) represents the sum of all the values in the original image to the left and above of that position (as defined in eq. 5.2.1). The sum over any arbitrary rectangular region R can be calculated by sampling four points (A, B, C and D in figure 5.2) in the integral image, and using the following formula:

86 CHAPTER 5: HUMAN DETECTION

sum(R) = ii(D) − ii(B) − ii(C) + ii(A) (5.2.3)

The value at ii(A) is added to compensate for the fact that by subtracting ii(B) and ii(C) from ii(D) the area ii(A) has actually been subtracted twice already, since the area represented by ii(A) is in both ii(B) and ii(C). Using this method the area over any arbitrary region can be calculated with only 4 samples from the integral image representation.

Viola and Jones [170] uses this representation to efficiently calculate rectangular features made up of positive and negative regions summed together, see figure 5.3 for some example features.

Figure 5.3: Some examples of the Haar-like rectangular features used by Viola and Jones [170]. Areas in white are subtracted from the areas in black.

Zhu, Yeh, Cheng and Avidan [183] extend this idea by combining the approach used in [170] and the integral histogram work of Porikli [125] to HOG [40] by calcu- lating integral images over the orientation histograms used by the HOG descriptors to accelerate the speed at which the descriptors can be calculated. In the integral his- togram implementation of HOG, the edge gradients are calculated as with the HOG descriptor [40], however after the edge gradients have been calculated, they are inter- polated between orientation channels and an integral image is calculated over each of the channels. These oriented integral histograms are used to calculate the gradient contribution for each cell by only sampling 4 points for each bin in the cell, instead of having to iterate over each pixel in a HOG descriptor cell.

For each cell in the descriptor block, the sum over the area covered by the cell R over all histograms is found by sampling 4 points for each orientation θ from the integral image (Aθ, Bθ, Cθ and Dθ) to give: sum(Rθ) = ii(Dθ) − ii(Bθ) − ii(Cθ) + ii(Aθ). This is done for the region that defines each descriptor cell to quickly generate histograms at any scale with constant time. The histograms are concatenated into a vector and

(cw·ch·nθ ) normalised as with regular HOG to form the descriptor h = {hi}i=1 where cw and

ch are the number of cells along the width and the height of the descriptor respectively,

and nθ is the number of orientations. Given that this implementation is much faster than the original HOG implementa- tion and almost as accurate [183], this algorithm is used for the HOG feature experi-

87 CHAPTER 5: HUMAN DETECTION

Figure 5.4: Diagram of integral histogram calculation for HOG. For each cell in the descriptor, the sum over the magnitudes for each orientation channel θ is concatenated

to form the histogram of gradients for the cell using: sum(Rθ) = ii(Dθ) − ii(Bθ) −

ii(Cθ) + ii(Aθ)

ments presented in this chapter.

5.2.2 DAISY

The DAISY descriptor is a computationally efficient descriptor used for dense matching applications that can be calculated very efficiently compared to other state-of-the-art algorithms such as SIFT, while retaining good accuracy [162, 163].

The descriptor itself is made up of a sampling grid arranged in concentric circles around the descriptor origin. See figure 5.5 for an illustration. Gradient magnitudes are first calculated in different directions. A Gaussian is applied to the edge magnitudes for each orientation, and are sampled to build the histograms for the inner most ring histograms. The Gaussian is applied again and mid-region histograms are sampled. This process is repeated for each ring of samples.

Figure 5.5: DAISY descriptor construction. Gradient magnitudes are extracted from the source image and quantised into orientations layers. The orientation layers are consecutively blurred by a Gaussian and sampled by one of the sample rings of the descriptor at each step (highlighted in red).

88 CHAPTER 5: HUMAN DETECTION

Figure 5.6: Filters used to determine edge response in x and y for the SURF descriptor.

The Gaussian blurring is an efficient way of determining the weighted sum over a circular area, in this case the sample regions in the circular descriptor arrangement. Another advantage of this method is that the different sized sampling regions can be calculated by applying a small Gaussian kernel successively to incrementally sample the different sized regions. The speed of this descriptor makes it very attractive as a feature for use in a dense arrangement such as over a dense object detection window as used in the experiments of this chapter.

The default arrangement of 1 centre region, 3 rings and 8 samples per ring make for quite a high dimensional descriptor compared to SIFT (though DAISY can be computed much more efficiently), but another lower dimensional arrangement, that of 1 centre region, 2 rings and 4 samples per ring over 4 orientations brings the descriptor size down to a similar size to the standard HOG descriptor arrangement with acceptable reduction in performance and comparable computation efficiency as integral histogram HOG.

One of the key sources of its efficiency comes from using successive convolution kernels applied to the image oriented gradient map to create lookup tables for the de- scriptor cell coverage.

5.2.3 SURF

The Speeded-Up Robust Features (SURF) algorithm proposed by Bay et al. [13] presents a simple and computationally efficient keypoint descriptor. There are two variants, a rotation invariant descriptor which is used in sparse keypoint detection algorithms such as SIFT, and an ‘upright’ descriptor that does not orient the descriptor represen- tation to its dominant gradient orientation. The ‘upright’ version of the descriptor is used for object detection and is the one used in the experiments of this chapter.

The descriptor itself consists of a 4 × 4 cell grid of 4-bin histograms. The histograms

89 CHAPTER 5: HUMAN DETECTION are calculated from responses to a Haar-like edge filter in x and y. See figure 5.6 for an illustration of the process.

At a desired descriptor scale s, a region of size 20s (s = 0.8 gives a descriptor diam- eter of 16 pixels, and is used for the detection experiments discussed later) centred at the descriptor’s location is convolved with Haar-like wavelets to determine horizon- tal and vertical edge responses. The region is then partitioned into 4 × 4 sub-regions. For each sub-region, Haar wavelet responses (filter size is 2s) are computed at 5 × 5 regularly spaced sample points, and are weighted with a Gaussian (σ = 3.3s) to in- crease robustness to localisation errors and geometric deformations. These weighted responses dx and dy over each cell are summed up to create a histogram vector v = (∑ dx, ∑ dy, ∑ |dx|, ∑ |dy|). The absolute values of the feature responses are included to express the polarity of the intensity changes over the region [13]. The vectors for each cell are concatenated and normalised to a unit vector to create the descriptor rep- resentation of a 64 dimensional descriptor vector. The wavelet responses are invariant to a constant offset in illumination and invariance to contrast is achieved by normalis- ing the descriptor.

5.2.4 Chamfer Distance Features

Recall from chapter2, section 2.6.3, that given a set of edge pixel coordinates O =

nO {(xi, yi)}i=1 extracted from a binary edge template object IO, and the edge pixel coor- nA dinates A = {(xi, yi)}i=1 extracted from the binary edge image IA from a query image

(e.g. thresholded edges from the current video frame), where nO and nA are the number

of edge pixels found in template IO and in image IA respectively, then the (truncated) chamfer distance between them can be found using:

1 C(D , O, τ ) = min(D (p)2, τ ) (5.2.4) A d |O| ∑ A d p∈O

where τd is a threshold used to provide an upper limit on the values in the distance transform to increase stability (see section 2.6.3), and DA is the distance transform func- tion of A such that for a given point p, the transform gives the distance to the nearest edge in A:

DA(p) = min kp − qk (5.2.5) q∈A

For multiple orientations, the chamfer distance is defined as:

90 CHAPTER 5: HUMAN DETECTION

1 C (D , O, τ ) = C(D , O , τ ) (5.2.6) Θ A d | | ∑ Aθ θ d Θ θ∈Θ

where Θ is the set of orientation channels, and |Θ| gives the number of orientations.

Expanding CΘ(DA, O, τd) gives:

1 1 ( O ) = ( ( )2 ) CΘ DA, , τd ∑ ∑ min DAθ p , τd (5.2.7) |Θ| |Oθ| θ∈Θ p∈Oθ

where Oθ and Aθ are the sets of edge coordinates corresponding to orientation θ

for the template object O and the query image A respectively, and DAθ is the distance

transform for oriented edge coordinates Aθ and is defined using the same notation as equation 5.2.5:

DAθ (p) = min kp − qk (5.2.8) q∈Aθ

The distance transform function can be efficiently precalculated for a given query image making the matching algorithm more computationally efficient. Matches are

found by evaluating CΘ(DA, O, τd) at different locations in the image, where high val- ues indicate a poor match and low values indicate a good match. A value of 0 means

that the edges in O correspond exactly to edges in A, i.e. O ⊆ A, but a threshold τc is typically used to accept partial matches: CΘ(DA, O, τd) ≤ τc. The chamfer matching algorithm requires that edge templates can be extracted from a series of reference images of the desired object. These images generally require that the edges of the foreground can be separated from background edges so that only the relevant edges are used in the matching algorithm.

The following section proposes an alternative method to this that allows for back- ground edges to be present in the training images, by formulating the template of an object as a weight vector in a linear SVM. This is similar to the work presented by Felzenszwalb [53] that learns weights for a Hausdorff distance transform classifier but here the truncated chamfer distance transform is used. The algorithm is then tested on a standard detection dataset and the results are compared to the other algorithms.

The Partial Hausdorff Distance

Felzenszwalb [53] observed that algorithms such as the Hausdorff distance can be ex- pressed as a generalisation of a linear classifier. The task of matching edges can be done as follows:

91 CHAPTER 5: HUMAN DETECTION

Figure 5.7: Illustration of the Hausdorff distance between edge pixel coordinates from object A (green) and object B (blue). If for each point in A, the closest point in B is found, then the Hausdorff distance is the largest of these distances.

If A and B are sets of 2D edge point coordinates, then the Hausdorff distance be- tween the two sets is given by [53]:

h(A, B) = max min ka − bk (5.2.9) a∈A b∈B The Hausdorff distance is the maximum distance over all the points in A to their nearest point in B, i.e. if for each point in A the closest point in B is found, then the Hausdorff distance is the largest of these distances. See figure 5.7 for an illustration. One of the strengths of this measure is that it does not require that the points in A and B match each other exactly, which makes it able to handle partial edge contours or minor occlusions, or even a slight variation in shape [53].

One problem with this distance measure is that it is not robust with respect to noise or outliers [53]. To deal with this the Kth ranked value is used instead of the max in equation 5.2.9. The partial Hausdorff distance [53] is defined as:

th hK(A, B) = Ka∈A min ka − bk (5.2.10) b∈B Taking K = |A|/2 the partial distance is the median distance from the set of points in A to the set of points in B.

Object Recognition

Using this partial Hausdorff distance measure, an input object B can be considered the same as object A if:

hK(A, B) ≤ d (5.2.11)

Felzenszwalb [53] observed that this condition holds exactly if at least K points from A are at most distance d from some point in B. By using the notion of dilation,

92 CHAPTER 5: HUMAN DETECTION

Figure 5.8: Illustration of dilation operator. Shown on the left is a set of edge coordi- nates B, the middle diagram shows the dilation of B by distance d to give the set of points Bhri. The diagram on the right shows another set A that matches Bhri. where Bhri denotes the set of points that are at most distance r from any point in B, then hdi hK(A, B) ≤ d holds when at least K points from A are contained in B . See figure 5.8 for an illustration of the dilation operation.

Let A be an m × n matrix with binary values 1 for all points in A and 0 otherwise, and let Bhdi be an m × n matrix with binary values 1 for all points in Bhdi and 0 oth-   erwise, then let a = vec (A) and bhdi = vec Bhdi be mn × 1 vectorisations of these matrices respectively, where vec (·) is defined as:

T vec(M) = (M1,1, M2,1, ··· , Mm,1, M1,2, M2,2, ··· , Mm,2, ··· , Mm,n) (5.2.12)

where M is an arbitrary m × n matrix, and superscript T denotes the transpose.

Using this vectorised representation, the partial Hausdorff distance hK(A, B) ≤ d can be expressed by a dot product between a and bhdi:

T hdi hK(A, B) ≤ d ⇔ a b ≥ K (5.2.13)

Felzenszwalb [53] applied this approach to human detection in a PAC learning framework [168] using the perceptron algorithm [105] for hypotheses, and promising results on human detection were reported. Figure 5.9 shows a sample detection made by the classifier and the corresponding weights learnt by the classifier, and a roughly human shape is visible in the positive weights.

Learning Chamfer Weights

Using a similar approach to the one discussed in 5.2.4, the chamfer matching algorithm can also be expressed as a dot product in a linear classifier learning algorithm such as

93 CHAPTER 5: HUMAN DETECTION

(a) (b) (c) Figure 5.9: Figure showing some results as reported by Felzenszwalb [53]. Image (a) shows an example detection made by the classifier, (b) shows the weights learnt by the PAC classifier, and (c) shows only the positive weights. an SVM.

Given an object template O and a query image A that are both sets of edge pixel coordinates, let Oθ and Aθ denote the set of edge pixel coordinates for the respective sets for a given orientation channel θ.

Using a similar formulation to the one used in 5.2.4, let Oθ be an m × n matrix with T binary values 1 for points in Oθ and 0 otherwise, and let oθ = vec Oθ be the vec- torisation of the transpose this matrix and is an mn × 1 column vector. The values are transposed so that the rows instead of the columns are concatenated to form the vec- torisation, and is more efficient to access in memory than concatenating the columns.

The matrix O is an mn × |Θ| matrix where each of the columns is one of the vec- torised orientations oθ for θ ∈ Θ. This matrix is also vectorised to give o = vec (O) so that each of the vectorised orientations are concatenated to form a single (mn · |Θ|) × 1 column vector.

 T    vec(O1 ) o1      vec(OT)   o   2   2  o =  .  =  .  (5.2.14)  .   .      ( T ) vec O|Θ| o|Θ|

Recall that DA is the distance transform of edge pixel coordinates A, let this now represent an m × n matrix where any point in DA is the distance to the nearest edge coordinate in A. For oriented chamfer matching, there is an oriented distance trans- form m × n matrix DAθ for each orientation channel θ and corresponding edge pixel coordinates Aθ.

94 CHAPTER 5: HUMAN DETECTION

Using the same notation for vectorisation as with the template object O, the ori- entated distance transforms for the query image A are vectorised and concatenated together so that:

    vec(DT ) A1 dA1  T     vec(D )   dA   A2   2  d =  .  =  .  (5.2.15)  .   .      vec(DT ) d A|Θ| A|Θ|

2 Next the values in d are transformed to satisfy the min(DA(p) , τd) term in equation

5.2.7, so that the squared distances are no higher than the distance threshold τd:

2  db = min(di , τd) | ∀di ∈ d (5.2.16)

The chamfer score can be written as a dot product between the oriented binary edge template vector o and truncated oriented distance transform vector db:

1 T Cdot = o db (5.2.17) NO · |Θ|

Where NO = ∑θ∈Θ |Oθ|, and NΘ = |Θ|. The template edges in o are either 1 to indicate an edge and 0 to indicate no edge, and are multiplied and summed with the corresponding value in the distance transform db at the edge location.

This formulation can be used to train a linear SVM to find the best weight vector for a set of object templates. As discussed in 2.6.2 a linear SVM classification is simply a dot product between a weight vector w and a query feature vector x plus a bias b:

wTx + b ≥ 0 for positive classification (5.2.18) wTx + b < 0 for negative classification

Replacing the binary edge template vector o with a weight vector w in the linear SVM equation (Eq. 5.2.18), and training an SVM on oriented distance transform fea- ture vectors db extracted from positive and negative examples, allows the SVM to learn a suitable weight vector to classify the examples without requiring any special pre- processing of the training data, or requiring that all the training templates are stored in memory.

Classification is then a simple dot product with the distance transform of a new query image. Very small weights can be eliminated to reduce the dimensionality of the weight vector.

95 CHAPTER 5: HUMAN DETECTION

wT db + b ≥ 0 for positive classification (5.2.19) wT db + b < 0 for negative classification

Feature Representation: Distance Features

The feature vectors used for the training and test examples are constructed in the fol- lowing way: For each image in the dataset, edges are extracted using a Canny edge detector [32] and split into orientation channels. The edge maps were created using a threshold of 0.1 ∗ ∇max, where ∇max is the maximum edge gradient strength of the edges within the window, and with hysteresis disabled, as these were found to be the best performing parameters (see section 5.6.3 for a discussion of the results).

The sign of an edge pixel is ignored so that angles 180 degrees out of phase with each other are allocated to the same orientation channel, this is to reduce the influence of clothing and background colour on the edge information [154] (the classifier should be invariant to this).

For each orientation channel, a distance transform is calculated [20] and trans- formed to be the square distance from the nearest pixel, truncated to an upper value of

τd, where τd = 64 for these experiments. This means any edge equal or greater than 8 pixels away is given the same distance value.

2  db = min(di , τd) | ∀di ∈ d (5.2.20)

The final feature vector is created by sampling the distance transform for each ori- entation channel every two pixels and concatenating the values into a single vector.

5.3 Human Detector

The method chosen to compare these different feature representations is the linear sup- port vector machine (SVM). This is the approach used already with a good deal of success by Dalal and Triggs [40] in their HOG detector, and subsequently speeded up by Zhu, Yeh, Cheng and Avidan [183]. Given the fast performance of the algorithm, it seems like a logical starting point to compare other fast features and see how they perform.

The human detectors are created by sampling a dense grid of features over a smaller cropped 96x160 detector window from cropped training examples. These training ex- amples are used to train a linear SVM to create the detector.

96 CHAPTER 5: HUMAN DETECTION

Figure 5.10: Distance feature calculation. Left: An image from the HumanEva dataset. Middle: Edges extracted for 2 orientations, each colour represents a different channel, white represents an overlap between adjacent orientation channels. Right: Squared truncated distance transforms for each orientation channel. These are sampled every two pixels and concatenated to form the feature vector.

The following detectors are considered:

1. SVM + HOG

2. SVM + DAISY

3. SVM + SURF

4. SVM + Chamfer Features (Chamfer SVM)

The experiments in this chapter compare these different feature descriptors com- bined with the SVM classifier to see how they perform against each other, and to anal- yse the performance of the Chamfer SVM method proposed in this chapter. These experiments demonstrate that it is possible to learn appropriate template weights us- ing the chamfer SVM formulation proposed in section 5.2.4, even in the presence of background edges.

5.3.1 Training

The original HOG classifier uses a linear SVM to classify images, and Dalal and Triggs [40] use a modified version of SVM Light [79] for training. This method proved very effective for human detection, but other feature types may be more effective than HOG, and other learning algorithms could be used as an alternative to Linear SVM.

The features are densely arranged in a detection window with a size of 96 × 160 pixels, in a similar way to the detection window used by Dalal and Triggs [40]. Each of

97 CHAPTER 5: HUMAN DETECTION

Figure 5.11: Construction of the detection window with the feature descriptor blocks. Blocks are created starting from the top left of the window moving to the right until the edge of the window, then continue on from the start of the left edge of the window on the next row of blocks until the bottom right of the window is reached.

the descriptors are configured to have a coverage of 16 × 16 pixels within this window so that they use the same amount of pixel information to construct their representation and can be compared fairly. See figure 5.11 for an illustration of how the descriptors are arranged.

To train the classifiers and features considered in this chapter, feature vectors are generated from a set of positive training images and randomly sampled patches from a negative image set.

5.3.2 Hard Examples

Once trained, the negative images are searched exhaustively in scale-space to find false positives to use as hard examples. These hard examples are added to the initial positive and negative training set, and the SVM is trained again using this augmented set. Dalal and Triggs [40] states that this significantly improves the classification performance of the classifier.

5.4 Datasets

5.4.1 HumanEva

The HumanEva dataset introduced by Sigal and Black [150] provided videos of sub- jects performing actions in a room from different cameras while also recording motion capture data from markers placed on the body of each of the subject. This dataset was extended by [132] by using the ground truth to provide cropped and aligned images for each of the subjects in the original HumanEva dataset. See figure 5.12 for some

98 CHAPTER 5: HUMAN DETECTION

Figure 5.12: Some images from the HumanEva dataset [132, 150].

Figure 5.13: Some images from the INRIA dataset [40]. The images contain a large variability in background and appearance.

sample images from this dataset.

5.4.2 INRIA

Dalal and Triggs [40] introduced a challenging pedestrian dataset to determine the per- formance of their HOG classifier. The positive training set contains 1208 image win- dows and the test set contains 566. The positive training and test windows are reflected left-right, effectively doubling the number of positive examples. The negative training set contains 1218 images and the negative test set 453 images. Figure 5.13 shows some cropped images from the INRIA dataset.

5.4.3 Mobo

The CMU Mobo database [72] contains sequences of 15 subjects walking on a treadmill at different camera angles. This chapter (and the experiments of the following chapter) use a dataset that has cropped and normalised images to a resolution of 96x160 from this dataset and superimposed on randomly sampled backgrounds (drawn from the negative INRIA training images). See chapter6 for more information about this dataset. Figure 5.14 shows some images sampled from the dataset.

99 CHAPTER 5: HUMAN DETECTION

Figure 5.14: MoBo dataset: (top row) Examples of original MoBo training images from the 6 training views. (bottom row) Examples of the normalized 96x160 images with random background.

5.4.4 Upper Body Datasets

For the upper body detection experiments in this chapter, a cropped version of the Mobo dataset is used. A region of 48x48 pixels centred around the head and shoul- ders area of the subject is used to extract images for the experiments in upper body detection.

5.5 Experiments

5.5.1 Human Detection

Feature vectors were extracted for each of the feature types from the INRIA dataset, and a linear SVM trained using each type of feature. Each feature vector is constructed by sampling descriptors in an overlapping grid as originally done with the HOG [40] classifier for each feature type.

The training set consists of 2416 cropped positive examples and 7259 random sam- ples from photos containing no people. The testing set consists of a further 1126 posi- tive examples and 1275 randomly sampled negative examples

Precision-recall graphs are used to compare the performances of each of the feature types. See figure 5.15 for the results. It can be seen that in this test HOG performs much better than the other feature types, but the chamfer SVM classifier performs as well as the SURF and DAISY descriptor classifiers.

100 CHAPTER 5: HUMAN DETECTION

Prec/Recall: Integral HOG, train/test INRIA Prec/Recall: ChamferSVM, train/test INRIA 1.00 1.00

0.95 0.95

Params Params 1 1 0.90 0.90 2 2 Precision 3 Precision 3 4 4 5 5 0.85 6 0.85 6 7 7 8 8 9 9 0.80 0.80 0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 Recall Recall

Prec/Recall: DAISY Q3 S8, train/test INRIA Prec/Recall: SURF, train/test INRIA 1.00 1.00

0.95 0.95

0.90 0.90 Precision Precision

Params 1 0.85 2 0.85 3 4 Params 5 1 0.80 0.80 0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 Recall Recall Figure 5.15: Precision-recall graphs comparing different features used in a linear SVM classifier. All graphs are zoomed in from 0.8 to 1.0 on each axis. Top-left: Performance of HOG classifier. Top-right: Performance of Chamfer classifier. Bottom-left: Perfor- mance of DAISY classifier. Bottom-right: Performance of SURF classifier.

101 CHAPTER 5: HUMAN DETECTION

5.5.2 Upper Body Detection

For a computer game context, detecting the whole player for localisation might not be necessary. Ferrari et al. [55] demonstrated that an upper body detector was sufficient to localise a human subject for more complex processing. They used the SVM HOG clas- sifier as described in Dalal and Triggs [40] to train an upper body detector on examples from front and side views of people.

Though accurate, this could perhaps be made faster by considering simpler, less complex features.

Experimental Setup

As with the human detection experiments, four SVM classifiers were trained on the different feature types. The dataset used was the modified Mobo dataset described in §5.4.4.

The training dataset consists of 8008 positive examples of 14 out of the 15 available subjects, and 7259 randomly sampled negative examples from images known to con- tain no people. The testing dataset consists of 536 images from the remaining subject in the database (subject 1), and 1275 randomly sampled negative examples.

The features are sampled from a regular overlapping grid over a 48x48 pixel win- dow centred over the subjects head and shoulders.

The results in figure 5.16 show that the HOG descriptor does not perform as well on the upper body dataset as it does on the INRIA human dataset. DAISY performs the best of the 4 descriptors. It is interesting to note that Chamfer SVM performs just as well as HOG on this dataset, implying that the formulation is quite capable of learning suitable template weights for binary classification tasks.

5.6 Discussion

Each of the 4 descriptor and SVM combinations perform well in the classification ex- periments run in this chapter. As expected, HOG clearly performs best on the INRIA dataset. The other descriptors perform at a similar rate to each other, and the Chamfer SVM method performs as well as the DAISY and SURF descriptors. SURF performs well, but not as good as either of the more complex descriptors, this could be due to the extra information encoded in the HOG and DAISY descriptor constructions.

The DAISY descriptor outperforms the HOG, SURF and DAISY classifiers on the

102 CHAPTER 5: HUMAN DETECTION

Prec/Recall: Integral HOG, train/test Mobo Upper Body Prec/Recall: ChamferSVM, train/test Mobo Upper Body 1.0 1.0

0.8 0.8

0.6 Params 0.6 1 2 Precision Precision 0.4 3 0.4 4 5 6 Params 0.2 0.2 7 3 0.100 8 4 0.100 9 5 0.100 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall

Prec/Recall: DAISY Q3 S8, train/test Mobo Upper Body Prec/Recall: SURF, train/test Mobo Upper Body 1.0 1.0

0.8 0.8

0.6 Params 0.6 1 2 Precision Precision 0.4 3 0.4 4 5 6 0.2 0.2 7 8 Params 9 1 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Figure 5.16: Precision-recall graphs comparing different features used in a linear SVM classifier. Top-left: Performance of HOG classifier. Top-right: Performance of Chamfer classifier. Bottom-left: Performance of DAISY classifier. Bottom-right: Performance of SURF classifier.

upper body dataset.

The proposed Chamfer SVM algorithm also performed well on both of the datasets, and with a competitive accuracy on the upper body experiment. The results show that it is possible to learn a suitable weight vector for the SVM formulation of chamfer matching, even in the presence of background edges in the training data.

Some of the weights learned by the SVM are shown in figure 5.17. These are gener- ated from SVMs trained using the HumanEva dataset. These particular weights were chosen over the other datasets as they are more distinct and better aid the explanation of the weight values and their meaning for classification.

In figure 5.17, the negative weights around the head and feet area of the window penalise edge distances that coincide with edges associated with human shapes. Con- sidering only the human head shape for a moment, if there are no edges in the distance transform where edges of a head shape should be, the distance transform will be high and contribute to a negative classification (hence the negative weighting in that re-

103 CHAPTER 5: HUMAN DETECTION

Figure 5.17: Learned SVM weights from oriented chamfer SVM classifiers projected onto the detection window. Colder weights are negatively weighted, warmer weights are positively weighted.

gion). Conversely, areas with low distance transform values (i.e. are close to an edge) near to where an edge from a head shape would be expected to be, will give a higher contribution to positive classifications.

5.6.1 Computational Efficiency

To assess the relative computational costs of extracting features for each of the SVM and feature combinations, an experiment was run to determine the average cost of extracting a feature vector using the same parameters used to generate the classification results already reported in section 5.5.1.

Using a sliding window approach, each classifier was swept across a 640 × 480 test image at the same scale and with a window separation of 1 pixel at each step in x and y. A total of 173, 217 locations were visited for each of the classifier windows, and the average window time was found by dividing the total image scan time (minus initialisation) by the number of windows extracted. A comparison of initialisation time was also produced in a separate graph. The results from this experiment are shown in figure 5.18.

Each classifier uses the same window size and grid array except for the chamfer SVM classifier, which uses the same window size but a much denser sampling grid than the others (sample separation of every 2 pixels). The classifiers were implemented in such a way that some of the calculations can be performed once for the image to attempt to minimise the effects of initialisation overheads on the feature timings, which is more in line with how the classifiers might be used in practice.

104 CHAPTER 5: HUMAN DETECTION

Figure 5.18: This figure shows how integral HOG, DAISY and Chamfer SVM dis- tance feature based classifiers compare with respect to feature extraction timings. (top) shows the average time taken to extract features for a classifier window, and (bottom) shows the initialisation times for for a 640 × 480 resolution image for each descriptor type.

The SURF algorithm did not perform well in the experiments and does not appear in the graph because of this. Initial tests showed that to extract the SURF feature grid used in the human detection experiments it would take on average 30.56 × 10−3 sec- onds - nearly 20 times slower than the slowest timing result of 1.6 × 10−3 seconds from the other 3 methods. The most probable cause is that the software API that was used in the experiment to generate the SURF descriptors1 might not optimised for this par- ticular type of application, but it is likely that by moving the wavelet filter calculation

1The SURF library was obtained from: http://www.vision.ee.ethz.ch/~surf/

105 CHAPTER 5: HUMAN DETECTION for the features to a pre-calculation phase for the whole image, then using an integral histogram approach (such as with Integral HOG) would improve the sliding window speed of the implementation.

In this experiment, the Chamfer SVM algorithm is faster at extracting features vec- tors than the other two descriptors, but this speed comes at the cost of accuracy when compared to the efficient integral HOG descriptor. However, the feature grid used for Chamfer SVM is not optimised, and exploring the effects of different sampling den- sities on chamfer distance features with respect to classification accuracy and speed would be an interesting area to explore for performance improvements.

In the initialisation time comparison, the HOG descriptor requires the least time to prepare for scanning the test image, whereas Chamfer SVM takes nearly twice as long to initialise for the same image. Trying different edge detection algorithms with the Chamfer SVM algorithm (the Canny edge detector is currently used), as well as different distance transform algorithms may improve initialisation performance.

To improve the sliding window performance further, the structure of the overlap- ping descriptor grid can be exploited as discussed in Dalal and Triggs [40]. If the win- dow is moved at increments that match the descriptor overlap within the classifier window, then only a single line of new descriptors need to be calculated since the rest of the descriptors can still be used but are just shifted in position. When shifting along x in positive increments, the left-most column of descriptors in the window can be discarded and a new column added at the right of the descriptor window. The same can be done with the rows when moving in increments of y. Additionally, if memory permits, the descriptors can be stored over the width of the image for the previous Wr

rows (where Wr is the number of rows in the grid of the descriptor window) so that when the right-most position of the x axis is reached and set to 0, and y is incremented

by one step, the previously cached Wr − 1 rows of descriptors may be re-used. This second optimisation can potentially consume a large amount of memory however, and isn’t always practical to do.

5.6.2 Chamfer SVM

A classification using the Chamfer SVM algorithm is fast even considering the dimen- sionality of the feature vector constructed using the dense sampling grid, as it is simply a dot product between the distance transform and the weight vector without any ex- tra post processing. The distance transform calculation is a simple two-pass algorithm over the oriented edge maps [20]. Both distance transform and classification operations

106 CHAPTER 5: HUMAN DETECTION are memory cache efficient.

To reduce the dimensionality of the Chamfer SVM classifier, very small weights from the SVM can be removed to make the sampling grid more sparse. These weights have a very small relative contribution to the overall classification decision compared to the more heavily weighted locations in the distance transform.

Alternatively, the weight strength can be used to determine the order in which the sampling is done while evaluating the dot product of the weight vector and the dis- tance transform. A desired approximation level could be set based on the desired total contribution from the weight vector, and the order of weights traversed based on their absolute magnitude. This would be at the cost of some memory cache efficiency due to the potential for somewhat random access into the distance transform when sampling values for a classification.

5.6.3 Edge Thresholding

One factor that could affect the performance of the chamfer based classifier is the qual- ity of the edge maps used to calculate the distance transforms. However, due to the chamfer matching algorithm being tolerant to partial edges and minor occlusions, the effect of this should be minimal.

For the chamfer SVM experiments in this chapter, binary edge maps were created using a Canny edge detector with a threshold of 0.1 ∗ MaxGradient. Hysteresis, as described in the original Canny algorithm, was not used as the performance of the classifier at an edge threshold of 0.1 did not change significantly enough to justify the extra hysteresis computation.

Figure 5.19 illustrates the effect of hysteresis on the edge map created from a typical INRIA training over varying edge thresholds.

To test the effect of the threshold and hysteresis on the performance of the chamfer SVM classifier, some further experiments were run on the INRIA dataset with varying thresholds both with and without hysteresis enabled.

The results in figure 5.20 show that although the use of hysteresis does indeed im- prove performance for Canny edge thresholds that are relatively high such as 0.3 and 0.5, for the lower and better performing thresholds such as 0.1, the extra edge infor- mation gained by using hysteresis makes only a very small difference in performance at the extra computational cost for the additional hysteresis pass, and even makes the performance worse for very low edge thresholds such as 0.25. At very low edge thresh- old levels, the secondary hysteresis threshold begins to include too many low strength

107 CHAPTER 5: HUMAN DETECTION

Figure 5.19: This diagram shows the result of the oriented Canny edge output for the image on the left at different thresholds (0.025, 0.050, 0.1. 0.2, 0.3, 0.5) with hysteresis disabled (top row) and enabled (bottom row).

Prec/Recall: ChamferSVM hystersis off, train/test INRIA Prec/Recall: ChamferSVM hystersis on, train/test INRIA 1.00 1.00

0.95 0.95

0.90 0.90 Precision Precision Params Params 4 0.025 4 0.025 4 0.050 4 0.050 0.85 4 0.100 0.85 4 0.100 4 0.200 4 0.200 4 0.300 4 0.300 4 0.500 4 0.500 0.80 0.80 0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 Recall Recall Figure 5.20: Precision-recall graphs comparing different Canny edge thresholds used to train the ChamferSVM classifier, showing hysteresis disabled (left) and enabled (right).

intensity changes in the edge data and makes the edge map too dense.

Given that the extra computational cost for the hysteresis pass makes very little im- provement on the performance of the classifier when trained with the 0.1 edge thresh- old, edge detection for the human detection experiments was done using a value of 0.1 for the edge threshold and with hysteresis disabled.

5.6.4 Bagging, Boosting and Randomized Forests

It could be that a linear SVM might not be the best classifier for the task, and it’s possi- ble that other classifiers may be able to learn a higher performing classifier. An exper- iment was run using the classifiers that had shown recent success in object detection

108 CHAPTER 5: HUMAN DETECTION

Prec/Recall: Integral HOG 4 orient, SVM-Light, train/test INRIA Prec/Recall: Integral HOG 4 orient, SGD-QN, train/test INRIA 1.00 1.00

0.98 0.98

0.96 0.96

Params 01935 Precision Precision 0.94 0.94 02902 03870 04837 05805 0.92 0.92 06772 Params 07740 4 09675 0.90 0.90 0.90 0.92 0.94 0.96 0.98 1.00 0.90 0.92 0.94 0.96 0.98 1.00 Recall Recall Figure 5.21: Precision-recall graphs comparing different SVM light classifer and SGD- QN SVM classifier. The numbers on the SGD-QN graph represent the number of train- ing instances used for the classifier. Left: Performance of SVM Light classifier. Right: Performance of SVN SGD-QN classifier.

and classification problems, that of Randomised Forests and Boosting. Additionally an experiment using Bagging was run, in part because this was also available within the learning algorithm library used (FEST), but mainly because the algorithm is an- other decision tree based classifier like Randomised Forests so it was thought to be of interest to these comparison experiments.

Shown in figure 5.22 is a comparison between SVM and the implementations of Boosting, Bagging and Randomised Forest provided by the FEST library. For this ex- periment, features extracted using the HOG descriptor with 4 orientations was cho- sen as this training data performed with an accuracy similar to the higher orientations when combined with an SVM, while at the same time keeping the dimensionality low for the feature representation. The performance of the other algorithms are all reason- ably good compared to the linear SVM, but Boosting comes closest to the performance of the same data learnt using the SVM. The linear SVM classifier still performs the best of the 4 algorithms on this dataset.

Algorithm Notes

Boosting has been discussed briefly in §2.6.2, and the Random Forest algorithm is dis- cussed later in §6.3.2. Bagging is a simple algorithm that can be summarised as follows:

Bagging, or bootstrap aggregating, is a method of training that helps to avoid the problem of over-fitting for classification and regression trees. Random Forests are an extension of this idea [29], but Random Forests also randomise over the dimensions used to split the data at each node, instead of exhaustively checking each dimension.

109 CHAPTER 5: HUMAN DETECTION

Prec/Recall: Integral HOG 4 orient, SVM-Light, train/test INRIA Prec/Recall: Integral HOG 4 orient, FEST, train/test INRIA 1.00 1.00

0.98 0.98

0.96 0.96 Precision Precision 0.94 0.94

Params 0.92 0.92 bagging Params boost 4 forest 0.90 0.90 0.90 0.92 0.94 0.96 0.98 1.00 0.90 0.92 0.94 0.96 0.98 1.00 Recall Recall Figure 5.22: Precision-recall graphs comparing SVM light classifer and Bagging, Boosting and Randomised Forest classifiers from the FEST library Left: Performance of SVM Light classifier. Right: Performance of FEST classifiers using the same dataset and parameters as the SVM classifier.

A bagging classifier is a set of decision trees that are grown by randomly sampling a bootstrap sample from the training data T to create m training sets of size n <= |T| for each tree. The bootstrap samples are randomly drawn from the training set uniformly and with replacement. When n = |T| for large datasets, the bootstrap set is expected to contain 62.3% of unique samples, the rest being duplicates due to sampling from T with replacement [30].

Classification is made by taking the average over all trees.

1 y = h(x) (5.6.1) | | ∑ H h∈H

Where y is a vector containing the proportion of votes for classification, or the av- erage over all predicted values in the case of regression.

5.7 Summary and Future Work

In summary, the role of localisation as an initial stage to more complex algorithms is quite important. By creating a separate efficient classifier specifically for localisation, it allows more computationally expensive algorithms to be applied at fewer locations, meaning more interactive frame rates can be achieved. Hard examples learnt during training aid object localisation a great deal.

The proposed chamfer SVM algorithm in this chapter performs at similar rates to classifiers that use more elaborate descriptors such as DAISY and SURF. The HOG

110 CHAPTER 5: HUMAN DETECTION descriptor performs consistently well, and the speed and robustness of this descriptor makes it an attractive algorithm to use for dense features.

The Chamfer SVM formulation enables the use of chamfer matching in situations where templates of the objects could be learnt on-line (using on-line linear SVM meth- ods such as SGD-QN SVM [19]), and it would be interesting to see how the algorithm performs in this application. Applying this algorithm to existing applications where hierarchical chamfer template matching trees are used would also be interesting.

The next chapter introduces a method of not only detecting a human but also mak- ing an estimate of their pose by formulating the pose estimation problem as a multi- class object detection problem, in an algorithm that exploits strengths of some of the learning algorithms examined in this chapter.

The training data is exploited by the proposed algorithm in the next chapter to sample a number of sparse features to make a classification using information in areas that are discriminative between different classes. This takes some inspiration from chamfer matching and its applications.

111 CHAPTER 6

Pose Detection

Determining the complete pose of a human solves many of the problems that have been discussed with the presented methods in Chapter3 and Chapter4. Once the pose is known, then the orientation and position of each limb is known and interacting with virtual GUI components becomes a much simpler problem of detecting when the appropriate part of the user intersects with a virtual control on screen.

This chapter introduces a novel discriminative method of combining both localisa- tion and pose classification. The algorithm does this without requiring that a segmen- tation or silhouette can be extracted from the image.

Monocular full body pose estimation is a very difficult and nontrivial problem. Widely varying appearance (e.g. shape, clothing, size) of people and varying back- grounds make it very hard to model efficiently. Generative approaches such as the one presented in [55] attempt to address this by adapting some parameters of their model to the subject being analysed, but they use a two stage approach separating detection from pose estimation.

Due to the wide availability of human detection and simple action databases such as HumanEva and Mobo that also contain ground truth, a simple walking action was selected to study the problem and analyse the performance of the algorithm. Monoc- ular sequences of walking actions contain ambiguities in pose, particularly from the lateral view of the action and from the front views due to absence of depth information (projection ambiguities). The approach discussed in this chapter attempts to give a dis- tribution over possible poses as a classification instead of making a hard decision for a given pose.

An upper body classifier would also have been interesting to see given the context of human computer interaction, however at the time of writing only small annotated databases were available for evaluation and training [45, 55] and do not contain a suf-

112 CHAPTER 6: POSE DETECTION

ficient amount of viewpoint variation and pose variation to apply the algorithm. In fu- ture work it would be interesting to apply this algorithm to larger upper body datasets and compare its performance to existing generative methods.

Computer vision based user interfaces and computer games that aim to improve the user’s fitness through exercise routines such as Sony’s EyeToy Kinetic series of games or Nintendo’s Wii Shape game can be made more effective by being able to accurately detect the pose a player is currently in, and could subsequently offer advice to improve the exercise pose. Once integrated into a fitness programme this tool can provide more useful feedback about the user’s fitness and progress as they perform specific exercises.

Some works [99, 167] attempt to determine pose from complex actions but use do- main knowledge available for a specific problem (e.g. knowledge of a golf club as in [167]) to improve robustness and tracking, though simple background subtraction used in [167] is prone to problems in more complex environments such as a living room and more varied clothing, as discussed in §3. If made more accurate and robust, these problem-specific approaches can be used to train and assess the quality of user actions to teach them a specific exercise or movement.

Systems that achieve this must ultimately be computationally efficient for online applications however, so that immediate feedback can be provided to the user. This is a very complex problem and is difficult to solve at real-time speeds. Therefore a scalable combined detection and localisation approach is proposed in the following sections that takes steps towards making the processing more computationally efficient (though not enough for real-time speeds).

By applying a method that can determine a player’s pose in an efficient way, these kind of human computer interactions can be achieved. Estimating the pose of a human subject can be broadly categorised as either generative or discriminative. As discussed in §2.7, generative approaches involve a model that is used to predict the expected ap- pearance given a pose, whereas discriminative approaches attempt to determine pose directly from an appearance representation such as a vector of feature descriptors.

Such systems can have some difficulty where ambiguities in appearance arise, such as identifying which leg is forward when viewing a human from the side. Both ap- proaches can alleviate this somewhat by allowing for multiple hypothesis to be valid at the same time, i.e. they are multimodal.

The algorithm described in the following sections is a discriminative exemplar based approach that returns a distribution over possible poses when given a test image.

113 CHAPTER 6: POSE DETECTION

6.1 Introduction

Full-body human pose recognition from monocular images constitutes one of the fun- damental problems in Computer Vision. It has a wide range of potential applica- tions such as human/computer interfaces, video-games, video annotation/indexing or surveillance. Given an input image, an ideal system would be able to both localise any humans present in the scene and recover their poses. The two stages, known as human detection and human pose recognition, are usually considered separately. There is an extensive literature on both detection [40, 65, 136, 171, 175, 183] and recognition [2, 18, 110, 131, 140, 161] but relatively few papers consider the two stages together [17, 44, 116]. Most algorithms for pose recognition assume that the human has been lo- calised and the silhouette has been recovered, making the problem substantially easier.

Some techniques thus separate the foreground from the background and classify a detected (and segmented) object as human or non-human. The pose of the human can then be estimated, for instance fitting a human model on the resulting blob or silhouette or applying pose regressors. These methods are very helpful when a relatively clean background image can be computed which is not always the case, depending on the settings and applications: for example if the goal is to detect humans in an isolated image (not from a video sequence) or in a moving camera sequence, the computation of a background image and consequently the segmentation of the subject are not trivial.

This chapter proposes a pose detection algorithm that uses the best components of state-of-the-art classifiers including hierarchical trees, cascades of rejectors as well as randomized forests.

6.1.1 Related Previous Work

Exemplar based approaches have been very successful in pose recognition [110]. How- ever, in a scenario involving a wide range of viewpoints and poses, a large number of exemplars would be required. As a result the computational time would be very high to recognize individual poses. One approach, based on an efficient nearest neighbour search using histogram of gradient features, addressed the problem of quick retrieval in large set of exemplars by using Parameters Sensitive Hashing (PSH) [140], a variant of the original Locality Sensitive Hashing algorithm (LSH) [42]. The final pose estimate is then produced by locally-weighted regression which uses the neighbours found by PSH to dynamically build a model of the neighbourhood.

The method of Agarwal and Triggs [2] is also exemplar based, uses a kernel based

114 CHAPTER 6: POSE DETECTION regression but they do not perform a nearest neighbours search for exemplars, instead using a hopefully sparse subset of the exemplars learnt by the Relevance Vector Ma- chines (RVM). Their method has the main disadvantage that it is silhouette based, per- haps more serious, it cannot model ambiguity in pose as the regression is uni-modal. In [164], an exemplar-based approach with dynamics is proposed for tracking pedestri- ans. In [65], Gavrila presents a probabilistic approach to hierarchical, exemplar-based shape matching. This method achieves a very good detection rate and real time per- formance but does not regress to a pose estimation. Similar in spirit, Stenger [154] proposed a hierarchical Bayesian filter for real-time articulated hand tracking.

Many other works have focused on human detection specifically without consider- ing pose [40, 65, 136, 171, 175, 183]. Dalal and Triggs [40] use a dense grid of Histograms of Orientated Gradients (HOG) and learn a Support Vector Machine (SVM) classifier to separate human from background examples. Later this work was extended [183] by integrating a cascade-of-rejectors concept and achieved near real-time detection per- formance.

Several works attempt to combine localisation and pose estimation. Dimitrijevic et al. [44] present a template-based pose detector and solve the problem of huge datasets by detecting only human silhouettes in characteristic postures (sideways opened-leg walking postures in this case). They extended this work [57] by inferring 3D poses be- tween consecutive detections using motion models. This work gave some very inter- esting results with moving cameras. However it seems somehow difficult to generalise to any actions that do not exhibit characteristic posture.

The pose estimation and detection work of Okada and Soatto [116] learns k kernel SVMs to discriminate between k pre-defined pose clusters and then learns linear re- gressors from feature to pose space. They extend this method to localisation by adding an additional cluster that contains only images of background. The Poselet work of [23] presents a two-layer classification/regression model for detecting people and lo- calising body components. The first layer consists of poselet classifiers trained to detect local patterns in the image. The second layer combines the output of the classifiers in a max-margin framework. Ferrari et al. [55] use an upper-body detector to localise a human in an image, find a rough segmentation using a foreground and background model calculated using the detection window location, and then apply a pictorial struc- ture model in regions of interest.

In this chapter a novel algorithm is presented that jointly tackles the problem of human detection and pose estimation in a similar way to template tree approaches [65, 154], while exploiting some advantages of AdaBoost style cascade classifiers [170, 183]

115 CHAPTER 6: POSE DETECTION and Random Forests [21, 29]. Randomized trees [6] and Random Forests [29] have shown to be fast and robust classification techniques that can handle multi-class prob- lems [98]. Bosch et al. [21] used Random Forests for object recognition, and others have used it for clustering [109, 145].

Many different types of features have been considered for human detection and pose recognition: silhouette [2], shape [65], edges [44], HOG descriptors [40, 140, 183], Haar filters [171], motion and appearance patches [18], edgelet feature [175], shapelet features [136] or SIFT [102]. Driven by the recent success of HOG descriptors for both human detection [40, 183] and pose recognition [140], and the fact that they can be implemented efficiently [183], we chose to use HOG as a feature in our algorithm.

6.1.2 Motivations and Overview of the Approach

We consider the problem of detecting people and recognizing their poses at the same time as in [17, 116] or [156] for hands. Some work on pose recognition assumes that the bounding box is provided (e.g. [18]). In this work, we use a sliding window approach to jointly localise and classify human pose using a multi-class classifier.

Supposing we have a database of images with corresponding 3D and/or 2D Poses, we first predefine a set of classes by discretising camera viewpoint and pose space. Random Forests [29] are inherently good for multi-class problems, so makes them ideal for use in pose estimation. We performed an initial test on a database of 50,000 images of walking people grouped in 64 classes and extract 3 different grids of HOG. See fig- ure 6.1 and the experiments section 6.4. We identified two main drawbacks with the algorithm and the existing implementation.

Random Forests are grown by randomly selecting a subset of features at each node √ of the tree (typically D [29] where D is the number of dimensions, to help avoid a single tree over fitting the training data) and the best split is found for each dimension

mi by evaluating all possible splits along that dimension using a measure such as infor- mation gain [21]. The dimension m∗ that best splits the data according to that score is used to partition the data at that node. This process continues recursively until all the data has been split and each node contains a single class of data.

For pose estimation, an ideal dataset should contain variation in subject pose, cam- era viewpoint, appearance and physical attributes (size, shape). Combining the dataset with a very dense image feature set (such as HOG [40]) captures discriminative details between very similar poses [116]. As illustrated in figure 6.1 (and section 6.4) using denser HOG feature grids improves pose classification accuracy. Neighbouring classes

116 CHAPTER 6: POSE DETECTION

Figure 6.1: Random Forest Preliminary Results: this initial test was performed on the MoBo walking dataset [72]: dense grid of HOG feature are extracted for 15 differ- ent subjects in around 50,000 images which are grouped in 64 classes. We build the training subset by randomly sampling 10 subjects and keep the remaining 5 subjects for testing. We run the same test for 3 different grids of HOG and show the classifica- tion results varying the number of trees used in the forest. Using denser HOG feature grids improves pose classification accuracy but we are quickly facing memory issues that prevent us from working with denser grids. can be very close to one another in image space, and in practice are only separable by some sparse subset of features. This means that having randomly picked an arbitrary feature to project on from a high dimensional feature space, it is highly unlikely that an informative split in this projection (i.e. one that improves the information measure) exists. While we do not need perfect trees, informative trees are still rare and finding them naively requires us to generate an infeasible number of trees.

Another drawback of the Random Forests algorithm is that it is not very well adapted for scanning window approaches. Even if on-demand feature extraction can be consid- ered as in [43], for each scanned sub-image, the trees still have to be completely tra- versed to produce a vote/classification. This means that a non-negligible amount of features have to be extracted for each processed window, making the algorithm less efficient than existing approaches like cascades-of-rejectors that quickly reject most of the negative input sub-images using a very small subset of features. Random Forests does not seem to be an appropriate option for our problem of pose detection. Works such as [170, 183] use AdaBoost to learn a cascade structure using very few features at the first level of the cascade, and increasing the number of features used for later stages. This allows efficient rejection of the majority of negative candidates using a

117 CHAPTER 6: POSE DETECTION small amount of features. Other approaches such as those described in [103, 179] for multi-view face detection, organise the cascade into a hierarchy structure consisting of two types of classifier; face/non-face, and face view detection.

Inspired by these ideas, we followed a bottom-up approach to first build a decision tree by recursively clustering and merging predefined classes at each level. Hierarchi- cal trees have been shown to be very effective for real-time systems [65, 154], and we extend this approach to non-segmented images. The main drawback of the template tree matching techniques is that they are limited by the number of templates that need to be stored and accessed online. Instead of storing templates for each branch or node of the tree, we propose to use a reduced number of HOG descriptors which require much less memory to store and can be manipulated faster.

For each branch of this decision tree we use a new algorithm that takes advantage of the alignment of training images to build a list of potentially discriminative gradient features. We then select the HOG blocks that show the best rejection performances. We finally grow an ensemble of cascades by randomly sampling one of these HOG-based rejectors at each branch of the tree. Cascade approaches are efficient at quickly rejecting negative candidates, so we exploit this property by learning multi-class hierarchical cascades. By randomly sampling the feature, each cascade uses different sets of features to vote, and adds some robustness to noise and prevents over fitting. Each cascade can vote for one or more classes so the final classification is a distribution over classes.

While other algorithms such as PSH, SVMs and Random Forests must extract the entire feature space for all training instances during learning making them less practical when dealing with very large datasets (as in the case of human pose classification) our hierarchical cascade classifier only selects a small set of discriminative features extracted from a small subset of the training instances for each branch. This makes it much more scalable for very large training sets.

In the next section, we present a new method for data driven discriminative feature selection that enables our algorithm to deal with large datasets and high dimensional feature spaces. Then in section 6.3 we explain how the proposed algorithm extends hierarchical template tree approaches [65, 154] to deal with unsegmented images, and describe how the algorithm incorporates a random feature selection inspired by Ran- domised Forests to learn an ensemble of multi-class cascade classifiers.

The work presented in this chapter is an extension of [132]. We take steps towards generalising the algorithm and present results on more challenging datasets. In par- ticular we observed that the original classifier was not robust to background due to the low variability of background appearance in the HumanEva training dataset [150].

118 CHAPTER 6: POSE DETECTION

This is addressed in the work presented here with a new dataset that includes a wider range of background appearances and more human subjects.

Our proposed approach gives promising near real-time performance with both fixed and moving cameras. We present results using different publicly available train- ing and testing datasets. Note that in the presented work, we focus on specific motion sequences (e.g. walking), although the algorithm can be generalised for any action.

6.2 Selection of Discriminative HOGs

Feature selection is probably the key point in most recognition problems. It is very important to select the relevant and most informative features in order to alleviate the effects of the curse of dimensionality (Bellman [14]). Many different types of features are used in general recognition problems. However, only a few of them are useful in the given exemplar-based pose recognition task. For example, features like colour and texture are very informative in general recognition problems, but because of their vari- ation due to clothing and lighting conditions, they are seldom useful in exemplar-based pose recognition. On the other hand, gradients and edges are more robust cues with respect to clothing and lighting variations 1. Guided by the success of HOG descriptors for both human detection [40, 183] and pose recognition [140] problems, we chose to use HOG blocks as a feature in our algorithm.

(a) (b) (c) (d) (e) (f) (g) (h) Figure 6.2: Log-likelihood Ratio for Human Pose. (a to e): examples of aligned im- ages belonging to the same class (for different subjects and camera viewpoints) as defined in [132] using HumanEVA dataset [150]. Resulting gradient probability map (f) and log-likelihood ratio (g) for this same class vs. all the other classes. Hot colours in (g) indicate the discriminative areas. The sampled HOG blocks are represented on top of the likelihood map in (h).

Each HOG block represents the probability distribution of gradient orientation (quan-

1Note that clothing could still be a problem for edges if there are very few subjects in the training set: some edges due to clothing (and not due to the pose) could be considered as discriminative edges when they should not.

119 CHAPTER 6: POSE DETECTION tised into a pre-defined number of histogram bins) over a specific rectangular neigh- bourhood. The usage of HOGs over the entire training image, usually in a grid, leads to a very large feature vector where all the individual HOG blocks are concatenated. So an important question is how to select the most informative blocks in the feature vector. Some works have addressed this question for human detection and pose recognition problems using SVM’s or RVM’s [17, 40, 116, 183]. However, such learning methods are computationally inefficient for very large data sets.

In [44], the authors use statistical learning techniques during the training phase to estimate and store the relevance of the different silhouette parts to the recognition task. We use a similar idea to learn relevant gradient features, although slightly dif- ferent because of the absence of silhouette information. In what follows, we present our method to select the most discriminative and informative HOG blocks for human pose classification. The basic idea is to take advantage of accurate image alignment and study gradient distribution over the entire training set to favour locations that we expect to be more discriminative between different classes (similar in spirit to [140]). Intra-class and inter-class probability density maps of gradient/edge distribution are used to select the best location for the HOG blocks.

6.2.1 Formulation

Here we describe a simple Bayesian formulation to compute the log-likelihood ratios which can be used to determine the importance of different regions in the image when discriminating between different classes. Given a set of classes C, the probability that the classes represented by C could be explained by the observed edges E can be defined using a simple Bayes rule.

p(E|C)p(C) p(C|E) = (6.2.1) p(E)

The likelihood term p(E|C) of the edges being observed given classes C, can be

estimated using the training data edges for the respective classes. Let T = {(Ii, ci)} be a set of training data consisting of images each with a corresponding class label. Let

TC = {(I, c) ∈ T | c ∈ C} be the set of training instances for a subset of classes C. Then the likelihood of observing an edge given a set of classes C can be estimated as follows:

1 p(E|C) = ∑ ∇(I) (6.2.2) |TC| (I,c)∈TC

Where ∇(·) calculates a normalised oriented gradient edge map for a given image

120 CHAPTER 6: POSE DETECTION

I, with the value at any point being in the range [0, 1]. Class specific information is represented by high values of p(E|C) from locations where edge gradients occur most frequently across the training instances. Edge gradients at locations that occur in only a few training instances (e.g. due to background or appearance) will tend to average out to low values. To increase robustness toward background noise the likelihood can be thresholded by a lower bound:

( p(E|C) if p(E|C) > τ p(E|C) = (6.2.3) 0 otherwise

Suppose we have a subset of classes B ⊂ C. Discriminative edge gradients will be those that are strong across the instances of classes within B but are not common across the instances within C. Using the log-likelihood ratio between the two likelihoods p(E|B) and p(E|C) gives:

 p(E|B)  L(B, C) = log (6.2.4) p(E|C)

The log-likelihood distribution defines a gradient prior for the subset B. High val- ues in this function give an indication of where informative gradient features may be located to discriminate between instances belonging to subset B and the rest of the classes in C.

For the example given in figure 6.2, we can see how the right knee is a very dis- criminative region for this class while in the example given in figure 6.3[121], we can observe how the space between the eyebrows and the area around the nose are very discriminative regions for the anger and joy together compared to the other facial ex- pressions.

Gradient orientation can be included by decomposing the gradient map into a num- ber of nθ separate orientation channels according to gradient orientation. The log- likelihood Lθ(B, C) is then computed separately for each channel, thereby increasing the discriminatory power of the likelihood function, especially in cases when there are many noisy edge points present in the images.

Maximising over the nθ orientation channels, the log-likelihood gradient distribu- tion for class B then becomes:

L(B, C) = max(Lθ(B, C)). (6.2.5) θ

We also obtain the corresponding orientation map:

121 CHAPTER 6: POSE DETECTION

Figure 6.3: Log-likelihood ratio for face expressions: Orrite et al. [121] applied the proposed feature selection scheme to a subset of the dataset used by Kanade et al. [83]. This dataset is composed of 20 different individuals acting 5 basic emotions besides the neutral face: happiness, anger, surprise, sadness and disgust, with 3 examples each; 300 pictures all together. All images were normalised, i.e. cropped and manually rec- tified. Gradient probability map (left) and log-likelihood ratio (right) are represented for each one of the 6 classes. Hot colours indicate the discriminative areas for a given facial expression.

Θ(B, C) = argmax(Lθ(B, C)). (6.2.6) θ

Uninformative edges (background, clothes, etc.) from a varied dataset will gen- erally be present for only a few instances and not be common across instances from the same class, whereas common informative edges for pose will be reinforced across instances belonging to a subset of classes B, and be easier to discriminate from edges that are common between B and all classes in parent set C. Note that if no separated

orientation channels are considered, i.e. nθ = 1, we have L(B, C) = L(B, C). Given this log-likelihood gradient distribution, we can randomly sample boxes from positions (x, y) where they are expected to be informative and reduce the dimen- sion of the feature space considering only the discriminative HOG blocks. We then use

L(B, C) as distribution proposal to drive block sampling (xi, yi):

(xi, yi) ∼ L(B, C). (6.2.7)

Features are then extracted from areas of high gradient probability across our train- ing set more than areas with low probability. By using this information to sample

122 CHAPTER 6: POSE DETECTION

(a) (b) (c) (d) Figure 6.4: Selection of Discriminative HOG for Facial Expression Classification: Results using this feature selection scheme on facial expression recognition have been reported in [121]. Details of the database are given in figure 6.3. In (a), the log- likelihood map is shown for 2 facial expressions together: joy and disgust. After thresholding (b), the most discriminative areas where HOG blocks will be extracted is obtained (c) to differentiate joy and disgust expressions from the other classes. In (d), we can see the areas covered by the sampled HOG blocks and how the resulting density follows the distribution from (b). features, the amount of useful information available to learn efficient classifiers is in- creased.

Results using the proposed feature selection scheme but applied to facial expres- sion recognition have been reported by Orrite et al. [121]. Figure 6.4a shows the log- likelihood for 2 facial expressions together: joy and disgust. After thresholding (figure 6.4b), the most significant areas where HOG blocks will be extracted are obtained (fig- ure 6.4c), and can be used to differentiate joy and disgust expressions from the other classes. In figure 6.4d, we can see the areas covered by the selected HOG blocks and how the resulting density follows the distribution from figure 6.4b.

6.3 Randomized Cascade of Rejectors

The classifier is an ensemble of hierarchical cascade classifiers. The method takes in- spiration from cascade approaches such as [170, 182], hierarchical template trees such as [65, 154] and Randomized Forests such as [6, 21, 29, 98].

6.3.1 Bottom-up Hierarchical Tree construction

Tree structures are a very effective way to deal with large exemplar sets. Gavrila [65] constructs hierarchical template trees using human shape exemplars and the chamfer distance between them. They recursively cluster together similar shape templates se-

123 CHAPTER 6: POSE DETECTION

(a) (b) (c) Figure 6.5: Bottom-up Hierarchical Tree learning. The hierarchical tree is built using a bottom-up approach by recursively clustering and merging the classes at each level. We present an example of tree construction from the 192 classes defined in [132] on a Torus manifold where the dimensions represent gait cycle and camera viewpoint. The matrix presented here (a) is built from the initial 192 classes and used to merge the classes at the very lowest level of the tree. The similarity matrix is then recomputed at each level of the tree with the resulting new classes. The resulting tree is shown in (b) while the merging process on the torus manifold is depicted in (c). lecting at each node a single cluster prototype along with a chamfer similarity threshold calculated from all the templates that the cluster contains. Multiple branches can be ex- plored if edges from a query image are considered to be similar to cluster exemplars for more than one branch in the tree. Stenger [154] follows a similar approach for hierarchi- cal template tree construction applied to articulated hand tracking, the main difference being that the tree is constructed by partitioning the state space. This state space in- cludes pose parameters and viewpoint. Although these existing hierarchical template tree techniques are shown to have interesting qualities in term of speed, they present some important drawbacks for the task we want to achieve. First, templates need to be stored for each node/branch of the tree leading to memory issues when dealing with large sets of templates. The second limitation is that they require that a clean silhouette or template data is available from manual segmentation [65] or generated synthetically from a 3D model [154]. Their methodology cannot be directly applied to complete image frames because of the presence of too many noisy edges from background and clothing of the individuals which dominate informative pose-related edges.

By using only the silhouette outlines as image features, the approach in [65] ignores the non-negligible amount of information contained in the internal edges which are very informative for pose estimation applications. We thus propose a solution to adapt the hierarchical tree structure for images by using the oriented gradient maps presented in

124 CHAPTER 6: POSE DETECTION section 6.2.1.

Instead of successively partitioning the state space at each level of the tree [154] or clustering together similar shape templates from bottom-up [65], we propose a hybrid algorithm. Assuming that a parametric model of the human pose is available (3D or 2D joints locations, joints angles), we first partition the state space into a number of classes. Then we construct the tree by merging classes in a bottom-up manner but using orien- tated gradient maps calculated over all the training images for a given class. The tree construction process only requires that the cropped training images have been aligned without the need for clean silhouettes or templates. Note that two different methods have been implemented to obtain the discrete set of classes in the state space. The first consists of clustering the state space, and the second approach consists of mapping training sequences on the surface of a 2D manifold whose dimensions represent action and camera viewpoint. The classes are then defined by applying a regular grid on the surface of the manifold thus discretising action and camera viewpoint (see figure 6.5c for walking action). The two methods are discussed in the experiments section (§6.4).

The leaves of our tree define the partition of the state space while the hierarchical tree is constructed in the feature space: the similarity in terms of image features between compared classes increases and the classification gets more difficult when going down the tree and reaching lower levels as in [65]. However in our case each leaf represents a cluster/class in the state space as in [154] while in [65], templates corresponding to completely different poses can end-up being merged in the same class making the regression to a pose impossible.

The hierarchical tree is built using a bottom-up approach by recursively clustering and merging the classes based on a similarity matrix that is recomputed at each level of the tree (figure 6.5). The similarity matrix M is computed using the L2 distance between the log-likelihood ratios of the hyper-classes Cn that represent the classes that fall below each node at the current level and the global edge map C constructed from all the classes together.

Mi,j = ||L(Ci, C) − L(Cj, C)|| (6.3.1)

At each level nodes are merged by taking the values from the similarity matrix in as- cending order and successively merge corresponding classes until they are all merged, then process the next level.

At each level of the tree, we repeat the following steps:

1. Compute new edge maps for each new cluster/hyper-classes Cn

125 CHAPTER 6: POSE DETECTION

2. Compute log-Likelihood ratios L(Cn, C) for the new clusters/classes

3. Compute the similarity matrix over the clusters at this depth using L2 distance

between L(Cn, C)

4. Merge classes based on similarity and define new hyper-classes

This process leads to a hierarchical structure S as represented in figure 6.5.

In [65] the number of branches are selected before growing the tree, potentially forcing dissimilar templates to merge too early. In our case, each node in the final hierarchy can have 2 or more branches, and the decision to explore any branch is based on an accept/reject decision detailed in the next section.

Instead of storing and matching an entire template prototype at each node as in [65, 154], we now propose to build a reduced list of discriminative HOG features, thus making the approach more scalable to the challenging size and complexity of human pose datasets.

6.3.2 Randomized Cascades

Discriminative HOG extraction

While other algorithms must extract the entire feature space for all training instances during training, we propose a method to select a small set of discriminative features extracted from a small subset of the training instances at a time. This makes it much more scalable for very large training sets.

For each branch of our hierarchical decision tree, we use our new algorithm for feature selection to build a list of potentially discriminative features, in our case a vec- tor of HOG descriptors. HOG descriptors need only to be placed near areas of high edge probability for a particular class. Feature sampling will then be concentrated in locations that are considered discriminative following the intra-class discriminative log-likelihood maps (discussed in § 6.2).

For each node, let the set of classes that fall under this node be Cn, and the sub- set of classes belonging to a branch b being considered be denoted by Cb. Probability of gradient distributions are calculated using L(Cb, Cn) as described in section 6.2.1.

Then nh locations are sampled from the distribution L(Cb, Cn) to give a set of poten- nh tially discriminative locations for HOG descriptors Hp = {(xi, yi)}i=1 ∼ L(Cb, Cn). For each of these positions we sample a corresponding HOG descriptor parameter

nh HΨ = {ψi}i=1 ∈ Ψ from a parameter space Ψ = (W × B × A) where W = {16, 24, 32}

126 CHAPTER 6: POSE DETECTION

Figure 6.6: HOG block selection. We present an example of a selected HOG block for 2 different branches of the tree. In each case, we show the location of the block on top of the log-likelihood map for this branch on the left. We also represent the 2 distribu- tions of training images that should pass through that branch (Good: green distribu- tion) and images that should not pass (Bad: red distribution). Finally we represent the True Positive (TP ) vs False Positive (FP) rates varying the decision threshold. We give precision and recall values for the selected threshold. is the width of the block in pixels, B = {(2, 2), (3, 3)} are the cell configurations consid- ered and A = {(1, 1), (1, 2), (2, 1)} are aspect ratios of the block.

+ For each branch b a positive set Tb is created by sampling from instances belonging − to Cb and a negative set Tb is created by sampling from Cn − Cb. An out-of-bag (OOB) testing set is created by removing 1/3 of the instances from the positive and negative

127 CHAPTER 6: POSE DETECTION

sets, and is used for ranking the classifier performance. Next, at each location (xi, yi) ∈

Hp HOG features are extracted from all positive and negative training examples for the node using corresponding parameters ψi ∈ HΨ. A discriminative classifier gi is trained using these examples. Then we test the classifier on the OOB test instances and select a

threshold τi that gives a desired False Positive (FP) and False Negative (FN) rates and rank the block according to the actual FN and FP rates achieved. For each branch, the

NH best HOG blocks that show the best rejection performances are kept in the list of

HOG blocks B which define our feature vector. If NB is the total number of branches in

the tree, the final list B has NB · NH block elements. The tree learning and the feature vector construction are depicted in Algorithm2.

By this process, features are extracted from areas of high edge probability across our training set more than areas with low probability. By using this information to sample features, the proportion of useful information available for random selection is increased.

Randomization

Random Forests, as described in [29] are constructed as follows. Given a set of training examples T = {(yi, xi)}, a set (forest) of random trees H is created. For the k-th tree in the forest, a random vector Φk is generated independent of the past random vectors

Φ1..Φk−1 but with the same distribution. The random vector Φk is used to grow the tree by recursively splitting the data by selecting one of or more of the dimensions to split on. This results in a classifier hk(x, Φk) where x is a test feature vector.

To train the k-th tree, a subset of the training data Tk ⊂ T is randomly sampled leaving around one third of the examples out. The unused training examples are kept aside and used later to find an out-of-bag estimate of the forest. After each new tree is added to the forest, an out-of-bag error estimate can be found by aggregating votes from trees only from which a given training example (y, x) ∈ T did not appear in the training subset that was used to grow that tree, i.e. for the k-th tree, votes are counted only for examples where (y, x) ∈/ Tk. This quantity can be used to estimate the error rate and generalization error of the tree during the training process [28].

When selecting a dimension for splitting data, a subset of the available dimensions is used m << D and the dimension that best splits the data is selected from the subset. √ Typically m = D dimensions are selected at random for each node, where D is the number of dimensions in the feature vector.

The resulting forest classifier H is used to classify a given feature vector x by taking

128 CHAPTER 6: POSE DETECTION

Algorithm 2: Discriminative HOG Features Selection input : Hierarchical structure S, and training images. Discriminative classifier g(·). output: List of discriminative HOG blocks B.

for each level l do for each node n do

Let Cn = set of classes under n; for each branch b do

Let Cb = set of classes under branch b;

Compute L(Cb, Cn) (cf. § 6.2); nh Hp = {(xi, yi)}i=1 ∼ L(Cb, Cn); nh HΨ = {ψi}i=1 ∈ Ψ;

for i = 1 to nh do

(xi, yi) ∈ Hp;

ψi ∈ HΨ; for all images under n do

Extract HOG at (xi, yi) using ψi; + Let hi = HOG from Cb; − Let hi = HOG from Cn − Cb; 2 + − Train classifier gi on 3 of: hi and hi ; 1 + − Test gi on OOB set 3 of: hi and hi ;

Select τi for desired FP and FN;

Rank block (xi, yi, ψi, gi, τi)

NH Select NH best blocks, Bb = {(xj, yj, ψj, gj, τj)}j=1 ; 0 Update the list B = B ∪ Bb ;

the mode of all the classifications made by the tree classifiers h ∈ H in the forest.

An advantage of this method over other tree based methods (e.g. single decision 2 trees) is that since each tree is trained on a randomly sampled 3 of the training data [30] and that only a subset of the available dimensions m << D are used to split the data, the trees grown in the forest are less prone to over-fitting the training data [29].

In our case, an ensemble (a forest) of hierarchical trees R is grown by randomly

sampling one of the HOG block rejectors in the list Bb ∈ B at each branch b of the tree. This gives a random NB-dimensional vector Φk where each element corresponds to a branch in the tree structure. The value of each element in Φk is the index of the

129 CHAPTER 6: POSE DETECTION

randomly selected rejector from Bb for its corresponding branch.

Figure 6.7: Rejector branch decision. Shown here is a diagram of a rejector classifier (note that the structure allows for any number of branches at each node and is not fixed at 2). The structure of the tree is defined by S and the blocks stored during training for each branch for this structure are held in B. The random vector Φ defines a cascade by selecting one of the rejector blocks at each branch. New random vectors will make a decision using different block configurations. With many random classifiers, this allows each classifier to make a decision using a different view of the data and offers a little robustness to noise.

For each tree, the decision to explore any branch in the hierarchy is based on an accept/reject decision of a simple binary classifier that works in a similar way to a cascade decision. The resulting forest is then, in fact, a series of hierarchical cascades of rejectors. The resulting classifier R is used to classify a given image I by taking the mode of all the classifications made by the hierarchical cascade classifiers r ∈ R. See figure 6.7.

Since the tree structure S and HOG block rejectors B remain fixed after training, new classifier trees can be constructed or adapted online by simply creating a new

random vector Φk. Performing construction online, each new random vector Φk con- tributes to the final distribution. As more classifications are made the distribution con- verges and becomes stable. This property can be exploited for efficient localisation as discussed in the experiments section.

The distribution returned from the rejector forest R is the proportion of trees that voted for each class. This can also be interpreted as the probability that a randomly gen-

erated cascade ri will vote for each class at that location. Single hierarchical cascades have good approximate localisation performance, but the disadvantage is that a single fixed rejector cascade may miss some positive detections that would have otherwise been found using more trees.

The vector Φk used to generate a classifier rk(Φk, S, B) can be regenerated at each

130 CHAPTER 6: POSE DETECTION location considered. Locations that yield a detection using a single tree randomised at each position can be classified with more trees until the detection result converges. The probability that any given randomly generated rejector rk will vote close to an object location increases as the position gets closer to the true location of the object. Even if the classification by the initial rejector for the pose class is wrong, it is generally close to an area where the object is (see figure 6.8). Then subsequent classifications from other randomised rejectors push the distribution toward a stable result. Localising using this approximate approach means that a single rejector can be used as a region of interest detector for a more dense classification. Results using this method are reported in the experiments section.

(a) (b) Figure 6.8: This figure shows localisation detection results for a single cascade clas- sifier (i.e. the number of cascades in the classifier is 1) for a cropped test image. (a) shows the original test image, and (b) shows the detection results where white pixels represent positive detections, and the blue tinted area are negative detections.

This property may also be exploited to spread the computation of image classifi- cation and localisation over time. They may also be readily parallelised due to the structure S and B being fixed once the classifier has been trained.

In Random Forests [29], the use of randomisation over feature dimensions means that each tree makes a decision using a different view of the data, and makes the forest less prone to over fitting, more robust to noisy data and better at handling outliers than single decision trees. Each tree learns quite different decision boundaries, but when averaged together the boundaries end up reasonably fitting the training data. Our randomised cascades algorithm exploits this random of selection features so that the classifier is also less susceptible to over fitting, and the tree structure enables the algorithm to only use a small number of features from a high dimensional feature space

131 CHAPTER 6: POSE DETECTION to make a classification decision.

The decision at each node of the hierarchical cascade classifier is made in a one- vs-all manner, between the branch in question and its sibling branches. In this way multiple branches can be explored in the cascade and allows it to vote for more than one class so the final classification is a distribution over classes. This is a useful attribute for classifying potentially ambiguous classes and allows the Random Cascade ensemble to produce a distribution over multiple likely poses, and is useful for tracking algorithms.

In Random Forests, the trees have to be completely traversed to produce a vote/- classification which can make them expensive for scanning window detectors. The randomized cascades algorithm learns a hierarchy of cascade classifiers, selecting at each node of the cascade a single feature to discriminate between each branch and its sibling branches at that node. Due to the construction of the tree, our hierarchical cas- cade classifier can efficiently reject negative candidates by only sampling a few features of the available feature space.

Randomised Forests require that the full feature space be available from an image during training. This can lead to very high dimensional feature vectors being extracted from an image for large configurations of features and potentially lead to memory prob- lems for training sets with a large number of examples. Randomised Cascades how- ever, selects the features it considers to be the most informative during training, and can build a smaller more useful feature space by sampling a small set of features from a much larger configuration of feature descriptors.

6.3.3 Application to Human Pose Detection

Given a new image, a sliding-window mechanism localises the individual within that image and, at each location and scale, the window is classified by our multi-class pose classifier as containing a human pose or not. Since we learn a classifier that is able to discriminate between very similar classes, we can also tackle localisation by including random background images in the negative examples during training of each rejector in the cascade to discriminate human pose from the background (see figure 6.10).

Selecting and normalising a bounding box in the input image, each tree gives a binary decision for each class, resulting in a distribution over all the classes when con- sidering the entire forest. The decision can then be taken based on this distribution by choosing for instance the class that received more votes. Taking the maximum clas- sification value over the image (after exploring all the possible positions, scales and orientations) results in reasonably good localisation of walking pedestrian, as shown

132 CHAPTER 6: POSE DETECTION

(a) (b)

(c) Figure 6.9: Pose Detection. (a) Input image from the moving camera sequence from [57]. Scanning in X and Y directions of the image. (b) Resulting cropped image and pose corresponding to the “pick” resulting from the classification using Random For- est. (c) Resulting distribution over the 192 classes after classification using Random Forest. We also represent this distribution on the 3D and 2D representations of the torus manifold. in figure 6.9 and figure 6.10.

Once a human has been detected and classified, 3D joints for this image can be estimated by weighting the mean poses of the classes resulting from the distribution using the distribution values as weights, or regressors can be learnt as in [116]. The normalised 2D pose is computed in the same way and re-transformed from the nor- malised bounding box to the input image coordinate system obtaining the 2D joints

133 CHAPTER 6: POSE DETECTION

Figure 6.10: Pose Detection. (up) Input image. (middle) Resulting saliency map when taking the mode of each distribution for each scanned sub-image. (bottom) Saliency map obtained after adding hard examples as negative training images in the cascade learning process. location.

6.4 Experiments

Before training the algorithm, strong correspondences are established between all the training images. This is done by aligning them automatically using the 2D joints loca- tion (complete process described in our previous work [132]), thus making the selection of useful features much easier. After that is done, classes need to be defined to train our pose classifier.

134 CHAPTER 6: POSE DETECTION

Figure 6.11: Pose detection results on HumanEva II data set. Top row: normalised 96x160 images corresponding to the pick obtained when applying the pose detection in one of the HumanEva II sequences (subject S2 and Camera 1). For each presented frame (1, 50, 100, 150, 200, 250, 300 and 350) the resulting pose is represented on top of the cropped image. Bottom row: mean 2D error (in pixels) plots for the same sequence using Random Forest and [131].

Subject Camera Frames Mean (Std) from [131] Mean (Std) using RT S2 C1 1-350 16.96 (4.83) 12.98 (3.5) S2 C2 1-350 18.53 (5.97) 14.18 (4.38)

Table 6.1: 2D Pose Estimation Error on HumanEva II data set. Table showing mean (and standard deviation) of the 2D joint location error using silhouette based model from [131] and our Randomized Cascade (RT). Results are provided in pixels, which is calculated using the 2D distance between the ground truth and estimated 2D positions of each joint.

6.4.1 Preliminary results training on HumanEva

The first dataset we consider is the HumanEva dataset [150]. This dataset consists of 3 subjects performing a number of actions (e.g. walking, running, gesture) all recorded in a motion capture environment so that accurate ground truth data is available. Ap- plying the process described above to this data (3 Subjects and 7 camera views) for training, we generate a very large data set of more than 40,000 aligned and normalised 96x160 images of walking people, with corresponding 2D and 3D Poses.

To define the set of classes, we propose to utilise a 2D manifold representation where the action (consecutive 3D poses) is represented by a 1D manifold and the view- point by another 1D manifold. Because of the cyclic nature of the viewpoint parameter, if it is modelled with a circle the resulting manifold is in fact a cylindrical one. When

135 CHAPTER 6: POSE DETECTION the action is cyclic too, as with gait, jog etc., the resulting 2D manifold lies on a “closed cylinder” topologically equivalent to a torus [95, 131]. The walking sequences are then mapped on the surface of this torus manifold and classes are defined by applying a regular grid on the surface of the torus thus discretising gait and camera viewpoint. By this process, we create a set of 192 homogeneous classes with about the same number of instances.

We choose to define non-overlapping class quantisation, due to the property of our hierarchical cascade classifier that each cascade compares feature similarity rather than make a greedy decision, so can traverse more than one branch in the hierarchical cas- cade. When an image reaches a node with a decision between very similar classes (and subsequently close in pose space), then it is possible that a query image that lies close to a quantisation border between those two classes can arrive at both class leaves.

Figure 6.12: Pose detection result with a moving camera. Normalized 96x160 images corresponding to the pick obtained applying the pose detection to a moving camera sequence (from [57]). For each presented frame, the mean 2D pose corresponding to the “winning” class is represented on top of the cropped image while the correspond- ing 3D pose is presented just below.

The resulting cascades classifier is first validated in similar conditions (indoor with a fixed camera) using HumanEVA II data set (figure 6.11). We apply a simple Kalman filter on the position, scale and rotation parameters along the sequence and locally look for the maxima, selecting only the probable classes based on spatiotemporal constraints (i.e. transitions between neighbouring classes on the torus). By this process, we do not guarantee to reach the best result but a reasonably good one in relatively few iterations. Note that since we only learnt our pose detector from walking human sequences, we do not attempt to detect people performing actions other than walking. Quantitative evaluation is provided using HumanEva II data set in table 6.1. The resulting cascades classifier is also applied to a moving camera sequence as in [44] (figure 6.12)

136 CHAPTER 6: POSE DETECTION

Even if the cascade classifier shows some very good results in similar environment and conditions (figure 6.11 and table 6.1), it can be observed in figure 6.12 that the classifier is unable to make a good pose estimate if the gait is too wide (e.g. when the subject was walking at speed). This is due to the low variability in pose present in the HumanEVA walking database. Even if 40,000 images are available, they are not representative of the variability in terms of gait style since only 3 subjects walking at the same speed have been considered.

Additionally, HumanEva has very little background (one unique capture room) and clothing (capture suit) variation, which makes the classifier unrobust to cluttered back- grounds.

6.4.2 Experimentation on MoBo dataset

Our second set of experiments is performed on the CMU Mobo dataset [72]. Again, we consider the walking action but this time add more variability in the dataset by includ- ing 15 subjects, two walking speeds (high and low) and a discretised set of 8 camera viewpoints θ, uniformly distributed between 0 and 2π. Between 30 and 50 frames of each sequence were selected in order to capture exactly one gait cycle per subject. By this process we generate a training database encompassing around 8000 images and the corresponding 2D and 3D pose parameters. The same alignment procedure is applied to this dataset as with the HumanEva dataset, but in addition we use hand labelled silhouette information to superimpose each of the instances on random backgrounds to increase the variability of the nonhuman area of the image (see figure 6.13). For more details about manual labelling of silhouettes and pose, please refer to [131].

We observed that the classification in section 6.4.1 was very strict so that very fine alignment was necessary during localisation. Looser alignment in the training should allow for a smoother detection confidence landscape. Following [94] and [55], the train- ing set is augmented by perturbing the original examples with small rotations and shears, and by mirroring them horizontally. This improves the generalisation ability of the classifier. The augmented training set is 6 times larger and contains more than 48,000 examples with different backgrounds. The same dataset is generated for the 3 following configurations: original background, no background, and random back- ground. This data set will enable to measure the effect of cluttered background on classification to be measured. The richer background variation compared to the Hu- manEva dataset allows a more robust cascade to be learnt, as demonstrated later by our experiments.

137 CHAPTER 6: POSE DETECTION

Figure 6.13: MoBo dataset: (up) Examples of original MoBo training images from the 6 training views. (down) Examples of the normalised 96x160 images with random background.

Class definition

Since this is a discriminative approach, the pose space must be discretised into a dis- tinct set of classes. Class definition is not a trivial problem because changes in both the human motion as well as the viewpoint are continuous and not discrete. In other words, it is not trivial to decide where a class ends and where the next one starts. In [140], the authors define the pose neighbours based on the distance between 3D joints. This definition produced good results in the absence of viewpoint changes. However, two poses which are exactly the same in the pose space could still have completely different appearances in the images due to changes in viewpoint. Thus it is critical to consider viewpoint information in the class definition.

Previously no one has attempted to efficiently define classes for human pose clas- sification and detection. In section 6.4.1, a 2D manifold of pose plus viewpoint was discretised (gait and viewpoint). In [116] they clustered 3D pose space and also con- sidered a discrete set of cameras. [65] clustered the silhouette space using binary edge maps. Ferrari et al. [55] defined classes as 3D pose but only considered frontal views and a limited set of poses. In PSH, they automatically built the neighbourhood in 3D pose space but only considered frontal views.

Classes need to be defined in feature space or in 2D projection of the pose space (combined pose and viewpoint) which is the most representative of the pose informa- tion that can be extracted from the image.

When we considered the HumanEVA dataset (section 6.4.1), classes were defined by applying a regular grid on the surface of a torus manifold (representing gait cycle

138 CHAPTER 6: POSE DETECTION

Figure 6.14: Defined Classes on MoBo. Resulting 64 classes from the MoBo walking dataset. Note that each row corresponds to one of the 8 training views available in the dataset. position and viewpoint). The discretisation process thus assigns the same number of classes for quite different viewpoints. This means that, for instance, frontal and lat- eral viewpoints of a walking cycle were quantised with an equal number of classes, despite the difference in appearance variability. However, since there is much less vi- sual change over the gait cycle when viewed from front or back than for lateral views, differences between classes do not reflect the same amount of change in visual infor- mation over each of these views. This over-quantisation of visually quite similar views can make the class data unrealistically difficult to separate for certain viewpoints, and

139 CHAPTER 6: POSE DETECTION can introduce some confusion when a cascade must classify those examples. Therefore it is important to define homogeneous classes. Too many class clusters become too spe- cific to a particular subject or pose, and do not generalise well enough. Too few clusters can merge poses that are too different and no feature can be found to represent all the images of the class.

In the proposed automatic class definition described here, views are quantised into a number of classes that reflect the variability in appearance for that view, and frontal views of walking would have a coarser quantisation (i.e. less classes) compared to the lateral views.

Baseline Results

We first ran some benchmark experiments to get initial baseline results with three state- of-the-art (pose) classifiers: PSH [140], Random Forest (RF) [29] , and SVMs [116]. A training subset is first built by randomly selecting 10 of the 15 available subjects from the database and a testing subset with the remaining 5 subjects. The figure 6.15 shows a comparison of the performances using segmented images without background and images with a random background. Table 6.2 gives the corresponding training time.

(a) (b) Figure 6.15: Pose Classification Baseline Experiments on MoBo. Several Classifiers (PSH, Random Forest and SVMs) are trained on a subset of the MoBo database con- taining 10 subjects and tested on another subset containing 5 different subjects. We compare the results using segmented images without background (a) and images with a random background (b). The dimensions are from different sized HOG feature grids: 4 × 4, 5 × 5, 7 × 12, and all three grids combined in one feature vector.

The first observation we made is that PSH does not perform well in the presence of cluttered backgrounds while performing decently on segmented images. PSH tries to make a decision based on 1-bin splits, as with RF. This could well be a reason for the dif-

140 CHAPTER 6: POSE DETECTION

HOG Grid 4x4 5x5 7x12 3 together Dimension 1152 1800 2688 5640 Background RF 5h30 6h45 13h00 21h45 SVMs 2h00 4h00 8h00 17h30 PSH 1h50 2h30 3h00 5h30 No Background RF 4h40 5h30 11h00 18h30 SVMs 45min 52min 1h00 2h20 PSH 1h20 2h20 3h08 5h30

Table 6.2: Classifiers - Training Time. Table showing training time for the experi- ments reported in figure 6.15. ference in performance as the histogram will be altered by the presence of background edges and the value in the selected bin will be affected.

In the presence of cluttered backgrounds, SVM seems to be the best classifier in terms of accuracy but learning an SVM per class is not very scalable.

Localisation Results

To construct our localisation dataset, we took images from several different datasets to compare each of the algorithms in different environments and at different subject scales. These images were taken from HumanEva [150], CamVid [31], INRIA [40], some images of pedestrians collected from movie sequences, and some images captured from a web camera in a simple lab environment. See figure 6.16 for some selected images from this dataset. We then manually annotated each of these images so we could de- termine localisation accuracy.

Figure 6.16: Localisation Dataset Sample images from the localisation dataset. From left to right: HumanEva, INRIA, movie sequences, CamVid, and lab sequence.

141 CHAPTER 6: POSE DETECTION

Each of the classifiers were trained on human subjects from the aligned Mobo dataset, and combined with random hard background examples sampled from the INRIA dataset. To create the hard example dataset, the classifiers were run on negative examples from the INRIA dataset, and strong positive classifications were incorporated as hard exam- ples. The hard examples were added to each of the classifiers as a background class so that there are a total of 65 classes; 1 for background and 64 for pose. The resulting classifiers were then run on the localisation dataset at the same scales and two different strides; 4 and 8 pixels. See figure 6.17 for the results. The classification times for the three classifiers are shown in table 6.3.

Figure 6.17: Detection Results Detection results on the combined localisation dataset for two different strides. The green line indicates a reference comparison point of 1 × 10−4 FPPW detection rate, which would be a detection of one false positive per 10, 000 classification windows.

The reference point of 1 × 10−4 FPPW is arbitrary but is a reasonable comparison point given the cascade classifier’s worst false positive performance is near that value and that it has been used in other detection works such as Dalal and Triggs [40]. At this rate we achieve a similar detection performance to RF and are close in accuracy to the multi-SVM classifier at the same FPPW rate.

142 CHAPTER 6: POSE DETECTION

Multi-SVM Random Forest Cascade t25 h300 Stride 4 15h03 5h12 16h32 Stride 8 3h36 31m 3h42

Table 6.3: Classifiers - Detection Time. Table showing time taken for detection on the complete test dataset used in the experiments reported in figure 6.17.

We can actually achieve a higher detection rate than multi-SVMs if we consider a cascade that uses only the first 20 best HOG classifiers at each node instead of the best 300. This constrains the features that our cascade can randomly draw from, but if a higher rate of 15 × 10−4 is acceptable, then we achieve nearly 90% accuracy as illustrated in figure 6.18. This is 10% higher than the highest multi-SVM detection rate at 3 × 10−4 FPPW.

Figure 6.18: Best Cascade Localisation Results At a higher detection rate of 15 × 10−4 FPPW, the cascade classifier yields nearly 90% detection rate.

Although we are the slower of the two other methods, we are actually drawing fea- tures from a much higher feature space; in the region of 200, 000 dimensions. Because of this we are able to consider many more feature configurations than either of the Ran- dom Forest or multi-SVM methods. Typically our cascades take around 2 hours to train in comparison to those reported in table 6.2 even given the high dimensional feature space used by our classifier.

143 CHAPTER 6: POSE DETECTION

6.5 Conclusions and Discussions

We have presented a novel approach to exemplar-based human pose detection and recognition using randomized trees. Unlike most previous works, this pose detection algorithm is applicable to more challenging scenarios involving extensive viewpoint changes and moving camera, without any prior assumption on an available segmented silhouette or normalised input bounding box. Moreover, the random cascade classi- fier outputs a distribution over possible poses that can be useful when combined with tracking algorithms for resolving ambiguities in pose.

The construction of the initial hierarchical tree structure is slightly similar to Gavrila [65] even if, in our case, the leaves are the classes we previously defined. However, it presents the following key differences with this work: first, the selection of the rele- vant HOG feature blocks at each node of the structure using log-likelihood ratio; and secondly, the way we grow an ensemble of trees by randomly sampling one of those selected HOG blocks at each node. This randomness makes the algorithm more robust to noise compared to a single hierarchical decision tree that would use all the features once. More importantly, it leads to a distribution over the classes that is useful for tracking as in [156].

The algorithm used to train the Random Cascade classifier proposed in this chap- ter shares some similarities with PSH (Shakhnarovich et al. [140]). In PSH, the authors learn a set of hashing functions which efficiently index examples for the pose recog- nition task. The hash functions are sensitive to the similarity in the parameter space. In our work, we use the pose space to define classes that are then used to select rel- evant features and thresholds in the image space. Secondly, the hierarchical structure we build is similar to the set of projections found by PSH that best define nearest neigh- bourhoods.

Even if this work shares some similarities with PSH, it is different in various as- pects. First, thanks to the method used for class definition, the algorithm can man- age extensive viewpoint changes. Secondly, the search through the training data set is hierarchical which means that if the input vector is rejected after the first node, the classifier does not have to extract all the other features. Finally, multiple branches can be explored as in [65] while PSH only considers binary splits. Deciding on a threshold is not an easy problem as each of the class supersets to be separated at a given node have instances that have some similarities between each other, so exploring multiple branches is necessary.

In future work, we will consider different actions and combine this approach with

144 CHAPTER 6: POSE DETECTION a pose tracking algorithm to make the system more robust, and improve the search method to achieve real-time performances.

145 CHAPTER 7

Conclusion

This thesis proposed a number of algorithms that can be used for creating novel inter- actions within computer games. The first approach was that of examining local image changes to indirectly locate the movements of the player in the live video stream by detecting changes between the current frame and a reference frame. Fast local features were used to make the detection efficient, and comparisons were made to other stan- dard local correlation methods such as NCC and the more shadow invariant CNCC algorithm. The second approach used a standard face detection algorithm to initialise a simple shape prior used in an ObjCut graph cut segmentation algorithm to demon- strate a real-time application face segmentation, and that segmentation energy can be used to reject false positives from the detector under certain conditions. The third approach formulates the Chamfer template matching algorithm as an SVM classifier where the weights represent the shape template. This allows the shape template to be learnt automatically from training data even when background information is present in the positive examples, and no pre-processing to extract silhouettes is necessary. The fourth approach presented was that of using log-likelihood ratios between edge dis- tributions of different classes as a prior for sampling feature locations. This allows a wider number of feature configurations to be considered at locations that are more likely to be discriminative.

7.1 Summary of Contributions

LBP3 The LBP3 local descriptor is an extension to the LBP algorithm that can handle homogeneous regions that have a very low local variance in intensity due to the presence of camera sensor noise, and prevents these regions as being described as having texture. Instead of comparing the intensities of the pixels in the descrip-

146 CHAPTER 7: CONCLUSION

tor the comparisons are done between weighted sums of the intensities around each of the respective pixels. This weighted comparison is made efficient by con- volving the image with a Gaussian kernel and performing the comparisons on the convolved image. To alleviate the problem of boundary conditions over time for intensity differences that fluctuate on the border of two labels (e.g. between Similar and Greater) a temporal hysteresis threshold method was proposed and results were presented demonstrating its effect on the descriptor stability.

Fast Face Segmentation using Detectors and Simple Shape Priors An algorithm was presented combining an off-the-shelf face detection algorithm to localise a detec- tion and, using this detection, derived the initialisation parameters of a simple shape prior for a graph cut segmentation algorithm. This demonstrated that very simple shape priors can be sufficient to segment the face area from a detection in real-time using graph cut. Initial results were also presented on using the seg- mentation energy to reject false positives under certain conditions. The detection initialisation approach was applied to the non-real time problem of upper body pose estimation using pose and some results of optimising over the pose param- eters after initialisation.

SVM Formulation of Chamfer Template Matching for Detection A formulation of the chamfer template matching algorithm was presented that expresses the shape template as weights in a linear SVM classifier. This formulation allows a shape template to be learnt automatically using training data even in the presence of background information, without having to first process the training examples to remove edges from the background. The method was compared to a state of the art detection method and results were presented on a pedestrian and an upper body dataset.

Edge Log-likelihood Ratios for Discriminative Location Selection A novel algorithm was presented that constructs a hierarchical cascade of rejectors, and at each branch in the hierarchy uses the edge distributions present in training data at each branch to construct edge distributions from the classes that fall beneath it, then uses the log-likelihood ratio of the edge distribution from the current branch versus the other branches at the same node to find potentially discriminative loca- tions to sample features. This approach allows the SVM classifiers at each branch to use this log-likelihood ratio as a location prior to favour locations that are more discriminative.

147 CHAPTER 7: CONCLUSION

7.2 Future Work

In future work, the LBP3 algorithm could be improved by considering different local descriptors in place of the weighted sum of the Gaussian filter. Using a shadow invari- ant feature such as CNCC could improve its robustness to the effects of shadow.

Since the LPB3 algorithm (as well as the NCC and CNCC methods tested here) holds a background reference image to compare new frames against, if the camera drifts slowly over time (as illustrated in figure 3.15), or is shaken suddenly by move- ments of the player during the game, differences will be registered in the difference map the more the frame moves out of alignment with the reference image. Applying basic image stabilising methods to the image to correct for gradual or sudden drift should address this problem.

The face segmentation algorithm can be improved by using knowledge of previous good detections in addition to temporal smoothing. This can be applied to avoid de- tections that oscillate between adjacent discrete detection scales when the true scale of the human is part way between them. Other temporal information such as initialising ∗ the model with the optimal parameters from the previous frame Θk [t − 1] could take the performance of an upper body model a step closer to real-time speed.

Applying the Chamfer SVM formulation in situations where templates of the ob- jects could be learnt on-line using linear SVM methods such as SGD-QN SVM [19] would be interesting. In addition, results comparing the performance of this algorithm to existing applications where hierarchical chamfer template matching trees are used would also be an interesting area to explore.

The hierarchical randomised cascade of rejectors classifier could be improved by making the localisation computation more efficient, and is an interesting area of fu- ture work. The optimal resolution at which the classifier operates for efficient localisa- tion could be explored by varying the resolution of the training data. Different feature types used in the cascade could also be explored. Results on upper body pose estima- tion would also be an interesting application, but at the time of writing a sufficiently well aligned upper body pose dataset covering a sufficient number of poses was not available.

148 APPENDIX A

Code Documentation

A.1 System Overview

The algorithms used in this thesis can be broadly categorised into two main projects, CapStation and HogLocalise. Each of these main projects depend on two shared util- ity projects called Common and ImageIO. Figure A.1 shows the high level dependencies between the project components.

Figure A.1: System diagram of the main applications used in the thesis. The shared components are in green, the main projects in blue, and external dependencies or li- braries are in yellow. Arrows indicate directional dependencies, and circular connec- tors indicate the type of interface between components.

A.1.1 Shared Projects

These projects contain a number of type definitions, helper functions, and image ma- nipulation functions that are used throughout the source code used in this thesis.

149 APPENDIX A:CODE DOCUMENTATION

Common

The Common project defines some common integral types and floating point types. The project also provides some functions for edge detection, distance transforms, chamfer matching, managing reference counted objects, file names and path manipulation.

ImageIO

This project contains a C++ class template definition encapsulating basic image manip- ulation operations in a way that’s simple and flexible. It also defines some common pixel formats and interpolation and cropping operations. The Image class can be used to represent a 2D array of any container safe structure (i.e. structures that have copy safe assignment and constructor operations) and provides a means to allow inter- polation between these types.

The Image class implements an indirect mechanism for sub-image references, so that regions of an image can be manipulated directly without duplicating data, and behave just like a regular image. This means code can be developed to operate on an image without caring whether the image presented to the code is the complete image, or a smaller region of a much larger image.

A.1.2 Project: CapStation

The real-time applications discussed in Chapter3 were implemented using a GUI ap- plication developed by the author called CapStation. This program allowed modules to be written and easily added to the main application to filter and process video data from either live camera streams or from a video file.

A system diagram of the application is shown in figure A.1. The main application utilises DirectShow and DirectX to interface with the graphics hardware of the host computer. This also allows more complex visualisations to be implemented efficiently.

Algorithms are implemented as separate modules called filters that are detected and loaded by the application at run time. They each have a simple interface that processes a frame and optionally draws a visualisation output over the image. Filters may also create additional windows such as with the rotating cube module.

150 APPENDIX A:CODE DOCUMENTATION

Figure A.2: Figure showing the main stages of the sliding window algorithm: initiali- sation, search, feature extraction, and classification.

A.1.3 Project: HogLocalise

The offline algorithms discussed in Chapters5 and6 used a more generalised frame- work to handle multiple types of sliding window classifiers without having to re- implement a great deal of underlying code.

This system is divided into two main areas: classifiers and features. The classifier code is implemented so that the only thing a classifier deals with are feature vectors, i.e. a vector of real numbers x = {xi}. To allow for a more flexible manner in which the feature vectors are generated, a FeatureVector object provides an abstraction. The on-demand feature sampling as described in Chapter6 is implemented using this mechanism so that features are only extracted when the corresponding feature element xi is accessed. A table is maintained that maps a feature element xi to a feature type fi so that features can be extracted only when necessary. This can save computation time for classifiers that may only use a sparse set of feature elements from the feature vector. Figure A.2 shows the stages for the sliding window search algorithm. Applications such as HogLocalise.exe utilise this FeatureVector abstraction to implement the HOG grids required for the classi- fiers discussed in Chapter6.

The library component called cppHogLocalise allows the localisation algorithm to be used in other applications such as processing real-time camera streams or videos, and provides a Matlab interface. Both 64bit and 32bit windows are supported.

151 APPENDIX A:CODE DOCUMENTATION

Data Caching

A cache is implemented so that consecutive calls to the module with the same classifier are efficient. After the first call, the classifier and corresponding HOG window are saved in memory for quick reuse in a second call. If the filenames match, then the previous classifier is used again, otherwise the new classifier data is loaded.

The integral histogram data generated during the search can also be cached. This means that if multiple operations are required on the same image, then the existing integral histograms can be reused.

Efficient Scale Search

Figure A.3: Illustration of the scale search method. For a given image and scales

s = {si}, an integral image is computed for each gradient orientation channel that allows for a border sufficiently large for the classifier window to scan the image at its largest size. Keeping the image sized fixed so that the integral histograms can be reused, the box scales inversely to the image scale so the maximum size will be from the minimum specified image scale, i.e. min(s).

Since the classifiers considered all use integral histogram HOG features, they can be rescaled easily for different window sizes without having to recalculate integral histograms for different scales as with the Haar like integral image features used by Viola et al. [170], see figure A.4.

When given a set of scales to consider, the algorithm will first find the scale that produces the largest box size with respect to the image area, then accounts for a max- imum border to allow the feature window to be placed at the very edge of the image area, and caches the corresponding integral image. Any scale can be explored by using a sub-region of the cached integral image. Sub-regions can be accessed efficiently with-

152 APPENDIX A:CODE DOCUMENTATION

Figure A.4: Different scales are queried by rescaling the classifier window, and HOG feature boxes are rescaled in proportion to the new classifier window scale. The sam- pling coordinates for the HOG boxes can be rescaled at no extra cost in computation time. out duplicating the data using the Image implementation. This re-use of integral histogram data can save computation time when searching over multiple scales.

As illustrated in figure A.3 different scales can be explored by keeping the source image size fixed and reusing the integral histograms. The classifier window is rescaled to the equivalent scale ( 1/si ) and the Integral HOG features are resized with respect to the new window size. Since the integral histogram sampling consists of only 4 co- ordinate samples for each cell considered, independent of the size of the HOG feature, the features can be scaled with respect to the new window size at no additional com- putation time, thus allowing different scales to be explored at approximately the same computational cost.

A.2 C++: Shared Libraries

The main projects in the thesis use environment variables to determine the location of these shared projects: BVG_COMMON_DIR is the path to the root directory of the Common project, and BVG_IMAGEIO_DIR is the path to the ImageIO project directory.

A.2.1 Common

Some other features of this project are listed below.

Type Definitions Common types that are defined to be a specified number of bits in size. The in- tegral types are: U8, S8, U16, S16, U32, S32, U64, S64, and floating point types are: F32, F64, F128. These are defined in Common/Inc/numtypes.h.

153 APPENDIX A:CODE DOCUMENTATION

Edge Detection Helper functions for edge detection, distance transforms, and chamfer distance are provided in: CannyEdge.h, DistanceTransforms.h, Chamfer.h.

File Names Some common operations on file names and paths are defined in pathutils.h.

A.2.2 ImageIO

The library is used by including imageio.h and any subsequent image loading and saving headers to handle file formats such as PGM and PPM image files.

The Image class is defined in image.h and can be used with any type that is safe to store in a STL container (i.e. can safely and efficiently handle copying and assignment operations).

A.3 MEX: cppHogLocalise

This code encapsulates the whole process of extracting HOG features and classifying them using the specified classifier. The result of calling this function is a distribution over the classes voted for by the classifier for each position in the image considered.

A.3.1 Usage

Usage is as follows:

1 [result] = cppHogLocalise( image, forestFile, hogFile [, ] );

There are 3 required parameters: image, forestFile, and hogFile. The first pa- rameter, image, is either a path to an image file (PPM format only), or a colour image already loaded in Matlab using the imload function. The second and third parameters are the paths to the classifier file and hog window feature definition file respectively.

To load an image in Matlab and use it as a parameter to the function, you can do the following:

1 image = imload('someimage.ppm');

2 [result] = cppHogLocalise( image, forestFile, hogFile );

154 APPENDIX A:CODE DOCUMENTATION

Using the MEX module in this way allows a greater number of file formats to be supported.

The 'result' value is organised in the following way:

 ∗ ∗  v1 c1 s1 x1 y1 w1 h1 c1,1 c1,2 c1,3 ··· c1,N  ......   ......    ∗ ∗ vM cM sM xM yM wM hM cM,1 cM,2 cM,3 ··· cM,N

Each row of the result matrix represents a single result and the columns represent the result data. From left to right the columns are as follows: maximum value in the distribution; maximum class for the highest value; the scale the result was found at; x and y coordinates and the width and height of the bounding box (with respect to the original image size); and finally the distribution over all classes for the result.

Example Usage

Scan the specified image at scale 1.0 with no search border with the specified classifier and corresponding HOG window:

1 [result] = cppHogLocalise( image, forestFile, hogFile );

Scan the image over scales [0.5, 1.0. 1.5] using a scale adjusted relative search border of 30 pixels and scale adjusted search stride of 2 pixels using the 'auto−stride', 'auto−border', and 'use−scales' options:

1 [result] = cppHogLocalise( image, forestFile, hogFile, ...

2 'use−scales', [0.5 1.0 1.5], ...

3 'auto−border', 30.0, ...

4 'auto−stride', 2.0, ...

5 );

A.3.2 Options: Scanning

'auto−border', value (double) Specify the scale adjusted search border added to an image relative to the scan- ning window size of the classifier. Overrides 'search−border'.

155 APPENDIX A:CODE DOCUMENTATION

'search−border', value (double) Specify the fixed scan stride step in pixels (not adjusted by scale). Overridden by 'auto−border'.

'auto−stride', value (double) Specify the scale adjusted search stride. This stride step size will be min(value/curScale, 1). Overrides 'scan−step'.

'scan−step', value (double) Specify the fixed scan stride step in pixels (not adjusted by scale). Overridden by 'auto−stride'.

'scan−vector', matrix (3xN matrix) A 3xN matrix of (x, y, scale)T coordinates that specify target scan locations and scales. The classifier window will be scaled and centred at the (x,y) positions specified, so this option is best combined with the 'return−box−centres' op- tion. Ignores 'auto−border', 'auto−stride', 'search−border' and 'scan−step'.

The result vector will correspond 1:1 with ordering of the points in the columns of the specified scan matrix. The first result corresponds to the first scan point, the second to the second scan point, and so on. Internally the points are sorted for scanning the scales efficiently, but the indices for the results are preserved so the points may be specified in any order.

The format of the point matrix is as follows:

  x1 x2 x3 ··· xN   points =  y y y ··· y   1 2 3 N  s1 s2 s3 ··· sN

Example:

1 %'scan −vector' example

2 % points are:[x1 y1 scale1; x2 y2 scale2;.. etc]'

3

4 % Example1.

5 % scan at these specified points

6 points = [32 54 1.5; 23 44 0.5; 12 24 2.0; ]';

7 [result] = cppHogLocalise( image, forestFile, hogFile, ...

8 'scan−vector', points, ...

9 'return−box−centres', 1, ...

10 );

156 APPENDIX A:CODE DOCUMENTATION

11

12 % Example2.

13 % scan acrossa series ofx coordinates

14 % keepingy fixed at5 and scale fixed at0.75

15 points = [ 1 2 3 4 5; 5 5 5 5 5; 0.75 0.75 0.75 0.75 0.75 ];

16

17 % can alternatively specify the points as follows:

18 % points=[ 1:5; repmat(5, 1, 5); repmat(0.75, 1, 5)];

19

20 [result] = cppHogLocalise( image, forestFile, hogFile, ...

21 'scan−vector', points, ...

22 'return−box−centres', 1, ...

23 );

A.3.3 Options: Classification

'max−trees', value (double) Forest classifiers only. The maximum number of trees to use. The trees considered in a forest classifier will be in the range of [1:value].

'min−trees', value (double) Forest classifiers only. The minimum trees that should be used to make a decision. The trees considered in this will be [1:value], but it can also be combined with the 'use−trees' option. Must be used in conjunction with the 'min−trees−percent' option.

'min−trees−percent', value (double) Forest classifiers only. The proportion of the minimum trees considered that must make a vote to continue classifying with other trees. Requires 'min−trees' to be set to a non-zero value.

'bg−class−id', value (integer) Cascade classifiers only. Used to define which of the class IDs should be consid- ered the ‘background’ class. This is used in conjunction with the 'min−trees' and 'min−trees−percent' parameters so that a strong vote for this specified background class means that the classifier stop classifying with the rest of the trees. Setting this value to −1 will mean the classifier will use the highest class ID in the available classes as the background class.

'min−result−vote', value (double) Forest classifiers only. The minimum vote value that a classifier must provide

157 APPENDIX A:CODE DOCUMENTATION

to return the classification in the list of results. Set this to zero to consider all classifications, which can be used to determine the number of windows actually considered.

1 %'min −results−vote' example

2

3 % Example1. Return results for all windows classified,

4 % even if the window was rejected.

5 trees = [1 3 5 20];

6 tic;

7 [result] = cppHogLocalise( image, forestFile, hogFile, ...

8 'min−result−vote', 0.0, ...

9 );

10 scan_time = toc;

11

12 % get the total number of windows actually considered in this search

13 numWindows = size(result,1);

14

15 % calculate the mean time per window

16 window_time = scan_time / numWindows;

'using−trees', matrix (1xN matrix) Forest classifiers only. Specify a list of indexes for trees that should be used in this classification.

1 %'using −trees' example

2 % indices are:[i1 i2 i3 i4.. etc]

3

4 % Example1.

5 % scan using the specified trees

6 trees = [1 3 5 20];

7 [result] = cppHogLocalise( image, forestFile, hogFile, ...

8 'using−trees', trees, ...

9 );

Options: Miscellaneous

'feature−vector−type', type (string) Specify how the feature vector is to be extracted for the classifier during the slid- ing window scan. The type string can take one of the following values:

158 APPENDIX A:CODE DOCUMENTATION

• 'normal' Extracts the entire feature vector before passing it to the classifier code. This method keeps the image the same size, but copies a small portion into a sub-window area to classify.

• 'sample' Default. Extracts only the features corresponding to the feature dimensions that are accessed by the classifier. This can be efficient for classifiers that may only use a small subset of the available feature space to make a decision. As with 'normal', this method keeps the image the same size, but copies a small portion into a sub-window area to classify.

• 'sample−offset' Same on-demand extraction operation as 'sample', but rescales the features proportionally to the current scale so that no sub-windows are copied. Im- proves speed a great deal due to not having to resample the sub-windows at each location. Generally the fastest method of feature extraction.

'enable−image−data−cache', value (0/1) Enables caching of integral histogram data between calls if the image remains the same. Default is disabled. Set this to 1 either after or before calling the module with the same image to re-use the integral histograms from the previous call.

'return−tree−distributions', value (0/1) Forest classifiers only. Returns vote distributions for each tree as a secondary parameter. The amount of data returned can be quite high for large forests, so must be used carefully.

1 %'return −tree−distributions' example

2

3 % Example1.

4 % scan image and return votes made by each individual tree in treeDist

5 [results, treeDist] = cppHogLocalise( imageFile, forestFile, hogFile, ...

6 'using−scales', [1.0], ...

7 'auto−border', 0.0,'auto −stride', 1.0, ...

8 'return−tree−distributions', 1 );

9

10 % actual result from MEX for treeDist is:

11 %[t1c1 t1c2.. t1cN t2c1 t2c2.. t2cN tMc1 tMc2.. tMcN]

12 % where t1c1 represents tree1 class1 votes,

13 % t1c2 is for tree1 class 2, etc

14

159 APPENDIX A:CODE DOCUMENTATION

15 % reshape the distribution intoa matrix

16 % where rows are trees, and colums are classes

17 %(in this example there are 192 classes and 200 trees)

18 dist = reshape(treeDist, 192, 200)';

19

20 % sum all the votes together

21 dres = sum(dist);

22

23 % plot the calculated total distribution

24 % below the result distribution

25 figure(1);

26 subplot(2,1,1);

27 plot(results(1,8:end));

28 subplot(2,1,2);

29 plot(dres);

'cascade−enable−normalised−vote', value (0/1) Cascade classifier only. If enabled, trees that vote for more than one class will be normalised so that they sum to one. For instance, a tree voting for 2 different classes would vote 0.5 to each class.

'cascade−enable−randomise−tree', value (0/1) Cascade classifier only. If enabled, bypasses the random vector lookup table for the tree definition, and generates a new random vector for each tree. This effec- tively generates a new cascade classifier each time the classification function is called.

You can combine this with 'min−trees' and 'min−trees−vote−percent' to quickly localise detections and commit the full forest to areas only where a tree makes a classification.

1 %'cascade −enable−randomise−tree' example

2 % indices are:[i1 i2 i3 i4.. etc]

3

4 % Example1.

5 % scan by randomising the forest and only using the full

6 % forest for confident classifications

7 [result] = cppHogLocalise( image, forestFile, hogFile, ...

8 'min−trees', 1, ...

9 'min−trees−vote−percent', 1.0, ...

10 'cascade−enable−randomise−tree', 1, ...

11 );

160 APPENDIX A:CODE DOCUMENTATION

'return−box−centres', value (0/1)

Return box centres in the result matrix instead of the top-left (x, y) coordinate. Can be combined with 'scan−vector' to efficiently implement tracking, as the output of the result can be used to define the points for the next scan vector ma- trix.

'debug−mode', value (0/1) Enables debug mode. Some internal images are saved to the current Matlab di- rectory. Largely used for development and debugging the C++ source code, not very useful without the source.

A.3.4 Related Source Code Folders

./ExtractHOG Contains all code relating to HOG feature extraction. Implements both Integral Histogram HOG and regular HOG features. Also contains a library project called libExtractHOG.

./HOGLocalise The main HogLocalise project folder. Includes code from RandomForest and ExtractHOG.

./RandomForest Contains the classifier code. Named RandomForest for historical reasons. The library grew into a more generalised abstraction of a classifier, and the classifiers used in Chapter6 are implemented here.

./HOGLocalise The main HogLocalise project folder. Includes code from RandomForest and ExtractHOG.

161 References

[1] Agarwal, A. and Triggs, B. [2005], Monocular human motion capture with a mix- ture of regressors, in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, 2005. CVPR Workshops’, pp. 72–72.

[2] Agarwal, A. and Triggs, B. [2006], ‘Recovering 3d human pose from monocular images’, IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 44–58.

[3] Agarwal, S., Awan, A. and Roth, D. [2004], ‘Learning to detect objects in images via a sparse, part-based representation’, IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1475–1490.

[4] Aggarwal, J. and Cai, Q. [1999], ‘Human motion analysis: A review’, Computer vision and image understanding 73(3), 428–440.

[5] Alonso, I., Llorca, D., Sotelo, M., Bergasa, L., de Toro, P., Nuevo, J., Ocana, M. and Garrido, M. [2007], ‘Combination of feature extraction methods for SVM pedes- trian detection’, IEEE Transactions on Intelligent Transportation Systems 8(2), 292– 307.

[6] Amit, Y. and Geman, D. [1997], ‘Shape quantization and recognition with random- ized trees’, Neural Comput. 9(7), 1545–1588.

[7] Anderson, E., Engel, S., Comninos, P. and McLoughlin, L. [2008], The case for research in game engine architecture, in ‘Proceedings of the 2008 Conference on Future Play: Research, Play, Share’, ACM, pp. 228–231.

[8] Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J. and Davis, J. [2005], ‘SCAPE: shape completion and animation of people’, ACM Transactions on Graph- ics (TOG) 24(3), 416.

[9] Baker, S. and Matthews, I. [2004], ‘Lucas-kanade 20 years on: A unifying frame- work’, International Journal of Computer Vision 56(3), 221–255.

162 REFERENCES

[10] Barron, C. and Kakadiaris, I. A. [2001], ‘Estimating anthropometry and pose from a single uncalibrated image’, Computer Vision and Image Understanding 81(3), 269 – 284.

[11] Barrow, H. G., Tenenbaum, J. M., Bolles, R. C. and Wolf, H. C. [1977], Parametric correspondence and chamfer matching: Two new techniques for image matching, Cambridge, MA, pp. 659–663.

[12] Baumberg, A. [1998], ‘Hierarchical shape fitting using an iterated linear filter’, Image and Vision Computing 16(5), 329–335.

[13] Bay, H., Tuytelaars, T. and Van Gool, L. [2006], ‘Surf: Speeded up robust features’, Computer Vision–ECCV 2006 pp. 404–417.

[14] Bellman, R. [1961], ‘Adaptive control processes: a guided tour’, Princeton Univer- sity Press, Princeton, New Jersey, USA 19, 94.

[15] Berg, A. C. and Malik, J. [2001], ‘Geometric blur for template matching’, Computer Vision and Pattern Recognition, IEEE Computer Society Conference on 1, 607.

[16] Bergtholdh, M., Cremers, D. and Schurr, C. [2006], ‘Variational segmentation with shape priors’, Handbook of Mathematical Models in Computer Vision pp. 131–143.

[17] Bissacco, A., Yang, M.-H. and Soatto, S. [2006], Detecting humans via their pose, in ‘NIPS’, pp. 169–176.

[18] Bissacco, A., Yang, M.-H. and Soatto, S. [2007], Fast human pose estimation using appearance and motion via multi-dimensional boosting regression, in ‘CVPR’.

[19] Bordes, A., Bottou, L. and Gallinari, P. [2009], ‘Sgd-qn: Careful quasi-newton stochastic gradient descent’, The Journal of Machine Learning Research 10, 1737–1754.

[20] Borgefors, G. [1988], ‘Hierarchical chamfer matching: A parametric edge match- ing algorithm’, IEEE Trans. Pattern Anal. Mach. Intell. 10(6), 849–865.

[21] Bosch, A., Zisserman, A. and Munoz, X. [2007], Image classification using random forests and ferns, in ‘Proceedings of the 11th International Conference on Com- puter Vision, Rio de Janeiro, Brazil’.

[22] Bottino, A. and Laurentini, A. [2001], ‘A silhouette based technique for the recon- struction of human movement’, Computer Vision and Image Understanding 83(1), 79– 95.

163 REFERENCES

[23] Bourdev, L. and Malik, J. [2009], Poselets: Body part detectors trained using 3d human pose annotations, in ‘International Conference on Computer Vision’. URL: http://www.eecs.berkeley.edu/ lbourdev/poselets

[24] Boykov, Y. and Jolly, M. [2001], Interactive graph cuts for optimal boundary and region segmentationof objects in ND images, in ‘International Conference on Computer Vision’, Vol. 1, Citeseer, pp. 105–112.

[25] Boykov, Y. and Kolmogorov, V. [2001], An experimental comparison of min- cut/max-flow algorithms for energy minimization in vision, in ‘Energy minimiza- tion methods in computer vision and pattern recognition’, Springer, pp. 359–374.

[26] Boykov, Y. and Kolmogorov, V. [2004], ‘An experimental comparison of min- cut/max-flow algorithms for energy minimization in vision’, IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1124–1137.

[27] Bray, M., Kohli, P. and Torr, P. [2006], Posecut: Simultaneous segmentation and 3d pose estimation of humans using dynamic graph-cuts, Springer, pp. 642–655.

[28] Breiman, L. [1996], ‘Out-of-bag estimation’.

[29] Breiman, L. [2001], ‘Random forests’, Mach. Learn. 45(1), 5–32.

[30] Breiman, L. and Breiman, L. [1996], Bagging predictors, in ‘Machine Learning’, pp. 123–140.

[31] Brostow, G. J., Shotton, J., Fauqueur, J. and Cipolla, R. [2008], Segmentation and recognition using structure from motion point clouds, in ‘ECCV’, pp. 44–57.

[32] Canny, J. [1986], ‘A computational approach to edge detection’, Pattern Analysis and Machine Intelligence, IEEE Transactions on PAMI-8(6), 679–698.

[33] Cheung, S. and Kamath, C. [2004], ‘Robust techniques for background subtraction in urban traffic video’, video communications and image processing, SPIE electronic imaging 5308, 881–892.

[34] Cootes, T., Marsland, S., Twining, C., Smith, K. and Taylor, C. [2004], ‘Group- wise diffeomorphic non-rigid registration for automatic model building’, Com- puter Vision-ECCV 2004 pp. 316–327.

[35] Cootes, T. and Taylor, C. [2004], ‘Statistical models of appearance for computer vision’, Technical Report, March 8 .

164 REFERENCES

[36] Criminisi, A., Cross, G., Blake, A. and Kolmogorov, V.[2006], Bilayer segmentation of live video, in ‘2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, Vol. 1.

[37] Crow, F. [1984], ‘Summed-area tables for texture mapping’, ACM SIGGRAPH Com- puter Graphics 18(3), 212.

[38] Cucchiara, R., Grana, C., Piccardi, M. and Prati, A. [2003], ‘Detecting moving ob- jects, ghosts, and shadows in video streams’, IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1337–1342.

[39] Cutler, R. and Davis, L. [1998], View-based detection and analysis of periodic mo- tion, in ‘International Conference on Pattern Recognition’, Vol. 14, pp. 495–500.

[40] Dalal, N. and Triggs, B. [2005], Histograms of oriented gradients for human de- tection, in C. Schmid, S. Soatto and C. Tomasi, eds, ‘International Conference on Computer Vision & Pattern Recognition’, Vol. 2, INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, pp. 886–893. URL: http://lear.inrialpes.fr/pubs/2005/DT05

[41] Dalal, N., Triggs, B. and Schmid, C. [2006], ‘Human detection using oriented his- tograms of flow and appearance’, Computer Vision–ECCV 2006 pp. 428–441.

[42] Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. [2004], Locality-sensitive hashing scheme based on p-stable distributions, in ‘Proceedings of the twenti- eth annual symposium on Computational geometry’, ACM New York, NY, USA, pp. 253–262.

[43] Deselaers, T., Criminisi, A., Winn, J. M. and Agarwal, A. [2007], Incorporating on-demand stereo for real time recognition, in ‘CVPR’.

[44] Dimitrijevic, M., Lepetit, V. and Fua, P. [2006], ‘Human body pose detection using bayesian spatio-temporal templates’, Comput. Vis. Image Underst. 104(2), 127–139.

[45] Eichner, M., Ferrari, V. and Zürich, S. [2009], ‘Better appearance models for picto- rial structures’, Proc. BMVC .

[46] Elgammal, A., Harwood, D. and Davis, L. [2000], ‘Non-parametric model for back- ground subtraction’, Computer VisionUECCV˚ 2000 pp. 751–767.

[47] Elgammal, A. and Lee, C. [2004], ‘Inferring 3D body pose from silhouettes using activity manifold learning’.

165 REFERENCES

[48] Enzweiler, M. and Gavrila, D. [2008a], A mixed generative-discriminative frame- work for pedestrian classification, in ‘IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008’, pp. 1–8.

[49] Enzweiler, M. and Gavrila, D. [2008b], ‘Monocular pedestrian detection: Sur- vey and experiments’, IEEE transactions on pattern analysis and machine intelligence pp. 2179–2195.

[50] Enzweiler, M., Kanter, P. and Gavrila, D. [2008], Monocular pedestrian recognition using motion parallax, in ‘2008 IEEE Intelligent Vehicles Symposium’, pp. 792–797.

[51] Everingham, M. and Zisserman, A. [2005], ‘Identifying Individuals in Video by Combining Generative and Discriminative Head Models’.

[52] Fan, L., Sung, K. and Ng, T. [2003], ‘Pedestrian registration in static images with unconstrained background’, Pattern Recognition 36(4), 1019–1029.

[53] Felzenszwalb, P. F. [2001], Learning models for object recognition, in ‘In CVPR’, pp. 56–62.

[54] Felzenszwalb, P. F. and Huttenlocher, D. P. [2004.], Distance transforms of sampled functions, Technical report, Cornell Computing and Information Science. URL: http://citeseer.ist.psu.edu/696385.html

[55] Ferrari, V., Marín-Jiménez, M. J. and Zisserman, A. [2008], Progressive search space reduction for human pose estimation, in ‘Proceedings of the IEEE Computer Vision and Pattern Recognition’, Anchorage, Alaska.

[56] Ford, L. and Fulkerson, D. [1962], ‘Flows in networks’.

[57] Fossati, A., Dimitrijevic, M., Lepetit, V. and Fua, P. [2007], Bridging the gap be- tween detection and tracking for 3d monocular video-based motion capture, in ‘CVPR’.

[58] Freedman, D. and Zhang, T. [2005], Interactive graph cut based segmentation with shape priors, in ‘IEEE Computer Society Conference on Computer Vision and Pat- tern Recognition, 2005. CVPR 2005’, Vol. 1.

[59] Freeman, W., Anderson, D., Beardsley, P., Dodge, C., Roth, M., Weissman, C., Yerazunis, W., Kage, H., Kyuma, I., Miyake, Y. and Tanaka, K. [1998], ‘Computer vision for interactive computer graphics’, Computer Graphics and Applications, IEEE 18(3), 42–53.

166 REFERENCES

[60] Freeman, W., Tanaka, K., Ohta, J. and Kyuma, K. [1996], Computer vision for com- puter games, pp. 100–105.

[61] Freund, Y. and Schapire, R. [1995], A desicion-theoretic generalization of on-line learning and an application to boosting, in ‘Computational Learning Theory’, Springer, pp. 23–37.

[62] Gandhi, T. and Trivedi, M. [2007], ‘Pedestrian protection systems: Issues, survey, and challenges’, IEEE Transactions on Intelligent Transportation Systems 8(3), 413– 430.

[63] Gavrila, D. [1999], ‘The Visual Analysis of Human Movement: A Survey’, Com- puter vision and image understanding 73(1), 82–98.

[64] Gavrila, D. and Davis, L. [1996], 3d model-based tracking of humans in action: A multi-view approach., in ‘Proc. Conf. Computer Vision and Pattern Recognition’, pp. 73–80.

[65] Gavrila, D. M. [2007], ‘A bayesian, exemplar-based approach to hierarchical shape matching’, IEEE Trans. Pattern Anal. Mach. Intell. 29(8), 1408–1421.

[66] Gavrila, D. M. and Munder, S. [2007], ‘Multi-cue pedestrian detection and tracking from a moving vehicle’, Int. J. Comput. Vision 73(1), 41–59.

[67] Gloyer, B., Aghajan, H., Siu, K. and Kailath, T. [1995], Video-based freeway- monitoring system using recursive vehicle tracking, in ‘Proceedings of SPIE’, Vol. 2421, p. 173.

[68] Grabner, H., Roth, P., Grabner, M. and Bischof, H. [2006], Autonomous learning of a robust background model for change detection, in ‘Proc. PETS’, pp. 39–46.

[69] Grauman, K., Shakhnarovich, G. and Darrell, T. [2003], Inferring 3d structure with a statistical image-based shape model, in ‘Ninth IEEE International Conference on Computer Vision, 2003. Proceedings’, pp. 641–647.

[70] Greig, D., Porteous, B. and Seheult, A. [1989], ‘Exact maximum a posteriori esti- mation for binary images’, Journal of the Royal Statistical Society. Series B (Method- ological) 51(2), 271–279.

[71] Grest, D., Frahm, J. and Koch, R. [2003], ‘A Color Similarity Measure for Robust Shadow Removal in Real-Time’, Vision, modeling, and visualization 2003: proceed- ings, November 19-21, 2003, Munchen, Germany p. 253.

167 REFERENCES

[72] Gross, R. and Shi, J. [2001], The cmu motion of body (mobo) database, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.

[73] Heap, T. and Hogg, D. [1997], Improving specificity in pdms using a hierarchical approach, in ‘British Machine Vision Conference’, Vol. 1, Citeseer, pp. 80–89.

[74] Heap, T. and Hogg, D. [1998], Wormholes in shape space: Tracking through dis- continuous changes in shape, in ‘Sixth International Conference on Computer Vi- sion’, pp. 344–349.

[75] Howe, N., Leventon, M. and Freeman, W. [1999], Bayesian reconstruction of 3d human motion from single-camera video, in ‘Neural Information Processing Sys- tems’, Vol. 1.

[76] Huang, R., Pavlovic, V. and Metaxas, D. [2004], ‘A graphical model framework for coupling MRFs and deformable models’.

[77] Huang, Y. and Huang, T. S. [2002], ‘2d model-based human body tracking’, Pattern Recognition, International Conference on 1, 10552.

[78] Jacques Jr, J., Jung, C. and Musse, S. [2005], ‘Background subtraction and shadow detection in grayscale video sequences’.

[79] Joachims, T. [1999], Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT- Press, chapter 11.

[80] Jones, M. and Poggio, T. [1998], Multidimensional morphable models, in ‘6th In- ternational Conference on Computer Vision’, pp. 683–688.

[81] Ju, S., Black, M. and Yacoob, Y. [1996], Cardboard people: A parameterized model of articulated image motion, in ‘Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (FG’96)’, Citeseer, p. 38.

[82] Kakadiaris, I. and Metaxas, D. [1998], ‘Three-dimensional human body model ac- quisition from multiple views’, International Journal of Computer Vision 30(3), 191– 218.

[83] Kanade, T., li Tian, Y. and Cohn, J. F. [2000], Comprehensive database for facial expression analysis, in ‘FG’, pp. 46–53.

[84] Karmann, K. and von Brandt, A. [1990], ‘Moving object recognition using an adap- tive background memory’, Time-varying image processing and moving object recogni- tion 2, 289–296.

168 REFERENCES

[85] Kehl, R. and Gool, L. [2006], ‘Markerless tracking of complex human motions from multiple views’, Computer Vision and Image Understanding 104(2-3), 190–209.

[86] Kilger, M. [1992], A shadow handler in a video-based real-time traffic monitoring system, in ‘IEEE Workshop on Applications of Computer Vision: November 30- December 2, 1992, Palm Springs: proceedings’, IEEE Computer Society, p. 11.

[87] Kittler, J. and Illingworth, J. [1986], ‘Minimum error thresholding’, Pattern recogni- tion 19(1), 41–47.

[88] Kohli, P. and Torr, P. [2007], ‘Dynamic graph cuts for efficient inference in markov random fields’, IEEE transactions on pattern analysis and machine intelligence 29(12), 2079–2088.

[89] Koller, D., Weber, J., Huang, T., Malik, J., Ogasawara, G., Rao, B. and Russell, S. [1994], ‘Towards robust automatic tra c scene analysis in real-time’, ICPR, Israel .

[90] Kolmogorov, V., Criminisi, A., Blake, A., Cross, G. and Rother, C. [2005], Bi-layer segmentation of binocular stereo video, in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005’, Vol. 2.

[91] Kolmogorov, V. and Zabih, R. [2002], ‘What energy functions can be minimized via graph cuts?’, Computer Vision ECCV 2002 pp. 185–208.

[92] Kumar, M. P., Torr, P. H. S. and Zisserman, A. [2005], ObjCut, in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005’, Vol. 1.

[93] Kumar, S. and Hebert, M. [2004], ‘Discriminative fields for modeling spatial de- pendencies in natural images’, Advances in neural information processing systems 16, 1531–1538.

[94] Laptev, I. [2009], ‘Improving object detection with boosted histograms’, Image Vi- sion Comput. 27(5), 535–544.

[95] Lee, C.-S. and Elgammal, A. M. [2006], Simultaneous inference of view and body pose using torus manifolds., in ‘ICPR (3)’, pp. 489–494.

[96] Leibe, B., Cornelis, N., Cornelis, K. and Van Gool, L. [2007], Dynamic 3d scene analysis from a moving vehicle, in ‘IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07’, pp. 1–8.

[97] Leibe, B., Seemann, E. and Schiele, B. [2005], ‘Pedestrian detection in crowded scenes’.

169 REFERENCES

[98] Lepetit, V. and Fua, P. [2006], ‘Keypoint recognition using randomized trees’, IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1465–1479.

[99] Lepetit, V., Shahrokni, A. and Fua, P. [2003], ‘Robust data association for online applications’, Computer Vision and Pattern Recognition, IEEE Computer Society Con- ference on 1, 281.

[100] Lewis, J. [1995], Fast normalized cross-correlation, in ‘Vision Interface’, Vol. 120, p. 123.

[101] Lo, B. and Velastin, S. [2001], Automatic congestion detection system for under- ground platforms, in ‘Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2001’, pp. 158–161.

[102] Lowe, D. G. [2004], ‘Distinctive image features from scale-invariant keypoints’, International Journal of Computer Vision 60(2), 91–110.

[103] Ma, Y. and Ding, X. [2005], ‘Real-time multi-view face detection and pose estima- tion based on cost-sensitive adaboost’, Tsinghua Science & Technology 10(2), 152– 157.

[104] Mikolajczyk, K., Schmid, C. and Zisserman, A. [2004], ‘Human detection based on a probabilistic assembly of robust part detectors’, Computer Vision-ECCV 2004 pp. 69–82.

[105] Minsky, M. L. and Papert, S. [1988], Perceptrons: An introduction to computational geometry, MIT press Cambridge, Mass.

[106] Moeslund, T. and Granum, E. [2001], ‘A survey of computer vision-based human motion capture’, Computer Vision and Image Understanding 81(3), 231–268.

[107] Moeslund, T., Hilton, A. and Kr "uger, V. [2006], ‘A survey of advances in vision-based human motion capture and analysis’, Computer vision and image understanding 104(2-3), 90–126.

[108] Mohan, A., Papageorgiou, C. and Poggio, T. [2001], ‘Example-based object detec- tion in images by components’, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(4), 349.

[109] Moosmann, F., Nowak, E. and Jurie, F. [2008], ‘Randomized clustering forests for image classification’, IEEE Transactions on Pattern Analysis and Machine Intelligence 30(9), 1632–1646.

170 REFERENCES

[110] Mori, G. and Malik, J. [2006], ‘Recovering 3d human body configurations using shape contexts’, IEEE Trans. on Pattern Analysis and Machine Intelligence 28(7), 1052– 1062.

[111] Morris, D. and Rehg, J. [1998], Singularity analysis for articulated object tracking, in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recogni- tion’, pp. 289–297.

[112] Munder, S. and Gavrila, D. [2006], ‘An experimental study on pedestrian clas- sification’, IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1863– 1868.

[113] Munder, S., Schnorr, C. and Gavrila, D. [2008], ‘Pedestrian detection and tracking using a mixture of view-based shape-texture models’, IEEE Transactions on Intelli- gent Transportation Systems 9(2), 333–343.

[114] Nakajima, C., Pontil, M., Heisele, B. and Poggio, T. [2003], ‘Full-body person recognition system’, Pattern recognition 36(9), 1997–2006.

[115] Ojala, T., Pietikäinen, M. and Mäenpää, T. [2002], ‘Multiresolution gray-scale and rotation invariant texture classification with local binary patterns’, IEEE Transac- tions on Pattern Analysis and Machine Intelligence 24(7), 971–987.

[116] Okada, R. and Soatto, S. [2008], Relevant feature selection for human pose esti- mation and localization in cluttered images, in ‘Proceedings European Conference on Computer Vision’, Springer, pp. 434–445.

[117] Okada, R. and Stenger, B. [2008], ‘A single camera motion capture system for human-computer interaction’, IEICE Transactions on Information and Systems 91(7), 1855–1862.

[118] Okuma, K., Taleghani, A., Freitas, N., Little, J. and Lowe, D. [2004], ‘A boosted particle filter: Multitarget detection and tracking’, Computer Vision-ECCV 2004 pp. 28–39.

[119] Oliver, N., Rosario, B. and Pentland, A. [1999], ‘A Bayesian computer vision sys- tem for modeling human interactions’, Computer Vision Systems pp. 255–272.

[120] O’Rourke, J. and Badler, N. [1980], ‘Model-based image analysis of human mo- tion using constraint propagation’, IEEE Trans. Pattern Analysis and Machine - ligence 2(6), 522–536.

171 REFERENCES

[121] Orrite, C., Gañán, A. and Rogez, G. [2009], Hog-based decision tree for facial expression classification, in ‘IbPRIA’, pp. 176–183.

[122] Papageorgiou, C. and Poggio, T. [2000], ‘A trainable system for object detection’, International Journal of Computer Vision 38(1), 15–33.

[123] Piccardi, M. [2004], Background subtraction techniques: a review, in ‘IEEE Inter- national Conference on Systems, Man and Cybernetics’, Vol. 4, pp. 3099–3104.

[124] Poppe, R. [2007], ‘Vision-based human motion analysis: An overview’, Computer Vision and Image Understanding 108(1-2), 4–18.

[125] Porikli, F. [2005], Integral histogram: A fast way to extract histograms in carte- sian spaces, in ‘in Proc. IEEE Conf. on Computer Vision and Pattern Recognition’, pp. 829–836.

[126] Power, P. and Schoonees, J. [2002], Understanding Background Mixture Models for Foreground Segmentation, in ‘Proceedings Image and Vision Computing New Zealand’, p. 267.

[127] Press, W., Flannery, B., Teukolsky, S. and Vetterling, W. [1988], Numerical recipes in C, Cambridge Uni. Press.

[128] Ramanan, D. [2007], Using segmentation to verify object hypotheses, in ‘IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07’, pp. 1– 8.

[129] Ren, L., Shakhnarovich, G., Hodgins, J. K., Pfister, H. and Viola, P. [2005], ‘Learning silhouette features for control of human motion’, ACM Trans. Graph. 24(4), 1303–1331.

[130] Ridder, C., Munkelt, O. and Kirchner, H. [1995], Adaptive background estimation and foreground detection using kalman-filtering, in ‘Proceedings of International Conference on recent Advances in Mechatronics’, Citeseer, pp. 193–199.

[131] Rogez, G., Orrite, C. and Martínez, J. [2008], ‘A spatio-temporal 2d-models framework for human pose recovery in monocular sequences’, Pattern Recognition .

[132] Rogez, G., Rihan, J., Ramalingam, S., Orrite, C. and Torr, P. H. [2008], ‘Random- ized trees for human pose detection’, Computer Vision and Pattern Recognition, IEEE Computer Society Conference on 0, 1–8.

172 REFERENCES

[133] Rohr, K. [1994], ‘Towards model-based recognition of human movements in im- age sequences’, CVGIP-Image Understanding 59(1), 94–115.

[134] Rosales, R. and Sclaroff, S. [2000], Inferring body pose without tracking body parts, in ‘CVPR’, Published by the IEEE Computer Society, p. 2721.

[135] Rother, C., Kolmogorov, V. and Blake, A. [2004], Grabcut: Interactive foreground extraction using iterated graph cuts, in ‘ACM SIGGRAPH 2004 Papers’, ACM, p. 314.

[136] Sabzmeydani, P. and Mori, G. [2007], Detecting pedestrians by learning shapelet features, in ‘CVPR07’, pp. 1–8.

[137] Seemann, E., Fritz, M. and Schiele, B. [2007], Towards robust pedestrian detec- tion in crowded image sequences, in ‘IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07’, pp. 1–8.

[138] Seki, M., Wada, T., Fujiwara, H. and Sumi, K. [2003], ‘Background subtraction based on cooccurrence of image variations’.

[139] Sezgin, M. and Sankur, B. [2004], ‘Survey over image thresholding techniques and quantitative performance evaluation’, Journal of Electronic imaging 13, 146.

[140] Shakhnarovich, G., Viola, P. and Darrell, R. [2003], Fast pose estimation with parameter-sensitive hashing, in ‘International Conference on Computer Vision’.

[141] Shashua, A., Gdalyahu, Y. and Hayun, G. [2004], Pedestrian detection for driving assistance systems: Single-frame classification and system level performance, in ‘Proceedings of IEEE Intelligent Vehicles Symposium’, Citeseer.

[142] Shet, V., Neumann, J., Ramesh, V. and Davis, L. [2007], Bilattice-based logical rea- soning for human detection, in ‘IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07’, pp. 1–8.

[143] Shi, J. and Tomasi, C. [1994], Good features to track, in ‘1994 IEEE Computer So- ciety Conference on Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94.’, pp. 593–600.

[144] Shimizu, H. and Poggio, T. [2004], Direction estimation of pedestrian from mul- tiple still images, in ‘2004 IEEE Intelligent Vehicles Symposium’, pp. 596–600.

[145] Shotton, J., Johnson, M., Cipolla, R., Center, T. and Kawasaki, J. [2008], Semantic texton forests for image categorization and segmentation, in ‘IEEE Conference on Computer Vision and Pattern Recognition’, pp. 1–8.

173 REFERENCES

[146] Shotton, J., Winn, J., Rother, C. and Criminisi, A. [2009], ‘Textonboost for image understanding: Multi-class object recognition and segmentation by jointly model- ing texture, layout, and context’, International journal of computer vision 81(1), 2–23.

[147] Sidenbladh, H. and Black, M. [2003], ‘Learning the statistics of people in images and video’, International Journal of Computer Vision 54(1), 183–209.

[148] Sidenbladh, H., Black, M. and Fleet, D. [2000], ‘Stochastic tracking of 3D human figures using 2D image motion’, Computer VisionUECCV˚ 2000 pp. 702–718.

[149] Sigal, L., Balan, A. and Black, M. [2007], ‘Combined discriminative and genera- tive articulated pose and non-rigid shape estimation’, Advances in neural informa- tion processing systems .

[150] Sigal, L. and Black, M. J. [2006], ‘Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion’.

[151] Sminchisescu, C., Kanaujia, A., Li, Z. and Metaxas, D. [2005], ‘Discriminative density propagation for 3d human motion estimation’.

[152] Spengler, M. and Schiele, B. [2003], ‘Towards robust multi-cue integration for visual tracking’, Machine Vision and Applications 14(1), 50–58.

[153] Stauffer, C. and Grimson, W. [1998], Adaptive background mixture models for real-time tracking, in ‘Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, Vol. 2, pp. 246–252.

[154] Stenger, B. [2004], Model-Based Hand Tracking Using A Hierarchical Bayesian Filter, PhD thesis, Department of Engineering, University of Cambridge.

[155] Stenger, B., Mendonça, P. and Cipolla, R. [2001], ‘Model-based 3D tracking of an articulated hand’.

[156] Stenger, B., Thayananthan, A., Torr, P. H. S. and Cipolla, R. [2003], ‘Filtering using a tree-based estimator’, ICCV 02.

[157] Stenger, B., Woodley, T., Kim, T. and Cipolla, R. [2009], ‘A vision-based system for display interaction’.

[158] Sun, J., Zhang, W., Tang, X. and Shum, H. [n.d.], ‘Background cut’, Computer Vision–ECCV 2006 pp. 628–641.

[159] Szarvas, M., Yoshizawa, A., Yamamoto, M. and Ogata, J. [2005], Pedestrian detec- tion with convolutional neural networks, in ‘IEEE Intelligent Vehicles Symposium, 2005. Proceedings’, pp. 224–229.

174 REFERENCES

[160] Tan, X. and Triggs, B. [2007], ‘Enhanced local texture feature sets for face recogni- tion under difficult lighting conditions’, Analysis and Modeling of Faces and Gestures pp. 168–182.

[161] Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P. H. S. and Cipolla, R. [2006], Multivariate relevance vector machines for tracking, in ‘ECCV (3)’, pp. 124–138.

[162] Tola, E., Lepetit, V. and Fua, P. [2008], A fast local descriptor for dense matching, in ‘Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on’, pp. 1–8. URL: http://dx.doi.org/10.1109/CVPR.2008.4587673

[163] Tola, E., Lepetit, V. and Fua, P. [5555], ‘Daisy: An efficient dense descriptor ap- plied to wide baseline stereo’, IEEE Transactions on Pattern Analysis and Machine Intelligence 99(1). URL: http://dx.doi.org/10.1109/TPAMI.2009.77

[164] Toyama, K. and Blake, A. [2002], ‘Probabilistic tracking with exemplars in a met- ric space’, Int. J. Comput. Vision 48(1), 9–19.

[165] Tulip, J., Bekkema, J. and Nesbitt, K. [2006], Multi-threaded game engine design, in ‘Proceedings of the 3rd Australasian conference on Interactive entertainment’, Murdoch University, p. 14.

[166] Tuzel, O., Porikli, F. and Meer, P. [2007], ‘Human Detection via Classification of Riemannian Manifolds’, CVPR .

[167] Urtasun, R., Fleet, D. J. and Fua, P. [2005], Monocular 3-d tracking of the golf swing, in ‘CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2’, IEEE Com- puter Society, Washington, DC, USA, pp. 932–938.

[168] Valiant, L. G. [1984], ‘A theory of the learnable’, Commun. ACM 27(11), 1134–1142. URL: http://dx.doi.org/10.1145/1968.1972

[169] Vincent, L. and Soille, P. [1991], ‘Watersheds in digital spaces: an efficient algo- rithm based on immersion simulations’, IEEE transactions on pattern analysis and machine intelligence 13(6), 583–598.

[170] Viola, P. and Jones, M. [2002], ‘Robust real-time object detection’, International Journal of Computer Vision . URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.4868

175 REFERENCES

[171] Viola, P., Jones, M. J. and Snow, D. [2005], ‘Detecting pedestrians using patterns of motion and appearance’, Int. J. Comput. Vision 63(2), 153–161.

[172] Wikipedia [2010], ‘HSL and HSV’, http://en.wikipedia.org/wiki/HSL_and_ HSV.

[173] Wohler, C. and Anlauf, J. [1999], ‘An adaptable time-delay neural-network algorithm for image sequence analysis’, IEEE Transactions on Neural Networks 10(6), 1531–1536.

[174] Wren, C., Azarbayejani, A., Darrell, T. and Pentland, A. [1997], ‘Pfinder: Real- time tracking of the human body’, IEEE Transactions on Pattern Analysis and Ma- chine Intelligence 19(7), 781.

[175] Wu, B. and Nevatia, R. [2005], Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors, in ‘Proceed- ings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1’, pp. 90–97.

[176] Yeffet, L. and Wolf, L. [2009], ‘Local Trinary Patterns for Human Action Recogni- tion’.

[177] Zhang, L., Wu, B. and Nevatia, R. [n.d.], Detection and tracking of multiple hu- mans with extensive pose articulation.

[178] Zhang, S. [2009], ‘Recent progresses on real-time 3d shape measurement using digital fringe projection techniques’, Optics and Lasers in Engineering .

[179] Zhang, Z., Zhu, L., Li, S. and Zhang, H. [2002], Real-time multi-view face detec- tion, in ‘Proc. IntŠl Conf. Automatic Face and Gesture Recognition’, pp. 149–154.

[180] Zhao, T. and Nevatia, R. [2004], ‘Tracking multiple humans in complex situa- tions’, IEEE Transactions on Pattern Analysis and Machine Intelligence 26(9), 1208– 1221.

[181] Zhou, Q. and Aggarwal, J. [2001], Tracking and classifying moving objects from video, in ‘Proceedings of IEEE Workshop on Performance Evaluation of Tracking and Surveillance’, Hawaii, USA.

[182] Zhu, Q., Avidan, S., Yeh, M. C. and Cheng, K. T. [2006], Fast human detection using a cascade of histograms of oriented gradients, in ‘CVPR’, IEEE Computer Society, pp. 1491–1498. URL: http://doi.ieeecomputersociety.org/10.1109/CVPR.2006.119

176 REFERENCES

[183] Zhu, Q., Yeh, M.-C., Cheng, K.-T. and Avidan, S. [2006], Fast human detection using a cascade of histograms of oriented gradients, in ‘CVPR06’, pp. II: 1491– 1498.

177