Multiview Depth-based Pose Estimation

by

Alireza Shafaei

B.Sc., Amirkabir University of Technology, 2013

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF

Master of Science

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science)

The University of British Columbia (Vancouver)

December 2015

© Alireza Shafaei, 2015 Abstract

Commonly used human capture systems require intrusive attachment of markers that are visually tracked with multiple cameras. In this work we present an efficient and inexpensive solution to markerless motion capture using only a few Kinect sensors. We use our system to design a smart home platform with a network of Kinects that are installed inside the house. Our first contribution is a multiview pose estimation system. Unlike the previ- ous work on 3d pose estimation using a single depth camera, we relax constraints on the camera location and do not assume a co-operative user. We apply recent image segmentation techniques with convolutional neural networks to depth im- ages and use curriculum learning to train our system on purely synthetic data. Our method accurately localizes body parts without requiring an explicit shape model. The body joint locations are then recovered by combining evidence from multiple views in real-time. Our second contribution is a dataset of 6 million synthetic depth frames for pose estimation from multiple cameras with varying levels of complexity to make curriculum learning possible. We show the efficacy and applicability of our data generation process through various evaluations. Our final system exceeds the state- of-the-art results on multiview pose estimation on the Berkeley MHAD dataset. Our third contribution is a scalable software platform to coordinate Kinect de- vices in real-time over a network. We use various compression techniques and de- velop software services that allow communication with multiple Kinects through TCP/IP. The flexibility of our system allows real-time orchestration of up to 10 Kinect devices over Ethernet.

ii Preface

The entire work presented here has been done by the author, Alireza Shafaei, with the collaboration and supervision of James J. Little. A manuscript describing the core of our work and our results has been submitted to the IEEE Conference on and Pattern Recognition (2016) and is under anonymous review at the moment of thesis submission.

iii Table of Contents

Abstract ...... ii

Preface ...... iii

Table of Contents ...... iv

List of Tables ...... vii

List of Figures ...... viii

Acknowledgments ...... xii

Dedication ...... xiii

1 Introduction ...... 1 1.1 Kinect Sensor ...... 2 1.2 Our Scenario ...... 4 1.3 Datasets ...... 5 1.4 Pose Estimation ...... 6 1.5 Outline ...... 7

2 Related Work ...... 9 2.1 Pose Estimation ...... 9 2.1.1 Single-view Pose Estimation ...... 11 2.1.2 Multiview Depth-based Pose Estimation ...... 13 2.2 Dense Image Segmentation ...... 14

iv 2.3 Curriculum Learning ...... 15

3 System Overview ...... 16 3.1 The General Context ...... 16 3.2 High-Level Framework Specification ...... 17 3.3 Internal Structure and Data Flow ...... 18 3.3.1 Camera Registration and Data Aggregation ...... 19 3.3.2 Pose Estimation ...... 20

4 Synthetic Data Generation ...... 22 4.1 Sampling Human Pose ...... 23 4.2 Building Realistic 3d Models ...... 25 4.3 Setting Camera Location ...... 25 4.4 Sampling Data ...... 27 4.5 Datasets ...... 28 4.6 Discussion ...... 29

5 Multiview Pose Estimation ...... 34 5.1 Human Segmentation ...... 35 5.2 Pixel-wise Classification ...... 36 5.2.1 Preprocessing The Depth Image ...... 37 5.2.2 Dense Segmentation with Deep Convolutional Networks . 37 5.2.3 Designing the Deep Convolutional Network ...... 39 5.3 Classification Aggregation ...... 40 5.4 Pose Estimation ...... 41 5.5 Discussion ...... 42

6 Evaluation ...... 44 6.1 Training the Dense Depth Classifier ...... 45 6.2 Evaluation on UBC3V Synthetic ...... 48 6.2.1 Dense Classification ...... 48 6.2.2 Pose Estimation ...... 49 6.3 Evaluation on Berkeley MHAD ...... 50 6.3.1 Dense Classification ...... 54

v 6.3.2 Pose Estimation ...... 55 6.4 Evaluation on EVAL ...... 58

7 Discussion and Conclusion ...... 59

Bibliography ...... 61

vi List of Tables

Table 4.1 Dataset complexity table. Θ is the relative camera angle, H refers to the height parameter and D refers to the distance pa- rameter as described in Figure 4.5. The simple set is the subset of postures that have the label ‘walk’ or ‘run’. Going from the first dataset to the second would require pose adaptation, while going from the second to the third dataset requires shape adap- tation...... 29

Table 6.1 The dense classification accuracy of the trained networks on the validation sets of the corresponding datasets. Net 2 and Net 3 are initialized with the learned parameters of Net 1 and Net 2 respectively...... 47 Table 6.2 Mean and standard deviation of the prediction error by testing on subjects and actions with the joint definitions of Michel et al [28]. We also report and compare the accuracy at 10cm threshold...... 57

vii List of Figures

Figure 1.1 The goal of pose estimation is to learn to represent the postural information of the left image abstractly as shown in the right image...... 2 Figure 1.2 A sample depth image. Each shade of gray visualizes a dif- ferent depth value. The closer the point, the darker the corre- sponding pixel. The white region is too distant or too noisy, making the sensor readings unreliable...... 4 Figure 1.3 An overview of our pipeline. In this hypothetical setting three Kinect 2 devices are communicating with a main hub where the depth information is processed to generate a pose estimate. 8

Figure 3.1 The high-level overview of the components in our system. Each Kinect is connected to a local Kinect Service. At the Smart Home Core we communicate with each Kinect Service to gather data. The Kinect Clients are the in- terfaces to the Kinect Service and can be implemented in any programming language...... 18 Figure 3.2 The high-level representation of data flow within our pipeline. The pose estimation block operates independently from the number of the active Kinects...... 19 Figure 3.3 An example to demonstrate the output result of camera cal- ibration. The blue and the red points are coming from two different Kinects facing each other but they are presented in a unified coordinate space...... 20

viii Figure 3.4 The pose estimation pipeline in our platform...... 21

Figure 4.1 The synthetic data generation pipeline. We use realistic 3d models with real human pose configurations and random cam- era location to generate realistic training data ...... 23 Figure 4.2 Random samples from MotionCK as described in Section 4.1. 25 Figure 4.3 Regions of interest in our humanoid model. There are total of 43 different body regions color-coded as above. (a) The frontal view and (b) the dorsal view...... 26 Figure 4.4 All the 16 characters we made for synthetic data generation. Subjects vary in age, weight, height, and gender...... 26 Figure 4.5 An overview of the extrinsic camera parameters inside our data generation pipeline...... 27 Figure 4.6 Three random samples from Easy-Pose. (a,,e) are groundtruth images and (b,d,f) are corresponding depth images...... 30 Figure 4.7 Three random samples from Inter-Pose. (a,c,e) are groundtruth images and (b,d,f) are corresponding depth images...... 31 Figure 4.8 Three random samples from Hard-Pose. (a,c,e) are groundtruth images and (b,d,f) are corresponding depth images...... 32

Figure 5.1 Our framework consists of four stages through which we grad- ually build higher level abstractions. The final output is an estimate of human posture...... 35 Figure 5.2 Sample human segmentation in the first stage of our pose esti- mation pipeline...... 36 Figure 5.3 Sample input and output of the normalization process. (a,b) the input from two views, (c,d) the corresponding foreground mask, (e,f) the normalized image output. The output is rescaled to 250×250 pixels. The depth data is from the Berkeley MHAD [31] dataset...... 38

ix Figure 5.4 Our CNN architecture. The input is a 250 × 250 normalized depth image. The first row of the network generates a 44 × 14 × 14 coarsely classified depth with a high stride. Then it learns deconvolution kernels that are fused with the informa- tion from lower layers to generate finely classified depth. Like [26] we use summation and crop alignment to fuse informa- tion. The input and the output blocks are not drawn to preserve the scale of the image. The number in the parenthesis within each block is the number of the corresponding channels. . . . 39

Figure 6.1 Front camera samples of all the subjects in the Berkeley MHAD [31] dataset...... 46 Figure 6.2 Front depth camera samples of all the subjects in the EVAL[12] dataset...... 47 Figure 6.3 The reference groundtruth classes of UBC3V synthetic data. . 49 Figure 6.4 The confusion matrix of Net 3 estimates on the Test set of Hard-Pose...... 50 Figure 6.5 The output of Net 3 classifier on the Test set of Hard-Pose (left) versus the groundtruth body part classes (right). The im- ages are in their original size...... 51 Figure 6.6 The groundtruth body part classes (top) versus the output of Net 3 classifier on the Test set of Hard-Pose (bottom). . 52 Figure 6.7 Mean average joint prediction error on the groundtruth and the Net 3 classification output. The error bar is one standard de- viation. The average error on the groundtruth is 2.44cm, and on Net 3 is 5.64cm...... 53 Figure 6.8 Mean average precision of the groundtruth dense labels and the Net 3 dense classification output with accuracy at threshold 10cm of 99.1% and 88.7% respectively...... 54 Figure 6.9 Dense classification result of Net 3 together with the original depth image on the Berkeley MHAD [31] dataset. Net 3 has been trained only on synthetic data...... 55

x Figure 6.10 Blue color is the motion capture groundtruth on the Berkeley MHAD [31] and the red color is the linear regression pose es- timate...... 56 Figure 6.11 Pose estimate mean average error per joint on the Berkeley MHAD [31] dataset...... 56 Figure 6.12 Accuracy at threshold for the entire skeleton on the Berkeley MHAD [31] dataset...... 57 Figure 6.13 Dense classification result of Net 3 and the original depth image on the EVAL [12] dataset. Net 3 has been only trained on synthetic data...... 58

xi Acknowledgments

I would like to express my sincerest gratitude to my supervisor and mentor, Profes- sor James J. Little, who has always been considerate, supportive, and most impor- tantly, patient with me. His guidance and support in academia and life have been truly invaluable. I also would like to thank Professor Robert J. Woodham for the intellectually stimulating conversations in almost every encounter. My appreciations go to Prof. David Kirkpatrick, Prof. Nick Harvey, and Prof. Mark Schmidt from who I learned innumerable lessons. I am thankful to the Computer Science Department staff who have helped me in various circumstances with an outstandingly professional and courteous manner. I would like to thank Ankur Gupta, and Bita Nejat for their friendship and support throughout the difficult times, and for continually going out of their way to offer solace. I am thankful to the other fellow graduate students whom I had the pleasure of knowing and working with. My greatest gratitude is reserved for my parents, Asghar Shafaei and Zahra Baghaei for teaching me what is truly valuable. I am eternally indebted to their never-ending support and comfort.

xii Dedication

xiii Chapter 1

Introduction

Pose estimation in computer vision is the problem of determining an approximate skeletal configuration of people in the environment through visual sensor readings (see Figure 1.1). Postural information provides a high level abstraction over the visual input which serves as a foundation for other computer vision tasks such as activity understanding, automatic surveillance, and gesture recognition to name a few. However, possible applications of pose estimates are not limited only to com- puter vision. For instance, postural information is used in human motion capture for computer generated imagery. Within this context, a 3d character is visualized while moving like a real human model – used mostly in movie and gaming prod- ucts. Currently the most reliable and accurate existing method is to use markered motion capture, that is, a collection of retroreflective markers are attached to the subjects usually in form of a specialized clothing, and then tracked by infrared cam- eras. A reliable pose estimation method can virtually replace the existing motion capture systems in any context. Similarly in human computer interaction one can use this abstract informa- tion to design user interfaces that actively collaborate with a person. Interactive augmented reality of simulations for instance, can benefit from real time human interaction to provide an immersive, exciting, and educative experience. In medical care, systems that are capable of pose estimation can be used to fa- cilitate early diagnosis of cognitive decline in patients. Pose estimation also opens

1 Figure 1.1: The goal of pose estimation is to learn to represent the postural information of the left image abstractly as shown in the right image. the possibility for the doctors to perform remote physiotherapy and to ensure the patient is performing the activities by examining the exact movements. In e-health we can also use postural information to monitor the elderly who prefer to live alone. By careful analysis one can generate emergency notifications in case of an accident.

1.1 Kinect Sensor The Kinect sensor provides a high resolution RGB video stream, as well as a spe- cialized depth stream, where for each pixel instead of color we get a measure of distance from the camera. This depth image can then be used to reconstruct a volu- metric model of the observed space. The depth image is usually referred to as 2.5d data because we get one depth reading along each ray through an image pixel. This sensor was popularized my Microsoft to facilitate game development with human interactions. For example, in one game the player can learn to perform dance moves correctly; or in another, a player can interact with virtual animals through augmented reality. In the research community people have been using this

2 sensor extensively in robotics to perform tasks more accurately. Such a sensor can simplify important indoor problems that otherwise would be difficult to deal with through mere color data, such as obstacle avoidance, or determining the correct way to grasp objects. A topic of current interest with the Kinect sensor is human pose estimation. Us- ing a depth image has a few attractive properties that make pose estimation easier than using the data from the color domain. With the additional depth information, we no longer need to worry about the scale ambiguity which is always a present problem with single color images (i.e., is the object small and close, or big and distant?). Furthermore, the depth image is primarily focusing on the shape of the objects rather than their visual pattern (see Figure 1.2). This brings in invariance to complex visual patterns for free; we can focus solely on the shape of the observa- tion, which is greatly beneficial for the pose estimation problem. In domains such as pose estimation with RGB images, the high clothing variation, which alters the visual pattern, is itself a major source of complication in learning usable filters. As part of the software development kit released for Kinect, Microsoft also provides APIs to automatically perform pose estimation. The original software developed by Microsoft makes the assumption that there is a cooperative user fac- ing the camera and interacting with the system. While this assumption is valid in the home entertainment setting, the resulting pose estimation algorithm does not generalize well in broader scenarios in which the user is not necessarily facing the camera or is not cooperative. This limitation has lead the researchers to work on more robust pose estimation algorithms. Ever since the original publication [36] there have been numerous at- tempts at improving the empirical results of pose estimation [e.g., 1, 13, 17, 43, 44]. However, much of this research has been focused on performing pose estimation with only a single-view depth image. One of the common obstacles with using multiple Kinect 1 devices in the past has been the interference issue between the Kinects that are aimed at the same direction. The early versions of this sensor determined the depth through emitting a patterned infrared light and analyzing its reflection. When these patterns collide the Kinect is either unable to determine depth, or the result of the computation is highly erroneous.

3 Figure 1.2: A sample depth image. Each shade of gray visualizes a different depth value. The closer the point, the darker the corresponding pixel. The white region is too distant or too noisy, making the sensor readings unreliable.

With the recent release of Kinect 2 with a new technology, using multiple Kinects has finally become practical; but presently there is relatively a small lit- erature on pose estimation with multiple Kinects.

1.2 Our Scenario While many of the current methods focus on a single depth camera, using just a single camera has one inherent limitation that is not possible to deal with: when there is occlusion in the scene, there is no way to be sure of the hidden body part’s pose. But if a second camera is present there is a higher chance of observing the occluded region from a different viewpoint, and hence, having a more reliable output. In this thesis our work is focused on generating better pose estimates with multiple Kinect sensors. By doing so we are hoping to perform a better pose esti- mation in human monitoring scenarios. The basic idea is to install a set of Kinects

4 inside our target environment and non-intrusively perform pose estimation on a not necessarily cooperative user. Our first contribution is a generalized framework for multiview depth-based pose estimation that is inspired by the existing litera- ture. We split the problem into several subproblems that we solve separately. Our approach enables using the state-of-art for each subproblem independently. We also instantiate the said framework with simple, yet effective, methods and discuss various design decisions and their effects on the final result. The targeted application of such a system is to facilitate e-health solutions within home environments. As part of our work we also develop a lightweight, flexible, platform independent, and distributed software framework to coordinate the orchestration and data communication of several Kinect sensors in real-time. We use this system as our infrastructure for experiments and evaluations. We present our software framework in Chapter 3. In Chapter 5 we describe our abstract pose estimation framework and instantiate it with state-of-the-art image segmentation and a simple, yet effective, pose estimation algorithm.

1.3 Datasets One of the main challenges in this field is the absence of datasets for training or even a standard benchmark for evaluation. At the time of writing the only dataset that provides depth from more than one viewpoint is the Berkeley Multimodal Hu- man Action (MHAD) [31] with only two synchronized Kinect readings. Collecting useful data is a lengthy and expensive process with unique chal- lenges. For instance, reliably annotating posture on multiview data by itself is an expensive task for humans to do, let alone dealing with the likely event of anno- tators not agreeing on the same exact label. One can also resort to using external motion capture systems to accurately capture joint configurations while recording the data, however, a reliable motion capture requires wearing marker sensors that alter the visual appearance. At the same time, with the advances in the graphics community, it is not diffi- cult to solve the forward problem; that is, given a 3d model and a pose, generate a realistic depth image from multiple viewpoints. In contrast we are interested in the inverse problem, that is, to infer the 3d pose from multiview depth images. One

5 way to benefit from the advances in graphics is to synthesize data. Using synthetic depth data for pose estimation has previously been proposed in Shotton et al. [36]. Interestingly, they show that their synthetic data provides a challenging baseline for the pose estimation problem. Unfortunately this pipeline, or even their data, is not publicly available and we believe the technical difficulties of making such a pipeline may have been discouraging a lot of researchers to take this particular direction. As part of our contribution we adopt and implement a data generation pipeline to render realistic training data for our work. In Chapter 4 we thoroughly dis- cuss the challenges, details, and differences with the previous work. Note that our contribution here removes a huge obstacle and makes further research within this realm possible. The developed pipeline, together with a compilation of datasets with varying level of complexities are prepared and will be released to encourage further research in this direction.

1.4 Pose Estimation The main focus of this thesis is on the problem of estimating human postures given multiview depth information. Our work differs from the previous work from the following perspectives. Multiview Depth. Most prior work has focused on single view depth informa- tion. While there is a substantial amount of work on multiview RGB based pose estimation, there are relatively few publications on multiview depth-based pose es- timation. Furthermore, the absence of datasets further complicates our work. As part of our contribution we also release a dataset for multiview pose estimation. Context Assumptions. The major focus of pose estimation papers has been on a limited context. For instance, the method that powers the Kinect SDK assumes that the camera is installed for home entertainment scenarios and a cooperative user is using the system. In contrast, our focus is on home monitoring, where cameras are more likely to be installed on the walls rather than in front of the television or the user. Furthermore, the user is not necessarily cooperative, or even aware of existence of such a system in the environment. Our goal is to improve upon single-view pose estimation techniques by analyzing the aggregated information

6 800 cm

50 cm

1 Reference Dense Classifier

3 2 View Pose Aggregation Estimation

Figure 1.3: An overview of our pipeline. In this hypothetical setting three Kinect 2 devices are communicating with a main hub where the depth information is processed to generate a pose estimate. of multiple views. Such a system will enable the application of pose estimation in broader contexts. At a high level our system connects to n Kinects that are installed within an environment. We then retrieve depth images from all of these Kinects to find and estimate the posture of the individuals who are present and observable. Each per- son may be visible from one or more Kinects. The output of our system is the location of each predefined skeletal joint in R3. These steps are visually illustrated in Figure 1.3. Further discussion will be presented in Chapter 5.

7 1.5 Outline In Chapter 2 we explore the literature of pose estimation, and for the unfamiliar reader we also present some of the fundamental methods that will facilitate com- prehension of the subsequent chapters. Chapter 3 is dedicated to the high-level abstraction of our scenario and the infrastructure that we have designed for our experiments. It is the requirement engineering portion of our project where we describe all the steps that are required to prepare the environment. In Chapter 4 we describe the data synthesis procedure and its challenges. At the end of this chapter we describe the properties of the datasets that we use for our work. Our abstract pose estimation framework is defined and motivated in Chapter 5. As we develop the framework we discuss the possible approaches that one can take which serves as a possible future work for the interested reader. We also describe the chosen elements that we later experiment on in the subsequent chapter. Chapter 6 includes all the experiments that we have conducted together with a discussion of the results. We conclude in Chapter 7 and provide possible future directions that may be of interest.

8 Chapter 2

Related Work

In this chapter we present relevant background material for the methods in this the- sis. In Section 2.1 we precisely define the pose estimation problem and present a summary of recent work. We then look at the image segmentation problem in Sec- tion 2.2 and discuss recent progress that underlies the foundation of our pipeline. To train our models we apply curriculum learning which is presented in Section 2.3.

2.1 Pose Estimation Previous work on real-time pose estimation can be categorized into top-down and bottom-up methods. Top-down or generative methods define a parametric shape model based on the kinematic properties of human body. These models generally require expensive optimization procedures to be initialized with parameters that ac- curately explain the presented evidence. After the initialization step the parameter estimates are used as a prior for tracking the subject [11, 12, 18, 43]. Top-down methods require an accurate shape model in order to generate a rea- sonably precise pose estimate. A common practice for shape estimation is to a priori adapt the basic model to fit the physical properties of the test subjects. The shape estimation process usually requires a co-operative user taking a neutral pose such as the T-Pose (i.e., standing erect with hands stretched) at the beginning which makes it difficult to apply top-down methods in non-cooperative scenarios.

9 Bottom-up discriminative models, the second category of approaches, directly focus on the current input to identify individual body parts, usually down to pixel level classification. These estimates are then used to generate hypotheses about the configuration of body that usually neglect higher-level kinematic properties and may give unlikely or impossible results. However, bottom-up methods are fast enough to be combined with a separate tracking algorithm that ensures the labeling is consistent and correct throughout the subsequent frames. Random-forest-based techniques have been shown to be an efficient approach for real-time performance in this domain [14, 34, 37, 44]. Pose estimation is the problem of determining skeletal joint locations of human subjects within a given image. More formally, given an image I the problem is to detect all human subjects s ∈ SI in the image I and determine the joint configura- s s s s tion of each subject s as P = {p1,..., pn}, where pi corresponds to the location of the i-th joint for subject s, and n is the total number of predefined joints. There are two variations of pose estimation: 3d pose estimation and 2d pose estimation. In 3d pose estimation we are interested in finding the 3d real-world s R3 joint locations (i.e., pi ∈ ) while in the 2d setting we only want to label particular pixel coordinates on the spatial input such as an image. While 2d images are primarily used to generate 2d pose estimates, it is also possible to infer 3d pose from a single or multiple 2d images. The multiple images used for 3d pose estimates can either come from a coherent sequence, such as a video, or simultaneously from different viewpoints. Commercial motion capture systems such as Vicon use up to 8 cameras/viewpoints for real-time joint tracking. However, the cheapest, fastest, and reasonably accurate pose estimates come from 2.5d depth images. Pose estimates are useful for providing a higher level abstraction of the scene in problems such as action understanding, surveillance, and human computer inter- action. An accurate pose estimate could be crucial for reliable action understand- ing [21]. One of the benefits of having a reliable pose estimate is the possibility of defining view invariant features that can significantly decrease our dependence on training data with multiple viewpoints. We expect a pose estimation process to operate under various constraints such as time, memory, and process power. This is important because pose estimation is

10 potentially the beginning of a longer pipeline that is resource demanding, whether it is human computer interaction or action recognition. Furthermore, if we wish to apply pose estimations tasks in embedded systems, or even mobile devices, satis- fying the resource constraints is even more critical. Two types of challenges arise in pose estimation: appearance and structural. The appearance problem refers to the way human body is captured under lighting conditions, varied clothing, and different views, which makes it difficult to recog- nize the body in random settings. The structural problem refers to the exponentially large possible configurations of human body and the possible ambiguities; an ex- haustive search through all possible joint configurations is simply not a feasible approach and it is necessary to capture all configurations in an effective system to be able to choose the sensible pose within a reasonable time.

2.1.1 Single-view Pose Estimation

RGB Based Bourdev and Malik [4] propose a method to group neighboring body joints based on the groundtruth configuration. They randomly select a window containing at least a few joints and collect training images that have the exact same configuration. Each chosen pattern is called a poselet. It is typical to randomly choose 1000s of poselets and learn appropriate filters to detect them later. During the evaluation these filters are run on an image and the highest responses vote for the true joint location. The time complexity of learned poselet models grow linearly with the number of the poselets which can be an obstacle to scalability. Chen et al. [6] present a hierarchical evaluation method to make poselet-based approaches scalable in prac- tice. Bourdev et al. [5] explore an alternative approach by learning generalized deep convolutional networks that generate pose discriminative features. Gkioxari et al. [15] use poselets in context of a deformable parts model to accurately esti- mate pose. The main idea of [15] is to reason based on the relative positions of poselets to remove erroneous results and improve accuracy.

11 Yang and Ramanan [42] formulate a structural-SVM model to infer the possi- ble body part configurations. The unary terms in their formulation corresponds to the local evidence through a learned mixture of HOG filters; and the binary terms are quadratic functions of the relative joint locations to score the determined struc- ture. In test time they efficiently pick the maximum scoring settings by applying dynamic programming and distance transform operators [8]. Toshev and Szegedy [40] directly learn a deep network regression model from images to body joint locations. Tompson et al. [39] learn a convolutional network to jointly identify body parts and perform belief-propagation like inference on a graphical model.

Depth Based Most of the application oriented approaches to pose estimation rely on depth data or a combination of depth and RGB data. The common preference of using depth sensors is due to the capability of operating in low-light or even no-light conditions, providing color invariant data, and also resolving scale ambiguity of the RGB do- main. One of the successful applications of pose estimation is the Microsoft Kinect and the work of Shotton et al. [36]. Shotton et al. use synthetic depth data and learn random-forest-based pixel classifiers on single depth images. The joint estimates are then derived from the densely classified depth by applying a mean-shift-based mode seeking method. Shotton et al. also propose an alternative method to directly learn regression forests that estimate the joint location. The use of random forests in [36] allows a 200fps performance. Furthermore, the accuracy of a single depth pose estimate is high enough that no temporal con- straint (e.g., tracking) on the input depth stream is necessary. However, since the algorithm of Shotton et al. uses a vast number of decision trees, it is a resource demanding algorithm for real-time applications. Notably, Shotton et al. assume a home entertainment scenario with a co-operative user which limits the applicability of their solution. Baak et al. [1] describe a data driven approach to pose estimation which has a 60fps performance. Their method depends on a good initialization of a realistic

12 3d model while the co-operative subject is taking a neutral pose. Furthermore, the initialization step depends on a few hyper-parameters that are manually tuned for each dataset. Ye and Yang [43] describe a probabilistic framework to simultaneously per- form pose and shape estimation of an articulated object. They assume the input point cloud is a Gaussian Mixture Model whose centroids are defined by an ar- ticulated deformable model. Ye and Yang describe an Expectation Maximization approach to estimate the correct deformation parameters of a 3d model to explain the observations. It is possible to run this computationally intensive algorithm in real-time if the implementation is on the GPU. Ge and Fan [13] introduced a non-rigid registration method called Global- Local Topology Preservation (GLTP). Their method combines two preexisting ap- proaches of Coherent Point Drift [30] and articulated ICP [32] as complementing hybrid approaches. They first initialize a realistic 3d model assuming the person has a neutral-pose and then track each joint similar to Pellegrini et al. [32]. Their method heavily relies on the target person starting with a neutral-pose which is generally not the case in a monitoring setting. Furthermore this system is compu- tationally expensive and does not offer a real-time performance. Yub Jung et al. [44] demonstrate a Random-Tree-Walk-based method to achieve 1000fps pose estimation. They learn a regression forest for each joint to help with navigation on a depth image from a random starting point. After a predefined num- ber of steps they use the average location as the joint position estimate. The speed improvement in their method is due to learning random forests per each joint rather than per pixel (as opposed to Shotton et al. [36]). This method does not model the structural constraints of a human body, rather it uses it as a guide to search the spatial space of the input depth image.

2.1.2 Multiview Depth-based Pose Estimation Michel et al. [28] use multiple depth sensors to generate a single point cloud of the environment. This point cloud is then used to configure the postural parameters of a cylindrical shape model by running particle swarm optimization offline. The phys- ical description of each subject is manually tuned before each experiment. Phan

13 and Ferrie [33] use optical flow in the RGB domain together with depth informa- tion to do multiview pose estimation within a human-robot collaboration context at the rate of 8fps. Phan and Ferrie report a median joint prediction error of approx- imately 15cm on a T-pose sequence. Zhang et al [45] combine the depth data of multiple Kinects with wearable pressure sensors to estimate shape and track human subjects at 6fps. To the best of our knowledge, these are the only published methods to multi- view pose estimation from depth.

2.2 Dense Image Segmentation Dense image segmentation is the problem of generating per-pixel classification estimates. This particular area of computer vision has been progressing rapidly during the past few months. All of the the top competing methods make use of convolutional networks one way or another. One of the major obstacles with commonly used convolutional networks for dense classification is that the output of each layer is shrinking spatially as making progress towards the end of the pipeline. While in general classification this is a desirable property, in dense classification, however, this effectively leads to high strides in the output space which gives coarse region predictions. Long et al. [26] propose a specific type of architecture for image segmentation that uses deconvolution layers to scale up the outputs of individual layers in a deep structure. These deconvolutional layers act as spatially large dictionaries that are combined in proportion to the outputs of the previous layer. Long et al. also fuse information of lower layers with higher layers through summation. Hu and Ramanan [20] further motivate the use of lower layers by looking at these networks from another perspective. They show that having top-down propa- gation of information, as opposed to just doing a bottom-up or feed-forward, is an essential part of reasoning for a variety of computer vision tasks – something that is motivated by the empirical neuroscientific results. The interesting aspect of this work is that we can simulate the top-down propagation by unrolling the network backwards and treating the whole architecture as a feed-forward structure.

14 Chen et al. [7] show that it is possible to further improve the image segmen- tation by adding a fully connected CRF on top of the deep convolutional network. This approach is essentially treating the output of the deep network as features for the CRFs, except that these features are also automatically learned by the back- propagation algorithm. More recently Zheng et al. [46] present a specialized deep architecture that inte- grates a CRF inside itself. They show that this architecture is capable of performing mean-field like inference on a CRF with Gaussian pairwise potentials while doing the feed-forward operation. They successfully train this architecture end-to-end and achieve the state-of-the-art performance in dense image segmentation. All of the recent work suggest that it is possible to incorporate the local depen- dency in the output domain as part of a deep architecture itself. Building on this observation we also take advantage of deep architectures as part of our work.

2.3 Curriculum Learning Bengio et al. [2] describe curriculum learning as a possible approach to training models that involve non-convex optimization. The idea is to rank the training in- stances by their difficulty. This ranking is then used for training the system by starting with simple instances and then gradually increasing the complexity of the instances during the training procedure. This strategy is hypothesized to improve the convergence speed and the quality of the final local minima [2]. Kumar et al. [25] later introduced the idea of self-paced learning where the system decides which instance is more important to learn next – in contrast to the earlier approach where an oracle had to define the curriculum before the training starts. More recently Jiang et al. [23] combined these two methods to have an adaptive approach to curriculum learning that takes feedback of the classifier into consideration while following the original curriculum guidelines. Our experiments suggest that a controlled approach to training deep convolu- tional networks can be crucial for training a better model, providing an example of curriculum learning in practice.

15 Chapter 3

System Overview

In this chapter we present an overview of our environment and the developed sys- tem. In Section 3.1 we talk about the general context of our problem and highlight a few important details. We then talk about the high level specification of our sys- tem in Section 3.2. Section 3.3 is dedicated to the internal structure and the data flow within our system.

3.1 The General Context The vision of our project is to non-obtrusively collect activity information within home environments. The target application is e-health care and automated moni- toring of people who may suffer from physical disabilities and require immediate attention in case of an accident. A collection of sensors are installed inside the house and a centeral server processes all the information. The specific sensor that we will be using is Microsoft Kinect 2, however, the developed framework is ca- pable of incorporating more sources of information. The Kinect 2 sensor is abstractly a consolidated set of microphones, infrared sensors, depth camera, and an HD RGB camera. The relatively cheap price and availability of this sensor has made an immense interest in developing systems with multiple Kinect sensors. Since this version of the device does not cause in- terference with the other Kinect 2.0 sensors, it has become more attractive than ever.

16 In our scenario the Kinect 2 sensors are installed within an indoor space such as a house or a room. A central server located inside the house will be processing all the incoming data. The privacy concerns has the added limitation that no raw information like the RGB video feed or the depth information can be stored on disk. Therefore, we are limited to real-time methods that analyze the data as observations take place. The system should also recognize the people automatically to profile the activities of the individuals and also invoke notifications in case of an emergency situation. A technical challenge is to organize communication with the Kinects. Each Kinect 2 requires the full bandwidth of a USB 3 controller, and the connecting cable can not be longer than 5m – the underlying USB 3 protocol has a maximum communication latency limit. Additionally, the Microsoft Kinect 2 SDK does not support multiple Kinects at the same time. Our solution to the above technical challenges is to deploy the system on multiple computers that communicate over the network.

3.2 High-Level Framework Specification At the highest abstraction level our system is a collection of small software pack- ages that communicate with each other on a network through message passing. The main component is a singleton Smart Home Core running on the server. For each Kinect involved we run a separate Kinect Service that processes and transmits sensor readings to the Smart Home Core software through a network with TCP/IP support (See Figure 3.1). Decoupling the individual components has the added benefit of easier scalability. For instance, the software operates inde- pendently of the total number of active Kinects in system – if later we wish to add more Kinects for a better accuracy or more coverage, we can do so without altering the software. The foundation of this platform is implemented in C# under the .Net frame- work 4. We use OpenCV1 and (PCL)2 at the lower levels for efficiency in tasks such as visualization or image processing. Messages used

1www.opencv.org 2www.pointclouds.org

17 Kinect Kinect Kinect Kinect Service Service … Service Service

Kinect Kinect … Kinect Kinect Client Client Client Client

Smart Home Core

Figure 3.1: The high-level overview of the components in our system. Each Kinect is connected to a local Kinect Service. At the Smart Home Core we communicate with each Kinect Service to gather data. The Kinect Clients are the interfaces to the Kinect Service and can be implemented in any programming lan- guage. in inter-process communication are serialized using Google Protocol Buffers3 to achieve fast and efficient transmission. The language neutrality of Google Pro- tocol Buffers allows interfacing with multiple languages. For example, we can communicate with a Kinect Service to gather data inside Matlab. To minimize network overhead we further compress the message payload with lossless LZ44 compression and lossy JPEG scheme. The final system transmits a 720p video stream and depth data at a frame rate of 30fps while consuming only 5.3MB of bandwidth, making simultaneous communication with up to ten Kinects feasible over wireless networks. While we can still optimize the communication costs for even higher scalability, we found that the described system is efficient enough to proceed with our experiments.

3.3 Internal Structure and Data Flow The internal structure of our system is best described by walking through the data flow path within our pipeline. The highest abstraction level of the data flow is shown in Figure 3.2.

3developers.google.com/protocol-buffers/ 4https://code.google.com/p/lz4/

18 Kinect Client

Kinect Camera Registration Pose Estimation Client & Data Aggregation …

Kinect Client Smart Home Core

Figure 3.2: The high-level representation of data flow within our pipeline. The pose estimation block operates independently from the number of the active Kinects.

3.3.1 Camera Registration and Data Aggregation Each Kinect sensor makes measurements in its own coordinate system. This coor- dinate system is a function of camera location and orientation in our environment. The problem of finding the relative transformations to unify the measurements into a single coordinate system is called camera calibration or extrinsic camera param- eter estimation. Camera calibration has been studied extensively in the computer vision and robotics community [10, 16, 19]. Within our problem context we assume the cameras are installed in a fixed loca- tion. Therefore, we only need to calibrate the cameras once. As long as we can do this in a reasonable time we can just resort to simple procedures. In our pipeline we simply perform feature matching in the RGB space to come up with reasonably ac- curate transformation parameters, and then run the Iterative Closest Point (ICP) [3] algorithm to fine-tune the estimates. Our implementation uses SIFT [27] to match features and then estimates a transformation matrix Tˆ by minimizing an `2 loss over the corresponding matched depth locations within a RANSAC [9] pipeline. To find a locally optimal transformation with respect to the entire point clouds, we then initialize the ICP method with Tˆ using the implementation of PCL. After generating a transformation estimate for each sensor pair we can unify the coordinate spaces and merge all the measurements into the same domain. Fig- ure 3.3 demonstrates a real output after determining a unified coordinate system. By adding more cameras we can increase the observable space to the entire house.

19 Figure 3.3: An example to demonstrate the output result of camera calibra- tion. The blue and the red points are coming from two different Kinects facing each other but they are presented in a unified coordinate space.

By constructing a larger observable space we no longer need to know about the individual Kinects in our pipeline. If later we decide to add more Kinects, we can simply estimate the relative transformation of the newly added Kinect to at least one of the existing cameras – at this point the unification is straightforward due to transitivity. After merging the new data we will have a more accurately observed space or a larger observable area, and either way the rest of the pipeline remains agnostic to the number of the Kinects involved. After unifying the measurements the data flow path proceeds to the pose estimation stage. In this thesis we only focus on development of the pose estimation subsystem and leave the other potential stages to the future work.

3.3.2 Pose Estimation At the pose estimation stage the target is to identify the posture of every person in the observable space. For each individual we perform 3d pose estimation based on the depth and the point cloud data. This stage of the pipeline can further be sepa-

20 1 2 3 4

Human Pixel-wise Classification Pose Segmentation Classification Aggregation Estimation

Figure 3.4: The pose estimation pipeline in our platform. rated into different parts as shown in Figure 3.4. The main focus of this thesis is on this particular stage of the developed framework. Our pose estimation pipeline consists of four stages.

Human segmentation At this stage we perform background subtraction to sep- arate the evidence of each person from the background so that the rest of pipeline can examine the data in isolation.

Pixel-wise classification After separating each person we perform classification on each pixel of the depth image.

Classification aggregation At this stage we merge all the classification results of all the cameras.

Pose estimation Given the merged evidence of the previous step we now solve the pose estimation problem and derive the actual joint locations.

Further details of our pose estimation methodology is presented in Chapter 5.

21 Chapter 4

Synthetic Data Generation

To address the absence of appropriate datasets we use computer generated imagery to synthesize realistic training data. The original problem of interest is extracting human posture from input, but the inverse direction, that is, generating output from human posture is reasonably solved in computer graphics. The main theme of this chapter is to simulate the inverse process to generate data. Commercial 3d rendering engines such as Autodesk Maya1 have simplified modeling a human body with human inverse kinematic algorithms. A human in- verse kinematic algorithm calculates human joint angles under the constraints of human anatomy to achieve the desired posture. The HumanIK middleware2 that underlies the aforementioned 3d engine has also been widely adopted in game de- velopment to create real-time characters that interact with the environment of the game. While the inverse kinematic algorithms facilitate body shape manipulation in a credible way, we also require realistic 3d body shapes to begin with. Luck- ily, there are numerous commercial and non-commercial tools to create realistic 3d human characters with desired physical attributes. We will be using the term ‘character’ from this point to refer to body shapes. Shotton et al. [36] demonstrate the efficacy of using synthetic data for depth- based pose estimation and argue the synthesized data tends to be more difficult for pose estimation than real-world data. This behavior is attributed to the high

1www.autodesk.com/products/maya 2http://gameware.autodesk.com/humanik

22 Human Pose Human Character Camera Dataset Dataset Locations

Sample Human Sample 3d Model Sample Camera Pose of Human Configuration

3d Computer Graphics Software

Synthetic Data

Figure 4.1: The synthetic data generation pipeline. We use realistic 3d mod- els with real human pose configurations and random camera location to generate realistic training data variation of possible postures in synthetic data, while the real-word data tends to exhibit a biased distribution towards the common postures. In the remainder of this chapter we discuss the specifics of the data generation process. Our pipeline is an adoption of the previous work as presented in Shotton et al. [36]. An overview of the data generation process is shown in Figure 4.1. We use a collection of real human postures and synthesize data with realistic 3d models and random camera locations. The output of this process has been carefully tuned to generate usable data for our task.

4.1 Sampling Human Pose At this stage of the data generation pipeline we are interested in collecting a set of real human postures. With the powerful HumanIK and a carefully defined space of postures, it is possible to simply enumerate over all the possible configurations. However, we take the simpler path of using data that is already collected from human subjects and leave the spontaneous pose generation to the future work. To collect real human postures we chose the publicly available motion capture dataset of CMU3. This dataset consists of over four million motion capture frames

3mocap.cs.cmu.edu

23 of human subjects performing a variety of tasks ranging from day to day conver- sation with other people to activities such as playing basketball. Each sequence of this dataset has the recorded joint rotation parameters of the subjects’ skeleton. Using the orientation of each bone rather than the XYZ coordinates of the joint locations has the benefit of being invariant to the skeleton’s physical properties. By defining the physical properties of the skeleton ourselves, which is merely the length of a few bones, we can convert this rotational information to absolute XYZ joint locations. One way to use this dataset is to simply pick random frame samples from the entire pool of frames. However, doing so will heavily bias our pose space because of the redundant nature of the data. For instance, consecutive frames of a 120fps sequence are highly similar to each other. Moreover, activities such as ‘walk’ tend to show up frequently in the dataset over different sequences. Therefore, we need to build a uniform space of postures to make sure the data skew does not bias our subsequent models. To build an unbiased dataset we collect a representative set of 100,000 human postures. To achieve this goal we first define a basic fixed skeleton and convert the rotational information to Cartesian space and then run the K-means clustering algo- rithm on this dataset with 100K centers. We use the Fast Library for Approximate Nearest Neighbours (FLANN) [29] to speed up the nearest neighbor look-ups. Af- ter finding the association of each posture with the cluster centers, we identify the median within each cluster and pick the corresponding rotational data as a repre- sentative for that cluster. After selecting a set of 100K human postures we split the data into three sets of 60K, 20K, and 20K for train, validation, and test respectively. We refer to this pose set as MotionCK. In Figure 4.2 you can see some examples of our postures. Note that at this stage our sets are merely a description of pose and do not include any 3d characters.

24 Figure 4.2: Random samples from MotionCK as described in Section 4.1.

4.2 Building Realistic 3d Models The next stage in our pipeline is to create realistic human 3d models. We use the open-source Make Human Project4 which allows creation of human-like models with varying physical and clothing attributes. Since we are only interested in gen- erating synthetic depth data, applying a human-like skin is irrelevant to the depth information. Hence, we create our own special skin that reflects our target regions of interest. Our model has 43 different body regions with distinguishing labels for left and right body parts (see Figure 4.3). We purposefully chose to oversegment the body parts to be able to merge them if necessary, without regenerating data. To make sure our data includes variety in shape, we create 16 characters with varying parameters in age, gender, height, and weight (2 of each). The Make Hu- man project allows a higher degree of freedom in making a model, however, we found varying other parameters does not substantially affect the apparent physical attributes. All of our characters can be seen in Figure 4.4. We plan to release our models for public use.

4.3 Setting Camera Location In order to render a 3d model we also require a camera location. The camera pa- rameter controls the viewpoint in which we would like to collect data from. Recall that in our problem we do not require cooperation and we would like to estimate

4www..org

25 (a) (b)

Figure 4.3: Regions of interest in our humanoid model. There are total of 43 different body regions color-coded as above. (a) The frontal view and (b) the dorsal view.

Figure 4.4: All the 16 characters we made for synthetic data generation. Sub- jects vary in age, weight, height, and gender.

26 Distance Height

Theta

Figure 4.5: An overview of the extrinsic camera parameters inside our data generation pipeline. pose from distances up to seven meters and heights up to three meters. Therefore we should define the camera location with respect to the aforementioned assump- tions. We chose the following possible configurations for the camera. The height of the camera is assumed to be between one to three meters from the ground. We assume the person is at most four meters away from the sensor, and the relative azimuthal angle between the person and the camera is the entire 2π span. See Figure 4.5 for a visualization of the defined camera parameters. The chosen pa- rameters are for data generation purpose only. In Chapter 5 we describe how our method will handle cases where the person is farther away. The intrinsic camera parameters such as focal length and output size are care- fully chosen based on the intrinsic depth camera parameters of the Kinect 2 device to ensure the synthetic data is comparable to real data as much as possible.

4.4 Sampling Data To generate data we follow the sampling process described in Algorithm 1. The first input to our algorithm is C , the pool of target 3d characters (e.g., the char- acters of Section 4.2). The second input is range of camera locations L (e.g., the definition in Section 4.3). The third input is pool of postures P (e.g., MotionCK). Finally, the last parameter is the total number of viewpoints n. The output sample

27 n is a set S = {(Di,Gi)}i=1, where Di and Gi are the depth and the groundtruth image as seen from the i-th camera.

Algorithm 1 Sample data C : Pool of characters. L : Range of camera locations. P: Pool of postures. n: Number of cameras. 1: procedure SAMPLE(C ,L ,P,n) 2: c ∼ Unif(C ) . select a random character 3: l1:n ∼ Unif(L ) . select n random locations 4: p ∼ Unif(P) . select a posture 5: S ← Render depth image and groundtruth n 6: return S . S = {(Di,Gi)}i=1

Separating the input to our function allows full control over the data generation pipeline. In Section 4.5 we will generate multiple datasets with different inputs to this function. To implement Algorithm 1 we use Python for scripting and Maya for rendering. We generate over 2 million samples within our pipeline for training, validation, and test.

4.5 Datasets In order to apply curriculum learning we make datasets of different complexity to train our models. We first start training on the simplest dataset and then gradually increase the complexity of data to adapt our models. Further details on our training procedure is described in Chapter 5. The datasets that we have created can be briefly described as in Table 4.1. Our first dataset is Easy-Pose which is the simplest of the three. To generate Easy-Pose we select a subset of postures in MotionCK that are labeled with ‘Walk’ or ‘Run’, and pick only one 3d character from our models. The second dataset Inter-Pose extends Easy-Pose by adding more variation to the pos- sible postures. If a hypthotetical model is making a transition from Easy-Pose to Inter-Pose, it will be required to learn more postures for the same character. The final dataset Hard-Pose includes all the 3d characters of Section 4.2. Each

28 Table 4.1: Dataset complexity table. Θ is the relative camera angle, H refers to the height parameter and D refers to the distance parameter as de- scribed in Figure 4.5. The simple set is the subset of postures that have the label ‘walk’ or ‘run’. Going from the first dataset to the second would require pose adaptation, while going from the second to the third dataset requires shape adaptation.

Dataset Postures # Characters Camera Param. Samples Easy-Pose simple (~10K) 1 Θ ∼ U(−π,π) 1M H ∼ U(1,1.5)m D ∼ U(1.5,4)m Inter-Pose MotionCK (100K) 1 Θ ∼ U(−π,π) 1.3M H ∼ U(1,1.5)m D ∼ U(1.5,4)m Hard-Pose MotionCK (100K) 16 Θ ∼ U(−π,π) 300K H ∼ U(1,3)m D ∼ U(1.5,4)m dataset of Table 4.1 has a train, test, and validation set with mutually exclusive set of postures. We generate all the data with n = 3 cameras. In Figure 4.6, Figure 4.7, and Figure 4.8 you can find sample training data from Easy-Pose, Inter-Pose, and Hard-Pose respectively.

4.6 Discussion In Section 4.1 we extracted a set of representative postures from the publicly avail- able CMU Mocap dataset. Even though the said dataset has over four million frames of real human motion capture, it does not cover the entire set of possible postures. Shotton et al. [36] further collect real data within their problem context to improve generalization, but they do not release it for public use. To create a better representation of human posture two potential approaches is suggested: collecting more data, and using human inverse kinematic algorithms. While collecting more motion capture data can be prohibitively expensive, it guarantees validity of the collected postures. Alternatively, it is also possible to enumerate over a space of

29 (a) (b)

(c) (d)

(e) (f)

Figure 4.6: Three random samples from Easy-Pose. (a,c,e) are groundtruth images and (b,d,f) are the corresponding depth images.

30 (a) (b)

(c) (d)

(e) (f)

Figure 4.7: Three random samples from Inter-Pose. (a,c,e) are groundtruth images and (b,d,f) are the corresponding depth images.

31 (a) (b)

(c) (d)

(e) (f)

Figure 4.8: Three random samples from Hard-Pose. (a,c,e) are groundtruth images and (b,d,f) are the corresponding depth images.

32 body configurations and rely on HumanIK to calculate appropriate joint angles. However, a majority of the generated postures using this procedure could be unre- alistic and useless without an effective pruning strategy. In Section 4.2 we created 16 characters to generate data with. The chosen regions of interest are defined heuristically while considering the previous work. Since we were able to achieve satisfactory results (presented in Chapter 6) we did not spend time with other region selection schemes. One potential future work here is to experiment with different region definitions to gauge the room for improve- ment. In Section 4.3 we defined a subspace to sample uniformly for the camera lo- cation parameter. An idea that we did not explore is to generate datasets with different elevations. Since at test time we have knowledge of the camera location, it is possible to just use models that are trained for that specific elevation.

33 Chapter 5

Multiview Pose Estimation

In this chapter we present a general framework for multiview depth-based pose es- timation. Our approach is to define a sequence of four tasks that must be addressed in order to predict the pose. Each task is a specific problem for which we can ap- ply a multitude of approaches. The four stages of our framework are depicted in Figure 5.1. In the first stage we perform background subtraction on the input depth image to separate out the human pixels. For each set of identified human pixels we then generate a pixel-wise classification where each pixel is labeled according to the body regions in Figure 4.3. Recall that each Kinect has an independent machine running an instance of Kinect Service (see Figure 3.2). The first two stages of our pipeline can be run on each machine in a distributed fashion, or run centrally within Smart Home Core. The next step is to aggregate information from all the cameras into a single unified coordinate space. This aggregation will result in a labeled point cloud of the human body. The final step is to perform pose estimation on the labeled point cloud of the previous stage. In the following sections we go through each step of this pipeline to provide an in-depth description of each task. We then discuss the potential design choices and present the motivation behind our chosen methods. We end this chapter in Section 5.5 by discussing alternative design choices and potential future research directions.

34 1 2

Human Pixel-wise Segmentation Classification 3 4 … … Classification Pose Aggregation Estimation Human Pixel-wise Segmentation Classification

Per Kinect

Figure 5.1: Our framework consists of four stages through which we gradu- ally build higher level abstractions. The final output is an estimate of human posture.

5.1 Human Segmentation Human segmentation is a binary background/foreground classification task that assigns a label y ∈ {0,1} to each pixel. The purpose of this task is to separate people from the background to process the individuals in isolation. While devising an exact solution to this problem is arguably a challenge in the RGB domain, it is possible to make sufficiently accurate methods in the depth domain. With the depth data it is possible to mark the boundary pixels of a human body by just examining the discontinuities – unfortunately in the RGB domain this cue is unreliable. Generating a pixel mask for each input is generally treated as a pre-processing step for which efficient tools exist already and in the pose estimation literature it is commonly assumed to be given [36, 43, 44]. While theoretically it is possible to use any classification model for this task, random forests have been shown to be particularly efficient. Given that the rest of the pipeline is likely to require more sophisticated models to yield a good result, as commonly practiced in the literature, we also use random decision forest classifiers. More specifically, we use the implementation of the Kinect SDK to execute this step of the pipeline. A sample output of this step is shown in Figure 5.2. Note that after this stage of the pipeline we will be looking at individual human subjects.

35 (a) (b)

Figure 5.2: Sample human segmentation in the first stage of our pose estima- tion pipeline.

5.2 Pixel-wise Classification Given a masked depth image of a human subject our task is to assign a class label y ∈ Y to each pixel, where Y is the set of our body classes as defined in Figure 4.3. This formulation of the subproblem has previously been motivated by Shotton et al. [36], but we have a few differences from the previous work. Shotton et al. [36] assume the user is facing the camera, leading to a simplified classification problem because it is no longer necessary to distinguish between the left and right side of a body. Furthermore, Shotton et al. use body labels with only 21 regions which wrap around the person. In this work we extend this to 43 regions to distinguish the right and left sides of the body (See Figure 4.3). In our context, unlike the case of Shotton et al., it is not possible to make classification decisions locally. A left or right hand label for instance, depends on the existence of a frontal or dorsal label for the head at a different spatial location; and even then, there are still special cases that must be taken into account. The other assumption that we are relaxing is the relative distance between the camera and the user. The home entertainment scenario of [36] is no longer valid in our context, and thus our system should be able to handle more varieties of viewpoints. Furthermore, in our context the user is not necessarily a co-operative agent, which further complicates our task. We may be able to assume co-operation

36 by focusing on use cases such as distant physiotherapy, but the general monitoring problem does not admit this simplification. We approach the classification problem from a new perspective that allows embedding of higher level inter-class spatial dependencies. During the subsequent sections we fully describe our pixel-wise classifier. In Section 5.2.1 we describe how background/foreground masks and depth images are used for normalization. In Section 5.2.2 we describe the CNN architecture that takes the normalized image as input and generates a densely classified output image. Further discussion of our architecture is presented in Section 5.2.3.

5.2.1 Preprocessing The Depth Image The first step in our image classification pipeline is to normalize the input image to make it consistent across all possible inputs. The input of this part is a depth image with a foreground mask of the target subject whose pose needs to be estimated. We first quantize and linearly map the depth value range [50, 800]cm to the range [0, 255]. We then crop the depth image using the foreground mask and scale the image to fit in a 190 × 190 pixel window while preserving the aspect ratio. Finally we translate all the depth values so that the average depth is approximately 160cm. After adding a 30 pixel margin we will have a 250 × 250 pixel image which will be used in the next stage. See the examples in Figure 5.3.

5.2.2 Dense Segmentation with Deep Convolutional Networks We use Convolutional Neural Networks (CNN) to generate a densely classified out- put from the normalized input depth image. Our network architecture is inspired by the work of Long et al. [26] in image segmentation. We use deconvolution out- puts and fuse the result with the information from the lower layers to generate a densely classified depth image. Our approach takes advantage of the information in neighboring pixels to generate a densely classified depth image; this is in con- trast to random forest based methods such as [37] where each pixel is evaluated independently. The particular architecture that we have chosen is presented in Figure 5.4. The input to this network is a single channel of normalized depth data. In the first row of

37 (a) (b)

(c) (d)

(e) (f)

Figure 5.3: Sample input and output of the normalization process. (a,b) the input from two views, (c,d) the corresponding foreground mask, (e,f) the normalized image output. The output is rescaled to 250×250 pixels. The depth data is from the Berkeley MHAD [31] dataset.

38 ReLU+ Pool (3×3) Conv (3×3) ReLU+ Conv (5×5) Stride (3) Stride (1) Pool (2×2) Stride (1) Conv (3×3) Stride (2) + ReLU Stride (1) (1) 250×250 (44) (64) (128) (128) 20×20 (256) 16×16 14×14 (64) 42×42 40×40 Input 128×128 Conv (1×1) (44) (44) 20×20 Σ 16×16 Conv (1×1) Deconv (3×3) (44) Σ Stride (1) 42×42 Conv (5×5) Stride (2) (44) (44) (44) 16×16 Pad (2) 250×250 (44) 34×34 Dropout (0.5) Deconv (19×19) 34×34 Output Stride (7) Deconv (4×4) Dropout (0.5) Stride (2)

Figure 5.4: Our CNN architecture. The input is a 250×250 normalized depth image. The first row of the network generates a 44 × 14 × 14 coarsely classified depth with a high stride. Then it learns deconvolution kernels that are fused with the information from lower layers to generate finely classified depth. Like [26] we use summation and crop alignment to fuse information. The input and the output blocks are not drawn to preserve the scale of the image. The number in the parenthesis within each block is the number of the corresponding channels. operations in Figure 5.4 our network generates a 14 × 14 output with 44 channels. The remainder of the network is responsible for learning deconvolution kernels to generate a dense classification output. After each deconvolution we fuse the output with the lower layer features through summation. The final deconvolution operation with the kernel size 19 × 19 enforces the spatial dependency of adjacent pixels within a 19 pixel neighborhood. At the end, our network gives a 250 × 250 output with 44 channels – one per class and one for the background label. This stage of our pipeline can be either run independently within each Kinect Service, or executed on all the data at once within Smart Home Core.

5.2.3 Designing the Deep Convolutional Network To the best of our knowledge, the general approach to designing deep convolutional networks is by trial and error. Most of the current applications of CNNs simply reuse previously successful architectures such as VGGNet [38] or AlexNet [24].

39 Such applications often fine-tune the pretrained architecture on datasets such as Imagenet [35] on the specific target data. We initially started our experiments by training window classifiers that label each pixel locally by only observing a 50 × 50 pixel window. The final architec- ture for our window classifier is the first row of Figure 5.4 where a 14 × 14 output with 44 channels is generated. In the window classification setting, the output is simply 1 × 1 with 44 channels – one per class and one for the background. After training the initial window classifier we fix the parameters and extend the network with deconvolution layers to get the final architecture of Figure 5.4. We learn the parameters of the newly added layers by training the entire network on densely labeled input data. Using the deconvolution approach of Long et al. [26] to gener- ate a densely classified output was particularly an attractive choice because of its compatibility with our window classifier. During our experiments we learned that a separate step for window classifica- tion was unnecessary, and thus, we abandoned the initial two-step approach and trained the entire network in an end-to-end fashion.

5.3 Classification Aggregation After generating a densely classified depth image we use the extrinsic camera pa- rameters of our set-up to reconstruct a merged point cloud in a reference coordinate system. It is possible to apply various filtering and aggregation techniques, how- ever, we have found this simple and fast approach to be sufficiently effective. At this point we have a labeled point cloud which is the final result after combining all the views. Now we extract a feature f from our merged point cloud to be used for the next stage of our pose estimation pipeline. For each class in our point cloud we extract the following features:

• The median location.

• The covariance matrix.

• The eigenvalues of the covariance matrix.

• The standard deviation within each dimension.

40 • The minimum and maximum values in each dimension.

The final feature f is the concatenation of all the above features into a single feature vector f ∈ R1032. We chose the above features to have a real-time performance in feature extraction. It is also possible to apply more computationally expensive data summarization techniques to generate possibly better features for the next stage.

5.4 Pose Estimation We treat the problem of pose estimation as regression. For each joint j in our skeleton we would like to learn a function Fj(·) to predict the location of joint j given the feature vector f . After examining a few real-time performing design choices such as linear regression and neural networks we learned that simple linear regression gives the best trade-off between complexity and performance. Our linear regression is a least squares formulation with an `2 regularizer which is also known as Ridge Regression (Equation 5.1).

n 1 T i i 2 λ j T 2 argmin ∑ kWj f + b j − y jk2 + · (Tr(Wj Wj) + kb jk2) (5.1) 1032×3 3 2 · n 2 Wj∈R ,b j∈R i=1

Our regression problem for each joint j is defined in Equation 5.1. For each 1032×3 3 joint j we would like to learn a matrix Wj ∈ R and a bias term b j ∈ R using the regularization parameter λ j. If we add a row of one to the feature vector f we can remove the bias term b and arrive at the closed-form solution shown in Equation 5.2.

T −1 T Wj = (F F + n · λ jI) F Yj (5.2)

n×1033 Here F ∈ R is the design matrix of all the training features and Yj ∈ Rn×3 is the corresponding coordinate for the j-th joint. Having a closed-form solution allows fast hyperparameter optimization of λ j. We also experimented with the LASSO counterpart to obtain sparser solutions but the improvements were negligible while the optimization took substantially more time. If the input data is

41 over a sequence, we further temporally smooth the predictions by calculating a weighted average over the previous estimate.

s Yˆt = (1 − η)Yˆt−1 + ηYˆt 0 ≤ η ≤ 1 (5.3)

s Where Yˆt is the smoothed estimate at time t, and Yˆt is the original estimate at time t. The regularizer hyper-parameters and the optimal smoothing weights are chosen automatically by cross-validation over the training data. For pose estimation it is also possible to apply more complex methods such as structural SVMs [42] or Gaussian Processes [41]. However, if we choose more complicated methods we also will need more data and computational power. Since we need to evaluate on real data and each dataset comes with its own definition of the skeleton we prefer to use the simplest approach to simultaneously maintain real-time performance and robustness against overfitting. Since datasets come with their own definition of joints, we need to train this part of our pipeline separately for each dataset.

5.5 Discussion In Section 5.2 we presented a CNN architecture to generate densely classified out- puts. Our architecture uses the ideas of Long et al. [26], however, there has been a recent surge of architectures such as the one presented by Zheng et al. [46] or Hu and Ramanan [20] that directly model CRFs inside the CNNs. One future direction is to investigate more recent CNN architectures to improve the classification. In Section 5.3 we aggregated the classification result of all the views through simple algebraic operations. By using the temporal information and applying filter- ing strategies we can eliminate some noise in the final labeled point cloud, which is likely to lead to minor improvements in the pipeline. Another possible direction is to explore various shape summarization techniques to generate better features. The final step of pose estimation in Section 5.4 is a simple linear regression which ignores the kinematic constraints. While the presented CNN architecture has control over kinematic constraints, one potential direction is to also add kine- matic constraints in the final prediction stage while maintaining a real-time perfor- mance. A kinematic model that also incorporates temporal consistency is likely to

42 significantly improve the results. Collecting real data with multiple Kinect sensors is also a valuable future direction that would greatly benefit the community and help with further development of multiview pose estimation methods.

43 Chapter 6

Evaluation

In this chapter we provide our evaluation results on three datasets: (i) UBC3V Synthetic Data (ii) Berkeley MHAD [31] (iii) EVAL [12]. To train and evaluate our deep network model we use Caffe [22]. Since each dataset has a specific definition of joint locations we simply need to train the regression part of our pipeline (see Section 5.4) on each dataset. Evaluation Metrics. There are two common evaluation metrics that are used for the pose estimation task: (i) mean joint prediction error (ii) mean average pre- cision at threshold. Mean joint prediction error is the measure of average error in- curred at prediction of each joint location. A mean average error of 10cm for a joint simply indicates that we incur an average error of 10cm per estimate. Mean aver- age precision at threshold is the fraction of predictions that are within the threshold distance of the groundtruth. A mean average precision of 70% for a threshold of 5cm means that our estimates are within the 5cm radius of the groundtruth 70% of the time. Because of the errors in groundtruth annotations it is also common to just report the mean average precision at 10 cm threshold [12, 43, 44]. Datasets. The only publicly available dataset for multiview depth at the mo- ment of writing is the Berkeley MHAD [31]. Note that our target is pose estimation with multiple depth cameras, therefore we only qualitatively evaluate our CNN body part classifier on single camera datasets as our pose estimation technique is not applicable to single depth camera settings. Our evaluation also includes the results on our synthetic dataset with three depth cameras.

44 UBC3V Synthetic. For evaluation we use the Test set of Hard-Pose. This dataset consists of 19000 body postures with 16 characters from three cameras at random locations. Note that these 19000 postures are not present in the training set of our dataset and have not been observed before. The groundtruth and the extrinsic camera parameters come from the synthetic data directly. For more information and sample data see Chapter 4. Berkeley MHAD [31]. This dataset includes 12 subjects performing 11 actions while being recorded by 12 cameras, two Kinect one devices, an Impulse motion capture system, four microphones, and six accelerometers. The mocap sequence generated by the Impulse motion capture is the groundtruth for pose estimation on this dataset. A sample frame from all the 12 subjects is shown in Figure 6.1. Note that we only use the depth information from the two Kinects for pose estimation and ignore all other sources of information. EVAL [12]. This dataset consists of 24 sequences of three different characters performing eight activities each. Since this dataset is created for single view pose estimation we only qualitatively evaluate our body part classifier of Section 5.2 on this data. See Figure 6.2 for three random frames from this dataset.

6.1 Training the Dense Depth Classifier Our initial attempts to train the deep network presented in Section 5.2.2 with the Hard-Pose dataset did not yield satisfying results. We experimented with vari- ous optimization techniques and configurations, but the accuracy of the network on dense classification did not go beyond 50%. Resorting to the curriculum learning idea of Bengio et al. [2] (see Section 2.2), we simplified the problem by defining easier datasets that we call Easy-Pose and Inter-Pose (see Table 4.1). We start training the network with the Easy-Pose dataset. Each iteration consists of eight densely classified images and we stop at 250k iterations reach- ing a dense classification accuracy of 87.8%. We then fine-tune the resulting net- work on Inter-Pose, initially starting at the accuracy of 78% and terminating at the iteration of 150k with an accuracy of 82%. Interestingly, the performance on Easy-Pose is preserved throughout this fine-tuning stage. Finally, we start fine-tuning on the Hard-Pose dataset and stop after 88k iterations. Initially this

45 Figure 6.1: Front camera samples of all the subjects in the Berkeley MHAD [31] dataset. network evaluates to 73% and by the termination point we have an accuracy of 81%. The evolution of our three networks is shown in Table 6.1. Notice how the final accuracy improved from 50% to 81% by controlling the difficulty of the training instances that our network sees. Our experiments demon- strate a real application of curriculum learning [2] in practice. All of our networks are trained with Stochastic Gradient Descent (SGD) with a momentum of 0.99. The initial learning rate is set to 0.01 and multiplied by 10−1

46 Figure 6.2: Front depth camera samples of all the subjects in the EVAL [12] dataset.

Dataset Easy-Pose Inter-Pose Hard-Pose Start End Start End Start End Net 1 0% 87% – – – – Net 2 87% 87% 78% 82% – – Net 3 87% 85% 82% 79% 73% 81%

Table 6.1: The dense classification accuracy of the trained networks on the validation sets of the corresponding datasets. Net 2 and Net 3 are initialized with the learned parameters of Net 1 and Net 2 respec- tively. every 30k iterations. The weight decay parameter is set to 5 · 10−5. In hindsight, all the above stages of training can be run in approximately 5 days on a Tesla K40 GPU. Given that we experimented with multiple architectures simultaneously, the exact required time to train these networks is not available. If we only extract the most likely class from the output, the CNN only takes 6ms to process each image on the GPU. However, calculating the exponent and normalizing the measures to get the full probability distribution on each pixel can cost up to an extra 40ms. To maintain real-time 30fps performance on multiple Kinects, we discard the full probability output and only use the most likely class for each pixel.

47 6.2 Evaluation on UBC3V Synthetic This dataset includes the groundtruth body part classification and pose annotation. Since the annotations come from synthetic data there are no errors associated with the annotation. For real-world data however, the annotations are likely to be erro- neous to a small extent. Having groundtruth body part classes and postures allows the separation of evaluation for the dense classifiers and the pose estimates. That is, we can evaluate the pose estimates assuming a perfect dense classification is available, and then compare the results with the densely classified depth image generated by our CNN. This separation gives us insight on how improvement on dense classification is likely to affect the pose estimates, and whether we should spend time on improving the dense depth classifier or the pose estimation algorithm. For training we have the multi-step fine-tuning procedure as described in Sec- tion 6.1. We first train the network on the Train set of Easy-Pose. We then successively fine-tune on the Train sets of Inter-Pose and Hard-Pose. We refer to the third fine-tuned network as Net 3 which we will be using throughout the remainder of the thesis.

6.2.1 Dense Classification The Test set of Hard-Pose includes 57057 depth frames with class annotations that are synthetically generated. This dataset is generated from a pool of 19000 postures that have not been seen by our classifier at any point. Furthermore, each frame of this dataset is generated from a random viewpoint. The reference class numbers are shown in Figure 6.3. The confusion matrix of our classifier is shown in Figure 6.4. Figure 6.5 displays a few sample classification outputs and the corresponding groundtruth images in their original size. Figure 6.6 shows a few enlarged sample classification output and the groundtruth. Note that for visualization we only use the most likely class at each pixel. The accuracy of Net 3 on the Test set is 80.6%, similar to the reported ac- curacy on the Validation set in Table 6.1. As evident by Figure 6.5 the network correctly identifies the direction of human body and assigns appropriate left/right

48 39 41 40 4243 34 33 32 31 10 9 38 37 36 35 17 18 31 32 33 34 29 11 35 36 37 38 30 29 13 12 15 16 14

1 2 7 5

3 4 8 6

19 20 20 19

21 22 22 21

23 24 24 23

27 28 26 25

Figure 6.3: The reference groundtruth classes of UBC3V synthetic data. classes. However, the network seems to be ignoring sudden depth discontinuities in the classification (see the last row of Figure 6.5).

6.2.2 Pose Estimation We evaluate our linear regression on the groundtruth class and the classification output of our CNN. The estimates derived from the groundtruth serve as a lower bound on the error for the pose estimation algorithm. The mean average joint prediction error is shown in Figure 6.7. Our system achieves an average pose estimation error of 2.44cm on groundtruth, and 5.64cm on the Net 3. The gap between the two results is due to dense classification errors. This difference is smaller on easy to recognize body parts and gets larger on the hard to recognize classes such as hands or feet. It is possible to reduce this gap by using more sophisticated pose estimation methods at the cost of more computation. In Figure 6.8 we compare the precision at threshold. The accuracy at 10cm for the groundtruth and Net 3 is 99.1% and 88.7% respectively.

49 Estimate 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 1 1 2 3 4 5 0.9 6 7 8 9 0.8 10 11 12 13 0.7 14 15 16 17 18 0.6 19 20 21 22 0.5 23

True Class True 24 25 26 0.4 27 28 29 30 31 0.3 32 33 34 35 0.2 36 37 38 39 0.1 40 41 42 43 0

Figure 6.4: The confusion matrix of Net 3 estimates on the Test set of Hard-Pose.

6.3 Evaluation on Berkeley MHAD This dataset has a total of 659 sequences of 12 actors over 11 actions with 5 rep- etitions1. There are two Kinect one devices at the opposite sides of the subjects capturing the depth information. This dataset defines 35 joints over the entire body (for a full list see Figure 6.11). The groundtruth pose here is the motion capture data. At the moment of writing there is no protocol for evaluation of pose estimation techniques on this dataset. The leave-one-out approach is a common practice for

1one sequence is missing.

50 Figure 6.5: The output of Net 3 classifier on the Test set of Hard-Pose (left) versus the groundtruth body part classes (right). The images are in their original size.

51 Groundtruth Net 3

Figure 6.6: The groundtruth body part classes (top) versus the output of Net 3 classifier on the Test set of Hard-Pose (bottom). single view pose estimation. However, each action has five repetitions and we can argue in general that it may not be a fair indicator of the performance because the method can adapt to the shape of the subject of the test on other sequences to get a better result. Furthermore, we are no longer restricted to only a few sequences of data as in previous datasets. To evaluate the performance on this dataset we take the harder leave-one- subject-out approach, that is, for evaluation on each subject we train our system on all the other subjects. This protocol ensures that no extra physical information

52 20

18 Groundtruth Net 3

16

14

12

10

8

6

Mean Average Error (cm) Error Average Mean 4

2

0 Head Neck Spine2Spine1SpineHip RHip RKneeRFootLHip LKneeLFootRShoulderRElbowRHandLShoulderLElbowLHand

Figure 6.7: Mean average joint prediction error on the groundtruth and the Net 3 classification output. The error bar is one standard deviation. The average error on the groundtruth is 2.44cm, and on Net 3 is 5.64cm. is leaked during the training and can provide a measure of robustness to shape variation. The Kinect depth images of this dataset are captured with Kinect 1 sensors, which has a different intrinsic camera parameters than Kinect 2. The difference in focal length and principal point offset can be eliminated by a simple scale and translation of the depth image. To make the depth images of this dataset compatible with our pipeline we resize and translate the provided depth images to match the intrinsic camera parameters of a Kinect 2 sensor. To verify the correctness of our procedure we generate a point cloud from the final output using Kinect 2 intrinsic camera parameters and compare the output cloud with the original point cloud generated from the Kinect 1 depth images. To measure the discrepancy between the two point clouds we run ICP for one iteration and calculate the objective value – we simply pick the translation and scale parameter that minimizes the error objective between the two clouds.

53 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Mean Average Precision Average Mean 0.2 Groundtruth 0.1 Net 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Threshold (cm)

Figure 6.8: Mean average precision of the groundtruth dense labels and the Net 3 dense classification output with accuracy at threshold 10cm of 99.1% and 88.7% respectively.

6.3.1 Dense Classification To reuse the CNN that we have trained on the synthetic data in Section 6.2.1 we adjust the depth images using the procedure described earlier. After this step, we simply feed the depth image to the CNN to get dense classification results. Fig- ure 6.9 shows the output of our dense classifier from the two Kinects on a few random frames. Even though the network has been only trained on synthetic data, it is generalizing well on the real test data. As demonstrated in Figure 6.9, the network has also successfully captured the long distance spatial relationships to correctly classify pixels based on the orientation of the body. The right column of Figure 6.9 shows an instance of high partial classification error due to occlusion. On the back image, the network mistakenly believes that the chair legs are the subject’s hands. However, once the back data is merged with the front data we get a reasonable estimate (see Figure 6.10).

54 Reference

800 cm

50 cm Front Kinect Back Back Kinect

Figure 6.9: Dense classification result of Net 3 together with the original depth image on the Berkeley MHAD [31] dataset. Net 3 has been trained only on synthetic data.

6.3.2 Pose Estimation We use the groundtruth motion capture joint locations to train our system. For each test subject we train our system on the other subjects’ sequences. The final result is an average over all the test subjects. Figure 6.11 shows the mean average joint prediction error. The total average joint prediction error is 5.01cm. The torso joints are easier for our system to local- ize than the hands’ joints, a similar behavior to the synthetic data results. However, it must be noted that even the groundtruth motion capture on smaller body parts such as hands or feet is biased with a high variance. During visual inspection of Berkely MHAD we noticed that, on some frames, especially when the subject bends over, the hands’ location is outside of the body point cloud or even outside the frame, and clearly erroneous. The overall average precision at 10cm is 93%.

55 Figure 6.10: Blue color is the motion capture groundtruth on the Berkeley MHAD [31] and the red color is the linear regression pose estimate.

12 11 10 9 8 7 6 5 4 3 2 1 Mean Average Error (cm) Error Average Mean 0 Hip Spine1Spine2Spine3Neck Neck1Head RLShoulderRShoulderRArmRElbowRForearmRHandRHFingerBaseLLShoulderLShoulderLArmLElbowLForearmLHandLHFingerBaseRHip RULegRKneeRLLegRFootRToeBaseRToeLHip LULegLKneeLLLegLFootLToeBaseLToe

Figure 6.11: Pose estimate mean average error per joint on the Berkeley MHAD [31] dataset.

An interesting observation is the similarity of performance on Berkeley MHAD data and the synthetic data in Figure 6.7. This suggests, at least for the applied methods, the synthetic data is a reasonable proxy for evaluating performance which has also been suggested by Shotton et al [37]. Figure 6.12 shows the accuracy at threshold for joint location predictions. We also compare our performance with Michel et al [28] in Table 6.2. Since they are using an alternative definition of skeleton that is derived by their shape model, we only evaluate over a subset of the joints that are closest with the locations presented in Michel et al [28]. Note that the method of [28] uses predefined shape parameters that are optimized for each subject a priori and does not operate in real-

56 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

Mean Average Precision Average Mean 0.1 0 0 2 4 6 8 10 12 14 16 18 20 Threshold (cm)

Figure 6.12: Accuracy at threshold for the entire skeleton on the Berkeley MHAD [31] dataset.

Subjects Actions Mean Std Acc (%) Mean Std Acc (%) OpenNI [28] 5.45 4.62 86.3 5.29 4.95 87.3 Michel et al [28] 3.93 2.73 96.3 4.18 3.31 94.4 Ours 3.39 1.12 96.8 2.78 1.5 98.1

Table 6.2: Mean and standard deviation of the prediction error by testing on subjects and actions with the joint definitions of Michel et al [28]. We also report and compare the accuracy at 10cm threshold. time. In contrast, our method does not depend on shape attributes and operates in real-time. Following the procedure of [28] we evaluate our performance by testing the subjects and testing the actions. Our method improves the previous mean joint prediction error from 3.93cm to 3.39cm (13%) when tested on subjects and 4.18cm to 2.78cm (33%) when tested on actions.

57 800 cm

50 cm Time-of-flight Sensor Reference Dense Classification Dense

Figure 6.13: Dense classification result of Net 3 and the original depth im- age on the EVAL [12] dataset. Net 3 has been only trained on syn- thetic data.

6.4 Evaluation on EVAL There are a total of 24 sequences of 3 subjects with 16 joints. To generate a depth image from this dataset we must project the provided point cloud into the original camera surface and then rescale the image to resemble a Kinect 2 output. To verify the correctness of our depth image, we generate a point cloud from this image with Kinect 2 parameters and compare against the original point cloud that is provided in the dataset. Three sample outputs of our procedure are presented in Figure 6.2. Figure 6.13 shows four random dense classification outputs from this dataset. The first column of Figure 6.13 shows an instance of the network failing to confi- dently label the data with front or back classes, but the general location of torso, head, feet and hands is correctly determined. The accuracy of our preliminary re- sults suggests that single depth pose estimation techniques can benefit from using the output of our dense classifier.

58 Chapter 7

Discussion and Conclusion

We presented an efficient and inexpensive markerless pose estimation system that uses only a few Kinect sensors. Our system only assumes availability of calibrated depth cameras and is capable of real-time performance without requiring an ex- plicit shape model of the subject or co-operation by the subject. While our main goal is to estimate the posture in real-time for smart homes, our system can also be used as a reasonably accurate and inexpensive replacement for commercial motion capture solutions in applications that do not require precise measurements. The non-intrusive nature of our system can also facilitate development of easy-to-use or augmented reality platforms. The subproblems of our pose estimation pipeline as described in Chapter 5 are all open to further improvement. Our results in Chapter 6 suggest that improving the depth dense classifier of Section 5.2 is a worthwhile path to explore. For a thorough discussion on this topic we refer the reader to Section 5.5. The supporting infrastructure of our pose estimation is a scalable and modular software framework for smart homes that orchestrates multiple Kinect devices in real-time. By tackling the technical challenges, our platform enables research on multiview depth-based pose estimation. The modular structure of our system sim- plifies integration of more sources of information for the smart home application. Our platform is only developed to the extent that provides the necessities of our research on pose estimation. Adding more features such as analysis of the auditory

59 signals to support voice activated commands is one of the many exciting research directions that our platform supports. The training of our system was made possible by generating a dataset of 6 million synthetic depth frames. Our data generation process depended on a set of human postures that were collected from real data. The 100k set of postures (see Section 4.1) that we used is arguably not a representative of every possible human posture. An interesting future research direction is to make automated systems that generate random, but plausible, body configurations. Our experiments demonstrated an application of curriculum learning in prac- tice and our system exceeded the state-of-the-art multiview pose estimation perfor- mance on the Berkeley MHAD [31] dataset.

60 Bibliography

[1] A. Baak, M. Muller,¨ G. Bharaj, H. P. Seidel, and C. Theobalt. A data-driven approach for real-time full body pose reconstruction from a depth camera. In Consumer Depth Cameras for Computer Vision. 2013. → pages 3, 12 [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In International Conference on Machine Learning, 2009. → pages 15, 45, 46 [3] P. J. Besl and H. D. McKay. A method for registration of 3-D shapes. In Transactions on Pattern Analysis and Machine Intelligence, 1992. → pages 19 [4] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In International Conference on Computer Vision, 2009. → pages 11 [5] L. Bourdev, F. Yang, and R. Fergus. Deep poselets for human detection. arXiv preprint arXiv:1407.0717, 2014. → pages 11 [6] B. Chen, P. Perona, and L. Bourdev. Hierarchical cascade of classifiers for efficient poselet evaluation. In British Machine Vision Conference, 2014. → pages 11 [7] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In International Conference on Learning Representations, 2015. → pages 15 [8] P. F. Felzenszwalb and D. P. Huttenlocher. Distance transforms of sampled functions. Theory of Computing, 2012. → pages 12 [9] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981. → pages 19

61 [10] Y. Furukawa and J. Ponce. Accurate camera calibration from multi-view stereo and bundle adjustment. In Computer Vision and Pattern Recognition, 2008. → pages 19

[11] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. Real time motion capture using a single time-of-flight camera. In Computer Vision and Pattern Recognition, 2010. → pages9

[12] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. Real-time human pose tracking from range data. In European Conference on Computer Vision, 2012. → pages x, xi, 9, 44, 45, 47, 58

[13] S. Ge and G. Fan. Non-rigid articulated for human pose estimation. In Winter Applications of Computer Vision, 2015. → pages 3, 13

[14] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In International Conference on Computer Vision, 2011. → pages 10

[15] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing their keypoints. In Computer Vision and Pattern Recognition, 2014. → pages 11

[16] D. F. Glas, D. Brsˇciˇ c,´ T. Miyashita, and N. Hagita. SNAPCAT-3D: Calibrating networks of 3d range sensors for pedestrian tracking. In International Conference on Robotics and Automation, 2015. → pages 19

[17] L. He, G. Wang, Q. Liao, and J. H. Xue. Depth-images-based pose estimation using regression forests and graphical models. Neurocomputing, 2015. → pages3

[18] T. Helten, A. Baak, G. Bharaj, M. Muller, H. P. Seidel, and C. Theobalt. Personalization and evaluation of a real-time depth-based full body tracker. In International Conference on 3D Vision, 2013. → pages9

[19] L. Heng, G. H. Lee, and M. Pollefeys. Self-calibration and visual slam with a multi-camera system on a micro aerial vehicle. In Robotics: Science and Systems (RSS), 2014. → pages 19

[20] P. Hu and D. Ramanan. Bottom-up and top-down reasoning with convolutional latent-variable models. arXiv preprint arXiv:1507.05699, 2015. → pages 14, 42

62 [21] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In International Conference on Computer Vision, 2013. → pages 10

[22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In International Conference on Multimedia, 2014. → pages 44

[23] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann. Self-paced curriculum learning. In Association for the Advancement of Artificial Intelligence, 2015. → pages 15

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012. → pages 39

[25] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, 2010. → pages 15

[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, 2015. → pages x, 14, 37, 39, 40, 42

[27] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004. → pages 19

[28] D. Michel, C. Panagiotakis, and A. A. Argyros. Tracking the articulated motion of the human body with two RGBD cameras. Machine Vision and Applications, 2014. → pages vii, 13, 56, 57

[29] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. Transactions on Pattern Analysis and Machine Intelligence, 36, 2014. → pages 24

[30] A. Myronenko and X. Song. Point set registration: Coherent point drift. Transactions on Pattern Analysis and Machine Intelligence, 2010. → pages 13

[31] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. Berkeley MHAD: A comprehensive multimodal human action database. In Winter Applications of Computer Vision, 2013. → pages ix, x, xi, 5, 38, 44, 45, 46, 55, 56, 57, 60

63 [32] S. Pellegrini, K. Schindler, and D. Nardi. A generalisation of the ICP algorithm for articulated bodies. In British Machine Vision Conference, 2008. → pages 13

[33] A. Phan and F. P. Ferrie. Towards 3D human posture estimation using multiple kinects despite self-contacts. In IAPR International Conference on Machine Vision Applications, 2015. → pages 14

[34] G. Pons-Moll, J. Taylor, J. Shotton, A. Hertzmann, and A. Fitzgibbon. Metric regression forests for human pose estimation. In British Machine Vision Conference, 2013. → pages 10

[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015. → pages 40

[36] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, and A. Kipman. Efficient human pose estimation from single depth images. Transactions on Pattern Analysis and Machine Intelligence, 2013. → pages 3, 6, 12, 13, 22, 23, 29, 35, 36

[37] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2013. → pages 10, 37, 56

[38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. → pages 39

[39] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems, 2014. → pages 12

[40] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition, 2014. → pages 12

[41] R. Urtasun and T. Darrell. Sparse probabilistic regression for activity-independent human pose inference. In Computer Vision and Pattern Recognition, 2008. → pages 42

64 [42] Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. Transactions on Pattern Analysis and Machine Intelligence, 2013. → pages 12, 42

[43] M. Ye and R. Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Computer Vision and Pattern Recognition, 2014. → pages 3, 9, 13, 35, 44

[44] H. Yub Jung, S. Lee, Y. Seok Heo, and I. Dong Yun. Random tree walk toward instantaneous 3d human pose estimation. In Computer Vision and Pattern Recognition, 2015. → pages 3, 10, 13, 35, 44

[45] P. Zhang, K. Siu, J. Zhang, C. K. Liu, and J. Chai. Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture. Transactions on Graphics, 2014. → pages 14

[46] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision, 2015. → pages 15, 42

65