Multiview Depth-Based Pose Estimation

Multiview Depth-based Pose Estimation by Alireza Shafaei B.Sc., Amirkabir University of Technology, 2013 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science) The University of British Columbia (Vancouver) December 2015 © Alireza Shafaei, 2015 Abstract Commonly used human motion capture systems require intrusive attachment of markers that are visually tracked with multiple cameras. In this work we present an efficient and inexpensive solution to markerless motion capture using only a few Kinect sensors. We use our system to design a smart home platform with a network of Kinects that are installed inside the house. Our first contribution is a multiview pose estimation system. Unlike the previ- ous work on 3d pose estimation using a single depth camera, we relax constraints on the camera location and do not assume a co-operative user. We apply recent image segmentation techniques with convolutional neural networks to depth images and use curriculum learning to train our system on purely synthetic data. Our method accurately localizes body parts without requiring an explicit shape model. The body joint locations are then recovered by combining evidence from multiple views in real-time. Our second contribution is a dataset of 6 million synthetic depth frames for pose estimation from multiple cameras with varying levels of complexity to make curriculum learning possible. We show the efficacy and applicability of our data generation process through various evaluations. Our final system exceeds the state- of-the-art results on multiview pose estimation on the Berkeley MHAD dataset. Our third contribution is a scalable software platform to coordinate Kinect devices in real-time over a network. We use various compression techniques and de- velop software services that allow communication with multiple Kinects through TCP/IP. The flexibility of our system allows real-time orchestration of up to 10 Kinect devices over Ethernet. ii Preface The entire work presented here has been done by the author, Alireza Shafaei, with the collaboration and supervision of James J. Little. A manuscript describing the core of our work and our results has been submitted to the IEEE Conference on Computer Vision and Pattern Recognition (2016) and is under anonymous review at the moment of thesis submission. iii Table of Contents Abstract . ii Preface . iii Table of Contents . iv List of Tables . vii List of Figures . viii Acknowledgments . xii Dedication . xiii 1 Introduction . 1 1.1 Kinect Sensor . 2 1.2 Our Scenario . 4 1.3 Datasets . 5 1.4 Pose Estimation . 6 1.5 Outline . 7 2 Related Work . 9 2.1 Pose Estimation . 9 2.1.1 Single-view Pose Estimation . 11 2.1.2 Multiview Depth-based Pose Estimation . 13 2.2 Dense Image Segmentation . 14 iv 2.3 Curriculum Learning . 15 3 System Overview . 16 3.1 The General Context . 16 3.2 High-Level Framework Specification . 17 3.3 Internal Structure and Data Flow . 18 3.3.1 Camera Registration and Data Aggregation . 19 3.3.2 Pose Estimation . 20 4 Synthetic Data Generation . 22 4.1 Sampling Human Pose . 23 4.2 Building Realistic 3d Models . 25 4.3 Setting Camera Location . 25 4.4 Sampling Data . 27 4.5 Datasets . 28 4.6 Discussion . 29 5 Multiview Pose Estimation . 34 5.1 Human Segmentation . 35 5.2 Pixel-wise Classification . 36 5.2.1 Preprocessing The Depth Image . 37 5.2.2 Dense Segmentation with Deep Convolutional Networks . 37 5.2.3 Designing the Deep Convolutional Network . 39 5.3 Classification Aggregation . 40 5.4 Pose Estimation . 41 5.5 Discussion . 42 6 Evaluation . 44 6.1 Training the Dense Depth Classifier . 45 6.2 Evaluation on UBC3V Synthetic . 48 6.2.1 Dense Classification . 48 6.2.2 Pose Estimation . 49 6.3 Evaluation on Berkeley MHAD . 50 6.3.1 Dense Classification . 54 v 6.3.2 Pose Estimation . 55 6.4 Evaluation on EVAL . 58 7 Discussion and Conclusion . 59 Bibliography . 61 vi List of Tables Table 4.1 Dataset complexity table. Q is the relative camera angle, H refers to the height parameter and D refers to the distance parameter as described in Figure 4.5. The simple set is the subset of postures that have the label ‘walk’ or ‘run’. Going from the first dataset to the second would require pose adaptation, while going from the second to the third dataset requires shape adaptation. 29 Table 6.1 The dense classification accuracy of the trained networks on the validation sets of the corresponding datasets. Net 2 and Net 3 are initialized with the learned parameters of Net 1 and Net 2 respectively. 47 Table 6.2 Mean and standard deviation of the prediction error by testing on subjects and actions with the joint definitions of Michel et al [28]. We also report and compare the accuracy at 10cm threshold. 57 vii List of Figures Figure 1.1 The goal of pose estimation is to learn to represent the postural information of the left image abstractly as shown in the right image. 2 Figure 1.2 A sample depth image. Each shade of gray visualizes a different depth value. The closer the point, the darker the corresponding pixel. The white region is too distant or too noisy, making the sensor readings unreliable. 4 Figure 1.3 An overview of our pipeline. In this hypothetical setting three Kinect 2 devices are communicating with a main hub where the depth information is processed to generate a pose estimate. 8 Figure 3.1 The high-level overview of the components in our system. Each Kinect is connected to a local Kinect Service. At the Smart Home Core we communicate with each Kinect Service to gather data. The Kinect Clients are the in- terfaces to the Kinect Service and can be implemented in any programming language. 18 Figure 3.2 The high-level representation of data flow within our pipeline. The pose estimation block operates independently from the number of the active Kinects. 19 Figure 3.3 An example to demonstrate the output result of camera cal- ibration. The blue and the red points are coming from two different Kinects facing each other but they are presented in a unified coordinate space. 20 viii Figure 3.4 The pose estimation pipeline in our platform. 21 Figure 4.1 The synthetic data generation pipeline. We use realistic 3d models with real human pose configurations and random camera location to generate realistic training data . 23 Figure 4.2 Random samples from MotionCK as described in Section 4.1. 25 Figure 4.3 Regions of interest in our humanoid model. There are total of 43 different body regions color-coded as above. (a) The frontal view and (b) the dorsal view. 26 Figure 4.4 All the 16 characters we made for synthetic data generation. Subjects vary in age, weight, height, and gender. 26 Figure 4.5 An overview of the extrinsic camera parameters inside our data generation pipeline. 27 Figure 4.6 Three random samples from Easy-Pose. (a,c,e) are groundtruth images and (b,d,f) are corresponding depth images. 30 Figure 4.7 Three random samples from Inter-Pose. (a,c,e) are groundtruth images and (b,d,f) are corresponding depth images. 31 Figure 4.8 Three random samples from Hard-Pose. (a,c,e) are groundtruth images and (b,d,f) are corresponding depth images. 32 Figure 5.1 Our framework consists of four stages through which we grad- ually build higher level abstractions. The final output is an estimate of human posture. 35 Figure 5.2 Sample human segmentation in the first stage of our pose estimation pipeline. 36 Figure 5.3 Sample input and output of the normalization process. (a,b) the input from two views, (c,d) the corresponding foreground mask, (e,f) the normalized image output. The output is rescaled to 250×250 pixels. The depth data is from the Berkeley MHAD [31] dataset. 38 ix Figure 5.4 Our CNN architecture. The input is a 250 × 250 normalized depth image. The first row of the network generates a 44 × 14 × 14 coarsely classified depth with a high stride. Then it learns deconvolution kernels that are fused with the information from lower layers to generate finely classified depth. Like [26] we use summation and crop alignment to fuse information. The input and the output blocks are not drawn to preserve the scale of the image. The number in the parenthesis within each block is the number of the corresponding channels. 39 Figure 6.1 Front camera samples of all the subjects in the Berkeley MHAD [31] dataset. 46 Figure 6.2 Front depth camera samples of all the subjects in the EVAL[12] dataset. 47 Figure 6.3 The reference groundtruth classes of UBC3V synthetic data. 49 Figure 6.4 The confusion matrix of Net 3 estimates on the Test set of Hard-Pose........................... 50 Figure 6.5 The output of Net 3 classifier on the Test set of Hard-Pose (left) versus the groundtruth body part classes (right). The images are in their original size. 51 Figure 6.6 The groundtruth body part classes (top) versus the output of Net 3 classifier on the Test set of Hard-Pose (bottom). 52 Figure 6.7 Mean average joint prediction error on the groundtruth and the Net 3 classification output. The error bar is one standard deviation. The average error on the groundtruth is 2:44cm, and on Net 3 is 5:64cm. 53 Figure 6.8 Mean average precision of the groundtruth dense labels and the Net 3 dense classification output with accuracy at threshold 10cm of 99:1% and 88:7% respectively.

Load more