University of Nevada, Reno

Deep Learning Based Robust Human Body Segmentation for Pose Estimation from RGB-D Sensors

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering

by

David Frank

Dr. David Feil-Seifer, Thesis Advisor

May, 2016

THE GRADUATE SCHOOL

We recommend that the thesis prepared under our supervision by

DAVID FRANK

Entitled

Deep Learning Based Robust Human Body Segmentation For Pose Estimation From RGB-D Sensors

be accepted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

Dr. David Feil-Seifer., Advisor

Dr. Monica Nicolescu, Committee Member

Dr. Jacqueline Snow, Graduate School Representative

David W. Zeh, Ph.D., Dean, Graduate School

May, 2016

i

Abstract

This project focuses on creating a system for human body segmentation meant to be used for pose estimation. Recognizing a human figure in a cluttered environment is a challenging problem. Current systems for pose estimation assume that there are no objects around the person, which restricts their use in a real world scenario. This project is based on new advances in deep learning, a field of machine learning that can tackle tough vision problems. The system contains a whole pipeline for training and using a system to estimate the pose of a human. It contains a data generation module that creates the training data for the deep learning module. The deep learning module is the main contribution of this work and provides a robust method for segmenting the body parts of a human. Finally, the project includes a pose estimation module which focuses on reducing the detailed output of the deep learning module into a pose skeleton. ii

Acknowledgments

This material is based in part upon work supported by: NASA Space Grant: NNX10AN23H, the Nevada Governor’s Office of Economic Development (NV-GOED: OSP-1400872), and Flirtey Technology Pty Ltd., and by Cubix Corporation through use of their PCIe slot expansion hardware solutions and HostEngine. Software used in the implementation of this project include: , MakeHuman, OpenEXR, HDF5, The , and Torch7. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NASA, NV-GOED, Cubix Corporation, Blender Foundation, Open Perception, The HDF Group, MakeHuman Team, Flirtey Technology Pty Ltd., Industrial Light & Magic, Deepmind Technologies, NYU, NEC Laboratories America, Flirtey Technology Pty Ltd or IDIAP Research Institute.

Advisor:

Dr. David Feil-Seifer

Committee members:

Dr. Monica Nicolescu Dr. Jacqueline Snow

Essential Guidance:

Dr. Richard Kelley

Computational Horsepower Courtesy of:

Dr. Frederick . Harris

Network and Computer Wizard:

Zachary Newell

Provider of Template and Elusive Information About Graduating:

Jessica Smith iii

Contents

Abstract i

Acknowledgments ii

List of Tables v

List of Figures vi

1 Introduction 1

2 Background 4 2.1 Pose Estimation ...... 4 2.2 Deep Learning ...... 9 2.3 Related Work ...... 13

3 Data Generation 16 3.1 Creating Data in Blender ...... 16 3.1.1 Human Model ...... 17 3.1.2 Clutter and Occlusions ...... 18 3.2 Post Processing ...... 20 3.3 Data Sets ...... 21

4 Network Training 24 4.1 Network Structure ...... 24

5 Pose Estimation 27 5.1 Point Cloud Representation ...... 27 5.2 Pose Skeleton ...... 28

6 Results 31 6.1 Mask Images ...... 31 6.1.1 Set 1 ...... 31 6.1.2 Set 2 ...... 33 6.1.3 Set 3 ...... 33 6.1.4 Set 4 ...... 33 6.2 Pose Estimation ...... 35 iv

6.3 Real Data ...... 38 6.4 Discussion ...... 39

7 Conclusion 40 7.1 Summary ...... 40 7.2 Future Work ...... 40

A Supporting Software 42 A.1 Blender ...... 42 A.2 MakeHuman ...... 42 A.3 OpenEXR ...... 43 A.4 Numpy ...... 43 A.5 HDF5 ...... 43 A.6 Torch ...... 43 A.7 PCL ...... 44

Bibliography 45 v

List of Tables

3.1 The body parts of interest and their color labels ...... 19

6.1 Per class accuracy for set 1 ...... 32 6.2 Per class accuracy for set 2 ...... 33 6.3 Per class accuracy for set 3 ...... 33 6.4 Per class accuracy for set 4 ...... 35 vi

List of Figures

2.1 Examples of pose skeletons detected by the Microsoft Kinect. These poses record 2D position in the frame, as well as the depth ...... 5 2.2 The stages of the Kinect system. The left images show the depth data, the middle images show the segemations from the RFD. Finally, the right images show the 3D pose estimation ...... 7 2.3 A simple Random Decision Tree for predicting Titanic survivors. The numbers at the leaves give the percentage for survival followed by the amount of samples at that leaf...... 7 2.4 A single neuron in a neural network ...... 9 2.5 A simple neural network showing the connections between neurons . . 10 2.6 An example of a convolutional neural network. Each pixel in later layers is taken from a window of pixels in the previous layers . . . . . 13

3.1 The human model textured for labeling in a default pose as it appears in Blender ...... 17 3.2 The human model textured for labeling in a default pose as it is rendered 18 3.3 The scene after the model has been posed and clutter objects have been added as seen in Blender ...... 19 3.4 The scene after the model has been posed and clutter objects have been added as rendered for viewing ...... 20 3.5 Example labeled images from Set 1 ...... 21 3.6 Example labeled images from Set 2 ...... 22 3.7 Example labeled images from Set 3 ...... 22 3.8 Example labeled images from Set 4 ...... 23

4.1 An outline of the fully convolutional network. Convolutional layers are shown as vertical lines with the number of feature planes they contain above them. Max pooling layers are shown as rectangles (all used kernels of 2x2) ...... 25

5.1 A point cloud representation of a person and the corresponding pose vectors ...... 29 6.1 Network predictions for Set 1 data ...... 32 6.2 Good performance network predictions for Set 2 data ...... 34 6.3 The hybrid network running on an image similar to those in Set 1 . . 34 vii

6.4 Reduced performance network predictions for Set 2 data ...... 35 6.5 Different views of the 3D reconstruction of a person, Large dots mark the point centers ...... 36 6.6 The orientation of the body parts shown as vectors ...... 37 6.7 Performance on real data. Top left: RGB image, not used in process- ing. Top right: depth image. Bottom: network predictions ...... 38 1

Chapter 1

Introduction

Person detection and pose estimation are common needs in Human Robot Interaction (HRI). Person detection is simply recognizing that there is a person nearby. The information gained may be the location of a person. This is useful, but lacks many important clues about what a person is doing. Pose classification goes a step past simple detection and gives a more complete description of the person, such as the location of the person’s arms and legs. With this information, a robot can gain a more complete understanding of a person than it can just by knowing where the person is. For example, a waving gesture may indicate that the person is trying to gain the robot’s attention while crossed arms may indicate an unwillingness to interact. This project focuses mainly on pose estimation. For a robotic system, the ability to locate a person is essential for achieving a basic level of social interaction [12]. For example, a robotic waiter may need recognize when a patron has approached it; or it may need to see when somebody is looking in its direction and waving. Tele-presence rehabilitation is another application [16], a person can engage in rehabilitation exercises monitored by a pose tracker to ensure that they are doing the exercise correctly. Many methods exist for doing pose detection [13]. One of the most prominent methods is used by Microsoft for the Kinect [18] [19]. This method uses depth images and a two stage system for classifying poses. The two stages are as follows: First, a mask image is produced from the depth image that labels each pixel as a body part of interest. Second, this mask image is used to calculate the center of each body part. 2

By using the depth data, that center can then be placed into the scene to get the 3D location of that body part. The method in [18] and other possible pose estimation methods such as [21] do not account for objects within the vicinity of the person. The environment needs to be structured so that any environemnt or non-person objects are easy to isolate from the person data. For example; in an entertainment scenario the person can be assumed to be a certain known distance from the sensor, which eliminates objects not near this assumed location, the floor around the person can be removed with well-known plane detection techniques. The method used in this project follows a very similar structure to the one used by [18] for the Xbox Kinect. Depth images are used since they have several advantages over color or black and white images for this application. They are unaffected by changes in clothing color and texture; this helps to remove unnecessary information from the image. Depth images naturally lend themselves to creating three dimensional (3D) representations of the environment which makes getting a 3D pose simple. The first stage of the system uses deep learning techniques to produce a mask image. Recently, a method for using convolutional neural networks (CNNs) to seg- ment images was proposed called Fully Convolutional Networks (FCNs) [11]. This method is adapted for use in this project to take depth images as input and produce a mask image; this is the main novelty of this project. Machine learning systems, in general, are based on taking known examples and using these examples to configure a system to recognize similar events in the future. This typically takes a large amount of labeled data which can be time consuming and difficult to acquire. In order to acquire the training data for this system with minimal cost and effort, synthetic data are used. The second stage fuses the mask image and the depth data to create a labeled 3D point cloud representation of the person. A point cloud is an arrangement of points in physical space. This point cloud is then refined into a representation of the body parts that gives a center location and an orientation vector. 3

The main contribution of this work is the creation of the deep learning based system for creating mask images. This system is meant to take advantage of graphics processing unit (GPU) acceleration to improve training times and run on consumer- grade hardware. It is focused on maintaining robustness when objects that cannot be easily removed from the scene are in the vicinity of the person. A possible method for pose estimation is proposed, but ultimately the implementation of this stage can be altered to fit the needs of whatever system is using it. After training, networks are shown to handle data where a person is facing the sensor and where clutter objects are present. Networks attempting to use data where the person can have any rotation do not produce satisfactory results. The following sections explore these key concepts in more detail. Chapter 2 presents a background of pose estimation and deep learning are along with related work in image segmentation and pose estimation. The system for creating the data sets is discussed in Chapter 3. The network structure and training methods for creating the first stage of the system are discussed in Chapter 4. A proposed method for pose estimation that gives the center and orientation of each body part is discussed in Chapter 5. The results of the network body segmentations are discussed in detail in Chapter 6, along with observations on the pose estimation and performance on real data. This project utilizes many different freely available programs and libraries, each are presented in Appendix A. 4

Chapter 2

Background

This chapter focuses on the ideas and concepts essential for implementing this project. The main focus is on current solutions for pose detection and information about deep learning. Also included is prior work similar to this project.

2.1 Pose Estimation

Pose estimation is a field that focuses on recovering the articulation of a body that is composed of rigid parts connected by joints [14][13]. Human pose estimation means that the human body is modeled as a series of rigid parts and joints. The full model is referred to as a "pose skeleton". The Microsoft Xbox 360 Kinect produces a pose skeleton that is utilized for video game and entertainment purposes [18][19]. This skeleton allows players to interact with video games through body movement. Similarly, a robot can use a pose skeleton to perceive a person; indeed, the Kinect has been integrated into many robotic platforms. The pose skeleton gives a richer description of a person than a location does. For example, a pose skeleton can be used to recognize a person waving or to locate their hands in order to pass an object to them. An example of the pose skeletons given by the Microsoft Kinect are shown in Figure 2.1. The Kinect is a consumer grade RGB-D sensor, which means that it captures a color image and a depth image where each pixel is the distance to the nearest object. It uses only the depth image to produce the pose skeleton. Unlike color images, 5

Figure 2.1: Examples of pose skeletons detected by the Microsoft Kinect. These poses record 2D position in the frame, as well as the depth 6 depth images are texture invariant. For example, two images of the same person in the same spot but one where the person is wearing a solid red shirt and the other with a plaid shirt will appear almost identical in a depth image. This is desirable since the appearance of clothing in a color image does not impact the pose skeleton, but can cause confusion in the system. To get a pose skeleton, the RGB-D data is interpreted by Microsoft’s software using a two-stage system. First, it takes the depth images and uses them as input to a Random Decision Forest classifier (RDF) which produces a mask image where each pixel is labeled as a body part of interest. The second stage takes the mask image and a marks a pixel for the center of each body part. By using the depth image, that pixel location can be placed into the 3D environment. Figure 2.2 shows the stages of the system. The first stage of the Kinect system requires a RDF that is capable of predicting the body parts. A RDF is an ensemble of Random Decision Trees (RDT). A RDT functions by making a series of decisions based on the input data, an example is shown in Figure 2.31. Each node of the tree has either two children or is a leaf node. For nodes with children, the input data are sent to one of the child nodes based on a binary decision. The decision is based on comparing the data against some set values; the method for setting these values is explained later. Once the data reaches a leaf node, a prediction is made for what the label of the data is; again, the value of this prediction will be explained later. A RDT is trained by setting the values at the comparison nodes and leaf nodes. To do this example data are used that have known labels. The RDTs are then grown by taking random features from the input data and finding what threshold best separates the remaining data samples. This process is repeated until the tree has reached a set depth; the leaf nodes will then contain a distribution of the labels corresponding to the remaining labels.

1Image by: Stephen Milborrow. Used under license: http://creativecommons.org/licenses/by- sa/3.0/legalcode 7

Figure 2.2: The stages of the Kinect system. The left images show the depth data, the middle images show the segemations from the RFD. Finally, the right images show the 3D pose estimation

Figure 2.3: A simple Random Decision Tree for predicting Titanic survivors. The numbers at the leaves give the percentage for survival followed by the amount of samples at that leaf. 8

The built-in Kinect software system uses 3 trees in the RDF. Each tree has a maximum depth of 20. The input features are a vector of length 2000 where each element is computed by Equation 2.1. In this equation, I is the image, x is the 2 component vector specifying the location on the image being focused on, u and v are

2 component offset vectors, dI is the depth probe that gives the depth at the specified pixel. Essentially, this equation takes two points from around pixel x and gets the difference between them. The offset vectors for each feature are chosen randomly in order to get a wide range of features. Each of these features is individually weak, but the RDF learns how to combine them to classify each pixel. The RDF is trained on synthetic and partially synthetic data, a system for generating synthetic data for this project is detailed in Chapter 3.

u v fθ(I, x) = dI (x + ) − I(x + ) (2.1) dI (x) dI (x)

RDFs are a powerful tool for machine learning, but they have some drawbacks. Features need to be extracted by some other method before the data can be worked on by a RDF; this means that weaknesses in the feature extraction stage will produce poor performance. They are known to over-fit to small changes in data, so sensor noise can become a problem if the tree was not trained on data with a similar noise pattern. Once trained, they cannot be easily modified. The only way to update them is to add more trees to the forest and this may not be effective since the trees in the old forest may overwhelm the predictions from the new trees. While they can be parallelized to some degree, they do not benefit from significant speedup from GPU training or execution. In contrast, deep learning can learn feature extraction, is more often robust to small changes in the data, can be updated without growing the classifier, and benefit from significant GPU speed-up. 9

Figure 2.4: A single neuron in a neural network

2.2 Deep Learning

Deep learning is a machine learning technique that focuses on extracting features from the data as well as classifying the input samples. Deep learning systems operate on the data with as few preprocessing steps as possible in order to extract features that are not obvious to close human inspection. The basic building blocks of a deep learning system are neural networks. A neural network is comprised of "neurons", each neuron in the network has many inputs and one output [3]. The input values each have weights that determine how much they impact the output of the neuron, Figure 2.4 shows a visual representation of a neuron and Equation 2.2 shows an equation for computing the output of a neuron in the first layer of a network.

(1) X (in→1) sj = xiwi→j (2.2) i The input for a neuron can be data or it can be the output of a previous layer of neurons. Each neuron can also have a bias that does not depend on previous layers. The output value of the neuron is then set by an activation function; common types of activations are rectified linear (ReLU), sigmoid, and tanh. The activation for ReLU is given in Equation 2.3. 10

Figure 2.5: A simple neural network showing the connections between neurons

σ(x) = ln(1 + ex) (2.3)

Neural networks are then constructed by layering neurons. Each layer contains a certain number of neurons, set by the . Each neuron in a layer gets as input the values of each neuron in the previous layer. Nonlinear activations are commonly placed between layers. This allows for many complex combinations of features to be captured. The first and last layers of the network are special; the first layer takes data as input and the last layer produces labels, or something close to them, as output. The other layers in the network are referred to as "hidden." An example of a neural network is shown in Figure 2.5, the equation for calculating an entire layer of neurons with a matrix multiplication operation is shown in Equation 2.4.

S(1) = XW (in→1) (2.4) 11

To optimize the weights at each neuron, a process called back propagation is used [5]. The essence of the back propagation algorithm is to update the weights throughout the network so that they produce better results on the training data. First a forward propagation is run through the network on a training sample. Using this output and the known label the error C is computed by using a cost function specified by the programmer. Next the error is propagated back through the network. The equation for determining the error for the last layer is given in Equation 2.5, the equation for determining errors at subsequent layers is given in Equation 2.6 where wi refers to the weights at layer i, δi is the error at layer i, zi is the output from layer i and σ0 is the derivative of the activation used at the layer. Together these functions give the errors at each neuron in the network.

L 0 L δ = ∇aC σ (z ) (2.5)

δl = ((wl+1)T δl+1) σ0(zl) (2.6)

Next the weights are updated so that the error will be reduced. Equation 2.7 gives the update for any bias in the network. The updates for the inputs from previous layers depend on the learning rate a, another value determined by the programmer, as seen in Equation 2.8. This process is repeated for each training sample and can be done on batches by using stochastic gradient descent.

∂C l l = δj (2.7) ∂bj

∂C l−1 l l = ak δj (2.8) ∂wjk A convolutional neural network (CNN) is a neural network layer with a special structure that is frequently used in image processing; they are biologically inspired 12

and mimic how animals process sight [6]. In these layers, neurons are repeatedly applied across the image, only seeing a set window of the input data instead of all of the inputs from the previous layer [1]. An example of a convolutional neural network is shown in Figure 2.6, note how pixels in later layers are based off of a window of surrounding pixels in the previous layer. Equation 2.9 shows how to compute

l the activation at neuron xij with a m ∗ m window and a stride of s. Libraries for computing convolutions will often have different versions of this equation that benefit from .

m m l X X xij = bias + waby(i+a∗s)(j+b∗s) (2.9) a=0 b=0 The other basic type of layer in a CNN is a max pooling layer. This layer down- samples an image by a certain kernel window n ∗ m. In each window the maximum value is selected and becomes a single pixel in the next layer. A common kernel is 2 ∗ 2 which will halve each dimension in the output. Convolutional layers have a few advantages over fully connected layers. First, these layers are translation invariant; this means that it does not matter where in an image an object of interest appears. Second, they are much smaller in size; since the neurons are applied over the entire image instead of needing connections for a whole densely connected layer. The lack of global information at each layer is somewhat of a disadvantage, but is mitigated by the fact that in images, pixels that are far away from the pixel being focused on rarely contain information relevant to the focused pixel. CNNs have been applied to many problems, with especially good results in classification problems. For example, a CNN in [9] was trained on a data set of 1.3 million images belonging to 1000 different classes; it achieved a top 5 error rate of 18.9% which beat the state of the art. They have also been used for facial recognition [10], speech recognition [2], and modeling sentences [7]. Recent work in CNNs has shown that they can be used to create per pixel se- 13

Figure 2.6: An example of a convolutional neural network. Each pixel in later layers is taken from a window of pixels in the previous layers mantic labels [11]. This provides the basis for using CNNs to create per-pixel labels identifying body parts. Long, Shelhammer, and Darrell call their network structure a Fully Convolutional Network (FCN). In an FCN, each layer is a convolution or max pooling. This means that the network only uses the efficient CNN layers and does not need the expensive dense layers. The strengths of CNNs lie in their ability to learn feature extractions that can- not be predicted and programmed effectively; unlike the RDFs, that require feature extraction to be set by a programmer. This allows for the feature extraction step to be optimized as part of the learning process, unlike with manual feature extraction.

2.3 Related Work

The method used in [18] was also used in [8] to get the pose of a single hand model. It uses the same staged system with a RDF for creating a per pixel labeled mask image followed by finding the centroid of each segment. The training process uses entirely synthetic data, similar to this system; however, the authors note that hands have very little variation between people. This may indicate that more human models will be needed to account for the variability in body types. Methods that require minimal learning also exist, but are often complicated to construct, such as in [15]. Without learning, this paper relies on a prior knowledge about the human body, the relation of body parts to each other, and explicitly handles issues like self occlusion. Mori, et al., make use of edge detection and super pixels to 14

find body segments. The data set consists of baseball players playing baseball and do contain a wide variety of poses, but there is no mention of occlusions except self occlusions. Overall, this system was effective at working on images from the data set it was given. However, extending and adapting it would be laborious as each stage of the system would need to be tweaked by an expert. Using neural networks for pose detection was done for RGB images in [20]. The poses that are given as the 2D pixel locations of the center of each body part. This system has two stages, first a CNN refines the information from the image and the output of this network becomes the input to a linear regression layer. The second stage minimizes the distance between the true pose and the pose output by the network. This network requires data sets to be labeled manually, which is a time consuming and laborious process. RGB images are highly variable, so capturing a diverse data set may be difficult. Another CNN based approach with goals similar to pose detection was proposed in [21]. This network finds correspondences between a known model and an image being classified. It can handle, full-to-full, full-to-partial, and partial-to-partial depth maps of a person. The most relevant to this project is the full-to-partial, since this project has the entire human model available for training and the data that will be used as input will be partial views of a person. This system can classify each pixel of the human scan with a very verbose descriptor of where on the body the pixel is. The method used in [18], and the method used in this project, use a comparatively coarse classification of body parts with sharper boundaries between them. The output of this system could be made coarser by doing a nearest neighbor sampling of all of the pixels. That might result in the boundaries between the body parts being much smoother than in [18] and could cut down on outliers. Thus, this paper could be extended to find poses. The main weakness of [21] is that occlusions are not handled. Every pixel that is input to the system is from the person and there is no mention of how non person pixels are separated. It is likely that they are removed by making assumptions that 15 there are no objects near the person and then removing the floor / walls with a simple non-learning process. Extending such a verbose method to handle non person data in the depth map could be extremely difficult. 16

Chapter 3

Data Generation

This chapter introduces the system used for creating the synthetic training data. It goes into detail on how images are processed to be suitable for the neural network and the differences between different training sets. Most computations shown in this section are implemented using numpy (more information about numpy is given in A.4).

3.1 Creating Data in Blender

The starting point for creating the synthetic data is the free program Blender. This program allows users to create images by placing and manipulating objects within a scene (more information about Blender is given in A.1). In this case, a human model is posed and rotated, then occlusions and clutter objects are added to the scene. The image is then rendered and saved as an EXR file which allows for the depth and the color components of the scene to be saved together. More information about the EXR image format is given in A.3. The depth channel of the EXR image is the depth image of the scene, similar to what a depth sensor would produce. The RGB components of the EXR image show the color view of the scene which has been setup to give the labels for each pixel. 17

Figure 3.1: The human model textured for labeling in a default pose as it appears in Blender

3.1.1 Human Model

The human model in the scene was created using another free program called Make- Human. This program creates articulated human figures that can be imported into Blender (more information can be found in A.2). The underlying kinematic pose skeleton closely approximates every human joint. Figure 3.1 shows the human model as it appears within Blender. Figure 3.2 shows how the model in Figure 3.1 is rendered for viewing. When posing the model, it is desirable to avoid invalid configurations; such sam- ples will not occur in actual use, so having them in the data set provides less benefit than having a valid configuration. Invalid poses are ones in which part of the model intersects with another part, such as an arm passing through the torso. To make sure a pose is valid, the locations of certain danger joints are checked. The points that can collide with other parts are the hands, elbows, feet, and knees. The locations of these points are compared to the other danger points as well as the head, torso, abdomen, 18

Figure 3.2: The human model textured for labeling in a default pose as it is rendered and hips. If any are within a set radius, the pose is considered invalid and a new one is generated. The model can be given different skin textures, normally this is used to give different appearances to the model, but in this project the skin texture is used to label the entire model as the various body parts of interest. Are shown in 3.1.

3.1.2 Clutter and Occlusions

To simulate objects in the environment that may be in the vicinity of a human subject, clutter objects can be added to the scene. For example, if a person is close to a chair, then the chair is considered a clutter object. In the training set, randomly configured geometric objects are used to approximate clutter objects1. The shapes that can be added to the scene include cubes, cones, and spheres. Each object is placed in a random position around the person and has a random orientation. Each object also has parameters that change its appearance. Cubes

1While these geometric objects are not realistic, they create realistic occlusions 19

Table 3.1: The body parts of interest and their color labels

Body Part Color Head Left Bright Red Head Right Dark Red Torso Left Bright Blue Torso Right Dark Blue Upper Arm Left Bright Yellow Upper Arm Right Dark Yellow Lower Arm Left Bright Cyan Lower Arm Right Dark Cyan Upper Leg Left Bright Green Upper Leg Right Dark Green Lower Leg Left Bright Magenta Lower Leg Right Dark Magenta have differing side lengths. Spheres have differing radii. Cones can vary in length and the radius at each end. Figure 3.3 shows an example of a scene that has had clutter objects added to it, Figure 3.4 shows how that scene is rendered for viewing.

Figure 3.3: The scene after the model has been posed and clutter objects have been added as seen in Blender 20

Figure 3.4: The scene after the model has been posed and clutter objects have been added as rendered for viewing

3.2 Post Processing

Once the data are generated in Blender, more steps need to be completed in order for the data to be usable for training. In general, the RGB data need to be converted into single integer value for each label and the depth data need to have a threshold applied to it. The label data in the image are in the three RGB channels and each pixel needs to be mapped to a single integer value. During the rendering process, many of the RGB values will be slightly changed from the values set on the human model. To deal with this, each pixel is assigned to have the RGB value closest to one of the original values. Then the RGB values can be easily mapped to the corresponding label integer. When no object is in view for a pixel, the depth at that point is given as the highest value possible that can be stored. This causes overflow errors during training, 21 so a threshold is applied to the data. In this case, 10.0 is used since none of the common RGB-D sensors can sense farther that 10.0 meters. Once an image has been fully processed, it is added to an HDF5 file that con- tains images from the data set. HDF5 allows for managing large amounts of data easily and can quickly access data stored on a hard drive, HDF5 is detailed in A.5. This is essential for this project since the data sets are much too large to fit into the computer’s main memory; a data set that contains 100000 processed images is approximately 200 GB.

3.3 Data Sets

Each data set has different characteristics that determine how difficult it is to classify. In general, data sets with a higher number of configurations are considered more difficult. Every set allows for the joints of the human to move in any valid configuration. Some sets restrict the rotation of the person relative to the sensor; for example one set may only allow for the person to be facing the sensor with their head near the top of the frame. Some sets have occlusions or clutter of various types, the amount of clutter can also vary between sets.

Figure 3.5: Example labeled images from Set 1

Set 1 is the easiest and contains images that are very similar to the entertainment scenario that the Kinect functions on. The person does not have any global rotation and there are no clutter objects. Examples are shown in Figure 3.5. 22

Figure 3.6: Example labeled images from Set 2

Set 2 is a hybrid set that has a slight amount of global rotation, from -15 to 15 degrees in the y axis, and just a few (3 - 5) clutter objects. This means that the person is still roughly facing the sensor and there are a few objects, but not so many that they should overwhelm the image. Examples are shown in Figure 3.6.

Figure 3.7: Example labeled images from Set 3

Set 3 is a medium difficulty set and contains images that have global rotation and a simple occlusion that blocks out a random percentage of the image. This set is meant to train a network that can handle situations where the person is not facing the sensor. Examples are shown in Figure 3.7. Set 4 is a hard set that contains images that have global rotation and contain 10 - 30 random clutter objects. This set is harder since the clutter objects are harder to recognize than in set 2. This set is meant to replicate the toughest conditions under which a sensor may be viewing a person. Examples are shown in Figure 3.8. 23

Figure 3.8: Example labeled images from Set 4 24

Chapter 4

Network Training

This chapter will go into detail about the first stage of the system that produces mask images. It uses a version of a Convolutional Neural Network (CNN) called a Fully Convolutional Neural Network (FCN). A descriptions of CNNs and other general neural network concepts is in Section 2.2.

4.1 Network Structure

The network structure takes inspiration from the FCN in [11]. Similar to the network in that paper, this one only uses convolutional layers for learning. The other non learning layers are the max poolings, the rectified linear (ReLU) activations and the log likelihood function applied at the end of the network. The initial layers of the network have a few convolutional layers followed by a max pool layer. As the network progresses, the max pool layers reduce the size of the image planes while the convolutional layers increase the number of feature planes, a visual description of the network is shown in Figure 4.1. This allows for more global information to be acquired at each step. Finally, the images are scaled up and the feature planes are all merged in one step called a deconvolution. The final output of the network has the same width and height of the output; each pixel is a vector the size of the number of classes, 13 in this case. Each element of this vector gives the negative log likelihood that the corresponding class is the correct class. The maximum element gives the predicted class. 25

Figure 4.1: An outline of the fully convolutional network. Convolutional layers are shown as vertical lines with the number of feature planes they contain above them. Max pooling layers are shown as rectangles (all used kernels of 2x2)

All of the weights of the network are initialized randomly, this is in contrast to [11] where weights were taken from an existing network meant to do classification. Using weights from another network was not an option for this project since there were no publicly available trained networks that operate on depth data. For optimization, Stochastic Gradient Descent was used with a learning rate of .1, learning rate decay of .001, weight decay of .0001, and momentum of .5. The high initial learning rate allowed the network to quickly learn a basic representation of the data since no prior weights were used to initialize the network. The learning rate decay brought the learning rate down to a level more suitable for fine tuning on later images and epochs. Running the network with a consistently high learning rate causes it to not converge to a decent optimal value while running with a consistently low value greatly increases training time. The criterion optimized for was Negative Log Likelihood. This criterion makes the last layer of the network produce the log likelihood chances for each class. This criterion is effective since it removes the need for the network to directly predict the class, which can cause problems during training. To recover a single prediction, the maximum value of the log likelihoods is taken. For efficient training, the GPU accelerated library Torch was used [4]. The network in [11] uses almost 12GB of memory on the GPU, which means that only the most high end GPUs, such as the NVidia K40, can run this network. These GPUs are not viable for most end-users since they are very expensive and can only be included in desktop computers. The network in this project uses about 3GB, which means that it can be run on more consumer friendly GPUs such as a Nvidia GTX780 or 26 certain models of Nvidia GTX960. Training the network on a NVidia GTX780 takes about 1 day to run 3 epochs. In practice, the training in these networks did not see much improvement after 3 epochs. Running a single forward propagation on an image takes 8.5 milliseconds on a GTX 960. Once a network is trained, it can take as input a depth image and produce a mask image, as discussed in Chapter 3, that can then be used by the second stage of the system to generate a pose. 27

Chapter 5

Pose Estimation

This chapter proposes a system for generating a 3D representation of the person. It uses the mask image from the first stage and the depth image from the sensor. The final step is to create a simplified version of the person called a pose skeleton. The Point Cloud Library (PCL) is used for point cloud manipulation. Information about PCL is given in A.7.

5.1 Point Cloud Representation

With a mask image predicting the body parts for each pixel, the mask image is fused with the depth image in order to get a point cloud representation of the human. The goal of this is to create a 3D representation of the person in the world space that is highly detailed. Each body part is maintained as a separate cloud in order to work with individual pieces as needed. The mask image is given in terms of pixels which are discrete numbers ranging to the maximum height and width of the image, in this project the dimensions of images are 480 ∗ 640. These must be transformed into points within the world space, which are given in continuous numbers in XYZ coordinates. Doing this requires the depth and the camera intrinsic parameters. The depth is taken directly from the depth image. The camera intrinsic parameters needed are the center point of the image and the focal length in the height and width of the camera. Equations 5.1 and 5.2 show how to get the x and y world position of pixel (h, w). The z position in the world is 28

simply the depth.

(h − center ) ∗ depth X = y (5.1) focaly

(w − center ) ∗ depth Y = x (5.2) focalx Once the point cloud is completed, points are trimmed from it that are known or very likely to be bad. First, all points that lie near the threshold for the depth are assumed to be bad since they are most likely caused by points on the mask image extending beyond the actual person.

5.2 Pose Skeleton

The point cloud gives a detailed representation of the person, to get a simpler repre- sentation, a pose is computed by reducing the point clouds to a few key descriptors. First, the center of each body part is found. This is simply the mean over every dimension. This gives a location in 3D space for the body part, but does not determine an orientation for the part. To get an orientation, Primary Component Analysis (PCA) [17] is used to find the vector of the point cloud that contains the highest level of variability. Equation 5.3 shows how to calculate this vector given data X, the XYZ locations of each point in the cloud, by maximizing vector w. Since the body parts defined in the labeling all have an approximately cylindrical shape, this vector gives the center axis of the body part. The results of the entire procedure are shown in Figure 5.1.

2  T T w(1) = arg max {kXwk } = arg max w X Xw (5.3) kwk=1 kwk=1 This process is one of many that could be applied to the same point cloud data to get a pose representation. Another possible solutions could be to fit a prior estimation 29

Figure 5.1: A point cloud representation of a person and the corresponding pose vectors 30 of the shape of the body part to the data. Ultimately, it depends on the application that the pose skeleton is intended for. This procedure was chosen due to its robustness to lost information due to occlusions; as long as most of the body part is visible, the first vector given by PCA remains relatively unchanged. 31

Chapter 6

Results

Once the network is trained, the output can be visualized and the per pixel accuracy can be analyzed. To visualize the output, the mask images can be compared to the known labels. The 3D point clouds can also be visualized along with the center points of each body part. The results were scored using a pure accuracy measure. A different network was trained from scratch on each data set. More details about each set are given in Section 3.3. The pose estimations cannot be scored empirically, since there is no "ground truth" to compare them to. However, since the performance of the pose estimator is heavily dependant on the performance of the mask images, the focus is on evaluating the first stage.

6.1 Mask Images

6.1.1 Set 1

Set 1 is the easiest set that most closely resembles an entertainment scenario. After 3 epochs, the network reached an overall accuracy of 97.25% on a test set of 15000 images, subsequent epochs did not significantly improve accuracy. Accuracy per class is shown in Table 6.1. Examples of the network segmentations are shown in Figure 6.1. The example images show that confusion can occur when separate body parts are in close proximity; these sorts of errors may be due to the large amount of down-sampling done in the network and could be mitigated by adding skip layers. 32

Table 6.1: Per class accuracy for set 1

Class Non Person Head Left Torso Left Upper Arm Left Lower Arm Left Upper Leg Left Lower Leg Left Accuracy % 99.15 70.55 91.11 72.89 73.37 80.52 66.31 Class Head Right Torso Right Upper Arm Right Lower Arm Right Upper Leg Right Lower Leg Right Accuracy % 67.19 81.04 74.56 63.32 72.28 57.18

(a) True segmentation (b) Network segmentation

Figure 6.1: Network predictions for Set 1 data 33

Table 6.2: Per class accuracy for set 2

Class Non Person Head Left Torso Left Upper Arm Left Lower Arm Left Upper Leg Left Lower Leg Left Accuracy % 97.75 77.26 91.04 79.27 77.45 85.28 80.22 Class Head Right Torso Right Upper Arm Right Lower Arm Right Upper Leg Right Lower Leg Right Accuracy % 73.35 84.39 77.54 69.13 78.42 72.16

Table 6.3: Per class accuracy for set 3

Class Non Person Head Left Torso Left Upper Arm Left Lower Arm Left Upper Leg Left Lower Leg Left Accuracy % 98.92 13.22 60.51 4.21 3.19 1.53 32.94 Class Head Right Torso Right Upper Arm Right Lower Arm Right Upper Leg Right Lower Leg Right Accuracy % 3.91 37.50 3.38 5.79 1.32 10.59

6.1.2 Set 2

Set 2 is the hybrid set that includes some clutter objects and very little global rotation. After 3 epochs, the network reaches results that on par with the Set 1 network, with a global accuracy of 96.50%. The network is able to reliably predict each body part as shown in Table 6.2. By observation, the network mostly removes cubes, planes, and spheres from the segmentation as shown in Figure 6.2; however, it shows confusion when there is a cone nearby that has dimensions similar to a body part, such as in Figure 6.4. This may indicate that the network has learned decent geometric representations of the body parts but has not learned higher level descriptions of the parts, such as the idea that there cannot be more than one "right arm."

6.1.3 Set 3

Set 3 contains global rotations and simple occlusions. Multiple different configurations were tried for the network settings, but results remained unsatisfactory. The results shown in Table 6.3 show that the network is unable to reliably classify most body parts. This may be due to the increased config- uration space when compared to Set 1. Since the size of the network has been reduced from the network in [11], it may lack the capacity to handle the global rotations.

6.1.4 Set 4

Set 4 contains global rotations as well as numerous clutter objects. 34

(a) True segmentation (b) Network segmentation

Figure 6.2: Good performance network predictions for Set 2 data

Figure 6.3: The hybrid network running on an image similar to those in Set 1 35

(a) True segmentation (b) Network segmentation

Figure 6.4: Reduced performance network predictions for Set 2 data

Table 6.4: Per class accuracy for set 4

Class Non Person Head Left Torso Left Upper Arm Left Lower Arm Left Upper Leg Left Lower Leg Left Accuracy % 100.00 0.00 0.00 0.00 0.00 0.00 0.00 Class Head Right Torso Right Upper Arm Right Lower Arm Right Upper Leg Right Lower Leg Right Accuracy % 0.00 0.00 0.00 0.00 0.00 0.00

Similar to the results for Set 3, the network is unable to produce satisfactory results. This is not surprising given that the network was unable to accurately func- tion on Set 3 which is an easier set. The results in Table 6.4 show that the network exclusively predicts the non-person class; since this class is the most common in the set, it may be that the network cannot escape this local optimal configuration.

6.2 Pose Estimation

The results of the pose estimation system proposed in Section 5.2 are presented for observation here. The 3D reconstruction of the person is shown in Figure 6.5. Most parts have their centers in sensible positions; however, the proposed center for the 36

(a) True segmentation (b) Network segmentation

Figure 6.5: Different views of the 3D reconstruction of a person, Large dots mark the point centers upper arm shown in bright yellow is clearly not correct. This is due to the incorrect classification of a large portion of the lower arm as being upper arm. Using this method of pose estimation, the results are highly dependant on the performance of the previous stage. The pose vectors in Figure 6.6 show the proposed orientations of the body parts. Again, the bright yellow arm vector is not reasonable due to the significant incorrect classifications. However, the proposed orientations for the well segmented parts, such as the legs and other arm, are reasonable. 37

Figure 6.6: The orientation of the body parts shown as vectors 38

Figure 6.7: Performance on real data. Top left: RGB image, not used in processing. Top right: depth image. Bottom: network predictions

6.3 Real Data

Images were taken of an actual subject with the Microsoft Kinect to observe the general performance on real data. The image was edited to remove known non person pixels, such as the floor and walls, to make it close to the images in Set 1. The results are shown in Figure 6.7. Ground truth for the real data have not been established, but the predictions appear to be sensible. 39

6.4 Discussion

The networks trained on Sets 1 and 4 produce good results while the networks trained on Sets 2 and 3 do not. Sets 1 and 4 both restrict the person to be facing the sensor while Sets 2 and 3 allow the person to rotate freely. Interestingly, the network trained on Set 4 is able to handle occlusions and clutter objects, so it may be that global rotations add too much variation to the data set for a network of this size to handle. Networks may occasionally "favor" one side of a left-right pair of body parts, as seen with the torso parts in Table 6.1. This is likely due to the border between the parts being classified as one part, the choice may be arbitrary and due to the ambiguity of the border. In addition to the difference in size from the network in [11], this network must segment parts that do not have sharply defined borders. For example, the border between the lower and upper arm may not be obvious when the arm is out stretched. With the RGB images used in [11], most classes will have distinct contrasts in color when they are close by. 40

Chapter 7

Conclusion

7.1 Summary

This thesis presented a system for creating human body segmentations using a deep learning based Fully Convolutional Network. This network was trained to be robust to some occlusion and clutter in the scene. Of the networks that were trained, the networks focused on identifying a person facing the sensor were able to achieve satis- factory results. The ones that were trained on data sets that allowed for large global rotations were not able to perform well. A method for pose estimation was proposed that is aimed at being resistant to losses due to occlusions. This method gives a center point of each body part in 3D space and a vector describing their orientation. This stage is purposefully meant to be switched out with any other method to fit the needs of the implementation.

7.2 Future Work

Since the pipeline has been designed to be used to retrain networks based on different data sets, more networks can be trained on sets reflecting different scenarios. These can be variations in clutter objects or variances in the human model, for example, a data set could include child models to create a system that works for detecting adults and children in the same setting. Other pose estimation methods can be explored. To evaluate the effectiveness of the methods, a way to compare them to the ground truth pose would be needed. 41

Other possible methods could include fitting known shapes of body parts to the point cloud or segmenting the body into more parts in order to connect the centers to find an orientation. It may be possible to use both the RGB data and the depth data as input for the network. This may help with some of the problems encountered with the depth data, but it would require a way to create data sets with realistic RGB data. The system needs to be improved to handle large global rotations. This may mean training a larger network and waiting for consumer grade GPUs to catch up in specifications. It may be possible to train an ensemble of segmentation networks each from different views, such as one each for front, back, and sides, in order to achieve a better global model of the person. To handle the erroneous pose estimations for body parts that have been poorly segmented, a a more robust method may be needed. 42

Appendix A

Supporting Software

This work utilizes many different pieces of freely available supporting software. This appendix describes each one in detail and contains a link to a website with more information about the software.

A.1 Blender www.blender.org Blender is an image rendering program. The possibilities for its use are limitless, but in this project it is used to create images from a human model and simple objects placed within a scene. More information about how it is used is provided in Chapter 3. Blender can be used interactively by a user to set up and edit detailed scenes. This feature was used in the prototyping stages of the project. Blender can also be run automatically with Python scripts. This is automated process is used to create a large number of images without needing supervision, essential for use later in the project.

A.2 MakeHuman www..org MakeHuman is a program that creates detailed and posable human models that can be used in Blender. This model is placed into the scene in Blender and in this 43 project is commonly referred to as the "human model". It is purely synthetic and not derived from an actual person.

A.3 OpenEXR www.openexr.com EXR is a file format that is typically used for highly detailed images. In this case, its most important feature is the ’Z’ channel that stores the depth of each pixel. Blender can export this format for rendered images which allows for depth images to be easily generated by Blender.

A.4 Numpy www.numpy.org Numpy is a module for Python that allows for powerful scientific computing. It efficiently handles large matrices which is essential for processing the images.

A.5 HDF5 www.hdfgroup.org/HDF5 HDF5 is a file format and library that allows for large amounts of data to be stored and accessed quickly and efficiently. This is essential for managing data sets in this project that can take up 200GB. The efficient access speeds up almost all aspects of the project since less time is being spent on I/O operations.

A.6 Torch torch.ch Torch is a library for machine learning that leverages GPUs to speed up compu- tations. It has support for the latest deep learning structures. In this project, it is used to train and test a convolutional neural network. 44

A.7 PCL pointclouds.org PCL stands for Point Cloud Library. Point clouds are large arrangements of points in 2D or 3D space. It provides many methods for analyzing point clouds. In this project, PCL is used to manage a point cloud representation of a human in 3D space. 45

Bibliography

[1] Convolutional neural networks (lenet) deeplearning 0.1 documentation. [2] Ossama Abdel-Hamid, Li Deng, and Dong Yu. Exploring convolutional neu- ral network structures and optimization techniques for speech recognition. In INTERSPEECH, pages 3366–3370, 2013. [3] Ross Berteig. Neural network technology, 1996. [4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environ- ment for machine learning. Neural Information Processing Systems, 2011. [5] Geoffrey E. Hinton David E. Rumelhart and Ronald J. Williams. Learning rep- resentations by back-propagating errors. Nature, 323:533–536, 1986. [6] David H Hubel and Torsten N Wiesel. Receptive fields and functional archi- tecture of monkey striate cortex. The Journal of physiology, 195(1):215–243, 1968. [7] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014. [8] C. Keskin, F. Kirac, Y.E. Kara, and L. Akarun. Real time hand pose estimation using depth sensors. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 1228–1234, Nov 2011. [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bot- tou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [10] Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recogni- tion: A convolutional neural-network approach. Neural Networks, IEEE Trans- actions on, 8(1):98–113, 1997. [11] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014. [12] V. Micelli, K. Strabala, and S. Srinivasa. Perception and control challenges for effective human-robot handoffs. RSS 2011 RGB-D Workshop, 2011. [13] T. Moeslund, A. Hilton, and V. Kruger. A survey of advances in vision-based human capture and analysis. Computer Vision and Image Understanding, 2006. 46

[14] Thomas B. Moeslund and Erik Granum. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81(3):231– 268, 2001. [15] Greg Mori, Xiaofeng Ren, Alexei A Efros, and Jitendra Malik. Recovering human body configurations: Combining segmentation and recognition. In Computer Vision and Pattern Recognition, volume 2, pages II–326. IEEE, 2004. [16] S Obdržálek, Gregorij Kurillo, Jay Han, Ted Abresch, Ruzena Bajcsy, et al. Real-time human pose detection and tracking for tele-rehabilitation in . Studies in health technology and informatics, 173:320–324, 2012. [17] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901. [18] J. Shotten, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kip- man, and A. Blake. Real-time human pose recognition in parts from single depth images. Proceedings of the 2011 IEEE Conference on Computer Vision and Pat- tern Recognition, pages 1297–1304, 2011. [19] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, and A. Blake. Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2013. [20] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1653–1660, 2014. [21] Lingyu Wei, Qixing Huang, Duygu Ceylan, Etienne Vouga, and Hao Li. Dense human body correspondences using convolutional networks. arXiv preprint arXiv:1511.05904, 2015.