Deep Learning Based Robust Human Body Segmentation for Pose Estimation from RGB-D Sensors

University of Nevada, Reno Deep Learning Based Robust Human Body Segmentation for Pose Estimation from RGB-D Sensors A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering by David Frank Dr. David Feil-Seifer, Thesis Advisor May, 2016 THE GRADUATE SCHOOL We recommend that the thesis prepared under our supervision by DAVID FRANK Entitled Deep Learning Based Robust Human Body Segmentation For Pose Estimation From RGB-D Sensors be accepted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Dr. David Feil-Seifer., Advisor Dr. Monica Nicolescu, Committee Member Dr. Jacqueline Snow, Graduate School Representative David W. Zeh, Ph.D., Dean, Graduate School May, 2016 i Abstract This project focuses on creating a system for human body segmentation meant to be used for pose estimation. Recognizing a human figure in a cluttered environment is a challenging problem. Current systems for pose estimation assume that there are no objects around the person, which restricts their use in a real world scenario. This project is based on new advances in deep learning, a field of machine learning that can tackle tough vision problems. The system contains a whole pipeline for training and using a system to estimate the pose of a human. It contains a data generation module that creates the training data for the deep learning module. The deep learning module is the main contribution of this work and provides a robust method for segmenting the body parts of a human. Finally, the project includes a pose estimation module which focuses on reducing the detailed output of the deep learning module into a pose skeleton. ii Acknowledgments This material is based in part upon work supported by: NASA Space Grant: NNX10AN23H, the Nevada Governor’s Office of Economic Development (NV-GOED: OSP-1400872), and Flirtey Technology Pty Ltd., and by Cubix Corporation through use of their PCIe slot expansion hardware solutions and HostEngine. Software used in the implementation of this project include: Blender, MakeHuman, OpenEXR, HDF5, The Point Cloud Library, and Torch7. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NASA, NV-GOED, Cubix Corporation, Blender Foundation, Open Perception, The HDF Group, MakeHuman Team, Flirtey Technology Pty Ltd., Industrial Light & Magic, Deepmind Technologies, NYU, NEC Laboratories America, Flirtey Technology Pty Ltd or IDIAP Research Institute. Advisor: Dr. David Feil-Seifer Committee members: Dr. Monica Nicolescu Dr. Jacqueline Snow Essential Guidance: Dr. Richard Kelley Computational Horsepower Courtesy of: Dr. Frederick C. Harris Network and Computer Wizard: Zachary Newell Provider of Template and Elusive Information About Graduating: Jessica Smith iii Contents Abstract i Acknowledgments ii List of Tables v List of Figures vi 1 Introduction 1 2 Background 4 2.1 Pose Estimation . .4 2.2 Deep Learning . .9 2.3 Related Work . 13 3 Data Generation 16 3.1 Creating Data in Blender . 16 3.1.1 Human Model . 17 3.1.2 Clutter and Occlusions . 18 3.2 Post Processing . 20 3.3 Data Sets . 21 4 Network Training 24 4.1 Network Structure . 24 5 Pose Estimation 27 5.1 Point Cloud Representation . 27 5.2 Pose Skeleton . 28 6 Results 31 6.1 Mask Images . 31 6.1.1 Set 1 . 31 6.1.2 Set 2 . 33 6.1.3 Set 3 . 33 6.1.4 Set 4 . 33 6.2 Pose Estimation . 35 iv 6.3 Real Data . 38 6.4 Discussion . 39 7 Conclusion 40 7.1 Summary . 40 7.2 Future Work . 40 A Supporting Software 42 A.1 Blender . 42 A.2 MakeHuman . 42 A.3 OpenEXR . 43 A.4 Numpy . 43 A.5 HDF5 . 43 A.6 Torch . 43 A.7 PCL . 44 Bibliography 45 v List of Tables 3.1 The body parts of interest and their color labels . 19 6.1 Per class accuracy for set 1 . 32 6.2 Per class accuracy for set 2 . 33 6.3 Per class accuracy for set 3 . 33 6.4 Per class accuracy for set 4 . 35 vi List of Figures 2.1 Examples of pose skeletons detected by the Microsoft Kinect. These poses record 2D position in the frame, as well as the depth . .5 2.2 The stages of the Kinect system. The left images show the depth data, the middle images show the segemations from the RFD. Finally, the right images show the 3D pose estimation . .7 2.3 A simple Random Decision Tree for predicting Titanic survivors. The numbers at the leaves give the percentage for survival followed by the amount of samples at that leaf. .7 2.4 A single neuron in a neural network . .9 2.5 A simple neural network showing the connections between neurons . 10 2.6 An example of a convolutional neural network. Each pixel in later layers is taken from a window of pixels in the previous layers . 13 3.1 The human model textured for labeling in a default pose as it appears in Blender . 17 3.2 The human model textured for labeling in a default pose as it is rendered 18 3.3 The scene after the model has been posed and clutter objects have been added as seen in Blender . 19 3.4 The scene after the model has been posed and clutter objects have been added as rendered for viewing . 20 3.5 Example labeled images from Set 1 . 21 3.6 Example labeled images from Set 2 . 22 3.7 Example labeled images from Set 3 . 22 3.8 Example labeled images from Set 4 . 23 4.1 An outline of the fully convolutional network. Convolutional layers are shown as vertical lines with the number of feature planes they contain above them. Max pooling layers are shown as rectangles (all used kernels of 2x2) . 25 5.1 A point cloud representation of a person and the corresponding pose vectors . 29 6.1 Network predictions for Set 1 data . 32 6.2 Good performance network predictions for Set 2 data . 34 6.3 The hybrid network running on an image similar to those in Set 1 . 34 vii 6.4 Reduced performance network predictions for Set 2 data . 35 6.5 Different views of the 3D reconstruction of a person, Large dots mark the point centers . 36 6.6 The orientation of the body parts shown as vectors . 37 6.7 Performance on real data. Top left: RGB image, not used in processing. Top right: depth image. Bottom: network predictions . 38 1 Chapter 1 Introduction Person detection and pose estimation are common needs in Human Robot Interaction (HRI). Person detection is simply recognizing that there is a person nearby. The information gained may be the location of a person. This is useful, but lacks many important clues about what a person is doing. Pose classification goes a step past simple detection and gives a more complete description of the person, such as the location of the person’s arms and legs. With this information, a robot can gain a more complete understanding of a person than it can just by knowing where the person is. For example, a waving gesture may indicate that the person is trying to gain the robot’s attention while crossed arms may indicate an unwillingness to interact. This project focuses mainly on pose estimation. For a robotic system, the ability to locate a person is essential for achieving a basic level of social interaction [12]. For example, a robotic waiter may need recognize when a patron has approached it; or it may need to see when somebody is looking in its direction and waving. Tele-presence rehabilitation is another application [16], a person can engage in rehabilitation exercises monitored by a pose tracker to ensure that they are doing the exercise correctly. Many methods exist for doing pose detection [13]. One of the most prominent methods is used by Microsoft for the Kinect [18] [19]. This method uses depth images and a two stage system for classifying poses. The two stages are as follows: First, a mask image is produced from the depth image that labels each pixel as a body part of interest. Second, this mask image is used to calculate the center of each body part. 2 By using the depth data, that center can then be placed into the scene to get the 3D location of that body part. The method in [18] and other possible pose estimation methods such as [21] do not account for objects within the vicinity of the person. The environment needs to be structured so that any environemnt or non-person objects are easy to isolate from the person data. For example; in an entertainment scenario the person can be assumed to be a certain known distance from the sensor, which eliminates objects not near this assumed location, the floor around the person can be removed with well-known plane detection techniques. The method used in this project follows a very similar structure to the one used by [18] for the Xbox Kinect. Depth images are used since they have several advantages over color or black and white images for this application. They are unaffected by changes in clothing color and texture; this helps to remove unnecessary information from the image. Depth images naturally lend themselves to creating three dimensional (3D) representations of the environment which makes getting a 3D pose simple. The first stage of the system uses deep learning techniques to produce a mask image. Recently, a method for using convolutional neural networks (CNNs) to seg- ment images was proposed called Fully Convolutional Networks (FCNs) [11].

Load more