Generating Human Images and Ground Truth Using Computer Graphics
Total Page:16
File Type:pdf, Size:1020Kb
University of California Los Angeles Generating Human Images and Ground Truth using Computer Graphics A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Statistics by Weichao Qiu 2016 c Copyright by Weichao Qiu 2016 Abstract of the Thesis Generating Human Images and Ground Truth using Computer Graphics by Weichao Qiu Master of Science in Statistics University of California, Los Angeles, 2016 Professor Alan Loddon Yuille, Chair How to provide high quality data for computer vision is a challenge. Re- searchers spent a lot of effort creating image datasets with more images and more detailed annotation. Computer graphics (CG) is a way of creating synthetic im- ages, during the image synthesis many types of information of the CG scene can be exported as ground truth annotation. In this paper, we develop a pipeline to synthesize realistic human images and automatically generate detailed anno- tation at the same time. We use 2D annotation to control the pose of the CG human model, which enables our images to contain more poses than motion cap- ture based method. The synthetic images are used to train and evaluate human pose estimation algorithm to show its usefulness. ii The thesis of Weichao Qiu is approved. Yingnian Wu Hongjing Lu Alan Loddon Yuille, Committee Chair University of California, Los Angeles 2016 iii To my parents who provide courage and support for me to pursue my dream To my grandpa who sparked my interest in science iv Table of Contents 1 Introduction :::::::::::::::::::::::::::::::: 1 2 Synthesize Human Images ::::::::::::::::::::::: 5 2.1 Related Work . 6 2.2 Computer Graphics Human Model . 7 2.3 Control Pose of the Human Model . 9 2.4 Automatic Ground Truth Generation . 13 2.5 Image Synthesis Pipeline . 14 3 2D Human Pose Estimation :::::::::::::::::::::: 16 3.1 Model Review . 16 3.2 Experiment Setup . 17 3.3 Transfer Between Dataset . 18 3.4 Diagnostic Result for Appearance Transfer . 20 4 Conclusion ::::::::::::::::::::::::::::::::: 22 References ::::::::::::::::::::::::::::::::::: 24 v List of Figures 1.1 Two type of information that are difficult to annotate (a) Line segments in an image, from an unpublished work (b) Optical flow in a temporal sequence, from [1] . 3 1.2 Generated ground truth during image synthesis. From left to right are real image, synthetic image, depth, joint location, semantic part annotation. The pose of the human model is obtained from the annotation of the real image. 4 2.1 Realistic synthetic images. (a) The virtual scene (top) and its cor- responding real photograph (bottom) in the game GTA5, (b) An indoor scene created by a virtual artist using Blender, (c) Digital human avatar . 6 2.2 Typical images from different sources, from left to right, 3D Ware- house, TurboSquid, MakeHuman, Daz 3D, SCAPE . 9 2.3 (a) Motion capture data is recorded indoor and subjects are asked to do a limited number of tasks, image from [2] (b) In the real world, humans perform a large range of poses by interacting with the environment, image from [3] . 10 2.4 (a) The bone structure of a MakeHuman model (b) the 3D joint information recovered from 2D annotation. The human model con- tains more bones in the shoulder and spine . 12 2.5 The corresponding real image and synthetic images based on the estimated 3D pose, under different camera viewpoints . 13 2.6 Illustration of the image synthesis pipeline . 15 vi 3.1 Visual comparison between the real LSP and the synthetic L23 dataset . 18 3.2 The loss for training unary CNN in [4]. (a) Train on synthetic data (b) Train on real data . 21 vii List of Tables 2.1 Comparison of human image dataset . 7 3.1 Test performance on the real LSP dataset, The two models are based on [5] and trained on synthetic images and real images re- spectively . 19 3.2 Test performance on the synthetic L23 dataset, The two models are based on [5] and trained on synthetic images and real images respectively . 19 viii Acknowledgments First, I would like to thank my advisor Prof. Alan Yuille. He is very supportive for this project and provided continuous support to make the idea come true. He also spent his precious time giving detailed feedback to the draft of this thesis. My sincere thanks also goes to my lab-mates, Peng Wang and Xianjie Chen. The idea of this project came from the brainstorm between us. They also provided guidance and support when the project got stuck. I also want to thank the mentors in my research career. They are Prof. Zhuowen Tu, Prof. Xiang Bai, Prof. Hongjing Lu, Prof. Xinggang Wang and Prof. Jiayi Ma. I learnt a lot by working with them. Last but not least, I want to thank Prof. Yingnian Wu, Chunyu Wang, Junhua Mao, Lingxi Xie and Jiamei Shuai for helpful discussion and suggestion. ix CHAPTER 1 Introduction Computer vision benefits a lot from the development of datasets by using them to train and benchmark algorithms. Thus, researchers are energetic at creating better image datasets. Datasets are becoming larger and include more detailed annotation. But creating a large dataset with detailed annotation is expensive. Computer graphics provides an alternative approach to solve this problem. Using synthetic images to build datasets has many benefits and can be a good com- plement to traditional approach. In this paper we built a pipeline to synthesize human images with detailed annotation. In addition to using motion capture data, we propose a novel way to use 2D joint annotation to control the human pose in synthetic images. This technique enables us to create many variable poses. Researchers have spent a lot of effort creating better datasets [6, 7, 8]. The number of images is growing bigger. Making dataset larger can cover more cases in the real world, which is helpful to avoid dataset bias [9] and overfitting. More images also enable us to train models with a higher capacity [10]. Datasets are also becoming more detailed [11, 12, 13, 14, 15, 16, 17]. At the beginning, objects in images were only annotated by bounding box. But now, people are more interested in finding the boundary of an object and its semantic parts. Richer annotation allows algorithms to extract more information from an image and do fine grain tasks. Building an algorithm to fully understand an image and pass the Visual Turing Test [18] is the Holy Grail for computer vision researchers. Many techniques and tools were developed to make annotation easier and 1 faster. Crowd-sourcing services such as Amazon Turk make annotation much cheaper. But annotating a lot of images in detail is still hard and expensive. By contrast, computer graphics is good at creating large and detailed datasets. After defining the 3D scene, rendering a new image charges no extra cost. Many types of annotation such as depth and boundary can be exported from the 3D scene directly. Information not contained in the 3D scene can be manually annotated. In short, after annotating the 3D scene, many images and their annotation can be synthesized at almost no cost. Besides being larger and more detailed, computer graphics based methods have some advantages that traditional approach can not compete with. First, computer graphics can generate hard to annotate information of a scene, such as depth, edge and optical flow. Fig 1.1 shows two types of annotations that are hard for human labelers to label, but easy to obtain using computer graphics. Second, every variable of a scene, such as material, lighting and camera viewpoint, can be controlled. This enables us to generate images with different configurations to perform diagnosis of a model [19, 20]. Third, instead of providing a fixed set of images, computer graphics based datasets can provide data according to the needs of an algorithm. This enables us to start training the algorithm from simple and clean images and then move to harder cases by adding occlusion and motion blur. This also enables the learning algorithm to interact with the synthesize system, which is similar to how we human learn. We human can explore the world by freely moving our head to find interesting or hard examples for learning. 2 (a) (b) Figure 1.1: Two type of information that are difficult to annotate (a) Line seg- ments in an image, from an unpublished work (b) Optical flow in a temporal sequence, from [1] Training computer vision systems in a virtual world is not a new idea and can date back to the beginning of this field, e.g., Blocks World [21]. But the computer graphics based approach was not popular for a long time for various reasons. It was hard to render realistic images, expensive to create a virtual environment and slow to render large number of images. The recent advances of the CG industry make synthetic images much better. In particular, fast GPUs make it possible to use physically realistic rendering at a large scale. More high quality models, open source movies and games make it easier to generate synthetic images. Recently, the authors of [22, 23] used computer graphics models from 3D ware- house to synthesize images for training state-of-the-art object detector. They mainly focused on rendering rigid objects. In this work, we study how to create synthetic human images with detailed annotations. Annotating images with dif- ferent types of information can enable us to train a complex model to work on multiple tasks. Different types of information in a system can be shared. For 3 example, after estimating the joint location of a human, it is easier to estimate the semantic part segmentation [24].