Generating Human Images and Ground Truth Using Computer Graphics

Total Page:16

File Type:pdf, Size:1020Kb

Generating Human Images and Ground Truth Using Computer Graphics University of California Los Angeles Generating Human Images and Ground Truth using Computer Graphics A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Statistics by Weichao Qiu 2016 c Copyright by Weichao Qiu 2016 Abstract of the Thesis Generating Human Images and Ground Truth using Computer Graphics by Weichao Qiu Master of Science in Statistics University of California, Los Angeles, 2016 Professor Alan Loddon Yuille, Chair How to provide high quality data for computer vision is a challenge. Re- searchers spent a lot of effort creating image datasets with more images and more detailed annotation. Computer graphics (CG) is a way of creating synthetic im- ages, during the image synthesis many types of information of the CG scene can be exported as ground truth annotation. In this paper, we develop a pipeline to synthesize realistic human images and automatically generate detailed anno- tation at the same time. We use 2D annotation to control the pose of the CG human model, which enables our images to contain more poses than motion cap- ture based method. The synthetic images are used to train and evaluate human pose estimation algorithm to show its usefulness. ii The thesis of Weichao Qiu is approved. Yingnian Wu Hongjing Lu Alan Loddon Yuille, Committee Chair University of California, Los Angeles 2016 iii To my parents who provide courage and support for me to pursue my dream To my grandpa who sparked my interest in science iv Table of Contents 1 Introduction :::::::::::::::::::::::::::::::: 1 2 Synthesize Human Images ::::::::::::::::::::::: 5 2.1 Related Work . 6 2.2 Computer Graphics Human Model . 7 2.3 Control Pose of the Human Model . 9 2.4 Automatic Ground Truth Generation . 13 2.5 Image Synthesis Pipeline . 14 3 2D Human Pose Estimation :::::::::::::::::::::: 16 3.1 Model Review . 16 3.2 Experiment Setup . 17 3.3 Transfer Between Dataset . 18 3.4 Diagnostic Result for Appearance Transfer . 20 4 Conclusion ::::::::::::::::::::::::::::::::: 22 References ::::::::::::::::::::::::::::::::::: 24 v List of Figures 1.1 Two type of information that are difficult to annotate (a) Line segments in an image, from an unpublished work (b) Optical flow in a temporal sequence, from [1] . 3 1.2 Generated ground truth during image synthesis. From left to right are real image, synthetic image, depth, joint location, semantic part annotation. The pose of the human model is obtained from the annotation of the real image. 4 2.1 Realistic synthetic images. (a) The virtual scene (top) and its cor- responding real photograph (bottom) in the game GTA5, (b) An indoor scene created by a virtual artist using Blender, (c) Digital human avatar . 6 2.2 Typical images from different sources, from left to right, 3D Ware- house, TurboSquid, MakeHuman, Daz 3D, SCAPE . 9 2.3 (a) Motion capture data is recorded indoor and subjects are asked to do a limited number of tasks, image from [2] (b) In the real world, humans perform a large range of poses by interacting with the environment, image from [3] . 10 2.4 (a) The bone structure of a MakeHuman model (b) the 3D joint information recovered from 2D annotation. The human model con- tains more bones in the shoulder and spine . 12 2.5 The corresponding real image and synthetic images based on the estimated 3D pose, under different camera viewpoints . 13 2.6 Illustration of the image synthesis pipeline . 15 vi 3.1 Visual comparison between the real LSP and the synthetic L23 dataset . 18 3.2 The loss for training unary CNN in [4]. (a) Train on synthetic data (b) Train on real data . 21 vii List of Tables 2.1 Comparison of human image dataset . 7 3.1 Test performance on the real LSP dataset, The two models are based on [5] and trained on synthetic images and real images re- spectively . 19 3.2 Test performance on the synthetic L23 dataset, The two models are based on [5] and trained on synthetic images and real images respectively . 19 viii Acknowledgments First, I would like to thank my advisor Prof. Alan Yuille. He is very supportive for this project and provided continuous support to make the idea come true. He also spent his precious time giving detailed feedback to the draft of this thesis. My sincere thanks also goes to my lab-mates, Peng Wang and Xianjie Chen. The idea of this project came from the brainstorm between us. They also provided guidance and support when the project got stuck. I also want to thank the mentors in my research career. They are Prof. Zhuowen Tu, Prof. Xiang Bai, Prof. Hongjing Lu, Prof. Xinggang Wang and Prof. Jiayi Ma. I learnt a lot by working with them. Last but not least, I want to thank Prof. Yingnian Wu, Chunyu Wang, Junhua Mao, Lingxi Xie and Jiamei Shuai for helpful discussion and suggestion. ix CHAPTER 1 Introduction Computer vision benefits a lot from the development of datasets by using them to train and benchmark algorithms. Thus, researchers are energetic at creating better image datasets. Datasets are becoming larger and include more detailed annotation. But creating a large dataset with detailed annotation is expensive. Computer graphics provides an alternative approach to solve this problem. Using synthetic images to build datasets has many benefits and can be a good com- plement to traditional approach. In this paper we built a pipeline to synthesize human images with detailed annotation. In addition to using motion capture data, we propose a novel way to use 2D joint annotation to control the human pose in synthetic images. This technique enables us to create many variable poses. Researchers have spent a lot of effort creating better datasets [6, 7, 8]. The number of images is growing bigger. Making dataset larger can cover more cases in the real world, which is helpful to avoid dataset bias [9] and overfitting. More images also enable us to train models with a higher capacity [10]. Datasets are also becoming more detailed [11, 12, 13, 14, 15, 16, 17]. At the beginning, objects in images were only annotated by bounding box. But now, people are more interested in finding the boundary of an object and its semantic parts. Richer annotation allows algorithms to extract more information from an image and do fine grain tasks. Building an algorithm to fully understand an image and pass the Visual Turing Test [18] is the Holy Grail for computer vision researchers. Many techniques and tools were developed to make annotation easier and 1 faster. Crowd-sourcing services such as Amazon Turk make annotation much cheaper. But annotating a lot of images in detail is still hard and expensive. By contrast, computer graphics is good at creating large and detailed datasets. After defining the 3D scene, rendering a new image charges no extra cost. Many types of annotation such as depth and boundary can be exported from the 3D scene directly. Information not contained in the 3D scene can be manually annotated. In short, after annotating the 3D scene, many images and their annotation can be synthesized at almost no cost. Besides being larger and more detailed, computer graphics based methods have some advantages that traditional approach can not compete with. First, computer graphics can generate hard to annotate information of a scene, such as depth, edge and optical flow. Fig 1.1 shows two types of annotations that are hard for human labelers to label, but easy to obtain using computer graphics. Second, every variable of a scene, such as material, lighting and camera viewpoint, can be controlled. This enables us to generate images with different configurations to perform diagnosis of a model [19, 20]. Third, instead of providing a fixed set of images, computer graphics based datasets can provide data according to the needs of an algorithm. This enables us to start training the algorithm from simple and clean images and then move to harder cases by adding occlusion and motion blur. This also enables the learning algorithm to interact with the synthesize system, which is similar to how we human learn. We human can explore the world by freely moving our head to find interesting or hard examples for learning. 2 (a) (b) Figure 1.1: Two type of information that are difficult to annotate (a) Line seg- ments in an image, from an unpublished work (b) Optical flow in a temporal sequence, from [1] Training computer vision systems in a virtual world is not a new idea and can date back to the beginning of this field, e.g., Blocks World [21]. But the computer graphics based approach was not popular for a long time for various reasons. It was hard to render realistic images, expensive to create a virtual environment and slow to render large number of images. The recent advances of the CG industry make synthetic images much better. In particular, fast GPUs make it possible to use physically realistic rendering at a large scale. More high quality models, open source movies and games make it easier to generate synthetic images. Recently, the authors of [22, 23] used computer graphics models from 3D ware- house to synthesize images for training state-of-the-art object detector. They mainly focused on rendering rigid objects. In this work, we study how to create synthetic human images with detailed annotations. Annotating images with dif- ferent types of information can enable us to train a complex model to work on multiple tasks. Different types of information in a system can be shared. For 3 example, after estimating the joint location of a human, it is easier to estimate the semantic part segmentation [24].
Recommended publications
  • Synthesizing Images of Humans in Unseen Poses
    Synthesizing Images of Humans in Unseen Poses Guha Balakrishnan Amy Zhao Adrian V. Dalca Fredo Durand MIT MIT MIT and MGH MIT [email protected] [email protected] [email protected] [email protected] John Guttag MIT [email protected] Abstract We address the computational problem of novel human pose synthesis. Given an image of a person and a desired pose, we produce a depiction of that person in that pose, re- taining the appearance of both the person and background. We present a modular generative neural network that syn- Source Image Target Pose Synthesized Image thesizes unseen poses using training pairs of images and poses taken from human action videos. Our network sepa- Figure 1. Our method takes an input image along with a desired target pose, and automatically synthesizes a new image depicting rates a scene into different body part and background lay- the person in that pose. We retain the person’s appearance as well ers, moves body parts to new locations and refines their as filling in appropriate background textures. appearances, and composites the new foreground with a hole-filled background. These subtasks, implemented with separate modules, are trained jointly using only a single part details consistent with the new pose. Differences in target image as a supervised label. We use an adversarial poses can cause complex changes in the image space, in- discriminator to force our network to synthesize realistic volving several moving parts and self-occlusions. Subtle details conditioned on pose. We demonstrate image syn- details such as shading and edges should perceptually agree thesis results on three action classes: golf, yoga/workouts with the body’s configuration.
    [Show full text]
  • Fake It Until You Make It Exploring If Synthetic GAN-Generated Images Can Improve the Performance of a Deep Learning Skin Lesion Classifier
    Fake It Until You Make It Exploring if synthetic GAN-generated images can improve the performance of a deep learning skin lesion classifier Nicolas Carmont Zaragoza [email protected] 4th Year Project Report Artificial Intelligence and Computer Science School of Informatics University of Edinburgh 2020 1 Abstract In the U.S alone more people are diagnosed with skin cancer each year than all other types of cancers combined [1]. For those that have melanoma, the average 5-year survival rate is 98.4% in early stages and drops to 22.5% in late stages [2]. Skin lesion classification using machine learning has become especially popular recently because of its ability to match the accuracy of professional dermatologists [3]. Increasing the accuracy of these classifiers is an ongoing challenge to save lives. However, many skin classification research papers assert that there is not enough images in skin lesion datasets to improve the accuracy further [4] [5]. Over the past few years, Generative Adversarial Neural Networks (GANs) have earned a reputation for their ability to generate highly realistic synthetic images from a dataset. This project explores the effectiveness of GANs as a form of data augmentation to generate realistic skin lesion images. Furthermore it explores to what extent these GAN-generated lesions can help improve the performance of a skin lesion classifier. The findings suggest that GAN augmentation does not provide a statistically significant increase in classifier performance. However, the generated synthetic images were realistic enough to lead professional dermatologists to believe they were real 41% of the time during a Visual Turing Test.
    [Show full text]
  • Memristor-Based Approximated Computation
    Memristor-based Approximated Computation Boxun Li1, Yi Shan1, Miao Hu2, Yu Wang1, Yiran Chen2, Huazhong Yang1 1Dept. of E.E., TNList, Tsinghua University, Beijing, China 2Dept. of E.C.E., University of Pittsburgh, Pittsburgh, USA 1 Email: [email protected] Abstract—The cessation of Moore’s Law has limited further architectures, which not only provide a promising hardware improvements in power efficiency. In recent years, the physical solution to neuromorphic system but also help drastically close realization of the memristor has demonstrated a promising the gap of power efficiency between computing systems and solution to ultra-integrated hardware realization of neural net- works, which can be leveraged for better performance and the brain. The memristor is one of those promising devices. power efficiency gains. In this work, we introduce a power The memristor is able to support a large number of signal efficient framework for approximated computations by taking connections within a small footprint by taking the advantage advantage of the memristor-based multilayer neural networks. of the ultra-integration density [7]. And most importantly, A programmable memristor approximated computation unit the nonvolatile feature that the state of the memristor could (Memristor ACU) is introduced first to accelerate approximated computation and a memristor-based approximated computation be tuned by the current passing through itself makes the framework with scalability is proposed on top of the Memristor memristor a potential, perhaps even the best, device to realize ACU. We also introduce a parameter configuration algorithm of neuromorphic computing systems with picojoule level energy the Memristor ACU and a feedback state tuning circuit to pro- consumption [8], [9].
    [Show full text]
  • A Survey of Autonomous Driving: Common Practices and Emerging Technologies
    Accepted March 22, 2020 Digital Object Identifier 10.1109/ACCESS.2020.2983149 A Survey of Autonomous Driving: Common Practices and Emerging Technologies EKIM YURTSEVER1, (Member, IEEE), JACOB LAMBERT 1, ALEXANDER CARBALLO 1, (Member, IEEE), AND KAZUYA TAKEDA 1, 2, (Senior Member, IEEE) 1Nagoya University, Furo-cho, Nagoya, 464-8603, Japan 2Tier4 Inc. Nagoya, Japan Corresponding author: Ekim Yurtsever (e-mail: [email protected]). ABSTRACT Automated driving systems (ADSs) promise a safe, comfortable and efficient driving experience. However, fatalities involving vehicles equipped with ADSs are on the rise. The full potential of ADSs cannot be realized unless the robustness of state-of-the-art is improved further. This paper discusses unsolved problems and surveys the technical aspect of automated driving. Studies regarding present challenges, high- level system architectures, emerging methodologies and core functions including localization, mapping, perception, planning, and human machine interfaces, were thoroughly reviewed. Furthermore, many state- of-the-art algorithms were implemented and compared on our own platform in a real-world driving setting. The paper concludes with an overview of available datasets and tools for ADS development. INDEX TERMS Autonomous Vehicles, Control, Robotics, Automation, Intelligent Vehicles, Intelligent Transportation Systems I. INTRODUCTION necessary here. CCORDING to a recent technical report by the Eureka Project PROMETHEUS [11] was carried out in A National Highway Traffic Safety Administration Europe between 1987-1995, and it was one of the earliest (NHTSA), 94% of road accidents are caused by human major automated driving studies. The project led to the errors [1]. Against this backdrop, Automated Driving Sys- development of VITA II by Daimler-Benz, which succeeded tems (ADSs) are being developed with the promise of in automatically driving on highways [12].
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • 1 Convolution
    CS1114 Section 6: Convolution February 27th, 2013 1 Convolution Convolution is an important operation in signal and image processing. Convolution op- erates on two signals (in 1D) or two images (in 2D): you can think of one as the \input" signal (or image), and the other (called the kernel) as a “filter” on the input image, pro- ducing an output image (so convolution takes two images as input and produces a third as output). Convolution is an incredibly important concept in many areas of math and engineering (including computer vision, as we'll see later). Definition. Let's start with 1D convolution (a 1D \image," is also known as a signal, and can be represented by a regular 1D vector in Matlab). Let's call our input vector f and our kernel g, and say that f has length n, and g has length m. The convolution f ∗ g of f and g is defined as: m X (f ∗ g)(i) = g(j) · f(i − j + m=2) j=1 One way to think of this operation is that we're sliding the kernel over the input image. For each position of the kernel, we multiply the overlapping values of the kernel and image together, and add up the results. This sum of products will be the value of the output image at the point in the input image where the kernel is centered. Let's look at a simple example. Suppose our input 1D image is: f = 10 50 60 10 20 40 30 and our kernel is: g = 1=3 1=3 1=3 Let's call the output image h.
    [Show full text]
  • Self-Driving Cars: Evaluation of Deep Learning Techniques for Object Detection in Different Driving Conditions Ramesh Simhambhatla SMU, [email protected]
    SMU Data Science Review Volume 2 | Number 1 Article 23 2019 Self-Driving Cars: Evaluation of Deep Learning Techniques for Object Detection in Different Driving Conditions Ramesh Simhambhatla SMU, [email protected] Kevin Okiah SMU, [email protected] Shravan Kuchkula SMU, [email protected] Robert Slater Southern Methodist University, [email protected] Follow this and additional works at: https://scholar.smu.edu/datasciencereview Part of the Artificial Intelligence and Robotics Commons, Other Computer Engineering Commons, and the Statistical Models Commons Recommended Citation Simhambhatla, Ramesh; Okiah, Kevin; Kuchkula, Shravan; and Slater, Robert (2019) "Self-Driving Cars: Evaluation of Deep Learning Techniques for Object Detection in Different Driving Conditions," SMU Data Science Review: Vol. 2 : No. 1 , Article 23. Available at: https://scholar.smu.edu/datasciencereview/vol2/iss1/23 This Article is brought to you for free and open access by SMU Scholar. It has been accepted for inclusion in SMU Data Science Review by an authorized administrator of SMU Scholar. For more information, please visit http://digitalrepository.smu.edu. Simhambhatla et al.: Self-Driving Cars: Evaluation of Deep Learning Techniques for Object Detection in Different Driving Conditions Self-Driving Cars: Evaluation of Deep Learning Techniques for Object Detection in Different Driving Conditions Kevin Okiah1, Shravan Kuchkula1, Ramesh Simhambhatla1, Robert Slater1 1 Master of Science in Data Science, Southern Methodist University, Dallas, TX 75275 USA {KOkiah, SKuchkula, RSimhambhatla, RSlater}@smu.edu Abstract. Deep Learning has revolutionized Computer Vision, and it is the core technology behind capabilities of a self-driving car. Convolutional Neural Networks (CNNs) are at the heart of this deep learning revolution for improving the task of object detection.
    [Show full text]
  • Do Better Imagenet Models Transfer Better?
    Do Better ImageNet Models Transfer Better? Simon Kornblith,∗ Jonathon Shlens, and Quoc V. Le Google Brain {skornblith,shlens,qvl}@google.com Logistic Regression Fine-Tuned Abstract 1.6 2.3 Inception-ResNet v2 Inception v4 1.5 2.2 Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship 1.4 MobileNet v1 2.1 MobileNet v1 NASNet Large NASNet Large between architecture and transfer. An implicit hypothesis 1.3 2.0 ResNet-50 ResNet-50 in modern computer vision research is that models that per- 1.2 1.9 form better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been sys- 1.1 1.8 Transfer Accuracy (Log Odds) tematically tested. Here, we compare the performance of 16 72 74 76 78 80 72 74 76 78 80 classification networks on 12 image classification datasets. ImageNet Top-1 Accuracy (%) ImageNet Top-1 Accuracy (%) We find that, when networks are used as fixed feature ex- Figure 1. Transfer learning performance is highly correlated with tractors or fine-tuned, there is a strong correlation between ImageNet top-1 accuracy for fixed ImageNet features (left) and ImageNet accuracy and transfer accuracy (r = 0.99 and fine-tuning from ImageNet initialization (right). The 16 points in 0.96, respectively). In the former setting, we find that this re- each plot represent transfer accuracy for 16 distinct CNN architec- tures, averaged across 12 datasets after logit transformation (see lationship is very sensitive to the way in which networks are Section 3).
    [Show full text]
  • Revisiting Visual Question Answering Baselines 3
    Revisiting Visual Question Answering Baselines Allan Jabri, Armand Joulin, and Laurens van der Maaten Facebook AI Research {ajabri,ajoulin,lvdmaaten}@fb.com Abstract. Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems in- clude attention or memory mechanisms designed to support “reasoning”. For multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This pa- per questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating an- swers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. We explore variants of the model and study its transferability between both datasets. We also present an error analysis of our model that suggests a key problem of current VQA systems lies in the lack of visual grounding of concepts that occur in the questions and answers. Overall, our results suggest that the performance of current VQA systems is not significantly arXiv:1606.08390v2 [cs.CV] 22 Nov 2016 better than that of systems designed to exploit dataset biases.
    [Show full text]
  • Deep Learning for Computer Vision
    Deep Learning for Computer Vision MIT 6.S191 Ava Soleimany January 29, 2019 6.S191 Introduction to Deep Learning 1/29/19 introtodeeplearning.com What Computers “See” Images are Numbers 6.S191 Introduction to Deep Learning [1] 1/29/19 introtodeeplearning.com Images are Numbers 6.S191 Introduction to Deep Learning [1] 1/29/19 introtodeeplearning.com Images are Numbers What the computer sees An image is just a matrix of numbers [0,255]! i.e., 1080x1080x3 for an RGB image 6.S191 Introduction to Deep Learning [1] 1/29/19 introtodeeplearning.com Tasks in Computer Vision Lincoln 0.8 Washington 0.1 classification Jefferson 0.05 Obama 0.05 Input Image Pixel Representation - Regression: output variable takes continuous value - Classification: output variable takes class label. Can produce probability of belonging to a particular class 6.S191 Introduction to Deep Learning 1/29/19 introtodeeplearning.com High Level Feature Detection Let’s identify key features in each image category Nose, Wheels, Door, Eyes, License Plate, Windows, Mouth Headlights Steps 6.S191 Introduction to Deep Learning 1/29/19 introtodeeplearning.com Manual Feature Extraction Detect features Domain knowledge Define features to classify Problems? 6.S191 Introduction to Deep Learning 1/29/19 introtodeeplearning.com Manual Feature Extraction Detect features Domain knowledge Define features to classify 6.S191 Introduction to Deep Learning [2] 1/29/19 introtodeeplearning.com Manual Feature Extraction Detect features Domain knowledge Define features to classify 6.S191 Introduction
    [Show full text]
  • Visual Question Answering
    Visual Question Answering Sofiya Semenova I. INTRODUCTION VQA dataset [1] and the Visual 7W dataset [20], The problem of Visual Question Answering provide questions in both multiple choice and (VQA) offers a difficult challenge to the fields open-ended formats. Several models, such as [8], of computer vision (CV) and natural language provide several variations of their system to solve processing (NLP) – it is jointly concerned with more than one question/answer type. the typically CV problem of semantic image un- Binary. Given an input image and a text-based derstanding, and the typically NLP problem of query, these VQA systems must return a binary, semantic text understanding. In [29], Geman et “yes/no" response. [29] proposes this kind of VQA al. describe a VQA-based “visual turing test" for problem as a “visual turing test" for the computer computer vision systems, a measure by which vision and AI communities. [5] proposes both a researchers can prove that computer vision systems binary, abstract dataset and system for answering adequately “see", “read", and reason about vision questions about the dataset. and language. Multiple Choice. Given an input image and a text- The task itself is simply put - given an input based query, these VQA systems must choose the image and a text-based query about the image correct answer out of a set of answers. [20] and [8] (“What is next to the apple?" or “Is there a chair are two systems that have the option of answering in the picture"), the system must produce a natural- multiple-choice questions. language answer to the question (“A person" or “Fill in the blank".
    [Show full text]
  • Controllable Person Image Synthesis with Attribute-Decomposed GAN
    Controllable Person Image Synthesis with Attribute-Decomposed GAN Yifang Men1, Yiming Mao2, Yuning Jiang2, Wei-Ying Ma2, Zhouhui Lian1∗ 1Wangxuan Institute of Computer Technology, Peking University, China 2Bytedance AI Lab Abstract Generated images with editable style codes This paper introduces the Attribute-Decomposed GAN, a novel generative model for controllable person image syn- thesis, which can produce realistic person images with de- sired human attributes (e.g., pose, head, upper clothes and pants) provided in various source inputs. The core idea of the proposed model is to embed human attributes into the Controllable Person Image Synthesis latent space as independent codes and thus achieve flexible Pose Code and continuous control of attributes via mixing and inter- Style Code polation operations in explicit style representations. Specif- Component Attributes ically, a new architecture consisting of two encoding path- Pose Attribute (keypoints) base head upper clothes pants ways with style block connections is proposed to decompose the original hard mapping into multiple more accessible subtasks. In source pathway, we further extract component layouts with an off-the-shelf human parser and feed them into a shared global texture encoder for decomposed latent codes. This strategy allows for the synthesis of more realis- Pose source Target pose Source 1 Source 2 Source 3 Source 4 tic output images and automatic separation of un-annotated Figure 1: Controllable person image synthesis with desired attributes. Experimental results demonstrate the proposed human attributes provided by multiple source images. Hu- method’s superiority over the state of the art in pose trans- man attributes including pose and component attributes are fer and its effectiveness in the brand-new task of component embedded into the latent space as the pose code and decom- attribute transfer.
    [Show full text]