
Be Your Own Prada: Fashion Synthesis with Structural Coherence Shizhan Zhu1 Sanja Fidler2;3 Raquel Urtasun2;3;4 Dahua Lin1 Chen Change Loy1 1Department of Information Engineering, The Chinese University of Hong Kong 2University of Toronto, 3Vector Institute, 4Uber Advanced Technologies Group fszzhu, dhlin, [email protected], ffidler, [email protected] Abstract We present a novel and effective approach for generating new clothing on a wearer through generative adversarial learning. Given an input image of a person and a sentence describing a different outfit, our model “redresses” the per- son as desired, while at the same time keeping the wearer and her/his pose unchanged. Generating new outfits with precise regions conforming to a language description while retaining wearer’s body structure is a new challenging task. Existing generative adversarial networks are not ideal in ensuring global coherence of structure given both the input photograph and language description as conditions. We ad- dress this challenge by decomposing the complex generative Figure 1. Given an original wearer’s input photo (left) and different process into two conditional stages. In the first stage, we textual descriptions (second column), our model generates new generate a plausible semantic segmentation map that obeys outfits onto the photograph (right three columns) while preserving the wearer’s pose as a latent spatial arrangement. An effec- the pose and body shape of the wearer. tive spatial constraint is formulated to guide the generation of this semantic segmentation map. In the second stage, 2. Producing regions and the associated textures that con- a generative model with a newly proposed compositional form to the language description, and mapping layer is used to render the final image with precise 3. Enforcing coherent visibility of body parts. regions and textures conditioned on this map. We extended the DeepFashion dataset [8] by collecting sentence descrip- Meeting all these requirements at the same time is a very tions for 79K images. We demonstrate the effectiveness challenging task. First, the input image is the only source of our approach through both quantitative and qualitative from which we can mine for the body shape information. evaluations. A user study is also conducted. With only a single view of the wearer, it is nontrivial to recover the body shape accurately. Moreover, we do not want the shape of the generated outfit to be limited by the 1. Introduction original garments of the wearer. For example, replacing the Imagine that you could be your own fashion designer, original long-sleeve shirt with a short-sleeve garment would and be able to seamlessly transform your current outfit in require the model to hallucinate the person’s arms and skin. the photo into a completely new one, by simply describing Conventional 2D non-parametric methods or 3D graph- it in words (Figure 1). In just minutes you could design and ics approaches meet the first requirement through structural “try on” hundreds of different shirts, dresses, or even styles, constraints derived from human priors. They can be in allowing you to easily discover what you look good in. The the form of accurate physical measurements (e.g. height, goal of this paper is to develop a method that can generate waist, hip, arm length) to create 3D virtual bodies [10]; new outfits onto existing photos, in a way that preserves manual manipulations of sliders such as height, weight and structural coherence from multiple perspectives: waist girth [19]; or indication of joint positions and a rough sketch outlining the human body silhouette [16]. All these 1. Retaining the body shape and pose of the wearer, methods require explicit human interventions at test time, which would limit their applicability in practical settings. 2. Related Work In addition, as these methods provide no obvious ways to incorporate textual descriptions to condition the synthesis Generative Adversarial Networks (GAN) [2] have shown process, it is non-trivial to fulfil the second requirement impressive results generating new images, e.g. faces [11], with existing methods. Lastly, they do not meet the third indoor scenes [15], fine-grained objects like birds [12], or requirement as they do not support hallucination of the clothes [17]. Training GANs based on conditions incorpo- missing parts. rates further information to guide the generation process. Generative Adversarial Networks (GAN) [2] is an ap- Existing works have explored various conditions, from cat- pealing alternative to conventional methods. In previous egory labels [9], text [12] to an encoded feature vector [17]. work, DCGAN [11], a GAN formulation combined with Different from the studies above, our study aims at gen- convolutional networks, has been shown to be an effective erating the target by using the spatial configuration of the model to produce realistic images. Moreover, it allows for input images as a condition. The spatial configuration is an end-to-end embedding of textual descriptions to condi- carefully formulated so that it is agnostic to the clothing tion the image generation. The task of clothing generation worn in the original image, and only captures information presents two significant challenges which are difficult to about the user’s body. address with the standard DCGAN. First, it directly tar- There exist several studies to transfer an input image to gets the pixel values and provides no mechanism to enforce a new one. Ledig et al. [6] apply the GAN framework to structural coherence w.r.t. to the input. Second, it tends to super-resolve a low-resolution image. Zhu et al. [3] use average out the pixels [14], thus resulting in various arti- a conditional GAN to transfer across the image domains, facts, e.g. blurry boundaries, as shown by our experiments. e.g. from edge maps to real images, or from daytime images To tackle this problem, we propose an effective two- to night-time. Isola et al. [3] change the viewing angle stage GAN framework that generates shape and textures of an existing object. Johnson et al. [4] apply GANs to in different stages. In the first stage, we aim to generate neural style transfer. All these studies share a common a plausible human segmentation map that specifies the re- feature - the image is transformed globally on the texture gions for body parts and the upper-body garment. This stage level but is not region-specific. In this study, we explore is responsible for preserving the body shape and ensuring a new compositional mapping method that allows region- the coherent visibility of parts based on the description. specific texture generation, which provides richer textures In the second stage, the generator takes both the produced for different body regions. segmentation map and the textual description as conditions, There are several recent studies that explore improved and renders the region-specific texture onto the photograph. image generation by stacking GANs. Our work is some- To ensure the coherence in structure of the synthesized what similar in spirit to [15, 18] – our idea is to have the first image with respect to the input image (i.e. preserving the stage to create the basic composition, and the second stage body shape and pose of the wearer), we present an effec- to add the necessary refinements to the image generated in tive spatial constraint that can be derived from the input the first stage. However, the proposed FashionGAN differs photograph. We formulate it carefully so that it does not from S2GAN [15] in that the latter aims at synthesizing contradict to the textual description when both of them a surface map from a random vector in its first stage. In are used to condition the first-stage GAN. In addition, we contrast, our goal is to generate a plausible mask whose also introduce a new compositional mapping layer into the structure conforms to a given photograph and language second-stage GAN to enforce region-specific texture ren- description, which requires us to design additional spatial dering guided by the segmentation map. In contrast to ex- constraints and design coding as conditions. Furthermore, isting GANs that perform non-compositional synthesis, the these two conditions should not contradict themselves. Sim- new mapping layer is capable of generating more coherent ilarly, our work requires additional constraints which are not visibility of body parts with image region-specific textures. explored in [18]. Compositional mapping is not explored in To train our model, we extend the DeepFashion the aforementioned studies as well. dataset [8] by annotating a subset of 79K upper-body im- Yo et al. [17] propose an image-conditional image gen- ages with sentence descriptions and human body annota- eration model to perform domain transfer, e.g., generating tions1. Extensive quantitative and qualitative comparisons a piece of clothing from an input image of a dressed per- are performed against existing GAN baselines and 2D non- son. Our work differs in that we aim at changing the outfit parametric approaches. We also conduct a user study in of a person into a newly designed one based on a textual order to obtain an objective evaluation on both the shape description. Rendering new outfits onto photographs with and image generation results. unconstrained human poses bring additional difficulties in 1The data and code can be found at http://mmlab.ie.cuhk. comparison with work that generates pieces of clothing in a edu.hk/projects/FashionGAN/. fixed view-angle as in [17]. Figure 2. Proposed framework. Given an input photograph of a person and a sentence description of a new desired outfit, our model first generates a segmentation map S~ using the generator from the first GAN. We then render the new image with another GAN, with the guidance from the segmentation map generated in the previous step.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-