<<

-to-Image Translation with Conditional Adversarial Networks

Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros

Berkeley AI Research (BAIR) Laboratory, UC Berkeley {isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu

Labels to Street Scene Labels to Facade BW to Color

input output Aerial to Map

input output input output Day to Night Edges to Photo

input output input output input output

Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image. These problems are often treated with application-specific algorithms, even though the setting is always the same: map to pixels. Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show results of the method on several. In each case we use the same architecture and objective, and simply train on different data.

Abstract 1. Introduction Many problems in image processing, , and computer vision can be posed as “translating” an input We investigate conditional adversarial networks as a image into a corresponding output image. Just as a concept general-purpose solution to image-to-image translation may be expressed in either English or French, a scene may problems. These networks not only learn the mapping from be rendered as an RGB image, a gradient field, an edge map, input image to output image, but also learn a loss func- a semantic label map, etc. In analogy to automatic language tion to train this mapping. This makes it possible to apply translation, we define automatic image-to-image translation the same generic approach to problems that traditionally as the task of translating one possible representation of a arXiv:1611.07004v3 [cs.CV] 26 Nov 2018 would require very different loss formulations. We demon- scene into another, given sufficient training data (see Figure strate that this approach is effective at synthesizing photos 1). Traditionally, each of these tasks has been tackled with from label maps, reconstructing objects from edge maps, separate, special-purpose machinery (e.g., [16, 25, 20,9, and colorizing images, among other tasks. Indeed, since the 11, 53, 33, 39, 18, 58, 62]), despite the fact that the setting release of the pix2pix software associated with this pa- is always the same: predict pixels from pixels. Our goal in per, a large number of internet users (many of them artists) this paper is to develop a common framework for all these have posted their own experiments with our system, further problems. demonstrating its wide applicability and ease of adoption The community has already taken significant steps in this without the need for parameter tweaking. As a commu- direction, with convolutional neural nets (CNNs) becoming nity, we no longer hand-engineer our mapping functions, the common workhorse behind a wide variety of image pre- and this work suggests we can achieve reasonable results diction problems. CNNs learn to minimize a loss – without hand-engineering our loss functions either. an objective that scores the quality of results – and although the learning process is automatic, a lot of manual effort still

1 x G(x) y goes into designing effective losses. In other words, we still G have to tell the CNN what we wish it to minimize. But, just D D like King Midas, we must be careful what we wish for! If we take a naive approach and ask the CNN to minimize the fake real Euclidean distance between predicted and ground truth pix- els, it will tend to produce blurry results [43, 62]. This is x x because Euclidean distance is minimized by averaging all Figure 2: Training a conditional GAN to map edges→photo. The plausible outputs, which causes blurring. Coming up with discriminator, D, learns to classify between fake (synthesized by loss functions that force the CNN to do what we really want the generator) and real {edge, photo} tuples. The generator, G, – e.g., output sharp, realistic images – is an open problem learns to fool the discriminator. Unlike an unconditional GAN, and generally requires expert knowledge. both the generator and discriminator observe the input edge map. It would be highly desirable if we could instead specify only a high-level goal, like “make the output indistinguish- large body of literature has considered losses of this kind, able from reality”, and then automatically learn a loss func- with methods including conditional random fields [10], the tion appropriate for satisfying this goal. Fortunately, this is SSIM metric [56], feature matching [15], nonparametric exactly what is done by the recently proposed Generative losses [37], the convolutional pseudo-prior [57], and losses Adversarial Networks (GANs) [24, 13, 44, 52, 63]. GANs based on matching covariance statistics [30]. The condi- learn a loss that tries to classify if the output image is real tional GAN is different in that the loss is learned, and can, in or fake, while simultaneously training a generative model theory, penalize any possible structure that differs between to minimize this loss. Blurry images will not be tolerated output and target. since they look obviously fake. Because GANs learn a loss Conditional GANs We are not the first to apply GANs that adapts to the data, they can be applied to a multitude of in the conditional setting. Prior and concurrent works have tasks that traditionally would require very different kinds of conditioned GANs on discrete labels [41, 23, 13], text [46], loss functions. and, indeed, images. The image-conditional models have In this paper, we explore GANs in the conditional set- tackled image prediction from a normal map [55], future ting. Just as GANs learn a generative model of data, condi- frame prediction [40], product photo generation [59], and tional GANs (cGANs) learn a conditional generative model image generation from sparse annotations [31, 48] (c.f. [47] [24]. This makes cGANs suitable for image-to-image trans- for an autoregressive approach to the same problem). Sev- lation tasks, where we condition on an input image and gen- eral other papers have also used GANs for image-to-image erate a corresponding output image. mappings, but only applied the GAN unconditionally, re- GANs have been vigorously studied in the last two lying on other terms (such as L2 regression) to force the years and many of the techniques we explore in this pa- output to be conditioned on the input. These papers have per have been previously proposed. Nonetheless, ear- achieved impressive results on inpainting [43], future state lier papers have focused on specific applications, and prediction [64], image manipulation guided by user con- it has remained unclear how effective image-conditional straints [65], style transfer [38], and superresolution [36]. GANs can be as a general-purpose solution for image-to- Each of the methods was tailored for a specific applica- image translation. Our primary contribution is to demon- tion. Our framework differs in that nothing is application- strate that on a wide variety of problems, conditional specific. This makes our setup considerably simpler than GANs produce reasonable results. Our second contri- most others. bution is to present a simple framework sufficient to Our method also differs from the prior works in several achieve good results, and to analyze the effects of sev- architectural choices for the generator and discriminator. eral important architectural choices. Code is available at Unlike past work, for our generator we use a “U-Net”-based https://github.com/phillipi/pix2pix. architecture [50], and for our discriminator we use a convo- lutional “PatchGAN” classifier, which only penalizes struc- 2. Related work ture at the scale of image patches. A similar PatchGAN ar- chitecture was previously proposed in [38] to capture local Structured losses for image modeling Image-to-image style statistics. Here we show that this approach is effective translation problems are often formulated as per- clas- on a wider range of problems, and we investigate the effect sification or regression (e.g., [39, 58, 28, 35, 62]). These of changing the patch size. formulations treat the output space as “unstructured” in the sense that each output pixel is considered conditionally in- 3. Method dependent from all others given the input image. Condi- tional GANs instead learn a structured loss. Structured GANs are generative models that learn a mapping from losses penalize the joint configuration of the output. A random noise vector z to output image y, G : z y [24]. In → contrast, conditional GANs learn a mapping from observed Encoder-decoder U-Net image x and random noise vector z, to y, G : x, z y. { } → The generator G is trained to produce outputs that cannot be distinguished from “real” images by an adversarially trained x y x y discriminator, D, which is trained to do as well as possible at detecting the generator’s “fakes”. This training procedure is diagrammed in Figure2. Figure 3: Two choices for the architecture of the generator. The 3.1. Objective “U-Net” [50] is an encoder-decoder with skip connections be- The objective of a conditional GAN can be expressed as tween mirrored layers in the encoder and decoder stacks.

cGAN (G, D) =Ex,y[log D(x, y)]+ L 3.2. Network architectures Ex,z[log(1 D(x, G(x, z))], (1) − We adapt our generator and discriminator architectures where G tries to minimize this objective against an ad- from those in [44]. Both generator and discriminator use versarial D that tries to maximize it, i.e. G∗ = modules of the form convolution-BatchNorm-ReLu [29]. arg minG maxD cGAN (G, D). Details of the architecture are provided in the supplemen- L To test the importance of conditioning the discriminator, tal materials online, with key features discussed below. we also compare to an unconditional variant in which the discriminator does not observe x: 3.2.1 Generator with skips

GAN (G, D) =Ey[log D(y)]+ A defining feature of image-to-image translation problems L is that they map a high resolution input grid to a high resolu- Ex,z[log(1 D(G(x, z))]. (2) − tion output grid. In addition, for the problems we consider, Previous approaches have found it beneficial to mix the the input and output differ in surface appearance, but both GAN objective with a more traditional loss, such as L2 dis- are renderings of the same underlying structure. Therefore, tance [43]. The discriminator’s job remains unchanged, but structure in the input is roughly aligned with structure in the the generator is tasked to not only fool the discriminator but output. We design the generator architecture around these also to be near the ground truth output in an L2 sense. We considerations. also explore this option, using L1 distance rather than L2 as Many previous solutions [43, 55, 30, 64, 59] to problems L1 encourages less blurring: in this area have used an encoder-decoder network [26]. In such a network, the input is passed through a series of lay- L1(G) = Ex,y,z[ y G(x, z) 1]. (3) ers that progressively downsample, until a bottleneck layer, L k − k at which point the process is reversed. Such a network re- Our final objective is quires that all information flow pass through all the layers,

∗ including the bottleneck. For many image translation prob- G = arg min max cGAN (G, D) + λ L1(G). (4) G D L L lems, there is a great deal of low-level information shared between the input and output, and it would be desirable to Without z, the net could still learn a mapping from x shuttle this information directly across the net. For exam- to y, but would produce deterministic outputs, and there- ple, in the case of image colorization, the input and output fore fail to match any distribution other than a delta func- share the location of prominent edges. tion. Past conditional GANs have acknowledged this and To give the generator a means to circumvent the bottle- provided Gaussian noise z as an input to the generator, in neck for information like this, we add skip connections, fol- addition to x (e.g., [55]). In initial experiments, we did not lowing the general of a “U-Net” [50]. Specifically, we find this strategy effective – the generator simply learned add skip connections between each layer i and layer n i, − to ignore the noise – which is consistent with Mathieu et where n is the total number of layers. Each skip connec- al. [40]. Instead, for our final models, we provide noise tion simply concatenates all channels at layer i with those only in the form of dropout, applied on several layers of our at layer n i. generator at both training and test time. Despite the dropout − noise, we observe only minor stochasticity in the output of 3.2.2 Markovian discriminator (PatchGAN) our nets. Designing conditional GANs that produce highly stochastic output, and thereby capture the full entropy of the It is well known that the L2 loss – and L1, see Fig- conditional distributions they model, is an important ques- ure4 – produces blurry results on image generation prob- tion left open by the present work. lems [34]. Although these losses fail to encourage high- frequency crispness, in many cases they nonetheless accu- 4. Experiments rately capture the low frequencies. For problems where this is the case, we do not need an entirely new framework to To explore the generality of conditional GANs, we test enforce correctness at the low frequencies. L1 will already the method on a variety of tasks and datasets, including both do. graphics tasks, like photo generation, and vision tasks, like semantic segmentation: This motivates restricting the GAN discriminator to only model high-frequency structure, relying on an L1 term to Semantic labels photo, trained on the Cityscapes • ↔ force low-frequency correctness (Eqn.4). In order to model dataset [12]. high-frequencies, it is sufficient to restrict our attention to Architectural labels photo, trained on CMP Facades • → the structure in local image patches. Therefore, we design [45]. a discriminator architecture – which we term a PatchGAN Map aerial photo, trained on data scraped from • ↔ – that only penalizes structure at the scale of patches. This Google Maps. discriminator tries to classify if each N N patch in an im- BW color photos, trained on [51]. × • → age is real or fake. We run this discriminator convolution- Edges photo, trained on data from [65] and [60]; bi- • → ally across the image, averaging all responses to provide the nary edges generated using the HED edge detector [58] ultimate output of D. plus postprocessing. Sketch photo: tests edges photo models on human- In Section 4.4, we demonstrate that N can be much • → → smaller than the full size of the image and still produce drawn sketches from [19]. Day night, trained on [33]. high quality results. This is advantageous because a smaller • → Thermal color photos, trained on data from [27]. PatchGAN has fewer parameters, runs faster, and can be • → Photo with missing pixels inpainted photo, trained applied to arbitrarily large images. • → on Paris StreetView from [14]. Such a discriminator effectively models the image as a Markov random field, assuming independence between pix- Details of training on each of these datasets are provided els separated by more than a patch diameter. This connec- in the supplemental materials online. In all cases, the in- tion was previously explored in [38], and is also the com- put and output are simply 1-3 channel images. Qualita- mon assumption in models of texture [17, 21] and style tive results are shown in Figures8,9, 11, 10, 13, 14, 15, [16, 25, 22, 37]. Therefore, our PatchGAN can be under- 16, 17, 18, 19, 20. Several failure cases are highlighted stood as a form of texture/style loss. in Figure 21. More comprehensive results are available at https://phillipi.github.io/pix2pix/. Data requirements and speed We note that decent re- 3.3. Optimization and inference sults can often be obtained even on small datasets. Our fa- cade training set consists of just 400 images (see results in To optimize our networks, we follow the standard ap- Figure 14), and the day to night training set consists of only proach from [24]: we alternate between one gradient de- 91 unique webcams (see results in Figure 15). On datasets scent step on D, then one step on G. As suggested in of this size, training can be very fast: for example, the re- the original GAN paper, rather than training G to mini- sults shown in Figure 14 took less than two hours of training mize log(1 D(x, G(x, z)), we instead train to maximize − on a single Pascal Titan X GPU. At test time, all models run log D(x, G(x, z)) [24]. In addition, we divide the objec- in well under a second on this GPU. tive by 2 while optimizing D, which slows down the rate at which D learns relative to G. We use minibatch SGD and 4.1. Evaluation metrics apply the Adam solver [32], with a learning rate of 0.0002, Evaluating the quality of synthesized images is an open and momentum parameters β = 0.5, β = 0.999. 1 2 and difficult problem [52]. Traditional metrics such as per- At inference time, we run the generator net in exactly pixel mean-squared error do not assess joint statistics of the the same manner as during the training phase. This differs result, and therefore do not measure the very structure that from the usual protocol in that we apply dropout at test time, structured losses aim to capture. and we apply batch normalization [29] using the statistics of To more holistically evaluate the visual quality of our re- the test batch, rather than aggregated statistics of the train- sults, we employ two tactics. First, we run “real vs. fake” ing batch. This approach to batch normalization, when the perceptual studies on Amazon Mechanical Turk (AMT). batch size is set to 1, has been termed “instance normal- For graphics problems like colorization and photo gener- ization” and has been demonstrated to be effective at im- ation, plausibility to a human observer is often the ultimate age generation tasks [54]. In our experiments, we use batch goal. Therefore, we test our map generation, aerial photo sizes between 1 and 10 depending on the experiment. generation, and image colorization using this approach. Input Ground truth L1 cGAN L1 + cGAN

Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please see https://phillipi.github.io/pix2pix/ for additional examples. Second, we measure whether or not our synthesized lution images, but exploited fully-convolutional translation cityscapes are realistic enough that off-the-shelf recognition (described above) to test on 512 512 images, which were × system can recognize the objects in them. This metric is then downsampled and presented to Turkers at 256 256 × similar to the “inception score” from [52], the object detec- resolution. For colorization, we trained and tested on tion evaluation in [55], and the “semantic interpretability” 256 256 resolution images and presented the results to × measures in [62] and [42]. Turkers at this same resolution. “FCN-score” While quantitative evaluation of genera- AMT perceptual studies For our AMT experiments, we tive models is known to be challenging, recent works [52, followed the protocol from [62]: Turkers were presented 55, 62, 42] have tried using pre-trained semantic classifiers with a series of trials that pitted a “real” image against a to measure the discriminability of the generated stimuli as a “fake” image generated by our algorithm. On each trial, pseudo-metric. The intuition is that if the generated images each image appeared for 1 second, after which the images are realistic, classifiers trained on real images will be able disappeared and Turkers were given unlimited time to re- to classify the synthesized image correctly as well. To this spond as to which was fake. The first 10 images of each end, we adopt the popular FCN-8s [39] architecture for se- session were practice and Turkers were given feedback. No mantic segmentation, and train it on the cityscapes dataset. feedback was provided on the 40 trials of the main experi- We then score synthesized photos by the classification accu- ment. Each session tested just one algorithm at a time, and racy against the labels these photos were synthesized from. Turkers were not allowed to complete more than one ses- sion. 50 Turkers evaluated each algorithm. Unlike [62], ∼ 4.2. Analysis of the objective function we did not include vigilance trials. For our colorization ex- periments, the real and fake images were generated from the Which components of the objective in Eqn.4 are impor- same grayscale input. For map aerial photo, the real and tant? We run ablation studies to isolate the effect of the L1 ↔ fake images were not generated from the same input, in or- term, the GAN term, and to compare using a discriminator der to make the task more difficult and avoid floor-level re- conditioned on the input (cGAN, Eqn.1) against using an sults. For map aerial photo, we trained on 256 256 reso- unconditional discriminator (GAN, Eqn.2). ↔ × L1 L1+cGAN that the output look realistic. This variant results in poor performance; examining the results reveals that the gener- ator collapsed into producing nearly the exact same output regardless of input photograph. Clearly, it is important, in this case, that the loss measure the quality of the match be- Encoder-decoder tween input and output, and indeed cGAN performs much better than GAN. Note, however, that adding an L1 term also encourages that the output respect the input, since the

U-Net L1 loss penalizes the distance between ground truth out- puts, which correctly match the input, and synthesized out- puts, which may not. Correspondingly, L1+GAN is also Figure 5: Adding skip connections to an encoder-decoder to create effective at creating realistic renderings that respect the in- a “U-Net” results in much higher quality results. put label maps. Combining all terms, L1+cGAN, performs Loss Per-pixel acc. Per-class acc. Class IOU similarly well. L1 0.42 0.15 0.11 Colorfulness A striking effect of conditional GANs is GAN 0.22 0.05 0.01 cGAN 0.57 0.22 0.16 that they produce sharp images, hallucinating spatial struc- L1+GAN 0.64 0.20 0.15 ture even where it does not exist in the input label map. One L1+cGAN 0.66 0.23 0.17 might imagine cGANs have a similar effect on “sharpening” Ground truth 0.80 0.26 0.21 in the spectral dimension – i.e. making images more color- Table 1: FCN-scores for different losses, evaluated on Cityscapes ful. Just as L1 will incentivize a blur when it is uncertain ↔ labels photos. where exactly to locate an edge, it will also incentivize an average, grayish color when it is uncertain which of sev- Loss Per-pixel acc. Per-class acc. Class IOU eral plausible color values a pixel should take on. Specially, Encoder-decoder (L1) 0.35 0.12 0.08 Encoder-decoder (L1+cGAN) 0.29 0.09 0.05 L1 will be minimized by choosing the median of the condi- U-net (L1) 0.48 0.18 0.13 tional probability density function over possible colors. An U-net (L1+cGAN) 0.55 0.20 0.14 adversarial loss, on the other hand, can in principle become Table 2: FCN-scores for different generator architectures (and ob- aware that grayish outputs are unrealistic, and encourage jectives), evaluated on Cityscapes labels↔photos. (U-net (L1- matching the true color distribution [24]. In Figure7, we cGAN) scores differ from those reported in other tables since batch investigate whether our cGANs actually achieve this effect size was 10 for this experiment and 1 for other tables, and random on the Cityscapes dataset. The plots show the marginal dis- variation between training runs.) tributions over output color values in Lab color space. The Discriminator ground truth distributions are shown with a dotted line. It receptive field Per-pixel acc. Per-class acc. Class IOU is apparent that L1 leads to a narrower distribution than the 1 1 0.39 0.15 0.10 × 16 16 0.65 0.21 0.17 ground truth, confirming the hypothesis that L1 encourages × 70 70 0.66 0.23 0.17 average, grayish colors. Using a cGAN, on the other hand, × 286 286 0.42 0.16 0.11 × pushes the output distribution closer to the ground truth. Table 3: FCN-scores for different receptive field sizes of the dis- criminator, evaluated on Cityscapes labels→photos. Note that in- 4.3. Analysis of the generator architecture put images are 256 × 256 pixels and larger receptive fields are padded with zeros. A U-Net architecture allows low-level information to shortcut across the network. Does this lead to better results? Figure4 shows the qualitative effects of these variations Figure5 and Table2 compare the U-Net against an encoder- on two labels photo problems. L1 alone leads to reason- → decoder on cityscape generation. The encoder-decoder is able but blurry results. The cGAN alone (setting λ = 0 in created simply by severing the skip connections in the U- Eqn.4) gives much sharper results but introduces visual ar- Net. The encoder-decoder is unable to learn to generate tifacts on certain applications. Adding both terms together realistic images in our experiments. The advantages of the (with λ = 100) reduces these artifacts. U-Net appear not to be specific to conditional GANs: when We quantify these observations using the FCN-score on both U-Net and encoder-decoder are trained with an L1 loss, the cityscapes labels photo task (Table1): the GAN-based → the U-Net again achieves the superior results. objectives achieve higher scores, indicating that the synthe- sized images include more recognizable structure. We also 4.4. From PixelGANs to PatchGANs to ImageGANs test the effect of removing conditioning from the discrimi- nator (labeled as GAN). In this case, the loss does not pe- We test the effect of varying the patch size N of our dis- nalize mismatch between the input and output; it only cares criminator receptive fields, from a 1 1 “PixelGAN” to a × L1 1 1 16 16 70 70 286 286 × × × ×

L LLL L L

−1 −1 −1 −1 CVPRFigure 6: Patch−1−1 size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions become CVPR CVPRCVPR −3 −3 CVPRCVPR #385blurry and− desaturated3−3 under L1. The 1x1 PixelGAN encourages greater−3 −3 color diversity but has no effect on spatial statistics. The 16x16 #385 #385#385 −5 −5 #385#385 −5−5 −5 −5 PatchGAN creates locally sharpCVPRCVPRCVPR results, 2016 2016 2016 Submission Submissionbut Submission also leads #385. #385. #385. to CONFIDENTIALtiling CONFIDENTIAL CONFIDENTIAL artifacts beyond REVIEW REVIEW REVIEW the COPY. scale COPY. COPY. DOit DO canDO NOT NOT NOTobserve. DISTRIBUTE. DISTRIBUTE. DISTRIBUTE. The 70×70 PatchGAN forces −7 −7 −7−7 −7 −7 L1 L1 L1 L1 L1 L1 × outputs that are sharp, even ifcGAN incorrect, in both the spatial and spectral (colorfulness) dimensions.cGAN The full 286 286 ImageGAN produces cGANcGAN −9 cGANcGAN −9−9 L1+cGAN −9 −9 L1+cGAN −9 L1+cGANL1+cGAN L1+cGANL1+cGAN L1+pixelcGAN × L1+pixelcGAN results that are visually similarL1+pixelcGANL1+pixelcGAN to the 70 70 PatchGAN, but somewhat lower qualityL1+pixelcGAN accordingL1+pixelcGAN to our FCN-score metric (Table3). Please −11 GroundGround truth truth −11−11 GroundGround truth truth −11 Ground truth Ground truth −11 −11 648648 https://phillipi.github.io/pix2pix/0 20 40 60 80 100 0 20 40 60 80 100 702702 648 see 0 0 2020 4040 60 60 80 80 100100 for additional0 0 examples.20 20 40 40 60 60 80 80 100 100 702 bb a a b b a a 649649 LLL b a b a 703703 649 703

−1 −1 −1 −1−1 −1−1 −1 −1 −1 −1 650650650 −1−1 −1 −1 704704704 −3 −3 −3 −3−3 −3−3 −3 −3 −3 −3 −3−3 −3 −3 HistogramHistogram intersection intersection )

) Histogram intersection )

651 ) 705 651 ) 705 ) )

651 ) Histogram intersection 705 ) b a b a L L b ( a ( ( ( L ( ( −5−5 −5−5 −5 −5 −5 −5 againstagainst ground ground truth truth ( −5 ( −( 5 −5 −5 −5 −5 against ground truth P P P

−5 P P 652652 P against ground truth 706706 P P 652 P LossLoss LaLa b b 706 −7−7 −7−7 −7 −7 −7 −7 Loss La b log log log −−77 log log log −7 −7 −7 −7 Loss L a b log −7 log 653653 log L1L1 707707 653 L1 L1L1 0.810.81 0.69 0.69 0.70 0.70 707 cGANcGAN L1 0.81 0.69 0.70 −−99 −9−9 cGAN −9−9 −9 −9 −9 −9 L1 0.81 0.69 0.70 −9 −9 L1+cGANL1+cGAN −9 −9 −9 cGANcGAN 0.87 0.870.740.74 0.840.84 654654 L1+cGAN cGAN 0.87 0.74 0.84 708708 654 L1+pixelcGANL1+pixelcGAN cGAN 0.87 0.74 0.84 708 −11−11 GroundL1+pixelcGAN truth −11−11 −11−11 −11 −11 0.86 0.82 −−1111 −11 Ground truth −11 −11 −11 L1+cGANL1+cGAN 0.860.840.84 0.82 −11 Ground truth L1+cGAN 0.86 0.84 0.82 655655 7070 9090 110110 130130 150150 7070 9090 110110 130130 70 70 90 90 110110 130130 150 150 70 70 90 90 110 110 130 130 709709 655 00 207020 404090 6060110 8080130 100100150 70 90 110 130 70 90 110 130 150 70 L1+cGAN90 110 0.86 1300.84 0.82 709 0 20 40 60 80 100 b PixelGANPixelGANPixelGAN0.830.830.83 0.68 0.68 0.68 0.78 0.78 0.78 LbLb a a aa bb 0.83 0.68 0.78 656656656 (a)(a)Lb a(b)(b)a (c)(c) PixelGAN 710710710 −−11 (a) −1−1 (b) (c) (d)(d) −1 (a) −1 (b) (c) (d) 657657657 FigureFigureFigure 5: 5: 5: Color Color Color distribution distribution distribution matching matching matching property property property of of of the the the cGAN, cGAN, cGAN, tested tested tested on on on Cityscapes. Cityscapes. Cityscapes. (c.f. (c.f. (c.f. Figure Figure Figure 1 of1 1 of theof the the original original original(d) GAN GAN GAN paper paper paper [14 []). [1414]). Note]). Note Note 711711711 −−33 −3−3 that−3 the histogram intersection−3 scores are dominated by differences in the high probability region, which are imperceptible in the plots, 658658658 thatthat the the histogram histogram intersection intersection scores scores are are dominated dominated by by differences differences in in the the high high probability probability region, region, which which are are imperceptible imperceptible in in the the plots, plots, 712712712 Figure− 7:−55 Color distribution matching−5−5 property of the cGAN, tested on Cityscapes. (c.f. Figure 1 of the original GAN paper [24]). Note whichwhich−5 show show log log probability probability and and−5 therefore therefore emphasize emphasize differences differences in in the the low low probability probability regions. regions. 659659659 which show log probability and therefore emphasize differences in the low probability regions. 713713713 that the−−77 histogram intersection scores−7−7 are dominated by differences in the high probability region, which are imperceptible in the plots, −7 −7 660660660 which show log probability and therefore emphasize differences in the low probability regions. 714714714 −−99 L1−9−9 1x1 16x16 70x70 256x256 661 −9 L1−9 1x1 16x16 70x70 256x256 715 661661 L1 1x1 16x16 70x70 256x256 715715 −−1111 −11−11 −11 −11 662662 7070 9090 110110 130130 150150 7070 9090 110110 130130 716716 662 70 90 110 130 150 70 90 110 130 716 663 1 Photo Map Map Photo 717 663663 full 286 286 “ImageGAN” . Figure6 shows qualitative → → 717717 Loss % Turkers labeled real % Turkers labeled real 664664 × 718718 664 results of this analysis and Table3 quantifies the effects us- L1 2.8% 1.0% 0.8% 0.3% 718 665 ± ± 719 665665 ing the FCN-score. Note that elsewhere in this paper, unless L1+cGAN 6.1% 1.3% 18.9% 2.5% 719719 666 FigureFigure 6: 6: Patch Patch size size variations. variations. Uncertainty Uncertainty in in the the output output manifests manifests itself itself differently differently for for different different loss± loss functions. functions. Uncertain Uncertain± regions regions become become 720 666666 specified,Figure all 6: experiments Patch size variations. use 70 Uncertainty70 PatchGANs, in the output and manifests for itselfTable differently 4: AMT for “realdifferent vs fake” loss functions. test on maps Uncertain↔aerial regions photos. become 720720 667 blurryblurry and and desaturated desaturated under under L1. L1.× The The 1x1 1x1 PixelGAN PixelGAN encourages encourages greater greater color color diversity diversity but but has has no no effect effect on on spatial spatial statistics. statistics. The The 16x16 16x16 721 667667 this sectionblurry and all desaturated experiments under use L1. an L1+cGANThe 1x1 PixelGAN loss. encourages greater color diversity but has no effect on spatial statistics. The 16x16 721721 PatchGAN creates locally sharp results, but also leads to tiling artifacts beyond the scale it can observe.% The Turkers 70x70 labeled PatchGANreal forces 668668 PatchGANPatchGAN creates creates locally locally sharp sharp results, results, but but also also leads leads to to tiling tiling artifacts artifacts beyond beyondMethod the the scale scale it it can can observe. observe. The The 70x70 70x70 PatchGAN PatchGAN forces forces 722722 668 Theoutputs PixelGAN that are sharp, has evenno effect if incorrect, on spatial in both sharpnessthe spatial and but spectral (coforfulness)L2 regression dimensions. from The [62] full 256x25616.3% ImageGAN2.4% produces 722 669 outputsoutputs that that are are sharp, sharp, even even if if incorrect, incorrect, in in both both the the spatial spatial and and spectral spectral (coforfulness) (coforfulness) dimensions. dimensions. The The full full 256x256 256x256± ImageGAN ImageGAN produces produces 723 669669 doesresults increase that are the visually colorfulness similar toof the the 70x70 results PatchGAN, (quantified but somewhat in lower qualityZhang according et al. 2016 to our [62] FCN-score 27.8% metric2.7% (Table 2). 723723 670 resultsresults that that are are visually visually similar similar to to the the 70x70 70x70 PatchGAN, PatchGAN, but but somewhat somewhat lower lower quality quality according according to to our our FCN-score FCN-score metric± metric (Table (Table22).). 724 670670 Ours 22.5% 1.6% 724724 671 Figure7). For example, the bus in Figure6 is painted gray ± 725 671671 Table 5: AMT “real vs fake” test on colorization. 725725 672 when the net is trainedClassification with an L1 loss, Ours but becomes red Input Ground truth L1 cGAN 726 672672 ClassificationClassification Ours Ours InputInput Ground Ground truth truth L1 L1 cGAN cGAN 726726 673 with theL2L2 PixelGAN [ [4444]] (rebal.) (rebal.) loss. Color[44 [44]] (L1histogram (L1 + + cGAN) cGAN) matching Ground Ground is truth truth a 727 673673 L2 [44] (rebal.) [44] (L1 + cGAN) Ground truth 727727 674 common problem in image processing [49], and PixelGANs generator convolutionally, on larger images than those on 728 674674 728728 675 may be a promising lightweight solution. which it was trained. We test this on the map aerial photo 729 675675 ↔ 729729 676 Using a 16 16 PatchGAN is sufficient to promote sharp task. After training a generator on 256 256 images, we test 730 676676 × × 730730 677 outputs, and achieves good FCN-scores, but also leads to it on 512 512 images. The results in Figure8 demonstrate 731 677677 × 731731 678 tiling artifacts. The 70 70 PatchGAN alleviates these theFigure effectiveness 8: Applying of a this conditional approach. GAN to semantic segmentation. 732 678678 × Figure 8: Applying a conditional GAN to semantic segmentation. 732732 679 artifacts and achieves slightly better scores. be- TheFigure cGAN 8: producesApplying sharpa conditional images GAN that look to semantic at glance segmentation. like the 733 679 The cGAN produces sharp images that look at glance like the 733 680679 286 286 4.5.groundThe Perceptual cGANtruth, but produces in validation fact include sharpimages many small, that hallucinatedlook at glance objects. like the 734733 680 yond this, to the full ImageGAN, does not ap- ground truth, but in fact include many small, hallucinated objects. 734 681680 × ground truth, but in fact include many small, hallucinated objects. 735734 681 pear to improve the visual quality of the results, and in 735 682681 We validate the perceptual realism of our results on the 736735 682 fact gets a considerably lower FCN-score (Table3). This 736 683682 tasks of map aerial photograph and grayscale color. Re- 737736 683 may be because the ImageGAN has many more parameters nearly discrete,↔ rather than “images”, with their→ continuous- 737 684683 sultsnearlynearly of our discrete, discrete, AMT ratherexperiment rather than than “images”, “images”, for map with withphoto their their are continuous- continuous- given in 738737 684 and greater depth than the 70 70 PatchGAN, and may be valued variation. Although cGANs achieve↔ some success, 738 685684 × Tablevaluedvalued4. The variation. variation. aerial photos Although Although generated cGANs cGANs by achieve achieve our method some some fooled success, success, 739738 685 harder to train. they are far from the best available method for solving this 739 686685 participantstheythey are are far far on from from18.9% the theof best best trials, available available significantly method methodabove for for solving solving the L1 this this 740739 686 Fully-convolutional translation An advantage of the problem: simply using L1 regression gets better scores than 740 687686 baseline,problem:problem: which simply simply produces using using L1 L1 blurry regression regression results gets gets and better better nearly scores scores never than than 741740 687 PatchGAN is that a fixed-size patch discriminator can be using a cGAN, as shown in Table 4. We argue that for vision 741 688687 fooledusingusing participants. a a cGAN, cGAN, as as shown Inshown contrast, in in Table Table in44 the.. We We photo argue argue that thatmap for for direc- vision vision 742741 688 applied to arbitrarily large images. We may also apply the problems, the goal (i.e. predicting output close→ to ground 742 688 Figure 7: Colorization results of conditional GANs versus the L2 problems, the goal (i.e. predicting output close6.1 to ground 742 689 Figure 7: Colorization results of conditional GANs versus the L2 tiontruth)problems, our may method be the less only goal ambiguous (i.e.fooled predicting participants than graphics output on tasks, close% and to of ground re- tri- 743 689689 regressionFigure 7: fromColorization [44] and results the full of method conditional (classification GANs versus with the re- L2 743743 690 1 als,truth)truth) and this may may was be be notless less significantly ambiguous ambiguous than thandifferent graphics graphics than tasks, tasks, the perfor- and and re- re- 744 690 Webalancing)regressionregression achieve this from from from variation [ [46 [4444].]] and andThe in patch the the cGANs full fullsize method methodby can adjusting produce (classification (classification the compelling depth of with with the col- re- re- construction losses like L1 are mostly sufficient. 744 691690 GANbalancing) discriminator. from Details [46 of]. this The process, cGANs and can the produce discriminator compelling architec- col- manceconstructionconstruction of the L1 losses losses baseline like like L1 L1(based are are mostly mostly on bootstrap sufficient. sufficient. test). This 745744 691 orizationsbalancing) (first from two [46 rows),]. The but cGANs havea can common produce failure compelling mode of col- 745 692691 746745 tures,producingorizations areorizations provided a (first (first grayscale in the two two in the rows), rows),or supplemental desaturated but but have haveresult materials a a common common (last online. row). failure failure mode mode of of may be because minor structural errors are more visible 692692 746746 693 producingproducing a a grayscale grayscale or or desaturated desaturated result result (last (last row). row). 747 694693693 4. Conclusion 748747747 695694694 To begin to test this, we train a cGAN (with/without L1 4.4. Conclusion Conclusion 749748748 696695695 loss)ToTo on begin begin cityscape to to test test photo this, this, we welabels. train train a a Figure cGAN cGAN8 (with/without (with/withoutshows qualita- L1 L1 The results in this paper suggest that conditional adver- 750749749 ! 697696696 tiveloss)loss) results, on on cityscape cityscape and quantitative photo photo labels. classificationlabels. Figure Figure accuracies88 showsshows arequalita- qualita- re- sarialTheThe networks results results are in in this a this promising paper paper suggest suggest approach that that for conditional conditional many image- adver- adver- 751750750 !! 698697697 portedtivetive results, results, in Table and and4. quantitative quantitative Interestingly, classification classification cGANs, trained accuracies accuracieswithout are arethe re- re- to-imagesarialsarial networks networks translation are are tasks, a a promising promising especially approach approach those involving for for many many highly image- image- 752751751 699698698 L1portedported loss, in in are Table Table able4 to4.. solveInterestingly, Interestingly, this problem cGANs, cGANs, at a reasonable trained trainedwithoutwithout degreethethe structuredto-imageto-image graphical translation translation outputs. tasks, tasks, especially especially These networks those those involving involving learn a highlyloss highly 753752752 700699699 ofL1L1 accuracy. loss, loss, are are able ableTo our to to solve solve knowledge, this this problem problem this is at at the a a reasonable reasonablefirst demonstra- degree degree adaptedstructuredstructured to the graphical graphical task and outputs. dataoutputs. at hand, These These which networks networks makes learn themlearn aap- a loss loss 754753753 701700700 tionofof accuracy. accuracy. of GANs To To successfully our our knowledge, knowledge, generating this this is is “labels”, the the first first whichdemonstra- demonstra- are plicableadaptedadapted in to to a thewide the task task variety and and data of data settings. at at hand, hand, which which makes makes them them ap- ap- 755754754 701701 tiontion of of GANs GANs successfully successfully generating generating “labels”, “labels”, which which are are plicableplicable in in a a wide wide variety variety of of settings. settings. 755755 7 77 Map to aerial photo Aerial photo to map

input output input output

Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256 × 256 resolution, and run convo- lutionally on the larger images at test time). Contrast adjusted for clarity. Classification Ours Input Ground truth L1 cGAN L2 [62] (rebal.) [62] (L1 + cGAN) Ground truth

Figure 10: Applying a conditional GAN to semantic segmenta- tion. The cGAN produces sharp images that look at glance like the ground truth, but in fact include many small, hallucinated ob- jects.

ble5). We also tested the results of [62] and a variant of their method that used an L2 loss (see [62] for details). The Figure 9: Colorization results of conditional GANs versus the L2 conditional GAN scored similarly to the L2 variant of [62] regression from [62] and the full method (classification with re- (difference insignificant by bootstrap test), but fell short of balancing) from [64]. The cGANs can produce compelling col- [62]’s full method, which fooled participants on 27.8% of orizations (first two rows), but have a common failure mode of trials in our experiment. We note that their method was producing a grayscale or desaturated result (last row). specifically engineered to do well on colorization. in maps, which have rigid , than in aerial pho- 4.6. Semantic segmentation tographs, which are more chaotic. We trained colorization on ImageNet [51], and tested Conditional GANs appear to be effective on problems on the test split introduced by [62, 35]. Our method, with where the output is highly detailed or photographic, as is L1+cGAN loss, fooled participants on 22.5% of trials (Ta- common in image processing and graphics tasks. What by Kaihu Chen by Jack Qiao by Mario Klingemann #fotogenerator

sketch by Ivy Tsai by Bertrand Gondouin by Brannon Dorsey sketch by Yann LeCun Figure 11: Example applications developed by online community based on our pix2pix codebase: #edges2cats [3] by Christopher Hesse, Background removal [6] by Kaihu Chen, Palette generation [5] by Jack Qiao, Sketch → Portrait [7] by Mario Klingemann, Sketch→ Pokemon [1] by Bertrand Gondouin, “Do As I Do” pose transfer [2] by Brannon Dorsey, and #fotogenerator by Bosman et al. [4].

Loss Per-pixel acc. Per-class acc. Class IOU L1 0.86 0.42 0.35 cGAN 0.74 0.28 0.22 L1+cGAN 0.83 0.36 0.29 Table 6: Performance of photo→labels on cityscapes. about vision problems, like semantic segmentation, where the output is instead less complex than the input? To begin to test this, we train a cGAN (with/without L1 loss) on cityscape photo labels. Figure 10 shows qualita- Figure 12: Learning to see: Gloomy Sunday: An interactive artis- → tive results, and quantitative classification accuracies are re- tic demo developed by Memo Akten [8] based on our pix2pix ported in Table6. Interestingly, cGANs, trained without the codebase. Please click the image to play the video in a browser. L1 loss, are able to solve this problem at a reasonable degree the pix2pix code we released. Nonetheless, they demon- of accuracy. To our knowledge, this is the first demonstra- strate the promise of our approach as a generic commodity tion of GANs successfully generating “labels”, which are tool for image-to-image translation problems. nearly discrete, rather than “images”, with their continuous- valued variation2. Although cGANs achieve some success, 5. Conclusion they are far from the best available method for solving this problem: simply using L1 regression gets better scores than The results in this paper suggest that conditional adver- using a cGAN, as shown in Table6. We argue that for vi- sarial networks are a promising approach for many image- sion problems, the goal (i.e. predicting output close to the to-image translation tasks, especially those involving highly ground truth) may be less ambiguous than graphics tasks, structured graphical outputs. These networks learn a loss and reconstruction losses like L1 are mostly sufficient. adapted to the task and data at hand, which makes them ap- plicable in a wide variety of settings. 4.7. Community-driven Research Acknowledgments: We thank Richard Zhang, Deepak Since the initial release of the paper and our pix2pix Pathak, and Shubham Tulsiani for helpful discussions, Sain- codebase, the Twitter community, including computer vi- ing Xie for help with the HED edge detector, and the online sion and graphics practitioners as well as visual artists, have community for exploring many applications and suggesting successfully applied our framework to a variety of novel improvements. Thanks to Christopher Hesse, Memo Ak- image-to-image translation tasks, far beyond the scope of ten, Kaihu Chen, Jack Qiao, Mario Klingemann, Brannon the original paper. Figure 11 and Figure 12 show just a Dorsey, Gerda Bosman, Ivy Tsai, and Yann LeCun for al- few examples from the #pix2pix hashtag, including Back- lowing the use of their creations in Figure 11 and Figure 12. ground removal, Palette generation, Sketch Portrait, → This work was supported in part by NSF SMA-1514512, Sketch Pokemon, ”Do as I Do” pose transfer, Learning → NGA NURI, IARPA via Air Force Research Laboratory, In- to see: Gloomy Sunday, as well as the bizarrely popular tel Corp, Berkeley Deep Drive, and hardware donations by #edges2cats and #fotogenerator. Note that these applica- Nvidia. J.-Y.Z. is supported by the Facebook Graduate Fel- tions are creative projects, were not obtained in controlled, lowship. Disclaimer: The views and conclusions contained scientific conditions, and may rely on some modifications to herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorse- 2Note that the label maps we train on are not exactly discrete valued, as they are resized from the original maps using bilinear interpolation and ments, either expressed or implied, of IARPA, AFRL or the saved as jpeg images, with some compression artifacts. U.S. Government. Input Ground truth Output Input Ground truth Output

Figure 13: Example results of our method on Cityscapes labels→photo, compared to ground truth.

Input Ground truth Output Input Ground truth Output

Figure 14: Example results of our method on facades labels→photo, compared to ground truth. Input Ground truth Output Input Ground truth Output

Figure 15: Example results of our method on day→night, compared to ground truth.

Input Ground truth Output Input Ground truth Output

Figure 16: Example results of our method on automatically detected edges→handbags, compared to ground truth. Input Ground truth Output Input Ground truth Output

Figure 17: Example results of our method on automatically detected edges→shoes, compared to ground truth.

Input Output Input Output Input Output

Figure 18: Additional results of the edges→photo models applied to human-drawn sketches from [19]. Note that the models were trained on automatically detected edges, but generalize to human drawings Figure 19: Example results on photo inpainting, compared to [43], on the Paris StreetView dataset [14]. This experiment demonstrates that the U-net architecture can be effective even when the predicted pixels are not geometrically aligned with the information in the input – the information used to fill in the central hole has to be found in the periphery of these photos.

Input Ground truth Output Input Ground truth Output

Figure 20: Example results on translating thermal images to RGB photos, on the dataset from [27].

Day Night Labels Facade Labels Street scene

Edges Shoe Edges Handbag Sketch Shoe Sketch Handbag

Figure 21: Example failure cases. Each pair of images shows input on the left and output on the right. These examples are selected as some of the worst results on our tasks. Common failures include artifacts in regions where the input image is sparse, and difficulty in handling unusual inputs. Please see https://phillipi.github.io/pix2pix/ for more comprehensive results. References [21] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.4 [1] Bertrand gondouin. https://twitter.com/ [22] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer bgondouin/status/818571935529377792. using convolutional neural networks. CVPR, 2016.4 Accessed, 2017-04-21.9 [23] J. Gauthier. Conditional generative adversarial nets for [2] Brannon dorsey. https://twitter.com/ convolutional face generation. Class Project for Stanford brannondorsey/status/806283494041223168. CS231N: Convolutional Neural Networks for Visual Recog- Accessed, 2017-04-21.9 nition, Winter semester, (5):2, 2014.2 [3] Christopher hesse. https://affinelayer.com/ [24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, pixsrv/. Accessed: 2017-04-21.9 D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- [4] Gerda bosman, tom kenter, rolf jagerman, and daan gosman. erative adversarial nets. In NIPS, 2014.2,4,6,7 https://dekennisvannu.nl/site/artikel/ Help-ons-kunstmatige-intelligentie-testen/ [25] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In SIGGRAPH, 2001.1,4 9163. Accessed: 2017-08-31.9 [5] Jack qiao. http://colormind.io/blog/. Accessed: [26] G. E. Hinton and R. R. Salakhutdinov. Reducing the Science 2017-04-21.9 dimensionality of data with neural networks. , 313(5786):504–507, 2006.3 [6] Kaihu chen. http://www.terraai.org/ imageops/index.html. Accessed, 2017-04-21. [27] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon. Mul- 9 tispectral pedestrian detection: Benchmark dataset and base- line. In CVPR, 2015.4, 13, 16 [7] Mario klingemann. https://twitter.com/ quasimondo/status/826065030944870400. [28] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be Accessed, 2017-04-21.9 Color!: Joint End-to-end Learning of Global and Local Im- [8] Memo akten. https://vimeo.com/260612034. Ac- age Priors for Automatic Image Colorization with Simulta- cessed, 2018-11-07.9 neous Classification. ACM Transactions on Graphics (TOG), 35(4), 2016.2 [9] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In CVPR, 2005.1 [29] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and deep network training by reducing internal covariate shift. In A. L. Yuille. Semantic image segmentation with deep con- ICML, 2015.3,4 volutional nets and fully connected crfs. In ICLR, 2015.2 [30] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for [11] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. real-time style transfer and super-resolution. In ECCV, 2016. Sketch2photo: internet image montage. ACM Transactions 2,3 on Graphics (TOG), 28(5):124, 2009.1 [31] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning [12] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, to generate images of outdoor scenes from attributes and se- R. Benenson, U. Franke, S. Roth, and B. Schiele. The mantic layouts. arXiv preprint arXiv:1612.00215, 2016.2 cityscapes dataset for semantic urban scene understanding. [32] D. Kingma and J. Ba. Adam: A method for stochastic opti- In CVPR, 2016.4, 16 mization. ICLR, 2015.4 [13] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen- [33] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient erative image models using a laplacian pyramid of adversar- attributes for high-level understanding and editing of outdoor ial networks. In NIPS, 2015.2 scenes. ACM Transactions on Graphics (TOG), 33(4):149, [14] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What 2014.1,4, 16 makes paris look like paris? ACM Transactions on Graphics, [34] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoen- 31(4), 2012.4, 13, 17 coding beyond pixels using a learned metric. In [15] A. Dosovitskiy and T. Brox. Generating images with per- ICML, 2016.3 ceptual similarity metrics based on deep networks. In NIPS, [35] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep- 2016.2 resentations for automatic colorization. ECCV, 2016.2,8, [16] A. A. Efros and W. T. Freeman. Image quilting for texture 16 synthesis and transfer. In SIGGRAPH, 2001.1,4 [36] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Te- [17] A. A. Efros and T. K. Leung. Texture synthesis by non- jani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single im- parametric sampling. In ICCV, 1999.4 age super-resolution using a generative adversarial network. [18] D. Eigen and R. Fergus. Predicting depth, surface normals In CVPR, 2017.2 and semantic labels with a common multi-scale convolu- [37] C. Li and M. Wand. Combining markov random fields and tional architecture. In ICCV, 2015.1 convolutional neural networks for image synthesis. CVPR, [19] M. Eitz, J. Hays, and M. Alexa. How do humans sketch 2016.2,4 objects? In SIGGRAPH, 2012.4, 12 [38] C. Li and M. Wand. Precomputed real-time texture synthe- [20] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. sis with markovian generative adversarial networks. ECCV, Freeman. Removing camera shake from a single photo- 2016.2,4 graph. ACM Transactions on Graphics (TOG), 25(3):787– [39] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional 794, 2006.1 networks for semantic segmentation. In CVPR, 2015.1,2,5 [40] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale [58] S. Xie and Z. Tu. Holistically-nested edge detection. In video prediction beyond mean square error. ICLR, 2016.2, ICCV, 2015.1,2,4 3 [59] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel- [41] M. Mirza and S. Osindero. Conditional generative adversar- level domain transfer. ECCV, 2016.2,3 ial nets. arXiv preprint arXiv:1411.1784, 2014.2 [60] A. Yu and K. Grauman. Fine-Grained Visual Comparisons [42] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adel- with Local Learning. In CVPR, 2014.4 son, and W. T. Freeman. Visually indicated sounds. In [61] A. Yu and K. Grauman. Fine-grained visual comparisons CVPR, 2016.5 with local learning. In CVPR, 2014. 16 [43] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. [62] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza- Efros. Context encoders: Feature learning by inpainting. In tion. ECCV, 2016.1,2,5,7,8, 16 CVPR, 2016.2,3, 13, 17 [63] J. Zhao, M. Mathieu, and Y.LeCun. Energy-based generative [44] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- adversarial network. In ICLR, 2017.2 sentation learning with deep convolutional generative adver- [64] Y. Zhou and T. L. Berg. Learning temporal transformations sarial networks. In ICLR, 2016.2,3, 16 from time-lapse videos. In ECCV, 2016.2,3,8 ˇ [45]R. S. Radim Tylecek.ˇ Spatial pattern templates for recogni- [65] J.-Y. Zhu, P. Krahenb¨ uhl,¨ E. Shechtman, and A. A. Efros. tion of objects with regular structure. In German Conference Generative visual manipulation on the natural image mani- on Pattern Recognition, 2013.4, 16 fold. In ECCV, 2016.2,4, 16 [46] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.2 [47] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, and N. de Freitas. Generating interpretable images with controllable structure. In ICLR Workshop, 2017. 2 [48] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016.2 [49] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color transfer between images. IEEE Computer Graphics and Ap- plications, 21:34–41, 2001.7 [50] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu- tional networks for biomedical image segmentation. In MIC- CAI, 2015.2,3 [51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.4,8, 16 [52] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad- ford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.2,4,5 [53] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013.1 [54] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal- ization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.4 [55] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.2, 3,5 [56] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to struc- tural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.2 [57] S. Xie, X. Huang, and Z. Tu. Top-down learning for struc- tured labeling with convolutional pseudoprior. In ECCV, 2015.2 6. Appendix 286 286 discriminator: C64-C128-C256-C512-C512-C512× 6.1. Network architectures We adapt our network architectures from those in [44]. Code for the models is available at 6.2. Training details https://github.com/phillipi/pix2pix. Random jitter was applied by resizing the 256 256 input × Let Ck denote a Convolution-BatchNorm-ReLU layer images to 286 286 and then randomly cropping back to × with k filters. CDk denotes a Convolution-BatchNorm- size 256 256. × Dropout-ReLU layer with a dropout rate of 50%. All convo- All networks were trained from scratch. Weights were lutions are 4 4 spatial filters applied with stride 2. Convo- × initialized from a Gaussian distribution with mean 0 and lutions in the encoder, and in the discriminator, downsample standard deviation 0.02. by a factor of 2, whereas in the decoder they upsample by a Cityscapes labels photo 2975 training images from → factor of 2. the Cityscapes training set [12], trained for 200 epochs, with random jitter and mirroring. We used the Cityscapes val- 6.1.1 Generator architectures idation set for testing. To compare the U-net against an encoder-decoder, we used a batch size of 10, whereas for The encoder-decoder architecture consists of: the objective function experiments we used batch size 1. encoder: C64-C128-C256-C512-C512-C512-C512-C512 We find that batch size 1 produces better results for the U- net, but is inappropriate for the encoder-decoder. This is decoder: CD512-CD512-CD512-C512-C256-C128-C64 because we apply batchnorm on all layers of our network, and for batch size 1 this operation zeros the activations on After the last layer in the decoder, a convolution is ap- the bottleneck layer. The U-net can skip over the bottleneck, plied to map to the number of output channels (3 in general, but the encoder-decoder cannot, and so the encoder-decoder except in colorization, where it is 2), followed by a Tanh requires a batch size greater than 1. Note, an alternative function. As an exception to the above notation, Batch- strategy is to remove batchnorm from the bottleneck layer. Norm is not applied to the first C64 layer in the encoder. See errata for more details. All ReLUs in the encoder are leaky, with slope 0.2, while Architectural labels photo 400 training images from ReLUs in the decoder are not leaky. → [45], trained for 200 epochs, batch size 1, with random jitter The U-Net architecture is identical except with skip con- and mirroring. Data were split into train and test randomly. nections between each layer i in the encoder and layer n i − Maps aerial photograph 1096 training images in the decoder, where n is the total number of layers. The ↔ skip connections concatenate activations from layer i to scraped from Google Maps, trained for 200 epochs, batch layer n i. This changes the number of channels in the size 1, with random jitter and mirroring. Images were − decoder: sampled from in and around New York City. Data were then split into train and test about the median latitude of the U-Net decoder: CD512-CD1024-CD1024-C1024-C1024-C512 sampling region (with a buffer region added to ensure that -C256-C128 no training pixel appeared in the test set). BW color 1.2 million training images (Imagenet train- → ing set [51]), trained for 6 epochs, batch size 4, with only ∼ 6.1.2 Discriminator architectures mirroring, no random jitter. Tested on subset of Imagenet The 70 70 discriminator architecture is: val set, following protocol of [62] and [35]. C64-C128-C256-C512× Edges shoes 50k training images from UT Zappos50K → After the last layer, a convolution is applied to map to dataset [61] trained for 15 epochs, batch size 4. Data were a 1-dimensional output, followed by a Sigmoid function. split into train and test randomly. As an exception to the above notation, BatchNorm is not Edges Handbag 137K Amazon Handbag images from → applied to the first C64 layer. All ReLUs are leaky, with [65], trained for 15 epochs, batch size 4. Data were split into slope 0.2. train and test randomly. All other discriminators follow the same basic architec- Day night 17823 training images extracted from 91 → ture, with depth varied to modify the receptive field size: webcams, from [33] trained for 17 epochs, batch size 4, 1 1 discriminator: with random jitter and mirroring. We use 91 webcams as × C64-C128 (note, in this special case, all convolutions are training, and 10 webcams for test. 1 1 spatial filters) Thermal color photos 36609 training images from set × → 16 16 discriminator: 00–05 of [27], trained for 10 epochs, batch size 4. Images × C64-C128 from set 06-11 are used for testing. Photo with missing pixels inpainted photo 14900 → training images from [14], trained for 25 epochs, batch size 4, and tested on 100 held out images following the split of [43]. 6.3. Errata For all experiments reported in this paper with batch size 1, the activations of the bottleneck layer are zeroed by the batchnorm operation, effectively making the innermost layer skipped. This issue can be fixed by removing batch- norm from this layer, as has been done in the public code. We observe little difference with this change and therefore leave the experiments as is in the paper. 6.4. Change log arXiv v2 Reran generator architecture comparisons (Section 4.3) with batch size equal to 10 rather than 1, so that bottleneck layer is not zeroed (see Errata). Reran FCN-scores with minor details cleaned up (re- sults saved losslessly as pngs, removed unecessary downsampling). FCN-scores computed using scripts at https://github.com/phillipi/pix2pix/tree/ master/scripts/eval cityscapes, commit d7e7b8b. Updated several figures and text. Added addi- tional results on thermal color photos and inpainting, as → well as community contributions. arXiv v3 Added additional results on community contri- butions. Fixed minor typos.