Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis

Yiyi Liao1,2,∗ Katja Schwarz1,2,∗ Lars Mescheder1,2,3,† Andreas Geiger1,2 1Max Planck Institute for Intelligent Systems, Tubingen¨ 2University of Tubingen¨ 3Amazon, Tubingen¨ {firstname.lastname}@tue.mpg.de

Abstract Classical Rendering Pipeline

3D Design Render In recent years, Generative Adversarial Networks have achieved impressive results in photorealistic image synthe- sis. This progress nurtures hopes that one day the classi- 2D Generative Model 2D cal rendering pipeline can be replaced by efficient models Sample Generator that are learned directly from images. However, current image synthesis models operate in the 2D domain where disentangling 3D properties such as camera viewpoint or Our Approach

object pose is challenging. Furthermore, they lack an inter- 3D 2D Sample Render pretable and controllable representation. Our key hypoth- Generator Generator esis is that the image generation process should be mod- eled in 3D space as the physical world surrounding us is intrinsically three-dimensional. We define the new task of Figure 1: Motivation. While classical rendering 3D controllable image synthesis and propose an approach require a holistic scene representation, image-based gener- for solving it by reasoning both in 3D space and in the 2D ative neural networks are able to synthesize images from a image domain. We demonstrate that our model is able to latent code and can hence be trained directly from images. disentangle latent 3D factors of simple multi-object scenes We propose to consider the generative process in both 3D in an unsupervised fashion from raw images. Compared to and 2D space, yielding a 3D controllable image synthesis pure 2D baselines, it allows for synthesizing scenes that are model which learns both the 3D content creation process as consistent wrt. changes in viewpoint or object pose. We fur- well as the rendering process jointly from raw images. ther evaluate various 3D representations in terms of their usefulness for this challenging task. Currently, image synthesis in these applications is real- ized using rendering engines (e.g., OpenGL) as illustrated 1. Introduction in Fig. 1 (top). This approach provides full control over the 3D content since the input to the rendering is Realistic image synthesis is a fundamental task in com- a holistic description of the 3D scene. However, creating puter vision and graphics with many applications including photorealistic content and 3D assets for a movie or video gaming, simulation, and . game is an extremely expensive and time-consuming pro- In all these applications, it is essential that the image syn- cess and requires the concerted effort of many 3D artists. In thesis algorithm allows for controlling the 3D content and this paper, we therefore ask the following question: produces coherent images when changing the camera posi- Is it possible to learn the simulation pipeline including tion and object locations. Imagine exploring a virtual reality 3D content creation from raw 2D image observations? room. When walking around and manipulating objects in Recently, Generative Adversarial Networks (GANs) the room, it is essential to observe coherent images, i.e. ma- have achieved impressive results for photorealistic image nipulating the pose of one object should not alter properties synthesis [21, 22, 29] and hence emerged as a promising al- (such as the color) of any other object in the scene. ternative to classical rendering algorithms, see Fig. 1 (mid- ∗ Joint first authors with equal contribution. dle). However, a major drawback of image-based GANs is † This work was done prior to joining Amazon. that the learned latent representation is typically “3D entan-

15871 2. Related Work x Image Decomposition: Several works have considered the problem of decomposing a scene, conditioned on an in- put image [4, 9, 11–13, 24, 39, 43]. One line of work uses RNNs to sequentially decompose the scene [4, 9, 24, 39], whereas other approaches iteratively refine the image seg- mentation [11–13, 43]. While early works consider very simple scenes [9, 24] or binarized images [12, 13, 43], re- cent works [4, 11] also handle scenes with occlusions. Figure 2: 3D Controllable Image Synthesis. We propose a In contrast to the problem of learning controllable gen- model for 3D controllable image synthesis which allows for erative models, the aforementioned approaches are purely manipulating the viewpoint and 3D pose of individual ob- discriminative and are not able to synthesize novel scenes. jects. Here, we illustrate consistent translation in 3D space, Furthermore, they operate purely in the 2D image domain object rotation and viewpoint changes. and thus lack an understanding of the physical 3D world. Image Synthesis: Unconditional Generative Adversarial gled”. That is, the latent dimensions do not automatically Networks (GANs) [10,30,34] have greatly advanced photo- expose physically meaningful 3D properties such as cam- realistic image synthesis by learning deep networks that are era viewpoint, object pose or object entities. As a result, able to sample from the natural image manifold. More re- 2D GANs fall short in terms of controllability compared to cent works condition these models on additional input in- classical rendering algorithms. This limits their utility for formation (e.g., semantic segmentations) to guide the image many applications including virtual reality, data augmenta- generation process [2, 6, 19, 20, 26, 36, 42, 46, 50]. tion and simulation. Another line of work learns interpretable, disentangled features directly from the input data for manipulating the Contribution: In this work, we propose the new task of generated images [7, 22, 35, 49]. More recent work aims 3D Controllable Image Synthesis. We define this task as at controlling the image generation process at the object an unsupervised learning problem, where a 3D controllable level [8, 40, 48]. The key insight is that this splits the generative image synthesis model that allows for manipulat- complex task of image generation into easier subproblems ing 3D scene properties is learned without 3D supervision. which benefits the quality of the generated images. The 3D controllable properties include the 3D pose, shape and appearance of multiple objects as well as the viewpoint However, all of these methods operate based on a 2D un- of the camera. Learning such a model from 2D supervision derstanding of the scene. While some of these works are alone is very challenging as the model must reason about able to disentangle 3D pose in the latent representation [7], the modular composition of the scene as well as physical full 3D control in terms of object translations and rotations 3D properties of the world such as light transport. as well as novel viewpoints remains a challenging task. In this work, we investigate the problem of learning an inter- Towards a solution to this problem, we propose a novel pretable intermediate representation at the object-level that formulation that combines the advantages of deep gener- is able to provide full 3D control over all objects in the ative models with those of traditional rendering pipelines. scene. We also demonstrate that reasoning in 3D improves Our approach enables 3D controllability when synthesiz- consistency wrt. pose and viewpoint changes. ing new content while learning meaningful representations from images alone. Our key idea is to learn the image 3D-Aware Image Synthesis: Several works focus on 3D generation process jointly in 3D and 2D space by com- controllable image synthesis with 3D supervision [45,52] or bining a 3D generator with a differentiable renderer and using 3D information as input [1]. In this work, we instead a 2D image synthesis model as illustrated in Fig. 1 (bot- focus on learning from 2D images alone. The problem of tom). This allows our model to learn abstract 3D repre- learning discriminative 3D models from 2D images is of- sentations which conform to the physical image formation ten addressed using differentiable rendering [15,23,25,27]. process, thereby retaining interpretability and controllabil- Our work is related to these works in that we also exploit ity. We demonstrate that our model is able to disentangle a differentiable renderer. However, unlike the aforemen- the latent 3D factors of simple scenes composed of multiple tioned methods, we combine differentiable rendering with objects and allows for synthesizing images that are consis- generative models in 3D and 2D space. We learn an ab- tent wrt. changes in viewpoint or object pose as illustrated stract 3D representation that can be transformed into photo- in Fig. 2. Our code and data are provided at https://github. realistic images via a neural network. Previous works in this com/autonomousvision/controllable image synthesis. direction can be categorized as either implicit or explicit,

5872 depending on whether the learned features have a physical as the rendering step that projects the 3D primitives to 2D meaning. feature maps in the image plane. Finally, we describe the Implicit methods [37, 47] apply rotation to a 3D latent 2D generator which synthesizes and composites the image feature vector to generate transformed images through a as well as the loss functions that we use for training our multilayer . This feature vector is a global rep- model in an end-to-end fashion from unlabeled images. resentation as it influences the entire image. In contrast, explicit methods [15, 32, 38] exploit feature 3.1. 3D Representations volumes that can be differentiably projected into 2D image We aim for a disentangled 3D object representation that space. DeepVoxels [38] proposes a novel view synthesis is suitable for end-to-end learning while at the same time approach which unprojects multiple images into a 3D vol- allowing full control over the geometric properties of each ume and subsequently generates new images by projecting individual scene element, including the object’s scale and the 3D feature volume to novel viewpoints. Albeit generat- pose. We refer to this abstract object representation using ing high quality images, their model is not generative and the term “primitive” in the following. requires hundreds of images from the same object for train- ing. In contrast, HoloGAN [32] and PLATONICGAN [15] Further, we use the term “disentangled representation” are generative models that are able to synthesize 3D consis- to describe a set of distinct, informative factors of varia- tent images of single objects by generating volumetric 3D tions in the observations [3, 28]. We model these factors o o o representations based on a differentiable mapping. with a set of 3D primitives O = { bg, 1,..., N } to dis- However, all these approaches are only able to learn entangle the foreground objects and the scene background. models for single objects or static scenes. Therefore, their Each foreground object is described by a set of attributes o s R t controllability is limited to object-centric rotations or cam- i =( i, i, i, φi) which represent the properties of each s R3 era viewpoint changes. In contrast, we propose a model for object instance. Here, i ∈ denotes the 3D object scale, R t R3 learning disentangled 3D representations which allows for i ∈ SO(3) and i ∈ are the primitive’s 3D pose pa- manipulating multiple objects in the scene individually. rameters, and φi is a feature vector determining its appear- ance. To model the scene background, we introduce an ad- o 3. Method ditional background primitive bg defined accordingly. As it is unclear which 3D representation is best suited Our goal is to learn a controllable generative model for for our task, we compare and discuss the following feature image synthesis from raw image data. We advocate that im- representations in our experimental evaluation: age synthesis should benefit from an interpretable 3D rep- resentation as this allows to explicitly take into account the Point Clouds: Point clouds are one of the simplest forms image formation process (i.e., perspective projection, occlu- to represent 3D information. We represent each primitive sion, light transport), rather than trying to directly learn the with a set of sparse 3D points. More specifically, each l f l RM×3 distribution of 2D images. We thus model the image syn- φi = (φi, φi ) represents the location φi ∈ and f RM×K thesis process jointly in 3D and 2D space by first generating feature vectors φi ∈ of M points with latent fea- an abstract 3D representation that is subsequently projected ture dimension K. We apply scaling si, rotation Ri and t l to and refined in the 2D image domain. translation i to the point location φi. Our model is illustrated in Fig. 3. We first sample a latent Cuboids and Spheres: We further consider cuboids and code that represents all properties of the generated scene. spheres as representations which are more closely aligned The latent code is passed to a 3D generative model which with classical mesh-based models used in computer graph- generates a set of 3D abstract object representations in the ics. In this case, the feature vector φ represents a texture form of 3D primitives as well as a background representa- i map that is attached to the surface of the cuboid or sphere, tion. The 3D representations are projected onto the image respectively. The geometry is determined by scaling, rotat- plane where a 2D generator transforms them into object ap- ing and translating the primitive via s , R and t . pearances and composites them into a coherent image. As i i i supervision at the primitive level is hard to acquire, we pro- Background: In order to allow for changes in the camera pose an approach based on adversarial training which does viewpoint, we do not use a vanilla 2D GAN for generating not require 3D supervision. a background image. Instead, we represent the background We now specify our model components in detail. One using a spherical environment map. More specifically, we contribution of our paper is to analyze and compare various attach the background feature map φbg as a texture map to 3D representations regarding their suitability for this task. the inside of a sphere located at the origin, i.e., Rbg = I We thus first define the 3D primitive representations that and tbg = 0. The scale sbg of the background sphere is we consider in this work. In the next section, we describe fixed to a value that is large enough to inclose the entire our model for generating the primitive parameters as well scene including the camera.

5873 Render

Real Image Render Alpha Composition Render Real Background Image

Render Background

3D Generator Differentiable Projection 2D Generator Figure 3: Method Overview. Our model comprises three main parts: A 3D Generator which maps a latent code z drawn from a Gaussian distribution into a set of abstract 3D primitives {oi}, a Differentiable Projection layer which takes each 3D primitive oi as input and outputs a feature map Xi, an alpha map Ai and a depth map Di, and a 2D Generator which refines them and produces the final composite image ˆI. We refer to the entire model as gθ : z 7→ ˆI and train it by minimizing a compactness loss Lcom, a geometric consistency loss Lgeo and an adversarial loss Ladv which compares the predicted image ˆI I o X′ to images from the training set. The last primitive bg generates the image background bg. The flag c determines if the adversarial loss compares full composite images (c = 1) or only the background (c = 0) to the training set. 3.2. 3D Generator mesh, we exploit a differentiable mesh renderer to project them onto the image domain. More specifically, we use the The first part of our neural network is an implicit gener- Soft Rasterizer of Liu et al. [27] and adapt it to our purpose. ative 3D model which generates the set of latent 3D prim- In addition to the projected features X , we obtain an alpha itives {o , o ,..., o } from a noise vector z ∼ N (0, I). i bg 1 N map A by smoothing the silhouette of the projected object We use a fully connected model architecture with a fixed i using a Gaussian kernel. We further output the depth map number of N + 1 heads to generate the attributes for the N D of the projected mesh. foreground primitives and the background. Note, that early i layers share weights such that all primitives are generated Point Clouds: For point clouds we follow prior works [18] jointly and share information. More formally we have: and represent them in a smooth fashion using isotropic 3D z o o o Gaussians. More specifically, we project all features onto gθ : 7→ { bg, 1,..., N } (1) an initial empty feature map Xi and smooth the result us- where θ denote the parameters of the 3D generation layer. ing a Gaussian blur kernel. This allows us to learn both l f Fig. 3 (left) illustrates our 3D generator. Details about the the locations φi as well as the features φi while also back- can be found in the supplementary. propagating gradients to the pose parameters {si, Ri, ti} of the i’th primitive. Since point clouds are sparse, we deter- 3.3. Differentiable Projection mine the initial alpha map Ai and depth map Di by addi- The differentiable projection layer takes each of the gen- tionally projecting a cuboid with the same scale, rotation and translation onto the image plane as described above. erated primitives oi ∈ O as input and converts each of W ×H×F them separately into a feature map Xi ∈ R where W × H are the dimensions of the output image and F de- notes the number of feature channels. In addition, it gener- 3.4. 2D Generator ates a coarse alpha map A ∈ RW ×H and an initial depth i Learning a 3D primitive for each object avoids explicitly map D ∈ RW ×H for each primitive which are further re- i reconstructing their exact 3D models, while the projected fined by the 2D generator that is described in the next sec- features are abstract. We learn to transform this abstract tion. We implement the differentiable projection layer as representation into a photorealistic image using a 2D gen- follows, assuming (without loss of generality) a fixed cali- erative model. More specifically, for each primitive, we use brated camera with intrinsics K ∈ R3×3. a fully convolutional network which takes the features, the Cuboids, Spheres and Background: As cuboids, spheres initial alpha map and the initial depth map as input and re- and the background are represented in terms of a surface fines it, yielding a color rendering, a refined alpha map and

5874 a refined depth map of the corresponding object dψ(I,c) on an additional variable c ∈ {0, 1} which deter- mines if the generated/observed image is a full composite 2D X A D X′ A′ D′ gθ : i, i, i 7→ i, i, i (2) image (c = 1) or a background image (c = 0). This con- dition is helpful to disentangle foreground objects from the where θ denote the parameters of the network that are background as evidenced by our experiments. For training 1 shared across the objects and the background . We use a our model, we thus collect two datasets: one that includes standard encoder-decoder structure based on ResNet [14] as foreground objects and one with empty background scenes. our 2D generator. See supplementary material for details. It is important to note that we perform amodal object pre- Compactness Loss: To bias solutions towards com- diction, i.e., all primitives are predicted separately without pact representations and to encourage the 3D primitives to taking into account occlusion. Thus, our 2D generator can tightly encase the objects, we minimize the projected shape learn to generate the entire object for each primitive, even if of each object. We formulate this constraint as a penalty on A it is partially occluded in some of the images. the l1-norm of the alpha maps i of each object: The final step in our 2D generation layer is to combine ˆI N A the individual predictions into a composite image . To- k ik1 L (θ)= E z max τ, (5) wards this goal, we fuse all predictions using alpha compo- com p( ) H × W "i=1 # sition in ascending order of depth at each pixel. Implemen- X   tation details are provided in the supplementary. Here, τ = 0.1 is a truncation threshold which avoids shrink- age below a fixed minimum size and A depends2 on the 3.5. Loss Functions i model parameters θ and the latent code z. We train the entire model end-to-end using adversarial Geometric Consistency Loss: To favor solutions that are training. Importantly, we do not rely on supervision in the consistent across camera viewpoints and 3D object poses, form of labeled 3D primitives, instance segmentations or we follow [33] and encourage the learned generative model pose annotations. The only input to our method are images to conform to multi-view geometry constraints. For in- of scenes with various number of objects in different poses, stance, a change in the pose parameters (R , t ) should from varying viewpoints and with varying backgrounds. i i change the pose of the object but it should not alter nei- Learning such a model without supervision is a challeng- ther its color nor its identity. We formulate this constraint ing task with many ambiguities. The model could for in- as follows: stance learn to explain two objects with a single primitive or even to generate all foreground objects with the back- N A′ E A′ X′ X˜ ′ ground model, setting all alpha maps i to zero. We there- Lgeo(θ)= p(z) k i ⊙ ( i − i)k1 fore introduce multiple loss functions that encourage a dis- "i=1 # X (6) entangled and interpretable 3D representation while at the N E A′ D′ D˜ ′ same time synthesizing images from the training data dis- + p(z) k i ⊙ ( i − i)k1 tribution. Our loss L is composed of three terms: "i=1 # X Here, (X′ , D′ ) are the outputs of the 2D generator for latent L(θ,ψ)= Ladv(θ,ψ,c)+ Lcom(θ)+ Lgeo(θ) (3) i i z X˜ ′ D˜ ′ c∈{0,1} code . ( i, i) correspond to the outputs when using the X same latent code z, but adding random noise to the pose pa- rameters of each primitive and warping the results back to Adversarial Loss: We use a standard adversarial loss [10] the original viewpoint. The warping function is determined to encourage that the images generated by our model fol- by the predicted depth and the relative transformation be- low the data distribution. Let I denote an image sampled tween the two views. The operator ⊙ denotes elementwise from the data distribution p I and let g z ˆI denote D( ) θ : 7→ multiplication and makes sure that geometric consistency the entire generative model. Let further d I denote the ψ( ) is only enforced inside the foreground region, i.e., where discriminator. Our adversarial loss is formulated as follows A′ i = 1. Intuitively, the geometric consistency loss en- E courages objects with the same appearance but which are Ladv(θ,ψ,c)= p(z)[f(dψ(gθ(z,c),c))] (4) observed from a different view to agree both in terms of + E I [f(−d (I,c))] pD ( |c) ψ appearance and depth. At the same time it serves as an un- with f(t) = − log(1 + exp(−t)). Note that we condi- supervised loss for training the depth prediction model. For details on the warping process, we refer the reader to the tion both the generator gθ(z,c) as well as the discriminator supplemental material. 1We slightly abuse notation and use the same symbol θ to describe the parameters of the 2D and the 3D generative models 2We omit this dependency here for clarity.

5875 tigate if the method is able to disentangle foreground and background without extra supervision. All implementation details are provided in the supplementary. Metrics: We measure the quality of the generated im- ages using the Frechet´ Inception Distance (FID) [16]. More Car w/o BG Car with BG Indoor Fruit specifically, we compute the FID score between the distri- Figure 4: Datasets. Random samples from each dataset. bution of real images I and the distribution of generated images ˆI, excluding the background images Ibg. In our ab- 3.6. Training lation study, we also report FIDi to quantify the disentan- We train our model using RMSprop [41] and a learning glement of object instances. FIDi measures the distance between rendered single-object images and our images for rate of 10−4. To stabilize GAN training, we use gradient penalties [29] and spectral normalization [31] for the dis- each primitive before alpha composition. Besides evaluat- criminator. Following [22] we use adaptive instance nor- ing on the generated samples, we also apply additional ran- malization [17] for the 2D generator. We randomly sam- dom rotation and translation to the 3D primitives and report ple the camera viewpoint during training from an upper-half FID scores on these transformed samples (FIDt, FIDR). sphere around the origin. Similarly, we apply random translation to Layout2Im and the 2D baseline. We investigate how well our model cap- 4. Experiments tures the underlying 3D distribution in the supplementary. In this section, we first compare our approach to several 4.1. Controllable Image Generation baselines on the task of 3D controllable image generation, We now report our results for the challenging task of 3D both on synthetic and real data. Next, we conduct a thor- controllable image generation on synthetic and real data. ough ablation study to better understand the influence of different representations and architecture components. Synthetic Data: Table 1 compares the FID scores on the Car and Indoor datasets. Comparing the quantitative results, Dataset: We render synthetic datasets using objects from we see that our method can achieve competitive FID score ShapeNet [5], considering three datasets with varying dif- compared to Layout2Im. However, Layout2Im requires 2D ficulty. Two datasets contain cars, one with and the other bounding boxes as supervision while our method operates without background. For both datasets, we randomly sam- in unsupervised manner. ple to cars from a total of different car models. Our 1 3 10 To test our hypothesis if 3D representations are helpful third dataset is the most challenging of these three. It com- for controllable image synthesis, we measure FID scores prises indoor scenes containing objects of different cate- FIDt and FIDR on transformed samples. Our results show gories, including chairs, tables and sofas. As background that the FID scores are relatively stable wrt. random rota- we use empty room images from Structured3D [51], a syn- tions and translations, demonstrating that the perturbed im- thetic dataset with photo-realistic 2D images. For each ages follow the same distribution. Without background su- dataset we render 48k real images I and 32k unpaired back- pervision (w/o c), our method can disentangle foreground ground images I for training, as well as 9.6k images for bg from background if the background appearance is simple validation and testing. The image resolution is for 64 × 64 (e.g., Car dataset). However, in the case of more com- all datasets. In addition to the synthetic datasets, we apply plex background appearance (e.g., Indoor dataset) the fore- our method to a real world dataset containing images of 800 ground primitives vanish while the background generates fruits with different backgrounds. In order to train our 5 5 the entire image, resulting in a higher FID score. In con- model, we also collect a set of unpaired background images. trast, with unpaired background images as supervision, our Sample images from our datasets are shown in Fig. 4. method is able to disentangle foreground from background Baselines: We first compare our method to several base- even in the presence of complex backgrounds. lines on the car dataset and the indoor dataset: Vanilla In Fig. 5 and Fig. 6, we show qualitative results when GAN [29]: a state-of-the-art 2D GAN with gradient penal- transforming the objects. By translating the 2D bounding ties. Layout2Im [50]: a generative 2D model that generates boxes, Layout2Im achieves 2D controllability that can even images conditioned on 2D bounding boxes. 2D baseline: handle occlusion. However, the images lack consistency: To investigate the advantage of our 3D representation, we moving an object changes its identity, indicating a failure of implement a 2D version of our method by replacing the 3D the model to disentangle the latent factors. Moreover, Lay- generator by a 2D feature generator, i.e. learning a set of out2Im fuses the objects with a recurrent module. There- 2D feature maps instead of 3D primitives. Ours w/o c: We fore, manipulating one object also affects other objects in train our method without background images Ibg to inves- the scene as well as the background. Note for example how

5876 otolbeiaesnhss Table 3D of synthesis. task image the for controllable representations 3D different of formance cuboid? or sphere cloud, Point experiments. the all use for we background study, without this dataset in Car objects As foreground study. the ablation on an focus we in framework our of variants multiple Study Ablation model 4.2. our while shadows. fruits and of occlusions handles arrangement are properly the we change representation, 3D to data. disentangled real able our from even of images use plausible Making synthesize to able is Fig. both in background the how or translation during low n Fig. and 5: Figure ol aae r hw nFig. in shown are dataset world Data: Real translation. are and that rotation images object produces wrt. consistent and correctly factors to able latent is the occlu- model disentangle our correct contrast, the In learn relationship. to sion struggles it while objects, tangle method our baselines. with the object with achieved one be cannot rotate which we Additionally, ods. h lc a ntelfms mg nFig. in image leftmost the in car black the us44 Ours aotI [ Layout2Im ail A [ GAN Vanilla DBsln 07 0 0 – 102 107 – 79 80 (w/o Ours Baseline 2D Ours (R) Ours (t) 2D (t) [50](t) enwcmaedfeetrpeettosadanalyze and representations different compare now We al 1: Table 6 a Dataset. Car ais u Dbsln sal obte disen- better to able is baseline 2D Our varies. c 57 51010120 120 120 75 71 65 ) I nCrdtstadIdo dataset. Indoor and dataset Car on FID 50 ulttv eut formto ntereal- the on method our of results Qualitative 29 ] ] I FID FID 43 43 etasaeoeojc o l meth- all for object one translate We 466 54 6– 56 a Indoor Car 9–– – 89 – – t FID 7 eseta u method our that see We . efis opr h per- the compare first We R 2 I FID FID oprsdifferent compares 88 84 5 eoe yel- becomes 0100 90 3– 93 t FID R 5 5877 l ehd.Adtoal,w oaeoeojc ihour with baselines. the object with achieved one be rotate cannot which we method Additionally, methods. all iue6: Figure uodrpeetto si osntsfe smc from much as suffer space. not 3D in does discontinuities the surface it to superior as is representation it that cuboid hypothesize We overall. per- best representation forms sphere The images. corre- generated the sponding and representations different the illustrates which 7: Figure ie h rttorw hwtepoetdpiiie and primitives projected the show composite rows the two first The tive. ersnain ntrso hi Dcnrlaiiy(FID controllability 3D their FID and of terms in representations individually. generator 2D training the into during fed while and rendered image are one they in primitives all visualize Drpeetto.Ti a lob bevdfo Fig. from observed be also can 3D- joint This a of representation. concept general 2D the represen- than 3D relevant the less of is form tation precise the that suggesting FID scores, similar very achieve representations different that serve pee d eombeshr w/o sphere deformable (d) sphere, 8: Figure x z Ours (R) Ours (t) 2D (t) [50](t) R ri Dataset. Fruit n ietnlmn aaiiy(FID capability disentanglement and ) bainStudy. Ablation a b c d (e) (d) (c) (b) (a) norDataset. Indoor A i ′ clr eoeisacs.Nt htwe that Note instances). denote (colors etasaeoefuti ahrow. each in fruit one translate We a on lu,()cbi,(c) cuboid, (b) cloud, point (a) etasaeoeojc for object one translate We g θ 2 D e igeprimi- single (e) , i .W ob- We ). 8 t pcfial,w edrapitcodwith cloud point a render we specifically, Table in shown as way. generator images 2D this to our shape compared from inferior reasonable generated is a fidelity learn image to the However, possible indeed is it oaetecmr n eeaenwsmlsa hw in shown as samples new generate we Fig. ability, and this camera test To the poses. rotate object cam- 3D across scenes and consistent viewpoints generate era can that 3D? model a in for consistent representation learned the Is hne ssoni Fig. in shown viewpoint as camera changes when More- changes objects objects. of 3D number individual the over, the does over generate control technique full this to However, allow not able surpris- scores. FID also Not low is with images. images model object single-primitive single the generate ingly, to able not is iu ok [ directly? model works 3D vious the learn we Can loss. consistency geometric the multi-primitive with and rows: w/o bottom model and Middle wrt. varies rotation. objects camera of number The representation. primitive ape from sampled 9: Figure eeaut fadn h emti ossec oscan loss consistency addition, geometric In the adding identity. if object primitive evaluate the single we preserve the to contrast, fails In baseline design. architecture its to multi- FID our evaluate as not capacity do same We the has model. model primitive the ensure to hp,w er odfr peesmlrt eetmesh- recent to [ similar techniques sphere the reconstruction a modify based deform to to generator textured learn the we and allow shape, accurate To predict reconstructions. to 3D model 3D generative the h desra oso h edrdimage add rendered directly the and on pipeline loss the re- adversarial we from the question, generator this 2D answer the To 2D move not. the or if helpful investigate our is we in generator words, explore model other 3D In also accurate setting. an we generative learn model, directly can discriminative we the whether in images 2D tv orpeetteetr cn.Ti silsrtdin illustrated is This scene. entire prim- single the Fig. a represent only using to method our itive pipeline, of generation variant a object-wise create our we of advantage the disentanglement? gate of advantage the is What Lgeo w/o Lgeo Single. 9 8 u ehdgnrtslclychrn mgsdue images coherent locally generates method Our . e n vlae uniaieyi Table in quantitatively evaluated and (e) vlaino oainConsistency. Rotation of Evaluation 23 180 , 27 ◦ htlana3 ehdrcl from directly mesh 3D a learn that ] fcmr oain o o:single row: Top rotation. camera of 9 tprow). (top 44 .Fig. ]. nprdb pre- by Inspired 8 N X i d hw that shows (d) stemodel the as i × hstasks This . oinvesti- To M 2 More . eaim We 2 Images points . 5878 ute mrv efrac nFig. in performance improve further example. example. second first the the in in changes map) identity alpha object The the in (red primitive same iue10: Figure archi- and representations components. primitive tecture different wrt. ground 2: Table T Acknowledgments area. exciting this new in this research on further more results inspire tackling our task for that key hope We be scenarios. will challenging geometry scene ob- or be- about We shape biases ject inductive poses. stronger object incorporating and viewpoint that lieve camera controllabil- of terms providing in while ity objects disentangle multiple successfully addition, to with able In scenes is method our results. that view-consistent show We cru- we is and space accurate 2D synthesis. for and 3D cial image in both controllable modeling that 3D demonstrate of learning vised Conclusion 5. problems. these in- tackle stronger to that required believe viewpoint are We biases the ductive (bottom). when large changes very is identity change object’s the objects or multiple (top) generates primitive single a Occasionally, Cases Failure 4.3. results. qualitative for object supplementary alongside See depth appearance. synthesize to learning for allow consistency cues geometric gains, performance significant serve x bne ICne FZ 01IS18039B). (FKZ: Center AI ubingen ¨ pee3 54 53 60 45 45 45 66 45 33 44 38 43 38 w/o primitive Deformable Sphere Cuboid cloud Point [ GAN Vanilla igepiiie3 84 – 44 38 30 primitive Single eakoldespotfo h MFtruhthe through BMBF the from support acknowledge We unsuper- towards step first a as paper this consider We Fig. in method our of cases failure several show We bainStudy. Ablation alr Cases. Failure 29 0––41 – – 50 ] g I nCrdtstwtotback- without dataset Car on FID w risaegnrtdfo the from generated are fruits Two θ 2 D I FID FID 97 469 74 71 69 9 hl ed o ob- not do we While . t FID R FID 10 i . References IEEE International Conf. on Computer Vision (ICCV), 2019. 2, 3 [1] H. Alhaija, S. Mustikovela, A. Geiger, and C. Rother. Geo- [16] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and metric image synthesis. In Proc. of the Asian Conf. on Com- S. Hochreiter. Gans trained by a two time-scale update rule puter Vision (ACCV), 2018. 2 converge to a local nash equilibrium. In Advances in Neural [2] O. Ashual and L. Wolf. Specifying object attributes and re- Information Processing Systems (NIPS), 2017. 6 lations in interactive scene generation. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2019. 2 [17] X. Huang and S. J. Belongie. Arbitrary style transfer in real- Proc. of the [3] Y. Bengio, A. C. Courville, and P. Vincent. Represen- time with adaptive instance normalization. In IEEE International Conf. on Computer Vision (ICCV) tation learning: A review and new perspectives. IEEE , 2017. Trans. on Pattern Analysis and Machine Intelligence (PAMI), 6 35(8):1798–1828, 2013. 3 [18] E. Insafutdinov and A. Dosovitskiy. Unsupervised learning [4] C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Hig- of shape and pose with differentiable point clouds. In Ad- gins, M. M. Botvinick, and A. Lerchner. Monet: Unsu- vances in Neural Information Processing Systems (NIPS), pervised scene decomposition and representation. arXiv.org, 2018. 4 1901.11390, 2019. 2 [19] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image [5] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, translation with conditional adversarial networks. In Proc. Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, IEEE Conf. on Computer Vision and J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d (CVPR), 2017. 2 model repository. arXiv.org, 1512.03012, 2015. 6 [20] J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from [6] Q. Chen and V. Koltun. Photographic image synthesis with scene graphs. In Proc. IEEE Conf. on Computer Vision and cascaded refinement networks. In Proc. of the IEEE Interna- Pattern Recognition (CVPR), 2018. 2 tional Conf. on Computer Vision (ICCV), 2017. 2 [21] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive [7] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, growing of GANs for improved quality, stability, and varia- I. Sutskever, and P. Abbeel. Infogan: Interpretable repre- tion. In Proc. of the International Conf. on Learning Repre- sentation learning by information maximizing generative ad- sentations (ICLR), 2018. 1 versarial nets. In Advances in Neural Information Processing [22] T. Karras, S. Laine, and T. Aila. A style-based generator ar- Systems (NIPS), 2016. 2 chitecture for generative adversarial networks. In Proceed- [8] M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner. ings of the IEEE Conference on Computer Vision and Pattern GENESIS: generative scene inference and sampling with Recognition, 2019. 1, 2, 6 object-centric latent representations. arXiv.org, 1907.13052, [23] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh ren- 2019. 2 derer. In Proc. IEEE Conf. on Computer Vision and Pattern [9] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepes- Recognition (CVPR), 2018. 2, 8 vari, K. Kavukcuoglu, and G. E. Hinton. Attend, infer, re- [24] A. R. Kosiorek, H. Kim, Y. W. Teh, and I. Posner. Sequen- peat: Fast scene understanding with generative models. In tial attend, infer, repeat: Generative modelling of moving Advances in Neural Information Processing Systems (NIPS), objects. In Advances in Neural Information Processing Sys- 2016. 2 tems (NIPS), 2018. 2 [10] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [25] N. K. L., P. Mandikal, V. Jampani, and R. V. Babu. DIF- D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. FER: moving beyond 3d reconstruction with differentiable Generative adversarial nets. In Advances in Neural Informa- feature rendering. In Proc. IEEE Conf. on Computer Vision tion Processing Systems (NIPS), 2014. 2, 5 and Pattern Recognition (CVPR) Workshops, 2019. 2 [11] K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner. Multi- [26] J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu. Layout- object representation learning with iterative variational infer- gan: Generating graphic layouts with wireframe discrimina- ence. In Proc. of the International Conf. on Machine learn- tors. arXiv.org, 1901.06767, 2019. 2 ing (ICML), 2019. 2 [27] S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differen- [12] K. Greff, A. Rasmus, M. Berglund, T. H. Hao, H. Valpola, tiable renderer for image-based 3d reasoning. In Proc. of the and J. Schmidhuber. Tagger: Deep unsupervised perceptual IEEE International Conf. on Computer Vision (ICCV), 2019. grouping. In Advances in Neural Information Processing 2, 4, 8 Systems (NIPS), 2016. 2 [28] F. Locatello, S. Bauer, M. Lucic, G. Ratsch,¨ S. Gelly, [13] K. Greff, S. van Steenkiste, and J. Schmidhuber. Neural ex- B. Scholkopf,¨ and O. Bachem. Challenging common as- pectation maximization. In Advances in Neural Information sumptions in the unsupervised learning of disentangled rep- Processing Systems (NIPS), 2017. 2 resentations. In Proc. of the International Conf. on Machine [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning learning (ICML), 2019. 3 for image recognition. In Proc. IEEE Conf. on Computer [29] L. Mescheder, A. Geiger, and S. Nowozin. Which training Vision and Pattern Recognition (CVPR), 2016. 5 methods for gans do actually converge? In Proc. of the In- [15] P. Henzler, N. J. Mitra, and T. Ritschel. Escaping plato’s ternational Conf. on (ICML), 2018. 1, 6, cave: 3d shape from adversarial rendering. In Proc. of the 7, 8

5879 [30] L. Mescheder, S. Nowozin, and A. Geiger. The numerics [46] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu. of GANs. In Advances in Neural Information Processing Occlusion aware unsupervised learning of optical flow. In Systems (NIPS), 2017. 2 Proc. IEEE Conf. on Computer Vision and Pattern Recogni- [31] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spec- tion (CVPR), 2018. 2 tral normalization for generative adversarial networks. In [47] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and Proc. of the International Conf. on Learning Representations G. J. Brostow. Interpretable transformations with encoder- (ICLR), 2018. 6 decoder networks. In Proc. of the IEEE International Conf. [32] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. on Computer Vision (ICCV), 2017. 3 Yang. Hologan: Unsupervised learning of 3d representa- [48] J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN: tions from natural images. In Proc. of the IEEE International layered recursive generative adversarial networks for image Conf. on Computer Vision (ICCV), 2019. 3 generation. In Proc. of the International Conf. on Learning [33] A. Noguchi and T. Harada. RGBD-GAN: unsupervised Representations (ICLR), 2017. 2 3d representation learning from natural image datasets via [49] B. Zhao, B. Chang, Z. Jie, and L. Sigal. Modular genera- RGBD image synthesis. arXiv.org, 1909.12573, 2019. 5 tive adversarial networks. In Proc. of the European Conf. on [34] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- Computer Vision (ECCV), 2018. 2 sentation learning with deep convolutional generative adver- [50] B. Zhao, L. Meng, W. Yin, and L. Sigal. Image generation sarial networks. arXiv.org, 1511.06434, 2015. 2 from layout. In Proc. IEEE Conf. on Computer Vision and [35] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disen- Pattern Recognition (CVPR), 2019. 2, 6, 7 tangle factors of variation with manifold interaction. In Proc. [51] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou. of the International Conf. on Machine learning (ICML), Structured3d: A large photo-realistic dataset for structured 2014. 2 3d modeling. arXiv.org, 1908.00222, 2019. 6 [36] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and [52] J. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenen- H. Lee. Learning what and where to draw. In Advances in baum, and B. Freeman. Visual object networks: Image gen- Neural Information Processing Systems (NIPS), 2016. 2 eration with disentangled 3d representations. In Advances in [37] H. Rhodin, M. Salzmann, and P. Fua. Unsupervised Neural Information Processing Systems (NIPS), 2018. 2 geometry-aware representation for 3d human pose estima- tion. In Proc. of the European Conf. on Computer Vision (ECCV), 2018. 3 [38] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer. Deepvoxels: Learning persistent 3d fea- ture embeddings. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019. 3 [39] A. Stanic and J. Schmidhuber. R-SQAIR: relational sequen- tial attend, infer, repeat. In Advances in Neural Information Processing Systems (NIPS) Workshops, 2019. 2 [40] T. D. Tamar Rott Shaham and T. Michaeli. Singan: Learning a generative model from a single natural image. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2019. 2 [41] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. 2012. 6 [42] M. O. Turkoglu, W. Thong, L. J. Spreeuwers, and B. Ki- canaoglu. A layer-based sequential framework for scene gen- eration with gans. In Proc. of the Conf. on Artificial Intelli- gence (AAAI), 2019. 2 [43] S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhu- ber. Relational neural expectation maximization: Unsuper- vised discovery of objects and their interactions. In Proc. of the International Conf. on Learning Representations (ICLR), 2018. 2 [44] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb im- ages. In Proc. of the European Conf. on Computer Vision (ECCV), 2018. 8 [45] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In Proc. of the Eu- ropean Conf. on Computer Vision (ECCV), 2016. 2

5880