Generative Adversarial Transformers

Drew A. Hudson§ 1 C. Lawrence Zitnick 2

Abstract We introduce the GANsformer, a novel and effi- cient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long- range interactions across the image, while main- taining computation of linear efficiency, that can readily scale to high-resolution synthesis. It itera- tively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compo- sitional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexi- Figure 1. Sample images generated by the GANsformer, along ble region-based modulation, and can thus be seen with a visualization of the model attention maps. as a generalization of the successful StyleGAN network. We demonstrate the model’s strength and prior knowledge inform the interpretation of the par- and robustness through a careful evaluation over ticular (Gregory, 1970). While their respective roles and a range of datasets, from simulated multi-object dynamics are being actively studied, researchers agree that it environments to rich real-world indoor and out- is the interplay between these two complementary processes door scenes, showing it achieves state-of-the- that enables the formation of our rich internal representa- art results in terms of image quality and diver- tions, allowing us to perceive the world in its fullest and sity, while enjoying fast learning and better data- create vivid imageries in our mind’s eye (Intaite˙ et al., 2013). efficiency. Further qualitative and quantitative Nevertheless, the very mainstay and foundation of computer experiments offer us an insight into the model’s vision over the last decade – the Convolutional Neural Net- inner workings, revealing improved interpretabil- work, surprisingly does not reflect this bidirectional nature ity and stronger disentanglement, and illustrating that so characterizes the human visual system, and rather the benefits and efficacy of our approach. An im- displays a one-way feed-forward progression from raw sen- plementation of the model is available at https: sory signals to higher representations. Their local receptive //github.com/dorarad/gansformer. field and rigid computation reduce their ability to model long-range dependencies or develop holistic understanding 1. Introduction of global shapes and structures that goes beyond the brittle The cognitive science literature speaks of two reciprocal reliance on texture (Geirhos et al., 2019), and in the gen- mechanisms that underlie human perception: the bottom-up erative domain especially, they are linked to considerable processing, proceeding from the retina up to the cortex, as optimization and stability issues (Zhang et al., 2019) due local elements and salient stimuli hierarchically group to- to their fundamental difficulty in coordinating between fine gether to form the whole (Gibson, 2002), and the top-down details across a generated scene. These concerns, along processing, where surrounding context, selective attention with the inevitable comparison to cognitive visual processes, beg the question of whether convolution alone provides a 1Computer Science Department, Stanford University, CA, USA complete solution, or some key ingredients are still missing. 2Facebook AI Research, CA, USA. Correspondence to: Drew A. Hudson . §I wish to thank Christopher D. Manning for the fruitful dis- cussions and constructive feedback in developing the bipartite Proceedings of the 38 th International Conference on Machine transformer, especially when explored for language, as well as for Learning, PMLR 139, 2021. Copyright 2021 by the author(s). the kind financial support that allowed this work to happen. Generative Adversarial Transformers

Meanwhile, the NLP community has witnessed a major revo- icant advances in training stability and dramatic improve- lution with the advent of the Transformer network (Vaswani ments in image quality and diversity, turning them to be et al., 2017), a highly-adaptive architecture centered around nowadays a leading paradigm in visual synthesis (Radford relational attention and dynamic interaction. In response, et al., 2016; Brock et al., 2019; Karras et al., 2019). In turn, several attempts have been made to integrate the transformer GANs have been widely adopted for a rich variety of tasks, into computer vision models, but so far they have met only including image-to-image translation (Isola et al., 2017), limited success due to scalabillity limitations stemming super-resolution (Ledig et al., 2017), style transfer (Choi from their quadratic mode of operation. et al., 2018), and representation learning (Donahue et al., 2016), to name a few. But while automatically produced Motivated to address these shortcomings and unlock the images for faces, single objects or natural scenery have full potential of this promising network for the field of com- reached astonishing fidelity, becoming nearly indistinguish- puter vision, we introduce the Generative Adversarial Trans- able from real samples, the unconditional synthesis of more former, or GANsformer for short, a simple yet effective structured or compositional scenes is still lagging behind, generalization of the vanilla transformer, explored here for suffering from inferior coherence, reduced geometric con- the task of visual synthesis. The model features a biparatite sistency and, at times, lack of global coordination (Johnson construction for computing soft attention, that iteratively et al., 2018; Casanova et al., 2020; Zhang et al., 2019). As aggregates and disseminates information between the gener- of now, faithful generation of structured scenes is thus yet ated image features and a compact set of latent variables to to be reached. enable bidirectional interaction between these dual represen- tations. This proposed design achieves a favorable balance, Concurrently, the last couple of years saw impressive being capable of flexibly modeling global phenomena and progress in the field of NLP, driven by the innovative ar- long-range interactions on the one hand, while featuring an chitecture called Transformer (Vaswani et al., 2017), which efficient setup that still scales linearly with the input size has attained substantial gains within the language domain on the other. As such, the GANsformer can sidestep the and consequently sparked considerable interest across the computational costs and applicability constraints incurred community (Vaswani et al., 2017; Devlin et al., by prior work, due to their dense and potentially excessive 2019). In response, several attempts have been made to in- pairwise connectivity of the standard transformer (Zhang corporate self-attention constructions into vision models, et al., 2019; Brock et al., 2019), and successfully advance most commonly for image recognition, but also in segmen- generative modeling of images and scenes. tation (Fu et al., 2019), detection (Carion et al., 2020), and synthesis (Zhang et al., 2019). From structural perspective, We study the model’s quantitative and qualitative behavior they can be roughly divided into two streams: those that through a series of experiments, where it achieves state-of- apply local attention operations, failing to capture global the-art performance for a wide selection of datasets, of both interactions (Cordonnier et al., 2020; Parmar et al., 2019; simulated as well as real-world kinds, obtaining particu- Hu et al., 2019; Zhao et al., 2020; Parmar et al., 2018), larly impressive gains in generating highly-compositional and others that borrow the original transformer structure multi-object scenes. As indicated by our analysis, the GANs- as-is and perform attention globally, across the entire im- former requires less training steps and fewer samples than age, resulting in prohibitive computation due to its quadratic competing approaches to successfully synthesize images of complexity, which fundamentally hinders its applicability high quality and diversity. Further evaluation provides ro- to low-resolution layers only (Zhang et al., 2019; Brock bust evidence for the network’s enhanced transparency and et al., 2019; Bello et al., 2019; Wang et al., 2018; Esser compositionality, while ablation studies empirically validate et al., 2020; Dosovitskiy et al., 2020; Jiang et al., 2021). the value and effectiveness of our approach. We then present Few other works proposed sparse, discrete or approximated visualizations of the model’s produced attention maps to variations of self-attention, either within the adversarial or shed more light upon its internal representations and synthe- autoregressive contexts, but they still fall short of reducing sis process. All in all, as we will see through the rest of the memory footprint and computation costs to a sufficient de- paper, by bringing the renowned GANs and Transformer gree (Shen et al., 2020; Ho et al., 2019; Huang et al., 2019; architectures together under one roof, we can integrate their Esser et al., 2020; Child et al., 2019). complementary strengths, to create a strong, compositional and efficient network for visual generative modeling. Compared to these prior works, the GANsformer stands out as it manages to avoid the high costs ensued by self at- 2. Related Work tention, employing instead bipartite attention between the image features and a small collection of latent variables. Its Generative Adversarial Networks (GANs), originally in- design fits naturally with the generative goal of transforming troduced in 2014 (Goodfellow et al., 2014), have made source latents into an image, facilitating long-range inter- remarkable progress over the past few years, with signif- action without sacrificing computational efficiency. Rather, Generative Adversarial Transformers

Figure 2. We introduce the GANsformer network, that leverages a bipartite structure to allow long-range interactions, while evading the quadratic complexity standard transformers suffer from. We present two novel attention operations over the bipartite graph: simplex and duplex, the former permits communication in one direction, in the generative context – from the latents to the image features, while the latter enables both top-down and bottom up connections between these two variable groups. the network maintains a scalable linear efficiency across all porating new ways to integrate information between nodes, layers, realizing the transformer full potential. In doing so, as well as novel forms of attention (Simplex and Duplex) we seek to take a step forward in tackling the challenging that iteratively update and refine the assignments between task of compositional scene generation. Intuitively, and as image features and latents, and is the first to explore these is later corroborated by our findings, holding multiple latent techniques in the context of high-resolution generative mod- variables that interact through attention with the evolving eling. generated image, may serve as a structural prior that pro- Most related to our work are certain GAN models for condi- motes the formation of compact and compositional scene tional and unconditional visual synthesis: A few methods representations, as the different latents may specialize for (van Steenkiste et al., 2020; Gu et al., 2020; Nguyen-Phuoc certain objects or regions of interest. Indeed, as demon- et al., 2020; Ehrhardt et al., 2020) utilize multiple replicas strated in section4, the Generative Adversarial Transformer of a generator to produce a set of image layers, that are then achieves state-of-the-art performance in synthesizing both combined through alpha-composition. As a result, these controlled and real-world indoor and outdoor scenes, while models make quite strong assumptions about the indepen- showing indications for semantic compositional disposition dence between the components depicted in each layer. In along the way. contrast, our model generates one unified image through a In designing our model, we drew inspiration from multiple cooperative process, coordinating between the different la- lines of research on generative modeling, compositionality tents through the use of soft attention. Other works, such as and scene understanding, including techniques for scene SPADE (Park et al., 2019; Zhu et al., 2020), employ region- decomposition, object discovery and representation learn- based feature modulation for the task of layout-to-image ing. Several approaches, such as (Burgess et al., 2019; Greff translation, but, contrary to us, use fixed segmentation maps et al., 2019; Eslami et al., 2016; Engelcke et al., 2020), per- or static class embeddings to control the visual features. Of form iterative variational inference to encode scenes into particular relevance is the prominent StyleGAN model (Kar- multiple slots, but are mostly applied in the contexts of ras et al., 2019; 2020), which utilizes a single global style synthetic and oftentimes fairly rudimentary 2D settings. vector to consistently modulate the features of each layer. Works such as Capsule networks (Sabour et al., 2017) lever- The GANsformer generalizes this design, as multiple style age ideas from psychology about Gestalt principles (Smith, vectors impact different regions in the image concurrently, 1988; Hamlyn, 2017), perceptual grouping (Brooks, 2015) allowing for a spatially finer control over the generation or analysis-by-synthesis (Biederman, 1987), and like us, process. Finally, while StyleGAN broadcasts information in introduce ways to piece together visual elements to discover one direction from the global latent to the local image fea- compound entities or, in the cases of Set Transformers (Lee tures, our model propagates information both from latents to et al., 2019) and A2-Nets (Chen et al., 2018), group local features and vise versa, enabling top-down and bottom-up information into global aggregators, which proves useful for reasoning to occur simultaneously1. a broad specturm of tasks, spanning unsupervised segmen- 1 tation (Greff et al., 2017; Locatello et al., 2020), clustering Note however that our model certainly does not claim to serve as a biologically-accurate reflection of cognitive top-down pro- (Lee et al., 2019), image recognition (Chen et al., 2018), cessing. Rather, this analogy played as a conceptual source of NLP (Ravula et al., 2020) and viewpoint generalization inspiration that aided us through the idea development. (Kosiorek et al., 2019). However, our work stands out incor- Generative Adversarial Transformers 3. The Generative Adversarial Transformer The Generative Adversarial Transformer is a type of Gen- erative Adversarial Network, which involves a generator network (G) that maps a sample from the latent space to the output space (e.g. an image), and a discriminator net- work (D) which seeks to discern between real and fake samples (Goodfellow et al., 2014). The two networks com- pete with each other through a minimax game until reaching an equilibrium. Typically, each of these networks consists of multiple layers of convolution, but in the GANsformer case, we instead construct them using a novel architecture, called Bipartite Transformer, formally defined below. The section is structured as follows: we first present a for- mulation of the Bipartite Transformer, a general-purpose Figure 3. Model Structure. Left: The GANsformer layer, com- generalization of the Transformer (section 3.1). Then, we posed of a bipartite attention to propagate information from the provide an overview of how the transformer is incorporated latents to the evolving features grid, followed by a standard con- into the generator and discriminator framework (section 3.2). volution and upsampling. These layers are stacked multiple times We conclude by discussing the merits and distinctive proper- starting from a 4×4 grid, until a final high-resolution image is ties of the GANsformer, that set it apart from the traditional produced. Right: The latents and image features attend to each GAN and transformer networks (section 3.3). other to capture the scene compositionality. The GANsformer’s compositional latent space contrasts with the StyleGAN mono- 3.1. The Bipartite Transformer lithic one, where a single latent modulates uniformly the whole scene. The standard transformer network consists of alternating multi-head self-attention and feed-forward layers. We refer these two groups of elements. Specifically, we then define: to each pair of self-attention and feed-forward layers as a QKT  transformer layer, such that a Transformer is considered Attention(Q, K, V ) = softmax √ V to be a stack composed of several such layers. The Self- d Attention operator considers all pairwise relations among ab(X,Y ) = Attention (q(X), k(Y ), v(Y )) the input elements, so to update each single element by Where a stands for Attention, and q(·), k(·), v(·) are func- attending to all the others. The Bipartite Transformer gener- tions that respectively map elements into queries, keys, and alizes this formulation, featuring instead a bipartite graph values, all maintaining the same dimensionality d. We also between two groups of variables (in the GAN case, latents provide the query and key mappings with respective posi- and image features). In the following, we consider two tional encoding inputs, to reflect the distinct position of each forms of attention that could be computed over the bipartite element in the set (e.g. in the image) (further details on the graph – Simplex attention, and Duplex attention, depending specifics of the positional encoding scheme in section 3.2). on the direction in which information propagates2 – either in one way only, or both in top-down and bottom-up ways. We can then combine the attended information with the While for clarity purposes we present the technique here in input elements X, but whereas the standard transformer im- its one-head version, in practice we make use of a multi- plements an additive update rule of the form: ua(X,Y ) = head variant, in accordance with (Vaswani et al., 2017). LayerNorm(X + ab(X,Y )) (where Y = X in the stan- dard self-attention case) we instead use the retrieved infor- 3.1.1. SIMPLEX ATTENTION mation to control both the scale as well as the bias of the elements in X, in line with the practices promoted by the We begin by introducing the simplex attention, which dis- StyleGAN model (Karras et al., 2019). As our experiments tributes information in a single direction over the Bipar- indicate, such multiplicative integration enables significant tite Transformer. Formally, let Xn×d denote an input set gains in the model performance. Formally: of n vectors of dimension d (where, for the image case, m×d s n = W ×H), and Y denote a set of m aggregator vari- u (X,Y ) = γ (ab(X,Y )) ω(X) + β (ab(X,Y )) ables (the latents, in the case of the generator). We can then compute attention over the derived bipartite graph between Where γ(·), β(·) are mappings that compute multiplicative and additive styles (gain and bias), maintaining a dimension 2 In computer networks, simplex refers to single direction com- of d, and ω(X) = X−µ(X) normalizes each element, with munication, while duplex refers to communication in both ways. σ(X) respect to the other features. This update rule fuses together Generative Adversarial Transformers

Figure 4. Samples of images generated by the GANsformer for the CLEVR, Bedroom and Cityscapes datasets, and a visualization of the produced attention maps. The different colors correspond to the latents that attend to each region. the normalization and information propagation from Y to (which can be deemed a 1 × 1 convolution). Since our case X, by essentially letting Y control X statistical tendencies pertains to images, we use instead a kernel size of k = 3 of X, which for instance can be useful in the case of visual after each application of attention. We also apply a Leaky synthesis for generating particular objects or entities. ReLU nonlinearity after each convolution (Maas et al., 2013) and then upsample or downsmaple the features X, as part 3.1.2. DUPLEX ATTENTION of the generator or discriminator respectively, following e.g. StyleGAN2 (Karras et al., 2020). To account for the features We can go further and consider the variables Y to poses a location within the image, we use a sinusoidal positional key-value structure of their own (Miller et al., 2016): Y = encoding along the horizontal and vertical dimensions for (Kn×d,V n×d), where the values store the content of the Y the visual features X (Vaswani et al., 2017), and a trained variables (e.g. the randomly sampled latents for the case of embedding for the set of latent variables Y . GANs) while the keys track the centroids of the attention- based assignments from X to Y , which can be computed as: Overall, the bipartite transformer is thus composed of a K = ab(Y,X). Consequently, we can define a new update stack that alternates attention (simplex or duplex) and con- rule, that is later empirically shown to work more effectively volution layers, starting from a 4 × 4 grid up to the desirable than the simplex attention: resolution. Conceptually, this structure fosters an interesting communication flow: rather than densely modeling interac- ud(X,Y ) = γ(a(X,K,V )) ω(X) + β(a(X,K,V )) tions among all the pairs of pixels in the images, it supports adaptive long-range interaction between far away pixels in This update compounds two attention operations on top of a moderated manner, passing through through a compact each other, where we first compute soft attention assign- and global bottleneck that selectively gathers information ments between X and Y , and then refine the assignments from the entire input and distribute it to relevant regions. by considering their centroids, analogously to the k-means Intuitively, this form can be viewed as analogous to the algorithm (Lloyd, 1982; Locatello et al., 2020). top-down notions discussed in section1, as information is Finally, to support bidirectional interaction between the propagated in the two directions, both from the local pixel elements, we can chain two reciprocal simplex attentions to the global high-level representation and vise versa. from X to Y and from Y to X, obtaining the duplex at- We note that both the simplex and the duplex attention op- tention, which alternates computing Y := ua(Y,X) and erations enjoy a bilinear efficiency of O(mn) thanks to the X := ud(X,Y ), such that each representation is refined in network’s bipartite structure that considers all pairs of corre- light of its interaction with the other, integrating together sponding elements from X and Y . Since, as we see below, bottom-up and top-down interactions. we maintain Y to be of a fairly small size, choosing m in the range of 8–32, this compares favorably to the poten- 3.1.3. OVERALL ARCHITECTURE STRUCTURE tially prohibitive O(n2) complexity of the self-attention, that impedes its applicability to high-resolution images. Vision-Specific adaptations. In the standard Transformer used for NLP, each self-attention layer is followed by a Feed- Forward FC layer that processes each element independently Generative Adversarial Transformers

Figure 5. Sample images and attention maps. Samples of images generated by the GANsformer for the CLEVR, LSUN-Bedroom and Cityscapes datasets, and a visualization of the produced attention maps. The different colors correspond to the latent variables that attend to each region. For the CLEVR dataset we should multiple attention maps in different layers of the model, revealing how the latent variables roles change over the different layers – while they correspond to different objects as the layout of the scene is being formed in early layers, they behave similarly to a surface normal in the final layers of the generator.

3.2. The Generator and Discriminator Networks this goal. Instead of controlling the style of all features globally, we use instead our new attention layer to perform We use the celebrated StyleGAN model as a starting point adaptive and local region-wise modulation. We split the for our GAN design. Commonly, a generator network con- latent vector z into k components, z = [z , ...., z ] and, as sists of a multi-layer CNN that receives a randomly sampled 1 k in StyleGAN, pass each of them through a shared mapping vector z and transforms it into an image. The StyleGAN network, obtaining a corresponding set of intermediate la- approach departs from this design and, instead, introduces a tent variables Y = [y , ..., y ]. Then, during synthesis, after feed-forward mapping network that outputs an intermediate 1 k each CNN layer in the generator, we let the feature map X vector w, which in turn interacts directly with each convolu- and latents Y to play the roles of the two element groups, tion through the synthesis network, globally controlling the mediate their interaction through our new attention layer feature maps statistics of every layer. (either simplex or duplex). This setting thus allows for a flex- Effectively, this approach attains layer-wise decomposition ible and dynamic style modulation at the region level. Since of visual properties, allowing the model to control particular soft attention tends to group elements based on their proxim- global aspects of the image such as pose, lighting conditions ity and content similarity (Vaswani et al., 2017), we see how or color schemes, in a coherent manner over the entire im- the transformer architecture naturally fits into the generative age. But while StyleGAN successfully disentangles global task and proves useful in the visual domain, allowing the properties, it is potentially limited in its ability to perform network to exercise finer control in modulating semantic spatial decomposition, as it provides no means to control regions. As we see in section4, this capability turns to be the style of a localized regions within the generated image. especially useful in modeling compositional scenes. Luckily, the bipartite transformer offers a solution to meet For the discriminator, we similarly apply attention after ev- Generative Adversarial Transformers

Table 1. Comparison between the GANsformer and competing methods for image synthesis. We evaluate the models along commonly used metrics such as FID, Inception, and Precision & Recall scores. FID is considered to be the most well-received as a reliable indication of images fidelity and diversity. We compute each metric 10 times over 50k samples with different random seeds and report their average. CLEVR LSUN-Bedroom Model FID ↓ IS ↑ Precision ↑ Recall ↑ FID ↓ IS ↑ Precision ↑ Recall ↑ GAN 25.02 2.17 21.77 16.76 12.16 2.66 52.17 13.63 k-GAN 28.29 2.21 22.93 18.43 69.90 2.41 28.71 3.45 SAGAN 26.04 2.17 30.09 15.16 14.06 2.70 54.82 7.26 StyleGAN2 16.05 2.15 28.41 23.22 11.53 2.79 51.69 19.42 VQGAN 32.60 2.03 46.55 63.33 59.63 1.93 55.24 28.00 GANsformers 10.26 2.46 38.47 37.76 8.56 2.69 55.52 22.89 GANsformerd 9.17 2.36 47.55 66.63 6.51 2.67 57.41 29.71

FFHQ Cityscapes Model FID ↓ IS ↑ Precision ↑ Recall ↑ FID ↓ IS ↑ Precision ↑ Recall ↑ GAN 13.18 4.30 67.15 17.64 11.57 1.63 61.09 15.30 k-GAN 61.14 4.00 50.51 0.49 51.08 1.66 18.80 1.73 SAGAN 16.21 4.26 64.84 12.26 12.81 1.68 43.48 7.97 StyleGAN2 9.24 4.33 68.61 25.45 8.35 1.70 59.35 27.82 VQGAN 63.12 2.23 67.01 29.67 173.80 2.82 30.74 43.00 GANsformers 8.12 4.46 68.94 10.14 14.23 1.67 64.12 2.03 GANsformerd 7.42 4.41 68.77 5.76 5.76 1.69 48.06 33.65 ery convolution, in this case using trained embeddings to • Bidirectional Interaction between the latents and the initialize the aggregator variables Y , which may intuitively visual features, which allows the refinement and inter- represent background knowledge the model learns about the pretation of each in light of the other. scenes. At the last layer, we concatenate these variables to • Multiplicative Integration rule to affect features’ the final feature map to make a prediction about the identity style more flexibly, akin to StyleGAN but in contrast of the image source. We note that this construction holds to the transformer architecture. some resemblance to the PatchGAN discriminator intro- duced by (Isola et al., 2017), but whereas PatchGAN pools As we see in the following section, the combination of these features according to a fixed predetermined scheme, the design choices yields a strong architecture that demonstrates GANsformer can gather the information in a more adaptive high efficiency, improved latent space disentanglement, and and selective manner. Overall, using this structure endows enhanced transparency of its generation process. the discriminator with the capacity to likewise model long- range dependencies, which can aid the discriminator in its assessment of the image fidelity, allowing it to acquire a 4. Experiments more holistic understanding of the visual modality. We investigate the GANsformer through a suite of experi- In terms of the loss function, optimization and training ments to study its quantitative performance and qualitative configuration, we adopt the settings and techniques used behavior. As we will see below, the GANsformer achieves in the StyleGAN and StyleGAN2 models (Karras et al., state-of-the-art results, successfully producing high-quality 2019; 2020), including in particular style mixing, stochastic images for a varied assortment of datasets: FFHQ for hu- variation, exponential moving average for weights, and a man faces (Karras et al., 2019), CLEVR for multi-object non-saturating logistic loss with lazy R1 regularization. scenes (Johnson et al., 2017), and the LSUN-Bedroom (Yu et al., 2015) and Cityscapes (Cordts et al., 2016) datasets 3.3. Summary for challenging indoor and outdoor scenes. The use of these datasets and their reproduced images are only for the pur- To recapitulate the discussion above, the GANsformer suc- pose of scientific communication. Further analysis we then cessfully unifies the GANs and Transformer for the task of conduct in section 4.1, 4.2 and 4.3 provides evidence for sev- scene generation. Compared to the traditional GANs and eral favorable properties the GANsformer posses, including transformers, it introduces multiple key innovations: better data-efficiency, enhanced transparency, and stronger disentanglement compared to prior approaches. Section 4.4 • Bipartite Structure that balances between expressive- then quantitatively assesses the network semantic coverage ness and efficiency, modeling long-range dependencies of the natural image distribution for the CLEVR dataset, while maintaining linear computational costs. while ablation studies in the supplementary empirically val- • Compositional Latent Space with multiple variables idate the relative importance of each of the model’s design that coordinate through attention to produce the image choices. Taken altogether, our evaluation offers solid ev- cooperatively, in a manner that matches the inherent idence for the GANsformer effectiveness and efficacy in compositionality of natural scenes. modeling compsitional images and scenes. Generative Adversarial Transformers

We compare our network with multiple related approaches between the Simplex and Duplex Attention versions fur- including both baselines as well as leading models for image ther reveals the strong benefits of integrating the reciprocal synthesis: (1) A baseline GAN (Goodfellow et al., 2014): bottom-up and top-down processes together. a standard GAN that follows the typical convolutional ar- chitecture. (2) StyleGAN2 (Karras et al., 2020), where a 4.1. Data and Learning Efficiency single global latent interacts with the evolving image by modulating its style in each layer. (3) SAGAN (Zhang We examine the learning curves of our and competing mod- et al., 2019), a model that performs self-attention across els (figure6, middle) and inspect samples of generated all pixel pairs in the low-resolution layers of the generator image at different stages of the training (figure 11 in the and discriminator. (4) k-GAN (van Steenkiste et al., 2020) supplementary). These results both reveal that our model that produces k separated images, which are then blended learns significantly faster than competing approaches, in through alpha-composition. (5) VQGAN (Esser et al., 2020) the case of CLEVR producing high-quality images in ap- that has been proposed recently and utilizes transformers proximately 3-times less training steps than the second-best for discrete autoregessive auto-encoding. approach. To explore the GANsformer learning aptitude further, we have performed experiments where we reduced To evaluate all models under comparable conditions of train- the size of the dataset that each model (and specifically, its ing scheme, model size, and optimization details, we im- discriminator) is exposed to during the training (figure6, plement them all within the codebase introduced by the rightmost) to varied degrees. These results similarly validate StyleGAN authors. The only exception to that is the re- the model superior data-efficiency, especially when as little cent VQGAN model for which we use the official imple- as 1k images are given to the model. mentation. All models have been trained with images of 256 × 256 resolution and for the same number of training 4.2. Transparency & Compositionality steps, roughly spanning a week on 2 NVIDIA V100 GPUs per model (or equivalently 3-4 days using 4 GPUs). For the To gain more insight into the model’s internal representa- GANsformer, we select k – the number of latent variables, tion and its underlying generative process, we visualize the from the range of 8–32. Note that increasing the value of attention distributions produced by the GANsformer as it k does not translate to increased overall latent dimension, synthesizes new images. Recall that at each layer of the and we rather kept the overall latent equal across models. generator, it casts attention between the k latent variables See supplementary materialA for further implementation and the evolving spatial features of the generated image. details, hyperparameter settings and training configuration. From the samples in figures4 and5, we can see that particu- As shown in table1, our model matches or outperforms lar latent variables tend to attend to coherent regions within prior work, achieving substantial gains in terms of FID the image in terms of content similarity and proximity. Fig- score which correlates with image quality and diversity ure5 shows further visualizations of the attention computed (Heusel et al., 2017), as well as other commonly used met- by the model in various layers, showing how it behaves dis- rics such as Inception score (IS) and Precision and Recall tinctively in different stages of the synthesis process. These (P&R).3 As could be expected, we obtain the least gains visualizations imply that the latents carry a semantic sense, for the FFHQ human faces dataset, where naturally there capturing objects, visual entities or constituent components is relatively lower diversity in images layout. On the flip of the synthesized scene. These findings can thereby attest side, most notable are the significant improvements in the for an enhanced compositionality that our model acquires performance for the CLEVR case, where our approach suc- through its multi-latent structure. Whereas models such as cessfully lowers FID scores from 16.05 to 9.16, as well as StyleGAN use a single monolithic latent vector to account the Bedroom dataset where the GANsformer nearly halves for the whole scene and modulate features only at the global the FID score from 11.32 to 6.5, being trained for equal num- scale, our design lets the GANsformer exercise finer control ber of steps. These findings suggest that the GANsformer is in impacting features at the object granularity, while leverag- particularly adept in modeling scenes of high composition- ing the use of attention to make its internal representations ality (CLEVR) or layout diversity (Bedroom). Comparing more explicit and transparent.

3Note that while the StyleGAN paper (Karras et al., 2020) To quantify the compositionality level exhibited by the reports lower FID scores in the FFHQ and Bedroom cases, they model, we use a pre-trained segmentor (Wu et al., 2019) obtain them by training their model for 5-7 times longer than our to produce semantic segmentations for a sample set of gen- experiments (StyleGAN models are trained for up to 17.5 million erated scenes, and use them to measure the correlation be- steps, producing 70M samples and demanding over 90 GPU-days). tween the attention cast by latent variables and various se- To comply with a reasonable compute budget, in our evaluation we equally reduced the training duration for all models, maintaining mantic classes. In figure7 in the supplementary, we present the same number of steps. the classes that on average have shown the highest correla- tion with respect to latent variables in the model, indicating Generative Adversarial Transformers

Attention First Layer Attention Last Layer CLEVR Data Efficiency 190 70 90 180 GAN GAN 80 160 160 60 k-GAN StyleGAN2 SAGAN 140 Simplex (Ours) 70 StyleGAN2 Duplex (Ours) 130 50 Simplex (Ours) 120 60 Duplex (Ours) 100 100 40 50 80 FID Score FID

FID Score FID Score FID Score FID 40 70 30 60 30 40 40 20 3 4 5 20 20 6 7 8 4 5 6 7 8 10 10 10 0 1250k 1250k 1250k 100 1000 10000 100000 Step Step Step Dataset size (Logaritmic)

Figure 6. (1-2) Performance as a function of initial and final layers the attention is applied to. The more layers attention is applied to, the better the model’s performance get and the faster it learns, verifying the effectiveness of the GANsformer approach. (3) Learning curve for the GANsformer and competing approaches. (4): Data-efficiency for CLEVR. that the model coherently attend to semantic concepts such Table 2. Chi-Square statistics of the generated scenes distribution as windows, pillows, sidewalks and cars, as well as coherent for the CLEVR dataset, based on 1k samples. Samples were background regions like carpets, ceiling, and walls. processed by a pre-trained object detector to identify objects and semantic attributes, to compute the properties’ distribution over 4.3. Disentanglement the generated scenes.

GAN StyleGAN GANsformers GANsformerd We consider the DCI metrics commonly used in the disentan- Object Area 0.038 0.035 0.045 0.068 glement literature (Eastwood & Williams, 2018), to provide Object Number 2.378 1.622 2.142 2.825 Co-occurrence 13.532 9.177 9.506 13.020 more evidence for the beneficial impact our architecture has Shape 1.334 0.643 1.856 2.815 on the model’s internal representations. These metrics asses Size 0.256 0.066 0.393 0.427 Material 0.108 0.322 1.573 2.887 the Disentanglement, Completeness and Informativeness of Color 1.011 1.402 1.519 3.189 a given representation, essentially evaluating the degree to Class 6.435 4.571 5.315 16.742 which there is 1-to-1 correspondence between latent factors and global image attributes. To obtain the attributes, we Table 3. Disentanglement metrics (DCI), which asses the Disen- consider the area size of each semantic class (bed, carpet, tanglement, Completeness and Informativeness of latent repre- pillows) obtained through a pre-trained segmentor and use sentations, computed over 1k CLEVR images. The GANsformer them as the output response features for measuring the latent achieves the strongest results compared to competing approaches. space disentanglement, computed over 1k images. We fol- GAN StyleGAN GANsformers GANsformerd low the protocol proposed by (Wu et al., 2020) and present Disentanglement 0.126 0.208 0.556 0.768 Modularity 0.631 0.703 0.891 0.952 the results in table3. This analysis confirms that the GANS- Completeness 0.071 0.124 0.195 0.270 former latent representations enjoy higher disentanglement Informativeness 0.583 0.685 0.899 0.972 when compared to the baseline StyleGAN approach. Informativeness’ 0.434 0.332 0.848 0.963

4.4. Image Diversity 5. Conclusion One of the advantages of compositional representations is We have introduced the GANsformer, a novel and efficient that they can support combinatorial generalization – a key bipartite transformer that combines top-down and bottom- foundation of human intelligence (Battaglia et al., 2018). up interactions, and explored it for the task of generative Inspired by this observation, we perform an experiment to modeling, achieving strong quantitative and qualitative re- measure that property in the context of visual synthesis of sults that attest for the model robustness and efficacy. The multi-object scenes. We use an object detector on the gen- GANsformer fits within the general philosophy that aims to erated CLEVR scenes to extract the objects and properties incorporate stronger inductive biases into neural networks within each sample. We then compute Chi-Square statistics to encourage desirable properties such as transparency, data- on the sample set to determine the degree to which each efficiency and compositionality – properties which are at model manages to covers the natural uniform distribution the core of human intelligence, and serve as the basis for of CLEVR images. Table2 summarizes the results, where our capacity to reason, plan, learn, and imagine. While our we can see that our model obtains better scores across al- work focuses on visual synthesis, we note that the bipartite most all the semantic properties of the image distribution. transformer is a general-purpose model, and expect it may These metrics complement the common FID and IS scores be found useful for other tasks in both vision and language. as they emphasize structure over texture, focusing on object Overall, we hope that our work will help taking us a little existence, arrangement and local properties, and thereby closer in our collective search to bridge the gap between the substantiating further the model compositionality. intelligence of humans and machines. Generative Adversarial Transformers

References Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified gen- Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Al- erative adversarial networks for multi-domain image-to- varo Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz image translation. In 2018 IEEE Conference on Com- Malinowski, Andrea Tacchetti, David Raposo, Adam puter Vision and Pattern Recognition, CVPR 2018, Salt Santoro, Ryan Faulkner, et al. Relational inductive bi- Lake City, UT, USA, June 18-22, 2018, pp. 8789–8797. ases, deep learning, and graph networks. arXiv preprint IEEE Computer Society, 2018. doi: 10.1109/CVPR.2018. arXiv:1806.01261, 2018. 00916. Irwan Bello, Barret Zoph, Quoc Le, Ashish Vaswani, and Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jonathon Shlens. Attention augmented convolutional Jaggi. On the relationship between self-attention and networks. In 2019 IEEE/CVF International Conference convolutional layers. In 8th International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), on Learning Representations, ICLR 2020, Addis Ababa, October 27 - November 2, 2019, pp. 3285–3294. IEEE, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. 2019. doi: 10.1109/ICCV.2019.00338. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Irving Biederman. Recognition-by-components: a theory of Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe human image understanding. Psychological review, 94 Franke, Stefan Roth, and Bernt Schiele. The cityscapes (2):115, 1987. dataset for semantic urban scene understanding. In 2016 Andrew Brock, Jeff Donahue, and Karen Simonyan. Large IEEE Conference on Computer Vision and Pattern Recog- scale GAN training for high fidelity natural image synthe- nition, CVPR 2016, Las Vegas, NV, USA, June 27-30, sis. In 7th International Conference on Learning Repre- 2016, pp. 3213–3223. IEEE Computer Society, 2016. doi: sentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 10.1109/CVPR.2016.350. 2019. OpenReview.net, 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Joseph L Brooks. Traditional and new principles of percep- Toutanova. BERT: Pre-training of deep bidirectional tual grouping. 2015. transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter Christopher P Burgess, Loic Matthey, Nicholas Watters, of the Association for Computational Linguistics: Hu- Rishabh Kabra, Irina Higgins, Matt Botvinick, and man Language Technologies, Volume 1 (Long and Short Alexander Lerchner. MONet: Unsupervised scene Papers), pp. 4171–4186, Minneapolis, Minnesota, June decomposition and representation. arXiv preprint 2019. Association for Computational Linguistics. doi: arXiv:1901.11390, 2019. 10.18653/v1/N19-1423.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico- Jeff Donahue, Philipp Krahenb¨ uhl,¨ and Trevor Dar- las Usunier, Alexander Kirillov, and Sergey Zagoruyko. rell. Adversarial feature learning. arXiv preprint End-to-end object detection with transformers. In Eu- arXiv:1605.09782, 2016. ropean Conference on Computer Vision, pp. 213–229. Springer, 2020. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Arantxa Casanova, Michal Drozdzal, and Adriana Romero- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Soriano. Generating unseen complex scenes: are we there Sylvain Gelly, et al. An image is worth 16x16 words: yet? arXiv preprint arXiv:2012.04027, 2020. Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A2-nets: Double attention networks. Cian Eastwood and Christopher KI Williams. A frame- In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, work for the quantitative evaluation of disentangled rep- Kristen Grauman, Nicolo` Cesa-Bianchi, and Roman Gar- resentations. In International Conference on Learning nett (eds.), Advances in Neural Information Processing Representations, 2018. Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, Sebastien´ Ehrhardt, Oliver Groth, Aron Monszpart, Mar- 2018, Montreal,´ Canada, pp. 350–359, 2018. tin Engelcke, Ingmar Posner, Niloy J. Mitra, and An- drea Vedaldi. RELATE: physically plausible multi-object Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. scene synthesis using structured latent spaces. In Hugo Generating long sequences with sparse transformers. Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria- arXiv preprint arXiv:1904.10509, 2019. Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Generative Adversarial Transformers

Neural Information Processing Systems 33: Annual Con- Klaus Greff, Raphael¨ Lopez Kaufman, Rishabh Kabra, ference on Neural Information Processing Systems 2020, Nick Watters, Christopher Burgess, Daniel Zoran, Loic NeurIPS 2020, December 6-12, 2020, virtual, 2020. Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative vari- Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones, ational inference. In Kamalika Chaudhuri and Ruslan and Ingmar Posner. GENESIS: generative scene inference Salakhutdinov (eds.), Proceedings of the 36th Interna- and sampling with object-centric latent representations. tional Conference on , ICML 2019, 9- In 8th International Conference on Learning Represen- 15 June 2019, Long Beach, California, USA, volume 97 tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, of Proceedings of Machine Learning Research, pp. 2424– 2020. OpenReview.net, 2020. 2433. PMLR, 2019. S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yu- Richard Langton Gregory. The intelligent eye. 1970. val Tassa, David Szepesvari, Koray Kavukcuoglu, and Geoffrey E. Hinton. Attend, infer, repeat: Fast scene Jinjin Gu, Yujun Shen, and Bolei Zhou. Image process- understanding with generative models. In Daniel D. Lee, ing using multi-code GAN prior. In 2020 IEEE/CVF Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, Conference on Computer Vision and Pattern Recogni- and Roman Garnett (eds.), Advances in Neural Infor- tion, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, mation Processing Systems 29: Annual Conference on pp. 3009–3018. IEEE, 2020. doi: 10.1109/CVPR42600. Neural Information Processing Systems 2016, December 2020.00308. 5-10, 2016, Barcelona, Spain, pp. 3225–3233, 2016. David Walter Hamlyn. The psychology of perception: A Patrick Esser, Robin Rombach, and Bjorn¨ Ommer. Taming philosophical examination of Gestalt theory and deriva- transformers for high-resolution image synthesis. arXiv tive theories of perception, volume 13. Routledge, 2017. preprint arXiv:2012.09841, 2020. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Bernhard Nessler, and Sepp Hochreiter. Gans trained Fang, and Hanqing Lu. Dual attention network for scene by a two time-scale update rule converge to a local nash segmentation. In IEEE Conference on Computer Vision equilibrium. In Isabelle Guyon, Ulrike von Luxburg, and Pattern Recognition, CVPR 2019, Long Beach, CA, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. USA, June 16-20, 2019, pp. 3146–3154. Computer Vision Vishwanathan, and Roman Garnett (eds.), Advances in Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019. Neural Information Processing Systems 30: Annual Con- 00326. ference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 6626– Robert Geirhos, Patricia Rubisch, Claudio Michaelis, 6637, 2017. Matthias Bethge, Felix A. Wichmann, and Wieland Bren- del. Imagenet-trained cnns are biased towards texture; Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim increasing shape bias improves accuracy and robustness. Salimans. Axial attention in multidimensional transform- In 7th International Conference on Learning Representa- ers. arXiv preprint arXiv:1912.12180, 2019. tions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo- OpenReview.net, 2019. cal relation networks for image recognition. In 2019 James J Gibson. A theory of direct visual perception. Vi- IEEE/CVF International Conference on Computer Vi- sion and Mind: selected readings in the philosophy of sion, ICCV 2019, Seoul, Korea (South), October 27 - perception, pp. 77–90, 2002. November 2, 2019, pp. 3463–3472. IEEE, 2019. doi: 10.1109/ICCV.2019.00356. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Zilong Huang, Xinggang Wang, Lichao Huang, Chang and . Generative adversarial networks. Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss- arXiv preprint arXiv:1406.2661, 2014. cross attention for semantic segmentation. In 2019 IEEE/CVF International Conference on Computer Vi- Klaus Greff, Sjoerd van Steenkiste, and Jurgen¨ Schmidhu- sion, ICCV 2019, Seoul, Korea (South), October 27 ber. Neural expectation maximization. In Isabelle Guyon, - November 2, 2019, pp. 603–612. IEEE, 2019. doi: Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, 10.1109/ICCV.2019.00069. Rob Fergus, S. V. N. Vishwanathan, and Roman Gar- nett (eds.), Advances in Neural Information Processing Monika Intaite,˙ Valdas Noreika, Alvydas Soliˇ unas,¯ and Systems 30: Annual Conference on Neural Information Christine M Falter. Interaction of bottom-up and top- Processing Systems 2017, December 4-9, 2017, Long down processes in the perception of ambiguous figures. Beach, CA, USA, pp. 6691–6701, 2017. Vision Research, 89:24–31, 2013. Generative Adversarial Transformers

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. 21-26, 2017, pp. 105–114. IEEE Computer Society, 2017. Efros. Image-to-image translation with conditional adver- doi: 10.1109/CVPR.2017.19. sarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, HI, USA, July 21-26, 2017, pp. 5967–5976. IEEE Com- Seungjin Choi, and Yee Whye Teh. Set transformer: puter Society, 2017. doi: 10.1109/CVPR.2017.632. A framework for attention-based permutation-invariant neural networks. In Kamalika Chaudhuri and Ruslan Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Trans- Salakhutdinov (eds.), Proceedings of the 36th Interna- gan: Two transformers can make one strong gan. arXiv tional Conference on Machine Learning, ICML 2019, 9- preprint arXiv:2102.07074, 2021. 15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 3744– Justin Johnson, Bharath Hariharan, Laurens van der Maaten, 3753. PMLR, 2019. Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional lan- Stuart Lloyd. Least squares quantization in pcm. IEEE guage and elementary visual reasoning. In 2017 IEEE transactions on information theory, 28(2):129–137, 1982. Conference on Computer Vision and Pattern Recogni- tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, Francesco Locatello, Dirk Weissenborn, Thomas Un- pp. 1988–1997. IEEE Computer Society, 2017. doi: terthiner, Aravindh Mahendran, Georg Heigold, Jakob 10.1109/CVPR.2017.215. Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object- centric learning with slot attention. arXiv preprint Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image gen- arXiv:2006.15055, 2020. eration from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti- pp. 1219–1228, 2018. fier nonlinearities improve neural network acoustic mod- Tero Karras, Samuli Laine, and Timo Aila. A style-based els. In Proc. icml, volume 30, pp. 3. Citeseer, 2013. generator architecture for generative adversarial networks. Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein In IEEE Conference on Computer Vision and Pattern Karimi, Antoine Bordes, and Jason Weston. Key-value Recognition, CVPR 2019, Long Beach, CA, USA, June 16- memory networks for directly reading documents. In Pro- 20, 2019, pp. 4401–4410. Computer Vision Foundation / ceedings of the 2016 Conference on Empirical Methods IEEE, 2019. doi: 10.1109/CVPR.2019.00453. in Natural Language Processing, pp. 1400–1409, Austin, Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Texas, November 2016. Association for Computational Jaakko Lehtinen, and Timo Aila. Analyzing and improv- Linguistics. doi: 10.18653/v1/D16-1147. ing the image quality of stylegan. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong- tion, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, Liang Yang, and Niloy Mitra. Blockgan: Learning 3d pp. 8107–8116. IEEE, 2020. doi: 10.1109/CVPR42600. object-aware scene representations from unlabelled im- 2020.00813. ages. arXiv preprint arXiv:2002.08988, 2020.

Adam R. Kosiorek, Sara Sabour, Yee Whye Teh, and Ge- Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan offrey E. Hinton. Stacked capsule autoencoders. In Zhu. Semantic image synthesis with spatially-adaptive Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, normalization. In IEEE Conference on Computer Vision Florence d’Alche-Buc,´ Emily B. Fox, and Roman Gar- and Pattern Recognition, CVPR 2019, Long Beach, CA, nett (eds.), Advances in Neural Information Processing USA, June 16-20, 2019, pp. 2337–2346. Computer Vision Systems 32: Annual Conference on Neural Information Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019. Processing Systems 2019, NeurIPS 2019, December 8-14, 00244. 2019, Vancouver, BC, Canada, pp. 15486–15496, 2019. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Ca- Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. ballero, Andrew Cunningham, Alejandro Acosta, An- Image transformer. In Jennifer G. Dy and Andreas Krause drew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan (eds.), Proceedings of the 35th International Conference Wang, and Wenzhe Shi. Photo-realistic single image on Machine Learning, ICML 2018, Stockholmsmassan,¨ super-resolution using a generative adversarial network. Stockholm, Sweden, July 10-15, 2018, volume 80 of Pro- In 2017 IEEE Conference on Computer Vision and Pat- ceedings of Machine Learning Research, pp. 4052–4061. tern Recognition, CVPR 2017, Honolulu, HI, USA, July PMLR, 2018. Generative Adversarial Transformers

Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Bello, Anselm Levskaya, and Jon Shlens. Stand-alone pp. 7794–7803. IEEE Computer Society, 2018. doi: 10. self-attention in vision models. In Hanna M. Wallach, 1109/CVPR.2018.00813. Hugo Larochelle, Alina Beygelzimer, Florence d’Alche-´ Buc, Emily B. Fox, and Roman Garnett (eds.), Advances Yuxin Wu, Alexander Kirillov, Francisco Massa, in Neural Information Processing Systems 32: Annual Wan-Yen Lo, and Ross Girshick. Detectron2. Conference on Neural Information Processing Systems https://github.com/facebookresearch/ 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, detectron2, 2019. BC, Canada, pp. 68–80, 2019. Zongze Wu, Dani Lischinski, and Eli Shechtman. Alec Radford, Luke Metz, and Soumith Chintala. Unsuper- StyleSpace analysis: Disentangled controls for StyleGAN vised representation learning with deep convolutional gen- image generation. arXiv preprint arXiv:2011.12799, erative adversarial networks. In Yoshua Bengio and Yann 2020. LeCun (eds.), 4th International Conference on Learning Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Representations, ICLR 2016, San Juan, Puerto Rico, May Funkhouser, and Jianxiong Xiao. Lsun: Construction 2-4, 2016, Conference Track Proceedings, 2016. of a large-scale image dataset using deep learning with Anirudh Ravula, Chris Alberti, Joshua Ainslie, Li Yang, humans in the loop. arXiv preprint arXiv:1506.03365, Philip Minh Pham, Qifan Wang, Santiago Ontanon, 2015. Sumit Kumar Sanghai, Vaclav Cvicek, and Zach Fisher. Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Etc: Encoding long and structured inputs in transformers. Augustus Odena. Self-attention generative adversarial 2020. networks. In Kamalika Chaudhuri and Ruslan Salakhutdi- Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dy- nov (eds.), Proceedings of the 36th International Confer- namic routing between capsules. In Isabelle Guyon, Ul- ence on Machine Learning, ICML 2019, 9-15 June 2019, rike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Long Beach, California, USA, volume 97 of Proceedings Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), of Machine Learning Research, pp. 7354–7363. PMLR, Advances in Neural Information Processing Systems 30: 2019. Annual Conference on Neural Information Processing Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring Systems 2017, December 4-9, 2017, Long Beach, CA, self-attention for image recognition. In 2020 IEEE/CVF USA, pp. 3856–3866, 2017. Conference on Computer Vision and Pattern Recognition, Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. Jia, and Ching-Hui Chen. Global self-attention networks 10073–10082. IEEE, 2020. doi: 10.1109/CVPR42600. for image recognition. arXiv preprint arXiv:2010.03019, 2020.01009. 2020. Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. Barry Smith. Foundations of gestalt theory. 1988. SEAN: image synthesis with semantic region-adaptive normalization. In 2020 IEEE/CVF Conference on Com- Sjoerd van Steenkiste, Karol Kurach, Jurgen¨ Schmidhuber, puter Vision and Pattern Recognition, CVPR 2020, Seat- and Sylvain Gelly. Investigating object compositionality tle, WA, USA, June 13-19, 2020, pp. 5103–5112. IEEE, in generative adversarial networks. Neural Networks, 130: 2020. doi: 10.1109/CVPR42600.2020.00515. 309–325, 2020.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Process- ing Systems 30: Annual Conference on Neural Informa- tion Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017.

Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition,