AlphaGAN: Fully Differentiable Architecture Search for Generative Adversarial Networks

Anonymous Author(s) Affiliation Address email

Abstract

1 Generative Adversarial Networks (GANs) are formulated as minimax game prob- 2 lems, whereby generators attempt to approach real data distributions by virtue of 3 adversarial learning against discriminators. In this work, we aim to boost model 4 learning from the perspective of network architectures, by incorporating recent 5 progress on automated architecture search into GANs. To this end, we propose a 6 fully differentiable search framework for generative adversarial networks, dubbed 7 alphaGAN. The searching process is formalized as solving a bi-level minimax 8 optimization problem, where the outer-level objective aims for seeking a suitable 9 network architecture towards pure Nash Equilibrium conditioned on the network pa- 10 rameters optimized in the inner level. The entire optimization performs a first-order 11 method by alternately optimizing the two-level objective in a fully differentiable 12 manner, enabling architecture search to be completed in an enormous search space. 13 Extensive experiments on CIFAR-10 and STL-10 datasets show that our algorithm 14 can obtain high-performing architectures only with 3-GPU hours on a single GPU 11 15 in the search space comprised of approximate 2 × 10 possible configurations.

16 1 Introduction

17 Generative Adversarial Networks (GANs) [1] have shown promising performance on a variety of 18 generative tasks (e.g., image generation [2], image translation [3, 4], dialogue generation [5], and 19 image inpainting [6]). However, pursuing high-performance generative networks is non-trivial due 20 to the non-convex non-concave property. There is a rich history of research aiming to improve the 21 training stabilization and alleviate mode collapse by introducing generative adversarial functions (e.g., 22 Wasserstein distance [7], Least Squares loss [8], and hinge loss [9]) or regularization (e.g., gradient 23 penalty [10, 11]).

24 Alongside the direction of improving loss functions, improving architectures has been proven to be 25 important for stabilizing training and improving generalization. Previous works [12, 9, 10, 13, 2] 26 employ deep convolutional networks to boost the performance of GANs. However, such a manual 27 architecture design typically requires domain-specific knowledge from human experts, which is 28 even challenging for GANs due to the minimax formulation that it intrinsically possesses. Recent 29 progress of architecture search on a variety of tasks [14, 15, 16, 17] has shown 30 that remarkable achievements can be achieved by automating the architecture search process.

31 In this paper, we aim to address the problem of GAN architecture search from the perspective of 32 Game theory since it is essentially a minimax problem [1] targeting at finding pure Nash Equilibrium 33 of generator and discriminator [18, 19]. From this perspective, we propose a fully differentiable 34 architecture search framework for GANs, dubbed alphaGAN, in which a differentiable evaluation 35 metric is introduced for guiding architecture search towards pure Nash Equilibrium [20]. Motivated 36 by DARTS [15], we formulate the search process of alphaGAN as a bi-level minimax optimization

Submitted to 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Do not distribute. 37 problem, and solve it efficiently via stochastic gradient-type methods. Specifically, the outer level 38 objective aims to optimize the generator architecture parameters towards pure Nash Equilibrium, 39 whereas the inner level constraint targets at optimizing the weight parameters conditioned on the 40 architecture currently searched.

41 This work is related to several recent methods. GAN architecture search is performed with a 42 paradigm [21, 22, 23] or a gradient-based paradigm [24].

43 Extensive experiments including comparison to these methods and analysis of the searched archi- 44 tectures, demonstrate the effectiveness of the proposed algorithm in performance and efficiency. 45 Specially, alphaGAN can discover high-performance architectures while being much faster than the 46 other automated architecture search methods.

47 2 GAN Architecture Search as Fully Differential Optimization

48 Differentiable Architecture Search was first proposed in [15], where the problem is formulated as a 49 bi-level optimization. Weight parameters and architecture parameters are optimized in the inner level 50 and outer level, respectively. The objective function in both levels is the cross entropy loss, which 51 can reflect the quality of current models in classification tasks.

52 However, deploying such a framework to searching architectures of GANs is non-trivial. The training 53 of GANs corresponds to the optimization of a minimax problem. Many previous works [2, 25] have 54 pointed that the adversarial loss cannot refect the quality of GANs. A suitable evaluation metric is 55 essential for a gradient-based NAS-GAN framework.

56 Evaluation function. Due to the intrinsic minimax property of GANs, the training process of 57 GANs can be viewed as a zero-sum game as in [18, 1]. The universal objective of training GANs 58 consequentially can be regarded as reaching pure equilibrium. Hence, we adopt the primal-dual gap 59 function [25, 26] for evaluating the generalization of vanilla GANs. Given a pair of G and D, the 60 duality-gap function is defined as

V(G, D) = Adv(G, D) − Adv(G, D) := max Adv(G, D) − min Adv(G, D). (1) D G

61 The evaluation metric V(G, D) is non-negative and V (G, D) = 0 can be only achieved when the 62 pure equilibrium holds. Function (1) provides a quantified measure of describing “how close is 63 current GAN to pure equilibrium", which can be used for assessing model capacity.

64 The architecture search for GANs can be formulated as a specific bi-level optimization problem: n o min V(G, D):(G, D) := arg min max Adv (G, D) , (2) α G D

65 where V(G, D) performs on the validation dataset and supervises seeking the optimal generator 66 architecture as an outer-level problem, and the inner-level optimization on Adv (G, D) aims to learn 67 suitable network parameters (including both the generator and discriminator) for GAN on the current 68 architecture.

69 In this work, we exploit the hinge loss from [9, 27] as the generative adversarial function Adv (G, D).

70 AlphaGAN formulation. By integrating the generative adversarial function (i.e., hinge loss) and 71 evaluation function (1) into the bi-level optimization (2), we can obtain the final objective for the 72 framework as follows,

min Vval(G, D) = Adv(G, D) − Adv(G, D) (3) α

s.t. ω ∈ arg min max Advtrain(G, D) (4) ωG ωD

73 where generator G and discriminator D are parameterized with variables (αG, ωG) and (ωD), re- 74 spectively, D = arg maxD Advval(G, D), and G = arg minG Advval(G, D). The detailed search 75 algorithm and other details are in the supplementary material.

2 Table 1: Comparison with state-of-the-art GANs on CIFAR-10. † denotes the results reproduced by us, with the structure released by Auto-GAN and trained under the same setting as AutoGAN.

Params FLOPs search time search search IS FID Architecture (M) (G) (GPU-hours) space method (↑ is better) (↓ is better) DCGAN([12]) - - - - manual 6.64 ± 0.14 - SN-GAN([9]) - - - - manual 8.22 ± 0.05 21.7 ± 0.01 Progressive GAN([13]) - - - - manual 8.80±0.05 - WGAN-GP, ResNet([10]) - - - - manual 7.86 ± 0.07 - AutoGAN([22]) 5.192 1.77 - ∼ 105 RL 8.55 ± 0.1 12.42 AutoGAN† 5.192 1.77 82 ∼ 105 RL 8.38 ± 0.08 13.95 AGAN([21]) - - 28800 ∼ 20000 RL 8.29 ± 0.09 30.5 Random search([30]) 2.701 1.11 40 ∼ 2 × 1011 Random 8.46 ± 0.09 15.43 11 alphaGAN(l) 8.618 2.78 22 ∼ 2 × 10 gradient 8.71 ± 0.12 11.23 11 alphaGAN(s) 2.953 1.32 3 ∼ 2 × 10 gradient 8.60 ± 0.11 11.85

76 3 Experiments

77 In this section, we conduct extensive experiments on CIFAR-10 [28] and STL-10 [29]. First, the 78 generator architecture is searched on CIFAR-10 and the discretized optimal structure is fully re- 79 trained from scratch following [22] in Section 3.1. We compare alphaGAN with the other automated 80 GAN methods in multiple measures to demonstrate its effectiveness. Second, the generalization of 81 the searched architectures is verified by fully training on STL-10 and evaluation in Section 3.2. To 82 further understand the properties of our method, a series of studies on the key components of the 83 framework are shown in Section 3.3. Experiment details are in the supplementary material.

84 3.1 Searching on CIFAR-10

85 We first compare our method with recent automated GAN methods. For a fair comparison, we report 86 the performance of best run (over 3 runs) for reproduced baselines and ours in the Table 1 and provide 87 the performance of several representative works with manually designed architectures for reference. 88 We provide a detailed analysis and discussion about the searching process. And the statistic properties 89 of architectures searched by alphaGAN are in the supplementary material.

90 Performances of alphaGAN with two search configurations are shown In Tab. 1 by adjusting step 91 sizes of the inner loop and the outer loop, where alphaGAN(l) represents passing through every epoch 92 on the training and validation sets for each loop, i.e., steps = 390. And alphaGAN(s) represents 93 using smaller interval steps with steps = 20.

94 The results show that our method performs well in the two settings and outperforms the other 95 automated GAN methods in terms of both efficiency and performance. alphaGAN(l) obtains the 96 lowest FID compared to all the baselines. Particularly, alphaGAN(s) can attain the best tradeoff 97 between efficiency and performance, and it can be achieve comparable results by searching in a 98 large search space (significantly larger than RL-based baselines) in a considerably efficient manner 99 (i.e., only 3 GPU hours compared to the baselines with tens to thousands of GPU hours). The 100 architecture obtained by alphaGAN(s) is light-weight and computationally efficient, which reaches a 101 good trade-off between performance and time complexity.

102 3.2 Transferability on STL-10

103 To validate the transferability of the architectures obtained by alphaGAN, we directly train models 104 by using the obtained architectures on the STL-10 dataset. The results are shown in Table 2. Both 105 alphaGAN(l) and alphaGAN(s) show remarkable superiority in performance over the baselines with 106 either automated or manually designed architectures. It reveals the benefit that the architecture 107 searched by alphaGAN can be effectively exploited across datasets. It is surprising that alphaGAN(s) 108 is best-behaved, which achieves the best performance in both IS and FID scores. It also shows that 109 compared to increase on model complexity, appropriate selection and composition of operations can 110 contribute to model performance in a more efficient manner which is consistent with the primary 111 motivation of automating architecture search.

3 Table 2: Results on STL-10. The structures of alphaGAN(l) and alphaGAN(s) are searched on CIFAR-10 and fully trained on STL-10. † denotes the reproduced results, with the architectural configurations released by the original papers.

Architecture Params (M) FLOPs (G) IS FID SN-GAN([9]) - - 9.10 ± 0.04 40.1 ± 0.5 ProbGAN([31]) - - 8.87 ± 0.095 46.74 Improving MMD GAN([32]) - - 9.36 36.67 Auto-GAN([22]) 5.853 3.98 9.16 ± 0.12 31.01 Auto-GAN† 5.853 3.98 9.38 ± 0.08 27.69 AGAN([21]) - - 9.23 ± 0.08 52.7 alphaGAN(l) 9.279 6.26 9.53 ± 0.12 24.52 alphaGAN(s) 3.613 2.97 10.12±0.13 22.43

112 3.3 Ablation Study

113 We conduct ablation experiments on CIFAR-10 to better understand the influence of components 114 when applying different configurations on both alphaGAN(l) and alphaGAN(s), including the studies 115 with the questions: the effect of searching the discriminator architecture and obtaining the optimal 116 generator G. More experiments are shown in the supplementary material. Table 3: Ablation studies on CIFAR-10. 117 Search D’s architecture or not? Type Search D? Obtain G IS FID 118 A problem may arise from al- Update αG Update ωG 119 phaGAN: If searching discrimi- × 8.51 ± 0.09 18.07 120 nator structures can facilitate the X X × × X 8.51 ± 0.06 11.38 121 searching and training of genera- alphaGAN(l) × × 8.51 ± 0.06 11.38 122 tors? The results in Table 3 show X × X × 7.06 ± 0.06 43.99 123 that searching the discriminator × XX 8.43 ± 0.11 13.91 124 cannot help the search of the op- X × X 8.70 ± 0.11 15.56 125 timal generator. We also con- × × X 8.72 ± 0.11 12.86 126 ducted the trial by training GANs alphaGAN(s) × × X 8.72 ± 0.11 12.86 127 with the obtained architectures by × X × 8.45 ± 0.09 15.47 128 searching G and D, while the fi- × XX 8.18 ± 0.11 18.85 129 nal performance is inferior to the 130 setting of retraining with a given discriminator configuration. Simultaneously searching architectures 131 of both G and D potentially increases the effect of inferior discriminators which may hamper the 132 search of optimal generators conditioned on strong discriminators. In this regard, solely learning 133 generators’ architectures may be a better choice.

134 How to obtain G? In the definition of duality gap, G and D denote the optima of G and D, 135 respectively. As both of the architecture and network parameters are variables for G, we do the 136 experiments of investigating the effect of updating ωG and αG for attaining G. The results in Table 3 137 show that updating ωG solely achieves the best performance. Approximating G with ωG update solely 138 means that the architectures of G and G are identical, and hence optimizing architecture parameters 139 αG in (3) can be viewed as the compensation for the gap brought by the weight parameters of ωG 140 and ωG.

141 4 Conclusion

142 We presented alphaGAN, a fully differentiable architecture search framework for GANs, which is 143 efficient and effective to seek high-performing generator architectures from vast possible configura- 144 tions, achieving comparable or superior performance compared to state-of-the-art architectures being 145 either manually designed or automatically searched. In addition, the analysis of tracking the behavior 146 of architecture performance and operation distribution gives some insights about architecture design, 147 which may promote further research on architecture improvement. We mainly focused on vanilla 148 GANs in this work and would like to extend such a framework to conditional GANs, in which extra 149 regularization on the parts of networks is typically imposed for task specialization, as future work.

4 150 Broader Impact

151 AlphaGAN will have potential positive impacts on the tasks of image/video generation, natural 152 language generation, and high fidelity . Researchers can utilize alphaGAN to design 153 a powerful generator adversarial network with superior performance. On the other hand, we would 154 hope that this work can attract more attention in the AI research community to design architectures 155 of generation tasks rather than classification tasks. Moreover, the theory behind alphaGAN is so 156 transferable that it can apply to conditional GANs on several conditionally generative tasks, e.g., 157 style transfer, image-to-image translation, etc.

158 References

159 [1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil 160 Ozair, Aaron Courville, and . Generative adversarial nets. In Advances in neural 161 information processing systems, pages 2672–2680, 2014.

162 [2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity 163 natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.

164 [3] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image 165 translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international 166 conference on , pages 2223–2232, 2017.

167 [4] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 168 Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. 169 In Proceedings of the IEEE conference on computer vision and , pages 170 8789–8797, 2018.

171 [5] Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial 172 learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.

173 [6] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image 174 inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision 175 and pattern recognition, pages 5505–5514, 2018.

176 [7] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint 177 arXiv:1701.07875, 2017.

178 [8] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 179 Least squares generative adversarial networks. In Proceedings of the IEEE International 180 Conference on Computer Vision, pages 2794–2802, 2017.

181 [9] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization 182 for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.

183 [10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 184 Improved training of wasserstein gans. In Advances in neural information processing systems, 185 pages 5767–5777, 2017.

186 [11] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of 187 gans. arXiv preprint arXiv:1705.07215, 2017.

188 [12] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with 189 deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

190 [13] Shaokai Ye, Xiaoyu Feng, Tianyun Zhang, Xiaolong Ma, Sheng Lin, Zhengang Li, Kaidi Xu, 191 Wujie Wen, Sijia Liu, Jian Tang, et al. Progressive dnn compression: A key to achieve ultra-high 192 weight pruning and quantization rates using admm. arXiv preprint arXiv:1903.09769, 2019.

193 [14] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv 194 preprint arXiv:1611.01578, 2016.

195 [15] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. 196 arXiv preprint arXiv:1806.09055, 2018.

197 [16] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable 198 architectures for scalable image recognition. In Proceedings of the IEEE conference on computer 199 vision and pattern recognition, pages 8697–8710, 2018.

5 200 [17] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model 201 architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.

202 [18] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 203 Improved techniques for training gans. In Advances in neural information processing systems, 204 pages 2234–2242, 2016.

205 [19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 206 Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances 207 in neural information processing systems, pages 6626–6637, 2017.

208 [20] John F Nash et al. Equilibrium points in n-person games. Proceedings of the national academy 209 of sciences, 36(1):48–49, 1950.

210 [21] Hanchao Wang and Jun Huan. Agan: Towards automated design of generative adversarial 211 networks. arXiv preprint arXiv:1906.11080, 2019.

212 [22] Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture 213 search for generative adversarial networks. In Proceedings of the IEEE International Conference 214 on Computer Vision, pages 3224–3234, 2019.

215 [23] Yuan Tian, Qin Wang, Zhiwu Huang, Wen Li, Dengxin Dai, Minghao Yang, Jun Wang, and 216 Olga Fink. Off-policy reinforcement learning for efficient and effective gan architecture search. 217 arXiv preprint arXiv:2007.09180, 2020.

218 [24] Chen Gao, Yunpeng Chen, Si Liu, Zhenxiong Tan, and Shuicheng Yan. Adversarialnas: 219 Adversarial neural architecture search for gans. arXiv preprint arXiv:1912.02037, 2019.

220 [25] Paulina Grnarova, Kfir Y Levy, Aurelien Lucchi, Nathanael Perraudin, Ian Goodfellow, Thomas 221 Hofmann, and Andreas Krause. A domain agnostic measure for monitoring and evaluating gans. 222 In Advances in Neural Information Processing Systems, pages 12069–12079, 2019.

223 [26] Cheng Peng, Hao Wang, Xiao Wang, and Zhouwang Yang. {DG}-{gan}: the {gan} with the 224 duality gap, 2020.

225 [27] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative 226 adversarial networks. arXiv preprint arXiv:1805.08318, 2018.

227 [28] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data 228 set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and 229 machine intelligence, 30(11):1958–1970, 2008.

230 [29] Adam Coates, , and Honglak Lee. An analysis of single- networks in unsuper- 231 vised feature learning. In Proceedings of the fourteenth international conference on artificial 232 intelligence and statistics, pages 215–223, 2011.

233 [30] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. 234 arXiv preprint arXiv:1902.07638, 2019.

235 [31] Hao He, Hao Wang, Guang-He Lee, and Yonglong Tian. Probgan: Towards probabilistic gan 236 with theoretical guarantees.

237 [32] Wei Wang, Yuan Sun, and Saman Halgamuge. Improving mmd-gan training with repulsive loss 238 function. arXiv preprint arXiv:1812.09916, 2018.

239 [33] Martin J Osborne and Ariel Rubinstein. A course in game theory. MIT press, 1994.

240 [34] Ding-Zhu Du and Panos M Pardalos. Minimax and applications, volume 4. Springer Science & 241 Business Media, 2013.

242 [35] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in 243 neural information processing systems, pages 4565–4573, 2016.

244 [36] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial 245 reinforcement learning. In Proceedings of the 34th International Conference on Machine 246 Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.

247 [37] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint 248 arXiv:1412.6980, 2014.

249 [38] Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In Advances in Neural 250 Information Processing Systems, pages 3086–3094, 2014.

6 251 [39] Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank 252 Hutter. Understanding and robustifying differentiable architecture search. arXiv preprint 253 arXiv:1909.09656, 2019.

7 254 A Preliminaries

255 Minimax Games have regained a lot of attraction [33, 34] since they are popularized in machine 256 learning, such as generative adversarial networks (GAN) [18], reinforcement learning [35, 36], etc. 257 Given the function Adv : X × Y → R, we consider a minimax game and its dual form: minmax Adv(G, D)=min  max Adv(G, D) , maxmin Adv(G, D)=max  min Adv(G, D) . G D G D D G D G

258 The pure equilibrium [20] of minimax game can be used to characterize the best decisions of two 259 players G and D for above minmax game.

260 Definition 1 (G, D) is called a pure equilibrium of game minG maxD Adv(G, D) if it holds that max Adv(G, D) = min Adv(G, D), (5) D G

261 where G = arg minG Adv(G, D) and D = arg maxD Adv(G, D). When minimax game equals to 262 its dual problem, (G, D) is the pure equilibrium of the game. Hence, the gap between the minimax 263 problem and its dual form can be used to measure the degree of approaching pure equilibrium [25].

264 Generative Adversarial Network (GAN) proposed in [1] is mathematically defined as a minimax game 265 problem with a binary cross entropy loss of competing between the distributions of real and synthetic 266 images generated by the GAN model. Despite remarkable progress achieved by GANs, training 267 high-performance models is still challenging for many generative tasks due to its fragility to almost 268 every factor in the training process. Architectures of GANs have proven useful for stabilizing training 269 and improving generalization [9, 10, 13, 2], and we hope to discover architectures by automating the 270 design process with limited computational resource in a principled differentiable manner.

271 B Algorithm and Optimization

272 In this section, we will give a detailed description for the training algorithm and optimization process 273 of alphaGAN. We first describe the entire network structure of the generator and the discriminator, 274 the search space of the generator, and the continuous relaxation of architectural parameters.

275 Base Backbone of G and D. The illumination of the entire structure for the generator and discrimi- 276 nator is shown in the supplementary material. The generator is constructed by stacking several cells 277 whose topology is identical to those in AutoGAN [22] and SN-GAN [9] (shown in the supplementary 278 material). Each cell, regarded as a directed acyclic graph, is comprised of the nodes representing 279 intermediate feature maps and the edges connecting pairs of nodes via different operations. We apply 280 a fixed network architecture for the discriminator, based on the conventional design as [9].

281 Search space of G. The search space is compounded from two types of operations, i.e., normal 282 operations and up-sampling operations. The pool of normal operations, denoted as On, is comprised 283 of {conv_1x1, conv_3x3, conv_5x5, sep_conv_3x3, sep_conv_5x5, sep_conv_7x7} . The 284 pool of up-sampling operations, denoted as Ou, is comprised of { deconv, nearest, bilinear}, where 3 2 3 3 11 285 “deconv” denotes the ConvTransposed_2x2. operation. Our method allows (6 ×3 ) ×3 ≈ 2×10 5 286 possible configurations for the generator architecture, which is larger than ∼ 10 of AutoGAN [22].

287 Continuous relaxation. The discrete selection of operations is approximated by using a soft decision 288 with a mutually exclusive function, following [15]. Formally, let o ∈ On denote some normal o 289 operations on node i, and αi,j represent the architectural parameter with respect to the operation 290 between node i and its adjacent node j, respectively. Then the node output induced by the input node 291 i can be calculated by o  X exp αi,j O (x) = o(x), i,j P o0  (6) 0 exp α o∈On o ∈On i,j j P i 292 and the final output is summed over all of its preceding nodes, i.e., x = i∈P r(j) Oi,j(x ). The 293 selection on up-sampling operations follows the same procedure.

294 Solving alphaGAN. We apply an alternating minimization method to solve alphaGAN with respect  295 to variables (ωG, ωD), (ωG, ωD), αG in Algorithm 1, which is a fully differentiable gradient-type

8 Algorithm 1 Searching the architecture of alphaGAN 1 1 1 Parameters: Initialize weight parameters (ωG,ωG). Initialize generator architecture parameters αG. Initialize base learning rate η, momentum parameter β1, and exponential moving average parameter β2 for Adam optimizer. 1: for k = 1, 2, ··· ,K do k,1 k,1 k k k,1 k 2: Set (ωG , ωD ) = (ωG, ωD) and set αG = αG; 3: for t = 1, 2, ··· ,T do (l) m (l) m 4: Sample real data {x }l=1 ∼ Pr from training set and noise {z }l=1 ∼ Pz; Estimate (l) (l) k,t k,t k,t k,t gradient of Adv loss with {x , z } at (ωG , ωD ), dubbed ∇Adv(ωG , ωD ); k,t+1 k,t k,t k,t  5: ωD = Adam ∇ωD Adv(ωG , ωD ), ωD , η, β1, β2 ; k,t+1 k,t k,t k,t  6: ωG = Adam ∇ωG Adv(ωG , ωD ), ωG , η, β1, β2 ; 7: end for k+1 k+1 k,T k,T 8: Set (ωG , ωD ) = (ωG , ωD ); k k+1 k+1 9: Receive architecture searching parameter αG and network weight parameters (ωG ,ωD ); Estimate neural architecture parameters (ωk+1, ωk+1) of (G, D) via Algorithm 2; G D 10: for s = 1, 2, ··· ,S do (l) m (l) m 11: Sample real data {x }l=1 ∼ Pr from the validation set and latent variables {z }l=1 ∼ Pz. (l) (l) k,s k,s Estimate gradient of the duality gap V with {x , z } at (α ), dubbed ∇V (αG ); k,s+1 k,s k,s 12: αG = Adam(∇αG V (αG ), αG , η, β1, β2); 13: end for k+1 k,S 14: Set αG =αG ; 15: end for K 16: Return αG=αG .

Algorithm 2 Solving G and D

Parameters: Receive architecture searching parameter αG and weight parameter (ωG,ωD). Initialize weight parameter (ω1 , ω1 ) = (ω , ω ) for (G, D). Initialize base learning rate η, momentum G D G D parameter β1, and EMA parameter β2 for Adam optimizer. 1: for r = 1, 2, ··· ,R do (l) m (l) m 2: Sample real data {x }l=1 ∼ Pr from validation dataset and noise {z }l=1 ∼ Pz; Estimate gradient of Adv loss with {x(l), z(l)} at (ω , ωr ), dubbed ∇Adv(ω , ωr ); G D G D 3: ωr+1 = Adam∇ Adv(ω , ωr ), ωr , η, β , β ; D ωD G D D 1 2 4: end for 5: for r = 1, 2, ··· ,R do 6: Sample noise {z(l)}m ∼ ; Estimate gradient of the Adv loss with {z(l)} at point (ωr , ω ), l=1 Pz G D dubbed ∇Adv(ωr , ω ); G D 7: ωr+1 = Adam∇ Adv(ωr , ω ), ωr , η, β , β ; G ωG G D G 1 2 8: end for 9: Return (ω , ω ) = (ωR, ωR). G D G D

296 algorithm. Algorithm 1 is composed of three parts. The first part (line 3-8), called “weight_part", 297 aims to optimize weight parameters ω on the training dataset via Adam optimizer [37]. The second 298 part (line 9), called “test-weight_part", aims to optimize the weight parameters (ωG, ωD), and the 299 third part (line 10-12), called ’arch_part’, aims to optimize architecture parameters αG by minimizing 300 the duality gap. Both ’test-weight_part’ and ’arch_part’ are optimized over the validation dataset via 301 Adam optimizer. Algorithm 2 illuminates the detailed process of computing G and D by updating 302 weight parameters (ω , ω ) with last searched generator network architecture parameters αG and G D  303 related network weight parameters (ωG, ωD). In summary, the variables (ωG, ωD), (ωG, ωD), αG 304 are optimized in an alternating fashion.

9 305 C Experiment Details

306 C.1 Searching on CIFAR-10

307 The CIFAR-10 dataset is comprised of 50000 images for training. The resolution of the images is 308 32x32. We randomly split the dataset into two sets during searching: one is used as the training set 309 for optimizing network parameters ωG and ωD (25000 images), and another is used as the validation 310 set for optimizing architecture parameters αG (25000 images). The search iterations for alphaGAN(l) 311 and alphaGAN(s) are set to 100. We use a minibatch size of 64 for both generators and discriminators, 312 channel number of 256 for generators and 128 for discriminators. The dimension of the noise vector 313 is 128. For a fair comparison, the discriminator adopted in searching is the same as the discriminator 314 in AutoGAN [22]. Batch sizes of both the generator and the discriminator are set to 64. The learning 315 rates of weight parameters ωG and ωD are 2e − 4 and the learning rate of architecture parameter αG 316 is 3e − 4. We use Adam as the optimizer. The hyperparameters for optimizing weight parameters 317 ωG and ωD are set as, 0.0 for β1 and 0.999 for β2, and 0 for the weight decay. The hyperparameters 318 for optimizing architecture parameters αG are set as 0.5 for β1, 0.999 for β2 and 1e − 3 for weight 319 decay.

320 We use the entire training set of CIFAR-10 for retraining the network parameters after obtaining 321 architectures. we use a minibatch size of 128 for generators and 64 for discriminators. The channel 322 number is set to 256 for generators and 128 for discriminators. The dimension of the noise vector is 323 128. Discriminator exploited in the re-training phase is identical to that during searching. The batch 324 size of the generator is set to 128. The batch size of the discriminator is set to 64. The generator is 325 trained for 50000 iterations. The learning rates of the generator and discriminator are set to 2e − 4. 326 The hyperparameters for the Adam optimizer are set to 0.0 for β1, 0.9 for β2 and 0 for weight decay.

327 When testing, 50000 images are generated with random noise, and IS [18] and FID [19] are used to 328 evaluate the performance of generators.

329 C.2 Transferability

330 The STL-10 dataset is comprised of ∼ 105k training images. We resize the images to the size 331 of 48x48 due to the consideration of memory and computational overhead. The dimension of the 332 noise vector is 128. We train the generator for 80000 iterations. The batch sizes for optimizing 333 the generators and the discriminator are set to 128 and 64, respectively. The channel numbers of 334 the generator and the discriminator are set to 256 and 128, respectively. The learning rates for the 335 generator and the discriminator are both set to 2e − 4. We also use the Adam as the optimizer, where 336 β1 is set to 0.5, β2 is set to 0.9 and weight decay is set to 0.

337 D The structures of the generator and the discriminator

338 The entire structures of the generator and the discriminator are illustrated in Fig. 1.

339 The topology of cells in the generator and the discriminator is illustrated in the Fig. 2. In the cell of 340 the generator, the edges from the node 0 to the node 1 and from the node 0 to the node 3 correspond 341 to up-sampling operations, and the rest edges are normal operations. In the cell of the discriminator, 342 the edges from the node 2 to the node 4 and from the node 3 to the node 4 are the operation of 343 avg_pool_2x2 with stride 2, the edges from the node 0 to the node 1 and from the node 1 to the 344 node 2 are the operation of conv_3x3 with stride 1, and the edge from the node 0 to the node 3 is the 345 operation of conv_1x1 with stride 1.

346 The structures of alphaGAN(l) and alphaGAN(s) are shown in Fig. 4 and Fig. 3.

347 E Relation between performance and structure

348 The distributions of operations in ’superior’ and ’inferior’ are shown in Fig. 5 and Fig. 6, respectively. 349 We get the following observations: first, for up-sampling operations, superior architectures tend to 350 exploit “nearest" or “bilinear" rather than “deconvolution" operations. Second, “conv_1x1" operations 351 dominate in the cell_1 of superior generators, suggesting that with large kernel sizes may 352 not be optimal when the spatial dimensions of feature maps are relatively small (i.e., 8x8). Finally,

10 input

input cell_1

FC cell_2

cell_1 cell_3

cell_2 cell_4

cell_3 GAP

BN+ReLU+Conv+tanh FC

fake_sample output

(a) G (b) D

Figure 1: The topology of the generator and the discriminator.

(a) The cell of G (b) The cell of D

Figure 2: The topology of the cell in the generator and the discriminator. The topology of the generator and the discriminator is identical to those of AutoGAN [22] and SN-GAN [9].

353 convolutions with large kernels (e.g., conv_5x5, sep_conv_3x3, and sep_conv_5x5) are preferred on 354 higher resolutions (i.e., cell_3 of superior generators), indicating the benefit of integrating information 355 from relatively large receptive fields for low-level representations on high resolutions.

11 FC

bilinear deconv conv_1x1 conv_1x1 sep_conv_3x3

nearest

deconv deconv deconv sep_conv_5x5

sep_conv_3x3 sep_conv_3x3

bilinear deconv

nearest sep_conv_5x5

sep_conv_3x3 sep_conv_3x3

conv_3x3 + tanh

Figure 3: The structure of alphaGAN(s).

356 F Generated Samples

357 Generated samples of alphaGAN(s) on STL-10 are shown in Fig. 7.

358 G Additional Results

359 In this section, we present the more experimental results and analysis (due to page limit), including 360 model scaling, intermediate architectures in searching, using Gumbel-max trick, warm-up, ablation 361 study on step sizes for ’arch_part’, effect of channel numbers for searching, searching on STL-10, 362 and the analysis of failure cases. The ’baseline’ in Tab. 4 denotes the structure searched under the 363 default settings of alphaGAN.

364 G.1 Robustness on Model Scaling

365 It would be interesting to know how the architecture performs when scaling up/down model com- 366 plexity. To this regard, we introduce a ratio to simply re-scale the channel dimension of the network 367 configuration for the fully training step. The relation between performance and parameter size 368 is illuminated in Fig. 8. The range of attaining promising performance is relatively narrow for 369 alphaGAN(s), mainly caused by the light-weight property induced by dominated depthwise separa- 370 ble convolutions. Light-weight architectures naturally result in highly sparse connections between 371 network neurons which may be sensitive to the configuration difference between searching and 372 re-training. In contrast, alphaGAN(l) shows acceptable performance in a wide range of parameter

12 FC

deconv bilinear conv_5x5 sep_conv_5x5 conv_5x5

nearest

deconv

bilinear bilinear conv_5x5

conv_3x3 conv_1x1

nearest

nearest

bilinear conv_1x1

sep_conv_3x3 conv_5x5

conv_3x3 + tanh

Figure 4: The structure of alphaGAN(l).

373 sizes (from 2M to 18M). While both of them present some degree of robustness on the scaling of the 374 original searching configuration.

375 G.2 Intermediate Architectures in Searching

IS during search FID during search

alphaGAN(l) alphaGAN(l)

alphaGAN(s) alphaGAN(s) 376 To understand the search process of alpha- 8.5 60

377 GAN, we track the intermediate structures of 8.0 50

378 alphaGAN and alphaGAN during search- 7.5 40

(s) (l) IS FID

379 ing, and fully train them on CIFAR-10 (in Fig. 7.0 30

380 9). We observe a clear trend that the archi- 6.5 20

381 tectures are learned towards high performance 6.0 10 0 20 40 60 80 100 0 20 40 60 80 100 Search iterations Search iterations 382 during searching though slight oscillation may (a) IS in search (b) FID in search 383 happen. Specially, alphaGANl realizes grad- 384 ual improvement in performance during the Figure 9: Tracking architectures during searching. 385 process, while alphaGAN displays a faster (s) alphaGAN is denoted by blue color with plus 386 convergence on the early stage of the process (s) marker and alphaGAN is denoted by red color 387 and can achieve comparable results, indicat- (l) with triangle marker. 388 ing solving inner-level optimization problem 389 by virtue of rough approximations (as using 390 more steps can always achieve a closer approximation of the optimum) can significantly benefit the 391 efficiency of solving the bi-level problem without sacrifice in accuracy.

13 Cell1_Node1->Node2 Cell1_Node2->Node4 Cell1_Node3->Node4 70% 60% inferior 50% inferior inferior superior superior superior 60% 50% 40% 50%

40% 30% 40%

30%

Proportion Proportion Proportion 30% 20% 20% 20% iue5 h itiuin fnra operations. normal of distributions The 5: Figure 10% 10% 10%

0% 0% 0% conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 Normal opr Normal opr Normal opr Cell2_Node1->Node2 Cell2_Node2->Node4 Cell2_Node3->Node4

inferior 40% 40% superior 35%

35% 35% 30% 30% 30% 25% 25% 25%

14 20% 20% 20% Proportion Proportion Proportion 15% 15% 15%

10% 10% 10%

5% 5% 5%

0% 0% 0% conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 Normal opr Normal opr Normal opr Cell3_Node1->Node2 Cell3_Node2->Node4 Cell3_Node3->Node4

inferior inferior 35% inferior 40% superior 50% superior superior

35% 30%

40% 30% 25%

25% 30% 20%

20% Proportion Proportion Proportion 15% 15% 20%

10% 10% 10% 5% 5%

0% 0% 0% conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 conv_1x1 conv_3x3 conv_7x7 sep_3x3 sep_5x5 sep_7x7 Normal opr Normal opr Normal opr Cell1_Node0->Node1 Cell2_Node0->Node1 Cell3_Node0->Node1

superior 60% superior 70% superior inferior inferior inferior

80% 50% 60%

50% 40% 60% 40% 30%

Proportion 40% Proportion Proportion 30%

20% 20% 20% 10% 10%

0% 0% 0% deconv nearest bilinear deconv nearest bilinear deconv nearest bilinear Up opr Up opr Up opr

Cell1_Node0->Node3 Cell2_Node0->Node3 Cell3_Node0->Node3

superior superior superior inferior 50% inferior inferior 80% 80%

40%

60% 60% 30%

Proportion Proportion Proportion 40% 40% 20%

20% 20% 10%

0% 0% 0% deconv nearest bilinear deconv nearest bilinear deconv nearest bilinear Up opr Up opr Up opr

Figure 6: The distributions of up-sampling operations.

Figure 7: Generated samples of alphaGAN(s) on STL-10.

392 G.3 Gumbel-max Trick

393 Gumbel-max trick [38] can be written as,

 0 0  exp αo + go /τ o0 β =  , (7) P exp αo + go/τ o∈On

o0 0 o0 394 where β is the probability of selecting operation o after Gumbel-max, and α represents the 0 o 395 architecture parameter of operation o , respectively. On represents the operation search space. g

15 IS of alphaGAN(s) and AutoGAN IS of alphaGAN(l) and AutoGAN FID of alphaGAN(s) and AutoGAN FID of alphaGAN(l) and AutoGAN 8.6 alphaGAN(s) alphaGAN(s) alphaGAN(l) 26 26 AutoGAN AutoGAN AutoGAN 8.6 8.4 24 24 8.4

8.2 22 22 8.2 20 8.0 20 IS IS 8.0 FID FID 18 18 7.8 7.8 16 16 7.6 7.6 14 14 alphaGAN 7.4 7.4 (l) 12 AutoGAN 12 0 2 4 6 8 10 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0 2 4 6 8 10 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Params (M) Params (M) Params (M) Params (M) Figure 8: Relation between model capacity and performance. To align the model capacities with AutoGAN, the channels for G and D in alphaGAN(l) are [64, 96, 128, 192, 256, 384], the channels for G and D in alphaGANs are [64, 128, 160, 256, 384], and the channels for G and D in AutoGAN are [64, 128, 192, 256, 384, 512]. Table 4: Gumbel-max trick and Warm-up.

Type Name Gumbel-max? Fix alphas? IS FID baseline × × 8.51 ± 0.06 11.38 alphaGAN(l) Gumbel-max X × 8.48 ± 0.10 20.69 Warm-up × X 8.34 ± 0.07 15.49 baseline × × 8.72 ± 0.11 12.86 alphaGAN(s) Gumbel-max X × 8.56 ± 0.06 15.66 Warm-up × X 8.25 ± 0.12 19.07

396 denotes samples drawn from the Gumbel (0,1) distribution, and τ represents the temperature to control 397 the sharpness of the distribution. Instead of continuous relaxation, the trick chooses an operation on 398 each edge, enabling discretization during searching. We compare the results by searching with and 399 without Gumbel-max trick. The results in Tab. 4 show that searching with Gumbel-max may not be 400 the essential factor for obtaining high-performance generator architectures.

401 G.4 Warm-up protocols

402 The generator contains two parts of parameters, (ωG, αG). The optimization of αG is highly related 403 to network parameters ωG. Intuitively, pretraining the network parameters ωG can benefit the search 404 of architectures since a better initialization may facilitate the convergence. To investigate the effect, 405 we fix αG and only update ωG at the initial half of the searching schedule, and then αG and ωG are 406 optimized alternately. This strategy is denoted as ’Warm-up’ in Table 4.

407 The results show that the strategy may not help performance, i.e., IS of ’Warm-up’ is slightly worse 408 than that of the baseline and FID of ’Warm-up’ is worse than that of the baseline, while it can 409 benefit the searching efficiency, i.e., it spends ∼ 15 GPU-hours for alphaGAN(l) (compared to ∼22 410 GPU-hours via the baseline) , and ∼ 1 GPU-hour for alphaGAN(s) (compared to ∼ 3 GPU-hours via 411 the baseline).

412 G.5 Effect of Step Sizes

413 To analyze the effect of different step sizes on the “arch part", corresponding to the optimization 414 process of the architecture parameters αG in Algorithm 1 (line 10-13). Since alphaGAN(l) has larger 415 step sizes for ’weight part’ and ’test-weight part’ compared with alphaGAN(s), the step size of ’arch 416 part’ can be adjusted in a wider range. We select the alphaGAN(l) to conduct the experiments and the 417 results are shown in Fig. 10. We can observe that the method perform fair robustness among different 418 step sizes on the IS metric, while network performance based on the FID metric may be hampered 419 with a less proper step.

420 G.6 Effect of Channels in Searching

421 As the default settings of alphaGAN, we search and re-train the networks with the same channel 422 dimensions (i.e., G_channels=256 and D_channels=128), which are predefined. To explore the impact

16 IS of different step sizes FID of different step sizes 8.8 18.73 8.52 8.35 8.51 8.51 17.5 17.09 8 16.94

15.0 13.42 6 12.5 11.38

10.0 IS FID 4 7.5

5.0 2

2.5

0 0.0 5 10 20 30 390 5 10 20 30 390 Step size Step size (a) IS (b) FID

Figure 10: The effect of different step sizes of ’arch part’.

Table 5: The channels in searching on the alphaGAN(s).

Search channels Re-train channels Params (M) FLOPs (G) IS FID G_32 D_32 0.109 0.02 7.10 ± 0.08 36.22 G_32 D_32 G_256 D_128 2.481 1.12 8.61 ± 0.12 14.98 G_64 D_64 0.403 0.212 7.97 ± 0.09 22.49 G_64 D_64 G_256 D_128 4.658 3.26 8.70 ± 0.17 14.02 G_128 D_128 1.967 0.91 8.26 ± 0.08 16.50 G_128 D_128 G_256 D_128 7.309 3.64 8.75 ± 0.09 13.02 G_256 D_128 2.953 1.32 8.72 ± 0.11 12.86 G_128 D_128 0.887 0.34 8.36 ± 0.08 17.12 G_256 D_128 G_64 D_64 0.296 0.09 7.73 ± 0.08 24.81 G_32 D_32 0.111 0.025 6.85 ± 0.1 35.6

423 of the channel dimensions during searching on the final performance of the searched architectures, 424 we adjust the channel numbers of the generator and the discriminator during searching based on 425 the searching configuration of alphaGAN(s). The results are shown in Tab. 5. We observe that our 426 method can achieve acceptable performance under a wide range of channel numbers (i.e., 32 ∼ 256). 427 We also find that using consistent channel dimensions during searching and re-training phases is 428 beneficial to the final performance.

429 When reducing channels during searching, we observe an increasing trend on the operations of 430 depth-wise convolutions with large kernels (e.g. 7x7), indicating that the operation selection induced 431 by such automated mechanism is adaptive to need of preserving the entire information flow (i.e., 432 increasing information extraction on the spatial dimensions to compensate for the channel limits).

433 G.7 Searching on STL-10

434 We also search alphaGAN(s) on STL-10. The channel dimensions in the generator and the discrimi- 435 nator are set to 64 (due to the consideration of GPU memory limit). We use the size of 48x48 as the 436 resolution of images. The rest experimental settings are same as the one of searching on CIFAR-10. 437 The settings remain the same as Section C.2 when retraining the networks.

438 The results of three runs are shown in Tab. 6. Our method achieves high performance on both 439 STL-10 and CIFAR-10, demonstrating the effectiveness and transferability of alphaGAN are not 440 confined to a certain dataset. alphaGAN(s) remains efficient which can obtain the structure reaching 441 the state-of-the-art on STL-10 with only 2 GPU-hours. We also find no failure case exists in the 442 three repeated experiments of alphaGAN(s) compared to that on CIFAR-10, which may be related to

17 Table 6: Search on STL-10. We search alphaGAN(s) on STL-10 and re-train the searched structure on STL-10 and CIFAR-10. In our repeated experiments, failure cases are prevented.

Search time Dataset of Name Params (M) FLOPs (G) IS FID (GPU-hours) re-training Repeat_1 ∼ 2 4.552 5.55 9.22 ± 0.08 25.42 Repeat_2 ∼ 2 STL-10 2.475 2.01 9.66 ± 0.10 29.28 Repeat_3 ∼ 2 4.013 3.67 9.47 ± 0.10 26.61 Repeat_1 ∼ 2 3.891 2.47 8.29 ± 0.17 13.94 Repeat_2 ∼ 2 CIFAR-10 1.815 0.90 8.20 ± 0.13 16.54 Repeat_3 ∼ 2 3.352 1.63 8.62 ± 0.11 12.64

443 multiple latent factors that datasets intrinsically possess (e.g., resolution, categories) and we leave as 444 a future work.

Distributions of normal operations Distributions of up operations

failure 70% failure normal normal

25% 60%

50% 20%

40% 15% Proportion Proportion 30%

10% 20%

5% 10%

0% 0% conv_1x1 conv_3x3 conv_5x5 sep_3x3 sep_5x5 sep_7x7 deconv nearest bilinear Normal opr Up opr (a) Distribution of normal operations (b) Distribution of up operations

Figure 11: Distributions of operations in normal cases and failure cases of alphaGAN.

Table 7: Repeated search on CIFAR-10.

Name Description Params (M) FLOPs (G) IS FID 4.475 2.36 8.44 ± 0.13 13.62 normal case alphaGAN(s) 2.953 1.32 8.72 ± 0.11 12.86 failure case 2.994 1.08 6.77 ± 0.07 45.88 8.207 2.41 8.55 ± 0.08 15.42 normal case alphaGAN(l) 8.618 2.78 8.51 ± 0.06 11.38 failure case 4.666 2.36 7.48 ± 0.1 52.58

445 G.8 Failure cases

446 As we pointed out in the main paper, the searching of alphaGAN will encounter failure cases, analo- 447 gous to other NAS methods [39]. For better understanding the method, we present the comparison 448 between normal cases and failure cases in Tab. 7 and the distributions of operations in Fig. 11. We 449 find that deconvolution operations dominate in these failure cases. To validate this, we conduct the 450 experiments on the variant by removing deconvolution operations from the search space under the 451 configuration of alphaGAN(s). The results (with 6 runs) in Tab. 8 show that the failure cases can be 452 prevented in this scenario.

18 Table 8: Search w\o deconv on alphaGAN(s).

Name Params (M) FLOPs (G) IS FID Repeat_1 4.594 2.20 8.29 ± 0.08 15.12 Repeat_2 2.035 0.51 8.34 ± 0.10 14.92 Repeat_3 1.586 0.55 8.24 ± 0.09 18.07 Repeat_4 1.631 0.58 8.32 ± 0.09 15.85 Repeat_5 1.631 0.60 8.43 ± 0.08 17.15 Repeat_6 2.064 1.03 8.26 ± 0.11 16.00

453 We also test on another setting by integrating conv_1x1 operation with the interpolation operations 454 (i.e., nearest and bilinear) and making them learnable as deconvonvolution, denoted as ’learnable 455 interpolation’. The results (with 6 runs) under the configuration of alphaGAN(s) are shown in Tab. 9, 456 suggesting that the failure cases can also be alleviated by the strategy.

Table 9: The effect of ’learnable interpolation’ on alphaGAN(s).

Method Name Params (M) FLOPs (G) IS FID Repeat_1 2.775 0.99 8.43 ± 0.15 14.8 Repeat_2 2.243 0.545 8.49 ± 0.12 18.82 Repeat_3 3.500 0.99 8.35 ± 0.1 18.93 Learnable Interpolation Repeat_4 3.195 1.53 8.59 ± 0.1 13.22 Repeat_5 2.968 0.82 8.22 ± 0.11 14.76 Repeat_6 2.712 0.77 8.41 ± 0.11 13.47

19