Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019

Searching for A Robust Neural Architecture in Four GPU Hours Xuanyi Dong1;2,Yi Yang1 1University of Technology Sydney 2Baidu Research [email protected], [email protected] Abstract 0 0 : input GDAS Conventional neural architecture search (NAS) ap- 3 : output on a DAG proaches are based on reinforcement learning or evolution- : sampled ary strategy, which take more than 3000 GPU hours to find : unsampled a good model on CIFAR-10. We propose an efficient NAS 1 approach learning to search by gradient descent. Our approach represents the search space as a directed acyclic graph (DAG). This DAG contains billions of sub-graphs, each of which indicates a kind of neural architecture. To 2 avoid traversing all the possibilities of the sub-graphs, we develop a differentiable sampler over the DAG. This sampler is learnable and optimized by the validation loss after training the sampled architecture. In this way, our approach can be trained in an end-to-end fashion by gra- 3 dient descent, named Gradient-based search using Differ- entiable Architecture Sampler (GDAS). In experiments, we Figure 1. We utilize a DAG to represent the search space of a neu- can finish one searching procedure in four GPU hours on ral cell. Different operations (colored arrows) transform one node CIFAR-10, and the discovered model obtains a test error (square) to its intermediate features (little circles). Meanwhile, of 2.82% with only 2.5M parameters, which is on par with each node is the sum of the intermediate features transformed from the state-of-the-art. Code is publicly available on GitHub: the previous nodes. As indicated by the solid connections, the https://github.com/D-X-Y/NAS-Projects. neural cell in the proposed GDAS is a sampled sub-graph of this DAG. Specifically, among the intermediate features between every two nodes, GDAS samples one feature in a differentiable way. 1. Introduction Designing an efficient and effective neural architecture better neural architectures than the human-invented archi- requires substantial human effort and takes a long time [7, tectures. Therefore, NAS is an important research topic in 9, 10, 12, 14, 20, 37, 44]. Since the birth of AlexNet [20] machine learning. arXiv:1910.04465v2 [cs.CV] 16 Oct 2019 in 2012, human experts have conducted a huge number of Most NAS approaches apply evolutionary algorithms experiments, and consequently devised several useful struc- (EA) [32, 23, 33] or reinforcement learning (RL) [46, 47, 3] tures, such as attention [7] and residual connection [12]. to design neural architectures automatically. In both RL- However, the infinite possible choices of network architec- based and EA-based approaches, their searching procedures ture make the manual search unfeasible [1]. Recently, neu- require the validation accuracy of numerous architecture ral architecture search (NAS) has increasingly attracted the candidates, which is computationally expensive [47, 32]. interest of researchers [1, 6, 17, 22, 25, 31, 46]. These ap- For example, the typical RL-based method utilizes the val- proaches learn to automatically discover good architectures. idation accuracy as a reward to optimize the architecture They can thus reduce the labour of human experts and find generator [46]. An EA-based method leverages the valida- ∗ tion accuracy to decide whether a model will be removed This paper was accepted to the IEEE CVPR 2019. from the population or not [33]. These approaches use a yPart of this work was done when Xuanyi Dong was a research intern with Baidu Research. large amount of computational resources, which is ineffi- zCorresponding author: Yi Yang. cient and unaffordable. This motivates researchers to reduce the computational cost. sub-graph at one training iteration, accelerating the search- In this paper, we propose a Gradient-based searching procedure. Besides, the sampling in GDAS is learnable ing approach using Differentiable Architecture Sampling and contributes to finding a better cell. (GDAS). It can search for a robust neural architecture in 3. GDAS delivers a strong empirical performances while four hours with a single V100 GPU. GDAS significantly using fewer GPU resources. On CIFAR-10, GDAS can fin- improves efficiency compared to the previous methods. We ish one searching procedure in several GPU hours and dis- start by searching for a robust neural “cell” instead of a neu- cover a robust neural network with a test error of 2.82%. On ral network [46, 47]. A neural cell contains multiple func- PTB, GDAS discovers a RNN model with a test perplexity tions to transform features, and a neural network consists of of 57.5. Moreover, the networks discovered on CIFAR and many copies of the discovered neural cell [22, 47]. Fig. 1 il- PTB can be successfully transferred to ImageNet and WT2. lustrates our searching procedure in detail. We represent the search space of a cell by a DAG. Every grey square node in- 2. Related Work dicates a feature tensor, numbered by the computation order. Recently, researchers have made significant progress in Different colored arrows indicate different kinds of opera- automatically discovering good architectures [46, 47, 23, tions, which transform one node into its intermediate fea- 22, 33]. Most NAS approaches can be categorized in two tures. Meanwhile, each node is the sum of the intermedi- modalities: macro search and micro search. ate features transformed from the previous nodes. During Macro search algorithms aim to directly discover the training, the proposed GDAS samples a sub-graph from the entire neural networks [5, 4, 38, 46, 21]. To search convo- whole DAG, indicated by solid connections in Fig. 1. In this lutional neural networks (CNNs) [20], typical approaches sub-graph, each node only receives one intermediate feature apply RL to optimize the searching policy to discover ar- from every previous node. Specifically, among the interme- chitectures [1, 5, 46, 31]. Baker et al. [1] trained a learn- diate features between every two nodes, GDAS samples one ing agent by Q-learning to sequentially choose CNN lay- feature in a differentiable way. In this way, GDAS can be ers. Zoph and Le [46] utilized long short-term memory trained by gradient descent to discover a robust neural cell (LSTM) [13] as a controller to configure each convolu- in an end-to-end fashion. tional layer, such as the filter shape and the number of fil- The fast searching ability of GDAS is mainly due to the ters. In these macro search algorithms [1, 5, 46], the num- sampling behavior. A DAG contains hundreds of parametric ber of possible networks is exponential to the depth of a operations with millions of parameters. Directly optimizing network, e.g., a depth of 12 can result in more than 1029 this DAG [24] instead of sampling a sub-graph leads to two possible networks [31]. It is difficult and ineffective to disadvantages. First, it costs a lot of time to update nu- search networks in such a large search space, and, there- merous parameters in one training iteration, increasing the fore, these macro search methods [31, 46, 5] usually limit overall training time to more than one day [24]. Second, the CNN models to be shallow, e.g., a depth is less than 12. optimizing different operations together could make them Since macro-discovered networks are shallower than deep compete with each other. For example, different operations CNNs [12, 14], their accuracies are limited. In contrast, our could generate opposite values. The sum of these oppo- GDAS allows the network to be much deeper by stacking site values tends to vanish, breaking the information flow tens of discovered cells [47] and thus can achieve a better between the two connected nodes and destabilizing the op- accuracy. timization procedure. To solve these two problems, the pro- Micro search algorithms aim to discover neural cells posed GDAS samples a sub-graph at one training iteration. and design a neural architecture by stacking many copies As a result, we only need to optimize a part of the DAG of the discovered cells [47, 32, 33, 31]. A typical mi- at one iteration, which accelerates the training procedure. cro search approach is NASNet [47], which extends the Moreover, the inappropriate competition is avoided, which approach of [46] to search neural cells in the proposed makes the optimization effective. “NASNet search space”. Following NASNet [47], many In summary, GDAS has the following benefits: researchers propose their methods based on the NASNet 1. Compared to previous RL-based and EA-based meth- search space [22, 24, 5, 32]. For example, Real et al. [32] ods, GDAS makes the searching procedure differentiable, applied EA algorithm with a simple regularization tech- which allows us to end-to-end learn a robust searching rule nique to search neural cells. Liu et al. [22] proposed a by gradient descent. For RL-based and EA-based methods, progressive approach to search cells from shallow to deep feedback (reward) is obtained after a prolonged training tra- gradually. These micro search algorithms usually take more jectory, while feedback (loss) in our gradient-based method than 100 GPU days [22, 47]. Even though some of them re- is instant and is given in every iteration. As a result, the duce the searching cost, they still take more than one GPU optimization of GDAS is potentially more efficient. day [24]. Our GDAS is a also micro search algorithm, fo- 2. Instead of using the whole DAG, GDAS samples one cusing on search cost reduction. In experiments, we can find Reduction Reduction a robust network within fewer GPU hours, which is 1000× Image 3x3 block block block soft conv Cell Cell max less than the standard NAS approach [47].

Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019

Lecture 4 Feedforward Neural Networks, Backpropagation

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

Loss Function Search for Face Recognition

Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan

Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of

Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019

Categorical Data

Mixed Pattern Recognition Methodology on Wafer Maps with Pre-Trained Convolutional Neural Networks

Reinforcement Learning with Dynamic Boltzmann Softmax Updates Arxiv

Rethinking Feature Distribution for Loss Functions in Image Classification