Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019
Total Page:16
File Type:pdf, Size:1020Kb
Searching for A Robust Neural Architecture in Four GPU Hours Xuanyi Dong1;2,Yi Yang1 1University of Technology Sydney 2Baidu Research [email protected], [email protected] Abstract 0 0 : input GDAS Conventional neural architecture search (NAS) ap- 3 : output on a DAG proaches are based on reinforcement learning or evolution- : sampled ary strategy, which take more than 3000 GPU hours to find : unsampled a good model on CIFAR-10. We propose an efficient NAS 1 approach learning to search by gradient descent. Our ap- proach represents the search space as a directed acyclic graph (DAG). This DAG contains billions of sub-graphs, each of which indicates a kind of neural architecture. To 2 avoid traversing all the possibilities of the sub-graphs, we develop a differentiable sampler over the DAG. This sam- pler is learnable and optimized by the validation loss af- ter training the sampled architecture. In this way, our ap- proach can be trained in an end-to-end fashion by gra- 3 dient descent, named Gradient-based search using Differ- entiable Architecture Sampler (GDAS). In experiments, we Figure 1. We utilize a DAG to represent the search space of a neu- can finish one searching procedure in four GPU hours on ral cell. Different operations (colored arrows) transform one node CIFAR-10, and the discovered model obtains a test error (square) to its intermediate features (little circles). Meanwhile, of 2.82% with only 2.5M parameters, which is on par with each node is the sum of the intermediate features transformed from the state-of-the-art. Code is publicly available on GitHub: the previous nodes. As indicated by the solid connections, the https://github.com/D-X-Y/NAS-Projects. neural cell in the proposed GDAS is a sampled sub-graph of this DAG. Specifically, among the intermediate features between every two nodes, GDAS samples one feature in a differentiable way. 1. Introduction Designing an efficient and effective neural architecture better neural architectures than the human-invented archi- requires substantial human effort and takes a long time [7, tectures. Therefore, NAS is an important research topic in 9, 10, 12, 14, 20, 37, 44]. Since the birth of AlexNet [20] machine learning. arXiv:1910.04465v2 [cs.CV] 16 Oct 2019 in 2012, human experts have conducted a huge number of Most NAS approaches apply evolutionary algorithms experiments, and consequently devised several useful struc- (EA) [32, 23, 33] or reinforcement learning (RL) [46, 47, 3] tures, such as attention [7] and residual connection [12]. to design neural architectures automatically. In both RL- However, the infinite possible choices of network architec- based and EA-based approaches, their searching procedures ture make the manual search unfeasible [1]. Recently, neu- require the validation accuracy of numerous architecture ral architecture search (NAS) has increasingly attracted the candidates, which is computationally expensive [47, 32]. interest of researchers [1, 6, 17, 22, 25, 31, 46]. These ap- For example, the typical RL-based method utilizes the val- proaches learn to automatically discover good architectures. idation accuracy as a reward to optimize the architecture They can thus reduce the labour of human experts and find generator [46]. An EA-based method leverages the valida- ∗ tion accuracy to decide whether a model will be removed This paper was accepted to the IEEE CVPR 2019. from the population or not [33]. These approaches use a yPart of this work was done when Xuanyi Dong was a research intern with Baidu Research. large amount of computational resources, which is ineffi- zCorresponding author: Yi Yang. cient and unaffordable. This motivates researchers to reduce the computational cost. sub-graph at one training iteration, accelerating the search- In this paper, we propose a Gradient-based search- ing procedure. Besides, the sampling in GDAS is learnable ing approach using Differentiable Architecture Sampling and contributes to finding a better cell. (GDAS). It can search for a robust neural architecture in 3. GDAS delivers a strong empirical performances while four hours with a single V100 GPU. GDAS significantly using fewer GPU resources. On CIFAR-10, GDAS can fin- improves efficiency compared to the previous methods. We ish one searching procedure in several GPU hours and dis- start by searching for a robust neural “cell” instead of a neu- cover a robust neural network with a test error of 2.82%. On ral network [46, 47]. A neural cell contains multiple func- PTB, GDAS discovers a RNN model with a test perplexity tions to transform features, and a neural network consists of of 57.5. Moreover, the networks discovered on CIFAR and many copies of the discovered neural cell [22, 47]. Fig. 1 il- PTB can be successfully transferred to ImageNet and WT2. lustrates our searching procedure in detail. We represent the search space of a cell by a DAG. Every grey square node in- 2. Related Work dicates a feature tensor, numbered by the computation order. Recently, researchers have made significant progress in Different colored arrows indicate different kinds of opera- automatically discovering good architectures [46, 47, 23, tions, which transform one node into its intermediate fea- 22, 33]. Most NAS approaches can be categorized in two tures. Meanwhile, each node is the sum of the intermedi- modalities: macro search and micro search. ate features transformed from the previous nodes. During Macro search algorithms aim to directly discover the training, the proposed GDAS samples a sub-graph from the entire neural networks [5, 4, 38, 46, 21]. To search convo- whole DAG, indicated by solid connections in Fig. 1. In this lutional neural networks (CNNs) [20], typical approaches sub-graph, each node only receives one intermediate feature apply RL to optimize the searching policy to discover ar- from every previous node. Specifically, among the interme- chitectures [1, 5, 46, 31]. Baker et al. [1] trained a learn- diate features between every two nodes, GDAS samples one ing agent by Q-learning to sequentially choose CNN lay- feature in a differentiable way. In this way, GDAS can be ers. Zoph and Le [46] utilized long short-term memory trained by gradient descent to discover a robust neural cell (LSTM) [13] as a controller to configure each convolu- in an end-to-end fashion. tional layer, such as the filter shape and the number of fil- The fast searching ability of GDAS is mainly due to the ters. In these macro search algorithms [1, 5, 46], the num- sampling behavior. A DAG contains hundreds of parametric ber of possible networks is exponential to the depth of a operations with millions of parameters. Directly optimizing network, e.g., a depth of 12 can result in more than 1029 this DAG [24] instead of sampling a sub-graph leads to two possible networks [31]. It is difficult and ineffective to disadvantages. First, it costs a lot of time to update nu- search networks in such a large search space, and, there- merous parameters in one training iteration, increasing the fore, these macro search methods [31, 46, 5] usually limit overall training time to more than one day [24]. Second, the CNN models to be shallow, e.g., a depth is less than 12. optimizing different operations together could make them Since macro-discovered networks are shallower than deep compete with each other. For example, different operations CNNs [12, 14], their accuracies are limited. In contrast, our could generate opposite values. The sum of these oppo- GDAS allows the network to be much deeper by stacking site values tends to vanish, breaking the information flow tens of discovered cells [47] and thus can achieve a better between the two connected nodes and destabilizing the op- accuracy. timization procedure. To solve these two problems, the pro- Micro search algorithms aim to discover neural cells posed GDAS samples a sub-graph at one training iteration. and design a neural architecture by stacking many copies As a result, we only need to optimize a part of the DAG of the discovered cells [47, 32, 33, 31]. A typical mi- at one iteration, which accelerates the training procedure. cro search approach is NASNet [47], which extends the Moreover, the inappropriate competition is avoided, which approach of [46] to search neural cells in the proposed makes the optimization effective. “NASNet search space”. Following NASNet [47], many In summary, GDAS has the following benefits: researchers propose their methods based on the NASNet 1. Compared to previous RL-based and EA-based meth- search space [22, 24, 5, 32]. For example, Real et al. [32] ods, GDAS makes the searching procedure differentiable, applied EA algorithm with a simple regularization tech- which allows us to end-to-end learn a robust searching rule nique to search neural cells. Liu et al. [22] proposed a by gradient descent. For RL-based and EA-based methods, progressive approach to search cells from shallow to deep feedback (reward) is obtained after a prolonged training tra- gradually. These micro search algorithms usually take more jectory, while feedback (loss) in our gradient-based method than 100 GPU days [22, 47]. Even though some of them re- is instant and is given in every iteration. As a result, the duce the searching cost, they still take more than one GPU optimization of GDAS is potentially more efficient. day [24]. Our GDAS is a also micro search algorithm, fo- 2. Instead of using the whole DAG, GDAS samples one cusing on search cost reduction. In experiments, we can find Reduction Reduction a robust network within fewer GPU hours, which is 1000× Image 3x3 block block block soft conv Cell Cell max less than the standard NAS approach [47].