RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

RL-DARTS: Differentiable Architecture Search for Reinforcement Learning Yingjie Miao∗, Xingyou Song∗, Daiyi Peng Summer Yue, Eugene Brevdo, Aleksandra Faust Google Research, Brain Team Abstract We introduce RL-DARTS, one of the first applications of Differentiable Architec- ture Search (DARTS) in reinforcement learning (RL) to search for convolutional cells, applied to the Procgen benchmark. We outline the initial difficulties of applying neural architecture search techniques in RL, and demonstrate that by simply replacing the image encoder with a DARTS supernet, our search method is sample-efficient, requires minimal extra compute resources, and is also compati- ble with off-policy and on-policy RL algorithms, needing only minor changes in preexisting code. Surprisingly, we find that the supernet can be used as an actor for inference to generate replay data in standard RL training loops, and thus train end-to-end. Throughout this training process, we show that the supernet gradually learns better cells, leading to alternative architectures which can be highly competi- tive against manually designed policies, but also verify previous design choices for RL policies. 1 Introduction and Motivation Over the last decade, recent advancements in deep reinforcement learning have heavily focused on algorithmic improvements [25, 71], data augmentation [51, 35], infrastructure upgrades [15, 27], and even hyperparameter tuning [72, 16]. Automating such methods for RL have given rise to the field of Automated Reinforcement Learning (AutoRL). Surprisingly however, one relatively underdeveloped and unexplored area is in automating large-scale policy architecture design. Recently, works in larger-scale RL suggest that designing a policy’s architecture can be just as important as the algorithm or quality of data, if not more, for various metrics such as generalization, transferrability and efficiency. One surprising phenomenon found on Procgen [10], a procedural generation benchmark for RL, was that "IMPALA-CNN", a residual convolutional architecture from [15], could substantially outperform "NatureCNN", the standard 3-layer convolutional architecture arXiv:2106.02229v1 [cs.LG] 4 Jun 2021 used for Atari [44], in both the generalization and infinite data regimes [12]. Even on a more basic level, [73] showed that CNN policies can generalize better than MLPs on GridWorld tasks. Furthermore, emphasis on strong visual processing is also not simply restricted to game-like observations. In the field of robotics, cameras collect very detailed real-world images for observations, which are very visually similar to images from classification datasets. For instance, the Q-network architecture for robotic grasping found in QT-OPT [33, 52] is quite large, consisting of 15 convolution and 3 max-pooling operations for processing a (472 x 472)-sized image, more than 4x bigger than ImageNet’s (224 x 224) size [54]. As such RL policy networks gradually become larger and more sophisticated, the need for better RL architecture design increases as well. One might wonder if current architecture designs for RL policies are optimal, or if there are alternative designs, which may respectively be verified or discovered using automated search techniques. This leads to the field of Neural Architecture Search (NAS), which has shown enormous impact on supervised learning (SL) ranging from image classification [74], object detection [75] and even ∗Equal contribution. Order decided randomly. Correspondence to: {yingjiemiao,xingyousong,sandrafaust}@google.com Preprint. Under review. language modelling [60], with very efficient and practical methods [43, 48] having been developed as well. However, such NAS methods inherently take advantage of the modularity and reliability of training SL from scratch. When training an SL model, there is usually only one module to be trained on a single machine, whose result only needs to be reported from a single seeded run due to high reproducibility. Unfortunately, RL possesses no such guarantees, and in fact, can be commonly characterized by two qualities which are exactly opposite to the aforementioned above: 1. In most RL algorithms involving auto-differentiation, there are multiple different phases: an actor phase which collects environment data via forward passes/inference of a policy, in addition to a training phase which updates parameters for potentially multiple network components (e.g. both the policy and value function in policy gradient methods). There can also even be a separate evaluation phase different from the actor phase (e.g. using argmax greedy evaluation instead of the stochastic actor, for both Q-learning and policy gradient methods). Such phases are also commonly distributed over different machines, adding an extra layer of complexity via inter-machine communication. 2. RL training curves possess considerably more noise than SL curves, with a much larger room for error due to sensitivity to hyperparameters [24, 3], which originally led to the criterion of reporting mean and standard deviation for 3 seeded runs [24]. In fact, this noise can be nearly unbounded, as in many cases, an unlucky and catastrophic seeded training run can simply lead to a trivial reward even with an optimal architecture. Such noise may potentially be problematic for fast convergence for blackbox/evolutionary optimization methods. From these issues, a practical question therefore arises: Can we even perform NAS for RL at all? If so, what are the correct design choices? Ideally, we desire an efficient search method DARTS Supernet which integrates naturally within the RL sys- Learner Replay tem with minimal modifications to the code and Network Experiences pipeline, in addition to also being undeterred by Op 1 Op 2 Op 3 random noise. As it turns out, Differentiable Architecture Search (DARTS) [43] indeed satis- fies these qualities. As demonstrated in Figure Actor NetworkActor Actor 1, DARTS combines the model training and ar- Op 1 Op 2 Op 3 Environment chitecture search procedure into one single differentiable supernet which can simply replace the image encoder in standard RL algorithms (e.g. PPO [59] and Rainbow-DQN [25]). Rather Figure 1: Representation of our method, in which than receiving feedback from potentially noisy a DARTS supernet is simply inserted into the forward evaluations over different distributed network components of a standard RL training machines, architecture search parameters in the pipeline, which may be potentially highly dis- supernet train similarly alongside weight param- tributed. eters using the same exact loss function, which can also contain extra important information via terms not explicitly related to reward-improvement (e.g. value function error and entropy penalty, as well as potentially exploration/novelty bonuses). The architecture search parameters are then discretized to select a final perception module whose performance is evaluated by training from scratch. However, it is nonobvious that a DARTS supernet can work well with so many RL components. It has not been shown previously that supernets are capable of being used for inference, which is crucial for the actor phase for RL. In this paper, we demonstrate that such a setting is indeed possible, which produces an efficient search algorithm for generating perception modules optimized for specific RL environments. In summary, our contributions are: • We demonstrate DARTS can be integrated with existing RL algorithms in a minimally inva- sive and easy-to-use manner, agnostic to the specific pipeline. On the Procgen benchmark, the DARTS supernet is able to efficiently produce similar training curves as baselines using both on-policy (PPO) and off-policy (Rainbow) methods, both of which require inference on the network. 2 • By visualizing discretized cell structures and inspecting their evaluated rewards, we show both qualitatively and quantitatively, that the discretized cells start with suboptimal architectures and gradually evolve to better architectures. We extensively ablate our method further in the Appendix. • By evaluating the final discretized cells, we discover that many environments benefit from learned, custom-tailored architectures, suggesting that human-designed architectures can be suboptimal. This opens up questions for future study, as large-scale RL can potentially benefit from more applications of NAS techniques. 2 Related Works Most core NAS methods, formulated before the advent of differentiable search [43], can be thought of in terms of blackbox and evolutionary techniques. These include evolutionary techniques before the deep learning era such as NEAT [64], as well as modern techniques which propose models via a controller [74, 47], optimized by techniques such as Regularized Evolution [53] and REINFORCE [66]. In relation to the RL setting, fully blackbox NAS methods have been developed in [61, 21, 63], which do not apply gradient-based methods even for training weight parameters. Such methods utilize hundreds of CPU workers for objective evaluations, but unfortunately, are usually unable to train policies involving more than 10K+ parameters due to the sample complexity of zeroth order methods in high dimensional parameter space [1]. However, such methods have proved useful on smaller search spaces for RL, such as hyperparameter optimization [72, 19, 46, 31] and algorithm search/"learning to learn" methods [16, 9]. Simply training a full model to completion from scratch over numerous architecture suggestions is also prohibitively expensive. Numerous

RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support