<<

RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Yingjie Miao∗, Xingyou Song∗, Daiyi Peng Summer Yue, Eugene Brevdo, Aleksandra Faust Research, Brain Team Abstract

We introduce RL-DARTS, one of the first applications of Differentiable Architec- ture Search (DARTS) in reinforcement learning (RL) to search for convolutional cells, applied to the Procgen benchmark. We outline the initial difficulties of applying neural architecture search techniques in RL, and demonstrate that by simply replacing the image encoder with a DARTS supernet, our search method is sample-efficient, requires minimal extra compute resources, and is also compati- ble with off-policy and on-policy RL algorithms, needing only minor changes in preexisting code. Surprisingly, we find that the supernet can be used as an actor for inference to generate replay data in standard RL training loops, and thus train end-to-end. Throughout this training process, we show that the supernet gradually learns better cells, leading to alternative architectures which can be highly competi- tive against manually designed policies, but also verify previous design choices for RL policies.

1 Introduction and Motivation

Over the last decade, recent advancements in deep reinforcement learning have heavily focused on algorithmic improvements [25, 71], data augmentation [51, 35], infrastructure upgrades [15, 27], and even hyperparameter tuning [72, 16]. Automating such methods for RL have given rise to the field of Automated Reinforcement Learning (AutoRL). Surprisingly however, one relatively underdeveloped and unexplored area is in automating large-scale policy architecture design. Recently, works in larger-scale RL suggest that designing a policy’s architecture can be just as important as the algorithm or quality of data, if not more, for various metrics such as generalization, transferrability and efficiency. One surprising phenomenon found on Procgen [10], a procedural generation benchmark for RL, was that "IMPALA-CNN", a residual convolutional architecture from [15], could substantially outperform "NatureCNN", the standard 3-layer convolutional architecture

arXiv:2106.02229v1 [cs.LG] 4 Jun 2021 used for Atari [44], in both the generalization and infinite data regimes [12]. Even on a more basic level, [73] showed that CNN policies can generalize better than MLPs on GridWorld tasks. Furthermore, emphasis on strong visual processing is also not simply restricted to game-like observations. In the field of robotics, cameras collect very detailed real-world images for observations, which are very visually similar to images from classification datasets. For instance, the Q-network architecture for robotic grasping found in QT-OPT [33, 52] is quite large, consisting of 15 convolution and 3 max-pooling operations for processing a (472 x 472)-sized image, more than 4x bigger than ImageNet’s (224 x 224) size [54]. As such RL policy networks gradually become larger and more sophisticated, the need for better RL architecture design increases as well. One might wonder if current architecture designs for RL policies are optimal, or if there are alternative designs, which may respectively be verified or discovered using automated search techniques. This leads to the field of Neural Architecture Search (NAS), which has shown enormous impact on supervised learning (SL) ranging from image classification [74], object detection [75] and even

∗Equal contribution. Order decided randomly. Correspondence to: {yingjiemiao,xingyousong,sandrafaust}@google.com Preprint. Under review. language modelling [60], with very efficient and practical methods [43, 48] having been developed as well. However, such NAS methods inherently take advantage of the modularity and reliability of training SL from scratch. When training an SL model, there is usually only one module to be trained on a single machine, whose result only needs to be reported from a single seeded run due to high reproducibility. Unfortunately, RL possesses no such guarantees, and in fact, can be commonly characterized by two qualities which are exactly opposite to the aforementioned above:

1. In most RL algorithms involving auto-differentiation, there are multiple different phases: an actor phase which collects environment data via forward passes/inference of a policy, in addition to a training phase which updates parameters for potentially multiple network components (e.g. both the policy and value function in policy gradient methods). There can also even be a separate evaluation phase different from the actor phase (e.g. using argmax greedy evaluation instead of the stochastic actor, for both Q-learning and policy gradient methods). Such phases are also commonly distributed over different machines, adding an extra layer of complexity via inter-machine communication. 2. RL training curves possess considerably more noise than SL curves, with a much larger room for error due to sensitivity to hyperparameters [24, 3], which originally led to the criterion of reporting mean and standard deviation for 3 seeded runs [24]. In fact, this noise can be nearly unbounded, as in many cases, an unlucky and catastrophic seeded training run can simply lead to a trivial reward even with an optimal architecture. Such noise may potentially be problematic for fast convergence for blackbox/evolutionary optimization methods.

From these issues, a practical question therefore arises: Can we even perform NAS for RL at all? If so, what are the correct design choices? Ideally, we desire an efficient search method DARTS Supernet which integrates naturally within the RL sys- Learner Replay tem with minimal modifications to the code and Network Experiences pipeline, in addition to also being undeterred by Op 1 Op 2 Op 3 random noise. As it turns out, Differentiable Architecture Search (DARTS) [43] indeed satis- fies these qualities. As demonstrated in Figure Actor NetworkActor Actor 1, DARTS combines the model training and ar- Op 1 Op 2 Op 3 Environment chitecture search procedure into one single dif- ferentiable supernet which can simply replace the image encoder in standard RL algorithms (e.g. PPO [59] and Rainbow-DQN [25]). Rather Figure 1: Representation of our method, in which than receiving feedback from potentially noisy a DARTS supernet is simply inserted into the forward evaluations over different distributed network components of a standard RL training machines, architecture search parameters in the pipeline, which may be potentially highly dis- supernet train similarly alongside weight param- tributed. eters using the same exact loss function, which can also contain extra important information via terms not explicitly related to reward-improvement (e.g. value function error and entropy penalty, as well as potentially exploration/novelty bonuses). The architecture search parameters are then discretized to select a final perception module whose performance is evaluated by training from scratch. However, it is nonobvious that a DARTS supernet can work well with so many RL components. It has not been shown previously that supernets are capable of being used for inference, which is crucial for the actor phase for RL. In this paper, we demonstrate that such a setting is indeed possible, which produces an efficient search algorithm for generating perception modules optimized for specific RL environments. In summary, our contributions are:

• We demonstrate DARTS can be integrated with existing RL algorithms in a minimally inva- sive and easy-to-use manner, agnostic to the specific pipeline. On the Procgen benchmark, the DARTS supernet is able to efficiently produce similar training curves as baselines using both on-policy (PPO) and off-policy (Rainbow) methods, both of which require inference on the network.

2 • By visualizing discretized cell structures and inspecting their evaluated rewards, we show both qualitatively and quantitatively, that the discretized cells start with suboptimal architec- tures and gradually evolve to better architectures. We extensively ablate our method further in the Appendix. • By evaluating the final discretized cells, we discover that many environments benefit from learned, custom-tailored architectures, suggesting that human-designed architectures can be suboptimal. This opens up questions for future study, as large-scale RL can potentially benefit from more applications of NAS techniques.

2 Related Works

Most core NAS methods, formulated before the advent of differentiable search [43], can be thought of in terms of blackbox and evolutionary techniques. These include evolutionary techniques before the deep learning era such as NEAT [64], as well as modern techniques which propose models via a controller [74, 47], optimized by techniques such as Regularized Evolution [53] and REINFORCE [66]. In relation to the RL setting, fully blackbox NAS methods have been developed in [61, 21, 63], which do not apply gradient-based methods even for training weight parameters. Such methods utilize hundreds of CPU workers for objective evaluations, but unfortunately, are usually unable to train policies involving more than 10K+ parameters due to the sample complexity of zeroth order methods in high dimensional parameter space [1]. However, such methods have proved useful on smaller search spaces for RL, such as hyperparameter optimization [72, 19, 46, 31] and algorithm search/"learning to learn" methods [16, 9]. Simply training a full model to completion from scratch over numerous architecture suggestions is also prohibitively expensive. Numerous techniques have been developed to reduce hardware consumption and wall-clock time, which include applying early-stopping [4, 41] and reducing the search model size [75]. One of the most effective techniques consists of weight sharing [48], in which only one set of weight parameters is trained through an ensemble of models, otherwise known as a supernet. Differentiable methods such as DARTS [43] automatically train a supernet efficiently by inserting architecture selection variables inside the network, in order to be differentiable with respect to a loss. For RL, the only previous known application of DARTS is [2], which searches for the optimal way of combining the observation and action tensors together in off-policy QT-OPT [33]. However, this approach does not modify the image encoder nor uses the supernet for inference, as off-policy robotic data collected independently from the actual policy is used for training, akin to a supervised learning setup.

3 DARTS Background

The standard approach to building NAS-based networks is to stack layers, all of which use a common cell structure, but different weights. A cell is usually defined via construction of a Directed Acyclic Graph (DAG) which consists of N intermediate nodes. Each node is denoted as x(i), where i is its topological order in the DAG, and x(i) may also represent the tensor value contained in the node. Each edge (i, j) between two nodes x(i) and x(j) consists of an operation o(i,j) that maps x(i) to (i,j) (i) (j) (i,j) (i) o (x ); x is constructed from {o (x )}i via a merge operation such as summation. In the DARTS method, the supernet is constructed by representing each operation in the cell as (i,j) (i) P (i,j) (i) a softmax mixture of all possible operations O, i.e. o (x ) = o∈O po · o(x ), where (i,j) (i,j) exp(αo ) (i,j) po = , and we denote the union of all possible architecture variables ao as P exp(α(i,j)) o0∈O o0 α. For brevity of notation, we will usually describe our operation space as Obase instead, since the total space O = Obase ∪ {Zero, Skip} must contain the Zero and Skip Connection operations. Application-specific design knobs consist of the number of intermediate nodes as well as the method to aggregate intermediate nodes into a final output, which we specify for our case later in Section 4.1.

3 Let L(θ, α) be a loss function defined with respect to the supernet induced by (θ, α). In the standard optimization framework for SL where there is the notion of a validation set, the learning procedure consists of a bilevel optimization problem ∗ α = argmin Lval(θ , α) α ∗ θ = argmin Ltrain(θ, α) θ While vanilla DARTS [43] proposes to solve this via a first/second order approximation, multiple followup works [42, 13, 30, 39, 7, 6, 8, 41, 70] simply alternate the gradient descent procedure between θ and α. Most of said works also address the common skip connection collapse issue, where the architecture variables may all converge towards the skip operation, which is hypothesized to be due to the effects of the two player-game involving the weight variables and architecture variables. The supernet’s cell is then discretized into a final cell in two steps. First, each edge is reduced to (i,j) a single operation that corresponds to arg maxo∈O,o6=zero po , and its strength is defined by its softmax weight. Then, a pruning step occurs where each intermediate node retains the top 2 incoming edges measured by strength. After pruning, strengths of edges are discarded and thus are not used in evaluation. More details can be found in the original DARTS paper [43].

4 Methodology: RL-DARTS

We are interested in applying the DARTS method to find better policy architectures for the Procgen benchmark [10], which tests for an agent’s ability to learn in many diverse environments via procedural level generation. The two most common settings on Procgen are generalization (i.e. training over 200 levels and testing on the full distribution) and sample complexity (i.e. training over infinite levels from the full distribution). In our work, we primarily focus on the sample complexity/infinite level setting to reduce complexity and confounding factors. As mentioned earlier in Section 3, the original bilevel optimization framework is notoriously unstable due to the skip connection collapse, and can require special techniques from followup works which are orthogonal to the purpose of this paper. Furthermore, in the Procgen benchmark, it has been noted that using finite training levels can potentially reduce training performance [10], which may also be problematic for a data-intensive DARTS search. The infinite level setting has been shown to strongly correlate with higher test performance [11, 51] in the generalization setting as well. Thus we choose to simply joint optimize both θ and α: argmin L(θ, α) θ,α

For notation, we use standard conventions in the RL literature: for a given MDP M, we denote s, a, r, s0, a0 as the state, action, reward, next state, next action respectively. π denotes the policy and γ is the discount factor, and D is the replay buffer which contains collected experiences (s, a, r, s0). Off-Policy: For off-policy algorithms such as Rainbow [25] and vanilla DQN [45], we replace the entire image encoder with the supernet, and thus we may define Qθ,α(s, a) to be the Q-function utilizing model weights θ and architecture variables α. Then in the standard Q-Learning setting, L(·) ∗ 2 can simply be defined to be the Bellman error loss, i.e. L(θ, α) = (Qθ,α(s, a) − B Qθ,α(s, a)) ∗ 0 0 where B Qθ,α(s, a) = r(s, a) + γEs0∼P (s0|s,a)[maxa0 Qθ,α(s , a )] is the Bellman operator. This loss of course, may be modified with regularization terms used to improve stability of training. Inference uses standard epsilon-greedy methods, e.g. randomly sampling action a with probability , and using maxa Qθ,α(s, a) otherwise. On-Policy: In standard practical policy-gradient methods such as PPO [59] and TRPO [58], there is a policy πθ(a|s) and also a value function Vθ(s), both of which share an image encoder fθ(s). For DARTS, we instead use a supernet encoder fθ,α(s), but still optimize the standard PPO objective with respect to θ and α. Our method is also compatible with recent works which seek to separate the policy and value networks [11, 50], which may lead to training two supernets. Standard inference is also via sampling from the (now supernet) policy, i.e. a ∼ πθ,α(a|s).

4.1 Search Space Design

4 Rather than using supernets specific to super- vised learning’s settings such as 224 x 224 image RL-DARTS Cell sizes, we anchor the skeleton of our supernet on Output the IMPALA-CNN [15] architecture, which is Feature Vector manually optimized for training Procgen agents on 64 x 64 images [12, 10], yet flexible to induce RL-DARTS Cell Node 2 modifications from NAS. RL-DARTS Cell xN

As shown in Figure 2, the IMPALA-CNN ar- Max. 3x3, stride 2 chitecture conveniently stacks "convolutional Node 1 sequences", where each sequence consists of Conv. 3x3, stride 1 (Conv3x3, MaxPool3x3, Residual Block 1, Residual Block 2), where the MaxPool3x3 op- Op 1 eration performs a "reduction" using a stride Input Op 2 of 2, while the other operations maintain the Op 3 same input/output shape and depth. We aim to search for a cell that replaces the residual Figure 2: Illustration of our supernet, with a block (originally consisting of Conv3x3’s and IMPALA-CNN skeleton. Solid lines correspond to ReLU’s) in IMPALA-CNN, but keep the pre- selected operations after discretizing the supernet, ceding Conv3x3 and MaxPool3x3, which can which contains all possible operations weighted be viewed together as a pre-selected reduction using α. The RL-DARTS cell replaces the original cell. The stacking of sequences allows us to Residual Blocks. Each sequence is allowed to use train a smaller proxy supernet consisting of its own depth size. The original IMPALA-CNN depths (16,16,16) but evaluate discretized cells uses depths of (16,32,32) in a sequence of size on larger models with depths (64,64,64,64,64). N=3. The key ingredient of the search space is the set of user-defined operators Obase. We consider the following for Obase:

• Classic: {Conv3x3+ReLU, Conv5x5+ReLU, Dilated3x3+ReLU, Dilated5x5+ReLU} which is standard in supervised learning [43, 75, 48]. The output of the entire cell simply uses the last intermediate node’s output. We apply this for PPO, and use 3 intermediate nodes by default. • Micro: {Conv3x3, ReLU, Tanh}. A more fine-grained search space, using 4 nodes, with the output consisting of depth-wise concatenation of all intermediate nodes, followed by a 1x1 convolution to maintain depth compatibility, similar to [43]. Note that Tanh is a standard nonlinearity used in continuous control [55, 62]. Because we found that Rainbow significantly underperformed with "Classic", we instead use "Micro" for Rainbow.

5 Experiments

5.1 Experiment Setup

We use PPO [59] and Rainbow DQN [25], standard algorithms benchmarked on Procgen, to train the agent on the full distribution of levels (a.k.a. infinite levels) under the "easy" difficulty. Specific hyperparameter settings can be found in Appendix C. Unless specified, we by default use consistent hyperparameters for all comparisons found inside a figure, although different hyperparameters may be used across different figures due to changes in minibatch size when training models of different sizes. We use the original IMPALA-CNN cell as a strong hand-designed baseline when comparing both supernet and discrete cell performances.

5.2 RL-DARTS Performance

We compare the performance of learned cells from RL-DARTS with randomly searched cells and the IMPALA-CNN baseline. In 3 and 4, we display the average normalized reward after 25M steps, as standard in ProcGen [10], for a subset of environments in which RL-DARTS performs competitively. The normalized reward for each environment is computed as Rnorm = (Rraw − Rmin)/(Rmax − Rmin) where Rmax and Rmin are calculated using a combination of theoretical maximums and PPO-trained agents, and can be found in [10].

5 Env Baseline RL-DARTS Random Bossfight 0.87 0.82 0.60 Climber 0.83 0.80 0.43 Coinrun 0.61 0.82 -0.04 Dodgeball 0.63 0.48 0.34 Fruitbot 0.96 0.89 0.85 Heist 0.58 0.38 0.20 Jumper 0.76 0.77 0.68 Leaper 0.57 0.85 0.08 Maze 0.99 0.86 0.68 Miner 0.72 0.86 0.63 Ninja 0.38 0.48 0.33 Starpilot 0.66 0.80 0.53

Figure 3: Left 2x2 Plot: Examples of evaluated cells with PPO using the "Classic" search space, using depths of (64,64,64,64,64) to emphasize cell differences. Right Table: Normalized Rewards in ProcGen across different search methods, evaluated at 25M steps. Blue values indicate largest score on the specific environment (as well as values within 0.03 of the largest), while Red indicates if the value is significantly lower (i.e. more than 20% drop from largest).

To provide a fair comparison against random search, our guiding principle is simple: random search should have at least the same computational cost as RL-DARTS search. The cost is defined as the total wall-clock time when searching under the same hardware. As we will see in Section 5.3, since the supernet uses approximately 3x more time than a fixed cell, we therefore sample 3 cells (discretized from randomly initialized α’s) and evaluate their performance with depths (16, 16, 16) for 25M steps. The best of the 3 cells is used for full evaluation. Note that in [20], a different principle was used: the cost of a search algorithm should be measured by the total environment steps. Comparatively, our principle here is more favorable to random search, as it takes much less walltime to achieve the same environment steps as the supernet and therefore random search gains more trials under the same walltime quota. We see that for PPO in Figure 3, the discretized cell outperforms the IMPALA-CNN baseline on quite a few environments, and remains competitive on others, with random search consistently underperforming.

Env Baseline RL-DARTS Random Bigfish 0.65 0.64 0.61 Caveflyer -0.11 0.07 -0.01 Chaser 0.26 0.19 0.17 Coinrun 0.48 0.51 0.39 Dodgeball 0.43 0.22 -0.06 Fruitbot 0.74 0.61 0.57 Heist -0.46 -0.52 -0.51 Jumper 0.09 0.17 0.14 Leaper -0.04 0.20 -0.02 Maze 0.73 0.77 0.62 Ninja -0.24 0.34 0.22

Figure 4: Same reporting format and setting as Fig. 3, but with Rainbow using the "Micro" search space.

6 For Rainbow in Figure 4, we also find strong returns for many environments, although the discretized cell is not without its flaws. For instance, on Dodgeball, the discretized cell significantly underper- forms against the IMPALA-CNN baseline. We show in Figure 17 in Appendix B that one significant factor for underperformance is if the supernet itself fails to train, which leads to cells with poorly selected combinations of operations. However, this is not the only scenario worthy of investigation. Below, we perform key ablations to understand the behavior of RL-DARTS, and answer the following main questions in our analysis:

1. Does the supernet train end-to-end at all? How is it related with discretization performance? 2. Does RL-DARTS find interesting architectures? 3. How does the cell structure change over time? Do the cells improve in evaluation perfor- mance throughout the evolution?

5.3 Training the Supernet

The first question we wish to answer is: Does the supernet train end-to-end at all? As mentioned previously in Section 4, the agent must also use the supernet in the actor phase in addition to the standard learning phase. In SL, it is often the case [43, 39] that the supernet underperforms compared to discretized cells. For RL, a negative feedback loop may occur where a suboptimal supernet collects poor data for training, which leads to an even worse supernet. However, we find that training the supernet end-to-end works fine as-is, even with minimal hyperparameter tuning for both PPO (Figure 5) and Rainbow (Figure 6).

Figure 5: Left: Softmax operation weights over all edges in the cell when training the supernet with PPO + "Classic" search space on Dodgeball. Zero operation weight is not shown to improve clarity. Note that by 25M steps, the operation choices have already converged. Right: Comparison between the supernet and vanilla IMPALA-CNN using (16,16,16) depths on various environments when using infinite levels.

Figure 6: Analogous settings with Figure 5 using Rainbow + "Micro" search space. Left: Softmax weights when training Rainbow with infinite levels on Bigfish. Right: Supernet vs IMPALA-CNN comparison with Rainbow. When using the "Classic" search space with PPO, the supernet achieves competitive asymptotic performance against a vanilla IMPALA-CNN using the same depth sizes, albeit slower (Figure 5). Surprisingly, we find that the process strongly downweights all base operations except for the standard Conv3x3+ReLU, with 5x5 operations possessing the lowest softmax weights. We corroborate this result by performing evaluations over standard IMPALA-CNN cells using either purely 3x3 or 5x5 convolutions for the whole network. We indeed find that the 3x3 setting outperforms the 5x5 setting, as shown in Figure 13 in Appendix B.1, suggesting the signaling ability of α on operation choice.

7 We further present ablations in Appendix A, which show that the search process is indeed providing meaningful feedback during training, as a poorly designed supernet can significantly falter. In terms of computational efficiency, we empirically find that using the supernet instead of the baseline IMPALA-CNN increases overall training time by ≈2.5x-3x, if using the same hardware and network sizes. For example, with a Tesla V100, in order to achieve 25M steps, training the Rainbow agent with baseline network IMPALA-CNN takes ≈8 hours, while the DARTS supernet takes ≈30 hours. A similar scenario occurs with PPO where the IMPALA-CNN with PPO takes ≈ 1 GPU day to reach 25M steps, while the DARTS supernet takes 2.5 GPU days. Further profiling suggests that the slowdown mostly comes from backpropagation of the supernet, as expected. Measuring the search cost of DARTS allowed us to design experiments for fair comparison against other search methods such as random search, which we used in 5.2.

5.4 Correlation Between Supernet and Discretized Cells

Given that the discretized cell only explicitly depends on architecture variables α and not necessarily model weights θ, one may wonder: Is there a relationship between the performances of the supernet and its corresponding discretized cell? For instance, the degenerate/underperformance setting mentioned in Section 5.2 and Appendix B can be thought of as an extreme scenario. In order to make such an analysis comparing the two performances, we first must prevent confounding factors arising from Rainbow’s natural performance on an environment regardless of architecture. We thus first divide the supernet and discrete cell scores by the score obtained by the IMPALA-CNN baseline. In Figure 7, when using Rainbow and observing across environments, we find that there is a significant correlation between supernet and discrete cell performances. This suggests that one way of improving search quality is to improve supernet training, via hyperparameter tuning or data augmentation [51].

Figure 7: Supernet and their corresponding discrete cell performances across all environments in Procgen using Rainbow, after normalizing using IMPALA-CNN’s performances. Thus, the black dashed line at 1.0 corresponds to IMPALA-CNN. Note that for the Ninja environment, both the supernet and discretized cell strongly outperform the baseline, suggesting a high correlation.

5.5 Discovered Cells

The second main question we wish to answer is: Does RL-DARTS find interesting architectures? In SL, a common issue that occurs with DARTS is skip connection collapse, which can result in degenerate cell structures and degraded performance. We find that RL-DARTS does not suffer from this issue when trained in joint optimization mode.

8 In Fig. 8, we display example discretized cells from converged supernets trained with PPO and Rainbow. Depending on the search space Obase, these cells can resemble the original IMPALA-CNN, suggesting that such an architecture seems natural for the task, or can instead be highly nontrivial and have a combinatorial appearance.

Figure 8: Left: Cell found by PPO using the "Classic" search space and joint training on Dodgeball. Note its similarity to vanilla IMPALA-CNN. Right: Cell found by Rainbow using the "Micro" search space. In addition to the final discretized cells, we study the evolution of discretized cells throughout the optimization trajectory of the supernet. We are interested in answering two sub-questions: 1) How does the cell structure change over time? 2) Do the cells improve in evaluation performance throughout the evolution? For 1), we display discretized cells when training a supernet with Rainbow on the Starpilot environ- ment in Figure 9. As we can see, the earlier cell is dominated by skip connections and is suboptimal compared to the later cell, which possesses several local structures similar to Conv + ReLU residual connections. We provide more extensive cell evolution ablations for the 6 node setting in Figures 14, 15 in Appendix B.2 as well, displaying more sophisticated yet still interpretable changes in cell topology.

Figure 9: Evolution of discovered cells over a DARTS optimization process. Left: A cell discovered in the early stage which is dominated by skip connections. Right: A cell discovered in the end which possesses several local structures similar to Conv + ReLU residual connections. For 2), as α changes during supernet training, so do the outputs of the discretization procedure on α. We collect all distinct out- put cells into a sequence, and evaluate each cell’s performance by training from scratch for 25M steps, displayed in Figure 10. Depths (16,16,16) were used to match with the supernet to prevent con- founding results from increased model size. As we can see, the performance generally improves over time, indicating that supernet optimization selects better cells. However, we find that such behav- ior can be environment-dependent, as some environments possess less monotonic evaluation curves (see Figure 16 in Appendix B). Figure 10: Evaluation of 9 We suspect that there are at least two factors which can affect the distinct discrete cells in or- evaluation curve. First, as mentioned before, it is unclear whether it der from the trajectory of α is optimal to use supernet in the actor phase, which affects the data on the Starpilot environment collected to train architecture variables. Secondly, there can be a when using Rainbow. high integrality gap when discretizing the supernet. As shown in [70], the discretization logic from vanilla DARTS (selection by magnitude of α) can be suboptimal.

6 Conclusion

Our research has demonstrated that it is entirely possible to apply NAS techniques to RL in a minimally invasive way. From a broader perspective, we believe that there could be multiple further uses for NAS in the RL setting, opening up the door to both NAS and RL practitioners alike.

9 Limitations: In this paper, we did not utilize the full spectrum of DARTS techniques and modi- fications, which can be explored in future work. Our design choices prioritized on reducing the complexity of usage in an already complex RL pipeline, as well as on avoiding confounding factors. For instance, the discretization method used in our paper is the default found in DARTS, which may be suboptimal and improved by more recent work [70]. Furthermore, as mentioned in Section 3, we did not use the bilevel approach due to its well-known instability concerns. However, finding a reliable bilevel algorithm in RL may be important for approaching sim-to-real in robotics [32] as well as generalization [38, 10]. Comparison against random search in NAS is also a highly debated topic [40]. As mentioned earlier in Section 5.2, random search with partial training (i.e. early stopping) can be an order of magnitude faster than with full training. However, due to the noise inherent with training RL models, it is unclear if early stopping can provide an accurate signal; for instance, in Figure 3, some of the training curves do not smoothly converge, but rather suddenly increase in performance in the middle of training. This also leads to discussion of blackbox NAS methods for RL which we did not explore due the difficulty of dealing with high noise, although noise-resilient methods have been previously developed [22, 36, 37]. Blackbox are better suited for non-differentiable objectives such as model latency and memory consumption to improve efficiency [67, 29, 56, 28], which may be important for reducing robotic deployment costs. In terms of applicability to benchmarks, it is clear that our results provide the most utility on vision- based RL tasks [68, 65], which may be furhter improved when searching for better pooling operations and different nonlinearities. NAS for sequence-to-sequence tasks [48, 60] may also be applied to search for RNN policies to improve memory [49, 18, 34], exploration [17] and adaptation [14, 69]. Negative Societal Impacts: Generally speaking, NAS methods can sacrifice model interpretability in order to achieve higher rewards. For the field of RL specifically, this may warrant more attention in AI safety when used for real world robotic pipelines. Furthermore, as with any NAS research, the initial phase of discovery and experimentation may contribute to carbon emissions due to the computational costs of extensive tuning, which is exacerbated in the noisy RL setting. However, this is usually a means to an end for reducing carbon emissions, such as an efficient search algorithm which requires little tuning or modifications, which this paper proposes.

7 Acknowledgements

We thank Sergio Guadarrama and Oscar Ramirez for their support on TF-Agents, Gabe Barth- Maron and Bobak Shahriari for their support on Acme, and Jakob Bauer for discussions on DARTS implementations. We further thank Quoc Le, Dale Schuurmanns, Izzeddin Gur, Sagi Perel, and Daniel Golovin for their feedback on the paper, as well as Yao Lu, Michael Ryoo, and Krzysztof Choromanski for relevant discussions in robotics, and Jack Parker-Holder and Yiding Jiang for discussions around AutoRL.

10 References [1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In A. T. Kalai and M. Mohri, editors, COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 28–40. Omnipress, 2010. [2] I. Akinola, A. Angelova, Y. Lu, Y. Chebotar, D. Kalashnikov, J. Varley, J. Ibarz, and M. S. Ryoo. Visionary: Vision architecture discovery for robot learning. In ICRA, 2021. [3] M. Andrychowicz, A. Raichuk, P. Stanczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, S. Gelly, and O. Bachem. What matters in on-policy reinforcement learning? A large-scale empirical study. CoRR, abs/2006.05990, 2020. [4] B. Baker, O. Gupta, R. Raskar, and N. Naik. Accelerating neural architecture search using performance prediction. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net, 2018. [5] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 449–458. PMLR, 2017. [6] X. Chu, X. Wang, B. Zhang, S. Lu, X. Wei, and J. Yan. DARTS-: robustly stepping out of performance collapse without indicators. CoRR, abs/2009.01027, 2020. [7] X. Chu, B. Zhang, and X. Li. Noisy differentiable architecture search. CoRR, abs/2005.03566, 2020. [8] X. Chu, T. Zhou, B. Zhang, and J. Li. Fair DARTS: eliminating unfair advantages in differentiable architecture search. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XV, volume 12360 of Lecture Notes in Computer Science, pages 465–480. Springer, 2020. [9] J. Co-Reyes, Y. Miao, D. Peng, E. Real, S. Levine, Q. V. Le, H. Lee, and A. Faust. Evolving reinforcement learning algorithms. In International Conference on Learning Representations (ICLR), 2021. [10] K. Cobbe, C. Hesse, J. Hilton, and J. Schulman. Leveraging procedural generation to benchmark rein- forcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 2048–2056. PMLR, 2020. [11] K. Cobbe, J. Hilton, O. Klimov, and J. Schulman. Phasic policy gradient. CoRR, abs/2009.04416, 2020. [12] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying generalization in reinforcement learning. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Confer- ence on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 1282–1289. PMLR, 2019. [13] X. Dong and Y. Yang. Searching for a robust neural architecture in four GPU hours. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 1761–1770. Computer Vision Foundation / IEEE, 2019. [14] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl$ˆ2$: Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779, 2016. [15] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1406–1415. PMLR, 2018. [16] A. Faust, A. Francis, and D. Mehta. Evolving rewards to automate reinforcement learning. In 6th ICML Workshop on Automated Machine Learning, 2019. [17] M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg. Noisy networks for exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [18] M. Fortunato, M. Tan, R. Faulkner, S. Hansen, A. P. Badia, G. Buttimore, C. Deck, J. Z. Leibo, and C. Blundell. Generalization of reinforcement learners with working and episodic memory. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12448–12457, 2019. [19] J. K. H. Franke, G. Köhler, A. Biedenkapp, and F. Hutter. Sample-efficient automated deep reinforcement learning. CoRR, abs/2009.01555, 2020.

11 [20] J. K. H. Franke, G. Köhler, A. Biedenkapp, and F. Hutter. Sample-efficient automated deep reinforcement learning. CoRR, abs/2009.01555, 2020. [21] A. Gaier and D. Ha. Weight agnostic neural networks. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5365–5379, 2019. [22] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang. Gradientless descent: High- dimensional zeroth-order optimization. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [23] S. Guadarrama, A. Korattikara, O. Ramirez, P. Castro, E. Holly, S. Fishman, K. Wang, E. Gonina, N. Wu, E. Kokiopoulou, L. Sbaiz, J. Smith, G. Bartók, J. Berent, C. Harris, V. Vanhoucke, and E. Brevdo. TF- Agents: A library for reinforcement learning in tensorflow. https://github.com/tensorflow/agents, 2018. [Online; accessed 25-June-2019]. [24] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In S. A. McIlraith and K. Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), , Louisiana, USA, February 2-7, 2018, pages 3207–3214. AAAI Press, 2018. [25] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In S. A. McIlraith and K. Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 3215–3222. AAAI Press, 2018. [26] M. Hoffman, B. Shahriari, J. Aslanides, G. Barth-Maron, F. Behbahani, T. Norman, A. Abdolmaleki, A. Cassirer, F. Yang, K. Baumli, S. Henderson, A. Novikov, S. G. Colmenarejo, S. Cabi, C. Gulcehre, T. L. Paine, A. Cowie, Z. Wang, B. Piot, and N. de Freitas. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020. [27] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver. Distributed prioritized experience replay. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [28] A. Howard, R. Pang, H. Adam, Q. V. Le, M. Sandler, B. Chen, W. Wang, L. Chen, M. Tan, G. Chu, V. Vasudevan, and Y. Zhu. Searching for mobilenetv3. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 1314–1324. IEEE, 2019. [29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017. [30] A. Hundt, V. Jain, and G. D. Hager. sharpdarts: Faster and more accurate differentiable architecture search. CoRR, abs/1903.09900, 2019. [31] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu. Population based training of neural networks. CoRR, abs/1711.09846, 2017. [32] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 12627–12637. Computer Vision Foundation / IEEE, 2019. [33] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, Proceedings, volume 87 of Proceedings of Machine Learning Research, pages 651–673. PMLR, 2018. [34] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [35] I. Kostrikov, D. Yarats, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. CoRR, abs/2004.13649, 2020. [36] O. Krause. Large-scale noise-resilient evolution-strategies. In A. Auger and T. Stützle, editors, Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2019, Prague, Czech Republic, July 13-17, 2019, pages 682–690. ACM, 2019.

12 [37] B. Letham, B. Karrer, G. Ottoni, and E. Bakshy. Constrained bayesian optimization with noisy experiments. CoRR, abs/1706.07094, 2017. [38] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspec- tives on open problems. CoRR, abs/2005.01643, 2020. [39] G. Li, G. Qian, I. C. Delgadillo, M. Müller, A. K. Thabet, and B. Ghanem. SGAS: sequential greedy architecture search. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, , WA, USA, June 13-19, 2020, pages 1617–1627. IEEE, 2020. [40] L. Li and A. Talwalkar. Random search and reproducibility for neural architecture search. In A. Globerson and R. Silva, editors, Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, volume 115 of Proceedings of Machine Learning Research, pages 367–377. AUAI Press, 2019. [41] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li. DARTS+: improved differentiable architecture search with early stopping. CoRR, abs/1909.06035, 2019. [42] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, volume 11205 of Lecture Notes in Computer Science, pages 19–35. Springer, 2018. [43] H. Liu, K. Simonyan, and Y. Yang. DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. [44] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. [45] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nat., 518(7540):529–533, 2015. [46] J. Parker-Holder, V. Nguyen, and S. J. Roberts. Provably efficient online hyperparameter optimization with population-based bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [47] D. Peng, X. Dong, E. Real, M. Tan, Y. Lu, G. Bender, H. Liu, A. Kraft, C. Liang, and Q. Le. Pyglove: Symbolic programming for automated machine learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [48] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4092–4101. PMLR, 2018. [49] A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural episodic control. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2827–2836. PMLR, 2017. [50] R. Raileanu and R. Fergus. Decoupling value and policy for generalization in reinforcement learning. CoRR, abs/2102.10330, 2021. [51] R. Raileanu, M. Goldstein, D. Yarats, I. Kostrikov, and R. Fergus. Automatic data augmentation for generalization in deep reinforcement learning. CoRR, abs/2006.12862, 2020. [52] K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari. Rl-cyclegan: Reinforcement learning aware simulation-to-real. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 11154–11163. IEEE, 2020. [53] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 4780–4789. AAAI Press, 2019. [54] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3):211–252, 2015. [55] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. CoRR, abs/1703.03864, 2017.

13 [56] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4510–4520. IEEE Computer Society, 2018. [57] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. [58] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In F. R. Bach and D. M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1889–1897. JMLR.org, 2015. [59] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. [60] D. R. So, Q. V. Le, and C. Liang. The evolved transformer. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5877–5886. PMLR, 2019. [61] X. Song, K. Choromanski, J. Parker-Holder, Y. Tang, D. Peng, D. Jain, W. Gao, A. Pacchiano, T. Sarlós, and Y. Yang. ES-ENAS: combining evolution strategies with neural architecture search at no extra cost for reinforcement learning. CoRR, abs/2101.07415, 2021. [62] X. Song, Y. Jiang, S. Tu, Y. Du, and B. Neyshabur. Observational overfitting in reinforcement learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [63] K. O. Stanley and R. Miikkulainen. Efficient reinforcement learning through evolving neural network topologies. In W. B. Langdon, E. Cantú-Paz, K. E. Mathias, R. Roy, D. Davis, R. Poli, K. Balakrishnan, V. G. Honavar, G. Rudolph, J. Wegener, L. Bull, M. A. Potter, A. C. Schultz, J. F. Miller, E. K. Burke, and N. Jonoska, editors, GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA, 9-13 July 2002, pages 569–577. Morgan Kaufmann, 2002. [64] K. O. Stanley and R. Miikkulainen. Evolving neural network through augmenting topologies. Evol. Comput., 10(2):99–127, 2002. [65] A. Stone, O. Ramirez, K. Konolige, and R. Jonschkowski. The distracting control suite - A challenging benchmark for reinforcement learning from pixels. CoRR, abs/2101.02722, 2021. [66] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems 12, [NIPS Conference, , Colorado, USA, November 29 - December 4, 1999], pages 1057–1063. The MIT Press, 1999. [67] M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 2019. [68] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller. Deepmind control suite. CoRR, abs/1801.00690, 2018. [69] J. Wang, Z. Kurth-Nelson, H. Soyer, J. Z. Leibo, D. Tirumala, R. Munos, C. Blundell, D. Kumaran, and M. M. Botvinick. Learning to reinforcement learn. In G. Gunzelmann, A. Howes, T. Tenbrink, and E. J. Davelaar, editors, Proceedings of the 39th Annual Meeting of the Cognitive Science Society, CogSci 2017, London, UK, 16-29 July 2017. cognitivesciencesociety.org, 2017. [70] R. Wang, M. Cheng, X. Chen, X. Tang, and C.-J. Hsieh. Rethinking architecture selection in differentiable nas. In International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2021. [71] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling network architectures for deep reinforcement learning. In M. Balcan and K. Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, , NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1995–2003. JMLR.org, 2016. [72] B. Zhang, R. Rajan, L. Pineda, N. O. Lambert, A. Biedenkapp, K. Chua, F. Hutter, and R. Calandra. On the importance of hyperparameter optimization for model-based reinforcement learning. CoRR, abs/2102.13651, 2021. [73] C. Zhang, O. Vinyals, R. Munos, and S. Bengio. A study on overfitting in deep reinforcement learning. CoRR, abs/1804.06893, 2018. [74] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016.

14 [75] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8697–8710. IEEE Computer Society, 2018.

15 APPENDIX

A What Affects Supernet Training?

Given the positive training results we demonstrated in the main body of the paper, one may wonder, can any supernet, no matter how poorly designed or setup, still train well in the RL setting? If so, this would imply that the search method would not be producing meaningful, but instead, random signals. We refute this hypothesis by performing ablations over our supernet training in order to have a better understanding of what components affect its performance. We ultimately show that the search space and architecture variables play a very significant role in its optimization, thus validating our method.

A.1 Role of Search Space

We remove the ReLU nonlinearities from the "Classic" search space, so that Obase = {Conv3x3, Conv5x5, Dilated3x3, Dilated5x5} and thus the DARTS cell consists of only linear operations. As shown in Fig. 11, this leads to a dramatic decrease in supernet performance, providing evidence that the search space matters greatly.

Figure 11: Supernet training using PPO on Dodgeball with infinite levels, when using the "Classic" search space with/without ReLU nonlinearities, under the same hyperparameters.

A.2 Uniform Architecture Variables

We further demonstrate the importance of the architecture variables α on training. We see that in Fig. 12, freezing α to be uniform throughout training makes the Rainbow agent unable to train at all. This suggests that it is crucial for α to properly route useful operations throughout the network.

Figure 12: Supernet training using Rainbow on Dodgeball with infinite levels, when using the "Micro" search space with/without trainable architecture variables α, under the same hyperparameters.

16 B What Affects Discrete Cell Performance?

B.1 Softmax Weights vs Discretization

As seen from Figure 5 in the main body of the paper, the DARTS supernet strongly downweights Conv5x5+ReLU operations when using the "Classic" search space with PPO. In order to verify the predictive power of the softmax weights, as a proxy, we thus also performed evaluations when using purely 3x3 or 5x5 convolutions on a large IMPALA-CNN with [64,64,64,64,64] depths. We see that the Conv3x3 setting indeed outperforms Conv5x5, corroborating the results in which during training, α strongly upweights the Conv3x3+ReLU operation and downweights Conv5x5+ReLU.

Figure 13: Large IMPALA-CNNs evaluated on Dodgeball using either Infinite or 200 levels with PPO. For the 200 level setting, lighter colors correspond to test performance.

B.2 Discrete Cell Evolutions

Along with Figure 9 in the main body of the paper, we also compare extra examples of discretizations before and after supernet training, to display reasonable behaviors induced by the trajectory of α. In Figure 14, we use PPO with the "Classic" search space, but instead replace the 3-node 2-cell search with a 6-node 1-cell search in the IMPALA sequence for larger search space. We find that the discretized cell initially uses multiple different operations, while at the end, the majority of operations use the Conv3x3+ReLU operation. In Figure 15, the discretized cell initially uses a large number of skip connections as well as dead-end nodes. However, at convergence, it eventually utilizes all nodes to compute the final output.

Figure 14: Comparison of discretized cells before and after supernet training, on Starpilot using PPO with 6 nodes.

Figure 15: Comparison of discretized cells before and after supernet training, on Plunder using PPO with 6 nodes. Curiously, we find that the skip connection between the input and output appears commonly through- out many searches.

17 For the Rainbow setting, in Figure 10 in the main body of the paper, we saw that when the search process is successful, the supernet’s training trajectory induces discretized cells which improve evaluation performance as well. The cells discovered later generally perform better than cells discovered earlier in the supernet training process. In Figure 16, we show more examples of such evaluation curves.

Figure 16: Evaluated discretized cells discovered throughout training the supernet with Rainbow. To save computation, we evaluate every 2nd cell that was discovered.

B.3 Degenerated Supernet

One case for why a discretized cell may underperform is if its corresponding supernet fails to learn. For example, in Figure 17, we see that when using Rainbow, Supernet 2 shows no sign of learning, and its corresponding discretized cell converges to a degenerate cell consisting of only nonlinear operations without any usage of the Conv3x3 operation. During evaluation, the degenerated cell also produces a trivial learning curve.

(b) Discretized Cell 1

(c) Discretized Cell 2 (a) Supernet and discretized cell performance.

Figure 17: (a) Two supernets trained on the Dodgeball environment using Rainbow, with corre- sponding discretized cells evaluated using 3 random seeds. (b) Discretized cell from Supernet 1. (c) Discretized cell from Supernet 2.

C Hyperparameters

In our code, we entirely use Tensorflow 2 for auto-differentation. We used the April 2020 version of Procgen, which also uses a MIT License. For compute, we either used P100 or V100 GPUs based on convenience and availability. Below are the hyperparameter settings for specific methods. For all training curves, we use the common standard for reporting in RL (i.e. plotting mean and standard deviation across 3 seeds).

C.1 DARTS

Initially, we sweeped the softmax temperature in order to find a stable default value that could be used for all environments. For PPO, the sweep was across the set {5.0, 10.0, 15.0}. For Rainbow, the sweep was across {10.0, 20.0, 50.0}. For tabular reported scores in Figures 3 and 4, we used a consistent softmax temperature of 5.0 for PPO, and 10.0 for Rainbow.

18 C.2 Rainbow-DQN

We use Acme [26] for our code infrastructure. We use a learning rate 5 × 10−5, batch size 256, n-step size 7, discount factor 0.99. For the priority replay buffer [57], we use priority exponent 0.8, importance sampling exponent 0.2, replay buffer capacity 500K. For particular environments (Bigfish, Bossfight, Chaser, Dodgeball, Miner, Plunder, Starpilot), we use n-step size 2 and replay buffer capacity 10K. For C51 [5], we use 51 atoms, with vmin = 0, vmax = 1.0. As a preprocessing step, we normalize the environment rewards by dividing the raw rewards by the max possible rewards reported in [10].

C.3 PPO

We use TF-Agents [23] for our code infrastructure, along with equivalent PPO hyperparameters found from [10]. Due to necessary changes in minibatch size when applying DARTS modules or networks with higher GPU memory usage, we thus swept learning rate across {1×10−4, 2.5×10−4, 5×10−4} and number of epochs across {1, 2, 3}. The default minibatchsize with a (16, 32, 32) IMPALA-CNN is 2048. For other models, we use a maximum power of 2 minibatch size before encountering GPU out-of-memory issues on a standard 16 GB GPU. Thus, for a (16, 16, 16) DARTS supernet, we set the minibatch size to be 256, and for evaluation with a (64, 64, 64, 64, 64) Discretized CNN, we use a minibatch size of 256 as well. Our hyperparameter gridsearch for the evaluation led to an optimal setting of learning rate = 1 × 10−4 and number of epochs = 1.

19