Sample Efficient Deep Neuroevolution in Low

Under review as a conference paper at ICLR 2019 SAMPLE EFFICIENT DEEP NEUROEVOLUTION IN LOW DIMENSIONAL LATENT SPACE Anonymous authors Paper under double-blind review ABSTRACT Current deep neuroevolution models are usually trained in a large parameter search space for complex learning tasks, e.g. playing video games, which needs billions of samples and thousands of search steps to obtain significant performance. This raises a question of whether we can make use of sequential data generated during evolution, encode input samples, and evolve in low dimensional parameter space with latent state input in a fast and efficient manner. Here we give an affirmative answer: we train a VAE to encode input samples, then an RNN to model environment dynamics and handle temporal information, and last evolve our low dimensional policy network in latent space. We demonstrate that this approach is surprisingly efficient: our experiments on Atari games show that within 10M frames and 30 evolution steps of training, our algorithm could achieve competitive result compared with ES, A3C, and DQN which need billions of frames. 1 INTRODUCTION Training an agent that can handle difficult tasks in complex environment is always challenging for artificial intelligence. Most works for solving such problems are using reinforcement learning algorithms, e.g. learning a value function (Watkins & Dayan, 1992), or a policy function (Kakade, 2002), or both (Sutton et al., 2000). Combined with deep neural networks, the so-called deep reinforcement learning achieved great success in playing Atari games (Mnih et al., 2015), operating robots (Yahya et al., 2017), and winning human experts in challenging competitions like Go and Dota (Silver et al., 2016; OpenAI, 2018). Recently, a portion of works are trying to combine deep neural networks with evolution strategies to solve similar problems. This approach, called deep neuroevolution, achieved competitive results compared to deep reinforcement learning on Atari Games and Mujoco tasks as well in much shorter training time due to its outstanding scalability and feasibility for parallelization (Salimans et al., 2017; Such et al., 2017). However, large-scale deep neuroevolution is both data and computation inefficient: it consumes thousands of CPU cores and needs billions of frames and thousands of generations to obtain significant results. In this work, we introduce a novel way to improve its efficiency: we train a variational autoencoder (VAE) to encode samples, and a recurrent neural network to model environment dynamics and handle temporal information; afterward, we train a two layer policy network with latent state vectors using covariance matrix adaption evolution strategy (CMA-ES). We evaluate our sample efficient deep neuroevolution algorithm, or in short SEDN, on 50 Atari Games from OpenAI Gym (Brockman et al., 2016), experiment shows that our approach surpassed DQN (Mnih et al., 2013), A3C (Mnih et al., 2016), and ES (Salimans et al., 2017) methods in several Atari games within much fewer state frames, shown in Table 1. Our key findings are listed below: • Evolution goes faster with latent state input. SEDN takes less than 30 evolution steps with only 32 children to surpass ES (Salimans et al., 2017) in several games: the latter needs thousands of children and generations. • Training is more data efficient. Compared to ES and asynchronous RL methods which needs billions of frames to train, SEDN takes less than 10 million frames to achieve targets. • Training is more computation efficient. On a workstation of one Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz and one Titan X Pascal GPU, SEDN takes less than 4 hours to finish the evolution, compared to ES which needs a cluster to finish. 1 Under review as a conference paper at ICLR 2019 2 METHOD Many real world decision making problems could be modeled as Markov Decision Process (MDP). Since proposed by Bellman (1957), it has been applied in many disciplines, e.g. robotics, economics and game playing. A Markov decision process is defined by: a set of states S, a set of actions A, transition probabilities defining probability over next state given current state and action P : S × A × S ! [0; 1], a initial state distribution d : S! [0; 1], and a reward function that maps states and actions into a scalar R : S × A ! R. A finite time horizon episode τ of T steps is a sequence of state-action-reward tuples: (s0; a0; r0); (s1; a1; r1);:::; (st; at; rt);::: (sT ; ;; ;) called trajectory, where st 2 S; at 2 A; rt = R(st; at); st+1 ∼ P (st; at); s0 ∼ d0(s) where sT is called terminal state. For a finite time horizon MDP, reinforcement learning (RL) algorithms learn a parameterized policy that defining probability of actions to be taken given current states πθ : S × A ! [0; 1], and PT −1 maximize episodic reward: i=0 ri, or formally: T X maximize E R(st; at) (1) θ i=0 where s0 ∼ d(s);at ∼ πθ(st); st+1 ∼ P (st; at) (2) Above objective function takes the expectation over episodic rewards, which could be approximated by sampling N trajectories from policy π and environment dynamics P and d. We misuse notations such that r(τ) donates episodic reward and p(τ) represents trajectory probability, then we have below reformulation: N X maximize L(θ) = r(τi)p(τi) (3) θ i=1 T −1 T −1 Y X where p(τ) = d(s0) πθ(st; at)p(st; at; st+1); r(τ) = ri (4) t=0 t=0 Vanilla policy gradient (PG) methods optimize L(πθ) by taking its gradient: N 1 X r L ≈ r(τ )r log p (τ ) (5) θ N i θ θ i i=1 N T −1 1 X Xi = r(τ ) r log π (s ; a ) (6) N i θ θ t t i=1 t=0 and updates policy πθ to maximize expected episodic reward. Variants of policy gradient methods achieved great success in recent years, e.g. TRPO (Schulman et al., 2015) and PPO Schulman et al. (2017). Evolution strategies (ES), instead, take another approach to solve above optimization problem with- out any differentiation: it approximates the gradient by sampling, shown in Equation 7. 1 r L ≈ (L(θ + θ^)θ^) (7) θ σ2 Eθ^∼N (0,σ2) Vanilla ES solves MDP problems iteratively shown in Algorithm 1: firstly, it samples Gaussian noise from a normal distribution; secondly, it injects noise into current policy, and evaluates it in environment, obtaining episodic rewards; thirdly, it updates current policy parameters by the product of episode reward and noise injected. ES and PG are interpreted as two faces of Gaussian smoothing on policy: ES is on parameter space smoothing and PG is on action space smoothing (Salimans et al., 2017). 2 Under review as a conference paper at ICLR 2019 Algorithm 1 Evolution Strategies Input: Learning rate η, noise standard derivation σ, initial policy parameter θ 1: for t = 0; 1;::: do ^ 2 2: Sample N parameter noise: θi ∼ N (0; σ I) for i = 1;:::;N 3: Execute perturbated deterministic policies πθi : at = arg maxa πθi (st) to produce trajectory ^ τi, where θi = θ + θi for i = 1;:::;N 2 PN ^ 4: Update policy parameter θ θ + α=(Nσ ) i=1 r(τi)θi 5: end for ES enjoys several advantages over PG. Firstly, it is a black-box optimization method that does not need gradients. In problems where gradients are not available, or state encoding is not feasible, or gradient is blocked by some operations in neural networks, ES exhibits its wider range of appli- cations over PG. Secondly, in low dimensional optimization problems like Mujoco tasks, ES has competitive performance with PG (Salimans et al., 2017), but is less sensitive to hyperparameters (Heidrich-Meisner & Igel, 2008). Thirdly, the exploration-exploitation trade-off needs to be bal- anced in RL here is integrated: policy parameter perturbation ensures different behaviors during evaluation, whereas RL has to use a stochastic policy to explore. Lastly, ES is highly scalable therefore can be trained faster by a group of workers in parallel. However, in n-dimensional search space where n is large, the exploration directions O(2n) increases in orders with n, therefore random sampling for gradient approximation would be very inefficient. For video games like Atari, high dimensional pixel state space induces more parameters of policy to be optimized, hence less efficient for ES. Surprisingly, Salimans et al. (2017) shows that even natural evolution strategies (NES) can beat common RL algorithms in several Atari games. Among the class of evolution strategies, the covariance matrix adaption evolution strategy (Hansen & Oster- meier, 2001) is the most successful one in solving low dimensional optimization problems. Would covariance matrix adaption evolution strategy (CMA-ES) perform better or faster over NES for high dimensional tasks? To answer this question, one problem has to be solved, that the application of CMA-ES on high dimensional space is infeasible since it needs to compute a large covariance matrix of size O(n2). Recently advances in deep learning show promises to encode high dimensional spatial-temporal information into low dimensional latent vectors (Oh et al., 2015; Pathak et al., 2018; Ha & Schmidhuber, 2018) by modeling environment dynamics. In this case, we can convert high dimensional RL tasks into low dimensional tasks, where CMA-ES can take its advantage. 2.1 COVARIANCE MATRIX ADAPTION EVOLUTION STRATEGY CMA-ES improves ES in two ways. Firstly, vanilla ES samples noise from a fixed Gaussian distribution N (0; σ2I) of only one degree of freedom σ, the step size; whereas CMA-ES, samples noise from N (0; C) where covariance matrix C has n(n + 1)=2 degree of freedom.

Load more