Arxiv:2006.06193V3 [Cs.LG] 10 Dec 2020 Exploration? to Address This Question, Several Settings Re- Ofﬂine to Elicit Desired Behavior

Exploration by Maximizing Renyi´ Entropy for Reward-Free RL Framework Chuheng Zhang* Yuanying Cai* Longbo Huang Jian Li Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University [email protected], [email protected], [email protected], [email protected] Abstract may need more new samples in some new task, and sometimes less), it is less clear how to evaluate the effectiveness Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free of the exploration in the meta-training phase. Another set- RL framework that completely separates exploration from ting is to learn a policy in a reward-free environment and exploitation and brings new challenges for exploration al- test the policy under the task with a specific reward function gorithms. In the exploration phase, the agent learns an ex- (such as the score in Montezuma’s Revenge) without fur- ploratory policy by interacting with a reward-free environ- ther training with the task (Burda et al. 2019b; Ecoffet et al. ment and collects a dataset of transitions by executing the 2019; Burda et al. 2019a). However, there is no guarantee policy. In the planning phase, the agent computes a good pol- that the algorithm has fully explored the transition dynamics icy for any reward function based on the dataset without fur- of the environment unless we test the learned model for arbi- ther interacting with the environment. This framework is suit- trary reward functions. Recently, Jin et al. (2020) proposed able for the meta RL setting where there are many reward reward-free RL framework that fully decouples exploration functions of interest. In the exploration phase, we propose to maximize the Renyi´ entropy over the state-action space and and exploitation. Further, they designed a provably efficient justify this objective theoretically. The success of using Renyi´ algorithm that conducts a finite number of steps of reward- entropy as the objective results from its encouragement to ex- free exploration and returns near-optimal policies for arbi- plore the hard-to-reach state-actions. We further deduce a pol- trary reward functions. However, their algorithm is designed icy gradient formulation for this objective and design a prac- for the tabular case and can hardly be extended to continu- tical exploration algorithm that can deal with complex envi- ous or high-dimensional state spaces since they construct a ronments. In the planning phase, we solve for good policies policy for each state. given arbitrary reward functions using a batch RL algorithm. The reward-free RL framework is as follows: First, a set Empirically, we show that our exploration algorithm is effec- tive and sample efficient, and results in superior policies for of policies are trained to explore the dynamics of the reward- arbitrary reward functions in the planning phase. free environment in the exploration phase (i.e., the meta- training phase). Then, a dataset of trajectories is collected by executing the learned exploratory policies. In the plan- 1 Introduction ning phase (i.e., the meta-testing phase), an arbitrary re- The trade-off between exploration and exploitation is at the ward function is specified and a batch RL algorithm (Lange, core of reinforcement learning (RL). Designing efficient ex- Gabel, and Riedmiller 2012; Fujimoto, Meger, and Precup ploration algorithm, while being a highly nontrivial task, 2019) is applied to solve for a good policy solely based on is essential to the success in many RL tasks (Burda et al. the dataset, without further interaction with the environment. 2019b; Ecoffet et al. 2019). Hence, it is natural to ask the This framework is suitable to the scenarios when there are following high-level question: What can we achieve by pure many reward functions of interest or the reward is designed arXiv:2006.06193v3 [cs.LG] 10 Dec 2020 exploration? To address this question, several settings re- offline to elicit desired behavior. The key to success in this lated to meta reinforcement learning (meta RL) have been framework is to obtain a dataset with good coverage over all proposed (see e.g., Wang et al. 2016; Duan et al. 2016; Finn, possible situations in the environment with as few samples Abbeel, and Levine 2017). One common setting in meta RL as possible, which in turn requires the exploratory policy to is to learn a model in a reward-free environment in the meta- fully explore the environment. training phase, and use the learned model as the initialization Several methods that encourage various forms of ex- to fast adapt for new tasks in the meta-testing phase (Eysen- ploration have been developed in the reinforcement learn- bach et al. 2019; Gupta et al. 2018; Nagabandi et al. 2019). ing literature. The maximum entropy framework (Haarnoja Since the agent still needs to explore the environment un- et al. 2017) maximizes the cumulative reward in the mean- der the new tasks in the meta-testing phase (sometimes it while maximizing the entropy over the action space con- *These authors contributed equally to this work. ditioned on each state. This framework results in several Copyright © 2021, Association for the Advancement of Artificial efficient and robust algorithms, such as soft Q-learning Intelligence (www.aaai.org). All rights reserved. (Schulman, Chen, and Abbeel 2017; Nachum et al. 2017), SAC (Haarnoja et al. 2018) and MPO (Abdolmaleki et al. To efficiently explore under this framework, we propose 2018). On the other hand, maximizing the state space cov- a novel objective that maximizes the Renyi´ entropy over erage may results in better exploration. Various kinds of the state-action space in the exploration phase and justify objectives/regularizations are used for better exploration this objective theoretically. of the state space, including information-theoretic metrics • (Section 4) We propose a practical algorithm based on (Houthooft et al. 2016; Mohamed and Rezende 2015; Eysen- a derived policy gradient formulation to maximize the bach et al. 2019) (especially the entropy of the state space, Renyi´ entropy over the state-action space for the reward- e.g., Hazan et al. (2019); Islam et al. (2019)), the prediction free RL framework. error of a dynamical model (Burda et al. 2019a; Pathak et al. 2017; de Abril and Kanai 2018), the state visitation count • (Section 5) We conduct a wide range of experiments and (Burda et al. 2019b; Bellemare et al. 2016; Ostrovski et al. the results indicate that our algorithm is efficient and ro- 2017) or others heuristic signals such as novelty (Lehman bust in the exploration phase and results in superior per- and Stanley 2011; Conti et al. 2018), surprise (Achiam and formance in the downstream planning phase for arbitrary Sastry 2016) or curiosity (Schmidhuber 1991). reward functions. In practice, it is desirable to obtain a single exploratory policy instead of a set of policies whose cardinality may be 2 Preliminary as large as that of the state space (e.g., Jin et al. 2020; Misra A reward-free environment can be formulated as a con- et al. 2019). Our exploration algorithm outputs a single ex- trolled Markov process (CMP) 1 (S; A; P; µ, γ) which spec- ploratory policy for the reward-free RL framework by max- ifies the state space S with S = jSj, the action space A imizing the Renyi´ entropy over the state-action space in the with A = jAj, the transition dynamics P, the initial state exploration phase. In particular, we demonstrate the advan- distribution µ 2 ∆S and the discount factor γ. A (station- tage of using state-action space, instead of the state space via A ary) policy πθ : S! ∆ parameterized by θ specifies a very simple example (see Section 3 and Figure 1). More- the probability of choosing the actions on each state. The over, Renyi´ entropy generalizes a family of entropies, in- stationary discounted state visitation distribution (or sim- cluding the commonly used Shannon entropy. We justify the ply the state distribution) under the policy π is defined as use of Renyi´ entropy as the objective function theoretically π P1 t dµ(s) := (1−γ) t=0 γ Pr(st = sjs0 ∼ µ; π). The station- by providing an upper bound on the number of samples in ary discounted state-action visitation distribution (or simply the dataset to ensure that a near-optimal policy is obtained the state-action distribution) under the policy π is defined as for any reward function in the planning phase. π P1 t dµ(s; a) := (1 − γ) t=0 γ Pr(st = s; at = ajs0 ∼ µ; π). Further, we derive a gradient ascent update rule for max- π Unless otherwise stated, we use dµ to denote the state-action imizing the Renyi´ entropy over the state-action space. The π distribution. We also use ds ;a to denote the distribution derived update rule is similar to the vanilla policy gradi- 0 0 started from the state-action pair (s0; a0) instead of the ini- ent update with the reward replaced by a function of the tial state distribution µ. discounted stationary state-action distribution of the current When a reward function r : S × A ! R is speci- policy. We use variational autoencoder (VAE) (Kingma and fied, CMP becomes the Markov decision process (MDP) Welling 2014) as the density model to estimate the distri- (Sutton and Barto 2018) (S; A; ; r; µ, γ). The objective for bution. The corresponding reward changes over iterations P MDP is to find a policy πθ that maximizes the expected which makes it hard to accurately estimate a value func- πθ cumulative reward J(θ; r) := Es0∼µ[V (s0; r)], where tion under the current reward.

Load more