1

Deep Hierarchical in Partially Observable Markov Decision Processes) Le Pham Tuyen∗, Ngo Anh Vien†, Md. Abu Layek∗, TaeChoong Chung ∗  ∗Artificial Intelligence Lab, Computer Science and Engineering Department, Kyung Hee University, Yongin, Gyeonggi 446-701, South Korea {tuyenple, layek, tcchung}@khu.ac.kr †EEECS/ECIT, Queens University Belfast, Belfast, UK [email protected]

Abstract—In recent years, reinforcement learning has achieved are policy gradient [16], expectation maximization [17], and many remarkable successes due to the growing adoption of deep evolutionary algorithm [18]. In addition, a hybrid approach learning techniques and the rapid growth in computing power. of value-based approach and policy-based approach is called Nevertheless, it is well-known that flat reinforcement learning are often not able to learn well and data-efficient in actor-critic [19]. Recently, RL algorithms integrated with a tasks having hierarchical structures, e.g. consisting of multiple deep neural network (DNN) [10] achieved good performance, subtasks. Hierarchical reinforcement learning is a principled even better than human performance at some tasks such as approach that is able to tackle these challenging tasks. On the Atari games [10] and Go game [11]. However, obtaining good other hand, many real-world tasks usually have only partial performance on a task consisting of multiple subtasks, such as observability in which state measurements are often imperfect and partially observable. The problems of RL in such settings Mazebase [20] and Montezuma’s Revenge [9], is still a major can be formulated as a partially observable Markov decision challenge for flat RL algorithms. Hierarchical reinforcement process (POMDP). In this paper, we study hierarchical RL in learning (HRL) [21] was developed to tackle these domains. POMDP in which the tasks have only partial observability and HRL decomposes a RL problem into a hierarchy of subprob- possess hierarchical properties. We propose a hierarchical deep lems (subtasks) and each of which can itself be a RL problem. reinforcement learning approach for learning in hierarchical POMDP. The deep hierarchical RL algorithm is proposed to Identical subtasks can be gathered into one group and are con- apply to both MDP and POMDP learning. We evaluate the trolled by the same policy. As a result, hierarchical decomposi- proposed algorithm on various challenging hierarchical POMDP. tion represents the problem in a compact form and reduces the computational complexity. Various approaches to decompose Index Terms—Hierarchical Deep Reinforcement Learning, the RL problem have been proposed, such as options [21], Partially Observable MDP (POMDP), Semi-MDP, Partially Ob- HAMs [22][23], MAXQ [24], Bayesian HRL [25, 26, 27] and servable Semi-MDP (POSMDP) some other advanced techniques [28]. All approaches com- monly consider semi-Markov decision process [21] as a base I.INTRODUCTION theory. Recently, many studies have combined HRL with deep Reinforcement Learning (RL) [1] is a subfield of machine neural networks in different ways to improve performance learning focused on learning a policy in order to maximize on hierarchical tasks [29][30][31][32][33][34][35]. Bacon et total cumulative reward in an unknown environment. RL has al. [30] proposed an option-critic architecture, which has a arXiv:1805.04419v1 [cs.AI] 11 May 2018 been applied to many areas such as [2][3][4][5], fixed number of intra-options, each of which is followed by a [6][7][8], computer games [9][10][11] and other “deep” policy. At each time step, only one option is activated applications [12][13][14]. RL is divided into two approaches: and is selected by another policy that is called “policy over value-based approach and policy-based approach [15]. A typ- options”. DeepMind lab [33] also proposed a deep hierarchical ical value-based approach tries to obtain an optimal policy framework inspired by a classical framework called feudal by finding optimal value functions. The value functions are reinforcement learning [36]. Similarly, Kulkarni et al. [37] updated using the immediate reward and the discounted value proposed a deep hierarchical framework of two levels in which of the next state. Some methods based on this approach the high-level controller produces a subgoal and the low-level are Q-learning, SARSA, and TD-learning [1]. In contrast, controller performs primitive actions to obtain the subgoal. the policy-based approach directly learns a parameterized This framework is useful to solve a problem with multiple policy that maximizes the cumulative discounted reward. Some subgoals such as Montezuma’s Revenge [9] and games in techniques used for searching optimal parameters of the policy Mazebase [20]. Other studies have tried to tackle more chal- lenging problems in HRL such as discovering subgoals [38]  Corresponding author and adaptively finding a number of options [39]. ) This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no Though many studies have made great efforts in this topic, longer be accessible. most of them rely on an assumption of full observability, where 2

a learning agent can observe environment states fully. In other pair (st, at), r (st, at) defines an immediate reward achieved words, the environment is represented as a Markov decision at each state-action pair, and γ ∈ (0, 1) denotes a discount process (MDP). This assumption does not reflect the nature factor. MDP relies on the Markov property that a next state of real-world applications in which the agent only observes a only depends on the current state-action pair: partial information of the environment states. Therefore, the p(s |{s , a , s , a , ..., s , a }) = p(s |s , a ). environment, in this case, is represented as a POMDP. In order t+1 1 1 2 2 t t t+1 t t for the agent to learn in such a POMDP environment, more A policy of a RL algorithm is denoted by π which is the advanced techniques are required in order to have a good of taking an action a given a state s: π = P(a|s), prediction over environment responses, such as maintaining a the goal of a RL algorithm is to find an optimal policy π∗ belief distribution over unobservable states; alternatively using in order to maximize the expected total discounted reward as a recurrent neural network (RNN) [40] to summarize an ob- follows servable history. Recently, there have been some studies using  ∞  X t deep RNNs to deal with learning in POMDP environment J(π) = E γ r(st, at) . (1) [41][42]. t=0 In this paper, we develop a deep HRL algorithm for POMDP B. Semi-Markov Decision Process for Hierarchical Reinforce- problems inspired by the deep HRL framework [37]. The ment Learning agent in this framework makes decisions through a hierarchical policy of two levels. The top policy determines the subgoal to Learning over different levels of policy is the main challenge be achieved, while the lower-level policy performs primitive for hierarchical tasks. The semi-Markov decision process actions to achieve the selected subgoal. To learn in POMDP, (SMDP) [21], which is as an extension of MDP, was devel- we represent all policies using RNNs. The RNN in lower- oped to deal with this challenge. In this theory, the concept level policies constructs an internal state to remember the “options” is introduced as a type of temporal abstraction. An whole states observed during the course of interaction with the “option” ξ ∈ Ξ is defined by three elements: an option’s policy environment. The top policy is a RNN to encode a sequence π, a termination condition β, and an initiation set I ⊆ S of observations during the execution of a selected subgoal. denoted as the set of states in the option. In addition, a policy We highlight our contributions as follows. First, we exploit over options µ(ξ|s) is introduced to select options. An option ξ the advantages of RNNs to integrate with hierarchical RL in is executed as follows. First, when an option’s policy π is ξ order to handle learning on challenging and complex tasks in taken, the action at is selected based on π . The environment POMDP. Second, this integration leads to a new hierarchical then transitions to state st+1. The option either terminates ξ POMDP learning framework that is more scalable and data- or continues according to a termination probability β (st+1). efficient. When the option terminates, the agent can select a next option The rest of the paper is organized as follows. Section based on µ(ξ|st+1). The total discounted reward received by II reviews the underlying knowledge such as semi-Markov executing option ξ is defined as decision process, partially observable Markov decision pro-  t+k−1  ξ X i−t ξ cess and deep reinforcement learning. Our contributions are Rs = E γ r (si, ai) . (2) described in Section III, which consists of two parts. The i=t deep model part describes all elements in our algorithm and The multi-time state transition model of option ξ [43], which the learning algorithm part summarizes the algorithm in a is initiated in state s and terminate at state s0 after k steps, generalized form. The usefulness of the proposed algorithms has the formula is demonstrated through POMDP benchmarks in Section IV. ∞ ξ X ξ 0 k Finally, the conclusion is established in Section V. Pss0 = P (s , k|s, ξ)γ . (3) k=1 II.BACKGROUND where Pξ(s0, k|s, ξ) is the joint probability of ending up at s0 In this section, we briefly review all underlying theories after k steps if taking option ξ at state s. Given this, we can that the content of this paper relies on: Markov decision write the for the value function of a policy process, semi-Markov decision process for hierarchical RL, µ over options: partially observable Markov decision process for RL, and deep   µ X ξ X ξ µ 0 reinforcement learning. V (s) = µ(ξ|s) Rs + Pss0 V (s ) (4) ξ∈Ξ s0 A. Markov Decision Process and the option-value function: µ ξ X ξ X 0 0 µ 0 0 A discrete MDP models a sequence of decisions of a learn- Q (s, ξ) = Rs + Pss0 µ(ξ |s )Q (s , ξ ). (5) ing agent interacting with an environment at some discrete s0 ξ0∈Ξ time scale, t = 1, 2,... . Formally, an MDP consists of a Similarly, the corresponding optimal Bellman equations are as tuple of five elements {S, A, P, r, γ}, where S is a discrete follows: state space, A is a discrete action space , P(st+1, st, at) =   ∗ ξ X ξ ∗ 0 p(st+1|st, at) is a transition function that measures the proba- V (s) = max Rs + Pss0 V (s ) (6) ξ∈Ξ 0 bility of obtaining a next state st+1 given a current state-action s 3

∗ ξ X ξ ∗ 0 0 Q (s, ξ) = Rs + Pss0 max Q (s , ξ ). (7) policies such as tabular, linear, radial basis network, or neural ξ0∈Ξ s0 networks of few layers. The shallow policies contain many The optimal Bellman equation can be computed using syn- limitations, especially in representing highly complex behav- chronous value iteration (SVI) [21], which iterates the follow- iors and the computational cost of updating parameters. In ing steps for every state: contrast, deep neural networks in deep reinforcement learning   can extract more information from the raw inputs by pushing ξ X ξ 0 0 them through multiple neural layers such as multilayer percep- V(s) = max Rs + Pss0 V(s , ξ ) . (8) ξ∈Ξ s0 tron layer (MLP) and convolutional layer (CONV). Multiple layers in DNNs can have a lot of parameters allowing them to When the option model is unknown, Qt(s, ξ) can be estimated using a Q-learning algorithm with an estimation represent highly non-linear problems. Deep Q-Network (DQN) formula: has been proposed recently by Google Deepmind [10], that  opens a new era of deep reinforcement learning and influenced Q(s, ξ) ← Q(s, ξ) + α r(s, ξ(s)) + γk max Q(s0, ξ0) − Q(s, ξ) most, later studies in deep reinforcement learning. In term of ξ0∈Ξ architecture, a Q-network parameterized by θ, e.g., Q(s, a|θ) is (9) built on a convolutional neural network (CNN) which receives where α denotes the learning rate, k denotes the number an input of four images of size 84 × 84 and is processed of time steps elapsing between s and s0 and r denotes an by three hidden CONV layers. The final hidden layer is a intermediate reward if ξ(s) is a primitive action, otherwise r MLP layer with 512 rectifier units. The output layer is a MLP is the total reward when executing option ξ(s). layer, which has the number of outputs equal to the number of actions. In term of the learning algorithm, DQN learns Q- C. Partially observable Markov Decision Process in Rein- value functions iteratively by updating the Q-value estimation forcement Learning via the temporal difference error: In many real-world tasks, the agent might not have full Q(s, a) := Q(s, a) + α(r + γ max Q(s0, a0) − Q(s, a)). a0 observability over the environment. In principle, those tasks (11) can in principle be formulated as a POMDP which is defined as a tuple of six components {S, A, P, r, Ω, Z}, where S, A, P, r In addition, the stability of DQN also relies on two mech- are the state space, action space, transition function and reward anisms. The first mechanism is experience replay memory, function, respectively as in a Markov decision process. Ω and which stores transition data in the form of tuples {s, a, s0, r}. Z are observation space and observation model, respectively. It allows the agent to uniformly sample from and train on If the POMDP model is known, the optimal approach is to previous data (off-policy) in batch, thus reducing the variance maintain a hidden state bt called belief state. The belief defines of learning updates and breaking the temporal correlation the probability of being in state s according to its history of among data samples. The second mechanism is the target Q- actions and observations. Given a new action and observation, network, which is parameterized by θ0, e.g., Qˆ(s, a|θ0), and is belief updates are performed using the Bayes rule [44] and a copy of the main Q-network. The target Q-network is used defined as follows: to estimate the loss function as follows: X   b0(s0) ∝ Z(s0, a, o) P(s, a, s0)b(s). (10) 2 L = E (y − Q(s, a|θ)) , (12) s∈S However, exact belief updates require a high computational where y = r + γ max Qˆ(s0, a0|θ0). Initially, the parameters of cost and expensive memory [40]. Another approach is using a a0 the target Q-network are the same as in the main Q-network. Q-learning agent with function approximation, which uses Q- However, during the learning iteration, they are only updated learning algorithm for updating the policy. Because updating at a specified time step. This update rule causes the target Q- the Q-value from an observation can be less accurate than network to decouple from the main Q-network and improves estimating from a complete state, a better way would be that the stability of the learning process. a POMDP agent using Q-Learning uses the last k observations Many other models based on DQNs have been developed as input to the policy. Nevertheless, the problem with using such as Double DQNs [46], Dueling DQN [47], and Priority a finite number of observations is that key-event observations Experiment Replay [48]. On the other hand, deep neural far in the past would be neglected in future decisions. For networks can be integrated into other methods rather than this reason, a RNN is used to maintain a long-term state, estimating Q-values, such as representing the policy in policy as in [41]. Our model using RNNs at different levels of the search algorithms [5], estimating advantage function [49], or hierarchical policies is expected to take advantage in POMDP mixed actor network and critic network [50]. environments. Recently, researchers have used RNNs in reinforcement learning to deal with POMDP domains. RNNs have been D. Deep Reinforcement Learning successfully applied to domains, such as natural language Recent advances in deep learning [45] are widely applied processing and speech recognition, and are expected to be to reinforcement learning to form deep reinforcement learning. advantageous in the POMDP domain, which needs to process A few years ago, reinforcement learning still used “shallow” a sequence of observation rather than a single input. Our 4 proposed method uses RNNs not only for solving the POMDP Given above, we can write the Bellman equation for the domain but also for solving these domains in a hierarchical value function of the meta-controller policy π over options form of reinforcement learning. as follows " # M X ξ X 0 M 0 V (b) = π(ξ|b) Rb + P(o |b, ξ)V (b ) (14) III.HIERARCHICAL DEEP RECURRENT Q-LEARNINGIN ξ o0 PARTIALLY OBSERVABLE SEMI-MARKOV DECISION PROCESS and the option-value function, X X QM (b, ξ) = Rξ + P(o0|b, ξ) µ(ξ0|b0)QM (b0, ξ0) In this section, we describe hierarchical recurrent Q learn- b o0 0 ing algorithm (hDRQN), our proposed framework. First, we ξ ∈Ξ (15) describe the model of hDRQN and explain the way to learn this model. Second, we describe some components of our Similarly to the MAX-Q framework, the reward term Rξ algorithm such as sampling strategies, subgoal definition, b is the total reward collected by executing the option ξ and extrinsic and intrinsic reward functions. We rely on partially defined as V ξ(b). Its corresponding Q-value function is defined observable semi-Markov decision process (POSMDP) settings, as QS(b, a) (the value function for sub-controllers). In the where the agent follows a hierarchical policy to solve POMDP use of RNN, we also denote QS(b, a) as QS({o , g }, a|θS) domains. The setting of POSMDP [43, 26] is described as t t in which θS is the weights of the sub-controller network follows. The domain is decomposed into multiple subdomains. that encodes previous observations, and {o , g } denote the Each subdomain is equivalent to an option ξ in the SMDP t t observation input to the sub-controller. framework and has a subgoal g ∈ Ω, that needs to be achieved Our hierarchical frameworks are illustrated in Fig. 1. The before switching to another option. Within an option, the agent framework in Fig. 1a is inspired by a related idea in [37]. The observes a partial state o ∈ Ω, takes an action a ∈ A, receives difference is that our framework is built on two deep recurrent an extrinsic reward rex ∈ , and transits to another state R neural policies: a meta-controller and a sub-controller. The s0 (but the agent only observes a part of the state o0 ∈ Ω). meta-controller is equivalent to a “policy over subgoals” that The agent executes option ξ until it is terminated (either the receives the current observation o and determine new subgoal subgoal g is obtained or the termination signal is raised). t g . The sub-controller is equivalent to the option’s policy, Afterward, the agent selects and executes another option. The t which directly interacts with the environment by performing sequence of options is repeated until the agent reaches the action a . The sub-controller receives both g and o as inputs final goal. In addition, for obtaining subgoal g in each option, t t t to the deep neural network. Each controller contains an arrow an intrinsic reward function rin ∈ is maintained. While R (Fig. 1) pointed to itself that indicates that the controller the objective of a subdomain is to maximize the cumulative employs recurrent neural networks. In addition, an internal discounted intrinsic reward, the objective of the whole domain component called “critic” is used to determine whether the is to maximize the cumulative discounted extrinsic reward. agent has achieved subgoal or not and to produce an intrinsic ξ Specifically, the belief update given a taken option , reward that is used to learn the sub-controller. In contrast to o b observation and current belief is defined as the framework in Fig. 1a, the framework in Fig. 1b does not ∞ use the current observation to determine the subgoal in the 0 0 X k−1 X 0 S b (s ) ∝ γ P(s , o, k|s, ξ)b(s) meta-controller but instead uses the last hidden state ht of k=1 s the sub-controller. The last hidden state that is inferred from a sequence of observations of the sub-controller is expected where P(s0, o, k|s, ξ) is a joint transition and observation func- to contribute to the meta-controller to correctly determine the tion of the underlying POSMDP model on the environment. next subgoal. We adopt a similar notation from the two frameworks MAX- Q [24] and Options [21], to describe our problem. We denote M M Q (ot, ξ|θ ) as the Q-value function of the meta-controller M at state ot, θ (in which we use a RNN to encode past obser- vation that has been encoded in its weights θM ) and option (macro-action) ξ (assuming ξ has a corresponding subgoal gt). M We note that the pair {ot, θ } represents the belief or history M at time t, bt (we will use them inter-changeably: Q (b, ξ) or QM (o, ξ|θM )). The multi-time observation model of option ξ [43], which is initiated in belief b and observe o, has a formula as follows

∞ X X X (a) Framework 1 (b) Framework 2 P(o|b, ξ) = γkP(s0, o, k|s, ξ)b(s) (13) k=1 s0 s Fig. 1: Hierarchical deep recurrent Q-network frameworks 5

As mentioned in the previous section, RNNs is used in our following Q value function: framework in order to enable learning in POMDP. Particularly, M M CNNs is used to learn a low-dimensional representation for Q (ˆot, gt) = Q (ˆot, gt)+ ex k M M image inputs, while RNN is used in order to capture a α(r + γ max Q (ˆot+k, gt+k) − Q (ˆot, gt)) (20) temporal relation of observation data. Without using a recur- gt+k rent layer as in RNNs, CNNs cannot accurately approximate and the state feature from observations in POMDP domain. The S S procedure of the agent is clearly illustrated in Fig. 2. The Q ({ot, gt}, at) = Q ({ot, gt}, at)+ in S S meta-controller and sub-controller use the Deep Recurrent α(r + γ max Q ({ot+1, gt}, at+1) − Q ({ot, gt}, at)) Q Network (DRQN), as described in [41]. Particularly, at at+1 (21) step t, the meta-controller takes an observation ot from the environment (framework 1) or the last hidden state of sub- where oˆ can be a direct observation or an internal hidden controller generated by the previous sequence of observations state generated by the sub-controller. Let θM and θS be (framework 2), extracts state features through some deep parameters (weights and bias) of the deep neural networks, that M M S neural layers, internally constructs a hidden state ht , and parameterize the networks Q (ˆot, gt) and Q ({ot, gt}, at), M M M M M produces the Q subgoal values Q (ot, gt|θ ). The Q subgoal correspondingly, e.g. Q (ˆot, gt) = Q (ˆot, gt|θ ) and S S S M S values then are used to determine the next subgoal gt+k. Q ({ot, gt}, at) = Q ({ot, gt}, at|θ ). Then, Q and Q M S Similarly, the sub-controller receives both observation ot and are trained by minimizing the loss function L and L , S M subgoal gt, extracts their feature, constructs a hidden state ht , correspondingly. L can be formulated as: S S and produces Q action values Q ({ot, gt}, at|θ ), which are M  M M M  L = E(o,g,o0,g0,rex)∼MM yi − Q (o, g|θ ) , (22) used to determine the next action at+1. Those explanations are formalized into the following equations: where E denotes the expectation over a batch of data which ( is uniformly sampled from an experience replay M M , i is the f extract(o ) framework 1 ΦM = t (16) iteration number in the batch of samples and f extract(hS) framework 2 M ex M 0 0 M 0 0 M M 0 M M M M M M yi = r + γQ (o , argmax Q (o , g |θ )|θ ). (23) ht ,Q (ot, gt|θ ) = f (Φ , ht−1) (17) g0 S extract Φ = f (ot, gt) (18) Similarly, the formula of LS is S S S S S S ht ,Q ({ot, gt}, at|θ ) = f (Φ , ht−1), (19) S  S S S  L = E(o,g,a,rin)∼MS yi − Q ({o, g}, a|θ ) , (24) where ΦM and ΦS are the features after the extraction process where of the meta-controller and sub-controller, respectively. f M and S S in S0 0 S 0 0 S S0 f are the respective recurrent networks of the meta-controller yi = r + γQ ({o , g}, argmax Q ({o , g}, a |θ )|θ ). and sub-controller. They receive the state features and a hidden a0 (25) state, and then provide the next hidden state and Q value function. MS is experience replay that stores the transition data from 0 The networks for controllers are illustrated in Fig. 3. For sub-controller, and QS is the target network of QS. Intuitively, framework 1, we use the networks demonstrated in Fig. 3a in contrast to DQN which uses the maximum operator for both for both the meta-controller and sub-controller. A sequence selecting and evaluating an action, Double DQN separately of four CONV layers and ReLU layers interleaved together uses the main Q network to greedily estimate the next action is used to extract information from raw observations. A RNN and the target Q network to estimate the value function. This layer, especially LSTM, is employed in front of the feature to method has been shown to achieve better performance than memorize information from previous observations. The output DQN on Atari games [46]. of the RNN layer is split into Advantage stream A and Value stream V before being unified to the Q-value. This architecture B. Minibatch sampling strategy inherits from Dueling architecture [47], effectively learning states without having to learn the effect of an action on that For updating RNNs in our model, we need to pass a state. For framework 2, we use the network in Fig. 3a for the sequence of samples. Particularly, some episodes from the sub-controller and use the network in Fig. 3b to the meta- experience replay are uniformly sampled and are processed controller. The meta-controller in framework 2 uses the state from the beginning of the episode to the end of the episode. that is the internal hidden state from the sub-controller. By This strategy called Bootstrapped Sequential Updates [41], passing through four fully connected layers and ReLU layers, is an ideal method to update RNNs because their hidden features of the meta-controller are extracted. The rest of the state can carry all information through the whole episode. network is the same as the network for framework 1. However, this strategy is computationally expensive in a long episode, which can contain many time steps. Another approach proposed in [41] has been evaluated to achieve the A. Learning Model same performance as Bootstrapped Sequential Updates, but We use a state-of-the-art Double DQN to learn the param- reduce the computational complexity. The strategy is called eters of the network. Particularly, the controllers estimate the Bootstrapped Random Updates and has a procedure as follows. 6

gt gt+k

M M QM (o , g|θM ) Q (ot, g|θ ) t+k

at at+1 at+k M hM META ht META t+k ... CONTROLLER CONTROLLER ... S S S S S S Q ({ot, gt}, a|θ ) Q ({ot+1, gt}, a|θ ) Q ({ot+k , gt}, a|θ )

ot+k ot SUB S SUB S SUB S ht ht+1 ht+k CONTROLLER CONTROLLER CONTROLLER ......

ot ot+1 ot+k (a) Framework 1

gt gt+k

M M M M M M Q (h , g|θ ) Q (ht , g|θ ) t+k

at at+1 at+k M hM META ht META t+k ... CONTROLLER CONTROLLER ... S S S S S S Q ({ot, gt}, a|θ ) Q ({ot+1, gt}, a|θ ) Q ({ot+k , gt}, a|θ )

S ht−1 SUB S SUB S SUB S ht ht+1 ht+k CONTROLLER CONTROLLER CONTROLLER ......

ot ot+1 ot+k (b) Framework 2 Fig. 2: Procedure of Hierarchical Deep Recurrent Q Learning (hDRQN). The framework is based on [37]

This strategy also randomly selects a batch of episodes from the experience replay. Then, for each episode, we begin at a random transition and select a sequence of n transitions. The value of n that affects to the performance of our algorithm is analyzed in section IV. We apply the same procedure of Bootstrapped Random Updates to our algorithm. In addition, the mechanism explained in [51] is also applied. That study explains a problem when updating DRQN: using the first observations in a sequence of transitions to update the Q value function might be inaccurate. Thus, the solution is to (a) The network for framework 1 use the last observations to update DRQN. Particularly, our n method uses the last 2 transitions to update the Q-value.

C. Subgoal Definition

Our model is based on the “option” framework. Learning an option is using flat deep RL algorithms to achieve a subgoal of that option. However, discovering subgoals among existing states in the environment is still one of the challenges in hierarchical reinforcement learning. For simplifying the model, (b) Meta-controller network in framework 2 we assume that a set of pre-defined subgoals is provided in advance. The pre-defined subgoals based on object-oriented Fig. 3: Network models MDPs [52], where entities or objects in the environment are decoded as subgoals. 7

D. Intrinsic and Extrinsic Reward Functions Traditional RL accumulates all reward and penalty in a reward function, which is hard to learn in a specified task in a complex domain. In contrast, hierarchical RL introduces Algorithm 1 hDRQN in POMDP the notions of intrinsic reward function and an extrinsic reward Require: function. Initially, intrinsic motivation is based on psychology, 1: POMDP M = {S, A, P, r, Ω, Z} M M 0 cognitive science, and neuroscience [53] and has been applied 2: Meta-controller with the network Q (main) and Q M M 0 to hierarchical RL [54][55][56][57][58][59]. Our framework (target) parameterized by θ and θ , respectively S S0 follows the model of intrinsic motivation in [55]. Particularly, 3: Sub-controller with the network Q (main) and Q S S0 within an option (or skill), the agent needs to learn an option’s (target) parameterized by θ and θ , respectively M S policy (sub-controller in our framework) to obtain a subgoal 4: Exploration rate  for meta-controller and  for sub- (a salient event) under reinforcing of an intrinsic reward while controller M S for the overall task, a policy over options (meta-controller) is 5: Experience replay memories M and M of meta- learned to generate a sequence of subgoals while reinforcing controller and sub-controller respectively an extrinsic reward. Defining “good” intrinsic reward and 6: A pre-defined set of subgoals G M S extrinsic reward functions is still an open question in rein- 7: f and f are recurrent networks of meta-controller and forcement learning, and it is difficult to find a reward function sub-controller respectively model that is generalized to all domains. To demonstrate some Ensure: notions above, Fig. 4 describes the domain of multiple goals 8: Initialize: M S in four-rooms which is used to evaluate our algorithm in • Experiences replay memories M and M M S Section IV. The four-rooms contain a number of objects: an • Randomly initialize θ and θ M 0 M agent (in black), two obstacles (in red) and two goals (in blue • Assign value to the target networks θ ← θ and 0 and green). These objects are randomly located on the map. θS ← θS M At each time step, the agent has to follow one of the four •  ← 1.0 and decreasing to 0.1 S actions: top, down, left or right, and has to move to the goal •  ← 1.0 and decreasing to 0.1 location in a proper order: the blue goal first and then the 9: for k = 1, 2,...K do green goal. If the agent obtains all goals in the right order, 10: Initialize: the environment and get the start observation it will receive a big reward; otherwise, it will only receive a o small reward. In addition, the agent has to learn to avoid the 11: Initialize: hidden states hM ← 0 obstacles if it does not want to be penalized. For this example, 12: while o is not terminal do the salient event is equivalent to reaching the subgoal or hitting 13: Initialize: hidden states hS ← 0 the obstacle. In addition, there are two skills the agent should 14: Initialize: start observations o0 ← oˆ where oˆ can be learn. One is moving to the goals while correctly avoiding the observation o or hidden state hS obstacles, and the second is selecting the goal it should reach 15: Determine subgoal: g, hM ← first. The intrinsic reward for each skill is generated based on EPS GREEDY (ˆo, hM , G, M ,QM , f M ) the salient events encounters while exploring the environment. 16: while o is not terminal and g is not reached do Particularly, for reaching the goal, the intrinsic reward includes 17: Determine action: a, hS ← the reward for reaching goal successful and the penalty if the EPS GREEDY ({o, g}, hS, A, S,QS, f S) agent encounters an obstacle. For reaching the goals in order, 18: Execute action a, receive reward r, extrinsic re- the intrinsic reward includes a big reward if the agent reaches ward rex, intrinsic reward rin, and obtain the next the goals in a proper order and a small reward if the agent state s0 reaches the goal in an improper order. A detailed explanation 19: Store transition {{o, g}, a, rin, {o0, g0}} in M S of intrinsic and extrinsic rewards for this domain is included in Section IV. 20: Update sub-controller 0 SUB UP DAT E(M S,QS,QS ) 21: Update meta-controller E. Learning Algorithm 0 META UP DAT E(M M ,QM ,QM ) In this section, our contributions are summarized through 22: Transition to next observation o ← o0 pseudo-code Algorithm 1. The algorithm learns four neural M M 0 23: end while networks: two networks for meta-controller (Q and Q ) ex 0 S 0 S S0 24: Store transition {o0, g, rtotal, oˆ } in M where oˆ and two networks for sub-controller (Q and Q ). They are 0 S 0 0 can be observation o or the last hidden state h parameterized by θM , θM , θS and θS . The architectures of 25: end while the networks are described in Section III. In addition, the al- 26: Anneal M and S gorithm separately maintains two experience replay memories 27: end for M M and M S to store transition data from meta-controller and sub-controller, respectively. Before starting the algorithm, the parameters of the main networks are randomly initialized and are copied to the target networks. M and S are annealed from 8

0 Algorithm 3 META UP DAT E(M M ,QM ,QM ) Require: 1: M M : experience replay memory of meta-controller Ensure: 2: Sample a mini-batch of {o, g, rex, o0} from M M as the strategy explained at III-B 3: Update the network by minimizing the loss function: M  M M M  L = E(o,g,o0,g0,rex)∼MM yi − Q (o, g|θ ) where M ex M 0 0 M 0 0 M M 0 yi = r + γQ (o , argmax Q (o , g |θ )|θ ) g0

0 0 Fig. 4: Example domain for illustrating the notions of intrinsic 4: Update the target network: θM ← τθM + (1 − τ)θM and extrinsic motivation 0 Algorithm 4 SUB UP DAT E(M S,QS,QS ) Require: 1.0 to 0.1, which gradually increase control of controllers. The 1: M S: experience replay memory of meta-controller algorithm loops through a specified number of episodes (Line 2: Sample a mini-batch of {{o, g}, a, rin, {o0, g0}} from M S 9) and each episode is executed until the agent reaches the as the strategy explained at III-B terminal state. To start an episode, first, a starting observation 3: Update the main network by minimizing the loss func- o0 is obtained (Line 10). Next, hidden states, which are inputs tion: to RNNs, must be initialized with zero values (Line 11 and S  S S S  Line 13) and are updated during the episode (Line 15 and L = E(o,g,a,rin)∼MS yi − Q ({o, g}, a|θ ) , Line 17). Each subgoal is determined by passing observation where o or hidden state hS (depending on the framework) to the S in S0 0 S 0 0 S0 S0 meta-controller (Line 15). By following a greedy  fashion, yi = r +γQ ({o , g}, argmax Q ({o , g}, a |θ )|θ ) a subgoal will be selected from the meta-controller if it is a a0 random number greater than . Otherwise, a random subgoal 0 0 4: Update the target network: θS ← τθS + (1 − τ)θS will be selected (Algorithm 2). The sub-controller is taught to reach the subgoal; when the subgoal is reached, a new subgoal will be selected. The process is repeated until the final goal is obtained. Intrinsic reward is evaluated by the critic IV. EXPERIMENTS module and is stored in M S (Line 19) for updating the sub- In this section, we evaluate two versions of the hierarchical controller. Meanwhile, the extrinsic reward is directly received deep recurrent network algorithm. hDRQNv1 is the algorithm from the environment and is stored in M M for updating the formed by framework 1, and hDRQNv2 is the algorithm meta-controller (Line 24) . Updating controllers at Line 20 formed by framework 2. We compare them with flat algo- and 21 is described in Section III-A and is summarized in rithms (DQN, DRQN) and the state-of-the-art hierarchical RL Algorithm 3 and Algorithm 4. algorithm (hDQN). The comparisons are performed on three domains. The domain of multiple goals in a gridworld is used to evaluate many aspects of the proposed algorithms. Mean- Algorithm 2 EPS GREEDY (x, h, B, , Q, f) while, the harder domain called multiple goals in four-rooms Require: is used to benchmark the proposed algorithm. Finally, the most 1: x: input of the Q network challenging game in ATARI 2600 [9] called Montezuma’s 2: h: internal hidden states Revenge, is used to confirm the efficiency of our proposed 3: B: a set of outputs framework. 4: : exploration rate 5: Q network and recurrent function f A. Implementation Details Ensure: 6: h ← f(x, h) We use Tensorflow [60] to implement our algorithms. The 7: if random() <  then settings for each domain are different, but they have some 8: o ← An element from the set of output B commonalities as follows. For a hDRQNv1 algorithm, the 9: else inputs to the meta-controller and sub-controller are an image of size 44×44×3 (a color image). The input image is resized 10: o = argmaxm∈B Q(x, m) 11: end if from an observation which is observed around the agent (either 12: Return o, h 3 × 3 unit or 5 × 5 unit). The image feature of 256 values extracted through four CNNs and ReLUs is put into a LSTM layer of 256 states to generate 256 output values, and an 9 internal hidden state of 256 values is also constructed. For location on the map. In the figures on the left (hDRQNv1) a hDRQNv2 algorithm, a hidden state of 256 values is put and in the middle (hDRQNv2), we use a fixed length of into the network of the meta-controller. The state is passed meta-transitions (nM = 1) and compare different lengths through four fully connected layers and ReLU layers instead of of sub-transitions. Otherwise, the figures on the right show four CONV layers. The output is a feature of 256 values. The the performance of the algorithm using a fixed length of algorithm uses ADAM [61] for learning the neural network sub-transitions (nS = 8) and compare different lengths of parameters with the learning rate 0.001 for both the meta- meta-transitions. With a fixed length of meta-transition, the controller network and sub-controller. For updating the target algorithm performs well with a long length of sub-transition network, a τ value of 0.001 is applied. The algorithm uses a (nS = 8 or nS = 12); the performance decreases when the discount factor of 0.99. The capacity of M S and M M is 1E5 length of sub-transitions is decreased. Intuitively, the RNN and 5E4, respectively. needs a sequence of transitions that is long enough to increase the probability that the agent will reach the subgoal within S B. Multiple Goals in a Gridworld that sequence. Another observation is that, with n = 8 or nS = 12, there is a little difference in performance. The domain of multiple goals in a gridworld is a simple This is reasonable because only eight transitions are needed form of multiple goals in four-rooms which is described in for the agent to reach the subgoals. For a fixed length of Section III-D. In this domain, a gridworld map of size 11×11 sub-transitions (nS = 8), with a hDRQNv1 algorithm, the unit is instead of four-rooms map. At each time step, the agent setting with nM = 2 has low performance and high variance only observes a part of the surrounding environment, either compared to the setting with nM = 1. The reason is that while 3 × 3 unit (Fig. 5b) or 5 × 5 unit (Fig. 5c). The agent is the sub-controller for two settings has the same performance allowed to choose one of four actions (top, left, right, bottom) (Fig. 6f) the meta-controller with nM = 1 performs better that are deterministic. The agent can not move if the action than the meta-controller with nM = 2. Meanwhile, with a leads it to the wall. The rewards for the agent are defined as hDRQNv2 algorithm, the performance is the same at both follows. If the agent hits an obstacle, it will receive a penalty settings nM = 1 and nM = 2. This means that the hidden of minus one. If the agent reaches two goals in proper order, state from the sub-controller is a better input to determine the it will receive a reward of one for each goal. Otherwise, it subgoal rather than a raw observation as it causes the algorithm only receives 0.01. For applying to an hDRQN algorithm, an to not depend on the length of the meta-transition. The number intrinsic reward function and an extrinsic reward function are of steps to obtain two goals in order is around 22. defined as follows ( 1 obtain the goal rin = (26) The next evaluation is a comparison at different levels −1 hit the obstacle of observation. Fig. 7 shows the performance of hDRQN algorithms with a 3 × 3 observable agent compared with a and ( 1 reach goals in order 5×5 observable agent and a fully observable agent. It is clear rex = (27) that a fully observable agent can have more information around 0.01 otherwise it than a 5 × 5 observable agent and a 3 × 3 observable agent; thus, the agent with a larger observation area can quickly explore and localize the environment completely. As a result, the performance of the agent with a larger observation area is better than the agent with a smaller observing ability. From the figure, the performance of a 5 × 5 observable agent using hDRQNv2 seems to converge faster than a fully observable agent. However, the performance of the fully observable agent (b) 3 × 3 unit surpasses the performance of 5×5 observable agent at the end.

In the last evaluation of this domain, we compare the performance of the proposed algorithms with the well-known algorithms DQN, DRQN, and hDQN [37]. All algorithms (a) Multiple goals in gridworld (c) 5 × 5 unit assume that the agent only observes an area of 5 units around Fig. 5: Multiple goals in gridworld it. The results are shown in Fig. 8. For both domains of two goals and three goals, hDRQN algorithms outperform The first evaluation reported in Fig. 6 is a comparison of other algorithms and hDRQNv2 has the best performance. The different lengths of selected transitions discussed in section hDQN algorithm, which can operate in a hierarchical domain, III-B). The agent in this evaluation can observe an area of is better than flat algorithms but not better than hDRQN 5 × 5 unit. We report the performance through three runs of algorithms. It might be that hDQN algorithm is only for 20000 episodes and each episode has 50 steps. The number of fully observed domains and has poor performance in partially steps for each episode assures that the agent can explore any observable domains. 10

1.50 sub-2 sub-2 sub-4 sub-4 1.25 1.5 1.5 sub-8 sub-8 1.00 sub-12 sub-12 1.0 1.0 0.75

0.50 Reward

Reward 0.5 Reward 0.5 0.25 hDRQNv1 meta-1 0.00 0.0 0.0 hDRQNv2 meta-1 0.25 hDRQNv1 meta-2 hDRQNv2 meta-2 0.50 0.5 0.5 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Timesteps*100 Timesteps*100 Timesteps*100 (a) hDRQNv1: reward (e) hDRQNv2: reward (i) Reward

1.00 sub-2 0.75 0.75 sub-4 0.75 sub-8 0.50 0.50 0.50 sub-12

0.25 0.25 0.25

0.00 0.00 0.00 Reward Reward Reward 0.25 0.25 0.25 sub-2 hDRQNv1 meta-1 sub-4 0.50 0.50 0.50 hDRQNv2 meta-1 sub-8 hDRQNv1 meta-2 0.75 sub-12 0.75 0.75 hDRQNv2 meta-2

0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Timesteps*100 Timesteps*100 Timesteps*100 (b) hDRQNv1: intrinsic (f) hDRQNv2: intrinsic (j) Intrinsic

1.6 sub-2 1.75 sub-2 1.75 hDRQNv1 meta-1 1.4 sub-4 sub-4 hDRQNv2 meta-1 sub-8 1.50 sub-8 1.50 hDRQNv1 meta-2 1.2 sub-12 sub-12 hDRQNv2 meta-2 1.25 1.25 1.0 1.00 1.00 0.8 Reward Reward 0.75 Reward 0.75 0.6 0.50 0.50 0.4 0.25 0.2 0.25 0.00 0.00 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Timesteps*100 Timesteps*100 Timesteps*100 (c) hDRQNv1: extrinsic (g) hDRQNv2: extrinsic (k) Extrinsic

50 50 50 hDRQNv1 meta-1 hDRQNv2 meta-1 45 45 45 hDRQNv1 meta-2 hDRQNv2 meta-2 40 40 40

35 35 35 Steps Steps Steps

30 30 30 sub-2 sub-2 sub-4 sub-4 25 25 25 sub-8 sub-8 sub-12 sub-12 20 20 20 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Timesteps*100 Timesteps*100 Timesteps*100 (d) hDRQNv1: steps (h) hDRQNv2: steps (l) Steps Fig. 6: Evaluation of different lengths of transitions. In the left-side figures, we use a fixed length of meta-transitions (nM = 1). The figures on the right use a fixed length of sub-transitions (nS = 8) .

C. Multiple Goals in a Four-rooms Domain

1.8 hDRQNv1 obs-3 In this domain, we apply the multiple goals domain to a 1.6 hDRQNv2 obs-3 complex map called four-rooms (Fig. 9). The dynamics of the hDRQNv1 obs-5 1.4 environment in this domain is similar to that of the task in hDRQNv2 obs-5 1.2 hDRQNv1 obs-full IV-B. The agent in this domain must usually pass through 1.0 hDRQNv2 obs-full hallways to obtain goals that are randomly located in four 0.8

Reward rooms. Originally, the four-rooms domain was an environment 0.6 0.4 for testing a hierarchical reinforcement learning algorithm 0.2 [21]. 0.0 The performance is shown in Fig. 10a is averaged through

0 20 40 60 80 three runs of 50000 episodes and each episode has 50 time Timesteps*1000 steps. Meanwhile, the performance is shown in Fig. 10b is averaged through three runs of 100000 episodes, and each Fig. 7: Evaluation on different levels of observation episode has 100 time steps. Similarly, the proposed algorithms outperform other algorithms, especially, the hDRQNv2 algo- 11

hDRQNv1 hDRQNv1 1.5 1.5 hDRQNv2 hDRQNv2 hDQN hDQN DQN DQN 1.0 1.0 DRQN DRQN

0.5 0.5 Reward Reward

0.0 0.0

0.5 0 50 100 150 0 100 200 300 400 Timesteps*100 Timesteps*100 (a) Two goals in gridworld (a) Two goals in four-rooms

hDRQNv1 2.0 2.5 hDRQNv1 hDRQNv2 hDRQNv2 2.0 hDQN 1.5 hDQN DQN 1.5 DQN DRQN 1.0 DRQN 1.0

Reward 0.5 0.5 Reward 0.0 0.0 0.5

1.0 0.5 0 100 200 300 400 0 20 40 60 80 Timesteps*100 Timesteps*1000 (b) Three goals in gridworld (b) Three goals in four-rooms Fig. 8: Comparing hDRQN algorithms with some baseline Fig. 10: Comparing hDRQN algorithms with some baseline algorithms algorithms

advance. This paper evaluates our proposed algorithms on the first screen of the game (Fig. 11). Particularly, the agent, which only observes a part of the environment (Fig. 11b), needs to pass through doors (the yellow line in the top left and top right corners of the figure) to explore other screens. However, (b) 3 × 3 unit to pass through the doors, first, the agent needs to pick up the key on the left side of the screen. Thus, the agent must learn to navigate to the key’s location and then navigate back to the door and open the next screens. The agent will earn 100 after it obtains the key and 300 after it reaches any door. Totally, the agent can receive 400 for this screen. (a) The map of multiple goals in the four-rooms domain (c) 5 × 5 unit The intrinsic reward function is defined to motivate the agent to explore the whole environment. Particularly, the agent Fig. 9: Multiple goals in the four-rooms domain will be received an intrinsic value of 1 if it could reach a subgoal from other subgoals. The set of subgoals is pre-defined rithm. in Fig. 11a (the white rectangles). In contrast to intrinsic reward function, the extrinsic reward function is defined as a reward value of 1 when the agent obtains the key or D. Montezuma’s Revenge game in ATARI 2600 open the doors. Because learning the meta-controller and the Montezumas Revenge is the hardest game in ATARI 2600, sub-controllers simultaneously is highly complex and time- where the DQN algorithm [10] can only achieve a score of consuming, we separate the learning process into two phases. zero. We use OpenAI Gym to simulate this domain [62]. The In the first phase, we learn the sub-controllers completely such game is hard because the agent must execute a long sequence that the agent can explore the whole environment by moving of actions until a state with non-zero reward (delayed reward) between subgoals. In the second phase, we learn the meta- can be visited. In addition, in order to obtain a state with controller and sub-controllers altogether. The architecture of larger rewards, the agent needs to reach a special state in the meta-controller and the sub-controllers is described in 12

350 hDRQNv1 hDRQNv2 300 hDQN 250 DQN DRQN 200

Reward 150

100

50

0 0 20 40 60 80 100 Episodes*100 Fig. 12: Comparing hDRQN algorithms with some baseline (a) The first screen of Montezuma’s (b) Observable Revenge game area algorithms on Montezuma’s Revenge game Fig. 11: Montezuma’s Revenge game in ATARI 2600 section IV-A. The length of sub-transition and meta-transition is 8 and 16 correspondingly. In this domain, the agent can 1.0 observe an area of 70 × 70 pixels. Then, the observation area is resized to 44 × 44 to fit the input of a controller network. 0.9 The performance of proposed algorithms compared to baseline algorithms is shown in Fig. 12. DQN reported a score of 0.8 zero which is similar to the result from [10]. DRQN which 0.7 can perform well in a partially observable environment also Reward achieves a score of zero because of the highly hierarchical 0.6 complexity of the domain. Meanwhile, hDQN can achieve a high score on this domain. However, it cannot perform 0.5 well in a partial observability setting. The performance of hDQN in a full observability setting can be found in the paper 0 20 40 60 80 100 of Kulkarni [37]. Our proposed algorithms can adapt to the Episodes*100 partial observability setting and hierarchical domains as well. Fig. 13: Success ratio for reaching the goal “key” The hDRQNv2 algorithm shows a better performance than hDRQNv1. It seems that the difference in the architecture of two frameworks (described in section III) has affected their performance. Particularly, using internal states of a sub- controller as the input to the meta-controller can give more 3.0 information for prediction than using only a raw observation. To evaluate the effectiveness of the two algorithms, we report 2.5 the success ratio for reaching the goal “key” in Fig. 13 and the number of time steps the agent explores each subgoal 2.0 left-door in Fig. 14. In Fig. 13, the agent using hDRQNv2 algorithm right-door almost picks up the “key” at the end of the learning process. top-right-ladder 1.5 bottom-right-ladder

Moreover, Fig. 14 shows that hDRQNv2 tends to explore more Times often on subgoals that are on the way reaching the “key” (E.g. bottom-left-ladder key top-right-ladder, bottom-right-ladder, and bottom-left-ladder) 1.0 while exploring less often on other subgoals such as the left door and right door. 0.5

0.0 V. CONCLUSION 4000 8000 12000 Episodes We introduced a new hierarchical deep reinforcement learn- ing algorithm that is a learning framework for both full Fig. 14: Statistic about the number of times the agent visits observability (MDP) and partial observability (POMDP). The subgoals algorithm takes advantages of deep neural networks (DNN, CNN, LSTM) to produce hierarchical policies that can solve 13 domains with a highly hierarchical nonlinearity. We showed Fidjeland, G. Ostrovski et al., “Human-level control that the framework has performed well when learning in hi- through deep reinforcement learning,” Nature, vol. 518, erarchical POMDP environments. Nevertheless, our approach no. 7540, pp. 529–533, 2015. contains some limitations as follows. First, our framework is [11] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, built on two levels of a hierarchy which does not fit into the G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, domain with multiple levels of hierarchy. Second, in order V. Panneershelvam, M. Lanctot et al., “Mastering the to simplify the learning problem in hierarchical POMDP, we game of go with deep neural networks and tree search,” assume that the set of subgoals is predefined and fixed because Nature, vol. 529, no. 7587, pp. 484–489, 2016. the problem of discovering a set of subgoals in POMDP is [12] G. Tesauro, D. Gondek, J. Lenchner, J. Fan, and J. M. still a hard problem. In the future, we plan to improve our Prager, “Simulation, learning, and optimization tech- framework by tackling those problems. In addition, we can niques in watson’s game strategies,” IBM Journal of apply DRQN to multi-agent problems where the environment Research and Development, vol. 56, no. 3.4, pp. 16–1, is partially observable and the task is hierarchical. 2012. [13] G. Tesauro, “Td-gammon: A self-teaching backgammon ACKNOWLEDGMENT program,” in Applications of Neural Networks. Springer, 1995, pp. 267–285. The authors are grateful to the Basic Science Research [14] D. Ernst, M. Glavic, and L. Wehenkel, “Power systems Program through the National Research Foundation of Korea stability control: Reinforcement learning framework,” (NRF-2017R1D1A1B04036354). IEEE transactions on power systems, vol. 19, no. 1, pp. 427–435, 2004. REFERENCES [15] J. Kober and J. Peters, “Reinforcement learning [1] R. S. Sutton and A. G. Barto, Introduction to Reinforce- in robotics: A survey,” in Reinforcement Learning. ment Learning, 1st ed. Cambridge, MA, USA: MIT Springer, 2012, pp. 579–610. Press, 1998. [16] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Man- [2] M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey sour, “Policy gradient methods for reinforcement learning on policy search for robotics,” Foundations and Trends R with function approximation,” in Advances in neural in Robotics, vol. 2, no. 1–2, pp. 1–142, 2013. information processing systems, 2000, pp. 1057–1063. [3] J. Peters and S. Schaal, “Policy gradient methods [17] P. Dayan and G. E. Hinton, “Using expectation- for robotics,” in Intelligent Robots and Systems, 2006 maximization for reinforcement learning,” Neural Com- IEEE/RSJ International Conference on. IEEE, 2006, putation, vol. 9, no. 2, pp. 271–278, 1997. pp. 2219–2225. [18] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette, [4] ——, “Natural actor-critic,” Neurocomputing, vol. 71, “Evolutionary algorithms for reinforcement learning,” no. 7, pp. 1180 – 1190, 2008, progress in Modeling, 1999. Theory, and Application of Computational Intelli- [19] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algo- genc. [Online]. Available: http://www.sciencedirect.com/ rithms,” in Advances in neural information processing science/article/pii/S0925231208000532 systems, 2000, pp. 1008–1014. [5] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and [20] S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and P. Moritz, “Trust region policy optimization,” in Proceed- R. Fergus, “Mazebase: A sandbox for learning from ings of the 32nd International Conference on Machine games,” arXiv preprint arXiv:1511.07401, 2015. Learning (ICML-15), 2015, pp. 1889–1897. [21] R. S. Sutton, D. Precup, and S. Singh, “Between mdps [6] J. W. Lee, “Stock price prediction using reinforcement and semi-mdps: A framework for temporal abstraction in learning,” in Industrial Electronics, 2001. Proceedings. reinforcement learning,” Artificial intelligence, vol. 112, ISIE 2001. IEEE International Symposium on, vol. 1. no. 1-2, pp. 181–211, 1999. IEEE, 2001, pp. 690–695. [22] R. E. Parr, Hierarchical control and learning for Markov [7] J. Moody and M. Saffell, “Learning to trade via direct decision processes. University of California, Berkeley reinforcement,” IEEE transactions on neural Networks, Berkeley, CA, 1998. vol. 12, no. 4, pp. 875–889, 2001. [23] R. Parr and S. J. Russell, “Reinforcement learning with [8] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep hierarchies of machines,” in Advances in neural informa- direct reinforcement learning for financial signal rep- tion processing systems, 1998, pp. 1043–1049. resentation and trading,” IEEE transactions on neural [24] T. G. Dietterich, “Hierarchical reinforcement learning networks and learning systems, vol. 28, no. 3, pp. 653– with the maxq value function decomposition,” Journal of 664, 2017. Artificial Intelligence Research, vol. 13, no. 1, pp. 227– [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, 303, 2000. I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing [25] N. A. Vien, H. Q. Ngo, S. Lee, and T. Chung, “Approx- atari with deep reinforcement learning,” arXiv preprint imate planning for bayesian hierarchical reinforcement arXiv:1312.5602, 2013. learning,” Appl. Intell., vol. 41, no. 3, pp. 808–819, 2014. [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve- [26] N. A. Vien and M. Toussaint, “Hierarchical monte-carlo ness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. planning,” in Proceedings of the Twenty-Ninth AAAI Con- 14

ference on Artificial Intelligence, January 25-30, 2015, 134, 1998. Austin, Texas, USA., 2015, pp. 3613–3619. [45] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn- [27] N. A. Vien, S. Lee, and T. Chung, “Bayes-adaptive ing. MIT Press, 2016. hierarchical mdps,” Appl. Intell., vol. 45, no. 1, pp. 112– [46] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforce- 126, 2016. ment learning with double q-learning.” 2016. [28] A. G. Barto and S. Mahadevan, “Recent advances in [47] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanc- hierarchical reinforcement learning,” Discrete Event Dy- tot, and N. De Freitas, “Dueling network architec- namic Systems, vol. 13, no. 4, pp. 341–379, 2003. tures for deep reinforcement learning,” arXiv preprint [29] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, arXiv:1511.06581, 2015. D. Saxton, and R. Munos, “Unifying count-based explo- [48] T. Schaul, J. Quan, I. Antonoglou, and D. Sil- ration and intrinsic motivation,” in Advances in Neural ver, “Prioritized experience replay,” arXiv preprint Information Processing Systems, 2016, pp. 1471–1479. arXiv:1511.05952, 2015. [30] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic [49] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Con- architecture.” 2017. tinuous deep q-learning with model-based acceleration,” [31] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg, in International Conference on , 2016, “Multi-level discovery of deep options,” arXiv preprint pp. 2829–2838. arXiv:1703.08294, 2017. [50] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, [32] S. Lee, S.-W. Lee, J. Choi, D.-H. Kwak, and B.-T. Y. Tassa, D. Silver, and D. Wierstra, “Continuous con- Zhang, “Micro-objective learning: Accelerating deep re- trol with deep reinforcement learning,” arXiv preprint inforcement learning through the discovery of continuous arXiv:1509.02971, 2015. subgoals,” arXiv preprint arXiv:1703.03933, 2017. [51] G. Lample and D. S. Chaplot, “Playing fps games with [33] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, deep reinforcement learning.” 2017. M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal [52] C. Diuk, A. Cohen, and M. L. Littman, “An object- networks for hierarchical reinforcement learning,” arXiv oriented representation for efficient reinforcement learn- preprint arXiv:1703.01161, 2017. ing,” in Proceedings of the 25th international conference [34] I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Ma- on Machine learning. ACM, 2008, pp. 240–247. hadevan, “Deep reinforcement learning with macro- [53] R. M. Ryan and E. L. Deci, “Intrinsic and extrinsic actions,” arXiv preprint arXiv:1606.04615, 2016. motivations: Classic definitions and new directions,” [35] K. Arulkumaran, N. Dilokthanakul, M. Shanahan, and Contemporary educational psychology, vol. 25, no. 1, pp. A. A. Bharath, “Classifying options for deep reinforce- 54–67, 2000. ment learning,” arXiv preprint arXiv:1604.08153, 2016. [54] A. Stout, G. D. Konidaris, and A. G. Barto, “Intrin- [36] P. Dayan and G. E. Hinton, “Feudal reinforcement sically motivated reinforcement learning: A promising learning,” in Advances in neural information processing framework for developmental robot learning,” MAS- systems, 1993, pp. 271–278. SACHUSETTS UNIV AMHERST DEPT OF COM- [37] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenen- PUTER SCIENCE, Tech. Rep., 2005. baum, “Hierarchical deep reinforcement learning: Inte- [55] A. G. Barto, “Intrinsically motivated learning of hierar- grating temporal abstraction and intrinsic motivation,” chical collections of skills,” 2004, pp. 112–119. in Advances in Neural Information Processing Systems, [56] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg, “Intrinsi- 2016, pp. 3675–3683. cally motivated reinforcement learning: An evolutionary [38] C.-C. Chiu and V.-W. Soo, “Subgoal identifications in perspective,” IEEE Transactions on Autonomous Mental reinforcement learning: A survey,” in Advances in Rein- Development, vol. 2, no. 2, pp. 70–82, 2010. forcement Learning. InTech, 2011. [57] M. Frank, J. Leitner, M. Stollenga, A. Forster,¨ and [39] M. Stolle, “Automated discovery of options in reinforce- J. Schmidhuber, “Curiosity driven reinforcement learning ment learning,” Ph.D. dissertation, McGill University, for motion planning on humanoids,” Frontiers in neuro- 2004. robotics, vol. 7, p. 25, 2014. [40] K. P. Murphy, “A survey of pomdp solution techniques,” [58] S. Mohamed and D. J. Rezende, “Variational information environment, vol. 2, p. X3, 2000. maximisation for intrinsically motivated reinforcement [41] M. Hausknecht and P. Stone, “Deep recurrent q-learning learning,” in Advances in neural information processing for partially observable mdps,” 2015. systems, 2015, pp. 2125–2133. [42] M. Egorov, “Deep reinforcement learning with pomdps,” [59] J. Schmidhuber, “Formal theory of creativity, fun, and 2015. intrinsic motivation (1990–2010),” IEEE Transactions on [43] C. C. White, “Procedures for the solution of a finite- Autonomous Mental Development, vol. 2, no. 3, pp. 230– horizon, partially observed, semi-markov optimization 247, 2010. problem,” Operations Research, vol. 24, no. 2, pp. 348– [60] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, 358, 1976. C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, [44] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is- “Planning and acting in partially observable ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Leven- domains,” Artificial intelligence, vol. 101, no. 1, pp. 99– berg, D. Mane,´ R. Monga, S. Moore, D. Murray, C. Olah, 15

M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Tal- war, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas,´ O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learn- ing on heterogeneous systems,” 2015, software available from tensorflow.org. [61] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [62] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.