Deep Hierarchical Reinforcement Learning Algorithm in Partially Observable Markov Decision Processes) Le Pham Tuyen∗, Ngo Anh Vien†, Md

1 Deep Hierarchical Reinforcement Learning Algorithm in Partially Observable Markov Decision ProcessesN Le Pham Tuyen∗, Ngo Anh Vieny, Md. Abu Layek∗, TaeChoong Chung ∗ 5 ∗Artificial Intelligence Lab, Computer Science and Engineering Department, Kyung Hee University, Yongin, Gyeonggi 446-701, South Korea ftuyenple, layek, [email protected] yEEECS/ECIT, Queens University Belfast, Belfast, UK [email protected] Abstract—In recent years, reinforcement learning has achieved are policy gradient [16], expectation maximization [17], and many remarkable successes due to the growing adoption of deep evolutionary algorithm [18]. In addition, a hybrid approach learning techniques and the rapid growth in computing power. of value-based approach and policy-based approach is called Nevertheless, it is well-known that flat reinforcement learning algorithms are often not able to learn well and data-efficient in actor-critic [19]. Recently, RL algorithms integrated with a tasks having hierarchical structures, e.g. consisting of multiple deep neural network (DNN) [10] achieved good performance, subtasks. Hierarchical reinforcement learning is a principled even better than human performance at some tasks such as approach that is able to tackle these challenging tasks. On the Atari games [10] and Go game [11]. However, obtaining good other hand, many real-world tasks usually have only partial performance on a task consisting of multiple subtasks, such as observability in which state measurements are often imperfect and partially observable. The problems of RL in such settings Mazebase [20] and Montezuma’s Revenge [9], is still a major can be formulated as a partially observable Markov decision challenge for flat RL algorithms. Hierarchical reinforcement process (POMDP). In this paper, we study hierarchical RL in learning (HRL) [21] was developed to tackle these domains. POMDP in which the tasks have only partial observability and HRL decomposes a RL problem into a hierarchy of subprob- possess hierarchical properties. We propose a hierarchical deep lems (subtasks) and each of which can itself be a RL problem. reinforcement learning approach for learning in hierarchical POMDP. The deep hierarchical RL algorithm is proposed to Identical subtasks can be gathered into one group and are con- apply to both MDP and POMDP learning. We evaluate the trolled by the same policy. As a result, hierarchical decomposi- proposed algorithm on various challenging hierarchical POMDP. tion represents the problem in a compact form and reduces the computational complexity. Various approaches to decompose Index Terms—Hierarchical Deep Reinforcement Learning, the RL problem have been proposed, such as options [21], Partially Observable MDP (POMDP), Semi-MDP, Partially Ob- HAMs [22][23], MAXQ [24], Bayesian HRL [25, 26, 27] and servable Semi-MDP (POSMDP) some other advanced techniques [28]. All approaches com- monly consider semi-Markov decision process [21] as a base I. INTRODUCTION theory. Recently, many studies have combined HRL with deep Reinforcement Learning (RL) [1] is a subfield of machine neural networks in different ways to improve performance learning focused on learning a policy in order to maximize on hierarchical tasks [29][30][31][32][33][34][35]. Bacon et total cumulative reward in an unknown environment. RL has al. [30] proposed an option-critic architecture, which has a arXiv:1805.04419v1 [cs.AI] 11 May 2018 been applied to many areas such as robotics [2][3][4][5], fixed number of intra-options, each of which is followed by a economics [6][7][8], computer games [9][10][11] and other “deep” policy. At each time step, only one option is activated applications [12][13][14]. RL is divided into two approaches: and is selected by another policy that is called “policy over value-based approach and policy-based approach [15]. A typ- options”. DeepMind lab [33] also proposed a deep hierarchical ical value-based approach tries to obtain an optimal policy framework inspired by a classical framework called feudal by finding optimal value functions. The value functions are reinforcement learning [36]. Similarly, Kulkarni et al. [37] updated using the immediate reward and the discounted value proposed a deep hierarchical framework of two levels in which of the next state. Some methods based on this approach the high-level controller produces a subgoal and the low-level are Q-learning, SARSA, and TD-learning [1]. In contrast, controller performs primitive actions to obtain the subgoal. the policy-based approach directly learns a parameterized This framework is useful to solve a problem with multiple policy that maximizes the cumulative discounted reward. Some subgoals such as Montezuma’s Revenge [9] and games in techniques used for searching optimal parameters of the policy Mazebase [20]. Other studies have tried to tackle more challenging problems in HRL such as discovering subgoals [38] 5 Corresponding author and adaptively finding a number of options [39]. N This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no Though many studies have made great efforts in this topic, longer be accessible. most of them rely on an assumption of full observability, where 2 a learning agent can observe environment states fully. In other pair (st; at), r (st; at) defines an immediate reward achieved words, the environment is represented as a Markov decision at each state-action pair, and γ 2 (0; 1) denotes a discount process (MDP). This assumption does not reflect the nature factor. MDP relies on the Markov property that a next state of real-world applications in which the agent only observes a only depends on the current state-action pair: partial information of the environment states. Therefore, the p(s jfs ; a ; s ; a ; :::; s ; a g) = p(s js ; a ): environment, in this case, is represented as a POMDP. In order t+1 1 1 2 2 t t t+1 t t for the agent to learn in such a POMDP environment, more A policy of a RL algorithm is denoted by π which is the advanced techniques are required in order to have a good probability of taking an action a given a state s: π = P(ajs), prediction over environment responses, such as maintaining a the goal of a RL algorithm is to find an optimal policy π∗ belief distribution over unobservable states; alternatively using in order to maximize the expected total discounted reward as a recurrent neural network (RNN) [40] to summarize an ob- follows servable history. Recently, there have been some studies using 1 X t deep RNNs to deal with learning in POMDP environment J(π) = E γ r(st; at) : (1) [41][42]. t=0 In this paper, we develop a deep HRL algorithm for POMDP B. Semi-Markov Decision Process for Hierarchical Reinforce- problems inspired by the deep HRL framework [37]. The ment Learning agent in this framework makes decisions through a hierarchical policy of two levels. The top policy determines the subgoal to Learning over different levels of policy is the main challenge be achieved, while the lower-level policy performs primitive for hierarchical tasks. The semi-Markov decision process actions to achieve the selected subgoal. To learn in POMDP, (SMDP) [21], which is as an extension of MDP, was devel- we represent all policies using RNNs. The RNN in lower- oped to deal with this challenge. In this theory, the concept level policies constructs an internal state to remember the “options” is introduced as a type of temporal abstraction. An whole states observed during the course of interaction with the “option” ξ 2 Ξ is defined by three elements: an option’s policy environment. The top policy is a RNN to encode a sequence π, a termination condition β, and an initiation set I ⊆ S of observations during the execution of a selected subgoal. denoted as the set of states in the option. In addition, a policy We highlight our contributions as follows. First, we exploit over options µ(ξjs) is introduced to select options. An option ξ the advantages of RNNs to integrate with hierarchical RL in is executed as follows. First, when an option’s policy π is ξ order to handle learning on challenging and complex tasks in taken, the action at is selected based on π . The environment POMDP. Second, this integration leads to a new hierarchical then transitions to state st+1. The option either terminates ξ POMDP learning framework that is more scalable and data- or continues according to a termination probability β (st+1). efficient. When the option terminates, the agent can select a next option The rest of the paper is organized as follows. Section based on µ(ξjst+1). The total discounted reward received by II reviews the underlying knowledge such as semi-Markov executing option ξ is defined as decision process, partially observable Markov decision pro- t+k−1 ξ X i−t ξ cess and deep reinforcement learning. Our contributions are Rs = E γ r (si; ai) : (2) described in Section III, which consists of two parts. The i=t deep model part describes all elements in our algorithm and The multi-time state transition model of option ξ [43], which the learning algorithm part summarizes the algorithm in a is initiated in state s and terminate at state s0 after k steps, generalized form. The usefulness of the proposed algorithms has the formula is demonstrated through POMDP benchmarks in Section IV. 1 ξ X ξ 0 k Finally, the conclusion is established in Section V. Pss0 = P (s ; kjs; ξ)γ : (3) k=1 II. BACKGROUND where Pξ(s0; kjs; ξ) is the joint probability of ending up at s0 In this section, we briefly review all underlying theories after k steps if taking option ξ at state s.

Deep Hierarchical Reinforcement Learning Algorithm in Partially Observable Markov Decision Processes) Le Pham Tuyen∗, Ngo Anh Vien†, Md

Probabilistic Goal Markov Decision Processes∗

1 Introduction

Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy

A Brief Taster of Markov Decision Processes and Stochastic Games

Markov Decision Processes

Markovian Competitive and Cooperative Games with Applications to Communications

Factored Markov Decision Process Models for Stochastic Unit Commitment

Capturing Financial Markets to Apply Deep Reinforcement Learning

Markov Decision Processes and Dynamic Programming 1 The

An Introduction to Temporal Difference Learning

Markov Decision Processes

Reinforcement Learning and Markov Decision Processes (Mdps)