Renewal Monte Carlo: Renewal Theory Based Reinforcement Learning Jayakumar Subramanian and Aditya Mahajan
Total Page:16
File Type:pdf, Size:1020Kb
1 Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan Abstract—In this paper, we present an online reinforcement discounted and average reward setups as well as for models learning algorithm, called Renewal Monte Carlo (RMC), for with continuous state and action spaces. However, they suffer infinite horizon Markov decision processes with a designated from various drawbacks. First, they have a high variance start state. RMC is a Monte Carlo algorithm and retains the advantages of Monte Carlo methods including low bias, simplicity, because a single sample path is used to estimate performance. and ease of implementation while, at the same time, circumvents Second, they are not asymptotically optimal for infinite horizon their key drawbacks of high variance and delayed (end of models because it is effectively assumed that the model is episode) updates. The key ideas behind RMC are as follows. episodic; in infinite horizon models, the trajectory is arbitrarily First, under any reasonable policy, the reward process is ergodic. truncated to treat the model as an episodic model. Third, the So, by renewal theory, the performance of a policy is equal to the ratio of expected discounted reward to the expected policy improvement step cannot be carried out in tandem discounted time over a regenerative cycle. Second, by carefully with policy evaluation. One must wait until the end of the examining the expression for performance gradient, we propose a episode to estimate the performance and only then can the stochastic approximation algorithm that only requires estimates policy parameters be updated. It is for these reasons that of the expected discounted reward and discounted time over a Monte Carlo methods are largely ignored in the literature on regenerative cycle and their gradients. We propose two unbiased estimators for evaluating performance gradients—a likelihood policy gradient methods, which almost exclusively focuses on ratio based estimator and a simultaneous perturbation based temporal difference methods such as actor-critic with eligibility estimator—and show that for both estimators, RMC converges traces [3]. to a locally optimal policy. We generalize the RMC algorithm to In this paper, we propose a Monte Carlo method—which we post-decision state models and also present a variant that con- call Renewal Monte Carlo (RMC)—for infinite horizon Markov verges faster to an approximately optimal policy. We conclude by presenting numerical experiments on a randomly generated MDP, decision processes with designated start state. Like Monte Carlo, event-triggered communication, and inventory management. RMC has low bias, is simple and easy to implement, and works for models with continuous state and action spaces. At the Index Terms—Reinforcement learning, Markov decision pro- cesses, renewal theory, Monte Carlo methods, policy gradient, same time, it does not suffer from the drawbacks of typical stochastic approximation Monte Carlo methods. RMC is a low-variance online algorithm that works for infinite horizon discounted and average reward setups. One doesn’t have to wait until the end of the episode I. INTRODUCTION to carry out the policy improvement step; it can be carried out In recent years, reinforcement learning [1]–[4] has emerged whenever the system visits the start state (or a neighborhood as a leading framework to learn how to act optimally in of it). unknown environments. Policy gradient methods [5]–[10] Although renewal theory is commonly used to estimate have played a prominent role in the success of reinforcement performance of stochastic systems in the simulation optimiza- learning. Such methods have two critical components: policy tion community [11], [12], those methods assume that the arXiv:1804.01116v1 [cs.LG] 3 Apr 2018 evaluation and policy improvement. In policy evaluation step, probability law of the primitive random variables and its weak the performance of a parameterized policy is evaluated while in derivate are known, which is not the case in reinforcement the policy improvement step, the policy parameters are updated learning. Renewal theory is also commonly used in the using stochastic gradient ascent. engineering literature on queuing theory and systems and Policy gradient methods may be broadly classified as Monte control for Markov decision processes (MDPs) with average Carlo methods and temporal difference methods. In Monte reward criteria and a known system model. There is some prior Carlo methods, performance of a policy is estimated using work on using renewal theory for reinforcement learning [13], the discounted return of a single sample path; in temporal [14], where renewal theory based estimators for the average difference methods, the value(-action) function is guessed and return and differential value function for average reward MDPs this guess is iteratively improved using temporal differences. is developed. In RMC, renewal theory is used in a different Monte Carlo methods are attractive because they have zero manner for discounted reward MDPs (and the results generalize bias, are simple and easy to implement, and work for both to average cost MDPs). This work was supported by the Natural Sciences and Engineering Research II. RMC ALGORITHM Council of Canada through NSERC Discovery Accelerator RGPAS 493011-16. Consider a Markov decision process (MDP) with state St The authors are with the Electrical and Computer Engineering Depart- 2 and action At . The system starts in an initial state ment, McGill University, Montreal, QC H3A 0E9, Canada. (e-mails: jayaku- S 2 A [email protected], [email protected]) s0 and at time t: 2 S 2 (n) −τ (n−1) 1) there is a controlled transition from St to St+1 according where Γ = γ . By the strong Markov property, (n) (n) to a transition kernel P (At); R n≥1 and T n≥1 are i.i.d. sequences. Let Rθ and f g (nf) g (n) 2) a per-step reward Rt = r(St;At;St+1) is received. Tθ denote E[R ] and E[T ], respectively. Define Future is discounted at a rate γ (0; 1). 2 N N A (time-homogeneous and Markov) policy π maps the 1 X (n) 1 X (n) Rb = R and Tb = T ; (5) current state to a distribution on actions, i.e., At π(St). N N ∼ n=1 n=1 We use π(a s) to denote P(At = a St = s). The performance j j of a policy π is given by where N is a large number. Then, Rb and Tb are unbiased and 1 asymptotically consistent estimators of Rθ and Tθ. X t From ideas similar to standard Renewal theory [17], we have Jπ = EAt∼π(St) γ Rt S0 = s0 : (1) t=0 the following. We are interested in identifying an optimal policy, i.e., a Proposition 1 (Renewal Relationship) The performance of policy that maximizes the performance. When and are policy πθ is given by: Borel spaces, we assume that the model satisfiesS the standardA conditions under which time-homogeneous Markov policies Rθ Jθ = : (6) are optimal [15]. In the sequel, we present a sample path based (1 γ)Tθ − online learning algorithm, which we call Renewal Monte Carlo (RMC), which identifies a locally optimal policy within the PROOF For ease of notation, define class of parameterized policies. Suppose policies are parameterized by a closed and convex τ (n)−τ (n−1) Tθ = EAt∼πθ (St) γ subset Θ of the Euclidean space. For example, Θ could be the weight vector in a Gibbs soft-max policy, or the weights Using the formula for geometric series, we get that Tθ = of a deep neural network, or the thresholds in a control limit (1 Tθ)=(1 γ). Hence, − − policy, and so on. Given θ Θ, we use πθ to denote the policy 2 parameterized by θ and J to denote J . We assume that for Tθ = 1 (1 γ)Tθ: (7) θ πθ − − all policies πθ, θ Θ, the designated start state s0 is positive recurrent. 2 Now, consider the performance: The typical approach for policy gradient based reinforcement τ (1)−1 learning is to start with an initial guess θ Θ and iteratively X 0 J = γtR update it using stochastic gradient ascent.2 In particular, let θ EAt∼πθ (St) t t=0 Jθm be an unbiased estimator of θJθ , then update 1 b θ=θm r r τ (1) X t−τ (1) + γ γ Rt S0 = s0 θm+1 = θm + αm b Jθm (2) (1) r Θ t=τ (a) τ (1)) where [θ]Θ denotes the projection of θ onto Θ and αm m≥1 = Rθ + EAt∼πθ (St)[γ ] Jθ is the sequence of learning rates that satisfies thef standardg = Rθ + TθJθ; (8) assumptions of 1 1 where the second expression in (a) uses the independence of X X α = and α2 < : (3) random variables from (0; τ (1) 1) to those from τ (1) onwards m m − m=1 1 m=1 1 due to the strong Markov property. Substituting (7) in (8) and Under mild technical conditions [16], the above iteration rearranging terms, we get the result of the proposition. ∗ converges to a θ that is locally optimal, i.e., θJθ ∗ = 0. Differentiating both sides of Equation (6) with respect to θ, r θ=θ In RMC, we approximate θJθ by a Renewal theory based we get that estimator as explained below.r Let τ (n) denote the stopping time when the system returns Hθ θJθ = ; where Hθ = Tθ θRθ Rθ θTθ: (9) to the start state s for the n-th time. In particular, let τ (0) = 0 r T2(1 γ) r − r 0 θ − and for n 1 define ≥ Therefore, instead of using stochastic gradient ascent to (n) (n−1) τ = inf t > τ : st = s0 : find the maximum of Jθ, we can use stochastic approximation f g to find the root of Hθ.