1

Renewal Monte Carlo: based reinforcement learning Jayakumar Subramanian and Aditya Mahajan

Abstract—In this paper, we present an online reinforcement discounted and average reward setups as well as for models learning algorithm, called Renewal Monte Carlo (RMC), for with continuous state and action spaces. However, they suffer infinite horizon Markov decision processes with a designated from various drawbacks. First, they have a high variance start state. RMC is a Monte Carlo algorithm and retains the advantages of Monte Carlo methods including low bias, simplicity, because a single sample path is used to estimate performance. and ease of implementation while, at the same time, circumvents Second, they are not asymptotically optimal for infinite horizon their key drawbacks of high variance and delayed (end of models because it is effectively assumed that the model is episode) updates. The key ideas behind RMC are as follows. episodic; in infinite horizon models, the trajectory is arbitrarily First, under any reasonable policy, the reward process is ergodic. truncated to treat the model as an episodic model. Third, the So, by renewal theory, the performance of a policy is equal to the ratio of expected discounted reward to the expected policy improvement step cannot be carried out in tandem discounted time over a regenerative cycle. Second, by carefully with policy evaluation. One must wait until the end of the examining the expression for performance gradient, we propose a episode to estimate the performance and only then can the stochastic approximation algorithm that only requires estimates policy parameters be updated. It is for these reasons that of the expected discounted reward and discounted time over a Monte Carlo methods are largely ignored in the literature on regenerative cycle and their gradients. We propose two unbiased estimators for evaluating performance gradients—a likelihood policy gradient methods, which almost exclusively focuses on ratio based estimator and a simultaneous perturbation based temporal difference methods such as actor-critic with eligibility estimator—and show that for both estimators, RMC converges traces [3]. to a locally optimal policy. We generalize the RMC algorithm to In this paper, we propose a Monte Carlo method—which we post-decision state models and also present a variant that con- call Renewal Monte Carlo (RMC)—for infinite horizon Markov verges faster to an approximately optimal policy. We conclude by presenting numerical experiments on a randomly generated MDP, decision processes with designated start state. Like Monte Carlo, event-triggered communication, and inventory management. RMC has low bias, is simple and easy to implement, and works for models with continuous state and action spaces. At the Index Terms—Reinforcement learning, Markov decision pro- cesses, renewal theory, Monte Carlo methods, policy gradient, same time, it does not suffer from the drawbacks of typical stochastic approximation Monte Carlo methods. RMC is a low-variance online algorithm that works for infinite horizon discounted and average reward setups. One doesn’t have to wait until the end of the episode I.INTRODUCTION to carry out the policy improvement step; it can be carried out In recent years, reinforcement learning [1]–[4] has emerged whenever the system visits the start state (or a neighborhood as a leading framework to learn how to act optimally in of it). unknown environments. Policy gradient methods [5]–[10] Although renewal theory is commonly used to estimate have played a prominent role in the success of reinforcement performance of stochastic systems in the simulation optimiza- learning. Such methods have two critical components: policy tion community [11], [12], those methods assume that the arXiv:1804.01116v1 [cs.LG] 3 Apr 2018 evaluation and policy improvement. In policy evaluation step, probability law of the primitive random variables and its weak the performance of a parameterized policy is evaluated while in derivate are known, which is not the case in reinforcement the policy improvement step, the policy parameters are updated learning. Renewal theory is also commonly used in the using stochastic gradient ascent. engineering literature on queuing theory and systems and Policy gradient methods may be broadly classified as Monte control for Markov decision processes (MDPs) with average Carlo methods and temporal difference methods. In Monte reward criteria and a known system model. There is some prior Carlo methods, performance of a policy is estimated using work on using renewal theory for reinforcement learning [13], the discounted return of a single sample path; in temporal [14], where renewal theory based estimators for the average difference methods, the value(-action) function is guessed and return and differential value function for average reward MDPs this guess is iteratively improved using temporal differences. is developed. In RMC, renewal theory is used in a different Monte Carlo methods are attractive because they have zero manner for discounted reward MDPs (and the results generalize bias, are simple and easy to implement, and work for both to average cost MDPs).

This work was supported by the Natural Sciences and Engineering Research II.RMCALGORITHM Council of Canada through NSERC Discovery Accelerator RGPAS 493011-16. Consider a Markov decision process (MDP) with state St The authors are with the Electrical and Computer Engineering Depart- ∈ and action At . The system starts in an initial state ment, McGill University, Montreal, QC H3A 0E9, Canada. (e-mails: jayaku- S ∈ A [email protected], [email protected]) s0 and at time t: ∈ S 2

(n) −τ (n−1) 1) there is a controlled transition from St to St+1 according where Γ = γ . By the strong , (n) (n) to a transition kernel P (At); R n≥1 and T n≥1 are i.i.d. sequences. Let Rθ and { } (n{) } (n) 2) a per-step reward Rt = r(St,At,St+1) is received. Tθ denote E[R ] and E[T ], respectively. Define Future is discounted at a rate γ (0, 1). ∈ N N A (time-homogeneous and Markov) policy π maps the 1 X (n) 1 X (n) Rb = R and Tb = T , (5) current state to a distribution on actions, i.e., At π(St). N N ∼ n=1 n=1 We use π(a s) to denote P(At = a St = s). The performance | | of a policy π is given by where N is a large number. Then, Rb and Tb are unbiased and  ∞  asymptotically consistent estimators of Rθ and Tθ. X t From ideas similar to standard Renewal theory [17], we have Jπ = EAt∼π(St) γ Rt S0 = s0 . (1)

t=0 the following. We are interested in identifying an optimal policy, i.e., a Proposition 1 (Renewal Relationship) The performance of policy that maximizes the performance. When and are policy πθ is given by: Borel spaces, we assume that the model satisfiesS the standardA conditions under which time-homogeneous Markov policies Rθ Jθ = . (6) are optimal [15]. In the sequel, we present a sample path based (1 γ)Tθ − online learning algorithm, which we call Renewal Monte Carlo (RMC), which identifies a locally optimal policy within the PROOF For ease of notation, define class of parameterized policies. Suppose policies are parameterized by a closed and convex  τ (n)−τ (n−1)  Tθ = EAt∼πθ (St) γ subset Θ of the Euclidean space. For example, Θ could be the weight vector in a Gibbs soft-max policy, or the weights Using the formula for geometric series, we get that Tθ = of a deep neural network, or the thresholds in a control limit (1 Tθ)/(1 γ). Hence, − − policy, and so on. Given θ Θ, we use πθ to denote the policy ∈ parameterized by θ and J to denote J . We assume that for Tθ = 1 (1 γ)Tθ. (7) θ πθ − − all policies πθ, θ Θ, the designated start state s0 is positive recurrent. ∈ Now, consider the performance: The typical approach for policy gradient based reinforcement τ (1)−1 learning is to start with an initial guess θ Θ and iteratively  X 0 J = γtR update it using stochastic gradient ascent.∈ In particular, let θ EAt∼πθ (St) t t=0 Jθm be an unbiased estimator of θJθ , then update ∞ b θ=θm  ∇ ∇ τ (1) X t−τ (1) + γ γ Rt S0 = s0   θm+1 = θm + αm b Jθm (2) (1) ∇ Θ t=τ (a) τ (1)) where [θ]Θ denotes the projection of θ onto Θ and αm m≥1 = Rθ + EAt∼πθ (St)[γ ] Jθ is the sequence of learning rates that satisfies the{ standard} = Rθ + TθJθ, (8) assumptions of ∞ ∞ where the second expression in (a) uses the independence of X X α = and α2 < . (3) random variables from (0, τ (1) 1) to those from τ (1) onwards m m − m=1 ∞ m=1 ∞ due to the strong Markov property. Substituting (7) in (8) and Under mild technical conditions [16], the above iteration rearranging terms, we get the result of the proposition.  ∗ converges to a θ that is locally optimal, i.e., θJθ ∗ = 0. Differentiating both sides of Equation (6) with respect to θ, ∇ θ=θ In RMC, we approximate θJθ by a Renewal theory based we get that estimator as explained below.∇ Let τ (n) denote the when the system returns Hθ θJθ = , where Hθ = Tθ θRθ Rθ θTθ. (9) to the start state s for the n-th time. In particular, let τ (0) = 0 ∇ T2(1 γ) ∇ − ∇ 0 θ − and for n 1 define ≥ Therefore, instead of using stochastic gradient ascent to (n) (n−1) τ = inf t > τ : st = s0 . find the maximum of Jθ, we can use stochastic approximation { } to find the root of Hθ. In particular, let Hm be an unbiased (n−1) (n) b We call the sequence of (St,At,Rt) from τ to τ 1 estimator of Hθm . We then use the update as the n-th regenerative cycle. Let R(n) and T(n) denote− the total discounted reward and total discounted time of the n-th   θm+1 = θm + αmHbm Θ (10) regenerative cycle, i.e., where αm m≥1 satisfies the standard conditions on learning τ (n)−1 τ (n)−1 { } X X rates (3). The above iteration converges to a locally optimal R(n) = Γ(n) γtR and T(n) = Γ(n) γt, (4) t policy. Specifically, we have the following. t=τ (n−1) t=τ (n−1) 3

Theorem 1 Let Rbm, Tbm, b Rm and b Tm be unbiased esti- Algorithm 1: RMC Algorithm with likelihood ratio based ∇ ∇ mators of Rθm , Tθm , θRθm , and θRθm , respectively such gradient estimates. ∇ 1∇ that Tbm b Rm and Rbm b Tm. Then, input : Intial policy θ , discount factor γ, initial ⊥ ∇ ⊥ ∇ 0 state s , number of regenerative cycles N Hbm = Tbm b Rm Rbm b Tm (11) 0 ∇ − ∇ for iteration m = 0, 1,... do is an unbiased estimator of H and the sequence θ θ m m≥1 for regenerative cycle n1 = 1 to N do generated by (10) converges almost surely and { } Generate n1-th regenerative cycle using policy πθm . Compute R(n1) and T(n1) using (4). lim θJθ = 0. m→∞ θm ∇ (n1) Set Rbm = average(R : n1 1,...,N ). (n1) ∈ { } Set Tbm = average(T : n1 1,...,N ). for regenerative cycle n = 1 to∈N {do } PROOF The unbiasedness of Hbm follows immediately from the 2 Generate n -th regenerative cycle using policy π . independence assumption. The convergence of the θ 2 θm m m≥1 (n2) (n2) follows from [16, Theorem 2.2] and the fact that the{ model} Compute Rσ , Tσ and Λσ for all σ. satisfies conditions (A1)–(A4) of [16, pg 10–11].  Compute b Rm and b Tm using (14) and (15). ∇ ∇ Set Hbm = Tbm b Rm Rbm b Tm. In the remainder of this section, we present two methods for ∇ − ∇ estimating the gradients of R and T . The first is a likelihood   θ θ Update θm+1 = θm + αmHbm Θ. ratio based gradient estimator which works when the policy is differentiable with respect to the policy parameters. The second is a simultaneous perturbation based gradient estimator that uses finite differences, which is useful when the policy is Therefore, not differentiable with respect to the policy parameters. t t (n) X X θ log Pθ(D ) = θ log πθ(As Ss) = Λs. (16) ∇ t ∇ | A. Likelihood ratio based gradient based estimator s=τ (n−1) s=τ (n−1)

One approach to estimate the performance gradient is to Note that Rθ can be written as: use likelihood radio based estimates [12], [18], [19]. Suppose τ (n)−1 the policy πθ(a s) is differentiable with respect to θ. For any (n) X t Rθ = Γ γ EAt∼πθ (St)[Rt]. time t, define the| likelihood function t=τ (n−1)

Λt = θ log[πθ(At St)], (12) 2 ∇ | Using the log derivative trick, we get (n−1) (n) and for σ τ , . . . , τ 1 , define τ (n)−1 ∈ { − } X (n) (n) (n) (n) t τ −1 τ −1 θRθ = Γ γ EAt∼πθ (St)[Rt θ log Pθ(Dt )] X X ∇ ∇ (n) (n) t (n) (n) t t=τ (n−1) Rσ = Γ γ Rt, Tσ = Γ γ . (13) t=σ t=σ  τ (n)−1  t  (a) (n) X t X (n) (n) (n) (n) = Γ EAt∼πθ (St) γ Rt Λσ In this notation R = Rτ (n−1) and T = Tτ (n−1) . Then, t=τ (n−1) σ=τ (n−1) define the following estimators for θRθ and θTθ: ∇ ∇  τ (n)−1  τ (n)−1  (n) (b) X X N τ −1 = Λ Γ(n) γtR 1 X X (n) EAt∼πθ (St) σ t b R = R Λσ, (14) ∇ N σ σ=τ (n−1) t=σ n=1 σ=τ (n−1)  τ (n)−1  (n) (c) X N τ −1 = R(n)Λ 1 X X (n) EAt∼πθ (St) σ σ (17) b T = T Λσ, (15) ∇ N σ σ=τ (n−1) n=1 σ=τ (n−1) where (a) follows from (16), (b) follows from changing the where N is a large number. order of summations, and (c) follows from the definition of (n) Proposition 2 b R and b T defined above are unbiased and Rσ in (13). b R is an unbiased and asymptotically consistent ∇ ∇ ∇ asymptotically consistent estimators of θRθ and θTθ. estimator of the right hand side of the first equation in (17). ∇ ∇ The result for b T follows from a similar argument.  PROOF Let Pθ denote the probability induced on the sample ∇ paths when the system is following policy πθ. For t To satisfy the independence condition of Theorem 1, we (n−1) (n) (n) ∈ τ , . . . , τ 1 , let Dt denote the sample path use two independent sample paths: one to estimate Rb and { t − } (Ss,As,Ss+1)s=τ (n−1) for the n-th regenerative cycle until Tb and the other to estimate b R and b T. The complete time t. Then, algorithm in shown in Algorithm∇ 1. An immediate∇ consequence t of Theorem 1 is the following. (n) Y Pθ(Dt ) = πθ(As Ss)P(Ss+1 Ss,As) | | 2Log-derivative trick: For any distribution p(x|θ) and any function f, s=τ (n−1) ∇θEX∼p(X|θ)[f(X)] = EX∼p(X|θ)[f(X)∇θ log p(X|θ)]. 1The notation X ⊥ Y means that the random variables X and Y are independent. 4

Corollary 1 The sequence θm m≥1 generated by Algo- Algorithm 2: RMC Algorithm with simultaneous pertur- { } rithm 1 converges to a local optimal. 2 bation based gradient estimates. Remark 1 Algorithm 1 is presented in its simplest form. It is input : Intial policy θ0, discount factor γ, initial possible to use standard variance reduction techniques such as state s0, number of regenerative cycles N, constant c, perturbation distribution ∆ subtracting a baseline [19]–[21] to reduce variance. 2 for iteration m = 0, 1,... do Remark 2 In Algorithm 1, we use two separate runs to for regenerative cycle n1 = 1 to N do compute (Rbm, Tbm) and ( Rbm, Tbm) to ensure that the Generate n1-th regenerative cycle using policy πθ . ∇ ∇ m independence conditions of Proposition 2 are satisfied. In Compute R(n1) and T(n1) using (4).

practice, we found that using a single run to compute both (n1) Set Rbm = average(R : n1 1,...,N ). (Rbm, Tbm) and ( Rbm, Tbm) has negligible effect on the (n1) ∈ { } ∇ ∇ Set Tbm = average(T : n1 1,...,N ). accuracy of convergence (but speeds up convergence by a Sample δ ∆. ∈ { } factor of two). 2 0 ∼ Set θm = θm + cδ. Remark 3 It has been reported in the literature [22] that using for regenerative cycle n2 = 1 to N do n π a biased estimate of the gradient given by: Generate 2-th regenerative cycle using policy θm . Compute R(n2) and T(n2) using (4). τ (n)−1 X Set R0 = average(R(n2) : n 1,...,N ). R(n) = Γ(n) γt−σR , bm 2 σ t (18) 0 (n2) ∈ { } Set T = average(T : n2 1,...,N ). t=σ bm 0 0 ∈ { } Set Hbm = δ(TbmRb RbmTb )/c. (n) m − m (and a similar expression for Tσ ) leads to faster convergence.   Update θm+1 = θm + αmHbm . We call this variant RMC with biased gradients and, in our Θ experiments, found that it does converge faster than RMC. 2

B. Simultaneous perturbation based gradient estimator where Nt t≥0 is an independent noise process. For other examples,{ see} the inventory control and event-triggered com- Another approach to estimate performance gradient is to munication models in Sec V. Such models can be written in use simultaneous perturbation based estimates [23]–[26]. The terms of a post-decision state model described below. general one-sided form of such estimates is Consider a post-decision state MDP with pre-decision state − − + + b Rθ = δ(Rbθ+cδ Rbθ)/c St , post-decision state St , action At . The ∇ − ∈ S + ∈ S + ∈ A system starts at an initial state s0 and at time t: where δ is a with the same dimension as θ and ∈ S 1) there is a controlled transition from S− to S+ according c is a small constant. The expression for b Tθ is similar. When t t ∇ to a transition kernel P −(A ); δi Rademacher( 1), the above method corresponds to si- t ∼ ± + − multaneous perturbation stochastic approximation (SPSA) [23], 2) there is an uncontrolled transition from St to St+1 + [24]; when δ Normal(0,I), the above method corresponds according to a transition kernel P ; ∼ − + to smoothed function stochastic approximation (SFSA) [25], 3) a per-step reward Rt = r(St ,At,St ) is received. [26]. Future is discounted at a rate γ (0, 1). Substituting the above estimates in (11) and simplifying, we ∈ + − − get Remark 4 When = and P is identity, then the above model reduces to theS standardS MDP model, considered in Sec II. Hbθ = δ(TbθRbθ+cδ RbθTbθ+cδ)/c. − When P + is a deterministic transition, the model reduces to a The complete algorithm in shown in Algorithm 2. Since standard MDP model with post decision states [27], [28]. 2 (Rbθ, Tbθ) and (Rbθ+cδ, Tbθ+cδ) are estimated from separate As in Sec II, we choose a (time-homogeneous and Markov) sample paths, H defined above is an unbiased estimator of bθ policy π that maps the current pre-decision state − to a Hθ. Then, an immediate consequence of Theorem 1 is the − S − distribution on actions, i.e., At π(St ). We use π(a s ) to following. − − ∼ | denote P(At = a St = s ). | Corollary 2 The sequence θm m≥1 generated by Algo- The performance when the system starts in post-decision { } + + rithm 2 converges to a local optimal. 2 state s and follows policy π is given by 0 ∈ S  ∞  III.RMC FOR POST-DECISION STATE MODEL X t + + Jπ = EAt∼π(St) γ Rt S0 = s0 . (19)

In many models, the state dynamics can be split into two t=0 parts: a controlled evolution followed by an uncontrolled evolution. For example, many continuous state models have As before, we are interested in identifying an optimal policy, dynamics of the form i.e., a policy that maximizes the performance. When and are Borel spaces, we assume that the model satisfiesS the A St+1 = f(St,At) + Nt, standard conditions under which time-homogeneous Markov 5

(n) ρ ρ policies are optimal [15]. Let τ denote the stopping times Substituting Tθ = (1 Tθ)/(1 γ) and rearranging the terms, such that τ (0) = 0 and for n 1, we get − − ≥ ρ (n) (n−1) + + ρ LθTθ τ = inf t > τ : s = s . Jθ J + ρ. { t−1 0 } ≤ θ (1 γ)Tρ − θ The slightly unusual definition (using s+ = s+ rather than t−1 0 The other direction can also be proved using a similar argument. + + ρ the more natural st = s0 ) is to ensure that the formulas for (n) (n) The second inequality in (20) follows from Tθ γ and R and T used in Sec. II remain valid for the post-decision ρ T 1. ≤ state model as well. Thus, using arguments similar to Sec. II, θ ≥  we can show that both variants of RMC presented in Sec. II Theorem 2 implies that we can find an approximately optimal ρ converge to a locally optimal parameter θ for the post-decision policy by identifying policy parameters θ that minimize Jθ . state model as well. To do so, we can appropriately modify both variants of RMC defined in Sec. II to declare a renewal whenever the state lies Bρ IV. APPROXIMATE RMC in . For specific models, it may be possible to verify that the value In this section, we present an approximate version of RMC function is locally Lipschitz (see Sec. V-C for an example). (for the basic model of Sec. II). Suppose that the state and However, we are not aware of general conditions that guarantee action spaces and are separable metric spaces (with metrics S A local Lipschitz continuity of value functions. It is possible to dS and dA). identify sufficient conditions that guarantee global Lipschitz ρ Given an approximation constant ρ R>0, let B = s ∈ { ∈ continuity of value functions (see [29, Theorem 4.1], [30, : dS(s, s0) ρ denote the ball of radius ρ centered around Lemma 1, Theorem 1], [31, Lemma 1]). We state these S ≤ } (n) s0. Given a policy π, let τ denote the stopping times for conditions below. successive visits to Bρ, i.e., τ (0) = 0 and for n 1, ≥ Proposition 3 Let Vθ denote the value function for any policy (n) (n−1) ρ τ = inf t > τ : st B . π . Suppose the model satisfies the following conditions: { ∈ } θ Define R(n) and T(n) as in (4) and let Rρ and Tρ denote the 1) The transition kernel P is Lipschitz, i.e., there exists a θ θ 0 0 (n) (n) constant LP such that for all s, s and a, a , expected values of R and T , respectively. Define ∈ S ∈ A ρ 0 0  0 0  R (P ( s, a),P ( s , a )) LP dS(s, s ) + dA(a, a ) , J ρ = θ . K ·| ·| ≤ θ (1 γ)Tρ − θ where is the Kantorovich metric (also called K Theorem 2 Given a policy πθ, let Vθ denote the value function Kantorovich-Monge-Rubinstein metric or Wasserstein ρ (1) τ distance) between probability measures. and Tθ = EAt∼πθ (St)[γ S0 = s0] (which is always less than γ). Suppose the following| condition is satisfied: 2) The per-step reward r is Lipschitz, i.e., there exists a ρ constant L such that for all s, s0, s and a, a0 , (C) The value function Vθ is locally Lipschitz in B , i.e., r + 0 ρ ∈ S ∈ A there exists a Lθ such that for any s, s B , 0 0  0 0  ∈ r(s, a, s+) r(s , a , s+) Lr dS(s, s ) + dA(a, a ) . 0 0 | − | ≤ Vθ(s) Vθ(s ) LθdS(s, s ). | − | ≤ In addition, suppose the policy satisfies the following: Then ρ 3) The policy πθ is Lipschitz, i.e., there exists a constant 0 ρ LθTθ γ Lπ such that for any s, s , Jθ J ρ ρ Lθρ. (20) θ − θ ≤ (1 γ)T ≤ (1 γ) ∈ S θ 0 0 − − (πθ( s), πθ( s )) Lπ dS(s, s ). K ·| ·| ≤ θ PROOF We follow an argument similar to Proposition 1. 4) γLP (1 + Lπθ ) < 1. (1) 5) The value function V exists and is finite.  τ −1 θ X t 0 J = V (s ) = γ R Then, Vθ is Lipschitz. In particular, for any s, s , θ θ 0 EAt∼πθ (St) t ∈ S t=0 0 0 ∞  Vθ(s) Vθ(s ) LθdS(s, s ), τ (1) X t−τ (1) | − | ≤ + γ γ Rt S0 = sτ (1) (1) where t=τ L (1 + L ) (a) (1) r πθ ρ τ Lθ = . = R + [γ S = s ] V (s (1) ) θ EAt∼πθ (St) 0 0 θ τ (21) 1 γLP (1 + Lπ ) | − θ where (a) uses the strong Markov property. Since Vθ is locally ρ Lipschitz with constant Lθ and s (1) B , we have that τ ∈

Jθ Vθ(s (1) ) = Vθ(s0) Vθ(s (1) ) Lθρ. V. NUMERICAL EXPERIMENTS | − τ | | − τ | ≤ Substituting the above in (21) gives We conduct three experiments to evaluate the performance of RMC: a randomly generated MDP, event-triggered commu- ρ ρ Jθ R + T (Jθ + Lθρ). nication, and inventory management. ≤ θ θ 6

A. Randomized MDP (GARNET) In this experiment, we study a randomly generated 300 GARNET(100, 10, 50) model [32], which is an MDP with 250

100 states, 10 actions, and a branching factor of 50 (which Exact 200 RMC-B means that each row of all transition matrices has 50 non- RMC S-0 zero elements, chosen Unif[0, 1] and normalized to add to 1). 150 S-0.25 S-0.5 For each state-action pair, with probability p = 0.05, the Performance S-0.75 100 S-1 reward is chosen Unif[10, 100], and with probability 1 p, the reward is 0. Future is discounted by a factor of γ =− 0.9. 50 The first state is chosen as start state. The policy is a Gibbs 0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 soft-max distribution parameterized by 100 10 (states Samples 105 × × × actions) parameters, where each parameter belongs to the (a) interval [ 30, 30]. The temperature of the Gibbs distribution − is kept constant and equal to 1. 300 We compare the performance of RMC, RMC with biased 250 gradient (denoted by RMC-B, see Remark 2), and actor critic Exact with eligibility traces for the critic [3] (which we refer to 200 RMC-B RMC as SARSA-λ and abbreviate as S-λ in the plots), with λ S-0 150 ∈ S-0.25 0, 0.25, 0.5, 0.75, 1 . For both the RMC algorithms, we use S-0.5 Performance { } 100 S-0.75 the same runs to estimate the gradients (see Remark 2 in S-1 3 Sec. II). Each algorithm is run 100 times and the mean and 50 standard deviation of the performance (as estimated by the 0 algorithms themselves) is shown in Fig. 1a. The performance of 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Samples 105 the corresponding policy evaluated by Monte-Carlo evaluation × over a horizon of 250 steps and averaged over 100 runs is (b) shown in Fig. 1b. The optimal performance computed using Fig. 1: Performance of different learning algorithms on value iteration is also shown. GARNET(100, 10, 50) with p = 0.05 and γ = 0.9. (a) The The results show that SARSA-λ learns faster (this is expected performance estimated by the algorithms online. (b) The because the critic is keeping track of the entire value function) performance estimated by averaging over 100 Monte Carlo but has higher variance and gets stuck in a local minima. On evaluations for a rollout horizon of 250. The solid lines show the other hand, RMC and RMC-B learn slower but have a the mean value and the shaded region shows the one standard low bias and do not get stuck in a local minima. The same deviation region. ± qualitative behavior was observed for other randomly generated models. Policy gradient algorithms only guarantee convergence to a local optimum. We are not sure why RMC and SARSA differ in which local minima they converge to. Also, it was observed that RMC-B (which is RMC with biased evaluation 10

of the gradient) learns faster than RMC. 8 θ Exact for pd = 0

6 RMC for pd = 0 B. Event-Triggered Communication RMC for pd = 0.1 RMC for pd = 0.2 In this experiment, we study an event-triggered communi- Threshold 4

cation problem that arises in networked control systems [34], 2 [35]. A transmitter observes a first-order autoregressive process 0 Xt t≥1, i.e., Xt+1 = αXt + Wt, where α, Xt,Wt R, and 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 { } ∈ 108 Wt t≥1 is an i.i.d. process. At each time, the transmitter Samples × { } uses an event-triggered policy (explained below) to determine Fig. 2: Policy parameters versus number of samples (sample whether to transmit or not (denoted by At = 1 and At = 0, values averaged over 100 runs) for event-driven communication respectively). Transmission takes place over an i.i.d. erasure using RMC for different values of pd. The solid lines show − + channel with erasure probability pd. Let St and St denote the the mean value and the shaded area shows the one standard “error” between the source realization and it’s reconstruction deviation region. ± − + at a receiver. It can be shown that St and St evolve as + − follows [34], [35]: when At = 0, St = St ; when At = 1,

3 + For all algorithms, the learning rate is chosen using ADAM [33] with St = 0 if the transmission is successful (w.p. (1 pd)) and default hyper-parameters and the α parameter of ADAM equal to 0.05 for + − − St = St if the transmission is not successful (w.p. pd); and RMC, RMC-B, and the actor in SARSA-λ and the learning rate is equal to − + 0.1 for the critic in SARSA-λ. For RMC and RMC-B, the policy parameters St+1 = αSt + Wt. Note that this is a post-decision state are updated after N = 5 renewals. model, where the post-decision state resets to zero after every 7 successful transmission.4 The per-step cost has two components: a communication + 2 20.0 cost of λAt, where λ R>0 and an estimation error (St ) . The objective is to minimize∈ the expected discounted cost. 17.5 15.0

An event-triggered policy is a threshold policy that chooses θ 12.5 − Exact At = 1 whenever St θ, where θ is a design choice. Under RMC | | ≥ 10.0 certain conditions, such an event-triggered policy is known Threshold 7.5 to be optimal [34], [35]. When the system model is known, 5.0 algorithms to compute the optimal θ are presented in [36], [37]. 2.5 In this section, we use RMC to identify the optimal policy 0.0 0 1 2 3 4 when the model parameters are not known. 106 Samples × In our experiment we consider an event-triggered model with (a) α = 1, λ = 500, pd 0, 0.1, 0.2 , Wt (0, 1), γ = 0.9, ∈ { } ∼ N 5 and use simultaneous perturbation variant of RMC to identify 280 θ. We run the algorithm 100 times and the result for different 270 choices of p are shown in Fig. 2.6 For p = 0, the optimal d d 260 threshold computed using [37] is also shown. The results show 250 that RMC converges relatively quickly and has low bias across Exact 240 multiple runs. RMC Total Cost 230 220 C. Inventory Control 210 200 In this experiment, we study an inventory management prob- 0 1 2 3 4 106 Samples × lem that arises in operations research [38], [39]. Let St R ∈⊂ (b) denote the volume of goods stored in a warehouse, At R≥0 ∈ denote the amount of goods ordered, and Dt denotes the Fig. 3: (a) Policy parameters and (b) Performance (total cost) demand. The state evolves according to St+1 = St +At Dt+1. − versus number of samples (sample values averaged over 100 We work with the normalized cost function: runs) for inventory control using RMC. The solid lines show the mean value and the shaded area shows the one standard C(s) = aps(1 γ)/γ + ahs1 abs1 , ± − {s≥0} − {s<0} deviation region. In (b), the performance is computed using (22) for the policy parameters given in (a). The red rectangular where ap is the procurement cost, ah is the holding cost, and region shows the total cost bound given by Theorem 2. ab is the backlog cost (see [40, Chapter 13] for details). It is known that there exists a threshold θ such that the optimal policy is a base stock policy with threshold θ (i.e., is shown in Fig. 3.7 The optimal threshold and performance whenever the current stock level falls below θ, one orders up computed using [40, Sec 13.2]8 is also shown. The result shows to θ). Furthermore, for s θ, we have that [40, Sec 13.2] ≤ that RMC converges to an approximately optimal parameter γ value with total cost within the bound predicted in Theorem 2. Vθ(s) = C(s) + E[C(θ D)]. (22) (1 γ) − − So, for Bρ (0, θ), the value function is locally Lipschitz, VI.CONCLUSIONS with ⊂ We present a renewal theory based reinforcement learning   1 γ algorithm called Renewal Monte Carlo. RMC retains the key L = a + − a . θ h γ p advantages of Monte Carlo methods and has low bias, is simple and easy to implement, and works for models with continuous So, we can use approximate RMC to learn the optimal policy. state and action spaces. In addition, due to the averaging over In our experiments, we consider an inventory management multiple renewals, RMC has low variance. We generalized model with ah = 1, ab = 1, ap = 1.5, Dt Exp(λ) with ∼ the RMC algorithm to post-decision state models and also λ = 0.025, start state s0 = 1, discount factor γ = 0.9, and presented a variant that converges faster to an approximately use simultaneous perturbation variant of approximate RMC optimal policy, where the renewal state is replaced by a renewal to identify θ. We run the algorithm 100 times and the result 7We choose the learning rate using ADAM with default hyper-parameters 4Had we used the standard MDP model instead of the post-decision state and the α parameter of ADAM equal to 0.25. We choose c = 3.0, N = 100, model, this restart would not have always resulted in a renewal. and ∆ = N (0, 1) in Algorithm 2 and choose ρ = 0.5 for approximate RMC. 5 − An event-triggered policy is a parametric policy but πθ(a|s ) is not We bound the states within [−100.0, 100.0]. differentiable in θ. Therefore, the likelihood ratio method cannot be used to 8For Exp(λ) demand, the optimal threshold is (see [40, Sec 13.2]) estimate performance gradient. 1  a + a  6We choose the learning rate using ADAM with default hyper-parameters θ∗ = log h b . and the α parameter of ADAM equal to 0.01. We choose c = 0.3, N = 100 λ ah + ap(1 − γ)/γ) and ∆ = N (0, 1) in Algorithm 2. 8

set. The error in using such an approximation is bounded by [13] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of Markov the size of the renewal set. reward processes,” IEEE Trans. Autom. Control, vol. 46, no. 2, pp. 191– 209, Feb 2001. In certain models, one is interested in the peformance at a [14] ——, “Approximate gradient methods in policy-space optimization of reference state that is not the start state. In such models, we Markov reward processes„” Discrete Event Dynamical Systems, vol. 13, can start with an arbitrary policy and ignore the trajectory until no. 2, pp. 111–148, 2003. [15] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov Control the reference state is visited for the first time and use RMC Processes: Basic Optimality Criteria. Springer Science & Business from that time onwards (assuming that the reference state is Media, 1996, vol. 30. the new start state). [16] V. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008. The results presented in this paper also apply to average [17] W. Feller, An Introduction to and its Applications. reward models where the objective is to maximize John Wiley and Sons, 1966, vol. 1. [18] R. Y. Rubinstein, “Sensitivity analysis and performance extrapolation for th−1  1 X computer simulation models,” Operations Research, vol. 37, no. 1, pp. Jπ = lim EAt∼π(St) Rt S0 = s0 . (23) 72–81, 1989. th→∞ t h t=0 [19] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” , vol. 8, no. 3-4, Let the stopping times τ (n) be defined as before. Define the pp. 229–256, 1992. total reward R(n) and duration T(n) of the n-th regenerative [20] E. Greensmith, P. L. Bartlett, and J. Baxter, “Variance reduction techniques for gradient estimates in reinforcement learning,” Journal of cycle as Machine Learning Research, vol. 5, no. Nov, pp. 1471–1530, 2004. [21] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in τ (n)−1 International Conference on Intelligent Robots and Systems, 2006 (n) X (n) (n) (n−1) R = Rt and T = τ τ . IEEE/RSJ. IEEE, Oct. 2006, pp. 2219–2225. − t=τ (n−1) [22] P. Thomas, “Bias in natural actor-critic algorithms,” in International Conference on Machine Learning, June 2014, pp. 441–448. (n) (n) Let Rθ and Tθ denote the expected values of R and T [23] J. C. Spall, “Multivariate stochastic approximation using a simultaneous under policy π . Then from standard renewal theory we have perturbation gradient approximation,” IEEE Trans. Autom. Control, θ vol. 37, no. 3, pp. 332–341, 1992. that the performance Jθ is equal to Rθ/Tθ and, therefore [24] J. L. Maryak and D. C. Chin, “Global random optimization by 2 simultaneous perturbation stochastic approximation,” IEEE Trans. Autom. θJθ = Hθ/Tθ , where Hθ is defined as in (9). We can use both∇ variants of RMC prosented in Sec. II to obtain estimates Control, vol. 53, no. 3, pp. 780–783, Apr. 2008. [25] V. Katkovnik and Y. Kulchitsky, “Convergence of a class of random of Hθ and use these to update the policy parameters using (10). search algorithms.” Automation and Remote Control, vol. 33, no. 8, pp. 1321–1326, 1972. [26] S. Bhatnagar, H. Prasad, and L. Prashanth, Stochastic Recursive Algo- ACKNOWLEDGMENT rithms for Optimization: Simultaneous Perturbation Methods. Springer, The authors are grateful to Joelle Pineau for useful feedback 2013, vol. 434. [27] B. Van Roy, D. P. Bertsekas, Y. Lee, and J. N. Tsitsiklis, “A neuro- and for suggesting the idea of approximate RMC. dynamic programming approach to retailer inventory management,” in 36th IEEE Conference on Decision and Control, 1997, vol. 4, Dec. 1997, REFERENCES pp. 4052–4057. [28] W. B. Powell, Approximate Dynamic Programming: Solving the Curses [1] D. Bertsekas and J. Tsitsiklis, Neuro-dynamic Programming, ser. Anthro- of Dimensionality, 2nd ed. John Wiley & Sons, 2011. pological Field Studies. Athena Scientific, 1996. [29] K. Hinderer, “Lipschitz continuity of value functions in Markovian [2] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement decision processes,” Mathematical Methods of Operations Research, learning: A survey,” Journal of Artificial Intelligence Research, vol. 4, vol. 62, no. 1, pp. 3–22, Sep 2005. pp. 237–285, 1996. [30] E. Rachelson and M. G. Lagoudakis, “On the locality of action domination [3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. in sequential decision making,” in 11th International Symposium on MIT Press, 1998. Artificial Intelligence and Mathematics (ISIAM 2010), Fort Lauderdale, [4] C. Szepesvári, Algorithms for reinforcement learning. Morgan & US, Jan. 2010, pp. pp. 1–8. Claypool Publishers, 2010. [31] M. Pirotta, M. Restelli, and L. Bascetta, “Policy gradient in Lipschitz [5] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy Markov decision processes,” Machine Learning, vol. 100, no. 2, pp. gradient methods for reinforcement learning with function approximation,” 255–283, Sep 2015. in Advances in Neural Information Processing Systems, Nov. 2000, pp. [32] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee, “Natural actor- 1057–1063. critic algorithms,” Department of Computing Science, University of [6] S. M. Kakade, “A natural policy gradient,” in Advances in Neural Alberta, Canada, Tech. Rep., 2009. Information Processing Systems, Dec. 2002, pp. 1531–1538. [33] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [7] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM arXiv preprint arXiv:1412.6980, 2014. Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, [34] G. M. Lipsa and N. Martins, “Remote state estimation with communi- 2003. cation costs for first-order LTI systems,” IEEE Trans. Autom. Control, [8] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust vol. 56, no. 9, pp. 2013–2025, Sep. 2011. region policy optimization,” in Proceedings of the 32nd International [35] J. Chakravorty, J. Subramanian, and A. Mahajan, “Stochastic approxima- Conference on Machine Learning (ICML-15), June 2015, pp. 1889–1897. tion based methods for computing the optimal thresholds in remote-state [9] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- estimation with packet drops,” in Proc. American Control Conference, imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, Seattle, WA, May 2017, pp. 462–467. 2017. [36] Y. Xu and J. P. Hespanha, “Optimal communication logics in networked [10] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, control systems,” in 43rd IEEE Conference on Decision and Control, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. others Chen, Dec. 2004, pp. 3527–3532. T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and [37] J. Chakravorty and A. Mahajan, “Fundamental limits of remote estimation D. Hassabis, “Mastering the game of go without human knowledge,” of Markov processes under communication constraints,” IEEE Trans. Nature, vol. 550, no. 7676, p. 354, 2017. Autom. Control, vol. 62, no. 3, pp. 1109–1124, Mar. 2017. [11] P. Glynn, “Optimization of stochastic systems,” in Proc. Winter Simulation [38] K. J. Arrow, T. Harris, and J. Marschak, “Optimal inventory policy,” Conference, Dec. 1986, pp. 52–59. Econometrica: Journal of the Econometric Society, pp. 250–272, 1951. [12] ——, “Likelihood ratio gradient estimation for stochastic systems,” [39] R. Bellman, I. Glicksberg, and O. Gross, “On the optimal inventory Communications of the ACM, vol. 33, pp. 75–84, 1990. equation,” Management Science, vol. 2, no. 1, pp. 83–104, 1955. 9

[40] P. Whittle, Optimization Over Time: Dynamic Programming and Optimal Control. John Wiley and Sons, Ltd., 1982.