arXiv:1507.01026v2 [cs.SY] 1 Oct 2015 eotLD--14 a 05(eie et 05 oappear To 2015) Sept. (Revised 2015 May LIDS-P-3174, Report oteoii rsm te emnlset. terminal control other of some problems or classical th origin in or the as graph, to spaces a continuous involving be problems may path shortest classical in ofso antaie ilb eoe by denoted on placed be are will arise, cannot confusion ple otesse 1,gnrtsauiu euneo sta of sequence unique a pairs generates control (1), system the to applied { telmteit hnst h ongtvt supin(2 assumption nonnegativity the to view We thanks exists limit [the hs r h rbesta r fe ae stestarting p and the value for results Bel iteration. as convergence general of provide solution taken very we of and Under equation, uniqueness often the programming. stat establish are we dynamic of assumptions, that adaptive set problems terminal for a point the to are control optimal These of problems horizon where oto rbe novn h system opt the discrete-time involving deterministic problem a control consider we paper this In epu icsin ntecus frltdworks. related of course the in discussions helpful 02139. of Mass., Cambridge, function M.I.T., Science, cost the as it to refer We e falplce sdntdby denoted is policies all of set apn every mapping sets n a osbytk h value the take possibly may and [values U to on o the for fteform the of ,µ . . . µ, µ, aytak r u oHihn(ae)Y o olbrto a collaboration for Yu a (Janey) Huizhen Engineering to Electr. due of are Dept. thanks Many the with is Bertsekas Dimitri ( ie niiilstate initial an Given Abstract a u n oiyIeaini pia oto and Control Optimal in Iteration Policy and Value x X x k X k h control The . ) o xml] eaeitrse nfebc policies feedback in interested are We example]. for , x 0 and ⊂ J k k g π hsae denoted stage, th ≤ } ( J U ( I hsppr ecnie iceetm infinite discrete-time consider we paper, this —In and x x π U g r called are k 0 x htmydpn ntecretstate current the on depend may that ( π safnto over function a as u , epciey and respectively, , lim = ) x k X u x +1 k = k k k x u , and = ) µ , k r h tt n oto tstage at control and state the are = { ∈ →∞ k µ u k ) f U ( 0 k X .I I. ∞ ≤ x µ , ( ∞ stationary o xml,te a efiiest as sets finite be may they example, for : X x t k utb hsnfo osritset constraint a from chosen be must dpieDnmcProgramming Dynamic Adaptive =0 k notecontrol the into k ) x 1 NTRODUCTION  u , . . . , a eue omdlconstraints model to used be may 0 g x , , g policy a , ( k k x x ) } k 0 = k , t hr each where , u , f µ , X n o ovnec,when convenience, for and , Π ∞ safnto mapping function a is k , k t ) oiiso h form the of Policies . httksvle in values takes that 1 ( [email protected] sasmdnonnnegative assumed is , ∈ : x , . . . , t π ,u X, ) 0 =  = x , µ π ihcost with , k { o stationary a For . 1 k ( µ µ , . . . , x µ ∈ orestrictions No . 0 ) k µ , 0 U ∈ safunction a is 1 x ∈ ( . . . , x k U k X, h cost The . k iir .Bertsekas P. Dimitri yn in lying , ( ) dComp. nd x , } dmany nd [0 ) X lman’s The . when , olicy imal π × ∞ (1) (3) (2) es. ey )]. U = te ] . ewl eiecniin ne which under conditions derive will We of euneo functions of sequence n policy a and h pia otfnto sdfie as defined is function cost optimal The policy pointwise. eti etitdcaso functions. of class restricted certain trto P o hr) h Iagrtmsat rmsome from starts algorithm VI The function short). nonnegative p for and short) (PI for (VI iteration iteration value of algorithms classical n oe opoeta h pia otfunction cost optimal the that prove to hopes one fBlmnseuto.I hspprw ilfcso derivin on focus which will we under paper conditions this In equation. Bellman’s of fplc vlain oobtain to evaluations policy of rtsasqec fsainr policies stationary of sequence a erates eipiil suehr sthat is here assume implicitly We codn to according ela’ equation: Bellman’s ouin lsia eut ttdi rp ()o Sectio function of cost 4(a) another optimal Prop. the produces in that stated solution is result, any classical to A si tha constant solution. solutions, Note positive equation. multiple a this has adding of generically side equation right throug Bellman’s the obtained be in may minimization policy the stationary optimal an that and nelae ihplc mrvmnst obtain to improvements policy with interleaved end h iiu nE.()sol eatie o each well- for be attained to be algorithm should PI (7) the Eq. for in Also minimum Pro (cf. the section). (2) defined, next assumption the nonnegativity in cost 4 the under true is J J µ h Iagrtmsat rmasainr policy stationary a from starts algorithm PI The nti ae,w ilas osdrfinding consider also will we paper, this In ntecneto yai rgamn D o short), for (DP programming dynamic of context the In µ k ∗ J k +1 ( π x ( ( x inf = ) ( x µ = ) x ) nIE rnatoso erlNtok,2015 Networks, Neural on Transactions IEEE in J h orsodn otfnto sdntdby denoted is function cost corresponding the , ) π o all for J ∈ u ∗ g ∈ k ( +1 r min arg π x U u ,µ x, inf = ) ∗ J ( ∈ x U ∗ inf = ssi ob pia fi tan h minimum the attains it if optimal be to said is ) x ( (  k x x π u ∈ g ( ) inf = ) ∈ ∈ x ( Π  X U ,u x, )  g ( { J J i.e., , x π ( J + 0 π ) ,u x, ∈ + ) k J (  Π J x } : ∗ g µ = ) J ( codn to according + ) J k X steuiu ouinwti a within solution unique the is ,u x, π ∗ ( f J x J f J µ 7→ J ) + ) ∗ ,µ x, µ k ( x , J ∗ ( ,u x, k µ x rmteequation the from ste“mlet solution “smallest” the is k [0 ) J f k , k aifisE.() which (6), Eq. satisfies ) ( ( ,  ,u x, x ∞ { f ) J µ ∈  ( , ] k n eeae a generates and , ,u x, k ) ∀ X, x , }  ovre to converges x i sequence a via µ ) x ,  k ∀ ∈ J +1 µ J x . X. ∗ 0 ∗ n gen- and , ∈ ∈ from ihthe with satisfies X, X, ∈ olicy II, n X. J nce J (5) (7) (4) (6) J µ µ p. k g h ∗ 1 t . 2 x ∈ X, which is true under some conditions that guarantee U = ℜm, g is positive semidefinite quadratic, and f represents compactness of the level sets a linear system of the form xk+1 = Axk + Buk, where A and B are given matrices. However, we will show in Section IV u ∈ U(x) | g(x, u)+ Jµk f(x, u) ≤ λ , λ ∈ℜ. that Assumption 1 is satisfied under some natural and easily ∗ We will derive conditions under which Jµk converges to J verifiable conditions. pointwise. Regarding notation, we denote by ℜ and ℜn the real line In this paper, we will address the preceding questions, for and n-dimensional Euclidean space, respectively. We denote + the case where there is a nonempty stopping set Xs ⊂ X, by E (X) the set of all functions J : X 7→ [0, ∞], and by J which consists of cost-free and absorbing states in the sense the set of functions that + J = J ∈ E (X) | J(x)=0, ∀ x ∈ Xs . (10) g(x, u)=0, x = f(x, u), ∀ x ∈ Xs, u ∈ U(x). (8) Since Xs consists of cost-free and absorbing states [cf. Eq. ∗ Clearly, J (x)=0 for all x ∈ Xs, so the set Xs may be (8)], the set J contains the cost function Jπ of all policies π, viewed as a desirable set of termination states that we are as well as J ∗. In our terminology, all equations, inequalities, trying to reach or approach with minimum total cost. We will and convergence limits involving functions are meant to be ∗ assume in addition that J (x) > 0 for x∈ / Xs, so that pointwise. Our main results are given in the following three ∗ propositions. Xs = x ∈ X | J (x)=0 . (9) Proposition 1 (Uniqueness of Solution of Bellman’s Equa- In the applications of primary interest, g is usually taken to tion). Let Assumption 1 hold. The optimal cost function J ∗ is be strictly positive outside of Xs to encourage asymptotic the unique solution of Bellman’s equation (4) within the set convergence of the generated state sequence to Xs, so this of functions J. assumption is natural and often easily verifiable. Besides Xs, another interesting subset of X is There are well-known examples where g ≥ 0 but Assump- tion 1 does not hold, and there are additional solutions of ∗ Xf = x ∈ X | J (x) < ∞ . Bellman’s equation within J. The following is a two-state shortest path example, which is discussed in more detail in Ordinarily, in practical applications, the states in Xf are [12], Section 3.1.2, and [14], Example 1.1. those from which one can reach the stopping set Xs, at least asymptotically. Example 1 (Counterexample for Uniqueness of Solution of For an initial state x, we say that a policy π terminates Bellman’s Equation). Let X = {0, 1}, where 0 is the unique starting from x if the state sequence {x } generated starting k cost-free and absorbing state, Xs = {0}, and assume that at from x and using π reaches Xs in finite time, i.e., satisfies state 1 we can stay at 1 at no cost, or move to 0 at cost 1. ¯ xk¯ ∈ Xs for some index k. A key assumption in this paper is Here J ∗(0) = J ∗(1) = 0, so Eq. (9) is violated. It can be ∗ that the optimal cost J (x) (if it is finite) can be approximated seen that arbitrarily closely by using policies that terminate from x. In particular, in all the results and discussions of the paper we J = J | J ∗(0) = 0, J ∗(1) ≥ 0 , make the following assumption (except for Prop. 5, which provides conditions under which the assumption holds). and that Bellman’s equation is

Assumption 1. The cost nonnegativity condition (2) and J ∗(0) = J ∗(0), J ∗(1) = min J ∗(1), 1+ J ∗(0) . stopping set conditions (8)-(9) hold. Moreover, for every pair (x, ǫ) with x ∈ Xf and ǫ > 0, there exists a policy π that It can be seen that Bellman’s equation has infinitely many ∗ terminates starting from x and satisfies Jπ(x) ≤ J (x)+ ǫ. solutions within J, the set J | J(0) = 0, 0 ≤ J(1) ≤ 1 . Specific and easily verifiable conditions that imply this Proposition 2 (Convergence of VI). Let Assumption 1 hold. assumption will be given in Section IV. A prominent case is when X and U are finite, so the problem becomes a deter- (a) The VI sequence {Jk} generated by Eq. (5) converges ministic shortest path problem with nonnegative arc lengths. If ∗ pointwise to J starting from any function J0 ∈ J with all cycles of the state transition graph have positive length, all ∗ J0 ≥ J . policies π that do not terminate from a state x ∈ Xf must (b) Assume further that U is a metric space, and the sets satisfy J (x) = ∞, implying that there exists an optimal π Uk(x, λ) given by policy that terminates from all x ∈ Xf . Thus, in this case Assumption 1 is naturally satisfied. Uk(x, λ)= u ∈ U(x) | g(x, u)+ Jk f(x, u) ≤ λ , When X is the n-dimensional Euclidean space ℜn, a pri- mary case of interest for this paper, it may easily happen that are compact for all x ∈ X, λ ∈ℜ, andk, where {Jk} is the optimal policies are not terminating from some x ∈ Xf , the VI sequence {Jk} generated by Eq. (5) starting from but instead the optimal state trajectories may approach Xs J0 ≡ 0. Then the VI sequence {Jk} generated by Eq. asymptotically. This is true for example in the classical linear- (5) converges pointwise to J ∗ starting from any function n quadratic problem, where X = ℜ , Xs = {0}, J0 ∈ J. 3

The compactness assumption of Prop. 2(b) is satisfied if also guarantee the compactness condition of Prop. 2(b), and U(x) is finite for all x ∈ X. Other easily verifiable as- will be noted following Prop. 4 in the next section. Moreover, sumptions implying this compactness assumption will be given in Section IV we will prove a similar convergence result for later. Note that when there are solutions to Bellman’s equation a variant of the PI algorithm where the policy evaluation is within J, in addition to J ∗, VI will not converge to J ∗ starting carried out approximately through a finite number of VIs. from any of these solutions. However, it is also possible that Example 3 (Counterexample for Convergence of PI). For a Bellman’s equation has J ∗ as its unique solution within J, and simple example where the PI sequence J k does not converge yet VI does not converge to J ∗ starting from the zero function µ to J ∗ if Assumption 1 is violated, consider the two-state because the compactness assumption of Prop. 2(b) is violated. shortest path Example 2. Let µ be the suboptimal policy that There are several examples of this type in the literature, and moves from state 1 to state 0. Then J (0) = 0, J (1) = 1, and the following example, an adaptation of Example 4.3.3 of [12], µ µ it can be seen that µ satisfies the policy improvement equation is a deterministic problem for which Assumption 1 is satisfied. Example 2 (Counterexample for Convergence of VI). Let µ(1) ∈ arg min 1+ Jµ(0), Jµ(1) . X = [0, ∞) ∪{s}, with s being a cost-free and absorbing Thus PI may stop with the suboptimal policy µ . state, and let U = (0, ∞)∪{u¯}, where u¯ is a special stopping control, which moves the system from states x ≥ 0 to state s The results of the preceding three propositions are new at at unit cost. The system has the form the level of generality given here. For example there has been no proposal of a valid PI algorithm in the classical literature on xk + uk if xk ≥ 0 and uk 6=u ¯, nonnegative cost infinite horizon Markovian decision problems xk+1 = s if xk ≥ 0 and uk =u ¯, (exceptions are special cases such as linear-quadratic problems s if xk = s and uk ∈ U. [23]). The ideas of the present paper stem from a more general analysis regarding the convergence of VI, which was presented The cost per stage has the form recently in the author’s research monograph on abstract DP [Ber12], and various extensions given in the recent papers [13], xk if xk ≥ 0 and uk 6=u ¯, [14]. Two more papers of the author, coauthored with H. Yu, g(x ,u )= 1 if x ≥ 0 and u =u ¯, k k  k k deal with issues that relate in part to the intricacies of the  0 if xk = s and uk ∈ U. convergence of VI and PI in undiscounted infinite horizon DP [35], [5]. Let also Xs = {s}. Then it can be verified that The paper is organized as follows. In Section II we provide 1 if x ≥ 0, background and references, which place in context our results J ∗(x)= (0 if x = s, and methods of analysis in relation to the literature. In Section III we give the proofs of Props. 1-3. In Section IV we discuss and that an optimal policy is to use the stopping control u¯ special cases and easily verifiable conditions that imply our at every state (since using any other control at states x ≥ 0, assumptions, and we provide extensions of our analysis. leads to unbounded accumulation of positive cost). Thus it can be seen that Assumption 1 is satisfied. On the other hand, the II. BACKGROUND VI algorithm is The issues discussed in this paper have received attention since Jk+1(x) = min 1+ Jk(s), inf x + Jk(x + u) the 60’s, originally in the work of Blackwell [15], who consid- u≥0   ered the case g ≤ 0, and the work by Strauch (Blackwell’s PhD  for x ≥ 0, and Jk+1(s) = Jk(s), and it can be verified by student) [30], who considered the case g ≥ 0. For textbook induction that starting from J0 ≡ 0, the sequence {Jk} is accounts we refer to [2], [25], [11], and for a more abstract given for all k by development, we refer to the monograph [12]. These works showed that the cases where g ≤ 0 (which corresponds to min{1, kx} if x ≥ 0, maximization of nonnegative rewards) and (which is Jk(x)= g ≥ 0 (0 if x = s. most relevant to the control problems of this paper) are quite different in structure. In particular, while VI converges to J ∗ Thus J (0) = 0 for all k, while J ∗(0) = 1, so the VI algorithm k starting for J ≡ 0 when g ≤ 0, this is not so when g ≥ 0; fails to converge for the state x =0. The difficulty here is that 0 a certain compactness condition is needed to guarantee this the compactness assumption of Prop. 2(b) is violated. [see Example 2, and part (d) of the following proposition]. Moreover when g ≥ 0, Bellman’s equation may have solutions Proposition 3 (Convergence of PI). Let Assumption 1 hold. Jˆ 6= J ∗ with Jˆ ≥ J ∗ (see Example 1), and VI will not A sequence {Jµk } generated by the PI algorithm (6), (7), ∗ ˆ ∗ converge to J starting from such J. In addition it is known satisfies J k (x) ↓ J (x) for all x ∈ X. µ that in general, PI need not converge to J ∗ and may instead It is implicitly assumed in the preceding proposition that the stop with a suboptimal policy (see Example 3). PI algorithm is well-defined in the sense that the minimization The following proposition gives the standard results when in the policy improvement operation (7) can be carried out for g ≥ 0 (see [2], Props. 5.2, 5.4, and 5.10, [11], Props. every x ∈ X. Easily verifiable conditions that guarantee this 4.1.1, 4.1.3, 4.1.5, 4.1.9, or [12], Props. 4.3.3, 4.3.9, and 4

4.3.14). These results hold for stochastic infinite horizon DP its solution, as well as the attendant questions of convergence problems with nonnegative cost per stage, and do not take of VI and PI. In particular, infinite horizon deterministic into account the favorable structure of deterministic problems optimal control for both discrete-time and continuous-time or the presence of the stopping set Xs. systems has been considered since the early days of DP in the works of Bellman. For continuous-time problems the questions Proposition 4. Let the nonnegativity condition (2) hold. discussed in the present paper involve substantial technical difficulties, since the analog of the (discrete-time) Bellman ∗ ˆ + (a) J satisfies Bellman’s equation (4), and if J ∈ E (X) equation (4) is the steady-state form of the (continuous- ˆ is another solution, i.e., J satisfies time) Hamilton-Jacobi-, a nonlinear partial Jˆ(x) = inf g(x, u)+ Jˆ f(x, u) , ∀ x ∈ X, differential equation the solution and analysis of which is u∈U(x) in general very complicated. A formidable difficulty is the (11)   potential lack of differentiability of the optimal cost function, then ∗ ˆ. J ≤ J even for simple problems such as time-optimal control of (b) For all stationary policies we have µ second order linear systems to the origin.

Jµ(x)= g x, µ(x) + Jµ f x, µ(x) , ∀ x ∈ X. The analog of VI for continuous-time systems essentially (12) involves the time integration of the Hamilton-Jacobi-Bellman (c) A stationary policy µ∗ is optimal if and only if equation, and its analysis must deal with difficult issues of sta- bility and convergence to a steady-state solution. Nonetheless ∗ ∗ µ (x) ∈ arg min g(x, u)+J f(x, u) , ∀ x ∈ X. there have been proposals of continuous-time PI algorithms, u∈U(x)   (13) in the early papers [26], [23], [28], [34], and the thesis [6], as (d) If U is a metric space and the sets well as more recently in several works; see e.g., the book [32], the survey [18], and the references quoted there. These works Uk(x, λ)= u ∈ U(x) | g(x, u)+ Jk f(x, u) ≤ λ also address the possibility of value function approximation, (14)   similar to other approximation-oriented methodologies such are compact for all x ∈ X, λ ∈ℜ, and k, where {Jk} is as neurodynamic programming [4] and the sequence generated by VI [cf. Eq. (5)] starting from [31], which consider primarily discrete-time systems. For J0 ≡ 0, then there exists at least one optimal stationary example, among the restrictions of the PI method, is that it ∗ policy, and we have Jk → J . must be started with a stabilizing controller; see for example Compactness assumptions such as the one of part (d) above, the paper [23], which considered linear-quadratic continuous- were originally given in [9], [10], and in [29]. They have been time problems, and showed convergence to the optimal policy used in several other works, such as [3], [11], Prop. 4.1.9. In of the PI algorithm, assuming that an initial stabilizing linear particular, the condition of part (d) holds when U(x) is a finite controller is used. By contrast, no such restriction is needed in set for all x ∈ X. The condition of part (d) also holds when the PI methodology of the present paper; questions of stability X = ℜn, and for each x ∈ X, the set are addressed only indirectly through the finiteness of the values J ∗(x) and Assumption 1. u ∈ U(x) | g(x, u) ≤ λ For discrete-time systems there has been much research, m both for VI and PI algorithms. For a selective list of recent is a compact subset of ℜ , for all λ ∈ ℜ , and g and f are continuous in u. The proof consists of showing by induction references, which themselves contain extensive lists of other references, see the book [32], the papers [19], [16], [17], [22], that the VI iterates Jk have compact level sets and hence are lower semicontinuous. [33], the survey papers in the edited volumes [27] and [21], Let us also note a recent result of H. Yu and the author [35], and the special issue [20]. Some of these works relate to where it was shown that J ∗ is the unique solution of Bellman’s continuous-time problems as well, and in their treatment of equation within the class of all functions J ∈ E+(X) that algorithmic convergence, typically assume that X and U are satisfy Euclidean spaces, as well as continuity and other conditions 0 ≤ J ≤ cJ ∗ for some c> 0, (15) on g, special structure of the system, etc. It is beyond our scope to provide a detailed survey of the state-of-the-art of (we refer to [35] for discussion and references to antecedents the VI and PI methodology in the context of adaptive DP. of this result). Moreover it was shown that VI converges to However, it should be clear that the works in this field involve J ∗ starting from any function satisfying the condition more restrictive assumptions than our corresponding results of Props. 1-3. Of course, these works also address questions J ∗ ≤ J ≤ cJ ∗ for some c> 0, that we do not, such as issues of stability of the obtained and under the compactness conditions of Prop. 4(d), starting controllers, the use of approximations, etc. Thus the results from any J that satisfies Eq. (15). The same paper and a related of the present work may be viewed as new in that they paper [5] discuss extensively PI algorithms for stochastic rely on very general assumptions, yet do not address some nonnegative cost problems. important practical issues. The line of analysis of the present For deterministic problems, there has been substantial re- paper, which is based on general results of Markovian decision search in the adaptive literature, regard- problem theory and abstract forms of dynamic programming, ing the validity of Bellman’s equation and the uniqueness of is also different from the lines of analysis of works in adaptive 5

DP, which make heavy use of the deterministic character of Jˆ(xk)=0 for all sufficiently large k, we have the problem and control theoretic methods such as Lyapunov k−1 stability. lim sup Jˆ(x )+ g x ,µ (x ) Still there is a connection between our line of analysis and k t t t k→∞ ( t=0 ) ∗ Lyapunov stability. In particular, if π is an optimal controller, kX−1  ∗ i.e., J ∗ = J , then for every x ∈ X , the state sequence π 0 f = lim g xt,µt(xt) ∗ k→∞ {xk} generated using π and starting from x0 remains within (t=0 ) X and satisfies J ∗(x ) ↓ 0. This can be seen by writing X  f k = Jπ(x0). k−1 ∗ ∗ ∗ By combining the last two relations, we obtain J (x0)= g xt,µt (xt) + J (xk), k =1, 2,..., t=0 J ∗(x ) ≤ Jˆ(x ) ≤ J (x ), ∀ x ∈ X , π ∈ Π . X  0 0 π 0 0 f T,x0 and using the facts g ≥ 0 and J ∗(x ) < ∞. Thus an optimal 0 Taking the infimum over π ∈ Π and using Eq. (16), it controller, restricted to the subset X , may be viewed as a T,x0 f follows that J ∗(x )= Jˆ(x ) for all x ∈ X . Also for x ∈/ Lyapunov-stable controller where the Lyapunov function is J ∗. 0 0 0 f 0 X , we have J ∗(x ) = Jˆ(x ) = ∞ [since J ∗ ≤ Jˆ by Prop. On the other hand, existence of a “stable” controller does f 0 0 4(a)], so we obtain J ∗ = Jˆ. not necessarily imply that J ∗ is real-valued. In particular, it J ∗ may not be true that if the generated sequence {xk} by an Proof of Prop. 2: (a) Suppose that J0 ∈ and J0 ≥ J . optimal controller starting from some x0 converges to Xs, Starting with J0, let us apply the VI operation to both sides ∗ ∗ ∗ then we have J (x0) < ∞. The reason is that the cost per of the inequality J0 ≥ J . Since J is a solution of Bellman’s stage g may not decrease fast enough as we approach Xs. As equation and VI has a monotonicity property that maintains ∗ an example, let the direction of functional inequalities, we see that J1 ≥ J . ∗ Continuing similarly, we obtain Jk ≥ J for all k. Moreover, X = {0}∪{1/m | m : is a positive integer}, we clearly have Jk(x)=0 for all x ∈ Xs, so Jk ∈ J for all k. We now argue that since Jk is produced by k steps with Xs = {0}, and assume that there is a unique controller, of VI starting from J0, it is the optimal cost function of the which moves from 1/m to 1/(m+1) with incurred cost 1/m. k-stage version of the problem with terminal cost function Then we have J ∗(x) = ∞ for all x 6= 0, despite the fact J0. Therefore, we have for every x0 ∈ X and policy π = that the controller is “stable” in the sense that it generates a {µ0,µ1,...}, sequence {xk} converging to 0 starting from every x0 6=0. k−1 ∗ J (x0) ≤ Jk(x0) ≤ J0(xk)+ g xt,µt(xt) , k =1, 2,..., III. PROOFSOFTHE MAIN RESULTS t=0 X  Let us denote for all x ∈ X, where {xt} is the state sequence generated starting from x0

and using π. If x0 ∈ Xf and π ∈ ΠT,x0 , we have xk ∈ Xs ΠT,x = π ∈ Π | π terminates from x , and J0(xk)=0 for all sufficiently large k, so that and note the following key implication of Assumption 1: k−1 lim sup J (x )+ g x ,µ (x ) ∗ 0 k t t t J (x) = inf Jπ(x), ∀ x ∈ Xf . (16) k→∞ ( t=0 ) π∈ΠT,x k−X1  In the subsequent arguments, the significance of policies = lim g xt,µt(xt) k→∞ that terminate starting from some initial state x is that the (t=0 ) 0 X  corresponding generated sequences {xk} satisfy J(xk)=0 = Jπ(x0). for all J ∈ J and k sufficiently large. By combining the last two relations, we obtain Proof of Prop. 1: Let Jˆ ∈ J be a solution of the Bellman ∗ equation (11), so that J (x0) ≤ lim inf Jk(x0) ≤ lim sup Jk(x0) ≤ Jπ(x0), k→∞ k→∞ Jˆ(x) ≤ g(x, u)+Jˆ f(x, u) , ∀ x ∈ X, u ∈ U(x), (17) for all x0 ∈ Xf and π ∈ ΠT,x0 . Taking the infimum over π ∈ Π and using Eq. (16), it follows that lim J (x ) = while by Prop. 4(a), J ∗ ≤ Jˆ. For any x ∈ X and policy T,x0 k→∞ k 0 0 f J ∗(x ) for all x ∈ X . Since for x ∈/ X , we have J ∗(x )= π = {µ ,µ ,...} ∈ Π , we have by using repeatedly Eq. 0 0 f 0 f 0 0 1 T,x0 , we obtain ∗. (17), Jk(x0)= ∞ Jk → J (b) Let {Jk} be the VI sequence generated starting from some k−1 function J ∈ J. By the monotonicity of the VI operation, {J } J ∗(x ) ≤ Jˆ(x ) ≤ Jˆ(x )+ g x ,µ (x ) , k =1, 2,..., k 0 0 k t t t lies between the sequence of VI iterates starting from the zero t=0 X  function [which converges to J ∗ from below by Prop. 4(d)], ∗ where {xk} is the state sequence generated starting from x0 and the sequence of VI iterates starting from J0 = max{J, J } ∗ and using π. Also, since π ∈ ΠT,x0 and hence xk ∈ Xs and [which converges to J from above by part (a)]. 6

Proof of Prop. 3: If µ is a stationary policy and µ¯ satisfies and the policy improvement equation g(x, u)+ Jµk f(x, u) ≥ J∞(x), x ∈ X, u ∈ U(x). µ¯(x) ∈ arg min g(x, u)+ Jµ f(x, u) , x ∈ X, u∈U(x) We now take the limit in the second relation as k → ∞, then   [cf. Eq. (7)], we have for all x ∈ X, the infimum over u ∈ U(x), and then combine with the first relation, to obtain Jµ(x)= g x, µ(x) + Jµ f x, µ(x) J∞(x) = inf g(x, u)+ J∞ f(x, u) , x ∈ X. ≥ min g(x, u)+ Jµ f(x, u) (18) u∈U(x) u∈U(x)     = g x, µ¯(x) + Jµ f x,µ¯(x)  , Thus J∞ is a solution of Bellman’s equation, satisfying J∞ ∈ J (since Jµk ∈ J and Jµk ↓ J∞), so by the uniqueness result where the first equality follows from Prop. 4(b) and the second ∗ of Prop. 1, we have J∞ = J . equality follows from the definition of µ¯. Let us fix x and let {xk} be the sequence generated starting from x and using µ. By repeatedly applying Eq. (18), we see that the sequence IV. DISCUSSION, SPECIAL CASES, AND EXTENSIONS J˜k(x) defined by In this section we elaborate on our main results and we derive easily verifiable conditions under which our assumptions hold.  J˜0(x)= Jµ(x), ˜ J1(x)= Jµ(x1)+ g x, µ¯(x) , A. Conditions that Imply Assumption 1 and more generally,  Consider Assumption 1. As noted in Section I, it holds when k−1 X and U are finite, a terminating policy exists from every J˜k(x)= Jµ(xk)+ g xt, µ¯(xt) , k =1, 2,..., x, and all cycles of the state transition graph have positive t=0 length. For the case where X is infinite, let us assume that X  is monotonically nonincreasing. Thus, using also Eq. (18), we X is a normed space with norm denoted k·k, and say that have π asymptotically terminates from x if the sequence {xk} generated starting from x and using π converges to Xs in Jµ(x) ≥ min g(x, u)+ Jµ f(x, u) u∈U(x) the sense that   lim dist(xk,Xs)=0, = J˜1(x) k→∞ ˜ ≥ Jk(x), where dist(x, Xs) denotes the minimum distance from x to for all x ∈ X and k ≥ 1. This implies that Xs, dist(x, Xs) = inf kx − yk, x ∈ X. y∈X Jµ(x) ≥ min g(x, u)+ Jµ f(x, u) s u∈U(x) The following proposition provides readily verifiable condi- ≥ lim J˜k(x)  k→∞ tions that guarantee Assumption 1. k −1 Proposition 5. Let the cost nonnegativity condition (2) and = lim Jµ(xk)+ g xt, µ¯(xt) k→∞ ( ) stopping set conditions (8)-(9) hold, and assume further the t=0 following: k−1 X  ≥ lim g xt, µ¯(xt) k→∞ t=0 (1) For every x ∈ Xf and ǫ> 0, there exits a policy π that X  = Jµ¯(x), asymptotically terminates from x and satisfies ∗ where the last inequality follows since Jµ ≥ 0. In conclusion, Jπ(x) ≤ J (x)+ ǫ. we have (2) For every ǫ> 0, there exists a δǫ > 0 such that for each Jµ(x) ≥ inf g(x, u)+ Jµ f(x, u) ≥ Jµ¯(x), x ∈ X. x ∈ Xf with u∈U(x) dist(x, Xs) ≤ δǫ, Using µk andµk+1 in place of µ and µ¯ in the preceding relation, we obtain for all x ∈ X, there is a policy π that terminates from x and satisfies Jπ(x) ≤ ǫ. Jµk (x) ≥ inf g(x, u)+ Jµk f(x, u) ≥ Jµk+1 (x). u∈U(x) Then Assumption 1 holds.   (19) Thus the sequence {Jµk } generated by PI converges mono- Proof: Fix x ∈ Xf and ǫ > 0. Let π be a policy + tonically to some function J∞ ∈ E (X), i.e., Jµk ↓ J∞. that asymptotically terminates from x, and satisfies Jπ(x) ≤ Moreover, by taking the limit as k → ∞ in Eq. (19), we have J ∗(x)+ ǫ, as per condition (1). Starting from x, this policy the two relations will generate a sequence {xk} such that for some index k¯ we have J∞(x) ≥ inf g(x, u)+ J∞ f(x, u) , x ∈ X, u∈U(x) dist(xk¯,Xs) ≤ δǫ,   7 so by condition (2), there exists a policy π¯ that terminates unstable (the instability of the system is important for the ′ from xk¯ and is such that Jπ¯ (xk¯) ≤ ǫ. Consider the policy π purpose of this example). Then it can be verified that the that follows π up to index k¯ and follows π¯ afterwards. This quadratic function J(x) = (a2 − 1)x2, which belongs to policy terminates from x and satisfies J, also solves Bellman’s equation. This is a case where the ∗ algebraic Riccati equation associated with the problem has J ′ (x)= J ¯(x)+ J (x¯) ≤ J (x)+ J (x¯) ≤ J (x)+2ǫ, π π,k π¯ k π π¯ k two nonnegative solutions because there is no cost on the where Jπ,k¯(x) is the cost incurred by π starting from x up to state, and a standard observability condition for uniqueness reaching xk¯. of solution of the Riccati equation is violated. Condition (1) of the preceding proposition requires that If on the other hand, in addition to (a)-(c), we assume that ∗ for some positive scalars γ,p, we have inf g(x, u) ≥ for states x ∈ Xf , the optimal cost J (x) can be achieved u∈U(x) p n ∗ arbitrarily closely with policies that asymptotically terminate γkxk for all x ∈ℜ , then J (x) > 0 for all x 6=0 [cf. Eq. from x. Problems for which condition (1) holds are those (9)], while condition (1) of Prop. 5 is satisfied as well [cf. Eq. involving a cost per stage that is strictly positive outside of (20)]. Then by Prop. 5, Assumption 1 holds, and Bellman’s equation has a unique solution within J. Xs. More precisely, condition (1) holds if for each δ > 0 there exists ǫ> 0 such that There are straightforward extensions of the conditions of the preceding example to a nonlinear system. Note that even inf g(x, u) ≥ ǫ, ∀ x ∈ X such that dist(x, Xs) ≥ δ. u∈U(x) for a controllable system, it is possible that there exist states (20) from which the terminal set cannot be reached, because U(x) Then for any x and policy π that does not asymptotically may imply constraints on the magnitude of the control vector. terminate from x, we will have Jπ(x) = ∞, so that if Still the preceding analysis allows for this case. x ∈ Xf , all policies π with Jπ(x) < ∞ must be asymp- totically terminating from x. In applications, condition (1) is natural and consistent with the aim of steering the state B. An Optimistic Form of PI towards the terminal set Xs with finite cost. Condition (2) is a “controllability” condition implying that the state can be Let us consider a variant of PI where policies are evaluated steered into Xs with arbitrarily small cost from a starting state inexactly, with a finite number of VIs. In particular, this that is sufficiently close to Xs. algorithm starts with some J0 ∈ E(X), and generates a k sequence of cost function and policy pairs {Jk,µ } as follows: Example 4 (Linear System Case). Consider a linear system k Given Jk, we generate µ according to xk+1 = Axk + Buk, k µ (x) ∈ arg min g(x, u)+ Jk f(x, u) , x ∈ X, where A and B are given matrices, with the terminal set being u∈U(x) the origin, i.e., Xs = {0}. We assume the following:   (21) k (a) X = ℜn, U = ℜm, and there is an open sphere R and then we obtain Jk+1 with mk ≥ 1 VIs using µ : centered at the origin such that U(x) contains R for all mk−1 x ∈ X. k Jk+1(x0)= Jk(xm )+ g xt,µ (xt) , x0 ∈ X, (b) The system is controllable, i.e., one may drive the system k t=0 from any state to the origin within at most n steps X  (22) k using suitable controls, or equivalently that the matrix where {xt} is the sequence generated using µ and starting n−1 [B AB ··· A B] has rank n. from x0, and mk are arbitrary positive integers. Here J0 is a (c) g satisfies function in J that is required to satisfy 0 ≤ g(x, u) ≤ β kxkp + kukp , ∀ (x, u) ∈ V, J0(x) ≥ inf g(x, u)+J0 f(x, u) , ∀ x ∈ X, u ∈ U(x). u∈U(x) where V is some open sphere centered at the origin,    (23) are some positive scalars, and is the standard β,p k·k For example J0 may be equal to the cost function of some Euclidean norm. stationary policy, or be the function that takes the value 0 Then condition (2) of Prop. 5 is satisfied, while x =0 is cost- for x ∈ Xs and ∞ at x∈ / Xs. Note that when mk ≡ 1 free and absorbing [cf. Eq. (8)]. Still, however, in the absence the method is equivalent to VI, while the case mk = ∞ of additional assumptions, there may be multiple solutions to corresponds to the standard PI considered earlier. In practice, Bellman’s equation within J. the most effective value of mk may be found experimentally, As an example, consider the scalar system xk+1 = axk +uk with moderate values mk > 1 usually working best. We refer with X = U(x) = ℜ, and the quadratic cost g(x, u) = u2. to the textbooks [25] and [11] for discussions of this type of Then Bellman’s equation has the form inexact PI algorithm (in [25] it is called “modified” PI, while in [11] it is called “optimistic” PI). J(x) = min u2 + J(ax + u) , x ∈ℜ, u∈ℜ Proposition 6 (Convergence of Optimistic PI). Let Assump-  ∗ and it is seen that the optimal cost function, J (x) ≡ 0, tion 1 hold. For the PI algorithm (21)-(22), where J0 belongs ∗ is a solution. Let us assume that a > 1 so the system is to J and satisfies the condition (23), we have Jk ↓ J . 8

Proof: We have for all x ∈ X, problem that is closely related is reachability of a target set in minimum time, which is obtained for J0(x) ≥ inf g(x, u)+ J0 f(x, u) u∈U(x) 0 if x ∈ Xs, 0 0 g(x, u, w)= = g x, µ (x) + J0 f(x, µ (x)) (1 if x∈ / Xs, ≥ J1(x)   0 0 assuming also that the control process stops once the state ≥ g x, µ (x) + J1 f(x, µ (x)) enters the set Xs. Here w is a disturbance described by set ≥ inf g(x, u)+ J1 f(x, u) membership (w ∈ W ), and the objective is to reach the set Xs u∈U(x)   1 1 in the minimum guaranteed number of steps. The set Xf is = g x, µ (x) + J1 f(x, µ (x)) the set of states for which Xs is guaranteed to be reached in a ≥ J2(x),   finite number of steps. Another related problem is reachability of a target tube, where for a given set Xˆ, where the first inequality is the condition (23), the second and third inequalities follow because of the monotonicity of the 0 if x ∈ Xˆ, m value iterations (22) for µ0, and the fourth inequality fol- g(x, u, w)= 0 (1 if x∈ / Xˆ, lows from the policy improvement equation (21). Continuing similarly, we have and the objective is to find the initial states starting from which we can guarantee to keep all future states within Xˆ. Jk(x) ≥ inf g(x, u)+ Jk f(x, u) ≥ Jk+1(x), These two reachability problems were first formulated and u∈U(x)   analyzed as part of the author’s Ph.D. thesis research [7], and for all x ∈ X and k. Moreover, since J0 ∈ J, we have Jk ∈ J the subsequent paper [8]. In fact the reachability algorithms for all k. Thus Jk ↓ J∞ for some J∞ ∈ J, and similar to the given in these works are essentially special cases of the VI proof of Prop. 3, it follows that J∞ is a solution of Bellman’s algorithm of the present paper, starting with appropriate initial equation. Hence, by the uniqueness result of Prop. 1, we have functions J0. ∗ J∞ = J . To extend our results to the general form of the minimax problem described above, we need to adapt the definition of C. Minimax Control to a Terminal Set of States termination. In particular, given a state x, in the minimax context we say that a policy π terminates from x if there Our analysis can be readily extended to minimax problems exists an index k¯ [which depends on (π, x)] such that the with a terminal set of states. Here the system is sequence {xk}, which is generated starting from x and using π, satisfies xk¯ ∈ Xs for all sequences {w0,...,wk¯−1} with xk+1 = f(xk,uk, wk), k =0, 1,..., wt ∈ W for all t =0,... k¯−1. Then Assumption 1 is modified where wk is the control of an antagonistic opponent that aims to reflect this new definition of termination, and our results can to maximize the cost function. We assume that wk is chosen be readily extended, with Props. 1, 2, 3, and 6, and their proofs, from a given set W to maximize the sum of costs per stage, holding essentially as stated. The main adjustment needed is which are assumed nonnegative: to replace expressions of the forms

0 ≤ g(x, u, w) ≤ ∞, x ∈ X,U ∈ U(x), w ∈ W. g(x, u)+ J f(x, u) and  We wish to choose a policy π = {µ0,µ1,...} to minimize k−1 the cost function J(xk)+ g(xt,ut) k t=0 X Jπ(x0) = sup lim g xk,µk(xk), wk , in these proofs with w ∈W k→∞ k t=0 k=0,1,... X  sup g(x, u, w)+ J f(x, u, w) where xk,µk(xk) is a state-control sequence corresponding w∈W   to π and the sequence {w0, w1,...}. We assume that there is and  k−1 a termination set Xs, the states of which are cost-free and absorbing, i.e., sup J(xk)+ g(xt,ut, wt) , wt∈W t=0,...,k−1 ( t=0 ) g(x, u, w)=0, x = f(x, u, w), X respectively; see also [14] for a more abstract view of such for all x ∈ Xs, u ∈ U(x), w ∈ W , and that all states outside lines of argument. Xs have strictly positive optimal cost, so that ONCLUDING EMARKS ∗ V. C R Xs = x ∈ X | J (x)=0 . In this paper we have considered problems of deterministic The finite-state version of this problem has been discussed in optimal control to a terminal set of states subject to very [13], under the name robust shortest path planning, for the general assumptions. Under reasonably practical conditions, case where g can take both positive and negative values. A we have established the uniqueness of solution of Bellman’s 9 equation, and the convergence of value and policy iteration al- Dimitri P. Bertsekas studied engineering at the gorithms, even when there are states with infinite optimal cost. National Technical University of , , obtained his MS in at the Our analysis bypasses the need for assumptions involving the George Washington University, Wash. DC in 1969, existence of globally stabilizing controllers, which guarantee PLACE and his Ph.D. in system science in 1971 at the PHOTO that the optimal cost function J ∗ is real-valued. This generality Massachusetts Institute of Technology (M.I.T.). HERE Dr. Bertsekas has held faculty positions with makes our results a convenient starting point for analysis of the Engineering-Economic Systems Dept., Stanford problems involving additional assumptions, and perhaps cost University (1971-1974) and the Electrical Engineer- function approximations. ing Dept. of the University of Illinois, Urbana (1974- 1979). Since 1979 he has been teaching at the While we have restricted attention to undiscounted prob- Electrical Engineering and Computer Science Department of M.I.T., where lems, the line of analysis of the present paper applies also to he is currently McAfee Professor of Engineering. He consults regularly with discounted problems with one-stage cost function g that may private industry and has held editorial positions in several journals. His research has spanned several fields, including optimization, control, large- be unbounded from above. Similar but more favorable results scale and distributed computation, and data communication networks, and can be obtained, thanks to the presence of the discount factor; is closely tied to his teaching and book authoring activities. He has written see the author’s paper [14], which contains related analysis numerous research papers, and sixteen books, several of which are used as textbooks in M.I.T. classes. for stochastic and minimax, discounted and undiscounted Professor Bertsekas was awarded the INFORMS 1997 Prize for Research problems, with nonnegative cost per stage. Excellence in the Interface Between Operations Research and Computer The results for these problems, and the results of the present Science for his book “Neuro-Dynamic Programming” (co-authored with John Tsitsiklis), the 2001 ACC John R. Ragazzini Education Award, the 2009 paper, have a common ancestry. They fundamentally draw INFORMS Expository Writing Award, the 2014 ACC Richard E. Bellman their validity from notions of regularity, which were developed Control Heritage Award, the 2014 Khachiyan Prize for Life-Time Accom- in the author’s abstract DP monograph [12] and were extended plishments in Optimization, and the SIAM/MOS 2015 George B. Dantzig Prize. In 2001 he was elected to the United States National Academy of recently in [14]. Let us describe the regularity idea briefly, Engineering. and its connection to the analysis of this paper. Given a set of Dr. Bertsekas’ recent books are “Dynamic Programming and Optimal functions S ∈ E+(X), we say that a collection C of policy- Control: 4th Edition” (2012), “Abstract Dynamic Programming” (2013), and “ Algorithms” (2015), all published by Athena Scientific. state pairs (π, x0), with π ∈ Π and x0 ∈ X, is S-regular if for all (π, x0) ∈ C and J ∈ S, we have VI. REFERENCES k−1 [1] Bertsekas, D. P., and Rhodes, I. B., 1971. “On the Minimax Jπ(x0) = lim J(xk)+ g xt,µt(xt) . Reachability of Target Sets and Target Tubes,” Automatica, k→∞ ( t=0 ) X  Vol. 7, pp. 233-241. In words, for all (π, x0) ∈ C, Jπ(x0) can be obtained in [2] Bertsekas, D. P., and Shreve, S. E., 1978. the limit by VI starting from any J ∈ S. The favorable Stochastic Optimal Control: The Discrete Time Case, properties with respect to VI of an S-regular collection C can Academic Press, N. Y.; may be downloaded from be translated into interesting properties relating to solutions of http://web.mit.edu/dimitrib/www/home.html Bellman’s equation and convergence of VI. In particular, the [3] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis optimal cost function over the set of policies {π | (π, x) ∈ C}, of Stochastic Shortest Path Problems,” Math. of Operations ∗ Research, Vol. 16, pp. 580-595. JC(x) = inf Jπ(x), x ∈ X, {π | (π,x)∈C} [4] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro- under appropriate problem-dependent assumptions, is the Dynamic Programming, Athena Scientific, Belmont, MA. unique solution of Bellman’s equation within the set J ∈ [5] Bertsekas, D. P., and Yu, H., 2015. “Stochastic Shortest ∗ Path Problems Under Weak Conditions,” Lab. for Information S | J ≥ JC , and can be obtained by VI starting from any J within that set (see [14]).  and Decision Systems Report LIDS-2909, revision of March Within the deterministic optimal control context of this 2015, to appear in Math. of Operations Research. paper, it works well to choose C to be the set of all (π, x) [6] Beard, R. W., 1995. Improving the Closed-Loop Perfor- mance of Nonlinear Systems, Doctoral dissertation, Rensselaer such that x ∈ Xf and π is terminating starting from x, and to choose S to be J, as defined by Eq. (10). Then, in view of Polytechnic Institute. ∗ ∗ Assumption 1, we have JC = J , and the favorable properties [7] Bertsekas, D. P., 1971. “Control of Uncertain Systems ∗ ∗ of JC are shared by J . For other types of problems different With a Set-Membership Description of the Uncertainty,” Ph.D. choices of C may be appropriate, and corresponding results Thesis, Dept. of EECS, MIT; may be downloaded from relating to the uniqueness of solutions of Bellman’s equation http://web.mit.edu/dimitrib/www/publ.html. and the validity of value and policy iteration may be obtained; [8] Bertsekas, D. P., 1972. “Infinite Time Reachability of see [14]. State Space Regions by Using Feedback Control,” IEEE Trans. Automatic Control, Vol. AC-17, pp. 604-613. [9] Bertsekas, D. P., 1975. “Monotone Mappings in Dynamic Programming,” Proc. 1975 IEEE Conference on Decision and Control, Houston, TX, pp. 20-25. [10] Bertsekas, D. P., 1977. “Monotone Mappings with Ap- plication in Dynamic Programming,” SIAM J. on Control and Optimization, Vol. 15, pp. 438-464. 10

[11] Bertsekas, D. P., 2012. Dynamic Programming and Op- be Optimal,” Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, timal Control, Vol. II: Approximate Dynamic Programming, Vol. 32, pp. 179-196. Athena Scientific, Belmont, MA. [30] Strauch, R., 1966. “Negative Dynamic Programming,” [12] Bertsekas, D. P., 2013. Abstract Dynamic Programming, Ann. Math. Statist., Vol. 37, pp. 871-890. Athena Scientific, Belmont, MA. [31] Sutton, R. S., and Barto, A. G., 1998. Reinforcement [13] Bertsekas, D. P., 2014. “Robust Shortest Path Planning Learning, MIT Press, Cambridge, MA. and Semicontractive Dynamic Programming,” Lab. for Infor- [32] Vrabie, D., Vamvoudakis, K. G., and Lewis, F. L., 2013. mation and Decision Systems Report LIDS-P-2915, Feb. 2014 Optimal Adaptive Control and Differential Games by Rein- (revised Jan. 2015). forcement Learning Principles, The Institution of Engineering [14] Bertsekas, D. P., 2015. “Regular Policies in Abstract and Technology, London. Dynamic Programming,” Lab. for Information and Decision [33] Wei, Q., Wang, F. Y., Liu, D., and Yang, X., 2014. “Finite- Systems Report LIDS-3173, April 2015. Approximation-Error-Based Discrete-Time Iterative Adaptive [15] Blackwell, D., 1965. “Positive Dynamic Programming,” Dynamic Programming,” IEEE Transactions on Cybernetics, Proc. Fifth Berkeley Symposium Math. Statistics and Proba- Vol. 44, pp. 2820-2833. bility, pp. 415-418. [34] Werbos, P. J., 1992. “Approximate Dynamic Programming [16] Heydari, A., 2014. “Revisiting Approximate Dynamic for Real-Time Control and Neural Modeling,” in Handbook Programming and its Convergence,” IEEE Transactions on of Intelligent Control (D. A. White and D. A. Sofge, eds.), Cybernetics, Vol. 44, pp. 2733-2743. Multiscience Press. [17] Heydari, A., 2014. “Stabilizing Value Iteration With and [35] Yu, H., and Bertsekas, D. P., 2013. “A Mixed Value and Without Approximation Errors,” available at arXiv:1412.5675. Policy Iteration Method for Stochastic Control with Univer- [18] Jiang, Y., and Jiang, Z. P., 2013. “Robust Adaptive sally Measurable Policies,” Lab. for Information and Decision Dynamic Programming for Linear and Nonlinear Systems: An Systems Report LIDS-P-2905, MIT; to appear in Math. of Overview,” Eur. J. Control, Vol. 19, pp. 417-425. Operations Research. [19] Jiang, Y., and Jiang, Z. P., 2014. “Robust Adaptive Dy- namic Programming and Feedback Stabilization of Nonlinear Systems,” IEEE Trans. on Neural Networks and Learning Systems, Vol. 25, pp. 882-893. [20] Lewis, F. L., Liu, D., and Lendaris, G. G., 2008. Special Issue on Adaptive Dynamic Programming and Reinforcement Learning in Feedback Control, IEEE Trans. on Systems, Man, and Cybernetics, Part B, Vol. 38, No. 4. [21] Lewis, F. L., and Liu, D., (Eds), 2013. Reinforcement Learning and Approximate Dynamic Programming for Feed- back Control, Wiley, Hoboken, N. J. [22] Liu, D., and Wei, Q., 2013. “Finite-Approximation-Error- Based Optimal Control Approach for Discrete-Time Nonlinear Systems, IEEE Transactions on Cybernetics, Vol. 43, pp. 779- 789. [23] Kleinman, D. L., 1968. “On an Iterative Technique for Riccati Equation Computations,” IEEE Trans. Automatic Control, Vol. AC-13, pp. 114-115. [24] Pallu de la Barriere, R., 1967. Optimal , Saunders, Phila; reprinted by Dover, N. Y., 1980. [25] Puterman, M. L., 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming, J. Wiley, N. Y. [26] Rekasius, Z. V., 1964. “Suboptimal Design of Inten- tionally Nonlinear Controllers,” IEEE Trans. on Automatic Control, Vol. 9, pp. 380-386. [27] Si, J., Barto, A., Powell, W., and Wunsch, D., (Eds.) 2004. Learning and Approximate Dynamic Programming, IEEE Press, N. Y. [28] Saridis, G. N., and Lee, C.-S. G., 1979. “An Approxima- tion Theory of Optimal Control for Trainable Manipulators,” IEEE Trans. Syst., Man, Cybern., Vol. 9, pp. 152-159. [29] Schal, M., 1975. “Conditions for Optimality in Dynamic Programming and for the Limit of n-Stage Optimal Policies to