2042 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018 Optimal and Autonomous Control Using : A Survey

Bahare Kiumarsi , Member, IEEE , Kyriakos G. Vamvoudakis, Senior Member, IEEE , Hamidreza Modares , Member, IEEE , and Frank L. Lewis, Fellow, IEEE

Abstract — This paper reviews the current state of the art on unsupervised, or reinforcement, depending on the amount and reinforcement learning (RL)-based feedback control solutions quality of feedback about the system or task. In supervised to optimal regulation and tracking of single and multiagent learning, the feedback information provided to learning algo- systems. Existing RL solutions to both optimal H2 and H∞ control problems, as well as graphical games, will be reviewed. rithms is a labeled training data set, and the objective is RL methods learn the solution to and game to build the system model representing the learned relation problems online and using measured data along the system between the input, output, and system parameters. In unsu- trajectories. We discuss Q-learning and the integral RL algorithm pervised learning, no feedback information is provided to the as core algorithms for discrete-time (DT) and continuous-time algorithm and the objective is to classify the sample sets to (CT) systems, respectively. Moreover, we discuss a new direction of off-policy RL for both CT and DT systems. Finally, we review different groups based on the similarity between the input sam- several applications. ples. Finally, reinforcement learning (RL) is a goal-oriented Index Terms — Autonomy, data-based optimization, learning tool wherein the agent or decision maker learns a reinforcement learning (RL). policy to optimize a long-term reward by interacting with the environment. At each step, an RL agent gets evaluative I. I NTRODUCTION feedback about the performance of its action, allowing it PTIMAL [1]–[6] is a mature mathemat- to improve the performance of subsequent actions [14]–[19]. Oical discipline that finds optimal control policies for The term of RL includes all work done in all areas such as dynamical systems by optimizing user-defined cost functionals psychology, computer science, economic, and so on. A more that capture desired design objectives. The two main principles modern formulation of RL is called approximate DP (ADP). for solving such problems are the Pontryagin’s maximum In a control engineering context, RL and ADP bridge the principle (PMP) and the (DP) principle. gap between traditional optimal control and adaptive control PMP provides a necessary condition for optimality. On the algorithms [20]–[26]. The goal is to learn the optimal policy other hand, DP provides a sufficient condition for optimal- and value function for a potentially uncertain physical system. ity by solving a partial , known as the Unlike traditional optimal control, RL finds the solution to the Hamilton–Jacobi–Bellman (HJB) equation [7], [8]. Classical HJB equation online in real time. On the other hand, unlike optimal control solutions are offline and require complete traditional adaptive controllers that are not usually designed knowledge of the system dynamics. Therefore, they are not to be optimal in the sense of minimizing cost functionals, able to cope with uncertainties and changes in dynamics. RL algorithms are optimal. This has motivated control system Machine learning [9]–[13] has been used for enabling researchers to enable adaptive autonomy in an optimal manner adaptive autonomy. Machine learning is grouped in supervised, by developing RL-based controllers. Manuscript received April 3, 2017; revised August 29, 2017; accepted November 1, 2017. Date of publication April 3, 2017; date of current version A. Related Theoretical Work May 15, 2018. This work was supported in part by the U.S. NSF under The origin of RL is rooted in computer science and has Grant ECCS-1405173, in part by ONR under Grant N00014-17-1-2239, in part by China NSFC under Grant 61633007, and in part by NATO through the attracted increasing attention since the seminal work of [27] Virginia Tech Startup Fund under Grant SPS G5176. (Corresponding author: and [28]. The interest in RL in control society dates back Bahare Kiumarsi.) to the work of [15] and [29]–[31]. Watkins’ Q-learning B. Kiumarsi is with the UTA Research Institute, University of Texas at Arlington, Arlington, TX 76118 USA (e-mail: [email protected]). algorithm [32] has also made an impact by considering K. G. Vamvoudakis is with the Kevin T. Crofton Department of Aerospace totally unknown environments. Extending RL algorithms to and Ocean Engineering, Virginia Tech, Blacksburg, VA 24061-0203 USA continuous-time (CT) and continuous-state systems was first (e-mail: [email protected]). H. Modares is with the Department of Electrical and Computer Engineering, performed in [33]. The work of [33] used the knowledge of the Missouri University of Science and Technology, Rolla, MO 65401 USA system models to learn the optimal control policy. The work (e-mail: [email protected]). of [34]–[36] formulated and developed such ideas in a control- F. L. Lewis is with the UTA Research Institute, University of Texas at Arlington, Arlington, TX 76118 USA, and also with the State Key Laboratory theoretic framework for CT systems. The control of switching of Synthetical for Process Industries, Northeastern University, and hybrid systems using ADP is considered in [37]–[41]. Shenyang 110004, China (e-mail: [email protected]). There are generally two basic tasks in RL algorithms. One Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. is called policy evaluation and the other is called policy Digital Object Identifier 10.1109/TNNLS.2017.2773458 improvement. Policy evaluation calculates the cost or value 2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2043 function related to the current policy, and policy improvement assesses the obtained value function and updates the current policy. Two main classes of RL algorithms that are used for performing these two steps are known as policy itera- tion (PI) [14] and value iteration (VI). PI and VI algorithms iteratively perform policy evaluation and policy improvement until an optimal solution is found. PI methods start with an admissible control policy [42], [43] and solve a sequence of Bellman equations to find the optimal control policy. In contrast to PI methods, VI methods do not require an initial stabilizing control policy. While most of RL-based control algorithms are PI, some VI algorithms have also been devel- oped to learn optimal control solutions. We mostly survey PI-based RL algorithms that are used for feedback control design. RL algorithms in control context have been mainly used to solve: 1) optimal regulation and optimal tracking of single- agent systems [1] and 2) optimal coordination of multiagent systems [44]. The objective of the optimal regulation problem is to design an optimal controller to assure the states or outputs of the systems converge to zero, or close to zero, while in the optimal tracking control problem, it is desired that the optimal controllers make the states or outputs of the systems track a desired reference trajectory. The goal in optimal coordination Fig. 1. Two different categories of RL. In on-policy RL, the policy that is applied to the system (behavior policy) to generate data for learning is the of multiagent systems is to design distributed control protocols same as the policy that is being learned (learned policy) to find the optimal based only on available local information of agents so that control solution. In off-policy RL, on the other hand, these two policies are agents achieve some team objectives. This paper reviews separated and can be different. (a) Off-policy RL diagram. (b) On-policy RL diagram. existing RL-based algorithms in solving optimal regulation and tracking of single-agent systems and game-based coordination of multiagent systems. designing the optimal controllers for CT dynamical systems. Finally, RL algorithms that are used to solve optimal Nash game problems and their online solutions are discussed control problems are categorized in two classes of learning at the end of this section. The RL solution to games on graphs control methods, namely, on-policy and off-policy methods is presented in Section IV. Finally, we talk about applications [14]. On-policy methods evaluate or improve the same policy and provide open research directions in Sections V and VI. as the one that is used to make decisions. In off-policy methods, these two functions are separated. The policy used to generate data, called the behavior policy, may in fact be II. O PTIMAL CONTROL OF DT S YSTEMS unrelated to the policy that is evaluated and improved, called AND ONLINE SOLUTIONS the estimation policy or target policy. The learning process Consider the nonlinear time-invariant system given as for the target policy is online, but the data used in this step can be obtained offline by applying the behavior policy to the x(k 1) f (x(k)) g(x(k)) u(k) + = + system dynamics. The off-policy methods are data efficient y(k) l(x(k)) (1) and fast since a stream of experiences obtained from executing = a behavior policy is reused to update several value functions where x(k) Rn, u(k) Rm, and y(k) Rp represent corresponding to different estimation policies. Moreover, off- the state of the∈ system, the∈ control input, and∈ the output of policy algorithms take into account the effect of probing the system, respectively. f (x(k)) Rn is the drift dynamics, n m ∈ p noise needed for exploration. Fig. 1 shows the schematic of g(x(k)) R × is the input dynamics, and l(x(k)) R on-policy and off-policy RL. is the output∈ dynamics. It is assumed that f (0) 0∈ and f (x(k)) g(x(k)) u(k) is locally Lipschitz and the= system is stabilizable.+ This is a standard assumption to make sure the B. Structure solution x(t) of the system (1) is unique for any finite initial This paper surveys the literature on RL and autonomy. condition. In Section II, we present the optimal control problems for discrete-time (DT) dynamical systems and their online solu- tions using RL algorithms. This section includes optimal reg- A. Optimal Regulation Problem ulation problem, tracking control problem, and H problem The goal of optimal regulation is to design an optimal for both linear and nonlinear DT systems. In Section∞ III, control input to stabilize the system in (1) while minimizing a we discuss several recent developments of using RL for predefined cost functional. Such energy-related cost functional 2044 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018 can be defined as C. Approximate Solution Using RL

∞ ∞ The HJB equation is generally extremely difficult or even J U(x(i), u(i)) (Q(x) uT (i)Ru (i)) = ≡ + impossible to solve analytically and one needs to approx- i 0 i 0 != != imate its solution. Existing methods for approximating the where Q(x) 0 and R RT 0. Hence, the problem to be HJB equation and DARE require complete knowledge of the solved can be defined as = ≻ system dynamics. The following PI algorithm can be used to approximate the HJB equation and DARE solutions by ∞ V (x(k)) min (Q(x) uT (i)Ru (i)) , x(k). performing successive iterations. = u + ∀ "i k # 1) Offline PI Algorithm: Algorithm 1 presents an offline != solution to the DT HJB equation but requires complete knowl- The value function given u(k) can be defined as edge of the system dynamics.

∞ V (x(k)) U(x(i), u(i)) = Algorithm 1 i k PI Algorithm to Find the Solution of HJB != 1: procedure ∞ T (Q(x) u (i)Ru (i)), x. (2) 2: Given admissible policy u0(k) ≡ + ∀ i k 3: for j 0, 1,... given u j , solve for the value V j 1(x) != = + An equivalent to (2) is the using Bellman equation T V j 1(x(k)) Q(x) u j (k) R u j (k) V j 1(x(k 1)), V (x(k)) U(x(k), u(k)) V (x(k 1)). (3) + = + + + + = + + on convergence, set V j 1(x(k)) V j (x(k)) . The associated Hamiltonian is defined by + = 4: Update the control policy u j 1(k) using + H (x(k), u(k), V) 1 1 T ∂V j 1(x(k 1)) u j 1(k) R− g (x)( + + ). U(x(k), u(k)) V (x(k 1)) V (x(k)), x, u. + = − 2 ∂x(k 1) = + + − ∀ + 5: Go to 3 The Bellman optimality principle [1] gives the optimal value 6: end procedure function as

V ⋆(x(k)) min U(x(k), u(k)) V ⋆(x(k 1)) = u [ + + ] 2) Actor and Critic Approximators: To approximate the solution of the HJB equation and obviate the requirement of which is termed the DT HJB equation. The optimal control is the complete knowledge about the system dynamics, actor- derived to be critic structure has been widely presented to find the online u⋆(x(k)) arg min U(x(k), u(k)) V ⋆(x(k 1)) solution to the HJB. The critic approximator estimates the = u [ + + ] ⋆ value function and is updated to minimize the Bellman error. 1 1 T ∂V (x(k 1) R− g (x) + . The actor approximator approximates the control policy and is = − 2 ∂x(k 1) $ + % updated to minimize the value function [15], [29], [30], [45]. The value function is represented at each step as

B. Special Case Nc T i V j (x(k)) W φc(x(k)) w φc (x(k)), x (6) For the DT linear systems, the dynamics (1) become ˆ = ˆ cj = ˆ cj i ∀ i 1 != x(k 1) Ax (k) Bu (k) y(k) Cx (k) (4) + = + = and the control input as where A, B, and C are constant matrices with appropriate dimensions. Na T T i By assuming that Q(x) x (k)Qx (k), Q 0 in the u j (k) Waj σa(x(k)) w σai (x(k)), x (7) = ≻ ˆ = ˆ = ˆ aj ∀ Bellman equation, the value function is quadratic in the current i 1 != state so that where φc (x(k)) and σa (x(k)) are the basis functions, Wcj T i i ˆ = V (x(k)) x (k)Px (k), x. (5) w1 w2 . . . wNc T is the weight vector of the critic approx- = ∀ [ ˆ cj ˆ cj ˆ cj ] Then, the DT HJB becomes the DT algebraic Riccati equation imator with Nc the number of basis functions used, and W w1 w2 . . . wNa T is the weight vector of the actor (DARE) ˆ aj aj aj aj approximator= [ ˆ ˆ with Nˆ the] number of basis functions used. T T T 1 T a Q P A P A A P B (R B P B )− B P A 0 − + − + = The Bellman equation (3) in terms of the critic approxima- tor (6) is written as and the optimal control input is

⋆ T 1 T T T u (k) (R B P B )− B P Ax (k), x. Wc( j 1)φc(x(k)) U(x(k), u j (k)) Wc( j 1) φc(x(k 1)). = − + ∀ ˆ + = ˆ + ˆ + + KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2045

The gradient descent tuning laws for the critic and actor The system (10) using current estimates Wm of the ideal ⋆ ˆ approximators are given as weights Wm is given as l 1 l l T Nmh Wc(+j 1) Wc( j 1) α1φc(x(k))(( Wc( j 1)) φc(x(k)) 2 ˆ = ˆ − ˆ x(k 1) w si (k) + + + ˆ + = mi U(x(k), u j (k))) (8) i 1 l 1 l − ˆ != Wa+( j 1) Wa( j 1) α2σa(x(k)) with ˆ + = ˆ + − hi (k) l T 1 e− 2R W σa(x(k)) si (k) − i 1,..., Nmh × a( j 1) × = 1 e hi (k) = $ + + − & ' T n m T ∂φ( x(k 1)) + 1 g(x(k)) + Wc( j 1) (9) hi (k) wmi , j xmj (k) i 1,..., Nmh + ∂x(k 1) ˆ + = = j 1 + % != where α1, α 2 > 0 are the tuning gains. Note that using actor- 1 2 where wm and wm are the weight matrices. hi is the input of critic structure to evaluate the value function and improve the ith hidden node, and si is its corresponding output. Nmh the policy for nonlinear systems does not need complete is the number of hidden neurons. xm(k) is the input of the knowledge of the system dynamics. In [46], synchronous model network that includes the sampled state vector and the methods are given to tune the actor and critic approximators. corresponding control law. In [47], an online approximator approach is used to find the The gradient descent can be used to update the weights of solution HJB equation without requiring the knowledge of the the network to minimize em x(k 1) x(k 1). The value internal system dynamics. In [48], a greedy iterative heuristic function can be approximated= ˆ by+ a network− + similar to (6). DP is introduced to obtain the optimal saturated controller The gradient descent can be used to update the weights of the using three neural networks to approximate the value function, network to minimize the optimal control policy, and the model of the unknown ec(k) J(x(ki )) J(x(ki 1)) U(k) plant. = − [ ˆ + + ] Remark 1: Note that the structure of the value function where J is the output of critic network and defined as approximator is an important factor in convergence and per- Nch formance of Algorithm 1. If an inappropriate value function 2 J(x(ki )) w ql(k) approximator is chosen, the algorithm may never converge to = cl l 1 an optimal solution. For linear systems, the form of the value != with function is known to be quadratic, and, therefore, there would pl (k) 1 e− be no error in its approximation. Consequently, Algorithm 1 ql(k) − l 1,..., Nch pl (k) converges to the global optimal solution for the linear systems. = 1 e− = n+ That is, for linear systems, Algorithm 1 gives the exact 1 pl(k) w xcj (ki ) l 1,..., Nch solution to the DARE. For , using single- = cl , j = j 1 layer neural networks for value function approximation may != require a large number of activation functions to assure a where pl and ql are the input and the output of the hidden good approximation error, and consequently, a near-optimal nodes of the critic network, respectively, and Nch is the total solution. Two-layer neural networks are used in [49] to achieve number of the hidden nodes. The input for the critic network, a better approximation error with lesser number of activation xc(ki ), is the sampled state vector only, so there are n input functions. Moreover, error-tolerant ADP-based algorithms are nodes in the critic network. presented for DT systems [50] that guarantees stability in The sampled state x(ki ) is used as input to learn the event- the presence of approximation error in the value function triggered control law, which is defined as approximation.  Nah 2 3) Event-Triggered RL [51]: In order to reduce the com- µ( x(ki )) w vl (k) = al munication between the controller and the plant, one needs to l 1 != use an event-triggered control algorithm. The event-triggering with mechanism determines when the control signal has to be tl(k) 1 e− vl (k) − l 1,..., Nah transmitted so that the resulting event-based control executions tl(k) = 1 e− = still achieve some degree of performance and stabilize the n+ 1 system. For the nonlinear DT systems, three neural networks tl (k) w xaj (ki ) l 1,..., Nah = al , j = are used to find the event-triggered-based approximation of j 1 the HJB equation. != where tl and vl are the input and the output of the hidden nodes The nonlinear system can be represented as of the action network, respectively. Nah is the total number of ⋆T the hidden nodes in the action network. x(k 1) Wm φm(xm(k)) ǫm (10) + = + The gradient descent can be used to update the weights ⋆ where Wm is the target weights of model network from the of the actor network. In [52], the event-triggered finite-time hidden layer to the output layer, xm(k) is the input vector of optimal control scheme is designed for an uncertain nonlinear the hidden layer, and ǫm is the bounded approximation error. DT system. 2046 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

4) Q Function for the DT Linear Quadratic Regulation: D. Optimal Tracking Problem For linear systems, the work of [32] proposed an action- The goal now is to design an optimal control input to dependent function (Q-function) instead of value function in make the states of the system x(k) follow a desired reference the Bellman equation to avoid knowing the system dynamics. trajectory xd (k). Let us now define the tracking error e(k) as This is termed in the literature as Q-learning [53], [54]. Based on (3) and (5), the DT Q-function is defined as e(k) x(k) xd (k). = − Q(x(k), u(k)) x T (k)Qx (k) uT (k)Ru (k) In the tracking problem, the control input consists of two = + terms: a feedforward term that guarantees tracking and a x T (k 1)Px (k 1), u, x. (11) + + + ∀ feedback term that stabilizes the system. Using the dynamics (4), the Q-function (11) becomes The feedforward term can be obtained using the dynamics inversion concept as Q(x(k), u(k)) 1 T ud (k) g(xd(k)) − (xd (k 1) f (xd (k))). x(k) Q AT P A A T P B x(k) = + − + . = u(k) BT P A R BT P B u(k) Consider the following cost functional ( ) ( + ) ( ) : Define ∞ T T J(e(k), ue(k)) e (i)Qee(i) u (i)Reue(i) T = + e x(k) Sxx Sxu x(k) i k Q(x(k), u(k)) != & ' = u(k) Sux Suu u(k) T ( ) ( ) ( ) where Qe 0 and Re Re 0. The feedback input Z T (k)SZ (k)  = ≻ = can be found by applying the following stationarity condition, ∂ J(e, ue)/∂ ue 0 as for a kernel matrix S. = : By applying the stationarity condition (∂ Q(x(k), u(k))/ ⋆ 1 1 T ∂ J(e(k 1)) u (k) R− g (k) + . ∂u(k)) 0, one has e = − 2 e ∂e(k 1) = + ⋆ T 1 T Then, the optimal control input including both feedback and u (k) (R B P B )− B P Ax (k) = − + feedforward terms is and ⋆ ⋆ u (k) ud (k) ue(k). ⋆ 1 = + u (k) S− Sux x(k). = − uu Obtaining the feedforward part of the control input needs complete knowledge of the system dynamics and the reference trajectory dynamics. In [57] and [58], a new formulation is Algorithm 2 Q-Learning Algorithm for DARE developed that gives both feedback and feedforward parts 1: procedure of the control input simultaneously and thus enables RL 2: Given admissible policy u0(k) algorithms to solve the tracking problems without requiring 3: for j 0, 1,... given u j , solve for the value S j 1 the complete knowledge of the system dynamics. = + using Bellman equation Assume now that the reference trajectory is generated by T T T the following command generator model Z (k) S j 1 Z(k) x (k) Q x (k) u j (k) R u j (k) : + = + T Z (k 1) S j 1 Z(k 1), xd (k 1) ψ( xd (k)) + + + + + = Rn on convergence, set S j 1 S j . where xd (k) . Then, an augmented system can be con- + = ∈ 4: Update the control policy u j 1(k) using structed in terms of the tracking error e(k) and the reference + trajectory xd (k) as 1 u j 1(k) (Suu )−j 1(Sux ) j 1 x(k). + = − + + e(k 1) + 5: go to 3 xd (k 1) 6: end procedure ( + ) f (e(k) xd(k)) ψ( xd (k)) g(e(k) xd(k)) + − + u(k) = ψ( xd (k)) + 0 ( ) ( ) Algorithm 2 is a model-free learning approach. This algo- F(X (k)) G(X (k)) u(k) rithm converges to the global optimal solution, on condition ≡ + that a persistence of excitation (PE) condition is satisfied [55]. where the augmented state is The PE condition guarantees the uniqueness of the policy e(k) evaluation step at each iteration. However, full information X (k) . = xd(k) of the states of the system is needed. In [56], output-feedback ( ) (OPFB) RL algorithms are derived for linear systems. These The new cost functional is defined as algorithms do not require any knowledge of the system J(x(0), xd (0), u(k)) dynamics and, as such, are similar to Q-learning and they ∞ i k T T have an added advantage of requiring only measurements of γ − (( x(k) xd(k)) Q(x(k) xd(k)) u (i)Ru (i)) = × − − + input/output data and not the full system state. i 0 != KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2047 where Q 0 and R RT 0. The value function in terms Algorithm 3 PI Algorithm to Find the Solution of Tracking  = ≻ of the states of the augmented system is written as HJB 1: procedure ∞ i k V (X (k)) γ − (U(X (k), u(k)) 2: Given admissible policy u0(k) = i k 3: for j 0, 1,... given u j , solve for the value V j 1(x) != using= Bellman equation + ∞ i k T T γ − (X (i)QT X (i) u (i)Ru (i)) (12) = + T T i k V j 1(X (k)) X (k)QT X (k) u j (k) R u j (k) != + = + where γ V j 1(X (k 1)) + + + Q 0 on convergence, set V j 1(X (k)) V j (X (k)) . QT + = = 0 0 4: Update the control policy u j 1(k) using ( ) + and 0 < γ 1 is the discount factor. ≤ γ 1 T ∂V j 1(X (k 1)) Remark 2: Note that it is essential to use a discounted u j 1(X (k)) R− G (X)( + + ). + = − 2 ∂ X (k 1) performance function for the proposed formulation. This is + because if the reference trajectory does not go to zero, which 5: Go to 3 is the case of most real applications, then the value is infinite, 6: end procedure without the discount factor as the control input contains a feedforward part that depends on the reference trajectory and T  F. H Control of DT Systems thus u (k)Ru (k) does not go to zero as time goes to infinity. ∞ A difference equivalent to (12) is The H control problem can be considered as a zero-sum game, where∞ the controller and the disturbance inputs are con- V (X (k)) U(X (k), u(k) γ V (X (k 1)). = + + sidered as minimizing and maximizing players, respectively The Hamiltonian for this problem is given as [61]–[64]. In the linear systems, the optimal solution to the T T zero-sum game problem leads to solving the GARE. H (X (k), u(k), V ) X (k)QT X (k) u (k)Ru (k) = + Consider now the dynamics with an added disturbance input γ V (x(k 1)) V (x(k)). + + − x(k 1) f (x(k)) g(x(k)) u(k) h(x(k)) d(k) (13) The optimal value can be found [59] + = + + where x(k) Rn is a measurable state vector, u(k) Rm ⋆ ⋆ V (X (k)) min U(X (k), u(k)) γ V (X (k 1)) is the control∈ input, d(k) Rq is the disturbance∈ input, = u [ + + ] f (x(k)) Rn is the drift dynamics,∈ g(x(k)) Rn m is the which is just the DT HJB equation. The optimal control is × input dynamics,∈ and h(x(k)) Rn q is the disturbance∈ input then given as × dynamics. ∈ u⋆(X (k)) arg min U(X (k), u(k)) γ V ⋆(X (k 1)) Define the cost functional to be optimized as = u [ + + ] ⋆ γ 1 T ∂V (X (k 1) ∞ T 2 T R− G (X) + . J(x(0), u(k), d(k)) (Q(x) u (i)Ru (i) β d (i)d(i)) = − 2 ∂ X (k 1) = + − $ + % i 0 != E. Approximate Solution Using RL where Q(x) 0, R RT 0, and β β⋆ 0 with β⋆  = ≻ ≥ ≥ The solution of the DT HJB tracking equation can be the smallest β such that the system is stabilized. The value approximated as follows. function with feedback control and disturbance policies can 1) Offline PI Algorithm: The PI algorithm is used to find be defined as the solution of DT HJB tracking by iterating on the solution ∞ V (x(k), u(k), d(k)) (Q(x) uT (i)Ru (i) β2dT (i)d(i)). of the Bellman equation. = + − i k The augmented system dynamics must be known in order != to update the control input in Algorithm 3. Convergence A difference equivalent to this is properties of Algorithm 3 are similar to Algorithm 1 and are V (x(k)) Q(x) uT (k)Ru (k) β2dT (k)d(k) V (x(k 1)). not discussed here. = + − + + 2) Online Actor and Critic Approximators: To obviate the We can find the optimal value, by solving a zero-sum differ- requirement of complete knowledge of the system dynamics ential game as or reference trajectory dynamics, an actor-critic structure, V ⋆(x(k)) min max J(x(0), u, d) similar to (6) and (7), is developed in [57] for solving the = u d nonlinear optimal tracking problem. In [58], the Q-learning subject to (13). It is worth noting that u(k) is the minimizing is used to find the optimal solution for linear systems. player while d(k) is the maximizing one. Kiumarsi et al. [60] presented PI and VI algorithms to solve In order to solve this zero-sum game, one needs to solve the linear quadratic tracker (LQT) ARE online without requir- the following Hamilton–Jacobi–Isaacs (HJI) equation ing any knowledge of the system dynamics and information : V ⋆(x(k)) of the states of the system only using the measured input and ⋆T ⋆ 2 ⋆T ⋆ ⋆ output data. Q(x) u (k)R u (k) β d (k)d (k) V (x(k 1)). = + − + + 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

Given a solution V ⋆ to this equation, one can find the optimal respectively. The gradient descent can also be used to update control and the worst case disturbance as the weights of the disturbance input approximator. ⋆ A Q-learning algorithm is presented in [65] to find the ⋆ 1 1 T ∂V (x(k 1)) u (k) R− g (x) + optimal control input and worst case disturbance input for = − 2 ∂x(k 1) $ + % linear systems without requiring any knowledge of the sys- and tem dynamics. However, the disturbance input needs to be 1 ∂V ⋆(x(k 1)) updated in a prescribed manner. An off-policy RL algorithm d⋆(k) hT (x) + = 2β2 ∂x(k 1) is presented in [66] that does not need any knowledge of $ + % the system dynamics and the disturbance input does not respectively. need to be updated in a prescribed manner. Fig. 2 shows a schematic of off-policy RL for solving the zero-sum game G. Approximate Solution Using RL problem. In the following sections, on-policy and off-policy RL 2) Off-Policy RL for Solving Zero-Sum Game Prob- algorithms are presented to solve the HJI and GARE lem of DT Systems [66]: Consider the following system description solutions by successive approximations to the Bellman : equations. 1) On-Policy RL to Solve H Control of DT Systems: x(k 1) Ax (k) Bu (k) Dd (k) (14) The following on-policy RL algorithm∞ is given to solve the + = + + HJI. where x Rn, u Rm, and d Rq are the state, control, and disturbance∈ inputs,∈ and A, B, and∈ D are constant matrices of Algorithm 4 PI Algorithm to Solve the HJI appropriate dimensions. To derive an off-policy RL algorithm, 1: procedure the original system (14) is rewritten as 2: Given admissible policies u0(k) and d0(k) 3: for j 0, 1,... given u j and d j , solve for the value 1 2 = x(k 1) Ak x(k) B(K j x(k) u(k)) D(K j x(k) d(k)) V j 1(x(k)) using Bellman equation + = + + + + + (15) T 2 T V j 1(x(k)) Q(x) u j (k) R u j (k) β d j (k)d j (k) + = + − 1 2 V j 1(x(k 1)), where Ak A B K DK . + j j + + = − − 1 In (15), the estimation policies are u j (k) K x(k) and on convergence, set V j 1(x(k)) V j (x(k)) . j + = 2 = − 4: Update the control policy u j 1(k) and disturbance pol- d j (k) K j x(k). By contrast, u(k) and d(k) are the behavior + = − icy d j 1(k) using policies that are actually applied to (14) to generate the data + for learning. 1 1 T ∂V j 1(k 1) u j 1(k) R− g (x)( + + ), Using the Taylor expansion of the value function V (x(k)) + = − 2 ∂x(k 1) at point x(k 1) for (14), the value function (5) yields + + 1 T ∂V j 1(x(k 1)) + d j 1(k) 2 h (x)( + ). T T + = 2β ∂x(k 1) x (k)Pj 1x(k) x (k 1)Pj 1x(k 1) + + − + + + 5: Go to 3. x T (k)Qx (k) x T (k) K 1 T RK 1x(k) β2x T (k) K 2 T 6: end procedure = + j j − j 2 & 1 ' T T & ' K j x(k) u(k) K j x(k) B Pj 1x(k 1) × − + + + The PI algorithm (Algorithm 4) is an offline algorithm 1& T T ' and requires complete knowledge of the system dynamics. u(k) K j x(k) B Pj 1 Ak x(k) − + + The actor-critic structure can be used to design an online & 2 'T T K j x(k) d(k) D Pj 1x(k 1) PI algorithm that simultaneously updates the value function − + + + and policies and does not require the knowledge of the drift & 2 'T T K j x(k) d(k) D Pj 1 Ak x(k). (16) dynamics. The value function and control input are approx- − + + imated as (6) and (7), respectively. The disturbance input is & ' approximated as Remark 3: Note that (16) does not explicitly depend on 1 2 K j 1 and K j 1. Also, complete knowledge of the system Nd + + T i dynamics is required for solving (16). In the following, it d j (k) Wd j σd (x(k)) wd j σdi (x(k)) 1 2 ˆ = ˆ = ˆ is shown how to find (Pj 1, K j 1, K j 1) simultaneously i 1 + + + != by (16) without knowing any knowledge of the system where σdi (x(k)) is the basis function for the approximator, dynamics.  1 2 Nd T n m nm 1 Wd j wd j wd j . . . wd j is the weight vector, and Nd is the Given any matrix M R , vec (M) R is transpose ˆ = [ ˆ ˆ ˆ ] ∈ × ∈ × number of basis functions used. of a vector formed by stacking the rows of matrix M. T T T The weights of the value function and control input approx- Using a Wb (b a )vec (W), (14), and Ak A = ⊗ = − imators are updated using gradient descent as (8) and (8), B K 1 DK 2, the off-policy Bellman equation (16) can be j − j KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2049 rewritten as T T 1 (x (k) x (k)) vec L j 1 ⊗ + T T 1 (x (k 1) x (&k 1))'vec L j 1 − + ⊗ + + T 1 T 2 2 x (k) u(k) K j x(k) & vec 'L j 1 + ⊗ + + 1 T 1 T 3 &K j x(k) & u(k) u(k)' ' K j &x(k) ' vec L j 1 − − ⊗ + + T 2 T 4 &&2 x (k) d(k) ' K j&x(k) vec L j '1 ' & ' + ⊗ + + 1 T 2 T 5 &K j x(k) & u(k) d(k)' ' K j &x(k) ' vec L j 1 − − ⊗ + + 2 T 1 T 6 && d(k) K j x(k)' &u(k) K j x(k)' 'vec &L j 1' + − ⊗ + + 2 T 2 T 7 && d(k) K j x(k)' &d(k) K j x(k)' 'vec &L j 1' + − ⊗ + + T T 1 T 1 && x (k)Qx (k) 'x (k)& K j RK j x(k)' ' & ' Fig. 2. Off-policy RL for H control. The behavior disturbance policy is = + the actual disturbance from the∞ environment that cannot be specified. On the 2 T 2 T 2 β x (k) K j K j x&(k)' (17) other hand, the learned disturbance policy uses the collected data to learn the − worst case policy. 1 2 T 3 T with L j 1 Pj 1,& L j ' 1 B Pj 1 A, L j 1 B Pj 1 B, + = + + = + + = + L4 DT P A, L5 DT P B, L6 BT P D, j 1 j 1 j 1 j 1 j 1 j 1 n m p + 7= T+ + = + + = + where x(t) R , u(t) R , and y(t) R represent and L j 1 D Pj 1 D that are the unknown variables in the ∈ ∈ ∈ + = + the state of the system, the control input, and the output of Bellman equation (17). Then, using (17), one has Rn T T T T the system, respectively. f (x(t)) is the drift dynamics, ψ vec L1 vec L2 vec L3 vec L4 n m ∈ p j j 1 j 1 j 1 j 1 g(x(t)) R × is the input dynamics, and l(x(t)) R + + + + ∈ ∈ 5 T 6 T 7 T T is the output dynamics. It is assumed that f (0) 0 and * & vec' L j &1 vec' L j &1 vec' L j &1 ' φ j (18) = + + + = f (x(t)) g(x(t)) u(t) is locally Lipschitz and the system is 1 T s T T j 1 T s T T + where φ j (φ & ) . .' . (φ )& and' ψ & (ψ ' )+ . . . (ψ ) stabilizable. i = [ i j j ] = [ j j ] with φ j and ψ j defined as in [66]. One can solve (18) using a least-squares (LS) method. A. Optimal Regulation Problem Algorithm 5 Off-Policy PI to Find the Solution of GARE The goal in optimal regulation is to design an optimal control input to assure the states of the system (19) converge 1: procedure to zero by minimizing a cost functional. The cost functional 2: Set the iteration number j 0 and start with a is defined as admissible control policy u(=k) K 1x(k) e(k) = − + where e(k) is probing noise. J(x(0), u) ∞ r(x, u)dt ∞(Q(x) uT Ru )) dt 1 1 2 2 = ≡ + 3: while K j 1 K j ǫ and K j 1 K j ǫ do .0 .0 k + − k ≥ k + − k ≥ T 4: For j 0, 1, 2,..., solve (18) using LS where Q(x) 0 and R R 0. The value function for the = admissible control policy= can be≻ defined as 1 T 2 T 3 T 4 T ψ j vec (L j 1) vec (L j 1) vec (L j 1) vec (L j 1) + + + + V x u ∞ r x u d ∞ Q x uT Ru d , T T T T ( , ) ( , ) τ ( ( ) )) τ. 5 6 7 = t ≡ t + vec (L j 1) vec (L j 1) vec (L j 1) φ j . . . + + + = A differential equivalent to this is 5: Update the control and disturbance gains as - ∂V T 1 3 6 2I 7 1 5 1 r(x, u) ( f (x) g(x)u)) 0, V (0) 0 K j 1 (R L j 1 L j 1(β L j 1)− L j 1)− x + = − + + + + − + + + ∂ + = = 2 6 2 7 1 4 and the Hamiltonian is given by (L j 1 L j 1(β I L j 1)− L j 1), × + + + − + + ∂V T ∂V T 2 7 2I 5 3 1 6 1 H x, u, r(x, u) ( f (x) g(x)u). K j 1 (L j 1 β L j 1(R L j 1)− L j 1)− x x + = + − − + + + + / ∂ 0 = + ∂ + 4 5 3 1 2 (L j 1 L j 1(R L j 1)− L j 1). The optimal value is given by the Bellman optimality equation × + − + + + + 6: j j 1 ⋆ T = + ⋆ ∂V ⋆ 7: end while r(x, u ) ( f (x) g(x)u ) 0 (20) + ∂x + = 8: end procedure which is just the CT HJB equation. The optimal control is then given as

PTIMAL ONTROL OF YSTEMS ⋆ T III. O C CT S ⋆ ∂V AND ONLINE SOLUTIONS u (t) arg min r(x, u) ( f (x) g(x)u) = u / + ∂x + 0 Consider the nonlinear time-invariant system given as ⋆ 1 1 T ∂V R− g (x) . (21) x(t) f (x(t)) g(x(t)) u(t) y(t) l(x(t)) (19) = − 2 ∂x ˙ = + = $ % 2050 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

n N B. Approximate Solution Using RL where φ1(x) R R is the basis function vector and N : → The following on-policy and off-policy integral RL (IRL) is the number of basis functions. Therefore, the approximate algorithms are presented to approximate HJB solution by IRL Bellman equation becomes t iterating on the Bellman equations. T T 1) On-Policy IRL: In [35] and [67], an equivalent for- eB 1φ 1(t) Wc (Q(x) u Ru)dτ = ˆ + t T + ˆ ˆ mulation of the Bellman equation that does not involve the . − where 1φ (t) φ (t) φ (t T ) and e is the TD error dynamics is found to be 1 1 1 B after using current= critic− approximator− weights. To collect data t T in the history stack, consider 1φ (t ) as evaluated values of V (x(t)) + (Q(x(τ)) uT (τ) Ru (τ)) dτ V (x(t T )) 1 j = + + + 1φ 1 at the recorded time t j . Then, define the Bellman equation .t (22) error (TD error) at the recorded time t j using the current critic weights estimation W as for any time t 0 and time interval T > 0. This equation ˆ c ≥ t j is called IRL Bellman equation. The following PI algorithm (e ) 1φ (t )T W (Q(x(τ)) uT (τ) Ru(τ)) dτ. B j 1 j ˆ c can be implemented by iterating on the above IRL Bellman = + t j T + ˆ ˆ . − equation and updating the control policy. The experience replay-based gradient-decent algorithm for the critic NN is now given as Algorithm 6 On-Policy IRL Algorithm to Find the Solution 1φ 1(t) of HJB W˙ c αc eB ˆ = − (1φ (t)T 1φ (t) 1)2 1: procedure 1 1 l + 2: Given admissible policy u0 1φ 1(t j ) αc T 2 (eB) j . 3: for j 0, 1,... given u j , solve for the value V j 1(x) − (1φ (t j ) 1φ (t j ) 1) = + j 1 1 1 using Bellman equation != + t T The first term is a gradient update law for the TD error and + T the last term minimizes its stored samples in the history stack. V j 1(x(t)) Q(x) u j Ru j ) dτ V j 1(x(t T )), + = + + + + .t 3) Event-Triggered On-Policy RL [74]: The event-triggered & ' on convergence, set V j 1(x) V j (x). version of the optimal controller uses the sampled state infor- + = 4: Update the control policy u j 1(k) using mation instead of the true one and (21) becomes + ⋆ 1 ∂V j 1(x) ⋆ 1 1 T ∂V (xi ) 1 T u (xi ) R− g (xi ) ˆ u j 1(t) R− g (x)( + ). ˆ = − 2 ˆ ∂x + = − 2 ∂x $ % t (ri 1, ri and i N (23) 5: Go to 3. ∀ ∈ − ] ∈ 6: end procedure where ri is the ith consecutive sampling instant and xi x(ri ). Using the event-triggered controller (23), the HJBˆ = equa- n The IRL algorithm (Algorithm 6) is online and does not tion (20) becomes x, xi R ∀ ˆ ∈ require the knowledge of the drift dynamics. To implement ⋆ T ⋆ ∂V 1 1 T ∂V (xi ) Step 3 of Algorithm 6, a neural-network-type structure, similar f (x) g(x)R− g (xi ) ˆ ∂x − 2 ˆ ∂x to (6) and (7), is used in [67] to approximate the value $ $ %% ⋆ T ⋆ function. This algorithm is a sequential RL algorithm in the 1 ∂V (xi ) 1 T 1 T ∂V (xi ) Q(x) ˆ g(xi )( R− ) R− g(xi ) ˆ sense that the actor (policy improvement) and critic (policy + + 4 ∂x ˆ ˆ ∂x ⋆ ⋆ T ⋆ ⋆ evaluation) are updated sequentially. Synchronous update laws (u (xi ) uc) R(u (xi ) uc) (24) for actor and critic were first introduced in [36] to update both = ˆ − ˆ − where u⋆ is given by (21). actor and critic simultaneously while assuring system stability. c To solve the event-triggered HJB equation (24), the value Later, in [68] and [69], synchronous actor-critic structure function is approximated on a compact set  as was augmented with system identification to avoid complete ⋆ ⋆T Rn knowledge of the system dynamics. V (x) Wc φ( x) ǫc(x), x (25) 2) On-Policy IRL With Experience Replay Learning Tech- = + ∀ ∈ where φ( x) Rn Rk is the basis function set vector, k is nique [70]–[72]: To speed up and obtain an easy-to-check the number of: basis→ functions, and ǫ (x) is the approximation condition for the convergence of the IRL algorithm, the recent c error. Based on this, the optimal event-triggered controller transition samples are stored and repeatedly presented to the in (23) can be rewritten as gradient-based update rule. A similar condition was proposed T in [73] for adaptive control systems. This is a gradient- ⋆ 1 1 T ∂φ( xi ) ⋆ ∂ǫ c(xi ) decent algorithm that does not only minimize the instantaneous u (xi ) R− g (xi ) ˆ Wc ˆ ˆ = − 2 ˆ / ∂x + ∂x 0 temporal difference (TD) error but also minimize the TD errors t (ri 1, ri1 . (26) for the stored transition samples. Assume now that the value ∈ − ] function V (x) can be uniformly approximated as in the IRL The optimal event-triggered controller (26) can be approxi- Bellman equation (22) mated by the actor for all t (ri 1, ri as ∈ − ] T ⋆ ⋆T V (x) W φ1(x), x u (xi ) W φu(xi ) ǫu (xi ), xi , i N (27) ˆ = ˆ c ∀ ˆ = u ˆ + ˆ ∀ ˆ ∈ KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2051

n h where φu(xi ) R R is the basis function set vector, h is Integrating from both sides of the above equation yields the ˆ : → the number of basis functions, and ǫu(xi ) is the approximation off-policy IRL Bellman equation error. ˆ The value function (25) and the optimal policy (27) using V j (x(t T )) V j (x(t)) + t T − current estimates Wc and Wu, respectively, of the ideal weights + T T ⋆ ⋆ ˆ ˆ Q(x) u j Ru j 2u j 1 R(u u j ) dτ. (29) W and W are given by the following critic and actor = − − − + − c u .t approximators & ' : Iterating on the IRL Bellman equation (29) yields the V (x) W T φ( x), x following off-policy IRL algorithm. ˆ ˆ c = T ∀ u(xi ) Wu φu(xi ), xi . ˆ ˆ = ˆ ˆ ∀ ˆ Algorithm 7 Off-Policy IRL Algorithm to Find the Solution The Bellman error is defined as of HJB 1: procedure T ∂φ ec W ( f (x) g(x)u(xi ))) r(x, u) 2: Given admissible policy u0 = ˆ c ∂x + ˆ ˆ + ˆ 3: for j 0, 1,... given u j , solve for the value V j and = with r(x, u) Q(x) u(x )T Ru(x ). u j 1 using off-policy Bellman equation i i + ˆ = + ˆ ˆ ˆ ˆ T The weights Wc are tuned to minimize K (1/2)e ec as ˆ = c V j (x(t T )) V j (x(t)) t T + − = ∂ K w T + T T W˙c αc αc (w Wc r(x, u)) Q(x) u j Ru j 2 u j 1 R (u u j ) dτ. ˆ = − = − (w T w 1)2 ˆ + ˆ t − − − + − ∂Wc . ˆ + & ' on convergence, set V j 1 V j . with w (∂φ/∂ x)( f (x) g(x)u(xi ))) . + = 4: Go to 3. In order= to find the update+ lawˆ forˆ the actor approximator, 5: end procedure the following error is defined : T T 1 1 T ∂φ( xi ) Remark 4: Note that for a fixed control policy u(t) (the eu W φu(xi ) R− g (xi ) ˆ Wc, xi . = ˆ u ˆ + 2 ˆ ∂x ˆ ∀ ˆ policy that is applied to the system), the off-policy IRL $ % Bellman equation (29) can be solved for both value function The weights W are tuned to minimize E (1/2)eT e as V and updated policy u , simultaneously without requiring ˆ u u u u j j 1 = any knowledge about the+ system dynamics.  W 0, for t (r , r To implement the off-policy IRL algorithm (Algorithm 7), ˆ˙u i 1 i = ∈ − ] the actor-critic structure, similar to (6) and (7), is used to and the jump equation to compute W (r ) given by ˆ u +j approximate the value function and control policy [79]. For a linear system, the off-policy IRL Bellman equation W W (t) α φ (x(t)) ˆ u+ ˆ u u u (see [81]) is given as = − T 1 ∂φ( x(t)) T T T T 1 T x (t T )Pj x(t T ) x (t)Pj x(t) Wu φu(x(t)) R− g (x(t)) Wc × ˆ + 2 ∂x ˆ + t T + − / 0 + T T T x Qx u j Ru j 2u j 1 R(u u j ) dτ. (30) for t ri . = − − − + − = .t & ' The convergence and stability are proved in [74]. The work Iterating on the Bellman equation (30) yields the off-policy of [75] extended [74] to systems with input constraints and IRL algorithm for the linear systems. without requiring the knowledge of the system dynamics. 5) Robust Off-Policy IRL [79]: Consider the following Event-based RL approaches are also presented in [76]–[78] uncertain nonlinear system : for interconnected systems. x(t) f (x(t)) g(x(t)) u(t) 1( x, w) 4) Off-Policy IRL [79], [80]: In order to develop the off- ˙ = + [ + ] w( t) 1w(x, w) (31) policy IRL algorithm, the system dynamics (19) is rewritten as ˙ = where x(t) Rn is the measured component of the state x(t) f (x(t)) g(x(t)) u j (t) g(x(t))( u(t) u j (t)) (28) ˙ = + + − available for∈ feedback control, w( t) Rp is the unmeasurable ∈ R where u j (t) is the policy to be updated. By contrast, u(t) is the part of the state with unknown order p, u(t) is the control p ∈ behavior policy that is actually applied to the system dynamics input, 1w R and 1 R are unknown locally Lipschitz ∈ ∈ to generate the data for learning. functions, and f (x(t)) Rn and g(x(t)) Rn are the drift ∈ ∈ Differentiating V (x) along with the system dynamics (28) dynamics and input dynamics, respectively. In order to develop 1 T and using u j 1(t) (1/2)R− g (x)(∂ V j (x)/∂ x) give the robust off-policy IRL algorithm, the system dynamics (31) + = − is rewritten as T ∂V j (x) T ∂V j (x) V j ( f gu j ) g(u u j ) x(t) f (x(t)) g(x(t)) u j (t) g(x(t))v j (t)) ˙ = ∂x + + ∂x − ˙ = + + $ % T $T % Q(x) u j Ru j 2u j 1 R(u u j ). where v j u 1 u j . = − − − + − = + − 2052 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

Differentiating V (x) along (31) and using u j 1(t) tracking problems without requiring complete knowledge of 1 T + = (1/2)R− g (x)(∂ V j (x)/∂ x) give the system dynamics. − Assume that the reference trajectory is generated by the ∂V j (x) T V j ( f gu j gv j ) command generator model ˙ = ∂x + + $ % T T xd (t) hd (xd (t)) Q(x) u j Ru j 2u j 1 Rv j . = − − − + ˙ = Rn Integrating both sides of the above equation yields the robust where xd (t) . An augmented system can be constructed ∈ e t off-policy IRL Bellman equation in terms of the tracking error ( ) and the reference trajectory xd (t) as V j (x(t T )) V j (x(t)) e(t) f (e(t) xd (t)) hd (xd (t)) + t −T X(t) ˙ + − + T T ˙ = xd (t) = hd (xd(t)) Q(x) u j Ru j 2u j 1 Rv j dτ. (32) ( ˙ ) ( ) + = t − − − g(e(t) xd (t)) . + u(t) F(X (t)) G(X (t)) u(t) Using an actor-critic& structure, similar to (6) and' (7), for V + 0 ≡ + j ( ) and u j 1 in (32), yields (34) + Nc where the augmented state is i w φc (x(t T )) φc (x(t)) ˆ cj [ i + − i ] e(t) i 1 X (t) . != T = xd(t) t T Na ( ) + T i Q(x) u Ru j 2 w σa (x(k)) Rv j dτ. The cost functional is defined as = − − ˆ j ˆ − ˆ aj i  .t /i 1 0 != J(x(0), xd(0), u(t))  (33) ∞ η(τ t) T e− − (( x(τ) xd (τ)) Q(x(τ) xd (τ)) Iterating on the IRL Bellman equation (33) provides the = × − − following robust off-policy IRL algorithm. .0 uT (τ) Ru (τ)) dτ + Algorithm 8 Robust Off-Policy IRL Algorithm where Q 0 and R RT 0. The value function in terms 1: procedure of the states of the augmented= ≻ system yields 2: Given admissible policy u0 ∞ 3: for j 0, 1,... given u j , solve for the value V j and V X t e η(τ t)r X u d = ( ( )) − − ( , ) τ u j 1 using off-policy Bellman equation = t + . T T (X (τ) QT X (τ) u (τ) Ru (τ)) dτ (35) Nc = + i w φci (x(t T )) φci (x(t)) where ˆ cj [ + − ] = i 1 != Q 0 t T Na QT + T i T = 0 0 Q(x) u Ru j 2 ( w σa (x(k))) R v j dτ, ( ) − − ˆ j ˆ − ˆ aj i .t i 1 and η 0 is the discount factor. != & ' A differential≥ equivalent to this is the Bellman equation on convergence, set V j 1 V j . + = 4: Go to 3. ∂V T r(X, u) ηV (F(x) G(X)u) 0, V (0) 0 5: end procedure − + ∂ X + = = and the Hamiltonian is given by ∂V T ∂V T C. Optimal Tracking Problem H X, u, r(X, u) ηV (F(X) G(X)u). / ∂ X 0 = − + ∂ X + The goal here is to design an optimal control input to make the states of the system x(t) track a desired reference The optimal value is given by trajectory x (t). d ∂V ⋆ T Define the tracking error as r(X, u⋆) ηV ⋆ (F(X) G(X)u⋆) 0 − + ∂ X + = e(t) x(t) xd (t). which is just the CT HJB tracking equation. The optimal = − control is then given as Similar to DT systems, standard techniques find the feedback and feedforward parts of the control input sepa- ⋆ T ⋆ ⋆ ∂V rately using complete knowledge of the system dynamics. u (t) arg min r(X, u) ηV (F(X) G(X)u) = u / − + ∂ X + 0 In [82] and [83], a new formulation is developed that gives ⋆ 1 1 T ∂V both feedback and feedforward parts of the control input R− G (x) . = − 2 ∂ X simultaneously and thus enables RL algorithms to solve the $ % KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2053

D. Approximate Solution Using RL To implement off-policy IRL algorithm (Algorithm 10), Using the value function (35) and in a similar manner to the actor-critic structure, similar to (6) and (7), is used to the off-policy and on-policy IRL algorithms for the optimal approximate the value function and control policy [84]. regulation, the following off-policy and on-policy IRL algo- Remark 5: Most of the existing solutions to optimal control rithms are developed to find the optimal solution of CT HJB problems are presented for affine systems. Extensions of RL tracking equation. algorithms for nonaffine systems are considered in [85] for 1) On-Policy IRL: Similar to Algorithm 6 for the opti- optimal regulation and in [86] for optimal tracking control  mal regulation, the following on-policy IRL algorithm is problems. presented in [83] to solve the CT HJB equation online and without requiring the knowledge of the internal system E. H Control of CT Systems dynamics. ∞ The CT system dynamics in the presence of disturbance input is considered as Algorithm 9 On-Policy IRL Algorithm to Find the Solution x(t) f (x(t)) g(x(t)) u(t) h(x(t)) d(t) (36) of HJB ˙ = + + 1: procedure where x(t) Rn is a measurable state vector, u(t) Rm is the 2: Given admissible policy u (0) control input,∈ d(t) Rq is the disturbance input, f (∈x(t)) Rn 3: for j 0, 1,... given u j , solve for the value V j 1(X) ∈ n m ∈ = + is the drift dynamics, g(x(t)) R × is the input dynamics, using Bellman equation n q ∈ and h(x(t)) R × is the disturbance input dynamics. The t T ∈ + η τ T performance index to be optimized is defined as V j 1(X (t)) e− X (τ) QT X (τ) + = t . ∞ T T T η&T J(x(0), u, d) (Q(x) u Ru β2d d)dτ u j (τ) Ru j (τ) dτ e− V j 1(X (t T )), = + − + + + + .t on convergence, set 'V j 1(X) V j (X). T ⋆ ⋆ + = where Q(x) 0, R R 0, and β β 0 with β 4: Update the control policy u j 1(t) using  = ≻ ≥ ≥ + the smallest β such that the system is stabilized. The value 1 1 T ∂V j 1(X) function can be defined as u j 1(t) R− g (X)( + ). + = − 2 ∂ X V (x(t), u, d) ∞(Q(x) uT Ru β2dT d)dτ. 5: Go to 3. = t + − 6: end procedure . A differential equivalent to this is the Bellman equation

The IRL algorithm (Algorithm 9) is an online algorithm and Q(x) uT Ru β2dT d + − does not require the knowledge of the internal system dynam- ∂V T ( f (x) g(x)u h(x)d) 0 V (0) 0 ics. In [83], an actor-critic structure, similar to (6) and (7), is + ∂x + + = = used to implement Algorithm 9. In [82], the quadratic form and the Hamiltonian is given by of value function, x T (t)Px (t), is used in the IRL Bellman equation for the linear systems to find the solution of LQT ∂V T ARE. H x, u, d, ∂x 2) Off-Policy IRL: Similar to off-policy IRL for optimal / 0 regulation, for the augmented system (34) and the discounted ∂V T Q(x) uT Ru β2dT d ( f (x) g(x)u h(x)d). value function (35), the following off-policy IRL algorithm is = + − + ∂x + + developed in [84] to avoid the requirement of the knowledge Define the two-player zero-sum differential game as of the system dynamics. V ⋆(x(t)) min max J(x(0), u, d) = u d Algorithm 10 Off-Policy IRL Algorithm to Find the Solution subject to (36). of HJB In order to solve this zero-sum game, one needs to to solve 1: procedure the following HJI equation 2: Given admissible policy u0 : 3: for j 0, 1,... given u j , solve for the value V j 1 and Q(x) (u⋆)T Ru ⋆ β2(d⋆)T d⋆ = + u j 1 using off-policy Bellman equation + − ⋆ T + ∂V ⋆ ⋆ η T ( f (x) g(x)u h(x)d ) 0. e− V j 1(X (t T )) V j 1(X (t)) + ∂x + + = + + t T + − = + η τ T T Given a solution to this equation, one has e− Q(X) u j Ru j 2 u j 1 R (u u j ) dτ, t − − − + − ⋆ . ⋆ 1 1 T ∂V & ' u R− g (x) on convergence, set V j 1 V j . = − 2 ∂x + = 4: Go to 3. 1 ∂V ⋆ d⋆ hT (x) . 5: end procedure = 2β2 ∂x 2054 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

2 T F. Approximate Solution Using RL d j 1(t) (1/2β )h (x)(∂ V j (x)/∂ x) give + = The following PI algorithms can be used to approximate the T ∂V j (x) T ∂V j (x) HJI solution by iterating on the Bellman equation. V j ( f gu j hd j ) g(u u j ) ˙ = ∂x + + + ∂x − 1) On-Policy IRL: In [87], an equivalent formulation of the $ % T $ % ∂V j (x) T Bellman equation that does not involve the dynamics is found h(d d j ) Q(x) u j Ru j to be + ∂x − = − − $2 T % T 2 T t β d j d j 2u j 1 R(u u j ) 2β d j 1 (d d j ). + − + − + + − V (x(t T )) (r(x(τ), u(τ), d(τ)) dτ V (x(t)) − = + Integrating from both sides of the above equation yields the .t T − H off-policy IRL Bellman equation for any time t 0 and time interval T > 0. This equation is ∞ ≥ called the IRL Bellman equation. V j (x(t T )) V j (x(t)) +t T − + T 2 T T Q(x) u j Ru j β d j d j 2u j 1 R(u u j ) Algorithm 11 On-Policy IRL for Regulation in Zero-Sum = t − − + − + − . 2 T Games ( H Control) & 2β d (d d j ) dτ. (37) ∞ + j 1 − 1: procedure + Iterating on the IRL Bellman' equation (37) yields the 2: Given admissible policy u0 following off-policy IRL algorithm. 3: for j 0, 1,... given u j = 0 (i) 4: for i 0, 1,... set d 0, solve for the value V j (x) using= Bellman’s equation= Algorithm 12 Off-Policy IRL Algorithm to Find the Solution of HJI i T ∂V j 1: procedure Q(x) f (x) g(x)u h(x)di uT Ru j j j 2: Given admissible policy u + / ∂x 0 + + + 0 & ' 3: for j 0, 1,... given u j and d j , solve for the value β2(di )T di 0, V i (0) 0, = j V j , u j 1 and d j 1 using off-policy Bellman equation − = = + + t T ∂V i + T i 1 1 T j V j (x(t T )) V j (x(t)) Q(x) u j R u j d + h (x)( ), + − = − − + 2β2 ∂x t = 2 T T . 2 T β d j d j 2 u j 1 R (u u j ) 2&β d j 1 (d d j ) dτ, i − + − + + − on convergence, set V j 1(x) V j (x). + = 5: Update the control policy u using on convergence, set V j 1 V j . ' j 1 + = + 4: Go to 3. 1 1 T ∂V j 1 u j 1 R− g (x)( + ). 5: end procedure + = − 2 ∂x 6: Go to 3. Note that for a fixed control policy u(t) and the actual 7: end procedure disturbance (the policies that are applied to the system), the off-policy IRL Bellman equation (37) can be solved for The PI algorithm (Algorithm 11) is an offline algorithm value function V j , learned control policy u j 1, and learned + and requires complete knowledge of the system dynamics. disturbance policy d j 1 simultaneously, and without requiring Actor-critic structure as (6) and (7) can be used to implement any knowledge about+ the system dynamics. an online PI algorithm that simultaneously updates the value To implement the off-policy IRL algorithm (Algorithm 12), function and policies and does not require the knowledge of the actor-critic structure, similar to (6) and (7), is used the drift dynamics. Various update laws are developed for to approximate the value function, control, and disturbance learning actor, critic, and disturbance approximators’ weights policies [80]. by minimizing the Bellman error [88]–[90]. In [88], an off-policy IRL algorithm is presented to find the 2) Off-Policy IRL [80]: In order to develop the off-policy solution of optimal tracking control problem without requiring IRL algorithm for the H control, the system dynamics (36) any knowledge about the system dynamics and reference is rewritten as ∞ trajectory dynamics. x(t) f (x(t)) g(x(t)) u j (t) g(x(t))( u(t) u j (t)) ˙ = + + − G. Nash Games h(x(t)) d j (t) g(x(t))( d(t) d j (t)) + + − In this section, online RL-based solutions to noncooperative games, played among N agents, are considered. where u j (t) and d j (t) are the policies to be updated. By contrast, u(t) and d(t) are the behavior policies that are Consider the N-player nonlinear time-invariant differential actually applied to the system dynamics to generate the data game for learning. N Differentiating V (x) along with the system dynamics (28) x(t) f (x(t)) g j (x(t)) u j (t) 1 T ˙ = + and using u j 1(t) (1/2)R− g (x)(∂ V j (x)/∂ x) and j 1 + = − != KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2055

n m where x R is a measurable state vector, u j (t) R j are the Algorithm 13 PI for Regulation in Nonzero-Sum Games ∈ Rn ∈ control inputs, f (x) is the drift dynamics, and g j (x) 1: procedure Rn m j ∈ ∈ × is the input dynamics. It is assumed that f (0) 0 and 2: Given N-tuple of admissible policies µk(0), i N N i = (k) (k 1) ∀ ∈ f (x) j 1 g j (x)u j is locally Lipschitz and that the system µ µ − + 3: while V V ǫiac , i N do is stabilizable.= k i − i k ≥ ∀ ∈ 4: Solve for the N-tuple of costs V k(x) using the The cost5 functional associated with each player is defined as i coupled Bellman equations ∞ J (x(0), u , u ,..., u ) (r (x, u ,..., u )) dt k T N i 1 2 N i 1 N ∂V T = 0 i k k k . Qi (x) f (x) gi (x)µ i µi Rii µi N + ∂x + + i 1 ∞ T & != ' Qi (x) u j Rij u j dt N ≡ 0  +  k T k µk . j 1 µ Rij µ 0, V i (0) 0. != + j j = =  i N 1, 2,...,N j 1 ∀ ∈ := { } != T T k 1 where Qi ( ) 0, Rii R 0, i N , and Rij R 0, 5: Update the N-tuple of control policies µ , i N ·  = ii ≻ ∀ ∈ = ij  i + j i N with N 1, 2,..., N . using ∀ ∈ ∀ 6= ∈ := { } The value for each player can be defined as k T k 1 1 1 T ∂Vi µi + Rii− gi (x) . Vi (x, u1, u2,..., u N ) = − 2 ∂x ∞ 6: k k 1 (ri (x, u1,..., u N )) dτ, i N , x, u1, u2,..., u N . = + = ∀ ∈ ∀ 7: end while .t Differential equivalents to each value function are given by 8: end procedure the following Bellman equations : T N ∂Vi 1) On-Policy IRL: It is obvious that (38) requires the com- ri (x, u1,..., u N ) f (x) g j (x)u j 0 + ∂x  +  = plete knowledge of the system dynamics. In [87], an equivalent j 1 != formulation of the coupled IRL Bellman equation that does not Vi (0) 0, i N .   = ∀ ∈ involve the dynamics is given as The Hamiltonian functions are defined as t T ∂Vi Vi (x(t T )) ri (x(τ), u1(τ),..., u N (τ)) dτ H x, u ,..., u , − = t T i 1 N . − / ∂x 0 Vi (x(t)), i N + ∀ ∈ T N ∂Vi for any time t 0 and time interval T > 0. ri (x, u1,..., u N ) f (x) g j (x)u j , i N . = + ∂x  +  ∀ ∈ ≥ ⋆ j 1 Let the value functions Vi be approximated on a compact != set  as By applying the stationarity conditions, the associated feed- back control policies are given by ⋆ T V (x) W φi (x) ǫi (x), x, i N i = i + ∀ ∀ ∈ ∂V ⋆ T ⋆ i Rn RKi ui arg min Hi x, u1,..., u N , where φi (x) , i N , are the basis function basis = ui ∂x : → ∀ ∈ / 0 set vectors, Ki , i N , is the number of basis functions, and ∀ ∈ 1 1 T ∂Vi ǫi (x) is the approximation error. Rii− gi (x) , i N . = − 2 ∂x ∀ ∈ Assuming current weight estimates Wic , i N , the ˆ ∀ ∈ After substituting the feedback control policies into the outputs of the critic are given by Hamiltonian, one has the coupled Hamilton–Jacobi (HJ) T Vi (x) W φi (x), x, i N . equations ˆ = ˆ ic ∀ ∈ N T 1 ∂V j T 1 T ∂V j Using a similar procedure as used for zero-sum games, the 0 Qi (x) g j (x)R− Rij R− g (x) = + 4 ∂x j j j j j ∂x update laws can be rewritten as j 1 != T N Wic ∂Vi 1 1 T ∂V j ˙ f (x) g j (x)R− g (x) , i N . 1φ i (t) + ∂x  − 2 j j j ∂x  ∀ ∈ α j 1 i T 2 != = − (1φ i (t)) 1φ i (t) 1) (38) +   t T T 1φ i (t) Wic Qi (x) ui Rii ui ×  ˆ + t T  + ˆ ˆ H. Approximate Solution Using RL . − A PI algorithm is now developed by iterating on the   N T Bellman equation to approximate the solution to the coupled u Rij u j dτ , i N + ˆ j ˆ   ∀ ∈ HJ equations. j 1 !=   2056 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018 and The cost functional associated to each agent i N has the following form ∈ : T W˙ iu αiu (Fi Wiu Li 1φ i (t) Wic ) Ji (δ i (0) ui , uN ) ˆ = −  ˆ − ˆ ; i 1 ∞ N  ri (δ i , ui , uNi )dt , i = 2 0 ∀ ∈ N T . 1 ∂φ i T 1 T ∂φ i gi (x)R− Rij R− gi (x) 1 − 4 ∂x ii ii ∂x ∞ δT Q δ uT R u uT R u dt , i N j 1 / 0 i i i i ii i j ij j != ≡ 2 0  + +  ∀ ∈ . j Ni !∈   1φ (t)T with user-defined matrices Qi 0, Rii 0, and Rij 0, i N  ≻  Wiu T 2 W jc , i i, j N , of appropriate dimensions and (√Qi , A), i N , ˆ (1φ i (t) 1φ i (t) 1) ˆ  ∀ ∈ ∀ ∈ ∀ ∈ + are detectable. ⋆  It is desired to find a graphical Nash equilibrium [93] ui respectively, with 1φ i (t) φi (t) φi (t T ). := − − for all agents i N in the sense that In [91], a model-free solution to nonzero-sum Nash games ∈ is presented. To obviate the requirement of complete knowl- ⋆ ⋆ ⋆ Ji (δ i (0) ui , uN ) Ji (δ i (0) ui , uN ), ui , i N . edge of the system dynamics, system identification is used ; i ≤ ; i ∀ ∈ in [92]. This can be expressed by the following coupled distributed minimization problems : ⋆ ⋆ ⋆ IV. G RAPHICAL GAMES Ji (δ i (0) ui , uN ) min Ji (δ i (0) ui , uN ), i N ; i = ui ; i ∀ ∈ Interactions among agents are modeled by a fixed strongly with the dynamics given in (40). connected graph G (VG, EG) defined by a finite set VG = = The ultimate goal is to find the distributed optimal value n1,..., nN of N agents and a set of edges EG VG VG { } ⊆ × functions V ⋆, i N defined by that represent interagent information exchange links. The set i ∀ ∈ of neighbors of a node n is defined as the set of nodes with i 1 edges incoming to n and is denoted by N n (n , n ) ⋆ ∞ T T i i j j i Vi (δ i (t)) min δi Qi δi ui Rii ui = { : ∈ ui 2 EG . The adjacency matrix is AG αij N N with weights := t  + } = [ ] × . αij > 0 if (n j , ni ) EG and zero otherwise. The diagonal ∈  degree matrix D of the graph G is defined as D diag (di ) T = u j Rij u j dt , t, δ i , i N . with the weighted degree di j N αij . +  ∀ ∈ = i j Ni Let the agents dynamics be modeled∈ as !∈ 5  (41) xi (t) Ax i (t) Bi ui (t), xi (0) xi0, t 0 ˙ = + = ≥ One can define the Hamiltonians associated with each agent’s ⋆ n neighborhood tracking error (40) and each V given in (41) where xi (t) R is a measurable state vector, ui (t) i Rmi ∈ ∈ as follows , i N 1,..., N , is each control input (or player), : ∈ n :=n { } n mi ⋆ and A R × and Bi R × , i N , are the plant and ∂Vi ∈ ∈ ∈ Hi δi , ui , uN , input matrices, respectively. i ∂δ $ i % The leader node/exosystem dynamics are ⋆ T ∂Vi Aδi (di gi )Bi ui αij B j u j x0 Ax 0. = ∂δ i  + + −  ˙ = j Ni !∈ The agents in the network seek to cooperatively asymptoti- 1 T  1 T 1 T  δi Qi δi ui Rii ui u j Rij u j , cally track the state of a leader node/exosystem, i.e., xi (t) + 2 + 2 + 2 → j Ni x0(t), i N while simultaneously satisfying distributed cost !∈ ∀ ∈ δi , ui , i N . functional . ∀ ∀ ∈ Define the neighborhood tracking error for every According to the stationarity condition, the optimal control for agent as each i N can be found to be ∈ ∂V ⋆ δi αij (xi x j ) gi (xi x0), i N (39) ⋆ i ui (δ i ) arg min Hi δi , ui , uNi , := − + − ∀ ∈ ui j Ni = ∂δ i !∈ $ ⋆ % 1 T ∂Vi where g is the pinning gain. g 0 if agent is pinned to the (di gi )R− B , δi (42) i i ii i ∂δ leader node. The dynamics of (39)6= are given by = − + i ∀ that should satisfy the appropriate distributed coupled HJ δi Aδi (di gi )Bi ui αij B j u j , i N (40) equations ˙ = + + − ∀ ∈ j Ni ⋆ !∈ ⋆ ⋆ ∂Vi n Hi δi , ui , uN , 0, i N . with δi R . i ∂δ = ∀ ∈ ∈ $ i % KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2057

Assume that the value functions are quadratic in the neigh- The PI algorithm (Algorithm 14) is an offline algorithm. We ⋆ Rn R borhood tracking error, i.e., Vi (δ i ) now show how to develop an online RL-based algorithm for : → solving the coupled HJ equations. ⋆ 1 T V (δ i ) δ Pi δi , δi , i N (43) Assume there exist constant weights Wi and such that i = 2 i ∀ ∀ ∈ ⋆ the value functions Vi are approximated on a compact T Rn n set  as where Pi Pi > 0 × , i N , are the unique matri- ces that solve= the following∈ complicated∀ ∈ distributed coupled ⋆ T equations V (δ i ) W φi (δ i ) ǫi (δ i ), δi , i N : i = i + ∀ ∈

T 2 1 T Rn RKi δ Pi Aδi (di gi ) Bi R− B Pi δi where φi (δ i ) , i N , are the basis function set i  − + ii i : → ∀ ∈ vectors, Ki is the number of basis functions, and ǫi (δ i ) is the  approximation error. 1 T Assuming current weight estimates Wic , i N , the αij (d j g j )B j R− B Pj δ j ˆ + + j j j  outputs of the critic are given by ∀ ∈ j N !∈  2 1 T T Aδi (di gi ) Bi R− B Pi δi Vi (δ i ) Wic φi (δ i ), i N . +  − + ii i ˆ = ˆ ∀ ∈

 T Using the current weight estimates, the approximate 1 T αij (d j g j )B j R− B Pj δ j Lyapunov-like equations are given by + + j j j  j N !∈ 2 T T 1 T Pi δi (d j g j ) δ Pj B j R− Rij R− B Pj δ j Hi ( ) ri (δ i (t), ui , uN ) × + + j j j j j j ˆ · = ˆ ˆ i j Ni !∈ 2 T T T T (di gi ) δi Pi Bi Rii Bi Pi δi δi Qi δi 0, i N . ∂Vi + + + = ∀ ∈ Aδi (di gi )Bi ui αij B j u j + ∂δ i  + + ˆ − ˆ  N jN Using (43), the optimal control (42) for every player i !i can be written as ∈   ei , i N (44) ⋆ 1 T = ∀ ∈ u (δ i ) (di gi )R− B Pi δi , δi . i = − + ii i ∀ where ei R, i N , are the residual errors due to ∈ ∀ ∈ A. Approximate Solution Using RL approximation and A PI algorithm is now developed to solve distributed cou- T pled HJ equations as follows. 1 T ∂φ i ui (di gi )Rii− Bi Wiu , i N ˆ = − + ∂δ i ˆ ∀ ∈ Algorithm 14 PI for Regulation in Graphical Games where W denote the current estimated values of the ideal 1: procedure ˆ iu k weights Wi . 2: Given N-tuple of admissible policies µi (0), i N (k) (k 1) ∀ ∈ µ µ − The critic weights are updated to minimize the square 3: while Vi (δ i ) Vi (δ i ) ǫiac , i N do k − k ≥ ∀ ∈ residual errors ei in order to guarantee that Wic 4: Solve for the N-tuple of costs V k(x) using the ˆ i W , i N . Hence, the tuning law for the critic→ is coupled Bellman equations, i given∀ as ∈ k T T ∂Vi k k δi Qii δi (Aδi (di gi )Bi µi αij B j µ j ) + ∂δ i + + − Wic j Ni ˙ !∈ T T k k k k k µi µi Rii µi µ j Rij µ j 0, V (0) 0. ∂φ i Aδ (d g )B u α B u + + = = ∂δ i i i i i j Ni ij j j N i + + ˆ − ˆ j i αi ∈ !∈ = − 6 12 7 k 1 5 5: Update the N-tuple of control policies µi + , i N using ∀ ∈ T ∂φ i k T Aδi (di gi )Bi ui αij B j u j ∂V × ∂δ  + + ˆ − ˆ  k 1 1 T i i N µi + (di gi )Rii− Bi . j i = − + ∂δ i  !∈    6: k k 1 = + 7: end while 1 T T T Wic δi Qi δi ui Rii ui u j Rij u , i N 8: end procedure × ˆ + 2 + ˆ ˆ + ˆ ˆ ∀ ∈ j Ni !∈  2058 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

where αi > 0 is a tuning gain, and for the actor is given as A mobile robot navigation task refers to plan a path with obstacle avoidance to a specified goal. RL-based mobile robot W ˆ˙ iu navigation mainly includes robot learning from expert demon- strations and robot self-learning and autonomous navigation. ∂φ i αiu Fi Wiu Li Aδi (di gi )Bi ui In the former, the examples of standard behaviors are provided = −  ˆ −  ∂δ  + + ˆ i to the robot and it learns from these data and generalize over    T all potential situations that are not given in the examples [103]–[112]. In the later, RL techniques are used to train the αij B j u j Wic − ˆ  ˆ  robot via interaction with the surrounding environment. Stabil- j Ni !∈ ity is not a concern in this type of the task. Yang et al. [113]   T  used continuous Q-learning for autonomous navigation of 1 ∂φ i T 1 T ∂φ i Bi Rii− Rij Rii− B j Wiu mobile robots. Lagoudakis and Parr [114] proposed the LS PI − 4  ∂δ i ∂δ i  ˆ j Ni for robot navigation task. In this paper, and some other similar !∈ T ∂φ i   works [115]–[117], neural networks are used to approximate Aδi (di gi )Bi ui N αij B j u j ∂δ i + + ˆ − j i ˆ the value function and the environment is assumed static. ∈ Wic ×  6 1 5 7 ˆ  An autonomous agent needs to adapt its strategies to cope with  changing surroundings and solve challenging navigation tasks   i N ∀ ∈ [118], [119]. The methods in [115] and [116] are designed for environments with dynamic obstacle. where αiu > 0 is a tuning gain, and Fi , Li 0 are matrices that guarantee stability of the system and ≻ B. Power Systems T ∂φ i There have existed several applications of RL in power 1 Aδi (di gi )Bi ui αij B j u j =  ∂δ i  + + ˆ − ˆ  systems. The work of [120] proposes the PI technique based j Ni  !∈ on actor-critic structure for automatic voltage regulator sys-    2 tem for its both models neglecting and including sensor ∂φ i dynamics. In [121], a game-theory-based distributed controller Aδi (di gi )Bi ui αij B j u j 1 . × ∂δ i  + + ˆ − ˆ +  is designed to provide the desired voltage magnitude and j Ni !∈    frequency at the load. The optimal switching between different V. A PPLICATIONS typologies in step-down dc–dc voltage converters is investi- gated in [122]. The control of boost converter is presented RL has been successfully applied in many fields. Two of in [123] using RL-based nonlinear control strategy at attaining such fields are presented below. a constant output voltage. In [124], RL is used for distributed control of dc microgrids to establish coordination among active A. Robot Control and Navigation loads to collectively respond to any load change. In [125], RL In the last decade or so, the application of RL in robotics has is applied to the control of an electric water heater with 50 increased steadily. In [94], RL is used to design a kinematic temperature sensors. and dynamic tracking control scheme for a nonholonomic wheeled mobile robot. Reference [95] uses the RL in the VI. C ONCLUSION network-based tracking control system of the two-wheeled In this paper, we have reviewed several RL techniques mobile robot, Pioneer 2-DX. The neural network RL is for solving optimal control problems in real time using data used to design a controller for an active simulated three- measured along the system trajectories. We have presented link biped robot in [96]. In [97], an RL based actor-critic families of online RL-based solutions for optimal regulation, framework is used in control of a fully actuated six degrees- optimal tracking, Nash, and graphical games. The complete of-freedom autonomous underwater vehicle. Luy et al. [98] dynamics of the systems do not need to be known for these use RL to design robust adaptive tracking control laws with RL-based online solution techniques. As another approach to optimality for multiwheeled mobile robots’ synchronization speed up the learning by reuse of data, experience replay in communication graph. Reference [99] presents an optimal technique used in RL was discussed. These algorithms are fast adaptive consensus-based formation control scheme over finite and data efficient because of reuse of the data for learning. horizon for networked mobile robots or agents in the presence The design of static RL-based OPFB controllers for non- of uncertain robot/agent dynamics. In [100], IRL approach linear systems has not been investigated yet. Also existing is employed to design a novel adaptive impedance control RL-based H controllers require the disturbance to be mea- for robotic exoskeleton with adjustable robot bebavior. IRL sured during∞ learning. Another interesting and important open is used in control of a surface marine craft with unknown research direction is to develop novel deep RL approaches for hydrodynamic parameters in [101]. In [102], RL is used feedback control of nonlinear systems with high-dimensional along with a stabilizing feedback controller such as PID or inputs and unstructured input data. Deep neural networks can linear quadratic regulation to improve the performance of the approximate a more accurate structure of the value function trajectory tracking in robot manipulators. and avoid divergence of the RL algorithm and consequently KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2059 instability of the feedback control system. Deep RL for [30] P. J. Werbos, A Menu of Designs for Reinforcement Learning Over feedback control, however, requires developing new learning Time . Cambridge, MA, USA: MIT Press, 1991. [31] P. J. Werbos, “Approximate dynamic programming for real-time control algorithms that assure the stability of the feedback system in and neural modeling,” in Handbook of Intelligent Control: Neural, the sense of Lyapunov during the learning. Fuzzy, and Adaptive Approaches , D. A. White and D. A. Sofge, Eds. New York, NY, USA: Van Nostrand, 1992. REFERENCES [32] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, King’s College, Cambridge Univ., Cambridge, [1] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control . New York, U.K., 1989. NY, USA: Wiley, 2012. [33] K. Doya, “Reinforcement learning in continuous time and space,” [2] R. F. Stengel, Optimal Control and Estimation (Dover Books on Neural Comput. , vol. 12, no. 1, pp. 219–245, 2000. Mathematics). Mineola, NY, USA: Dover, 1986. [34] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for [3] D. Liberzon, and Optimal Control Theory—A nonlinear systems with saturating actuators using a neural network HJB Concise Introduction . Princeton, NJ, USA: Princeton Univ. Press, 2011. approach,” Automatica , vol. 41, no. 5, pp. 779–791, May 2005. [4] M. Athans and P. L. Falb, Optimal Control: An Introduction to the [35] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive Theory and Its Applications (Dover Books on Engineering). Mineola, optimal control for continuous-time linear systems based on policy NY, USA: Dover, 2006. iteration,” Automatica , vol. 45, no. 2, pp. 477–484, Feb. 2009. [5] D. E. Kirk, Optimal Control Theory: An Introduction (Dover Books [36] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithm to on Electrical Engineering). Mineola, NY, USA: Dover, 2012. solve the continuous-time infinite horizon optimal control problem,” [6] A. E. Bryson, Applied Optimal Control: Optimization, Estimation and Automatica , vol. 46, no. 5, pp. 878–888, 2010. Control . New York, NY, USA: Halsted, 1975. [37] A. Heydari and S. N. Balakrishnan, “Optimal switching and con- [7] W. M. McEneaney, Max-Plus Methods for Nonlinear Control and trol of nonlinear switching systems using approximate dynamic pro- Estimation . Basel, Switzerland: Birkhäuser, 2006. gramming,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 25, no. 6, [8] W. H. Fleming and W. M. McEneaney, “A max-plus-based algorithm pp. 1106–1117, Jun. 2014. for a Hamilton–Jacobi–Bellman equation of nonlinear filtering,” SIAM [38] A. Heydari and S. N. Balakrishnan, “Optimal switching between J. Control Optim. , vol. 38, no. 3, pp. 683–710, 2000. autonomous subsystems,” J. Franklin Inst. , vol. 351, no. 5, [9] M. T. Hagan, H. B. Demuth, M. Beale, and O. De Jesús, Neural pp. 2675–2690, 2014. Network Design , 2nd ed. New York, NY, USA: Martin, 2014. [39] A. Heydari, “Optimal scheduling for reference tracking or state regu- [10] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspec- lation using reinforcement learning,” J. Franklin Inst. , vol. 352, no. 8, tives, and prospects,” Science , vol. 349, no. 6245, pp. 255–260, 2015. pp. 3285–3303, 2015. [11] C. Bishop, Pattern Recognition and Machine Learning . New York, NY, [40] A. Heydari and S. N. Balakrishnan, “Optimal switching between USA: Springer-Verlag, 2006. controlled subsystems with free mode sequence,” Neurocomputing , [12] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical vol. 149, pp. 1620–1630, Feb. 2015. Learning: Data Mining, Inference, and Prediction . New York, NY, [41] A. Heydari, “Feedback solution to optimal switching problems with USA: Springer-Verlag, 2009. switching cost,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 27, no. 10, [13] K. P. Murphy, Machine Learning: A Probabilistic Perspective . pp. 2009–2019, Oct. 2016. Cambridge, MA, USA: MIT Press, 2012. [42] R. J. Leake and R.-W. Liu, “Construction of suboptimal control [14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction , sequences,” SIAM J. Control , vol. 5, no. 1, pp. 54–63, Feb. 1967. vol. 1, 2nd ed. Cambridge, MA, USA: MIT Press, 2017. [43] R. A. Howard, Dynamic Programming and Markov Processes . [15] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming . Cambridge, MA, USA: MIT Press, 1960. Belmont, MA, USA: Athena Scientific, 1996. [16] W. B. Powell, Approximate Dynamic Programming . Hoboken, NJ, [44] F. L. Lewis, H. Zhang, K. Hengster-Movric, and A. Das, Coopera- USA: Wiley, 2007. tive Control of Multi-Agent Systems: Optimal and Adaptive Design [17] X.-R. Cao, Stochastic Learning and Optimization . New York, NY, Approaches . New York, NY, USA: Springer-Verlag, 2014. USA: Springer, 2007. [45] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time [18] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive Dynamic Program- nonlinear HJB solution using approximate dynamic programming: ming for Control . London, U.K.: Springer-Verlag, 2013. Convergence proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern. , [19] D. Liu, Q. Wei, D. Wang, X. Yang, and H. Li, Adaptive Dynamic vol. 38, no. 4, pp. 943–949, Aug. 2008. Programming With Applications in Optimal Control . Springer, 2017. [46] P. He and S. Jagannathan, “Reinforcement learning neural-network- [20] M. Krsti´c and I. Kanellakopoulos, Nonlinear and Adaptive Control based controller for nonlinear discrete-time systems with input con- Design (Adaptive and Cognitive Dynamic Systems: Signal Processing, straints,” IEEE Trans. Syst., Man, Cybern. B, Cybern. , vol. 37, no. 2, Learning, Communications and Control). New York, NY, USA: Wiley, pp. 425–436, Apr. 2007. 1995. [47] T. Dierks and S. Jagannathan, “Online optimal control of nonlin- [21] G. Tao, Adaptive Control Design and Analysis (Adaptive and Cognitive ear discrete-time systems using approximate dynamic programming,” Dynamic Systems: Signal Processing, Learning, Communications and J. Control Theory Appl. , vol. 9, no. 3, pp. 361–369, 2011. Control). New York, NY, USA: Wiley, 2003. [48] Y. Luo and H. Zhang, “Approximate optimal control for a class [22] P. Ioannou and B. Fidan, Adaptive Control Tutorial (Advances in of nonlinear discrete-time systems with saturating actuators,” Prog. Design and Control). Philadelphia, PA, USA: SIAM, 2006. Natural Sci. , vol. 18, no. 8, pp. 1023–1029, 2008. [23] N. Hovakimyan and C. Cao, L1 Adaptive Control Theory: Guaranteed [49] Y. Sokolov, R. Kozma, L. D. Werbos, and P. J. Werbos, “Complete Robustness With Fast Adaptation (Advances in Design and Control). stability analysis of a heuristic approximate dynamic programming Philadelphia, PA, USA: SIAM, 2010. control design,” Automatica , vol. 59, pp. 9–18, Sep. 2015. [24] S. Sastry and M. Bodson, Adaptive Control: Stability, Convergence [50] A. Heydari, “Theoretical and of approximate and Robustness (Dover Books on Electrical Engineering). Mineola, dynamic programming with approximation errors,” J. Guid., Control, NY, USA: Dover, 2011. Dyn. , vol. 39, no. 2, pp. 301–311, 2016. [25] K. J. Aström and B. Wittenmark, Adaptive Control (Dover Books on [51] L. Dong, X. Zhong, C. Sun, and H. He, “Adaptive event-triggered Electrical Engineering), 2nd ed. Mineola, NY, USA: Dover, 2013. control based on heuristic dynamic programming for nonlinear discrete- [26] P. Ioannou and J. Sun, Robust Adaptive Control . Mineola, NY, USA: time systems,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 28, no. 7, Dover, 2013. pp. 1594–1605, Jul. 2016. [27] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive [52] A. Sahoo, H. Xu, and S. Jagannathan, “Near optimal event-triggered elements that can solve difficult learning control problems,” IEEE control of nonlinear discrete-time systems using neurodynamic pro- Trans. Syst., Man, Cybern., Syst. , vol. SMC-13, no. 5, pp. 834–846, gramming,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 27, no. 9, Sep./Oct. 1983. pp. 1801–1815, Sep. 2016. [28] R. S. Sutton, “Learning to predict by the methods of temporal differ- [53] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic ences,” Mach. Learn. , vol. 3, no. 1, pp. 9–44, 1988. control using policy iteration,” in Proc. IEEE Amer. Control Conf. , [29] P. J. Werbos, “Neural networks for control and system identification,” vol. 3. Jul. 1994, pp. 3475–3479. in Proc. IEEE 28st Annu. Conf. Decision Control (CDC) , Dec. 1989, [54] C. Watkins and P. Dayan, “Q-learning,” Mach. Learn. , vol. 8, no. 3, pp. 260–265. pp. 279–292, 1992. 2060 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

[55] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive [77] V. Narayanan and S. Jagannathan, “Approximate optimal distributed dynamic programming for feedback control,” IEEE Circuits Syst. Mag. , control of uncertain nonlinear interconnected systems with event- vol. 9, no. 3, pp. 32–50, Aug. 2009. sampled feedback,” in Proc. IEEE 55th Conf. Decision Control (CDC) , [56] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning Dec. 2016, pp. 5827–5832. for partially observable dynamic processes: Adaptive dynamic pro- [78] V. Narayanan and S. Jagannathan, “Distributed event-sampled approx- gramming using measured output data,” IEEE Trans. Syst., Man, imate optimal control of interconnected affine nonlinear continuous- Cybern. B, Cybern. , vol. 41, no. 1, pp. 14–25, Feb. 2011. time systems,” in Proc. Amer. Control Conf. (ACC) , Jul. 2016, [57] B. Kiumarsi and F. L. Lewis, “Actor–critic-based optimal track- pp. 3044–3049. ing for partially unknown nonlinear discrete-time systems,” IEEE [79] Y. Jiang and Z.-P. Jiang, “Robust adaptive dynamic programming and Trans. Neural Netw. Learn. Syst. , vol. 26, no. 1, pp. 140–151, feedback stabilization of nonlinear systems,” IEEE Trans. Neural Netw. Jan. 2015. Learn. Syst. , vol. 25, no. 5, pp. 882–893, May 2014. [58] B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and [80] B. Luo, H.-N. Wu, and T. Huang, “Off-policy reinforcement learning M. B. Naghibi-Sistani, “Reinforcement Q-learning for optimal tracking for H control design,” IEEE Trans. Cybern. , vol. 45, no. 1, pp. 65–76, control of linear discrete-time systems with unknown dynamics,” Jan. 2015.∞ Automatica , vol. 50, no. 4, pp. 1167–1175, Apr. 2014. [81] Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for [59] R. Bellman, “Dynamic programming and Lagrange multipliers,” continuous-time linear systems with completely unknown dynamics,” Proc. Nat. Acad. Sci. USA , vol. 42, no. 10, pp. 767–769, Automatica , vol. 48, no. 10, pp. 2699–2704, 2012. 1956. [82] H. Modares and F. L. Lewis, “Linear quadratic tracking control of [60] B. Kiumarsi, F. L. Lewis, M. B. Naghibi-Sistani, and A. Karimpour, partially-unknown continuous-time systems using reinforcement learn- “Optimal tracking control of unknown discrete-time linear systems ing,” IEEE Trans. Autom. Control , vol. 59, no. 11, pp. 3051–3056, using input-output measured data,” IEEE Trans. Cybern. , vol. 45, Nov. 2014. no. 12, pp. 2770–2779, Dec. 2015. [83] H. Modares and F. L. Lewis, “Optimal tracking control of nonlinear [61] G. Zames, “Feedback and optimal sensitivity: Model reference trans- partially-unknown constrained-input systems using integral reinforce- formations, multiplicative seminorms, and approximate inverses,” IEEE ment learning,” Automatica , vol. 50, no. 7, pp. 1780–1792, Jul. 2014. Trans. Autom. Control , vol. AC-26, no. 2, pp. 301–320, Apr. 1981. [84] H. Modares, “Optimal tracking control of uncertain systems: On-policy and off-policy reinforcement learning approaches,” Ph.D. dissertation, [62] A. J. van der Schaft, “L 2-gain analysis of nonlinear systems and nonlinear state-feedback H control,” IEEE Trans. Autom. Control , Dept. Electr. Eng., Univ. Texas Arlington, Arlington, TX, USA, 2015. vol. 37, no. 6, pp. 770–784,∞ Jun. 1992. [85] T. Bian, Y. Jiang, and Z.-P. Jiang, “Adaptive dynamic programming and [63] T. Ba¸sar and P. Bernhard, H Optimal Control and Related Minimax optimal control of nonlinear nonaffine systems,” Automatica , vol. 50, Design Problems . Boston, MA,∞ USA: Birkhäuser, 1995. no. 10, pp. 2624–2632, Oct. 2014. [64] J. C. Doyle, K. Glover, P. P. Khargonekar, and B. A. Francis, “State- [86] B. Kiumarsi, W. Kang, and F. L. Lewis, “ H control of nonaffine aerial systems using off-policy reinforcement∞ learning,” Unmanned space solutions to standard H2 and H control problems,” IEEE Trans. Autom. Control , vol. 34, no. 8, pp. 831–847,∞ Aug. 1989. Syst. , vol. 4, no. 1, pp. 51–60, 2016. [87] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adap- [65] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free tive Control and Differential Games by Reinforcement Learning Q-learning designs for linear discrete-time zero-sum games with appli- Principles (Control, Robotics and Sensors). Edison, NJ, USA: IET, cation to H control,” Automatica , vol. 43, no. 3, pp. 473–481, 2007. 2013. [66] B. Kiumarsi,∞ F. L. Lewis, and Z.-P. Jiang, “ H control of linear [88] H. Modares, F. L. Lewis, and Z.-P. Jiang, “ H tracking control of discrete-time systems: Off-policy reinforcement learnin∞ g,” Automatica , completely unknown continuous-time systems via∞ off-policy reinforce- vol. 78, pp. 144–152, Apr. 2017. ment learning,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 26, no. 10, [67] D. Vrabie and F. Lewis, “Neural network approach to continuous- pp. 2550–2562, Oct. 2015. time direct adaptive optimal control for partially unknown nonlinear [89] H. Li, D. Liu, and D. Wang, “Integral reinforcement learning for systems,” Neural Netw. , vol. 22, no. 3, pp. 237–246, 2009. linear continuous-time zero-sum games with completely unknown [68] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, dynamics,” IEEE Trans. Autom. Sci. Eng. , vol. 11, no. 3, pp. 706–714, F. L. Lewis, and E. W. Dixon, “A novel actor-critic-identifier architec- Jul. 2014. ture for approximate optimal control of uncertain nonlinear systems,” [90] B. Luo and H.-N. Wu, “Computationally efficient simultaneous pol- Automatica , vol. 49, no. 1, pp. 89–92, 2013. icy update algorithm for nonlinear H state feedback control with [69] T. Dierks and S. Jagannathan, “Optimal control of affine nonlinear Galerkin’s method,” Int. J. Robust Nonlinear∞ Control , vol. 23, no. 9, continuous-time systems,” in Proc. Amer. Control Conf. , Jun./Jul. 2010, pp. 991–1012, 2013. pp. 1568–1573. [91] K. G. Vamvoudakis, “Non-zero sum nash Q-learning for unknown [70] R. Kamalapurkar, J. R. Klotz, and W. E. Dixon, “Concurrent learning- deterministic continuous-time linear systems,” Automatica , vol. 61, based approximate feedback-Nash equilibrium solution of N-player pp. 274–281, Nov. 2015. nonzero-sum differential games,” IEEE/CAA J. Autom. Sin. , vol. 1, [92] M. Johnson, R. Kamalapurkar, S. Bhasin, and W. E. Dixon, “Approxi- no. 3, pp. 239–247, Jul. 2014. mate N-player nonzero-sum game solution for an uncertain continuous [71] R. Kamalapurkar, P. Walters, and W. E. Dixon, “Model-based rein- nonlinear system,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 26, forcement learning for approximate optimal regulation,” Automatica , no. 8, pp. 1645–1658, Aug. 2015. vol. 64, p. 94–104, Feb. 2016. [93] K. G. Vamvoudakis, F. L. Lewis, and G. R. Hudas, “Multi- [72] H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Integral rein- agent differential graphical games: Online adaptive learning solution forcement learning and experience replay for adaptive optimal con- for synchronization with optimality,” Automatica , vol. 48, no. 8, trol of partially-unknown constrained-input continuous-time systems,” pp. 1598–1611, 2012. Automatica , vol. 50, no. 1, pp. 193–202, 2014. [94] N. T. Luy, “Reinforecement learning-based optimal tracking control [73] G. Chowdhary, T. Yucelen, M. Mühlegg, and E. N. Johnson, “Con- for wheeled mobile robot,” in Proc. IEEE Int. Conf. Cyber Technol. current learning adaptive control of linear systems with exponentially Autom., Control, Intell. Syst. (CYBER) , May 2012, pp. 371–376. convergent bounds,” Int. J. Adapt. Control Signal Process. , vol. 27, [95] M. Szuster and Z. Hendzel, “Discrete globalised dual heuristic dynamic no. 4, pp. 280–301, 2013. programming in control of the two-wheeled mobile robot,” Math. [74] K. G. Vamvoudakis, “Event-triggered optimal adaptive control algo- Problems Eng. , vol. 2014, Sep. 2014, Art. no. 628798. rithm for continuous-time nonlinear systems,” IEEE/CAA J. Autom. [96] A. Ghanbari, Y. Vaghei, and S. M. R. S. Noorani, “Neural network Sinica , vol. 1, no. 3, pp. 282–293, Jul. 2014. reinforcement learning for walking control of a 3-link biped robot,” [75] L. Dong, X. Zhong, C. Sun, and H. He, “Event-triggered adaptive Int. J. Eng. Technol. , vol. 7, no. 6, pp. 449–452, 2015. dynamic programming for continuous-time systems with control con- [97] P. Walters, R. Kamalapurkar, and W. E. Dixon. (Sep. 2013). “Online straints,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 28, no. 8, approximate optimal station keeping of an autonomous underwater pp. 1941–1952, Aug. 2017. vehicle.” [Online]. Available: https://arxiv.org/abs/1310.0063 [76] V. Narayanan and S. Jagannathan, “Distributed adaptive optimal reg- [98] N. T. Luy, N. T. Thanh, and H. M. Tri, “Reinforcement learning- ulation of uncertain large-scale interconnected systems using hybrid based robust adaptive tracking control for multi-wheeled mobile robots Q-learning approach,” IET Control Theory Appl. , vol. 10, no. 12, synchronization with optimality,” in Proc. IEEE Workshop Robot. pp. 1448–1457, 2016. Intell. Inform. Struct. Space (RiiSS) , Apr. 2013, pp. 74–81. KIUMARSI et al. : OPTIMAL AND AUTONOMOUS CONTROL USING RL 2061

[99] H. M. Guzey, H. Xu, and J. Sarangapani, “Neural network- [122] A. Heydari, “Optimal switching of DC–DC power converters using based finite horizon optimal adaptive consensus control of mobile approximate dynamic programming,” IEEE Trans. Neural Netw. Learn. robot formations,” Optim. Control Appl. Methods , vol. 37, no. 5, Syst. , to be published, doi: 10.1109/TNNLS.2016.2635586. pp. 1014–1034, 2016. [123] D. J. Pradeep, M. M. Noel, and N. Arun, “Nonlinear control of a [100] Z. Huang, J. Liu, Z. Li, and C.-Y. Su, “Adaptive impedance control of boost converter using a robust regression based reinforcement learning robotic exoskeletons using reinforcement learning,” in Proc. Int. Conf. algorithm,” Eng. Appl. Artif. Intell. , vol. 52, pp. 1–9, Jun. 2016. Adv. Robot. Mechatron. (ICARM) , 2016, pp. 243–248. [124] L.-L. Fan, V. Nasirian, H. Modares, F. L. Lewis, Y.-D. Song, and [101] Z. Bell, A. Parikh, J. Nezvadovitz, and W. E. Dixon, “Adaptive A. Davoudi, “Game-theoretic control of active loads in DC microgrids,” control of a surface marine craft with parameter identification using IEEE Trans. Energy Convers. , vol. 31, no. 3, pp. 882–895, Sep. 2016. integral concurrent learning,” in Proc. IEEE 55th Conf. Decision [125] F. Ruelens, B. Claessens, S. Quaiyum, B. De Schutter, R. Babuska, Control (CDC) , Dec. 2016, pp. 389–394. and R. Belmans, “Reinforcement learning applied to an electric water [102] Y. P. Pane, S. P. Nageshrao, and R. Babuška, “Actor-critic reinforcement heater: From theory to practice,” IEEE Trans. Smart Grid , to be learning for tracking control in robotics,” in Proc. IEEE 55th Conf. published, doi: 10.1109/TSG.2016.2640184. Decision Control (CDC) , Dec. 2016, pp. 5819–5826. [103] M. van Lent and J. E. Laird, “Learning procedural knowledge through observation,” in Proc. ACM 1st Int. Conf. Knowl. Capture (K-CAP) , 2001, pp. 179–186. [104] A. Jain, B. Wojcik, T. Joachims, and A. Saxena, “Learning trajectory preferences for manipulators via iterative improvement,” in Proc. Adv. Neural Inf. Process. Syst. , 2013, pp. 575–583. [105] A. Alissandrakis, C. L. Nehaniv, and K. Dautenhahn, “Imitation with ALICE: Learning to imitate corresponding actions across dissimilar embodiments,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans , vol. 32, no. 4, pp. 482–496, Jul. 2002. [106] B. Robins, K. Dautenhahn, R. te Boekhorst, and A. Billard, Effects of Repeated Exposure to a Humanoid Robot on Children With Autism . Bahare Kiumarsi (M’17) received the B.S. degree London, U.K.: Springer, 2004, pp. 225–236. from the Shahrood University of Technology, [107] A. Ude, C. G. Atkeson, and M. Riley, “Programming full-body move- Shahrood, Iran, in 2009, the M.S. degree from the ments for humanoid robots by observation,” Robot. Auto. Syst. , vol. 47, Ferdowsi University of Mashhad, Mashhad, Iran, in nos. 2–3, pp. 93–108, 2004. 2013, and the Ph.D. degree in electrical engineering [108] J. Saunders, C. L. Nehaniv, and K. Dautenhahn, “Teaching robots from the University of Texas at Arlington, Arlington, by moulding behavior and scaffolding the environment,” in Proc. 1st TX, USA, in 2017. ACM SIGCHI/SIGART Conf. Hum.-Robot Interaction (HRI) , 2006, Her current research interests include cyber- pp. 118–125. physical systems, , reinforcement learn- [109] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aerobatics ing, and distributed control of multiagent systems. through apprenticeship learning,” Int. J. Robot. Res. , vol. 29, no. 13, Dr. Kiumarsi was a recipient of the UT Arlington pp. 1608–1639, 2010. N. M. Stelmakh Outstanding Student Research Award and the UT Arlington [110] F. Stulp, E. A. Theodorou, and S. Schaal, “Reinforcement learning with Graduate Dissertation Fellowship in 2017. sequences of motion primitives for robust manipulation,” IEEE Trans. Robot. , vol. 28, no. 6, pp. 1360–1370, Dec. 2012. [111] E. Rombokas, M. Malhotra, E. A. Theodorou, E. Todorov, and Y. Matsuoka, “Reinforcement learning and synergistic control of the ACT hand,” IEEE/ASME Trans. Mechatronics , vol. 18, no. 2, pp. 569–577, Apr. 2013. [112] E. Alibekov, J. Kubalik, and R. Babuska, “Policy derivation methods for critic-only reinforcement learning in continuous action spaces,” IFAC-PapersOnLine , vol. 49, no. 5, pp. 285–290, 2016. [113] G.-S. Yang, E.-K. Chen, and C.-W. An, “Mobile robot navigation using neural Q-learning,” in Proc. Int. Conf. Mach. Learn. Cybern. , vol. 1. Aug. 2004, pp. 48–52. Kyriakos G. Vamvoudakis (SM’15) was born in [114] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” Athens, Greece. He received the Diploma degree J. Mach. Learn. Res. , vol. 4, pp. 1107–1149, Dec. 2003. (High Hons.) in electronic and computer engineering [115] B.-Q. Huang, G.-Y. Cao, and M. Guo, “Reinforcement learning neural from the Technical University of Crete, Chania, network to the problem of autonomous mobile robot obstacle avoid- Greece, in 2006, and the M.S. and Ph.D. degrees in ance,” in Proc. Int. Conf. Mach. Learn. Cybern. , vol. 1. Aug. 2005, electrical engineering from The University of Texas pp. 85–89. at Arlington, Arlington, TX, USA, in 2008 and 2011, [116] C. Li, J. Zhang, and Y. Li, “Application of artificial neural network respectively. based on Q-learning for mobile robot path planning,” in Proc. IEEE From 2012 to 2016, he was a Project Research Int. Conf. Inf. Acquisition , Aug. 2006, pp. 978–982. Scientist with the Center for Control, Dynamical [117] J. Qiao, Z. Hou, and X. Ruan, “Application of reinforcement learning Systems, and Computation, University of California, based on neural network to dynamic obstacle avoidance,” in Proc. Int. Santa Barbara, CA, USA. He is currently an Assistant Professor with the Conf. Inf. Autom. , Jun. 2008, pp. 784–788. Kevin T. Crofton Department of Aerospace and Ocean Engineering, Virginia [118] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training Tech, Blacksburg, VA, USA. He has co-authored one patent, more than of deep visuomotor policies,” J. Mach. Learn. Res. , vol. 17, no. 39, 100 technical publications, and two books. His current research interests pp. 1–40, 2016. include game-theoretic control, network security, smart grid, and multiagent [119] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. optimization. (Feb. 2015). “Trust region policy optimization.” [Online]. Available: Dr. Vamvoudakis was a recipient of several international awards including https://arxiv.org/abs/1502.05477 the 2016 International Neural Network Society Young Investigator Award, [120] L. B. Prasad, H. O. Gupta, and B. Tyagi, “Application of policy the Best Paper Award for Autonomous/Unmanned Vehicles at the 27th Army iteration technique based adaptive optimal control design for automatic Science Conference in 2010, the Best Presentation Award at the World voltage regulator of power system,” Int. J. Elect. Power Energy Syst. , Congress of Computational Intelligence in 2010, and the Best Researcher vol. 63, pp. 940–949, Dec. 2014. Award from the Automation and Robotics Research Institute in 2011. He is [121] K. G. Vamvoudakis and J. P. Hespanha, “Online optimal oper- currently an Associate Editor of the Journal of Optimization Theory and ation of parallel voltage-source inverters using partial informa- Applications , an Associate Editor of Control Theory and Technology , a tion,” IEEE Trans. Ind. Electron. , vol. 64, no. 5, pp. 4296–4305, Registered Electrical/Computer Engineer, and a member of the Technical May 2016. Chamber of Greece. 2062 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018

Hamidreza Modares (M’15) received the B.Sc. Frank L. Lewis (F’94) received the bachelor’s degree from Tehran University, Tehran, Iran, in degree in physics/electrical engineering and the 2004, the M.Sc. degree from the Shahrood Univer- M.S.E.E. degree from Rice University, Houston, TX, sity of Technology, Shahrood, Iran, in 2006, and USA, the M.S. degree in aeronautical engineering the Ph.D. degree from the University of Texas at from the Univesity of West Florida, Pensacola, FL, Arlington (UTA), Arlington, TX, USA, in 2015. USA, and the Ph.D. degree from Georgia Tech, From 2006 to 2009, he was with the Shahrood Atlanta, GA, USA. University of Technology as a Senior Lecturer. From He holds seven U.S. patents, and has authored 2015 to 2016, he was a Faculty Research Associate numerous journal special issues, 363 journal papers, with UTA. He is currently an Assistant Professor and 20 books including Optimal Control , Aircraft with the Missouri University of Science and Tech- Control , Optimal Estimation , and Robot Manipula- nology, Rolla, MO, USA. He has authored several journal and conference tor Control that are used as university textbooks worldwide. His current papers on the design of optimal controllers using reinforcement learning. His research interests include feedback control, intelligent systems, cooperative current research interests include cyber-physical systems, machine learning, control systems, and nonlinear systems. distributed control, robotics, and renewable energy microgrids. Dr. Lewis was a recipient of the Fulbright Research Award, the NSF Dr. Modares was a recipient of the Best Paper Award from the 2015 Research Initiation Grant, the ASEE Terman Award, the International Neural IEEE International Symposium on Resilient Control Systems, the Stelmakh Network Society Gabor Award, the U.K. Inst Measurement and Control Outstanding Student Research Award from the Department of Electrical Honeywell Field Engineering Medal, the IEEE Computational Intelligence Engineering, UTA, in 2015, and the Summer Dissertation Fellowship from Society Neural Networks Pioneer Award, the AIAA Intelligent Systems UTA, in 2015. He is an Associate Editor of the IEEE T RANSACTIONS ON Award, the Texas Regents Outstanding Teaching Award 2013, and the China NEURAL NETWORKS AND LEARNING SYSTEMS . Liaoning Friendship Award. He is a Founding Member of the Board of Governors of the Mediterranean Control Association. He is a member of the National Academy of Inventors, a fellow of IFAC, AAAS, and the U.K. Institute of Measurement and Control, PE Texas, the U.K. Chartered Engineer, the UTA Distinguished Scholar Professor, the UTA Distinguished Teaching Professor, the Moncrief-ODonnell Chair with the University of Texas at Arlington Research Institute, Arlington, TX, USA, and the Qian Ren Thousand Talents Consulting Professor, Northeastern University, Shenyang, China.