Optimal Control and of Switched Systems

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Hua Chen, B.E.

Graduate Program in Electrical and Computer Engineering

The Ohio State University

2018

Dissertation Committee:

Prof. Wei Zhang, Advisor Prof. Andrea Serrani Prof. Vadim Utkin c Copyright by

Hua Chen

2018 Abstract

This dissertation studies optimal control and reinforcement learning of switched systems. Roughly speaking, a switched system consists of several subsystems and a switching signal determining which subsystem is being used for evolving the system dynamics at each time instant. Optimal control of such switched systems involves

finding the discrete switching signal and the associated continuous input into the cho- sen subsystem to jointly optimize certain performance index. It is widely known in the literature that optimal control of switched systems is challenging to solve, mainly due to the discrete nature of the switching signal that makes the overall problem com- binatorial. Two problems with different settings are considered in this dissertation.

The first problem we consider is a general optimal control of continuous-time nonlin- ear switched systems. We focus on the so-called embedding-based approach. Rather than proposing new embedding-based algorithms, we develop a framework originating from a novel topological perspective of the embedding-based technique. The proposed framework unifies most existing embedding-based algorithms as special cases and pro- vides guidance on how to construct new ones. The second problem studied in this dissertation is optimal control of discrete-time switched linear system. Due to the fact that knowledge about accurate system dynamics is in general hard to obtain for practical systems, we do not assume knowledge about the system model. Alterna- tively, a simulator is adopted for generating the successive state given any state-input

ii pair. Based on this simulator, we utilize the reinforcement learning framework for solving the problem in a model-free manner. Instead of directly applying existing neural network based algorithms, we develop a distinct Q-learning algorithm which explicitly incorporates the analytical insights about the optimal solution from tradi- tional optimal control. In particular, a specific parametric Q-function approximation is proposed. To update the involved parameters, two approaches based on different structural information of the underlying model are adopted.

iii This is dedicated to my parents.

iv Acknowledgments

This dissertation summarizes my six-and-half year PhD study at The Ohio State

University, which would not have been possible without the help of many people.

First of all, I would like to express my deepest gratitude to my PhD advisor, Prof.

Wei Zhang, for his guidance, patience, and inspirations. I am also greatly indebted to his generous support for my various conference travelings.

I would also like to thank Prof. Andrea Serrani and Prof. Vadim Utkin for serving on my PhD committee. Prof. Andrea Serrani’s enthusiasm toward his research topics has constantly been a strong motivation for me. His courses on linear system theory, nonlinear system theory and adaptive control are among the best that I have ever taken. Although I have never taken a course from Prof. Utkin and have never directly worked with him, his seminal works on sliding mode control were a strong motive for me to working on problems related to switched systems.

I would like to thank Prof. Antonio Conejo for many enlightening discussions when

I was working on control problems in power systems. I am also greatly indebted to

Dr. Jianming Lian from the Pacific Northwest National Laboratory for lots of fruitful discussions on voltage control problems and offering me an internship there.

Many other individuals have also contributed to this dissertation, either directly or indirectly, including, but not limited to, David Casbeer, Krishna Kalyanam, Laurentiu

Marinovici, Chin-Yao Chang, Kiryung Lee, Lin Zhao, Sen Li, Jianzhe Liu, Yueyun

v Lu, Yanzheng Zhu, Huaqing Xiong, Hao Li, Bowen Weng and many more, for making my life at Ohio State memorable.

Finally, I want to thank my wife, Lingxiao Zhou, and my parents, Zhengping Chen and Huimin Ma, for their support, patience and unconditional love.

vi Vita

2012 ...... B.E., , Zhejiang University

2012-present ...... Graduate Research Associate, Electrical and Computer Engineering, The Ohio State University

Publications

Research Publications

H. Chen, K. Krishnamoorthy, W. Zhang, D. Casbeer “Intruder Isolation on a General Road Network Under Partial Information”. IEEE Transactions on Control Systems Technology, vol. 25, no. 1, pp. 222 - 234, January 2017

H. Chen, W. Zhang “On weak topology for optimal control of switched nonlinear systems”. Automatica, vol. 81, pp. 409-415, July 2017

J. Liu, H. Chen, W. Zhang, B. Yurkovich, G. Rizzoni “Energy Management Problems Under Uncertainties for Grid-Connected Microgrids: a Chance Constrained Program- ming Approach”. IEEE Transactions on Smart Grid, vol. 8, no.6, pp. 2585-2596, November 2017

H. Chen, K. Krishnamoorthy, W. Zhang, D. Casbeer “Continuous-time intruder iso- lation using Unattended Ground Sensors on graphs”. American Control Conference (ACC) , pp. 5270-5275, 2014

J. Lian, D. Wu, K. Kalsi, H. Chen “Theoretical Framework for Integrating Distributed Energy Resources into Distribution Systems ”. Power & Energy Society General Meeting , pp. 1-5, 2017

vii H. Chen, W. Zhang, J. Lian, A. J. Conejo “Robust Distributed volt/var Control of Distribution Systems ”. Conference on Decision and Control (CDC) , pp. 6321-6326, 2017

H. Li, H. Chen, W. Zhang “On Model-free Reinforcement Learning for Switched Lin- ear Systems: A Subspace Clustering Approach ”. 56th Annual Allerton Conference on Communication, Control, and Computing, October 2018

Fields of Study

Major Field: Electrical and Computer Engineering

viii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita...... vii

List of Tables ...... xi

List of Figures ...... xii

1. Background ...... 1

1.1 Motivations and Overview ...... 1 1.2 Literature Review ...... 3 1.2.1 Classical Optimal Control ...... 3 1.2.2 Optimal Control of Continuous-time Switched Systems . . .5 1.2.3 Optimal Control and Reinforcement Learning of Discrete- time Switched Systems ...... 8 1.3 Preview of Main Results and Contributions ...... 11 1.4 Organization ...... 14

2. A Weak-Topology based Framework for Optimal Control of Continuous- time Switched Systems ...... 16

2.1 Introduction ...... 16 2.2 Problem Formulation and Preliminaries ...... 17 2.3 Weak Topologies and Infinite-dimensional Optimization ...... 24 2.3.1 Topologies and Weak Topologies ...... 24 2.3.2 Infinite-dimensional Optimization for Optimal Control . . . 28

ix 2.4 A Unified Framework for Switched Optimal Control Problem . . . 32 2.4.1 Convergence Analysis and Proofs ...... 35 2.5 Case Studies ...... 40 2.5.1 Problem with terminal cost ...... 40 2.5.2 Problem with mode-dependent cost ...... 41 2.6 Conclusion ...... 49

3. Reinforcement Learning for Switched Linear Systems ...... 51

3.1 Introduction ...... 51 3.2 Problem Formulation ...... 53 3.3 Model-based Optimal Control ...... 56 3.3.1 ...... 56 3.3.2 Switched Linear Quadratic Regulation ...... 62 3.3.3 Limitations ...... 67 3.4 Q-learning ...... 68 3.5 Main Results ...... 75 3.5.1 Q-function and parametric approximator ...... 75 3.5.2 Q-function Update ...... 77 3.6 Case Studies ...... 91 3.6.1 A Simple 2-Dimensional Example ...... 91 3.6.2 Another More Interesting 2-Dimensional Example ...... 94 3.6.3 A 3-Dimensional Example ...... 96 3.7 Conclusion ...... 97

4. Contributions and Future Work ...... 102

4.1 Future Works ...... 104

Bibliography ...... 105

x List of Tables

Table Page

2.1 Configurations of DES system ...... 43

xi List of Figures

Figure Page

2.1 Convergence of terminal states under different topologies ...... 42

3.1 General Neural Network Structure ...... 74

3.2 Original data distribution ...... 84

3.3 Geometric Algorithm on 2-D Synthetic Data ...... 85

3.4 Histogram of empirical error between the identified model and under- lying true ...... 86

3.5 Empirical Costs Comparison among Subspace Clustering Approach, Geometric Approach and SLQR - Example 1 ...... 92

3.6 Pij convergence with subspace clustering - Example 1 ...... 92

3.7 Pij convergence with geometric approach - Example 1 ...... 93

3.8 Empirical Costs Comparison among Subspace Clustering Approach, Geometric Approach and SLQR - Example 2 ...... 94

3.9 Pij convergence with subspace clustering - Example 2 ...... 95

3.10 Pij convergence with geometric approach - Example 2 ...... 95

3.11 Empirical Costs Comparison among Subspace Clustering Approach, Geometric Approach and SLQR - Example 3 ...... 98

3.12 Pij convergence with subspace clustering - Example 3 ...... 99

3.13 Pij convergence with geometric approach - Example 3 ...... 100

xii Chapter 1: Background

1.1 Motivations and Overview

In this dissertation, we study optimal control and reinforcement learning for switched systems. Roughly speaking, a switched system contains a number of subsys- tems and a switching signal determining which subsystem is being used for evolving the system dynamics at each time instant. Optimal control of such switched systems involves finding the discrete switching signal and the associated continuous input into the chosen subsystem to jointly optimize certain performance index.

Switched systems have been extensively studied in the literature due to its strong capability of modeling various engineering phenomena involving multi-mode behav- iors, such as power electronics [60], automotive systems [38, 68, 83], robotics [95], and manufacturing [23, 61]. Because of the presence of the discrete switching sig- nal, classical theories and techniques on control systems cannot be directly applied to the switched systems. During the past several decades, mathematical theories for switched system have been developed. In particular, [47] provides an overview of several fundamental questions in control of switched systems, including stability and stabilizability properties for controlled switching, autonomous switching and so

1 on. From a higher level perspective, switched systems serves as a particular class of hybrid systems which involve both continuous and logical dynamics [16, 32, 33, 53].

Among the rich literature regarding control systems, optimal control is one of the most important topics apart from stability and stabilization problems. In particular, optimal control aims to optimize certain performance criteria associated with a control system by finding the best possible inputs. Depending on the problem setup, there are in general two types of solution notions to an optimal control problem, which are the open-loop solution and the closed-loop solution. For those problems where initial conditions are given, the open-loop solution is commonly adopted, referring to a function of input over the given time horizon. Such an open-loop solution is typically independent of the real time state ( or output ) information and needs to be re-computed once the initialization changes. The closed-loop solution, also known as feedback law, is usually adopted for problems without a specifically given initial condition. Such a solution notion can be viewed as a function mapping from system state ( or ouput ) to an admissible input. Provided such a closed-loop solution, once the current system state ( or ouput ) is available, the optimal input can be efficiently computed by directly evaluating the feedback law of the given state ( or output ). In fact, an open-loop solution can be viewed as the special case of a closed-loop solution evaluated along the corresponding state trajectory. Both types of solutions have been extensively studied in the literature in both continuous-time and discrete-time settings.

Optimal control of switched systems, which is the topic studied in this dissertation, is known to be challenging. On one hand, most existing approaches for solving optimal control problem require complete and accurate knowledge about the system dynamics,

2 which is difficult to obtain especially in the presence of multiple subsystems. On the other hand, even though the system dynamics are fully given, solving optimal control of switched systems is in general prohibitive, mainly due to the existence of discrete input signals that makes the problem combinatorial in nature. In order to develop practically significant solutions to the switched optimal control problems, both issues need to be addressed. In this dissertation, we investigate potential solutions to the aforementioned issues.

1.2 Literature Review

1.2.1 Classical Optimal Control

Optimal control of classical dynamical systems has been investigated for decades.

From the theoretical perspective, abstract mathematical conditions ensuring existence of optimal solutions and necessary and sufficient conditions characterizing optimality of any given solution have been developed in the literature [12, 21, 45, 41]. Among them, Pontrygain’s Maximum Principle is perhaps the most well-known necessarily condition for optimality and Hamilton-Jacobi-Bellman’s equation, when solved glob- ally, is a necessary and sufficient condition for an optimum. In particular, Hamilton-

Jacobi-Bellman’s equation is essentially a nonlinear partial differential equation, and solving it globally is in general impractical. Theoretical studies of this partial differ- ential equation have been conducted in the seminal works [6, 24, 25].

Practically, optimal control of continuous-time systems is often considered with certain initial conditions and hence the goal is to find open-loop solutions. Such an op- timal control problem is typically reformulated as an infinite-dimensional optimization problem [10, 30, 92]. In this case, the first order necessary conditions for optimality

3 of the corresponding optimization problem can be directly translated as necessary conditions for optimality of the original optimal control problem. However, such conditions are typically weaker than the well known Pontrygain’s Maximum Princi- ple as pointed out in [63]. Tractable numerical solutions to the infinite-dimensional optimization problems are available in the literature. In particular, similar to the

finite-dimensional cases, first order algorithms have been developed based on the di- rectional derivatives. Alternative approaches include sequential linear programming, sequential quadratic programming and others.

Optimal control for discrete-time systems is another important topic, which is more influential for practical implementations due to the fact that almost all com- puter systems operate in a discrete-time fashion. Similar to the continuous-time scenario, finding open-loop solutions in discrete-time case can be formulated as a

finite-dimensional optimization problem which can be efficiently solved via numer- ous existing optimization techniques. The more interesting case in discrete-time scenario is the problem of finding closed-loop solutions. Such a problem has been extensively investigated for decades [12, 41]. Both Pontrygain’s Maximum Princi- ple and Hamilton-Jacobi-Bellman’s equation have their discrete-time counterparts.

Especially, Bellman’s equation, which is the discrete-time counterpart of Hamilton-

Jacobi-Bellman’s Equation, gives a necessary and sufficient condition for optimality.

Furthermore, dynamic programming [12] offers a powerful tool for solving the Bell- man’s equation. Multiple algorithms have been proposed for numerically implement- ing dynamic programming. Among them, value iteration and policy iteration are two most well-known algorithms. Convergence results for both algorithms have been

4 established under mild conditions [13]. Despite the great success of dynamic program- ming, this classical technique suffers from two famous “curses”, namely the “curse of dimensionality” and “curse of modeling”, referring to the exponential growth of computational power and memory needed for solving dynamic programming and the difficulty of finding accurate system model of the underlying physical phenomena, respectively. Furthermore, dynamic programming is often performed in an off-line fashion. All these features are not preferred when solving practical problems and hence greatly limit the applicability of these dynamic programming algorithms. To develop tractable solutions, approximated versions of dynamic programming have been extensively studied in the literature [14, 64].

In addition to the research on general conditions and properties of optimal control problems, there is another important research category in optimal control focusing on simple problems for which analytical properties of the optimal solution can be derived.

The most famous case is the linear quadratic regulation problem for which the optimal closed-loop control law is known to be linear in state and the optimal cost function is quadratic in state [75]. It is widely known that such optimal solution can be efficiently computed via the algebraic Riccati equation provided the system dynamics model, namely the system and cost matrices. For certain classes of simple hybrid systems, exact analytical structure of optimal cost function and optimal control law has been derived as well [8, 15, 16, 103, 104, 105].

1.2.2 Optimal Control of Continuous-time Switched Systems

The first problem studied in this dissertation is optimal control for continuous- time switched systems with certain given initial condition. Such an optimal problem

5 requires finding both the continuous control input and the discrete switching signal over certain specified time horizon. In addition to the difficulties of finding such open- loop solutions for traditional optimal control problems, the problem of interest has several distinct challenges mainly due to the discrete nature of the switching signal that makes the problem combinatorial. As a results, existing optimal control and optimization techniques cannot be directly applied.

Numerous approaches trying to solve the optimal control problem have been in- vestigated in the literature. One one hand, from the theoretical perspective, the

Maximum Principle has been extended to hybrid systems to characterize necessary optimality conditions for optimal hybrid control solutions [62, 71, 76, 77]. However, it is difficult to numerically compute the optimal solutions based on these abstract conditions [100]. On the other hand, from the practical perspective, there are several approaches proposed in the literature trying to numerically solve the problem. Most existing algorithms fall into two main categories, the two-stage optimization method and the embedding-based method.

The two-stage optimization method proposed in [97, 98, 99, 100, 101] aims to address the aforementioned challenge by iterating between two subproblems. At the

first stage, a switching mode sequence is given and fixed and the performance index is optimized over the switching instants and continuous inputs. At the second stage, the mode sequence is updated to improve the system performance. The first stage problem is typically formulated as a finite-dimensional optimization problem that is efficiently solvable via classical techniques if there is no continuous input involved.

Such an optimization formulation becomes infinite-dimensional if continuous inputs are incorporated. There is a considerable amount of research devoted into finding

6 solutions to the first stage problem, mainly focusing on designing specific algorithms solving the infinite-dimensional optimization problem [5, 27, 34, 35, 51, 52]. It should be noted that the underlying optimization space is continuous, and hence such an opti- mization problem is solvable via classical first order algorithms for infinite-dimensional problems. For the second stage problem, various schemes have been considered in the literature for updating the switching mode sequence. Nonetheless, such schemes are mainly heuristic, and there is no general way of updating the switching modes with rigorous performance guarantees. Consequently, due to such restrictions on possible mode sequences, solutions obtained through this approach may be unsatisfactory.

More recently, an alternative approach based on the so-called embedding principle has been investigated [9, 86, 87, 95]. Recall that the main issue preventing us from directly applying classical infinite-dimensional optimization techniques is the discrete nature of the switching input. The key idea of the embedding-based approach to address this issue involves three steps. First, the discrete input space is relaxed to be a continuous one. Then, the optimal control problem associated with the relaxed input spaces that are continuous is solved. Finally, solutions to the relaxed problem are projected back to generate original solutions. Such an approach originates from the idea of relaxed optimal control problems which is an important topic in classical optimal control literature [10, 30, 31, 92]. Traditional relaxed optimal control con- siders the probability measure over the original input space as the new control input.

Both the original and the relaxed optimal control problems can be reformulated as infinite-dimensional optimization problems. It has been shown in the literature that solutions to the relaxed problem exist under weaker conditions and are relatively eas- ier to compute via classical infinite-dimensional optimization approaches [10, 30]. In

7 addition, under certain mild conditions, any relaxed solution can be approximated arbitrarily well by an original one. This is widely known as the chattering lemma in the literature. However, the original version of this so-called chattering lemma re- ported in [10] cannot be directly applied to the optimal control of switched systems.

In [9], the authors successfully proved an extended version of the traditional chatter- ing lemma that is compatible with switched systems, which has been fundamental for most existing embedding-based algorithm from then on. The corresponding re- laxed problem is solved using the classical Maximum Principle. More recently, an alternative algorithm is proposed which solves the relaxed optimal control problem via first-order gradient based optimization techniques [86]. A constructive wavelet based approximation is also proposed as a numerical implementation of the extended chattering lemma.

1.2.3 Optimal Control and Reinforcement Learning of Discrete- time Switched Systems

Optimal control of discrete-time switched systems has also been studied in the literature for decades. In particular, by focusing on simple linear subsystems with quadratic cost, analytical structure of the optimal cost function and the associated optimal control law can be derived. Based on an extension of traditional linear quadratic regulation results, namely the algebraic Riccati equation and Riccati re- cursion, the optimal cost for finite-horizon problem can be exactly characterized by the pointwise minimum of finitely many quadratic functions and the associated op- timal control law is piecewise linear [103, 104, 105]. Furthermore, it has been shown that the finite-horizon value function converges to the infinite-horizon value func- tion, which does not hold for general optimal control problems. However, due to the

8 combinatorial nature of the problem, finding such an exact characterization, i.e. all candidate matrices defining the quadratic functions involved in the minimization, is prohibitive, in fact NP-hard in general. In order to develop numerically tractable solu- tions, numerous approximated dynamic programming techniques have been proposed in the literature [11, 64]. One of the widely adopted approximation approaches builds upon rigorous analysis of the exact analytical structure of the optimal solutions and constructs approximated solutions with sub-optimal performance guarantees. This approximation approach has been applied to the optimal control of switched linear systems with quadratic cost for removing redundant quadratic functions in defining the optimal cost function in [103, 105]. Nevertheless, such an approximation tech- nique requires knowledge about the analytical structure of the exact optimal solutions which is in general difficult to obtain.

Another important topic studied in approximate dynamic programming is to con- sider simulation-based approaches with parameterized function approximators. Such an approach is motivated by several well known practical difficulties of implementing dynamic programming, including the famous dual curses, i.e. curse of modeling and curse of dimensionality, and the fact that dynamic programming is often performed in an offline manner. To be more specific, the main idea is to use parametric approx- imators as representations of value functions and to develop rules for updating the parameters based on data samples collected from interaction with the environment or a simulator. This simulation and approximation based approach is in fact the core of reinforcement learning [3, 78], a.k.a. adaptive dynamic programming [96] or neuro-dynamic programming [14].

9 Numerous reinforcement learning schemes have been proposed in the literature, such as Q-learning [93], actor-critic [43], direct policy search [79] and so on. Both linear and nonlinear parametric function approximations have been considered in the literature. In particular, it has been shown that linear approximator suffices for sim- ple linear quadratic regulation problems [20]. More recently, reinforcement learning for linear systems has been extensively studied for both continuous-time and discrete- time settings [44, 46, 91]. In [91], a policy iteration algorithm for optimal control of continuous-time linear systems with partially known dynamics has been proposed. In particular, an iterative solution to the algebraic Reccati equation without knowledge about system’s internal dynamics is developed. In [58], the authors extended the above results from regulation problems to tracking problems, adopting a similar ap- proach. Optimal control of discrete-time linear systems has also been studied in [42].

All the aforementioned results on reinforcement learning for linear systems adopt quadratic cost/reward and take advantage of the fact that the optimal value function is quadratic which is characterized by the algebraic Reccati equation. More recently, these approaches have been extended to general nonlinear dynamics [48, 84, 94]. The dominating research direction in model-free reinforcement learning is to use deep neural networks as the nonlinear parameterized function approximator [57, 69, 70] and use advanced training algorithms to update deep neural networks parameters, such as deep deterministic policy gradient (DDPG) [48], proximal policy optimiza- tion (PPO) [70], trust region policy optimization (TRPO) [69], and so on. Recent developments along this research direction have demonstrate impressive successes for solving some challenging practical problems [56, 57, 72].

10 1.3 Preview of Main Results and Contributions

The primary focus of this dissertation is to study optimal control of switched systems. Two problems are considered where one of which studies optimal control of general switched nonlinear systems in continuous-time setting while the other one studies optimal control of discrete time switched linear systems using reinforcement learning.

In the first part of this dissertation, we focus on optimal control of general continuous-time switched nonlinear systems. Particularly, we consider the afore- mentioned embedding-based approach. Instead of constructing new specific algo- rithms, we provide a framework unifying the understanding and analysis of different embedding-based algorithms, based on a novel topological viewpoint. In particular, by formulating both the original switched optimal control problem and the associated relaxed optimal control problem as infinite-dimensional optimizations, relationship between these two problems can be viewed as a change of topology over the under- lying optimization space. In fact, classical relaxed optimal control problem can be viewed as optimizing under the weak-star topology over the relaxed control space.

Nonetheless, the combinatorial nature of the switched optimal control problem yields failure of directly using the weak-star topology. By considering a more general topo- logical notion, namely the weak topology, we show that most of the embedding-based algorithms can be unified into a general framework based on weak topology. Specif- ically, most existing embedding based algorithms adopt the weak topology induced by the state trajectory. The proposed framework allows for alternative choices of the topologies according to particular underlying problems. In a nutshell, our framework constitutes a weak topology over the optimization space dictating the embedding

11 procedure, an algorithm solving the relaxed optimization problem, and a projection operator generating the desired solution. Different selections of these components will result in different embedding based switched optimal control algorithms. We also derive a set of conditions for these components so that the resulting algorithm converges to a stationary point of the original problem under the selected weak topol- ogy. Two case studies examples are provided illustrating the importance of choosing appropriate weak topologies according to the underlying optimal control problem and how the developed framework can be used to analyze and design embedding-based algorithms.

In the second part of this dissertation, we study optimal control problem for discrete-time switched linear systems using reinforcement learning. Instead of as- suming complete knowledge about the switched system model, we assume access to a simulator which can generate the successive state and induce the associated cost given any continuous state, continuous input, and discrete input (mode) at each time instant. Through interaction with such a simulator, we can collect data samples of state-input-cost tuple that will be used for updating the value function. According to the previous discussion, value function of this switched optimal control problem can be approximated arbitrarily well by a piecewise quadratic function. Furthermore, this function can always be written as the pointwise minimum of a finite number of quadratic functions. Taking advantage of these properties, we develop a specific

Q-learning algorithm by constructing a particular Q-function approximator that ex- plicitly incorporates the analytical structure and an associate update scheme of the parameters used in the approximator.

12 Update of such parameters is in fact crucial in the proposed learning algorithm. It is shown that due to the particular value function structure, update of the proposed function approximator is essentially a unsupervised learning problem that is NP-hard to solve in general. By lifting the data samples into a higher-dimensional space, it is shown that the lifted data samples lie in several linear subspaces in the ambient space due to the particular value function structure. As a result, update of the parameters can be solved by first clustering the data samples according to the underlying sub- spaces and then identify the corresponding quadratic function for each cluster. The

first step of this approach is essentially a subspace clustering problem that has been extensively studied in the computer vision literature [28, 29, 65, 80, 81, 88, 89, 102].

Various algorithms for solving the subspace clustering problem have been proposed and analyzed in the literature. Among them, the sparse subspace clustering algo- rithm [28, 29] exploits the self-representativeness property of data samples. Such an approach originates from the observation that each data sample can be written as a linear combination of all other data samples. By promoting the sparsity of such representation, there is a hope that the resulting sparse representation would only use data samples belonging to the same subspace. Conditions guaranteeing perfor- mance of the sparse subspace clustering algorithm have been extensively analyzed in the literature [73, 74]. In addition, classical K-means clustering algorithms have been extended to subspace scenarios [1, 19]. An improved algorithm with rigorous analysis is given recently [50]. Nonetheless, since most of the subspace clustering algorithms are developed in the context of computer vision, conditions guaranteeing correct clustering are unlikely to be satisfied for our problem. Moreover, the sub- space clustering reformulation only utilizes the piecewise quadratic property of the

13 value function structure which is in fact weaker than the exact pointwise minimum quadratic property.

An alternative geometric based approach is proposed for updating the parameters.

Such an approach fully exploits the geometric structure of the underlying model. In- stead of looking at the original set of data samples, we focus on a transformed data set which encodes an explicit geometric structure. Under some standard assumptions widely adopted in the unsupervised learning literature, performance of the proposed algorithm is demonstrated numerically. Unfortunately, due to the underlying optimal control problem, such assumptions are not satisfied in general, especially the data dis- tribution is skewed. A heuristic re-sampling technique for improving the performance is proposed. How to further modifying this geometric approach to accommodate the skewed distribution with rigorous analysis remains open. Three case studies have been provided to demonstrate the performance and limitations of the proposed algorithm.

1.4 Organization

In Chapter 2, we study the optimal control problem for continuous-time switched nonlinear systems. We focus on the embedding-based approach which is motivated by the relaxed optimal control and the infinite-dimensional re-formulation technique.

A novel viewpoint of this approach from the topological perspective is provided which yields a framework unifying the analysis and design of various embedding-based al- gorithms. In Chapter 3, we shift our attention to optimal control of discrete-time switched linear systems. Different from the classical optimal problem, we drop the

14 assumption on accessibility of knowledge about the system dynamics. A novel Q- learning algorithm based on theoretical results from optimal control literature is de- veloped. Concluding remarks and potential future research directions are summarized in Chapter 4.

15 Chapter 2: A Weak-Topology based Framework for Optimal Control of Continuous-time Switched Systems

2.1 Introduction

Int this chapter, we study the optimal control problem for general continuous-time nonlinear switched systems. Apart from the difficulties in solving classical optimal control problems, such a switched optimal control problem suffers from additional challenges due to the existence of discrete input which makes the problem combina- torial in nature. Specifically, we consider the problem of finding both the continuous and discrete control inputs to jointly minimize certain performance criteria for a given initial state. One of the most widely adopted approach for solving this optimal con- trol aims to find the so-called open-loop solution by reformulating the problem as an infinite-dimensional optimization problem. Traditionally, such an optimization problem can be solved via well-established off-the-shelf optimization tools. However, existing methods do not directly apply to our combinatorial problem.

To address this issue, we focus on the embedding-based approach. Instead of try- ing to develop novel algorithms, we construct a framework that unifies understanding and analysis of different embedding-based algorithms based on a novel topological viewpoint. The main contributions of this chapter lie in two aspects. First, the

16 proposed framework offers a unified weak topology formulation of switched optimal

control problems which includes most existing embedding-based approaches as spe-

cial cases. Second, the proposed framework provides more freedom to choose weak

topologies, optimization algorithms, and projection operators, which expands the ap-

plicability of embedding-based approach.

This chapter unfolds as follows: Chapter 2.2 formulates the switched optimal

control problem and briefly introduces the relaxed optimal control problem. Chap-

ter 2.3 reviews several important concepts and results in weak topology and infinite-

dimensional optimization for classical optimal control problems. Our main results

are given in Chapter 2.4, presenting the general framework and corresponding con-

vergence analysis. Two numerical examples demonstrating the importance of weak

topologies in switched optimal control problems and the usage of the proposed frame-

work are presented in Chapter 2.5. Concluding remarks and possible future works

are given in Chapter 2.6.

2.2 Problem Formulation and Preliminaries

In this chapter, we consider the following general continuous-time nonlinear switched

system

x˙(t) = fv(t)(t, x(t), u(t)), x(0) = xo, t ∈ [0,T ], (2.1)

where x(t) ∈ X ⊂ Rnx is the system state, u(t) ∈ U ⊂ Rnu is the continuous control input, and v(t) ∈ Σ , {1, 2, . . . , nv} is the discrete switching signal which determines the active subsystem (mode) at every time instance t.

The above switched system (2.1) can be re-written as a general nonlinear dynamics n with both continuous and discrete input. To see this, let D , (d1, . . . , dnv ) ∈

17 nv o nv nv P {0, 1} ⊂ R di = 1 be the set of corners of the nv-simplex, then (2.1) is i=1 equivalent to the following nonlinear system

n Xv x˙ = di(t)fi(t, x(t), u(t)) , f(t, x(t), u(t), d(t)), x(0) = xo, t ∈ [0,T ], (2.2) i=1

where d(t) = [d1(t), . . . , dnv (t)] ∈ D is the discrete control input taking care of the switching signal.

The optimal control problem we consider in this chapter aims to find both the continuous control input, denoted by u(·) : [0,T ] → U, and the discrete control input d(·) : [0,T ] → D, to jointly minimize the following cost functional:

T Z J(u, d) = `(x(t), u(t), d(t))dt + `F (x(T )), (2.3) t=0 subject to (2.2) with initial condition x(0) = xo. In the above cost functional, `(·, ·, ·):

X × U × D → R+ is called the running cost function which penalizes the system trajectory and control efforts, and `F (·): X → R+ is the terminal cost function which only penalizes the terminal state. In addition, we assume that the following state and continuous input constraints are imposed as well

n o

x(t) ∈ Ωx , x ∈ X gx(x) ≤ 0 , ∀t ∈ [0,T ] (2.4a) n o

u(t) ∈ Ωu , u ∈ U gu(u) ≤ 0 , ∀t ∈ [0,T ]. (2.4b)

The following standard assumptions are adopted throughout this chapter to ensure the existence and uniqueness of the state trajectory of system (2.2) and the well- posedness of the optimal control problem.

Assumption 2.1. The following Lipschitz conditions are assumed throughout this chapter.

18 1. f(t, x, u, d) is locally Lipschitz continuous with respect to x and u for any given

d with a common Lipschitz constant L,

2. gx(x) and gu(u) are locally Lipschitz continuous with respect to their arguments

with a common Lipschitz constant L.

Here, we assume a common Lipschitz constant L to simplify notation. All the results in this paper extend immediately to the case where all these functions have different Lipschitz constants.

In the context of this chapter, we assume the all control inputs belong to the following L 2 space, which can be interpreted as the space of all functions with finite energy. In the literature, there are other possible choices of the underlying spaces, such as the space of essentially bounded functions, i.e. the L ∞ space or the space of piecewise continuous functions.

Definition 2.1. We say a function r : [0,T ] → R ⊆ Rn belongs to L 2([0,T ],R), if

1  T  2 Z 2 krkL 2 ,  kr(t)k2dt < ∞, (2.5) 0 where the integration is taken with respect to the Lebesgue measure.

Provided the above definition, let U = L 2([0,T ],U) be the space of continuous control inputs and let D = L 2([0,T ],D) be the space of discrete control inputs.

Then, the optimal control problem of switched systems studied in this chapter is given below.

19 Problem 2.1. Given the switched system (2.2) and

T Z inf J(u, d) = `(x(t), u(t), d(t))dt + `F (x(T )), (2.6a) (u,d)∈U×D t=0

subject tox ˙ = f(t, x(t), u(t), d(t)), x(0) = xo ∀t ∈ [0,T ], (2.6b)

gx(x(t)) ≤ 0, gu(u(t)) ≤ 0, ∀t ∈ [0,T ]. (2.6c)

Note that the above optimal control problem is combinatorial in nature due to the existence of the discrete input d ∈ D, which prevent us from directly applying existing infinite-dimensional optimization techniques. One of the most well known methods to address such issue is the embedding-based approach which originates from the relaxed optimal control problem [10, 30, 92]. The key idea of the relaxed optimal control is to consider the set of probability measures over the original control space as the new control set. There is a profound and rich literature for the relaxed optimal control problems, mainly focusing on establishing conditions guaranteeing existence of solutions to the relaxed optimal control and relationships between solutions to the original problem and the relaxed problem. Here, we provide a very concise review of some main results in relaxed optimal control.

Consider a generic nonlinear system

x˙ = f(t, x(t), u(t)), x(0) = xo, ∀t ∈ [0,T ], where u(t) ∈ U is the control input. Under Assumption 2.1.1, the relaxed system is given below.

x˙ = F (t, x(t))µ(t),

20 R where F (t, x)µ , f(t, x, u)µ(du) and µ is a probability measure over U. Suppose U the following state constraint is added to the problem

x(t, µ) ∈ Xc(t), ∀t ∈ [0,T ], and the cost functional can be defined for relaxed controls µ denoted by J(µ). Let

J ∗ = inf J(µ), the minimizing sequence is defined below. µ

Definition 2.2. {µn(·)} is called a minimizing sequence if

n 1. lim dist(x(t, µ ),Xc(t)) = 0; n→∞

2. lim sup J(µn) ≤ J ∗. n→∞

With the above minimizing sequence, the existence of relaxed optimal control can be constructed.

Theorem 2.1. Assume f(t, x, u) satisfies Assumption 2.1.1 and the cost functional

J(µ) is weakly lower-semicontinous. Assume in addition

• J ∗ ∈ (−∞, ∞);

• The state constraint set Xc(t) is closed for all t.

Let {µn(·)} be a minimizing sequence of relaxed controls such that {xn(·, µn)} is uniformly bounded. Then there exists a relaxed optimal control µ∗ which is the weak limit of a subsequence of {µn(·)}.

The above existence result is a very particular version of the generic results pro- vided in [30]. In fact, the notion of minimizing sequence is crucial both theoretically and practically. Many iterative algorithms solving the optimal control problem can be actually viewed as methods for constructing the minimizing sequences.

21 The existence of original optimal control has also been discussed. In addition, it

has been shown that any relaxed solution can be approximated arbitrarily well by an

original solution due to the following chattering lemma.

Lemma 2.1. Let the original system satisfy Assumption 2.1.1 and let µ(·) be such

that x(t, µ) of the relaxed problem exists in [0,T ] where x(t, µ) is the state trajectory

under µ. Then, there exists a piecewise constant original control u(·) defined in [0,T ]

such that the solution x(t, u) to the original system exists and

kx(t, u) − x(t, µ)k ≤ , ∀t ∈ [0,T ].

However, these results do not directly applicable to our problem. In the context

of optimal control for switched systems, due to the particular structure of the discrete

input space, probability measure over D is in fact equivalent to the convex hull of D.

Therefore, by simply replacing D with Co(D) in the original problem (2.6) results in

a standard optimal control problem.

As mentioned above, both the switched optimal control problem (2.6) and the

relaxed one can be formulated as infinite-dimensional optimization. To see this, by

further letting S = U × D be the hybrid input space and R = U × Dr with Dr =

L 2([0,T ], Co(D)) be the relaxed input space, we can rewrite the two problems in the following abstract forms.

(inf J(s), s∈S PS : (2.7) subject to Ψ(s) ≤ 0.

(inf J(r), r∈R PR : (2.8) subject to Ψ(r) ≤ 0. Now, the above relaxed problem (2.8) is a standard infinite-dimensional optimization

problem over continuous space which can then be solved using existing tools. The

22 above described approach is commonly referred to as embedding-based approach in the literature of optimal control of switched systems. Although existing results in the relaxed optimal control literature on relationships between original and relaxed solutions do not apply directly to the embedding-based approach, the traditional chattering lemma actually has been extended to the switched optimal control scenar- ios.

In this chapter, we develop a framework that unifies most existing embedding- based algorithms. Such a framework views the embedding-based approaches from a novel topological perspective. In particular, relationship between the continuous optimization and the original combinatorial one is precisely characterized by a weak topology induced by the underlying switched optimal control problem.

Remark 2.1. It is worth mentioning that, albeit the above embedding-based frame- work is valid, finding optimal solution to the relaxed problem is challenging on its own. In fact, the resulting relaxed problem is in general an infinite-dimensional non- linear optimization problem. Existence of optimal solutions to such a problem is typically characterized by some abstract conditions that are difficult to be checked and do not lead to implementable algorithms. Following a similar idea used in finite- dimensional nonlinear optimization problems, first order necessary optimality con- ditions can be constructed for checking quality of any given solution. Additionally, there are numerous algorithms utilize the derivative information involved in those

first order conditions.

Remark 2.2. Recall that the infinite-dimensional optimization is just a re-formulation of our original optimal control problem. Although the aforementioned first order con- ditions for optimization can be directly used as optimality conditions for the optimal

23 control problem, they are actually different. To be exact, these first order conditions are weaker than the celebrated Pontryagin’s Maximum Principle. Detailed discussions about relationships between these two kinds of optimality conditions can be found in [63]. In addition, it should be emphasized that existence of optimal solutions to both the relaxed optimal control problem and the original switched optimal control problem is not assumed. As suggested by the discussions in [41], such existence re- sults are challenging to obtain. An alternative philosophy is trying to numerically search for potential solutions and verify using optimality conditions.

In the next section, we first briefly review some important concepts and results in weak topology and infinite-dimensional optimization that will be fundamental in our later discussions.

2.3 Weak Topologies and Infinite-dimensional Optimization

As discussed above, both the original and the relaxed optimal control problems can be formulated as infinite-dimensional optimization problems. Establishing rela- tionship between solutions to these problems is central to our problem. In this section, we first briefly provide a short mathematical introduction to topologies which is fun- damental to all the following discussions. Then, some important concepts and results for infinite-dimensional optimization are briefly reviewed.

2.3.1 Topologies and Weak Topologies

Topology is one of the most fundamental notions in mathematics, which character- izes open sets. The very first thing to be emphasized is that, working with Euclidean spaces almost all the time, especially Rn, results in ignoring of careful selections of underlying topology since the Euclidean norm or metric naturally induces a strong

24 and well behaved topology. However, we do not have the luxury to use such a strong

topology in our problem.

To formally introduce the weak topology concept and associated results useful

for our later discussion, the generic topology definition is first given below which

essentially defines open sets.

Definition 2.3 (Topology). Let X be a set and P(X) be the power set of X. Let

O ⊂ P(X) a collection of subsets of X such that

•∅∈ O and X ∈ O;

• O is closed under finite intersections, i.e. if O1,...,Om ∈ O with m < ∞, then m T Oi ∈ O. i=1   Then T = S |O ⊂ O is a topology on X, and (X, T ) is called a topological O∈O space.

The following example will help make this concept more clear

Example 2.1. Consider the set X = {a, b} containing only two points. Then, all possible topologies defined on this set are given by

• T = {∅,X};

• T = {∅,X, {a}};

• T = {∅,X, {b}};

• T = {∅,X, {a} , {b}};

Note that, for any non-empty set X, Tid = {{∅} ,X} and Tdi = P(X) are always

two topologies over X. Commonly Tid is called the in-discrete topology and Tdi is

25 called the discrete topology. It is then apparent to see that any other T over X must satisfy Tid ⊂ T ⊂ Tdi. In addition, given two topologies T1 and T2 over X, we say

T1 is weaker (or coarser) than T2 (or T2 is stronger (or finer) than T1) if T1 ⊂ T2.

In topological spaces, continuity are typically characterized by openness of pre- image of open sets under functions. Mathematically, let g : X → Y be a function mapping from one topological space (X, Tx) to another (Y, Ty). Then, g is said to be

−1 continuous if for any Oy ∈ Ty we have g (Oy) ∈ Tx. With this continuity definition, we are ready to introduce the following key weak topology definition.

Definition 2.4 (Weak Topology). Let G = {gi}i∈I be a family of functions gi : X →

Yi, ∀i ∈ I, mapping from a set X to several topological spaces (Yi, Ti), respectively.

The weak topology on X induced by G, denoted by TG, is the collection of all unions

−1 of finite intersection of sets of the form gi (Oi) where i ∈ I and Oi ∈ Ti.

In other words, the weak topology induced by G is the weakest topology from X to Yi that makes all functions gi, i ∈ I continuous.

Remark 2.3. Note that, the above weak topology definition should not be confused with another commonly used notion which is called the weak topology on X, denoted

w by TX . Such a topology requires X to be a normed space and is defined to be the weak topology induced by X? which is the dual space of X, i.e. the space containing all linear functions from X to its base field. Such a topology is actually an important topic studied in the topology literature. There is another more related topology, called

? w? the weak (weak-star) topology, denoted by TX , which is defined on the dual space X? by all elements in X. Such a definition originates from the observation that any element of X can be viewed as a bounded linear function on X?. This weak? topology is in fact the mathematical foundation behind the idea of considering relaxed optimal

26 control problems. Especially, the weak? topology provides an exact interpretation on

relationship between solutions to the original problem and the relaxed problem.

Remark 2.4. The structure of weak topology TG is determined by the family of func-

tions G = {gi}i∈I . In particular, assume X is a normed vector space X, the underlying

topology people are implicitly assuming is the weak topology induced by the norm

function k · k. 1 Such a norm topology is commonly called strong topology on X.

Equipped with the above topological notions, convergence of sequences in a topo- logical space is defined below.

Definition 2.5. Let (X, ) be a topological space, then a sequence {xn} con- T n∈N verges to x ∈ X if and only if

n ∀O ∈ T such that x ∈ O, ∃N ∈ N such that ∀n ≥ N, x ∈ O.

Theorem 2.2. Let {xn} be a sequence in X. It converges to x ∈ X in the weak n∈N topology TG if and only if

n ∀i ∈ I, lim gi(x ) = gi(x). n→∞

Theorem 2.3. Let (Z, T ) be a topological space, and ϕ : Z → X be a mapping

from Z to another topological space (X, TG). Then ϕ is continuous for the topologies

T and TG if and only if for every i ∈ I, gi ◦ ϕ is continuous.

Theorem 2.2 establishes convergence of sequences in weak topologies using the

functions defining the weak topologies. Such a convergence notion has actually been

extensively used, sometimes implicitly, in the optimization literature. For example,

1Many norms can be defined on a space X and each of them induces a norm topology. In this paper, we assume T = T if X is a function space and T = T if X is an Euclidean k·kX k·kL 2 k·kX k·k2 space.

27 as stated in [18], for unconstrained minimization problems, a sequence of solutions is called convergent if the corresponding cost sequence converges. Adopting the topo- logical notion developed above, this condition can be interpreted as the sequence of solutions converges in the weak topology induced by the cost function. It is worth mentioning that, for finite-dimensional normed vector spaces, the norm topology co- incides with the weak topology. Hence, there is no need to distinguish convergence in the weak topology induced by the cost function or convergence in the norm topology as long as the cost function is continuous with respect to the norm topology which is a standard assumption, due to actually Theorem 2.3.

In fact, topologies play a central role in general optimization problems. Specifi- cally, topology adopted over the underlying optimization space defines open sets which in turn affects the definition of neighborhood and local minimizers. In other words, adopting different topologies over the underlying optimization space may result in drastically different solutions. Hence, it is critical to use appropriate topology when solving an optimization problem. In the sequel, we provide a general overview of the relationship between optimal control and infinite-dimensional optimization over topological spaces.

2.3.2 Infinite-dimensional Optimization for Optimal Control

Based on our previous discussion, both the original and the relaxed problems can be formulated as infinite-dimensional optimization problems. In this subsection, a general introduction to infinite-dimensional optimization will be provided.

28 Let (U, T ) be a topological space and let J : U → R¯ be a (cost) functional. An abstract infinite-dimensional optimization problem is typically given by

min J(u). (2.9) u∈U

Two kinds of solution notions are of interest, namely the local and global minimizers.

Definition 2.6. A point u ∈ U is called

1. local minimizer, if there exists an open set O ∈ Tp such that u ∈ U and

J(u) ≤ J(u0), ∀u0 ∈ O.

2. global minimizer, if

J(u) ≤ J(u0), ∀u0 ∈ U.

It is obvious from the above generic definition of minimizers that the underlying

topology T plays a central role in characterizing the set of local minimizers. Different

choices of T may result in drastically different set of local minimizers. Equipped with

the weak topology notion given in Definition 2.4, we are particularly interested in the

local minimizers associated with certain weak topology TG.

∗ Definition 2.7. u ∈ U is a local minimizer under topology TG, if there exists a

∗ 0 0 ∗ 0  > 0 such that J(u ) ≤ J(u ) for all u satisfying kgi(u ) − gi(u )k ≤  for all i ∈ I.

This definition is in fact a restatement of the first part of Definition 2.6 by fixing

the general topology T to be a weak topology TG. This treatment enables us to

manipulate the set of local minimizers of interest by selecting appropriate functions

defining the weak topology.

Existence of solutions to the generic optimization problem can be argued in an

abstract way, relying on two basic properties: compactness and lower-semicontinuity.

29 Definition 2.8. Let (U, T ) be a topological space, and let J : U → R¯. The cost functional J is lower semicontinuous at u ∈ U if

J(u) ≤ sup inf J(u0). 0 O∈T u ∈O

In addition, if we further assume U is a metric space, this definition can be equivalently characterized by sequences. In specific, J is lower semicontinuous at u ∈ U if

J(u) ≤ lim inf J(un) n→∞ for all sequences {un} converging to u.

Theorem 2.4. Let J : U → R¯ be lower semicontinuous and let

{u ∈ U|J(u) ≤ M} be non-empty and compact for some M ∈ R. Then the infinite-dimensional optimiza- tion problem (2.9) has a global minimum.

In practice, it is in general difficult to find the solution based on these abstract conditions. Usually, iterative algorithms are implemented to obtain solutions that verify certain optimality conditions. Particularly, first order necessary condition for optimality is one of the most widely adopted optimality conditions. Such a necessary condition is built upon derivatives of cost functionals and potential constraints, which would be used in those iterative algorithms as well. For finite-dimensional problems, derivatives are defined in the classical manner. However, for infinite-dimensional problems, a more general derivative notion needs to be defined.

Definition 2.9. Suppose (X, TX ) and (Y, TY ) are locally convex topological spaces.

Then, the Gˆateauxdifferential dF (u; ψ) of F : X → Y at u ∈ O ∈ TX in the direction

30 ψ ∈ X is given by

F (u + τψ) − F (u) d dF (u; ψ) = lim = F (u + τψ) (2.10) τ→0 τ dτ τ=0

If the limit exists for all ψ ∈ X, then F is Gˆateauxdifferential at u.

In addition, if X and Y are normed vector spaces, then F is called Fr´echet differ-

entiable if there exists a bounded linear operator A : X → Y such that

kF (u + τ) − F (u) − Aτk lim Y = 0, (2.11) τ→0 kτkX where F 0(u) = A is called the Fr´echet derivative at u.

Based on the above differential(derivative) notion, both necessary and sufficient conditions for local minimizers can be derived [54]. Here, we assume in addition that the cost functional J in (2.9) is continuously Fr´echet-differentiable on U. Then, the

first order necessary condition for optimality is given below.

Proposition 2.1. Let u∗ is a local minimizer for (2.9). Then J 0(u∗) = 0.

Sufficient conditions based on second order derivative can also be derived and

all the above discussions can be extended to the constrained case with additional

treatment.

Here, we only review several very fundamental concepts and results in infinite-

dimensional optimization. There are actually a much deeper connection between the

infinite-dimensional optimization and optimal control problems [30, 63]. In Chap-

ter 2.5, we will provide a detailed construction of a first-order necessary optimality

condition for a constrained problem based on the Fr´echet derivative introduced above.

31 2.4 A Unified Framework for Switched Optimal Control Prob- lem

In this section, we formally develop our unified topology based framework for analyzing and designing algorithms for switched optimal control problems. Such a framework is based on a novel weak topological viewpoint and classical results in infinite-dimensional optimization.

To begin with, it is widely accepted that solutions to optimization or optimal control problems are typically given by local minimizers as defined in Definition 2.7 provided the underlying topology. Moreover, for general problems, such local min- imizers are difficult to characterize. To address this issue, necessary and sufficient conditions have been extensively studied in the literature for characterizing these local solutions. Among the rich literature, first order necessary condition based on gradient information and second order sufficient condition based on Hessian (second order derivative) information are the most widely used ones. Possible constraints involved in the optimization can be handled via introducing the tangent cone and considering the Lagrangian. Extensions of these conditions to infinite-dimensional problems have been established in the literature using the notion of Gˆateauxdiffer- ential and Fr´echet derivative.

To avoid going into too many details when presenting our main results, we focus on the abstracted problems in (2.7) and (2.8). In addition, we construct our framework based on the mathematical optimization model developed in [63]. One of the most important features of such model lies in the introduction of an optimality function which provides an abstract representation of various necessary optimality conditions.

The formally definition of the optimality function is given as follows.

32 Definition 2.10. A function θS (·): S → R is an optimality function for PS in (2.7) if:

1. θS (s) ≤ 0 for all s ∈ S;

∗ ∗ 2. if s is a local minimizer of PS , then θS (s ) = 0.

Remark 2.5. Often times, the optimality function is required to be continuous (or up-

per semi-continuous) [63]. Such a condition is introduced to ensure that in a topolog-

∗ n n ical space, if s is an accumulation point of any sequence {s }n∈ and lim inf θS (s ) = N n→∞ ∗ 0, then we have θS (s ) = 0. However, in our problem we do not assume the existence

n of accumulation points of the sequence {s }n∈N. Hence, the continuity (or upper semi-continuity) condition is not necessary.

Apparently, first of all, the above definition can be immediately extended to other cases, especially the relaxed problem PR in (2.8). We denote by theta a generic optimality function and by θR the optimality function for PR in the rest of this chapter without repeating the definition. In addition, the above optimality function definition is very abstract and is not really informative. For example, θ = 0 will always qualify as a valid optimality function for any, which contributes nothing to solutions to the optimization problem. How to construct informative optimality functions is an important research topic per se. Commonly, as discussed in [63], there is a general procedure of conveying first order necessary conditions to a corresponding optimality function. An example describing how to construct an optimality function using first order necessary condition will be provided in the next section.

33 Recall that the relaxed problem (2.8) is a classical infinite-dimensional optimiza- tion problem and hence it can be solved via existing algorithms. Adopting the op- timality function based framework, we call (ΓR, θR) an algorithm for the relaxed problem where ΓR : R → R is iterative update rule. Such an algorithm is called admissible if given any initial condition r0, it generates a sequence {rn} in R via n∈N ( Γ (rn), if θ (rn) < 0, rn+1 = R R (2.12) n n r , if θR(r ) = 0,

n and ensures lim θR(s ) = 0. n→∞ Based on the aforementioned backgrounds, our goal is to construct admissible optimization algorithms (ΓS , θS ) for the combinatorial problem PS (2.7). In specific, we provide a unified framework for constructing such algorithms based on the relaxed algorithms (ΓR, θR) and a novel weak topology-based viewpoint of the chattering lemma.

The topology based framework involves the following three key steps.

1. Relax the optimization space S to a vector space R, select a weak topology

function g : R 7→ Y and construct a projection operator Rk : R → S associated

with the weak topology Tg.

2. Solve the relaxed optimization problem PR defined in (2.8) by designing a re-

laxed optimality function θR : R → R under the selected weak topology Tg,

and selecting (or constructing) a relaxed optimization algorithm ΓR : R → R.

0 3. Set θS = θR|S and ΓS = Rk ◦ ΓR with any initial condition sp ∈ S.

The main underlying idea of the proposed framework is to transform the switched optimization problem PS to a classical optimization problem PR which can be solved

34 through the classical optimization methods in functional spaces [41, 63]. The solution

of PR will then be used to construct the solution to the original problem PS . The key components of the framework include the relaxed optimization space R, the weak topology Tg, the projection operator Rk, and the relaxed optimization algorithm characterized by θR and ΓR.

In the rest of this section, we first show that θS = θR|S is an optimality function

for PS and then derive conditions for the aforementioned key framework components

n to guarantee convergence of the sequence {s }n∈N to a stationary point characterized by θS .

2.4.1 Convergence Analysis and Proofs

The following assumptions on R, Tg and Rk are adopted in the framework to

ensure its validity.

Assumption 2.2. 1. J(·) and Ψ(·) are Lipschitz continuous under topology Tg with

a common Lipschitz constant L.

2. S is dense in R under Tg, i.e. ∀r ∈ R, ∀ > 0, ∃s ∈ S s.t. kg(r) − g(s)kY ≤ .

3. There exists a projection operator Rk : R → S associated with Tg and parametrized

by k = 1, 2,..., such that ∀r ∈ R, ∀ > 0, there exists a kˆ ∈ N, such that

ˆ kg(Rk(r)) − g(r)kY ≤ eRk (k) ≤ , ∀k ≥ k. (2.13)

Assumption 2.2.1 is a restatement of Assumption 2.1.2 which ensures the well-

posedness of the problems PS and PR. Assumption 2.2.2 and 2.2.3 specify conditions

on the relaxed space and weak topology that can be used in the framework.

35 In the following lemma, we show that θS = θR|S is an optimality function for PS if θR is a valid optimality function for PR.

Lemma 2.2. If θR is a valid optimality function for PR, then θS = θR|S is a valid optimality function for PS .

Proof. To prove this lemma, we need to show θS satisfies the conditions in Defini- tion 2.10.

The first condition is trivially satisfied.

∗ ∗ For the second condition, suppose s ∈ S is a local minimizer for PS but θS (s ) <

∗ ∗ 0. Since θR(s ) = θS (s ), by the definition of local minimizers for PR, it follows that there exists a r and a positive number C, such that J(r)−J(s∗) ≤ −C and Ψ(r) ≤ 0.

By Assumption 2.2, we have |J(Rk(r)) − J(r)| ≤ L kg(Rk(r)) − g(r)kY ≤ LeRk (k). By adding and subtracting J(r), it follows that

∗ J(Rk(r)) − J(s )

∗ ≤ |J(Rk(r)) − J(r)| + J(r) − J(s ) (2.14)

≤LeRk (k) − C

C C Let  = 2L , it follows that LeRk (k) − C ≤ − 2 < 0 for k ≥ k1 for some k1 ∈ N. Hence

∗ we have J(Rk(r)) − J(s ) < 0 for k ≥ k1, which contradicts the assumption.

The constraint evaluation can be divided into two cases: Ψ(r) < 0 and Ψ(r) = 0.

For the first case, a similar argument as the discussion on the cost part can be ˆ applied, yielding that Ψ(Rk(r)) ≤ 0 for all k ≥ k2 for some k2 ∈ N. Set k1 ≥

∗ ˆ max{k1, k2} and hence J(Rk(r)) − J(s ) < 0 and Ψ(Rk(r)) ≤ 0 for all k ≥ k1 which

∗ contradicts that s is a local minimizer for PS .

36  For the second case, suppose Ψ(r) = 0 for all r ∈ s ∈ R Ψ(s) ≤ 0 such that

∗  J(r) − J(s ) ≤ −C and hence Ψ(s) = 0 for all s ∈ s ∈ S Ψ(s) ≤ 0 such that

∗ ∗ J(s) − J(s ) ≤ −C, which contradicts that s is a local minimizer for PS .

n To show the convergence of {s }n∈N, we adopt a similar idea of the sufficient descent property presented in [5]. In order to handle the projection step and the state constraints, we define two functions Φ : R × N 7→ R and ∆ : R × R 7→ R below.

  max J(Rk ◦ ΓR(r)) − J(ΓR(r)),  Φ(r, k) , Ψ(Rk ◦ ΓR(r)) − Ψ(ΓR(r)) , if Ψ(r) ≤ 0, (2.15) Ψ(Rk ◦ ΓR(r)) − Ψ(ΓR(r)), if Ψ(r) > 0.

( max{J(r2) − J(r1), Ψ(r2)}, if Ψ(r1) ≤ 0, ∆(r1, r2) , (2.16) Ψ(r2) − Ψ(r1), if Ψ(r1) > 0, The function Φ compactly characterizes the change of the cost J and the constraint

Ψ at a point s under the projection operator Rk. For a feasible point, we care about both the changes of the cost and the constraint under Rk. For an infeasible point, we only care about the change of the constraint. The function ∆ characterizes the value difference of J and Ψ between two points r1 and r2. If r1 is feasible and ∆ < 0, it means the cost can be reduced while maintaining feasibility by moving from r1 to r2.

Similarly, if r1 is infeasible and ∆ < 0, then it is possible to reduce the infeasibility by moving from r1 to r2.

Under Assumption 2.2, the following bound on the function Φ can be established:

∗ Lemma 2.3. There exists a k ∈ N such that given ω ∈ (0, 1), for any C > 0, γC > 0, and for any s ∈ S with θS (s) < −C, we have

∗ Φ(s, k) ≤ (ω − 1)γC θS (s), ∀k ≥ k . (2.17)

37 Proof. This is a straightforward result from Assumption 2.2.1, Assumption 2.2.2 and

Lemma 2.2.

Employing the definition of the function ∆ and the above two lemmas, our main

n result on the convergence of {s }n∈N is presented below.

Theorem 2.5. If for each C > 0, there exists a γC > 0 such that for any r ∈ R

with θR(r) < −C,

∆(r, ΓR(r)) ≤ γC θR(r) < 0, (2.18)

0 then for an appropriate choice of k for Rk, for any s ∈ S the following two conclusions

hold:

n0 n 1. if there exists a n0 ∈ N such that Ψ(s ) ≤ 0, then Ψ(s ) ≤ 0 for all n ≥ n0,

n n 2. lim θS (s ) = 0, i.e. the sequence {s }n∈ converges asymptotically to a sta- n→∞ N tionary point.

n0 ∗ Proof. 1. Suppose there exists an n0 such that Ψ(s ) ≤ 0, then we have for k ≥ k

n0+1 n0 n0 Ψ(s ) =Ψ(Rk ◦ ΓR(s )) − Ψ(ΓR(s ))

n0 n0 n0 + Ψ(ΓR(s )) − Ψ(s ) + Ψ(s ) (2.19) n0 n0 ≤(ω − 1)γC θS (s ) + γC θR(s ) + 0

n0 =ωγC θS (s ) < 0 where the inequality in the third line is due to Lemma 2.3 and the assumption

in Theorem 2.5.

2. We need to consider two cases due to different forms of ∆ for different values

of Ψ.

Case 1: Ψ(sn) > 0 for all n ∈ N, i.e. the entire sequence is infeasible.

38 n Suppose lim θS (s ) 6= 0, since θS (·) is a non-positive function, we know there n→∞ n must exists C > 0 such that lim inf θS (s ) = −2C. Hence, there exists an n→∞ infinite subsequence {snm }, where m is the sub-index for the subsequence, and

nm an m1 ∈ N+ such that θS (s ) < −C for all m ≥ m1. Then, it follows that for

∗ all m ≥ m1, and for k ≥ k , we have

nm+1 nm nm nm Ψ(s ) − Ψ(s ) =Ψ(Rk ◦ ΓR(s )) − Ψ(ΓR(s ))

nm nm + Ψ(ΓR(s )) − Ψ(s ) (2.20) nm nm ≤(ω − 1)γC θS (s ) + γC θR(s )

nm =ωγC θS (s ) < 0 where the inequality in the third line is due to Lemma 2.3 and the assump-

nm tion in Theorem 2.5. This leads to the fact that lim infm→∞ Ψ(s ) = −∞, which contradicts the lower boundedness of Ψ implied by Assumption 2.1 and

Assumption 2.2.

n0 Case 2: There exists an n0 such that Ψ(s ) ≤ 0.

n By the first conclusion, it follows that Ψ(s ) ≤ 0 for all n ≥ n0. Sup-

n n pose lim inf θS (s ) 6= 0, then there exists C > 0 such that lim inf θS (s ) = n→∞ n→∞

nm −2C. Hence, there exists an infinite subsequence {s } and a m1 ∈ N+ such

nm that θS (s ) < −C for all m ≥ m1. Then, it follows that for all m ≥ m1 and for all k ≥ k∗, we have:

nm+1 nm l nm l nm J(s ) − J(s ) =J(Rk ◦ ΓR(s )) − J(ΓR(s ))

l nm nm + J(ΓR(s )) − J(s ) (2.21) nm nm ≤(ω − 1)γC θS (s ) + γC θR(s )

nm =ωγC θS (s ) < 0

39 where the inequality in the third line is due to Lemma 2.3 and the assump-

tion in Theorem 2.5. This leads to the fact that lim inf J(snm ) = −∞, which m→∞ contradicts with the lower boundedness of J implied by Assumption 2.1 and

Assumption 2.2.

2.5 Case Studies

In this section, we provide two numerical examples to illustrate the importance of

viewing the switched optimal control problem from the topological perspective and

the usage of the proposed framework.

2.5.1 Problem with terminal cost

We first provide a numerical example to show the importance of choosing appro-

priate topology in solving switched optimal control problems.

Consider the following 2-dimensional switched system with 3 modes

3 X x˙(t) = f(x(t)) = di(t)fi(x(t)), t ∈ [0, 2] (2.22) i=1 where  0   −1.5   −1 0  f (x) = , f (x) = , f (x) = x. 1 −1.5 2 0 3 0 1

Suppose no constraint is imposed and only terminal state is penalized according to

2 2 J(x) = x1x2 exp(−x1 − x2 − 0.3x1). The cost function admits a local minimum at A =  0.6361, −0.7071 T and a global minimum at B =  −0.7861, 0.7071 T and

the contour plot of the cost function is shown in Fig. 2.1. Since the cost function only

penalizes the terminal state and there is no constraint, it is actually inappropriate

to adopt the weak topology induced by entire trajectory, denoted by Tx. The weak

40 topology induced by the terminal state function TF and the weak topology induced by the cost function TJ are two better candidates for this particular problem. To illustrate this point, we implement two algorithms solving this problem, one designed under Tx and the other one constructed under TJ .

In the numerical simulation, the initial state is given by x(0) = [1, 1.5]T and the initial switching input is set to be d(0) = [0.7, 0.2, 0.1]T . The trajectories of the ter- minal states generated by the algorithms under Tx and TJ are shown in Fig. 2.1, respectively. It can be observed that the terminal states generated by the algorithm under Tx converges to the local minimum but not the global one, however the ter- minal states generated by the algorithm under TJ is able to jump away from the local minimum and converges to the global one. The reason is that the choice of weak topology determines the definition of local minimizers which in turn affects the optimality function in the proposed framework. In particular A is a local minimum under Tx, but it is not a local minimum under TJ anymore.

Throughout this example, it is clear that selecting appropriate underlying topology plays a crucial role in solving switched optimal control problems using the embedding based approach. In the next subsection, we provide a more realistic example to illustrate the usage of the proposed framework.

2.5.2 Problem with mode-dependent cost

In this subsection, we consider an optimal control problem with mode-dependent cost. It is shown that classical topology induced by state trajectory of the original system fails to serve as the appropriate underlying in this problem as well. Different from the previous case, such a topology is actually too weak. To address such an

41 Convergence of terminal states under different topologies 2 0.2 Convergence of terminal states under T x Convergence of terminal states under T 1.5 J 0.15

1 0.1 B 0.5 0.05 2

x 0 0

-0.5 -0.05 A -1 -0.1

-1.5 -0.15

-2 -0.2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 1

Figure 2.1: Convergence of terminal states under different topologies

42 issue, a commonly adopted technique is to augment the system state. We argue that

such an augmentation essentially yields a stronger underlying topology, making the

new topology feasible.

Consider the Diesel-Electric Submarines(DES) problem studied in [22, 49, 66]

which consists of seven subsystems given by:

x˙(t) = f(x(t), u(t), d(t)) 7 X a b  (2.23) = di(t) fi (x(t)) − fi (u(t)) , t ∈ [0,T ], i=1 where 2 × 104c f a(x(t)) = i , ∀i = 1, 2,..., 7 i x(t) + 100

and ( 1 u2(t) + 5, if i = 1, f b(u(t)) = 80 i 1 2 120 u (t) + 8, if i = 2,..., 7, In the above formulation, x(t) is the battery charing level, u(t) is the velocity of the

submarine, T is the time horizon. The configurations are summarized in Table 2.1,

where sn is a given constant capturing the noise level of the diesel engine.

Table 2.1: Configurations of DES system

i 1 2 3 4 5 6 7 ci 0 0.2564 0.3333 0.5128 0.6666 0.7692 1 si 0 sn 1.5sn 2sn 3sn 3sn 4.5sn

43 The DES optimal control problem is given as follows:

T Z Minimize α(x(t) − 100)2 + β ln(s(t) + 1)dt (2.24a) u,d 0 subject to x(t) ∈ [x, x], u(t) ∈ [u, u], (2.24b) T Z u(t)dt = L, (2.24c)

0

7 P where α and β are weighting factors and s(t) = di(t)si. The first term in the i=1 cost function accounts for the deviation from the maximum charging level and the

second term accounts for the exposure of the vessel due to recharging [66]. The

constraint (2.24b) summarizes the battery charging level and submarine speed con-

straints and (2.24c) is the submarine route constraint with L being the distance the

submarine needs to travel.

In this problem, the cost function (2.24a) explicitly depends on s(t) which is de-

termined by the switching signal. Therefore, the classical choice of the weak topology

induced by the state trajectory does not work in the problem, since the cost function

is not Lipschitz continuous under such weak topology, i.e. Assumption 2.2.1 is not

satisfied.

To address this issue, we augment the state to incorporate the effect of the switch-

ing signal and adopt the state trajectory based topology induced by the augmented

system. Note that, since the system is augmented, the new topology is stronger than

the original one.

The DES system (2.23) can be simplified to be a 2-mode switched system with

an additional continuous input c(t) which is the relaxed discrete input among modes

44 2 to 7. The new dynamics is given by:

7 X  a b  b x˙(t) = di(t) fi (x(t)) − fi (u(t)) − d1(t)f1 (u(t)) i=2 2 × 104 × c(t)  = (1 − d (t)) − f b(u(t)) 1 x(t) + 100 2 (2.25) b − d1(t)f1 (u(t)) ˜ , f(x(t), u(t), c(t), d(t))

7 P where c(t) satisfies di(t)ci = (1 − d1(t))c(t). We augment the original system by i=2 introducing three new states encoding the equality constraint, switching cost and

running cost, respectively as follows:

˜ x˙ 1(t) = f(x1(t), u(t), d(t), c(t)), (2.26a)

x˙ 2(t) = u(t), (2.26b)

x˙ 3(t) = ln(s(t) + 1), (2.26c)

2 x˙ 4(t) = (x1(t) − 100) , (2.26d)

with constraints:

x1(t) ∈ [x, x], u(t) ∈ [u, u], c(t) ∈ [c2, c7]. (2.27)

New cost function is given by:

2 J(u, c, d) = αx4(T ) + βx3(T ) + γ(x2(T ) − L) , (2.28)

where γ is a new weighting factor. The modified DES problem is summarized as:

Minimize J(u, c, d) u,c,d subject to (2.26a) − (2.26d) (2.29)

(2.27)

45 The weak topology induced by the state trajectory of the new

(2.26) is appropriate for the modified DES problem. Now, we construct a first-order algorithm for this particular problem under the guidance of the proposed framework.

1. Relaxed space: for the augmented system the relaxed control space would be n o n o r r 2 r r U = (u, c) u ∈ [u, u], c ∈ [c2, c7] and Dr = (d1, d2) ∈ [0, 1] d1 + d2 = 1 . Therefore, the function spaces used in the infinite-dimensional optimization

2 2 problem is given by R = U×Dr where U = L ([0,T ],U) and Dr = L ([0,T ],Dr).

2. Weak topology: the weak topology is the one induced by the state trajectory

of the augmented system (2.26).

3. Projection operator: the projection operator for the DES problem (2.29) con-

sists two projections. The first step projects the obtained relaxed input to the

2-mode switching signal and the second step projects the 2-mode switching

signal to the 7-mode switching signal.

R 1 R R T The first step projection s = Rk (r) = (d1 (t), d2 (t)) is the frequency modu- lation operator with frequency 2k which is given by ( R 1, if t ∈ (Ti,1,Ti,2], d1 (t) = 0, if t ∈ (ti−1,Ti,1] ∪ (Ti,2, ti], (2.30) R R d2 (t) = 1 − d1 (t).

ti ti i−1 1 R r R r Ti,j above are given by Ti,1 = 2k−1 + 2 d2(t)dt and Ti,2 = Ti,1 + d1(t)dt ti−1 ti−1 i where ti = 2k T .

2 The second projection operator Rk is given as follows. For each time period

(T1,T2) and δt = T2 − T1 obtained from the first step projection such that

R d2 (t) = 1 for t ∈ (T1,T2), and the solution of c(t) during such period, the

46 original switching signal dk, k ∈ Ik = {2, 3,..., 7} is given by:  dj(t) = 1, t ∈ (T1,T0],  d (t) = 0, t ∈ (T ,T ], −[j] 1 0 (2.31) d (t) = 1, t ∈ (T ,T ),  j+1 0 2  d−[j+1](t) = 0, t ∈ (T0,T2),

where −[k] = Ik \{k} is the set of indices except for k, and j is such that T R2 cjδt ≤ c(t)dt ≤ cj+1δt and T0 is such that T1

T R2 cj+1T2 − cjT1 − c(t)dt T1 T0 = . (2.32) cj+1 − cj

2 1 The overall projection operator is then given by the composition Rk = Rk ◦Rk .

0 4. Optimality function: θR(s) = min DJ(s, r − s), where DH(x; x ) is the direc- r∈R tional derivative of H at x along direction x0.

ˆl ˆ 5. Relaxed algorithm: ΓR is given by ΓR = Γ where Γ is the standard steepest

decent algorithm and l is determined by verifying the condition of Theorem 2.5

as follows:

ˆk l = min{k ∈ N J(Γ (s)) − J(s) ≤ γC θR(s)}, (2.33)

where γC is the constant in Theorem 2.5.

Proposition 2.2. The components specified as above satisfy the conditions of the topology based framework, i.e. given any initial condition, the sequence of switched inputs generated by the algorithm ΓS = Rk ◦ ΓR converges to a stationary point of this problem.

To prove this proposition, all the conditions in our framework need to be verified.

Here, we only sketch the proof.

47 • Validity of θR:

1. θR(s) = min DJ(s, r − s) ≤ DJ(s, s − s) = 0; r∈R

0 2. Suppose s is a local minimizer of PR but θR(s) < 0, then ∃s such that

DJ(s; s0 − s) < 0. By mean value theorem, we have ∃λ ∈ (0, 1) such that

J(s + λ(s0 − s)) − J(s) = λDJ(s; s0 − s) + o(λ) < 0.

• Assumption 2.2.(2): This condition is guaranteed by the chattering lemma [9,

10].

• Assumption 2.2.(3): The validity of this projection operator is ensured by an

analogous argument of the proof of Theorem 1 in [9].

• Condition in Theorem 2.5: This is clearly satisfied due to our construction of

ΓR.

The DES problem was solved in [66] using an approach similar to the bi-level optimization method [100, 101], which optimizes the switching instants for a given switching sequence first and then updates the switching sequence. In [66], the switch- ing sequence is chosen as N copies of the switching input set. Once the switching instants for the N-copied switching sequence are optimized, one more copy of the discrete input set is added and N is increased by 1. In addition, the assumption that the continuous control input u(t) is a piecewise constant function which switches

when discrete-valued input switches was adopted in [66]. This assumption signifi-

cantly restricts the space of potential solutions to the problem and hence limits the

performance of the obtained solution.

48 We adopt a similar initialization as that in [66] and Euler integration is adopted to approximate the continuous time state trajectory. Note that the consistent ap- proximation analysis of the Euler integration is beyond the scope of this paper and hence omitted. The terminal time T is set to be 24, α = 0.001, β = 1, γ = 1, x = 20, x = 100, u = 5, u = 60, L = 840, sn = 10 and the initial state and control as set as x1(0) = 100 and u1(0) = 30. Initial relaxed switching signal is given by d1 = 1.

The solution obtained by the constructed switched optimal control algorithm re- sults in a cost of 17.44, which is much smaller than the cost induced by the solution presented in [66]. With this example, how the proposed framework can be used to construct embedding based algorithms for the switched optimal control problems is demonstrated.

2.6 Conclusion

In this chapter, we study the optimal control of continuous-time switched systems.

A unified topology based framework for designing and analyzing various embedding based algorithms is developed.

Our framework views the embedding based approach as a change of topology over the optimization space. From this perspective, the proposed framework adopts the weak topology structure and develops a general procedure to construct various em- bedding based switched optimal control algorithms. A set of conditions guaranteeing convergence of the resulting algorithm are provided as well. Examples are shown to demonstrate the importance of selecting appropriate weak topology and the usage of the proposed framework.

49 Future works include the study of universal mechanisms for selecting appropri- ate weak topologies and other components involved in our framework for particular underlying problems.

50 Chapter 3: Reinforcement Learning for Switched Linear Systems

3.1 Introduction

In this chapter, we turn our attention to discrete-time switched systems, especially the switched linear systems. Different from the previous chapter, for the optimal con- trol problem considered in this chapter, we aim to find optimal control feedback laws for this problem rather than open-loop control sequences for given initial conditions.

Generally speaking, looking for feedback laws is more challenging than finding open- loop solutions in optimal control problems. Among the rich literature, dynamic pro- gramming offers a powerful tool for finding such laws. Another more recent approach is to use multi-parametric programming to find the control laws which originates from the literature of model predictive control. Despite these well-known results, it is widely known that there are mainly two practical challenges of applying these tech- niques, namely the “curse of modeling” and “curse of dimensionality”, referring to difficulty of obtaining accurate system model and exponential growth of computation and memory needed for solving the problem, respectively.

In addition to these challenges, the combinatorial nature associated with the switched optimal control problem arise here as well. This actually aggravates the two

51 “curses” even more. In order to develop tractable solutions addressing the aforemen- tioned issues, we consider a model-free reinforcement learning approach that tackles the two “curses” together.

Contributions of this chapter is mainly two-fold. First, this paper is among the

first to study model-free reinforcement learning for hybrid systems. By assuming a simulator which outputs the successive state and an associated cost given any state- input pair according to a switched linear model, a Q-learning algorithm is developed for finding an (sub-)optimal control policy without any knowledge about the system dynamics. Second, inspired by the analytical structure of optimal control for switched linear systems developed in the control literature, we propose a specific Q-function approximator and an associated update scheme instead of directly applying existing neural network-based algorithms.

The rest of this chapter is organized as follows: the discrete-time optimal control problem for switched systems is formulated in Chapter 3.2. In Chapter 3.3, classi- cal model-based optimal control results is reviewed whose solution structure will be used in the proposed algorithm. Chapter 3.4 presents the main idea of Q-learning which is a classical algorithm in model-free reinforcement learning. The proposed Q- learning algorithm and the associated training method are presented in Chapter 3.5.

Case studies are given in Chapter 3.6 and concluding remarks are summarized in

Chapter 3.7.

52 3.2 Problem Formulation

In this chapter, we consider the discrete-time version of the switched systems given

as follows:

x(t + 1) = fv(t)(x(t), u(t)), t ∈ Z+. (3.1)

Similar to the previous chapter, x(t) ∈ X ⊂ Rnx is the system state, s(t) = (u(t), v(t))

n will be referred to as a hybrid control pair where u(t) ∈ U ⊂ R u and v(t) ∈ Σ ,

{1, 2, . . . , nv} are the continuous and discrete inputs, respectively. In particular, we consider the unconstrained scenario, i.e. we assume X = Rnx and U = Rnu in the rest of this chapter. Following standard notions in optimal , we denote by

nx nu nx nu the functions ξt = (µt, νt): R → R ×Σ a hybrid control law, where µt : R → R

nx is called the continuous control law and νt : R → Σ is called the switching control law. A sequence of hybrid control laws over time forms a infinite-horizon hybrid control policy, denoted by π = {ξ0, ξ1,...} or a finite-horizon hybrid control policy

πN = {ξ0, ξ1, . . . ξN−1} if only finite-horizon is considered. A hybrid control policy is called stationary if ξt = ξ for all decision stages t ∈ Z+. Performance of a given a hybrid control policy π (or πN ) for any given initial state x(0) = x is evaluate by the following infinite-horizon (or finite-horizon) cost function:

 N P π  lim `(x(t), µt(x(t)), νt(x(t))) J (x) = N→∞ t=0 (3.2a)  subject to (3.1) with x(0) = x,  N−1  P `(x(t), µ (x(t)), ν (x(t))) + J (x(N)) πN t t f JN (x) = t=0 (3.2b)  subject to (3.1) with x(0) = x, where Jf in (3.2b) is a nonnegative function called the terminal cost penalizing the terminal state in finite-horizon scenario, and `(x, u, v) is the running cost per decision

53 stage satisfying the following positivity condition throughout this chapter guarantee-

ing that the limit in the above cost function is well-defined.

Assumption 3.1 (Positivity). The running cost ` satisfies

`(x, u, v) ≥ 0, for all (x, u, v) ∈ X × U × Σ. (3.3)

Denote by Π and ΠN the set of all admissible infinite-horizon and finite-horizon control policies, the optimal control problem is given as follows:

∗ ∗ Problem 3.1. Find π (resp. πN ) that minimizes (3.2a) (resp. (3.2b)), or equiva- lently,  N P ∗ ∗  min lim `(x(t), µt(x(t)), νt(x(t))) (π ,J ) = π∈Π N→∞ t=0 (3.4) subj. to system (3.1) with x(0) = x

 N−1 P ∗ ∗  min `(x(t), µt(x(t)), νt(x(t))) + Jf (x(N)) (πN ,JN ) = πN ∈ΠN t=0 (3.5) subj. to system (3.1) with x(0) = x

Commonly, the optimal infinite-horizon cost function J ∗ is referred to as (infinite- horizon) value function denoted by V ∗ = J ∗ and the optimal finite-horizon cost

∗ function is referred to as the finite-horizon value function denoted by VN = JN . Solving Problem 3.1 with general nonlinear switched dynamics is known to be hard. In particular, classical optimal control requires complete knowledge about the system dynamics which is difficult to obtain for practical problems. One possible way to solve Problem 3.2 is to first identify the system dynamics for each subsystem using various system identification techniques and then solve the corresponding optimal control problem based on the identified model. While such an approach is concep- tually easy, its performance can be quite sensitive to modeling error or unmodeled dynamics. A profound understanding on the impact of modeling error for model-based

54 reinforcement learning is very challenging even for standard LQR problems [26]. Mo- tivated by these observations, we focus on model-free reinforcement learning approach that tackles the two “curses” together by directly searching for a good sub-optimal control policy. In specific, we assume accessibility to a switched system simulator which can be well approximated by a simple switched linear system. In other words, ˜ ˜ denote by f(x, u, v) the simulator dynamics, we assume f(x, u, v) = Avx + Bvu in this dissertation where Av,Bv are unknown to us. Furthermore, the running cost per decision stage is chosen to be the following quadratic form

T T `(x, u, v) = x Qvx + u Rvu, (3.6)

where Qv ∈ S+ and Rv ∈ S++ are the penalizing matrices. Provided the above discussions, the problem studied in this chapter is stated below.

Problem 3.2. Given the simulator dynamics

+ ˜ x = f(x, u, v) = Avx + Bvu, (3.7) and the quadratic cost function (3.6), find π∗ that minimizes the infinite-horizon cost function (3.2a), or equivalently,

N ∗ X T T π = arg min lim x(t) Qν (x(t))x(t) + µt(x(t)) Rν (x(t))µt(x(t)) N→∞ t t π∈Π t=0 (3.8) ˜ subject to x(t + 1) = f(x(t), µt(x(t)), νt(x(t))), ∀t ∈ Z+. In the sequel, we first review the classical results from optimal control literature for solving Problem 3.2. A particular Q-learning algorithm will then be developed trying to incorporate analytical results from optimal control.

55 3.3 Model-based Optimal Control

In this section, we review the classical model-based optimal control results for the switched linear quadratic regulation problem 3.2. Dynamic programming serves as the key approach for solving classical optimal control problems. In particular, the value iteration algorithm and its convergence results will be introduced first. Then, the SLQR results based on value iteration are provided which will be extensively used in our later discussion. At the end of this section, limitations of these classical results will be pointed out which motivate our model-free solution.

3.3.1 Dynamic Programming

Provided information about the system dynamics, the classical optimal control problem 3.1 is usually solved by the well known dynamic programming. The key idea of using dynamic programming to solve the optimal control problem lies in the well known Bellman’s Principle of Optimality [7], which states that any tail policy of an optimal policy remains optimal with regard to the state resulting from the previous decision. In particular, it has been shown that the infinite-horizon value function V ∗ satisfies the Bellman’s Equation, namely

V ∗(x) = min {`(x, u, v) + V ∗(f(x, u, v))} , ∀x ∈ X, (3.9) u,v or equivalently, V ∗ is a stationary point of the following Value Iteration operator.

+ ¯ Definition 3.1. Denote by E (X) the set of all functions V : X → R+, then T : E +(X) → E +(X) is called the Value Iteration operator given by:

T V = min {`(x, u, v) + V (f(x, u, v))} . (3.10) u,v

56 Essentially, the Value Iteration operator maps a positive function to another pos-

itive function. Given the infinite-horizon value function V ∗, the optimal stationary

policy can be obtained through minimization in the right-hand side of (3.9). There-

fore, the optimal control problem becomes finding the infinite-horizon value function

V ∗. Value Iteration is one of the most well-known algorithms proposed in the dy-

namic programming literature for finding value functions in both infinite-horizon and

finite-horizon scenarios. ¯ Generally speaking, Value Iteration starts with some initialization V0 : X → R+

∞ and generates a sequence of functions {Vk}k=0 according to

Vk+1 = T Vk min {`(x, u, v) + Vk(f(x, u, v))} . (3.11) , u,v

For finite-horizon problems, V0 is typically chosen as certain given terminal cost

function Jf . It is clear that due to the definition of the Value Iteration operator (3.10)

and the Principle of Optimality, the above value iteration (3.11) gives the exact

finite-horizon value function provided the aforementioned initialization is fulfilled,

N i.e. VN = T Jf for all N < ∞.

For infinite-horizon problems, it is expect that the finite-horizon value functions

∗ will converge to the infinite-horizon one, i.e. V = lim VN , however such a relation- N→∞ ship is not always true. Conditions guaranteeing the above desired convergence have

been given in the literature [11, 12, 13].

Lemma 3.1. Under Assumption 3.1, if V˜ : X → (−∞, ∞] satisfies V˜ ≥ T V˜ and

V˜ ≥ 0, then V˜ ≥ V ∗.

˜ Proof. Due to the assumption (T V )(x) > −∞ for all x ∈ X, for any sequence {k} ˜ ˜ with k > 0, there exists an admissible policyπ ˜ = {ξ1, ξ2,...} such that for any x ∈ X

57 and k ˜ ˜ ˜ ˜ `(x, ξk(x)) + V (f(x, ξk(x))) ≤ (T V )(x) + k. (3.12)

Then we have for any x0 ∈ X,

N−1 ∗ X V (x0) = min lim `(xk, ξk(xk)) π∈Π N→∞ k=0 N−1 ˜ X ≤ min lim inf V (xN ) + `(xk, ξk(xk)) π∈Π N→∞ k=0 N−1 ˜ X ˜ ≤ lim inf V (xN ) + `(xk, ξk(xk)). N→∞ k=0

By (3.12) and V˜ ≥ T V˜ , it follows that

N−1 ˜ X ˜ V (xN ) + `(xk, ξk(xk)) k=0 N−1 ˜ ˜ X ˜ =V (f(xN−1, ξN−1(xN−1))) + `(xk, ξk(xk)) k=0 N−2 ˜ X ˜ ≤V (xN−1) + `(xk, ξk(xk)) + N−1 k=0 N−3 ˜ X ˜ ≤V (xN−2) + `(xk, ξk(xk)) + N−2 + N−1 k=0 . . N−1 ˜ X ≤V (x0) + k k=0 Consequently, we have

N−1 ∗ ˜ X V (x0) ≤ V (x0) + lim k (3.13) N→∞ k=0

N−1 P for arbitrary positive sequence {k}. By choosing {k} such that lim k is arbi- N→∞ k=0 trarily small, the desired result follows.

58 Lemma 3.2. Under Assumption 3.1, a stationary policy ξ is optimal if and only if

∗ ∗ T V = TξV .

Theorem 3.1 ([13]). Under Assumption 3.1, assume further that U is a metric space and Uk(x, η) given by

Uk(x, η) = {u ∈ U, v ∈ Σ | `(x, u, v) + Vk(f(x, u, v)) ≤ η} (3.14)

¯ ¯ ∞ are compact for all k ≥ k with some integer k, for all x ∈ X and η ∈ R, where {Vk}k=0

∞ is the Value Iteration sequence generated by (3.11) with V0 ≡ 0. Then {Vk}k=0

∗ ∗ converges pointwise to V with arbitrary initialization 0 ≤ V0 ≤ V . Furthermore, there exists a stationary optimal policy, i.e. π∗ = (ξ∗, ξ∗,...).

∗ Proof. 1. Under Assumption 3.1, by V0 ≤ V , we have

k ∗ V0 ≤ T V0 ≤ ... ≤ T V0 ... ≤ V .

By the monotonicity of value iteration and Bellman’s equation V ∗ = T V ∗, the

above relationship implies

k k ∗ ∗ T V0 ≤ T V = V .

k ∗ Hence, by monotone convergence theorem {T V0} converges to some V∞ ≤ V .

By applying value iteration to the following relationship

k ∗ T V0 ≤ V∞ ≤ V , ∀k ∈ N

we have

k+1  k T V0(x) = min `(x, u, v) + T V0(f(x, u, v)) ≤ T V∞(x). u,v

59 Taking limit as k → ∞ yields the following important relationship

V∞ ≤ T V∞. (3.15)

Now, suppose that there is a statex ˜ such that the above inequality is strict, i.e.

V∞(˜x) < (T V∞)(˜x).

Apparently, we have V∞(˜x) < ∞. Letη ˜ = V∞(˜x) and consider the following set for k ≥ k¯

 k Uk(˜x, η˜) = u ∈ U(˜x), v ∈ Σ | `(˜x, u, v) + (T V0)(f(˜x, u, v)) ≤ η˜ .

Due to the compactness assumption, it follows that the hybrid control pair

(uk, vk) attaining the following minimum exist.

k+1  k (T V0)(˜x) = min `(˜x, u, v) + (T V0)(f(˜x, u, v)) u∈U(˜x),v∈Σ . k = `(˜x, uk, vk) + (T V0)(f(˜x, uk, vk))

∞ Now, let us focus on the hybrid control sequence {um, vm}m=k. It follows from

k k+1 T V0 ≤ T V0 ≤ ... ≤ V∞ that

k `(˜x, um, vm) + (T V0)(f(˜x, um, vm))

m ≤ `(˜x, um, vm) + (T V0)(f(˜x, um, vm))

≤ V∞(˜x), ∀m ≥ k

∞ By definition of Uk(·, ·), it is clear that {(um, vm)}m=k ⊂ Uk(˜x, V∞(˜x)). By compactness of such set, there exists at least one (˜u, v˜) which is a limit point

∞ ∞ of {(um, vm)}m=k. Therefore, ifu ˜ is a limit point of {(um, vm)}m=k¯, then

∞ \ (˜u, v˜) ∈ Uk(˜x, V∞(˜x)). k=k¯

60 Consequently, we have

k+1 k (T V0)(˜x) ≤ `(˜x, u,˜ v˜) + (T V0)(f(˜x, u,˜ v˜)) ≤ V∞(˜x).

Taking limit as k → ∞ on both sides of the above inequality, we have

V∞(˜x) = `(˜x, u,˜ v˜) + V∞(f(˜x, u,˜ v˜)) ≥ T V∞(˜x),

which is contradicting (1). Hence, we must have V∞ = T V∞. By Lemma 3.1,

∗ ∗ we have V∞ ≥ V which contradicts (1). Therefore, we have V∞ = V and the

desired result directly follows.

2. To show the existence of an optimal stationary policy, notice that (1) and V∞ =

V ∗ implies (˜u, v˜) attains the minimum in

V ∗(˜x) = min `(˜x, u, v) + V ∗(f(˜x, u,˜ v˜)) u∈U(˜x),v∈Σ

for anyx ˜ ∈ X with V ∗(˜x) < ∞. Consequently, the existence of an optimal

stationary policy follows from Lemma 3.2.

Moreover, similar convergence result can be established with initialization V0 sat-

∗ ∗ isfying V ≤ V0 ≤ cV for some c > 0. In such a scenario, existence and knowledge about a feasible set X∞ and a termination set Xs is required, where

∗ ∗ X∞ = {x ∈ X|V (x) = ∞} ,Xs = {x ∈ X|V (x) = 0} .

∗ ∗ However, the initialization condition V ≤ V0 ≤ cV is in general difficult to verify as information about V ∗ is not usually known a priori.

The above result essentially claim that both finite-horizon and infinite-horizon value functions can be obtained via Value Iteration under mild conditions.

61 For linear quadratic regulation problems, it is widely known in the literature that the Bellman’s equation (3.9) becomes the following discrete-time algebraic Riccati equation

P = Q + AT PA − AT PB(R + BT PB)−1BT PA (3.16) where A, B and Q, R are matrices defining the system and cost, respectively. The infinite-horizon value function is then given by the solution to (3.16) in the following quadartic form

V ∗(x) = xT P x.

Similarly, value iteration algorithm in LQR case becomes the following Riccati recur- sion

T T T −1 T Pk+1 = Q + A PkA − A PkB(R + B PkB) B PkA.

These well known results have been extended to the switched linear quadratic regulation problem 3.2 in the literature [103, 104, 105], which are reviewed below.

3.3.2 Switched Linear Quadratic Regulation

In this section, we review classical optimal control results for Problem 3.2 under ˜ the assumption that the simulator dynamics f, i.e. the matrices Av and Bv are known precisely to us.

First of all, to ensure well-posedness of Problem 3.2, specifically finiteness of the optimal cost, the following assumption is adopted.

Assumption 3.2. System (3.7) is exponentially stabilizable.

Stabilizability of switched systems is an important research topic in the literature per se. An important result states that for switched linear systems, exponential stabilizability is equivalent to asymptotic stabilizability [47]. As such, this assumption

62 is not very restrictive. It can be guaranteed, for example, if one of the subsystems is stabilizable. However, it is worth mentioning that stabilizability of switched systems does not rely on stabilizability of individual subsystem.

If the switched system dynamics are known, the problem of interest becomes a standard switched LQR problem, which has been studied extensively in the control literature. We now briefly review some important properties of value functions for the switched LQR problem, which motivates our Q-function approximation architecture later on.

Due to the linear dynamics and quadratic cost function in each subsystem, each

finite-horizon value function Vk and the associated control policy πk have some nice properties. To see this, let us first assume we pick a subsystem v ∈ Σ and evolve the system using only this subsystem. In this case, the switched LQR problem degenerates the classical LQR problem. It is well known that solution to LQR problem can be fully characterized by the following Riccati recursion:

T T T −1 T Pk+1 = ρv(Pk) = Qv + Av PkAv − Av PkBv(Rv + Bv PkBv) Bv PkAv, (3.17a)

T −1 T Kv(Pk) = (Rv + Bv PkBv) Bv PkAv, (3.17b) where ρv : S++ → S++ is referred to as Riccati mapping and Kv(Pk) is the corre- sponding optimal feedback gain (Kalman gain) matrix.

For switched systems, one can choose different subsystems (mode) to evolve the system dynamics. Depending on which subsystem is activated, the Riccati recursion will evolve differently, resulting in different positive definite matrices. This can be described by a set-valued mapping, referred to as the Switched Riccati Mapping

(SRM), defined by

ρΣ(H) = {ρv(P )|v ∈ Σ and P ∈ H} (3.18)

63 where H is an arbitrary set that contains a finite number of positive definite matrices.

Let Hk be a set of positive definite matrices that is generated recursively using the switched Riccati mapping:

Hk+1 = ρΣ(Hk), k = 0, 1,..., with H0 = {Pf }, (3.19)

where Pf is the matrix defining the terminal cost. Then it can be easily shown that the optimal k-horizon value function and the corresponding control law ξk can be characterized exactly using Hk.

Lemma 3.3 ([104]). For any finite horizon k, we have

1. Finite-horizon value function Vk is a pointwise minimum of finitely many quadratic

functions, i.e.

T nx Vk(z) = min z P z, ∀z ∈ R (3.20) P ∈Hk

2. The corresponding control law ξk = (µk, νk) is given by

µk(z) = −Kνk(z)(Pk(z))z, (3.21a)

T −1 T where Kv(P ) = −(Rv + Bv PBv) Bv PAvz (3.21b)

T (Pk(z), νk(z)) = arg min z ρv(P )z (3.21c) P ∈Hk,v∈Σ

Proof. This lemma is proved by induction.

T 1. V0(z) = z Pf z which satisfies the desired form.

T 2. Suppose Vk(z) = min z P z where Hk is obtained through (3.19). By value P ∈Hk iteration, we have

T T Vk+1(z) = inf z Qvz + u Rvu + Vk(Avx + Bvu) v∈Σ,u∈U

T T T T T = inf z (Qv + Av PAv)z + u (Rv + BvPBv)u + 2z Av PBvu. v∈Σ,P ∈Hk,u∈U 64 Note that the infimum to be evaluated is quadratic in u, hence the optimal u∗

attaining the infimum can be easily found to be of the form (3.21b). Substitute

u∗ into the above formulation yields the desired pointwise minimum quadratic

structure.

The above lemma provides an exact characterization of finite-horizon value func- tion for switched linear quadratic regulation problems. In addition, convergence of the above value iteration to the infinite-horizon value function can be established with certain suboptimal closed-loop performance guarantees.

Lemma 3.4 ([105]). Under Assumption 3.2, we have

∗ k 2 1. |Vk(z) − V (z)| ≤ c1γ kzk , for some finite constant c1 and γ ∈ (0, 1);

2. There exists a finite k for which the stationary policy πk = {ξk, ξk,...} is expo-

nentially stabilizing and

∗ 2 Jπk (z) ≤ V (z) + c2kzk

for some finite constant c2.

Proof. The convergence results in this lemma can be proved either by checking the conditions provided in Theorem 3.1 and applying the conclusion therein, or by the argument provided in [104, 105], while the detailed bound provided herein can be obtained via analysis on the deviation.

Here we provide a proof of only the convergence result by checking conditions in

Theorem 3.1.

65 First, by definition of the running cost, Assumption 3.1 is obviously satisfied.

Second, we need to verify the compactness of the set

Uk(x, η) = {u ∈ U, v ∈ Σ | `(x, u, v) + Vk(f(x, u, v)) ≤ η}

¯ for all k ≥ k, for all x ∈ X, and η ∈ R, where {Vk} is generated with V0 ≡ 0. Note

that, in SLQR, Pf = 0 which implies V0 ≡ 0. In addition, due to the discrete nature

of v, we only need to focus on the continuous input part in Uk(·, ·). By Lemma 3.3,

we know

`(x, u, v) + Vk(f(x, u, v))

 T T T T T = min x (Qv + Av PAv)x + u (Rv + BvPBv)u + 2x Av PBvu . P ∈Hk Due to the fact that the above function is continuous and radially unbounded in u for

any x, the sub-level set will be compact. Therefore for any x and η, Uk(x, η) will be

∗ compact. By Theorem 3.1, it follows that {Vk} converges pointwise to V and there

exists an optimal stationary policy.

According to Lemma 3.3 and Lemma 3.4, it can be seen that the optimal solution

to Problem 3.2 can be approximated arbitrarily well by a pointwise minimum of a

finite number of quadratic functions. However, due to the enumeration of all possible

k switching sequences, the size of Hk grows exponentially as k increases, |Hk| = Σ to be precise. This issue makes the exact evaluation of Vk numerically challenging especially for large horizons and in turn yields the approximation of the infinite-horizon value function intractable.

To alleviate this issue, sub-optimal solutions have been developed in the litera- ture, building upon the observation that some matrices in a generic set H may not

66 contribute (or contribute very little) to the overall minimum structure (3.20). In par-

ticular, a matrix P 0 ∈ H is called algebraically redundant if zT P 0z ≥ min zT P z for H\{P 0} 0 0 all z. If P is algebraically redundant, then it is safe to remove P from Hk without

affecting the characterization of the value function at all. However, checking alge-

braic redundancy is challenging. A sufficient condition and a numerically tractable

algorithm based on S-procedure [18] has been given in [105]. Furthermore, a geo-

metric interpretation of algebraic redundancy has been given as well in [105], which

motivates the geometric identification approach used for Q-function update proposed

later in Chapter 3.5. In addition to the aforementioned algebraic redundancy, there

is another notion termed numerical redundancy which helps further reducing the

number of matrices involved in characterizing the value functions while allows for

introducing an  level of error. Formally, a matrix P 0 ∈ H is called numerically - redundant if min zT P z ≤ min zT (P + I)z for all z. A slight modification to the H\{P 0} H aforementioned tractable algorithm for algebraic redundancy gives rise to a tractable

algorithm checking numerical redundancy.

3.3.3 Limitations

Generally speaking, despite the powerfulness of dynamic programming, there are

several practical limitations of applying dynamic programming for real-world prob-

lems. Among them, the most famous ones are the “curse of modeling” and the “curse

of dimensionality”. Roughly speaking, the “curse of modeling” refers to the fact

that complete system model knowledge is required in order to perform dynamic pro-

gramming. However, for practical problems, such an accurate model is in general

challenging to obtain and may subject to noises and changes. On the other hand, the

67 “curse of dimensionality” refers to the fact that amount of computation and mem- ory storage for solving dynamic programming growth exponentially with the size of the problem. In addition, for combinatorial problems, these quantities explode very quickly. In particular, due to the existence of the discrete input v(t), the SLQR problem is in fact combinatorial in nature and NP-hard in general.

The aforementioned limitations are strong motivations for considering simulation and approximation based approaches. In particular, the simulation based approach with parametric function approximation is the main focus on the rapidly growing reinforcement learning literature.

3.4 Q-learning

Among the rich literature of reinforcement learning, Q-learning is one of the most important algorithms that has been widely studied and adopted in practice. Q- learning is named after the Q-function [78] or Q-factor [11] where Q stands for quality.

The key idea behind Q-learning is to incorporate input into the definition of value function. In particular, the optimal Q-function for any state-input tuple (x, u, v) is defined to be the summation of one-step running cost for the given tuple and the optimal cost for the successive state, i.e. Q∗(x, u, v) = `(x, u, v)+V ∗(f(x, u, v)). With this optimal Q-function definition, the Bellman’s equation (3.9) can be equivalently written as follows

V ∗(x) = min Q∗(x, u, v), (3.22) u,v where it can be easily checked that Q∗(x, u, v) is the solution to

Q∗(x, u, v) = `(x, u, v) + min Q∗(f(x, u, v), u+, v+). (3.23) u+,v+

68 Similarly, Q-function for each finite-horizon k + 1 can be defined as

Qk+1(x, u, v) = `(x, u, v) + Vk(f(x, u, v)), (3.24) and the classical Value Iteration essentially becomes

Vk+1(x) = min Qk+1(x, u, v). (3.25) u,v

Combining the above two equations, we have

Qk+1(x, u, v) = `(x, u, v) + Vk(f(x, u, v))

+ + = `(x, u, v) + min Qk(f(x, u, v), u , v ) (3.26) u+,v+

, (TQQk)(xk, uk, vk). which is referred to as Q-Iteration throughout this dissertation.

Theoretically speaking, the above Q-functions are mathematically no different than value functions and hence Q-Iteration is mathematically equivalent to Value

Iteration. In addition, Value Iteration for Q-functions is mathematically equivalent to classical Value Iteration for cost functions. Therefore, all exact theories and al- gorithms for value function computation directly apply to Q-functions. Especially, convergence results for Value Iteration described in 3.1 directly apply to the above

Q-Iteration (3.26).

Despite the mathematical equivalence between Q-functions and value functions, the introduction of Q-functions is significant from an implementation perspective.

The most significant advantage of using Q-functions is that it allows us to conve- niently implement the associated control policy in a model-free and on-line fashion.

In particular, the optimal control policy ξ∗(x) can be simply computed by ξ∗(x) = arg min Q∗(x, u, v) provided the optimal Q-function Q∗(x, u, v) without knowledge u,v about the system dynamics f(x, u, v). Therefore, focusing on Q-functions enables us

69 to develop model-free solutions to optimal control problem. Thanks to the equiv-

alence of Q-functions and value functions, all approximate theories and algorithms

studied in the approximate dynamic programming literature actually apply to Q-

functions as well. As mentioned before, parametric approximations are typically used

for representing Q-functions. This idea has been extensively studied and used in

approximate dynamic programming literature [11] and Reinforcement Learning liter-

ature [55, 56, 57].

The general idea of Q-learning scheme is instead of using exact Q-function Qk(x, u, v) ˆ at each iteration in (3.26), a function approximation Qk(x, u, v; θ) is used where θ

denotes a vector containing all parameters in the approximation. Under the assump-

tion that the same class of parameterized approximations is used for all k, update ˆ of Qk(x, u, v; θ) can be considered as update of the parameters θ directly. Denote by ˆ ˆ Q(x, u, v; θk) = Qk(x, u, v; θ), the Q-Iteration essentially becomes

ˆ ˆ + + Q(x, u, v; θk+1) = `(x, u, v) + min Q(f(x, u, v), u , v ; θk). (3.27) u+,v+

Adopting such a parametric approximation, the general Q-learning algorithm is given

below.

Here, at steps 6, 7 and 8 in Algorithm 1, we assume both states and inputs can be

draw uniformly from their underlying spaces and we can use our simulator to evolve

each state-input pair in parallel for simplicity of exposition. It should be noted that

this is not realistic in practice, especially when we are interacting with the real en-

vironment. A practical alternative is to use the current state and run the simulator

for several steps with a sequence of chosen inputs. In this case, the input sequence

+ needs to satisfy certain conditions to ensure that the data samples (xi, ui, vi, xi , yi)

70 Algorithm 1 Q-learning Framework

Input: Initial parameter θ0, T , kmax Output: θ∗ ∗ 1: θ = θ0 2: for k = 1 to kmax do 3: if kθk − θk−1k ≤ T then ∗ 4: θ ← θk; return θ∗; 5: else i.i.d. 6: xi ∼ Unif(X), ∀i = 1,...,N; . Draw N random states i.i.d. 7: (ui, vi) ∼ Unif(U × Σ), ∀i = 1,...,N; . Draw N associated random hybrid inputs + ˜ 8: xi = f(xi, ui, vi), ∀i = 1,...,N; . Generate successive state using the simulator ˆ + + + 9: yi ← `(xi, ui, vi) + min Q(xi , u , v |θk−1), ∀i = 1,...,N; . Compute new u+,v+ Q-function value PN ˆ 2 10: θk = arg min i=1 ||Q(xi, ui, vi; θ) − yi||2; . Update θ using the collected data θ 11: end if; 12: end for ∗ 13: θ ← θkmax

are well distributed. In the linear case, one of such conditions is the so-called persis-

tence of excitation that has been extensively studied in the classical adaptive control literature [4], which implies potential relationship between reinforcement learning and adaptive control.

In the Q-learning framework, two key questions are how to select appropriate parametric approximation function class Qˆ(x, u, v; θ), and how to update parameter

θ, i.e. how to solve line 10 in Algorithm 1. The update of parameter θ in step 10 is commonly referred to as “training” of the approximator Qˆ.

Both linear and nonlinear parametric approximation architectures and associated training methods have been studied in the literature. In the sequel, a brief review of these architectures is given.

71 Linear Approximator for DSLQR

Linear approximation architecture has been shown to work with theoretical guar- antees for simple problems such as discrete-time linear quadratic regulation (DTLQR) problems [44, 46]. In particular, based on the well-known fact that value function for discrete-time linear quadratic regulation problem is quadratic in state, it has been shown that Q-learning with simple linear approximation guarantees convergence to the optimal Q-function Q∗. Mathematically, such a linear approximation takes the following form

 T     ˆ x Hxx Hxu x T T ¯ T T Q(x, u; θ) = , z Hz = vec(H) (z⊗z) , H z¯ = θ φ(z), u Hux Huu u where φ(z) is commonly referred to as feature vector of the data sample z in the literature. Given such a linear approximation structure, the Q-Iteration becomes

T T + θk+1φ(z) = `(x, u) + θk φ(z ) which can be solved via recursive least square [20] or other approaches [82].

For practical problems, the class of linear parametric functions may not be rich enough for representing the Q-functions. Instead, nonlinear approximations are con- sidered. Among the rich nonlinear function classes, neural network architecture is one of the most widely adopted nonlinear architecture, which has demonstrated im- pressive success in solving numerous difficult problems [55, 56, 57, 72].

General Nonlinear Approximator - Neural Networks

An artificial neural network typically consists of an input layer, an output layer and a few hidden layers. Roughly speaking, the input layer encodes the input to the neural network according to certain rules, e.g. adding a constant 1 to the input vector

72 to allow for affine transforms. The output layer computes a linear combination of all outputs from the last nonlinear layer. All hidden layers are constructed by simple building blocks called perceptron in [12, 14] composed of a linear and a nonlinear layer as shown in Figure. 3.1(a). Elements in each nonlinear layer is commonly called neurons and outputs of the nonlinear layer in each perceptron are used as inputs to the linear layer of the next layer. In order to build a perceptron, number of neurons used in each nonlinear layer and the activation function α : R → R used in each neuron need to be specified. Rectified linear unit(ReLU) [2, 59] is one of the most widely used activation functions. Parameters in such a neural network structure are the linear weights used in the linear layers in all the perceptrons. This particular neural network structure enables simple computation of the gradient of the output with respect to the parameters through a special procedure known as back-propagation [37, 67, 96]. Based on the gradient information, various training methods for updating the neural network parameters have been proposed, e.g. trust- region policy optimization (TRPO) [69], proximal policy optimization (PPO) [70], deep deterministic policy gradient (DDPG) [48] and so on. Apart from the value iteration framework discussed so far, there are several other general frameworks such as policy iteration, generalized policy iteration, actor-critic and etc.

Despite the wide range applicability of deep neural network architecture for rep- resenting Q-function approximations, the theoretical properties have not been fully investigated. In particular, such a deep neural network architecture usually omits the analytical structures of the value functions of the underlying optimal control problem and may result in unsatisfactory performance. In the sequel, we exploit the classical

73 (a) Bulding Block of Neural Network

(b) Overall Neural Network Structure with Multiple Layers

Figure 3.1: General Neural Network Structure

74 optimal control results for switched linear systems and develop a specific Q-function approximator and the associated training techniques.

3.5 Main Results

Given the aforementioned Q-learning framework, the key question now is how to design the parameterized Q-function approximator and the corresponding update scheme. Most of the existing Q-learning algorithms use deep neural networks (DNN) as the parameterized approximator and use various advanced learning algorithms to update the weights in the DNN which are parameters of the overall Q-function ap- proximator [36, 56, 57, 85]. Generally speaking, such algorithms are capable of dealing with a very general class of problems without specifically taking into account of un- derlying physics or structural properties. In our solution, instead of directly applying these generic algorithms, a particular Q-function approximator and an associated up- date scheme are developed which explicitly incorporate the analytical structure of value functions of the optimal control problem discussed in Section 3.3.

3.5.1 Q-function and parametric approximator

As has been discussed in the previous section, Q-functions are defined according to the associated value function in classical dynamic programming. By the switched linear quadratic regulation results and the Q-function definition, we know that the exact optimal Q-function for Problem (3.2) has the following form Q∗(x, u, v) = `(x, u, v) + V ∗(f(x, u, v))

 T T T T T = min x (Qv + Av PAv)x + u (Rv + BvPBv)u + 2x Av PBvu P ∈H∗

Proposition 3.1. The sequence {Qk} generated by the Q-iteration (3.26) converges

∗ ∗ pointwise to Q with arbitrary initialization such that 0 ≤ Q0 ≤ Q .

75 This proposition immediately follows from the convergence results of value it- eration (Theorem 3.1) and the mathematical equivalence between Q-functions and value functions. Nonetheless, it is worth mentioning that despite the convergence is

∗ guaranteed for all possible initialization satisfying 0 ≤ Q0 ≤ Q , a commonly chosen initialization is Q0(x, u, v) = `(x, u, v). Such an initialization originates from the clas- sical initialzation V0 ≡ 0 in the value iteration for infinite-horizon problems, which essentially means that no terminal cost is involved.

Due to the fact that H∗ typically contains infinitely many quadratic functions, exact characterization is in general impossible to obtain. To address this issue, it has been shown that finitely many quadratic function can approximate the desired optimal function with certain sub-optimal performance guarantee. Therefore, we adopt the following parameteric Q-function approximation structure which explicitly incorporates the value function structure.

 T   T T x T x Q(x, u, v) = x Qvx + u Rvu + min [Av,Bv] P [Av,Bv] (3.28) P ∈HM u u where HM is a finite set of positive definite matrices with cardinality M serving as the parameters in this approximation.

Since we do not assume knowledge about the system dynamics, i.e. Av and Bv, we

T treat [Av,Bv] P [Av,Bv] as a single parameter defining our Q-function approximator given as follows.

 T   ˆ T T x ˆ x Q(x, u, v|θ) = x Qvx + u Rvu + min P (3.29) ˆ u u P ∈HMv where Mv = M × nv is a set containing finitely many positive definite matrices used for each v ∈ Σ. As a result, the proposed Q-function approximator is a pointwise minimum of (at most) Mv quadratic functions in state-input pair.

76 Note that, in the above Q-function, M is a design parameter tuning the accuracy of our approximation. In fact, there is a natural trade-off between accuracy and complexity when choosing M. Generally speaking, larger M results in more accurate

approximation but introduces additional computations and requires much more data

samples to be collected. Rigorous analysis on the trade-off and systematic schemes

for choosing M remain open.

3.5.2 Q-function Update

Due to the above specific structure, classical learning algorithms cannot be directly

applied to update the Q-function approximator. To address this issue, we propose two

approaches based on the particular structure of our Q-function approximator. First,

inspired by the piecewise quadratic structure of the underlying value function, the

Q-function update can be formulated as a subspace clustering problem. Two different

algorithms solving the subspace clustering problem are implemented for solving our

problem. Nonetheless, such subspace clustering formulation does not fully exploit the

exact pointwise minimum structure which is actually stronger. Having this in mind,

we develop a novel alternative geometric approach.

At each iteration, the Q-learning framework generates a set of N data samples

N T T T D = {(zi, vi, yi)}i=1 where zi , [xi , ui ] and yi is the value of the associated value function. Updating of the Q-function approximator can then be abstracted to be the

following problem.

N n Problem 3.3. Given a data set D = {(zi, yi)}i=1 where zi ∈ R and yi ∈ R, find a

M n set of matrices {Pj}j=1 ⊂ S++ such that

T ∀i, yi = min zi Pjzi (3.30) j=1,...,M

77 Since we do not know the membership of data samples with their underlying quadratic functions, the above problem can be viewed as a unsupervised learning problem which is known to be challenging. One widely adopted idea is to first clus- ter the data samples into several clusters and then identify the associated quadratic function for each cluster. Typically, identification step is much easier. With enough data samples, the identification step can be formulated as a standard least square problem. The main bottleneck of the two-step approach is the clustering step. Effec- tive clustering algorithms usually rely upon knowledge about the underlying model generating the data samples. In the context of our problem 3.3, we know the exact structural property of the underlying model. Based on this knowledge, we adopt two approaches to solve the problem, where the subspace clustering approach utilizes the piecewise quadratic structure of the underlying model and the geometric approach tries to fully exploit the pointwise minimum quadratic structure.

Subspace Clustering Approach

The first approach adopted is the so-called subspace clustering approach. Such an approach is motivated by the observation that the pointwise minimum quadratic function can be viewed as a piecewise linear function in a higher dimensional space.

To see this, we first notice that any quadratic form can be transformed into a linear form by lifting the space as follows:

zT P z = trace(P zzT ) = vec(P )T (z ⊗ z). (3.31)

 T T Moreover, let Zj , z|z Pjz ≤ z Plz, ∀l = 1,...,M and l 6= j , we have

T T min z Plz = z Pjz, if z ∈ Zj (3.32) l=1,...,M

78 Hence, pointwise minimum of a finite number of quadratic functions in z can be viewed as a piecewise linear function in z ⊗ z. For notational simplicity, we will use pˆ = vec(P ) andz ˆ = z ⊗ z throughout this chapter.

Based on the above discussion, it is not hard to see that if we focus on the lifted

n2+1 data samples (ˆzi, yi) ∈ R , they actually lie in several subspaces of dimension at

n(n+1) n(n+1) most 2 determined byp ˆj embedded in the ambient space. The number 2

n(n−1) comes from the observations that there are 2 redundant terms inz ˆ due to the symmetry and the linear relationship between y andz ˆ introduces one more linear dependence. Therefore, Problem 3.3 can be solved in a two-step manner. In the first step, we aim to cluster all the (lifted) data samples according to their underlying linear relationship (subspace). Once such clusters are given, the associated matrices

P can be easily obtained via numerous optimization techniques, e.g. least square.

The first step clustering problem is exactly the subspace clustering problem which aim to partition data samples according to their underlying subspaces. Subspace clustering problem has been extensively studied in the computer vision and image processing literature. In this dissertation, we adopt the sparse subspace clustering algorithm [28, 29] that has been extensively studied, used and analyzed in the litera- ture.

The sparse subspace clustering algorithm exploits the self-expressiveness property stating that each data point in a union of subspaces can be efficiently represented as a linear combination of other points. Such a representation is not unique in general, but by promoting the sparsity of such representation ideally it will involve only a few points from the data point’s own subspace.

79 Given a column-wise data set Z ∈ Rn×N where n is the dimension of each data sample and N is the number of data samples, SSC approach tries to solve the following global optimization problem.

min kCk0 N×N C∈R (3.33) subject to Z = ZC, diag(C) = 0 where the constraint on diagonal entries of Z is trying to eliminate the trivial solution that each data can be written as linear combination of itself. This optimization problem is in fact nonconvex due to the nonconvex L0 norm cost function. In practice, the following convex relaxation is used where L0 norm is replaced with L1 norm which is the tightest convex relaxation and is known to prefer sparse solutions.

min kCk1 N×N C∈R (3.34) subject to Z = ZC, diag(C) = 0

Unfortunately, unless data samples are drawn exactly from the underlying sub- spaces, the optimization problem (3.34) is in general infeasible. In order to allow for outliers and noises, multiple variants have been proposed. In particular, we solve the following relaxed optimization problem with a weighting factor γ:

2 min kZ − ZCk + γkCk1 N×N F C∈R (3.35) subject to diag(C) = 0

This convex optimization problem can be efficiently solved using various techniques such as alternating direction methods of multipliers [17] or simple gradient based techniques for a further simplified linearized problem [40]. Once C is solved from the above optimization, classical spectral clustering approaches [90] can then be ap- plied to generate clusters. The pseudo-code for the overall sparse subspace clustering algorithm is provided below.

80 Algorithm 2 Sparse Subspace Clustering Input: Column-wise data matrix Z, weighting factor γ, number of desired subspaces M M Output: Partition of data {Cj}j=1 1: Solve 2 min kZ − ZCkF + γkCk1 C∈RN×N subject to diag(C) = 0 2: Construct an affinity graph G with vertices representing the data samples and edge weights given by W = |C| + |C|T ; 3: Sort the eigenvalues σ1 ≥ σ2 ≥ ... ≥ σN of the normalized Laplacian of G in descending order; 4: Apply a spectral clustering technique to the affinity graph G using M as the estimated number of clusters.

M Given the clustering results, i.e. {Cj}j=1, the associated quadratic functions can be efficiently identified using various approaches. We apply the classical least square n o method to obtain the quadratic function for each cluster Z = z , . . . , z , which j i1 iNj is given by the following semi-definite programming

Ni X T 2 min kyim − zim P zim k2 (3.36) P ∈Sn + m=1

Remark 3.1. Note that, despite the fact that subspace clustering provides a candidate algorithm for updating the parameters in our Q-learning framework, it does not fully exploit the analytical structure. In particular, the idea of using subspace clustering technique originates from the piecewise quadratic structure. However, the actual pointwise minimum structure is stronger which has not been incorporated into the subspace clustering framework.

In addition, neither of the aforementioned subspace clustering algorithm ensures correct clustering under arbitrary conditions. Both algorithms have been rigorously

81 analyzed and conditions ensuring correct clustering have been studied in the liter-

ature. Unfortunately, due to the underlying control problem, none of the existing

conditions guaranteeing correct clustering is satisfied for our problem.

Geometric Approach

An alternative approach is proposed in this section based on the geometric insights

of the distribution of data samples satisfying (3.30). Note that, under the assumption

that all data samples are drawn exactly from the underlying model (3.30), we know

that T yi = min zi Pjzi j=1,...,M

1 T ⇔ 1 = min zi Pjzi yi j=1,...,M zi T zi ⇔ 1 = min √ Pj √ j=1,...,M yi yi N Therefore, instead of focusing on the original data set D = {(zi, yi)}i=1, we consider

N zi the new data set W = {wi} where wi = √ . According to the above discussion, it i=1 yi is obvious that for each data sample, we have

T 1 = min wi Pjwi j=1,...,M

n Therefore, each wi lies on surface of union of concentric hyper-ellipsoids in R . Fur- 1 − 2 thermore, it is well known that the principal axes having lengths proportional to λj,k

where λj,k is the k-th eigenvalue of the j-th matrix Pj. Consequently, the L2 norm of

the new data samples wi actually provides certain membership information. In par-

ticular, under the assumptions that the number of samples is large enough and the

data samples are uniformly distributed, the data sample with largest L2 norm denoted

∗ by w belongs to the quadratic function defined by Pj with the smallest eigenvalue.

Additionally, data samples having small inner product with w∗ are likely to belong

82 to the same quadratic function. Inspired by these geometric insights, we propose a new algorithm for solving the unsupervised learning problem 3.3. The pseudo-code of the proposed algorithm is given below.

Algorithm 3 Geometric Identification N Input: Data set W = {wi}i=1, , β Output: Number of quadratic functions H, corresponding matrices P1,...,PH and partition C1, ··· , CH . 1: D = W and k = 1; . Dummy set containing data to be clustered 2: while D= 6 ∅ do 3: Find wi∗ ∈ D such that kwi∗ k2 ≥ kwk2, ∀w ∈ D; 4: Find a neighborhood of wi∗ , denoted by Nβ(wi∗ ) = {wj ∈ D|hwj, wi∗ i ≥ β}; 5: Identify Pk using the set Nβ(wi∗ ) by solving

X T 2 min k1 − wj P wjk2; (3.37) Pk∈S++ wj ∈Nβ (wi∗ )

6: Given Pk, find Ck = {wl||1 − wlPkwl| ≤ }; 7: D ← D \ Ck and k ← k + 1; 8: end while;

In the above algorithm,  and β are two tuning parameters affecting the accu- racy. In detail, β is the key parameter determining the size of the data set used for identifying each quadratic function. If β ensures that all samples used for identify- ing each quadratic function come from the same underlying quadratic function, then such an algorithm is guaranteed to output the correct clustering and identification results. However, due to the unsupervised nature of this identification problem, it is in general not possible to determine a β ensuring such conditions a priori. It is worth mentioning that the idea of using inner product as a criteria for similarity of data samples emerge recently in the literature [39, 50].

83 Here, we test the proposed algorithm using synthetic data to further illustrate

the main idea and how it works. A simple 2-dimensional model is used in our test,

2 i.e. zi ∈ R and yi ∈ R. We randomly generate four positive definite matrices and

N generate the original data set D = {(zi, yi)}i=1 using such a model. N is chosen to be 500 in our test and the matrices are chosen as follows.  1.4409 −0.3049   0.3372 −0.5887  P = ,P = , 1 −0.3049 0.0962 2 −0.5887 1.1757  0.0641 −0.2099   0.2142 0.4714  P = ,P = . 3 −0.2099 0.9716 4 0.4714 1.1343 The generated data set D is shown below in Fig. 3.2 and the transformed data set W

is shown in Fig.3.3(a). Applying our algorithm to data set D, the resulting result is

given in Fig. 3.3(b).

Figure 3.2: Original data distribution

84 6 6 Class 1 Class 1 Class 2 Class 2 4 Class 3 4 Class 3 Class 4 Class 4

2 2

0 0

-2 -2

-4 -4

-6 -6 -8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8

(a) Transformed Data Distribution (b) Result from Algorithm 3

Figure 3.3: Geometric Algorithm on 2-D Synthetic Data

It is worth mentioning that, current version of the geometric approach is by no means perfect in the sense that there is no theoretical guarantees on correctness of identification. Practically, such an approach may result in inaccurate model as well, especially in high-dimensional scenarios. Here, we consider a 5-dimensional case where the underlying model involves 4 matrices. We randomly generate 10000 data samples and randomly generate 500 scenarios of the matrix set. Performance is evaluated via

20000 randomly generated test data samples and compute the empirical error with the underlying true model.Histogram of the result is depicted in Figure 3.4 below.

From this figure, it can be seen that albeit the algorithm correctly identifies the underlying model for a large number of the generated scenarios, it fails occasionally.

In fact, this issue will become even worse in higher dimensional scenarios and if the data samples are insufficient and distributed askew as will be discussed in the sequel and demonstrated in the case studies.

85 400

350

300

250

200

150

100

50

0 0 1 2 3 4 5 6 7 8 9 104

Figure 3.4: Histogram of empirical error between the identified model and underlying true

Despite the great performance of our algorithm on synthetic data set, as mentioned above, there are several distinct features due to the underlying control problem and

Q-learning framework limiting the performance for our proposed geometric approach.

In fact, in traditional unsupervised learning literature, sufficiency and well distribu- tion of data samples are two fairly standard assumptions yielding good performance of learning algorithms. However, due to our underlying control problem and the adopted Q-learning framework, we do not have the freedom to assume such condi- tions. In particular, distribution of the data samples collected from the simulator may be skewed or even degenerate. To see this, we first note that there is a nature trade-off between degeneracy and distribution of data samples due to the underlying system dynamics. In particular, the underlying model we are facing in our original

86 problem (3.29) is:

 T   x  T   x y = min Av Bv P Av Bv P u u  x T  x  = min Pˆ Pˆ u u = min(x+)T P x+ P

 x  where is the input to the simulator which can be randomly generated. However, u  x  if we directly use the data set , the matrices defining the quadratic functions u  T   Pˆ = Av Bv P Av Bv actually drops rank, resulting in a nonempty null space. Consequently, the transformed data set will no longer lie on surfaces of con- centric hyper-ellipsoids. Rather, only after a projection would the data samples lie on surfaces of concentric hyper-ellipsoids. Nonetheless, such projection or the null space is unknown a priori. Instead, due to the accessibility of the simulator, we have information about x+, therefore, we can apply the geometric algorithm to x+ to ob- tain the clustering results and generate associated matrices Pˆ based on the clustering results. Albeit this idea addresses the degeneracy issue, we lose the uniform distribu- tion property of the data samples x+. How to rigorously characterize such a trade-off is still an open question and will be investigated in future works.

One potential heuristic solution to this issue is to consider possible adaptive sam- pling techniques. In specific, we point out that the major inaccuracy of clustering or identification when using the proposed geometric based approach mainly comes from two aspects. First and foremost, there is in general no guarantee that the data samples in the neighborhood set Nβ(wi∗ ) constructed in Algorithm 3 correspond to the same quadratic function defining wi∗ . This is in fact the core of unsupervised

87 learning (clustering) problem. In addition, even though we have such desired prop-

erty, the identification step (3.37) may not be accurate due to insufficiency of points

in Nβ(wi∗ ). To address these two potential issues, we propose a heuristic adaptive

sampling technique. To begin with, we first notice that for a symmetric n-dimensional

n(n−1) n(n−1) matrix, there are at most 2 free variables. Hence, we need at least 2 samples

in Nβ(wi∗ ) at each iteration to ensure that the least square problem (3.37) admits an acceptable solution. In addition, as discussed above, we hope that all data samples in

Nβ(wi∗ ) originate from the same underlying quadratic function. By taking advantage of the accessibility of the system simulator, if any of the above two conditions is not satisfied, an extra set of data samples will be generated according to the following rule (Algorithm 4) and added to the data set. Essentially, we are trying to draw samples closed to wi∗ , which is done by slightly perturbing the initial state and input to the simulator.

Algorithm 4 Extra Data Samples ∗ n Input: i from step 3 of Algorithm 3, data size M and variance matrices Ξx ∈ S++ m and Ξu ∈ S++.  + Output: New data samples (xj, uj, xj ) 1: for doj = 1 to M 2: xj = xi∗ + N (0, Ξx); 3: uj = ui∗ + N (0, Ξu) + 4: xj = Avi∗ xj + Bvi∗ uj; 5: end for

With our carefully designed Q-function approximator and the subspace clustering

based updating scheme, analytical properties of optimal control solution discussed

in Section 3.3 are successfully conveyed into the Q-learning framework. The overall

Q-learning algorithm we propose is provided as follows.

88 Algorithm 5 Q-learning for Switched LQR Input: system simulator f˜(x, u, v), initial Q-function parameters M, H0 = 0 0 {P1 ,...,PM }, error tolerance T ; ∗ ∗ ∗ Output: H = {P1 ,...,PM } 1: H∗ ← H0 ; 2: for k = 1 to kmax do k k−1 3: if kPj − Pj kF ≤ T for all j = 1,...,M then 4: H∗ ← Hk; return H∗; 5: else i.i.d. 6: xi ∼ Unif(X), ∀i = 1,...,N; i.i.d. 7: (ui, vi) ∼ Unif(U × Σ), ∀i = 1,...,N; + ˜ 8: xi = f(xi, ui, vi), ∀i = 1,...,N;  T   xi xi 9: yi ← `(xi, ui, vi) + min P , ∀i = 1,...,N; P ∈Hk ui ui 10: Update Hk+1 using either of the proposed algorithm; 11: end if; 12: end for:

The above proposed Q-learning algorithm is the very first step of studying model-

free reinforcement learning for hybrid systems which embeds several features that are

worth discussing.

First of all, in our Q-learning algorithm, we are actually abusing the simulator

in the sense that we can arbitrarily pick the state-input pair and use the simulator

to evaluate the successive state. In classical Q-learning framework, the simulator

is used in a sequential fashion, meaning that people can only determine the initial

state and all the rest states injected into the simulator come from previous iterations.

Different choices of how to determine the inputs into the simulator may result in

different performance of the learning algorithm. One intuitive way is to draw input

according to a pre-defined distribution over the input space, e.g. uniform or normal.

An alternative is to use the control law determined by the Q-function at the current

89 step to generate the inputs. The trade-off between the above two methods is known as the trade-off of exploration and exploitation. A widely used method is the so- called -greedy method in which the input is determined by the current Q-function with a noisy corruption usually chosen as a zero-mean normal random variable with decaying variance. Various heuristics proposed in the literature have been added to the design of such sampling techniques for different underlying approaches to ensure desired convergence behavior.

In addition, notice that in the proposed algorithm, the update of Q-function at each iteration does not rely on the previous parameters at all. In other words, in Al- gorithm 5, the information about Hk does not convey to Hk+1 directly but only used in determining yi in step 9. This actually differs significantly from the neural network based approaches where the weights used in the neural networks are directly con- veyed to the next iteration as initialization of the optimization procedure. Therefore, traditional training methods of the neural network are somehow biased. Intuitively speaking, on one hand such a bias may result in better convergence behavior of the overall algorithm, while on the other hand it is easier to be trapped in certain local solutions. To the best of the authors’ knowledge, rigorous analysis and theoretical discussions on this issue are missing.

All the above discussed issues are motivations for our future works, including developments of more efficient and reliable numerical algorithms and rigorous math- ematical analysis on performance of the proposed algorithm.

90 3.6 Case Studies

In this section, we test the proposed Q-learning algorithm for solving optimal con-

trol of discrete-time switched linear systems without knowledge about the dynamics.

3.6.1 A Simple 2-Dimensional Example

To begin with, we consider a simple 2-dimensional system involving two subsys-

tems that has been analytically studied in the switched linear quadratic regulation

problem.

 2 1   1   2 1   1  A = ,B = ,A = ,B = , 1 0 1 1 1 2 0 0.5 2 2

Q1 = Q2 = I2,R1 = R2 = 1.

The detailed configuration of the Q-learning algorithm in Algorithm 5 is provided as follows:

0 1. Main algorithm: M = 2, Pj = 0, kmax = 20 and N = 500, note that we do not

specify T here;

2. Subspace clustering: γ = 0.01 in Algorithm 2;

3. Geometric identification:  = 10−5 and β is chosen such that we have enough

data samples in the neighborhood set;

In order to evaluate the proposed algorithm, we use the reinforcement learning

way which simulates the system with 100 initial samples using the control law asso-

ciated with the Q-function and observe the corresponding cost. in addition, as the

termination condition of our main algorithm suggests, performance of our algorithm

can also be evaluated by comparing the matrices we obtain with the optimal ones

91 2.138

2.136

2.134

2.132

2.13

2.128

2.126

2.124

2.122

2.12

0 2 4 6 8 10 12 14 16 18 20

Figure 3.5: Empirical Costs Comparison among Subspace Clustering Approach, Ge- ometric Approach and SLQR - Example 1

25

20

15

10

5

0 0 2 4 6 8 10 12 14 16 18 20

Figure 3.6: Pij convergence with subspace clustering - Example 1

92 25

20

15

10

5

0 0 2 4 6 8 10 12 14 16 18 20

Figure 3.7: Pij convergence with geometric approach - Example 1

computed via classical switched LQR techniques. For this simple problem, the op- timal solutions can be found easily. In practice, the first evaluation method is more widely used since there is in general impossible to direct obtain the optimal solutions.

In Figure. 3.5, the simulated costs during the learning processes with both training methods versus the simulated cost associated with the SLQR optimal solution are plotted. It can be seen that the cost associated with geometric approach empirically converges to the SLQR cost but the subspace clustering based result exhibits some oscillations. This issue can be seen more clearly from Figures. 3.6 and 3.7 . In fact, the matrices obtained from the geometric approach converge to the ones corresponding to the optimal solution.

Another interesting thing to be mentioned about the result is that albeit there are 6 free parameters in each matrix to be identified, we observe only 3 parameters

93 defining each matrix from the results with both approaches. This is due to the degeneracy issue discussed in the previous section.

3.6.2 Another More Interesting 2-Dimensional Example

Apparently, the above problem admits a simple sub-optimal solution by using the optimal solution corresponding to only one subsystem all the time due to the stabilizability of the subsystems. A more interesting scenario is considered in this subsection in which neither of the subsystems is stabilizable. Configuration of this problem is given below.

 2 0   1   1.5 1   1  A = ,B = ,A = ,B = , 1 0 2 1 2 2 0 1.5 2 0

Q1 = Q2 = I2,R1 = R2 = 1.

11

10

9

8

7

6

5

4

3

2

0 2 4 6 8 10 12 14 16 18 20

Figure 3.8: Empirical Costs Comparison among Subspace Clustering Approach, Ge- ometric Approach and SLQR - Example 2

94 20

15

10

5

0

-5 0 2 4 6 8 10 12 14 16 18 20

Figure 3.9: Pij convergence with subspace clustering - Example 2

30

25

20

15

10

5

0 0 2 4 6 8 10 12 14 16 18 20

Figure 3.10: Pij convergence with geometric approach - Example 2

95 The parameters used in the algorithm remain exactly the same as the previous case. The same evaluation methods are adopted as well. In Figure. 3.8, the simulated costs are depicted. Convergence of matrices defining the Q-functions is shown in Fig- ures. 3.9 and 3.10. From these figures, it can be seen that in this case, the oscillatory behavior of subspace clustering approach is more severe than the previous case. This issue is probably due to the fact that neither of the subsystems in the current example is stabilizable. Consequently, a small deviation in the matrices defining the value (or

Q-) function will induce a relatively large error as compared to the previous example.

Another feature that is slightly difficult to notice from Figures. 3.8 and 3.10 is that, although the entries in each P matrix used in the Q-function approximator converge and the associated empirical cost seems to converge to the SLQR cost in Figure. 3.8 there is in fact a very small gap between the two costs. Such a gap is due to the numerical -redundancy discussed Chapter 3.3.

3.6.3 A 3-Dimensional Example

The main purpose of this example is to demonstrate the limitations of our current solution. As suggested by the discussion given in Chapter 3.5, the proposed geometric approach may not be accurate in high dimensional scenarios due to various factors such as insufficiency of data samples, skewed data distributions, carelessly chosen parameters and so on. We consider a 3-dimensional example here to demonstrate that these issues arise ubiquitously which is a strong motivation for us to develop

96 better solutions. Configuration of the considered is given below.  2 0 2   3 2 0   1 3 0  A1 =  3 1 3  ,A2 =  1 3 1  ,A3 =  2 2 0  1 0 2 0 3 2 0 2 1  0   2   3  B1 =  1  ,B2 =  2  ,B3 =  3  ,Q1 = Q2 = Q3 = I3,R1 = R2 = R3 = 1. 0 2 2 Similar to the previous example, it is easy to verify that neither of the subsystems is stabilizable. All parameters in the proposed algorithm remain the same except

M = 4 and N = 5000. We iterate the proposed Q-learning algorithm for 200 steps, and the simulation results are provided below in Figures. 3.11, 3.12 and 3.13.

From Figures. 3.11(a), 3.12(a) and 3.13(a) which depict the results from the first

50 iteration steps, a similar conclusion to the previous example can be drawn, i.e. the propose algorithm with geometric approach converges to a sub-optimal solution while the subspace clustering based version does not converge. In fact, the resulting control policy generated by subspace clustering approach is not even stabilizing.

However, if we look at the entire horizon, i.e. Figures. 3.11(b), 3.12(b) and 3.13(b) there are actually some small fluctuations occasionally. This issue is due to the nature of data driven approaches, namely the algorithm fails to draw any sample corresponding to one of the underlying quadratic functions which contribute to the overall function value.

3.7 Conclusion

In this chapter, we study optimal control of discrete-time switched linear systems using model-free reinforcement learning. Motivated by the special analytical value function structure of the underlying problem, we propose a novel Q-learning algorithm instead of directly applying existing neural network based techniques. In particular, a

97 11

10

9

8

7

6

5

4

3

2

1

0 5 10 15 20 25 30 35 40 45 50

(a) Trainig with 50 steps

11

10

9

8

7

6

5

4

3

2

1

0 20 40 60 80 100 120 140 160 180 200

(b) Trainig with 200 steps

Figure 3.11: Empirical Costs Comparison among Subspace Clustering Approach, Geometric Approach and SLQR - Example 3

98 35

30

25

20

15

10

5

0 0 5 10 15 20 25 30 35 40 45 50

(a) Trainig with 50 steps

40

35

30

25

20

15

10

5

0 0 20 40 60 80 100 120 140 160 180 200

(b) Trainig with 200 steps

Figure 3.12: Pij convergence with subspace clustering - Example 3

99 50

40

30

20

10

0 0 5 10 15 20 25 30 35 40 45 50

(a) Trainig with 50 steps

50

40

30

20

10

0 0 20 40 60 80 100 120 140 160 180 200

(b) Trainig with 200 steps

Figure 3.13: Pij convergence with geometric approach - Example 3

100 specific parametric Q-function approximator explicitly incorporating analytical value function structure is proposed. Two approaches for updating the parameters used in the approximation are described, exploiting different structural properties of the underlying approximator architecture.

101 Chapter 4: Contributions and Future Work

This dissertation studies optimal control of switched systems. Such a problem is of great interest to the literature due to wide range of applicability in diverse engi- neering fields. Roughly speaking, switched systems involve multiple operating modes

( subsystems ) and a switching signal orchestrating the active subsystem at each time instant. Optimal control of such switched systems aims to find both the continuous input and the switching signal to jointly optimize certain system performance index.

Apart from the challenges of solving classical optimal control problems, optimal con- trol of switched systems suffers from additional difficulties, mainly due to the discrete nature of the switching signal that makes the problem combinatorial.

In the first part of this dissertation, the problem of finding optimal open-loop solu- tion to general continuous-time switched nonlinear systems is discussed. We consider the embedding-based approach that solves the problem by first relaxing the combina- torial (discrete) input space to be a continuous one, then solving the optimal control problem with the continuous relaxed input space and finally projecting the relaxed solution back into the original combinatorial space. Exploiting the weak topology notion, we provide a novel topological perspective of viewing the embedding-based approach and develop an associated framework that unifies the understanding and analysis of most embedding-based algorithms. The major contributions of our works

102 lie in the following aspects. First, the framework offers an abstract and high-level way of understanding embedding-based techniques for solving optimal control of switched systems. Moreover, the proposed framework streamlines the convergence analysis for the embedding-based algorithms which can be viewed as a general guidance on constructing new algorithms.

In the second part of this dissertation, we turn our attention to optimal control of discrete-time switched linear system which is a more structured problem. Motivated by the fact that accurate knowledge about system dynamics is in general challenging to obtain, we try to solve the optimal control problem in a model-free setting. Instead of requiring information about the system dynamics, we assume accessibility to a system simulator which outputs the successive state and the associated running cost given any state-input pair. Utilizing such a simulator, a specific Q-learning algorithm is developed. Instead of directly applying existing neural network based techniques, a particular parametric Q-function approximator and the corresponding parameter update scheme are proposed to directly incorporate analytical insights about the optimal solution gained from classical optimal control literature. Contributions of this work lie in mainly the following aspects. First of all, a novel Q-learning algorithm is proposed for solving optimal control of switched linear systems. In addition, the proposed solution tries to explicitly incorporate the knowledge about the analytical structure of that the optimal solution possesses rather than directly applying existing neural network based methods.

103 4.1 Future Works

There are several potential future research directions for both problems studied in this dissertation that are both fundamentally important and numerically influential.

For the first problem, one promising and practically influential direction is to construct systematic ways of determining the underlying weak topology to be used in the entire framework. In particular, most existing approaches fail to handle cases involving switching cost. Inspired by the proposed framework, a new weak topology different from the classically used trajectory induced one needs to be found. More broadly, new optimality conditions and algorithms for general optimization problems are always of great interest to the literature.

For the second problem, there are many more open questions yet to be answered.

One of the most important direction is to develop more reliable updating schemes for the proposed Q-function that avoids abuse of the simulator and explicitly in- corporates the skewed distribution information caused by the underlying dynamics.

Rigorous mathematical analysis on the performance of the proposed training methods is essential for establishing performance guarantees of the overall Q-learning frame- work. Moreover, other reinforcement learning frameworks such as policy iteration or actor-critic approach can be implemented to help improving numerical performance.

In-depth understanding on the differences between these general frameworks are of great interest to the literature on a higher level. The idea of incorporating optimal control insights into reinforcement learning framework for other hybrid systems such as piecewise affine systems is actually the original motivations. Furthermore, pos- sibilities of bringing optimal control insights into the general neural network based reinforcement learning algorithms are worth exploiting as well.

104 Bibliography

[1] P. K. Agarwal and N. H. Mustafa, “K-means projective clustering,” in Pro- ceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2004, pp. 155–165.

[2] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” in International Conference on Learning Representations, 2018.

[3] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.

[4] K. J. Astr¨omand˚ B. Wittenmark, Adaptive Control. Courier Corporation, 2013.

[5] H. Axelsson, Y. Wardi, M. Egerstedt, and E. Verriest, “Gradient descent ap- proach to optimal mode scheduling in hybrid dynamical systems,” Journal of Optimization Theory and Applications, vol. 136, no. 2, pp. 167–186, 2008.

[6] M. Bardi and I. Capuzzo-Dolcetta, Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations. Springer Science & Business Media, 2008.

[7] R. Bellman, Dynamic Programming. Courier Corporation, 1957.

[8] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos, “The explicit linear quadratic regulator for constrained systems,” Automatica, vol. 38, no. 1, pp. 3–20, 2002.

[9] S. C. Bengea and R. A. DeCarlo, “Optimal control of switching systems,” Au- tomatica, vol. 41, no. 1, pp. 11 – 27, 2005.

[10] L. D. Berkovitz, Optimal Control Theory, ser. Applied Mathematical Sciences. Springer, 1974.

[11] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. Ii: Approx- imate Dynamic Programming, 4th ed. Athena Scientific, 2012.

105 [12] ——, Dynamic Programming and Optimal Control, Vol. I, 4th ed. Athena Scientific, 2017. [13] ——, “Value and policy iterations in optimal control and adaptive dynamic programming,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 500–509, 2017. [14] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, 1st ed. Athena Scientific, 1996. [15] F. Borrelli, M. Baoti´c,A. Bemporad, and M. Morari, “Dynamic programming for constrained optimal control of discrete-time linear hybrid systems,” Auto- matica, vol. 41, no. 10, pp. 1709–1721, 2005. [16] F. Borrelli, A. Bemporad, and M. Morari, Predictive Control for Linear and Hybrid Systems. Cambridge University Press, 2017. [17] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimiza- tion and statistical learning via the alternating direction method of multipliers,” Foundations and Trends R in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011. [18] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge university press, 2004. [19] P. S. Bradley and O. L. Mangasarian, “K-plane clustering,” Journal of Global Optimization, vol. 16, no. 1, pp. 23–32, 2000. [20] S. J. Bradtke, “Reinforcement learning applied to linear quadratic regulation,” in Advances in neural information processing systems, 1993, pp. 295–302. [21] A. E. Bryson, Applied Optimal Control: Optimization, Estimation and Control. Routledge, 2018. [22] L. Caccetta, I. Loosen, and V. Rehbock, “Computational aspects of the optimal transit path problem,” Management, vol. 4, no. 1, pp. 95–105, 2008. [23] C. G. Cassandras, D. L. Pepyne, and Y. Wardi, “Optimal control of a class of hybrid systems,” IEEE Transactions on Automatic Control, vol. 46, no. 3, pp. 398–415, Mar 2001. [24] M. G. Crandall, L. C. Evans, and P.-L. Lions, “Some properties of viscosity solutions of hamilton-jacobi equations,” Transactions of the American Mathe- matical Society, vol. 282, no. 2, pp. 487–502, 1984. [25] M. G. Crandall and P.-L. Lions, “Viscosity solutions of hamilton-jacobi equa- tions,” Transactions of the American mathematical society, vol. 277, no. 1, pp. 1–42, 1983.

106 [26] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” https://arxiv.org/abs/1710.01688, 2018.

[27] M. Egerstedt, Y. Wardi, and H. Axelsson, “Transition-time optimization for switched-mode dynamical systems,” IEEE Transactions on Automatic Control, vol. 51, no. 1, pp. 110–115, 2006.

[28] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 2790–2797.

[29] ——, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013.

[30] H. O. Fattorini, Infinite Dimensional Optimization and Control Theory, ser. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1999.

[31] X. Ge, W. Kohn, A. Nerode, and J. B. Remmel, “Hybrid systems: Chattering approximation to relaxed controls,” in Hybrid Systems III. Springer, 1996, pp. 76–100.

[32] R. Goebel, R. G. Sanfelice, and A. R. Teel, “Hybrid dynamical systems,” IEEE Control Systems, vol. 29, no. 2, pp. 28–93, 2009.

[33] ——, Hybrid Dynamical Systems: Modeling, Stability, and Robustness. Prince- ton University Press, 2012.

[34] H. Gonzalez, R. Vasudevan, M. Kamgarpour, S. S. Sastry, R. Bajcsy, and C. J. Tomlin, “A descent algorithm for the optimal control of constrained nonlinear switched dynamical systems,” in 13th ACM International Conference on Hybrid Systems: Computation and Control. ACM, 2010, pp. 51–60.

[35] ——, “A numerical method for the optimal control of switched systems,” in 49th IEEE Conference on Decision and Control, 2010, pp. 7519–7526.

[36] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” in International Conference on Machine Learn- ing, 2016, pp. 2829–2838.

[37] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in Neural networks for perception. Elsevier, 1992, pp. 65–93.

[38] S. Hedlund and A. Rantzer, “Optimal control of hybrid systems,” in 38th IEEE Conference on Decision and Control, vol. 4, 1999, pp. 3972–3977.

107 [39] A. Jalali and R. Willett, “Subspace clustering via tangent cones,” in Advances in Neural Information Processing Systems, 2017, pp. 6744–6753.

[40] S. Ji and J. Ye, “An accelerated gradient method for trace norm minimization,” in International Conference on Machine Learning. ACM, 2009, pp. 457–464.

[41] D. E. Kirk, Optimal Control Theory: An Introduction. Englewood Cliffs: Prentice-Hall, 1970.

[42] B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B. Naghibi- Sistani, “Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167–1175, 2014.

[43] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, 2000, pp. 1008–1014.

[44] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic pro- gramming for feedback control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, 2009.

[45] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control. John Wiley & Sons, 2012.

[46] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012.

[47] D. Liberzon, Switching in Systems and Control. Springer Science & Business Media, 2003.

[48] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.

[49] Q. Lin, R. Loxton, and K. L. Teo, “Optimal control of nonlinear switched systems: Computational methods and applications,” Journal of the Society of China, vol. 1, no. 3, pp. 275–311, 2013.

[50] J. Lipor, D. Hong, D. Zhang, and L. Balzano, “Subspace clustering using en- sembles of k-subspaces,” arXiv preprint arXiv:1709.04744, 2017.

[51] R. C. Loxton, K. L. Teo, V. Rehbock, and W. Ling, “Optimal switching instants for a switched-capacitor dc/dc power converter,” Automatica, vol. 45, no. 4, pp. 973–980, 2009.

108 [52] R. C. Loxton, K. L. Teo, and V. Rehbock, “Computational method for a class of switched system optimal control problems,” IEEE Transactions on Automatic Control, vol. 54, no. 10, pp. 2455–2460, 2009.

[53] J. Lygeros, S. Sastry, and C. Tomlin, “Hybrid systems: Foundations, advanced topics and applications,” under copyright to be published by Springer Verlag, 2012.

[54] H. Maurer and J. Zowe, “First and second-order necessary and sufficient opti- mality conditions for infinite-dimensional programming problems,” Mathemat- ical Programming, vol. 16, no. 1, pp. 98–110, 1979.

[55] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.

[56] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.

[57] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[58] H. Modares and F. L. Lewis, “Linear quadratic tracking control of partially- unknown continuous-time systems using reinforcement learning,” IEEE Trans- actions on Automatic control, vol. 59, no. 11, pp. 3051–3056, 2014.

[59] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning, 2010, pp. 807– 814.

[60] F. M. Oettmeier, J. Neely, S. Pekarek, R. DeCarlo, and K. Uthaichana, “MPC of switching in a boost converter using a hybrid state model with a sliding mode observer,” IEEE Transactions on Industrial Electronics, vol. 56, no. 9, pp. 3453–3466, 2009.

[61] D. L. Pepyne and C. G. Cassandras, “Optimal control of hybrid systems in manufacturing,” Proceedings of the IEEE, vol. 88, no. 7, pp. 1108–1123, 2000.

[62] B. Piccoli, “Hybrid systems and optimal control,” in 37th IEEE Conference on Decision and Control, vol. 1, 1998, pp. 13–18.

[63] E. Polak, Optimization: Algorithms and Consistent Approximations, ser. Ap- plied mathematical sciences. Springer-Verlag, 1997.

109 [64] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Di- mensionality. John Wiley & Sons, 2007, vol. 703.

[65] S. Rao, R. Tron, R. Vidal, and Y. Ma, “Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1832–1845, 2010.

[66] V. Rehbock and L. Caccetta, “Two defence applications involving discrete val- ued optimal control,” ANZIAM journal, vol. 44, pp. E33–E54, 2002.

[67] M. Riedmiller and H. Braun, “A direct adaptive method for faster backprop- agation learning: The rprop algorithm,” in IEEE International Conference on Neural Networks. IEEE, 1993, pp. 586–591.

[68] M. Rinehart, M. Dahleh, D. Reed, and I. Kolmanovsky, “Suboptimal control of switched systems with an application to the disc engine,” IEEE Transactions on Control Systems Technology, vol. 16, no. 2, pp. 189–201, March 2008.

[69] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.

[70] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[71] M. S. Shaikh and P. E. Caines, “On the optimal control of hybrid systems: Optimization of trajectories, switching times, and location schedules,” in Hybrid systems: Computation and control. Springer, 2003, pp. 466–481.

[72] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mas- tering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.

[73] M. Soltanolkotabi and E. J. Candes, “A geometric analysis of subspace clus- tering with outliers,” The Annals of Statistics, vol. 40, no. 4, pp. 2195–2238, 2012.

[74] M. Soltanolkotabi, E. Elhamifar, and E. J. Candes, “Robust subspace cluster- ing,” The Annals of Statistics, vol. 42, no. 2, pp. 669–699, 2014.

[75] E. D. Sontag, Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer Science & Business Media, 2013, vol. 6.

[76] H. J. Sussmann, “A maximum principle for hybrid optimal control problems,” in 38th IEEE Conference on Decision and Control, vol. 1, 1999, pp. 425–430.

110 [77] ——, “Set-valued differentials and the hybrid maximum principle,” in 39th IEEE Conference on Decision and Control, vol. 1, 2000, pp. 558–563.

[78] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press, 1998.

[79] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, 2000, pp. 1057–1063.

[80] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Computation, vol. 11, no. 2, pp. 443–482, 1999.

[81] P. Tseng, “Nearest q-flat to m points,” Journal of Optimization Theory and Applications, vol. 105, no. 1, pp. 249–252, 2000.

[82] S. Tu and B. Recht, “Least-squares temporal difference learning for the linear quadratic regulator,” arXiv preprint arXiv:1712.08642, 2017.

[83] K. Uthaichana, R. DeCarlo, S. Bengea, M. Zefran,ˇ and S. Pekarek, “Hybrid optimal theory and predictive control for power management in hybrid electric vehicle,” Journal of Nonlinear Systems and Applications, vol. 2, no. 1-2, pp. 96–110, 2011.

[84] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878–888, 2010.

[85] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning.” in AAAI, vol. 2. Phoenix, AZ, 2016, p. 5.

[86] R. Vasudevan, H. Gonzalez, R. Bajcsy, and S. S. Sastry, “Consistent approx- imations for the optimal control of constrained switched systems—part 1: A conceptual algorithm,” SIAM Journal on Control and Optimization, vol. 51, no. 6, pp. 4463–4483, 2013.

[87] ——, “Consistent approximations for the optimal control of constrained switched systems—part 2: An implementable algorithm,” SIAM Journal on Control and Optimization, vol. 51, no. 6, pp. 4484–4503, 2013.

[88] R. Vidal and P. Favaro, “Low rank subspace clustering,” Pattern Recognition Letters, vol. 43, pp. 47–61, 2014.

[89] R. Vidal, Y. Ma, and S. S. Sastry, Generalized Principal Component Analysis. Springer, 2016, vol. 5.

111 [90] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.

[91] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automat- ica, vol. 45, no. 2, pp. 477–484, 2009.

[92] J. Warga, Optimal Control of Differential and Functional Equations. Academic press, 2014.

[93] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992.

[94] Q. Wei, D. Liu, and X. Yang, “Infinite horizon self-learning optimal control of nonaffine discrete-time nonlinear systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 4, pp. 866–879, 2015.

[95] S. Wei, K. Uthaichana, M. Zefran,ˇ and R. DeCarlo, “Hybrid model predictive control for the stabilization of wheeled mobile robots subject to wheel slippage,” IEEE Transactions on Control Systems Technology, vol. 21, no. 6, pp. 2181– 2193, Nov 2013.

[96] P. J. Werbos, “Backpropagation through time: What it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.

[97] X. Xu and P. J. Antsaklis, “A dynamic programming approach for optimal control of switched systems,” in 39th IEEE Conference on Decision and Control, vol. 2. IEEE, 2000, pp. 1822–1827.

[98] ——, “Optimal control of switched systems: New results and open problems,” in American Control Conference, vol. 4. IEEE, 2000, pp. 2683–2687.

[99] ——, “Stabilization of second-order LTI switched systems,” International Jour- nal of Control, vol. 73, no. 14, pp. 1261–1279, 2000.

[100] ——, “Results and perspectives on computational methods for optimal control of switched systems,” in Hybrid Systems: Computation and Control. Springer, 2003, pp. 540–555.

[101] ——, “Optimal control of switched systems based on parameterization of the switching instants,” IEEE Transactions on Automatic Control, vol. 49, no. 1, pp. 2–16, 2004.

[102] T. Zhang, A. Szlam, and G. Lerman, “Median k-flats for hybrid linear model- ing with many outliers,” in 12th IEEE International Conference onComputer Vision Workshops. IEEE, 2009, pp. 234–241.

112 [103] W. Zhang, A. Abate, and J. Hu, “Efficient suboptimal solutions of switched LQR problems,” in American Control Conference. IEEE, 2009, pp. 1084–1091.

[104] W. Zhang, A. Abate, J. Hu, and M. P. Vitus, “Exponential stabilization of discrete-time switched linear systems,” Automatica, vol. 45, no. 11, pp. 2526– 2536, 2009.

[105] W. Zhang, J. Hu, and A. Abate, “Infinite-horizon switched LQR problems in discrete time: A suboptimal algorithm with performance analysis,” IEEE Transactions on Automatic Control, vol. 57, no. 7, pp. 1815–1821, 2012.

113