A Mean-Field Optimal Control Formulation for Global Optimization

A Mean-ﬁeld Optimal Control Formulation for Global Optimization

Chi Zhang, Amirhossein Taghvaei, and Prashant G. Mehta

Abstract—A particle filter is introduced to numerically ap- where the second equality follows from (1b). With initial proximate a solution of the global optimization problem. The ∗ condition p(0,t) = p0(x), its closed-form solution is given by theoretical significance of this work comes from its variational the Bayes’ density: aspects. Specifically, the particle filter is a controlled interacting particle system where the control input represents the solution p∗(x)exp(−h(x)t) p(x,t) := 0 =: p∗(x,t) (2) of a mean-field type optimal control problem. Its parametric R p∗(y)exp(−h(y)t)dy counterpart, obtained when a parametric form of density is 0 known apriori, is shown to be equivalent to the natural gradient Consequently, as t → ∞, one would expect the particles Xi algorithm. Explicit formulae for the filter are derived when the t objective function is quadratic and the density is Gaussian. The to converge to the global minimizer of the function h, in the ∗ optimal control construction of the particle filter is a significant support of the prior p0. departure from the classical importance sampling-resampling The use of probabilistic models to derive recursive sampling based approaches. algorithms is by now a standard solution approach to the global optimization problem: The model (2) appears in [23] I.INTRODUCTION with closely related variants given in [20], [13]. Importance This paper is concerned with the coupled ordinary differen- sampling type schemes based on these and more general tial equation-partial differential equation (ODE-PDE) model: stochastic models appear in [29], [17], [18], [21]. The central question for this paper is whether there is an dXi optimal control interpretation for −∇φ term in (1a)? Such ODE: t = −∇φ(Xi,t), Xi ∼ p∗ (1a) an interpretation for the optimization algorithm (1a)-(1b), and dt t 0 0 1 more generally for FPF, is of both theoretical and practical PDE: − ∇ · (p(x,t)∇φ(x,t)) = (h(x) − hˆt ) (1b) interest, e.g., in development of numerical algorithms. p(x,t) The main contribution of this paper is to describe an i d th where Xt ∈ R is the state, at time t, of the i particle in optimal control construction for −∇φ (Theorems1 and2). ∗ a population of N particles; p0(x) is a (given) everywhere The construction is based on the following steps: (i) the density d positive prior density on R ; and the objective function h(x) transport (2) is shown to be a gradient flow (steepest descent) d is a real-valued function defined on R . The control function, in the space of densities, with respect to the Kullback–Leibler −∇φ(x,t), used to define the righthand-side of the ODE (1a), divergence pseudo-metric; (ii) a Lagrangian is defined by is obtained from solving, at each fixed time t > 0, the (Poisson casting the gradient flow as a dynamic programming recursion; equation) PDE (1b) where ∇ and ∇· denote the gradient and and (iii) the resulting mean-field type optimal control problem the divergence operators, respectively; p(x,t) is the density of is solved to obtain the particle filter (1a)-(1b). The particle i ˆ R the state Xt ; and ht := h(x) p(x,t)dx. For the remainder of filter is shown to be the Hamilton’s equation where the Poisson this paper, the coupled ODE-PDE model (1a)-(1b) is referred equation is the first order optimality condition obtained from to as the particle filter. the application of the Pontryagin’s minimum principle. The interest in such coupled ODE-PDE models comes from For the case where the objective function h is a quadratic the feedback particle filter (FPF) [25], [24] and related particle ∗ function and the prior p0 is a Gaussian density, the Poisson flow algorithms for nonlinear filtering [19], [7], [12], [8]. equation (1b) admits a closed-form solution. The resulting The inspiration for controlling a single particle comes from control law is affine in the state. The quadratic Gaussian the mean-field type control formalisms [15], [3], [4], [26] case is an example of the more general parametric case and control methods for optimal transportation [5], [6] where where the density is of a (known) parametrized form. For the coupled ODE-PDE models also arise. parametric case, the filter is shown to be equivalent to the The particular model (1a)-(1b) is interesting because the finite-dimensional natural gradient algorithm in the space of Fokker-Planck equation for the density (formally) yields the parameters. replicator PDE: The outline of the remainder of this paper is as follows: The ∂ p optimal control construction appears in Sec.II. The quadratic (x,t) = ∇(p(x,t)∇φ(x,t)) = −(h(x) − hˆt ) p(x,t) ∂t Gaussian and the more general parametric cases are discussed in Sec. III. All the proofs are contained in the Appendix. C. Zhang, A. Taghvaei and P. G. Mehta are with the Coordinated Science Laboratory and the Department of Mechanical Science and d Notation: The Euclidean space R is equipped with the Borel Engineering at the University of Illinois at Urbana-Champaign (UIUC) d [email protected]; [email protected]; σ-algebra denoted as B(R ). The space of Borel probability d [email protected] measures on R with finite second moment is denoted as P: Financial support from the NSF CMMI grants 1334987 and 1462773 is Z gratefully acknowledged. . d 2 The conference versions of this paper appear in [27]. P = ρ : R → [0,∞)meas. density |x| ρ(x)dx < ∞ 2

The density for a Gaussian random variable with mean m and To obtain an optimal control formulation, the key idea is to d variance Σ is denoted as N (m,Σ). For vectors x,y ∈ R , the view the gradient flow time-stepping procedure as a dynamic √ T dot product is denoted as x·y and |x| := x · x; x denotes the programming recursion from time tn−1 → tn: transpose of the vector. Similarly, for a matrix K, KT denotes 1 (u) (u) the matrix transpose, and K 0 denotes positive-definiteness. ρn = argmin D(ρ |ρn−1) + V(ρ ) k (u)∈ ∆tn C is used to denote the space of k-times continuously differ- ρ P | {z } entiable functions on d. For a function f , ∇ f = ∂ f is used to control cost R ∂xi 2 (u) R (u) denote the gradient vector, and D2 f = ∂ f is used to denote where V(ρ ) := ρ (x)h(x)dx is the cost-to-go. The nota- ∂xi∂x j tion (u) for density corresponds to the following construction: the Hessian matrix. L∞ denotes the space of bounded functions ρ d 2 d Consider the differential equation on R with associated norm denoted as k · k∞. L (R ;ρ) d i is the Hilbert space of square integrable functions on R dXt i R = u(X ,t) equipped with the inner-product, φ,ψ := φ(x)ψ(x)ρ(x)dx. dt t The associated norm is denoted as kφk2 := φ,φ . 2 and denote the associated flow from tn−1 → tn as x 7→ sn(x). The following assumptions are made throughout the paper: Under suitable assumptions on u (Lipschitz in x and continuous in t), the flow map sn is a well-defined diffeomorphism on (i) Assumption A1: The prior probability density function d (u) # # ∗ ∗ −V0(x) 2 R and ρ := sn (ρn−1), where sn denotes the push-forward p0 ∈ P and is of the form p0(x) = e where V0 ∈ C , D2V ∈ L∞, and operator. The push-forward of a probability density ρ by 0 a smooth map s is defined through the change-of-variables x liminf ∇V0(x) · = ∞ formula |x|→∞ |x| Z Z f (x)[s#(ρ)](x)dx = f (s(x))ρ(x)dx 2 2 d ∗ (ii) Assumption A2: The function h ∈ C ∩ L (R ; p0) with D2h ∈ L∞ and for all continuous and bounded test functions f . x Via a formal but straightforward calculation, in the asymp- liminf ∇h(x) · > −∞ |x|→∞ |x| totic limit as ∆tn → 0, the control cost is expressed in terms of the control u as (iii) Assumption A3: The function h has a unique minimizer d Z 2 x¯ ∈ R with minimum value h(x¯) =: h¯. Outside some 1 (u) ∆tn 1 d D(ρ |ρn−1) = ∇ · (ρn−1u) ρn−1 dx + o(∆tn) compact set D ⊂ R , ∃ r > 0 such that ∆tn 2 ρn−1 (4) d h(x) > h¯ + r ∀ x ∈ R \ D These considerations help motivate the following optimal Remark 1: Assumptions (A1) and (A2) are important to control problem: prove existence, uniqueness and regularity of the solutions Z T Z of the Poisson equation [24], [16]. (A1) holds for density Minimize: J(u) = L(ρt ,ut )dt + h(x)ρT (x)dx u with Gaussian tails. Assumption (A3) is used to obtain weak 0 (5) ∂ρ convergence of p∗(x,t) to Dirac delta atx ¯. The uniqueness of Constraint: t + ∇ · (ρ u ) = 0, ρ (x) = p∗(x) t t t 0 0 the minimizerx ¯ can be relaxed to obtain weaker conclusions ∂ on convergence (See AppendixC). where the Lagrangian is defined as 2 1 Z 1 L(ρ,u) := ∇ · (ρ(x)u(x)) ρ(x) dx II.VARIATIONAL FORMULATION d 2 R ρ(x) A variational formulation of the Bayes formula is the 1 Z + |h(x) − hˆ|2 ρ(x) dx following time-stepping procedure: For the discrete-time se- 2 Rd quence {t0,t1,t2,...,tN¯ } with tN¯ = T and increments ∆tn := ˆ R ∗ N¯ where h := h(x)ρ(x)dx. tn −tn−1, set ρ0 = p0 ∈ P and recursively define {ρn}n=1 ⊂ P The Hamiltonian is defined as by taking ρn ∈ P to minimize the functional Z H( ,q,u) := L( ,u) − q(x) · ( (x)u(x)) dx (6) 1 Z ρ ρ ∇ ρ I(ρ|ρn−1) := D(ρ | ρn−1) + h(x)ρ(x)dx (3) ∆tn where q is referred to as the momentum. D where denotes the relative entropy or Kullback–Leibler Suppose ρ ∈ P is the density at time t. The value function divergence, is defined as Z ρ(x) Z T D(ρ | ρn−1) := ρ(x)ln dx V(ρ,t) := inf L(ρs,us)ds (7) ρn−1(x) u t ∗ The proof that ρn = p (x,tn) (defined in (2)) is the mini- The value function is a functional on the space of densities. mizer is straightforward: By Jensen’s formula, I(ρ|ρn−1) ≥ For a fixed ρ ∈ P and time t ∈ [0,T), the (Gateaux)ˆ derivative R d −ln( ρn−1(y)exp(−h(y)∆tn)dy) with equality if and only if of V is a function on R , and an element of the function 2 d ∂V ρ = ρn. space L (R ;ρ). This function is denoted as ∂ρ (ρ,t)(x) for 3

d x ∈ R . Additional details appear in the AppendixA where Remark 2: In AppendixB, the Pontryagin’s minimum the following Theorem is proved. principle of optimal control is used to express the particle filter (1a)-(1b) in its Hamilton’s form: Theorem 1 (Finite-horizon optimal control): Consider the optimal control problem (5) with the value function defined dXi t = u(Xi,t), Xi ∼ p∗ in (7). Then V solves the following DP equation: dt t 0 0 ∂V ∂V 0 ≡ H(p(·,t),h,u(·,t)) = min H(p(·,t),h,v) (ρ,t) + inf H(ρ, (ρ,t),u) = 0, t ∈ [0,T) v∈L2 ∂t u∈L2 ∂ρ Z The Poisson equation (1b) is simply the first order optimality V(ρ,T) = h(x)ρ(x)dx condition to obtain a minimizing control. Under this optimal The solution of the DP equation is given by control, the density p(x,t) is the optimal trajectory. The Z associated optimal trajectory for the momentum (co-state) is V(ρ,t) = h(x)ρ(x)dx a constant equal to its terminal value h(x). Rd and the associated optimal control is a solution of the follow- The following theorem shows that the particle filter imple- ing pde: ments the Bayes’ transport of the density, and establishes the asymptotic convergence for the density (The proof appears in 1 d the Appendix (C)). We recall the notation for the two types ∇ · (ρ(x)u(x)) = (h(x) − hˆ), ∀ x ∈ R (8) ρ(x) of density in our analysis: i 1) p(x,t): Defines the density of Xt . It is also useful to consider the following infinite-horizon 2) p∗(x,t): The Bayes’ density given by (2). version of the optimal control problem: Z ∞ Theorem 3 (Bayes’ exactness and convergence): Consider Minimize: J(u) = L(ρt ,ut )dt the ODE-PDE model (1a)-(1b). If p(·,0) = p∗(·,0), we have u 0 for all t ≥ 0, ( ∂ρt ∗ (9) + ∇ · (ρ u ) = 0, ρ (x) = p (x) ∗ ∂t t t 0 0 p(·,t) = p (·,t) Constraints: R lim h(x)ρt (x) = h(x¯) t→∞ As t → ∞, R h(x)p(x,t)dx decreases monotonically to h(x¯) and i For this problem, the value function is defined as Xt → x¯ in probability. V(ρ) = inf J(u) (10) The hard part of implementing the particle filter is solving u the Poisson equation (1b). For the quadratic Gaussian case The solution is given by the following Theorem whose proof – where the objective function h is quadratic and the prior appears in AppendixA: ∗ p0 is Gaussian – the solution can be obtained in an explicit Theorem 2 (Infinite-horizon optimal control): Consider the form. This is the subject of the Sec. III-A. The more general infinite horizon optimal control problem (9) with the value parametric case is the subject of Sec. III-B. function defined in (10). The value function is given by Z III.PARAMETRIC CASES V(ρ) = h(x)ρ(x)dx − h(x¯) A. Quadratic Gaussian case Rd and the associated optimal control law is a solution of the For the quadratic Gaussian problem, the solution of the pde (8). Poisson equation can be obtained in an explicit form as described in the following Lemma. The proof appears in the The particle filter algorithm (1a)-(1b) in Sec.I is obtained AppendixD. by additionally requiring the solution u of (8) to be of the gradient form. One of the advantages of doing so is that the Lemma 2: Consider the Poisson equation (1b). Suppose the optimizing control law, obtained instead as solution of (1b), is objective function h is a quadratic function such that h(x) → ∞ uniquely defined [16]. In part, this choice is guided by the L2 as |x| → ∞ and the density ρ is a Gaussian with mean m and optimality of the gradient form solution (The proof is omitted variance Σ. Then the control function on account of space): u(x) = −∇φ(x) = −K(x − m) − b (11) Lemma 1 (L2 optimality): Consider the pde (8) where ρ where the affine constant vector and h satisfy Assumptions (A1)-(A2). The general solution is Z given by b = x(h(x) − hˆ)ρ(x)dx (12) u = −∇φ + v and the gain matrix K = KT 0 is the solution of the Lyapunov where φ is the solution of (1b), v solves ∇ · (ρv) = 0, and equation: 2 2 2 kuk2 = k∇φk2 + kvk2 Z ΣK + KΣ = (x − m)(x − m)T (h(x) − hˆ)ρ(x)dx (13) That is, u = −∇φ is the minimum L2-norm solution of (8). 4

Using an affine control law (11), it is straightforward to results, including a comparison with other state-of-the-art ∗ verify that p(x,t) = p (x,t) is a Gaussian whose mean mt → x¯ algorithms, appear in the PhD thesis of the first author [28]. and variance Σ → 0. The proofs of the following Proposition t Remark 4: and the Corollary appear in the AppendixD: The particle filter (16) is related to the cross- entropy (CE) and the model reference adaptive search (MRAS) Proposition 1: Consider the particle filter (1a) with the algorithms described in [20], [13]. In these algorithms, one affine control law (11). Suppose the objective function h is samples recursively from a Gaussian density whose parameters a quadratic function such that h(x) → ∞ as |x| → ∞ and the are updated at each time-step via density projection. The up- ∗ prior density p0 is a Gaussian with mean m0 and variance Σ0. date formulae in (14) are the counterparts for similar formulae Then the posterior density p is a Gaussian whose mean mt obtained using the density projection in CE and MRAS. The and variance Σt evolve according to particle filter (16) is simpler, or at least more direct, because (i) explicit update for the parameters is not necessary, (ii) dmt i i ˆ = −E Xt (h(Xt ) − ht ) recursive sampling from the Gaussian density is avoided, and dt (14) dΣ (iii) there is no need to calculate the importance weights. t = −E(Xi − m )(Xi − m )T (h(Xi) − hˆ ) dt t t t t t t Algorithm 1 Affine approximation of the control function ˆ i where ht := E[h(Xt )]. i N i N Input: {X }i=1, {h(X )}i=1 i N Corollary 1: Under the hypothesis of Proposition1, with Output: {u }i=1 1 an explicit form for quadratic objective function h(x) = 2 (x− (N) 1 N i T T 1: Calculate m := N ∑i=1 X , x¯) H(x − x¯) + c where H = H 0, the expectations on the T (N) 1 N i (N) i (N) righthand-side of (14) are computed in closed-form and the 2: Calculate Σ := N ∑i=1 X − m X − m resulting evolution is given by 3: Calculate hˆ(N) := 1 ∑N h(Xi) N i=1 dmt 4: Calculate b(N) := 1 ∑N Xi h(Xi) − hˆ(N) = Σt H(x¯− mt ) (15a) N i=1 dt 5: Calculate dΣt N = −Σt HΣt (15b) 1 T dt C(N) := Xi − m(N) Xi − m(N) h(Xi) − hˆ(N) N ∑ whose explicit solution is given by i=1 −1 (N) (N) (N) (N) (N) (N) mt = m0 + Σ0St (x¯− m0) 6: Calculate K by solving Σ K + K Σ = C i (N) i (N) (N) −1 7: Calculate u = −K (X − m ) − b Σt = Σ0 − Σ0St Σ0 1 −1 where St := t H + Σ0. In particular, mt → x¯ and Σt → 0. In practice, the affine control law (11) is implemented as: B. Natural Gradient Algorithm dXi t = −K(N)(Xi − m(N)) − b(N) =: ui (16) Consider next the case where the density has a known dt t t t t t parametric form where the terms are approximated empirically from the ensem- p(x,t) =% (x;θt ) (17) ble {Xi}N . The algorithm appears in Table1 (the dependence t i=1 M on time t is suppressed). where θt ∈ R is the parameter vector. For example, in the As N → ∞, the approximations become exact and (11) quadratic Gaussian problem, % is a Gaussian with parameters represents the mean-field limit of the finite-N control in (16). mt and Σt . ∂ Consequently, the empirical distribution of the ensemble ap- For the parametric density, ∂ϑ (log % (x;ϑ)) is a M × 1 th proximates the posterior distribution (density) p∗(x,t). column vector whose k entry ∂ ∂ Remark 3: The finite-dimensional system (14) is the opti- (log % (x;ϑ)) = (log % (x;ϑ)) mization counterpart of the Kalman filter. Likewise the particle ∂ϑ k ∂ϑk filter (16) is the counterpart of the ensemble Kalman filter. for k = 1,...,M. While the affine control law (11) is optimal for the quadratic The Fisher information matrix is a M × M matrix: Gaussian case, it can be implemented in the more general non-quadratic non-Gaussian settings – as long as the various Z ∂ ∂ T G := (log % (x;ϑ)) (log % (x;ϑ)) % (x;ϑ)dx approximations can be obtained at each step. The situation (ϑ) ∂ϑ ∂ϑ is analogous to the filtering setup where the Kalman filter is (18) often used as an approximate algorithm even in nonlinear non- By construction, G(ϑ) is symmetric and positive semidefinite. Gaussian settings. In the following, it is furthermore assumed that G(ϑ) is strictly M General purpose numerical algorithms for approximating positive definite, and thus invertible, for all ϑ ∈ R . the solution of Poisson equation appear in our earlier pa- In terms of the parameter, pers [22]. These algorithms have been applied to several Z benchmark global optimization problems. These numerical e(ϑ) := h(x) % (x;ϑ)dx 5

and its gradient is a M × 1 column vector: REFERENCES Z ∂ ∇e(ϑ) = h(x) (log % (x;ϑ)) % (x;ϑ)dx (19) [1] S-I. Amari. Natural gradient works efficiently in learning. Neural ∂ϑ Comput., 10(2):251–276, 1998. [2] D. Bakry, I. Gentil, and M. Ledoux. Analysis and Geometry of Markov We are now prepared to describe the induced evolution for Diffusion Operators, volume 348. Springer Science & Business Media, 2013. the parameter vector θt . The proof of the following proposition [3] A. Bensoussan, J. Frehse, and S. C. P. Yam. The master equation appears in the AppendixE. in mean field theory. Journal de Mathematiques´ Pures et Appliquees´ , 103(6):1441–1474, 2015. Proposition 2: Consider the particle filter (1a)-(1b). Suppose [4] R. Brockett. Optimal control of the Liouville equation. AMS IP Studies the density admits the parametric form (17) whose Fisher in Advanced Mathematics, 39:23, 2007. information matrix, defined in (18), is assumed to be invertible. [5] Y. Chen, T. Georgiou, and M. Pavon. On the relation between optimal transport and Schrodinger´ bridges: A stochastic control viewpoint. J. Then the parameter vector θt is a solution of the following Optimiz. Theory App., 169(2):671–691, 2016. ordinary differential equation, [6] Y. Chen, T. Georgiou, and M. Pavon. Optimal steering of a linear stochastic system to a final probability distribution, part I. IEEE Trans. dθt −1 Autom. Control, 61(5):1158–1169, 2016. = −G ∇e(θt ) (20) dt (θt ) [7] F. Daum and J. Huang. Exact particle flow for nonlinear filters. In Proc. SPIE 7698, pages 76980I–1, 2010. [8] F. Daum, J. Huang, and A. Noushin. Gromov’s method for Bayesian stochastic particle flow: A simple exact formula for Q. In Proc. IEEE Remark 5: The filter (20) is the well known natural gradient Int. Conf. Multisensor Fusion Integr. Intel. Syst. (MFI), pages 540–545, algorithm; cf., [1], [11]. It has several variational interpreta- 2016. [9] G. E. Dullerud and F. Paganini. A Course in Robust Control Theory: A tions: Convex Approach, volume 36. Springer, 2013. (i) The filter can be obtained via a time stepping procedure, [10] R. Durrett. Probability: Theory and Examples. Cambridge University N Press, 2010. analogous to (3). The sequence {θn}n=1 is inductively [11] M. Girolami and B. Calderhead. Riemann manifold Langevin and defined as a minimizer of the function Hamiltonian Monte Carlo methods. J. R. Stat. S. B, 73(2):123–214, 2011. 1 [12] J. Heng, A. Doucet, and Y. Pokern. Gibbs flow for approximate transport I(θ|θn−1) := D(% (·;θ)| % (·;θn−1)) + e(θ) ∆tn with applications to Bayesian computation. arXiv preprint: 1509.08787. [13] J. Hu, M. C. Fu, and S. I. Marcus. A model reference adaptive search On taking the limit as ∆tn → 0, one arrives at the filter (20). method for global optimization. Oper. Res., 55(3):549–568, 2007. (ii) The optimal control interpretation of (20) is based on the [14] J. Hu and P. Hu. A stochastic approximation framework for a class of randomized optimization algorithms. IEEE Trans. Autom. Control, Pontryagin’s minimum principle (see also Remark2). For 57(1):165–178, 2012. the finite-dimensional problem, the Hamiltonian [15] M. Huang, P. E. Caines, and R. P. Malhame.´ Large-population cost- coupled LQG problems with nonuniform agents: Individual-mass behav- H(θ,q,u) = L(θ,u) + q · u ior and decentralized ε-Nash equilibria. IEEE Trans. Autom. Control, 52(9):1560–1571, 2007. where q ∈ M is the momentum. With θ˙ = u, the counterpart [16] R. S. Laugesen, P. G. Mehta, S. P. Meyn, and M. Raginsky. Poisson’s R equation in nonlinear filtering. SIAM J. Control Optim., 53(1):501–525, of (4) is 2015. [17] B. Liu. Posterior exploration based sequential Monte Carlo for global 1 (u) 1 T optimization. arXiv preprint:1509.08870, 2016. D(% (·;θ)| % (·;θn−1)) = u G(θ)u + o(∆tn) ∆tn 2 [18] B. Liu, S. Cheng, and Y. Shi. Particle filter optimization: A brief introduction. In Proc. 7th Int. Conf. Swarm Intel., pages 95–104, 2016. 1 T [19] S. Reich. A dynamical systems framework for intermittent data assim- With 2 u G(θ)u as the control cost component in the Lagrangian, the first order optimality condition gives ilation. BIT Numerical Mathematics, 51(1):235–249, 2011. [20] R. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodol. Comput. Appl., 1(2):127–190, 1999. G(θ)u = −q = −∇e(θ) [21] P. Stinis. Stochastic global optimization as a filtering problem. J. Comput. Phys., 231(4):2002–2014, 2012. where we have used the fact that e(θ) is the value function. [22] A. Taghvaei and P. G Mehta. Gain function approximation in the Note that it was not necessary to write the explicit form of feedback particle filter. In Proc. 55th IEEE Conf. Decision Control, pages 5446–5452. IEEE, 2016. the Lagrangian to obtain the optimal control. [23] Y. Wang and M. C. Fu. Model-based evolutionary optimization. In Proc. (iii) Finally, the filter (20) represents the gradient flow (in 2010 Winter Simul. Conf., pages 1199–1210, 2010. M R ) for the objective function e(θ) with respect to the [24] T. Yang, R. S. Laugesen, P. G. Mehta, and S. P. Meyn. Multivariable Riemannian metric v,w = vT G w for all v,w ∈ M. feedback particle filter. Automatica, 71(9):10–23, 2016. θ (θ) R [25] T. Yang, P. G. Mehta, and S. P. Meyn. Feedback particle filter. IEEE Trans. Autom. Control, 58(10):2465–2480, 2013. Example 1: In the quadratic Gaussian case, the natural [26] H. Yin, P. G. Mehta, S. P. Meyn, and U. V. Shanbhag. Synchronization of gradient algorithm (20) with parameters mt and Σt reduces coupled oscillators is a game. IEEE Trans. Autom. Control, 57(4):920– to (14). 935, 2012. [27] C. Zhang. A particle system for global optimization. In Proc. 52nd IEEE Conf. Decision Control, pages 1714–1719, 2013. Remark 6: While the systems (20) and (14) are finite- [28] C. Zhang. Controlled Particle Systems for Nonlinear Filtering and dimensional, the righthand-sides will still need to be approx- Global Optimization. PhD thesis, University of Illinois, Urbana- imated empirically. The convergence properties of a class of Champaign, May 2017. related algorithms is studied using a stochastic approximation [29] E. Zhou, M. C. Fu, and S. I. Marcus. Particle filtering framework for a class of randomized optimization algorithms. IEEE Trans. Autom. framework in [14]. Control, 59(4):1025–1030, 2014. 6

APPENDIX On substituting the optimal control law into the DP equation (21), the HJB equation for the V is given by A. Optimal control ∂V 1 Z (ρ,t) + |h − hˆ|2ρ dx Preliminaries: Consider a functional E : P → R mapping ∂t 2 1 Z densities to real numbers. For a ﬁxed ρ ∈ P, the (Gateaux)ˆ − |Θ(ρ,t) − Θˆ (ρ,t)|2ρ dx = 0, t ∈ [0,T) d derivative of E is a real-valued function on R , and an element 2 d Z of the function space L2(R ;ρ). This function is denoted as V(ρ,T) = h(x)ρ(x)dx ∂E d ∂ρ (ρ,t)(x) for x ∈ R , and deﬁned as follows: The equation involves both V and Θ. One obtains the so-called

d Z ∂E master equation (see [3]) involving only Θ by differentiating E(ρt ) = − (ρ)(x)∇ · (ρ(x)u(x))dx d with respect to ρ dt t=0 R ∂ρ ∂Θ 1 1 ∂ρt (ρ,t)(x) + |h(x) − hˆ|2 − |Θ(ρ,t)(x) − Θˆ (ρ,t)|2 where ρt is a path in P such that ∂t = −∇ · (ρt u) with d ∂t 2 2 ρ0 = ρ, and u is any arbitrary vector-field on R . Similarly, Z ∂Θ 2 ˆ ∂ E ( ) ∈ L2( d × d) is the second (Gateaux)ˆ derivative of − (Θ(ρ,t)(y) − Θ(ρ,t)) (ρ,t)(y,x)ρ(y)dy = 0, t ∈ [0,T) ∂ρ2 ρ R R ∂ρ E the functional if Θ(ρ,T) = h Z 2 d ∂E ∂ E It is easily verified that Θ(ρ,t) = h solves the master equation. (ρt )(x) = − (ρ)(x,y)∇ · (ρ(y)u(y))dy d 2 R dt ∂ρ t=0 R ∂ρ The corresponding value function V(ρ,t) = hρ dx. The optimal control problems (5) and (9) are examples of Sufficiency: The proof that the proposed control law is a the mean-field type control problem introduced in [3]. The minimizer is as follows. Consider any arbitrary control law notation and methodology for the following proofs are based vt with the resulting density ρt . Taking the time derivative of R in part on [3]. − hρt dx: d Z Z 1 Proof of Theorem1: The value function V(ρ,t), defined in − hρt dx = (h − hˆt )( ∇ · (ρt vt )) ρt dx (7), is the solution of the DP equation: dt ρt Z 2! 1 2 1 1 ∂V ∂V ≤ |h − hˆ| + ∇ · (ρt vt ) ρt dx (ρ,t) + inf H(ρ, (ρ,t),u) = 0, t ∈ [0,T) 2 2 ρt ∂t u∈L2 ∂ρ (21) Z = L(ρt ,vt ) V(ρ,T) = h(x)ρ(x)dx On integrating both sides with respect to time, In the following, we use the notation Z Z T Z hρ0 dx ≤ L(ρt ,vt )dt + hρT dx ∂V Rd 0 Rd Θ = Θ(ρ,t)(x) := (ρ,t)(x) ∂ρ where the equality holds with vt = ut (defined as solution of (8)). Therefore, d For a fixed ρ ∈ P and t ∈ [0,T), Θ is a function on R . Z A necessary condition is obtained by considering the first J(u) = hρ0 dx ≤ J(v) variation of H. Suppose u is a minimizing control function. R Then u satisfies the first order optimality condition: This also confirms that V(ρ,t) = hρ dx is the value function, and completes the proof of Theorem1.

d H(ρ,Θ,u + εv) = 0 The analysis for the infinite horizon optimal control prob- dε ε=0 lem (9) is similar and described next. d where v is an arbitrary vector field on R . Explicitly, Proof of Theorem2: The infinite-horizon value function V ∞(ρ) := inf R ∞ L(ρ ,u )dt is a solution of the DP equation: Z 1 u 0 t t ∇(− ∇ · (ρu) + Θ) · v ρ dx = 0 ∞ ρ inf H(ρ,Θ (ρ),u) = 0 (22) u∈L2 or in its strong form ∞ ∂V ∞ where Θ (ρ) := ∂ρ (ρ). By carrying out the first order 1 analysis in an identical manner, it is readily verified that: − ∇ · (ρu) + Θ = (constant) ρ (i) A minimizing control u is a solution of the pde (8); (ii) V ∞(ρ) = R hρ dx − h(x¯) is a solution of the DP equa- Multiplying both sides by ρ and integrating yields the value of tion (22). R ˆ the constant as Θ(ρ,t)(x)ρ(x)dx =: Θ(ρ,t). Therefore, the The sufficiency also follows similarly. With any arbitrary minimizing control solves the pde control vt , 1 Z Z ∞ Z ∇ · (ρu) = Θ − Θˆ hρ0 dx ≤ L(ρt ,vt )dt + limsup hρt dx ρ Rd 0 t→∞ Rd 7

2 2 ∞ x with equality if vt = ut solves the pde (8). Using the boundary Vt ∈ C with D Vt ∈ L and liminf|x|→∞ ∇Vt (x) · |x| = ∞. R ∗ condition, limsupt→∞ hρt dx = h(x¯), Therefore, the density p (x,t) admits a spectral bound [Thm Z 4.6.3 in [2]]. The function h is square-integrable because J(u) = hρ0 dx − h(x¯) ≤ J(v) Z ¯ Z |h(x)|2 p∗(x,t)dx ≤ e−th−γt |h(x)|2e−V0(x) dx < ∞ B. Hamiltonian formulation 1 The Hamiltonian H is defined in (6). Suppose ut is the Proof of Theorem3: Given any C smooth and compactly optimal control and ρt is the corresponding optimal trajectory. supported (test) function f , using the elementary chain rule, Denote the trajectory for the co-state (momentum) as qt . i i i d f (Xt ) = −∇φ(Xt ) · ∇ f (Xt ) Using the Pontryagin’s minimum principle, (ρt ,qt ) satisfy the following Hamilton’s equations: On integrating and taking expectations, ∂ρ ∂H Z t t = ( ,q ,u ), = p∗ i i i i ρt t t ρ0 0 E[ f (Xt )] = E[ f (X0)] − E[∇φ(Xs) · ∇ f (Xs)]ds ∂t ∂ρ 0 Z Z t ∂qt ∂H ∂ i i ˆ i = − (ρt ,qt ,ut ), qT = h(x)ρ(x)dx = E[ f (X0)] − E[(h(Xs) − hs) f (Xs)]ds ∂t ∂q ∂ρ 0 0 = H(ρt ,qt ,ut ) = minH(ρt ,qt ,v) which is the weak form of the replicator pde. Note that the 2 v∈L weak form of the Poisson equation is used to obtain the second The calculus of variation argument in the proof of Theorem equality. Since the test function f is arbitrary, the evolution of 1 shows that the minimizing control ut solves the first order p and p∗ are identical. That the control function is well-defined optimality equation for each time follows from [16] based on apriori estimates in 1 Lemma3 for p∗ = p. ∇ · (ρt ut ) = qt − qˆt (23) ρt The convergence proof is presented next. The proof here is R whereq ˆt := qt (x)ρt (x)dx. somewhat more general than needed to prove the Theorem. The explicit form of the Hamilton’s equations are obtained For a function h, we define the minimizing set: by explicitly evaluating the derivatives along the optimal d ¯ trajectory: A0 := {x ∈ R | h(x) = h}

where it is recalled that h¯ = inf d h(x). In the following it ∂H x∈R (ρt ,qt ,ut ) = −∇ · (ρt ut ) ∂ρ is shown that for any open neighborhood U of A0, Z ∂H 1 2 1 2 (ρt ,qt ,ut ) = |h − hˆt | − |qt − qˆt | liminf p(x,t)dx = 1 (24) ∂q 2 2 t→∞ U i It is easy to verify that qt ≡ h(x) satisfies both the boundary It then follows that Xt converges in distribution where the condition and the evolution equation for the momentum. This limiting distribution is supported on A0 (Thm. 3.2.5 in [10]). i results in a simpler form of the Hamilton’s equations: If the minimizer is unique (i.e., A0 = {x¯}), Xt converges tox ¯ ∂ρt in probability. = −∇ · (ρt ut ) ∂t The key to prove the convergence is the following property of the function h: 0 = H(ρt ,h,ut ) = minH(ρt ,h,v) v∈L2 (P1): For each δ > 0, ∃ ε > 0 such that: In a particle filter implementation, the minimizing control ¯ d ut = −∇φ is obtained by solving the first order optimality |h(x) − h| ≤ ε ⇒ dist(x,A0) ≤ δ ∀x ∈ R equation (23) with qt = h. where dist(x,A0) denotes the Euclidean distance of point x from set A . If the minimizerx ¯ is unique, it equals |x − x¯|. C. Bayes’ exactness and convergence 0 Before proving the Theorem3, we state and prove the Any lower semi-continuous function satisfying Assumption following technical Lemma: (A3) also satisfies the property (P1): Suppose {xn} is a sequence such that h(x ) → h¯. Then {x } is compact because ∗ n n Lemma 3: Suppose the prior density p0(x) satisfies Assump- h(x) > h¯ +r outside some compact set. Therefore, the limit set tion (A1) and the objective function h(x) satisfies Assumption is non-empty and because h is lower semi-continuous, for any (A2). Then for each fixed time t ≥ 0: ¯ ¯ limit point z, h ≤ h(z) ≤ liminfxn→z h(xn) = h. That is, z ∈ A0. (i) The posterior density p∗(x,t), defined according to (2), The proof for (24) is based on construction of a Lyapunov d ¯ admits a spectral bound; function: Denote Aε := {x ∈ R | h(x) ≤ h + ε} where ε > 0. d ∗ (ii) The objective function h ∈ L2(R ; p (·,t)). By property (P1), given any open neighborhood U containing ∗ Proof 1: Define Vt (x) := −log p (x,t) = V0(x) + th(x) + γt A0, ∃ ε > 0 such that Aε ⊂ U. A candidate Lyapunov func- R −V0(y)−th(y) where γt := log( e dy). It is directly verified that tion VAε (µ) := −log(µ(Aε )) is defined for measure µ with 8

everywhere positive density. By construction VAε (µ) ≥ 0 with A unique positive-definite symmetric solution K exists for the equality iff µ(Aε ) = 1. Lyapunov equation (29b) because Σ 0 and ΣHΣ 0 [9]. Let µt be the probability measure associated with p(x,t), On substituting the solution (26) into the Poisson equa- R d i.e, µt (B) := B p(x,t)dx for all Borel measurable set B ⊂ R . tion (25) and dividing through by ρ, the two sides are: Since p(x,t) is a solution of the replicator pde, 1 − ∇ · (ρ∇φ) = (x − m)T Σ−1(K(x − m) + b) − tr(K) d d ρ VAε (µt ) = − log(µt (Aε )) dt dt ˆ 1 T 1 T 1 1 Z h − h = (x − x¯) H(x − x¯) − (m − x¯) H(m − x¯) − tr(HΣ) = (h(x) − hˆ )dµ (x) 2 2 2 (A ) t t µt ε Aε where tr(·) denotes the matrix trace. Using formulae (29a)- R R ! hdµ c hdµt Aε t Aε (29b) for b and K, the two sides are seen to be equal. = (1 − µt (Aε )) − c ≤ 0 µt (Aε ) µt (Aε ) Proof of Proposition1: Using the affine control law (11), the with equality iff µt (Aε ) = 1. particle filter is a linear system with a Gaussian prior: h For the objective function , a direct calculation also shows: i dXt i i d Z Z = −Kt (X − mt ) − bt , X ∼ N (m ,Σ ) (30) h(x)p(x,t)dx = − h(x)(h(x) − hˆ )p(x,t)dx dt t 0 0 0 dt t Z Therefore, the density of Xi is Gaussian for all t > 0. The ˆ 2 t = − (h(x) − h) p(x,t)dx ≤ 0 evolution of the mean is obtained by taking an expectation of both sides of the ode (30): with equality iff h = hˆ almost everywhere (with respect to the dm measure µt ). t = −b = −E[Xi(h(Xi) − hˆ )] dt t t t t D. Quadratic Gaussian case where (12) is used to obtain the second equality. The equation i for the variance Σt of Xt is similarly obtained: Proof of Lemma2: We are interested in obtaining an explicit dΣ solution of the Poisson equation, t = −(K Σ + Σ K ) dt t t t t − · ( (x) (x)) = (h(x) − hˆ) (x) i i T i ˆ ∇ ρ ∇φ ρ (25) = −E (Xt − mt )(Xt − mt ) (h(Xt ) − ht ) Consider the solution ansatz: where (13) has been used. ∇φ(x) = K(x − m) + b (26) Proof of Corollary1: The closed-form odes (15a) and (15b) T d×d d where the matrix K = K ∈ R and the vector b ∈ R are are obtained by using explicit formulae (29a) and (29b) for b determined as follows: and K, respectively. (i) Multiply both sides of (25) by vector x and integrate (element-by-element) by parts to obtain E. Parametric case Z ˆ b = x(h(x) − h)ρ(x)dx (27) Proof of Theorem2: The natural gradient ode (20) is obtained by applying the chain rule. In its parameterized form, the (ii) Multiply both sides of (25) by matrix (x − m)(x − m)T density p(x,t) =% (x;θ ) evolves according to the replicator and integrate by parts to obtain t pde: Z T ˆ ∂ % ΣK + KΣ = (x − m)(x − m) (h(x) − h)ρ(x)dx (28) (x;θt ) = −(h(x) − hˆt ) % (x;θt ) ∂t We have thus far not used the fact that the density ρ is Now, using the chain rule, Gaussian and the function h is quadratic. In the following, T ∂ % ∂ dθt it is shown that the solution thus defined in fact solves the (x,θt ) =% (x,θt ) (log % (x;θt )) pde (25) under these conditions. ∂t ∂ϑ dt A radially unbounded quadratic function is of the general where ∂ (log %) and dθt are both M × 1 column vectors. form: ∂ϑ dt 1 Therefore, the replicator pde is given by h(x) = (x − x¯)T H(x − x¯) + c T 2 ∂ dθt (log % (x;θt )) % (x;θt ) = −(h(x) − hˆt ) % (x;θt ) where the matrix H = HT 0 and c is some constant. For ∂ϑ dt a Gaussian density ρ with mean m and variance Σ 0, the Multiplying both sides by the column vector ∂ (log %), inte- integrals (27) and (28) are explicitly evaluated to obtain ∂ϑ Z grating over the domain, and using the definitions (18) of the b = x(h(x) − hˆ)ρ(x)dx = ΣH(m − x¯) (29a) Fisher information matrix G and (19) for ∇e, one obtains Z dθ T ˆ G t = −∇e(θ ) ΣK + KΣ = (x − m)(x − m) (h(x) − h)ρ(x)dx = ΣHΣ (θt ) dt t (29b) The ode (20) is obtained because G is assumed invertible.