Mean-field Markov decision processes with common noise and open-loop controls Médéric Motte, Huyên Pham
To cite this version:
Médéric Motte, Huyên Pham. Mean-field Markov decision processes with common noise and open-loop controls. 2021. hal-02413114v2
HAL Id: hal-02413114 https://hal.archives-ouvertes.fr/hal-02413114v2 Preprint submitted on 8 Sep 2021
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Mean-field Markov decision processes with common noise and open-loop controls
M´ed´eric MOTTE ∗ Huyˆen PHAM †
This version: june 2021 to appear in Annals of Applied Probability
Abstract We develop an exhaustive study of Markov decision process (MDP) under mean field interaction both on states and actions in the presence of common noise, and when optimization is performed over open-loop controls on infinite horizon. Such model, called CMKV-MDP for conditional McKean- Vlasov MDP, arises and is obtained here rigorously with a rate of convergence as the asymptotic problem of N-cooperative agents controlled by a social planner/influencer that observes the envi- ronment noises but not necessarily the individual states of the agents. We highlight the crucial role of relaxed controls and randomization hypothesis for this class of models with respect to classical MDP theory. We prove the correspondence between CMKV-MDP and a general lifted MDP on the space of probability measures, and establish the dynamic programming Bellman fixed point equation satisfied by the value function, as well as the existence of ǫ-optimal randomized feedback controls. The arguments of proof involve an original measurable optimal coupling for the Wasser- stein distance. This provides a procedure for learning strategies in a large population of interacting collaborative agents.
MSC Classification: 90C40, 49L20.
Key words: Mean-field, Markov decision process, conditional propagation of chaos, measurable cou- pling, randomized control.
∗LPSM, Universit´ede Paris medericmotte at gmail.com The author acknowledges support of the DIM MathInnov. †LPSM, Universit´ede Paris, and CREST-ENSAE, pham at lpsm.paris This work was partially supported by the Chair Finance & Sustainable Development / the FiME Lab (Institut Europlace de Finance)
1 Contents
1 Introduction 2
2 The N-agent and the limiting McKean-Vlasov MDP 5
3 Lifted MDP on P(X ) 12 3.1 Casewithoutcommonnoise ...... 14 3.2 Case with finite state space X andwithcommonnoise ...... 15
4 General case and Bellman fixed point equation in P(X ) 18 4.1 A general lifted MDP on P(X ) ...... 18 4.2 Bellman fixed point on P(X ) ...... 20 4.3 Building ǫ-optimal randomized feedback controls ...... 24 4.4 Relaxing the randomization hypothesis ...... 28 4.5 Feedback vs open-loop vs randomized controls ...... 30 4.6 Computing value function and ǫ-optimal strategies in CMKV-MDP ...... 32
5 Conclusion 36
A Some useful results on conditional law 36
B Wasserstein convergence of the empirical law 37
C Proof of coupling results 38
1 Introduction
Optimal control of McKean-Vlasov (MKV) systems, also known as mean-field control (MFC) problems, has sparked a great interest in the domain of applied probabilities during the last decade. In these optimization problems, the transition dynamics of the system and the reward/gain function depend not only on the state and action of the agent/controller, but also on their probability distributions. These problems are motivated from models of large population of interacting cooperative agents obeying to a social planner (center of decision), and are often justified heuristically as the asymptotic regime with infinite number of agents under Pareto efficiency. Such problems have found numerous applications in distributed energy, herd behavior, finance, etc. A large literature has already emerged on continuous-time models for the optimal control of McKean-Vlasov dynamics, and dynamic programming principle (in other words time consistency) has been established in this context in the papers [16], [19], [1], [8]. We point out the work [15], which is the first paper to rigorously connect mean-field control to large systems of controlled processes, see also the recent paper [10], and refer to the books [2], [5] for an overview of the subject. Our work and main contributions. In this paper, we introduce a general discrete time framework by providing an exhaustive study of Markov decision process (MDP) under mean-field interaction in the presence of common noise, and when optimization is performed over open-loop controls on infinite horizon. Such model is called conditional McKean-Vlasov MDP, shortly abbreviated in the sequel as
2 CMKV-MDP, and the set-up is the mathematical framework for a theory of reinforcement learning with mean-field interaction. Let us first briefly describe and motivate the main features of our framework:
(i) The controls are open-loop: it emulates the generally unobservable nature of the states in mean- field situations (political opinions, targeted advertising, health conditions). Yet, the noises driving these states (people’s characteristics, habits, surroundings, and navigation data) are nowadays observable.
(ii) The controls are individualized: the action applied to an individual only depends upon her own noise. Indeed, for the strategy to be implementable in practice, the action must be computed by a program running on the individual’s own device (computer, phone), and, to limit data transfers, we impose that it only uses this device’s data.
(iii) The dynamics of individuals depend upon a common noise, emulating the fact that they are influenced by common information (public data) which may vary over time. By considering an i.i.d. common noise sequence, we shall see that the value function is characterized as the unique fixed point of a Bellman operator, which is the basis for designing algorithms in reinforcement learning.
Compared to continuous-time models, discrete-time McKean-Vlasov control problems have been less studied in the literature. In [18], the authors consider a finite-horizon problem without common noise and state the dynamic programming (Bellman) equation for MFC with closed-loop (also called feedback) controls, that are restricted to depend on the state. Very recently, the works [6], [12] addressed Bellman equations for MFC problems in the context of reinforcement learning. The paper [12] considers relaxed controls in their MFC formulation but without common noise, and derives the Bellman equation for the Q-value function as a deterministic control problem that we obtain here as a particular case (see our Remark 4.11). The framework in [6] is closest to ours by considering also common noise, however with the following differences: these authors restrict their attention to stationary feedback policies, and reformulate their MFC control problem as a MDP on the space of probability measures by deriving formally (leaving aside the measurability issues and assuming the existence of a stationary feedback control) the associated Bellman equation, which is then used for the development of Q-learning algorithms. Notice that [6], [12] do not consider dependence upon the probability distribution of the control in the state transition dynamics and reward function. Besides the introduction of a general framework including a mean-field dependence on the pair state/action, our first main contribution is to rigorously connect CMKV-MDP to a large but finite system of MDP with interacting processes. We prove the almost sure and L1 conditional propagation of chaos, i.e., the convergence, as the number of interacting agents N tends to infinity, of the state processes and gains of the N-individual population control problem towards the corresponding object in the CMKV-MDP. Furthermore, by relying on rate of convergence in Wasserstein distance of the empirical measure, we give a rate of convergence for the limiting CMKV-MDP under suitable Lipschitz assumptions on the state transition and reward functions, which is new to the best of our knowledge. Our second contribution is to obtain the correspondence of our CMKV-MDP with a suitable lifted MDP on the space of probability measures. Starting from open-loop controls, this is achieved in general by introducing relaxed (i.e. measure-valued) controls in the enlarged state/action space, and by emphasizing the measurability issues arising in the presence of common noise and with continuous state space. In the special case without common noise or with finite state space, the relaxed control in
3 the lifted MDP is reduced to the usual notion in control theory, also known as mixed or randomized strategies in game theory. While it is known in standard MDP that an optimal control (when it exists) is in pure form, relaxed control appears naturally in MFC where the social planner has to sample the distribution of actions instead of simply assigning the same pure strategy among the population in order to perform the best possible collective gain. The reformulation of the original problem as a lifted MDP leads us to consider an associated dynamic programming equation written in terms of a Bellman fixed point equation in the space of probability measures. Our third contribution is to establish rigorously the Bellman equation satisfied by the state value function of the CMKV-MDP, and then by the state-action value function, called Q- function in the reinforcement learning terminology. This is obtained under the crucial assumption that the initial information filtration is generated by an atomless random variable, i.e., that it is rich enough, and calls upon original measurable optimal coupling results for the Wasserstein distance. Moreover, and this is our fourth contribution, the methodology of proof allows us to obtain as a by-product the existence of an ǫ-optimal control, which is constructed from randomized feedback policies under a randomization hypothesis. This shows in particular that the value function of CMKV-MDP over open-loop controls is equal to the value function over randomized feedback controls, and we highlight that it may be strictly larger than the value function of CMKV-MDP over “pure” feedback controls, i.e., without randomization. This is a notable difference with respect to the classical (without mean- field dependence) theory of MDP as studied e.g. in [3], [21]. We discuss and illustrate with a set of simple examples the difference of control strategies (open loop vs feedback), and the crucial role of the randomization hypothesis. Finally, we discuss how to compute the value function and approximate optimal randomized feed- back controls from the Bellman equation according to value or policy iteration methods and by dis- cretization of the state space and of the space of probability measures. Outline of the paper. The rest of the paper is organized as follows. Section 2 carefully formulates both the N-individual model and the CMKV-MDP, and show their connection by providing the rate of convergence of the latter to the limiting MFC when N goes to infinity. In Section 3, we establish the correspondence of the CMKV-MDP with a lifted MDP on the space of probability measures with usual relaxed controls when there is no common noise or when the state space is finite. In the general case considered in Section 4, we show how to lift the CMKV-MDP by a suitable enlargement of the action space in order to get the correspondance with a MDP on the Wasserstein space. We then derive the associated Bellman fixed point equation satisfied by the value function, and obtain the existence of approximate randomized feedback controls. We also highlight the differences between open-loop vs feedback vs randomized controls. We conclude in Section 5 by indicating some questions for future research. Finally, we collect in the Appendix some useful and technical results including measurable coupling arguments used in the proofs of the paper.
Notations. Given two measurable spaces (X1, Σ1) and (X2, Σ2), we denote by pr1 (resp. pr2 ) the projection function (x1,x2) ∈ X1 ×X2 7→ x1 ∈ X1 (resp. x2 ∈ X2). For a measurable function Φ : X1 → X2, and a positive measure µ1 on (X1, Σ1), the pushforward measure Φ ⋆µ1 is the measure on (X2, Σ2) defined by
−1 Φ ⋆µ1(B2)= µ1 Φ (B2) , ∀B2 ∈ Σ2. We denote by P(X1) the set of probability measures on X1, and C(X1) the cylinder (or weak) σ- algebra on P(X1), that is the smallest σ-algebra making all the functions µ ∈ P(X1) 7→ µ(B1) ∈ [0, 1],
4 measurable for all B1 ∈ Σ1. A probability kernel ν on X1 ×X2, denoted ν ∈ Xˆ2(X1), is a measurable mapping from (X1, Σ1) into (P(X2), C(X2)), and we shall write indifferently ν(x1,B2) = ν(x1)(B2), for all x1 ∈ X1, B2 ∈ Σ2. Given a probability measure µ1 on (X1, Σ1), and a probability kernel ν ∈ Xˆ2(X1), we denote by µ1 · ν the probability measure on (X1 ×X2, Σ1 ⊗ Σ2) defined by
(µ1 · ν)(B1 × B2)= µ1(dx1)ν(x1, dx2), ∀B1 ∈ Σ1, B2 ∈ Σ2. ZB1×B2 0 Let X1 and X2 be two random variables valued respectively on X1 and X2, denoted Xi ∈ L (Ω; Xi). We denote by L(Xi) the probability distribution of Xi, and by L(X2|X1) the conditional probability distribution of X2 given X1. With these notations, when X2 = Φ(X1), then L(X2)=Φ ⋆ L(X1). When (Y, d) is a compact metric space, the set P(Y) of probability measures on Y is equipped with the Wasserstein distance
W(µ,µ′) = inf d(y,y′)µ(dy, dy′) : µ ∈ Π(µ,µ′) , Y2 n Z o ′ ′ where Π(µ,µ ) is the set of probability measures on Y×Y with marginals µ and µ , i.e., pr1 ⋆ µ = ′ µ, and pr2 ⋆ µ = µ . Since (Y, d) is compact, it is known (see e.g. Corollary 6.13 in [23]) that the Borel σ-algebra generated by the Wasserstein metric coincides with the cylinder σ-algebra on P(Y), i.e., Wasserstein distances metrize weak convergence. We also recall the dual Kantorovich-Rubinstein representation of the Wasserstein distance
′ ′ W(µ,µ ) = sup φ d(µ − µ ) : φ ∈ Llip(Y; R), [φ]lip ≤ 1 , Y n Z o where Llip(Y; R) is the set of Lipschitz continuous functions φ from Y into R, and [φ]lip = sup{|φ(y) − φ(y′)|/d(y,y′) : y,y′ ∈Y,y 6= y′}.
2 The N-agent and the limiting McKean-Vlasov MDP
We formulate the mean-field Markov Decision Process (MDP) in a large population model with indis- tinguishable agents i ∈ N∗ = N \ {0}. Let X (the state space) and A (the action space) be two compact Polish spaces equipped respectively with their metric d and dA. We denote by P(X ) (resp. P(A)) the space of probability measures on X (resp. A) equipped respectively with their Wasserstein distance W and WA. We also consider the ′ ′ ′ ′ ′ ′ product space X× A, equipped with the metric d((x, a), (x , a )) = d(x,x )+ dA(a, a ), x,x ∈X , a, a ∈ A, and the associated space of probability measure P(X×A), equipped with its Wasserstein distance W . Let G, E, and E0 be three measurable spaces, representing respectively the initial information, idiosyncratic noise, and common noise spaces. We denote by ΠOL the set of sequences (πt)t∈N (called open-loop policies) where πt is a measurable function from G × Et × (E0)t into A for t ∈ N. Let (Ω, F, P) be a probability space on which are defined the following family of mutually i.i.d. random variables
i i • (Γ ,ξ )i∈N⋆ (initial informations and initial states) valued in G ×X • i (εt)i∈N⋆,t∈N (idiosyncratic noises) valued in E with probability distribution λε
5 • 0 0 0 ε := (εt )t∈N (common noise) valued in E . We assume that F contains an atomless random variable, i.e., F is rich enough, so that any probability measure ν on X (resp. A or X × A) can be represented by the law of some random variable Y on X (resp. A or X × A), and we write Y ∼ ν, i.e., L(Y ) = ν. Given an open-loop policy π, we associate an open-loop control for individual i ∈ N∗ as the process αi,π defined by i,π i i 0 N α = πt(Γ , (εs)s≤t, (εs)s≤t), t ∈ .
In other words, an open-loop control is a non-anticipative process that depends on the initial informa- tion, the past idiosyncratic and common noises, but not on the states of the agent in contrast with closed-loop control. ∗ Given N ∈ N , and π ∈ ΠOL, the state process of agent i = 1,...,N in an N-agent MDP is given by the dynamical system
i,N,π i X0 = ξ i,N,π i,N,π i,π 1 N i 0 X = F (X , α , δ j,N,π j,π , ε , ε ), t ∈ N, ( t+1 t t N j=1 (Xt ,αt ) t+1 t+1 where F is a measurable function from X × A P× P(X × A) × E × E0 into X , called state transition function. The i-th individual contribution to the influencer’s gain over an infinite horizon is defined by
∞ N N,π t i,N,π i,π 1 Ji := β f Xt , αt , δ j,N,π j,π , i = 1,...,N, N (Xt ,αt ) Xt=0 Xj=1 where the reward f is a mesurable real-valued function on X× A × P(X× A), assumed to be bounded (recall that X and A are compact spaces), and β is a positive discount factor in [0, 1). The influencer’s renormalized and expected gains are
N 1 J N,π := J N,π,V N,π := E J N,π , N i i=1 X N N,π and the optimal value of the influencer is V := supπ∈ΠOL V . Observe that the agents are indistin- i i guishable in the sense that the initial pair of information/state (Γ ,ξ )i, and idiosyncratic noises are i.i.d., and the state transition function F , reward function f, and discount factor β do not depend on i. Let us now consider the asymptotic problem when the number of agents N goes to infinity. In view of the propagation of chaos argument, we expect that the state process of agent i ∈ N∗ in the infinite population model to be governed by the conditional McKean-Vlasov dynamics
i,π i X0 = ξ i,π i,π i,π P0 i 0 N (2.1) Xt+1 = F (Xt , αt , i,π i,π , εt+1, εt+1), t ∈ . ( (Xt ,αt ) Here, we denote by P0 and E0 the conditional probability and expectation knowing the common noise 0 P0 0 ε , and then, given a random variable Y valued in Y, we denote by Y or L (Y ) its conditional law knowing ε0, which is a random variable valued in P(Y) (see Lemma A.2). The i-th individual contribution to the influencer’s gain in the infinite population model is ∞ π t i,π i,π P0 N∗ Ji := β f Xt , αt , i,π i,π , i ∈ , (Xt ,αt ) t=0 X 6 and we define the conditional gain, expected gain, and optimal value, respectively by
π E0 π E0 π N∗ J := Ji = J1 , i ∈ (by indistinguishability of the agents), π π π V := EJ ,V := sup V . (2.2) π∈ΠOL Problem (2)-(2) is called conditional McKean-Vlasov Markov decision process, CMKV-MDP in short. The main goal of this Section is to rigorously connect the finite-agent model to the infinite pop- ulation model with mean-field interaction by proving the convergence of the N-agent MDP to the CMKV-MDP. First, we prove the almost sure convergence of the state processes under some continu- ity assumptions on the state transition function.
0 0 Proposition 2.1 Assume that for all (x0, a, ν0, e ) ∈X× A × P(X × A) × E , the (random) function
1 0 (x,ν) ∈X×P(X × A) 7−→ F (x, a, ν, ε1, e ) ∈X is continuous in (x0,ν0) a.s.
P 1 0 1 0 i,N,π a.s. i.e. lim F (x, a, ν, ε1, e ) = F (x0, a, ν0, ε1, e ) = 1. Then, for any π ∈ ΠOL, Xt → (x,ν)→(x0,ν0) N→∞ i,π N∗ N Xt , i ∈ , t ∈ , and
N N 1 P0 a.s. 1 i,N,π i,π W δ(Xi,N,π ,αi,π ), i,π i,π −→ 0, d(Xt , Xt ) → 0 a.s. N t t (Xt ,αt ) N→∞ N N→∞ Xi=1 Xi=1 Furthermore, if we assume that for all a ∈ A, the function (x,ν) ∈X×P(X × A) 7→ f(x, a, ν) is continuous, then
N,π a.s. π N,π a.s. π N,π π N J −→ Ji , J −→ J ,V −→ V , and lim infV ≥ V. i N→∞ N→∞ N→∞ N→∞
Remark 2.1 We stress the importance of making the continuity assumption for F only a.s. w.r.t. 1 the noise ε1. Indeed, most of the practical applications we have in mind concerns finite spaces X , for which F cannot be, strictly speaking, continuous, since it maps from a continuous space P(X × A) to a finite space X . However, in this case, the discontinuities will essentially be placed on the boundaries between each region where F is constant. Our assumption then simply formalizes the idea that these 1 0 0 boundaries depend upon ε1 in a way that no point (x0, a, ν0, e ) ∈X× A × P(X × A) × E has a 1 positive probability (w.r.t. ε1) to be on a boundary. In other words, we assume that the discontinuity 1 ✷ sets always vary with ε1, which is a natural phenomenon.
Proof. Fix π ∈ ΠOL. We omit the dependence of the state processes and control on π, and denote by N 1 N N,∞ 1 N P0 νt := i=1 δ i,N i , νt := i=1 δ(Xi,αi), and νt := i i . N (Xt ,αt) N t t (Xt ,αt) ⋆ (1) We firstP prove the convergence ofP trajectories, for all i ∈ N ,
N P i,N N i P 1 j,N j lim (Xt ,νt ) = (Xt ,νt) = 1, lim d(Xt , Xt ) = 0 = 1, N→∞ N→∞ N j=1 h X i by induction on t ∈ N. i,N i i i N N⋆ - Initialization. For t = 0, we have X0 = ξ = X0, α0 = π0(Γi), for all N ∈ ⋆ and i ∈ , i,N a.s. i 1 N j,N j N which obviously implies that X0 → X0 and j=1 d(X0 , X0 ) → 0. Moreover, W(ν0 ,ν0) N→∞ N N→∞ P 7 N,∞ = W(ν0 ,ν0), which converges to zero a.s., when N goes to infinity, by the weak convergence of empirical measures (see [22]), and the fact that Wasserstein distance metrizes weak convergence. - Hereditary Property. We have
i,N i,N i N i 0 i i i i 0 Xt+1 = F (Xt , αt,νt , εt+1, εt+1) and Xt+1 = F (Xt , αt,νt, εt+1, εt+1). (2.3)
P i,N i E i,N N i i By a simple conditioning, we notice that lim Xt+1 = Xt+1 = P (Xt ,νt )N , Xt ,νt, αt , where N→∞ P i 0 i 0 P (xN ,νN )N , x, ν, a) = lim F (xN , a, νN , εt+1, εt+1)= F (x, a, ν, εt+1, εt+1) . N→∞ h i By the continuity assumption on F , we see that P (xN ,νN )N , x, ν, a) is bounded from below by
1 lim (xN ,νN )=(x,ν), and thus N→∞
P i,N i P i,N N i lim Xt+1 = Xt+1 ≥ lim (Xt ,νt ) = (Xt ,νt) . N→∞ N→∞ P i,N i This proves by induction hypothesis that lim Xt+1 = Xt+1 = 1. N→∞ 1 N i,N i a.s. Let us now show that i=1 d(Xt+1, Xt+1) → 0. From (2), we have N N→∞ P i,N i i i i 0 d(Xt+1, Xt+1) ≤ ∆DN F (Xt , αt,νt, εt+1, εt+1)
i,N i N with DN := max[d(Xt , Xt ), W (νt ,νt)], and
0 ′ ′ 0 0 ∆yF (x, a, ν, e, e ):= sup {d(F (x , a, ν , e, e ), F (x, a, ν, e, e ))1max[d(x′,x),W (ν′,ν)]≤y}, (x′,ν′)∈D where D is a fixed countable dense set of the separable space X × P(X × A), which implies that 0 0 (y,x,a,ν,e,e ) 7→ ∆yF (x, a, ν, e, e ) is a measurable function. Fix ǫ> 0. Let ∆X denote the diameter of the compact metric space X . We thus have, for any η > 0,
i,N i i,N i i,N i d(Xt+1, Xt+1) ≤ d(Xt+1, Xt+1)1DN ≤η + d(Xt+1, Xt+1)1DN >η i i 1 0 ≤ ∆ηF (Xt , αt,νt, εt+1, εt+1)+∆X 1 N,i i +∆X 1W N , d(Xt ,Xt )>η (νt ,νt)>η and thus
N 1 d(Xi,N , Xi ) N t+1 t+1 Xi=1 N N 1 i i 1 0 ∆X i,N i ≤ ∆ηF (Xt , αt,νt, εt+1, εt+1)+ d(X , Xt )+∆X 1W N . N ηN t (νt ,νt)>η Xi=1 Xi=1 The second and third terms in the right hand side converge to 0 by induction hypothesis, and by Proposition B.1, the first term converges to
E0 1 1 1 0 E0 E 1 1 1 0 ∆ηF (Xt , αt ,νt, εt+1, εt+1) = ∆ηF (xt , at ,νt, εt+1, et+1) 1 1 0 1 1 0 . (xt ,at ,et+1):=(Xt ,αt ,εt+1) h i As η → 0, the inner expectation tends to zero by continuity assumption on F and by dominated convergence. Then, the outer expectation converges to zero by conditional dominated convergence,
8 ǫ 1 N i,N i and will thus be smaller than 2 for η small enough, which implies that N i=1 d(Xt+1, Xt+1) will be smaller than ǫ for N large enough. N a.s. N P N N,∞ Let us finally prove that W (νt+1,νt+1) → 0. We have W (νt+1,νt+1) ≤ W (νt+1,νt+1 ) + N→∞ N,∞ N N,∞ W (νt+1 ,νt+1). To dominate the first term W (νt+1,νt+1 ), notice that, given a variable UN ∼ N N,∞ U({1, ..., N}), the random measure νt+1 (resp. νt+1 ) is, at fixed ω ∈ Ω, the law of the pair of UN ,N UN UN UN random variable (Xt+1 (ω), αt+1(ω)) (resp. (Xt+1(ω), αt+1(ω)) where we stress that only UN is ran- 1 dom here, essentially selecting each sample of these empirical measures with probability N . Thus, N N,∞ E UN ,N UN by definiton of the Wasserstein distance, W (νt+1,νt+1 ) is dominated by [d(Xt+1 (ω), Xt+1(ω))] = 1 N i,N i N i=1 d(Xt+1(ω), Xt+1(ω)), which has been proved to converge to zero. For the second term, observe i i i 0 that αt+1 = πt+1(Γ , (εs)s≤t+1, (εs)s≤t+1), and by Proposition A.1, there exists a measurable function P t+1 0 t+1 i,N i i 0 ft+1 : X×G×E ×(E ) →X such that Xt+1 = ft+1(Γ , (εs)s≤t+1, (εs)s≤t+1)) . From Proposition N,∞ B.1, we then deduce that W (νt+1 ,νt+1) converges to zero as N goes to infinity. This concludes the induction. i,N i N a.s. (2) Let us now study the convergence of gains. By the continuity assumption on f, we have f(Xt , αt,νt ) → N→∞ i i N N,π a.s. π f(Xt , αt,νt) for all t ∈ . Thus, as f is bounded, we get by dominated convergence that J → Ji . i N→∞ Let us now study the convergence of J N,π to J π. We write N N 1 1 |J N,π − J π| ≤ |J N,π − J π| + J π − J π =: S1 + S2 . N i i N i N N Xi=1 Xi=1 2 The second term SN converges a.s. to zero by Propositions A.1 and B.1, as N goes to infinity. On the other hand, ∞ N 1 S1 ≤ βt∆ (f), with ∆ (f) := f(Xi,N , αi,νN ) − f(Xi, αi,ν ) . N N N N t t t t t t t=0 i=1 X X 1 N i,N i a.s. By the same argument as above in (1) for showing that i=1 d(Xt+1, Xt+1) → 0, we prove that N N→∞ ∆N (f) tends a.s. to zero as N → ∞. Since f is bounded,P we deduce by the dominated convergence 1 N,π a.s. π theorem that SN converges a.s. to zero as N goes to infinity, and thus J → J . By dominated N→∞ convergence, we then also obtain that V N,π = E[J N ] a.s.→ E[J π] = V π. Finally, by considering an N→∞ ǫ-optimal policy πǫ for V , we have lim infV N ≥ lim V N,πǫ = V πǫ ≥ V − ǫ, N→∞ N→∞ which implies, by sending ǫ to zero, that liminf V N ≥ V . ✷ N→∞ Next, under Lipschitz assumptions on the state transition and reward functions, we prove the corresponding convergence in L1, which implies the convergence of the optimal value, and also a rate of convergence in terms of the rate of convergence in Wasserstein distance of the empirical measure.
0 0 ′ ′ (HFlip) There exists KF > 0, such that for all a ∈ A, e ∈ E , x,x ∈X , ν,ν ∈ P(X × A), E 1 0 ′ ′ 1 0 ′ ′ d F (x, a, ν, ε1, e ), F (x , a, ν , ε1, e ) ≤ KF d(x,x )+ W (ν,ν ) ). ′ ′ (Hflip) There exists Kf > 0, such that for all a ∈ A, x,x ∈X , ν,ν ∈ P(X × A), ′ ′ ′ ′ d(f(x, a, ν),f(x , a, ν )) ≤ Kf d(x,x )+ W (ν,ν ) ).
9 Remark 2.2 Here again, we stress the importance of making the regularity assumptions for F in expectation only. For the same argument as in Remark 2.1, when X is finite, F cannot be, strictly speaking, Lipschitz. However, F can be Lipschitz in expectation, e.g. once integrated w.r.t. the idiosyncratic noise, and, again, it is a very natural assumption. ✷
In the sequel, we shall denote by ∆X the diameter of the metric space X , and define
MN := sup E[W (νN ,ν)], (2.9) ν∈P(X×A)
1 N where νN is the empirical measure νN = N n=1 δYn , (Yn)1≤n≤N are i.i.d. random variables with law ν. We recall in the next Lemma recent results about non asymptotic bounds for the mean rate of P convergence in Wasserstein distance of the empirical measure.
Lemma 2.1 We have MN → 0. Furthermore, N→∞ d ⋆ − 1 − 1 • If X × A ⊂ R for some d ∈ N , then: MN = O(N 2 ) for d = 1, MN = O(N 2 log(1 + N)) for − 1 d = 2, and MN = O(N d ) for d ≥ 3.
• If for all δ > 0, the smallest number of balls with radius δ covering the compact metric set X× A ∆X×A θ −1/θ with diameter ∆X×A is smaller than O δ for θ> 2, then MN = O(N ). Proof. The second point is proved in [11], and the third one in [4]. ✷
Remark 2.3 In the case where the state and action spaces X and A are finite, let φ : X × A ֒−→ R be any injective function. Then φ−1 is necessarily Lipschitz. From the dual Kantorovich representation of the Wasserstein distance, we have
MN ≤ sup E sup gd(νN − ν) R ν∈P(X×A) "g∈Llip(X×A; ),[g]lip≤1 ZX×A #
−1 ≤ sup E sup g ◦ φ d (φ⋆ν)N − φ⋆ν) R ν∈P(X×A) "g∈Llip(X×A; ),[g]lip≤1 ZX×A # ≤ C sup E [W((φ⋆ν)N ,φ⋆ν)] ≤ C sup E [W(νN ,ν)] , ν∈P(X×A) ν∈P(φ(X×A))
−1 − 1 where C is the Lipschitz constant of φ . Thus, by the second point in Lemma 2.1, MN = O(N 2 ). ✷
∗ Theorem 2.1 Assume (HFlip). For all i ∈ N , t ∈ N, E i,N,π i,π sup d(Xt , Xt ) = O(MN ), (2.10) π∈ΠOL N E 1 P0 sup W δ i,N,π i,π , i,π i,π = O(MN ). (2.11) (Xt ,α ) (Xt ,α ) π∈ΠOL N h Xi=1 i | ln β| Furthermore, if we assume (Hflip), and set γ = min 1, , there exists a constant C = (ln 2KF )+ C(K , K , β, γ) (explicit in the proof) such that for N large enough, F f N,π π γ sup |V − V | ≤ CMN , π∈ΠOL
10 N γ and thus |V − V | = O(MN ). Consequently, any ε − optimal policy for the CMKV-MDP is O(ε)- optimal for the N-agent MDP problem, and conversely, any ε − optimal policy for the N-agent MDP γ problem is O(ε)-optimal for the CMKV-MDP, for N large enough, namely MN = O(ǫ). N,π 1 N N,∞,π 1 N Proof. Given π ∈ ΠOL, denote by νt := δ i,N,π i,π , νt := δ i,π i,π and N i=1 (Xt ,αt ) N i=1 (Xt ,αt ) π P0 i,π i i 0 i,π νt := i,π i . By definition, αt = πt(Γ , (εs)Ps≤t, (εs)s≤t), and by Lemma A.1,P we have Xt = (Xt ,αt) π i i i 0 π 0 t 0 t ft (ξ , Γ , (εs)s≤t, (εs)s≤t) for some measurable function ft ∈ L (X × G × E × (E ) , X ). By definition of MN in (2), we have E N,∞,π π N W (νt ,νt ) ≤ MN , ∀N ∈ , ∀π ∈ ΠOL. (2.13) h Ni i,N,π i,π i Let us now prove (2.1) by induction on t ∈ . At t = 0, X0 = X0 = ξ , and the result is obvious. Now assume that it holds true at time t ∈ N and let us show that it then holds true at time t + 1. By E i,N,π i,π E i,N,π i,π i N,π π 0 a simple conditioning argument, [d(Xt+1 , Xt+1)] = ∆ Xt , Xt , αt,νt ,νt , εt+1 , where ′ ′ 0 E i 0 ′ ′ i 0 ∆(x,x , a, ν, ν , e )= [d(F (x, a, ν, εt+1 , e ), F (x , a, ν , εt+1, e ))] ′ ′ ≤ KF d(x,x )+ W (ν,ν ) , (2.14) by (HFlip). On the other hand, we have E N,π π E N,π N,∞,π E N,∞,π π W (νt ,νt ) ≤ W (νt ,νt ) + W (νt ,νt ) E i,N,π i,π ≤ [d(Xt , Xt )] +MN , (2.15) N,π N,∞,π 1 N i,N,π i,π where we used the fact that W (νt ,νt ) ≤ N i=1 d(Xt , Xt ), and (2). It follows from (2) that P E i,N,π i,π E i,N,π i,π d(Xt+1 , Xt+1) ≤ KF 2 [d(Xt , Xt )] + MN , ∀π ∈ ΠOL, (2.16) E i,N,π i,π which proves that sup [d(Xt+1 , Xt+1)] = O(MN ) by induction hypothesis, and thus (2.1). Plugging π∈ΠOL (2.1) into (2) then yields (2.1).
Let us now prove the convergence of gains. From (Hflip), and (2), we have ∞ N,π π tE i,N,π i,π N,π π |V − V |≤ Kf β d(Xt , Xt )+ W (νt ,νt ) t=0 h i X∞ M ≤ K 2 βtδN + N , ∀π ∈ Π , (2.17) f t 1 − β OL Xt=0 N E i,N,π i,π N N N where we set δt := sup [d(Xt , Xt )]. From (2), we have: δt+1 ≤ 2KF δt + KF MN , t ∈ , with π∈ΠOL N δ0 = 0, and so by induction:
N KF KF t δt ≤ MN + st MN , st(m) := m(2KF ) , m ≥ 0. 1 − 2KF |2KF − 1| N where we may assume w.l.o.g. that 2KF 6= 1. Observing that we obviously have δt ≤ ∆X (the diameter of X ), we deduce that ∞ K K βtδN ≤ F M + S F M (2.18) t (1 − 2K )(1 − β) N |2K − 1| N t=0 F F X ∞ t S(m) := β min st(m); ∆X , m ≥ 0. t=0 X
11 If 2βKF < 1, we clearly have
∞ m S(m) ≤ (mβ2K )t = , F 1 − β2K t=0 F X N,π π and so by (2), |V −V | = O(MN ). Let us now study the case 2βKF > 1. In this case, in particular, 2KF > 1, thus t 7→ st(m) is nondecreasing, and so
∞ t+1 t S(m) ≤ β min st(m); ∆X ds t=0 Zt X∞ 1 t+1 ≤ βs min m(2K )s;∆ ds β F X t=0 t X Z 1 ∞ ≤ e−| ln β|s min meln(2KF )s;∆ ds. (2.19) β X Z0 ln(∆ /m) Let t be such that meln(2KF )t⋆ =∆ , i.e. t = X . Then, ⋆ X ⋆ ln(2KF )
∞ t⋆ ∞ −| ln β|s ln(2KF )s ln(2KF β)s ln(β)s e min me ;∆X ds ≤ m e ds +∆X e ds ⋆ Z0 Z0 Zt m ⋆ ∆ ⋆ ≤ eln(2KF β)t − 1 − X eln(β)t . ln(2KF β) ln β h i After substituting t⋆ by its explicit value, we then obtain
ln(2K β) ln(β) ∞ F ln(2K ) m ∆ ln(2K ) ∆ ∆ F e−| ln β|s min meln(2KF )s;∆ ds ≤ X F − 1 − X X X ln(2K β) m ln β m Z0 F h i ln(β) 1 1 ∆X ln(2KF ) m ≤ ∆X − − ln(2KF β) ln β m ln(2KF β) | ln β| | ln β| min 1, min 1, ln(2K ) ln(2KF ) F ≤ O m = O MN . This, combined with (2), (2) and (2), concludes the proof. ✷
Remark 2.4 If the Lipschitz constant in (HFlip) satisfies β2KF < 1, then we can take γ = 1 in the rate of convergence (2.6) of the optimal value. In the particular case when F and f depend on the joint distribution ν ∈ P(X × A) only through its marginals on P(X ) and P(A), which is the usual framework considered in controlled mean-field dynamics, then a careful look in the above proof shows that the rate of convergence of the CMKV-MDP will be expressed in terms of
M˜ N := max sup E[W(µN ,µ)], sup E[WA(υN , υ)] , µ∈P(X ) υ∈P(A) n o instead of MN in (2), where here µN (resp. υN ) is the empirical measure associated to µ (resp. υ) ∈ P(X ) (resp. P(A)). From Lemma 2.1, the speed of convergence of M˜ N is faster than the one of MN . −1/2 −1/2 For instance when X ⊂ R, A ⊂ R, then M˜ N = O(N ), while MN = O N log(1 + N) . The propagation of chaos in mean-field control problems has been addressed in several papers, see [15], [5], [17], [7]. For instance, the recent paper [7] dealing with continuous time mean-field
12 diffusion processes, shows that the propagation of chaos holds when the optimization is performed over asymmetric open-loop controls where each individual control is allowed to depend upon all the past noises (including the noises from other individuals), in contrast to our model where the optimization is made over symmetric “individualized” open-loop controls (each control αi only depends upon the i-th individual’s noise and the common noise). However, rate of convergence is not provided in [7]. To the best of our knowledge, convergence rate has been obtained only in the continuous time case with feedback controls and finite horizon in Chapter 6, vol 2 of [5]. The originality of Theorem 2.1, besides providing a convergence rate in the individualized open-loop case, is that we are dealing with an infinite horizon. The above proof illustrates how two phenomenons are involved in this result: for t ≤ t⋆, it is the regularity of F which ensures the closeness of the N-agent processes and the McKean-Vlasov processes, and thus the closeness of their rewards. As t becomes larger, because the horizon is infinite, the N-agent and McKean-Vlasov processes drift away from each other due to errors accumulation, but the compactness of X prevents them from getting too far away from each other, and for t>t⋆, the discounting factor βt, by being smaller and smaller, makes the discounted rewards geometrically tend to 0 and be neglectable. What is remarkable is that these combined phenomenons work together well γ enough to still ensure a fast convergence, even in infinite horizon, in O(MN ), with an explicit order γ ≤ 1. ✷
3 Lifted MDP on P(X )
Theorem 2.1 justifies the CMKV-MDP as a macroscopic approximation of the N-agent MDP problem with mean-field interaction. Observe that the computation of the conditional gain, expected gain and optimal value of the CMKV-MDP in (2), only requires the variables associated to one agent, called representative agent. Therefore, we place ourselves in a reduced universe, dismissing other individuals variables, and renaming the representative agent’s initial information by Γ, initial state by
ξ, idiosyncratic noise by (εt)t∈N. In the sequel, we shall denote by G = σ(Γ) the σ-algebra generated by the random variable Γ, hence representing the initial information filtration, and by L0(G; X ) the set of G-measurable random variables valued in X . We shall assume that the initial state ξ ∈ L0(G; X ), which means that the policy has access to the agent’s initial state through the initial information filtration G. An open-loop control for the representative agent is a process α, which is adapted to the fil- 0 π tration generated by (Γ, (εs)s≤t, (εs)s≤t)t∈N, and associated to an open-loop policy by: αt = αt := 0 πt(Γ, (εs)s≤t, (εs)s≤t) for some π ∈ ΠOL. We denote by A the set of open-loop controls, and given α ∈ A, ξ ∈ L0(G; X ), the state process X = Xξ,α of the representative agent is governed by
0 0 X = F (X , α , P , εt+1, ε ), t ∈ N, X0 = ξ. (3.1) t+1 t t (Xt,αt) t+1
π ξ,π ξ,α α π For α = α , π ∈ ΠOL, we write indifferently X = X , and the expected gain V = V equal to
α t 0 V (ξ)= E β f(Xt, αt, P ) , (Xt,αt) N h Xt∈ i where we stress the dependence upon the initial state ξ. The value function to the CMKV-MDP is then defined by
V (ξ) = sup V α(ξ), ξ ∈ L0(G; X ). α∈A
13 Let us now show how one can lift the CMKV-MDP to a (classical) MDP on the space of probability measures P(X ). We set F0 as the filtration generated by the common noise ε0. Given an open-loop ξ,α P0 N control α ∈ A, and its state process X = X , denote by {µt = Xt , t ∈ }, the random P(X )- 0 valued process, and notice from Proposition A.1 that (µt)t is F -adapted. From (3), and recalling the pushforward measure notation, we have
P0 0 P0 µt+1 = F ·, ·, (Xt,αt), ·, εt+1 ⋆ (Xt,αt) ⊗ λε , a.s. (3.2) As the probability distribution λε of the idiosyncratic noise is a fixed parameter, the above relation means that µ only depends on P0 and ε0 . Moreover, by introducing the so-called relaxed t+1 (Xt,αt) t+1 control associated to the open-loop control α as
0 αˆt(x)= L αt|Xt = x , t ∈ N, which is valued in Aˆ(X ), the set of probability kernels onX × A (see Lemma A.2), we see from Bayes formula that P0 = µ · αˆ . The dynamics relation (3) is then written as (Xt,αt) t t ˆ 0 N µt+1 = F (µt, αˆt, εt+1), t ∈ , where the function Fˆ : P(X ) × Aˆ(X ) × E0 → P(X ) is defined by
0 0 Fˆ(µ, a,ˆ e )= F ·, ·,µ · a,ˆ ·, e ⋆ (µ · aˆ) ⊗ λε . (3.3)
On the other hand, by the law of iterated conditional expecta tion, the expected gain can be written as
α E tE0 P0 V (ξ)= β f(Xt, αt, (Xt,αt)) , t∈N h X i with the conditional expectation term equal to
E0 P0 ˆ f(Xt, αt, (Xt,αt)) = f(µt, αˆt), where the function fˆ : P(X ) × Aˆ(X ) → R is defined by
fˆ(µ, aˆ)= f(x, a, µ · aˆ)(µ · aˆ)(dx, da). (3.4) ZX×A The above derivation suggests to consider a MDP with state space P(X ), action space Aˆ(X ), a state transition function Fˆ as in (3), a discount factor β ∈ [0, 1), and a reward function fˆ as in (3). A key point is to endow Aˆ(X ) with a suitable σ-algebra in order to have measurable functions Fˆ, fˆ, and F0-adapted processα ˆ valued in Aˆ(X ), so that the MDP with characteristics (P(X ), Aˆ(X ), Fˆ , f,ˆ β) is well-posed. This issue is investigated in the next sections, first in special cases, and then in general case by a suitable enlargement of the action space.
3.1 Case without common noise
When there is no common noise, the original state transition function F is defined from X×A×P(X× A) × E into X , and the associated function Fˆ is then defined from P(X ) × Aˆ(X ) into P(X ) by
Fˆ(µ, aˆ)= F ·, ·,µ · a,ˆ · ⋆ (µ · aˆ) ⊗ λε . 14 In this case, we are simply reduced to a deterministic control problem on the state space P(X ) with dynamics
µt+1 = Fˆ(µt, κt), t ∈ N, µ0 = µ ∈ P(X ), controlled by κ = (κt)t∈N ∈ A, the set of deterministic sequences valued in Aˆ(X ), and cumulated gain/value function: b ∞ κ t κ V (µ)= β fˆ(µt, κt), V (µ) = sup V (µ), µ ∈ P(X ), κ∈Ab Xt=0 b b b where the bounded function fˆ : P(X ) × Aˆ(X ) → R is defined as in (3). Notice that there are no measurability issues for Fˆ, fˆ, as the problem is deterministic and all the quantities defined above are well-defined. We aim to prove the correspondence and equivalence between the MKV-MDP and the above deterministic control problem. From similar derivation as in (3)-(3) (by taking directly law under P 0 α αˆ instead of P ), we clearly see that for any α ∈ A, V (ξ) = V (µ), with µ = L(ξ), andα ˆ = Rξ(α) where Rξ is the relaxed operator b Rξ : A −→ A ξ,α N α = (αt)t 7−→ αˆ = (ˆαt)t :α ˆt(x)= L αt|Xt = x , t ∈ , x ∈X . b It follows that V (ξ) ≤ V (µ). In order to get the reverse inequality, we have to show that Rξ is surjective. Notice that this property is not always satisfied: for instance, when the σ-algebra generated by ξ is b equal to G, then for any α ∈ A, α0 is σ(ξ)-measurable at time t = 0, and thus L(α0|ξ) is a Dirac distribution, hence cannot be equal to an arbitrary probability kernel κ0 =a ˆ ∈ Aˆ(X ). We shall then make the following randomization hypothesis. Rand(ξ, G): There exists a uniform random variable U ∼ U([0, 1]), which is G-measurable and inde- pendent of ξ ∈ L0(G; X ).
Remark 3.1 The randomization hypothesis Rand(ξ, G) implies in particular that Γ is atomless, i.e., G is rich enough, and thus P(X ) = {L(ζ) : ζ ∈ L0(G; X )}. Furthermore, it means that there is extra randomness in G besides ξ, so that one can freely randomize via the uniform random variable U the first action given ξ according to any probability kernela ˆ. Moreover, one can extract from U, by standard separation of the decimals of U (see Lemma 2.21 in [14]), an i.i.d. sequence of uniform variables (Ut)t∈N, which are G-measurable, independent of ξ, and can then be used to randomize the subsequent actions. ✷
Theorem 3.1 (Correspondence in the no common noise case)
Assume that Rand(ξ, G) holds true. Then Rξ is surjective from A into A, and we have V (ξ) = V (µ), ǫ ǫ for µ = L(ξ). Moreover, for ǫ ≥ 0, if α ∈ A is an ǫ-optimal control for V (ξ), then Rξ(α ) ∈ A is an ǫ-optimal control for V (µ), and conversely, if αˆǫ ∈ A is an ǫ-optimalb control for V (µ), thenb any ǫ −1 ǫ b α ∈ Rξ (ˆα ) is an ǫ-optimal control for V (ξ). Consequently, an optimal control for V (ξ) exists iff an optimal control for V (µ) bexists. b b
Proof. In view of theb above discussion, we only need to prove the surjectivity of Rξ. Fix a control κ ∈ A for the MDP on P(X ). By Lemma 2.22 in [14], for all t ∈ N, there exists a measurable function
b 15 P at : X × [0, 1] → A such that at(x,U) = κt(x), for all x ∈ X . It is then clear that the control α ξ,α defined recursively by αt := at(Xt ,Ut), where (Ut)t is an i.i.d. sequence of G-measurable uniform ξ,α variables independent of ξ under Rand(ξ, Γ), satisfies L(αt | X = x)= κt(x) (observing that Ut is ξ,α ✷ independent of Xt ), and thusα ˆ = κ, which proves the surjectivity of Rξ.
Remark 3.2 The above correspondence result shows in particular that the value function V of the MKV-MDP is law invariant, in the sense that it depends on its initial state ξ only via its probability law µ = L(ξ), for ξ satisfying the randomization hypothesis. ✷
3.2 Case with finite state space X and with common noise We consider the case with common noise but when the state space X is finite, i.e., its cardinal #X is finite, equal to n. Sn−1 n n In this case, one can identify P(X ) with the simplex = {p = (pi)i=1,...,n ∈ [0, 1] : i=1 pi = 1}, n−1 by associating any probability distribution µ ∈ P(X ) to its weights (µ({x}))x∈X ∈ S . We also n P identify the action space Aˆ(X ) with P(A) by associating any probability kernela ˆ ∈ Aˆ(X ) to (ˆa(x))x∈X ∈ P(A)n, and thus Aˆ(X ) is naturally endowed with the product σ-algebra of the Wasserstein metric space P(A).
Lemma 3.1 Suppose that #X = n < ∞. Then, Fˆ in (3) is a measurable function from Sn−1 × P(A)n × E0 into Sn−1, fˆ in (3) is a real-valued measurable function on Sn−1 × P(A)n. Moreover, for 0 n 0 ξ,α any ξ ∈ L (G; X ), and α ∈ A, the P(A) -valued process αˆ defined by αˆt(x) = L (αt|Xt = x), t ∈ N, x ∈ X , is F0-adapted.
Proof. By Lemma A.1, it is clear, by measurable composition, that we only need to prove that Ψ : (µ, aˆ) ∈ (P(X ), Aˆ(X )) 7→ µ · aˆ ∈ P(X × A) is measurable. However, in this finite state space case, µ · aˆ ✷ is here simply equal to x∈X µ(x)ˆa(x) and, thus Ψ is clearly measurable. In view of Lemma 3.1,P the MDP with characteristics (P(X ) ≡ Sn−1, Aˆ(X ) ≡ P(A)n, Fˆ , f,ˆ β) is well-posed. Let us then denote by A the set of F0-adapted processes valued in P(A)n, and given κ ∈ A, consider the controlled dynamics in Sn−1 b ˆ 0 N Sn−1 b µt+1 = F (µt, κt, εt+1), t ∈ , µ0 = µ ∈ , (3.5) the associated expected gain and value function
∞ κ t κ V (µ) = E β fˆ(µt, κt) , V (µ) = sup V (µ). (3.6) t=0 κ∈Aˆ h X i We aim to prove theb correspondence and equivalenceb between theb CMKV-MDP and the MDP (3.2)-(3.2). From the derivation in (3)-(3) and by Lemma 3.1, we see that for any α ∈ A, V α(ξ) = αˆ 0 0 V (µ), where µ = L(ξ), andα ˆ = Rξ(α) where Rξ is the relaxed operator b R0 : A −→ A ξ (3.7) 0 ξ,α N α = (αt)t 7−→ αˆ = (ˆαt)t :α ˆt(x)= L αt|Xt = x , t ∈ , x ∈X . b 0 It follows that V (ξ) ≤ V (µ). In order to get the reverse inequality from the surjectivity of Rξ, we need again as in the no common noise case to make some randomization hypothesis. It turns out that when X is finite, this randomizationb hypothesis is simply reduced to the atomless property of Γ.
16 Lemma 3.2 Assume that Γ is atomless, i.e., G is rich enough. Then, any ξ ∈ L0(G; X ) taking a countable number of values, satisfies Rand(ξ, Γ).
Proof. Let S be a countable set s.t. ξ ∈ S a.s., and P[ξ = x] > 0 for all x ∈ S. Fix x ∈ S and denote P P P[B,ξ=x] by x the probability “knowing ξ = x”, i.e., x[B] := P[ξ=x] , for all B ∈ F. It is clear that, endowing Ω with this probability, Γ is still atomless, and so there exists a G-measurable random variable Ux that P is uniform under x. Then, the random variable U := x∈S Ux1ξ=x is a G-measurable uniform random variable under P for all x ∈ S, which implies that it is a uniform variable under P, independent of ξ. x P ✷
Theorem 3.2 (Correspondance with the MDP on P(X ) in the X finite case) 0 Assume that G is rich enough. Then Rξ is surjective from A into A, and V (ξ) = V (µ), for any µ ∈ P(X ), ξ ∈ L0(G; X ) s.t. µ = L(ξ). Moreover, for ǫ ≥ 0, if αǫ ∈ A is an ǫ-optimal control for V (ξ), 0 ǫ bǫ b then Rξ(α ) ∈ A is an ǫ-optimal control for V (µ). Conversely, if αˆ ∈ A is an ǫ-optimal control for ǫ 0 −1 ǫ V (µ), then any α ∈ (Rξ ) (ˆα ) is an ǫ-optimal control for V (ξ). Consequently, an optimal control b b b for V (ξ) exists iff an optimal control for V (µ) exists. b 0 Proof. From the derivation in (3.2)-(3.2),b we only need to prove the surjectivity of Rξ . Fix κ ∈ A and 0 0 t ˆ 0 let πt ∈ L ((E ) ; A(X )) be such that κt = πt((εs)s≤t). As X is finite, by definition of the σ-algebra 0 0 t 0 b on Aˆ(X ), πt can be seen as a measurable function in L ((E ) ×X ; P(A)). Let φ ∈ L (A, R) be 0 0 t an embedding as in Lemma C.2. By Lemma A.1, we know that φ ⋆ πt is in L ((E ) ×X ; P(R)). R −1 Given m ∈ P( ) we denote by Fm the generalized inverse of its distribution function, and it is R −1 1 R known that the mapping m ∈ (P( ), W) 7→ Fm ∈ (Lcaglad( ), k · k1) is an isometry and is thus measurable. Therefore, F −1 is in L0((E0)t ×X ; (L1 (R), k · k )). Finally, the mapping (f, u) ∈ φ⋆πt caglad 1 1 R R R (Lcaglad( ), k · k1) × ([0, 1], B([0, 1])) 7→ f(u) ∈ ( , B( )) is measurable, since it is the limit of the i+1 n sequence n i∈Z 1[ i+1 , i+2 )(u) i f(y)dy when n →∞. Therefore, the mapping n n n
P R 0 t at : (E ) ×X × [0, 1] −→ A 0 −1 −1 ((e )s≤t, x, u) 7−→ φ ◦ F π 0 (u) s φ⋆ t((es)s≤t,x)
0 ξ,α is measurable. We thus define, by induction, αt := at((εs)s≤t, Xt ,Ut). By construction and by the generalized inverse simulation method, it is clear thatα ˆt = κt. ✷
Remark 3.3 We point out that when both state space X and action space A are finite, equipped with ′ ′ ′ ′ the metrics d(x,x ) := 1x=6 x′ , x,x ∈ X and dA(a, a ) := 1a6=a′ , a, a ∈ A, the transition function Fˆ and reward function fˆ of the lifted MDP on P(X ) inherits the Lipschitz condition (HFlip) and (Hflip) used for the propagation of chaos. Indeed, it is known that the Wasserstein distance obtained from d 1 (resp. dA) coincides with twice the total variation distance, and thus to the L distance when naturally #X #A embedding P(X ) (resp. P(A)) in [0, 1] (resp. [0, 1] ). Thus, embedding Aˆ(X ) in M#X ,#A([0, 1]), the set of #X × #A matrices with coefficients valued in [0, 1], we have
ˆ 0 ˆ ′ 0 ′ ′ kF (µ, a,ˆ e ), F (ν, aˆ , e )k1 ≤ (1 + KF )(2kµ − µ k1 + supkaˆx,· − aˆx,·k1). x∈X
We obtain a similar property for f. In other words, lifting the CMKV-MDP not only turns it into an MDP, but also its state and action spaces [0, 1]#X and [0, 1]#X×#A are very standard, and its dynamic
17 and reward are Lipschitz functions with factors of the order of KF and Kf according to the norm k·k1. Thus, due to the standard nature of this MDP, most MDP algorithms can be applied and their speed will be simply expressed in terms of the original parameters of the CMKV-MDP, KF and Kf . ✷
Remark 3.4 As in the no common noise case, the correspondence result in the finite state space case for X shows notably that the value function of the CMKV-MDP is law-invariant. The general case (common noise and continuous state space X ) raises multiple issues for establishing the equivalence between CMKV-MDP and the lifted MDP on P(X ). First, we have to endow the action space Aˆ(X ) with a suitable σ-algebra for the lifted MDP to be well-posed: on the one hand, this σ- algebra has to be large enough to make the functions Fˆ : P(X )×Aˆ(X )×E0 → P(X ) and fˆ: P(X )×Aˆ(X ) R 0 F0 → measurable, and on the other hand, it should be small enough to make the processα ˆ = Rξ(α) - adapted for any control α ∈A in the CMKV-MDP. Beyond the well-posedness issue of the lifted MDP, 0 the second important concern is the surjectivity of the relaxed operator Rξ from A into A. Indeed, if we try to adapt the proof of Theorem 3.2 to the case of a continuous state space X , the issue is that we cannot in general equip Aˆ(X ) with a σ-algebra such that L0((E0)t; Aˆ(X )) = L0((E0)t b×X ; P(A)), 0 0 t 0 0 t and thus we cannot see πt ∈ L ((E ) ; Aˆ(X )) as an element of L ((E ) ×X ; P(A)), which is crucial 0 because the control α (such thatα ˆ = κ) is defined with αt explicitly depending upon πt((εs)s≤t, Xt). In the next section, we shall fix these measurability issues in the general case, and prove the correspondence between the CMKV-MDP and a general lifted MDP on P(X ). ✷
4 General case and Bellman fixed point equation in P(X )
We address the general case with common noise and possibly continuous state space X , and our aim is to state the correspondence of the CMKV-MDP with a suitable lifted MDP on P(X ) associated to a Bellman fixed point equation, characterizing the value function, and obtain as a by-product an ǫ-optimal control. We proceed as follows: (i) We first introduce a well-posed lifted MDP on P(X ) by enlarging the action space to P(X × A), and call V˜ the corresponding value function, which satisfies: V (ξ) ≤ V˜ (µ), for µ = L(ξ).
(ii) We then consider the Bellman equation associated to this well-posed lifted MDP on P(X ), which admits a unique fixed point, called V ⋆.
(iii) Under the randomization hypothesis for ξ, we show the existence of an ǫ-randomized feedback policy, which yields both an ǫ-randomized feedback control for the CMKV-MDP and an ǫ-optimal feedback control for V˜ . This proves that V (ξ) = V˜ (µ) = V ∗(µ), for µ = L(ξ).
(iv) Under the condition that G is rich enough, we conclude that V is law-invariant and is equal to V˜ = V ⋆, hence satisfies the Bellman equation. Finally, we show how to compute from the Bellman equation by value or policy iteration approximate optimal strategy and value function.
4.1 A general lifted MDP on P(X ) P0 N We start again from the relation (3) describing the evolution of µt = Xt , t ∈ , for a state process ξ,α Xt = Xt controlled by α ∈ A: P0 0 P0 µt+1 = F (·, ·, (Xt,αt), ·, εt+1) ⋆ ( (Xt,αt) ⊗ λε), a.s. (4.1)
18 Now, instead of disintegrating as in Section 3, the conditional law of the pair (X , α ), as P0 = t t (Xt,αt) 0 µt · αˆt whereα ˆ = Rξ (α) is the relaxed control in (3.2), we directly consider the control process αt = P0 , t ∈ N, which is F0-adapted (see Proposition A.1), and valued in the space of probability (Xt,αt) measures A := P(X × A), naturally endowed with the σ-algebra of its Wasserstein metric. Notice that this A-valued control α obtained from the CMKV-MDP has to satisfy by definition the marginal constraint pr1 ⋆ αt = µt at any time t. In order to tackle this marginal constraint, we shall rely on the following coupling results.
Lemma 4.1 (Measurable coupling) There exists a measurable function ζ ∈ L0(P(X )2 ×X × [0, 1]; X ) s.t. for any (µ,µ′) ∈ P(X ), and if ξ ∼ µ, then
• ζ(µ,µ′, ξ, U) ∼ µ′, where U is an uniform random variable independent of ξ.
• (i) When X ⊂ R:
E d(ξ,ζ(µ,µ′, ξ, U)) = W(µ,µ′).
(ii) In general when X Polish: ∀ ε> 0, ∃η> 0 s.t.
W(µ,µ′) < η ⇒ E d(ξ,ζ(µ,µ′, ξ, U)) < ε. Proof. See Appendix C. ✷
Remark 4.1 Lemma 4.1 can be seen as a measurable version of the well-known coupling result in optimal transport, which states that given µ, µ′ ∈ P(X ), there exists ξ and ξ′ random variables with L(ξ)= µ, L(ξ′)= µ′ such that W(µ,µ′)= E d(ξ,ξ′)]. A similar measurable optimal coupling is proved in [9] under the assumption that there exists a transfer function realizing an optimal coupling between µ and µ′. However, such transfer function does not always exist, for instance when µ has atoms but not µ′. Lemma 4.1 builds a measurable coupling without making such assumption (essentially using the uniform variable U to randomize when µ has atoms). ✷
From the measurable coupling function ζ as in Lemma 4.1, we define the coupling projection p : P(X ) × A → A by
′ p(µ, a)= L ζ(pr1 ⋆ a,µ,ξ ,U), α0 , µ ∈ P(X ), a ∈ A, ′ ′ where (ξ , α0) ∼ a, and U is a uniform random variable independent of ξ .
Lemma 4.2 (Measurable coupling projection) The coupling projection p is a measurable function from P(X )×A into A, and for all (µ, a) ∈ P(X )×A:
pr1 ⋆ p(µ, a) = µ, and if pr1 ⋆ a = µ, then p(µ, a)= a. (4.2) Proof. By construction, it is clear that ζ(µ, µ, ξ, U) = ξ, and so relation (4.2) is obvious. The only result that is not trivial is the measurability of p. Observe that p(µ, a)= g(µ, a, ·, ·, ·) ⋆ (a ⊗U([0, 1])) where g is the measurable function
g : P(X ) × P(X × A) ×X × A × [0, 1] −→ X× A
(µ, a, x, a, u) 7−→ (ζ(pr1 ⋆ a, µ, x, u), a)
19 We thus conclude by Lemma A.1. ✷
By using this coupling projection p, we see that the dynamics (4.1) can be written as
˜ 0 N µt+1 = F (µt, αt, εt+1), t ∈ , (4.3) where the function F˜ : P(X ) × A × E0 → P(X ) defined by
0 0 F˜(µ, a, e )= F (·, ·, p(µ, a), ·, e ) ⋆ p(µ, a) ⊗ λε , is clearly measurable. Let us also define the measurable func tion f˜ : P(X )× A → R by
f˜(µ, a)= f(x, a, p(µ, a))p(µ, a)(dx, da). ZX×A The MDP with characteristics (P(X ), A = P(X × A), F˜ , f,˜ β) is then well-posed. Let us then denote by A the set of F0-adapted processes valued in A, and given an open-loop control ν ∈ A, consider the controlled dynamics
˜ 0 N µt+1 = F (µt, νt, εt+1), t ∈ , µ0 = µ ∈ P(X ), (4.4) with associated expected gain/value function
ν t ν V (µ)= E β f˜(µt, νt) , V (µ) = sup V (µ). (4.5) N ν∈A h Xt∈ i e e e 0 L0 L0 Given ξ ∈ L (G; X ), and α ∈ A, we set α = ξ(α), where ξ is the lifted operator
L0 A ξ : A −→ P0 N α = (αt)t 7−→ α = (αt)t : αt = ξ,α , t ∈ . (Xt ,αt) P0 N By construction from (4.1), we see that µt = ξ,α , t ∈ , follows the dynamics (4.1) with the control Xt L0 A ˜ ν = ξ (α) ∈ . Moreover, by the law of iterated conditional expectation, and the definition of f, the expected gain of the CMKV-MDP can be written as
α E tE0 ξ,α P0 V (ξ)= β f(Xt , αt, ξ,α ) (Xt ,αt) t∈N h X i t 0 α = E β f˜(P ξ,α , α ) = V (µ), with µ = L(ξ). (4.6) X t N t h Xt∈ i e It follows that V (ξ) ≤ V (µ), for µ = L(ξ). Our goal is to prove the equality, which implies in particular that V is law-invariant, and to obtain as a by-product the corresponding Bellman fixed point equation that characterizes analyticallye the solution to the CMKV-MDP.
4.2 Bellman fixed point on P(X ) We derive and study the Bellman equation corresponding to the general lifted MDP (4.1)-(4.1) on P(X ). By defining this MDP on the canonical space (E0)N, we identify ε0 with the canonical identity 0 N 0 0 N 0 N 0 N function in (E ) , and εt with the t-th projection in (E ) . We also denote by θ : (E ) → (E ) the
20 0 0 shifting operator, defined by θ((et )t∈N) = (et+1)t∈N. Via this identification, an open-loop control ν ∈ 0 t A is a sequence (νt)t where νt is a measurable function from (E ) into A, with the convention that A 0 0 e0 e0 A e0 ν0 is simply a constant in A. Given ν ∈ , and e ∈ E , we define ~ν := (~νt )t ∈ , where ~νt (.) 0 N A µ,ν := νt+1(e , .), t ∈ . Given µ ∈ P(X ), and ν ∈ , we denote by (µt )t the solution to (4.1) on the canonical space, which satisfies the flow property
µ,ν 0 ν µ ,~νε1 (θ(ε0)) ε0 µ, 1 1 0 N µt+1, νt+1 ≡ µt , ~νt (θ(ε )) , t ∈ .