Mean-Field Langevin Dynamics and Neural Networks

Zhenjie Ren

CEREMADE, Université Paris-Dauphine joint works with Giovanni Conforti, Kaitong Hu, Anna Kazeykina, David Siska, Lukasz Szpruch, Xiaolu Tan, Junjian Yang

MATH-IMS Joint Colloquium Series August 28, 2020

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 1 / 37 Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read:

Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − f (x) m (x, v) m (x) = Ce σ2

In particular, f does NOT need to be convex.

Classical Langevin dynamics and non-convex optimization

The Langevin dynamics was first introduced in statistical to describe the motion of a particle with position X and velocity V in a potential field ∇x f subject to damping and random collision.

Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −∇x f (Xt )dt + σdWt dXt = Vt dt  dVt = −∇x f (Xt )−γVt dt+σdWt

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex.

Classical Langevin dynamics and non-convex optimization

The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field ∇x f subject to damping and random collision.

Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −∇x f (Xt )dt + σdWt dXt = Vt dt  dVt = −∇x f (Xt )−γVt dt+σdWt

Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read:

Overdamped Langevin dynamics Underdamped Langevin dynamics

2 2 1 2 ∗ − 2 f (x) ∗ − f (x)+ |v| m (x) = Ce σ m (x, v) = Ce σ2 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex.

Classical Langevin dynamics and non-convex optimization

The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field ∇x f subject to damping and random collision.

Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −∇x f (Xt )dt + σdWt dXt = Vt dt  dVt = −∇x f (Xt )−γVt dt+σdWt

Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read:

Overdamped Langevin dynamics Underdamped Langevin dynamics

2 2 1 2 ∗ − 2 f (x) ∗ − f (x)+ |v| m (x) = Ce σ → δarg min f (x) m (x, v) = Ce σ2 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex.

Classical Langevin dynamics and non-convex optimization

The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field ∇x f subject to damping and random collision.

Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −∇x f (Xt )dt + σdWt dXt = Vt dt  dVt = −∇x f (Xt )−γVt dt+σdWt

Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read:

Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ − f (x) 1 ∗ σ2 m (x, v) → δ 2 m (x) = Ce → δarg min f (x) arg min f (x)+ 2 |v|

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex.

Classical Langevin dynamics and non-convex optimization

The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field ∇x f subject to damping and random collision.

Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −∇x f (Xt )dt + σdWt dXt = Vt dt  dVt = −∇x f (Xt )−γVt dt+σdWt

Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read:

Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − 2 f (x) m (x, v) → δ(arg min f (x),0) m (x) = Ce σ → δarg min f (x)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 Classical Langevin dynamics and non-convex optimization

The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field ∇x f subject to damping and random collision.

Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −∇x f (Xt )dt + σdWt dXt = Vt dt  dVt = −∇x f (Xt )−γVt dt+σdWt

Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read:

Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − 2 f (x) m (x, v) → δ(arg min f (x),0) m (x) = Ce σ → δarg min f (x)

In particular, f does NOT need to be convex.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin ⇔ MCMC the underdamped Langevin ⇔ Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! One may diminish σ ↓ 0 along the simulation ⇒ Simulation annealing.

Relation with classical algorithms

If we overlook the Brownian noise, then the overdamped process ⇒ gradient descent algorithm the underdamped process ⇒ Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f .

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 One may diminish σ ↓ 0 along the simulation ⇒ Simulation annealing.

Relation with classical algorithms

If we overlook the Brownian noise, then the overdamped process ⇒ gradient descent algorithm the underdamped process ⇒ Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin ⇔ MCMC the underdamped Langevin ⇔ Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent!

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 Relation with classical algorithms

If we overlook the Brownian noise, then the overdamped process ⇒ gradient descent algorithm the underdamped process ⇒ Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin ⇔ MCMC the underdamped Langevin ⇔ Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! One may diminish σ ↓ 0 along the simulation ⇒ Simulation annealing.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 It is natural to study this problem using Mean-field Langevin equations.

Deep neural networks

The deep neural networks have won and continue gaining impressive success in various applications. Mathematically speaking, we may approximate a given function f with the parametrized function:

ni X i i i f (z) ≈ ϕn ◦ · · · ◦ ϕ1(z), where ϕi (z) := ck ϕ(Ak z + bk ) k=1 and ϕ is a given non-constant, bounded, continuous activation function. The expressiveness of the neural network is ensured by the universal representation theorem. However, the efficiency of such over-parametrized, non-convex optimization is still a mystery for .

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 4 / 37 Deep neural networks

The deep neural networks have won and continue gaining impressive success in various applications. Mathematically speaking, we may approximate a given function f with the parametrized function:

ni X i i i f (z) ≈ ϕn ◦ · · · ◦ ϕ1(z), where ϕi (z) := ck ϕ(Ak z + bk ) k=1 and ϕ is a given non-constant, bounded, continuous activation function. The expressiveness of the neural network is ensured by the universal representation theorem. However, the efficiency of such over-parametrized, non-convex optimization is still a mystery for mathematical analysis. It is natural to study this problem using Mean-field Langevin equations.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 4 / 37 Two-layer Network and Mean-field Langevin Equation Table of Contents

1 Two-layer Network and Mean-field Langevin Equation

2 Application to GAN

3 Deep neural network and MFL system

4 Game on random environment

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 5 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex.

How to characterize the minimizer of a function of ?

Two-layer Network and Mean-field Langevin Equation Two-layer neural network

In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing

n h X 2i inf E f (Z) − ck ϕ(Ak Z + bk ) , n,(c ,A ,b ) k k k k=1 where Z represents the data and E is the expectation under the law of the data.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex.

How to characterize the minimizer of a function of probabilities ?

Two-layer Network and Mean-field Langevin Equation Two-layer neural network

In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing

n h 1 X 2i inf E f (Z) − ck ϕ(Ak Z + bk ) , n,(c ,A ,b ) n k k k k=1 where Z represents the data and E is the expectation under the law of the data.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex.

How to characterize the minimizer of a function of probabilities ?

Two-layer Network and Mean-field Langevin Equation Two-layer neural network

In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing

h ν  2i inf E f (Z) − E Cϕ(AZ + B) , ν=Law(C,A,B)

where Z represents the data and E is the expectation under the law of the data.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex.

How to characterize the minimizer of a function of probabilities ?

Two-layer Network and Mean-field Langevin Equation Two-layer neural network

In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing

h ν  2i inf F (ν), where F (ν) := E f (Z) − E Cϕ(AZ + B) , ν=Law(C,A,B)

where Z represents the data and E is the expectation under the law of the data. Note that F is convex in ν.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 How to characterize the minimizer of a function of probabilities ?

Two-layer Network and Mean-field Langevin Equation Two-layer neural network

In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing

σ2 inf F (ν) + Ent(ν), ν 2 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Two-layer Network and Mean-field Langevin Equation Two-layer neural network

In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing

σ2 inf F (ν) + Ent(ν), ν 2 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex.

How to characterize the minimizer of a function of probabilities ?

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x) If further assume F is convex, we have Therefore, a sufficient condition for m being a minimizer would be

Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R.

given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 If further assume F is convex, we have Therefore, a sufficient condition for m being a minimizer would be

Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R.

given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 Therefore, a sufficient condition for m being a minimizer would be

Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R.

given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x) If further assume F is convex, we have

εF (m0) + (1 − ε)F (m) ≥ F (mε)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 Therefore, a sufficient condition for m being a minimizer would be

Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R.

given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x) If further assume F is convex, we have   ε F (m0) − F (m) ≥ F (mε) − F (m)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 Therefore, a sufficient condition for m being a minimizer would be

Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R.

given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x) If further assume F is convex, we have 1 Z ε Z δF F (m0) − F (m) ≥ mλ, x(m0 − m)(dx)dλ ε 0 Rd δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 Therefore, a sufficient condition for m being a minimizer would be

Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R.

given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x) If further assume F is convex, we have Z δF F (m0) − F (m) ≥ m, x(m0 − m)(dx) Rd δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R. given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x) If further assume F is convex, we have Z δF F (m0) − F (m) ≥ m, x(m0 − m)(dx) Rd δm Therefore, a sufficient condition for m being a minimizer would be δF m, x = C for all x δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 Two-layer Network and Mean-field Langevin Equation Derivatives of functions of probabilities

d δF d d Let F : P(R ) → R. Denote its derivative by δm : P(R ) × R → R.

given m, m0, denote mλ := (1 − λ)m + λm0 we have F (mε) − F (m) = R ε R δF mλ, x(m0 − m)(dx)dλ 0 Rd δm m δF e.g. (a) F (m) := E [ϕ(X )], then δm (m, x) = ϕ(x) m  δF m  (b) F (m) := g E [ϕ(X )] , then δm (m, x) =g ˙ E [ϕ(X )] ϕ(x) If further assume F is convex, we have Z δF F (m0) − F (m) ≥ m, x(m0 − m)(dx) Rd δm Therefore, a sufficient condition for m being a minimizer would be δF Intrinsic derivative D F (m, x)= ∇ m, x = 0 for all x m δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 7 / 37 Note that the density of m∗ satisfies:

∗ − 2 δF (m∗,x) m (x) = Ce σ2 δm

Two-layer Network and Mean-field Langevin Equation First order condition of minimizers

Under the presence of the entropy regularizer, we can also prove the first order equation is a necessary condition for being minimizer. Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 Under mild conditions, if m = arg minm F (m) + 2 Ent(m) , then σ2 D F (m∗, x) + ∇ ln m∗(x)= 0, for all x. (1) m 2 Conversely, if F to be convex, (1) implies m∗ is the minimizer.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 8 / 37 Two-layer Network and Mean-field Langevin Equation First order condition of minimizers

Under the presence of the entropy regularizer, we can also prove the first order equation is a necessary condition for being minimizer. Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 Under mild conditions, if m = arg minm F (m) + 2 Ent(m) , then σ2 D F (m∗, x) + ∇ ln m∗(x)= 0, for all x. (1) m 2 Conversely, if F to be convex, (1) implies m∗ is the minimizer.

Note that the density of m∗ satisfies:

∗ − 2 δF (m∗,x) m (x) = Ce σ2 δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 8 / 37 Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 ∗ Under mild conditions, if m = arg minm F (m) + 2 Ent(m) then m is a stationary solution to the Fokker-Planck equation

 σ2  ∂ m = ∇ · D F (m, ·)m + ∇m (2) t m 2 It is well-known that the equation (2) characterizes the marginal law of the mean-field Langevin (MFL) dynamics:

dXt = −DmF (mt , Xt )dt + σdWt , mt = Law(Xt )

Two-layer Network and Mean-field Langevin Equation Link to Overdamped Mean-field Langevin equation

The first order equation has a clear link to a Fokker-Planck equation.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 9 / 37 Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 ∗ Under mild conditions, if m = arg minm F (m) + 2 Ent(m) then m is a stationary solution to the Fokker-Planck equation

 σ2  ∂ m = ∇ · D F (m, ·)m + ∇m (2) t m 2 It is well-known that the equation (2) characterizes the marginal law of the mean-field Langevin (MFL) dynamics:

dXt = −DmF (mt , Xt )dt + σdWt , mt = Law(Xt )

Two-layer Network and Mean-field Langevin Equation Link to Overdamped Mean-field Langevin equation

The first order equation has a clear link to a Fokker-Planck equation. σ2 D F (m∗, x) + ∇ ln m∗(x)= 0 m 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 9 / 37 Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 ∗ Under mild conditions, if m = arg minm F (m) + 2 Ent(m) then m is a stationary solution to the Fokker-Planck equation

 σ2  ∂ m = ∇ · D F (m, ·)m + ∇m (2) t m 2 It is well-known that the equation (2) characterizes the marginal law of the mean-field Langevin (MFL) dynamics:

dXt = −DmF (mt , Xt )dt + σdWt , mt = Law(Xt )

Two-layer Network and Mean-field Langevin Equation Link to Overdamped Mean-field Langevin equation

The first order equation has a clear link to a Fokker-Planck equation. σ2 ∇m∗(x) D F (m∗, x) + = 0 m 2 m∗(x)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 9 / 37 Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 ∗ Under mild conditions, if m = arg minm F (m) + 2 Ent(m) then m is a stationary solution to the Fokker-Planck equation

 σ2  ∂ m = ∇ · D F (m, ·)m + ∇m (2) t m 2 It is well-known that the equation (2) characterizes the marginal law of the mean-field Langevin (MFL) dynamics:

dXt = −DmF (mt , Xt )dt + σdWt , mt = Law(Xt )

Two-layer Network and Mean-field Langevin Equation Link to Overdamped Mean-field Langevin equation

The first order equation has a clear link to a Fokker-Planck equation. σ2 D F (m∗, x)m∗(x)+ ∇m∗(x)= 0 m 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 9 / 37 Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 ∗ Under mild conditions, if m = arg minm F (m) + 2 Ent(m) then m is a stationary solution to the Fokker-Planck equation

 σ2  ∂ m = ∇ · D F (m, ·)m + ∇m (2) t m 2 It is well-known that the equation (2) characterizes the marginal law of the mean-field Langevin (MFL) dynamics:

dXt = −DmF (mt , Xt )dt + σdWt , mt = Law(Xt )

Two-layer Network and Mean-field Langevin Equation Link to Overdamped Mean-field Langevin equation

The first order equation has a clear link to a Fokker-Planck equation.  σ2  ∇· D F (m∗, x)m∗(x) + ∇m∗(x) = 0 m 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 9 / 37 It is well-known that the equation (2) characterizes the marginal law of the mean-field Langevin (MFL) dynamics:

dXt = −DmF (mt , Xt )dt + σdWt , mt = Law(Xt )

Two-layer Network and Mean-field Langevin Equation Link to Overdamped Mean-field Langevin equation

The first order equation has a clear link to a Fokker-Planck equation.  σ2  ∇· D F (m∗, x)m∗(x) + ∇m∗(x) = 0 m 2

Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 ∗ Under mild conditions, if m = arg minm F (m) + 2 Ent(m) then m is a stationary solution to the Fokker-Planck equation

 σ2  ∂ m = ∇ · D F (m, ·)m + ∇m (2) t m 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 9 / 37 Two-layer Network and Mean-field Langevin Equation Link to Overdamped Mean-field Langevin equation

The first order equation has a clear link to a Fokker-Planck equation.  σ2  ∇· D F (m∗, x)m∗(x) + ∇m∗(x) = 0 m 2

Theorem (Hu, R., Siska, Szpruch, ’19) ∗  σ2 ∗ Under mild conditions, if m = arg minm F (m) + 2 Ent(m) then m is a stationary solution to the Fokker-Planck equation

 σ2  ∂ m = ∇ · D F (m, ·)m + ∇m (2) t m 2 It is well-known that the equation (2) characterizes the marginal law of the mean-field Langevin (MFL) dynamics:

dXt = −DmF (mt , Xt )dt + σdWt , mt = Law(Xt )

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 9 / 37 The first order condition reads σ2 σ2 D F (mX , x) + ∇ ln m(x, v) = 0 and v + ∇ ln m(x, v) = 0. m 2γ x 2γ v

One can again directly verify that the minimizer is an invariant measure of the underdamped MFL equation: ( dXt = Vt , X dVt = (−DmF (mt , Xt ) − γVt )dt + σdWt

Two-layer Network and Mean-field Langevin Equation Link to Underdamped Mean-field Langevin equation

Different from above, introduce the velocity variable V and consider the minimization:

2 X 1 m 2 σ inf F (m ) + E |V | + Ent(m) m=Law(X ,V ) 2 2γ

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 10 / 37 One can again directly verify that the minimizer is an invariant measure of the underdamped MFL equation: ( dXt = Vt , X dVt = (−DmF (mt , Xt ) − γVt )dt + σdWt

Two-layer Network and Mean-field Langevin Equation Link to Underdamped Mean-field Langevin equation

Different from above, introduce the velocity variable V and consider the minimization:

2 X 1 m 2 σ inf F (m ) + E |V | + Ent(m) m=Law(X ,V ) 2 2γ

The first order condition reads σ2 σ2 D F (mX , x) + ∇ ln m(x, v) = 0 and v + ∇ ln m(x, v) = 0. m 2γ x 2γ v

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 10 / 37 Two-layer Network and Mean-field Langevin Equation Link to Underdamped Mean-field Langevin equation

Different from above, introduce the velocity variable V and consider the minimization:

2 X 1 m 2 σ inf F (m ) + E |V | + Ent(m) m=Law(X ,V ) 2 2γ

The first order condition reads σ2 σ2 D F (mX , x) + ∇ ln m(x, v) = 0 and v + ∇ ln m(x, v) = 0. m 2γ x 2γ v

One can again directly verify that the minimizer is an invariant measure of the underdamped MFL equation: ( dXt = Vt , X dVt = (−DmF (mt , Xt ) − γVt )dt + σdWt

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 10 / 37 A simple example: mean-field Ornstein-Uhlenbeck process dXt = (αE[Xt ] − Xt )dt + dWt α < 1 =⇒ ∃ unique invariant measure α > 1 =⇒ no invariant measure α = 1 =⇒ ∃ multiple invariant measures

Given a convex F , the existence and uniqueness of the invariant measure of MFL is due to that of the minimizer m∗, thanks to the first order condition. It remains to study the convergence of the marginal laws to the invariant measure.

Two-layer Network and Mean-field Langevin Equation Difficulties: invariant measure of MFL

For the mean-field diffusion, the existence and uniqueness of the invariant measures is non-trivial, and the convergence of the marginal laws towards the invariant measure, if exists, is one of the long-standing problems in .

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 11 / 37 Given a convex F , the existence and uniqueness of the invariant measure of MFL is due to that of the minimizer m∗, thanks to the first order condition. It remains to study the convergence of the marginal laws to the invariant measure.

Two-layer Network and Mean-field Langevin Equation Difficulties: invariant measure of MFL

For the mean-field diffusion, the existence and uniqueness of the invariant measures is non-trivial, and the convergence of the marginal laws towards the invariant measure, if exists, is one of the long-standing problems in probability. A simple example: mean-field Ornstein-Uhlenbeck process dXt = (αE[Xt ] − Xt )dt + dWt α < 1 =⇒ ∃ unique invariant measure α > 1 =⇒ no invariant measure α = 1 =⇒ ∃ multiple invariant measures

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 11 / 37 It remains to study the convergence of the marginal laws to the invariant measure.

Two-layer Network and Mean-field Langevin Equation Difficulties: invariant measure of MFL

For the mean-field diffusion, the existence and uniqueness of the invariant measures is non-trivial, and the convergence of the marginal laws towards the invariant measure, if exists, is one of the long-standing problems in probability. A simple example: mean-field Ornstein-Uhlenbeck process dXt = (αE[Xt ] − Xt )dt + dWt α < 1 =⇒ ∃ unique invariant measure α > 1 =⇒ no invariant measure α = 1 =⇒ ∃ multiple invariant measures

Given a convex F , the existence and uniqueness of the invariant measure of MFL is due to that of the minimizer m∗, thanks to the first order condition.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 11 / 37 Two-layer Network and Mean-field Langevin Equation Difficulties: invariant measure of MFL

For the mean-field diffusion, the existence and uniqueness of the invariant measures is non-trivial, and the convergence of the marginal laws towards the invariant measure, if exists, is one of the long-standing problems in probability. A simple example: mean-field Ornstein-Uhlenbeck process dXt = (αE[Xt ] − Xt )dt + dWt α < 1 =⇒ ∃ unique invariant measure α > 1 =⇒ no invariant measure α = 1 =⇒ ∃ multiple invariant measures

Given a convex F , the existence and uniqueness of the invariant measure of MFL is due to that of the minimizer m∗, thanks to the first order condition. It remains to study the convergence of the marginal laws to the invariant measure.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 11 / 37 Due to the generalized Itô calculus and time-reversal of diffusions.

Two-layer Network and Mean-field Langevin Equation Gradient flow and its analog

Define the energy functions for both overdamped and underdamped cases

σ2 1 σ2 U(m)= F (m) + Ent(m), Ub(m)= F (mX ) + m[|V |2] + Ent(m) 2 2E 2γ

For the convergence towards the invariant measure, it is crucial to observe Theorem (Overdamped MFL, Hu, R., Siska, Szpruch, ’19)

h σ2 2i dU(mt ) = −E DmF (mt , Xt ) + 2 ∇x ln mt (Xt ) dt for all t > 0

Theorem (Underdamped MFL, Kazeykina, R., Tan, Yang, ’20)

h σ2 2i dUb(mt ) = −γE Vt + 2γ ∇v ln mt (Xt , Vt ) dt for all t > 0

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 12 / 37 Two-layer Network and Mean-field Langevin Equation Gradient flow and its analog

Define the energy functions for both overdamped and underdamped cases

σ2 1 σ2 U(m)= F (m) + Ent(m), Ub(m)= F (mX ) + m[|V |2] + Ent(m) 2 2E 2γ

For the convergence towards the invariant measure, it is crucial to observe Theorem (Overdamped MFL, Hu, R., Siska, Szpruch, ’19)

h σ2 2i dU(mt ) = −E DmF (mt , Xt ) + 2 ∇x ln mt (Xt ) dt for all t > 0

Theorem (Underdamped MFL, Kazeykina, R., Tan, Yang, ’20)

h σ2 2i dUb(mt ) = −γE Vt + 2γ ∇v ln mt (Xt , Vt ) dt for all t > 0

Due to the generalized Itô calculus and time-reversal of diffusions.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 12 / 37 h σ2 2i Underdamped: dUb(mt ) = −γE Vt + 2γ ∇v ln mt (Xt , Vt ) dt ∗ Similarly, Ub(mt ) shall decrease till mt hits m s.t.

σ2 v + ∇ ln m∗(x, v) = 0 2γ v

Two-layer Network and Mean-field Langevin Equation Convergence for convex F

h σ2 2i Overdamped: dU(mt ) = −E DmF (mt , Xt ) + 2 ∇x ln mt (Xt ) dt. ∗ Heuristically, U(mt ) decreases till mt hits m s.t.

σ2 D F (m∗, x) + ∇ ln m∗(x) = 0 m 2 x

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 13 / 37 h σ2 2i Underdamped: dUb(mt ) = −γE Vt + 2γ ∇v ln mt (Xt , Vt ) dt ∗ Similarly, Ub(mt ) shall decrease till mt hits m s.t.

σ2 v + ∇ ln m∗(x, v) = 0 2γ v

Two-layer Network and Mean-field Langevin Equation Convergence for convex F

h σ2 2i Overdamped: dU(mt ) = −E DmF (mt , Xt ) + 2 ∇x ln mt (Xt ) dt. ∗ Heuristically, U(mt ) decreases till mt hits m s.t.

m∗ = arg min U(m) m

∗ and mt ≡ m afterwards.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 13 / 37 Two-layer Network and Mean-field Langevin Equation Convergence for convex F

h σ2 2i Overdamped: dU(mt ) = −E DmF (mt , Xt ) + 2 ∇x ln mt (Xt ) dt. ∗ Heuristically, U(mt ) decreases till mt hits m s.t.

m∗ = arg min U(m) m

∗ and mt ≡ m afterwards. h σ2 2i Underdamped: dUb(mt ) = −γE Vt + 2γ ∇v ln mt (Xt , Vt ) dt ∗ Similarly, Ub(mt ) shall decrease till mt hits m s.t.

σ2 v + ∇ ln m∗(x, v) = 0 2γ v

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 13 / 37 Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

We can prove that ∗ Overdamped: m ∈ w(m0) =⇒ ∗ Underdamped: m ∈ w(m0) =⇒

Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 We can prove that ∗ Overdamped: m ∈ w(m0) =⇒ ∗ Underdamped: m ∈ w(m0) =⇒

Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 ∗ Underdamped: m ∈ w(m0) =⇒

Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

We can prove that ∗ ∗ σ2 ∗ Overdamped: m ∈ w(m0) =⇒ DmF (m , x) + 2 ∇x ln m (x) = 0

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 ∗ Underdamped: m ∈ w(m0) =⇒

Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

We can prove that ∗ ∗ Overdamped: m ∈ w(m0) =⇒ m = arg minm U(m)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 ∗ Underdamped: m ∈ w(m0) =⇒

Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

We can prove that ∗ Overdamped: m ∈ w(m0) =⇒ w(m0) = {arg minm U(m)}

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 ∗ Underdamped: m ∈ w(m0) =⇒

Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

We can prove that ∗ Overdamped: m ∈ w(m0) =⇒ mt → arg minm U(m) in W1

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

We can prove that ∗ Overdamped: m ∈ w(m0) =⇒ mt → arg minm U(m) in W1 ∗ σ2 ∗ Underdamped: m ∈ w(m0) =⇒ v + 2γ ∇v ln m (x, v) = 0

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 Two-layer Network and Mean-field Langevin Equation A bit more rigorously...

The intuition above can be materialized by the LaSalle’s invariance principle and the functional inequalities. Define the set of cluster points

w(m0) := {m : ∃(mt )n∈ s.t. lim W1(mt , m) = 0} n N n→∞ n Invariance Principle says:

Law(X0) ∈ w(m0) =⇒ Law(Xt ) ∈ w(m0) for all t > 0

We can prove that ∗ Overdamped: m ∈ w(m0) =⇒ mt → arg minm U(m) in W1 γ ∗ ∗ − v 2 Underdamped: m ∈ w(m0) =⇒ m (x, v) = g(x)e σ2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 14 / 37 By Itô’s formula,

Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v

Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0).

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v

Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

 ˙  X  dVt h(Xt ) = h(Xt ) · Vt Vt + h(Xt )(−DmF (mt , Xt ) − γVt ) dt + dMt

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v

Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

h i h ˙  X i dE Vt h(Xt ) = E h(Xt ) · Vt Vt + h(Xt )(−DmF (mt , Xt ) − γVt ) dt

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v

Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

hσ2 i 0 = h˙(X ) − h(X )D F (mX , X ) E 2γ t t m t t

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v

Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

hσ2 i 0 = h˙(X ) − h(X )D F (mX , X ) E 2γ t t m t t h σ2 i = − h(X )∇ ln m (X , V ) − h(X )D F (mX , X ) E 2γ t x t t t t m t t

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v

Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

hσ2 i 0 = h˙(X ) − h(X )D F (mX , X ) E 2γ t t m t t h  σ2 i = h(X ) − ∇ ln m (X , V ) − D F (mX , X ) E t 2γ x t t t m t t

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v

Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

hσ2 i 0 = h˙(X ) − h(X )D F (mX , X ) E 2γ t t m t t h  σ2 i = h(X ) − ∇ ln m (X , V ) − D F (mX , X ) E t 2γ x t t t m t t σ2 ⇒ ∇ ln m (X , V ) + D F (mX , X ) = 0 2γ x t t t m t t

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

hσ2 i 0 = h˙(X ) − h(X )D F (mX , X ) E 2γ t t m t t h  σ2 i = h(X ) − ∇ ln m (X , V ) − D F (mX , X ) E t 2γ x t t t m t t σ2 ⇒ ∇ ln m (X , V ) + D F (mX , X ) = 0 2γ x t t t m t t

⇒ mt ≡ arg min Ub(m) m

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Two-layer Network and Mean-field Langevin Equation Complete the proof for Underdamped MFL

Recall that

2 ∗ σ ∗ ∗ − γ v 2 m ∈ w(m ) =⇒ v + ∇ ln m (x, v) = 0 =⇒ m (x, v) = g(x)e σ2 0 2γ v Consider any smooth function h with compact support. Let Law(X0) ∈ w(m0). By Itô’s formula,

hσ2 i 0 = h˙(X ) − h(X )D F (mX , X ) E 2γ t t m t t h  σ2 i = h(X ) − ∇ ln m (X , V ) − D F (mX , X ) E t 2γ x t t t m t t σ2 ⇒ ∇ ln m (X , V ) + D F (mX , X ) = 0 2γ x t t t m t t

⇒ w(m0) = {arg min Ub(m)} m

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 15 / 37 Regretfully, the small mean-field dependence assumption is corresponding to the over-regularized problem in the context of neural networks.

Two-layer Network and Mean-field Langevin Equation Convergence rate for special case

For possibly non-convex F such that DmF (m, x) bearing small mean-field dependence, we can prove the contraction results using synchronous-reflection couplings. Theorem (Overdamped MFL, Hu, R., Siska, Szpruch, ’19) 0 −λt 0 W1(mt , mt ) ≤ Ce W1(m0, m0).

Theorem (Underdamped MFL, Kazeykina, R., Tan, Yang, ’20) 0 −λt 0 Wψ(mt , mt ) ≤ Ce Wψ(m0, m0), with the semi-metric 0 R 0 0  0 Wψ(m, m ) = inf{ ψ (x, v), (x , v ) π(dx, dy): π is a coupling of m, m }

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 16 / 37 Two-layer Network and Mean-field Langevin Equation Convergence rate for special case

For possibly non-convex F such that DmF (m, x) bearing small mean-field dependence, we can prove the contraction results using synchronous-reflection couplings. Theorem (Overdamped MFL, Hu, R., Siska, Szpruch, ’19) 0 −λt 0 W1(mt , mt ) ≤ Ce W1(m0, m0).

Theorem (Underdamped MFL, Kazeykina, R., Tan, Yang, ’20) 0 −λt 0 Wψ(mt , mt ) ≤ Ce Wψ(m0, m0), with the semi-metric 0 R 0 0  0 Wψ(m, m ) = inf{ ψ (x, v), (x , v ) π(dx, dy): π is a coupling of m, m }

Regretfully, the small mean-field dependence assumption is corresponding to the over-regularized problem in the context of neural networks.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 16 / 37 Application to GAN Table of Contents

1 Two-layer Network and Mean-field Langevin Equation

2 Application to GAN

3 Deep neural network and MFL system

4 Game on random environment

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 17 / 37 Taking Wasserstein distance as example, we aim at sampling µˆ by

n m  n2 o where E = z 7→ E ϕ(X , z) : X ∼ m ∈ P(R ) GAN can be viewed as a zero-sum game between the generator and the discriminator:

( R m σ2  Gen. : inf 1 [ϕ(X , z)](µ − µˆ)(dz) + Ent(µ) − Ent(m) µ∈P(Rn ) E 2 R m σ2  Discr. : inf 2 − [ϕ(X , z)](µ − µˆ)(dz) + Ent(m) − Ent(µ) m∈P(Rn ) E 2

R m In particular, µ, m 7→ F (µ, m) := E [ϕ(X , z)](µ − µˆ)(dz) are linear.

Application to GAN GAN and zero-sum game

The Generative Adversary Network aims at sampling a target a probability n1 measure µˆ ∈ P(R ) only empirically known.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 18 / 37 n m  n2 o where E = z 7→ E ϕ(X , z) : X ∼ m ∈ P(R ) GAN can be viewed as a zero-sum game between the generator and the discriminator:

( R m σ2  Gen. : inf 1 [ϕ(X , z)](µ − µˆ)(dz) + Ent(µ) − Ent(m) µ∈P(Rn ) E 2 R m σ2  Discr. : inf 2 − [ϕ(X , z)](µ − µˆ)(dz) + Ent(m) − Ent(µ) m∈P(Rn ) E 2

R m In particular, µ, m 7→ F (µ, m) := E [ϕ(X , z)](µ − µˆ)(dz) are linear.

Application to GAN GAN and zero-sum game

The Generative Adversary Network aims at sampling a target a probability n1 measure µˆ ∈ P(R ) only empirically known. Taking Wasserstein distance as example, we aim at sampling µˆ by

min W1(µ, µˆ) µ

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 18 / 37 n m  n2 o where E = z 7→ E ϕ(X , z) : X ∼ m ∈ P(R ) GAN can be viewed as a zero-sum game between the generator and the discriminator:

( R m σ2  Gen. : inf 1 [ϕ(X , z)](µ − µˆ)(dz) + Ent(µ) − Ent(m) µ∈P(Rn ) E 2 R m σ2  Discr. : inf 2 − [ϕ(X , z)](µ − µˆ)(dz) + Ent(m) − Ent(µ) m∈P(Rn ) E 2

R m In particular, µ, m 7→ F (µ, m) := E [ϕ(X , z)](µ − µˆ)(dz) are linear.

Application to GAN GAN and zero-sum game

The Generative Adversary Network aims at sampling a target a probability n1 measure µˆ ∈ P(R ) only empirically known. Taking Wasserstein distance as example, we aim at sampling µˆ by

min W1(µ, µˆ) µ Z = min sup f (x)(µ − µˆ)(dx) µ f ∈Lip1

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 18 / 37 GAN can be viewed as a zero-sum game between the generator and the discriminator: ( R m σ2  Gen. : inf 1 [ϕ(X , z)](µ − µˆ)(dz) + Ent(µ) − Ent(m) µ∈P(Rn ) E 2 R m σ2  Discr. : inf 2 − [ϕ(X , z)](µ − µˆ)(dz) + Ent(m) − Ent(µ) m∈P(Rn ) E 2 R m In particular, µ, m 7→ F (µ, m) := E [ϕ(X , z)](µ − µˆ)(dz) are linear.

Application to GAN GAN and zero-sum game

The Generative Adversary Network aims at sampling a target a probability n1 measure µˆ ∈ P(R ) only empirically known. Taking Wasserstein distance as example, we aim at sampling µˆ by

min W1(µ, µˆ) µ Z = min sup f (x)(µ − µˆ)(dx) µ f ∈Lip1 Z ≈ min sup f (x)(µ − µˆ)(dx) µ f ∈E

n m  n2 o where E = z 7→ E ϕ(X , z) : X ∼ m ∈ P(R )

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 18 / 37 GAN can be viewed as a zero-sum game between the generator and the discriminator:

( R m σ2  Gen. : inf 1 [ϕ(X , z)](µ − µˆ)(dz) + Ent(µ) − Ent(m) µ∈P(Rn ) E 2 R m σ2  Discr. : inf 2 − [ϕ(X , z)](µ − µˆ)(dz) + Ent(m) − Ent(µ) m∈P(Rn ) E 2

R m In particular, µ, m 7→ F (µ, m) := E [ϕ(X , z)](µ − µˆ)(dz) are linear.

Application to GAN GAN and zero-sum game

The Generative Adversary Network aims at sampling a target a probability n1 measure µˆ ∈ P(R ) only empirically known. Taking Wasserstein distance as example, we aim at sampling µˆ by Z min sup f (x)(µ − µˆ)(dx) µ f ∈E

n m  n2 o where E = z 7→ E ϕ(X , z) : X ∼ m ∈ P(R )

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 18 / 37 R m In particular, µ, m 7→ F (µ, m) := E [ϕ(X , z)](µ − µˆ)(dz) are linear.

Application to GAN GAN and zero-sum game

The Generative Adversary Network aims at sampling a target a probability n1 measure µˆ ∈ P(R ) only empirically known. Taking Wasserstein distance as example, we aim at sampling µˆ by Z min sup f (x)(µ − µˆ)(dx) µ f ∈E

n m  n2 o where E = z 7→ E ϕ(X , z) : X ∼ m ∈ P(R ) GAN can be viewed as a zero-sum game between the generator and the discriminator:

( R m σ2  Gen. : inf 1 [ϕ(X , z)](µ − µˆ)(dz) + Ent(µ) − Ent(m) µ∈P(Rn ) E 2 R m σ2  Discr. : inf 2 − [ϕ(X , z)](µ − µˆ)(dz) + Ent(m) − Ent(µ) m∈P(Rn ) E 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 18 / 37 Application to GAN GAN and zero-sum game

The Generative Adversary Network aims at sampling a target a probability n1 measure µˆ ∈ P(R ) only empirically known. Taking Wasserstein distance as example, we aim at sampling µˆ by Z min sup f (x)(µ − µˆ)(dx) µ f ∈E

n m  n2 o where E = z 7→ E ϕ(X , z) : X ∼ m ∈ P(R ) GAN can be viewed as a zero-sum game between the generator and the discriminator:

( R m σ2  Gen. : inf 1 [ϕ(X , z)](µ − µˆ)(dz) + Ent(µ) − Ent(m) µ∈P(Rn ) E 2 R m σ2  Discr. : inf 2 − [ϕ(X , z)](µ − µˆ)(dz) + Ent(m) − Ent(µ) m∈P(Rn ) E 2

R m In particular, µ, m 7→ F (µ, m) := E [ϕ(X , z)](µ − µˆ)(dz) are linear.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 18 / 37 Therefore the value of the game can be rewritten as

σ2 σ2 min max −F (µ, m) + Ent(m) − Ent(µ) = min G(m) + Ent(m) m µ 2 m 2 σ2 where G(m) := −F (µ∗[m], m) − Ent(µ∗[m]) is convex 2

Application to GAN The feedback of the generator

Due to the linearity, the solution to the generator given m (choice of discriminator) is explicit:  ∗ −1 − 2 m[ϕ(X ,z)] µ [m](z) = C(m) e σ2 E ,

where C(m) is the normalization constant.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 19 / 37 Application to GAN The feedback of the generator

Due to the linearity, the solution to the generator given m (choice of discriminator) is explicit:  ∗ −1 − 2 m[ϕ(X ,z)] µ [m](z) = C(m) e σ2 E ,

where C(m) is the normalization constant. Therefore the value of the game can be rewritten as

σ2 σ2 min max −F (µ, m) + Ent(m) − Ent(µ) = min G(m) + Ent(m) m µ 2 m 2 σ2 where G(m) := −F (µ∗[m], m) − Ent(µ∗[m]) is convex 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 19 / 37 and the intrinsic derivative can be computed explicitly: Recall 2 R − m[ϕ(X ,z)] ∗ C(m) = e σ2 E dz, and thus Finally note that µ [m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 2 R − m[ϕ(X ,z)] ∗ Recall C(m) = e σ2 E dz, and thus Finally note that µ [m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = − m[ϕ(X , z)](µ∗ − µˆ)(dz) − Ent(µ∗[m]) E 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 2 R − m[ϕ(X ,z)] ∗ Recall C(m) = e σ2 E dz, and thus Finally note that µ [m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 Z 2 G(m) = − m[ϕ(X , z)](µ∗ − µˆ)(dz) + ln C(m) + m[ϕ(X , z)]µ∗(dz) E 2 σ2 E

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 2 R − m[ϕ(X ,z)] ∗ Recall C(m) = e σ2 E dz, and thus Finally note that µ [m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = m[ϕ(X , z)]ˆµ(dz)+ ln C(m) E 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 Finally note that µ∗[m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = m[ϕ(X , z)]ˆµ(dz)+ ln C(m) E 2

R − 2 m[ϕ(X ,z)] Recall C(m) = e σ2 E dz, and thus

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 Finally note that µ∗[m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = m[ϕ(X , z)]ˆµ(dz)+ ln C(m) E 2

R − 2 m[ϕ(X ,z)] Recall C(m) = e σ2 E dz, and thus

Z 2 Z δG σ − 2 m[ϕ(X ,z)] 2 = ϕ(x, z)ˆµ(dz)− e σ2 E ϕ(x, z)dz δm 2C(m) σ2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 Finally note that µ∗[m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = m[ϕ(X , z)]ˆµ(dz)+ ln C(m) E 2

R − 2 m[ϕ(X ,z)] Recall C(m) = e σ2 E dz, and thus Z Z δG 1 − 2 m[ϕ(X ,z)] = ϕ(x, z)ˆµ(dz) − e σ2 E ϕ(x, z)dz δm C(m)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 Finally note that µ∗[m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = m[ϕ(X , z)]ˆµ(dz)+ ln C(m) E 2

R − 2 m[ϕ(X ,z)] Recall C(m) = e σ2 E dz, and thus

δG Z = ϕ(x, z)(ˆµ − µ∗[m])(dz) δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 Finally note that µ∗[m] can be sampled by MCMC.

Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = m[ϕ(X , z)]ˆµ(dz)+ ln C(m) E 2

R − 2 m[ϕ(X ,z)] Recall C(m) = e σ2 E dz, and thus Z ∗ DmG(m, x) = ∇x ϕ(x, z)(ˆµ − µ [m])(dz)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 Application to GAN A convergent algorithm

Therefore the choice of the discriminator at the equilibrium is the invariant measure of the MFL dynamics:

dXt = −DmG(mt , Xt )dt + σdWt

and the intrinsic derivative can be computed explicitly:

Z σ2 G(m) = m[ϕ(X , z)]ˆµ(dz)+ ln C(m) E 2

R − 2 m[ϕ(X ,z)] Recall C(m) = e σ2 E dz, and thus Z ∗ DmG(m, x) = ∇x ϕ(x, z)(ˆµ − µ [m])(dz)

Finally note that µ∗[m] can be sampled by MCMC.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 20 / 37 Application to GAN A toy example

Here we sample the law µˆ = exp(1) with µ0 = N (0, 1).

Figure: Pontential function value Figure: Histogram

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 21 / 37 Application to GAN Toy example with underdamped MFL

Similarly, we can train the discriminator by the underdamped MFL dynamics. ( dXt = Vt X  dVt = − DmG(mt , Xt ) − γVt dt + σdWt

Figure: Energy value Figure: Histogram

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 22 / 37 Deep neural network and MFL system Table of Contents

1 Two-layer Network and Mean-field Langevin Equation

2 Application to GAN

3 Deep neural network and MFL system

4 Game on random environment

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 23 / 37 d Consider the optimization over π ∈ P(R × Y): σ2 min F (π) + Ent(π|Leb × m) π: πY =m 2

where m is a fixed law on the environment Y (Polish).

Deep neural network and MFL system Optimization on random environment/with marginal constraint

More recently, with G. Conforti and A. Kazeykina, we discover that the previous analysis can be generalized to the optimization on random environment.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 24 / 37 Deep neural network and MFL system Optimization on random environment/with marginal constraint

More recently, with G. Conforti and A. Kazeykina, we discover that the previous analysis can be generalized to the optimization on random d environment. Consider the optimization over π ∈ P(R × Y): σ2 min F (π) + Ent(π|Leb × m) π: πY =m 2

where m is a fixed law on the environment Y (Polish).

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 24 / 37 0 Since πY = πY = m, a sufficient condition for m to be a minimizer is: Theorem (Conforti, Kazeykina, R., ’20) ∗  σ2 Under mild conditions, if π ∈ arg minπY =m F (π) + 2 Ent(π|Leb × m) and let π∗(dx, dy) = π∗(x|y)dxm(dy), then

δF σ2 ∇ (π∗, x, y) + ∇ ln π∗(x|y) = 0, for all x, m-a.s. y. (3) x δm 2 x Conversely, if F to be convex, (3) implies m∗ is the minimizer.

Deep neural network and MFL system First order condition

It is crucial to observe : for F convex we have Z δF F (π0) − F (π) ≥ π, x, y(π0 − π)(dx, dy)dλ Rd δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 25 / 37 Theorem (Conforti, Kazeykina, R., ’20) ∗  σ2 Under mild conditions, if π ∈ arg minπY =m F (π) + 2 Ent(π|Leb × m) and let π∗(dx, dy) = π∗(x|y)dxm(dy), then

δF σ2 ∇ (π∗, x, y) + ∇ ln π∗(x|y) = 0, for all x, m-a.s. y. (3) x δm 2 x Conversely, if F to be convex, (3) implies m∗ is the minimizer.

Deep neural network and MFL system First order condition

It is crucial to observe : for F convex we have Z δF F (π0) − F (π) ≥ π, x, y(π0 − π)(dx, dy)dλ Rd δm 0 Since πY = πY = m, a sufficient condition for m to be a minimizer is: δF  δm π, x, y does NOT depend on x

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 25 / 37 Theorem (Conforti, Kazeykina, R., ’20) ∗  σ2 Under mild conditions, if π ∈ arg minπY =m F (π) + 2 Ent(π|Leb × m) and let π∗(dx, dy) = π∗(x|y)dxm(dy), then

δF σ2 ∇ (π∗, x, y) + ∇ ln π∗(x|y) = 0, for all x, m-a.s. y. (3) x δm 2 x Conversely, if F to be convex, (3) implies m∗ is the minimizer.

Deep neural network and MFL system First order condition

It is crucial to observe : for F convex we have Z δF F (π0) − F (π) ≥ π, x, y(π0 − π)(dx, dy)dλ Rd δm 0 Since πY = πY = m, a sufficient condition for m to be a minimizer is: δF  ∇x δm π, x, y = 0 for all x, m-a.s. y

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 25 / 37 Deep neural network and MFL system First order condition

It is crucial to observe : for F convex we have Z δF F (π0) − F (π) ≥ π, x, y(π0 − π)(dx, dy)dλ Rd δm 0 Since πY = πY = m, a sufficient condition for m to be a minimizer is: δF  ∇x δm π, x, y = 0 for all x, m-a.s. y

Theorem (Conforti, Kazeykina, R., ’20) ∗  σ2 Under mild conditions, if π ∈ arg minπY =m F (π) + 2 Ent(π|Leb × m) and let π∗(dx, dy) = π∗(x|y)dxm(dy), then

δF σ2 ∇ (π∗, x, y) + ∇ ln π∗(x|y) = 0, for all x, m-a.s. y. (3) x δm 2 x Conversely, if F to be convex, (3) implies m∗ is the minimizer.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 25 / 37 In particular, we know ∗ σ ∗ if F is convex, π = arg minπY =m V (π) iff π is the invariant measure for general F , if MFL system has unique invariant measure π∗, then ∗ σ π = arg minπY =m V (π)

Theorem (Conforti, Kazeykina, R., ’20) Under mild conditions, the MFL system admits unique strong solution and

σ h δF 2i dV (πt ) = − ∇x (πt , Xt , Y ) + ∇x ln πt (Xt |Y ) dt, for t > 0 E δm

Deep neural network and MFL system Minimizer and invariant measure of MFL system

Due to the first order condition, the minimizer of V σ is closely related to the invariant measure of the overdamped MFL system: δF dX = −∇ (π , X , Y ) + σdW , π = Law(X , Y ) t x δm t t t t t

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 26 / 37 In particular, we know ∗ σ ∗ if F is convex, π = arg minπY =m V (π) iff π is the invariant measure for general F , if MFL system has unique invariant measure π∗, then ∗ σ π = arg minπY =m V (π)

Theorem (Conforti, Kazeykina, R., ’20) Under mild conditions, the MFL system admits unique strong solution and

σ h δF 2i dV (πt ) = − ∇x (πt , Xt , Y ) + ∇x ln πt (Xt |Y ) dt, for t > 0 E δm

Deep neural network and MFL system Minimizer and invariant measure of MFL system

Due to the first order condition, the minimizer of V σ is closely related to the invariant measure of the overdamped MFL system: δF dX y = −∇ (π , X y, y) + σdW , π = Law(X , Y ) t x δm t t t t t

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 26 / 37 Theorem (Conforti, Kazeykina, R., ’20) Under mild conditions, the MFL system admits unique strong solution and

σ h δF 2i dV (πt ) = − ∇x (πt , Xt , Y ) + ∇x ln πt (Xt |Y ) dt, for t > 0 E δm

Deep neural network and MFL system Minimizer and invariant measure of MFL system

Due to the first order condition, the minimizer of V σ is closely related to the invariant measure of the overdamped MFL system: δF dX y = −∇ (π , X y, y) + σdW , π = Law(X , Y ) t x δm t t t t t In particular, we know ∗ σ ∗ if F is convex, π = arg minπY =m V (π) iff π is the invariant measure for general F , if MFL system has unique invariant measure π∗, then ∗ σ π = arg minπY =m V (π)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 26 / 37 Deep neural network and MFL system Minimizer and invariant measure of MFL system

Due to the first order condition, the minimizer of V σ is closely related to the invariant measure of the overdamped MFL system: δF dX y = −∇ (π , X y, y) + σdW , π = Law(X , Y ) t x δm t t t t t In particular, we know ∗ σ ∗ if F is convex, π = arg minπY =m V (π) iff π is the invariant measure for general F , if MFL system has unique invariant measure π∗, then ∗ σ π = arg minπY =m V (π)

Theorem (Conforti, Kazeykina, R., ’20) Under mild conditions, the MFL system admits unique strong solution and

σ h δF 2i dV (πt ) = − ∇x (πt , Xt , Y ) + ∇x ln πt (Xt |Y ) dt, for t > 0 E δm

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 26 / 37 • For possibly non-convex but with small MF-dependence F , we can prove the contraction result: Theorem (Conforti, Kazeykina, R., ’20) Under particular conditions, we have

0 −γt 0 W1(πt , πt ) ≤ Ce W1(π0, π0), Z 0 0  where W1(π, π ) = W1 π(·|y), π (·|y) m(dy)

The constant γ can be computed, and once γ > 0 the MFL has a unique invariant measure, equal to the minimizer of V σ, towards which the marginal laws converge.

Deep neural network and MFL system Convergence towards the invariant measure

• In case F convex, the V σ again serves as Lyapunov function for the n dynamic system (mt ). Provided that Y is countable or R , we can show πt ∗ converges to π in W2 based on Lasalle’s invariant principle.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 27 / 37 Deep neural network and MFL system Convergence towards the invariant measure

• In case F convex, the V σ again serves as Lyapunov function for the n dynamic system (mt ). Provided that Y is countable or R , we can show πt ∗ converges to π in W2 based on Lasalle’s invariant principle. • For possibly non-convex but with small MF-dependence F , we can prove the contraction result: Theorem (Conforti, Kazeykina, R., ’20) Under particular conditions, we have

0 −γt 0 W1(πt , πt ) ≤ Ce W1(π0, π0), Z 0 0  where W1(π, π ) = W1 π(·|y), π (·|y) m(dy)

The constant γ can be computed, and once γ > 0 the MFL has a unique invariant measure, equal to the minimizer of V σ, towards which the marginal laws converge.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 27 / 37 Define the Hamiltonian function H(y, s, x, p) = L(y, s, x) + p · ϕ(y, s, x). δF We may compute δm (π, x, y) = H(y, Sy, x, Py), where Z T Z Py = ∇s g(ST ) + ∇s H(u, Su, x, Pu)π(x|u)du y

The paper with K. Hu and A. Kazeykina, ’19 was devoted to this example and connect it to the deep neural network.

Deep neural network and MFL system Important example: Optimal control

Let Y = [0, T ] and m = Leb[0, T ]. Consider the relaxed optimal control

Z T Z σ2 inf L(y, Sy, x)π(x|y)dy + g(ST )+ Ent(π|Leb), πY =m 0 2 Z y Z where Sy = S0 + ϕ(u, Su, x)π(x|u)du 0

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 28 / 37 δF We may compute δm (π, x, y) = H(y, Sy, x, Py), where Z T Z Py = ∇s g(ST ) + ∇s H(u, Su, x, Pu)π(x|u)du y

The paper with K. Hu and A. Kazeykina, ’19 was devoted to this example and connect it to the deep neural network.

Deep neural network and MFL system Important example: Optimal control

Let Y = [0, T ] and m = Leb[0, T ]. Consider the relaxed optimal control

Z T Z σ2 inf L(y, Sy, x)π(x|y)dy + g(ST )+ Ent(π|Leb), πY =m 0 2 Z y Z where Sy = S0 + ϕ(u, Su, x)π(x|u)du 0 Define the Hamiltonian function H(y, s, x, p) = L(y, s, x) + p · ϕ(y, s, x).

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 28 / 37 The paper with K. Hu and A. Kazeykina, ’19 was devoted to this example and connect it to the deep neural network.

Deep neural network and MFL system Important example: Optimal control

Let Y = [0, T ] and m = Leb[0, T ]. Consider the relaxed optimal control

Z T Z σ2 inf L(y, Sy, x)π(x|y)dy + g(ST )+ Ent(π|Leb), πY =m 0 2 Z y Z where Sy = S0 + ϕ(u, Su, x)π(x|u)du 0 Define the Hamiltonian function H(y, s, x, p) = L(y, s, x) + p · ϕ(y, s, x). δF We may compute δm (π, x, y) = H(y, Sy, x, Py), where Z T Z Py = ∇s g(ST ) + ∇s H(u, Su, x, Pu)π(x|u)du y

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 28 / 37 Deep neural network and MFL system Important example: Optimal control

Let Y = [0, T ] and m = Leb[0, T ]. Consider the relaxed optimal control

Z T Z σ2 inf L(y, Sy, x)π(x|y)dy + g(ST )+ Ent(π|Leb), πY =m 0 2 Z y Z where Sy = S0 + ϕ(u, Su, x)π(x|u)du 0 Define the Hamiltonian function H(y, s, x, p) = L(y, s, x) + p · ϕ(y, s, x). δF We may compute δm (π, x, y) = H(y, Sy, x, Py), where Z T Z Py = ∇s g(ST ) + ∇s H(u, Su, x, Pu)π(x|u)du y

The paper with K. Hu and A. Kazeykina, ’19 was devoted to this example and connect it to the deep neural network.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 28 / 37 X 1 X 1 X 1 t1 t2 tN

X 2 X 2 X 2 t1 t2 tN

X 3 S X 3 S S X 3 Z S0 t1 t1 t2 t2 tN−1 tN ST

n n n X t1 X t2 X tN t1 t2 tN

Input Layer t1 Layer t2 Layer tN Output

Figure: Neural network corresponding to the relaxed controlled process

Deep neural network and MFL system Deep neural network associated to relaxed controlled process

The Euler scheme introduces a forward propagation of a neural network: n δt P ti+1 j Sti+1 ≈ Sti + ϕ(ti , Sti , Xt , Z), where Z is the data. nti+1 j=1 i+1

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 29 / 37 Deep neural network and MFL system Deep neural network associated to relaxed controlled process

The Euler scheme introduces a forward propagation of a neural network: n δt P ti+1 j Sti+1 ≈ Sti + ϕ(ti , Sti , Xt , Z), where Z is the data. nti+1 j=1 i+1

X 1 X 1 X 1 t1 t2 tN

X 2 X 2 X 2 t1 t2 tN

X 3 S X 3 S S X 3 Z S0 t1 t1 t2 t2 tN−1 tN ST

n n n X t1 X t2 X tN t1 t2 tN

Input Layer t1 Layer t2 Layer tN Output

Figure: Neural network corresponding to the relaxed controlled process

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 29 / 37 The gradients of the parameters are easy to compute, due to the chain rule

(or backward propagation): (δWsj ) are independent copies of N (0, δs). Clearly, it is a discretization of the MF Langevin system.

Deep neural network and MFL system Mean-field Langevin system ≈ Backward propagation

The architecture of the network is characterized by the average pooling after each layer!

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 30 / 37 Clearly, it is a discretization of the MF Langevin system.

Deep neural network and MFL system Mean-field Langevin system ≈ Backward propagation

The architecture of the network is characterized by the average pooling after each layer! The gradients of the parameters are easy to compute, due to the chain rule (or backward propagation):

X y = X y − δsE∇ H(y, Sy , X y , Py , Z)+σδW , with δs = s − s , sj+1 sj a sj sj sj sj j+1 j

nyi+1 yi−1 yi X yi yi ,j yi  T T where Ps = Ps − δy ∇s H yi , Ss , Xs , Ps , Z , Ps = ∇s g(Ss , Z), j=1

(δWsj ) are independent copies of N (0, δs).

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 30 / 37 Deep neural network and MFL system Mean-field Langevin system ≈ Backward propagation

The architecture of the network is characterized by the average pooling after each layer! The gradients of the parameters are easy to compute, due to the chain rule (or backward propagation):

y  y y y  dXs = −E ∇aH(y, Ss , Xs , Ps , Z) ds+σdWs , y h y y y i T T where dPs = −E ∇s H y, Ss , Xs , Ps , Z dy, Ps = ∇s g(Ss , Z),

(δWsj ) are independent copies of N (0, δs). Clearly, it is a discretization of the MF Langevin system.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 30 / 37 Game on random environment Table of Contents

1 Two-layer Network and Mean-field Langevin Equation

2 Application to GAN

3 Deep neural network and MFL system

4 Game on random environment

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 31 / 37 σ2 π∗ is Nash eq: π∗,i ∈ arg min F i (πi , π∗,−i ) + Ent(πi |Leb × m), ∀i i i π : πy =m 2 Due to the previous first order condition we have Theorem (Conforti, Kazeykina, R., ’20)

If π is a Nash equilibrium, we have for i = 1, ··· , n,

i 2 δF i −i i σ i i  i ni ∇ i (π , π , x , y) + ∇ i ln π (x |y) = 0 ∀x ∈ , m-a.s. y ∈ . x δν 2 x R Y

Game on random environment Nash equilibirum

Consider the game in which the i-th player chooses the probability measure i ni i i −i π on R × Y as strategy to minimize his objective function F (π ,π ), −i Q nj where π is the joint strategy of other players on j6=i R × Y. We urge i that the marginal law of π on Y is equal to the fixed law m ∈ P(Y).

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 32 / 37 Due to the previous first order condition we have Theorem (Conforti, Kazeykina, R., ’20)

If π is a Nash equilibrium, we have for i = 1, ··· , n,

i 2 δF i −i i σ i i  i ni ∇ i (π , π , x , y) + ∇ i ln π (x |y) = 0 ∀x ∈ , m-a.s. y ∈ . x δν 2 x R Y

Game on random environment Nash equilibirum

Consider the game in which the i-th player chooses the probability measure i ni i i −i π on R × Y as strategy to minimize his objective function F (π ,π ), −i Q nj where π is the joint strategy of other players on j6=i R × Y. We urge i that the marginal law of π on Y is equal to the fixed law m ∈ P(Y). σ2 π∗ is Nash eq: π∗,i ∈ arg min F i (πi , π∗,−i ) + Ent(πi |Leb × m), ∀i i i π : πy =m 2

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 32 / 37 Game on random environment Nash equilibirum

Consider the game in which the i-th player chooses the probability measure i ni i i −i π on R × Y as strategy to minimize his objective function F (π ,π ), −i Q nj where π is the joint strategy of other players on j6=i R × Y. We urge i that the marginal law of π on Y is equal to the fixed law m ∈ P(Y). σ2 π∗ is Nash eq: π∗,i ∈ arg min F i (πi , π∗,−i ) + Ent(πi |Leb × m), ∀i i i π : πy =m 2 Due to the previous first order condition we have Theorem (Conforti, Kazeykina, R., ’20)

If π is a Nash equilibrium, we have for i = 1, ··· , n,

i 2 δF i −i i σ i i  i ni ∇ i (π , π , x , y) + ∇ i ln π (x |y) = 0 ∀x ∈ , m-a.s. y ∈ . x δν 2 x R Y

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 32 / 37 Game on random environment Uniqueness: Monotonicity condition

Theorem (Conforti, Kazeykina, R., ’20) i Denote x¯ = (x, y). The functions (F )i=1,··· ,n satisfy the monotonicity condition, if for π, π0 we have

n X Z δF i δF i  (πi , π−i , x¯i ) − (π0i , π0−i , x¯i ) (π − π0)(dx¯) ≥ 0. δν δν i=1 We have the following results: (i) if n = 1, a function F satisfies the monotonicity condition iff it is convex. i (ii) in general (n ≥ 1), if (F )i=1,··· ,n satisfy the monotonicity condition, then for any two Nash equilibria π∗, π0∗ ∈ Π we have (π∗)i = (π0∗)i for all i = 1, ··· , n.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 33 / 37 Together with the first order condition of equilibrium, we obtain Therefore πi = π0i for all i.

Game on random environment Proof of uniqueness

Sketch of proof: Since (F i ) is monotone,

n X Z δF i δF i  (πi , π−i , x¯i ) − (π0i , π0−i , x¯i ) (π − π0)(dx¯) ≥ 0. δν δν i=1

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 34 / 37 Therefore πi = π0i for all i.

Game on random environment Proof of uniqueness

Sketch of proof: Since (F i ) is monotone,

n X Z δF i δF i  (πi , π−i , x¯i ) − (π0i , π0−i , x¯i ) (π − π0)(dx¯) ≥ 0. δν δν i=1 Together with the first order condition of equilibrium, we obtain

n X Z − ln πi (xi |y) + ln π0i (xi |y)(π − π0)(dx¯) ≥ 0. i=1

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 34 / 37 Game on random environment Proof of uniqueness

Sketch of proof: Since (F i ) is monotone,

n X Z δF i δF i  (πi , π−i , x¯i ) − (π0i , π0−i , x¯i ) (π − π0)(dx¯) ≥ 0. δν δν i=1 Together with the first order condition of equilibrium, we obtain

n X   − Entπi |π0i  + Entπ0i |πi  ≥ 0. i=1

Therefore πi = π0i for all i.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 34 / 37 In particular, if the game admits at least one Nash equilibrium and the MFL system has a unique invariant measure, then the invariant measure is an equilibrium.

In the context of game, in general it is hard to find Lyapunov function.

δF i i −i i If the coefficient ∇xi δν (π , π , x , y) bears small mean-field dependence, we still can prove the contraction result, namely,

0 −γt 0 W1(πt , πt ) ≤ Ce W1(π0, π0)

Game on random environment Mean-field Langevin system and convergence to equilibrium

Again the FOC inspires the form of MFL dynamics:

i i,y δF i −i i,y i dX = ∇ i (π , π , X , y)dt + σdW t x δν t t t t

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 35 / 37 In the context of game, in general it is hard to find Lyapunov function.

δF i i −i i If the coefficient ∇xi δν (π , π , x , y) bears small mean-field dependence, we still can prove the contraction result, namely,

0 −γt 0 W1(πt , πt ) ≤ Ce W1(π0, π0)

Game on random environment Mean-field Langevin system and convergence to equilibrium

Again the FOC inspires the form of MFL dynamics:

i i,y δF i −i i,y i dX = ∇ i (π , π , X , y)dt + σdW t x δν t t t t In particular, if the game admits at least one Nash equilibrium and the MFL system has a unique invariant measure, then the invariant measure is an equilibrium.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 35 / 37 δF i i −i i If the coefficient ∇xi δν (π , π , x , y) bears small mean-field dependence, we still can prove the contraction result, namely,

0 −γt 0 W1(πt , πt ) ≤ Ce W1(π0, π0)

Game on random environment Mean-field Langevin system and convergence to equilibrium

Again the FOC inspires the form of MFL dynamics:

i i,y δF i −i i,y i dX = ∇ i (π , π , X , y)dt + σdW t x δν t t t t In particular, if the game admits at least one Nash equilibrium and the MFL system has a unique invariant measure, then the invariant measure is an equilibrium.

In the context of game, in general it is hard to find Lyapunov function.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 35 / 37 Game on random environment Mean-field Langevin system and convergence to equilibrium

Again the FOC inspires the form of MFL dynamics:

i i,y δF i −i i,y i dX = ∇ i (π , π , X , y)dt + σdW t x δν t t t t In particular, if the game admits at least one Nash equilibrium and the MFL system has a unique invariant measure, then the invariant measure is an equilibrium.

In the context of game, in general it is hard to find Lyapunov function.

δF i i −i i If the coefficient ∇xi δν (π , π , x , y) bears small mean-field dependence, we still can prove the contraction result, namely,

0 −γt 0 W1(πt , πt ) ≤ Ce W1(π0, π0)

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 35 / 37 Mean-field Langevin dynamics is a natural model to analyze the (Hamiltonian) gradient descent for the overparametrized nonconvex optimization The calculus involving the measure derivatives characterizes the first order conditions, as well as allows the Itô-type calculus The relaxed control (continuous time or discrete time) can be viewed as an optimization with marginal constraint.

Game on random environment Conclusion

References: One layer: Mei, Montanari, Nguyen ’18, Hu, R., Siska, Szpruch ’19; Deep/Neuron ODE: Hu, R., Kazeykina, ’19, Jabir, Siska, Szpruch ’19; Game on random environment: Conforti, R., Kazeykina, ’20; Stochastic control: Siska, Szpruch, ’20; Underdamped MFL: Kazeykina, R., Tan, Yang, ’20 ...

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 36 / 37 The calculus involving the measure derivatives characterizes the first order conditions, as well as allows the Itô-type calculus The relaxed control (continuous time or discrete time) can be viewed as an optimization with marginal constraint.

Game on random environment Conclusion

References: One layer: Mei, Montanari, Nguyen ’18, Hu, R., Siska, Szpruch ’19; Deep/Neuron ODE: Hu, R., Kazeykina, ’19, Jabir, Siska, Szpruch ’19; Game on random environment: Conforti, R., Kazeykina, ’20; Stochastic control: Siska, Szpruch, ’20; Underdamped MFL: Kazeykina, R., Tan, Yang, ’20 ...

Mean-field Langevin dynamics is a natural model to analyze the (Hamiltonian) gradient descent for the overparametrized nonconvex optimization

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 36 / 37 The relaxed control (continuous time or discrete time) can be viewed as an optimization with marginal constraint.

Game on random environment Conclusion

References: One layer: Mei, Montanari, Nguyen ’18, Hu, R., Siska, Szpruch ’19; Deep/Neuron ODE: Hu, R., Kazeykina, ’19, Jabir, Siska, Szpruch ’19; Game on random environment: Conforti, R., Kazeykina, ’20; Stochastic control: Siska, Szpruch, ’20; Underdamped MFL: Kazeykina, R., Tan, Yang, ’20 ...

Mean-field Langevin dynamics is a natural model to analyze the (Hamiltonian) gradient descent for the overparametrized nonconvex optimization The calculus involving the measure derivatives characterizes the first order conditions, as well as allows the Itô-type calculus

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 36 / 37 Game on random environment Conclusion

References: One layer: Mei, Montanari, Nguyen ’18, Hu, R., Siska, Szpruch ’19; Deep/Neuron ODE: Hu, R., Kazeykina, ’19, Jabir, Siska, Szpruch ’19; Game on random environment: Conforti, R., Kazeykina, ’20; Stochastic control: Siska, Szpruch, ’20; Underdamped MFL: Kazeykina, R., Tan, Yang, ’20 ...

Mean-field Langevin dynamics is a natural model to analyze the (Hamiltonian) gradient descent for the overparametrized nonconvex optimization The calculus involving the measure derivatives characterizes the first order conditions, as well as allows the Itô-type calculus The relaxed control (continuous time or discrete time) can be viewed as an optimization with marginal constraint.

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 36 / 37 Game on random environment

Thank you for your attention!

Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 37 / 37