Mean-Field Langevin Dynamics and Neural Networks

Mean-Field Langevin Dynamics and Neural Networks Zhenjie Ren CEREMADE, Université Paris-Dauphine joint works with Giovanni Conforti, Kaitong Hu, Anna Kazeykina, David Siska, Lukasz Szpruch, Xiaolu Tan, Junjian Yang MATH-IMS Joint Applied Mathematics Colloquium Series August 28, 2020 Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 1 / 37 Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − f (x) m (x; v) m (x) = Ce σ2 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 2 1 2 ∗ − 2 f (x) ∗ − f (x)+ jvj m (x) = Ce σ m (x; v) = Ce σ2 2 Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 2 1 2 ∗ − 2 f (x) ∗ − f (x)+ jvj m (x) = Ce σ ! δarg min f (x) m (x; v) = Ce σ2 2 Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ − f (x) 1 ∗ σ2 m (x; v) ! δ 2 m (x) = Ce ! δarg min f (x) arg min f (x)+ 2 jvj Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − 2 f (x) m (x; v) ! δ(arg min f (x);0) m (x) = Ce σ ! δarg min f (x) Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − 2 f (x) m (x; v) ! δ(arg min f (x);0) m (x) = Ce σ ! δarg min f (x) In particular, f does NOT need to be convex. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin , MCMC the underdamped Langevin , Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! One may diminish σ # 0 along the simulation ) Simulation annealing. Relation with classical algorithms If we overlook the Brownian noise, then the overdamped process ) gradient descent algorithm the underdamped process ) Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 One may diminish σ # 0 along the simulation ) Simulation annealing. Relation with classical algorithms If we overlook the Brownian noise, then the overdamped process ) gradient descent algorithm the underdamped process ) Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin , MCMC the underdamped Langevin , Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 Relation with classical algorithms If we overlook the Brownian noise, then the overdamped process ) gradient descent algorithm the underdamped process ) Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin , MCMC the underdamped Langevin , Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! One may diminish σ # 0 along the simulation ) Simulation annealing. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 It is natural to study this problem using Mean-field Langevin equations. Deep neural networks The deep neural networks have won and continue gaining impressive success in various applications. Mathematically speaking, we may approximate a given function f with the parametrized function: ni X i i i f (z) ≈ 'n ◦ · · · ◦ '1(z); where 'i (z) := ck '(Ak z + bk ) k=1 and ' is a given non-constant, bounded, continuous activation function. The expressiveness of the neural network is ensured by the universal representation theorem. However, the efficiency of such over-parametrized, non-convex optimization is still a mystery for mathematical analysis. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 4 / 37 Deep neural networks The deep neural networks have won and continue gaining impressive success in various applications. Mathematically speaking, we may approximate a given function f with the parametrized function: ni X i i i f (z) ≈ 'n ◦ · · · ◦ '1(z); where 'i (z) := ck '(Ak z + bk ) k=1 and ' is a given non-constant, bounded, continuous activation function. The expressiveness of the neural network is ensured by the universal representation theorem. However, the efficiency of such over-parametrized, non-convex optimization is still a mystery for mathematical analysis. It is natural to study this problem using Mean-field Langevin equations. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 4 / 37 Two-layer Network and Mean-field Langevin Equation Table of Contents 1 Two-layer Network and Mean-field Langevin Equation 2 Application to GAN 3 Deep neural network and MFL system 4 Game on random environment Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 5 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex. How to characterize the minimizer of a function of probabilities ? Two-layer Network and Mean-field Langevin Equation Two-layer neural network In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing n h X 2i inf E f (Z) − ck '(Ak Z + bk ) ; n;(c ;A ;b ) k k k k=1 where Z represents the data and E is the expectation under the law of the data. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex. How to characterize the minimizer of a function of probabilities ? Two-layer Network and Mean-field Langevin Equation Two-layer neural network In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing n h 1 X 2i inf E f (Z) − ck '(Ak Z + bk ) ; n;(c ;A ;b ) n k k k k=1 where Z represents the data and E is the expectation under the law of the data. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex. How to characterize the minimizer of a function of probabilities ? Two-layer Network and Mean-field Langevin Equation Two-layer neural network In the work with K.

Mean-Field Langevin Dynamics and Neural Networks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support