Mean-Field Langevin Dynamics and Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Mean-Field Langevin Dynamics and Neural Networks Zhenjie Ren CEREMADE, Université Paris-Dauphine joint works with Giovanni Conforti, Kaitong Hu, Anna Kazeykina, David Siska, Lukasz Szpruch, Xiaolu Tan, Junjian Yang MATH-IMS Joint Applied Mathematics Colloquium Series August 28, 2020 Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 1 / 37 Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − f (x) m (x; v) m (x) = Ce σ2 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 2 1 2 ∗ − 2 f (x) ∗ − f (x)+ jvj m (x) = Ce σ m (x; v) = Ce σ2 2 Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 2 1 2 ∗ − 2 f (x) ∗ − f (x)+ jvj m (x) = Ce σ ! δarg min f (x) m (x; v) = Ce σ2 2 Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ − f (x) 1 ∗ σ2 m (x; v) ! δ 2 m (x) = Ce ! δarg min f (x) arg min f (x)+ 2 jvj Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 In particular, f does NOT need to be convex. Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − 2 f (x) m (x; v) ! δ(arg min f (x);0) m (x) = Ce σ ! δarg min f (x) Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 Classical Langevin dynamics and non-convex optimization The Langevin dynamics was first introduced in statistical physics to describe the motion of a particle with position X and velocity V in a potential field rx f subject to damping and random collision. Overdamped Langevin dynamics Underdamped Langevin dynamics dXt = −rx f (Xt )dt + σdWt dXt = Vt dt dVt = −rx f (Xt )−γVt dt+σdWt Under mild conditions, the two Markov diffusions admits unique invariant measures whose densities read: Overdamped Langevin dynamics Underdamped Langevin dynamics 2 ∗ ∗ − 2 f (x) m (x; v) ! δ(arg min f (x);0) m (x) = Ce σ ! δarg min f (x) In particular, f does NOT need to be convex. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 2 / 37 Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin , MCMC the underdamped Langevin , Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! One may diminish σ # 0 along the simulation ) Simulation annealing. Relation with classical algorithms If we overlook the Brownian noise, then the overdamped process ) gradient descent algorithm the underdamped process ) Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 One may diminish σ # 0 along the simulation ) Simulation annealing. Relation with classical algorithms If we overlook the Brownian noise, then the overdamped process ) gradient descent algorithm the underdamped process ) Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin , MCMC the underdamped Langevin , Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 Relation with classical algorithms If we overlook the Brownian noise, then the overdamped process ) gradient descent algorithm the underdamped process ) Hamiltonian gradient descent algorithm But their convergence to the minimizer is ensured only for convex potential function f . Taking into account the Brownian noise with constant σ, we may produce samplings of the invariant measures the overdamped Langevin , MCMC the underdamped Langevin , Hamiltonian MCMC The convergence rate of MCMC algorithm is in general dimension dependent! One may diminish σ # 0 along the simulation ) Simulation annealing. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 3 / 37 It is natural to study this problem using Mean-field Langevin equations. Deep neural networks The deep neural networks have won and continue gaining impressive success in various applications. Mathematically speaking, we may approximate a given function f with the parametrized function: ni X i i i f (z) ≈ 'n ◦ · · · ◦ '1(z); where 'i (z) := ck '(Ak z + bk ) k=1 and ' is a given non-constant, bounded, continuous activation function. The expressiveness of the neural network is ensured by the universal representation theorem. However, the efficiency of such over-parametrized, non-convex optimization is still a mystery for mathematical analysis. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 4 / 37 Deep neural networks The deep neural networks have won and continue gaining impressive success in various applications. Mathematically speaking, we may approximate a given function f with the parametrized function: ni X i i i f (z) ≈ 'n ◦ · · · ◦ '1(z); where 'i (z) := ck '(Ak z + bk ) k=1 and ' is a given non-constant, bounded, continuous activation function. The expressiveness of the neural network is ensured by the universal representation theorem. However, the efficiency of such over-parametrized, non-convex optimization is still a mystery for mathematical analysis. It is natural to study this problem using Mean-field Langevin equations. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 4 / 37 Two-layer Network and Mean-field Langevin Equation Table of Contents 1 Two-layer Network and Mean-field Langevin Equation 2 Application to GAN 3 Deep neural network and MFL system 4 Game on random environment Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 5 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex. How to characterize the minimizer of a function of probabilities ? Two-layer Network and Mean-field Langevin Equation Two-layer neural network In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing n h X 2i inf E f (Z) − ck '(Ak Z + bk ) ; n;(c ;A ;b ) k k k k=1 where Z represents the data and E is the expectation under the law of the data. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex. How to characterize the minimizer of a function of probabilities ? Two-layer Network and Mean-field Langevin Equation Two-layer neural network In the work with K. Hu, D. Siska, L. Szpruch ’19, we focused on the two-layer network, and aimed at minimizing n h 1 X 2i inf E f (Z) − ck '(Ak Z + bk ) ; n;(c ;A ;b ) n k k k k=1 where Z represents the data and E is the expectation under the law of the data. Zhenjie Ren (CEREMADE) MF Langevin 28/08/2020 6 / 37 Note that F is convex in ν. Take Ent(·), the relative entropy w.r.t. Lebesgue measure, as a regularizer, and note that Ent(·) is strictly convex. How to characterize the minimizer of a function of probabilities ? Two-layer Network and Mean-field Langevin Equation Two-layer neural network In the work with K.