Continuity equations, PDEs, probabilities, and gradient flows∗
Xiaohui Chen
This version: July 5, 2020
Contents
1 Continuity equation2
2 Heat equation3 2.1 Heat equation as a PDE in space and time...... 3 2.2 Heat equation as a (continuous) Markov semi-group of operators...... 6 2.3 Heat equation as a (negative) gradient flow in Wasserstein space...... 7
3 Fokker-Planck equations: diffusions with drift 10 3.1 Mehler flow...... 12 3.2 Fokker-Planck equation as a stochastic differential equation...... 13 3.3 Rate of convergence for Wasserstein gradient flows of the Fokker-Planck equation 15 3.4 Problem set...... 20
4 Nonlinear parabolic equations: interacting diffusions 21 4.1 McKean-Vlasov equation...... 21 4.2 Mean field asymptotics...... 22 4.3 Application: landscape of two-layer neural networks...... 23
5 Diffusions on manifold 24 5.1 Heat equation on manifold...... 24 5.2 Application: manifold clustering...... 25
6 Nonlinear geometric equations 26 6.1 Mean curvature flow...... 26 6.2 Problem set...... 27
A From SDE to PDE, and back 27
B Logarithmic Sobolev inequalities and relatives 29 B.1 Gaussian Sobolev inequalties...... 29 B.2 Euclidean Sobolev inequalties...... 30
∗Working draft in progress and comments are welcome (email: [email protected]).
1 C Riemannian geometry: some basics 30 C.1 Smooth manifolds...... 31 C.2 Tensors...... 31 C.3 Tangent and cotangent spaces...... 32 C.4 Tangent bundles and tensor fields...... 36 C.5 Connections and curvatures...... 38 C.6 Riemannian manifolds and geodesics...... 41 C.7 Volume forms...... 45
1 Continuity equation
n Let Ω ⊂ R be a spatial domain. Consider the continuity equation (CE):
∂tµt + ∇ · (µtvt) = 0, (1) where µt is a probability measure (typically absolutely continuous with a density) on Ω, n vt :Ω → R is a velocity vector field on Ω, and ∇ · v is the divergence of a vector field v. There are several meanings of solving the continuity equation (1). Given the vector field vt, we can speak of a classical (or strong) solution as a partial differential equation (PDE) by thinking µt(x) as a differentiable function of two variables x and t. We can also think of a distributional solution by integrating against some “nice” class of test functions on (x, t). If the continuity equation (1) is satisfied, then, for any finite time point T > 0, we can integrate 1 with a C function ψ :Ω × [0,T ] → R with bounded support and apply integration-by-parts: Z T Z Z T Z 0 = ψ∂tµt + ψ∇ · (µtvt) (2) 0 Ω 0 Ω Z T Z Z T Z = − (∂tψ)µt − h∇ψ, vtiµt, (3) 0 Ω 0 Ω where there is no contribution from the boundary because ψ has compact support so that the divergence theorem works. Here we implicitly assumed that the first law of thermodynamics holds (i.e., mass conservation of µt) so that there is no mass escapes at the boundary (if Ω is bounded) or near the infinity (if Ω is not bounded). Compared with the strong solution, the distribution solution (3) does not require differentiability by moving the derivative from µt(x) to ψ(x, t). R Suppose further we can interchange ∂t and Ω in the first integral of (2) and we keep the second integral in (3), then we can take a smaller class of test functions only in the spatial R domain Ω such that the solution space is larger. Why shouldn’t we interchange ∂t and Ω in the first integral of (3)? This is because if we take test functions depending only on Ω, then it does not make sense to take the time derivative as in (3), which always equals to zero. 1 Let φ :Ω → R be such C test function with bounded support. To make sense of R differentiation outside of integration, we obviously need t 7→ Ω φµt is absolutely continuous in t. In addition, we need the identity d Z Z φµt − h∇φ, vtiµt = 0 (4) dt Ω Ω to hold pointwise (in t) such that (3) equals to zero. This motivates the following definition.
2 Definition 1.1 (Weak solution of the continuity equation). We say that the density µt is a 1 weak solution in the distribution sense if for any C test function φ :Ω → R with bounded R support, the function t 7→ Ω φµt is absolutely continuous in t, and for each a.e. t, we have d Z Z φµt = h∇φ, vtiµt. (5) dt Ω Ω In these notes, we make (more) sense of dynamic behaviors of the continuity equation, and illustration its links to the classical PDEs, probabilities (such as Markov processes and stochastic differential equations), and the trajectory analysis (such as gradient flows). Specif- ically, we would like to understand the question that how does the weak solution of the continuity equation in an infinite-dimensional metric space (typically Wasserstein) connect with the classical solution of PDEs in a finite-dimensional space (typically Euclidean space and time)? We start from the (linear) heat equation as a concrete example.
2 Heat equation
2.1 Heat equation as a PDE in space and time n Recall that the heat equation (HE) on R is defined as:
(∂t − ∆)u = ∂tu − ∇ · ∇u = 0, (6)
n where u : R × [0, ∞) → R and u(x, t) is a two-variable function of space and time. The fundamental solution of the heat equation (6) is given by
u(x, t) = H(x, 0, t), (7) where H(x, y, t) is the heat kernel (sometimes also called Green’s function):
|x − y|2 H(x, y, t) = (4πt)−n/2 exp − , y ∈ n, t > 0. (8) 4t R The classical solution u is just a (positive) function of two variables x and t. One can think of H(x, y, t) is the transition density from x to y in time 2t. n Let Ω = R and µt(x) = u(x, t). Clearly µt > 0 is the probability density of a Gaussian distribution N(y, 2tIn) and the continuity equation (1) reads
∇µt ∂tµt − ∇ · (µt ) = 0. (9) µt In this case, the velocity vector field is given by
∇µt ∇u vt(x) = − (x) = − (x, t), (10) µt u where the last equality is justified by the equivalence of weak solution and the classical PDE solution (because u(·, ·) is Lipschitz continuous and vt(·) is Lipschitz). Since |x − y|2 x I ∇u = (4πt)−n/2 exp − − = −u n x = −uv (x), (11) 4t 2t 2t t
3 n n the velocity vector field vt : R → R is a linear map that can be represented by an n × n matrix: I v = n . (12) t 2t −1 Thus in the heat equation, the velocity vector field vt = (2t) In does not depend on the location x (i.e., location-free) and it dies off as t → ∞. The vanishing velocity means that the particles moving according to the heat equation will eventually converge to an equilibrium distribution given by a harmonic function ∆u = 0. If the boundary value is imposed (either Dirichlet problem or Neumann problem) or the heat growth at infinity is not too fast, then the solution of the harmonic function is unique. n n Let p > 1 and Pp(R ) be the collection of probability measures on R such that the p-Wasserstein distance is well-defined, i.e., Z n n n p no Pp(R ) = µ ∈ P(R ): |x − x0| µ(dx) < ∞ for some x0 ∈ R , (13) Rn n n where P(R ) contains all probability measures on R . Let Wp be the p-Wasserstein distance Z p n p o Wp (µ, ν) = min |x − y| γ(dx, dy): γ ∈ Γ , (14) where Γ is the set of all couplings with marginal distributions µ and ν.
Definition 2.1 (Metric derivative). Let (µt)t>0 be an absolutely continuous curve in the n Wasserstein (metric) space (Pp(R ),Wp). The metric derivative at time t of the curve t 7→ µt w.r.t. Wp is defined as 0 Wp(µt+h, µt) |µ |p(t) = lim . (15) h→0 |h| 0 0 We write |µ |(t) = |µ |2(t). We can compute the metric derivative of the fundamental solution curve of the heat equation. Note that µt is the Gaussian density N(y, 2tIn) for each t > 0. Lemma 2.2 (Wasserstein distance between two Gaussians, cf. Remark 2.31 in [11]). Let µ1 = N(m1, Σ1) and µ2 = N(m2, Σ2). Then the optimal transport map (i.e., the Monge map) from µ1 to µ2 is given by
T (x) = m2 + A(x − m1), (16) where −1/2 1/2 1/2 1/2 −1/2 A = Σ1 (Σ1 Σ2Σ1 ) Σ1 , (17) and the squared 2-Wasserstein distance between µ1 and µ2 equals to
2 2 n 1/2 1/2 1/2o W2 (µ1, µ2) = |m1 − m2| + tr Σ1 + Σ2 − 2(Σ1 Σ2Σ1 ) , (18) where the last term is the Bures distance on positive-semidefinite matrices. In particular, if Σ1Σ2 = Σ2Σ1, then 2 2 1/2 1/2 2 W2 (µ1, µ2) = |m1 − m2| + |Σ1 − Σ2 |F . (19)
4 In the Gaussian case, we have √ √ √ 2 p 2 2 W2 (µt+h, µt) = | 2(t + h)In − 2tIn|F = 2n( t + h − t) , (20) which gives a formulas for |µ0|(t): √ √ √ t + h − t √ 1 r n |µ0|(t) = lim 2n = 2n √ = . (21) h→0 h 2 t 2t On the other hand, recalling (12), we get Z Z 2 2 1 2 tr(2tIn) n kvtk 2 := |vt(x)| µt(dx) = |x| µt(dx) = = . (22) L (µt) 4t2 4t2 2t This implies that r 0 n |µ |(t) = kv k 2 = . (23) t L (µt) 2t Equivalence in (23) is a much more general fact; see ahead Theorem 3.1 and Corollary 3.1 in Section3. Below we shall give a heuristic interpretation based on the theory of optimal transport to see why (23) is intuitively true. Consider p > 1. If (µt)t>0 is an absolute continuous curve in Wp, then there is an optimal transport map T :Ω → Ω moving particles R p from µt to µt+h that minimizes |T (x) − x| µt(dx), i.e., Z Z p p p Wp (µt, µt+h) = |T (x) − x| µt(dx) = |(T − id)(x)| µt(dx). (24)
The “discrete velocity” of the particle located at x at time t (i.e., time derivative of T ) is given by T (x) − x v (x) = , (25) t h so that we have
Z p p p T (x) − x Wp (µt, µt+h) kvtk p = µt(dx) = . (26) L (µt) h |h|p
0 p Letting h → 0, we expect that kvtkL (µt) = |µ |p(t). The above heuristic argument (cf. (25)) suggests consider the following gradient flow (GF) n for moving particles in R : y0 (t) = v (y (t)), x t x (27) yx(0) = x, where yx(t) is the time t position of the particle initially at yx(0) = x, i.e., it is the trajectory of the particle starting from x. Thus the gradient flow of the heat equation is given by the following Cauchy problem:
0 −1 yx(t) = (2t) yx(t) (28) with the initial datum yx(0) = x. This is a first-order linear (homogeneous) ordinary differ- ential equation with initial value problem, whose solution can be easily computed. We can take a basic solution as √ yx(t) = Cx t, t > 0, (29)
5 for any constant C ∈ R. Let Yt :Ω → Ω be the flow of the vector field vt on Ω defined through Yt(x) = yx(t) in (27). Note that Yt is indeed a flow on Ω since Y0(x) = yx(0) = x and Yt(Ys(x)) =
Yt(yx(s)) = yyx(s)(t) = yx(s + t) = Ys+t(x) for any s, t > 0, so that Yt on Ω is a group action of additive group on R+ = [0, ∞). Then the pushforward measure µt = (Yt)]µ0 is a weak solution of the continuity equation 1 ∂tµt +∇·(µtvt) = 0 in (1) because for any C test function φ :Ω → R with bounded support, d Z d Z d Z d Z φ dµ = φ d(Y ) µ = (φ ◦ Y ) dµ = φ(y (t)) dµ (x) dt t dt t ] 0 dt t 0 dt x 0 Z Z 0 = h∇φ(yx(t)), yx(t)i dµ0(x) = h∇φ(yx(t)), vt(yx(t))i dµ0(x) Z Z Z = h∇φ(Yt), vt(Yt)i dµ0 = h∇φ, vti d(Yt)]µ0 = h∇φ, vti dµt.
2.2 Heat equation as a (continuous) Markov semi-group of operators Next we see how to construct the stochastic process from the heat equation. Recall that H(x, y, t) is the heat kernel in (8) coming from the fundamental solution of the heat equa- n tion (6). Denote γ as the standard Gaussian measure on R , i.e., the density of γ equals −n/2 2 n to (2π) exp(−|x| /2). Given a measurable function f : R → R, we define the (linear) integral operator Z n Ptf(x) = f(z)H(z, x, t) dz, t > 0, x ∈ R , (30) Rn which is sometimes referred as the forward evolution. Then one can verify that for all s, t > 0,
Pt+sf = Pt(Psf) = Ps(Ptf), (31)
lim Ptf(x) = f(x). (32) t↓0
p Thus (Pt)t>0 forms a continuous semi-group of linear operators on L (γ), which allows us to n define a (homogeneous) continuous Markov process (Xt)t>0 on R via
E[f(Xs+t) | Xr, r 6 t] = (Psf)(Xt), γ-almost surely, (33) for all bounded measurable functions f, and s, t > 0. This Markov process is called the heat diffusion process√, which we show below is the rescaled Wiener process (or Brownian motion) by a factor of 2. It is also easy to check (by smooth approximation and Taylor expansion) that the generator A of the semi-group (Pt)t>0 is given by the Laplacian operator A = ∆ in the sense that P f − f Af = lim t = ∆f, (34) t↓0 t where the limit exists in L2(γ). Indeed, note that the simi-group operator can be represented as √ Ptf(x) = Eξ∼γ[f(x + 2tξ)]. (35)
6 n For small enough t > 0 and smooth f : R → R, Taylor expansion yields √ Ptf(x) − f(x) = Eξ∼γ[f(x + 2tξ)] − f(x) h √ 1 i = E f(x) + 2tξ · ∇f(x) + 2tξT Hess (x)ξ + higher-order terms − f(x) ξ∼γ 2 f n T o = t tr Eξ∼γ[ξξ ]Hessf (x) = t tr Hessf (x) = t∆f(x).
This is not surprising in view of the heat equation ∂tu = ∆u, where the rate of change (i.e., time derivative) in the heat diffusion process equals to the Laplacian. As opposed to the semi-group integral operators (Pt), the generator A is a (linear) dif- ferential operator that unfolds Pt and generates the stochastic process Xt. The semi-group integral operators are useful for finding the stochastic differential equations (SDEs) from the evolution PDEs of probability density functions. Thus the generator does the reverse job by converting the SDEs to the evolution PDEs of densities. We shall see an example on the Fokker-Planck equation in Section3. We remark that for the heat equation, one can do explicit calculations on Pt and directly write down the Markov process by using (33) and its SDE as following. Using the Markov property (33) with f(x) = 1(x 6 y), we get √ P P n (Xs+t 6 y | Xr, r 6 t) = ξ∼γ(Xt + 2sξ 6 y) for all y ∈ R , (36) where Pξ∼γ(·) is the probability taken only w.r.t. to ξ. This implies that √ P P y (Xs+t − Xt 6 y | Xr, r 6 t) = ξ∼γ( 2sξ 6 y) = Φ √ , (37) 2s where Φ(y) is the cumulative distribution function of γ. The last display shows that the process (Xt)t>0 has independent increments (i.e., Xs+t − Xt does not depend on the past√ values) and Xs+t − Xt ∼ N(0, 2sIn). Thus the heat diffusion process (Xt)t>0 is simply 2- n rescaled standard Wiener process (Wt)t>0 on R , which we write as √ dXt = 2 dWt. (38) √ Note that the factor 2 is due to the fact that Wiener process solves a slightly different 1 version of the heat equation ∂tu − 2 ∆u = 0. E For X0 = 0, then [Xt] = 0 and Var(Xt) =√ 2t, which means that, after time t, the expected position of the heat diffusion process is 2t, which should be compared with the position of trajectory of yx(t) at time t given in (29).
2.3 Heat equation as a (negative) gradient flow in Wasserstein space n We have seen from (27) that the heat equation is the gradient flow of moving particles in R . In this section, we show that the heat equation can also viewed as a (negative) gradient flow of the entropy in the Wasserstein space of probability measures. To do this, we need first to make sense what do we mean by a “derivative” of the entropy (or more general functionals) in terms of a density in the infinite-dimensional Wasserstein (or metric) space.
7 Definition 2.3 (First variation, Chapter 7 in [13]). Let ρ be a density on Ω. Given a functional δF F : P(Ω) → R, we call δρ (ρ): P(Ω) → R, if it exists, the unique (up to additive constants) measurable function such that d F (ρ + hχ) − F (ρ) Z δF F (ρ + hχ) = lim = (ρ) dχ (39) dh h=0 h→0 h δρ holds for every mean-zero perturbation density χ such that ρ+hχ ∈ P(Ω) for all small enough δF h. The function δρ (ρ):Ω → R is the first variation of F at ρ. n In R , the first variation behaves like the directional derivatives projected to all possible n 1 2 n directions χ ∈ R . For example, take F (x) = 2 |x| for x ∈ R . Then ∇F (x) = x and for any n y ∈ R , |x + hy|2 − |x|2 h lim = lim hx, yi + |y|2 = hx, yi = h∇F (x), yi. (40) h→0 2h h→0 2 Comparing (39) with (40), we may interpret Z δF DδF E (ρ) dχ = (ρ), χ (41) δρ δρ as an inner product in a Hilbert space. Thus first variation is an infinite-dimensional analog δF of the gradient in the finite-dimensional Euclidean space. Thus δρ (ρ) can be viewed as a gradient in P(Ω). Note that first variation is defined only up to additive constants since R c dχ = 0 for any constant c ∈ R. Below are two important functionals that will be extremely useful in studying heat equation, or more generally the Fokker-Planck equation in Section3. Example 2.4 (“Generalized” entropy). Let f : R → R be a convex superlinear function and Z Z F (ρ) = f(ρ(x)) dx = f ◦ ρ. (42)
Clearly F (ρ + hχ) − F (ρ) Z f(ρ(x) + hχ(x)) − f(ρ(x)) = χ(x) dx. h hχ(x) Letting h → 0, we get d Z Z F (ρ + hχ) = f 0(ρ(x))χ(x) dx = f 0(ρ) dχ, (43) dh h=0 which implies that δF (ρ) = f 0(ρ). (44) δρ For the special case where f(ρ) = ρ log ρ is the entropy, it first variation at ρ is given by δF (ρ) = 1 + log ρ. (45) δρ
Example 2.5 (Potential). Let V :Ω → R be a potential function and the energy functional Z Z F (ρ) = V dρ = V (x)ρ(x) dx. (46)
8 Compute F (ρ + hχ) − F (ρ) Z ρ(x) + hχ(x) − ρ(x) Z Z = V (x) dx = V (x)χ(x) dx = V dχ. h h Thus we see that δF (ρ) = V (47) δρ is a constant function that does not depend on ρ. Intuitively, the first variation of a functional F (either entropy or energy) at ρ is the rate of change in distribution for moving the particles on Ω from ρ that minimizes the entropy/energy. Given a functional F : P(Ω) → R, the minimizing movement scheme introduced by Jordan-Kinderlehrer-Otto [6] (sometimes also called the JKO scheme) is a time-discretized version of gradient flows that solves a sequence of iterated minimization problems (in the context of (P(Ω),W2)): W 2(ρ, ρh) ρh = argmin F (ρ) + 2 k . (48) k+1 ρ∈P(Ω) 2h By strong duality, Z Z 2 h c h W2 (ρ, ρk) = max ϕ dρ + ϕ dρk, (49) ϕ∈Φc(Ω) Ω Ω c 2 1 where ϕ (y) = infx∈Ω{|x − y| − ϕ(x)} is the c-transform . The functions ϕ realizing the maximum on the right hand side of (49) is called the Kantorovich potentials for the transport h h 2 h from ρ to ρk. For each ρk, W2 (ρ, ρk) is convex in ρ since it is a supremum of linear functionals in ρ. Now differentiating (48) w.r.t. ρ, the first-order optimality condition is implied by δF ϕ (ρh ) + = constant, (50) δρ k+1 h
h h where ϕ now is the Kantorovich potential for the transport from ρk+1 to ρk (not in the reversed direction). Here we implicitly assumed the uniqueness of c-concave Kantorovich potential. From Brenier’s polarization theorem, we know that optimal map Te from µt+h and µt to and ϕ :Ω → R are linked through Te(x) = x − ∇ϕ(x). (51) So the velocity vector from time t + h to t (note the reverse direction!) is given by
Te(x) − x ∇ϕ(x) δF v = = − = ∇ (ρh ) (x). (52) et h h δρ k+1
Reverting the time direction (vt = −vet) and letting h → 0 (using the continuity of ρ), we get δF v (x) = −∇ (ρ) (x) (53) t δρ
1 c Here c stands for a general cost function and the c-transform in general is defined as ϕ (y) = infx∈Ω{c(x, y)− c ϕ(x)}. A function ϕ is said to be c-concave if there exists a χ such that ϕ = χ and Φc(Ω) is the set of all c-concave functions.
9 and we get the following continuity equation δF ∂ ρ − ∇ · ρ ∇ (ρ) = 0 (54) t δρ in the Wasserstein space of measures. If we choose the entropy functional F (ρ) = R f(ρ(x)) dx with f(ρ) = ρ log ρ, then
δF δF ∇ρ (ρ) = 1 + log ρ, and ∇ (ρ) = = ∇ log ρ, (55) δρ δρ ρ so that the continuity equation in (54) reads
∂tρ − ∇ · (ρ ∇ log ρ) = 0. (56)
Now recall that we can rewrite the heat equation as: ∇u 0 = (∂ − ∆)u = ∂ u − ∇ · (u ) = ∂ u − ∇ · (u∇ log u), (57) t t u t where we can think of µt = u(·, t) and vt = ∇ log u in the continuity equation (1). Thus we see that (56) and (57) are really the same continuity equation associated with the heat equation. However, they are viewed as different gradient flows in the spaces (P2(Ω),W2) and Ω, respectively. In either case, we call it the heat flow.
3 Fokker-Planck equations: diffusions with drift
We start from the equivalence of the metric derivative and the velocity vector field.
Theorem 3.1. If (µt)t∈[0,1] is an absolute continuous curve in Wp, then for any t ∈ [0, 1] a.e., p n there is a velocity vector field vt ∈ L (µt; Ω) for Ω ⊂ R such that:
1. µt is a weak solution of the continuity equation ∂tµt + ∇ · (µtvt) = 0 in the sense of distribution;
0 p 2. for a.e. t, we have kvtkL (µt) 6 |µ |p(t). n p Conversely, if (µt)t∈[0,1] are measures in Pp(R ) and vt ∈ L (µt; Ω) for each t such that R 1 p 0 kvtkL (µt)dt < ∞ satisfying the continuity equation ∂tµt + ∇ · (µtvt) = 0, then
1.( µt)t∈[0,1] is an absolute continuous curve in Wp;
0 p 2. for a.e. t, we have |µ |p(t) 6 kvtkL (µt).
Corollary 3.2. If (µt)t∈[0,1] is an absolute continuous curve in Wp, then the velocity vector field given in part 1 of Theorem 3.1 must satisfy
0 p |µ |p(t) = kvtkL (µt). (58)
10 As in the heat equation, we can take two alternative perspectives on the continuity equa- tion structure and the gradient flow in the spaces (P2(Ω),W2) and Ω. First, if we choose the functional Z Z Z Z F (ρ) = f(ρ(x)) dx + V (x) dρ(x) = f ◦ ρ + V dρ, (59) where f(ρ) = ρ log ρ is the entropy and V :Ω → R is a potential (independent of ρ), then the first variation of F at ρ is given by δF (ρ) = 1 + log ρ + V. (60) δρ Thus, δF ∇ρ ∇ (ρ) = ∇ log ρ + ∇V = + ∇V, (61) δρ ρ and the continuity equation for this entropy+potential functional F (note that F is convex in ρ) becomes
∇ρ 0 = ∂ ρ − ∇ · ρ + ∇V = ∂ ρ − ∆ρ − ∇ · (ρ ∇V ), (62) t ρ t or alternatively, we may write
∂tρ = ∆ρ − h∇ρ, ∇V i − ρ∆V, (63) which is the Fokker-Planck equation (FPE). Hence if we define the drift Laplacian operator LV as V −V LV ρ = ∆ρ + h∇ρ, ∇V i = e ∇ · (e ∇ρ), (64) then the Fokker-Planck equation (63) is nothing but the drift heat equation:
∂tρ − LV ρ − ρ∆V = 0, (65) which describes the time evolution of density of particles under the influence of drag forces and random forces (such as in Brownian motion or heat diffusion in the pure diffusion case.) For a general potential V , from the continuity equation (62), the velocity vector field vt is given by ∇ρt vt = − − ∇V. (66) ρt
Thus we can write down the gradient flow of vt as in (27). In particular, the Mehler flow 1 2 (with the potential V (x) = 4 |x| ) is given by
0 ∇ρt ∇ρt yx(t) yx(t) = − + ∇V (yx(t)) = − (yx(t)) − (67) ρt ρt 2 with the initial datum yx(0) = x, which again is a first-order linear (homogeneous) ordinary differential equation with initial value problem.
11 3.1 Mehler flow 1 2 x n It is interesting to note that as a special case if V (x) = 4 |x| , then ∇V = 2 , ∆V = 2 , and the drift heat equation (65) is the Mehler flow:
(∂t − LM )u = 0, (68) where LM is the Mehler operator defined as n LM (ρ) = L |x|2 (ρ) + ρ. (69) 4 2 It is well-known that the Mehler flow can be obtained by the heat flow in the following sense. Let u(x, t) be a smooth function and define the rescaled version
√ |x|2 u(x, t) = tn/2u( tx, t) → c(4π)−n/2 exp − as t → ∞, (70) e 4 R where c = n u and the convergence holds by the central limit theorem (CLT). Changing the R s s time clock unit t = e and setting v(x, s) = ue(x, e ) for s > 0, one can verify that ns +s (∂s − LM )v = e 2 (∂t − ∆)u. (71)
Thus u solves the heat equation (∂t − ∆)u = 0 if and only if v solves drift heat equation (∂s − LM )v = 0. In particular, this implies that if we consider the fundamental solution of the heat equation given by (7) and (8):
|x|2 u(x, t) = (4πt)−n/2 exp − , t > 0, (72) 4t then |x|2 v(x, s) = (4π)−n/2 exp − (73) 4 is a constant in time. Thus ∂sv(x, s) = 0 and we see that LM (v) = 0, i.e., v is an LM -harmonic n function so that v(x) is a static point that minimizes the Mehler energy: for f : R ×[0, ∞) → R, Z 2 Z 2 |x| 2 n 2 |x| E (f) = − fL (f)e 4 = (|∇f| − f )e 4 0. (74) t M 2 > Since d E (f) = −2hf ,L fi = −2E (f), (75) dt t t M M t R |x|2 where hf, giM = n f(x)g(x)e 4 dx defines an inner product, we see that LM is the twice R n negative gradient flow of the Mehler energy (74) in R . Note that (75) implies that the exponential rate of convergence for the Mehler flow (i.e., the Fokker-Planck equation with 1 2 n V (x) = 4 |x| ) in the Euclidean space R : for t > 0, −2t Et(f) = E0(f)e . (76)
In Section 3.3 below, we shall also look at the rate of convergence of the gradient flow of the Fokker-Planck equation (in particular the Mehler flow) to v in the Wasserstein space n (P(R ),W2).
12 3.2 Fokker-Planck equation as a stochastic differential equation The Fokker-Planck equation (in the most general form) is a PDE that describes the time evolution of density of particles of a diffusion process with a (deterministic) drift and (random) noise, which is governed by the following stochastic differential equation (SDE) (sometimes referred as the Langevin diffusion):
dXt = m(Xt, t) dt + σ(Xt, t) dWt, (77)
n where m(Xt, t) is the drift coefficient vector in R and σ(Xt, t) is an n × n matrix. Here (Wt) n is again the standard Wiener process on R . The SDE in (77) is understood in the integral form: Z s+t Z s+t Xt+s − Xs = m(Xu, u) du + σ(Xu, u) dWu (78) s s as a sum of an Lebesgue integral and an Itˆointegral. In this model, both the random drift coefficient vector m(Xt, t) and the random matrix σ(Xt, t) are path/state and time dependent. The (standard) n-dimensional Wiener process is the special case where m(Xt, t) = 0 and σ(Xt, t) = In. For n = 1, the density evolution equation of (77) is given by
2 ∂tρ(x, t) = −∂x(m(x, t)ρ(x, t)) + ∂x(D(x, t)ρ(x, t)), (79)
2 where D(Xt, t) = σ(Xt, t) /2 is the diffusion coefficient. (In AppendixA, we show how to convert the SDE of sample paths (77) to the evolution PDE of densities (79).) In the one- dimensional√ heat diffusion case (where there is no drift m(x, t) = 0) with D(Xt, t) = 1 (i.e., σ(Xt, t) = 2), the evolution of the density ρ(x, t) is governed by
2 ∂tρ − ∂xρ = 0, (80) which is just the heat equation on R. Higher dimension analog ∂tu − ∆u = 0 can also be made by noting that the density evolution equation now becomes
n X ∂2 ∂ ρ(x, t) = −∇ · (m(x, t)ρ(x, t)) + (D (x, t)ρ(x, t)), (81) t ∂x ∂x ij i,j=1 i j where D(x, t) = σ(x, t)σ(x, t)T /2 is the diffusion tensor. How does the density evolution equation (81) link to the continuity equation version (62) and (63)? With the diffusion tensor D(x, t) = In,(81) reads
∂tρ = −∇ · (mρ) + ∆ρ. (82)
Comparing the last display with the continuity equation (62):
∂tρ − ∆ρ − ∇ · (ρ ∇V ) = 0, (83) we see that m = −∇V, (84) which means that the mean drift vector m in the Fokker-Planck equation proceeds in the neg- ative gradient direction of minimizing the potential V (and of course subject to the diffusion
13 effect given by the Laplacian ∆). Combining this with the fact that the heat flow proceeds in the negative gradient direction in the entropy, we see that the Fokker-Planck equation is the negative gradient flow of the entropy+potential functional Z Z F (ρ) = f ◦ ρ + V dρ (85) | {z } | {z } microscopic behavior macroscopic behavior in the Wasserstein space (P2(Ω),W2). x 1 2 We remark that in the special case m(x, t) = − 2 (i.e., V (x) = 4 |x| ) and D(x, t) = 1 in n the Langevin diffusion (77) is often called the Ornstein-Uhlenbeck (OU) process in R : X √ dX = − t dt + 2 dW , (86) t 2 t which is simply a mean-reverting diffusion process. The semi-group operator of the OU process is given by h x p i P f(x) = E f e−t + 1 − e−2tξ , t 0, (87) t ξ∼π 2 > √ n −n/2 2 where π is again 2-scaled standard Gaussian measure γ on R , i.e., π(x) = (4π) exp(−|x| /4). The OU semi-group (Pt)t>0 admits π as stationary measure, where the convergence holds in 2 n L (π): for any bounded measurable function f : R → R, 0 2 2 −t ξ p −2t kP f − πfk = E 0 E [f(e + 1 − e ξ)] − E [f(ξ)] t L2(π) ξ ∼π ξ∼π 2 ξ∼π 0 2 h −t ξ p −2t i E 0 E f(e + 1 − e ξ) − f(ξ) (88) 6 ξ ∼π ξ∼π 2 h ξ0 p i2 = E f(e−t + 1 − e−2tξ) − f(ξ) (89) 2 → 0, as t → ∞, (90) where (88) follows from Jensen’s inequality, (89) from Fubini’s theorem, and (90) from the 2 dominated convergence theorem (since f ∈ L (π)). In addition, the generator A of (Pt)t>0 is given by d 1 Af = Ptf = − h∇f, xi + ∆f, (91) dt t=0 2 which is called the Ornstein-Uhlenbeck (OU) operator (also write A = LOU ). Comparing (91) with the Mehler operator in (69): n 1 n LM (ρ) = L |x|2 (ρ) + ρ = ∆ρ + h∇ρ, xi + ρ, (92) 4 2 2 2 which gives the forward Fokker-Planck equation (cf. AppendixA for more details), the generator in (91) gives the backward Fokker-Planck equation. Integrating-by-parts w.r.t. dx, ∗ we conclude that LM = A , which means that the Mehler operator (forward equation) is the adjoint of the generator, i.e., the OU operator (backward equation); that is, we have ∗ 2 1 2 LM = LOU in L (dx). This holds for a general potential V , not just V (x) = 4 |x| . (Here we 2 2 − |x| 2 need to be slightly careful on the reference measure: if we consider L (e 4 dx) = L (dπ), ∗ 2 then L |x|2 = LOU in L (dπ).) 4
14 To summarize, given a general Fokker-Planck equation:
0 = ∂tρt − ∆ρt − ∇ · (ρt ∇V ) (93)
= ∂tρt − LV ρt − ρt∆V, (94) where (93) is the continuity equation version and (94) is the drifted heat equation version, if 1 −V (x) n we assume it admits a stationary distribution π(x) = Z e on R (cf. Section 3.3 ahead for more details), then the generator A of the drift heat diffusion process (i.e., the Langevin diffusion) is given by ∗ A = LV , (95) where LV = LV + ρ∆V = ∆ρ + h∇ρ, ∇V i + ∆V, (96) and the semi-group (Pt)t>0 is given by E −t p −2t Ptf(x) = ξ∼π[f(e ∇V (x) + 1 − e ξ)], t > 0. (97)
Letting t → ∞, we see that Pt asymptotically converges to the stationary distribution π (in L2(π)). Thus from the evolution PDE of probability density functions, we can fully characterize the related SDEs. The reverse direction from SDEs to PDEs can be found in AppendixA, where we use Itˆo’sformula to show that a measure solution ρt to the continuity equation can be seen as the law at time t of the process (Xt)t>0 solution to the SDE. This justifies the equivalence between the PDEs and the SDEs.
3.3 Rate of convergence for Wasserstein gradient flows of the Fokker- Planck equation R −V (x) If Z = n e dx < ∞, then the Fokker-Planck equation has a stationary distribution on n R R : 1 π(x) = e−V (x), (98) Z where 0 < Z < ∞ is a normalization constant. To see this, recall that F (ρ) = R ρ log ρ+R V dρ δF −1 −V and δρ (π) = log π + V . Since ∇π = Z e (−∇V ) = −π∇V , δF ∇π π ∇ (π) = π + π∇V = ∇π + π∇V = 0. (99) δρ π Then, δF ∂ π = ∇ · π ∇ (π) = 0, (100) t δρ which implies that π is a stationary point of the Fokker-Planck continuity equation. In this case, the gradient flow of the functional F can be viewed as the gradient flow of the relative entropy between ρ and π (cf. (104) in Definition 3.3 below): Z h ρ i Z ρ F (ρ) + log Z = ρ log + log Z = ρ log = H(ρkπ) (101) e−V π so that δF δ(F + log Z) δH(ρkπ) (ρ) = (ρ) = (ρ). (102) δρ δρ δρ
15 Now suppose the Fokker-Planck equation admits a stationary distribution. Given an initial distribution ρ0, we would like to ask how fast does the Wasserstein gradient flow (ρt)t>0 converge to the stationary distribution π? In a nutshell, the answer is given by the following statement.
If the stationary measure π satisfies a logarithmic Sobolev inequality (LSI), then ρt converges to π exponentially fast in time t (under numerous distance or divergence measures).
Suppose π is probability measure (i.e., 0 < Z < ∞). Since the convergence quantities involved depend only on the (first and second) derivatives of the density π, without loss of generality, we may assume Z = 1 and
π = e−V . (103)
Definition 3.3 (Relative entropy and relative Fisher information). Let p, q be two probability n measures on R such that q p. Then the relative entropy (or Kullback-Leibler divergence) between p and q is defined as
Z dp H(pkq) = log dp. (104) dq
In particular, if dx p and dx q, then
Z p H(pkq) = p log dx. (105) q
n dµ Let π, µ be two probability measures on R such that π µ and dπ = ρ, then the relative Fisher information between µ and π is defined as
Z |∇ρ|2 I(µkπ) = dπ. (106) ρ We state a powerful information inequality that implies the LSI. n Theorem 3.4 (HWI++ inequality [5]). Let (R , | · |, π) be a probability space such that dπ(x) = e−V (x) dx, (107)
n ∞ n where V : R → [0, ∞) is a C (R ) function such that Hess(V ) κIn for some parameter n κ ∈ R. Then for any µ0, µ1 ∈ P2(R ) such that H(µ0kπ) < ∞, κ H(µ kπ) − H(µ kπ) W (µ , µ )pI(µ kπ) − W 2(µ , µ ). (108) 1 0 6 2 0 1 1 2 2 0 1 Theorem 3.4 implies several classical functional and information inequalities. n Corollary 3.5. In the setting of Theorem 3.4 and assume further π ∈ P2(R ), then we have n 1. HWI inequality (Theorem 3 in [10]): for any ν ∈ P2(R ), κ H(νkπ) W (ν, π)pI(νkπ) − W 2(ν, π). (109) 6 2 2 2
16 n 2. Talagrand’s T2-inequality: for any ν ∈ P2(R ) with finite second moment, κ W 2(ν, π) H(νkπ). (110) 2 2 6
n 3. Logarithmic Sobolev inequality: if κ > 0, then for any ν ∈ P(R ), 1 H(νkπ) I(νkπ). (111) 6 2κ
Combining (110) and (111), we have that if κ > 0, then
pI(νkπ) W (ν, π) , ∀π ν. (112) 2 6 κ In AppendixB and ??, we discuss more details of the LSIs and Talagrand’s transportation inequalities.
Proof of Corollary 3.5. Corollary 3.5 follows from Theorem 3.4 with specific choices.
1. Take µ0 = π and µ1 = ν.
2. Take µ0 = ν and µ1 = π.
3. Take µ0 = π and µ1 = ν to get the HWI inequality in part 1, and then maximize its κ 2 p ∗ 1 p right-hand side − 2 x + I(νkπ)x, where the maximizer occurs at x = κ I(νkπ). Then a standard approximation argument is sufficient.
There are various ways of measuring the gap between the gradient flow (ρt)t>0 as a so- lution of the Fokker-Planck equation and the stationary distribution π: total variation (as in Meyn-Tweedie’s Markov chain approach), L2-norm, relative entropy, Wasserstein distance, etc. Below in Theorem 3.6, we state the exponential rate of convergence under the relative entropy. Theorem 3.6 (Exponential rate of convergence for the Fokker-Planck gradient flow: relative entropy). Let (ρt)t>0 solves the (gradient drift) Fokker-Planck equation:
∂tρt = ∇ · (∇ρt + ρt∇V ), (113)
2 n where the potential V > 0 satisfies V ∈ C (R ). If the Bakry-Emery´ criterion
Hess(V ) κIn (114)
n holds for some κ > 0 (i.e., HessV (x) κIn for all x ∈ R ), then
−2κt H(ρtkπ) 6 e H(ρ0kπ), t > 0, (115) where π(x) = e−V (x) is the stationary distribution for the Fokker-Planck equation (113).
17 Note that the Bakry-Emery´ criterion is a strong convexity (sometimes called the κ- convexity) requirement for the potential V . From the sampling point of view, the Bakry- Emery´ criterion requires that the stationary distribution π is strictly log-concave to obtain exponential rate of convergence under the relative entropy. It is known that strictly log-concave probability density satisfies an LSI, and vice versa (cf. Problem 3.17 in [15]). In the celebrated result of Bakry and Emery,´ a simple sufficient condition is given to ensure that the density satisfies an LSI. n 2 n −V Lemma 3.7 (Bakry-Emery´ criterion). Let V : R → R be a C (R ) function and dπ = e dx be a probability measure such that Hess(V ) κIn for some κ > 0. Then π satisfies the LSI with parameter κ: 1 H(νkπ) I(νkπ) (116) 6 2κ n for all ν ∈ P(R ) such that π ν. However, it is an open question whether or not all log-concave densities satisfy an LSI. It is even an open question to ask whether or not all log-concave distributions satisfy a Poincar´e inequality with a dimension-free constant. This is the Kannan-Lov´asz-Simonovits (KLS) conjecture [1,3]. Conjecture 3.8 (Kannan-Lov´asz-Simonovits conjecture, cf. Conjecture 1.2 in [1]). There exists n an absolute constant C > 0 such that for any log-concave probability measure ν on R , we have E E 2 2 E 2 Varν(f) := ν |f − ν[f]| 6 Cλν ν |∇f| (117) 2 for any locally Lipschitz (i.e., Lipschitz on any Euclidean ball) function f ∈ L (ν), where λν T is the square root of the largest eigenvalue of the covariance matrix EX∼ν[XX ]. Currently, the best known result is that, for an isotropic n-dimensional log-concave distri- bution, the dimension-dependent constant C(n) is on the order n1/4 [8]. We also note that the Bakry-Emery´ criterion can be replaced by a slightly stronger Otto- Villani criterion (with an additional finite second moment assumption), so that we can obtain both LSI and T2 inequality. n 2 n Lemma 3.9 (Otto-Villani criterion, Theorem 1 in [10]). Let V : R → R be a C (R ) function and dπ = e−V dx be a probability measure with a finite second moment such that Hess(V ) κIn for some κ > 0. Then π satisfies the LSI and Talagrand’s T2 inequality, both with parameter κ: 1 κ H(νkπ) I(νkπ) and W 2(ν, π) H(νkπ) (118) 6 2κ 2 2 6 n for all ν ∈ P(R ) such that π ν. With the (slightly) stronger Otto-Villani criterion, exponential rate of convergence also holds under the Wasserstein distance [2]. In fact, rate of convergence for (more general) non-gradient drift Fokker-Planck equation is established in [2]. Theorem 3.10 (Exponential rate of convergence for the Fokker-Planck gradient flow: Wasser- stein distance). Let (ρt)t>0 solves the (gradient drift) Fokker-Planck equation (113) with the stationary distribution dπ = e−V dx satisfying the Otto-Villani criterion for some κ > 0 (i.e., −V 2 n dπ = e dx has finite second moment such that V ∈ C (R ) and Hess(V ) κIn). Then, −κt W2(ρt, π) 6 e W2(ρ0, π). (119)
18 (Stochastic) proof of Theorem 3.10. Using the relation between the PDE and SDE, solution to the (gradient drift) Fokker-Planck equation (113) is the density of the stochastic process
(Xt)t>0 solving the following SDE: √ dXt = −∇V (Xt) dt + 2 dWt, (120)
n where (Wt)t>0 is the standard Wiener process on R and the initial datum has distribution ρ0 (cf. Section 3.2 for more details of the equivalence). n Let µ0 and ν0 be two probability measures on R , and (X0,Y0) be a coupling at the initial time with the marginal distributions X0 ∼ µ0 and Y0 ∼ ν0 such that E 2 2 |X0 − Y0| = W2 (µ0, ν0). (121)
Then we run two coupled copies of the SDE with (Xt)t>0 (resp. (Yt)t>0) as the solution to (120) starting from X0 ∼ µ0 (resp. Y0 ∼ ν0), both driven by the same Wiener process (Wt)t>0. Then, d E |X − Y |2 = −2 E X − Y , ∇V (X ) − ∇V (Y ) . (122) dt t t t t t t n Note that for any x, y ∈ R , 1 V (x) = V (y) + hx − y, ∇V (y)i + (x − y)T Hess (z)(x − y), (123) 2 V 1 V (y) = V (x) + hy − x, ∇V (x)i + (y − x)T Hess (z0)(y − x), (124) 2 V where z, z0 are on the line segment joining x and y. Adding the last two equalities, we get 1 hx − y, ∇V (x) − ∇V (y)i = (x − y)T [Hess (z) + Hess (z0)](x − y). (125) 2 V V
If Hess(V ) κIn for some κ > 0, then
2 n x − y, ∇V (x) − ∇V (y) > κ|x − y| , ∀x, y ∈ R . (126) Then Gr¨onwall’s lemma yields that
2 −2κt 2 E |Xt − Yt| 6 e E |X0 − Y0| . (127) By definition of the Wasserstein distance,
2 E 2 W2 (µt, νt) 6 |Xt − Yt| . (128) Now we have 2 −2κt 2 W2 (µt, νt) 6 e W2 (µ0, ν0), (129) which gives an exponential contraction between any two solutions µt and νt to the Fokker- Planck equation (113). In particular, (119) follows from choosing ν0 (and thus all νt, t > 0) −V as the stationary solution π = e .
19 For the Mehler flow, recall that the stationary distribution is given by |x|2 π(x) = (4π)−n/2 exp − , (130) 4 √ n which is a 2-rescaled standard Gaussian distribution γ on R . By the Gaussian LSI for γ, we know that π satisfies a similar LSI and thus the Mehler flow has the exponential rate of convergence to its stationary distribution π. Corollary 3.11 (Exponential rate of convergence for the Mehler flow). We have
−t −t/2 H(ρtkπ) 6 e H(ρ0kπ) and W2(ρt, π) 6 e W2(ρ0, π), t > 0, (131) |x|2 where π is the stationary distribution the Mehler flow (ρt)t>0 with V (x) = 4 . In particular, ρt converges weakly to π (in distribution) as t → ∞.
|x|2 x 1 Proof of Corollary 3.11. For the Mehler flow, V (x) = 4 , ∇V = 2 , and Hess(V ) = 2 In. So 1 κ = 2 . Then Corollary 3.11 follows from Theorem 3.6 and Theorem 3.10. Corollary 3.11 is a quantitative version of Boltzmann’s H-theorem stating that the total entropy of an isolated system can never decrease over time (i.e., the second law of thermody- namics): d e−V H(ρ kπ) = −I(ρ kπ) 0 with π = . (132) dt t t 6 Z
3.4 Problem set Problem 3.1. Prove the central limit theorem for the Mehler flow in (70).
Problem 3.2. Derive the Mehler energy formula Et(f) in (74) and show that d E (f) = −2E (f). (133) dt t t
Problem 3.3. Let A be the Ornstein-Uhlenbeck operator defined in (91) and LM is the Mehler operator defined in (69). ∗ 2 1. Show that LM = A in L (dx).
2 ∗ 2 − |x| 2. Show that L |x|2 = A in L (e 4 dx). 4 Problem 3.4. Prove Boltzmann’s H-theorem in (132). Problem 3.5. The KLS conjecture (Conjecture 3.8) implies a weaker variance conjecture: there n exists an absolute constant C > 0 such that for any log-concave probability measure ν on R , we have 2 E 2 VarX∼ν(|X| ) 6 Cλν X∼ν[|X| ], (134) T where λν is the square root of the largest eigenvalue of the covariance matrix Σ = EX∼ν[XX ]. In particular, if Σ = In is isotropic, then the variance conjecture reads 2 VarX∼ν(|X| ) 6 Cn. (135) It is easy to see that the variance conjecture in (134) is a special case of the KLS conjecture by considering f(x) = |x|2.
20 n 1. Gaussian Poincar´einequality. Let γ be the standard Gaussian measure on R and n f : R → R be a Lipschitz function. Show that 2 Varγ(f) 6 Eγ[|∇f| ], (136)
where the equality is attained for f(x) = x1 + ··· + xn and x = (x1, . . . , xn). 2. Show that the Gaussian Poincar´einequality (136) is sharp (up to a multiplicative con- 2 2 stant) for f(x) = x1 + ··· + xn. 3. Exponential Poincar´einequality. Let X be a Laplace (i.e., double-exponential) distri- 1 −|x| bution on R (i.e., the density ν of X is given by ν(x) = 2 e for x ∈ R). Show that
0 2 Var(f(X)) 6 4 E[f (X) ], (137) 4. Now using the Efron-Stein inequality (i.e., a special case of the tensorization technique), show that E 2 Varν⊗n (f) 6 4 ν⊗n [|∇f| ]. (138)
4 Nonlinear parabolic equations: interacting diffusions
4.1 McKean-Vlasov equation Consider the following nonlinear continuity equation: Z ∂tρ = ∆ρ − ∇ · ρ b(·, y) ρt(dy) , (139) Rn n n n where b : R ×R → R is a bounded Lipschitz function and ρt(·) = ρ(·, t). Note that (139) is a nonlinear parabolic PDE, which arises from McKean’s example of the interacting diffusion process (in statistical physics), which is described by the following SDE: starting from X0, Z √ dXt = b(Xt, y)ρt(dy) dt + 2 dWt, (140) Rn where ρt(dy) is the distribution of Xt. In (140), the drift coefficient depends on the process state Xt and its marginal density ρt. Here we consider a simpler version of the McKean-Vlasov equations where the diffusion coefficient is constant. Note that (139) is the forward equation corresponding to time marginals of the nonlinear process (140) (cf. AppendixA). Indeed, by 2 n Itˆo’slemma, for f ∈ Cb (R ), h Z i df(Xt) = ∆f(Xt) + hf(Xt), b(Xt, y)ρt(dy)i dt + h∇f(Xt), dWti, (141) which implies the generator A of (Xt)t>0 is given by Z Af = ∆f + hf, b(·, y)ρt(dy)i. (142)
Thus we get the adjoint operator Z ∗ A ρ = ∆ρ − ∇ · ρ b(·, y)ρt(dy) , (143)
21 which entails the forward equation (139). Conversely, one can construct a unique nonlinear process with the time marginal distri- butions satisfying the continuity equation (139). Theorem 4.1 (Theorem 1.1 in [14]). There is existence and uniqueness, trajectorial and in distribution for the solution of (139). A construction of (a system of) the interacting diffusion processes is given in (145) below based on the interacting N-particle system.
4.2 Mean field asymptotics Consider the N-particle system: for each particle i = 1,...,N,
N 1 X √ dXi,N = b(Xi,N ,Xj,N ) dt + 2 dW i, (144) t N t t t j=1
i,N i starting from X0 . Note that Theorem 4.1 allows us to construct the processes (Xt)t>0 for each i > 1 as the solution of √ Z t Z i i i i Xt = X0 + 2Wt + b(Xs, y)ρs(dy) ds, (145) 0
i where ρs(dy) is the distribution of Xs. The following theorem gives a guarantee that the (weak) solution of the McKean-Vlasov equation is governed by the mean-field limit. Theorem 4.2 (Finite-N approximation error for mean-field limit). For any i > 1 and T > 0, we have √ E i,N i sup N [sup |Xt − Xt|] < ∞. (146) N t6T 0 0 n The Vlasov equation is a special case of (144) where b(x, x ) = ∇U(x−x ) and U : R → R is a potential for the particle interactions. In this case, the system of Langevin equations for the N-particle system becomes
N 1 X √ dXi,N = ∇U(Xi,N − Xj,N ) dt + 2 dW i. (147) t N t t t j=1
⊗N If the initial distribution of X0 is chaotic (i.e., X0 ∼ ρ0 ) and U is smooth, then for each t, we have the mean-field limit:
(N) ρt ρt, as N → ∞, (148)
(N) −1 PN i,N where ρt = N i=1 δXi,N is the empirical measure of Xt and denotes weak conver- gence. t On the other hand, the Vlasov equation (139) can be viewed as a W2-gradient flow for the functional Z 1 ZZ F (ρ) = ρ log ρ + U(x − y) dρ(x) dρ(y), (149) 2
22 because the first variation of the interaction potential 1 ZZ U(ρ) = U(x − y) dρ(x) dρ(y) (150) 2
δU at ρ equals to δρ (ρ) = U ∗ ρ. Since ∇(U ∗ ρ) = (∇U) ∗ ρ, the continuity equation in (56) with F given in (149) becomes
∂tρ − ∆ρ − ∇ · (ρ (∇U) ∗ ρ) = 0, (151) which can be thought as a nonlinear Fokker-Planck equation (62) with a state and density dependent free energy functional (see the similar comments for the SDE version (140)).
4.3 Application: landscape of two-layer neural networks We present an application of the mean-field asymptotics to approximate the dynamics of stochastic gradient descent (SGD) for learning two-layer neural networks [9]. In supervised machine learning problems, one often wants to learn the relationship between n (x, y) ∈ R × R, where x is the feature vector and y is the label such that (x, y) ∼ µ. Suppose m a sample of independent and identically distributed (i.i.d.) data (xi, yi)i=1 is drawn and observed from the joint distribution µ. In a two-layer neural network, the dependence of y on x is modeled as: N 1 X yˆ(x; θ) = σ (x; θ ), (152) N ∗ j j=1 n p where N is the number of hidden units (i.e., neurons), σ∗ : R × R → R is an activation p function, and θj ∈ R are parameters for the j-th hidden units. Here we collectively denote 2 all parameters as θ = (θ1, . . . , θN ). Let `(ˆy, y) = (ˆy − y) be the square loss function. Given the activation function σ∗, the learning problem aims to find the “best” parameters θ such that the finite-N risk (i.e., generalization error)
2 RN (θ) = E[`(ˆy(x; θ), y)] = E[ˆy(x; θ) − y] (153) is minimized. The parameter learning is often carried out by the SGD, which iteratively updates the parameter by the following rule: k+1 k ˆ k k θj = θj + 2sk(yk − k(xk; θ ))∇θj σ∗(xk; θj ), (154)
k k k where θ = (θ1 , . . . , θN ) denotes the parameters after k iterations and sk is a step size. Note that we can decompose the finite-N risk as
N N 2 X 1 X R (θ) = R + V (θ ) + U(θ , θ ), (155) N ] N j N 2 i j j=1 i,j=1
2 where R] = E[y ] is the constant risk ofy ˆ = 0, V (θj) = − E[yσ∗(x; θj)], and U(θi, θj) = E (N) −1 PN [σ∗(x; θi)σ∗(x; θj)]. Let ρ = N j=1 δθj be the empirical measure of θ1, . . . , θN . Then RN in (155) can be written as Z ZZ (N) (N) (N) RN (θ) = R] + 2 V (θ) ρ (dθ) + U(θ1, θ2) ρ (dθ1) ρ (dθ2), (156)
23 which suggests considering the minimization of a mean-field limit risk (as N → ∞) defined p for the probability measure ρ ∈ P(R ): Z ZZ R(ρ) = R] + 2 V (θ) ρ(dθ) + U(θ1, θ2) ρ(dθ1) ρ(dθ2). (157)
Under mild assumptions,
inf RN (θ) = inf R(ρ) + O(1/N), (158) θ∈(Rp)⊗N ρ∈P(Rp) so that the mean-field limit provides a good approximation of the finite-N risk. In addition, 0 0 0 ⊗N if the SGD (154) has a chaotic initialization θ = (θ1, . . . , θN ) ∼ ρ0 and it has step size chosen as sk = εξ(kε) for some sufficiently regular function ξ : R+ → R+, then the SGD for p estimating θ is also well approximated by a continuum dynamics on P(R ) in the sense that, for each fixed t > 0, (N) ρt/ε ρt, as N → ∞, ε → 0, (159)
(N) −1 PN where ρ = N δ k is the empirical distribution after k SGD steps and (ρt)t 0 is a t/ε j=1 θj > weak solution of the following distributional dynamics:
∂tρt = 2ξ(t)∇θ · (ρt∇θΨ(θ; ρt)) (160) and Z Ψ(θ; ρ) = V (θ) + U(θ1, θ2) ρ(dθ2). (161)
In particular, if ξ(t) ≡ 1, then the distributional dynamics in (160) and (161) is the Wasserstein p gradient flow of the risk R(ρ) in (P2(R ),W2). Hence the mean-field approximation provides an “averaging out” perspective of the com- plexity of the landscape of neural networks, which can be useful to prove the convergence of the SGD algorithm or its variants.
5 Diffusions on manifold
5.1 Heat equation on manifold In order to speak of the heat equation on manifold, we need first to extend the concept of gradient vector field ∇, divergence ∇·, and Laplace operator ∆ = ∇ · ∇ from the Euclidean space to the manifold setting. Definition 5.1 (Divergence). Let X be a vector field on a smooth manifold (M, O, A) with connection ∇. On a chart (U, x), the divergence ∇· : Γ(TM) → R (sometimes also written as div(·)) is defined as i div(X) := ∇ · X = ∇ ∂ X , (162) ∂xi ∂ where ∇ ∂ X is the covariant derivative of X in the direction of ∂xi and the last expression ∂xi is implied by the Einstein summation convention (cf. Appendix C.2 for the notation).
24 Remark 5.2. Often we also write the divergence as ∇M · or divM (·) to explicit denote it is a divergence on a (curved) manifold (rather than on a Euclidean space with a global coordinate system.) Moreover, even though the divergence is defined on charts, one can show that the divergence of a vector field is chart-independent. Similar comments apply to the gradient and Laplace-Beltrami operator below. The next question is that how can we define an analog of gradient vector field? Recall n n n that for a (smooth) function f : R → R, the gradient vector ∇f : R → R . Extending n gradient vector to gradient vector field on the domain in R , we get the gradient vector field ∞ n n n ∇ : C (R ) → Γ(T R ), where Γ(T R ) is the set of all vector fields on the tangent bundle n ∼ n n T R = R (cf. Appendix C.4 for more details). Replacing the Euclidean space R by a Riemannian manifold M, we can define the gradient on such manifolds. Definition 5.3 (Gradient on manifold). On a Riemannian manifold (M, O, A, g), the gradient is defined as a map ∇ : C∞(M) → Γ(TM), f 7→ ∇f := [−1(df), for f ∈ C∞(M), (163) ∗ where [ := [g : Γ(TM) → Γ(T M) is the metric-induced (invertible) musical map defined in (270) (cf. Appendix C.6). With the concept of gradient (Definition 5.3) and divergence (Definition 5.1), we can talk about the Laplace-Beltrami operator on a Riemannian manifold. Definition 5.4 (Laplace-Beltrami operator). On a Riemannian manifold (M, O, A, g), the Laplace-Beltrami operator is a linear operator ∆ : C∞(M) → C∞(M) is define as ∆f = ∇ · ∇f. (164) In particular, on a (positively) oriented Riemannian manifold (M, O, A↑, g), the Laplace- Beltrami operator in a given chart (U, x) ∈ A↑ can be expressed as (cf. Chapter 1 in [12])
1 ∂ ∂f ∆f = pdet(g) gij . (165) pdet(g) ∂xi ∂xj The (linear) heat equation on a Riemannian manifold (M, O, A, g) is defined as
(∂t − ∆)u = 0, (166) where u : M ×[0, ∞) → R is a smooth function and ∆ is the Laplace-Beltrami operator (164).
5.2 Application: manifold clustering We present an application of manifold clustering based on the idea of the heat diffusion on Riemannian manifolds [4]. For k = 1,...,K, let Mk be compact nk-dimensional Riemannian manifolds that are disjoint. Suppose we observe a sequence of independent random variables X1,...,Xm taking K ∗ ∗ values in M := tk=1Mk. Suppose further that there exists a clustering structure G1,...,GK K ∗ (i.e., a partition on [n] := {1, . . . , n} such that [n] = tk=1Gk) such that each of the m data ∗ points belong to one of the K clusters: if i ∈ Gk, then Xi ∼ µk for some probability measure µk supported on Mk. Given the observations X1,...,Xm, the manifold clustering task is to develop (computationally feasible) algorithms with strong theoretical guarantees to recover ∗ ∗ the true clustering structure G1,...,GK .
25 6 Nonlinear geometric equations
6.1 Mean curvature flow Now we turn to the nonlinear and geometric continuity equations that describe the (time) n N evolution of manifolds deformed by certain volume functional. Let M := M ⊂ R be an N n-dimensional closed submanifold properly embedded in an ambient Euclidean space R . For ⊥ p ∈ M, denote TpM as the tangent space to M at p, and Tp M as the orthogonal complement of TpM. Let ⊥ A : TpM × TpM → Tp M (167) be the second fundamental form that is bilinear and symmetric defined as
⊥ A(u, v) = ∇v u, (168) where u and v are two vector fields tangent to M, and ∇ is the Euclidean covariant derivative of u in the direction of v (cf. the definition of covariant derivative in Appendix C.5). The mean curvature is defined as H = − tr(A) (169) Pn ⊥ n such that H(p) = − i=1 ∇ei ei, where (ei)i=1 is an orthonormal basis (ONB) for TpM. Given an infinitely differentiable (i.e., smooth), compactly supported, normal vector field v on M, consider the one-parameter variation
Ms,v = {x + sv(x): x ∈ M}, (170) which gives a curve s 7→ Ms,v in the space of submanifolds with M0,v = M. To study the geometric flow of the one-parameter family submanifolds, we need first to compute the time derivative of volume, which is given by the first variation formula of volume. Lemma 6.1 (First variation of the volume functional). The first variation formula of volume is given by d Z
Vol(Ms,v) = hv,Hi, (171) ds s=0 M R where H is the mean curvature vector in (169) and the integration M is with respect to the volume form dVolM . n+1 Example 6.2 (n-sphere). Consider the n-sphere M = {x ∈ R : |x| = r} with radius r n+1 embedded in R for n > 1 and v = x/|x| is the unit normal vector field (since M is a n+1 hypersurface in R ). Consequently the one-parameter variation of M is given by n x o n s o M = x + s : x ∈ M = 1 + x : |x| = r . (172) s,v |x| r
Then, sn Vol(M ) = 1 + Vol(M) (173) s,v r and n s Vol(M ) − Vol(M) 1 + r − 1 s,v = Vol(M). (174) s s
26 Letting s → 0, we get d n Vol(Ms,v) = Vol(M). (175) ds s=0 r On the other hand,
n n X X x n hH, vi = − h∇⊥e , vi = he , ∇ vi = ∇ · v = ∇ · = , (176) ei i i ei M M |x| r i=1 i=1 which implies that Z n hH, vi = Vol(M). (177) M r Combining (175) and (177), we see that (171) indeed holds for n-sphere. Definition 6.3 (Mean curvature flow). A one-parameter family of n-dimensional submanifolds n N Mt ⊂ R is said to flow (or move by motion) by the mean curvature flow (MCF) if
xt = −H, (178)
∂x where xt = ∂t is the normal component of the time derivative of the position vector x. If n = 1, then (178) is also called the curve-shortening flow. In light of the first variation formula (171), we see that the mean curvature flow (178) is the negative gradient flow of volume. n N Let Mt ⊂ R be n-dimensional submanifolds flowing by the MCF. By Lemma ??, co- ordinate functions x1, . . . , xN of the ambient Euclidean space restricted to the evolving sub- manifolds satisfy the (nonlinear) heat equation
(∂t − ∆Mt )xi = 0, i = 1,...,N. (179)
In fact, the heat equation (179) is a characterization of the mean curvature flow (178). N Definition 6.4 (Gaussian surface area or volume). Let M ⊂ R be an n-dimensional subman- ifold. Define the Gaussian surface area or volume as
Z 2 −n/2 −|x| F (M) = (4π) e 4 . (180) M
6.2 Problem set Problem 6.1. Prove the first variation formula of volume in Lemma 6.1. Problem 6.2. Prove that the coordinate functions satisfy the nonlinear heat equation (179).
A From SDE to PDE, and back
We derive the PDE from the associated SDE. The derivation is based on Itˆo’scalculus and a guess-and-verify trick. Let (Xt)t>0 be a one-dimensional diffusion process of the form
dXt = m(Xt, t) dt + σ(Xt, t) dWt. (181)
27 For a twice differentiable function f(x, t), Itˆo’slemma gives
2 df(Xt, t) = (∂tf + m∂xf + D∂xf) dt + σ∂xf dWt, (182) 1 2 where D(Xt, t) = 2 σ(Xt, t) . Equation (182) implies that the backward equation is given by 2 ∂tf(x, t) + m(x, t)∂xf(x, t) + D(x, t)∂xf(x, t) = 0. (183) Indeed, if (183) holds, then (182) implies that for any T > t, Z T f(XT ,T ) − f(Xt, t) = σ(Xs, s)∂xf(Xs, s) dWs. (184) t
Taking the conditional expectation on Xt, we get h Z T i E[f(XT ,T ) | Xt] − f(Xt, t) = E σ(Xs, s)∂xf(Xs, s) dWs | Xt = 0, (185) t where the last step follows from the fact that the Itˆointegral of a martingale is a martingale. Now we need to verify that the solution f to the equation
f(Xt, t) = E[f(XT ,T ) | Xt] (186) should satisfy the assumed backward equation (183). (In finance, the function f(x, t) is called the value function at time t and (186) represents the value function at a future time point T > t.) Using T = t + dt and the Itˆoformula trick (i.e., expanding time derivative to the first order and spatial derivative to the second order due to the quadratic variation of (Wt)t>0), we see that solution to (186) satisfies the backward equation (183). Now if we define the differential operator A as:
2 Af = m∂xf + D∂xf, (187) then the backward equation (183) can be written as
∂tf + Af = 0. (188)
The operator A is called the generator of the diffusion process (Xt)t>0. Then we find the adjoint operator A∗ in L2(dx) satisfying hAf, ρi = hf, A∗ρi for all ρ and f, where hf, gi = R fg. Integrating by parts, we see that the adjoint operator A∗ in L2(dx) is given by
∗ 2 A ρ = −∂x(mρ) + ∂x(Dρ). (189) Thus the forward equation is determined by the adjoint operator:
∗ ∂tρ = A ρ. (190) Note that (190) is nothing but the one-dimensional Fokker-Planck equation given in the PDE form: 2 ∂tρ(x, t) = −∂x(m(x, t)ρ(x, t)) + ∂x(D(x, t)ρ(x, t)), (191) where ρ(x, t) is the density of particles at position x at time t. This is indeed a forward equation since it predicts ρ(x, t) from the initial datum ρ(x, 0). The process of deriving the SDE from the PDE in the Fokker-Planck equation case is described in Section 3.2.
28 B Logarithmic Sobolev inequalities and relatives
A probability measure π is said to satisfy the logarithmic Sobolev inequality (LSI) if there n exists a κ > 0 such that for any ν ∈ P(R ) and π ν, 1 H(νkπ) I(νkπ). (192) 6 2κ
B.1 Gaussian Sobolev inequalties n −n/2 2 Let γ be the standard Gaussian measure on R , i.e., γ(x) = (2π) exp(−|x| /2). n Lemma B.1 (Gaussian LSI: information-theoretic version). For any ν ∈ P(R ) and γ ν, 1 H(νkγ) I(νkγ). (193) 6 2 From Lemma B.1, we see that the standard Gaussian measure γ satisfies the LSI with κ = 1. The equality in (193) is achieved if ν is a translation of γ. To see this, take
|x − y|2 ν(x) = (2π)−n/2 exp − , y ∈ n. (194) 2 R Then Z ν Z 1 H(νkγ) = log dν = − (|x − y|2 − |x|2) dν γ 2 Z 1 Z 1 1 = hx, yi dν − |y|2 dν = |y|2 − |y|2 = |y|2. (195) 2 2 2 On the other hand, since γ ν, we have dν |x − y|2 |x|2 |y|2 ρ(x) = = exp − + = exp(hx, yi) exp − (196) dγ 2 2 2 and |y|2 ∇ρ(x) = y exp(hx, yi) exp − = yρ. (197) 2 Then Z |y|2ρ2 Z I(νkγ) = dγ = |y|2ρ dγ = |y|2. (198) ρ 1 Combining (195) and (198), we conclude H(νkγ) = 2 I(νkγ). In fact, translation is the only case that is currently known to achieve the equality case. Hence we pose the following conjecture. Conjecture B.2 (Characterization of the equality case in the Gaussian LSI). The equality case 1 H(νkγ) = 2 I(νkγ) in the Gaussian LSI is achieved if and only if ν is a translation of γ, i.e., n ν(x) = γ(x + y) for some y ∈ R . n Lemma B.3 (Gaussian LSI: functional version). Let f : R → R be a smooth function such that f > 0 and R f 2 dγ = 1. Then Z Z 2 2 f log f dγ 6 |∇f| dγ. (199)
29 Equivalently, we can write Z 2 2 Entγ(f ) 6 2 |∇f| dγ, (200) R where Entγ(g) = g log g dγ is the entropy of the probability density g > 0. It is easy to see that Lemma B.3 and B.1 are equivalent. Let g = f 2. Then ∇g = 2f∇f √1 and ∇f = 2 g ∇g. Note that (199) is equivalent to
Z Z 1 2 1 Z |∇g|2 g log g dγ 2 √ ∇g dγ = dγ. (201) 6 2 g 2 g
R dµ Since g > 0 and g = 1, we can define a probability measure µ such that dγ = g. Then (201) can be written as 1 H(µkγ) I(µkγ). (202) 6 2 Since g is arbitrary, the equivalence between Lemma B.3 and B.1 follows.
B.2 Euclidean Sobolev inequalties There are Euclidean LSIs and the following Lemma B.4 is a version with sharp constant. n R 2 Lemma B.4 (Euclidean LSI). Let f : R → R be a smooth function such that f dx = 1. Then Z n 2 Z f 2 log f 2 dx log |∇f|2 dx . (203) 6 2 nπe n Lemma B.5 (Sobolev inequality). Let f : R → R be a smooth function with compact support. Then for all n 3, > Z Z 2n 2 |f| n−2 6 C(n) |∇f| , (204) where 1 Γ(n) 2 C(n) = n > 2. (205) n(n − 2) Γ(n/2) Note that the Gaussian LSIs (Lemma B.1 and B.3) are dimension-free inequalities, while the Euclidean LSI (Lemma B.4) and Sobolev inequality (Lemma B.5) are not. The con- 2 stants in (203) and (204) are both sharp. Indeed, direct computation gives the constant nπe nk R 2 from (205). Let m = nk and k → ∞ and take a function f : R → R such that f dx = 1. Then 2 1 1 Γ(nk) 2 nk 2nk 2 2 C(m) = nk ∼ ∼ = , (206) πnk(nk − 2) Γ(nk/2) πn2k2 e nkπe mπe where we used Stirling’s approximation
√ nk nk (nk)! ∼ 2πnk . (207) e C Riemannian geometry: some basics
In this section, we collect some basic facts and results about Riemannian geometry.
30 C.1 Smooth manifolds Definition C.1 (Topological manifold). A topological space (M, O) is said to be an n-dimensional topological manifold if for any p ∈ M, there is an open set U ∈ O containing p such that there n n exists a map x : U → x(U) ⊂ R (equipped with the standard topology on R ) satisfying: (i) x is invertible, i.e., there is a map x−1 : x(U) → U; (ii) x is continuous; (iii) x−1 is continuous. In other words, x is a homeomorphism between U and x(U). The pair (U, x) in Definition C.1 is called a chart and x is called the chart map. For p ∈ U and x(p) = (x1(p), . . . , xn(p)), xi(p) is the i-th coordinate of x(p). An atlas is a collection of charts A = {(Uα, xα): α ∈ A} such that M = ∪α∈AUα. −1 n For two charts (U, x) and (V, y) such that U ∩ V 6= ∅, the map y ◦ x : R ⊃ x(U ∩ V ) → n y(U ∩V ) ⊂ R is said to be the chart transition map. We say (U, x) and (V, y) are ♣-compatible if either U ∩ V = ∅, or U ∩ V 6= ∅ if the chart transition maps y ◦ x−1 : x(U ∩ V ) → y(U ∩ V ) −1 n n and x ◦ y : y(U ∩ V ) → x(U ∩ V ) has certain ♣-property (as a map from R to R ). 0 n 1 n ∞ n For instance, the ♣-property can be C (R ),C (R ),...,C (R ), and so on. An atlas A is ♣-compatible if the transition maps between any two charts in A have the ♣-property. Definition C.2 (Smooth manifold). A smooth manifold (M, O, A) is a topological manifold ∞ n (M, O) equipped with a C (R )-compatible atlas A. In these notes, we shall only consider smooth manifolds unless otherwise indicated. Definition C.3 (Smooth functions on manifold). Let (M, O, A) be a smooth manifold. A ∞ −1 n function f : M → R is said to be a smooth map (or C -map) if f ◦ y : R → R is smooth for every chart y in the atlas A. According to Definition C.3, the coordinate functions x1, . . . , xn of any chart (U, x) in a C∞-compatible atlas are all smooth maps. A map φ : M → N is smooth if there is a chart (U, x) in M and a chart (V, y) in N such −1 m −1 n ∞ that φ(U) ⊂ V and the map y ◦ φ ◦ x : R ⊃ x (U) → y(V ) ⊂ R is C .
Definition C.4 (Diffeomorphism). Let (M, OM , AM ) and (N, ON , AN ) are two smooth mani- folds. We say that (M, OM , AM ) and (N, ON , AN ) are diffeomorphic if there exists a bijection φ : M → N such that φ : M → N and φ−1 : N → M are both smooth maps.
C.2 Tensors Let (V, +, ·) be a vector space and (V ∗, ⊕, ) be the dual space, where
∗ ∼ V = {φ : V −→ R} =: Hom(V, R) (208) is the set of linear functionals on V . An element φ ∈ V ∗ is called a covector.
Definition C.5 (Tensor). Let (V, +, ·) be a vector space and r, s ∈ N0 := {0, 1,... }. An (r,s)-tensor T over V is an R-multi-linear map ∗ ∗ ∼ T : V × · · · × V × V × · · · × V −→ R. (209) | {z } | {z } r times s times By Definition C.5, a covector φ ∈ V ∗ is a (0, 1)-tensor over V . If dim(V ) < ∞, then ∼ ∗ ∗ ∗ ∼ V = (V ) is an isomorphism so that v ∈ V can be identified as a linear map V −→ R, which means that v is a (1, 0)-tensor.
31 Let V be an n-dimensional vector space with an (arbitrarily chosen) basis (e1, . . . , en). Then the dual basis (ε1, . . . , εn) for V ∗ is uniquely determined by
i i ε (ej) = δj, (210)
i i where δj = 1 if i = j, and δj = 0 if i 6= j. Definition C.6 (Components of tensor). Let T be an (r, s)-tensor over an n-dimensional vector 1 n space V such that n < ∞. Let (e1, . . . , en) be a basis of V and (ε , . . . , ε ) be the dual basis of V ∗. Then the components of T w.r.t. the chosen basis are defined as the (r + s)n real numbers (or sometimes called coefficients)
i1,...,ir i1 ir T j1,...,js = T (ε , . . . , ε , ej1 , . . . , ejs ) (211) for i1, . . . , ir, j1, . . . , js ∈ {1, . . . , n}. i As an example, consider a (1, 1)-tensor T with components given by T j = T (εi, ej). Then for any v ∈ V and φ ∈ V ∗, we can express
n n n n n n X i X j X X j i X X j i T (φ, v) = T φiε , v ej = φiv T (ε , ej) = φiv T j. (212) i=1 j=1 i=1 j=1 i=1 j=1
Thus components fully determine the tensor (given the basis). To avoid write too many sums, we typically use the Einstein summation convention:
j i T (φ, v) = φiv T j, (213) where the repeated up-and-down indices i and j are summed over.
C.3 Tangent and cotangent spaces
Definition C.7 (Velocity). Let (M, O, A) be a smooth manifold and γ : R → M be a curve 1 (at least C ). Let p ∈ M and γ(t0) = p for some t0 = p. The velocity of γ at p is the linear map defined as
∞ ∼ vγ,p : C (M) −→ R, 0 f 7→ vγ,p(f) = (f ◦ γ) (t0). (214)
Note that (C∞(M), ⊕, ) forms a vector space equipped with (f ⊕ g)(p) = f(p) + g(p) and (λ g)(p) = λ · g(p). Note further that (C∞(M), ⊕, ) is not only a vector space, it is also a ring. However, (C∞(M), ⊕, ) is not a field! Definition C.8 (Tangent vector space). Let (M, O, A) be a smooth manifold. For each p ∈ M, n o TpM := vγ,p : γ is a smooth curve in M (215) is the tangent space to M at p.
Intuitively, we may understand the velocity vγ,p (i.e., a tangent vector in TpM) as the directional derivative along the curve γ at p. The following Lemma C.9 confirms that the tangent space TpM is indeed a vector space.
32 ∞ Lemma C.9. For each p ∈ M, TpM is a vector space of linear maps from C (M) to R.
Proof of Lemma C.9. We shall define ⊕ and to make (TpM, ⊕, ) a vector space. For f ∈ C∞(M), we define
∞ ⊕ : TpM × TpM → Hom(C (M), R), (vγ,p ⊕ vδ,p)(f) = vγ,p(f) + vδ,p(f), (216) and
∞ : R × TpM → Hom(C (M), R), (s vγ,p)(f) = s · vγ,p(f). (217)
We still need to show that there are smooth curves σ and τ such that vσ,p = vγ,p ⊕ vδ,p and vτ,p = s vγ,p. We first construct the curve τ via
τ : R → M, t 7→ τ(t) = γ(st + t0) =: γ ◦ µs(t), (218) where µs(t) = st + t0. Now τ(0) = γ(t0) = p and by the chain rule,
0 0 0 0 0 vτ,p(f) = (f ◦ τ) (0) = (f ◦ γ ◦ µs) (0) = µs(0) · (f ◦ γ) (µs(0)) 0 (219) = s · (f ◦ γ) (t0) = s · vγ,p(f) = (s vγ,p)(f).
Next choose a chart (U, x) such that p ∈ U, and define
σx : R → M, −1 t 7→ σx(t) = x ((x ◦ γ)(t0 + t) + (x ◦ δ)(t1 + t) − (x ◦ γ)(t0)), (220) where γ(t0) = δ(t1) = p. Then σx(0) = p and by the chain rule,
0 −1 0 vσx,p(f) = (f ◦ σx) (0) = (f ◦ x ) ◦ (x ◦ σx) (0) i0 −1 = (x ◦ σx) (0) · ∂i(f ◦ x ) (x(σx(0))) i0 i0 −1 = (x ◦ γ) (t0) + (x ◦ δ) (t1) · ∂i(f ◦ x ) (x(p)) (221) i0 −1 i0 −1 = (x ◦ γ) (t0) · ∂i(f ◦ x ) (x(p)) + (x ◦ δ) (t1) · ∂i(f ◦ x ) (x(p)) −1 0 −1 0 = (f ◦ x ) ◦ (x ◦ γ) (t0) + (f ◦ x ) ◦ (x ◦ δ) (t1) 0 0 = (f ◦ γ) (t0) + (f ◦ δ) (t1) = vγ,p(f) + vδ,p(f) = (vγ,p ⊕ vδ,p)(f), which the last line does not depend on the choice of the chart (U, x).
Notation. In the proof of Lemma C.9, we have seen that for a curve γ : R → M such that γ(0) = p, where (U, x) is a chart containing p,