<<

Continuity equations, PDEs, probabilities, and flows∗

Xiaohui Chen

This version: July 5, 2020

Contents

1 Continuity equation2

2 Heat equation3 2.1 Heat equation as a PDE in space and time...... 3 2.2 Heat equation as a (continuous) Markov semi-group of operators...... 6 2.3 Heat equation as a (negative) gradient flow in Wasserstein space...... 7

3 Fokker-Planck equations: diffusions with drift 10 3.1 Mehler flow...... 12 3.2 Fokker-Planck equation as a stochastic differential equation...... 13 3.3 Rate of convergence for Wasserstein gradient flows of the Fokker-Planck equation 15 3.4 Problem set...... 20

4 Nonlinear parabolic equations: interacting diffusions 21 4.1 McKean-Vlasov equation...... 21 4.2 Mean field asymptotics...... 22 4.3 Application: landscape of two-layer neural networks...... 23

5 Diffusions on manifold 24 5.1 Heat equation on manifold...... 24 5.2 Application: manifold clustering...... 25

6 Nonlinear geometric equations 26 6.1 Mean curvature flow...... 26 6.2 Problem set...... 27

A From SDE to PDE, and back 27

B Logarithmic Sobolev inequalities and relatives 29 B.1 Gaussian Sobolev inequalties...... 29 B.2 Euclidean Sobolev inequalties...... 30

∗Working draft in progress and comments are welcome (email: [email protected]).

1 C Riemannian geometry: some basics 30 C.1 Smooth manifolds...... 31 C.2 Tensors...... 31 C.3 Tangent and cotangent spaces...... 32 C.4 Tangent bundles and tensor fields...... 36 C.5 Connections and curvatures...... 38 C.6 Riemannian manifolds and geodesics...... 41 C.7 Volume forms...... 45

1 Continuity equation

n Let Ω ⊂ R be a spatial domain. Consider the continuity equation (CE):

∂tµt + ∇ · (µtvt) = 0, (1) where µt is a probability measure (typically absolutely continuous with a density) on Ω, n vt :Ω → R is a velocity vector field on Ω, and ∇ · v is the divergence of a vector field v. There are several meanings of solving the continuity equation (1). Given the vector field vt, we can speak of a classical (or strong) solution as a partial differential equation (PDE) by thinking µt(x) as a differentiable function of two variables x and t. We can also think of a distributional solution by integrating against some “nice” class of test functions on (x, t). If the continuity equation (1) is satisfied, then, for any finite time point T > 0, we can integrate 1 with a C function ψ :Ω × [0,T ] → R with bounded support and apply integration-by-parts: Z T Z Z T Z 0 = ψ∂tµt + ψ∇ · (µtvt) (2) 0 Ω 0 Ω Z T Z Z T Z = − (∂tψ)µt − h∇ψ, vtiµt, (3) 0 Ω 0 Ω where there is no contribution from the boundary because ψ has compact support so that the divergence theorem works. Here we implicitly assumed that the first law of thermodynamics holds (i.e., mass conservation of µt) so that there is no mass escapes at the boundary (if Ω is bounded) or near the infinity (if Ω is not bounded). Compared with the strong solution, the distribution solution (3) does not require differentiability by moving the derivative from µt(x) to ψ(x, t). R Suppose further we can interchange ∂t and Ω in the first integral of (2) and we keep the second integral in (3), then we can take a smaller class of test functions only in the spatial R domain Ω such that the solution space is larger. Why shouldn’t we interchange ∂t and Ω in the first integral of (3)? This is because if we take test functions depending only on Ω, then it does not make sense to take the time derivative as in (3), which always equals to zero. 1 Let φ :Ω → R be such C test function with bounded support. To make sense of R differentiation outside of integration, we obviously need t 7→ Ω φµt is absolutely continuous in t. In addition, we need the identity d Z Z φµt − h∇φ, vtiµt = 0 (4) dt Ω Ω to hold pointwise (in t) such that (3) equals to zero. This motivates the following definition.

2 Definition 1.1 (Weak solution of the continuity equation). We say that the density µt is a 1 weak solution in the distribution sense if for any C test function φ :Ω → R with bounded R support, the function t 7→ Ω φµt is absolutely continuous in t, and for each a.e. t, we have d Z Z φµt = h∇φ, vtiµt. (5) dt Ω Ω In these notes, we make (more) sense of dynamic behaviors of the continuity equation, and illustration its links to the classical PDEs, probabilities (such as Markov processes and stochastic differential equations), and the trajectory analysis (such as gradient flows). Specif- ically, we would like to understand the question that how does the weak solution of the continuity equation in an infinite-dimensional metric space (typically Wasserstein) connect with the classical solution of PDEs in a finite-dimensional space (typically Euclidean space and time)? We start from the (linear) heat equation as a concrete example.

2 Heat equation

2.1 Heat equation as a PDE in space and time n Recall that the heat equation (HE) on R is defined as:

(∂t − ∆)u = ∂tu − ∇ · ∇u = 0, (6)

n where u : R × [0, ∞) → R and u(x, t) is a two-variable function of space and time. The fundamental solution of the heat equation (6) is given by

u(x, t) = H(x, 0, t), (7) where H(x, y, t) is the heat kernel (sometimes also called Green’s function):

 |x − y|2  H(x, y, t) = (4πt)−n/2 exp − , y ∈ n, t > 0. (8) 4t R The classical solution u is just a (positive) function of two variables x and t. One can think of H(x, y, t) is the transition density from x to y in time 2t. n Let Ω = R and µt(x) = u(x, t). Clearly µt > 0 is the probability density of a Gaussian distribution N(y, 2tIn) and the continuity equation (1) reads

∇µt ∂tµt − ∇ · (µt ) = 0. (9) µt In this case, the velocity vector field is given by

∇µt ∇u vt(x) = − (x) = − (x, t), (10) µt u where the last equality is justified by the equivalence of weak solution and the classical PDE solution (because u(·, ·) is Lipschitz continuous and vt(·) is Lipschitz). Since  |x − y|2  x  I  ∇u = (4πt)−n/2 exp − − = −u n x = −uv (x), (11) 4t 2t 2t t

3 n n the velocity vector field vt : R → R is a that can be represented by an n × n matrix: I v = n . (12) t 2t −1 Thus in the heat equation, the velocity vector field vt = (2t) In does not depend on the location x (i.e., location-free) and it dies off as t → ∞. The vanishing velocity means that the particles moving according to the heat equation will eventually converge to an equilibrium distribution given by a harmonic function ∆u = 0. If the boundary value is imposed (either Dirichlet problem or Neumann problem) or the heat growth at infinity is not too fast, then the solution of the harmonic function is unique. n n Let p > 1 and Pp(R ) be the collection of probability measures on R such that the p-Wasserstein distance is well-defined, i.e., Z n n n p no Pp(R ) = µ ∈ P(R ): |x − x0| µ(dx) < ∞ for some x0 ∈ R , (13) Rn n n where P(R ) contains all probability measures on R . Let Wp be the p-Wasserstein distance Z p n p o Wp (µ, ν) = min |x − y| γ(dx, dy): γ ∈ Γ , (14) where Γ is the set of all couplings with marginal distributions µ and ν.

Definition 2.1 (Metric derivative). Let (µt)t>0 be an absolutely continuous curve in the n Wasserstein (metric) space (Pp(R ),Wp). The metric derivative at time t of the curve t 7→ µt w.r.t. Wp is defined as 0 Wp(µt+h, µt) |µ |p(t) = lim . (15) h→0 |h| 0 0 We write |µ |(t) = |µ |2(t). We can compute the metric derivative of the fundamental solution curve of the heat equation. Note that µt is the Gaussian density N(y, 2tIn) for each t > 0. Lemma 2.2 (Wasserstein distance between two Gaussians, cf. Remark 2.31 in [11]). Let µ1 = N(m1, Σ1) and µ2 = N(m2, Σ2). Then the optimal transport map (i.e., the Monge map) from µ1 to µ2 is given by

T (x) = m2 + A(x − m1), (16) where −1/2 1/2 1/2 1/2 −1/2 A = Σ1 (Σ1 Σ2Σ1 ) Σ1 , (17) and the squared 2-Wasserstein distance between µ1 and µ2 equals to

2 2 n 1/2 1/2 1/2o W2 (µ1, µ2) = |m1 − m2| + tr Σ1 + Σ2 − 2(Σ1 Σ2Σ1 ) , (18) where the last term is the Bures distance on positive-semidefinite matrices. In particular, if Σ1Σ2 = Σ2Σ1, then 2 2 1/2 1/2 2 W2 (µ1, µ2) = |m1 − m2| + |Σ1 − Σ2 |F . (19)

4 In the Gaussian case, we have √ √ √ 2 p 2 2 W2 (µt+h, µt) = | 2(t + h)In − 2tIn|F = 2n( t + h − t) , (20) which gives a formulas for |µ0|(t): √ √ √ t + h − t √ 1 r n |µ0|(t) = lim 2n = 2n √ = . (21) h→0 h 2 t 2t On the other hand, recalling (12), we get Z Z 2 2 1 2 tr(2tIn) n kvtk 2 := |vt(x)| µt(dx) = |x| µt(dx) = = . (22) L (µt) 4t2 4t2 2t This implies that r 0 n |µ |(t) = kv k 2 = . (23) t L (µt) 2t Equivalence in (23) is a much more general fact; see ahead Theorem 3.1 and Corollary 3.1 in Section3. Below we shall give a heuristic interpretation based on the theory of optimal transport to see why (23) is intuitively true. Consider p > 1. If (µt)t>0 is an absolute continuous curve in Wp, then there is an optimal transport map T :Ω → Ω moving particles R p from µt to µt+h that minimizes |T (x) − x| µt(dx), i.e., Z Z p p p Wp (µt, µt+h) = |T (x) − x| µt(dx) = |(T − id)(x)| µt(dx). (24)

The “discrete velocity” of the particle located at x at time t (i.e., time derivative of T ) is given by T (x) − x v (x) = , (25) t h so that we have

Z p p p T (x) − x Wp (µt, µt+h) kvtk p = µt(dx) = . (26) L (µt) h |h|p

0 p Letting h → 0, we expect that kvtkL (µt) = |µ |p(t). The above heuristic argument (cf. (25)) suggests consider the following gradient flow (GF) n for moving particles in R :  y0 (t) = v (y (t)), x t x (27) yx(0) = x, where yx(t) is the time t position of the particle initially at yx(0) = x, i.e., it is the trajectory of the particle starting from x. Thus the gradient flow of the heat equation is given by the following Cauchy problem:

0 −1 yx(t) = (2t) yx(t) (28) with the initial datum yx(0) = x. This is a first-order linear (homogeneous) ordinary differ- ential equation with initial value problem, whose solution can be easily computed. We can take a basic solution as √ yx(t) = Cx t, t > 0, (29)

5 for any constant C ∈ R. Let Yt :Ω → Ω be the flow of the vector field vt on Ω defined through Yt(x) = yx(t) in (27). Note that Yt is indeed a flow on Ω since Y0(x) = yx(0) = x and Yt(Ys(x)) =

Yt(yx(s)) = yyx(s)(t) = yx(s + t) = Ys+t(x) for any s, t > 0, so that Yt on Ω is a group action of additive group on R+ = [0, ∞). Then the pushforward measure µt = (Yt)]µ0 is a weak solution of the continuity equation 1 ∂tµt +∇·(µtvt) = 0 in (1) because for any C test function φ :Ω → R with bounded support, d Z d Z d Z d Z φ dµ = φ d(Y ) µ = (φ ◦ Y ) dµ = φ(y (t)) dµ (x) dt t dt t ] 0 dt t 0 dt x 0 Z Z 0 = h∇φ(yx(t)), yx(t)i dµ0(x) = h∇φ(yx(t)), vt(yx(t))i dµ0(x) Z Z Z = h∇φ(Yt), vt(Yt)i dµ0 = h∇φ, vti d(Yt)]µ0 = h∇φ, vti dµt.

2.2 Heat equation as a (continuous) Markov semi-group of operators Next we see how to construct the stochastic process from the heat equation. Recall that H(x, y, t) is the heat kernel in (8) coming from the fundamental solution of the heat equa- n tion (6). Denote γ as the standard Gaussian measure on R , i.e., the density of γ equals −n/2 2 n to (2π) exp(−|x| /2). Given a measurable function f : R → R, we define the (linear) integral operator Z n Ptf(x) = f(z)H(z, x, t) dz, t > 0, x ∈ R , (30) Rn which is sometimes referred as the forward evolution. Then one can verify that for all s, t > 0,

Pt+sf = Pt(Psf) = Ps(Ptf), (31)

lim Ptf(x) = f(x). (32) t↓0

p Thus (Pt)t>0 forms a continuous semi-group of linear operators on L (γ), which allows us to n define a (homogeneous) continuous Markov process (Xt)t>0 on R via

E[f(Xs+t) | Xr, r 6 t] = (Psf)(Xt), γ-almost surely, (33) for all bounded measurable functions f, and s, t > 0. This Markov process is called the heat diffusion process√, which we show below is the rescaled Wiener process (or Brownian motion) by a factor of 2. It is also easy to check (by smooth approximation and Taylor expansion) that the generator A of the semi-group (Pt)t>0 is given by the Laplacian operator A = ∆ in the sense that P f − f Af = lim t = ∆f, (34) t↓0 t where the limit exists in L2(γ). Indeed, note that the simi-group operator can be represented as √ Ptf(x) = Eξ∼γ[f(x + 2tξ)]. (35)

6 n For small enough t > 0 and smooth f : R → R, Taylor expansion yields √ Ptf(x) − f(x) = Eξ∼γ[f(x + 2tξ)] − f(x) h √ 1 i = E f(x) + 2tξ · ∇f(x) + 2tξT Hess (x)ξ + higher-order terms − f(x) ξ∼γ 2 f n T o   = t tr Eξ∼γ[ξξ ]Hessf (x) = t tr Hessf (x) = t∆f(x).

This is not surprising in view of the heat equation ∂tu = ∆u, where the rate of change (i.e., time derivative) in the heat diffusion process equals to the Laplacian. As opposed to the semi-group integral operators (Pt), the generator A is a (linear) dif- ferential operator that unfolds Pt and generates the stochastic process Xt. The semi-group integral operators are useful for finding the stochastic differential equations (SDEs) from the evolution PDEs of probability density functions. Thus the generator does the reverse job by converting the SDEs to the evolution PDEs of densities. We shall see an example on the Fokker-Planck equation in Section3. We remark that for the heat equation, one can do explicit calculations on Pt and directly write down the Markov process by using (33) and its SDE as following. Using the Markov property (33) with f(x) = 1(x 6 y), we get √ P P n (Xs+t 6 y | Xr, r 6 t) = ξ∼γ(Xt + 2sξ 6 y) for all y ∈ R , (36) where Pξ∼γ(·) is the probability taken only w.r.t. to ξ. This implies that √   P P y (Xs+t − Xt 6 y | Xr, r 6 t) = ξ∼γ( 2sξ 6 y) = Φ √ , (37) 2s where Φ(y) is the cumulative distribution function of γ. The last display shows that the process (Xt)t>0 has independent increments (i.e., Xs+t − Xt does not depend on the past√ values) and Xs+t − Xt ∼ N(0, 2sIn). Thus the heat diffusion process (Xt)t>0 is simply 2- n rescaled standard Wiener process (Wt)t>0 on R , which we write as √ dXt = 2 dWt. (38) √ Note that the factor 2 is due to the fact that Wiener process solves a slightly different 1 version of the heat equation ∂tu − 2 ∆u = 0. E For X0 = 0, then [Xt] = 0 and Var(Xt) =√ 2t, which means that, after time t, the expected position of the heat diffusion process is 2t, which should be compared with the position of trajectory of yx(t) at time t given in (29).

2.3 Heat equation as a (negative) gradient flow in Wasserstein space n We have seen from (27) that the heat equation is the gradient flow of moving particles in R . In this section, we show that the heat equation can also viewed as a (negative) gradient flow of the entropy in the Wasserstein space of probability measures. To do this, we need first to make sense what do we mean by a “derivative” of the entropy (or more general functionals) in terms of a density in the infinite-dimensional Wasserstein (or metric) space.

7 Definition 2.3 (First variation, Chapter 7 in [13]). Let ρ be a density on Ω. Given a functional δF F : P(Ω) → R, we call δρ (ρ): P(Ω) → R, if it exists, the unique (up to additive constants) measurable function such that d F (ρ + hχ) − F (ρ) Z δF F (ρ + hχ) = lim = (ρ) dχ (39) dh h=0 h→0 h δρ holds for every mean-zero perturbation density χ such that ρ+hχ ∈ P(Ω) for all small enough δF h. The function δρ (ρ):Ω → R is the first variation of F at ρ. n In R , the first variation behaves like the directional derivatives projected to all possible n 1 2 n directions χ ∈ R . For example, take F (x) = 2 |x| for x ∈ R . Then ∇F (x) = x and for any n y ∈ R , |x + hy|2 − |x|2 h lim = lim hx, yi + |y|2 = hx, yi = h∇F (x), yi. (40) h→0 2h h→0 2 Comparing (39) with (40), we may interpret Z δF DδF E (ρ) dχ = (ρ), χ (41) δρ δρ as an inner product in a Hilbert space. Thus first variation is an infinite-dimensional analog δF of the gradient in the finite-dimensional Euclidean space. Thus δρ (ρ) can be viewed as a gradient in P(Ω). Note that first variation is defined only up to additive constants since R c dχ = 0 for any constant c ∈ R. Below are two important functionals that will be extremely useful in studying heat equation, or more generally the Fokker-Planck equation in Section3. Example 2.4 (“Generalized” entropy). Let f : R → R be a convex superlinear function and Z Z F (ρ) = f(ρ(x)) dx = f ◦ ρ. (42)

Clearly F (ρ + hχ) − F (ρ) Z f(ρ(x) + hχ(x)) − f(ρ(x)) = χ(x) dx. h hχ(x) Letting h → 0, we get d Z Z F (ρ + hχ) = f 0(ρ(x))χ(x) dx = f 0(ρ) dχ, (43) dh h=0 which implies that δF (ρ) = f 0(ρ). (44) δρ For the special case where f(ρ) = ρ log ρ is the entropy, it first variation at ρ is given by δF (ρ) = 1 + log ρ. (45) δρ

Example 2.5 (Potential). Let V :Ω → R be a potential function and the energy functional Z Z F (ρ) = V dρ = V (x)ρ(x) dx. (46)

8 Compute F (ρ + hχ) − F (ρ) Z ρ(x) + hχ(x) − ρ(x) Z Z = V (x) dx = V (x)χ(x) dx = V dχ. h h Thus we see that δF (ρ) = V (47) δρ is a constant function that does not depend on ρ. Intuitively, the first variation of a functional F (either entropy or energy) at ρ is the rate of change in distribution for moving the particles on Ω from ρ that minimizes the entropy/energy. Given a functional F : P(Ω) → R, the minimizing movement scheme introduced by Jordan-Kinderlehrer-Otto [6] (sometimes also called the JKO scheme) is a time-discretized version of gradient flows that solves a sequence of iterated minimization problems (in the context of (P(Ω),W2)): W 2(ρ, ρh) ρh = argmin F (ρ) + 2 k . (48) k+1 ρ∈P(Ω) 2h By strong duality, Z Z 2 h c h W2 (ρ, ρk) = max ϕ dρ + ϕ dρk, (49) ϕ∈Φc(Ω) Ω Ω c 2 1 where ϕ (y) = infx∈Ω{|x − y| − ϕ(x)} is the c-transform . The functions ϕ realizing the maximum on the right hand side of (49) is called the Kantorovich potentials for the transport h h 2 h from ρ to ρk. For each ρk, W2 (ρ, ρk) is convex in ρ since it is a supremum of linear functionals in ρ. Now differentiating (48) w.r.t. ρ, the first-order optimality condition is implied by δF ϕ (ρh ) + = constant, (50) δρ k+1 h

h h where ϕ now is the Kantorovich potential for the transport from ρk+1 to ρk (not in the reversed direction). Here we implicitly assumed the uniqueness of c-concave Kantorovich potential. From Brenier’s polarization theorem, we know that optimal map Te from µt+h and µt to and ϕ :Ω → R are linked through Te(x) = x − ∇ϕ(x). (51) So the velocity vector from time t + h to t (note the reverse direction!) is given by

Te(x) − x ∇ϕ(x) δF  v = = − = ∇ (ρh ) (x). (52) et h h δρ k+1

Reverting the time direction (vt = −vet) and letting h → 0 (using the continuity of ρ), we get δF  v (x) = −∇ (ρ) (x) (53) t δρ

1 c Here c stands for a general cost function and the c-transform in general is defined as ϕ (y) = infx∈Ω{c(x, y)− c ϕ(x)}. A function ϕ is said to be c-concave if there exists a χ such that ϕ = χ and Φc(Ω) is the set of all c-concave functions.

9 and we get the following continuity equation  δF  ∂ ρ − ∇ · ρ ∇ (ρ) = 0 (54) t δρ in the Wasserstein space of measures. If we choose the entropy functional F (ρ) = R f(ρ(x)) dx with f(ρ) = ρ log ρ, then

δF δF  ∇ρ (ρ) = 1 + log ρ, and ∇ (ρ) = = ∇ log ρ, (55) δρ δρ ρ so that the continuity equation in (54) reads

∂tρ − ∇ · (ρ ∇ log ρ) = 0. (56)

Now recall that we can rewrite the heat equation as: ∇u 0 = (∂ − ∆)u = ∂ u − ∇ · (u ) = ∂ u − ∇ · (u∇ log u), (57) t t u t where we can think of µt = u(·, t) and vt = ∇ log u in the continuity equation (1). Thus we see that (56) and (57) are really the same continuity equation associated with the heat equation. However, they are viewed as different gradient flows in the spaces (P2(Ω),W2) and Ω, respectively. In either case, we call it the heat flow.

3 Fokker-Planck equations: diffusions with drift

We start from the equivalence of the metric derivative and the velocity vector field.

Theorem 3.1. If (µt)t∈[0,1] is an absolute continuous curve in Wp, then for any t ∈ [0, 1] a.e., p n there is a velocity vector field vt ∈ L (µt; Ω) for Ω ⊂ R such that:

1. µt is a weak solution of the continuity equation ∂tµt + ∇ · (µtvt) = 0 in the sense of distribution;

0 p 2. for a.e. t, we have kvtkL (µt) 6 |µ |p(t). n p Conversely, if (µt)t∈[0,1] are measures in Pp(R ) and vt ∈ L (µt; Ω) for each t such that R 1 p 0 kvtkL (µt)dt < ∞ satisfying the continuity equation ∂tµt + ∇ · (µtvt) = 0, then

1.( µt)t∈[0,1] is an absolute continuous curve in Wp;

0 p 2. for a.e. t, we have |µ |p(t) 6 kvtkL (µt).

Corollary 3.2. If (µt)t∈[0,1] is an absolute continuous curve in Wp, then the velocity vector field given in part 1 of Theorem 3.1 must satisfy

0 p |µ |p(t) = kvtkL (µt). (58)

10 As in the heat equation, we can take two alternative perspectives on the continuity equa- tion structure and the gradient flow in the spaces (P2(Ω),W2) and Ω. First, if we choose the functional Z Z Z Z F (ρ) = f(ρ(x)) dx + V (x) dρ(x) = f ◦ ρ + V dρ, (59) where f(ρ) = ρ log ρ is the entropy and V :Ω → R is a potential (independent of ρ), then the first variation of F at ρ is given by δF (ρ) = 1 + log ρ + V. (60) δρ Thus, δF  ∇ρ ∇ (ρ) = ∇ log ρ + ∇V = + ∇V, (61) δρ ρ and the continuity equation for this entropy+potential functional F (note that F is convex in ρ) becomes

 ∇ρ  0 = ∂ ρ − ∇ · ρ + ∇V = ∂ ρ − ∆ρ − ∇ · (ρ ∇V ), (62) t ρ t or alternatively, we may write

∂tρ = ∆ρ − h∇ρ, ∇V i − ρ∆V, (63) which is the Fokker-Planck equation (FPE). Hence if we define the drift Laplacian operator LV as V −V LV ρ = ∆ρ + h∇ρ, ∇V i = e ∇ · (e ∇ρ), (64) then the Fokker-Planck equation (63) is nothing but the drift heat equation:

∂tρ − LV ρ − ρ∆V = 0, (65) which describes the time evolution of density of particles under the influence of drag forces and random forces (such as in Brownian motion or heat diffusion in the pure diffusion case.) For a general potential V , from the continuity equation (62), the velocity vector field vt is given by ∇ρt vt = − − ∇V. (66) ρt

Thus we can write down the gradient flow of vt as in (27). In particular, the Mehler flow 1 2 (with the potential V (x) = 4 |x| ) is given by

0 ∇ρt  ∇ρt  yx(t) yx(t) = − + ∇V (yx(t)) = − (yx(t)) − (67) ρt ρt 2 with the initial datum yx(0) = x, which again is a first-order linear (homogeneous) ordinary differential equation with initial value problem.

11 3.1 Mehler flow 1 2 x n It is interesting to note that as a special case if V (x) = 4 |x| , then ∇V = 2 , ∆V = 2 , and the drift heat equation (65) is the Mehler flow:

(∂t − LM )u = 0, (68) where LM is the Mehler operator defined as n LM (ρ) = L |x|2 (ρ) + ρ. (69) 4 2 It is well-known that the Mehler flow can be obtained by the heat flow in the following sense. Let u(x, t) be a smooth function and define the rescaled version

√  |x|2  u(x, t) = tn/2u( tx, t) → c(4π)−n/2 exp − as t → ∞, (70) e 4 R where c = n u and the convergence holds by the central limit theorem (CLT). Changing the R s s time clock unit t = e and setting v(x, s) = ue(x, e ) for s > 0, one can verify that ns +s (∂s − LM )v = e 2 (∂t − ∆)u. (71)

Thus u solves the heat equation (∂t − ∆)u = 0 if and only if v solves drift heat equation (∂s − LM )v = 0. In particular, this implies that if we consider the fundamental solution of the heat equation given by (7) and (8):

 |x|2  u(x, t) = (4πt)−n/2 exp − , t > 0, (72) 4t then  |x|2  v(x, s) = (4π)−n/2 exp − (73) 4 is a constant in time. Thus ∂sv(x, s) = 0 and we see that LM (v) = 0, i.e., v is an LM -harmonic n function so that v(x) is a static point that minimizes the Mehler energy: for f : R ×[0, ∞) → R, Z 2 Z 2 |x| 2 n 2 |x| E (f) = − fL (f)e 4 = (|∇f| − f )e 4 0. (74) t M 2 > Since d E (f) = −2hf ,L fi = −2E (f), (75) dt t t M M t R |x|2 where hf, giM = n f(x)g(x)e 4 dx defines an inner product, we see that LM is the twice R n negative gradient flow of the Mehler energy (74) in R . Note that (75) implies that the exponential rate of convergence for the Mehler flow (i.e., the Fokker-Planck equation with 1 2 n V (x) = 4 |x| ) in the Euclidean space R : for t > 0, −2t Et(f) = E0(f)e . (76)

In Section 3.3 below, we shall also look at the rate of convergence of the gradient flow of the Fokker-Planck equation (in particular the Mehler flow) to v in the Wasserstein space n (P(R ),W2).

12 3.2 Fokker-Planck equation as a stochastic differential equation The Fokker-Planck equation (in the most general form) is a PDE that describes the time evolution of density of particles of a diffusion process with a (deterministic) drift and (random) noise, which is governed by the following stochastic differential equation (SDE) (sometimes referred as the Langevin diffusion):

dXt = m(Xt, t) dt + σ(Xt, t) dWt, (77)

n where m(Xt, t) is the drift coefficient vector in R and σ(Xt, t) is an n × n matrix. Here (Wt) n is again the standard Wiener process on R . The SDE in (77) is understood in the integral form: Z s+t Z s+t Xt+s − Xs = m(Xu, u) du + σ(Xu, u) dWu (78) s s as a sum of an Lebesgue integral and an Itˆointegral. In this model, both the random drift coefficient vector m(Xt, t) and the random matrix σ(Xt, t) are path/state and time dependent. The (standard) n-dimensional Wiener process is the special case where m(Xt, t) = 0 and σ(Xt, t) = In. For n = 1, the density evolution equation of (77) is given by

2 ∂tρ(x, t) = −∂x(m(x, t)ρ(x, t)) + ∂x(D(x, t)ρ(x, t)), (79)

2 where D(Xt, t) = σ(Xt, t) /2 is the diffusion coefficient. (In AppendixA, we show how to convert the SDE of sample paths (77) to the evolution PDE of densities (79).) In the one- dimensional√ heat diffusion case (where there is no drift m(x, t) = 0) with D(Xt, t) = 1 (i.e., σ(Xt, t) = 2), the evolution of the density ρ(x, t) is governed by

2 ∂tρ − ∂xρ = 0, (80) which is just the heat equation on R. Higher dimension analog ∂tu − ∆u = 0 can also be made by noting that the density evolution equation now becomes

n X ∂2 ∂ ρ(x, t) = −∇ · (m(x, t)ρ(x, t)) + (D (x, t)ρ(x, t)), (81) t ∂x ∂x ij i,j=1 i j where D(x, t) = σ(x, t)σ(x, t)T /2 is the diffusion tensor. How does the density evolution equation (81) link to the continuity equation version (62) and (63)? With the diffusion tensor D(x, t) = In,(81) reads

∂tρ = −∇ · (mρ) + ∆ρ. (82)

Comparing the last display with the continuity equation (62):

∂tρ − ∆ρ − ∇ · (ρ ∇V ) = 0, (83) we see that m = −∇V, (84) which means that the mean drift vector m in the Fokker-Planck equation proceeds in the neg- ative gradient direction of minimizing the potential V (and of course subject to the diffusion

13 effect given by the Laplacian ∆). Combining this with the fact that the heat flow proceeds in the negative gradient direction in the entropy, we see that the Fokker-Planck equation is the negative gradient flow of the entropy+potential functional Z Z F (ρ) = f ◦ ρ + V dρ (85) | {z } | {z } microscopic behavior macroscopic behavior in the Wasserstein space (P2(Ω),W2). x 1 2 We remark that in the special case m(x, t) = − 2 (i.e., V (x) = 4 |x| ) and D(x, t) = 1 in n the Langevin diffusion (77) is often called the Ornstein-Uhlenbeck (OU) process in R : X √ dX = − t dt + 2 dW , (86) t 2 t which is simply a mean-reverting diffusion process. The semi-group operator of the OU process is given by h  x p i P f(x) = E f e−t + 1 − e−2tξ , t 0, (87) t ξ∼π 2 > √ n −n/2 2 where π is again 2-scaled standard Gaussian measure γ on R , i.e., π(x) = (4π) exp(−|x| /4). The OU semi-group (Pt)t>0 admits π as stationary measure, where the convergence holds in 2 n L (π): for any bounded measurable function f : R → R, 0 2 2 −t ξ p −2t kP f − πfk = E 0 E [f(e + 1 − e ξ)] − E [f(ξ)] t L2(π) ξ ∼π ξ∼π 2 ξ∼π 0 2 h −t ξ p −2t i E 0 E f(e + 1 − e ξ) − f(ξ) (88) 6 ξ ∼π ξ∼π 2 h ξ0 p i2 = E f(e−t + 1 − e−2tξ) − f(ξ) (89) 2 → 0, as t → ∞, (90) where (88) follows from Jensen’s inequality, (89) from Fubini’s theorem, and (90) from the 2 dominated convergence theorem (since f ∈ L (π)). In addition, the generator A of (Pt)t>0 is given by d 1 Af = Ptf = − h∇f, xi + ∆f, (91) dt t=0 2 which is called the Ornstein-Uhlenbeck (OU) operator (also write A = LOU ). Comparing (91) with the Mehler operator in (69): n 1 n LM (ρ) = L |x|2 (ρ) + ρ = ∆ρ + h∇ρ, xi + ρ, (92) 4 2 2 2 which gives the forward Fokker-Planck equation (cf. AppendixA for more details), the generator in (91) gives the backward Fokker-Planck equation. Integrating-by-parts w.r.t. dx, ∗ we conclude that LM = A , which means that the Mehler operator (forward equation) is the adjoint of the generator, i.e., the OU operator (backward equation); that is, we have ∗ 2 1 2 LM = LOU in L (dx). This holds for a general potential V , not just V (x) = 4 |x| . (Here we 2 2 − |x| 2 need to be slightly careful on the reference measure: if we consider L (e 4 dx) = L (dπ), ∗ 2 then L |x|2 = LOU in L (dπ).) 4

14 To summarize, given a general Fokker-Planck equation:

0 = ∂tρt − ∆ρt − ∇ · (ρt ∇V ) (93)

= ∂tρt − LV ρt − ρt∆V, (94) where (93) is the continuity equation version and (94) is the drifted heat equation version, if 1 −V (x) n we assume it admits a stationary distribution π(x) = Z e on R (cf. Section 3.3 ahead for more details), then the generator A of the drift heat diffusion process (i.e., the Langevin diffusion) is given by ∗ A = LV , (95) where LV = LV + ρ∆V = ∆ρ + h∇ρ, ∇V i + ∆V, (96) and the semi-group (Pt)t>0 is given by E −t p −2t Ptf(x) = ξ∼π[f(e ∇V (x) + 1 − e ξ)], t > 0. (97)

Letting t → ∞, we see that Pt asymptotically converges to the stationary distribution π (in L2(π)). Thus from the evolution PDE of probability density functions, we can fully characterize the related SDEs. The reverse direction from SDEs to PDEs can be found in AppendixA, where we use Itˆo’sformula to show that a measure solution ρt to the continuity equation can be seen as the law at time t of the process (Xt)t>0 solution to the SDE. This justifies the equivalence between the PDEs and the SDEs.

3.3 Rate of convergence for Wasserstein gradient flows of the Fokker- Planck equation R −V (x) If Z = n e dx < ∞, then the Fokker-Planck equation has a stationary distribution on n R R : 1 π(x) = e−V (x), (98) Z where 0 < Z < ∞ is a normalization constant. To see this, recall that F (ρ) = R ρ log ρ+R V dρ δF −1 −V and δρ (π) = log π + V . Since ∇π = Z e (−∇V ) = −π∇V , δF  ∇π π ∇ (π) = π + π∇V = ∇π + π∇V = 0. (99) δρ π Then,  δF  ∂ π = ∇ · π ∇ (π) = 0, (100) t δρ which implies that π is a stationary point of the Fokker-Planck continuity equation. In this case, the gradient flow of the functional F can be viewed as the gradient flow of the relative entropy between ρ and π (cf. (104) in Definition 3.3 below): Z h  ρ  i Z  ρ  F (ρ) + log Z = ρ log + log Z = ρ log = H(ρkπ) (101) e−V π so that δF δ(F + log Z) δH(ρkπ) (ρ) = (ρ) = (ρ). (102) δρ δρ δρ

15 Now suppose the Fokker-Planck equation admits a stationary distribution. Given an initial distribution ρ0, we would like to ask how fast does the Wasserstein gradient flow (ρt)t>0 converge to the stationary distribution π? In a nutshell, the answer is given by the following statement.

If the stationary measure π satisfies a logarithmic Sobolev inequality (LSI), then ρt converges to π exponentially fast in time t (under numerous distance or divergence measures).

Suppose π is probability measure (i.e., 0 < Z < ∞). Since the convergence quantities involved depend only on the (first and second) derivatives of the density π, without loss of generality, we may assume Z = 1 and

π = e−V . (103)

Definition 3.3 (Relative entropy and relative Fisher information). Let p, q be two probability n measures on R such that q  p. Then the relative entropy (or Kullback-Leibler divergence) between p and q is defined as

Z dp H(pkq) = log dp. (104) dq

In particular, if dx  p and dx  q, then

Z p H(pkq) = p log dx. (105) q

n dµ Let π, µ be two probability measures on R such that π  µ and dπ = ρ, then the relative Fisher information between µ and π is defined as

Z |∇ρ|2 I(µkπ) = dπ. (106) ρ We state a powerful information inequality that implies the LSI. n Theorem 3.4 (HWI++ inequality [5]). Let (R , | · |, π) be a probability space such that dπ(x) = e−V (x) dx, (107)

n ∞ n where V : R → [0, ∞) is a C (R ) function such that Hess(V )  κIn for some parameter n κ ∈ R. Then for any µ0, µ1 ∈ P2(R ) such that H(µ0kπ) < ∞, κ H(µ kπ) − H(µ kπ) W (µ , µ )pI(µ kπ) − W 2(µ , µ ). (108) 1 0 6 2 0 1 1 2 2 0 1 Theorem 3.4 implies several classical functional and information inequalities. n Corollary 3.5. In the setting of Theorem 3.4 and assume further π ∈ P2(R ), then we have n 1. HWI inequality (Theorem 3 in [10]): for any ν ∈ P2(R ), κ H(νkπ) W (ν, π)pI(νkπ) − W 2(ν, π). (109) 6 2 2 2

16 n 2. Talagrand’s T2-inequality: for any ν ∈ P2(R ) with finite second moment, κ W 2(ν, π) H(νkπ). (110) 2 2 6

n 3. Logarithmic Sobolev inequality: if κ > 0, then for any ν ∈ P(R ), 1 H(νkπ) I(νkπ). (111) 6 2κ

Combining (110) and (111), we have that if κ > 0, then

pI(νkπ) W (ν, π) , ∀π  ν. (112) 2 6 κ In AppendixB and ??, we discuss more details of the LSIs and Talagrand’s transportation inequalities.

Proof of Corollary 3.5. Corollary 3.5 follows from Theorem 3.4 with specific choices.

1. Take µ0 = π and µ1 = ν.

2. Take µ0 = ν and µ1 = π.

3. Take µ0 = π and µ1 = ν to get the HWI inequality in part 1, and then maximize its κ 2 p ∗ 1 p right-hand side − 2 x + I(νkπ)x, where the maximizer occurs at x = κ I(νkπ). Then a standard approximation argument is sufficient.



There are various ways of measuring the gap between the gradient flow (ρt)t>0 as a so- lution of the Fokker-Planck equation and the stationary distribution π: total variation (as in Meyn-Tweedie’s Markov chain approach), L2-norm, relative entropy, Wasserstein distance, etc. Below in Theorem 3.6, we state the exponential rate of convergence under the relative entropy. Theorem 3.6 (Exponential rate of convergence for the Fokker-Planck gradient flow: relative entropy). Let (ρt)t>0 solves the (gradient drift) Fokker-Planck equation:

∂tρt = ∇ · (∇ρt + ρt∇V ), (113)

2 n where the potential V > 0 satisfies V ∈ C (R ). If the Bakry-Emery´ criterion

Hess(V )  κIn (114)

n holds for some κ > 0 (i.e., HessV (x)  κIn for all x ∈ R ), then

−2κt H(ρtkπ) 6 e H(ρ0kπ), t > 0, (115) where π(x) = e−V (x) is the stationary distribution for the Fokker-Planck equation (113).

17 Note that the Bakry-Emery´ criterion is a strong convexity (sometimes called the κ- convexity) requirement for the potential V . From the sampling point of view, the Bakry- Emery´ criterion requires that the stationary distribution π is strictly log-concave to obtain exponential rate of convergence under the relative entropy. It is known that strictly log-concave probability density satisfies an LSI, and vice versa (cf. Problem 3.17 in [15]). In the celebrated result of Bakry and Emery,´ a simple sufficient condition is given to ensure that the density satisfies an LSI. n 2 n −V Lemma 3.7 (Bakry-Emery´ criterion). Let V : R → R be a C (R ) function and dπ = e dx be a probability measure such that Hess(V )  κIn for some κ > 0. Then π satisfies the LSI with parameter κ: 1 H(νkπ) I(νkπ) (116) 6 2κ n for all ν ∈ P(R ) such that π  ν. However, it is an open question whether or not all log-concave densities satisfy an LSI. It is even an open question to ask whether or not all log-concave distributions satisfy a Poincar´e inequality with a dimension-free constant. This is the Kannan-Lov´asz-Simonovits (KLS) conjecture [1,3]. Conjecture 3.8 (Kannan-Lov´asz-Simonovits conjecture, cf. Conjecture 1.2 in [1]). There exists n an absolute constant C > 0 such that for any log-concave probability measure ν on R , we have E E 2 2 E 2 Varν(f) := ν |f − ν[f]| 6 Cλν ν |∇f| (117) 2 for any locally Lipschitz (i.e., Lipschitz on any Euclidean ball) function f ∈ L (ν), where λν T is the square root of the largest eigenvalue of the covariance matrix EX∼ν[XX ]. Currently, the best known result is that, for an isotropic n-dimensional log-concave distri- bution, the dimension-dependent constant C(n) is on the order n1/4 [8]. We also note that the Bakry-Emery´ criterion can be replaced by a slightly stronger Otto- Villani criterion (with an additional finite second moment assumption), so that we can obtain both LSI and T2 inequality. n 2 n Lemma 3.9 (Otto-Villani criterion, Theorem 1 in [10]). Let V : R → R be a C (R ) function and dπ = e−V dx be a probability measure with a finite second moment such that Hess(V )  κIn for some κ > 0. Then π satisfies the LSI and Talagrand’s T2 inequality, both with parameter κ: 1 κ H(νkπ) I(νkπ) and W 2(ν, π) H(νkπ) (118) 6 2κ 2 2 6 n for all ν ∈ P(R ) such that π  ν. With the (slightly) stronger Otto-Villani criterion, exponential rate of convergence also holds under the Wasserstein distance [2]. In fact, rate of convergence for (more general) non-gradient drift Fokker-Planck equation is established in [2]. Theorem 3.10 (Exponential rate of convergence for the Fokker-Planck gradient flow: Wasser- stein distance). Let (ρt)t>0 solves the (gradient drift) Fokker-Planck equation (113) with the stationary distribution dπ = e−V dx satisfying the Otto-Villani criterion for some κ > 0 (i.e., −V 2 n dπ = e dx has finite second moment such that V ∈ C (R ) and Hess(V )  κIn). Then, −κt W2(ρt, π) 6 e W2(ρ0, π). (119)

18 (Stochastic) proof of Theorem 3.10. Using the relation between the PDE and SDE, solution to the (gradient drift) Fokker-Planck equation (113) is the density of the stochastic process

(Xt)t>0 solving the following SDE: √ dXt = −∇V (Xt) dt + 2 dWt, (120)

n where (Wt)t>0 is the standard Wiener process on R and the initial datum has distribution ρ0 (cf. Section 3.2 for more details of the equivalence). n Let µ0 and ν0 be two probability measures on R , and (X0,Y0) be a coupling at the initial time with the marginal distributions X0 ∼ µ0 and Y0 ∼ ν0 such that E 2 2 |X0 − Y0| = W2 (µ0, ν0). (121)

Then we run two coupled copies of the SDE with (Xt)t>0 (resp. (Yt)t>0) as the solution to (120) starting from X0 ∼ µ0 (resp. Y0 ∼ ν0), both driven by the same Wiener process (Wt)t>0. Then, d E |X − Y |2 = −2 E X − Y , ∇V (X ) − ∇V (Y ) . (122) dt t t t t t t n Note that for any x, y ∈ R , 1 V (x) = V (y) + hx − y, ∇V (y)i + (x − y)T Hess (z)(x − y), (123) 2 V 1 V (y) = V (x) + hy − x, ∇V (x)i + (y − x)T Hess (z0)(y − x), (124) 2 V where z, z0 are on the line segment joining x and y. Adding the last two equalities, we get 1 hx − y, ∇V (x) − ∇V (y)i = (x − y)T [Hess (z) + Hess (z0)](x − y). (125) 2 V V

If Hess(V )  κIn for some κ > 0, then

2 n x − y, ∇V (x) − ∇V (y) > κ|x − y| , ∀x, y ∈ R . (126) Then Gr¨onwall’s lemma yields that

2 −2κt 2 E |Xt − Yt| 6 e E |X0 − Y0| . (127) By definition of the Wasserstein distance,

2 E 2 W2 (µt, νt) 6 |Xt − Yt| . (128) Now we have 2 −2κt 2 W2 (µt, νt) 6 e W2 (µ0, ν0), (129) which gives an exponential contraction between any two solutions µt and νt to the Fokker- Planck equation (113). In particular, (119) follows from choosing ν0 (and thus all νt, t > 0) −V as the stationary solution π = e . 

19 For the Mehler flow, recall that the stationary distribution is given by  |x|2  π(x) = (4π)−n/2 exp − , (130) 4 √ n which is a 2-rescaled standard Gaussian distribution γ on R . By the Gaussian LSI for γ, we know that π satisfies a similar LSI and thus the Mehler flow has the exponential rate of convergence to its stationary distribution π. Corollary 3.11 (Exponential rate of convergence for the Mehler flow). We have

−t −t/2 H(ρtkπ) 6 e H(ρ0kπ) and W2(ρt, π) 6 e W2(ρ0, π), t > 0, (131) |x|2 where π is the stationary distribution the Mehler flow (ρt)t>0 with V (x) = 4 . In particular, ρt converges weakly to π (in distribution) as t → ∞.

|x|2 x 1 Proof of Corollary 3.11. For the Mehler flow, V (x) = 4 , ∇V = 2 , and Hess(V ) = 2 In. So 1 κ = 2 . Then Corollary 3.11 follows from Theorem 3.6 and Theorem 3.10.  Corollary 3.11 is a quantitative version of Boltzmann’s H-theorem stating that the total entropy of an isolated system can never decrease over time (i.e., the second law of thermody- namics): d e−V H(ρ kπ) = −I(ρ kπ) 0 with π = . (132) dt t t 6 Z

3.4 Problem set Problem 3.1. Prove the central limit theorem for the Mehler flow in (70).

Problem 3.2. Derive the Mehler energy formula Et(f) in (74) and show that d E (f) = −2E (f). (133) dt t t

Problem 3.3. Let A be the Ornstein-Uhlenbeck operator defined in (91) and LM is the Mehler operator defined in (69). ∗ 2 1. Show that LM = A in L (dx).

2 ∗ 2 − |x| 2. Show that L |x|2 = A in L (e 4 dx). 4 Problem 3.4. Prove Boltzmann’s H-theorem in (132). Problem 3.5. The KLS conjecture (Conjecture 3.8) implies a weaker variance conjecture: there n exists an absolute constant C > 0 such that for any log-concave probability measure ν on R , we have 2 E 2 VarX∼ν(|X| ) 6 Cλν X∼ν[|X| ], (134) T where λν is the square root of the largest eigenvalue of the covariance matrix Σ = EX∼ν[XX ]. In particular, if Σ = In is isotropic, then the variance conjecture reads 2 VarX∼ν(|X| ) 6 Cn. (135) It is easy to see that the variance conjecture in (134) is a special case of the KLS conjecture by considering f(x) = |x|2.

20 n 1. Gaussian Poincar´einequality. Let γ be the standard Gaussian measure on R and n f : R → R be a Lipschitz function. Show that 2 Varγ(f) 6 Eγ[|∇f| ], (136)

where the equality is attained for f(x) = x1 + ··· + xn and x = (x1, . . . , xn). 2. Show that the Gaussian Poincar´einequality (136) is sharp (up to a multiplicative con- 2 2 stant) for f(x) = x1 + ··· + xn. 3. Exponential Poincar´einequality. Let X be a Laplace (i.e., double-exponential) distri- 1 −|x| bution on R (i.e., the density ν of X is given by ν(x) = 2 e for x ∈ R). Show that

0 2 Var(f(X)) 6 4 E[f (X) ], (137) 4. Now using the Efron-Stein inequality (i.e., a special case of the tensorization technique), show that E 2 Varν⊗n (f) 6 4 ν⊗n [|∇f| ]. (138)

4 Nonlinear parabolic equations: interacting diffusions

4.1 McKean-Vlasov equation Consider the following nonlinear continuity equation:  Z  ∂tρ = ∆ρ − ∇ · ρ b(·, y) ρt(dy) , (139) Rn n n n where b : R ×R → R is a bounded Lipschitz function and ρt(·) = ρ(·, t). Note that (139) is a nonlinear parabolic PDE, which arises from McKean’s example of the interacting diffusion process (in statistical physics), which is described by the following SDE: starting from X0, Z √ dXt = b(Xt, y)ρt(dy) dt + 2 dWt, (140) Rn where ρt(dy) is the distribution of Xt. In (140), the drift coefficient depends on the process state Xt and its marginal density ρt. Here we consider a simpler version of the McKean-Vlasov equations where the diffusion coefficient is constant. Note that (139) is the forward equation corresponding to time marginals of the nonlinear process (140) (cf. AppendixA). Indeed, by 2 n Itˆo’slemma, for f ∈ Cb (R ), h Z i df(Xt) = ∆f(Xt) + hf(Xt), b(Xt, y)ρt(dy)i dt + h∇f(Xt), dWti, (141) which implies the generator A of (Xt)t>0 is given by Z Af = ∆f + hf, b(·, y)ρt(dy)i. (142)

Thus we get the adjoint operator Z ∗   A ρ = ∆ρ − ∇ · ρ b(·, y)ρt(dy) , (143)

21 which entails the forward equation (139). Conversely, one can construct a unique nonlinear process with the time marginal distri- butions satisfying the continuity equation (139). Theorem 4.1 (Theorem 1.1 in [14]). There is existence and uniqueness, trajectorial and in distribution for the solution of (139). A construction of (a system of) the interacting diffusion processes is given in (145) below based on the interacting N-particle system.

4.2 Mean field asymptotics Consider the N-particle system: for each particle i = 1,...,N,

N 1 X √ dXi,N = b(Xi,N ,Xj,N ) dt + 2 dW i, (144) t N t t t j=1

i,N i starting from X0 . Note that Theorem 4.1 allows us to construct the processes (Xt)t>0 for each i > 1 as the solution of √ Z t Z i i i i Xt = X0 + 2Wt + b(Xs, y)ρs(dy) ds, (145) 0

i where ρs(dy) is the distribution of Xs. The following theorem gives a guarantee that the (weak) solution of the McKean-Vlasov equation is governed by the mean-field limit. Theorem 4.2 (Finite-N approximation error for mean-field limit). For any i > 1 and T > 0, we have √ E i,N i sup N [sup |Xt − Xt|] < ∞. (146) N t6T 0 0 n The Vlasov equation is a special case of (144) where b(x, x ) = ∇U(x−x ) and U : R → R is a potential for the particle interactions. In this case, the system of Langevin equations for the N-particle system becomes

N 1 X √ dXi,N = ∇U(Xi,N − Xj,N ) dt + 2 dW i. (147) t N t t t j=1

⊗N If the initial distribution of X0 is chaotic (i.e., X0 ∼ ρ0 ) and U is smooth, then for each t, we have the mean-field limit:

(N) ρt ρt, as N → ∞, (148)

(N) −1 PN i,N where ρt = N i=1 δXi,N is the empirical measure of Xt and denotes weak conver- gence. t On the other hand, the Vlasov equation (139) can be viewed as a W2-gradient flow for the functional Z 1 ZZ F (ρ) = ρ log ρ + U(x − y) dρ(x) dρ(y), (149) 2

22 because the first variation of the interaction potential 1 ZZ U(ρ) = U(x − y) dρ(x) dρ(y) (150) 2

δU at ρ equals to δρ (ρ) = U ∗ ρ. Since ∇(U ∗ ρ) = (∇U) ∗ ρ, the continuity equation in (56) with F given in (149) becomes

∂tρ − ∆ρ − ∇ · (ρ (∇U) ∗ ρ) = 0, (151) which can be thought as a nonlinear Fokker-Planck equation (62) with a state and density dependent free energy functional (see the similar comments for the SDE version (140)).

4.3 Application: landscape of two-layer neural networks We present an application of the mean-field asymptotics to approximate the dynamics of stochastic gradient descent (SGD) for learning two-layer neural networks [9]. In supervised machine learning problems, one often wants to learn the relationship between n (x, y) ∈ R × R, where x is the feature vector and y is the label such that (x, y) ∼ µ. Suppose m a sample of independent and identically distributed (i.i.d.) data (xi, yi)i=1 is drawn and observed from the joint distribution µ. In a two-layer neural network, the dependence of y on x is modeled as: N 1 X yˆ(x; θ) = σ (x; θ ), (152) N ∗ j j=1 n p where N is the number of hidden units (i.e., neurons), σ∗ : R × R → R is an activation p function, and θj ∈ R are parameters for the j-th hidden units. Here we collectively denote 2 all parameters as θ = (θ1, . . . , θN ). Let `(ˆy, y) = (ˆy − y) be the square loss function. Given the activation function σ∗, the learning problem aims to find the “best” parameters θ such that the finite-N risk (i.e., generalization error)

2 RN (θ) = E[`(ˆy(x; θ), y)] = E[ˆy(x; θ) − y] (153) is minimized. The parameter learning is often carried out by the SGD, which iteratively updates the parameter by the following rule: k+1 k ˆ k k θj = θj + 2sk(yk − k(xk; θ ))∇θj σ∗(xk; θj ), (154)

k k k where θ = (θ1 , . . . , θN ) denotes the parameters after k iterations and sk is a step size. Note that we can decompose the finite-N risk as

N N 2 X 1 X R (θ) = R + V (θ ) + U(θ , θ ), (155) N ] N j N 2 i j j=1 i,j=1

2 where R] = E[y ] is the constant risk ofy ˆ = 0, V (θj) = − E[yσ∗(x; θj)], and U(θi, θj) = E (N) −1 PN [σ∗(x; θi)σ∗(x; θj)]. Let ρ = N j=1 δθj be the empirical measure of θ1, . . . , θN . Then RN in (155) can be written as Z ZZ (N) (N) (N) RN (θ) = R] + 2 V (θ) ρ (dθ) + U(θ1, θ2) ρ (dθ1) ρ (dθ2), (156)

23 which suggests considering the minimization of a mean-field limit risk (as N → ∞) defined p for the probability measure ρ ∈ P(R ): Z ZZ R(ρ) = R] + 2 V (θ) ρ(dθ) + U(θ1, θ2) ρ(dθ1) ρ(dθ2). (157)

Under mild assumptions,

inf RN (θ) = inf R(ρ) + O(1/N), (158) θ∈(Rp)⊗N ρ∈P(Rp) so that the mean-field limit provides a good approximation of the finite-N risk. In addition, 0 0 0 ⊗N if the SGD (154) has a chaotic initialization θ = (θ1, . . . , θN ) ∼ ρ0 and it has step size chosen as sk = εξ(kε) for some sufficiently regular function ξ : R+ → R+, then the SGD for p estimating θ is also well approximated by a continuum dynamics on P(R ) in the sense that, for each fixed t > 0, (N) ρt/ε ρt, as N → ∞, ε → 0, (159)

(N) −1 PN where ρ = N δ k is the empirical distribution after k SGD steps and (ρt)t 0 is a t/ε j=1 θj > weak solution of the following distributional dynamics:

∂tρt = 2ξ(t)∇θ · (ρt∇θΨ(θ; ρt)) (160) and Z Ψ(θ; ρ) = V (θ) + U(θ1, θ2) ρ(dθ2). (161)

In particular, if ξ(t) ≡ 1, then the distributional dynamics in (160) and (161) is the Wasserstein p gradient flow of the risk R(ρ) in (P2(R ),W2). Hence the mean-field approximation provides an “averaging out” perspective of the com- plexity of the landscape of neural networks, which can be useful to prove the convergence of the SGD algorithm or its variants.

5 Diffusions on manifold

5.1 Heat equation on manifold In order to speak of the heat equation on manifold, we need first to extend the concept of gradient vector field ∇, divergence ∇·, and Laplace operator ∆ = ∇ · ∇ from the Euclidean space to the manifold setting. Definition 5.1 (Divergence). Let X be a vector field on a smooth manifold (M, O, A) with connection ∇. On a chart (U, x), the divergence ∇· : Γ(TM) → R (sometimes also written as div(·)) is defined as  i div(X) := ∇ · X = ∇ ∂ X , (162) ∂xi ∂ where ∇ ∂ X is the of X in the direction of ∂xi and the last expression ∂xi is implied by the Einstein summation convention (cf. Appendix C.2 for the notation).

24 Remark 5.2. Often we also write the divergence as ∇M · or divM (·) to explicit denote it is a divergence on a (curved) manifold (rather than on a Euclidean space with a global coordinate system.) Moreover, even though the divergence is defined on charts, one can show that the divergence of a vector field is chart-independent. Similar comments apply to the gradient and Laplace-Beltrami operator below.  The next question is that how can we define an analog of gradient vector field? Recall n n n that for a (smooth) function f : R → R, the gradient vector ∇f : R → R . Extending n gradient vector to gradient vector field on the domain in R , we get the gradient vector field ∞ n n n ∇ : C (R ) → Γ(T R ), where Γ(T R ) is the set of all vector fields on the n ∼ n n T R = R (cf. Appendix C.4 for more details). Replacing the Euclidean space R by a M, we can define the gradient on such manifolds. Definition 5.3 (Gradient on manifold). On a Riemannian manifold (M, O, A, g), the gradient is defined as a map ∇ : C∞(M) → Γ(TM), f 7→ ∇f := [−1(df), for f ∈ C∞(M), (163) ∗ where [ := [g : Γ(TM) → Γ(T M) is the metric-induced (invertible) musical map defined in (270) (cf. Appendix C.6). With the concept of gradient (Definition 5.3) and divergence (Definition 5.1), we can talk about the Laplace-Beltrami operator on a Riemannian manifold. Definition 5.4 (Laplace-Beltrami operator). On a Riemannian manifold (M, O, A, g), the Laplace-Beltrami operator is a linear operator ∆ : C∞(M) → C∞(M) is define as ∆f = ∇ · ∇f. (164) In particular, on a (positively) oriented Riemannian manifold (M, O, A↑, g), the Laplace- Beltrami operator in a given chart (U, x) ∈ A↑ can be expressed as (cf. Chapter 1 in [12])

1 ∂  ∂f  ∆f = pdet(g) gij . (165) pdet(g) ∂xi ∂xj The (linear) heat equation on a Riemannian manifold (M, O, A, g) is defined as

(∂t − ∆)u = 0, (166) where u : M ×[0, ∞) → R is a smooth function and ∆ is the Laplace-Beltrami operator (164).

5.2 Application: manifold clustering We present an application of manifold clustering based on the idea of the heat diffusion on Riemannian manifolds [4]. For k = 1,...,K, let Mk be compact nk-dimensional Riemannian manifolds that are disjoint. Suppose we observe a sequence of independent random variables X1,...,Xm taking K ∗ ∗ values in M := tk=1Mk. Suppose further that there exists a clustering structure G1,...,GK K ∗ (i.e., a partition on [n] := {1, . . . , n} such that [n] = tk=1Gk) such that each of the m data ∗ points belong to one of the K clusters: if i ∈ Gk, then Xi ∼ µk for some probability measure µk supported on Mk. Given the observations X1,...,Xm, the manifold clustering task is to develop (computationally feasible) algorithms with strong theoretical guarantees to recover ∗ ∗ the true clustering structure G1,...,GK .

25 6 Nonlinear geometric equations

6.1 Mean curvature flow Now we turn to the nonlinear and geometric continuity equations that describe the (time) n N evolution of manifolds deformed by certain volume functional. Let M := M ⊂ R be an N n-dimensional closed submanifold properly embedded in an ambient Euclidean space R . For ⊥ p ∈ M, denote TpM as the to M at p, and Tp M as the orthogonal complement of TpM. Let ⊥ A : TpM × TpM → Tp M (167) be the second fundamental form that is bilinear and symmetric defined as

⊥ A(u, v) = ∇v u, (168) where u and v are two vector fields tangent to M, and ∇ is the Euclidean covariant derivative of u in the direction of v (cf. the definition of covariant derivative in Appendix C.5). The mean curvature is defined as H = − tr(A) (169) Pn ⊥ n such that H(p) = − i=1 ∇ei ei, where (ei)i=1 is an orthonormal basis (ONB) for TpM. Given an infinitely differentiable (i.e., smooth), compactly supported, normal vector field v on M, consider the one-parameter variation

Ms,v = {x + sv(x): x ∈ M}, (170) which gives a curve s 7→ Ms,v in the space of submanifolds with M0,v = M. To study the geometric flow of the one-parameter family submanifolds, we need first to compute the time derivative of volume, which is given by the first variation formula of volume. Lemma 6.1 (First variation of the volume functional). The first variation formula of volume is given by d Z

Vol(Ms,v) = hv,Hi, (171) ds s=0 M R where H is the mean curvature vector in (169) and the integration M is with respect to the volume form dVolM . n+1 Example 6.2 (n-sphere). Consider the n-sphere M = {x ∈ R : |x| = r} with radius r n+1 embedded in R for n > 1 and v = x/|x| is the unit normal vector field (since M is a n+1 hypersurface in R ). Consequently the one-parameter variation of M is given by n x o n s o M = x + s : x ∈ M = 1 + x : |x| = r . (172) s,v |x| r

Then,  sn Vol(M ) = 1 + Vol(M) (173) s,v r and n  s  Vol(M ) − Vol(M) 1 + r − 1 s,v = Vol(M). (174) s s

26 Letting s → 0, we get d n Vol(Ms,v) = Vol(M). (175) ds s=0 r On the other hand,

n n X X  x  n hH, vi = − h∇⊥e , vi = he , ∇ vi = ∇ · v = ∇ · = , (176) ei i i ei M M |x| r i=1 i=1 which implies that Z n hH, vi = Vol(M). (177) M r Combining (175) and (177), we see that (171) indeed holds for n-sphere. Definition 6.3 (Mean curvature flow). A one-parameter family of n-dimensional submanifolds n N Mt ⊂ R is said to flow (or move by motion) by the mean curvature flow (MCF) if

xt = −H, (178)

∂x where xt = ∂t is the normal component of the time derivative of the position vector x. If n = 1, then (178) is also called the curve-shortening flow. In light of the first variation formula (171), we see that the mean curvature flow (178) is the negative gradient flow of volume. n N Let Mt ⊂ R be n-dimensional submanifolds flowing by the MCF. By Lemma ??, co- ordinate functions x1, . . . , xN of the ambient Euclidean space restricted to the evolving sub- manifolds satisfy the (nonlinear) heat equation

(∂t − ∆Mt )xi = 0, i = 1,...,N. (179)

In fact, the heat equation (179) is a characterization of the mean curvature flow (178). N Definition 6.4 (Gaussian surface area or volume). Let M ⊂ R be an n-dimensional subman- ifold. Define the Gaussian surface area or volume as

Z 2 −n/2 −|x| F (M) = (4π) e 4 . (180) M

6.2 Problem set Problem 6.1. Prove the first variation formula of volume in Lemma 6.1. Problem 6.2. Prove that the coordinate functions satisfy the nonlinear heat equation (179).

A From SDE to PDE, and back

We derive the PDE from the associated SDE. The derivation is based on Itˆo’scalculus and a guess-and-verify trick. Let (Xt)t>0 be a one-dimensional diffusion process of the form

dXt = m(Xt, t) dt + σ(Xt, t) dWt. (181)

27 For a twice differentiable function f(x, t), Itˆo’slemma gives

2 df(Xt, t) = (∂tf + m∂xf + D∂xf) dt + σ∂xf dWt, (182) 1 2 where D(Xt, t) = 2 σ(Xt, t) . Equation (182) implies that the backward equation is given by 2 ∂tf(x, t) + m(x, t)∂xf(x, t) + D(x, t)∂xf(x, t) = 0. (183) Indeed, if (183) holds, then (182) implies that for any T > t, Z T f(XT ,T ) − f(Xt, t) = σ(Xs, s)∂xf(Xs, s) dWs. (184) t

Taking the conditional expectation on Xt, we get h Z T i E[f(XT ,T ) | Xt] − f(Xt, t) = E σ(Xs, s)∂xf(Xs, s) dWs | Xt = 0, (185) t where the last step follows from the fact that the Itˆointegral of a martingale is a martingale. Now we need to verify that the solution f to the equation

f(Xt, t) = E[f(XT ,T ) | Xt] (186) should satisfy the assumed backward equation (183). (In finance, the function f(x, t) is called the value function at time t and (186) represents the value function at a future time point T > t.) Using T = t + dt and the Itˆoformula trick (i.e., expanding time derivative to the first order and spatial derivative to the second order due to the quadratic variation of (Wt)t>0), we see that solution to (186) satisfies the backward equation (183). Now if we define the differential operator A as:

2 Af = m∂xf + D∂xf, (187) then the backward equation (183) can be written as

∂tf + Af = 0. (188)

The operator A is called the generator of the diffusion process (Xt)t>0. Then we find the adjoint operator A∗ in L2(dx) satisfying hAf, ρi = hf, A∗ρi for all ρ and f, where hf, gi = R fg. Integrating by parts, we see that the adjoint operator A∗ in L2(dx) is given by

∗ 2 A ρ = −∂x(mρ) + ∂x(Dρ). (189) Thus the forward equation is determined by the adjoint operator:

∗ ∂tρ = A ρ. (190) Note that (190) is nothing but the one-dimensional Fokker-Planck equation given in the PDE form: 2 ∂tρ(x, t) = −∂x(m(x, t)ρ(x, t)) + ∂x(D(x, t)ρ(x, t)), (191) where ρ(x, t) is the density of particles at position x at time t. This is indeed a forward equation since it predicts ρ(x, t) from the initial datum ρ(x, 0). The process of deriving the SDE from the PDE in the Fokker-Planck equation case is described in Section 3.2.

28 B Logarithmic Sobolev inequalities and relatives

A probability measure π is said to satisfy the logarithmic Sobolev inequality (LSI) if there n exists a κ > 0 such that for any ν ∈ P(R ) and π  ν, 1 H(νkπ) I(νkπ). (192) 6 2κ

B.1 Gaussian Sobolev inequalties n −n/2 2 Let γ be the standard Gaussian measure on R , i.e., γ(x) = (2π) exp(−|x| /2). n Lemma B.1 (Gaussian LSI: information-theoretic version). For any ν ∈ P(R ) and γ  ν, 1 H(νkγ) I(νkγ). (193) 6 2 From Lemma B.1, we see that the standard Gaussian measure γ satisfies the LSI with κ = 1. The equality in (193) is achieved if ν is a translation of γ. To see this, take

 |x − y|2  ν(x) = (2π)−n/2 exp − , y ∈ n. (194) 2 R Then Z ν  Z 1 H(νkγ) = log dν = − (|x − y|2 − |x|2) dν γ 2 Z 1 Z 1 1 = hx, yi dν − |y|2 dν = |y|2 − |y|2 = |y|2. (195) 2 2 2 On the other hand, since γ  ν, we have dν  |x − y|2 |x|2   |y|2  ρ(x) = = exp − + = exp(hx, yi) exp − (196) dγ 2 2 2 and  |y|2  ∇ρ(x) = y exp(hx, yi) exp − = yρ. (197) 2 Then Z |y|2ρ2 Z I(νkγ) = dγ = |y|2ρ dγ = |y|2. (198) ρ 1 Combining (195) and (198), we conclude H(νkγ) = 2 I(νkγ). In fact, translation is the only case that is currently known to achieve the equality case. Hence we pose the following conjecture. Conjecture B.2 (Characterization of the equality case in the Gaussian LSI). The equality case 1 H(νkγ) = 2 I(νkγ) in the Gaussian LSI is achieved if and only if ν is a translation of γ, i.e., n ν(x) = γ(x + y) for some y ∈ R . n Lemma B.3 (Gaussian LSI: functional version). Let f : R → R be a smooth function such that f > 0 and R f 2 dγ = 1. Then Z Z 2 2 f log f dγ 6 |∇f| dγ. (199)

29 Equivalently, we can write Z 2 2 Entγ(f ) 6 2 |∇f| dγ, (200) R where Entγ(g) = g log g dγ is the entropy of the probability density g > 0. It is easy to see that Lemma B.3 and B.1 are equivalent. Let g = f 2. Then ∇g = 2f∇f √1 and ∇f = 2 g ∇g. Note that (199) is equivalent to

Z Z 1 2 1 Z |∇g|2 g log g dγ 2 √ ∇g dγ = dγ. (201) 6 2 g 2 g

R dµ Since g > 0 and g = 1, we can define a probability measure µ such that dγ = g. Then (201) can be written as 1 H(µkγ) I(µkγ). (202) 6 2 Since g is arbitrary, the equivalence between Lemma B.3 and B.1 follows.

B.2 Euclidean Sobolev inequalties There are Euclidean LSIs and the following Lemma B.4 is a version with sharp constant. n R 2 Lemma B.4 (Euclidean LSI). Let f : R → R be a smooth function such that f dx = 1. Then Z n  2 Z  f 2 log f 2 dx log |∇f|2 dx . (203) 6 2 nπe n Lemma B.5 (Sobolev inequality). Let f : R → R be a smooth function with compact support. Then for all n 3, > Z Z 2n 2 |f| n−2 6 C(n) |∇f| , (204) where 1  Γ(n)  2 C(n) = n > 2. (205) n(n − 2) Γ(n/2) Note that the Gaussian LSIs (Lemma B.1 and B.3) are dimension-free inequalities, while the Euclidean LSI (Lemma B.4) and Sobolev inequality (Lemma B.5) are not. The con- 2 stants in (203) and (204) are both sharp. Indeed, direct computation gives the constant nπe nk R 2 from (205). Let m = nk and k → ∞ and take a function f : R → R such that f dx = 1. Then 2 1 1  Γ(nk)  2 nk 2nk 2 2 C(m) = nk ∼ ∼ = , (206) πnk(nk − 2) Γ(nk/2) πn2k2 e nkπe mπe where we used Stirling’s approximation

√ nk nk (nk)! ∼ 2πnk . (207) e C Riemannian geometry: some basics

In this section, we collect some basic facts and results about Riemannian geometry.

30 C.1 Smooth manifolds Definition C.1 (Topological manifold). A topological space (M, O) is said to be an n-dimensional topological manifold if for any p ∈ M, there is an open set U ∈ O containing p such that there n n exists a map x : U → x(U) ⊂ R (equipped with the standard topology on R ) satisfying: (i) x is invertible, i.e., there is a map x−1 : x(U) → U; (ii) x is continuous; (iii) x−1 is continuous. In other words, x is a homeomorphism between U and x(U). The pair (U, x) in Definition C.1 is called a chart and x is called the chart map. For p ∈ U and x(p) = (x1(p), . . . , xn(p)), xi(p) is the i-th coordinate of x(p). An atlas is a collection of charts A = {(Uα, xα): α ∈ A} such that M = ∪α∈AUα. −1 n For two charts (U, x) and (V, y) such that U ∩ V 6= ∅, the map y ◦ x : R ⊃ x(U ∩ V ) → n y(U ∩V ) ⊂ R is said to be the chart transition map. We say (U, x) and (V, y) are ♣-compatible if either U ∩ V = ∅, or U ∩ V 6= ∅ if the chart transition maps y ◦ x−1 : x(U ∩ V ) → y(U ∩ V ) −1 n n and x ◦ y : y(U ∩ V ) → x(U ∩ V ) has certain ♣-property (as a map from R to R ). 0 n 1 n ∞ n For instance, the ♣-property can be C (R ),C (R ),...,C (R ), and so on. An atlas A is ♣-compatible if the transition maps between any two charts in A have the ♣-property. Definition C.2 (Smooth manifold). A smooth manifold (M, O, A) is a topological manifold ∞ n (M, O) equipped with a C (R )-compatible atlas A. In these notes, we shall only consider smooth manifolds unless otherwise indicated. Definition C.3 (Smooth functions on manifold). Let (M, O, A) be a smooth manifold. A ∞ −1 n function f : M → R is said to be a smooth map (or C -map) if f ◦ y : R → R is smooth for every chart y in the atlas A. According to Definition C.3, the coordinate functions x1, . . . , xn of any chart (U, x) in a C∞-compatible atlas are all smooth maps. A map φ : M → N is smooth if there is a chart (U, x) in M and a chart (V, y) in N such −1 m −1 n ∞ that φ(U) ⊂ V and the map y ◦ φ ◦ x : R ⊃ x (U) → y(V ) ⊂ R is C .

Definition C.4 (Diffeomorphism). Let (M, OM , AM ) and (N, ON , AN ) are two smooth mani- folds. We say that (M, OM , AM ) and (N, ON , AN ) are diffeomorphic if there exists a bijection φ : M → N such that φ : M → N and φ−1 : N → M are both smooth maps.

C.2 Tensors Let (V, +, ·) be a and (V ∗, ⊕, ) be the , where

∗ ∼ V = {φ : V −→ R} =: Hom(V, R) (208) is the set of linear functionals on V . An element φ ∈ V ∗ is called a covector.

Definition C.5 (Tensor). Let (V, +, ·) be a vector space and r, s ∈ N0 := {0, 1,... }. An (r,s)-tensor T over V is an R-multi-linear map ∗ ∗ ∼ T : V × · · · × V × V × · · · × V −→ R. (209) | {z } | {z } r times s times By Definition C.5, a covector φ ∈ V ∗ is a (0, 1)-tensor over V . If dim(V ) < ∞, then ∼ ∗ ∗ ∗ ∼ V = (V ) is an isomorphism so that v ∈ V can be identified as a linear map V −→ R, which means that v is a (1, 0)-tensor.

31 Let V be an n-dimensional vector space with an (arbitrarily chosen) basis (e1, . . . , en). Then the dual basis (ε1, . . . , εn) for V ∗ is uniquely determined by

i i ε (ej) = δj, (210)

i i where δj = 1 if i = j, and δj = 0 if i 6= j. Definition C.6 (Components of tensor). Let T be an (r, s)-tensor over an n-dimensional vector 1 n space V such that n < ∞. Let (e1, . . . , en) be a basis of V and (ε , . . . , ε ) be the dual basis of V ∗. Then the components of T w.r.t. the chosen basis are defined as the (r + s)n real numbers (or sometimes called coefficients)

i1,...,ir i1 ir T j1,...,js = T (ε , . . . , ε , ej1 , . . . , ejs ) (211) for i1, . . . , ir, j1, . . . , js ∈ {1, . . . , n}. i As an example, consider a (1, 1)-tensor T with components given by T j = T (εi, ej). Then for any v ∈ V and φ ∈ V ∗, we can express

n n n n n n  X i X j  X X j i X X j i T (φ, v) = T φiε , v ej = φiv T (ε , ej) = φiv T j. (212) i=1 j=1 i=1 j=1 i=1 j=1

Thus components fully determine the tensor (given the basis). To avoid write too many sums, we typically use the Einstein summation convention:

j i T (φ, v) = φiv T j, (213) where the repeated up-and-down indices i and j are summed over.

C.3 Tangent and cotangent spaces

Definition C.7 (Velocity). Let (M, O, A) be a smooth manifold and γ : R → M be a curve 1 (at least C ). Let p ∈ M and γ(t0) = p for some t0 = p. The velocity of γ at p is the linear map defined as

∞ ∼ vγ,p : C (M) −→ R, 0 f 7→ vγ,p(f) = (f ◦ γ) (t0). (214)

Note that (C∞(M), ⊕, ) forms a vector space equipped with (f ⊕ g)(p) = f(p) + g(p) and (λ g)(p) = λ · g(p). Note further that (C∞(M), ⊕, ) is not only a vector space, it is also a ring. However, (C∞(M), ⊕, ) is not a field! Definition C.8 (Tangent vector space). Let (M, O, A) be a smooth manifold. For each p ∈ M, n o TpM := vγ,p : γ is a smooth curve in M (215) is the tangent space to M at p.

Intuitively, we may understand the velocity vγ,p (i.e., a tangent vector in TpM) as the directional derivative along the curve γ at p. The following Lemma C.9 confirms that the tangent space TpM is indeed a vector space.

32 ∞ Lemma C.9. For each p ∈ M, TpM is a vector space of linear maps from C (M) to R.

Proof of Lemma C.9. We shall define ⊕ and to make (TpM, ⊕, ) a vector space. For f ∈ C∞(M), we define

∞ ⊕ : TpM × TpM → Hom(C (M), R), (vγ,p ⊕ vδ,p)(f) = vγ,p(f) + vδ,p(f), (216) and

∞ : R × TpM → Hom(C (M), R), (s vγ,p)(f) = s · vγ,p(f). (217)

We still need to show that there are smooth curves σ and τ such that vσ,p = vγ,p ⊕ vδ,p and vτ,p = s vγ,p. We first construct the curve τ via

τ : R → M, t 7→ τ(t) = γ(st + t0) =: γ ◦ µs(t), (218) where µs(t) = st + t0. Now τ(0) = γ(t0) = p and by the chain rule,

0 0 0 0 0 vτ,p(f) = (f ◦ τ) (0) = (f ◦ γ ◦ µs) (0) = µs(0) · (f ◦ γ) (µs(0)) 0 (219) = s · (f ◦ γ) (t0) = s · vγ,p(f) = (s vγ,p)(f).

Next choose a chart (U, x) such that p ∈ U, and define

σx : R → M, −1 t 7→ σx(t) = x ((x ◦ γ)(t0 + t) + (x ◦ δ)(t1 + t) − (x ◦ γ)(t0)), (220) where γ(t0) = δ(t1) = p. Then σx(0) = p and by the chain rule,

0 −1 0 vσx,p(f) = (f ◦ σx) (0) = (f ◦ x ) ◦ (x ◦ σx) (0) i0 −1  = (x ◦ σx) (0) · ∂i(f ◦ x ) (x(σx(0)))  i0 i0  −1  = (x ◦ γ) (t0) + (x ◦ δ) (t1) · ∂i(f ◦ x ) (x(p)) (221)  i0  −1   i0  −1  = (x ◦ γ) (t0) · ∂i(f ◦ x ) (x(p)) + (x ◦ δ) (t1) · ∂i(f ◦ x ) (x(p)) −1 0 −1 0 = (f ◦ x ) ◦ (x ◦ γ) (t0) + (f ◦ x ) ◦ (x ◦ δ) (t1) 0 0 = (f ◦ γ) (t0) + (f ◦ δ) (t1) = vγ,p(f) + vδ,p(f) = (vγ,p ⊕ vδ,p)(f), which the last line does not depend on the choice of the chart (U, x). 

Notation. In the proof of Lemma C.9, we have seen that for a curve γ : R → M such that γ(0) = p, where (U, x) is a chart containing p,

i0 −1  vγ,p(f) = (x ◦ γ) (0) · ∂i(f ◦ x ) (x(p)), f : M → R smooth, (222) | {z } | {z } =:γ ˙ i (0)   x =: ∂f ∂xi p

33 n where we reserve ∂ig as the j-th Euclidean partial derivative of the function g : R → R. ∞ Thus we write the velocity vector vγ,p ∈ TpM as a linear map from C (M) to R defined as

i  ∂  vγ,p =γ ˙x(0) , (223) ∂xi p

 ∂  i where i is the chart-induced basis of TpM andγ ˙x(0) are the components of vγ,p w.r.t. ∂x p  ∂f  the chart-induced basis. We emphasize that i in (223) are not partial derivatives of ∂x p f because it does not make sense to speak of partial derivatives for a function defined on a n  ∂f  manifold where there is no global canonical basis such as in R . Once we see i , we need ∂x p always to translate back to its definition in (222). Lemma C.10 (Chart-induced basis of tangent space). Let (M, O, A) be a smooth manifold and (U, x) be a chart in A that contains p. Then  ∂   ∂  ,..., (224) ∂x1 p ∂xn p form a basis of TpU. In particular, dim(TpM) = d = dim(M), where dim(TpM) is the vector space dimension of TpM and dim(M) is the dimension of the topological manifold (M, O).

Proof of Lemma C.10. We have shown in (223) that every vγ,p ∈ TpU can be expressed as a  ∂   ∂  linear combination of 1 ,..., n . It remains to show that these vectors are linearly ∂x p ∂x p ∞ n j independent. Since A is a C (R )-compatible atlas, x : U → R is smooth. By (222),

i ∂  j i j −1  i j j α (x ) = α ∂i(x ◦ x ) (x(p)) = α δi = α , (225) ∂xi p j −1 n j −1 since x ◦ x : R → R such that (x ◦ x )(c1, . . . , cn) = cj. If  ∂  0 = αi , (226) ∂xi p then αj = 0 for all j = 1, . . . , n, which means that the chart-induced basis vectors in (224) are linearly independent.  ∗ Definition C.11 (Cotangent space). The cotangent space Tp M is the dual space of TpM, i.e., ∗ ∼ Tp M = {φ : TpM −→ R} (227) contains the set of linear functionals on TpM. Example C.12 (Gradient is a covector in the cotangent space). Let f ∈ C∞(M) and consider ∼ (df)p : TpM −→ R, X 7→ (df)p(X) = Xf, for X ∈ TpM. (228) ∗ We call (df)p is the gradient of f at p ∈ M. Clearly (df)p ∈ Tp M (i.e., (df)p is a covector) and it is a (0, 1)-tensor over the vector space TpM. Thus the components of (df)p w.r.t. the chart-induced basis by (U, x) (cf. (211)) is given by   ∂    ∂f  (df)p = (df)p = for j = 1, . . . , n. (229) j ∂xj p ∂xj p

34 The interpretation of gradient is as follows. The directional derivative Xf of f along a tangent vector X at p is the gradient (df)p evaluated at X. Thus gradient (df)p gives the information of directional derivatives of f along all possible tangent vectors at such point p. Lemma C.13 (Dual basis of cotangent space). Let (M, O, A) be a smooth manifold and (U, x) j be a chart in A that contains p, where x : U → R is the j-th coordinate of x. Then

1 n (dx )p,..., (dx )p (230)

∗  ∂   ∂  form a dual basis of Tp M w.r.t. the basis 1 ,..., n of TpM. ∂x p ∂x p Proof of Lemma C.13. The lemma follows from

i i  ∂   ∂x  i (dx )p = = δj (231) ∂xj p ∂xj p and the dual basis definition in (210).  Now suppose we have two charts (U, x) and (V, y) such that U ∩ V 6= ∅. Take a point p ∈ U ∩ V and a tangent vector X ∈ TpM. Then we may express X in the two charts using different local coordinate systems:

i  ∂  j ∂  X = Xx = Xy . (232) ∂xi p ∂yj p Let f ∈ C∞(M). Applying the (multi-variable) chain rule, we compute

 ∂  −1 −1 −1  f = ∂i(f ◦ x )(x(p)) = ∂i (f ◦ y ) ◦ (y ◦ x ) (x(p)) ∂xi p j −1 −1  = ∂i(y ◦ x )(x(p)) · ∂j(f ◦ y ) (y(p)) ∂yj   ∂f  = . (233) ∂xi p ∂yj p Thus the last two displays imply that

j i ∂y   ∂  j ∂  Xx = Xy . (234) ∂xi p ∂yj p ∂yj p

 ∂  Since j is a basis vector, we obtain the formula for the change of vector components ∂y p under a change of chart given by

j j ∂y  i Xy = Xx, (235) ∂xi p which is a linear map at the given point p. However, we should emphasize that the global chart transformation is nonlinear. We can also derive a formula for change of covector components under a change of chart. ∗ Let ω ∈ Tp M. Then we may write

i j ω = ω(x)i(dx )p = ω(y)j(dy )p. (236)

35 By similar computations in the tangent space, one can show that

 ∂xi  ω(y)j = ω(x)i (237) ∂yj p

 ∂xi   ∂yj  and the matrix j is the inverse of the matrix i . ∂y p ∂x p

C.4 Tangent bundles and tensor fields We have defined the tangent and cotangent spaces at a given point p ∈ M in Section C.3. In this section, we consider the field extension of these concepts to the whole manifold based on the theory of bundles. Definition C.14 (Bundle and fiber). A bundle is a triple (E, M, π) such that

π : E → M, (238) where E is a smooth manifold (“total space”), M is a smooth manifold (“base space”), and π is a surjective smooth map (“projection map”). For p ∈ M, the fiber over p is the pre-image π−1(p) of {p}. Consider a smooth n-dimensional manifold (M, O, A). Let G TM = TpM (239) p∈M be a total space and π : TM → M be the surjective map defined via X 7→ p where p is the unique point in M such that X ∈ TpM. We would like first to turn TM into a topological manifold such that π is continuous. We consider the coarsest topology:

−1 OTM = {π (U): U ∈ O}, (240) which is sometimes also referred as the initial topology w.r.t. π. Next we need to construct a C∞-atlas on TM from the C∞-atlas on M. Let

ATM = {(T U, ξx):(U, x) ∈ A}, (241)

2n where the chart map ξx : TU → R is defined as

1 n 1 n  X 7→ ξx(X) = (x ◦ π)(X),..., (x ◦ π)(X), (dx )π(X)(X),..., (dx )π(X)(X) . (242) | {z } | {z } (U,x)-coordinates of base point π(X) components of Xw.r.t. (U,x)

Note that

−1 ξx : ξx(TU) → T U,   ∂   ∂   (a1, . . . , an, b1, . . . , bn) 7→ b1 , ··· , bn ,(243) ∂x1 x−1(a1,...,an) ∂xn x−1(a1,...,an)

36 −1 1 n where x (a , . . . , a ) = π(X) is the base point. Then we need to check that the atlas ATM is smooth. Let (U, x) and (V, y) be two charts in A such that U ∩ V 6= ∅. Combining (242) 2n 2n and (243), we can compute the chart transition map from R to R in ATM as following:

−1 1 n 1 n  j ∂   (ξy ◦ ξx )(a , . . . , a , b , . . . , b ) = ξy b ∂xj x−1(a1,...,an)

 i  j ∂   i  j ∂    = ..., (y ◦ π) b ,...,..., (dy )x−1(a1,...,an) b ... ∂xj x−1(a1,...,an) ∂xj x−1(a1,...,an)   ∂yi   = ..., (yi ◦ x−1)(a1, . . . , an), . . . , . . . , bj ,... . ∂xj x−1(a1,...,an) (244) Note that (yi ◦ x−1)(a1, . . . , an) is just the i-th coordinate of chart transition map of A and

i  ∂y  i −1 −1 1 n i −1 1 n = ∂j(y ◦ x )(x(x (a , . . . , a )))) = ∂j(y ◦ x )(a , . . . , a )). (245) ∂xj x−1(a1,...,an)

−1 −1 Since y ◦ x is smooth, it follows that ξy ◦ ξx is also smooth. Thus (TM, OTM , ATM ) is a smooth manifold. Because π is a projection from TM to M, both of which are smooth manifolds, we conclude that π is a smooth map. Thus the triple (T M, M, π) has a bundle structure, which is referred as the tangent bundle. Definition C.15 (Tangent bundle). For a smooth manifold (M, O, A), the triple (T M, M, π) is called the tangent bundle. For simplicity, we sometimes call TM the tangent bundle. Why do we care about tangent bundles? The reason is that any smooth vector field on a smooth manifold M can be represented as a smooth section of the tangent bundle (T M, M, π). Definition C.16 (Section). Let (E, M, π) be a bundle. A section σ of the bundle is a map χ : M → E such that π ◦ χ = idM , where idM is the identity map on M. Recall that (C∞(M), ⊕, ) is not only a vector space, it is also a ring. Here we slightly abuse the notation by denoting this ring as (C∞(M), +, ·). Define

Γ(TM) = {χ : M → TM : χ is a smooth section}. (246)

The key idea is that one can think of Γ(TM) as a collection of smooth vector fields on M. ∞ Note that χ : M → TM such that p 7→ χ(p) ∈ TpM. Thus given a smooth map f ∈ C (M), we can view χf : M → R defined via p 7→ χ(p)f ∈ R. In words, a smooth vector field χ ∈ Γ(TM) on M acting on a smooth function f ∈ C∞(M) gives another smooth function χf ∈ C∞(M). Now we want to give more algebraic structures to the set Γ(TM). Let χ, χe ∈ Γ(TM) be two smooth vector fields on M and f, g ∈ C∞(M). Define

(χ ⊕ χe)(f) = χf + χf,e (247) (g χ)(f) = g · χ(f). (248)

Then (Γ(TM), ⊕, ) would be a vector space, if C∞(M) is a field. However, C∞(M) is only a ring, so this makes (Γ(TM), ⊕, ) a C∞(M)-module. Informally, we call Γ(TM) is “tangent field.” Similarly, one can consider the vector fields on the T ∗M and construct the section Γ(T ∗M) as a C∞(M)-module. In words, an element in Γ(T ∗M) takes a vector field

37 in Γ(TM) and it produces a smooth function in C∞(M). Informally, we also call Γ(T ∗M) “cotangent field.” Definition C.17 (Tensor field). An (r, s)-tensor field T over Γ(TM) is a C∞(M)-multi-linear map T : Γ(T ∗M) × · · · × Γ(T ∗M) × Γ(TM) × · · · × Γ(TM) −→∼ C∞(M). (249) | {z } | {z } r times s times By convention, a (0, 0)-tensor field is a smooth function C∞(M). In the language of tensor field, an element of the cotangent field Γ(T ∗M) is a (0, 1)-tensor field, which takes a vector field in Γ(TM) and produces a smooth function in C∞(M). Below we give such an example. Example C.18 (Gradient tensor field as a cotangent field). The C∞(M)-linear map

df : Γ(TM) −→∼ C∞(M), χ 7→ df(χ) := χf, (250) where as before (χf)(p) = χ(p)f gives the directional derivative of f along the vector χ(p) at p, is the (0, 1)-tensor field over the smooth vector fields Γ(TM) on M. In other words, the gradient tensor field df gives the of f (on the whole manifold M) along all possible smooth vector fields χ, whereas we recall that the gradient (df)p gives the directional derivatives of f along all possible tangent vectors at a given point p.

C.5 Connections and curvatures Let X ∈ Γ(TM) be a smooth vector field on M. (Note that we slightly change the notation for a vector field from χ in Appendix C.4 to X.) In this section, we wish to extend the notion of directional derivative X : C∞(M) → C∞(M) defined through (Xf)(p) = X(p)f for p ∈ M to the notion of connections on tensor fields (cf. Example C.12 and C.18), which allows us to define straight lines and eventually leads to curvatures. Definition C.19 (Connection). A connection (sometimes also referred as affine connection) ∇ on a smooth manifold (M, O, A) is a map taking a pair of a vector (field) X on M and an (r, s)-tensor field T over Γ(TM) and sending them to an (r, s)-tensor (field) ∇X T , satisfying:

∞ 1. ∇X f = Xf for f ∈ C (M) (i.e., f is a (0, 0)-tensor); 2. Additivity (in T ): for (r, s)-tensor fields T and S,

∇X (T + S) = ∇X T + ∇X S; (251)

3. Leibniz rule (in T ): for ω ∈ Γ(T ∗M),Y ∈ Γ(TM), and (1, 1)-tensor field T ,

∇X (T (ω, Y )) = (∇X T )(ω, Y ) + T (∇X ω, Y ) + T (ω, ∇X Y ); (252)

4. C∞(M)-linearity (in X): for f ∈ C∞(M),

∇fX+Z T = f∇X T + ∇Z T. (253)

38 We say ∇X T is the covariant derivative of T in the direction of X. Remark C.20 (Remark on the Leibniz rule). First, the Leibniz rule (252) is equivalent to the tensor product form. Let T be a (p, q)-tensor field and S be an (r, s)-tensor field. The tensor product T ⊗ S of T and S is defined as the (p + r, q + s)-tensor field such that

(T ⊗ S)(ω1, . . . , ωp+r,X1,...Xq+s) (254) = T (ω1, . . . , ωp,X1,...,Xq) · S(ωp+1, . . . , ωp+r,Xq+1,...,Xq+s),

∗ for ω1, . . . , ωp+r ∈ Γ(T M) and X1,...,Xq+s ∈ Γ(TM). Then the Leibniz rule (252) can be expressed in terms of the tensor product:

∇X (T ⊗ S) = (∇X T ) ⊗ S + T ⊗ (∇X S). (255) Second, the Leibniz rule (252) can defined on higher-order tensors, which are useful to describe curvatures of smooth manifolds. For instance, for (1, 2)-tensor T , the Leibniz rule in T becomes

∇X (T (ω, Y, Z)) = (∇X T )(ω, Y, Z) + T (∇X ω, Y, Z) + T (ω, ∇X Y,Z) + +T (ω, Y, ∇X Z). (256)

 Now we can equip a smooth manifold with an additional connection structure. A smooth manifold with connection is a quadruple of structures (M, O, A, ∇). Intuitively, one can think of ∇X is an extension of the directional derivative X and ∇ is an extension of the differential d, where both extensions are seen from smooth functions to tensor fields. How many ways can we determine the connection ∇ on a smooth manifold (M, O, A)? It turns out the degree of freedom (without putting extra structures) is high. Actually there are infinitely many connections can be given on a smooth manifold. To see this, let X and Y be vector fields on M. Take a chart (U, x) and note that ∂ ∂ ∞ ∂x1 ,..., ∂xn are (1, 0)-tensor fields (through an isomorphic identification). By the C (M)- linearity (253) and the Leibniz rule (255) (in the tensor product form), we may compute

 j ∂  i  j ∂  ∇X Y = ∇Xi ∂ Y j = X ∇ ∂ Y j ∂xi ∂x ∂xi ∂x i j ∂  i j  ∂  = X ∇ ∂ Y j + X Y ∇ ∂ j ∂xi ∂x ∂xi ∂x (257) j i∂Y  ∂ i j  ∂  = X i j + X Y ∇ ∂ j , ∂x ∂x ∂xi ∂x | {z } k ∂ =:Γ ji ∂xk k where Γ ji are the (chart-dependent) connection coefficient functions (on M) of ∇ w.r.t. (U, x). Definition C.21 (Christoffel symbols). Given a smooth manifold (M, O, A) and a chart (U, x) ∈ A, the Christoffel symbols Γk := Γ k are the connection coefficient functions defining a ji (x) ji connection on the smooth manifold via

k Γ ji : U → R, k  k ∂  p 7→ Γ ji(p) := dx ∇ ∂ j (p). (258) ∂xi ∂x

39 In one chart (U, x), the Christoffel symbols, we can write the vector field in (257) as

∂Y j  ∂ ∂ h∂Y j  ∂ ∂ i ∇ Y = Xi + XiY jΓk = Xi + Y jΓk , (259) X ∂xi ∂xj ji ∂xk ∂xi ∂xj ji ∂xk or we can write down its components

m m m j i (∇X Y ) = X(Y ) + Γ ji Y X , (260) where Y m is the m-th component of Y and the products in this expression are all pointwise products of C∞ functions. By the duality between basis and Leibniz rule, one can show that

j i (∇X ω)m = X(ωm) − Γ mi ωj X . (261)

Based on (260) and (261), given an n-dimensional smooth manifold (M, O, A), we can 3 k determine a connection ∇ from the n -many Christoffel symbols Γ ji, and we are left with k a huge freedom for choosing Γ ji in order to determine a manifold with smooth connection (M, O, A, ∇). A first step to nail down the number of connections is to require the torsion-free structure. Definition C.22 (Torsion). Let (M, O, A, ∇) be a smooth manifold with connection. The torsion of ∇ is the (1, 2)-tensor field

T (ω, X, Y ) = ω(∇X Y − ∇Y X − [X,Y ]), (262) where the Lie bracket [X,Y ] is the vector field defined by

[X,Y ]f = X(Y f) − Y (Xf), f ∈ C∞(M). (263)

The manifold (M, O, A, ∇) (or simply the connection ∇) is said to be torsion-free if T ≡ 0. It is easy to check that torsion T (ω, X, Y ) defined in (262) is indeed a C∞(M)-multi-linear map. Note that, for chart-induced basis, ∂ ∂ [ , ] = 0, (264) ∂xi ∂xj k k k so that the components of torsion are given by T ij = Γ ji − Γ ij. Thus, on a chart-induced k k basis, the connection ∇ is torsion-free if the Christoffel symbols are symmetric Γ ji = Γ ij. In order to define a curvature, it is first instructive to think about how the straight lines look like in a curved smooth manifold. It is intuitively clear to speak about straight lines in Euclidean spaces. This leads to the notion of autoparallely transported curves. Definition C.23 (Paralle transport). A vector field X on M is said to be parallelly transported along a smooth curve γ : R → M if

∇vγ X = 0, (265) where vγ denotes the velocity vector field along γ. Definition C.24 (Autoparallely transported curve). A smooth curve γ : R → M is said to be autoparallely transported if

∇vγ vγ = 0, (266) where vγ again denotes the velocity vector field along γ.

40 In words, the velocity vector field along an autoparallely transported curve is constant along the curve. We can view such curves have constant velocity, which mimics the “no- n acceleration” situation of straight lines in R . Definition C.25 (Riemann curvature). The Riemann curvature of a connection ∇ is the (1, 3)- tensor field R(ω, Z, X, Y ) = ω(∇X ∇Y Z − ∇Y ∇X Z − ∇[X,Y ]Z). (267) It is also easy to check that the Riemann curvature R(ω, Z, X, Y ) defined in (267) is ∞ ∂ ∂ indeed a C (M)-multi-linear map. In a chart-induced basis ∂x1 ,..., ∂xn , components of the Riemann curvature tensor can be expressed in terms of Christoffel symbols:  ∂ ∂ ∂  Ra = R dxa, , , bcd ∂xb ∂xc ∂xd ∂ ∂ = Γa − Γa + Γa Γs − Γa Γs . (268) ∂xc db ∂xd cb cs db ds cb Remark C.26 (Euclidean space with Cartesian coordinates). Consider the Euclidean space as n a smooth manifold (R , O, A) equipped with the standard topology (i.e., topology generated n by open subsets in R ) and a smooth atlas A. Under different atlas, we may have a Euclidean space with different coordinate systems such as Cartesian and polar coordinate systems. By n default, if we assume the chart (R , idRn ) ∈ A and for such chart,

k  k e ∂  Γ ji = dx ∇ ∂ j = 0, (269) ∂xi ∂x ∂ ∂ e where ∂x1 ,..., ∂xn is the global Cartesian basis, then we call ∇ the Euclidean connection n e and the resulting the manifold with connection (R , O, A, ∇ ) the n-dimensional Euclidean space. Usually, we suppress the superscript e and simply write ∇ = ∇e.

C.6 Riemannian manifolds and geodesics Now we consider a smooth manifold with a metric structure and construct a connection on such metric manifold. On a metric manifold, we can speak of metric quantities such as speed or length of curves. It turns out, with an additional metric-compatibility requirement (plus the torsion-free requirement), one can uniquely determine a connection on the metric manifold such that the straight lines are the same as length-minimizing curves. Let g : Γ(TM) × Γ(TM) → C∞(M) be a symmetric (0, 2)-tensor field (i.e., g(X,Y ) = g(Y,X) for all vector fields X and Y in Γ(TM)). We first define a bundle called the musical map

[ : Γ(TM) → Γ(T ∗M), X 7→ [(X) such that [(X)(Y ) = g(X,Y ). (270)

By construction, the musical map [ := [g depends on the metric tensor g. For a vector field X ∈ Γ(TM), [(X) ∈ Γ(T ∗M) is a (0, 1)-tensor with components given by  ∂   ∂  ([(X))a = [(X) a = g X, a ∂x ∂x (271)  ∂ ∂   ∂ ∂  = g Xm , = Xmg , = Xmg . ∂xm ∂xa ∂xm ∂xa am

41 where we used C∞(M)-multi-linearity of g. Definition C.27 (Metric). A (non-degenerate) metric g on a smooth manifold (M, O, A) is a symmetric (0, 2)-tensor field such that the musical map [ is a C∞-isomorphism (i.e., [ is invertible) between the tangent field Γ(TM) and the cotangent field Γ(T ∗M). Given a (symmetric and non-degenerate) metric g, one can define its inverse g−1 as the symmetric (2, 0)-tensor field by

g−1 : Γ(T ∗M) × Γ(T ∗M) → C∞(M), (ω, σ) 7→ ω([−1(σ)). (272)

For a covector field ω ∈ Γ(T ∗M), [−1(ω) ∈ Γ(TM) is a (1, 0)-tensor as a vector field (via an isomorphic identification) with components given by

([−1(ω))a = dxa([−1(ω)) = g−1(dxa, ω) −1 a m −1 a m am (273) = g (dx , ωmdx ) = ωmg (dx , dx ) = ωmg . where we used C∞(M)-multi-linearity of g−1 and we denote gam as the components of g−1. Informally, a (1, 1)-tensor field can be represented by a symmetric matrix (say in a chart), and we know that a symmetric matrix has an eigendecomposition with real eigenvalues. For a general (r, s)-metric tensor field g, it does not have an eigendecomposition. Rather, it only has a signature. Definition C.28 (Riemannian metric). A metric is called Riemannian (or sometimes called positive-definite) if its signature is (+,..., +). Metrics with all other signatures are called pseudo-Riemannian metric. Definition C.29 (Riemannian manifold). A Riemannian manifold (M, O, A, g) is a smooth manifold (M, O, A) equipped with a Riemannian metric tensor g. Because a Riemannian metric is positive-definite, it defines an inner product structure on the tangent field Γ(TM). In particular, on a Riemannian metric manifold (M, O, A, g), the speed s(t) of a curve at γ(t) is defined as

r  s(t) = g(vγ, vγ) , (274) γ(t) where vγ is the velocity vector field along the curve (i.e., vγ,t is the velocity of curve γ at t). Definition C.30 (Length). Let γ : [0, 1] → M be a smooth curve. Then the length of γ is defined as Z 1 Z 1 r  L(γ) = s(t) dt = g(vγ, vγ) dt (275) 0 0 γ(t) Lemma C.31 (Length is preserved under reparametrization). Let γ : [0, 1] → M be a smooth curve and σ : [0, 1] → [0, 1] be a smooth, bijective, and increasing function. Then

L(γ) = L(γ ◦ σ). (276)

Definition C.32 (Geodesic). A curve γ : [0, 1] → M is called a geodesic on a Riemannian manifold (M, O, A, g) if it is a stationary curve w.r.t. the length functional L in (275).

42 Recall that the signature of a Riemannian manifold is (+,..., +). Thus, given two end points on M, a geodesic corresponding to the stationary point of L is the length-minimizing curve on the manifold. Nevertheless, we should note that geodesic does not always exist: 2 2 consider two points (+1, +1) and (−1, −1) in R \{(0, 0)} with the Euclidean metric on R . This problem can be fixed by considering the admissible curve, which is a piecewise smooth curve segment. Setting d(p, q) = inf L(γ): γ : [0, 1] → M is admissible, γ(0) = p, γ(1) = q , (277) then (M, d) is a metric space [7, Chapter 2] (sometimes referred as a length space). If (M, d) is a complete metric space, then it is a geodesically complete manifold/space or simply geodesic space. Theorem C.33 (Levi-Civita connection). On a Riemannian manifold (M, O, A, g), there is a unique connection ∇ that is torsion-free T = 0 and metric-compatible ∇g = 0. This connec- tion is called the Levi-Civita connection (or sometimes called the Riemiannian connection). For any vector fields X,Y,Z in the tangent bundle TM, the Leibniz rule (252) reads

∇X g(Y,Z) = (∇X g)(Y,Z) + g(∇X Y,Z) + g(Y, ∇X Z). (278) ∞ Since g(Y,Z) ∈ C (M) and ∇X g = 0, metric compatibility can be equivalently expressed as

X(g(Y,Z)) = g(∇X Y,Z) + g(Y, ∇X Z). (279) From now on, on a Riemannian manifold (M, O, A, g), the connection endowed will al- ways be the Levi-Civita connection (without mentioning further). In such case, the metric compatibility implies that a straight line between two points on the manifold is the same as a geodesic that minimizes the length functional L in (275). That is, once we work on a Riemannian manifold, the Riemman curvature tensor will be (implicitly) induced from the Levi-Civita connection. However, we should highlight that the Riemann curvature tensor can be defined through any connection without referring to the Riemannian metric. Lemma C.34 (Existence and uniqueness of geodesics). Let (M, O, A, g) be a Riemannian manifold. For every point p ∈ M and v ∈ TpM, there is a unique maximal geodesic γ := γv : I → M with γ(0) = p and γ0(p) = v, defined on some open interval I containing 0. Proof of Lemma C.34 can be found in Theorem 4.27 and Corollary 4.28 in [7]. Definition C.35 (Exponential map). Let (M, O, A, g) be a Riemannian manifold and p ∈ M. The exponential map expp : TpM → M is defined by

expp(v) = γv(1), (280) where γv is the unique maximal geodesic defined on an open interval containing [0, 1] (cf. Lemma C.34). Note that for a smooth manifold (M, O, A), without additional connection ∇ or metric g structures, we cannot speak of the manifold shape. Given two end points, the straight line determined by the autoparallely transported curves coincides to the geodesic determined by the metric g if coefficient functions of the Levi-Civita connection satisfy that 1 ∂g ∂g ∂g  Γi = giq kq + jq − jk , (281) jk 2 ∂xj ∂xk ∂xq which comes from the geodesic equation.

43 2 Example C.36 (Round metric on 2-sphere). Consider the 2-sphere manifold (S , O, A) and a chart map x(p) := (x1(p), x2(p)) = (θ, ϕ) such that θ ∈ (0, π) and ϕ ∈ (0, 2π). We now determine the shape of the smooth manifold 2-sphere by the round metric. Under the chart 2 map x, we may equip the 2-sphere smooth manifold (S , O, A) with the round metric whose components are given by  1 0  g (x−1(θ, ϕ)) = . (282) ij i,j=1,2 0 sin2(θ) It is straightforward to check that the metric g in (282) corresponds to a connection ∇round with the connection coefficients in (281) given by 1 −1 1 1 Γ 22(x (θ, ϕ)) = − sin(θ) cos(θ), Γ 12 = Γ 21 = cot(θ), (283) and all other 5 Christoffel symbols are all zeros. Consider the equator curve γ of the round 2-sphere parameterized by π θ(t) := (θ ◦ γ)(t) = (x1 ◦ γ)(t) = , (284) 2 ϕ(t) := (ϕ ◦ γ)(t) = (x2 ◦ γ)(t) = 2πt3. (285) Obviously, θ0(t) = 0 and ϕ0(t) = 6πt2. Then the length of γ can be computed by using the components of the round metric tensor Z 1 q −1 i 0 j 0 L(γ) = gij(x (θ(t), ϕ(t)))(x ◦ γ) (t)(x ◦ γ) (t) dt 0 Z 1 q = 1 · 0 + sin2(π/2) · 36π2t4 dt (286) 0 Z 1 = 6π t2 dt = 2π, (287) 0 which is the same as the length of the equator curve under the constant speed parametrizaiton ϕ(t) = 2πt. In the special case of round sphere, this calculation verifies the length maintenance under reparametrization stated in Lemma C.31.  From the metric tensor g, we can define more curvature tensors on a Riemannian manifold via contraction. Definition C.37 (Curvature tensors). Let (M, O, A, g) be a Riemannian manifold. 1. The Riemann-Christoffel curvature is a (0, 4)-tensor defined by m Rabcd = gam R bcd. (288)

2. The Ricci curvature is a (0, 2)-tensor defined by m Rab = R amb. (289)

3. The scalar curvature is defined by −1 ab R = (g ) Rab. (290)

4. The Einstein curvature is a (0, 2)-tensor defined by 1 G = R − R g . (291) ab ab 2 ab

44 C.7 Volume forms Consider the problem of integrating functions over a smooth manifold (M, O, A). Definition C.38 (Volume form). On an n-dimensional smooth manifold (M, O, A), a (0, n)- tensor field Ω (over Γ(TM)) is called a volume form if

1. Non-vanishing: Ω 6= 0 on M;

2. Totally anti-symmetric: Ω(...,Xi,...,Xj,... ) = −Ω(...,Xj,...,Xi,... ) for all Xi,Xj ∈ Γ(TM) and i, j = 1, . . . , n.

We can construct a (natural) volume form from a Riemannian manifold. Let (M, O, A, g) be a Riemannian manifold. Take an arbitrary chart (U, x) and we let the components of the tensor field Ω be given by √ Ωi1,...,in = gij i1,...,in , (292) where (i1, . . . , in) is a permutation of (1, . . . , n) such that 1,...,n = 1 and i1,...,in = [i1,...,in] for any anti-symmetric bracket [... ]. The i1,...,in ’s are called the Levi-Civita symbols. One can check that Ωi1,...,in is well-defined if and only if for any pair of charts (U, x) and (U, y),  ∂y  det = det(∂ (yj ◦ x−1)) > 0. (293) ∂x i Condition (293) means that the chart transition maps are oriented. Definition C.39 (Oriented atlas). Let A be a smooth atlas. Then A↑ ⊂ A is called a (pos- itively) oriented sub-atlas of A if for any two charts (U, x), (V, y) ∈ A↑, the chart transition maps y ◦ x−1 and x ◦ y−1 are oriented:  ∂y  ∂x det > 0 or det > 0 on U ∩ V. (294) ∂x ∂y

Let (M, O, A↑) be a smooth oriented manifold and (U, x) ∈ A↑ be an oriented chart. Given a volume form Ω, we can define a scalar density on M by

i1,...,in ω(p) = Ωi1,...,in (p)  for p ∈ M, (295)

i1,...,in where  = i1,...,in . Note that Ωi1,...,in is chart-dependent, so is ω. Moreover, the change of scalar density under a change of chart is given by ∂x ω = det ω (296) (y) ∂y (x) for any two charts (U, x), (U, y) ∈ A↑. Let (M, O, A↑, g) be an oriented metric manifold. On one chart domain (U, x), we can define the integration of a function f : U → R as Z Z q −1 −1 f := det(gij)(x (α))(f ◦ x )(α) dα. (297) U x(U)

One can check that (297) is well-defined and does not depend on the chart map on the same chart domain U. To extend the integration to the whole manifold, we would then need

45 ↑ partition of unity. Let (Ui, xi) ∈ A and %i : Ui → R, i = 1,...,N, be a finite collection of P continuous functions such that for any p ∈ M, we have i %i(p) = 1 where the sum is taken over i such that p ∈ Ui. Then we define the integration of a function f : M → R as

N Z X Z f = (%if). (298) M i=1 Ui Combining (297) and (298), we can define the (natural) volume form of a Riemannian manifold as following. Definition C.40 (Volume form of oriented Riemannian manifold). Let (M, O, A↑, g) be an oriented Riemmanian manifold and (U, x) ∈ A↑ be an oriented chart. The volume form of the manifold M in an oriented chart (i.e., local coordinates) is defined as

dVol = pdet(g) dx1 ∧ · · · ∧ dxn, (299) where ∧ denotes the wedge product. The integration of a function f : M → R is defined as Z Z f := f(p) dVol(p). (300) M M Acknowledgement

This lecture note was carried out when the author was visiting the Institute for Data, System, and Society (IDSS) at Massachusetts Institute of Technology. The author would like to thank Sinho Chewi, Tobias Colding, Thibaut Le Gouic, Philippe Rigollet, Austin Stromme, Yun Yang for their helpful comments and feedbacks. This work is supported in part by an NSF CAREER award DMS-1752614 and a Simons Fellowship in Mathematics.

References

[1] David Alonso-Guti´errez and Jes´usBastero. Approaching the Kannan-Lov´asz-Simonovits and Variance Conjectures. Springer, Cham, 2015.

[2] Francois Bolley, Ivan Gentil, and Arnaud Guillin. Convergence to equilibrium in wasser- stein distance for fokker-planck equations. Journal of Functional Analysis, 263(8):2430 – 2457, 2012.

[3] Silouanos Brazitikos, Apostolos Giannopoulos, Petros Valettas, and Beatrice-Helen Vrit- siou. Geometry of Isotropic Convex Bodies, volume 196 of Mathematical Surveys and Monographs. American Mathematical Society, 2014.

[4] Xiaohui Chen and Yun Yang. Diffusion k-means clustering on manifolds: provable exact recovery via semidefinite relaxations. Applied and Computational Harmonic Analysis (arXiv:1903.04416), 2020.

[5] Ivan Gentil, Christian L´eonard, Luigia Ripani, and Luca Tamanini. An entropic inter- polation proof of the hwi inequality. Stochastic Processes and their Applications, 2019.

46 [6] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.

[7] John Lee. Introduction to Riemannian Manifolds. Graduate Texts in Mathematics. Springer International Publishing, 2 edition, 2018.

[8] Y. T. Lee and S. S. Vempala. Eldan’s stochastic localization and the kls hyperplane con- jecture: An improved lower bound for expansion. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 998–1007, Oct 2017.

[9] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the land- scape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.

[10] F. Otto and C. Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361 – 400, 2000.

[11] Gabriel Peyr´eand Marco Cuturi. Computational Optimal Transport: With Applications to Data Science. Now Foundations and Trends, 2019.

[12] Steven Rosenberg. The Laplacian on a Riemannian Manifold: An Introduction to Anal- ysis on Manifolds. London Mathematical Society Student Texts. Cambridge University Press, 1997.

[13] Filippo Santambrogio. Optimal Transport for Applied Mathematicians - Calculus of Vari- ations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Birkh¨auser,2015.

[14] Alain-Sol Sznitman. Topics in propagation of chaos. In Paul-Louis Hennequin, editor, Ecole d’Et´ede Probabilit´esde Saint-Flour XIX — 1989, pages 165–251, Berlin, Heidel- berg, 1991. Springer Berlin Heidelberg.

[15] Ramon van Handel. Probability in high dimension. APC 550 Lecture Notes, Princeton University, 2016.

47