arXiv:1806.09087v3 [math.PR] 7 Sep 2020 u ahrgoswith grows rather but sdsrbdi oedti nSection in detail more high-dimen in the described in is rate convergence the in for theorems estimates limit not tative central is proving here for dependence technique optimal new the and dimension, on depends ht ne idcniin,tesum the conditions, mild under that, yIre cec onaingat715/16. grant Foundation Science by medn,isie yacntuto ftefis ae auth named first the multidim of a construction ca use a we We by which inspired variation. for embedding, motion quadratic emb Brownian the martingale a of a and concentration via sum motion the Brownian between a coupling with vector random the ftemrigl embedding. martingale the of Let Introduction 1 sa xesv ieauesoigta h itnefo Ga from distance the as that showing literature extensive an is h L nhg iesos uniaiebud via bounds quantitative dimensions: high in CLT The ∗ † ‡ √ oee,i ihdmninlstig,i sotntecase the often is it settings, high-dimensional in However, ezanIsiueo cec.Icmeto h lieBlon Elaine the of Incumbent Science. of Institute Weizmann ezanIsiueo cec.Spotdb nAreiFell Azrieli an by Supported Science. of Institute Weizmann tnodUiest.Prilyspotdb tnodGra Stanford a by supported Partially University. Stanford 1 X n (1) eipoetebs nw on,otie ytetidnmda named third the by obtained bound, known e best and the distance our improve transportation Using We in convergence setting. for dimensional bounds high new a in (CLT) theorem limit eso,frgnrllgcnaerno etr ti adds (this vectors e random the log-concave for general rate for convergence mension, non-asymptotic fo first distance the derive transportation We Wasserstein quadratic in gence pcfial nteSoohdebdigcntutdi [ m in on constructed based embedding Skorokhod is the method on Our specifically log-concavity. improvement strong and of assumption assumption co log-concavity for a bound improved under an distance give We (c) assumed); is information as X , . . . , n eitoueanwmto o bann uniaieconver quantitative obtaining for method new a introduce We ∞ → ( n n hsi optimal. is this and , ) eiid admvcosin vectors random i.i.d. be oe Eldan Ronen n tte eoe eesr oudrtn o h convergenc the how understand to necessary becomes then It . atnaeembedding martingale a Mikulincer Dan , ∗ etme ,2020 8, September √ 1 1.1 n P eo,i ae nptws nlss efis couple first we analysis: pathwise on based is below, i n Abstract =1 R X 1 d ( ytecnrllmtterm ti well-known is it theorem, limit central the By . i ) ovre oaGusa.With Gaussian. a to converges ut Fellowship. duate R whpaadfo h zil Foundation. Azrieli the from award owship d aerdvlpetcar atal supported Partially chair. development career d hti utbefretbihn quanti- establishing for suitable is that † n lxZhai Alex and inlstig h ehiu,which technique, The setting. sional 22 sinudrvrosmtisdecays metrics various under ussian one admvcos (b) vectors; random bounded r ]. ninlvrino Skorokhod a of version ensional o[ to toy n npriua:(a) particular: in and ntropy, httedimension the that rfo [ from or o ohmtisudrthe under metrics both for s toi L nabtaydi- arbitrary in CLT ntropic vrec ntransportation in nvergence eludrto.W rsn a present We understood. well ehd eoti several obtain we method, rigl mednsand embeddings artingale ec ae o h central the for rates gence 20 dig hsgvsrs oa to rise gives This edding. ,weeafiieFisher finite a where ], to [ uthor sals onso the on bounds establish n 22 46 ,a manifestation a as ], ‡ ,frconver- for ], d d xd there fixed, sntfixed not is rate e Using our method, we prove new bounds on quadratic transportation (also known as “Kan- torovich” or “Wasserstein”) distance in the CLT, and in the case of log-concave distributions, we also give bounds for entropy distance. Let 2(A, B) denote the quadratic transportation distance between two d-dimensional random vectorsW A and B. That is,

2 2 (A, B)= inf E X Y 2 , W (X,Y ) s.t. k − k sX A, Y B ∼ ∼   where the infimum is taken over all couplings of the vectors A and B. As a first demonstration of our method, we begin with an improvement to the best known convergence rate in the case of bounded random vectors. Theorem 1. Let X be a random d-dimensional vector. Suppose that E[X]=0 and X β k k ≤ almost surely for some β > 0. Let Σ = Cov(X), and let G (0, Σ) be a Gaussian with covariance Σ. If X(i) n are i.i.d copies of X and S = 1 ∼n NX(i), then { }i=1 n √n i=1 P β√d 32 + 2 log2(n) 2(Sn,G) . W ≤ p √n Theorem 1 improves a result of the third named author [46] that gives a bound of order β√d log n √n under the same conditions. It was noted in [46] that when X is supported on a lattice d β√d βZ , then the quantity 2(Sn,G) is of order √n . Thus, Theorem 1 is within a √log n factor of optimal. W When the distribution of X is isotropic and log-concave, we can improve the bounds guar- anteed by Theorem 1. In this case, however, a more general bound has already been established in [20], see discussion below. Theorem 2. Let X be a random d-dimensional vector. Suppose that the distribution of X is (i) n log-concave and isotropic. Let G (0, Id) be a standard Gaussian. If X i=1 are i.i.d n ∼ N { } 1 (i) copies of X and Sn = √n X , then there exists a universal constant C > 0 such that, if i=1 d 8, P ≥ Cd3/4 ln(d) ln(n) 2(Sn,G) . W ≤ √np Remark 3. We actually prove the slightly stronger bound

Cκd ln(d) d ln(n) 2(Sn,G) , W ≤ √pn where

κd := sup x1x xµ(dx) , (1) µ isotropic, ⊗ HS log-concave RZd

1/4 as defined in [21]. Results in [21] and [35] imply that κd = O(d ), leading to the bound in Theorem 2. If the thin-shell conjecture (see [2], as well [11]) is true, then the bound is improved to κd = O( ln(d)), which yields p C d ln(d)3 ln(n) 2(Sn,G) . W ≤ p √n By considering, for example, a random vector uniformly distributed on the unit cube, one can see that the above bound is sharp up to the logarithmic factors.

2 Remark 4. To compare with the previous theorem, note that if Cov(X) = I , then E X 2 = d. d k k Thus, in applying Theorem 1 we must take β √d, and the resulting bound is then of order at d√log n ≥ least √n . Next, we describe our results regarding convergence rate in entropy. If A and B are random vectors such that A has density f with respect to the law of B, then relative entropy of A with respect to B is given by Ent (A B)= E [ln (f(A))] . || As a warm-up, we first use our method to recover the entropic CLT in any fixed dimension. In dimension one this was first established by Barron, [6]. The same methods may also be applied to prove a multidimensional analogue. See [14] for a more quantitative version of the theorem.

Theorem 5. Suppose that Ent (X G) < . Then one has || ∞

lim Ent(Sn G)=0. n →∞ || The next result gives the first non-asymptotic convergence rate for the entropic CLT, again under the log-concavity assumption (other non-asymptotic results appear in previous works, notably [20], but require additional assumptions; see below).

Theorem 6. Let X be a random d-dimensional vector. Suppose that the distribution of X is (i) n log-concave and isotropic. Let G (0, Id) be a standard Gaussian. If X i=1 are i.i.d n ∼ N { } 1 (i) copies of X and Sn = √n X then i=1 P Cd10(1 + Ent(X G)) Ent(S G) || , n|| ≤ n for a universal constant C > 0.

Our method also yields a different (and typically stronger) bound if the distribution is strongly log-concave.

Theorem 7. Let X be a d-dimensional random vector with E[X]=0 and Cov(X) = Σ. ϕ(x) Suppose further that X is 1-uniformly log concave (i.e. it has a probability density e− satisfying 2ϕ I ) and that Σ σI for some σ > 0. ∇  d  d Let G (0, Σ) be a Gaussian with the same covariance as X and let γ (0, Id) be a ∼N n ∼N (i) n 1 (i) standard Gaussian. If X i=1 are i.i.d copies of X and Sn = √n X , then { } i=1 P 2(d + 2Ent (X γ)) Ent(S G) || . n|| ≤ σ4n Remark 8. The theorem can be applied when X is isotropic and σ-uniformly log concave for some σ > 0. In this case, a change of variables shows that √σX is 1-uniformly log concave and has σId as a covariance matrix. Since relative entropy to a Gaussian is invariant under affine transformations, if G (0, I ) is a standard Gaussian, we get ∼N d 2(d + 2Ent (√σX G)) Ent (S G) = Ent √σS √σG || . n|| n|| ≤ σ4n 

3 1.1 An informal description of the method

d Let Bt be a standard Brownian motion in R with an associated filtration t. The following definition will be central to our method: F

Definition 9. Let Xt be a martingale satisfying dXt = ΓtdBt for some adapted process Γt taking values in the positive semi-definite cone and let τ be a stopping time. We say that the triplet (X , Γ , τ) is a martingale embedding of the measure µ if X µ. t t τ ∼

Note that if Γt is deterministic, then Xt has a Gaussian law for each t. At the heart of our proof is the following simple idea: Summing up n independent copies of a martingale embed- n ding of µ, we end up with a martingale embedding of µ∗ whose associated covariance process 2 n (i) has the form i=1 Γt . By the law of large numbers, this process is well concentrated r and thus the resultingP  martingale is close to a Brownian motion. n (i) This suggests that it would be useful to couple the sum process i=1 Xt with the ”aver- 2 n (i) P aged” process whose covariance is given by E i=1 Γt (this process is a Brownian "r # motion up to deterministic time change). ControllingP the error in the coupling naturally leads to a bound on transportation distance. For relative entropy, we can reformulate the discrepancies in the coupling in terms of a predictable drift and deduce bounds by a judicious application of Girsanov’s theorem. In order to derive quantitative bounds, one needs to construct a martingale embedding in a way that makes the fluctuations of the process Γt tractable. The specific choices of Γt that we consider are based on a construction introduced in [22]. This construction is also related to the entropy minimizing process used by F¨ollmer ( [29,30], see also Lehec [36]) and to the stochastic localization which was used in [21]. Such techniques have recently gained prominence and have been used, among other things, to improve known bounds of the KLS conjecture [21, 35], calculate large deviations of non-linear functions [23] and study tubular neighborhoods of complex varieties [34]. The basic idea underlying the construction of the martingale is a certain measure-valued Markov process driven by a Brownian motion. This process interpolates between a given mea- sure and a delta measure via multiplication by infinitesimal linear functions. The Doob martin- gale associated to the delta measure (the conditional expectation of the measure, based on the past) will be a martingale embedding for the original measure. This construction is described in detail in Subsection 2.3 below.

1.2 Related work Multidimensional central limit theorems have been studied extensively since at least the 1940’s [8] (see also [10] and references therein). In particular, the dependence of the convergence rate on the dimension was studied by Nagaev [38], Senatov [44], G¨otze [31], Bentkus [7], and Chen and Fang [18], among others. These works focused on convergence in probabilities of convex sets. We mention that in dimension 1, the picture is much clearer and that tight estimates are known under various metrics ( [9, 12, 13, 26, 42, 43]). More recently, dependence on dimension in the high-dimensional CLT has also been studied for Wishart matrices (Bubeck and Ganguly [17], Eldan and Mikulincer [25]), maxima of sums of independent random vectors (Chernozhukov, Chetverikov, and Kato [19]), and transportation distance ( [46]). As mentioned earlier, Theorem 1 is directly comparable to an earlier result

4 of the third named author [46], improving on it by a factor of √log n (see also the earlier work [45]). We refer to [46] for a discussion of how convergence in transportation distance may be related to convergence in probabilities of convex sets. As mentioned above, Theorem 2 is not new, and follows from a result of Courtade, Fathi and Pananjady [20, Theorem 4.1]. Their technique employs Stein’s method (see also [16], for a different approach using Stein’s method) in a novel way which is also applicable to entropic CLTs (see below). In a subsequent work [27], similar bounds are derived for convergence in the p’th-Wasserstein transportation metric. Regarding entropic CLTs, it was shown by Barron [6] that convergence occurs as long as the distribution of the summand has finite relative entropy (with respect to the Gaussian). However, establishing explicit rates of convergence does not seem to be a straightforward task. Even in the restricted setting of log-concave distributions, not much is known. One of the only quan- titative results is Proposition 4.3 in [20], which gives near optimal convergence, provided that the distribution has finite Fisher information. We do not know of any results prior to Theorem 6 poly(d) which give entropy distance bounds of the form n to a sum of general log-concave vectors. A one-dimensional result was established by Artstein, Ball, Barthe, and Naor [3] and in- dependently by Barron and Johnson [33], who showed an optimal O(1/n) convergence rate in relative entropy for distributions having a spectral gap (i.e. satisfying a Poincar´einequality). This was later improved by Bobkov, Chistyakov, and G¨otze [14,15], who derive an Edgeworth- type expansion for the entropy distance which also applies to higher dimensions. However, although their estimates contain very precise information as n , the given error term is only asymptotic in n and no explicit dependence on the measure→ or ∞ on the dimension is given (in fact, the dependence derived from the method seems to be exponential in the dimension d). A related “entropy jump” bound was proved by Ball and Nguyen [5] for log-concave ran- dom vectors in arbitrary dimensions (see also [4]). Essentially, the bound states that for two i.i.d. random vectors X and Y , the relative entropy Ent X+Y G is strictly less than Ent(X G), √2 || where the amount is quantified by the spectral gap for the dist ribution of X. Repeated applica- tion gives a bound for entropy of sums of i.i.d. log-concave vectors in any dimension, but the bound is far from optimal. It is not apparent to us whether the method of [5] can be extended to provide quantitative estimates for convergence in the entropic CLT.

1.3 Notation We work in Rd equipped with the Euclidean norm, which we denote by . For a positive semi-definite symmetric matrix A we denote by √A the unique positive semi-definitek·k matrix B, for which the relation B2 = A holds. For symmetric matrices A and B we use A B to  signify that B A is a positive semi-definite matrix. By A† we denote the pseudo inverse of A. − Put succinctly, this means that in A† every non-zero eigenvalue of A is inverted. For a random matrix A, we will write E [A]†, for the pseudo inverse of its expectation. d If Bt is the standard Brownian motion in R then for any adapted process Ft we denote by t FsdBs, the Itˆostochastic integral. We refer by Itˆo’s isometry to the fact 0 R t 2 t E F dB = E F 2 ds  s s  k skHS Z Z 0 0     when F is adapted to the natural filtration of B . t t µ will always stand for a probability measure. To avoid confusion, when integrating with respect

5 d to µ, on R , we will use the notation ...µ(dx). For a measure-valued stochastic process µt, the expression dµ refers to the stochastic derivative of the process. A measure µ on Rd is said to t R be log-concave if it is supported on some subspace of Rd and, relative to the Lebesgue measure of that subspace, it has a density ρ, twice differentiable almost everywhere, for which

2 log(ρ(x)) 0 for all x, −∇  where 2 denotes the Hessian matrix, in the Alexandrov sense. If in addition there exists an ∇ σ> 0 such that 2 log(ρ(x)) σI for all x, −∇  d we say that µ is σ-uniformly log-concave. The measure µ is called isotropic if it is centered and its covariance matrix is the identity, i.e.,

xµ(dx)=0 and x xµ(dx) = I . ⊗ d RZd RZd

Finally, as a convention, we use the letters C,C′,c,c′ to represent positive universal constants whose values may change between different appearances.

Acknowledgments We are extremely grateful to the anonymous referee for his/her careful reading of this manuscript. His/her efforts have greatly improved the presentation and overall readability.

2 Obtaining convergence rates from martingale embeddings

Suppose that we are given a measure µ and a corresponding martingale embedding (Xt, Γt, τ). The goal of this section is to express bounds for the corresponding CLT convergence rates (of the sum of independent copies of µ-distributed random vectors) in terms of the behavior of the process Γt and τ. Throughout this section we fix a measure µ on Rd whose expectation is 0, a random vector X µ, and a corresponding Gaussian G (0, Σ), where Cov (X) = Σ. Also, the sequence ∼ ∼N n (i) 1 (i) X i∞=1 will denote independent copies of X, and we write Sn := √n X for their nor- { } i=1 d malized sum. Finally, we use Bt to denote a standard Brownian motionP on R adapted to a filtration . Ft 2.1 A bound for Wasserstein-2 distance The following is our main bound for convergence in Wasserstein distance.

Theorem 10. Let Sn and G be defined as above and let (Xt, Γt, τ) be a martingale embedding of µ. Set Γt =0 for t > τ, then

∞ 2 1 4 2 2 (S ,G) min Tr E Γ E[Γ ]† , 4Tr E Γ dt. W2 n ≤ n t t t Z   0     

6 To illustrate how such a result might be used, let us assume, for simplicity, that Γ kI t ≺ d almost-surely for some k > 0 and that τ has a sub-exponential tail, i.e., there exist positive constants C,c > 0 such that for any t> 0,

ct P(τ > t) Ce− . (2) ≤ Under these assumptions,

∞ 2 1 4 2 2 (S ,G) min Tr E Γ E[Γ ]† , 4k dP (τ > t) dt W2 n ≤ n t t Z   0    log(n) c ∞ 2 2 2 1 2 ct d log(n)k 4Cdk dk dt +4Cdk e− dt = + . ≤ n cn n Z0 log(Zn) c Towards the proof, we will need the following technical lemma.

Lemma 1. Let A, B be positive semi-definite matrices with ker(A) ker(B). Then, ⊂ 2 2 Tr √A √B Tr (A B) A† . − ≤ −      Proof. Since A and B are positive semi-definite, ker √A + √B ker √A √B .Thus, ⊂ − we have that    

√A √B = √A √B √A + √B √A + √B † (3) − −       = A B + √A, √B √A + √B † . −  h i   So, 2 2 Tr √A √B = Tr A B + √A, √B √A + √B † . − − !     h i    Note that for any symmetric matrices X and Y , by the Cauchy-Schwartz inequality,

Tr (XY )2 Tr (XYXY ) Tr (XYYX) Tr (YXXY ) = Tr X2Y 2 . ≤ ≤ · Applying this to the above equation showsp 

2 2 2 Tr √A √B Tr A B + √A, √B √A + √B † . − ≤ − !     h i    Note that the commutator √A, √B is an anti-symmetric matrix, so that (A B) √A, √B + − √A, √B (A B) is anti-symmetrich asi well. Thus, for any symmetric matrix C, weh have thati − h i Tr (A B) √A, √B + √A, √B (A B) C =0. − −  h i h i   Also, since all eigenvalues of anti-symmetric matrices are purely imaginary, the square of such matrices must be negative definite. And again, for any symmetric positive definite matrix C,

7 2 2 it holds that C1/2 √A, √B C1/2 is negative definite and Tr √A, √B C 0. Using ≤ these observationsh we obtaini h i 

2 2 2 Tr A B + √A, √B √A + √B † Tr (A B)2 √A + √B † . − ! ≤ − !  h i       Finally, if C,X,Y are positive definite matrices with X Y then C1/2(Y X)C1/2 is pos-  − itive definite which shows Tr (CX) Tr (CY ). The assumption ker(A) ker(B) implies 2 ≤ ⊂ † √A + √B A†, which concludes the claim by     2 2 † 2 Tr (A B) √A + √B Tr (A B) A† − ≤ −   !   

(i) (i) (i) Proof of Theorem 10. Recall that (Xt, Γt, τ) is a martingale embedding of µ. Let Xt , Γt , τ (i) (i) be independent copies of the embedding. We can always set Γt =0 whenever t > τ , so that  n 2 ∞Γ(i)dB(i) µ. Define Γ˜ = 1 Γ(i) . Our first goal is to show t t ∼ t n t 0 r i=1 R P   ∞ 2 2 ˜ 2 2 (G,Sn) E Tr Γt E [Γt ] dt. (4) W ≤ " − !# Z0  q  2 The theorem will then follow by deriving suitable bounds for E Tr Γ˜ E [Γ2] us- t − t n     1 ∞ (i) (i) p ing Lemma 1. Consider the sum √n Γt dBt , which has the same law as Sn. It may be i=1 0 rewritten as P R ∞ Sn = Γ˜tdB˜t, Z0 1 (i) (i) ˜ ˜† where dBt := √n Γt i Γt dBt is a martingale whose quadratic variation matrix has deriva- tive satisfying 2 P d 1 (i) [B˜] = Γ˜† Γ Γ˜† I . (5) dt t n t t t  d i X   d (i) (in fact, as long as R is spanned by the images of Γt , this process is a Brownian motion). We may now decompose Sn as

∞ ∞ ˜2 ˜ ˜ ˜2 ˜ Sn = E Γt dBt + Γt E Γt dBt. (6) − ! Z0 r h i Z0 r h i

∞ ˜2 ˜ ˜2 2 Observe that G := E[Γt ]dBt has a Gaussian law and that E[Γt ]= E[Γt ]. By applying Itˆo’s 0 isometry, we may seeR q that G has the “correct” covariance in the sense that 2 2 ∞ ⊗ ∞ ∞ ⊗ Cov(G)= E E[Γ˜2]dB˜ = E Γ2dt = E Γ dB = Cov(X).  t t   t   t t  Z0 q Z0 Z0         8 The decomposition (6) induces a natural coupling between G and Sn, which shows, by another application of Itˆo’s isometry, that

2 ∞ (5) ∞ 2 2(G,S ) E Γ˜ E[Γ2] dB˜ Tr E Γ˜ E[Γ2] dt W2 n ≤  t − t t  ≤   t − t  Z   Z   0 q 0 q      ∞ 2 ˜ 2 = E Tr Γt E [Γt ] dt, " − !# Z0  q 

2 where the last equality is due to Fubini’s theorem. Thus, (4) is established. Since Γ˜ E [Γ2] t − t  ˜2 2   2 Γt + E[Γt ] , we have p   2 Tr E Γ˜ E [Γ2] 4Tr E[Γ2] . (7) t − t ≤ t "  #! q  n 2 1 (i) ˜ To finish the proof, write Ut := n Γt , so that Γt = √Ut. Since Γt is positive semi- i=1 2 definite, it is clear that ker (E [Γ ]) Pker( Ut). By Lemma 1, t ⊂ 2 2 2 2 2 † E Tr Ut E [Γt ] Tr E Ut E Γt E Γt " − !# ≤ −  q   h i  p n     2 2 1 (i) = Tr E Γ E Γ2 E Γ2 † n2 t − t t i=1 "  # ! X       1 2 = Tr E Γ4 E Γ2 E Γ2 † n t − t t 1  4  2       Tr E Γ E Γ † , ≤n t t       2 (i) 2 where we have used the fact E Γt = E [Γt ] in the second equality. Combining the last inequality with (7) and (4) produces the  required result.

2.2 A bound for the relative entropy As alluded to in the introduction, in order to establish bounds on the relative entropy we will use the existence of a martingale embedding to construct an Itˆo process whose martingale part has a deterministic quadratic variation. This will allow us to relate the relative entropy to a Gaussian with the norm of the drift term through the use of Girsanov’s theorem. As a technicality, we require the stopping time associated to the martingale embedding to be constant. Our main bound for the relative entropy reads,

Theorem 11. Let (X , Γ , 1) be a martingale embedding of µ. Assume that for every 0 t 1, t t ≤ ≤ E [Γ ] σ I  0 and that Γ is invertible a.s. for t < 1. Then we have the following t  t d t

9 inequalities:

1 2 2 2 1 E Tr (Γt E [Γt ]) 1 − 2 Ent(Sn G) 2 2 σs− ds dt, || ≤ n h (1 t) σt i   Z0 − Zt and   2 1 2 ˜ 1 Tr E [Γt ] E Γt − 2 Ent(S G)   σ− ds dt, n|| ≤ (1 t)2 h i  s  Z0 − Zt where   n 1 2 Γ˜ = Γ(i) t vn t u i=1 u X   (i) t and Γt are independent copies of Γt. The theorem relies on the following bound, whose proof is postponed to the end of the subsection.

d d d Lemma 2. Let Γ be an -adapted matrix-valued processes and let F : R R R × t Ft × → be almost surely invertible and locally Lipschitz. Denote Ft(x) := F (t, x) and let Xt, Mt be defined by t t Xt = ΓsdBs and Mt = Fs(Ms)dBs. Z0 Z0 Define the process Yt by t t s Γ F (Y ) Y = F (Y )dB + r − r r dB ds. t s s s 1 r r Z0 Z0 Z0 − Then, 1 1 2 1 Γs Fs(Ys) Ent (X M ) E F − (Y ) − dtds . 1|| 1 ≤  t t 1 s  Z Z HS 0 s −   Note that if the process Ft is deterministic, i.e. it is a constant function, then M1 has a Gaussian law, so that the lemma can be used to bound the relative entropy of X1 with respect to a Gaussian. (i) (i) Proof of Theorem 11. Let (Xt , Γt , 1) be independent copies of the martingale embedding. n t ˜ 1 (i) ˜ ˜ ˜ Consider the sum process Xt = √n Xt , which satisfies Xt = ΓsdBs where we define, as i=1 0 in the proof of Theorem 10, P R

n 2 1 (i) 1 1 (i) (i) Γ˜ := Γ and dB˜ = Γ˜− Γ dB . t vn t t √n t t t u i=1 u X   X t By assumption Γ˜t is invertible, which makes B˜t a Brownian motion. In this case, (X˜t, Γ˜t, 1) is a martingale embedding for the law of Sn. For the first bound, consider the process t 2 ˜ Mt = E [Γs]dBs. 0 Z p 10 By Itˆo’s isometry one has M (0, Σ). Also, by Jensen’s inequality 1 ∼N E [Γ2] E [Γ ] σ I . t  t  t d q 2 Using this observation and substituting E [Γt ] for a constant function Ft in Lemma 2 yields, 1 2 1 p˜ 2 Γt E [Γt ] 2 Ent (Sn G) E − σs− ds dt. (8) k ≤  1 t    Z0 p− HS Zt

With the use of Lemma 1 we obtain    

2 2 E Γ˜ E [Γ2] = E Tr Γ˜ E [Γ2] t − t t − t HS "   !# q q n 2 2 1 (i) 1 E Tr Γ E Γ2 E Γ2 − ≤   n t − t t  i=1 ! X       1  2 2 2  2 E Tr Γt E Γt . ≤ nσt − h  i Plugging the above into (8) shows the first bound. To see the second bound, we define a process

Mt′, which is similar to Mt, and is given by the equation t ˜ ˜ Mt′ := E Γs dBs. 0 Z h i Let Gn denote a Gaussian which is distributed as M1′ . For any s, we now have the following Cauchy-Schwartz type inequality

n n 2 2 n Γ(i) Γ(i) . s  s i=1 ! i=1 ! X  X Since the square root is monotone with respect to the order on positive definite matrices, this implies 1 n E Γ˜ E Γ(i) σ I . s  n s  s d " i=1 # h i X Thus, 2 1 1 1 Γ˜t E Γ˜t Ent(S G ) E E Γ˜ − − dsdt n|| n ≤  s 1 ht i  Z0 Zt h i − HS   2 1 ˜ ˜ 1  Γ t E Γt − 2 E σ− ds dt ≤  1 ht i   s  Z0 − Zt  HS  2    1 2 ˜ 1 Tr E [Γt ] E Γ t − 2 =   σ− ds dt. (1 t)2 h i  s  Z0 − Zt   Since Cov(G) = Cov(Sn), it is now easy to verify that Ent (Sn G) Ent (Sn Gn), which concludes the proof. || ≤ ||

11 A key component in the proof of the theorem lies in using the norm of an adapted process in order to bound the relative entropy. The following lemma embodies this idea. Its proof is based on a straightforward application of Girsanov’s theorem. We provide a sketch and refer the reader to [36], where a slightly less general version of this lemma is given, for a more detailed proof.

d d d Lemma 3. Let F : R R R × be almost surely invertible and locally Lipschitz. De- × → t note Ft(x) := F (t, x) and let Mt = Fs(Ms)dBs. For ut, an adapted process, set Yt := 0 t t R Fs(Ys)dBs + usds. Then 0 0 R R 1 1 1 2 Ent (Y M ) E F − (Y )u dt. 1|| 1 ≤ 2 t t t Z 0 h i

Proof. Since Mt is an Itˆodiffusion, by Girsanov’s theorem ( [39, Theorem 8.6.5]), the density of Yt t [0,1] with respect to that of Mt t [0,1] on the space of paths is given by { } ∈ { } ∈ 1 1 1 1 1 2 := exp F (Y )− u dB F (Y )− u dt . E − t t t t − 2 t t t  Z Z 0 0   If f is the density of Y1 with respect to M1, this implies

1= E [f(Y ) ] . 1 E By Jensen’s inequality

0 = ln (E [f(Y ) ]) E [ln (f(Y ) )] = E [ln(f(Y ))] + E [ln( )] . 1 E ≥ 1 E 1 E But, 1 1 1 2 E [ln( )] = E F − (Y )u dt, E −2 t t t Z 0 h i and E [ln(f(Y ))] = Ent(Y M ), 1 1|| 1 which concludes the proof.

The proof of Lemma 2 now amounts to invoking the above bound with a suitable construc- tion of the drift process ut.

Proof of Lemma 2. By defintion of the process Yt, we have the following equality

1 1 t 1 1 Γ F (Y ) Y = F (Y )dB + s − s s dB dt = F (Y )dB + (Γ F (Y )) dB = X , 1 t t t 1 s s t t t t − t t t 1 Z0 Z0 Z0 − Z0 Z0 (9)

12 where we have used Fubini’s theorem in the penultimate equality. Now, consider the adapted process t Γ F (Y ) u = s − s s dB , t 1 s s Z0 − so that, dYt = Ft(Yt)dBt + utdt. Applying Lemma 3 and using Itˆo’s isometry, we get

1 1 t 2 1 2 1 Γs Fs(Ys) Ent(X M ) E F − (Y )u dt = E F − (Y ) − dB dt 1|| 1 ≤ t t t  t t 1 s s  Z Z Z 0 h i 0 0 − 1 t   2 1 Γs Fs(Ys) = E F − (Y ) − dsdt  t t 1 s  Z Z HS 0 0 −  1 1  2 1 Γs Fs(Ys) = E F (Y )− − dtds ,  t t 1 s  Z Z HS 0 s −   where last equality follows from another use of Fubini’s theorem.

2.3 A stochastic construction In this section we introduce the main construction used in our proofs, a martingale process which meets the assumptions of Theorems 10 and 11. The construction in the next proposition is based on the Skorokhod embedding described in [22]. Most of the calculations in this sub- section are very similar to what is done in [22], except that we allow some inhomogeneity in the quadratic variation according to the function Ct below. In particular, Ct will be a symmetric matrix almost surely, and we will denote the space of d d symmetric matrices by Sym . × d Proposition 1. Let µ be a probability measure on Rd with smooth density and bounded support. For a probability measure-valued process µt, let

2 at = xµt(dx), At = (x at)⊗ µt(dx) d d − ZR ZR denote its mean and covariance.

Let C : R Symd Symd be a continuous function. Then, we can construct µt so that the following properties× hold:→

1. µ0 = µ,

2. at is a stochastic process satisfying dat = AtC(t, At†)dBt, where Bt is a standard Brow- nian motion on Rd, and

d 3. For any continuous and bounded ϕ : R R, d ϕ(x)µ (dx) is a martingale. → R t Remark 12. We will be mainly interested in situationsR where µt converges almost surely to a point mass in finite time. In this case, we obtain a martingale embedding (at, AtC(t, At†), τ) for µ, where τ is the first time that µt becomes a point mass.

13 In the sequel, we abbreviate Ct := C(t, At†). We first give an informal description of how µt+ǫ is constructed from µt for ǫ 0. Consider a stochastic process Xs 0 s 1 in which we ≤ ≤ first sample X µ and then set → { } 1 ∼ t 1 X = (1 s)a + sX + C− B , s − t 1 t s 1 where Bs is a standard Brownian bridge. We can write Xǫ = at + √ǫCt− Z, where Z is close to a standard Gaussian. We then take µt+ǫ to be the conditional distribution of X1 given Xǫ. This immediately ensures that property 3 holds and that at is a martingale. It remains to see why property 2 holds. A direct calculation with conditioned Brownian bridges gives a first-order approximation

1 −1 T 2 −1 (√ǫC Z ǫ(x at)) C (√ǫC Z ǫ(x at)) µ (dx) e− 2 t − − t t − − µ (dx) t+ǫ ∝ t √ǫ CtZ,x at +O(ǫ) e h − i µ (dx) ∝ t (1 + √ǫ C Z, x a )µ (dx). ≈ h t − ti t Then, to highest order, we have

at+ǫ at √ǫ CtZ, x at (x at) µt(dx)= √ǫAtCtZ, − ≈ d h − i − ZR which translates into property 2 as ǫ 0. → Observe that the procedure outlined above yields measures µt that have densities which are proportional to the original density µ times a Gaussian density. (This applies at least when At is non-degenerate; something similar also holds when At is degenerate, as we will see shortly.) Let us now perform the construction formally. We will proceed by iterating the following preliminary construction, which handles the case when At remains non-degenerate. Lemma 4. Let µ be a measure on Rd with smooth density and bounded support, and let C : R Sym Sym be a continuous map. Then, there is a measure-valued process µ and a × d → d t stopping time T such that µt satisfies the properties in Proposition 1 for t < T and the affine hull of the support of µT has dimension strictly less than d. Moreover, if µT is considered as a measure on this affine hull, it has a smooth density.

Proof. We will construct a (Rd Sym )-valued stochastic process (c , Σ˜ ) started at (c , Σ˜ )= × d t t 0 0 (0, Id). Let us write 1 1 Q (x)= x c , Σ˜ − (x c ) , t 2 − t t − t D dµ˜ 1 Ex 2 and let µ˜ be the probability measure satisfying (x) e 2 k k . We will then take µ to be dµ ∝ t µt(dx)= Ft(x)˜µ(dx), where

1 Qt(x) Qt(x) Ft(x)= e− , Zt = e− µ˜(dx). Z d t ZR 1 Note that since Σ˜ 0 = Id, we have µ0 = µ. In order to specify the process, it remains to construct (ct, Σ˜ t). We take it to be the solution to the SDE dc = Σ˜ C dB + Σ˜ C2(a c )dt, dΣ˜ = Σ˜ C2Σ˜ dt. t t t t t t t − t t − t t t 1 ˜ Conceptually, one can replace all instances of µ˜ with µ if we think of the initial value Σ0 as being an “infinite” multiple of identity. However, to avoid issues with infinities, we have expressed things in terms of µ˜ instead.

14 Note that the coefficients of this SDE are continuous functions of (ct, Σ˜ t) so long as Σ˜ t 0. By standard existence and uniqueness results, this SDE has a unique solution up to a stopping≻ time T (possibly T = ), at which point At (and hence Σ˜ t) becomes degenerate. Observe that, for every t, Σ˜ I and∞ so, the matrix process is continuous on the interval [0, T ]. t  d By a limiting procedure, it is easy to see that µT has a smooth density when considered as a measure on the affine hull of its support. (Indeed, its density is proportional to the conditional density of µ˜ times a Gaussian density.) It remains to verify that µt is a martingale and dat = AtCtdBt. By direct calculation, we have

˜ 1 2 d(Σt− )= Ct dt ˜ 1 2 2 d(Σt− ct)= Ct ctdt + Ct (at ct)dt + CtdBt 2 − = Ct atdt + CtdBt 1 dQ (x)= x, C2x C2a dt C dB t 2 t − t t − t t     Qt(x) Qt(x) 1 Qt(x) d(e− )= e− dQ (x)+ e− d[Q (x)] − t 2 t Qt(x) 2 = e− x, CtdBt + Ct atdt

Integrating against µ˜(dx), we obtain

2 dZt = Zt at,CtdBt + Ct atdt 1 1 1 1 dZt− = 2 dZt + 3 d[Zt]= at, CtdBt −Zt Zt Zt h − i Qt(x) 1 1 Qt(x) 1 Qt(x) dFt(x)= e− dZt− + Zt− d(e− )+ d[Zt− , e− ] = F (x) x a ,C dB . t ·h − t t ti

Thus, Ft(x) is a martingale for each fixed x, and furthermore,

dat = d xµt(dx)= xdµt(dx)= x(x at)Ctµt(dx)dBt = AtCtdBt. d d d − ZR ZR ZR

Proof of Proposition 1. We use the process given by Lemma 4, which yields a stopping time T1 and a measure µT1 with a strictly lower-dimensional support. If µT is a point mass, then we set µ = µ for all t T . t T ≥ Otherwise, by the smoothness properties of µT1 guaranteed by Lemma 4, we can recursively apply Lemma 4 again on µT1 conditioned on the affine hull of its support. Repeating this procedure at most d times gives us the desired process.

2.4 Properties of the construction

We record here various formulas pertaining to the quantities at, At, and µt constructed in Propo- sition 1.

15 Proposition 2. Let µ, Ct, and µt be as in Proposition 1. Then, there is a Symd-valued process Σ satisfying the following: { t}t>0 For all t, there is an affine subspace L = L Rd and a Gaussian measure γ on Rd, • t ⊂ t supported on L, with covariance Σt such that µt is absolutely continuous with respect to γt, and dµ t (x) µ(x), x L. dγt ∝ ∀ ∈ Σ is continuous and for almost every t obeys the differential equation • t d Σ = Σ C2Σ . dt t − t t t

1 limt 0+ Σt− =0. • → Proof. For 1 k d, let T denote the first time the measure µ is supported in a (d k)- ≤ ≤ k t − dimensional affine subspace, and denote by Lt the affine hall of the support of µt. We will define Σt inductively for each interval [Tk 1, Tk]. Recall from the proof of Proposition 1 that − µt is constructed by iteratively applying Lemma 4 to affine subspaces of decreasing dimension d,d 1,d 2,..., 1. Let Σ˜ k,t denote the quantity Σ˜ t, from the k-th application of Lemma 4, − ˜ − so that Σk,t is a linear operator on the subspace LTk . ˜ 1 1 For the base case 0 < t T1, take Σt =(Σ0−,t Id)− . A straightforward calculation shows ≤dµt − that over this time interval, dµ is proportional to the density of a Gaussian with covariance Σt. ˜ 1 1 Note that since Σ0−,0 = Id, we also have limt 0+ Σt− =0. → Now suppose that Σt has been defined up until time Tk; we will extend it to time Tk+1. Let L denote the affine hull of the support of µ , so that dim(L )= d k (if dim(L ) < d k, k Tk k − k − then we simply have T = T ). Then, for 0 t T T , we may set k+1 k ≤ ≤ k+1 − k 1 1 1 − ΣT +t := Σ˜ − + Σ− Id , k k,t Tk −   where the quantities involved are matrices over the subspace parallel to Lk but may also be regarded as degenerate bilinear forms in the ambient space Rd. First, observe that continuity of the processes Σ˜ k,t implies the same for Σt. Once again, a straightforward calculation shows that for T t < T , dµt is proportional to the density of a Gaussian with covariance Σ , where k ≤ k+1 dµ t we view µt and µ as densities on Lk (for µ, we take its conditional density on Lk). It remains only to show that Σt satisfies the required differential equation. From our con- 1 1 − struction, we see that Σ always takes the form Σ˜ − H , where H I and t t −  d   d Σ˜ = Σ˜ C2Σ˜ . dt t − t t t Then, we have

1 1 d 1 − d 1 1 − Σ = Σ˜ − H Σ˜ − Σ˜ − H dt t − t − dt t t −       1 d 1 = Σ Σ˜ − Σ˜ Σ˜ − Σ − t − t dt t t t     = Σ C2Σ , − t t t as desired.

16 3 2 Proposition 3. dAt = (x at)⊗ µt(dx)CtdBt AtCt Atdt Rd − − R Proof. We consider the Doob decomposition of At = Mt + Et, where Mt is a local martingale and Et is a process of bounded variation. By the previous two propositions and the definition of At, we have on one hand

2 2 2 2 dA = d x⊗ µ (dx) da⊗ = d x⊗ µ (dx) a da da a A C A dt. t t − t t − t ⊗ t − t ⊗ t − t t t RZd RZd Clearly the first 3 terms are local martingales, which shows, by the uniqueness of the Doob decomposition, dE = A C2A dt. On the other hand, one may also rewrite the above as t − t t t

2 2 dA =d (x a )⊗ µ (dx)= d (x a )⊗ µ (dx) t − t t − t t Zd Zd R R  2 = da (x a )µ (dx) (x a ) da µ (dx)+ (x a )⊗ dµ (dx) − t ⊗ − t t − − t ⊗ t t − t t RZd RZd RZd 2 (x a ) d[a ,µ (dx)] + d[a , a ] µ (dx). − − t ⊗ t t t t t t t RZd RZd

Note that the first 2 terms are equal to 0, since, by definition of at,

da (x a )µ (dx)= da (x a )µ (dx)=0. t ⊗ − t t t ⊗ − t t RZd RZd Also, the last 2 terms are clearly of bounded variation, which shows

2 3 dM = (x a )⊗ dµ (dx)= (x a )⊗ C µ (dx)dB . t − t t − t t t t RZd RZd

Define the stopping time τ = inf t At =0 . Then, at time τ, µτ is just a delta mass located at a and µ = µ for every s τ. A{ crucial| is} observation is τ s τ ≥

Proposition 4. Suppose that there exists constants t0 0 and c > 0 such that a.s. one of the following happens ≥

2 1. for every t0 c,

t0 2 2 2 2. λmin (Ct ) dt = , where λmin (Ct ) is the minimal eigenvalue of Ct , 0 ∞ R then τ is finite a.s. and in the second case τ t0. Moreover, if τ is finite a.s. then aτ has the law of µ. ≤

17 t 2 Proof. Consider the process Rt = At + AsCs Asds. For the first case, the previous propo- 0 sition shows that the real-valued processRTr (Rt) a positive local martingale; hence, a super- martingale. By the martingale convergence theorem Tr (Rt) converges to a limit almost surely. By our assumption, if τ = then ∞

∞ ∞ ∞ Tr(A C2A )dt Tr(A C2A )dt cdt = . t t t ≥ t t t ≥ ∞ Z0 tZ0 tZ0

This would imply that lim Tr(At)= which clearly cannot happen. t →∞ −∞

For the second case, under the event τ > t0 , by continuity of the process At there exists { } d a> 0 such that for every t [0, t0], there is a unit vector vt R for which vt, Atvt a. We then have, ∈ ∈ h i ≥

t0 t0 t0 Tr(A C2A )dt A v ,C2A v dt a2 λ (C2)dt = , t t t ≥ h t t t t ti ≥ min t ∞ Z0 Z0 Z0 which implies lim Tr(At)= . Again, this cannot happen and so P(τ > t0)=0. t t0 → −∞ To understand the law of a , let ϕ : Rd R be any continuous bounded function. By Property τ → 3 of Proposition 1 ϕ(x)µt(dx) is a martingale. We claim that it is bounded. Indeed, observe Rd that since µt is a probabilityR measure for every t, then

ϕ(x)µt(dx) max ϕ(x) . ≤ x | | RZd τ is finite a.s., so by the optional stopping theorem for continuous time martingales ( [39] Theorem 7.2.4)

E ϕ(x)µ (dx) = ϕ(x)µ(dx).  τ  RZd RZd   Since µτ is a delta mass, we have that ϕ(x)µτ (dx)= ϕ(aτ ) which finishes the proof. Rd R

We finish the section with an important property of the process At. Proposition 5. The rank of A is monotonic decreasing in t, and ker(A ) ker(A ) for t s. t t ⊂ s ≤

Proof. To see that rank(At) is indeed monotonic decreasing, let v0 be such that At0 v0 = 0 for some t0 > 0, we will show that for any t t0, Atv0 =0. In a similar fashion to Proposition 4, t ≥ 2 we define the process v0, Atv0 + v0, AsCs Asv0 ds, which is, using Proposition 3, a positive h i 0 h i local martingale and so a super-martingale.R This then implies that v0, Atv0 is itself a positive super-martingale. Since v , A v =0, we have that for any t th , v , Aiv =0 as well. h 0 t0 0i ≥ 0 h 0 t 0i

18 3 Convergence rates in transportation distance

3.1 The case of bounded random vectors: proof of Theorem 1 In this subsection we fix a measure µ on Rd and a random vector X µ with the assumption ∼ that X β almost surely for some β > 0. We also assume that E [X]=0. k k ≤

We define the martingale process at along with the stopping time τ as in Section 2.3, where t we take Ct = At†, so that at = AsAs†dBs. We denote Pt := AtAt†, and remark that since At 0 is symmetric, P is a projectionR matrix. As such, we have that for any t < τ, Tr (P ) 1. By t t ≥ Proposition 4, aτ has the law µ.

In light of the remark following Theorem 10, our first objective is to understand the expec- tation of τ.

Lemma 5. Under the boundedness assumption X β, we have E [τ] β2. k k ≤ ≤ Proof. Let H = a 2. By Itˆo’s formula and since P is a projection matrix, t k tk t dH =2 a ,P dB + Tr(P ) dt =2 a ,P dB + rank (P ) dt. t h t t ti t h t t ti t d 2 So, E [Ht]= E [rank (Pt)]. Since E [H ] β , dt ∞ ≤

∞ ∞ 2 β E [H ] E[H0]= E [rank (Pt)] dt P (τ > t) dt = E [τ] . ≥ ∞ − ≥ Z0 Z0

The above claim gives bounds on the expectation of τ, however in order to use Theorem 10, we need bounds for its tail behaviour in the sense of (2). To this end, we can use a bootstrap argument and invoke the above lemma with the measure µt in place of µ, recalling that X t ∞|F ∼ µt and noting that X t β almost surely. Therefore, we can consider the conditioned k ∞|F k ≤ stopping time τ t and get that |Ft − E [τ ] t + β2. |Ft ≤ The following lemma will make this precise.

2 Lemma 6. Suppose that, for the stopping time τ, it holds that for every t> 0, E [τ t] t+β a.s., then |F ≤ 1 i N, P τ i 2β2 . (10) ∀ ∈ ≥ · ≤ 2i Proof. Denote t = i 2β2. Since µ is Markovian, and by the law of total probability, for any i · t i N we have the relation ∈ 2 P (τ ti+1) P (τ > ti) ess sup P τ ti 2β ti , ≥ ≤ µti − ≥ |F 

19 where the essential supremum is taken over all possible states of µti . Using Markov’s inequality, we almost surely have

E [τ t ] 1 P τ t 2β2 − i|Fti , − i ≥ |Fti ≤ 2β2 ≤ 2  which is also true for the essential supremum. Clearly P (τ 0)=1 which finishes the proof. ≥

Proof of Theorem 1. Our objective is to apply Theorem 10, defining Xt = at and Γt = Pt so that (Xt, Γt, τ) becomes a martingale embedding according to Proposition 4. In this case, we have that Γt is a projection matrix almost surely. Thus,

Tr E[Γ4]E Γ2 † d, t t ≤   and   Tr E[Γ2] dP (τ > t) . t ≤ Therefore, if G and Sn are defined as in Theorem 10, then

2 2β log2(n) d ∞ 2 (S ,G) dt + 4dP(τ > t)dt W2 n ≤ n 2 Z0 2β logZ 2(n) 2dβ2 log (n) ∞ t 2 +4d P τ > 2β2 dt ≤ n 2β2 2     2β logZ 2(n) t ∞ (10) 2dβ2 log (n) 1 j 2β2 k 2 +4d dt ≤ n 2 2   2β logZ 2(n) 2dβ2 log (n) ∞ 1 2dβ2 log (n) 32dβ2 2 +8dβ2 2 + . ≤ n 2j ≤ n n j= log (n) ⌊X2 ⌋ Taking square roots, we finally have

β√d 32 + 2 log2(n) 2(Sn,G) , W ≤ p √n as required.

3.2 The case of log-concave vectors: proof of Theorem 2 µ µ In this section we fix µ to be an isotropic log concave measure. The processes at = at , At = At are defined as in Section 2.3 along with the stopping time τ. To define the matrix process Ct, we first define a new stopping time

T := 1 inf t A 2 . ∧ { |k tkop ≥ }

20 Ct is then defined in the following manner:

min(At†, Id) if t T Ct = ≤ (At† otherwise where, again, At† denotes the pseudo-inverse of At and min(At†, Id) is the unique matrix which is diagonalizable with respect to the same basis as At† and such that each of its eigenvalues corresponds to an an eigenvalue of A† truncated at 1. Since Tr A A† 1 whenever t τ, t t t ≥ ≤ then the conditions of Proposition 4 are clearly met for t0 =1 and aτ has the law of µ. In order to use Theorem 10, we will also need to demonstrate that τ has subexponential tails in the sense of (2). For this, we first relate τ to the stopping time T .

4 Lemma 7. τ < 1+ T .

Proof. Let Σt be as in Proposition 2. As described in the proposition, µt is proportional to µ times a Gaussian of covariance Σt, on an appropriate affine subspace. In this case, an application of the Brascamp-Lieb inequality (see [32] for details) shows that A = Cov(µ ) Σ . In t t  t particular, this means that for t > T , when restricted to the orthogonal complement of ker(At), the following inequality holds,

d Σ = Σ C2Σ I . dt t − t t t  − d So, τ T + Σ . ≤ k T kop It remains to estimate ΣT op. To this end, recall that for 0 < t T , we have At op 2, which implies k k ≤ k k ≤ d 1 Σ = Σ C2Σ Σ2. dt t − t t t  −4 t 1 2 Now, consider the differential equation f ′(t) = f(t) with f(T ) = Σ , which has − 4 k T kop solution 4 . By Gronwall’s inequality, lower bounds for f(t) = t T + 4 f(t) Σt op 0 < t − ΣT k k ≤ k kop T , and so, in particular, f(t) must remain finite within that interval. Consequently, we have 4 4 > T = Σ < . Σ ⇒ k T kop T k T kop We conclude that 4 τ T + Σ < 1+ , ≤ k T kop T as desired.

2 2 Lemma 8. There exist universal constants c,C > 0 such that if s>C κd ln(d) and d 8 then · ≥ cs P(τ >s) e− , ≤ where κd is the constant defined in (1). Proof. First, by using the previous claim, we may see that for any s 5, ≥

1 s 1 1 s 1 P (τ >s) P − P = P 5s− T = P max At op 2 . ≤ T ≥ 4 ≤ T ≥ 5 ≥ 0 t 5s−1 k k ≥      ≤ ≤   21 Recall from Proposition 3,

dA = (x a ) (x a ) C (x a ) , dB µ (dx) A C2A dt. t − t ⊗ − t h t − t ti t − t t t RZd

Since we are trying to bound the operator norm of At, we might as well just consider the matrix t ˜ 2 At = At + AsCs Asds. Note that, by definition of T , for any t T , 0 ≤ R t A C2A ds I . s s s  d Z0

Thus, for t [0, T ], ∈ 3I A + I A˜ A . (11) d  t d  t  t Also, A˜t can be written as,

dA˜ = (x a ) (x a ) C (x a ), dB µ (dx), A˜ = I . (12) t − t ⊗ − t h t − t ti t 0 d RZd The above shows

˜ P max At op 2 P max At op 2 . 0 t 5s−1 k k ≥ ≤ 0 t 5s−1 || || ≥  ≤ ≤   ≤ ≤  1 d We note than whenever A˜ 2 then also Tr A˜4 ln(d) 4 ln( ) 2, so that || t||op ≥ t ≥   1 ˜ ˜4 ln(d) 4 ln(d) P max At op 2 P max Tr At 2 0 t 5s−1 || || ≥ ≤ 0 t 5s−1 ≥  ≤ ≤   ≤ ≤    ˜4 ln(d) P max ln Tr At 2 ln(d) = P max (Mt + Et) 2 ln(d) , (13) ≤ 0 t 5s−1 ≥ 0 t 5s−1 ≥  ≤ ≤      ≤ ≤  ˜4 ln(d) where Mt and Et form the Doob-decomposition of ln Tr At . That is, Mt is a local martingale and Et is a process of bounded variation. To calculate the differential of the Doob- decomposition, fix t, let v , ..., v be the unit eigenvectors of A˜ and let α = v , A˜ v with 1 n t i,j h i t ji

dα = x, v x, v C x, dB µ (dx + a ), i,j h iih jih t ti t t RZd which follows from (12). Also define 1 ξi,j = x, vi x, vj Ctxµt(dx + at). √αi,iαj,j h ih i RZd So that d 2 dαi,j = √αi,iαj,j ξi,j, dBt , [αi,j]t = αi,iαj,j ξi,j . h i dt k k

22 Now, since vi is an eigenvector corresponding to the eigenvalue αi,i, we have

1/2 1/2 ξ = A˜− x, v A˜− x, v C xµ (dx + a ). i,j h t iih t ji t t t RZd

˜ 1/2 ˜1/2 If we define the measure µ˜t(dx) = det(At) µt(At dx + at), then µ˜t has the law of a centered ˜ 1/2 ˜ 1/2 log-concave random vector with covariance At− AtAt− Id. By making the substitution ˜ 1/2  y = At− x, the above expression becomes

ξ = y, v y, v C A˜1/2yµ˜ (dy). i,j h iih ji t t t RZd

˜1/2 ˜1/2 By (11) and the definition of T , Ct, for any t T , At 2Id and Ct Id. So, CtAt 2. ≤   op ≤ Under similar conditions, it was shown in [21], Lemma 3.2, that there exists a universal constant

C > 0 for which for any 1 i d, ξ 2 C. • ≤ ≤ k i,ik ≤ d 2 2 for any 1 i d, ξi,j Cκd. • ≤ ≤ j=1 k k ≤ P Furthermore, in the proof of Proposition 3.1 in the same paper it was shown

d dTr A˜4 ln(d) 4 ln(d) α4 ln(d) ξ , dB + 16Cκ2 ln(d)2Tr A˜4 ln(d) dt. t ≤ i,i h i,i ti d t i=1   X   So, using Itˆo’s formula with the function ln(x) we can calculate the differential of the Doob decomposition (13). Specifically, we use the fact that the second derivative of ln(x) is negative and get ˜4 ln(d) Tr At 2 2 2 2 dEt 16Cκ ln(d) = 16Cκ ln(d) , E0 = ln(d), ≤ d  ˜4 ln(d) d Tr At and   2 ˜4 ln(d) Tr At d 2 2 2 2 [M]t 16C ln(d) = 16C ln(d) . (14) dt ≤   ˜4 ln(d) Tr At   Hence, E t 16Cκ2 ln(d)2 + ln(d), which together with (13) gives t ≤ · n

1 2 2 P (τ >s) P max Mt 2 ln(d) ln(d) 80s− Cκd ln(d) s 5. ≤ 0 t 5s−1 ≥ − − ∀ ≥  ≤ ≤  Under the assumption s> 80Cκ2 ln(d)2, and since d 8, the above can simplify to d ≥ 1 P (τ >s) P max Mt ln(d) . (15) ≤ 0 t 5s−1 ≥ 2  ≤ ≤  To bound this last expression, we will apply the Dubins-Schwartz theorem to write

Mt = W[M]t ,

23 where Wt is some Brownian motion. Combining this with (15) gives ln(d) P (τ >s) P max W[M]t . ≤ 0 t 5s−1 ≥ 2  ≤ ≤ 

An application of Doob’s maximal inequality ( [41] Proposition I.1.8) shows that for any t′,K > 0 K2 P max Wt K exp . 0 t t′ ≥ ≤ − 2t  ≤ ≤   ′  We now integrate (14) and use the above inequality to obtain

ln(d) cs P max W[M]t e− , 0 t 5s−1 ≥ 2 ≤  ≤ ≤  where c> 0 is some universal constant.

Proof of Theorem 2. By definition of T and C , we have that for any t T , A C 2I and t ≤ t t  d for any t > T , A C = A A† I . We now invoke Theorem 10, with Γ = A C , for which t t t t  d t t t Tr E[Γ4]E Γ2 † 4d, t t ≤   and, by Lemma 8  

2 ct 2 2 Tr E[Γ ] 4dP (τ > t) 4de− t>C κ ln(d) . t ≤ ≤ ∀ · d If G is the standard d-dimensional  Gaussian, then the theorem yields

2 2 C κd ln(d) ln(n) · d ∞ 2(S ,G) 4 dt + 16dP (τ > t) W2 n ≤ n Z0 C κ2 ln(Zd)2 ln(n) · d 2 2 ∞ dC κd ln(d) ln(n) ct 4 · + 16d e− dt ≤ n C κ2 ln(Zd)2 ln(n) · d 2 2 d κd ln(d) ln(n) C′ · . ≤ n Thus Cκd ln(d) d ln(n) 2(Sn,G) , W ≤ √pn

4 Convergence rates in entropy

Throughout this section, we fix a centered measure µ on Rd with an invertible covariance matrix n (i) 1 (i) Σ and G (0, Σ). Let X be independent copies of X µ and Sn := √n X . ∼N { } ∼ i=1 Our goal is to study the quantity Ent (Sn G). In light of Theorem 11, we aimP to construct a martingale embedding (X , Γ , 1) such that||X µ and which satisfies appropriate bounds t t 1 ∼ 24 on the matrix Γt. Our construction uses the process at from Proposition 1 with the choice 1 Ct := 1 t Id. Property 2 in Proposition 1 gives − t A a = s dB . t 1 s s Z0 − Thus, we denote A Γ := t . t 1 t − 1 2 Since λmin(Ct ) = , Proposition 4 shows that the triplet (at, Γt, 1) is a martingale embed- 0 ∞ R (i) ding of µ. As above, the sequence Γt will denote independent copies of Γt and we define 2 ˜ n (i) Γt := i=1 Γt . r P  

4.1 Properties of the embedding The martingale embedding has several useful properties which we record in this section. First, we give an alternative description of the process which will be of use for us. Define the random process 1 1 2 v := arg min E ut , u 2 k k Z 0   1 where u varies over all t-adapted drifts such that B1 + utdt µ. Denote F 0 ∼ R t

Yt := Bt + vsds. Z0 In [24] (Section 2.2) it was shown that the density of the measure Y has the same dynamics 1|Ft as the density of µt. Thus, almost surely Y1 t µt and since at is the expectation of µt, we have the identity |F ∼ a = E [Y ] , (16) t 1|Ft and in particular we have a1 = Y1. Moreover, the same reasoning implies that At = Cov(Y1 t) and |F Cov(Y ) Γ = 1|Ft . (17) t 1 t − The process Yt goes back at least to the works of F¨ollmer [29,30]. In a later work, by Lehec [36], it is shown that vt is a martingale and that

1 1 Ent(Y γ)= E v 2 dt, (18) 1|| 2 k tk Z 0   where γ denotes the standard Gaussian.

25 Lemma 9. It holds that d E [Cov(Y )] = E [Γ2] . dt 1|Ft − t Proof. From (16), we have

2 2 2 2 Cov(Y )= E Y ⊗ E [Y ]⊗ = E Y ⊗ a⊗ . 1|Ft 1 |Ft − 1|Ft 1 |Ft − t     at is a martingale, hence d d E [Cov(Y )] = E [[a] ]= E Γ2 . (19) dt 1|Ft −dt t − t  

Our next goal is to recover vt from the martingale at.

t Γs Id Lemma 10. The drift vt satisfies that identity vt = 1−s dBs. Furthermore, 0 − R t 2 Tr E (Γs Id) E v 2 = − ds. (20) k tk (1 s)2 Z0  −    Proof. We begin by writing da = dB + (Γ I ) dB . t t t − d t Using Fubini’s theorem then yields

1 1 1 1 t Γ I Γ I (Γ I ) dB = s − d dtdB = s − d dB dt. s − d s 1 s s 1 s s Z0 Z0 Zs − Z0 Z0 −

t 1 Γs Id Therefore, defining v˜t = 1−s dBs we have that v˜t is a martingale. and that B1 + v˜tdt = a1. 0 − 0 R 1 R It follows that vt v˜t is a martingale and that (vt v˜t)dt = 0. We will now show that if a − 0 − 1 R martingale Qt satisfies Q0 =0 and Qtdt =0 a.s., then Qt =0 for every t [0, 1]. From this, 0 ∈ R t it will follow that vt =v ˜t. Indeed, write Qt = Qs′ dBs, for some adapted process Qt′ . Using 0 Fubini’s theorem, a calculation, similar to the oneR above, gives the identity,

1 1

0= Q dt = (1 t)Q′ dB . t − t t Z0 Z0

· Considering the martingale (1 t)Qt′ dBt, we now have, for any s [0, 1) 0 − ∈ R 1 s

0= E (1 t)Q′ dB = (1 t)Q′ dB .  − t t|Fs − t t Z0 Z0   26 Thus, Q′ = 0 almost surely, which implies, for every t [0, 1], Q = Q = 0. Therefore ∈ t 0 vt =v ˜t, or in other words t Γ I v = s − d dB . t 1 s s Z0 − Finally, equation (20) follows from a direct application of Itˆo’s isometry.

A combination of equations (18) and (20) gives the useful identity,

1 t 1 1 Tr E (Γ I )2 1 Tr E (Γ I )2 Ent (Y γ)= s − d dsdt = t − d dt. (21) 1|| 2 (1 s)2 2 1 t Z0 Z0  −  Z0  − 

The above lemma also affords a representation of E [Tr (Γ )] in terms of E v 2 . t k tk Lemma 11. It holds that  

E [Tr(Γ )] = d (1 t) d Tr (Σ) + E v 2 . t − − − k tk Proof. The identity can be obtained through integration by parts. By Lemma 10,

t 2 (20) Tr E (Γs Id) E[ v 2] = − ds k tk (1 s)2 Z0  −  t t t Tr (E [Γ2]) Tr (E [Γ ]) Tr (I ) = s ds 2 s ds + d ds. (1 s)2 − (1 s)2 (1 s)2 Z0 − Z0 − Z0 −

Since, by Lemma 9, d E [Cov (Y )] = E [Γ2] integration by parts shows dt 1|Ft − t t t t Tr (E [Γ2]) Tr (E [Cov (Y )]) Tr (E [Cov (Y )]) s ds = 1|Fs +2 1|Fs ds (1 s)2 − (1 s)2 (1 s)3 Z0 − − 0 Z0 −

t Tr (E [Γt]) Tr (E [Γs]) = Tr (Σ) +2 ds, − 1 t (1 s)2 − Z0 − where we have used (17) and the fact Cov (Y1 0) = Cov (Y1) = Σ. Plugging this into the previous equation shows |F

Tr (E [Γ ]) d E[ v 2] = Tr(Σ) t + d. k tk − 1 t 1 t − − − or equivalently E [Tr(Γ )] = d (1 t) d Tr (Σ) + E v 2 . t − − − k tk  

27 Next, as in Theorem 11, we define σt to be the minimal eigenvalue of E [Γt], so that

E [Γ ] σ I . t  t d Note that by Jensen’s inequality we also have

E Γ2 σ2I . (22) t  t d Lemma 12. Assume that Ent(Y γ) < . Then Γ is almost surely invertible for all t [0, 1) 1|| ∞ t ∈ and, moreover, there exists a constant m = mµ > 0 for which

σ m, t [0, 1). t ≥ ∀ ∈

Proof. We will show that for every 0 t < 1, σt > 0 and that there exists c > 0 such that 1 ≤ σt > 8 whenever t> 1 c. The claim will then follow by continuity of σt. The key to showing this is identity (21), due− to which,

1 2 1 Tr E (Γt Id) Ent (Y γ)= − dt. 1|| 2 1 t Z0  − 

Cov(Y1 t) |F Recall that, by Equation (17), Γt = 1 t and observe that, by Proposition 5, if Cov (Y1 s) − |F is not invertible for some 0 s< 1 then Cov (Y1 t) is also not invertible for any t>s. Under ≤ 1 2 |F Tr (Γt Id) ( − ) this event, we would have that 1 t dt = which, using the above display, implies s − ∞ that the probability of this eventR must be zero. Therefore, Γt is almost surely invertible and σ > 0 for all t [0, 1). t ∈ 1 Suppose now that for some t′ [0, 1], σ ′ . By Jensen’s inequality, we have ∈ t ≤ 8 Tr E (Γ I )2 Tr E [Γ I ]2 (1 σ )2 1 2σ . t − d ≥ t − d ≥ − t ≥ − t    1 t′ Since, by Lemma 9, E [Cov (Y )] is non increasing, for any t′ t t′ + − , 1|Ft ≤ ≤ 2

σt′ (1 t′) 1 t′ 1 σt − − 1 t′ = . ≤ 1 t ≤ 8(1 t − ) 4 − − ′ − 2 1 Now, assume by contradiction that there exists a sequence ti (0, 1) such that σti 8 ∈ 1 ti ≤ and lim ti = 1. By passing to a subsequence we may assume that ti+1 ti − for all i. i 2 →∞ − ≥ The assumption Ent(Y1 γ) < combined with Equation (21) and with the last two displays finally gives || ∞

− 1 ti 1 1 ti+ 2 2 Tr E (Γt Id) 1 2σ ∞ 1 1 ∞ 1 > − dt − t dt dt log 2 , ∞ 1 t ≥ 1 t ≥ 2 1 t ≥ 2   i=1 i=1 Z0 − Z0 − X Zti − X which leads to a contradiction and completes the proof.

28 4.2 Proof of Theorem 5

Thanks to the assumption Ent (Y1 G) < , an application of Lemma 12 gives that Γt is invertible almost surely, so we may|| invoke the∞ second bound in Theorem 11 to obtain

2 1 2 ˜ 1 Tr E [Γt ] E Γt − 2 Ent(S G)   σ− ds dt. n|| ≤ (1 t)2 h i  s  Z0 − Zt   The same lemma also shows that for some m> 0 one has

1 2 1 t σ− ds − . s ≤ m2 Zt Therefore, we attain that

2 1 Tr E [Γ2] E Γ˜ 1 t − t Ent(S G)  dt. (23) n|| ≤ m2 1 t h i Z0 − Next, observe that, by Itˆo’s isometry,

1 2 Cov(X)= E Γt dt. Z 0   2 Hence, as long as Cov(X) is finite, E [Γt ] is also finite for all t A where [0, 1] A is a set of measure 0. We will use this fact to show that ∈ \

2 2 ˜ lim Tr E Γt E Γt =0, t A. (24) n − ∀ ∈ →∞     h i 2 ˜ 2 (i) Indeed, by the law of large numbers, Γt almost surely converges to E [Γt ]. Since Γt n 2 1 (i) p   are integrable, we get that the sequence n Γt is uniformly integrable. We now use the i=1 inequality P   n n 1 2 1 2 Γ˜ Γ(i) + I Γ(i) + I , t  vn t d  n t d u i=1 i=1 u X   X   t to deduce that Γ˜t is uniformly integrable as well. An application of Vitali’s convergence theo- rem (see [28], for example) implies (24).

We now know that the integrand in the right hand side of (23) convergence to zero for almost every t. It remains to show that the expression converges as an integral, for which we intend to apply the dominated convergence theorem. It thus remains to show that the expression

2 Tr E [Γ2] E Γ˜ t − t   1 t h i − 29 is bounded by an integrable function, uniformly in n, which would imply that

lim Ent(Sn G)=0, n →∞ || and the proof would be complete. To that end, recall that the square root function is concave on positive definite matrices (see e.g., [1]), thus

1 n Γ˜ Γ(i). t  n t i=1 X It follows that

2 Tr E Γ2 E Γ˜ Tr E Γ2 E [Γ ]2 Tr E (Γ I )2 . t − t ≤ t − t ≤ t − d     h i      So we have

2 1 2 ˜ 1 Tr E [Γt ] E Γt 2 1 − 1 Tr E (Γt Id)  dt − dt m2 1 t h i ≤ m2 1 t Z0 − Z0  −  (21) 2 = Ent (Y γ) < . m2 1|| ∞ This completes the proof.

4.3 Quantitative bounds for log concave random vectors In this section, we make the additional assumption that the measure µ is log concave. Under this assumption, we show how one can obtain explicit convergence rates in the . Our aim is to use the bound in Theorem 11 for which we are required to obtain bounds on the process Γt. We begin by recording several useful facts concerning this process.

Lemma 13. The process Γt has the following properties: 1. If µ is log concave, then for every t [0, 1], Γ 1 I , almost surely. ∈ t  t d 2. If µ is also 1-uniformly log concave, then for every t [0, 1], Γ I almost surely. ∈ t  d

Proof. Denote by ρt the density of Y1 t with respect to the Lebesgue measure with ρ := ρ0 |F Id being the density of µ. By Proposition 2 with Ct = 1 t , we can calculate the ratio between ρt and ρ. In particular, we have −

d 1 1 d 1 1 Σ− = Σ− Σ Σ− = I . dt t − t dt t t (1 t)2 d   − 1 1 t Solving this differential equation with the initial condition Σ0− =0, we find that Σt− = 1 t Id. − Since the ratio between ρt and ρ is proportional to the density of a Gaussian with covariance Σt, we thus have t 2 log(ρ )= 2 log(ρ)+ I . −∇ t −∇ 1 t d −

30 t Now, if µ is log concave then Y1 t is almost surely 1 t -uniformly log-concave. By the |F − 1 t Brascamp-Lieb inequality (as in [32]) we get Cov (Y ) − I and, using (17), 1|Ft  t d 1 Γ I . t  t d If µ is also 1-uniformly log-concave then 2 log(ρ) I and almost surely −∇  d 1 2 log(ρ ) I . −∇ t  1 t d − By the same argument this implies Γ I . t  d

The relative entropy to the Gaussian of a log concave measure with non-degenerate covari- ance structure is finite (it is even universally bounded, see [37]). Thus, by Lemma 12, it follows that Γt is invertible almost surely. This allows us to invoke the first bound of Theorem 11,

1 2 2 2 1 1 E Tr (Γt E [Γt ]) − 2 (25) Ent(Sn G) 2 2 σs− ds dt. || ≤ n h (1 t) σt i   Z0 − Zt   Attaining an upper bound on the right hand side amounts to a concentration estimate for the 2 process Γt and a lower bound on σt. These two tasks are the objective of the following two lemmas. Lemma 14. If µ is log concave and isotropic then for any t [0, 1), ∈ 2 1 t d(1 + t) Tr E Γ2 E Γ2 − +2E v 2 , t − t ≤ t2 t2 k tk    h   i   and 4 2 d Tr E Γ2 E Γ2 C t − t ≤ (1 t)4  h i − for a universal constant C > 0.   Proof. The isotropicity of µ, used in conjunction with the formula given in Lemma 11, yields 1 Tr E Γ2 Tr (E [Γ ])2 d 2(1 t)E v 2 , t ≥ d t ≥ − − k tk where the first inequality follows  by convexity. Since µ is log concave, Lemma 13 ensures that, almost surely, Γ 1 I . Therefore, t  t d 2 2 1 Tr E Γ2 E Γ2 Tr E Γ2 I t − t ≤ t − t2 d "  #!  h   i 1 2 = Tr E I t2Γ2 t4 d − t 1  h i Tr E I t2Γ2  ≤ t4 d − t 1 t d(1 + t)  2 − +2E v . ≤ t2 t2 k tk     31 Which proves the first bound. Towards the second bound, we use (17) to write

2 1 2 2 Γ E Y ⊗ . t  (1 t)2 1 |Ft −   So, 2 2 1 2 2 2 1 8 E Γ E Y Y ⊗ E Y . t HS ≤ (1 t)4 k 1k 1 HS ≤ (1 t)4 k 1k h i − h i − 8   4 For an isotropic log concave measure, the expression E Y1 is bounded from above by Cd for a universal constant C > 0 (see [40]). Thus, k k   4 2 2 2 d Tr E Γ2 E Γ2 = E Γ2 E Γ2 2E Γ2 C . t − t t − t HS ≤ t HS ≤ (1 t)4 −  h   i h   i h i

Lemma 15. Suppose that µ is log concave and isotropic, then there exists a universal constant 1 >c> 0 such that 1. For any, t [0, c ], σ 1 . ∈ d2 t ≥ 2 2. For any, t [ c , 1], σ c . ∈ d2 t ≥ td2 Proof. By Lemma 9, we have

2 d (17) E Cov (Y1 t) E [Cov(Y )] = E Γ2 = |F . dt 1|Ft − t − (1 t)2  −    Moreover, by convexity,

2 2 2 4 E Cov (Y ) E E Y ⊗ E Y I . 1|Ft  1 |Ft  k 1k d h i It is known (see [40]) then when µ is log concave and isotropic there exists a universal constant C > 0 such that E Y 4 Cd2. k 1k ≤ d Cd2 Consequently, dt E [Cov(Y1 t)] (1 t)2 Id, and since Cov(Y1 0) = Id, |F  − − |F t 1 Cd2t E [Cov(Y )] 1 Cd2 ds I = 1 I . 1|Ft   − (1 s)2  d − 1 t d Z0 −  −    1 By increasing the value of C, we may legitimately assume that Cd2 1, thus for any t 1 ≤ ∈ [0, 3Cd2 ] we get that 1 E [Cov(Y )] I , 1|Ft  2 d 1 which implies σt 2 and completes the first part of the lemma. In order to prove the second part, we first write≥

2 2 d d E [Cov(Y )] (Lemma 9) E [Cov(Y )] (1 t)E [Γ ] E [Γ ] E [Γ ] E [Γ ]= 1|Ft = 1|Ft − − t = t − t . dt t dt 1 t (1 t)2 1 t − − − (26)

32 Since, by Lemma 13, Γ 1 I , we have the bound t  t d E [Γ ] E [Γ2] 1 1 1 t − t − t E [Γ ]= E [Γ ] . 1 t  1 t t − t t − − f(t) 1 1 − Now, consider the differential equation f ′(t) = t , f 3Cd2 = 2 . Its unique solution is f(t) = 1 . Thus, Gromwall’s inequality shows that σ 1 , which concludes the proof. 6Cd2t t ≥ 6Cd2t

Proof of Theorem 6. Our objective is to bound from above the right hand side of Equation (25). As a consequence of Lemma 15, we have that for any t [0, 1), ∈ 1 2 4 σ− ds Cd (1 t), s ≤ − Zt for some universal constant C > 0. It follows that the integral in (25) admits the bound

1 2 2 2 1 1 2 2 2 E Tr (Γt E [Γt ]) E Tr (Γt E [Γt ]) − 2 4 − 2 2 σs− ds dt Cd 2 dt. h (1 t) σt i   ≤ h (1 t)σt i Z0 − Zt Z0 −   Next, there exists a universal constant C′ > 0 such that

−2 −2 cd 2 2 2 cd E Tr (Γt E [Γt ]) 8 4 d 8 − ′ ′ Cd 2 dt C 5 dt C d , h (1 t)σt i ≤ (1 t) ≤ Z0 − Z0 − where we have used the second bound of Lemma 14 and the first bound of Lemma 15. Also, by 2 1 applying the second bound of Lemma 15 when t [cd− ,d− ] we get ∈ −1 −1 d 2 2 2 d E Tr (Γt E [Γt ]) 12 2 4 − d t 9 Cd 2 dt C′ 5 dt C′d . h (1 t)σt i ≤ (1 t) ≤ cdZ−2 − cdZ−2 −

1 Finally, when t>d− , we have

1 2 2 2 1 2 2 2 2 E Tr (Γt E [Γt ]) t E Tr (Γt E [Γt ]) 4 − 8 − Cd 2 dt C′d dt h (1 t)σt i ≤ h  1 t i dZ−1 − dZ−1 − 1 9 1 2 2C′d + E v dt ≤ t2 k tk Z−1   d   (18) 10 4C′d (1 + Ent(Y G)), ≤ 1|| where the first inequality uses Lemma 15 and the second one uses Lemma 14. This establishes Cd10(1 + Ent(Y G)) Ent(S G) 1|| . n|| ≤ n

33 Finally, we derive an improved bound for the case of 1-uniformly log concave measures, based on the following estimates.

Lemma 16. Suppose that µ is 1-uniformly log concave, then for every t [0, 1) ∈ 1. Tr E (Γ2 E [Γ2])2 2(1 t) d Tr (Σ) + E v 2 . t − t ≤ − − k tk  h i 2. σ σ .   t ≥ 0

Proof. By Lemma 13, we have that Γt Id almost surely. Using this together with the identity given by Lemma 11, and proceeding in similar fashion to Lemma 14 we obtain 1 Tr E Γ2 Tr (E [Γ ])2 d 2(1 t) d Tr (Σ) + E v 2 , t ≥ d t ≥ − − − k tk and    

2 2 Tr E Γ2 E Γ2 Tr E Γ2 I Tr E I Γ2 t − t ≤ t − d ≤ d − t  h   i 2(1 th) d Tr (Σ)i + E v 2 .  ≤ − − k tk Also, recalling (26) and since Γ I we get   t  d d E [Γ ] E [Γ2] E [Γ ]= t − t 0, dt t 1 t ≥ − which shows that σ is bounded from below by a non-decreasing function and so σ σ which t t ≥ 0 is the minimal eigenvalue of Σ.

Proof of Theorem 7. Plugging the bounds given in Lemma 16 into Equation (25) yields

1 2 2 2 1 E Tr (Γt E [Γt ]) 1 2 − − Ent(Sn G) 2 2 σs ds dt || ≤ n h (1 t) σt i   Z0 − Zt 1   2 2 d + E vt dt k k (18) 2(d + 2Ent (X γ))  0  = || , R 4  4 ≤ σ0n σ0 n which completes the proof.

References

[1] T. Ando. Concavity of certain maps on positive definite matrices and applications to Hadamard products. Linear Algebra Appl., 26:203–241, 1979.

[2] Milla Anttila, Keith Ball, and Irini Perissinaki. The central limit problem for convex bodies. Trans. Amer. Math. Soc., 355(12):4723–4735, 2003.

[3] Shiri Artstein, Keith M. Ball, Franck Barthe, and . On the rate of convergence in the entropic central limit theorem. Probab. Theory Related Fields, 129(3):381–390, 2004.

34 [4] Keith Ball, Franck Barthe, and Assaf Naor. Entropy jumps in the presence of a spectral gap. Duke Math. J., 119(1):41–63, 2003.

[5] Keith Ball and Van Hoang Nguyen. Entropy jumps for isotropic log-concave random vectors and spectral gap. Studia Math., 213(1):81–96, 2012.

[6] Andrew R. Barron. Entropy and the central limit theorem. Ann. Probab., 14(1):336–342, 1986.

[7] V.Bentkus. A Lyapunov type bound in Rd. Teor. Veroyatn. Primen., 49(2):400–410, 2004.

[8] Harald Bergstr¨om. On the central limit theorem in the space Rk,k > 1. Skand. Aktuari- etidskr., 28:106–127, 1945.

[9] Andrew C. Berry. The accuracy of the Gaussian approximation to the sum of independent variates. Trans. Amer. Math. Soc., 49:122–136, 1941.

[10] R. N. Bhattacharya. Refinements of the multidimensional central limit theorem and appli- cations. Ann. Probability, 5(1):1–27, 1977.

[11] S. G. Bobkov and A. Koldobsky. On the central limit property of convex bodies. In Geometric aspects of functional analysis, volume 1807 of Lecture Notes in Math., pages 44–52. Springer, Berlin, 2003.

[12] Sergey G. Bobkov. Entropic approach to E. Rio’s central limit theorem for W2 transport distance. Statist. Probab. Lett., 83(7):1644–1648, 2013.

[13] Sergey G. Bobkov. Berry-Esseen bounds and Edgeworth expansions in the central limit theorem for transport distances. Probab. Theory Related Fields, 170(1-2):229–262, 2018.

[14] Sergey G. Bobkov, Gennadiy P. Chistyakov, and Friedrich G¨otze. Rate of convergence and Edgeworth-type expansion in the entropic central limit theorem. Ann. Probab., 41(4):2479–2512, 2013.

[15] Sergey G. Bobkov, Gennadiy P. Chistyakov, and Friedrich G¨otze. Berry-Esseen bounds in the entropic central limit theorem. Probab. Theory Related Fields, 159(3-4):435–478, 2014.

[16] Thomas Bonis. Stein’s method for normal approximation in wasserstein distances with application to the multivariate central limit theorem. arXiv preprint arXiv:1905.13615, 2019.

[17] S´ebastien Bubeck and Shirshendu Ganguly. Entropic clt and phase transition in high- dimensional wishart matrices. International Mathematics Research Notices, page rnw243, 2016.

[18] Louis HY Chen and Xiao Fang. Multivariate normal approximation by stein’s method: The concentration inequality approach. arXiv preprint arXiv:1111.4073, 2011.

[19] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist., 41(6):2786–2819, 2013.

35 [20] Thomas A. Courtade, Max Fathi, and Ashwin Pananjady. Existence of Stein kernels un- der a spectral gap, and discrepancy bounds. Ann. Inst. Henri Poincare´ Probab. Stat., 55(2):777–790, 2019.

[21] Ronen Eldan. Thin shell implies spectral gap up to polylog via a stochastic localization scheme. Geom. Funct. Anal., 23(2):532–569, 2013.

[22] Ronen Eldan. Skorokhod embeddings via stochastic flows on the space of Gaussian mea- sures. volume 52, pages 1259–1280, 2016.

[23] Ronen Eldan. Gaussian-width gradient complexity, reverse log-Sobolev inequalities and nonlinear large deviations. Geom. Funct. Anal., 28(6):1548–1596, 2018.

[24] Ronen Eldan and James R. Lee. Regularization under diffusion and anticoncentration of the information content. Duke Math. J., 167(5):969–993, 2018.

[25] Ronen Eldan and Dan Mikulincer. Information and dimensionality of anisotropic random geometric graphs. arXiv preprint arXiv:1609.02490, 2016.

[26] Carl-Gustav Esseen. On the Liapounoff limit of error in the theory of probability, volume 28A. 1942.

[27] Max Fathi. Stein kernels and moment maps. Ann. Probab., 47(4):2172–2185, 2019.

[28] Gerald B. Folland. Real analysis. Pure and Applied Mathematics (New York). John Wiley & Sons, Inc., New York, second edition, 1999. Modern techniques and their applications, A Wiley-Interscience Publication.

[29] H. F¨ollmer. An entropy approach to the time reversal of diffusion processes. In Stochastic differential systems (Marseille-Luminy, 1984), volume 69 of Lect. Notes Control Inf. Sci., pages 156–163. Springer, Berlin, 1985.

[30] H. F¨ollmer. Time reversal on Wiener space. In Stochastic processes—mathematics and physics (Bielefeld, 1984), volume 1158 of Lecture Notes in Math., pages 119–129. Springer, Berlin, 1986.

[31] F. G¨otze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2):724– 739, 1991.

[32] Gilles Harg´e. A convex/log-concave correlation inequality for Gaussian measure and an application to abstract Wiener spaces. Probab. Theory Related Fields, 130(3):415–440, 2004.

[33] Oliver Johnson and Andrew Barron. Fisher information inequalities and the central limit theorem. Probab. Theory Related Fields, 129(3):391–409, 2004.

[34] Bo’az Klartag. Eldan’s stochastic localization and tubular neighborhoods of complex- analytic sets. J. Geom. Anal., 28(3):2008–2027, 2018.

[35] Yin Tat Lee and Santosh Srinivas Vempala. Eldan’s stochastic localization and the kls hyperplane conjecture: An improved lower bound for expansion. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 998–1007. IEEE, 2017.

36 [36] Joseph Lehec. Representation formula for the entropy and functional inequalities. Ann. Inst. Henri Poincare´ Probab. Stat., 49(3):885–899, 2013.

[37] Arnaud Marsiglietti and Victoria Kostina. A lower bound on the differential entropy of log-concave random vectors with applications. Entropy, 20(3):Paper No. 185, 24, 2018.

[38] S. V. Nagaev. An estimate of the remainder term in the multidimensional central limit theorem. In Proceedings of the Third Japan-USSR Symposium on Probability Theory (Tashkent, 1975), pages 419–438. Lecture Notes in Math., Vol. 550, 1976.

[39] Bernt Ø ksendal. Stochastic differential equations. Universitext. Springer-Verlag, Berlin, sixth edition, 2003. An introduction with applications.

[40] G. Paouris. Concentration of mass on convex bodies. Geom. Funct. Anal., 16(5):1021– 1049, 2006.

[41] Daniel Revuz and Marc Yor. Continuous martingales and Brownian motion, volume 293 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathe- matical Sciences]. Springer-Verlag, Berlin, third edition, 1999.

[42] Emmanuel Rio. Upper bounds for minimal distances in the central limit theorem. Ann. Inst. Henri Poincare´ Probab. Stat., 45(3):802–817, 2009.

[43] Emmanuel Rio. Asymptotic constants for minimal distance in the central limit theorem. Electron. Commun. Probab., 16:96–103, 2011.

[44] V. V. Senatov. Some uniform estimates of the convergence rate in the multidimensional central limit theorem. Teor. Veroyatnost. i Primenen., 25(4):757–770, 1980.

[45] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/ log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In STOC’11—Proceedings of the 43rd ACM Symposium on Theory of Computing, pages 685–694. ACM, New York, 2011.

[46] Alex Zhai. A high-dimensional CLT in 2 distance with near optimal convergence rate. Probab. Theory Related Fields, 170(3-4):821–845,W 2018.

37