arXiv:1509.03258v2 [math.PR] 13 Sep 2015 ihr arx(ihrmvddiagonal) removed (with matrix Wishart a Let Introduction 1 one seSection (see bounded etrsi ag nuh ihr arcsas peri ph in where appear state also quantum matrices mixed Wishart enough. large is features aaeesand parameters o of row h itiuino oainemtie,adi hscase this in pro and most matrices, Perhaps covariance of mathematics. of distribution areas the many in importance of respectively. matrix thresholded the on admmti ihiid nre from entries i.i.d. with matrix random uhlre hntenme fprmtr n a s h well the use of can properties one the parameters study of to number matrices the than larger much nrpcCTadpaetasto nhigh-dimensional in transition phase and CLT Entropic ∗ † n h measure The nvriyo ahntn [email protected] Washington; of University [email protected]. Research; Microsoft µ etcs hr ahvre sascae oaltn featur latent a to associated is vertex each where vertices, eapoaiiydsrbto upre on supported distribution probability a be o eaieetoyaogwt h eusv tutr nt in structure recursive the with along entropy relative for farno arx n eti ml alpoaiiyestim probability ball small certain and matrix, random a conc of information, betwee Fisher relation for known representation well variational a the on relies crucially proof The uhmtie r ls nttlvraindsac oteco the to informa distance variation an if total only prove and in We close are distribution. matrices log-concave such a from i.i.d. X ,ada dei rsn ewe w etcsi h correlat the if vertices two between present is edge an and ), ecnie ihdmninlWsatmatrices Wishart dimensional high consider We d d W stesml ie nte plcto si h hoyo ra of theory the in is application Another size. sample the as smc agrthan larger much is n,d S ( bsinBubeck ebastien ´ 1.1 µ A ) i,j eoe prxmtl asinwhen Gaussian approximately becomes .Tu ntecasclrgm fsaitc hr h sampl the where statistics of regime classical the in Thus ). = n and 1 { W ihr matrices Wishart d W i,j n r h iesoso h bevbeaduosral state unobservable and observable the of dimensions the are etme 5 2015 15, September 3 n,d τ > u ro setoybsd aigueo h hi rule chain the of use making entropy-based, is proof Our . µ ( µ h itiuinof distribution The . } ) ∗ W nti ae eivsiaeteetn owihthis which to extent the investigate we paper this In . steajcnymti farno emti graph geometric random a of matrix adjacency the is Abstract = R 1 XX ihzr enadui aine econsider We variance. unit and mean zero with hrhnuGanguly Shirshendu . XX ⊤ − nrto onsfrteseta norm spectral the for bounds entration edfiiino h ihr ensemble. Wishart the of definition he ⊤ n diag( hr h nre of entries the where rsodn asinesml if ensemble Gaussian rresponding sc,a ipemdlo random a of model simple a as ysics, tsfrlgcnaemeasures. log-concave for ates a etogto stenme of number the as of thought be can ihrifrainadentropy, and information Fisher n intertcpaetransition: phase theoretic tion W XX ietyi rssi ttsisas statistics in arises it minently etrin vector e d hc edenote we which , nesodter fGaussian of theory understood ⊤ ost nnt and infinity to goes )  o ewe h underlying the between ion / √ d † where dmgah where graphs ndom R X d ∈ nml the (namely R X n × W san is d n n,d are ieis size e remains ( n µ ) × is , i th d s Gaussian picture remains relevant in the high-dimensional regime where the matrix size n also goes to infinity. Our main result, stated informally, is the following universality of a critical dimension for sufficiently smooth measures µ (namely log-concave): the Wishart measure (µ) becomes Wn,d approximately Gaussian if and only if d is much larger than n3. From a statistical perspective this means that analyses based on Gaussian approximation of a Wishart are valid as long as the number of samples is at least the cube of the number of parameters. In the random graph setting this gives a dimension barrier to the extraction of geometric information from a network, as our result shows that all geometry is lost when the dimension of the latent feature space is larger than the cube of the number of vertices.

1.1 Main result Writing X Rd for the ith row of X one has for i = j, W = 1 X ,X . In particular i ∈ 6 i,j √d h i ji EWi,j = 0 and EWi,jWℓ,k = 1 (i, j)=(ℓ,k) and i = j . Thus for fixed n, by the multivariate one has, as{d goes to infinity, 6 }

(µ) D , Wn,d → Gn where n is the distribution of a n n Wigner matrix with null diagonal and standard Gaussian entriesG off diagonal (recall that a Wigner× matrix is symmetric and the entries above the main di- agonal are i.i.d.). Recall that the total variation distance between two measures λ, ν is defined as

TV(λ, ν) = supA λ(A) ν(A) where the supremum is over all measurable sets A. Our main result is the following:| − |

Theorem 1 Assuming that µ is log-concave1 and d/(n3 log5(d)) + , one has → ∞ TV( (µ), ) 0, (1) Wn,d Gn → Observe that for (1) to be true one needs some kind of smoothness assumption on µ. Indeed if µ is purely atomic then so is (µ), and thus its total variation distance to is 1. We also remark Wn,d Gn that Theorem 1 is tight up to the logarithmic factor in the sense that if d/n3 0, then → TV( (µ), ) 1, (2) Wn,d Gn → see Section 1.2 below for more details on this result. Finally we note that our proof in fact gives the following quantitative version of (1), where C > 1 is a universal constant,

3 4 3 2 n log(n) log (d) n TV( n,d(µ), n) C + . W G ≤ d r d !

1.2 Related work and ideas of proof In the case where µ is a standard Gaussian, Theorem 1 (without the logarithmic factor) was re- cently proven simultaneously and independently in Bubeck et al. [2014], Jiang and Li [2013]. We

1 A measure µ with density f is said to be log-concave if f( )= e−ϕ(·) for some convex function ϕ. ·

2 also observe that previously to these results certain properties of a Gaussian Wishart were already known to behave as those of a Gaussian matrix, and for values of d much smaller than n3, see e.g. Johnstone [2001] for the largest eigenvalue at d n, and Aubrun et al. [2014] on whether the ≈ quantum state represented by the Wishart is separable at d n3/2. The proof of Theorem 1 for the Gaussian case is simpler as both measures have a known≈ density with a rather simple form, and one can then explicitely compute the total variation distance as the L1 distance between the densi- ties. We also note that Bubeck et al. [2014] implicitely proves (2) for this Gaussian case. Taking inspiration from the latter work, one can show that in the regime d/n3 0, for any µ (zero mean, 2 → unit variance and finite fourth moment ), one can distinguish n,d(µ) and n by considering the n n 3 W 3 G statistic A R × Tr(A ). Indeed it turns out that the mean of Tr(A ) under the two measures 3 5 ∈ 7→ n 3 3 n are respectively zero and Θ( ) whereas the variances are respectively Θ(n ) and Θ(n + 2 ). √d d

Proving normal approximation results without the assumption of independence is a natural question and has been a subject of intense study over many years. One method that has found several applications in such settings is the so called Stein’s method of exchangeable pairs. Since Stein’s original work (see Stein [1986]) the method has been considerably generalized to prove error bounds on convergence to gaussian distribution in various situations. The multidimensional case was treated first in Chatterjee and Meckes [2007]. For several applications of Stein’s method in proving CLT see Chatterjee [2014] and the references therein. In our setting note that

d W = X X⊤ diag(X X⊤) /√d i i − i i i=1 X  n where the Xi are i.i.d vectors in R whose coordinates are i.i.d samples from a one dimen- Y X X X X Rn2 sional measure µ. Considering i = i i⊤ diag( i i⊤) as a vector in and noting that 3 3 − Yi n , a straightforward application of Stein’s method using exchangeable pairs (see the |proof| of∼ [Chatterjee and Meckes, 2007, Theorem 7]) provides the following suboptimal bound: the Wishart ensemble converges to the Gaussian ensemble (convergence of integrals against ‘smooth’ enough test functions) when d n6. Whether there is a way to use Stein’s method to recover Theorem 1 in any reasonable metric≫ (total variation metric, Wasserstein metric, etc.) remains an open problem (see Section 6 for more on this).

Our approach to proving (1) is information theoretic and hence completely different from Bubeck et al. [2014], Jiang and Li [2013] (this is a necessity since for a general µ there is no simple expression for the density of n,d(µ)). The first step in our proof, described in Section 2, is to use Pinsker’s inequality to changeW the focus from total variation distance to the relative entropy (see also Section 2 for definitions). Together with the chain rule for relative entropy this allows us to bound the relative entropy of n,d(µ) with respect to n by induction on the dimension n. The base case essentially follows fromW the work of Artstein etG al. [2004] who proved that the relative entropy between the standard one-dimensional Gaussian and 1 d x , where x ,...,x R is √d i=1 i 1 d ∈ an i.i.d. sequence from a log-concave measure µ, goes to 0 at a rate 1/d. One of the main tech- P nical contribution of our work is a certain generalization of the latter result in higher dimensions, see Theorem 2 in Section 3. Recently Ball and Nguyen [2012] also studied a high dimensional

2 Note that log-concavity implies exponential tails and hence existence of all moments.

3 generalization of the result in Ball et al. [2003] (which contains the key elements for the proof in Artstein et al. [2004]) but it seems that Theorem 2 is not comparable to the main theorem in Ball and Nguyen [2012]. Another important part of the induction argument, which is carried out in Section 4, relies 1 XX on controlling from above the expectation of logdet( d ⊤), which should be understood as − 1 XX the relative entropy between a centered Gaussian with covariance given by d ⊤ and a standard n Gaussian in R . This leads us to study the probability that XX⊤ is close to being non-invertible. Denoting by smin the smallest singular value of X, it suffices to prove a ‘good enough’ upper bound for P(smin(X⊤) ε) for all small ε. The case when the entries of X are gaussian allows to work with exact formulas≤ and was studied in Edelman [1988], Sankar et al. [2006]. The last few years have seen tremendous progress in understanding the universality of the tail behavior of extreme singular values of random matrices with i.i.d. entries from general distributions. See Rudelson and Vershynin [2010] and the references therein for a detailed account of these results. Such estimates are quite delicate, and it is worthwhile to mention that the following estimate was n d proved only recently in Rudelson and Vershynin [2008]: Let A R × with (d n) be a rectan- gular matrix with i.i.d. subgaussian entries then for all ε> 0, ∈ ≥ d n+1 d P(s (A⊤) ε(√d √n 1)) (Cε) − + c , min ≤ − − ≤ where c,C are independent of n, d. In full generality, such estimates are essentially sharp since in d the case where the entries are random signs, smin is zero with probability c . Unfortunately this type of bound is not useful for us, as we need to control P(smin(X⊤) ε) for arbitrarily small 1 XX ≤ scales ε (indeed logdet( d ⊤) would blow up if smin can be zero with non-zero probability). It turns out that the assumption of log-concavity of the distribution allows us to do that. To this end we use recent advances in Paouris [2012] on small ball probability estimates for such distributions: Let Y Rn be an isotropic centered log-concave random variable, and ε (0, 1/10), then one has P( Y ∈ ε√n) (Cε)√n. This together with an ε-net argument gives us∈ the required control on | | ≤ ≤ P(s (X⊤) ε). min ≤ We conclude the paper with several open problems in Section 6.

2 An induction proof via the chain rule for relative entropy

Recall that the (differential) entropy of a measure λ with a density f (all densities are understood with respect to the Lebesgue measure unless stated otherwise) is defined as:

Ent(λ) = Ent(f)= f(x) log f(x)dx. − Z The relative entropy of a measure λ (with density f) with respect to a measure ν (with density g) is defined as f(x) Ent(λ ν)= f(x) log dx. k g(x) Z With a slight abuse of notations we sometimes write Ent(Y ν) where Y is a random variable distributed according to some distribution λ. Pinsker’s inequalityk gives: 1 TV( (µ), )2 Ent( (µ) ). Wn,d Gn ≤ 2 Wn,d kGn 4 Next recall the chain rule for relative entropy states for any random variables Y1,Y2,Z1,Z2, E Ent((Y1,Y2) (Z1,Z2)) = Ent(Y1 Z1)+ x λ1 Ent(Y2 Y1 = y Z2 Z1 = y), k k ∼ | k | where λ1 is the (marginal) distribution of Y1, and Y2 Y1 = x is used to denote the distribution of Y conditionally on the event Y = x (and similarly| for Z Z = y). Also observe that a sample 2 1 2| 1 from n+1,d(µ) can be obtained by adjoining to XX⊤ diag(XX⊤) /√d (whose distribution is W − d n,d(µ)) the column vector XX/√d (and the row vector (XX)⊤/√d) where X R has i.i.d. W  n ∈ entries from µ. Thus denoting γn for the standard Gaussian measure in R we obtain

Ent( (µ) ) = Ent( (µ) )+ EX Ent XX/√d XX⊤ γ . (3) Wn+1,d kGn+1 Wn,d kGn | k n   By convexity of the relative entropy (see e.g., Cover and Thomas [1991]) one also has:

EX Ent(XX/√d XX⊤ γ ) EX Ent(XX/√d X γ ). (4) | k n ≤ | k n Next we need a simple lemma to rewrite the above term:

n d n n Lemma 1 Let A R × and Q R × be such that QAA⊤Q⊤ = I . Then one has for any ∈ ∈ n isotropic random variable X Rd, ∈ 1 n 1 Ent(AX γ ) = Ent(QAX γ )+ Tr(AA⊤) + logdet(Q). k n k n 2 − 2 2

Proof Denote ΦΣ for the density of a centered Gaussian with covariance matrix Σ, and let G γn. Also let f be the density of QAX. Then one has: ∼

Ent(AX G) = Ent(QAX QN) k k f(x) = f(x) log dx Φ ⊤ (x) Z  QQ  f(x) Φ (x) = f(x) log dx + f(x) log In dx Φ (x) Φ ⊤ (x) Z  In  Z  QQ  1 1 1 1 = Ent(QAX G)+ f(x) x⊤(QQ⊤)− x x⊤x + logdet(Q) k 2 − 2 2 Z   1 1 n 1 = Ent(QAX G)+ Tr (QQ⊤)− + logdet(Q), k 2 − 2 2  where for the last equality we used the fact that QAX is isotropic, that is f(x)xx⊤dx = In. 1 Finally it only remains to observe that Tr (QQ⊤)− = Tr(AA⊤). R 1 XX 1/2 Combining (3) and (4) with Lemma 1 (noting that one can take Q =( d ⊤)− ), and using that E Tr(XX⊤)= n, one obtains

Ent( (µ) ) Wn+1,d kGn+1 1/2 1 1 Ent( (µ) )+ EX Ent (XX⊤)− X X X γ EX logdet( XX⊤). (5) ≤ Wn,d kGn | k n − 4 d  n d In Section 3 we show how to bound the term Ent(AX γn) where A R × has orthonormal rows k ∈ 1 EX XX (i.e., AA⊤ = In), and then in Section 4 we deal with the term logdet( d ⊤).

5 3 A high dimensional entropic CLT

The main goal of this section is to prove the following high dimensional generalization of the entropic CLT of Artstein et al. [2004].

Theorem 2 Let Y Rd be a random vector with i.i.d. entries from a distribution ν with zero ∈ 3 n d mean, unit variance, and spectral gap c (0, 1]. Let A R × be a matrix such that AA⊤ = I . ∈ ∈ n Let ε = maxi [d](A⊤A)i,i and ζ = maxi,j [d],i=j (A⊤A)i,j . Then one has ∈ ∈ 6 | | Ent(AY γ ) n min(2(ε + ζ2d)/c, 1) Ent(ν γ ). k n ≤ k 1

Note that the assumption AA⊤ = In implies that the rows of A form an orthonormal system. In particular if A is built by picking rows one after the other at uniform on the Euclidean sphere in Rd conditionally on being orthogonal to previous rows, then one expects that ε n/d and ζ √n/d. 2 ≃ ≃3 Theorem 2 then yields Ent(AY γn) . n /d. Thus we already see appearing the term n /d from Theorem 1 as we will sum the latterk bound over the n rounds of induction (see Section 2). We also note that for the special case n = 1, Theorem 2 is slightly weaker than the result of Artstein et al. [2004] which makes appear the ℓ4-norm of A. Section 3.1 and Section 3.2 are dedicated to the proof of Theorem 2. Then in Section 3.3 we show how to apply this result to bound the term EX Ent(QXX/√d X γ ) from Section 2. | k n 3.1 From entropy to Fisher information

2 n w(x) R R |∇ | For a density function w : +, we denote J(w) = w(x) dx for its Fisher information, ⊤ → w(x) w(x) ∇ ∇ and I(w) = w(x) dx for the Fisher information matrixR (if ν denotes the measure whose density is w, we may also write J(ν) instead of J(w)). Also denote P for the Ornstein-Uhlenbeck R t semigroup, that is with G γ one has for a random variable Z with density g, ∼ n P Z = exp( t)Z + 1 exp( 2t)G, t − − − p and Ptg is the density of PtZ. The de Bruijn identity states that the Fisher information is the time derivative of the entropy along the Ornstein-Uhlenbeck semigroup, more precisely one has:

∞ Ent(w γ ) = Ent(γ ) Ent(w)= (J(P w) n)dt. k n n − t − Z0 Our objective is to prove a bound of the form (for some constant C depending on A)

Ent(AY γ ) C Ent(ν γ ), (6) k n ≤ k 1 and thus given the above identity it suffices to show that for any t> 0,

J(h ) n C (J(ν ) 1), (7) t − ≤ t − 3 A probability measure µ is said to have spectral gap c if for all smooth functions g with Eµ(g)=0, we have 2 1 ′2 Eµ(g ) Eµ(g ). ≤ c

6 where ht is the density of PtAY (which is equal to the density of APtY ) and νt is such that PtY d Rn has distribution νt⊗ . Furthermore if e1,...,en denotes the canonical basis of , then to prove (7) it is enough to show that for any i [n], ∈

e⊤I(h )e 1 C (J(ν ) 1), (8) i t i − ≤ i t − n where i=1 Ci = C. We will show that one can take P 2 cUi Ci =1 , − cWi +2Vi

d d where we denote B = A⊤A R × , and ∈ d d U = A2 (1 B ), W = A2 (1 B )2, V = (A B )2. i i,j − j,j i i,j − j,j i i,j j,k j=1 j=1 j,k [d],k=j X X ∈X 6 2 Straightforward calculations (using that Ui 1 ε, Wi 1, and Vi ζ d) show that one has 2 ≥ − ≤ ≤ n cUi 2 1 2n(ε + ζ d)/c where ε = maxi [d] Bi,i and ζ = maxi,j [d],i=j Bi,j , thus i=1 − cWi+2Vi ≤ ∈ ∈ 6 | | concluding the proof of Theorem 2. P   In the next subsection we prove (8) for a given t > 0 and i = 1. We use the following well known but crucial fact: the spectral gap of νt is in [c, 1] (see [Proposition 1, Ball et al. [2003]]). Denoting f for the density of ν , one has with ϕ = log f that J := J(ν ) = ϕ′′(x)dµ(x). t − t The last equality easily follows from the fact that for any t > 0 one has f ′′ = 0 (which itself R follows from the smoothness of ν induced by the convolution of ν with a Gaussian). t R 3.2 Variational representation of Fisher information Let Z Rd be a random variable with a twice continuously differentiable density w such that 2 w ∈ 2 n |∇ | R w < and w < , and let h the density of AZ . Our main tool is a remarkable formula from∞ Ball etk∇ al. [2003k ∞], which states the following:∈ for all e Rn and all sufficiently R R smooth map p : Rd Rd with Ap(x) = e, x Rd, one has (with Dp∈ denoting the Jacobian → ∀ ∈ matrix of p),

2 2 e⊤I(h)e Tr(Dp(x) )+ p(x)⊤ ( log w(x))p(x) w(x)dx. (9) ≤ ∇ − Z   For sake of completeness we include a short proof of this inequality in Section 5. Let (a1,...,ad) be the first row of A. Following Artstein et al. [2004], to prove (7), we would like to use the above formula4 with p of the form (a r(x ),...,a r(x )) for some map r : R R. 1 1 d d → Since we need to satisfy Ap(x)= e1 we adjust the formula accordingly and take

p(x)=(I A⊤A)(a r(x ),...,a r(x ))⊤ + A⊤e . n − 1 1 d d 1 4Note that the smoothness assumptions on w are satisfied in our context since we consider a random variable convolved with a Gaussian.

7 In particular we get, with B = A⊤A,

p (x)= a + a (1 B )r(x ) B a r(x ), i i i − i,i i − i,j j j j [d],j=i ∈X6 and ∂pi a (1 B )r′(x ) if i = j (x)= i − i,i i ∂x Bi,jajr′(xj) otherwise. j  − d Next recall that we apply (9) to prove (8) where w(x)= i=1 f(xi), in which case we have (recall also the notation ϕ = log f): − Q d 2 2 p(x)⊤ ( log w(x))p(x)= p (x) ϕ′′(x ) ∇ − i i i=1 X 2 d = ϕ′′(x ) a + a (1 B )r(x ) B a r(x ) . i  i i − i,i i − i,j j j  i=1 j [d],j=i X ∈X6 We also have   d 2 2 2 2 2 Tr(Dp(x) )= a (1 B ) r′(x ) + B a a r′(x )r′(x ). i − i,i i i,j i j i j i=1 i,j [d],i=j X ∈X6 Putting the above together we obtain (with a slightly lengthy straightforward computation) that 2 e1⊤I(h)e1 is upper bounded by (recall also that i ai =1 and j Bi,jaj = ai since BA⊤ = A⊤) P P 2 2 2 2 J + W f(r′) + fϕ′′r + JV fr + J(W V ) fr (10) − Z Z  Z Z 2 +2U fϕ′′r J fr 2W fr fϕ′′r + M fr′ − − Z Z  Z  Z  Z  where d d U = a2(1 B ), W = a2(1 B )2, V = (B a )2, M = B2 a a . i − ii i − ii i,j j i,j i j i=1 i=1 i,j [d],i=j i,j [d],i=j X X ∈X6 ∈X6 Observe that by Cauchy-Schwarz inequality one has M V , and furthermore following Artstein et al. [2004] one also has with m = fr, ≤

2 2 2 R f ′ 2 2 fr′ = f ′(r m) = f(r m) J fr m . − √f − ≤ −         Z Z Z p Z Thus we get fom (10) and the above observations that e⊤I(h )e J T (r) where 1 t 1 − ≤ 2 2 2 2 T (r)= W f(r′) + fϕ′′r +2JV fr + J(W 2V ) fr − Z Z  Z  Z 

+2U fϕ′′r J fr 2W fr fϕ′′r , − −  Z  Z  Z 

8 which the exact same quantity as the one obtain in Artstein et al. [2004]. The goal now is to optimize over r to make this quantity as negative as possible. Solving the above optimization problem is exactly the content of [Artstein et al., 2004, Section 2.4] and it yields the following bound: cU 2 e⊤I(h )e 1 1 (J 1), 1 t 1 − ≤ − cW +2V −   which is exactly the claimed bound in (8).

3.3 Using Theorem 2

1/2 Given (5) we want to apply Theorem 2 with A = (XX⊤)− X (also observe that the spectral gap assumption of Theorem 2 is satisfied since log-concavity and isotropy of µ impy that µ has a 1 spectral gap in [1/12, 1], Bobkov [1999]). In particular we have A⊤A = X⊤(XX⊤)− X, and thus denoting X Rn for the ith column of X one has for any i, j [d], i ∈ ∈ 1 1 1 1 (A⊤A) = X⊤(XX⊤)− X = X⊤ XX⊤ − X . i,j i j d i d j In particular this yields: 

1 1 1 1 (A⊤A) X⊤X + X X XX⊤ − I . | i,j| ≤ d| i j| d| i|·| j|·k d − nk We now recall two important results on log-concave random vectors 5. First Paouris’ inequality Pao states that for an isotropic, centered, log-concave random variable Y Rn one has for any ∈ t C, ≥ P( Y (1 + t)√n) exp( ct√n), | | ≥ ≤ − where c,C are universal constants. We also need an inequality proved by Adamczak, Litvak, Pajor and Tomczak-Jaegermann Adamczak et al. [2010] which states that for a sequence Y ,...,Y 1 d ∈ Rn of i.i.d. copies of Y , one has for any t 1 and ε (0, 1), ≥ ∈ d P 1 YiYi⊤ In >ε exp( ct√n), d − ! ≤ − i=1 X t4 t4 provided that d C 2 log 2 2 n. ≥ ε ε Paouris’ inequality directly  yields that for any i [d], with probability at least 1 δ, one has ∈ − 1 X √n + log(1/δ). | i| ≤ c X X X j Furthermore, by Prekopa-Leindler, conditionally on j one has for i = j that ⊤ X is a cen- 6 i j tered, isotropic, log-concave random variable. In particular using again Paouris’ inequality| | and independence of X and X one obtains that for i = j, with probability at least 1 δ, i j 6 − 1 X⊤X X 1+ log(1/δ) . | i j|≤| j| c   5We note that more classical inequalities could also be used here since the entries of X are independent. This would slightly improve the logarithmic factors but it would obscure the main message of this section so we decided to use the more general inequalities for log-concave vectors.

9 Finally the inequality from Adamczak, Litvak, Pajor and Tomczak-Jaegermann yields that with probability at least 1 δ, −

1 n log(n) 2 XX⊤ In C′ log (1/δ). (11) kd − k ≤ r d 1 ε Also note that if A In ε < 1 then A− In 1 ε . From now on C denotes a universal constant whose valuek − cank change ≤ at each occurence.k − k Putting ≤ − together all of the above with a union bound, we obtain for d Cn2 that with probability at least 1 1/d, for all i = j, ≥ − 6 X C(√n + log(d)), | i| ≤ 2 X⊤X C(√n log(d) + log (d)), | i j| ≤ 1 1 n log(n) 2 log(n) 2 XX⊤ − In C log (d) C log (d). k d − k ≤ r d ≤ r n This yields that for i = j, 6 2 n log(n) + log (d) 2 (A⊤A)i,j C log (d), | | ≤ p d and n + log2(d) (A⊤A) C . | i,i| ≤ d

Thus denoting ε = maxi [d](A⊤A)i,i and ζ = maxi,j [d],i=j (A⊤A)i,j one has: ∈ ∈ 6 | | n log(n) + log4(d) E min(ε + ζ2d, 1) C log4(d). ≤ d 4 Small ball probability estimates

1 The goal of this section is to upper bound E logdet( XX⊤). − d Lemma 2 1 n n2 E logdet XX⊤ C + . (12) − d ≤ d d   r  Proof We decompose this expectation on the event (and its complement) that the smallest eigen- 1 XX 2 value λmin of d ⊤ is less than 1/2. We first write, using log(x) 1 x + (1 x) for x 1/2, − ≤ − − ≥ 1 1 1 2 E logdet XX⊤ 1 λ 1/2 E Tr I XX⊤ + I XX⊤ . − d { min ≥ } ≤ n − d n − d   HS!  

Denote ζ for the 4th moment of µ. Then one has (recall that X Rd denotes the ith row of X) i ∈

2 n 2 1 1 2 n E Tr In XX⊤ E Tr In XX⊤ = E (1 Xi /d) = (ζ 1) . − d ≤ s − d v −| | − d   u i=1 ! r   u X t 10 Similarly one can easily check that

2 n 2 2 1 1 2 n n n n E I XX⊤ = E X ,X n = − + (ζ 1) . n − d d2 h i ji − d d − ≤ d HS i,j=1 ! X

Next note that by log-concavity of µ one has ζ 70, and thus we proved (for some universal constant C > 0): ≤

1 n n2 E logdet XX⊤ 1 λ 1/2 C + . (13) − d { min ≥ } ≤ d d   r   We now take care of the integral on the event λ < 1/2 . First observe that the inequality (11) { min } from Adamczak, Litvak, Pajor and Tomczak-Jaegermann gives for d C, ≥ P(λ < 1/2) exp( d1/10). min ≤ − In particular we have for any ξ (0, 1): ∈ 1 E logdet XX⊤ 1 λ < 1/2 nE ( log(λ )1 λ < 1/2 ) − d { min } ≤ − min { min }    ∞ = n P( log(λ ) t)dt − min ≥ Zlog(2) 1/2 1 = n P(λ

Thus it remains to control P(λmin M) 1 ≥ ≤ exp( cM) where λ is the largest eigenvalue of XX⊤. Recall that − max d XX n 1 ⊤ n 1 P(λ 1/√s). min ≤ s | | max   We now use the Paouris small ball probability bound Paouris [2012] which states that for an isotropic centered log-concave random variable Y Rd, and any ε (0, 1/10), one has ∈ ∈ P( Y ε√d) (cε)√d. | | ≤ ≤ 11 Thus we obtain for d Cn2 ≥ P(λ

1 1/20 E logdet XX⊤ 1 λ n exp( d ), − d { min} ≤ −    and thus together with (13) it yields (12).

5 Proofof (9)

Recall that Z Rd is a random variable with a twice continuously differentiable density w such 2 w ∈ 2 n that |∇ | < and w < , h is the density of AZ R (with AA⊤ = I ), and also we w ∞ k∇ k ∞ ∈ n fix e Rn and a sufficiently smooth map6 p : Rd Rd with Ap(x) = e, x Rd. We want to prove:R∈ R → ∀ ∈ 2 2 e⊤I(h)e Tr(Dp )+ p⊤ ( log w)p w . (15) ≤ Rd ∇ − Z   First we rewrite the right hand side in (15) as follows:

2 2 2 ( (pw)) Tr(Dp )+ p⊤ ( log w)p w = ∇· . Rd ∇ − Rd w Z   Z The above identity is a straightforward calculation (with several applications of the one-dimensional integration by parts, which are justified by the assumptions on p and w), see Ball et al. [2003] for more details. Now we rewrite the left hand side of (15). Using the notation gx for the partial derivative of a function g in the direction x, we have

2 he e⊤I(h)e = . Rn h Z Rn Rd Next observe that for any x one can write h(x)= E⊥ w(A⊤x + ) where E is the n- dimensional subspace generated∈ by the orthonormal rows of A, and thus· thanks to the⊂ assumptions R on w one has:

he(x)= wA⊤e = (A⊤e)w . ⊤ ⊥ ⊤ ⊥ ∇· ZA x+E ZA x+E The key step is now to remark that the condition x, Ap(x) = e exactly means that the projection ∀ of p on E is A⊤e, and thus by the Divergence Theorem one has

(A⊤e)w = (pw) . ⊤ ⊥ ∇· ⊤ ⊥ ∇· ZA x+E ZA x+E 6  For instance it is enough that p is twice continuously differentiable, and that the coordinate functions pi and their 2 derivatives ∂pi , ∂pi , ∂ pi are bounded. ∂xi ∂xj ∂xi∂xj

12 The proof is concluded with a simple Cauchy-Schwarz inequality:

2 ⊤ ⊥ (pw) 2 2 A x+E ∇· ( (pw)) ( (pw)) e⊤I(h)e = ∇· = ∇· . Rn  ⊤ ⊥ w  ≤ Rn ⊤ ⊥ w Rd w Z R A x+E Z ZA x+E Z R 6 Open problems

This work leaves many questions open. A basic question is whether one could get away with less independence assumption on the matrix X. Indeed several of the estimates in Section 3 and Section 4 would work under the assumption that the rows (or the columns) of X are i.i.d. from a log-concave distribution in Rd (or Rn). However it seems that the core of the proof, namely the induction argument from Section 2, breaks without the independence assumption for the entries of X. Thus it remains open whether Theorem 1 is true with only row (or column) independence for X. We note that the case of row independence is probably much harder than column independence.

As we observed in Section 1.2, a natural alternative route to prove Theorem 1 (or possibly a variant of it with a different metric) would be to use Stein’s method. A straightforward application of existing results yield the suboptimal dimension dependency d n6 for convergence, and it is ≫ an intriguing open problem whether the optimal rate d n3 can be obtained with Stein’s method. ≫ In this paper we consider Wishart matrices with zeroed out diagonal elements in order to avoid further technical difficulties (also for many applications -such as the random geometric graph example- the diagonal elements do not contain relevant information). We believe that Theorem 1 remains true with the diagonal included (given an appropriate modification of the Gaussian en- semble). The main difficult is that in the chain rule argument one will have to deal with the law of the diagonal elements conditionally on the other entries. We leave this to further works, but we note that when µ is the standard Gaussian it is easy to conclude the calculations with these conditional laws.

In Eldan [2015] it is proven that when µ is a standard Gaussian and d/n + , one has TV( (µ), (µ)) 0. It seems conceivable that the techniques develop→ in∞ this paper Wn,d Wn,d+1 → could be useful to prove such a result for a more general class of distributions µ. However a major obstacle is that the tools from Section 3 are strongly tied to measuring the relative entropy with respect to a standard Gaussian (because it maximizes the entropy), and it is not clear at all how to adapt this part of the proof.

Finally one may be interested in understanding CLT of the form (1) for higher-order inter- th actions. More precisely recall that by denoting Xi for the i column of X one can write XX⊤ = d d 2 (p) 1 d p X X = X⊗ . For p N we may now consider the distribution of X⊗ i=1 i i i=1 i n,d √d i=1 i (for sake⊗ of consistency we should∈ remove the non-principal terms in this tensor).W The measure P(p) P P n,d have recently gained interest in the machine learning community, see Anandkumar et al. W[2014]. It would be interesting to see if the method described in this paper can be used to under- (p) stand how large d needs to be as a function of n and p so that n,d is close to being a Gaussian distribution. W

13 Acknowledgements

The authors thank for some useful discussions. This work was completed while S.G. was an intern at Microsoft Research in Redmond. He thanks the Theory group for its hospitality.

References

Radosław Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Quantita- tive estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of the American Mathematical Society, 23(2):535–561, 2010.

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15: 2773–2832, 2014.

Shiri Artstein, Keith M Ball, Franck Barthe, and Assaf Naor. On the rate of convergence in the entropic central limit theorem. Probability theory and related fields, 129(3):381–390, 2004.

Guillaume Aubrun, Stanislaw J Szarek, and Deping Ye. Entanglement thresholds for random induced states. Communications on Pure and Applied Mathematics, 67(1):129–171, 2014.

Keith Ball and Van Hoang Nguyen. Entropy jumps for log-concave isotropic random vectors and spectral gap. Studia Math., 213(1):81–96, 2012.

Keith Ball, Franck Barthe, and Assaf Naor. Entropy jumps in the presence of a spectral gap. Duke Mathematical Journal, 119(1):41–63, 2003.

Sergey Bobkov. Isoperimetric and analytic inequalities for log-concave probability measures. An- nals of Probability, 27(4):1903–1921, 1999.

S´ebastien Bubeck, Jian Ding, Ronen Eldan, and Mikl´os Z. R´acz. Testing for high-dimensional geometry in random graphs. arXiv preprint arXiv:1411.5713, 2014. To appear in Random Structures and Algorithms.

Sourav Chatterjee. A short survey of stein’s method. arXiv preprint arXiv:1404.1392, 2014.

Sourav Chatterjee and Elizabeth Meckes. Multivariate normal approximation using exchangeable pairs. arXiv preprint math/0701464, 2007.

Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, 1991.

Alan Edelman. Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9(4):543–560, 1988.

Ronen Eldan. An efficiency upper bound for inverse covariance estimation. Journal of Mathematics, 207(1):1–9, 2015.

14 Tiefeng Jiang and Danning Li. Approximation of rectangular beta-laguerre ensembles and large deviations. Journal of Theoretical Probability, pages 1–44, 2013.

Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295–327, 2001.

Grigoris Paouris. Small ball probability estimates for log-concave measures. Transactions of the American Mathematical Society, 364(1):287–308, 2012.

Mark Rudelson and Roman Vershynin. The littlewood–offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008.

Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme sin- gular values. arXiv preprint arXiv:1003.2990, 2010.

Arvind Sankar, Daniel A Spielman, and Shang-Hua Teng. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications,28 (2):446–476, 2006.

Charles Stein. Approximate computation of expectations. Lecture Notes-Monograph Series, 7: i–164, 1986.

15