On Sketching Matrix Norms and the Top Singular Vector

Yi Li Huy L. Nguy˜ên University of Michigan, Ann Arbor Princeton University [email protected] [email protected] David P. Woodruff IBM Research, Almaden [email protected]

Abstract 1 Introduction Sketching is a prominent algorithmic tool for processing Sketching is an algorithmic tool for handling big data. A large data. In this paper, we study the problem of sketching matrix norms. We consider two sketching models. The first linear skech is a distribution over k × n matrices S, k  is bilinear sketching, in which there is a distribution over n, so that for any fixed vector v, one can approximate a pairs of r × n matrices S and n × s matrices T such that for function f(v) from Sv with high probability. The goal any fixed n×n matrix A, from S ·A·T one can approximate is to minimize the dimension k. kAkp up to an approximation factor α ≥ 1 with constant probability, where kAkp is a matrix . The second is The sketch Sv thus provides a compression of v, and general linear sketching, in which there is a distribution over is useful for compressed sensing and for achieving low- n2 k linear maps L : R → R , such that for any fixed n × n communication protocols in distributed models. One n2 matrix A, interpreting it as a vector in R , from L(A) one can perform complex procedures on the sketch which can approximate kAkp up to a factor α. We study some of the most frequently occurring matrix would be too expensive to perform on v itself, and this norms, which correspond to Schatten p-norms for p ∈ has led to the fastest known approximation algorithms {0, 1, 2, ∞}. The p-th Schatten norm of a rank-r matrix for fundamental problems in numerical linear algebra Pr p 1/p A is defined to be kAkp = ( i=1 σi ) , where σ1, . . . , σr and nearest neighbor search. Sketching is also the only are the singular values of A. When p = 0, kAk0 is defined to technique known for processing data in the turnstile be the rank of A. The cases p = 1, 2, and ∞ correspond to the trace, Frobenius, and operator norms, respectively. For model of data streams [22, 38], in which there is an bilinear sketches we show: underlying vector v initialized to 0n which undergoes a long sequence of additive positive and negative updates 2 4 1. For p = ∞ any sketch must have r · s = Ω(n /α ) of the form vi ← vi + ∆ to its coordinates vi. Given dimensions. This matches an upper bound of Andoni and Nguyen (SODA, 2013), and implies one cannot an update of the form vi ← vi + ∆, resulting in a new 0 approximate the top right singular vector v of A by vector v = v + ∆ · ei, we can add S(∆ · ei) to Sv to 0 0 1 2 0 a vector v with kv − vk2 ≤ 2 with r · s =o ˜(n ). obtain Lv (here ei is the i-th standard unit vector). A well-studied problem in the sketching litera- 2. For p ∈ {0, 1} and constant α, any sketch must have ture is approximating the frequency moments F (v), r · s ≥ n1− dimensions, for arbitrarily small constant p  > 0. which is equivalent to estimating the p-norms kvkp = Pn p 1/p ( i=1 |vi| ) , for p ∈ [0, ∞]. This problem was in- 3. For even integers p ≥ 2, we give a sketch with r · troduced by Alon, Matias, and Szegedy [1], and many 2−4/p −2 s = O(n  ) dimensions for obtaining a (1 + )- insights in information complexity [6, 9] and sketching approximation. This is optimal up to logarithmic factors, and is the first general subquadratic upper [1, 3, 7, 24] were made on the path to obtaining optimal bound for sketching the Schatten norms. bounds. In addition to being of theoretical interest, the problems have several applications. The value kvk0, by For general linear sketches our results, though not optimal, continuity, is equal to the support size of x, also known are qualitatively similar, showing that for p = ∞, k = as the number of distinct elements [16, 17]. The norm 3/2 4 √ Ω(n /α ) and for p ∈ {0, 1}, k = Ω( n). These give kvk1 is the Manhattan norm, which is a robust measure separations in the sketching complexity of Schatten-p norms of distance and is proportional to the variation distance with the corresponding vector p-norms, and rule out a table between distributions [19, 21, 28]. The Euclidean dis- lookup nearest-neighbor search for p = 1, making progress tance kvk2 is important in linear algebra problems [44], on a question of Andoni. and corresponds to the self-join size in databases [1]. Often one wishes to find or approximate the largest co- a factor α with constant probability. The goal is to ordinates of x, known as the heavy hitters [10, 13], and minimize k. kvk∞ is defined, by continuity, to equal maxi |vi|. Previous Results. Somewhat surprisingly the In this paper, we are interested in the analogous only known o(n2) upper bound for either model is problem of sketching matrix norms. We study some for p = 2, in which case one can achieve a bilinear of the most frequently occurring matrix norms, which sketch with r · s = O(1) [23]. Moreover, the only correspond to Schatten p-norms for p ∈ {0, 1, 2, ∞}. lower bounds known were those for estimating the p- The p-th Schatten norm of an n × n rank-r matrix A norm of a vector v, obtained for p > 2 by setting Pr p 1/p 1−2/p is defined to be kAkp = ( i=1 σi ) , where σ1, . . . , σr A = diag(v) and are of the form k = Ω(n log n) are the singular values of A. When p = 0, kAk0 is [5, 34, 40]. We note that the bit complexity lower defined to be the rank of A. The cases p = 1, 2, and ∞ bounds of [6, 9, 43] do not apply, since a single linear correspond to the trace norm, the Frobenius norm, and sketch (1, 1/M, 1/M 2, 1/M 3,..., 1/M n−1)T v is enough the operator norm, respectively. These problems have to recover v if its entries are bounded by M. The sketch- found applications in several areas; we refer the reader ing model thus gives a meaningful measure of complex- to [11] for graph applications for p = 0, to differential ity in the real RAM model. privacy [20, 33] and non-convex optimization [8, 15] for Thus, it was not even known if a sketching dimen- p = 1, and to the survey on numerical linear algebra for sion of r · s = O(1) was sufficient for bilinear sketches p ∈ {2, ∞} [35]. to obtain a constant-factor approximation to the rank In nearest neighbor search (NNS), one technique of- or Schatten 1-norm, or if k = Ω(n2) was required for ten used is to first replace each of the input objects general linear sketches. (points, images, matrices, etc.) with a small sketch, Our Results. We summarize our results for the then build a lookup table for all possible sketches to two sketching models in Table 1. We note that, prior support fast query time. In his talks at the Barriers to our work, for all p∈ / {2, ∞}, all upper bounds in in Computational Complexity II workshop and Math- the table were a trivial O(n2) while all lower bounds for ematical Foundations of Computer Science conference, p ≤ 2 were a trivial Ω(1), while for p > 2 they were a Alexandr Andoni states that a goal would be to design weaker Ω(n1−2/p log n). a NNS data structure for the Schatten norms, e.g., the For the bilinear sketching model, we have the trace or Schatten 1-norm (slide 31 of [2]). If a sketch following results. For the spectral norm, p = ∞, we for a norm has small size, then building a table lookup prove an Ω(n2/α4) bound for achieving a factor α- is feasible. approximation with constant probability, matching an Sketching Model. We give the first formal study upper bound achievable by an algorithm of [4]. This of the sketching complexity of the Schatten-p norms. generalizes to Schatten-p norms for p > 2, for which we A first question is what does it mean to have a linear prove an Ω(n2−4/p) lower bound, and give a matching sketch of a matrix, instead of a vector as is typically O(n2−4/p) upper bound for even integers p. For odd studied. We consider two sketching models. integers p we are only able to achieve this upper bound The first model we consider is bilinear sketching, in if we additionally assume that A is positive semi-definite which there is a distribution over pairs of r ×n matrices (PSD). For the rank, p = 0, we prove an Ω(n2) lower S and n × s matrices T such that for any fixed n × n bound, showing that no non-trivial sketching is possible. 1−ε matrix A, from S · A · T one can approximate kAkp Finally for p = 1, we prove an n lower bound for up to an approximation factor α ≥ 1 with constant arbitrarily small constant ε > 0. Note that our bounds probability, where kAkp is a . The goal are optimal in several cases, e.g., for p = ∞, for even is to minimize r · s. This model has been used in several integers p > 2, and for p = 0. streaming papers for sketching matrices [4, 14, 23], and For the general sketching model, we show the fol- as far as we are aware, all known sketches in numerical lowing. For the spectral norm, our bound is Ω(n3/2/α3), linear algebra applications have this form. It also has which although is a bit weaker than the bilinear case, the advantage that SAT can be computed quickly if S is super-linear in n. It implies, for example, that any and T have fast matrix multiplication algorithms. general sketching algorithm for approximating the top The second model is more general, which we dub right (or left) singular vector v of A by a vector v0 with 0 1 3/2 general linear sketching, and interprets the n×n matrix kv − v k2 ≤ requires k = Ω(n ). Indeed, other- 2 2 A as a vector in n . The goal is then to design a wise, with a sketch of O(n) dimensions one could store R 2 distribution over linear maps L : Rn → Rk, such that gA, for a random Gaussian vector g, and together with 0 for any fixed n × n matrix A, interpreting it as a vector v obtain a constant approximation to kAk∞, which n2 in R , from L(A) one can approximate kAkp up to our lower bound rules out. This bound naturally gen- Bilinear sketches General Sketches Schatten p-norm Lower Bound Upper Bound Lower Bound Upper Bound p = ∞ Ω(n2/α4) O(n2/α4) [4] Ω(n3/2/α3) O(n2/α4) [4] p > 2 Ω(n2−4/p) O(n2−4/p) even p Ω(n(3/2)(1−2/p)) O(n2−4/p) even p 2 2 √ 2 p = 0 Ω(n ) O(n ) Ω(√n) O(n ) p = 1 n1−ε for any ε > 0 O(n2) Ω( n) O(n2)

Table 1: Our results for approximating the Schatten-p norm up to a constant factor (except for the p = ∞ case) with constant probability in both the bilinear sketching and general sketching models. For the bilinear case, we look at the minimal r · s value, while for general linear sketches we look at the minimal value of k. For p = ∞, α ≥ 1 is the desired approximation factor.

(3/2)(1−2/p) eralizes to an Ω(n ) bound for√ p > 2. For bound for p > 2, we propose the following rotationally- p ∈ {0, 1} our bound is now a weaker Ω( n). However, invariant distributions with constant factor gap in it is the first super-constant lower bound for rank and Schatten norm: Schatten-1 norm, which in particular rules out a naïve T table lookup solution to the NNS problem, addressing • For p = 0, L1 = UV for n × n/2 i.i.d. Gaussian T a question of Andoni. U and V , while L2 = UV + γG for the same U Our Techniques for Bilinear Sketches. A stan- and V and G an n × n i.i.d. Gaussian matrix with dard technique in proving lower bounds is Yao’s mini- variance γ ≤ 1/poly(n). max principle which implies if there exists a distribution • For p > 2, L1 = G for an n × n i.i.d. Gaussian 1 T on sketches that succeeds on all n × n inputs matrices matrix, while L2 = G + n1/2−1/p uv for the same A with large probability, then for any distribution L on G and random n-dimensional Gaussian vectors u inputs A, there is a fixed pair S and T of r × n and and v. n × s matrices, respectively, which succeeds with large 0 0 probability over A ∼ L. Moreover, we can assume the The major difficulty is bounding dTV (L1, L2). For p > 2 rows of S are orthonormal, as well as the columns of T . this amounts to distinguishing g from g + h, where g This is because, given SAT , we can compute USAT V , is an r × s matrix of Gaussians and h is a random where U and V are arbitrary invertible r × r and s × s r × s matrix with (i, j)-th entry equal to uivj. The fact matrices, respectively. Thus, it suffices to give two dis- that h is random makes the probability density function tributions L1 and L2 on A for which the kAkp values of g + h intractable. Moreover, for each fixed h, the differ by a factor α w.h.p. in the two distributions, but variation distance of g and g+h is much larger than for a for any matrix S with orthonormal rows and T with random h, and the best lower bound we could obtain by 0 fixing h is Ω(n1−2/p) which would just match the vector orthonormal columns, the induced distributions L1 and 0 lower bound (up to a log n factor). Instead of bounding L2 on SAT , when A ∼ L1 and A ∼ L2, respectively, 0 0 d (L0 , L0 ) directly, we bound the χ2-divergence of L0 have low total variation distance dTV (L1, L2). TV 1 2 1 0 0 0 Since S has orthonormal rows and T has orthonor- and L2, which if o(1) implies dTV (L1, L2) = o(1). This mal columns, then if L1 and L2 are rotationally invari- idea was previously used in the context of sketching p- ant distributions, then SAT is equal in distribution to norms of vectors [5], improving the previous Ω(n1−2/p) bound to Ω(n1−2/p log n). Surprisingly, for the Schatten an r × s submatrix√ of A. This observation already suf- fices to get an Ω( n) bound on r · s for all Schatten p-norm, it improves a simple Ω(n1−2/p) bound by a quadratic factor to Ω(n2−4/p), which can similarly be p-norms for a fixed constant p 6= 2,√ using a√ result of Jiang [26] which shows that square o( n) × o( n) sub- used to show an Ω(n2/α4) bound for α-approximating matrices of an n × n matrix of i.i.d. Gaussians and an the spectral norm. One caveat is that if we were to 2 0 0 n × n orthonormal matrix have o(1) variation distance. directly compute the χ -divergence between L1 and L2, Note that both distributions are rotationally invariant it would be infinite once r · s ≥ n1−2/p. We fix this by 0 0 and by the Marčenko-Pastur Law have constant factor conditioning L1 and L2 on a constant probability event, 0 0 2 difference in Schatten-p norm w.h.p. We slightly gener- resulting in distributions Le1 and Le2 for which the χ - alize Jiang’s proof to show the variation distance is o(1) divergence is small. for any r × s submatrix of these distributions provided For p = 0, the problem amounts to distinguishing r · s < n1−ε for arbitrarily small ε > 0. an r × c submatrix Q of UV T from an r × s submatrix For our Ω(n2) bound for p = 0 and Ω(n2−4/p) of UV T + γG. Working directly with the density function of UV T is intractable. We instead provide an algorithmic proof to bound the variation distance. [30], which generalizes the more familiar Hanson-Wright See Theorem 3.5 for details. The proof also works for inequality for second order arrays P . arbitrary Q of size O(n2), implying a lower bound of For p ∈ {0, 1} we look at distinguishing an n × n Ω(n2) to decide if an n × n matrix is of rank at most Gaussian matrix G from a matrix (G0,G0M), where G0 n/2 or ε-far from rank n/2 (for constant ε), showing an is an n × n/2 Gaussian random matrix and M is a ran- algorithm of Krauthgamer and Sasson is optimal [29]. dom n/2 × n/2 orthogonal matrix. For all constants Our Algorithm. Due to these negative results, p 6= 2, the Schatten p-norms differ by a constant factor a natural question is whether non-trivial sketching in the two cases. Applying our sketching matrix L, we 0 0 is possible for any Schatten p-norm, other than the have L1 distributed as N(0,Ik), but L2 is the distribu- i 0 i 0 Frobenius norm. To show this is possible, given an tion of (Z1,...,Zk), where Zi = hA ,G i + hB ,G Mi n × n matrix A, we left multiply by an n × n matrix and each Li is written as the adjoined matrix (Ai,Bi) G of i.i.d. Gaussians and right multiply by an n × n for n × n/2 dimensional matrices Ai and Bi. For each matrix H of i.i.d. Gaussians, resulting in a matrix A0 fixed O, we can view Z as a k-dimensional Gaussian of the form G0ΣH0, where G0,H0 are i.i.d. Gaussian vector formed from linear combinations of entries of G0. and Σ is diagonal with the singular values of A on Thus the problem amounts to bounding the variation the diagonal. We then look at cycles in a submatrix distance between two zero-mean k-dimensional Gaus- 0 0 0 Pn 0 0 0 of G ΣH . The (i, j)-th entry of A is `=1 σ`Gi,`H`,j. sian vectors with different covariance matrices. For L1 0 Interestingly, for even p, for any distinct i1, . . . , ip/2 and the covariance matrix is the identity Ik, while for L2 it distinct j1, . . . , jp/2, is Ik +P for some perturbation matrix P . We show that with constant probability over M, the Frobenius√ norm [(A0 A0 ) · (A0 A0 ) ··· (A0 A0 )] E i1,j1 i2,j1 i2,j2 i3,j2 ip/2,jp/2 i1,jp/2 kP kF is small enough to give us an k = Ω( n) bound, p and so it suffices to fix M with this property. One may = kAkp. worry that fixing M reduces the variation distance—√ The row indices of A0 read from left to right form a in this case one can show that with k = O( n), dis- 0 0 cycle (i1, i2, i2, i3, i3, , ..., ip/2, ip/2, i1), which since also tributions L1 and L2 already have constant variation each column index occurs twice, results in an unbiased distance. estimator. We need to average over many cycles to We believe our work raises a number of intriguing reduce the variance, and one way to obtain these is to open questions. store a submatrix of A0 and average over all cycles in it. Open Question 1: Is it possible that for every odd While some of the cycles are dependent, their covariance integer p < ∞, the Schatten-p norm requires k = Ω(n2)? is small, and we show that storing an n1−2/p × n1−2/p Interestingly, odd and even p behave very differently 0 2 2 submatrix of A suffices. since for even p, we have kAkp = kA kp/2, where A Our Techniques for General Linear Sketches: is PSD. Note that estimating Schatten norms of PSD We follow the same framework for bilinear sketches. matrices A can be much easier: in the extreme case of The crucial difference is that since the input A is now p = 1 the Schatten norm kAk1 is equal to the trace of viewed as an n2-dimensional vector, we are not able to A, which√ can be computed with k = 1, while we show design two rotationally invariant distributions L1 and k = Ω( n) for estimating kAk1 for non-PSD A. n n2 L2, since unlike rotating A in R , rotating A in R Open Question 2: For general linear sketches our does not preserve its Schatten p-norm. Fortunately, for lower bound for the operator norm is Ω(n3/2/α3) for both of our lower bounds (p > 2) and (p ∈ {0, 1}) we α-approximation. Can this be improved to Ω(n2/α4), 0 can choose L1 and L2 so that L1 is the distribution which would match our lower bound for bilinear sketches of a k-dimensional vector g of i.i.d. Gaussians. For and the upper bound of [4]? Using the tightness of 0 p > 2, L2 is the distribution of g + h, where g is as Latała’s bounds for Gaussian chaoses, this would either 1 T 1 T k 0 0 before but h = n1/2−1/p (u L v, . . . , u L v) for random require a new conditioning of distributions L1 and L2, n-dimensional Gaussian u and v and where Li is the or bounding the variation distance without using the i-th row of the the sketching matrix L, viewed as an χ2-divergence. n × n matrix. We again use the χ2-divergence to 0 0 bound dTV (L1, L2) with appropriate conditioning. In 2 Preliminaries this case the problem reduces to finding tail bounds n×d Notation. Let R be the set of n × d real matrices for Gaussian chaoses of degree 4, namely, sums of the and O the orthogonal group of degree n (i.e., the set P 0 0 n form a,b,c,d Aa,b,c,d uaubvcvd for a 4th-order array P of of n × n orthogonal matrices). Let N(µ, Σ) denote n4 coefficients and independent n-dimensional Gaussian the (multi-variate) normal distribution of mean µ and vectors u, u0, v, v0. We use a tail bound of Latała covariance matrix Σ. We write X ∼ D for a random p 2 variable X subject to a probability distribution D. We dTV (µ, ν) ≤ χ (µ||ν). also use On to denote the uniform distribution over the orthogonal group of order n (i.e., endowed with the Proposition 2.2. ([25, p97]) It holds that 2 hx,x0i normalized Haar measure). We denote by G(m, n) the χ (N(0,In) ∗ µ||N(0,In)) ≤ Ee − 1, where ensemble of random matrices with entries i.i.d. N(0, 1). x, x0 ∼ µ are independent. For two n × n matrices X and Y , we define hX,Y i T P In the case of n = 1, if F (x) and G(x) are the cu- as hX,Y i = tr(X Y ) = i,j XijYij, i.e., the entrywise inner product of X and Y . mulative distribution functions of µ and ν, respectively, Singular values and Schatten norms. Consider the Kolmogorov distance is defined as a matrix A ∈ n×d. Then AT A is a positive semi- R √ dK (µ, ν) = sup |F (x) − G(x)|. definite matrix. The eigenvalues of AT A are called x the singular values of A, denoted by σ1(A) ≥ σ2(A) ≥ It follows easily that for continuous and bounded f, · · · ≥ σ (A) in decreasing order. Let r = rank(A). It d Z Z is clear that σ (A) = ··· = σ (A) = 0. The matrix r+1 d (2.1) fdµ − fdν ≤ kfk∞ · dK (µ, ν). A also has the following singular value decomposition T (SVD) A = UΣV , where U ∈ On, V ∈ Od and Σ is an If both µ and ν are compactly supported, it suffices n × d diagonal matrix with diagonal entries σ1(A), ... , to have f continuous and bounded on the union of the σmin{n,d}(A). Define supports of µ and ν. 1 Hanson-Wright Inequality. Suppose that µ is a r ! p X p distribution over . We say µ is subgaussian if there kAkp = (σi(A)) , p > 0 R exists a constant c > 0 such that Prx∼µ{|x| > t} ≤ i=1 2 e1−ct for all t ≥ 0. n×d then kAkp is a norm over R , called the p-th Schatten The following form of the Hanson-Wright inequality n×d norm, over R for p ≥ 1. When p = 1, it is also called on the tail bound of a quadratic form is contained in the trace norm or Ky-Fan norm. When p = 2, it is the proof of [41, Theorem 1.1] due to Rudelson and 2 exactly the Frobenius norm kAkF , recalling that σi(A) Vershynin. T 2 T are the eigenvalues of A A and thus kAkF = tr(A A). Let kAk denote the operator norm of A when treating Theorem 2.1. (Hanson-Wright inequality) Let d n n A as a linear operator from `2 to `2 . Besides, it holds u = (u1, . . . , un), v = (v1, . . . , vn) ∈ R be random that limp→∞ kAkp = σ1(A) = kAk and limp→0+ kAkp = vectors such that u1, . . . , un, v1, . . . , vn are i.i.d. sym- rank(A). We define kAk∞ and kAk0 accordingly in this metric subgaussian variables. Let A ∈ Rn×n be a fixed limit sense. matrix, then Finally note that A and AT have the same non-zero 2 T   t t  singular values, so kAkp = kA kp for all p. T T Pr{|u Av−Eu Av| > t} ≤ 2 exp −c min , 2 Distance between probability measures. Sup- kAk kAkF pose µ and ν are two probability measures over some for some constant c > 0, which depends only on the n Borel algebra B on R such that µ is absolutely contin- constant of the subgaussian distribution. uous with respect to ν. For a convex function φ : R → R such that φ(1) = 0, we define the φ-divergence Latała’s Tail Bound. Suppose that gi1 , . . . , gid are i.i.d. N(0, 1) random variables. The following re- Z dµ Dφ(µ||ν) = φ dν. sult, due to Latała [30], bounds the tails of Gaussian dν P chaoses ai1 ··· aid gi1 ··· gid . The Hanson-Wright in- In general Dφ(µ||ν) is not a distance because it is not equality above, when restricted to Gaussian random symmetric. variables, is a special case (d = 2) of this tail bound. The total variation distance between µ and ν, The proof of Latała’s tail bound was later simplified by denoted by dTV (µ, ν), is defined as Dφ(µ||ν) for φ(x) = Lehec [32].

|x − 1|. It can be verified that this is indeed a distance. Suppose that A = (ai)1≤i1,...,id≤n is a finite multi- The χ2-divergence between µ and ν, denoted by indexed matrix of order d. For i ∈ [n]d and I ⊆ [d], 2 2 χ (µ||ν), is defined as Dφ(µ||ν) for φ(x) = (x − 1) or define iI = (ij)j∈I . For disjoint nonempty subsets 2 φ(x) = x − 1. It can be verified that these two choices I1,...,Ik ⊆ [d] define of φ give exactly the same value of D (µ||ν). φ ( X (1) (k) kAkI1,...,Ik = sup aix ··· x : iI1 iIk Proposition 2.1. ([45, p90]) It holds that i  2 2 Theorem 3.1. Suppose that p > 2, ζ ∈ (0, 1) and X  (1) X  (1)   x ≤ 1,..., x ≤ 1 . r , s ≥ 4. Whenever r s ≤ cζ2n2(1−2/p), it holds that iI1 iIk n n n n i i  I1 Ik dTV (L1, L2) ≤ ζ + 0.009, where c > 0 is an absolute constant. Also denote by S(k, d) the set of all partitions of {1, . . . , d} into k nonempty disjoint sets I1,...,Ik. It Proof. For simplicity we write rn and sn as r and is easy to see that if a partition {I1,...,Ik} is finer s, respectively. The distribution L1 is identical to than another partition {J1,...,J`}, then kAkI1,...,Ik ≤ N(0,Irs), and the distribution L2 is a Gaussian mixture rs kAkJ1,...,J` . N(z, Irs) with shifted mean z ∈ R , where −1/2+1/p Theorem 2.2. For any t > 0 and d ≥ 2, zij = 5n uivj 1 ≤ i ≤ r, 1 ≤ j ≤ s.   For a given t > 0, define the truncated Gaussian d  X Y (j)  distribution, denoted by N (0,I ), as the marginal Pr a g ≥ t t n i ij   distribution of N(0,In) conditioned on the event that i j=1 2 2 kxk ≤ t, where x ∼ N(0,In). Since kxk ∼ χ (n), ( 2 ) √   k t the probability pn := Pr{kxk > 2 n} < 0.004 by ≤ Cd exp −cd min min , 2 1≤k≤d evaluating an integral of the p.d.f. of χ (n) distribution. (I1,...,Ik)∈S(k,d) kAkI1,...,Ik Consider an auxiliary random vector z˜ ∈ Rrs defined as where C , c > 0 are constants depending only on d. d d −1/2+1/p z˜ij = 5n u˜iv˜j 1 ≤ i ≤ r, 1 ≤ j ≤ s,

Distribution of Singular Values We shall need √ where (˜u1,..., u˜r) is drawn from N2 r(0,Ir) and the following two lemmata. √ (˜v1,..., v˜s) is drawn from N2 s(0,Is). Define an aux- Lemma 2.1. (Marčenko-Pastur Law [36, 18]) iliary distribution Le2 as N(˜z, Irs). It is not difficult to Suppose that X is a p × m matrix with i.i.d N(0, 1/p) see that entries. Consider the probability distribution FX (x)  1  associated with the spectrum of XT X as dTV (L2, Le2) ≤ max − 1, pr + ps < 0.009. 1 − pr − ps 1  T FX (x) = i : λi(X X) ≤ x . So we need only bound d (L , L ). It follows from m TV 1 e2 Proposition 2.1 and Proposition 2.2 that For γ ∈ (0, 1], define a distribution Gγ (x) with density p hz˜ ,z˜ i function pγ (x) as dTV (L1, Le2) ≤ Ee 1 2 − 1, p (b − x)(x − a) where the expectation is taken over independent z˜1 and pγ (x) = , x ∈ [a, b], 2πγx z˜2, which are both identically distributed as z˜. Next we compute Eehz˜1,z˜2i. where √ √ exp(hz˜ , z˜ i) a = (1 − γ)2, b = (1 + γ)2. Ez˜1,z˜2 1 2  −1+2/p X 0 0  = u,˜ v,˜ u˜0,v˜0 exp 5n u˜iv˜ju˜iv˜j Then when m → ∞, p → ∞ and p/m → γ ∈ (0, 1) it E i,j holds that the expected Kolmogorov distance  −1+2/p X 0 X 0  (3.2) = Eu,˜ v,˜ u˜0,v˜0 exp 5n u˜iu˜i v˜jv˜j . −1/2 i j E sup |FX (x) − Gγ (x)| = O(n ). x P 0 2k P 0 Now let us bound E| i u˜u˜ | . Note that | i u˜u˜ | ≤ 0 ku˜k2ku˜ k2 ≤ 4r. By our assumption that r ≥ 4 it holds Lemma 2.2. ([46]) Suppose that X ∼ G(p, m). Then 2k−1 2 2 that r ≥ 4, so t /r ≤ t for t ≤ 4r. Applying the with probability at least 1−e−t /2, it holds that σ (X) ≤ √ √ 1 Hanson-Wright inequality (Theorem 2.1) to the identity p + m + t. matrix Ir, we have that

3 Lower Bounds for Bilinear Sketches 2k  2k  Z 4r   X 0 X 0 3.1 The case of p > 2 Fix rn ≤ n and sn ≤ n. E u˜u˜ ≤ Pr u˜iu˜i > t dt 0 Let L1 = G(rn, sn) and L2 denote the distribution of i  i  −1/2+1/p T the upper-left rn × sn block of G + 5n uv , Z 4r  k −c t1/k/r r where G ∼ G(n, n), u, v ∼ N(0,In) and G, u, v are ≤ 2 e 1 dt ≤ 2 k! independent. 0 c1 for some absolute constant c1 > 0, where the integral 1/p+1/2 is evaluated by variable substitution. We continue kG + Xkp ≥ kXkp − kGkp ≥ (4.9 − 2.2)n bounding (3.2) using the Taylor series as below: 1/p+1/2 ≥ 1.2 · 2.2n ≥ 1.2kGkp

Ez˜1,z˜2 exp(hz˜1, z˜2i) with high probability.  ∞ −1+ 2 2k P 0 2k P 0 2k X (5n p ) | u˜u˜ | | v˜v˜ | ≤ E i E i (2k)! 3.2 General p > 0 The following theorem is a k=0 generalization of a result of Jiang [26]. Following his k ∞ 2(−1+ 2 ) ! X 5n p rs (k!)2 notation, we let Zn denote the distribution of the upper- ≤ 1 + 2 · c2 (2k)! left rn × sn block of an orthonormal matrix chosen k=1 1 1 uniformly at random from On and Gn ∼ √ G(rn, sn). ∞ k n X ζ2  ≤ 1 + 2 ≤ 1 + ζ2, 3 Theorem 3.3. If rnsn = o(n) and sn ≤ rn ≤ k=1 nd/(d+2) as n → ∞ for some integer d ≥ 1, then 2 2(1−2/p) 2 provided that 5rs/c1n ≤ ζ /3. Therefore limn→∞ dTV (Zn,Gn) = 0.

hz˜1,z˜2i 1/2 √ dTV (L1, L2) ≤ (ln e ) +dTV (L2, Le2) ≤ ζ+0.009, E Jiang’s original√ result restricts rn and sn to rn = o( n) as claimed. The absolute constant c in the statement and sn = o( n). We follow the general notation can be taken as c = c2/15. and outline in his paper, making a few modifications 1  to remove this restriction. We postpone the proof to It is not difficult to see that kGkp and kG + Appendix A. 1/p−1/2 T 5n uv kp differ by a constant factor with high The lower bound on bilinear sketches follows from probability. The lower bound on the dimension of the Theorem 3.3, with the hard pair of rotationally invariant bilinear sketch is immediate, using L1 and L2 as the distributions being a Gaussian random matrix versus a hard pair of distributions and the observations that (i) random orthonormal matrix. The proof follows from both distributions are rotationally invariant, and (ii) by Theorem 3.3 and is postponed to Appendix B. increasing the dimension by at most a constant factor, n×n one can assume that rn ≥ 4 and sn ≥ 4. Theorem 3.4. Let A ∈ R and p > 0. Suppose that an algorithm takes a bilinear sketch SAT (S ∈ r×n n×n R Theorem 3.2. Let A ∈ R and p > 2. Suppose that and T ∈ n×s) and computes Y with (1 − c )kAkp ≤ r×n R p p an algorithm takes a bilinear sketch SAT (S ∈ R p n×n Y ≤ (1 + cp)kAkp for any A ∈ R with probability and T ∈ n×s) and computes Y with (1 − c )kAkp ≤ R p p 3 |Ip−1| 1−η p n×n at least 4 , where cp = 2(I +1) . Then rs = Ω(n ) for Y ≤ (1 + cp)kAkp for any A ∈ R with probability p p p any constant η > 0. at least 3/4, where cp = (1.2 − 1)/(1.2 + 1). Then rs = Ω(n2(1−2/p)). 3.3 Rank (p = 0) Let S ⊂ [n]×[n] be a set of indices n×n Proof. It suffices to show that kGkp and kG + of an n × n matrix. For a distribution L over R , the 1/p−1/2 T |S| 5n uv kp differ by a constant factor with high entries of S induce a marginal distribution L(S) on R probability. Let X = 5n1/p−1/2uvT . Since X is of rank as one, the only non-zero singular value σ1(X) = kXkF ≥ 1/p+1/2 T 2 (Xp1,q1 ,Xp2,q2 ,...,Xp|S|,q|S| ),X ∼ L. 4.9 · n with high probability, since kuv kF ∼ 2 2 2 (χ (n)) , which is tightly concentrated around n . Theorem 3.5. Let U, V ∼ G(n, d), G ∼ γG(n, n) for On the other hand, combining Lemma 2.1 and γ = n−14. Consider two distributions L and L over p 1 2 Lemma 2.2 as well as (2.1) with f(x) = x on [0, 4], n×n T T R defined by UV and UV + G respectively. Let we can see that with probability 1 − o(1) it holds for S ⊂ [n] × [n]. When |S| ≤ d2, it holds that X ∼ √1 G(n, n) (note the normalization!) that n −2 d (3.5) dTV (L1(S), L2(S)) ≤ C|S| n + dc , p (3.3) kXkp = (Ip + o(1)) n, where C > 0 and 0 < c < 1 are absolute constants. where Z 4 p Proof. (sketch, see Appendix D for the full proof.) We p (4 − x)x p (3.4) I = x 2 · dx ≤ 2 . give an algorithm which gives a bijection f : |S| → p 2πx R 0 R|S| with the property that for all but a subset of 1/p 1/2+1/p 1/2+1/p |S| of measure o(1) under both L (S) and L (S), the Hence kGkp ≤ 1.1 · Ip n ≤ 1.1 · 2 · n R 1 2 with high probability. By the triangle inequality, probability density functions of the two distributions are equal up to a multiplicative factor of (1 ± 1/poly(n)). The idea is to start with the row vectors U1,...,Un Algorithm 1 The sketching algorithm for even p ≥ 4. of U and V1,...,Vn of V , and to iteratively perturb Input: n,  > 0, even integer p ≥ 4 and A ∈ n×n. T R them by adding γGi,j to UV for each (i, j) ∈ S. We 1: N ← Ω(−2) 0 0 0 0 1−2/p find new vectors U1,...,Un and V1 ,...,Vn of n × d 2: Let {Gi} and {Hi} be independent n × n matrices U 0 and V 0 so that (U 0)(V 0)T and UV T + γG matrices with i.i.d. N(0, 1) entries are equal on S. We do this in a way for which kU k = T i 2 3: Maintain each GiAHi , i = 1,...,N 0 0 (1±1/poly(n))kUi k2 and kVik2 = (1±1/poly(n))kVi k2 4: Compute Z as defined in (4.6) for all i, and so the marginal density function evaluated 5: return Z 0 0 on Ui (or Vj) is close to that evaluated on Ui (or Vj ), by definition. Moreover, our mapping is bijective, so the 0 0 0 0 joint distribution of (U1,...,Un,V1 ,...,Vn) is the same sketches, which can thus be implemented in the most as that evaluated of (U1,...,Un,V1,...,Vn) up to a general turnstile data stream model (arbitrary number (1 ± 1/poly(n))-factor. The bijection we create depends of positive and negative additive updates to entries T on properties of S, e.g., if the entry (UV )i,j = hUi,Vji given in an arbitrary order). The algorithm works for is perturbed, and more than d entries of the i-th row of arbitrary A when p ≥ 4 is an even integer. A appear in S, this places more than d constraints on Suppose that p = 2q. We define a cycle σ to Ui, but Ui is only d-dimensional. Thus, we must also be an ordered pair of a sequence of length q: σ = change some of the vectors Vj. We change those Vj for ((i1, . . . , iq), (j1, . . . , jq)) such that ir, jr ∈ [k] for all r, 0 which (i, j) ∈ Q and there are fewer than d rows i 6= i ir 6= is and jr 6= js for r 6= s. Now we associate with σ 0 for which (i , j) ∈ S; in this way there are fewer than q d constraints on Vj so it is not yet fixed. We can find Y Aσ = Ai`,j` Ai`+1,j` enough Vj with this property by the assumption that `=1 2 |S| ≤ d .  where we adopt the convention that ik+1 = i1. Let C In the theorem above, choose d = n/2 and so denote the set of cycles. We define rank(UV T ) ≤ n/2 while rank(G) = n with probability N 1. Note that both distributions are rotationally invari- 1 X 1 X (4.6) Z = (G AHT ) ant, and so the lower bound on bilinear sketches follows N |C| i i σ immediately. i=1 σ∈C for even p. Theorem 3.6. Let A ∈ Rn×n. Suppose that an al- r×n gorithm takes a bilinear sketch SAT (S ∈ R and With probability ≥ 3/4, the output Z n×s Theorem 4.1. T ∈ R ) and computes Y with (1 − cp) rank(A) ≤ returned by Algorithm 1 satisfies (1 − )kAkp ≤ Z ≤ n×n p Y ≤ (1 + cp) rank(A) for any A ∈ R with probability (1 + )kAkp when p is even. The algorithm is a bilinear at least 3/4, where c ∈ (0, 1/3) is a constant. Then p p sketch with r · s = O (−2n2−4/p). rs = Ω(n2). p

As an aside, given that w.h.p. over A ∼ L2 in See Appendix E for the proof. A similar algorithm Theorem 3.5, A requires modifying Θ(n2) of its entries works for odd p and PSD matrices A. See Appendix F to reduce its rank to at most d if d ≤ n/2, this implies for details. that we obtain an Ω(d2) bound on the non-adaptive query complexity of deciding if an n×n matrix is of rank 5 Lower Bounds for General Linear Sketches at most d or ε-far from rank d (for constant ε), showing 5.1 Lower bound for p > 2 Let G ∼ G(n, n), u, v ∼ an algorithm of Krauthgamer and Sasson is optimal [29]. N(0,In) be independent. Define two distributions: L1 = G(n, n), while L2 is the distribution induced by − 1 + 1 4 Bilinear Sketch Algorithms G + 5n 2 p uvT . By the Johnson-Lindenstrauss Transform, or in fact, A unit linear sketch can be described using an n×n any subspace embedding with near-optimal dimension matrix L. Applying this linear sketch to an n × n (which can lead to better time complexity [44, 12, 37, matrix Q results in hL, Qi. More generally, consider 39]), we can reduce the problem of general matrices k orthonormal linear sketches (which as was argued to square matrices (see Appendix C for details), and earlier, is equivalent to any k linear forms, since, given henceforth we shall assume the input matrices are a sketch, one can left-multiply it by an arbitrary matrix k square. to change its row space) corresponding to {Li}i=1 with T We present a sketching algorithm to compute a tr(Li Lj) = δij. p n×n (1+)-approximation of kAkp for A ∈ R using linear Let X ∼ L2 and Zi = hLi,Xi. We define k  1/2 a distribution Dn,k on R to be the distribution of k X X i 2 i 2 (Z1,...,Zk). =  (La,b) (Lc,d)  a,b,c,d i=1 √ = k Theorem 5.1. Suppose that p > 2. For all suffi- 3 (1− 2 ) ciently large n, whenever k ≤ cn 2 p , it holds that Partition of size 2 and 3. The norms√ are automati- dTV (N(0,Ik), Dn,k) ≤ 0.24, where c > 0 is an absolute cally upper-bounded by kAk{1,2,3,4} = k. constant. Partition of size 4. The only partition is {1}, {2}, {3}, {4}. We have

Proof. It is clear that Dn,k is a Gaussian mixture with k X T i 0T i 0 shifted mean kAk{1},{2},{3},{4} = sup u L vu L v 0 0 n−1 u,v,u ,v ∈S i=1 −1/2+1/p T T T T Xu,v = 5n (u L1v, u L2v, . . . , u Lkv) k ! 1 −1/2+1/p X T i 2 0 0T i 2 =: 5n Yu,v. ≤ sup huv ,L i + hu v ,L i ≤ 1 0 0 2 u,v,u ,v i=1 Without loss of generality we may assume that k ≥ 16. √ T Consider the event E = {kY k ≤ 4 k}. Since The last inequality follows the fact that uv is a unit u,v u,v 2 n2 i n2 2 vector in R and L ’s are orthonormal vectors in R . EkYu,vk2 = k, it follows from Markov’s inequality Latała’s inequality (Theorem 2.2) states that for that Pr {E } ≥ 15/16. Let D be the marginal √ u,v u,v en,k t ∈ [ k, k2], distribution of Dn,k conditioned on Eu,v. Then n X 0 0 o Pr Aa,b,c,duaubvcvd > t dTV (Den,k, Dn,k) ≤ 1/8 a,b,c,d ( 2 2 )! t t t 3 √ and it suffices to bound dTV (N(0,In), Den,k). Resort- ≤ C exp −c min √ , , , t 2 1 1 ing to χ divergence by invoking Proposition 2.1 and k k k 3

Proposition 2.2, we have that 2 ! t 3 p ≤ C exp −c · hXu,v ,X 0 0 i 1 1 dTV (N(0,In), Den,k) ≤ Ee u ,v − 1, k 3 0 0 where u, v, u , v ∼ N(0,In) conditioned on Eu,v and with no conditions imposed on u, v, u0, v0. It follows that Eu0,v0 . We first calculate that  Pr |hYu,v,Yu0,v0 i| > t Eu,vEu0,v0 n 2 Pr {|hYu,v,Yu0,v0 | > t} 2 −1+ p X X i i 0 0 hXu,v,Xu0,v0 i = cpn (L )ab(L )cduaubvcvd ≤ Pr{E 0 0 } Pr{E } a,b,c,d=1 i u ,v u,v 2 ! X 0 0 t 3 √ =: D Aa,b,c,duau vcv , 2 b d ≤ C2 exp −c · 1 , t ∈ [ k, k ]. a,b,c,d k 3

−1+ 2 2 p Note that conditioned on Eu,v and Eu0,v0 , where D = cpn and Aa,b,c,d is an array of order 4 such that |hYu,v,Yu0,v0 i| ≤ kYu,vk2kYu0,v0 k2 ≤ 16k. k X i i 1/2+ Aa,b,c,d = LabLcd. Let  = 1/8, then for t ∈ [k , 16k], it holds that i=1 ct2/3 ct2/3 We shall compute the partition norms of A as tD − ≤ − , a,b,c,d k1/3 2k1/3 needed in Latała’s tail bound. 3 (1− 2 ) Partition of size 1. The only possible partition is provided that k ≤ c0n 2 p . Integrating the tail bound {1, 2, 3, 4}. We have gives that

 21/2 Xu,v ,Xu0,v0 k ! Ee X X i i kAk{1,2,3,4} = L L Z 16k  a,b c,d  tD a,b,c,d i=1 = 1 + e Pr{|hYu,v,Yu0,v0 i| > t}dt 0  1/2 1/2+ k Z k X X j j tD i i = 1 + D e Pr{|hYu,v,Yu0,v0 i| > t}dt =  La,bLc,dLa,bLc,d 0 a,b,c,d i,j=1 Z 16k tD where N(0,Ik) is the standard k-dimensional Gaussian + D e Pr{|hYu,v,Yu0,v0 i| > t}dt k1/2+ distribution. Z k1/2+ Z 16k tD tD−ct2/3/k1/3 ≤ 1 + D e dt + C2D e dt 0 k1/2+ 16k 2/3 1/2+ Z ct Dk − 1/3 ≤ e + C2D e 2k dt Proof. The sketch can be written as a matrix Φg, where 1/2+ 2 k Φ ∈ k×n /2 is a random matrix that depends on A(i,σ),  2/3  R 1/2+ ck B(i,σ) and M, and g ∼ G(n2/2, 1). Assume that Φ has ≤ exp(Dk ) + 16C2kD · exp − 2 full row rank (we shall justify this assumption below). ! T c2 Fix Φ (by fixing M). Then Φg ∼ N(0, ΦΦ ). It is ≤ exp p ( 1 − 3 )(1− 2 ) known that ([27, Lemma 22]) n 4 2 p q 0 (1− 2 ) ! T T T 1 2 cc n p dTV (N(0, ΦΦ ),N(0,Ik)) ≤ tr(ΦΦ ) − k − ln |ΦΦ |, 0 2 2 (1− p ) + 16C2c cpn exp − 2 T where λ1, . . . , λk are the eigenvalues values of ΦΦ . ≤ 1.01 Write ΦΦT = I + P . Define an event E = n 2 o M : kP k2 ≤ 12 · k . When E happens, the eigenval- when n is large enough. It follows immediately that F ζ n q dTV (N(0,In), Den,k) ≤ 1/10 and thus 12 √k ues of P are bounded by ζ · n ≤ 2/3. Let µ1, . . . , µk

dTV (N(0,In), Dn,k) ≤ 1/10 + 1/8 < 0.24. be the eigenvalues of P , then λi = 1+µi and |µi| ≤ 2/3. Hence  v u k T uX The lower bound on the number of linear sketches dTV (N(0, ΦΦ ),N(0,Ik)) ≤ t (µi − ln(1 + µi)) follows immediately as a corollary. i=1 v n×n u k r Theorem 5.2. Let X ∈ and p > 2. Suppose uX q 12 k 2 R ≤ t µ2 = kP k2 ≤ · √ ≤ ζ, that an algorithm takes k linear sketches of X and i F ζ n 3 p p i=1 computes Y with (1 − cp)kXkp ≤ Y ≤ (1 + cp)kXkp for any X ∈ Rn×n with probability at least 3/4, where where we use that x − ln(1 + x) ≤ x2 for x ≥ −2/3. p p (3/2)(1−2/p) cp = (1.2 − 1)/(1.2 + 1). Then k = Ω(n ). Therefore, when E happens, Φ is of full rank and we can apply the total variation bound above. We claim that 2 2 2 5.2 General p ≥ 0 Consider a random matrix EPij ≤ 4/n for all i, j and thus EkP kF ≤ 4k /n, it then (G, GM), where G ∼ G(n, n/2) and M ∼ On/2. follows that Pr(E) ≥ 1−ζ/3 by Markov’s inequality and A unit linear sketch can be described using an n×n n×(n/2) 2 c 2 1 matrix L = (A, B), where A, B ∈ R such that dTV (Dn,k,N(0,Ik)) ≤ ζ + Pr(E ) ≤ ζ + ζ = ζ 2 2 3 3 3 kAkF + kBkF = 1. Applying this linear sketch to an n×(n/2) n × n matrix Q = (Q1,Q2) (where Q1,Q2 ∈ R ) as advertised. 2 results in hA, Q1i + hB,Q2i. Now we show that EPij ≤ 4/n for all i, j. Suppose More generally, consider k orthonormal linear that M = (mij). Notice that the r-th row of Φ is sketches (which as was argued earlier, is equivalent to (r) X (r) n A + B m , i = 1, . . . , n, ` = 1,..., . any k linear forms, since, given a sketch, one can left- i` ij `j 2 multiply it by an arbitrary matrix to change its row j k T space) corresponding to {Li}i=1 with tr(Li Lj) = δij. Hence by a straightforward calculation, the inner prod- Now we define a probability distribution Dn,k on uct of r-th and s-th row is k (i) (i) R . For each i write Li as Li = (A ,B ). Then X (r) (s) X (s) (r) (i) (j) (i) (j) hΦr·, Φs·i = δrs + A B m`j + A B m`j by orthonormality, hA ,A i + hB ,B i = δi,j. i` ij i` ij (i) (i) i,j,` i,j,` Define Zi = hA ,Gi + hB , GMi and Dn,k to be the X  (r) (s) (s) (r)  distribution of (Z1,...,Zk). = δrs + hA` ,Bj i + hA` ,Bj i m`j j,` Theorem 5.3. Let Dn,k be defined as above and ζ ∈ 3/2√ (r) (r) (0, 1). Then for k ≤ (ζ/3) n it holds that where Ai denotes the i-th column of A . Then

dTV (Dn,k,N(0,Ik)) ≤ ζ, Prs = tr(UM), where the matrix U is defined by [4] A. Andoni and H. L. Nguyen. Eigenvalues of a matrix in the streaming model. In SODA, 2013. (r) (s) (s) (r) uj` = hA` ,Bj i + hA` ,Bj i. [5] A. Andoni, H. L. Nguyen, Y. Polyansnkiy, and Y. Yu. Tight lower bound for linear sketches of moments. In Since ICALP, 2013. 2 nX (r) 2 X (s) 2 [6] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivaku- ujk ≤ 2 |A | |B | i ik i ij mar. An information statistics approach to data X (s)  X (r) o stream and communication complexity. J. Comput. + |A |2 |B |2 i ik i ij Syst. Sci., 68(4):702–732, 2004. [7] L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha. and thus Simpler algorithm for estimating frequency moments 2 X nX (r) 2 X (s) 2 of data streams. In SODA, pages 708–713, 2006. kUkF ≤ 2 |Aik | |Bij | + j,k i i [8] E. J. Candès and B. Recht. Exact matrix completion X (s)  X (r) o |A |2 |B |2 via convex optimization. Commun. ACM, 55(6):111– i ik i ij 119, 2012.   ≤ 2 kA(r)k2 kB(s)k2 + kA(s)k2 kB(r)k2 ≤ 2. [9] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal F F F F lower bounds on the multi-party communication com- We conclude that plexity of set disjointness. In CCC, 2003. [10] M. Charikar, K. Chen, and M. Farach-Colton. Find- 2 X 2 2 X ing frequent items in data streams. In Proceedings E[Prs] = ujkE[mkj]+ ujkui`E[mkjmi`] j,k (j,k)6=(i,`) of the 29th International Colloquium on Automata, 2 2 X 2 2kUkF 4 Languages and Programming (ICALP), pages 693–703, = ujk = ≤ . n j,k n n 2002. [11] H. Y. Cheung, L. C. Lau, and K. M. Leung. Graph This completes the proof.  connectivities, network coding, and expander graphs. In FOCS, pages 190–199, 2011. Without loss of generality, we can normalize our √ [12] K. L. Clarkson and D. P. Woodruff. Low rank ap- matrix by a factor of 1/ n. Let X ∼ √1 G(n, n) and n proximation and regression in input sparsity time. In Y = (G, GM), where G ∼ G(n, n/2) and M ∼ On/2. It STOC, pages 81–90, 2013. p p is not difficult to see that kXkp and kY kp differs by a [13] G. Cormode and S. Muthukrishnan. An improved constant with high probability. The following theorem data stream summary: the count-min sketch and its is now an immediate corollary of Theorem 5.3. The applications. J. Algorithms, 55(1):58–75, 2005. proof is postponed to Appendix H. [14] M. S. Crouch and A. McGregor. Periodicity and cyclic shifts via linear sketches. In APPROX-RANDOM, Theorem 5.4. Let X ∈ Rn×n and p ≥ 0. Suppose that pages 158–170, 2011. an algorithm takes k linear sketches of X and computes [15] A. Deshpande, M. Tulsiani, and N. K. Vishnoi. Algo- Y with rithms and hardness for subspace approximation. In SODA, pages 482–496, 2011. p p • (when p > 0) (1 − cp)kXkp ≤ Y ≤ (1 + cp)kXkp, [16] P. Flajolet and G. N. Martin. Probabilistic counting. or In Proceedings of the 24th Annual IEEE Symposium • (when p = 0) (1 − cp)kXk0 ≤ Y ≤ (1 + cp)kXk0 on Foundations of Computer Science (FOCS), pages 76–82, 1983. for any X ∈ n×n with probability at least 3 , where c R 4 √ p [17] P. Flajolet and G. N. Martin. Probabilistic counting is some constant depends only on p. Then k = Ω( n). algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182–209, 1985. [18] F. Götze and A. Tikhomirov. Rate of convergence in References probability to the marchenko-pastur law. Bernoulli, 10(3):pp. 503–548, 2004. [1] N. Alon, Y. Matias, and M. Szegedy. The Space [19] S. Guha, P. Indyk, and A. McGregor. Sketching Complexity of Approximating the Frequency Moments. information divergences. Machine Learning, 72(1-2):5– J. Comput. Syst. Sci., 58(1):137–147, 1999. 19, 2008. [2] A. Andoni. Nearest neighbor search in high- [20] M. Hardt, K. Ligett, and F. McSherry. A simple dimensional spaces. In the workshop: Bar- and practical algorithm for differentially private data riers in Computational Complexity II, 2010. release. In Advances in Neural Information Processing http://www.mit.edu/ andoni/nns-barriers.pdf. Systems 25, pages 2348–2356. 2012. [3] A. Andoni, R. Krauthgamer, and K. Onak. Streaming [21] P. Indyk. Stable distributions, pseudorandom gener- algorithms from precision sampling. In FOCS, pages ators, embeddings, and data stream computation. J. 363–372, 2011. ACM, 53(3):307–323, 2006. [22] P. Indyk. Sketching, streaming and sublinear-space [41] M. Rudelson and R. Vershynin. Hanson-Wright in- algorithms, 2007. Graduate course notes available at equality and sub-gaussian concentration. Manuscript. http://stellar.mit.edu/S/course/6/fa07/6.895/. http://arxiv.org/abs/1306.2872v2. [23] P. Indyk and A. McGregor. Declaring independence [42] M. Rudelson and R. Vershynin. The Littlewood- via the sketching of sketches. In SODA, pages 737– Offord problem and invertibility of random matrices. 745, 2008. Advances in Mathematics, 218(2):600 – 633, 2008. [24] P. Indyk and D. P. Woodruff. Optimal approximations [43] M. E. Saks and X. Sun. Space lower bounds for of the frequency moments of data streams. In STOC, distance approximation in the data stream model. In pages 202–208, 2005. STOC, pages 360–369, 2002. [25] Y. Ingster and I. A. Suslina. Nonparametric Goodness- [44] T. Sarlós. Improved approximation algorithms for of-Fit Testing Under Gaussian Models. Springer, 1st large matrices via random projections. In FOCS, pages edition, 2002. 143–152, 2006. [26] T. Jiang. How many entries of a typical orthogonal [45] A. B. Tsybakov. Introduction to Nonparametric Esti- matrix can be approximated by independent normals? mation. Springer, 1st edition, 2008. The Annals of Probability, 34(4):pp. 1497–1529, 2006. [46] R. Vershynin. Introduction to the non-asymptotic [27] A. T. Kalai, A. Moitra, and G. Valiant. Efficiently analysis of random matrices. In Y. C. Eldar and learning mixtures of two gaussians. In Proceedings of G. Kutyniok, editors, Compressed Sensing: Theory and the 42nd STOC, pages 553–562, 2010. Applications. Cambridge University Press, 2011. [28] D. M. Kane, J. Nelson, E. Porat, and D. P. Woodruff. Fast moment estimation in data streams in optimal Appendices space. In Proceedings of the 43rd annual ACM sympo- A Proof of Theorem 3.3 sium on Theory of computing, STOC ’11, pages 745– 754, 2011. Proof. First we strengthen Lemma 2.6 in [26] and in [29] R. Krauthgamer and O. Sasson. Property testing of this proof we shall follow Jiang’s notation that the we data dimensionality. In SODA, pages 18–27, 2003. read the upper-left p × q block despite the usage of rn [30] R. Latała. Estimates of moments and tails of Gaussian and sn in the statement of the theorem. chaoses. Ann. Probab., 34(6):2315–2331, 2006. 0 It holds that fs(s, t) = −2/(1 − 2s − t) = −2 + [31] B. Laurent and P. Massart. Adaptive estimation of −1/(d+2) 0 1/(d+2) O(n ) and ft(s, t) = −1 + O(n ). The a quadratic functional by model selection. Annals of bounds for second-order derivatives of f(s, t) still hold. Statistics, 28(5):1302–1338, 2000. Thus [32] J. Lehec. Moments of the Gaussian chaos. In Sémi- naire de Probabilités XLIII, volume 2006 of Lecture ZZ   2 3kq 1 Notes in Math., pages 327–340. Springer, Berlin, 2011. Bn = n ln(1−2s−t)dsdt+ +O 2n n1/(d+2) [33] C. Li and G. Miklau. Measuring the achievable ZZ error of query sets under differential privacy. CoRR, = n2 ln(1 − 2s − t)dsdt + o(1). abs/1202.3399, 2012. [34] Y. Li and D. P. Woodruff. A tight lower bound for Now, we expand the Taylor series into r terms, high frequency moment estimation with small error. In RANDOM, 2013.  2 d (s+t) d+1 (s+t) [35] M. W. Mahoney. Randomized algorithms for matrices ln(1+s+t)− (s+t)− +···+(−1) and data. Foundations and Trends in Machine Learn- 2 d ing, 3(2):123–224, 2011. (s + t)d+1 ≤ . [36] V. A. Marčenko and L. A. Pastur. Distribution of d + 1 eigenvalues in certain sets of random matrices. Mat. Sb. (N.S.), 72 (114):507–536, 1967. So [37] X. Meng and M. W. Mahoney. Low-distortion sub- Z v Z u space embeddings in input-sparsity time and applica- ln(1 + s + t)dsdt tions to robust linear regression. In STOC, pages 91– 0 0 d 100, 2013. X (−1)k+1 = ((u + v)k+2 − uk+2 − vk+2) [38] S. Muthukrishnan. Data Streams: Algorithms and k(k + 1)(k + 2) Applications. Foundations and Trends in Theoretical k=1 Computer Science, 1(2):117–236, 2005. + O((u + v)d+3) [39] J. Nelson and H. L. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embed- as n → ∞. Substituting u = −(p + 2)/n and v = −q/n dings. CoRR, abs/1211.1002, 2012. back into two integrals in (2.5), (2.7) becomes [40] E. Price and D. P. Woodruff. Applications of the Shannon-Hartley theorem to data streams and sparse n2 Z v Z u recovery. In ISIT, pages 2446–2450, 2012. ln(1 + s + t)dsdt 2 0 0 d T d+2 1X 1 (p+q+2)k+2 − (p+2)k+2 − qk+2 tr((X X) ) = − +g ˜n d+2 , 2 k(k+1)(k+2) nk n k=1 where g ∈ [0, 2) for n sufficiently large. (p + q + 2)d+3  n + O We want to show that nd+1 q X and (A.1) Bn + f(λi) → 0 i=1 n2 Z v Z −2/n ln(1 + s + t)dsdt in probability as n → ∞. First, we look at the last 2 0 0 term. It is clear from d 1 X 1 (q + 2)k+2 − 2k+2 − qk+2 = − q 2 k(k + 1)(k + 2) nk T m X X X k=1 tr((Xn Xn) ) = xk1,ixk1,j1 xk2,j1 xk2,j2 m−1 m  qd+3  i=1 j∈[q] k∈[p] + O d+1 n ··· xkm−1,jm−2 xkm−1,jm−1 xkm,jm−1 xkm,i T m m Hence that E tr((Xn Xn) ) = O(pn qn), where the constant in d k+1 ! O(·) notation depends on m but not p and q. For any 1X 1 Xk+2piqk+2−i B = − +o(1).  > 0, n 2 k(k+1)(k+2) i nk k=1 i=1 T r+2 2  T r+2 r+2 E[(tr(Xn Xn) ) ] i k+2−i k Pr tr((X X ) > n ≤ Notice that for i ≥ 2, it holds that p q /n ≤ n n 2n2r+4 k 2 k p q /n → 0 as n → ∞. It follows that [q tr(XT X )2r+4]  p2r+4q2  ≤ E n n = O → 0 d 2n2r+4 2n2r+4 1 X pk+1q B = − + o(1). n 2 k(k + 1)nk as n → ∞. So the last term goes to 0 in probability. k=1 For the other terms, fix  > 0, we see that This is our desired form of Lemma 2.6.  T m T m m Now, following the proof of Lemma 2.7, we expand Pr | tr((Xn Xn) ) − E tr((Xn Xn) )| ≥ n into more terms [(tr((XT X )m)2] [q(tr((XT X )2m)] ≤ E n n ≤ E n n  x  x x2 xd+1 xd+2 1 2n2m 2n2m ln 1 − = − − −· · ·− − · . n n n2 (d + 1)nd+1 d + 2 (ξ − n)d+2  p2mq2  = O → 0 so 2n2m p+q+1 n−p−q−1 n−p−q−1 hence f(x) = x− x−· · ·− xd+1 2n 4n2 2(d+1)nd+1 q X p+q+1 T xd+2 f(λi) → E tr(X X) + g (x) , 2n n nd+1 i=1 d+1 where X n−p−q−1 T j − E tr((X X) ) d+1 2jnj n (n − p − q − 1) j=2 gn(x) = − d+2 . 2(d + 2)(ξ − n) in probability. It suffices to show that It is trivial to see that  d+1  1 p+q+1 T X n−p−q−1 T j sup |g (x)| ≤ . Bn+ tr(X X)− tr((X X) ) n d+2 2n E 2jnj E 0≤x≤αn (1 − α) j=2 √ √ 2 The eigenvalues are bounded by O(( p + q) ) = goes to 0. Rearrange the terms in the bracket as O(n1−1/(d+2)). Define Ω accordingly. Then d Xp+q+1 1  q (tr(XT X)j) − (tr(XT X)j+1) X jnj E (j+1)nj E f(λi) j=1 i=1 p + q + 1 T j+1 d+1 + E(tr(X X) ) p + q + 1 X n − p − q − 1 (d + 1)nd+1 = tr(XT X) − tr((XT X)j) 2n 2jnj j=2 The last term goes to 0 as E(tr(XT X)j+1) = O(pj+1q) O(d/2) rows while roughly maintaining the singular and pd+1/nd → 0. Hence it suffices to show that for values as follows. Call Φ a (d, δ)-subspace embedding each k matrix if with probability ≥ 1 − δ it holds that

k+1 p q p + q + 1 T k (1 − )kxk ≤ kΦxk ≤ (1 + )kxk − k + k E(tr(X X) ) k(k + 1)n kn for all x in a fixed d-dimensional subspace. In [44], it is 1 − (tr(XT X)k+1) → 0, proved that (k + 1)nk E n or Lemma C.1. Suppose that H ⊂ R is a d-dimensional subspace. Let Φ be an r-by-n random matrix with −pk+1q+(k+1)(p+q+1)E(tr(XTX)k)−kE(tr(XTX)k+1)entries i.i.d N(0, 1/r), where r = Θ(d/2 log(1/δ)). nk Then it holds with probability ≥ 1 − δ that → 0, (1 − )kxk ≤ kΦxk ≤ (1 + )kxk, ∀x ∈ H. which can be verified easily using that T m m m In fact we can use more modern subspace embeddings E tr((Xn Xn) ) = pn qn + o(n ). Therefore (A.1) holds and the rest of the argument in Jiang’s paper [44, 12, 37, 39] to improve the time complexity, though follows. since our focus is on the sketching dimension, we defer a thorough study of the time complexity to future work. B Proof of Theorem 3.4 Now we are ready for the subspace embedding transform on singular values, which follows from the Proof. Choose A from D0 with probability 1/2 and from min-max principle for singular values. On with probability 1/2. Let b ∈ {0, 1} indicate which distribution A is chosen from, that is, b = 0 when A Lemma C.2. Let Φ be a (d, δ)-subspace embedding ma- is chosen from D0 and b = 1 otherwise. Note that trix. Then, with probability ≥ 1 − δ, it holds that both D0 and O(n) are rotationally invariant, so without (1−)σ (ΦA) ≤ σ (A) ≤ (1+)σ (ΦA) for all 1 ≤ i ≤ d.  i i i loss of generality, we can assume that S = Ir 0 I  Proof. [of Lemma C.2] The min-max principle for sin- and T = s . Hence SAT ∼ Z when b = 0 and 0 n gular values says that SAT ∼ Gn when b = 1. Notice that the singular σi(A) = max min kAxk, values of an othorgonal matrix are all 1s. Then with Si x∈Si probability 1 − o(1) we have that kxk2=1

where Si runs through all i-dimensional subspace. Ob- (1 − cp)((Ip + o(1))n ≤ Y ≤ (1 + cp)((Ip + o(1))n, b = 0 serve that the range of A is a subspace of dimension at (1 − c )n ≤ Y ≤ (1 + c )n, b = 1 p p most d. It follows from Lemma C.1 that with probabil- so we can recover b from Y with probability 1 − o(1), ity ≥ 1 − δ, |Ip−1| provided that cp ≤ . d 2(Ip+1) (1 − )kAxk ≤ kΦAxk ≤ (1 + )kAxk, ∀x ∈ R . Consider the event E that the algorithm’s output indicates b = 1. Then as in the proof of Theorem 5.4, The claimed result follows immediately from the min- max principle for singular values. 1  dTV (Dn,k,N(0,Ik)) ≥ + o(1). 2 Let A˜ = (ΦA, 0) with zero padding so that A˜ is On the other hand, by Theorem 3.3, a square matrix of dimension O(d/2). Then by the 1−η ˜ dTV (Dn,k,N(0,Ik)) = o(1) when rs ≤ n for some preceding lemma, with probability ≥ 1 − δ, kAkp = 1−η η > 0. This contradiction implies that rs = Ω(n ) kΦAkp is a (1 ± )-approximation of kAkp for all p > 0. for any η > 0. Therefore we have reduced the problem to the case of square matrices. C Reduction to Square Matrices for the upper bound D Proof of Theorem 3.5 n×d 2 k Suppose that A ∈ R (n > d). When n = O(d/ ), Proof. Suppose that |S| = k and S = {(pi, qi)}i=1. By let A˜ = (A, 0) with zero padding so that A˜ is a square symmetry, without loss of generality, we can assume matrix of dimension n. Then kA˜kp = kAkp for all that S does not contain a pair of symmetric entries. p > 0. Otherwise, we can sketch the matrix with Throughout this proof we rewrite L1(S) as L1 and L2(S) as L2. Now, using new notation, let us denote T the marginal distribution of L2 with fixed G by L2(G). 1. ((TtU)(TtV ) )pq = (UV )pq + t and for all 0 0 T Then (p , q ) ∈ S \{(p, q)}, we have ((TtU)(TtV ) )p0q0 = T (UV )p0q0 . dTV (L1, L2) = sup | Pr(A) − Pr(A)| A⊂M( k) L1 L2 0 0 R 2. kU −TtUkF ≤ t , kV −TtV kF ≤ t with probability Z 2 d 1 − O(1/n + dc ), over the randomness of U and = sup Pr(A) − Pr(A|G)p(G)dG k L n2 L V . When this holds, we say that U and V are good, A⊂M(R ) 1 R 2 Z otherwise we say that they are bad.

≤ sup Pr(A) − Pr(A|G) p(G)dG k n2 L L A⊂M(R ) R 1 2 3. T−t · Tt = id.  Z Property 3 implies that Tt is bijective. Define

≤ sup  Pr(A) − Pr(A|G) p(G)dG+ T k  L L E(x) = {(U, V ): UV | = x} A⊂M(R ) 1 2 S F (δ) E (x) = {(U, V ) ∈ E(x):(U, V ) is good},  good Z Ebad(x) = {(U, V ) ∈ E(x):(U, V ) is bad}  Pr(A) − Pr(A|G) p(G)dG L1 L2 Then, using these three properties about T , as well as c t F (δ) the triangle inequality, and letting p(U), p(V ) be the c (D.2) ≤ sup dTV (L1, L2(G)) + 2 Pr{F (δ) }, p.d.f.’s of U and V that G∈F (δ) 1  kUk2  where F (δ) = {G ∈ n×n : |G | ≤ δ, ∀i = 1, . . . , k} (D.6) p(U) = exp − F R pi,qi (2π)nd/2 2 and Pr{F (δ)c} is the probability of the complement 1  kV k2  of F (δ) under the distribution on G, and we choose p(V ) = exp − F δ = n1/4γ. Recalling the PDF of a Gaussian random (2π)nd/2 2 2 variable and that k ≤ n , it follows from a union bound we have that that 2 2 1/2 (D.3) Pr{F (δ)c} ≤ ke−δ /(2γ ) = ke−n /2 ≤ n−3. Pr(A) − Pr(A + tei) L1 L1

Now we examine d (L , L (G)) with G ∈ TV 1 2 Z Z F (δ). For notational convenience, let ξ = = p(U)p(V )dUdV − p(U)p(V )dUdV T T (ξ1, . . . , ξk) = (Gp1,q1 ,...,Gpk,qk ) and define (i) T E(x) E(x−tei) ξ = (ξ1, . . . , ξi, 0,..., 0) . Applying the triangle inequality to a telescoping sum, Z Z

≤ p(U)p(V )dUdV − p(TtU)p(TtV )dUdV (D.4) dTV (L1, L2(G)) Egood(A) Egood(A)

= sup Pr(A) − Pr (A) Z k L1 L2(G) + p(U)p(V ) + p(T U)p(T V )dUdV A⊂M(R ) t t Ebad(A) Z = sup Pr(A) − Pr(A − ξ) k L L ≤ |p(U) − p(TtU)|p(V )dUdV A⊂M(R ) 1 1 Egood(A) k Z X (i−1) (i) (D.5) ≤ sup Pr(A − ξ ) − Pr(A − ξ ) + p(TtU)|p(V ) − p(TtV )|dUdV k L1 L1 A⊂M(R ) i=1 Egood(A) where M(Rk) denotes the canonical Borel algebra on (D.7) Rk.  1  + O + dcd . To bound (D.5), we need a way of bounding n2

| PrL1 (A) − PrL1 (A + tei)| for a value t with |t| ≤ δ. In this case, we say that we perturb a single entry Using (D.6), T (p, q) := (pi, qi) of UV by t while fixing the remaining (D.8) kUk2 − kT Uk2  k − 1 entries. We claim that there exists a mapping |p(U) − p(T U)| = p(U) · 1 − exp F t F . n×d n×d t Tt : R → R (we defer the construction to the 2 end of the proof) for which the following three proper- Notice that kUk2 ∼ χ2(nd), where χ2(nd) denotes the ties hold: F χ-squared distribution with nd degrees of freedom, and 2 ˜ so by a tail bound for the χ -distribution [31, Lemma 1], With probability 1, the matrix V := (Vq1 ,Vq2 ,...,Vqd ) 2 0 0 4 0 kUkF ≤ 6nd−t (recall that t = 1/n ) with probability has rank d. Hence we can solve Ui − Ui uniquely, and at least 1 − e−nd ≥ 1 − n−3. When this happens, for 0 k∆xk δ good U it follows from the triangle inequality, the second kUi − Uik ≤ ≤ . 0 4 σ (V˜ ) σ (V˜ ) property of Tt above, and the fact that t = 1/n that d d

2 2 Using the tail bound on the minimum√ singular value kUk − kTtUk = (kUk + kTtUk) kUk − kTtUk d √ given in [42], we have Pr{σd(˜v) ≤ / d} ≤ C + c , 0 −3 ≤ (2kUk+kU−TtUk) kU−TtUk ≤ 2 6nd·t ≤ 6n . where C > 0 and 0 < c < 1 are absolute constants. Choosing  = n−4 and recalling that d ≤ n, we see that Using |1 − e|x|| ≤ 2|x| for |x| < 1 and combining with with probability at least 1 − Cn−4 − cd, it holds that 9/2 (D.8), we have σd(V˜ ) ≥ 1/n and thus (D.9) |p(U) − p(T U)| ≤ p(U) · 12n−3. 0 9/2 9/2+1/4 −5 t kUi − Uik ≤ n δ = n γ ≤ n . 2 0 Similarly it holds that kV kF ≤ 6nd−t with probability This proves property 2 of this case. To show property ≥ 1 − n−3 and when this happens, 3 in this case, we can show something stronger, which is −3 T−t ◦ Tt = id, namely, this step is invertible. Indeed, if |p(V ) − p(TtV )| ≤ p(V ) · 12n . 0 we replace ∆x with −∆x in (D.13) the solution Ui − Ui It then follows that will be the opposite sign as well. Case 1b. Suppose that the q-th column contains s ≤ d (D.10) 0 Z  1  entries read. Similarly to Case 1a we have Ui = Ui and |p(U) − p(T U)|p(V )dUdV = O , kV 0 − V k ≤ n−5 with probability ≥ 1 − Cn−4 − cd. The t n3 i i Egood(A) invertibility is similar as in Case 1a and therefore holds. (D.11) Case 2. Suppose that there are more than d entries Z  1  read in both the p-th row and the q-th column. Define p(T U)|p(V ) − p(T V )|dUdV = O t t n3 Egood(A) J = {i∈[n]:i-th column has ≤ d entries contained in S}

Plugging (D.10) and (D.11) into (D.7) yields that Cr = {i ∈ [n]:(r, i) ∈ S},Rc = {i ∈ [n]:(i, c) ∈ S}   1 d Call the columns with index in J good columns and (D.12) Pr(A) − Pr(A + tei) = O 2 + dc , c L1 L1 n those with index in J bad columns. Note that |J c| ≤ d since the total number of entries which, combined with (D.5), (D.2), and (D.3), finally 2 c in S is at most d . Take the columns in J ∩ Cp, and leads to c notice that q ∈ J ∩ Cp. As in Case 1, we can change c 0 0 T dTV (L1, L2) ≤ dTV (L1, L2(G)) + 2 Pr{F (δ) } Up to Up such that U V agrees with the perturbed T T = O k(n−2 + dcd) . entry of UV at (p, q) and keeps the entries of UV 0 0 c the same for all (p, q ) for q ∈ (J ∩ Cp) \{q}, since c Construction of Tt. Now we construct Tt. Sup- |(J ∩ Cp) \{q}| ≤ d − 1. pose the entry to be perturbed is (p, q). However, this new choice of U 0 possibly causes 0 T T Case 1a. Suppose that the p-th row contains s ≤ d (U V )p,i 6= (UV )p,i for i ∈ Cp ∩ J. For each 0 entries read, say at columns q1, . . . , qs. Without loss of i ∈ Cp ∩ J, we also need to change Vi to a vector Vi generality, we can assume that s = d, as we can preserve without affecting the entries read in any bad column. the entries in S, perturbing one of them, and preserve Now for each good column, the matrix U˜ used in Case d − s arbitrary additional entries in the p-th row. 1 applied to each i in Cp ∩ J is no longer i.i.d. Gaussian Then we have (Ui denotes the i-th row of U and Vi because one row of V˜ has been changed, and this change −5 the i-th column of V ) has `2 norm at most n (the guarantee on property 2 in Case 1) with probability at least 1 − 4n−3 − cd, U V V ··· V  = x i q1 q2 qd and since the minimum singular value is a 1-Lipschitz 0 and we want to construct Ui such that function of matrices, the minimum singular value is perturbed by at most n−5. Hence for each good column 0  U V V ··· V = x + ∆x −3 d i q1 q2 qd i, with probability at least 1 − 4n − c , we have 0 −5 Hence property 1 automatically holds and kVi − Vi k ≤ n . Since there are at most d good columns, by a union bound, with probability at least 0  2 d 0 −4 (D.13) (Ui − Ui) Vq1 Vq2 ··· Vqd = ∆x 1 − 4/n − dc we have kV − V kF ≤ n . This concludes the proof of properties 1 and 2 in Case i.e., 2.     Up1 0 The final step is to verify property 3. Suppose . . 0 0 0 0 00 00  .  . that Tt(U, V ) = (U ,V ) and T−t(U ,V ) = (U ,V ),  .  . 00 00   00   we want to show that U = U and V = V . Observe  Up  (V − Vi) = 0 ,   i   that V = (V 0) = (V 00) for i ∈ J c ∩ C = {q , . . . , q }, . . i i i p 1 d  .  . we have     Upd 0 0  (Up − Up) Vq1 Vq2 ··· Vqd = ∆x 00 whence it follows immediately that Vi = Vi provided 00 0  that (U T ,...,U T ) is invertible. Together with V = (Up − Up) Vq1 Vq2 ··· Vqd = −∆x p1 pd i 0 00 c 00 V = V for all i ∈ Cp ∩ J , we conclude that V = V . Adding the two equation yields i i 00  E Proof of Theorem 4.1 (Up − Up) Vq1 Vq2 ··· Vqd = 0, 00 We say that two cycles σ = ({i}, {j}) and τ = and thus Up = Up provided that (Vq1 ,Vq2 ,...,Vqd ) is 0 0 0 0 00 ({i }, {j }) are (m1, m2)-disjoint if |i∆i | = 2m1 and invertible. Since Ui = Ui = Ui for all i 6= p, we have 0 00 00 |j∆j | = 2m2, denoted by |σ∆τ| = (m1, m2). U = U . Next we show that Vi = Vi for each i ∈ Cp∩J. Suppose Ri = {p1, . . . , pd} 3 p. Similarly to the above we have that Proof. Let X = UΣV be the SVD of X. Let G and U 0    p1 0 H be random matrices with i.i.d. N(0, 1) entries. By  .   .  rotational invariance, GXHT is identically distributed  .   .  T 1−2/p  0  0  0  as GΣH . Let k = n and X˜ be the upper-left k×k  U  (V − Vi) = −hU − Up,Vii  p  i  p  block of GΛHT =: X˜. It is clear that  .   .   .   .  n U 0 0 ˜ X pd Xs,t = σiGi,sHi,t i=1 and  00    Define Up 0 1 1 X ˜  .   .  Y = Xσ.  .   .  |C|  00  00 0  00 0 0  σ∈C  Up  (V − V ) = −hUp − Up,V i .   i i  i  Suppose that σ = ({i }, {j }) then  .   .  s s  .   .  q q U 00 0 X Y Y pd ˜ Xσ = σ`s σms G`s,is H`s,js Gms,is+1 Hms,js

00 `1,...,`q s=1 s=1 Adding the two equations and recalling that Up = Up m ,...,m 0 00 1 q and Ui = Ui = Ui for all i 6= p, we obtain that ˜ P 2q p It is easy to see that EXσ = σi = kAkp (all {`s} U   0  p p1 and {ms} are the same) and thus EY = kXkp. Now we  .   .  compute Y 2.  .   .  E   00  0  0 q q  Up  (Vi − Vi) + Up − Up (Vi − Vi)     2 1 X X X ˜ ˜  .   .  EY = E(XσXτ ). . . |C|2     m1=0 m2=0 σ,τ∈C |σ∆τ|=(m ,m ) Upd 0 1 2   0 Suppose that |σ∆τ| = (m1, m2),  .   .  X Yq  ˜ ˜ 0 0  0 0  E(XσXτ ) = σ`i σmi σ` σm · = hUp − Up,Vi − Vii , i=1 i i  .  `1,...,`q  .  `0 ,...,`0  .  1 q m1,...,mq 0 0 0 m1,...,mq nYq o G` ,i Gm ,i G`0 ,i0 Gm0 ,i0 · E s=1 s s s s+1 s s s s+1 nYq o H` ,j Hm ,j H`0 ,j0 Hm0 ,j0 . E s=1 s s s s s s s s It is not difficult to see that (see Appendix G for details) Algorithm 2 The sketching algorithm for odd p ≥ 3. Input: n,  > 0, odd integer p ≥ 3 and PSD A ∈ n×n ˜ ˜ R E(XσXτ ) .m,p 1: N ← Ω(−2)  2(2q−2m1−1) 2(m2+1) 4(m1−m2) 1−2/p kAk kAk kAk , m2 ≤m1 ≤q−1 2: Let {Gi} be independent n × n matrices with  2 2(m2+1) 4  2(2q−2m2−1) 2(m1+1) 4(m2−m1) i.i.d. N(0, 1) entries. kAk2 kAk2(m +1)kAk4 , m1 ≤m2 ≤q−1 T  1 3: Maintain each GiXG , i = 1,...,N 4(q−m2−1) 4(m2+1) i kAk kAk , m1 =q, m2 ≤q−1 4 2(m2+1) 4: Compute Z as defined in (F.14)  4(q−m1−1) 4(m1+1) kAk kAk , m2 =q, m1 ≤q−1 5: return Z  4 2(m1+1)  4q kAk2q, m1 = m2 = q

2 2 1− p 2 dent on p only. Since By Hölder’s inequality, kAk2 ≤ n kAkp and N 2(m+1) 1− 2(m+1) 2(m+1) 1 X kAk ≤ n p kAk Z = Y , 2(m+1) p N i i=1 Thus, then Yi’s are i.i.d. copies of Y . It follows that  p−m1−m2−2 2p n kAk , m1, m2 ≤ q − 1  p X = Y = kAkp  q−m1−1 2p E E p n kAkp , p2 = q, p1 ≤ q − 1 (X˜σX˜τ ) m,p E . q−m2−1 2p and n kAkp , p1 = q, p2 ≤ q − 1  kAk2p, m = m = q. V ar(Y ) Y 2 1 p 1 2 V ar(X) = ≤ E = 2kAk2p. N N 4 p q+m1 q+m2 p+m1+m2 There are Oq(k k ) = Op(k ) pairs of p Finally, by Chebyshev inequality, (m1, m2)-disjoint cycles and |C| = Θ(k ),  p p V ar(X) 1 1 X Pr X − kAk > kAk ≤ ≤ , E(AσAτ ) p p 2 2p |C|2  kAkp 4 σ,τ∈C |σ∆τ|=(m1,m2) which implies the correctness of the algorithm. It is easy p−m1−m2−2 n 2p to observe that the algorithm only reads the upper-left ≤ Cm1+m2,p kAkp T kp−m1−m2 k × k block of each GiXGi , thus it can be maintained 2 −2 2−4/p 2p in O(Nk ) = Op( n ) space. ≤ Cm1+m2,pkAkp , m1, m2 ≤ q − 1,  1 X (A A ) F Algorithms for Odd p |C|2 E σ τ σ,τ∈C Given integers k and p < k, call a sequence (i1, . . . , ip) |σ∆τ|=(m1,m2) a cycle if ij ∈ [k] for all j and ij 6= ij for all j1 6= j2. q−m1−1 1 2 n 2p On a k × k matrix A, each cycle σ defines a product ≤ Cm1+m2,p kAkp kq−m1 p 2p p Y ≤ Cm1,pkAkp , ≤ m2 = q, m1 ≤ q − 1, A = A 2 σ σi,σi+1 1 X i=1 2 E(AσAτ ) |C| where we adopt the convention that ip+1 = i1. Let C σ,τ∈C |σ∆τ|=(m1,m2) denote the set of cycles. Call two cycles σ, τ ∈ C k-

p−2m2−2 disjoint if |σ∆τ| = 2k, where σ and τ are viewed as n 2p ≤ Cm1+m2,p kAkp multisets. kq−m2 2p p ≤ Cm1,pkAkp , ≤ m1 = q, m2 ≤ q − 1, Theorem F.1. With probability ≥ 3/4, the output X 2 p returned by Algorithm 2 satisfies (1 − )kAkp ≤ X ≤ 1 X 2p (AσAτ ) ≤ kAk , m1 + m2 = p. (1 + )kAkp when A is positive semi-definite. The |C|2 E p p σ,τ∈C −2 2−4/p algorithm is a bilinear sketch with r·s = Op( n ). |σ∆τ|=(m1,m2)

2 2p Therefore EY ≤ CpkAkp for some constant Cp depen- Proof. Since X is symmetric it can be written as X = OΛOT , where Λ is a diagonal matrix and O an orthogonal matrix. Let G be a random matrix with i.i.d. N(0, 1) entries. By rotational invariance, GXGT is identically distributed as GΛGT . Let k = n1−2/p and ˜ T ˜ 1 X X be the upper-left k × k block of GΛG := X. It is (A A ) ≤ kAk2p, m = p. |C|2 E σ τ p clear that σ,τ∈C n |σ∆τ|=2m X X˜ = λ G G s,t i i,s i,t Therefore Y 2 ≤ C kAk2p for some constant C i=1 E p p p dependent on p only. Hence if we take multiple copies Define of this distribution as stated in Algorithm 2 and define 1 X ˜ Y = Xσ. N N C 1 X 1 X 1 X σ∈C (F.14) Z = (G XGT ) =: Y , N |C| i i σ N i Suppose that σ = (i1, . . . , ip) then i=1 σ∈C i=1

n p then Yi’s are i.i.d. copies of Y . It follows that ˜ X Y Xσ = λj1 ··· λjp Gj`,i` Gj`,i`+1 p X = Y = kAkp j1,...,jp=1 `=1 E E ˜ P p p and It is easy to see that EXσ = λi = kAkp (all j`’s are p 2 2 the same) and thus EY = kXkp. Now we compute EY . V ar(Y ) EY 1 2 2p V ar(X) = ≤ =  kAkp . p N N 4 1 X X Y 2 = (X˜ X˜ ). Finally, by Chebyshev inequality, E |C|2 E σ τ m=0 σ,τ∈C  p p V ar(X) 1 σ,τ are m-disjoint Pr X − kAk > kAk ≤ ≤ , p p 2kAk2p 4 It is not difficult to see that (see Appendix G for details), p when m ≤ p − 2, it holds for m-disjoint σ and τ that which implies the correctness of the algorithm. It is easy to observe that the algorithm only reads the upper-left 2(p−m−1) 2(m+1) (X˜σX˜τ ) m,p kAk kAk T E . 2 2(m+1) k × k block of each GiXGi , thus it can be maintained 2 −2 2−4/p 2 in O(Nk ) = Op( n ) space.  2 1− p 2 By Hölder’s inequality, kAk2 ≤ n kAkp and

2(m+1) 2(m+1) G Omitted Details in the Proof of Theorem F.1 1− p 2(m+1) kAk2(m+1) ≤ n kAkp , 2(m + 1) ≤ p Suppose that σ = (i1, . . . , ip) and τ = (j1, . . . , jp) are Thus, m-disjoint. Then

( p−m−2 2p X n kAkp , 2(m + 1) ≤ p (X˜ X˜ ) = λ ··· λ λ 0 ··· λ 0 · ˜ ˜ E σ τ `1 `p `1 `p E(XσXτ ) .m,p p−m−1 2p k kAk , otherwise. `1,...,`p p 0 0 `1,...,`p And when σ, τ are (p − 1)- and p-disjoint, ( p ) Y 0 0 2p E G`s,is G`s,is+1 G` ,js G` ,js+1 . (AσAτ ) ≤ kAk . s s E p s=1 p+m There are Op(k ) pairs of m-disjoint cycles and |C| = For the expectation to be non-zero, we must have p Θ(k ), each appeared entry repeated an even number of times. Hence, if some is appears only once among the index p−m−2 {i } and {j }, it must hold that ` = ` . Thus for 1 X n 2p s s s s+1 (AσAτ ) ≤ Cm,p kAk |C|2 E kp−m p each of the summation term the indices {`s} breaks σ,τ∈C |σ∆τ|=2m into a few blocks, in each of which all `s takes the same 0 p value, the same holds for {`s}. We also need to piece ≤ C kAk2p, m ≤ − 1, 0 m,p p 2 the blocks of {`s} which those of {`s}. Hence the whole sum breaks into sums corresponding to different block 1 X kp−m−1 (A A ) ≤ C kAk2p configurations. For a certain kind of configuration, |C|2 E σ τ m,p kp−m p σ,τ∈C in which `1, . . . , `w are free variables with multiplicity |σ∆τ|=2m r1, . . . , rw respectively, the sum is bounded by 2p p ≤ Cm,pkAk , − 1 < m ≤ p − 2, X p 2 C · λr1 ··· λrw ≤ CkAkr1 · · · kAkrw `1 `w r1 rw 1 X 1 (A A ) ≤ kAk2p, m = p − 1 `1,...,`w |C|2 E σ τ k p σ,τ∈C |σ∆τ|=2m where the constant C depends on the configuration p ∈ (2, ∞), and Ip = Jp = 1 when p = 2. only, and thus can be made on m and p by taking Denote by D0 the distribution of X and by D1 that the maximum constant over all possible block config- of Y . Suppose that the k linear forms are a1, . . . , ak. urations. Notice that in a configuration, all rw’s are Choose X from D0 with probability 1/2 and from D1 P even, rw = 2p and w ≤ p − m. Note that with probability 1/2. Let b ∈ {0, 1} indicate that r s r−1 s+1 X ∼ Db. Without loss of generality, we can assume kAkrkAks ≤ kAkr−1kAks+1, r > s that a1, . . . , ak are orthonormal. Hence (a1, . . . , ak) ∼ r+s r s kAkr+s ≤ kAkrkAks, N(0,Ik) when b = 0 and (a1, . . . , ak) ∼ Dn,k when b = 1. Then with probability 3/4 − o(1) we have that it is easy to see that the worst case of configuration is w = p − m and (r1, . . . , rw) = (2(m + 1), 2 ..., 2), giving (1 − cp)((Ip + o(1))n ≤ Y ≤ (1 + cp)((Ip + o(1))n, b = 0 the bound (1 − cp)((Jp + o(1))n ≤ Y ≤ (1 + cp)((Jp + o(1))n, b = 1 2(p−m−1) 2(m+1) CkAk2 kAk2(m+1). so we can recover b from Y with probability 1 − |Ip−Jp| o(1), provided that cp ≤ . Consider the Finally, observe that the number of configurations is a 2(Ip+Jp) constant depends on p and m only, giving the variance event E that the algorithm’s output indicates b = claim. 1. Then Pr(E|X ∼ D0) ≤ 1/4 + o(1) while Pr(E|X ∼ D1) ≥ 3/4 − o(1). By definition of to- H Proof of Theorem 5.4 tal variation distance, dTV (Dn,k,N(0,Ik)) is at least Proof. First we show that kXkp and kY kp differ by a |Pr(E|X ∼ D0) − Pr(E|X ∼ D1)|, which is at least p p 1 + o(1). On the other hand, by Theorem 5.3, constant factor. We have calculated kXkp = (I +o(1))n 2 √ p p d (D ,N(0,I )) ≤ 1 when k ≤ c n for some c > 0. in (3.3). Similarly, it holds that TV n,k k 4 We meet a contradiction. Therefore it must hold that p √ kY kp = (Jp + o(1)) n, k > c n. where (H.15) b p p Z p (b − x)(x − a) 3 √ Jp = 2 2 x 2 dx, a, b = ∓ 2. a πx 2

Extend the definition of Ip and Jp to p = 0 by Ip = 1 and Jp = 1/2. This agrees with kXk0 = rank(X). Now it suffices to show that Ip 6= Jp when p 6= 2. Indeed, let us consider the general integral √ 2 √ √ Z (1+ β) p 2 2 γ ((1 + β) − x)(x − (1 − β) ) √ x dx (1− β)2 2πβx √ Change the variable y = (x − (1 + β))/ β, the integral above becomes 2 1 Z p (pβy + (1 + β))γ−1 4 − y2dy. 2π −2 Hence Z 2 1 p 2 Jp − Ip = fp(x) 4 − x dx, 2π −2 where √ p −1 p −1 fp(x) = ( 2x + 3) 2 − (x + 2) 2 .

One can verify that fp < 0 on (−2, 2) when 0 < p < 2, fp > 0 on (−2, 2) when p > 2 and fp = 0 when p = 2. It is easy to compute that I2 = 1. The case p = 0 is trivial. Therefore, Ip > Jp for p ∈ [0, 2), Ip < Jp for