<<

Applications of the -Hartley Theorem to Data Streams and Sparse Recovery

Eric Price David P. Woodruff MIT IBM Almaden

Abstract—The Shannon-Hartley theorem bounds the is also needed for network aggregation, which usually maximum rate at which can be transmitted follows a bottom-up approach [18]: given a routing tree over a Gaussian channel in terms of the ratio of the signal where the nodes represent sensors, starting from the to noise power. We show two unexpected applications of this theorem in computer science: (1) we give a much simpler leaves the aggregation propagates upwards to the root. proof of an Ω(n1−2/p) bound on the number of linear We refer to the rows of A as measurements. measurements required to approximate the p-th frequency Alon, Matias, and Szegedy [2] initiated the line of moment in a data stream, and show a new distribution work on frequency moments. There is a long line of which is hard for this problem, (2) we show that the number of measurements needed to solve the k-sparse upper bounds on the number of linear measurements; we recovery problem on an n-dimensional vector x with the refer the reader to the most recent works [3], [11] and the C-approximate `2/`2 guarantee is Ω(k log(n/k)/ log C). references therein. Similarly, we refer the reader to the We complement this result with an almost matching ∗ most recent lower bounds [16], [22] and the references O(k log k log(n/k)/ log C) upper bound. therein. The best upper and lower bounds for obtaining I.INTRODUCTION a (1 + )-approximation with probability at least 1 − δ have the form n1−2/p · poly(−1 log(nδ−1). Let S be a real-valued random variable with E[S2] = τ 2. Consider the random variable S + T , where T ∼ The existing lower bounds are rather involved, using N(0, σ2) is additive white Gaussian noise of variance the direct sum paradigm for information complexity σ2. The Shannon-Hartley theorem states that [5]. Moreover, they apply to the number of rather than the number of linear measurements, and typically  2  1 τ do not provide an explicit distribution which is hard. I(S; S + T ) ≤ log 1 + , 2 σ2 These issues can be resolved using techniques from [4], [20] and [15]. The resulting hard distribution is: choose where I(X; Y ) = h(X) − h(X|Y ) is the mu- x ∈ {−1, 0, 1}n uniformly at random, and then with tual information between X and Y , and h(X) = probability 1/2, replace a random coordinate x of x − R f(x) log f(x)dx is the differential entropy of a i X with a value in Θ(n1/p). F (x) changes by a constant random variable X with probability density function f. p factor in the two cases, and so the approximation algo- We show two unexpected applications of the Shannon- rithm must determine which case we are in. Hartley theorem in computer science, the first to estimat- ing frequency moments in a data stream, and the second We instead consider the following continuous dis- to approximating a vector by a sparse vector. tribution: choose x to be a random N(0,In) vector, i.e., a vector whose coordinates are independent stan- A. Sketching Frequency Moments dard normal random variables. With probability 1/2, In the data stream literature, a line of work has consid- replace a random coordinate xi of x with a value 1/p ered the problem of estimating the frequency moments in Θ(n ). The use of Gaussians instead of signs p Pn p n allows us to derive our lower bound almost immedi- Fp(x) = kxkp = i=1 |xi| , where x ∈ R and p ≥ 2. One usually wants a linear sketch, that is, we choose a ately from the Shannon-Hartley theorem. We obtain an m×n Ω(n1−2/p) bound on the number of linear measurements random matrix A ∈ R from a certain distribution, for m  n, and compute Ax, from which one can required for estimating Fp, matching known bounds up to poly(−1 log(Mnδ−1) factors. Our proof is much output a constant-factor approximation to Fp(x) with high probability. Linearity is crucial for distributed com- simpler than previous proofs. putation, formalized in the MUD (Massive Unordered Our new hard distribution may also more accurately Distributed) model [9]. In this model the vector x is model those signals x arising in practice, since it corre- split into pieces x1, . . . , xr, each of which is handled by sponds to a signal with support 1 which is corrupted by a different machine. The machines individually compute independent Gaussian noise in each coordinate. Identify- Ax1, . . . , Axr, and an aggregation function combines ing natural hard distributions has been studied for other these to compute Ax and estimate Fp(x). Linearity data stream problems, see, e.g., [19] and [17]. 0 0 B. Sparse Recovery r ≥ 2Gpn, then Bob sets Z = 1, else Bob sets Z = 0. 0 In the field of compressed sensing, a standard problem We thus have a Markov chain `, Z → x → y → Z . is that of stable sparse recovery: we want a distribution If A works for any x with probability 1 − δ, as a m×n n distribution over A, then there is a specific A and random A of matrices A ∈ R such that, for any x ∈ R and with probability 1 − δ > 2/3 over A ∈ A, there is an seed such that A, together with the associated estimation algorithm to recover xˆ from Ax with procedure, succeeds with probability 1−δ over x drawn from the distribution described above. Let us fix this 0 kxˆ − xkp ≤ (1 + ) min kx − x kp choice of A and associated random seed, so that Alice k−sparse x0 and Bob run deterministic algorithms. Let m be the for some  > 0 and norm p. We call this a (1 + )- number of rows of A. We can assume the rows of A are approximate `p/`p recovery scheme with failure proba- orthonormal since this can be done in post-processing. bility δ. We will focus on the popular case of p = 2. Lemma 2: I(Z; Z0) = O(m/n1−2/p). For any constant δ > 0 and any  satisfying  = O(1) Proof: Let the rows of A be denoted v1, . . . , vm. −1/2 and  = Ω(n ), the optimal number of measurements Then we have that is Θ(k log(n/k)/). The upper bound is in [12], and the i 1/p 1/p i 0 lower bound is given by [1], [6], [14], [20]; see [20] for yi = hv , xi = (4Gp) n · hv , e`iZ + wi, a comparison of these works. 0 1/p 1/p One question is if the number of measurements can where wi ∼ N(0, 1). Define zi = (4Gp) n · i 0 be improved when the approximation factor C = 1 +  hv , e`iZ so yi = zi + wi. Then is very large (i.e. ω(1)). In the limiting case of C = ∞, 2 1 2/p i 2 corresponding to sparse recovery in the absence of noise, EZ,`[z ] = · (4Gpn) E`[(v ) ] i 2 ` it is known that O(k) measurements are sufficient [7]. 1 (4G n)2/p 1 (4G )2/p However, the intermediate regime has not been well = p = p = Θ(1/n1−2/p). 2 n 2 1−2/p studied. n 0 Using the Shannon-Hartley theorem, we prove an Hence, yi = zi + wi is a Gaussian channel with power 2 1−2/p Ω(k log(n/k)/ log C) lower bound on the number of constraint E[zi ] = Θ(1/n ) and noise variance 0 2 measurements. We complement this with a novel sparse E[(wi) ] = 1. By the Shannon-Hartley theorem, recovery algorithm, which builds upon [12] and [13],  2  but is the first to obtain an improved bound for C > 1. 1 E[zi ] max I(zi; yi) ≤ log 1 + ∗ i 0 2 Our bound is O(k + k log(n/k) log k/ log C), which v 2 E[(wi) ] ∗ matches our lower bound up to a log k factor. Because 1  1−2/p  1−2/p 1 = log 1 + Θ(1/n ) = Θ(1/n ). log(1 + ) ≈ , these results match the Θ(  k log(n/k)) 2 results for   1. By the data processing inequality for Markov chains and Related work. Related lower bounds have appeared the chain rule for entropy, in a number of recent works, including [6], [14], [1], [21], and [10]. See [20] for a comparison. I(Z; Z0) ≤ I(z; y) = h(y) − h(y|z)

II.LOWER BOUNDFOR FREQUENCY MOMENTS = h(y) − h(y − z|z) X 0 0 0 This section is devoted to proving the following the- = h(y) − h(wi|z, w1, . . . , wi−1) orem: i X 0 X 0 Theorem 1: Any sketching algorithm for Fp up to a = h(y) − h(wi) ≤ h(yi) − h(wi) factor of (1 ± ) for  < 1/2, which succeeds with i i X X probability 1 − δ for a sufficiently small constant δ > 0, = h(yi) − h(yi|zi) = I(yi; zi) 1−2/p requires m = Ω(n ). i i p 1−2/p Let Gp = E[|X| ] where X ∼ N(0, 1). For constant ≤ O(m/n ). p, Gp is Θ(1). Consider the following communication game between p two players, Alice and Bob. Alice chooses a random Proof of Theorem 1: If Z = 1, then kxkp ≥ 4Gp·n, ` ∈ [n] and associated standard unit vector e` = and so any (1 ± )-approximation is at least 2Gpn for n p (0,..., 0, 1, 0,..., 0) ∈ R . She also chooses w ∼  < 1/2. On the other hand, if Z = 0, then E[kxkp] = p N(0,In). Then she chooses Z ∈ {0, 1} uniformly at Gp · n, and since the |xi| are i.i.d. (as we range over random. If Z = 0, then Alice sets x = w. If Z = 1, then i) with bounded variance, by Bernstein’s inequality, with 1/p 1/p p 4 Alice sets x = (4Gp) · n e` + w. She sets y = Ax, probability at least 1−1/n, kxkp ≤ 3 ·Gp ·n. Hence, any where A is the random matrix used for estimating Fp. (1±)-approximation is less than 2Gpn for  < 1/2. So She sends y to Bob, who runs the estimation procedure if the algorithm succeeds with probability at least 1 − δ, 0 associated with A to recover an estimate r to Fp(x). If then Z = Z with probability at least 1 − δ − 1/n. 0 0 0 1 By Fano’s inequality and the fact that Z,Z ∈ {0, 1}, so I(S; S ) = H(S) − H(S|S ) ≥ −1 + 2 log |F| = if q = Pr[Z0 6= Z] then we have H(Z|Z0) ≤ H(q) + q. Ω(k log n/k). Hence, To show the claim, we condition on successful sparse

0 0 recovery, which happens with probability 1 − δ ≥ 2/3. I(Z; Z ) = H(Z) − H(Z | Z ) = 1 − (H(q) + q) ≥ 1/2 Let z = x + w be the transmitted signal. We also 2 αk 2 if q is less than a sufficiently small constant, which condition on kwk∞ ≤ O( n log n) and kwk2/(αk) ≤ 2, follows from δ being a sufficiently small constant. But which happen with probability at least 1 − o(1). So both by Lemma 2, I(Z; Z0) = O(m/n1−2/p). Hence m = events occur with probability at least 2/3 − o(1) > 1/2. Ω(n1−2/p). Given this conditioning and that α = Ω(1/C), the best k-sparse approximation to z is x + wS, where wS is the III.BOUNDSFOR SPARSE RECOVERY restriction of w to coordinates in S. 0 0 A. Lower bound for C  1 Suppose xˆ 6= x, so kxˆ − x k2 ≤ kx − x k2. Then because sparse recovery was successful, kz − x0k ≤ For C = 1+ a lower bound of Ω(k log(n/k)/) was 2 Ckw − w k ≤ Ckwk . Hence shown in [1], [6], [14], [20] for any constant δ > 0 and  S 2 2 −1/2 0 0 satisfying  = O(1) and  = Ω(n ). As in the lower kxˆ − xk2 ≤ kxˆ − x k2 + kx − xk2 bound of [20], ours uses the Shannon-Hartley theorem, 0 ≤ 2kx − xk2 but this proof is simpler because it can use that C is ≤ 2(kx0 − zk + kz − xk ) large. We explain the approach and our modification, 2 2 ≤ 2(C + 1)kwk and refer the reader to [20] for more details. √ 2 This section will prove the following theorem: ≤ 2(C + 1) 2αk, Theorem 3: Any C-approximate ` /` recovery √ 2 2 which is less than k for appropriate α = Ω(1/C). This scheme with failure probability δ < 1/2 requires is a contradiction, and so xˆ = x, as desired. m = Ω(k log(n/k)/ log C). Proof of Theorem 3: Combining Lemma 4 with As in [20], let F ⊂ {S ⊂ [n] | |S| = k} be a family Lemma 5, Ω(k log(n/k)) = I(S, S0) = O(m log C), of k-sparse supports such that: from which m = Ω(k log(n/k)/ log C). 0 0 • |S∆S | ≥ k for S 6= S ∈ F, • PrS∈F [i ∈ S] = k/n for all i ∈ [n], and B. Upper bound for C  1 • log |F| = Ω(k log(n/k)). We first focus on recovery of a single heavy coordi- A random linear code on [n/k]k with relative distance nate. We then study recovery of 90% of the heavy hitters 1/2 has these properties (see discussion in [20]). for general k. We conclude with recovery of all the heavy Let X = {x ∈ {0, ±1}n | supp(x) ∈ F}. Let w ∼ hitters. k 1) k=1: We observe 2r measurements, for some N(0, α n In) be i.i.d. normal with variance αk/n in each coordinate. Consider the following process. r = O(logC n). Let D = C/16. For i ∈ [r], we choose Alice chooses S ∈ F uniformly at random, then x ∈ pairwise independent hash functions hi :[n] → [D] and X uniformly at random subject to supp(x) = S, then si :[n] → {±1}. We then observe w ∼ N(0, α k I ). She sets y = A(x + w) and sends y X X n n y = h (j)s (j)x y = s (j)x to Bob. Bob performs sparse recovery on y to recover 2i i i j 2i+1 i j 0 0 j j x ≈ x, rounds to X by xˆ = arg minxˆ∈X kxˆ − x k2, and sets S0 = supp(ˆx). This gives a Markov chain S → x → y → x0 → S0. procedure IDENTIFYSINGLE(y, h) If sparse recovery works for x + w with probability αi ← ROUND(y2i/y2i+1) for i ∈ [r]. 1 − δ over A, then there is a fixed A and random seed cj ← |{i ∈ [r] | hi(j) = αi}| for j ∈ [n]. such that sparse recovery works with probability 1 − δ S ← {j ∈ [n] | cj > 5r/8}. over x+w; choose this A and random seed, so that Alice if |S| = 1 then and Bob run deterministic algorithms on their inputs. return j ∈ S The next lemma uses the Shannon-Hartley theorem. else 0 1 return ⊥ Lemma 4: (4.1 of [20]) I(S, S ) = O(m log(1 + α )). We modify Lemma 4.3 of [20] to obtain our main end if lemma and theorem. It is simpler than [20] since when end procedure C is large, the recovery algorithm cannot try to output Algorithm III.1: 1-sparse identification many of the Gaussian coordinates in lieu of finding x. 0 Lemma 5: I(S, S ) = Ω(k log(n/k)) if α = Ω(1/C). Define x−j to equal x over [n] \{j} and 0 at j. Proof: The claim is that with probability at least Lemma 6: Suppose there exists a j∗ ∈ [n] such that 0 1/2, xˆ = x, and so S = S . By Fano’s inequality we |xj∗ | ≥ C kx−j∗ k2. Then if C is a sufficiently large 0 0 will then have H(S|S ) ≤ 1 + Pr[S 6= S] log |F|, and constant, we can choose r = O(logC n + log 1/δ) and 2 D = C/16 so that IDENTIFYSINGLE returns j∗ with times its expectation Err (x, k)/k, and the algorithm in probability 1 − δ. Lemma 6 does not fail. All these occur with constant Proof: The key claim is that, for αi = probability if l = O(k), giving the result. y ROUND( 2i ), we have Corollary 8: With O(k log (n/k) log(1/δ)) mea- y2i+1 C DENTIFY OST ∗ surements, I M returns a set L of size O(k) Pr[αi 6= hi(j )] ≤ 1/4. (1) such that each j ∈ S has j ∈ L with probability at least Straightforward concentration inequalities then give 1 − δ. Proof: the result. To get (1), define the “noise” βi = We repeat the method of Lemma 7 P P O(log(1/δ)) times, and take all coordinates that are j6=j∗ hi(j)si(j)xj and γi = j6=j∗ si(j)xj. Then listed in more than half the sets Li. This at most doubles 2 2 2 2 2 E[γi ] = kx−j∗ k2 E[βi ] ≤ D kx−j∗ k2 . the output size, and by a Chernoff bound each j ∈ S lies in the output with probability at least 1 − δ. Thus with probability at least 1 − 2/9 > 3/4, γ ≤ i Corollary 8 gives a good method for finding the heavy 3 kx ∗ k and β ≤ 3D kx ∗ k . But then −j 2 i −j 2 hitters, but we also need to estimate them. ∗ ∗ ∗ y2i hi(j ) + si(j )βi/xj hi(j ) ± 3D/C 3) Estimating coordinates: We estimate using Count- = ∗ = y2i+1 1 + si(j )γi/xj 1 ± 3/C Sketch [8], with R = O(log(1/δ)) hash tables of size ∗ O(k/). y2i ∗ 3D/C + 3hi(j )/C 6D − hi(j ) ≤ ≤ y 1 − 3/C C − 2 2i+1 procedure IDENTIFYMOST(y) so if D = C/16 < (C − 2)/12, as happens for for r ← [R] do .R = O(log(1/δ)) (i) sufficiently large C, this is less than 1/2 so αi = Lr ← {IDENTIFYSINGLE(y ) | i ∈ [k]} y2i ∗ ROUND( ) = hi(j ), giving (1). end for y2i+1 ∗ Then by a Chernoff bound, Pr[j ∈/ S] = Pr[cj∗ < cj ← |{r | j ∈ Lr}| for j ∈ [n]. −Ω(r) 5r/8] = e < δ/2 for r = Ω(log(1/δ)). Suppose L ← {j | cj > R/2} ∗ ∗ that j ∈ S. In order for any j 6= j to lie in S, it must return xbL ∗ have hi(j) = hi(j ) for at least r/4 different i (because end procedure both match α for 5r/8 coordinates). But Pr[hi(j) = procedure ESTIMATEMOST(y, L) ∗ hi(j )] = 1/D independently over i, so for r ← [R] do .R = O(log(1/δ)) (r)  r  xbj ← s(j)yh(j). Pr[j ∈ S] ≤ (1/D)r/4 ≤ (4e/D)r/4 = C−Ω(r) end for r/4 (r) xbj ← medianr xbj as long as C is larger than a fixed constant. But for return xbL r = O(logC (n/δ)) this gives Pr[j ∈ S] < δ/(2n), so a end procedure union bound gives that S = {j} with probability 1 − δ. Algorithm III.2: Estimating most coordinates well 2) General k, finding most coordinates: For general Lemma 9: Suppose |L| ≤ O(k). With O( 1 k log( 1 )) k, we identify a set L of O(k) coordinates by partitioning  fδ measurements, ESTIMATEMOST returns x so that for the coordinates into O(k) sets of size Θ(n/k) and bL any j, with probability 1 − δ we have applying IDENTIFYSINGLE. To be specific, we use a 2 2 pairwise independent hash function h:[n] → [l] to Err (xL − xbL, fk) ≤ Err (x, k) partition into l sets. To analyze how this performs, define the “error” Proof: The analysis of Count-Sketch [8] gives  2 2 2 2 0 Pr[|xj − xj| > Err (x, k)] < fδ. Err (x, k) = min kx − x k2 b k-sparse x0 k Thus with probability 1 − δ, at most f |L| = O(fk) of and the “heavy hitters” 2  2 the j have |xj − xj| > Err (x, k). Rescaling f and  2 b k 2 C 2 gives the result. S = {i ∈ [n] | |xi| > Err (x, k)}. k 4) Recovering all the heavy hitters: Lemma 10: x DENTIFY OST Lemma 7: With O(k logC (n/k)) measurements, this The result bL of I M fol- algorithm returns a set L of size O(k) such that each lowed by ESTIMATEMOST satisfies j ∈ S has j ∈ L with probability at least 3/4. Err2(x − x , fk) ≤ C2Err2(x, k) Proof: For each coordinate j ∈ S, Lemma 6 shows bL 1 it will be recovered as long as three events hold: none with probability 1−δ, and uses O(k logC (n/k) log( fδ )) of the other elements of the top k coordinates hash to measurements. 2 the same value as j, the `2 norm of the mass that hashes Proof: Let T contain the largest k coordinates of x. to the same value as j is no more than a constant factor By Corollary 8, each j ∈ S has j ∈ L with probability procedure RECOVERALL(y) 0 (0) k ← k, δ ← 1/16, xb ← 0 for r ← [R] do ACKNOWLEDGEMENTS 0 (r) (r) (r) y ← y − A xb E. Price is supported in part by an NSF Graduate L(r) ← IDENTIFYMOST(y0, k0, δ) Research Fellowship. (r) 0 0 vb ← ESTIMATEMOST(y , k , δ, L) (r+1) (r) (r) REFERENCES xb ← xb + vb Decrease k0, , δ per Theorem 11 [1] S. Aeron, V. Saligrama, and M. Zhao. Information theoretic bounds for compressed sensing. , IEEE end for Transactions on, 56(10):5111–5130, 2010. end procedure [2] Noga Alon, Yossi Matias, and Mario Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS, Algorithm III.3: Recovering all coordinates 58(1):137–147, 1999. [3] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In FOCS, pages 363–372, 2011. 1 − δf, so with probability 1 − δ we have |S \ L| ≤ fk. [4] Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Then Lower bounds for sparse recovery. In SODA, pages 1190–1197, 2010. 2 2 2 Err (x − x , 2fk) ≤ Err (x − x , fk) + x [5] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An bL L bL [n]\(S∪L) 2 information statistics approach to data stream and communication 2 2 2 ≤ Err (x, k) + x[n]\T 2 + xT \S 2 complexity. JCSS, 68(4):702–732, 2004. 2 2 [6] E.J. Candes` and M.A. Davenport. How well can we estimate a ≤ ( + 1 + C )Err (x, k) sparse vector? Arxiv preprint arXiv:1104.5246, 2011. ≤ 2C2Err2(x, k) [7] E.J. Candes,` J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete with probability 1 − δ by Lemma 9. Rescale f, δ, and frequency information. Information Theory, IEEE Transactions on, 52(2):489–509, 2006. C to get the result. [8] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding Theorem 11: RECOVERALL achieves C-approximate frequent items in data streams. In ICALP, pages 693–703, 2002. ` /` sparse recovery with O(k + (log∗ k)k log (n/k)) [9] Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clif- 2 2 C ford Stein, and Zoya Svitkina. On distributing symmetric measurements and 3/4 success probability. streaming computations. ACM Transactions on Algorithms, 6(4), ∗ Proof: We will achieve DO(log k)-approximate re- 2010. [10] Alyson K. Fletcher, Sundeep Rangan, and Vivek K. Goyal. covery using O(k logD(n/k)) measurements. Substitut- ∗ Necessary and sufficient conditions for sparsity pattern recovery. ing log C = log D log k gives the result. IEEE Transactions on Information Theory, 55(12):5758–5772, 1 2009. Define δi = 8·2i . Let f0 = 1/16 and fi+1 = i −1/(4 fi) Q ∗ [11] Sumit Ganguly. Polynomial estimators for high frequency mo- 2 . Let ki = k j