Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Sub-Constant Error
Total Page:16
File Type:pdf, Size:1020Kb
Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Sub-Constant Error T.S. Jayram ∗ David Woodruff y Abstract 1 Introduction The Johnson-Lindenstrauss transform is a dimensional- The Johnson-Linderstrauss transform is a fundamental ity reduction technique with a wide range of applica- dimensionality reduction technique with applications tions to theoretical computer science. It is specified by to many areas such as nearest-neighbor search [2, 25], a distribution over projection matrices from Rn ! Rk compressed sensing [14], computational geometry [17], where k d and states that k = O("−2 log 1/δ) di- data streams [7, 24], graph sparsification [40], machine mensions suffice to approximate the norm of any fixed learning [32, 39, 43], and numerical linear algebra [18, vector in Rd to within a factor of 1 ± " with probability 22, 37, 38]. It is given by a projection matrix that at least 1 − δ. In this paper we show that this bound maps vectors in Rn to Rk, where k d, while seeking on k is optimal up to a constant factor, improving upon to approximately preserve their norm. The classical −2 1 a previous Ω((" log 1/δ)= log(1=")) dimension bound result states that k = O( "2 log 1/δ) dimensions suffice of Alon. Our techniques are based on lower bounding to approximate the norm of any fixed vector in Rn to the information cost of a novel one-way communication within a factor of 1 ± " with probability at least 1 − δ. game and yield the first space lower bounds in a data This is a remarkable result because the target dimension stream model that depend on the error probability δ. is independent of n. Because the transform is linear, it For many streaming problems, the most na¨ıve way also preserves the pairwise distances of the vectors in of achieving error probability δ is to first achieve con- this set, which is what is needed for most applications. stant probability, then take the median of O(log 1/δ) The projection matrix is itself produced by a random independent repetitions. Our techniques show that for process that is oblivious to the input vectors. Since the a wide range of problems this is in fact optimal! As original work of Johnson and Lindenstrauss, it has been an example, we show that estimating the `p-distance shown [1, 8, 21, 25] that the projection matrix could be for any p 2 [0; 2] requires Ω("−2 log n log 1/δ) space, constructed element-wise using the standard Gaussian even for vectors in f0; 1gn. This is optimal in all pa- distribution or even uniform ±1 variables [1]. By setting 1 rameters and closes a long line of work on this prob- the size of the target dimension k = O( "2 log 1/δ), lem. We also show the number of distinct elements re- the resulting matrix, suitably scaled, is guaranteed to quires Ω("−2 log 1/δ + log n) space, which is optimal if approximate the norm of any single vector with failure "−2 = Ω(log n). We also improve previous lower bounds probability δ. for entropy in the strict turnstile and general turnstile Due to its algorithmic importance, there has been models by a multiplicative factor of Ω(log 1/δ). Finally, a flurry of research aiming to improve upon these we give an application to one-way communication com- constructions that address both the time needed to plexity under product distributions, showing that unlike generate a suitable projection matrix as well as to in the case of constant δ, the VC-dimension does not produce the transform of the input vectors [2, 3, 4, characterize the complexity when δ = o(1). 5, 33]. In the area of data streams, the Johnson- Lindenstrauss transform has been used in the seminal work of Alon, Matias and Szegedy [7] as a building block to produce sketches of the input that can be used to estimate norms. For a stream with poly(n) increments/decrements to a vector in Rn, the size of the 1 sketch can be made to be O( "2 log n log 1/δ). To achieve even better update times, Thorup and Zhang [42], building upon the Count Sketch data structure of ∗IBM Almaden, [email protected] Charikar, Chen, and Farach-Colton [16], use an ultra- yIBM Almaden, [email protected] sparse transform to estimate the norm, but then have to take a median of several estimators in order to reduce known Ω("−2 log n) bound for estimating the entropy the failure probability. This is inherently non-linear in the turnstile model to Ω("−2 log n log 1/δ), and we but suggests the power of such schemes in addressing improve the previous Ω("−2 log n= log 1=") bound [28] sparsity as a goal; in contrast, a single transform with for estimating the entropy in the strict turnstile model constant sparsity per column fails to be an ("; δ)-JL to Ω("−2 log n log 1/δ= log 1="). Entropy has become an transform [20, 34]. important tool in databases as a way of understand- In this paper, we consider the central lower bound ing database design, enabling data integration, and per- question of Johnson Lindenstrauss transforms: how forming data anonymization [41]. Estimating this quan- good is the upper bound on the target dimension of tity in an efficient manner over large sets is a crucial 1 k = O( "2 log 1/δ) to approximate the norm of a fixed ingredient in performing this analysis (see the recent vector in Rn? Alon [6] gave a near-tight lower bound tutorial in [41] and the references therein). 1 of Ω( "2 (log 1/δ)= log(1=")), leaving an asymptotic gap Kremer, Nisan and Ron [30] showed the surprising of log(1=") between the upper and lower bounds. In theorem that for constant error probability δ, the this paper, we close the gap and resolve the optimality one-way communication complexity of a function under of Johnson Lindenstrauss transforms by giving a lower product distributions coincides with the VC-dimension 1 bound of k = Ω( "2 log 1/δ) dimensions. More generally, of the communication matrix for the function. We show we show that any sketching algorithm for estimating that for sub-constant δ, such a nice characterization is the norm (whether linear or not) of vectors in Rn must not possible. Namely, we exhibit two functions with the 1 use space at least Ω( "2 log n log 1/δ) to approximate the same VC-dimension whose communication complexities norm within a 1 ± " factor with a failure probability differ by a multiplicative log 1/δ factor. of at most δ. By a simple reduction, we show that this result implies the aforementioned lower bound on Organization: In Section 2, we give preliminar- Johnson Lindenstrauss transforms. ies on communication and information complexity. In Our results come from lower-bounding the informa- Section 3, we give our lower bound for augmented- tion cost of a novel one-way communication complexity indexing over larger domains. In Section 4, we give problem. One can view our results as a strengthening the improved lower bound for Johnson-Lindenstrauss of the augmented-indexing problem [9, 10, 18, 28, 35] transforms and the streaming and communication to very large domains. Our technique is far-reaching, applications mentioned above. implying the first lower bounds for the space complex- ity of streaming algorithms that depends on the error 2 Preliminaries probability δ. In many cases, our results are tight. For Let [a; b] denote the set of integers fi j a ≤ i ≤ bg, instance, for estimating the `p-norm for any p ≥ 0 in and let [n] = [1; n]. Random variables will be denoted −2 the turnstile model, we prove an Ω(" log n log 1/δ) by upper case Roman or Greek letters, and the values space lower bound for streams with poly(n) incre- they take by (typically corresponding) lower case letters. ments/decrements. This resolves a long sequence of Probability distributions will be denoted by lower case work on this problem [26, 28, 44] and is simultaneously Greek letters. A random variable X with distribution optimal in "; n, and δ. For p 2 [0; 2], this matches µ is denoted by X ∼ µ. If µ is the uniform distribution the upper bound of [28]. Indeed, in [28] it was shown over a set U, then this is also denoted as X 2R U. how to achieve O("−2 log n) space and constant prob- ability of error. To reduce this to error probability 2.1 One-way Communication Complexity Let δ, run the algorithm O(log 1/δ) times in parallel and D denote the input domain and O the set of outputs. take the median. Surprisingly, this is optimal! For Consider the two-party communication model, where estimating the number of distinct elements in a data Alice holds an input x 2 D and Bob holds an input −2 stream, we prove an Ω(" log 1/δ + log n) space lower y 2 D. Their goal is to solve some relation problem bound, improving upon the previous Ω(log n) bound Q ⊆ D × D × O. For each (x; y) 2 D2, the set of [7] and Ω("−2) bound of [26, 44]. In [28, 29], an Qxy = fz j (x; y; z) 2 Qg represents the set of possible −2 O(" + log n)-space algorithm is given with constant answers on input (x; y). Let L ⊆ D2 be the set of legal or probability of success.