Non-Adaptive Adaptive Sampling on Turnstile Streams
Total Page:16
File Type:pdf, Size:1020Kb
Non-Adaptive Adaptive Sampling on Turnstile Streams Sepideh Mahabadi∗ Ilya Razenshteyn† David P. Woodruff‡ Samson Zhou§ April 24, 2020 Abstract Adaptive sampling is a useful algorithmic tool for data summarization problems in the clas- sical centralized setting, where the entire dataset is available to the single processor performing the computation. Adaptive sampling repeatedly selects rows of an underlying matrix A Rn×d, where n d, with probabilities proportional to their distances to the subspace of the previously∈ selected rows.≫ Intuitively, adaptive sampling seems to be limited to trivial multi-pass algorithms in the streaming model of computation due to its inherently sequential nature of assigning sam- pling probabilities to each row only after the previous iteration is completed. Surprisingly, we show this is not the case by giving the first one-pass algorithms for adaptive sampling on turnstile streams and using space poly(d, k, log n), where k is the number of adaptive sampling rounds to be performed. Our adaptive sampling procedure has a number of applications to various data summariza- tion problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model. We give the first relative-error algorithm for column subset selection on turnstile streams. We show our adaptive sampling algorithm also gives the first relative-error algorithm for subspace approximation on turnstile streams that returns k noisy rows of A. The quality of the output can be improved to a (1+ǫ)-approximation at the tradeoff of a bicriteria algorithm that outputs a larger number of rows. We then give the first algorithm for projective clustering on turnstile streams that uses space sublinear in n. In fact, we use space 1 poly d, k, s, ǫ , log n to output a (1 + ǫ)-approximation, where s is the number of k-dimensional subspaces. Our adaptive sampling primitive also provides the first algorithm for volume max- imization on turnstile streams. We complement our volume maximization algorithmic results with lower bounds that are tight up to lower order terms, even for multi-pass algorithms. By a similar construction, we also obtain lower bounds for volume maximization in the row-arrival model, which we match with competitive upper bounds. arXiv:2004.10969v1 [cs.DS] 23 Apr 2020 ∗Toyota Technological Institute at Chicago (TTIC). E-mail: [email protected] †Microsoft Research. E-mail: [email protected] ‡School of Computer Science, Carnegie Mellon University. E-mail: [email protected] §School of Computer Science, Carnegie Mellon University. E-mail: [email protected] 1 1 Introduction Data summarization is a fundamental task in data mining, machine learning, statistics, and ap- plied mathematics. The goal is to find a set S of k rows of a matrix A Rn d that optimizes ∈ × some predetermined function that quantifies how well S represents A. For example, row subset selection seeks S to well-approximate A with respect to the spectral or Frobenius norm, subspace approximation asks to minimize the sum of the distances of the rows of A from S, while volume maximization wants to maximize the volume of the parallelepiped spanned by the rows of S. Due to their applications in data science, data summarization problems are particularly attractive to study for big data models. The streaming model of computation is an increasingly popular model for describing large datasets whose overwhelming size places restrictions on space available to algorithms. For turnstile streams that implicitly define A, the matrix initially starts as the all zeros matrix and receives a large number of updates to its coordinates. Once the updates are processed, they cannot be accessed again and hence any information not stored is lost forever. The goal is then to perform some data summarization task after the stream is completed without storing A in its entirety. Adaptive sampling is a useful algorithmic paradigm that yields many data summarization al- gorithms in the centralized setting [DV06, DV07, DRVW06]. The idea is that S begins as the A p k ik2 empty set and some row Ai of A is sampled with probability A p , where p 1, 2 and k kp,2 ∈ { } p 1 p A = n d A q q . As S is populated, the algorithm adaptively samples rows of k kp,q i=1 j=1 | i,j| th A, so that at each iteration, row Ai is sampled with probability proportional to the p power of P P p A (I M†M) i 2 the distance of the row from S. That is, A is sampled with probability − p , where M i kA(I M†M) k − p,2 is the matrix formed by the rows in S. The procedure is repeated k timesk until we obtaink k rows of A, which then forms our summary of the matrix A. Unfortunately, adaptive sampling seems like an inherently sequential procedure and thus the extent of its capabilities has not been explored in the streaming model. 1.1 Our Contributions In this paper, we show that although adaptive sampling seems like an iterative procedure, we do not need multiple passes over the stream to perform adaptive sampling. This is particularly surprising since any row Ai of A can be made irrelevant, i.e., zero probability of being sampled, in future rounds if some row along the same direction of Ai is sampled in the present round. Yet somehow we must still output rows of A while storing a sublinear number of rows more or less non-adaptively. The challenge seems compounded by the turnstile model, since updates can be made to arbitrary elements of the matrix, but somehow we need to recover the rows with the largest norms. For example, if the last update in the stream is substantially larger than the previous updates, an adaptive sampler must return the entire row, even though this update could be an entry in any row of A. To build our adaptive sampler, we first give an algorithm that performs a single round of sampling. Namely, given a matrix A Rn d that is defined over a turnstile stream and post- ∈ × processing query access to a matrix P Rd d, we first give L samplers for AP. ∈ × p,q Theorem 1.1 Let ǫ> 0, q = 2, and p 1, 2 . There exists a one-pass streaming algorithm that ∈ { } 2 n d takes rows of a matrix A R × as a turnstile stream and post-processing query access to matrix d d ∈ P R × after the stream, and with high probability, samples an index i [n] with probability ∈ A P p ∈ k i kq 1 1 (1 (ǫ)) AP p + poly(n) . The algorithm uses poly d, ǫ , log n bits of space. (See Theorem 2.7 ± O k kp,q and Theorem A.4.) We remark that our techniques can be extended to p (0, 2] but we only require p 1, 2 for ∈ ∈ { } the purposes of our applications. Now, suppose we want to perform adaptive sampling of a row of A Rn d with probability proportional to its distance or squared distance from some subspace ∈ × H Ri d, where i is any integer. Then by taking P = I H H and either L or L sampling, ∈ × − † 1,2 2,2 we select rows of A with probability roughly proportional to the distance or squared distance from H. We can thus simulate k rounds of adaptive sampling in a stream, despite its seemingly inherent sequential nature. Theorem 1.2 Let A Rn d be a matrix and q be the probability of selecting a set S [n] of ∈ × S ⊂ k rows of A according to k rounds of adaptive sampling with respect to either the distances to the selected subspace in each iteration or the squared distances to the selected subspace in each iteration. There exists an algorithm that takes inputs A through a turnstile stream and ǫ > 0, and outputs a set S [n] of k indices such that if p is the probability of the algorithm outputting S, then ⊂ S p q ǫ. The algorithm uses poly d, k, 1 , log n bits of space. (See Theorem 3.4 and S | S − S| ≤ ǫ Theorem A.7.) P In other words, our output distribution is close in total variation distance to the desired adaptive sampling distribution. Our algorithm is the first to perform adaptive sampling on a stream; existing implementations require extended access to the matrix, such as in the centralized or distributed models, for subse- quent rounds of sampling. Moreover, if the set S of k indices output by our algorithm is s1,...,sk, then our algorithm also returns a set of rows r ,..., r so that if R = and R = r . r 1 k 0 ∅ i 1 ◦ ◦ i for i [k], then r = u + v , where u = A (I R†R ) is the projection of the sampled row to ∈ i si i si si − i i the space orthogonal to the previously selected rows, and vi is some small noisy vector formed by linear combinations of other rows in A such that v ǫ u . k ik2 ≤ k si k2 Thus we do not return the true rows of A corresponding to the indices in S, but we output a small noisy perturbation to each of the rows, which we call noisy rows and suffices for a number of applications previously unexplored in turnstile streams. Crucially, the noisy perturbation in each of our output rows can be bounded in norm not only relative to the norm of the true row, but also relative to the residual. In fact, our previous example of a long stream of small updates followed a single arbitrarily large update shows that it is impossible to return the true rows of A in sublinear space.