Turnstile Streaming Algorithms Might As Well Be Linear Sketches

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyên˜ David P. Woodruff Max-Planck Institute for Princeton University IBM Research, Almaden Informatics [email protected] [email protected] [email protected] ABSTRACT Keywords In the turnstile model of data streams, an underlying vector x 2 streaming algorithms, linear sketches, lower bounds, communica- {−m; −m+1; : : : ; m−1; mgn is presented as a long sequence of tion complexity positive and negative integer updates to its coordinates. A randomized algorithm seeks to approximate a function f(x) with constant probability while only making a single pass over this sequence of 1. INTRODUCTION updates and using a small amount of space. All known algorithms In the turnstile model of data streams [28, 34], there is an under- in this model are linear sketches: they sample a matrix A from lying n-dimensional vector x which is initialized to ~0 and evolves a distribution on integer matrices in the preprocessing phase, and through an arbitrary finite sequence of additive updates to its coor- maintain the linear sketch A · x while processing the stream. At the dinates. These updates are fed into a streaming algorithm, and have end of the stream, they output an arbitrary function of A · x. One the form x x + ei or x x − ei, where ei is the i-th standard cannot help but ask: are linear sketches universal? unit vector. This changes the i-th coordinate of x by an additive In this work we answer this question by showing that any 1- 1 or −1. At the end of the stream, x is guaranteed to satisfy the pass constant probability streaming algorithm for approximating promise that x 2 {−m; −m + 1; : : : ; mgn, where throughout we an arbitrary function f of x in the turnstile model can also be assume that m ≥ 2n. The goal of the streaming algorithm is to implemented by sampling a matrix A from the uniform distribu- make one pass over the data and to use limited memory to compute tion on O(n log m) integer matrices, with entries of magnitude functions of x, such as the frequency moments [1], the number of poly(n), and maintaining the linear sketch Ax. Furthermore, the distinct elements [19], the empirical entropy [12], the heavy hitters logarithm of the number of possible states of Ax, as x ranges over [14, 17], and treating x as a matrix, various quantities in numerical {−m; −m + 1; : : : ; mgn, plus the amount of randomness needed linear algebra such as a low rank approximation [15]. Since com- to store A, is at most a logarithmic factor larger than the space re- puting these quantities exactly or deterministically often requires a quired of the space-optimal algorithm. Our result shows that to prohibitive amount of space, these algorithms are usually random- prove space lower bounds for 1-pass streaming algorithms, it suf- ized and approximate. fices to prove lower bounds in the simultaneous model of commu- Curiously, all known algorithms in the turnstile model have the nication complexity, rather than the stronger 1-way model. More- following form: they sample an r × n matrix A from a distribution over, the fact that we can assume we have a linear sketch with on integer matrices in a preprocessing phase, and then maintain the polynomially-bounded entries further simplifies existing lower bounds, “linear sketch” A · x during the stream1. For known solutions to e.g., for frequency moments we present a simpler proof of the the above problems, the sketch A·x is maintained over the integers Ω(~ n1−2=k) bit complexity lower bound without using communi- (rather, than say, a finite field). From A · x, the algorithm then cation complexity. computes an arbitrary function of A·x, which should with constant probability, be a good approximation to the desired function f(x). Categories and Subject Descriptors Linear sketches are well-suited for turnstile streaming algorithms since given the state A · x and an additive update x x + ei, the F.2.2 [Analysis of Algorithms and Problem Complexity]: Non- new state A · (x + ei) can be computed as A · x + Ai, where Ai numerical Algorithms and Problems is the i-th column of A. This raises the natural question of whether or not all turnstile streaming algorithms need to in fact be linear General Terms sketches. Several works already consider the question of lower bounds Theory on the dimension of linear sketches themselves, rather than try- Permission to make digital or hard copies of all or part of this work for ing to obtain bit complexity lower bounds [4, 27, 32, 36]. For in- personal or classroom use is granted without fee provided that copies are not stance, Andoni et al. [4] prove a lower bound on the dimension of made or distributed for profit or commercial advantage and that copies bear a randomized linear sketch for estimating frequency moments and this notice and the full citation on the first page. Copyrights for components state “We stress that essentially all known algorithms in the general of this work owned by others than ACM must be honored. Abstracting with streaming model are in fact linear estimators.” The relationship be- credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request tween the space complexity of the optimal turnstile algorithm and permissions from [email protected]. 1 STOC ’14, May 31 - June 03 2014, New York, NY, USA There are algorithms in the insertion-only model, where all the Copyright 2014 ACM 978-1-4503-2710-7/14/05 ...$15.00. updates are required to be positive, which are not linear sketches, http://dx.doi.org/10.1145/2591796.2591812 e.g., the algorithm for frequency moments given by Alon et al. [1]. that of a linear sketch is thus important for understanding the appli- and lower bounds the communication required among the players 2 cability of lower bounds for linear sketches . to compute a function f(X1;:::;Xk). For 1-pass lower bounds, it Some progress has been made on the above question. Ganguly suffices to consider 1-way communication, in which P1 speaks to [22] shows that for approximating the `1-heavy hitters, any deter- P2 who speaks to P3, etc., and Pk outputs the answer. To obtain ministic streaming algorithm might as well be a linear sketch over lower bounds in the turnstile model, each player Pi creates a data the integers. Ganguly also generalized this reduction to determin- stream σi from his/her input Xi. Then P1 runs a data stream al- istic streaming algorithms for approximating the `p-heavy hitters, gorithm A on σ1, and transmits the memory contents of A to P2, for every 1 ≤ p ≤ 2 [23]. who continues the computation of A on σ2, etc. At the end, A will In a related work, Feldman et al. [18] introduced the MUD have been executed on the stream σ1 ◦ σ2 ◦ · · · ◦ σk. If the output model for computational tasks, which stands for massive, unordered, of A determines f with constant probability, then the space of the distributed algorithms, which contains linear sketches as a spe- streaming algorithm is at least the randomized 1-way communica- cial case. The authors show that any deterministic streaming al- tion complexity of f divided by k. gorithm for computing a symmetric (order-invariant) total single- While for some problems randomized 1-way communication lower valued function exactly can be simulated by a MUD algorithm. bounds are easy via a reduction from the Indexing problem [31], They partially extend this to approximation algorithms, but require for other problems such as k-player Set Disjointness and Gap-`1 that the approximating function is also a symmetric function of its [7, 13, 26, 30], 1-way lower bounds are not much easier to prove input, rather than only requiring that the function being approxi- than 2-way communication lower bounds with current techniques. mated is symmetric. For randomized algorithms they require that A weaker model than the 1-way model is the simultaneous commu- for each fixed random string, the streaming algorithm computes a nication model. In this model there are k players P1;:::;Pk, each symmetric function of its input. with an input Xi, but the communication is even more restricted. Thus, existing results on this problem were either for determin- There is an additional player, called the referee, and each of the istic algorithms and specific problems, or did not work for arbitrary players Pi can only send a single message, and only to the referee, approximation algorithms. though they may share a common random string. The referee an- nounces the output, which should equal f(X1;:::;Xk) with con- Our Contributions: We show that any 1-pass constant probability stant probability. Before our work, one could not use randomized streaming algorithm for approximating an arbitrary function f of x communication lower bounds for f to obtain space lower bounds in the turnstile model can be implemented by sampling a matrix A for streaming algorithms since there is no simulation in which the from the uniform distribution with support on O(n log m) integer state of a streaming algorithm can be passed sequentially among the matrices, each with entries of magnitude poly(n log m), and main- players. However, since our result shows that the optimal stream- taining the linear sketch A · x over the integers. Furthermore, the ing algorithm can be implemented using a sketching matrix A, up logarithm of the number of possible states of A·x, as x ranges over to a logarithmic factor, each player Pi can compute A · xi, where n {−m; −m + 1; : : : ; mg , plus the number of random bits needed xi is the underlying vector associated with the stream σi that Pi to sample A, is at most an O(log m) factor larger than the space creates from his/her input Xi to the communication game.

Turnstile Streaming Algorithms Might As Well Be Linear Sketches

Data Streams and Applications in Computer Science

Efficient Semi-Streaming Algorithms for Local Triangle Counting In

Streaming Quantiles Algorithms with Small Space and Update Time

Finding Graph Matchings in Data Streams

Lecture 1 1 Graph Streaming

Streaming Algorithms

Anytime Algorithms for Stream Data Mining

Regular Path Query Evaluation on Streaming Graphs

Stochastic Streams: Sample Complexity Vs. Space Complexity

Finding Matchings in the Streaming Model

Suffix Tree, Minwise Hashing and Streaming Algorithms for Big Data Analysis in Bioinformatics

Streaming Algorithms for Halo Finders