Turnstile Streaming Might as Well Be Linear Sketches

Yi Li Huy L. Nguyên˜ David P. Woodruff Max-Planck Institute for Princeton University IBM Research, Almaden Informatics [email protected] [email protected] [email protected]

ABSTRACT Keywords In the turnstile model of data streams, an underlying vector x ∈ streaming algorithms, linear sketches, lower bounds, communica- {−m, −m+1, . . . , m−1, m}n is presented as a long sequence of tion complexity positive and negative integer updates to its coordinates. A random- ized seeks to approximate a function f(x) with constant probability while only making a single pass over this sequence of 1. INTRODUCTION updates and using a small amount of space. All known algorithms In the turnstile model of data streams [28, 34], there is an under- in this model are linear sketches: they sample a matrix A from lying n-dimensional vector x which is initialized to ~0 and evolves a distribution on integer matrices in the preprocessing phase, and through an arbitrary finite sequence of additive updates to its coor- maintain the linear sketch A · x while processing the . At the dinates. These updates are fed into a streaming algorithm, and have end of the stream, they output an arbitrary function of A · x. One the form x ← x + ei or x ← x − ei, where ei is the i-th standard cannot help but ask: are linear sketches universal? unit vector. This changes the i-th coordinate of x by an additive In this work we answer this question by showing that any 1- 1 or −1. At the end of the stream, x is guaranteed to satisfy the pass constant probability streaming algorithm for approximating promise that x ∈ {−m, −m + 1, . . . , m}n, where throughout we an arbitrary function f of x in the turnstile model can also be assume that m ≥ 2n. The goal of the streaming algorithm is to implemented by sampling a matrix A from the uniform distribu- make one pass over the data and to use limited memory to compute tion on O(n log m) integer matrices, with entries of magnitude functions of x, such as the frequency moments [1], the number of poly(n), and maintaining the linear sketch Ax. Furthermore, the distinct elements [19], the empirical entropy [12], the heavy hitters logarithm of the number of possible states of Ax, as x ranges over [14, 17], and treating x as a matrix, various quantities in numerical {−m, −m + 1, . . . , m}n, plus the amount of randomness needed linear algebra such as a low rank approximation [15]. Since com- to store A, is at most a logarithmic factor larger than the space re- puting these quantities exactly or deterministically often requires a quired of the space-optimal algorithm. Our result shows that to prohibitive amount of space, these algorithms are usually random- prove space lower bounds for 1-pass streaming algorithms, it suf- ized and approximate. fices to prove lower bounds in the simultaneous model of commu- Curiously, all known algorithms in the turnstile model have the nication complexity, rather than the stronger 1-way model. More- following form: they sample an r × n matrix A from a distribution over, the fact that we can assume we have a linear sketch with on integer matrices in a preprocessing phase, and then maintain the polynomially-bounded entries further simplifies existing lower bounds, “linear sketch” A · x during the stream1. For known solutions to e.g., for frequency moments we present a simpler proof of the the above problems, the sketch A·x is maintained over the integers Ω(˜ n1−2/k) bit complexity lower bound without using communi- (rather, than say, a finite field). From A · x, the algorithm then cation complexity. computes an arbitrary function of A·x, which should with constant probability, be a good approximation to the desired function f(x). Categories and Subject Descriptors Linear sketches are well-suited for turnstile streaming algorithms since given the state A · x and an additive update x ← x + ei, the F.2.2 [Analysis of Algorithms and Problem Complexity]: Non- new state A · (x + ei) can be computed as A · x + Ai, where Ai numerical Algorithms and Problems is the i-th column of A. This raises the natural question of whether or not all turnstile streaming algorithms need to in fact be linear General Terms sketches. Several works already consider the question of lower bounds Theory on the dimension of linear sketches themselves, rather than try- Permission to make digital or hard copies of all or part of this work for ing to obtain bit complexity lower bounds [4, 27, 32, 36]. For in- personal or classroom use is granted without fee provided that copies are not stance, Andoni et al. [4] prove a lower bound on the dimension of made or distributed for profit or commercial advantage and that copies bear a randomized linear sketch for estimating frequency moments and this notice and the full citation on the first page. Copyrights for components state “We stress that essentially all known algorithms in the general of this work owned by others than ACM must be honored. Abstracting with streaming model are in fact linear estimators.” The relationship be- credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request tween the space complexity of the optimal turnstile algorithm and permissions from [email protected]. 1 STOC ’14, May 31 - June 03 2014, New York, NY, USA There are algorithms in the insertion-only model, where all the Copyright 2014 ACM 978-1-4503-2710-7/14/05 ...$15.00. updates are required to be positive, which are not linear sketches, http://dx.doi.org/10.1145/2591796.2591812 e.g., the algorithm for frequency moments given by Alon et al. [1]. that of a linear sketch is thus important for understanding the appli- and lower bounds the communication required among the players 2 cability of lower bounds for linear sketches . to compute a function f(X1,...,Xk). For 1-pass lower bounds, it Some progress has been made on the above question. Ganguly suffices to consider 1-way communication, in which P1 speaks to [22] shows that for approximating the `1-heavy hitters, any deter- P2 who speaks to P3, etc., and Pk outputs the answer. To obtain ministic streaming algorithm might as well be a linear sketch over lower bounds in the turnstile model, each player Pi creates a data the integers. Ganguly also generalized this reduction to determin- stream σi from his/her input Xi. Then P1 runs a data stream al- istic streaming algorithms for approximating the `p-heavy hitters, gorithm A on σ1, and transmits the memory contents of A to P2, for every 1 ≤ p ≤ 2 [23]. who continues the computation of A on σ2, etc. At the end, A will In a related work, Feldman et al. [18] introduced the MUD have been executed on the stream σ1 ◦ σ2 ◦ · · · ◦ σk. If the output model for computational tasks, which stands for massive, unordered, of A determines f with constant probability, then the space of the distributed algorithms, which contains linear sketches as a spe- streaming algorithm is at least the randomized 1-way communica- cial case. The authors show that any deterministic streaming al- tion complexity of f divided by k. gorithm for computing a symmetric (order-invariant) total single- While for some problems randomized 1-way communication lower valued function exactly can be simulated by a MUD algorithm. bounds are easy via a reduction from the Indexing problem [31], They partially extend this to approximation algorithms, but require for other problems such as k-player Set Disjointness and Gap-`∞ that the approximating function is also a symmetric function of its [7, 13, 26, 30], 1-way lower bounds are not much easier to prove input, rather than only requiring that the function being approxi- than 2-way communication lower bounds with current techniques. mated is symmetric. For randomized algorithms they require that A weaker model than the 1-way model is the simultaneous commu- for each fixed random string, the streaming algorithm computes a nication model. In this model there are k players P1,...,Pk, each symmetric function of its input. with an input Xi, but the communication is even more restricted. Thus, existing results on this problem were either for determin- There is an additional player, called the referee, and each of the istic algorithms and specific problems, or did not work for arbitrary players Pi can only send a single message, and only to the referee, approximation algorithms. though they may share a common random string. The referee an- nounces the output, which should equal f(X1,...,Xk) with con- Our Contributions: We show that any 1-pass constant probability stant probability. Before our work, one could not use randomized streaming algorithm for approximating an arbitrary function f of x communication lower bounds for f to obtain space lower bounds in the turnstile model can be implemented by sampling a matrix A for streaming algorithms since there is no simulation in which the from the uniform distribution with support on O(n log m) integer state of a streaming algorithm can be passed sequentially among the matrices, each with entries of magnitude poly(n log m), and main- players. However, since our result shows that the optimal stream- taining the linear sketch A · x over the integers. Furthermore, the ing algorithm can be implemented using a sketching matrix A, up logarithm of the number of possible states of A·x, as x ranges over to a logarithmic factor, each player Pi can compute A · xi, where n {−m, −m + 1, . . . , m} , plus the number of random bits needed xi is the underlying vector associated with the stream σi that Pi to sample A, is at most an O(log m) factor larger than the space creates from his/her input Xi to the communication game. The ref- required of the space-optimal algorithm for approximating f in the eree uses linearity to compute A · (x1 + x2 + ··· + xk) from which turnstile model. Our result helps explain why all known streaming it can compute the output of the streaming algorithm. algorithms in the turnstile model are linear sketches. Our result thus shows that it suffices to consider simultaneous We note that the logarithmic factor loss in our result is necessary communication complexity to prove lower bounds. This makes for some problems, e.g., for the function f(x) = x1 mod 2. In this progress on Question 19 on Sketching versus Streaming in the IITK case, the optimal algorithm has two states and maintains x1 mod 2, Open Problems in Data Streams from 2006, which asks to show while the optimal sketching algorithm Ax over the integers must that any symmetric function that admits a good streaming algorithm have at least log m states since with large constant probability, A · also admits a sketching algorithm. Our result shows this even for (1, 0,..., 0) 6= 0, and by scaling, A · (i, 0,..., 0) results in a non-symmetric functions (in the turnstile streaming model). One different state for each i ∈ [m]. well-studied problem for which it is easier to prove lower bounds While the distribution on sketching matrices A that we create can in the simultaneous model is k-player Set Disjointness, see The- be sampled from with an O(log n + log log m) bit random seed, orem 18 of [6] which predates the lower bound [7] in the one-way in general the O(n log m) matrices in its support cannot be repre- model. This problem is used to prove lower bounds for approxi- Pn p sented in small space. The output function of the sketching algo- mating the p-th frequency moment Fp = i=1 |xi| , p > 2, in rithm, given A · x and the random bits used to sample A, may also the turnstile model, for which it is known that Θ(˜ n1−2/p) bits of not be computable in small space. These issues can be handled by space is necessary and sufficient. This problem has a long history allowing the sketching algorithm to be non-uniform, though there [1, 2, 3, 7, 10, 11, 13, 16, 20, 21, 24, 25, 29, 33, 39]. While we can are other ways of addressing them as well. These are discussed fur- use the simultaneous lower bound for k-player Set Disjointness ther below, and do not affect one of our main applications to lower to prove an Ω(˜ n1−2/p) bound for frequency moments, we give an bounds described next. even simpler proof for linear sketches with polynomially bounded entries without using communication complexity at all in Appendix Communication Complexity: One consequence of our result is for C. By our reduction, this gives a bit complexity lower bound for proving lower bounds on the space required of algorithms in the turnstile streaming algorithms in general. turnstile model, which is often done using communication com- plexity. Typically one creates a communication problem in which Non-uniformity. It may not be possible to represent the sketch- there are two or more players P1,...,Pk, each with an input Xi, ing matrix A or compute the output given Ax in small space, even though A is sampleable with only O(log n + log log m) random 2 We note that for the sketching lower bounds mentioned, they are bits. A natural way of addressing this is to allow the streaming al- for real input vectors, so one would still need to obtain dimension gorithm to be non-uniform, meaning for each n it has the uniform lower bounds for linear sketches for integer input vectors to apply them to linear sketches in the turnstile model. distribution on our O(n log m) matrices A hardwired into a read- only tape. We could further hardwire the output of Az for each Our procedure reduces the coefficients of random linear combina- possible value of Az and each of the O(n log m) possible A. The tions of the rows of A modulo random small primes. number of such outputs is (mn)O(rank(A)), which since the algo- rithm uses Θ(rank(A) log(mn)) bits of space to maintain Ax, the 2. PRELIMINARIES algorithm does not need additional space to index into this list of Notation. Let Z denote the set of integers and Z|m| = {−m, . . . , m}. outputs. This hardwiring is analogous to the distinction between For a random variable X, we write X ∼ D if X is subject to distri- circuits and Turing machines, and is sometimes the definition used bution D. We shall denote automata by script letters A, B,... and for streaming algorithms, see, e.g., [9]. matrices and modules by the regular italic letters A, B, . . . . We A second way of addressing this is, since our procedure for find- also define a mod 0 = a for all a ∈ Z and allow us to say that 0 is ing these O(n log m) matrices A with associated outputs is con- divisible by any integer a, denoted as a|0, even for a = 0. structive, when processing an update x ← x + ei the algorithm could run this procedure to reconstruct Ai and add Ai to the sketch. Modules and Smith Normal Forms. Throughout this paper we This process is consistent across different updates since the algo- assume the ‘scalars’ of a module corresponds to the ring of integers, rithm stores the same random seed. Similarly, it could run a con- Z, and we accordingly tailor the definitions. structive procedure to compute its output. Before and after pro- cessing each update or output, the state of the streaming algorithm DEFINITION 1 (MODULE). A Z-module (or module for short) consists only of its random seed together with its sketch Ax for is an additive abelian group A together with a function Z×A → A the current x. However, the algorithm is allowed more space while (the image of (r, a) being denoted by ra) such that for all r, s ∈ Z processing an update or computing the output. and a, b ∈ A these four properties hold: (1) r(a + b) = ra + rb, This non-uniformity does not affect our application to using si- (2) (r + s)a = ra + sa, (3) r(sa) = (rs)a, (4) 1a = a. multaneous communication complexity to prove streaming lower Notions such as submodules, module homomorphisms and finitely bounds since the parties can locally create the same O(n log m) generated modules can be defined similarly as for groups and we matrices A, and use the common random seed to sample an A. omit the definitions. Analogously to vector spaces, we can define Moreover, the local space and time complexities are not counted in the notions of linear independence, basis, submodule spanned by a the communication lower bounds, and so the referee can use addi- set, etc., and we omit the definitions here. However, since is not tional space to compute the output. Z a field, a Z-module A may not have a basis. In case it has a basis, we call A a free -module, or a free module, for short. Note that Our Techniques: Our starting point is an elegant work of Gan- Z every submodule of a free -module is free. Also, if F is a free guly [22] which introduces concepts such as path-reversibility and Z module, then any bases of F have the same cardinality. Hence we path-independence for modeling a streaming algorithm by a deter- can define the dimension (or rank) of F to be the cardinal number ministic automaton. One property of deterministic automata is that of any basis of F . As in the case of vector spaces, if we have a if the input vector x goes to multiple different states (induced by linearly independent set of F we can always extend it to a basis of distinct input streams), then these states can be “merged” and the F . output of the new state can be chosen to be the output of any of the The following is an important structure theorem of modules. states being merged. We deal with automata which only have large success probability on an input distribution and so this is no longer THEOREM 1 (STRUCTURE THEOREM). Let A be a finitely gen- true, i.e., if the merging is not done carefully our success probabil- t erated module over Z. Then A ' Za1 ⊕ Za2 ⊕ · · · ⊕ Zar ⊕ Z for ity could drop dramatically. We instead partition the state space into some integers a1|a2| · · · |ar. The numbers r, t and a1, . . . , ar are connected components and perform random walks in these compo- uniquely determined by A. nents, where the walks correspond to variable-length streams with underlying input ~0. Our outputs are defined by samples from the The structure of a module is closely related to the Smith Normal stationary distribution of these walks. Form. We explore the relation below. For each fixed random string of the automaton we obtain a deter- n×n ministic automaton whose state space is isomorphic to the quotient DEFINITION 2. A matrix A ∈ Z is called unimodular if n n module Z /M for a submodule M of Z . In Ganguly’s work [22], det A = 1 or −1. for his specific problem of approximating the `1-heavy hitters he n Note that the inverse of a unimodular matrix is an integer matrix could then remove torsion from Z /M so that this quotient mod- ule is free. This is not possible for general problems (e.g., when and also unimodular. computing x1 mod 2 the quotient module would be Z/2Z and re- m×n THEOREM 2 (SMITH NORMAL FORM). Let A ∈ , then moving torsion would make the quotient module 0). We instead Z A can be written as A = SDT where S ∈ m×m and T ∈ crucially rely on the Smith Normal Form of n/M, which allows Z Z n×n are unimodular, and D ∈ m×n is a rectangular diago- us to write the states of the automaton as Bx mod q for an integer Z Z nal matrix with diagonal elements a , . . . , a , 0,..., 0 for some matrix B with r rows, where q = (q , . . . , q ) is an integer vector. 1 r 1 r r ≤ min{m, n} and nonzero a , . . . , a with a | · · · |a . The ma- The remainder of our proof has two main ingredients. The first is 1 r 1 r trix D is called the Smith Normal Form of A and the diagonal in showing how to go from Bx mod q to a linear sketch A over the entries a , . . . , a are uniquely determined by A. integers whose number of states is not much larger than the origi- 1 r nal number of states. The main difficulty here is that each of these n To connect free modules with Theorem 2, consider the following states should be the image of an x ∈ {−m, −m + 1, . . . , m} , problem: Suppose that M is a submodule of a free module F . We not of an arbitrary x ∈ n, and we do not know how large the Z want to find a basis {f1, . . . , fn} of F with integers a1, . . . , ar ∈ set {Bx mod q | x ∈ {−m, −m + 1, . . . , m}n} is. The second Z with a1| · · · |ar such that {a1f1, . . . , arfr} is a basis of M. We ingredient is to show that while the entries of A may be exponen- say {f1, . . . , fn} is a compatible basis of F and M. tially large, we can construct an integer matrix with polynomially Pick an arbitrary basis e1, . . . , en of F and suppose that M ad- bounded entries without increasing the number of states by much. m×n mits a basis b1, . . . , bm with coefficient matrix A ∈ Z with P respect to e1, . . . , en, that is, bi = j Aij ej . Reduce A to its dependent only on freq σ and s. A stream automaton A is said Smith Normal Form A = SDT , then to be path-reversible if for every stream σ and configuration s, −1       s ⊕ (σ ◦ σ ) = s. b1 e1 f1  .   .   .  Suppose that A is a path independent automaton. We can define  .  = S · D · T  .  =: S · D ·  .  a function + : n × C → C as x + a = a ⊕ σ, where freq σ = x. bm en fn Z Since A is a path independent automaton, the function + is well- where D is the Smith Normal Form of A. It is clear that f1, . . . , fn defined. In [22] it is proved that is a basis of F and easy to check that {a1f1, . . . , arfr} is a basis of M. THEOREM 3. Suppose that A is a path independent automaton n Finally, in connection with Theorem 1, we note that F/M ' with initial configuration o. Let M = {x ∈ Z : x + o = 0 + o}, n−m n Z/(a1) ⊕ · · · Z/(ar) ⊕ Z . This fact will be repeatedly used then M is a submodule of Z , and the mapping x + M 7→ x + n in Section 4. Our convention that 0 divides 0 allows us to write o is a set isomorphism between Z /M and the set of reachable n a1| · · · |ar+n−m, where ar+1 = ··· = ar+n−m = 0, in a more configurations {x + o : x ∈ Z }. unified notation, F/M ' Z/(a1) ⊕ · · · ⊕ Z/(ar+n−m). The submodule M above is called the kernel of A. The theo- Deterministic Stream Automata. The results in this section are rem implies that A gives the same output for all vectors in x + M, largely due to Ganguly [22]. n and the transition function of A is the canonical addition on Z /M. We consider only the problems in which the input is a vector n n Conversely, given a module M ⊆ Z one can construct an automa- v ∈ Z but represented as a data stream σ = (σ1, σ2,... ) in which ton A whose states are the cosets of M and the transition function each element σ is an element of Σ = {e , . . . , e , −e ,..., −e } n i 1 n i n is the canonical addition on Z /M. Furthermore, A has poly(n) e P σ = v (where the i’s are canonical basis vectors) such that i i , states in its finite control. This construction uses the optimal space. and in this case, we write v = freq σ.

DEFINITION 3. A deterministic stream automaton A is a deter- 3. RANDOMIZED STREAM AUTOMATA ministic Turing machine that uses two tapes, a one-way (unidirec- In this section, we shall extend the notions and results of de- tional) read-only input tape and a (bidirectional) two way work- terministic stream automata to randomized stream automata. The tape. The input tape contains the input stream σ. After processing following definition first appeared in [23] but the author did not its input, the automaton writes an output, denoted by φA(σ), on define its space complexity. the work-tape. DEFINITION 6. A randomized stream automaton is a determin- A configuration of a stream automaton A is modeled as a triple istic stream automaton with one additional tape for the random (q, h, w), where, q is a state of the finite control, h the current head bits. The random bit string R is initialized on the random bit tape position of the work-tape and w the content of the work-tape. The before any input record is read; thereafter the random bit string set of configurations of a stream automaton A that are reachable is used in a two way read-only manner. The rest of the execution from the initial configuration o on some input stream is denoted proceeds as in a deterministic stream automaton. as C(A). A stream automaton is a tuple (n, C, o, ⊕, φ), where n specifies the dimension of the underlying vector, ⊕ : C×Σ → C is A randomized stream automaton A is said to be path-independent the configuration transition function, o is the initial position of the if for each randomness R the deterministic instance AR is path- p(n) automaton and φ : C → Z is the output function and p(n) is independent (path-reversible). The space complexity of A with the dimension of the output. For a stream σ we also write φ(o ⊕ σ) stream width parameter m is defined as as φ(σ) for simplicity. S(A, m) = max {|R| + S(AR, m)} . The set of configurations of an automaton A that is reachable R from the origin o for some input stream σ with k freq σk∞ ≤ m is Next we show how to reduce a general automaton to a path in- denoted by C(A, m). The space of the automaton A with stream dependent automaton. The reduction of deterministic automata is parameter m is defined as S(A, m) = log |C(A, m)|. due to Ganguly [22], where he first reduces a general automaton to a path reversible automaton and thence to a path independent DEFINITION 4. Let A and B be two stream automata. We say automaton. We take the same approach for randomized automata. B is an output restriction of A if for every stream σ there exists a 0 0 0 The difficulty is in defining the output of the reduced automaton. stream σ such that freq σ = freq σ and φB(σ) = φA(σ ). THEOREM 4. Suppose that a randomized path reversible au- A problem over a data stream is characterized by a family of p(n) n tomaton A succeeds in solving problem P on any stream with prob- binary relations Pn ⊆ Z × Z , n ≥ 1. We say an automaton A ability at least 1 − δ. Let Π be an arbitrary distribution over solves a problem P (with domain size n) if for every input stream streams. There exists a deterministic path independent automa- σ it holds that (φA(σ), freq σ) ∈ Pn. It is clear that if A solves a ton B with S(B, m) ≤ S(A, m) + O(log n) which succeeds with problem and B is an output restriction of A, then B solves the same probability at least 1 − 2δ on Π. problem. Now we introduce more concepts of different kinds of automata. PROOF. Let S be the set of all streams of length at most L whose Suppose σ and τ are two streams. Let σ ◦ τ be the stream obtained frequency vector is ~0. We choose L large enough such that for any by concatenating τ to the end of σ, so freq(σ◦τ) = freq σ+freq τ. two states o1 and o2 of A, if there exists a stream σ with freq σ = ~0 −1 The inverse stream of σ is denoted by σ and defined inductively such that o1 ⊕ σ = o2, then at least one such σ is included in S. −1 −1 −1 −1 −1 A finite bound L = L(C(A, m), n) can be obtained as follows. as follows: ei = −ei, −ei = ei and (σ ◦ τ) = τ ◦ σ . A path in the transition graph of A is a linear combination of a DEFINITION 5. A stream automaton A is said to be path inde- simple path and simple cycles of the graph. Thus, whether there pendent if for each configuration s and input stream σ, s ⊕ σ is is freq σ = ~0 such that o1 ⊕ σ = o2 can be reduced to whether there is a non-negative solution (multiplicities of the simple cycles) (V 0,E0) be the directed acyclic graph where each vertex repre- for a system of n linear equations (all frequencies add up to 0). sents a strongly connected component of G. Let rep : V 0 → V be A bound on the magnitude of the solution can be found in [38]. a map such that rep(v) is a (fixed) arbitrary vertex in the strongly We also include in S the empty stream. Pick W to be a number connected component v and com : V → V 0 a map from a vertex large enough such that a random walk on an undirected graph of v ∈ G to its strongly connected component. We call a strongly |C(A, m)| vertices and nL edges would mix. An upper bound for connected component of G (correspondingly a vertex in G0) termi- L W is 2O((2n) ). Construct a distribution Π0 as follows. For each nal if there is no edge from it to the rest of the graph. Define a map 0 0 0 stream τ ∈ Π, we include new streams τ ◦ σ1 ◦ · · · ◦ σW in Π for α : V → V where α(v) is an arbitrary terminal vertex reachable 0 all σ1, . . . , σW ∈ S. from v. Let the states of B be the set of terminal vertices of G . 0 By Yao’s principle, there exists a choice of randomness Define the transition function ⊕ on the states of B as R and thus a deterministic automaton AR such that AR succeeds 0 α(s) ⊕ ei = α(com(rep(α(s)) ⊕ ei)). on Π0 with probability at least 1−δ. Let G be the associated multi- 0 graph of the states of AR where two vertices o1, o2 are connected It is clear that the transition function ⊕ is well-defined: for any s, t 0 0 by an arc if there exists σ ∈ S such that o1 ⊕ σ = o2. Since AR with α(s) = α(t), we have α(s)⊕ ei = α(t)⊕ ei. It is easy to see is path reversible, the graph G is undirected. Since S contains the that B is path reversible. The proof is postponed to Appendix A. empty stream, a random walk in a connected component C of G 0 0 LEMMA 6. Consider a terminal vertex u ∈ V . We have u ⊕ π converges to a stationary distribution C . e ◦ −e = u. Construct a randomized automaton B as follows. Define a state i i of B for each connected component of G. Let o1 and o2 be two Next we set the initial state of B. Let u0 be the initial state of AR states of AR in the same connected component in G, i.e., there and let π be the stationary distribution for the random walk starting exists a stream σ ∈ S such that o1 ⊕ σ = o2. Let o3 = o1 ⊕ ei from u0 in G. Note that G is directed so the stationary distribu- and o4 = o2 ⊕ ei. Since AR is path reversible, o3 ⊕ −ei = o1, tion is dependent on the initial state. Notice that π is a mixture of so o3 ⊕ (−ei ◦ σ ◦ ei) = o4, which implies that o3 and o4 are in stationary distributions of random walks in terminal components the same connected component. Therefore, the transitions among reachable from that initial state. The initial state of B is picked states of B are well-defined and it is clear that B is path-reversible. randomly from V 0 according to the mixing weight of the terminal Furthermore, B is path independent because there is no stream of components in π. frequency ~0 changing B from a state to a different state. The output Finally, the output of each state of B is picked randomly from of B on a state is picked randomly from the outputs of A according the outputs of the states in the corresponding terminal component to πC . in G according to the stationary distribution of a random walk in Fix a σ ∈ Π. We analyze the probability of AR succeeding over that component. all choices of s1, . . . , sW . Let C be the connected component of G We now show that the expected failure probability of B is no containing the state of AR after processing σ. Because W is large more than 3δ. It then follows from an averaging argument that enough, the distribution of the final state of AR is close (within δ in there exists a deterministic B with failure probability at most 3δ. statistical distance) to πC . Thus, the probability that AR succeeds We shall further need the following two lemmata, whose proofs is at most δ more than the expected success probability of B. are postponed to Appendix A. By an averaging argument, there exists a deterministic automa- ton B achieving success probability at least as high as that of the LEMMA 7. Let s be a vertex in a terminal component of G. For randomized B, which is at least 1 − 2δ. any stream σ, there is only one terminal component reachable from s ⊕ σ via streams of frequency ~0. THEOREM 5. Suppose that a randomized algorithm A solves P on any stream with probability at least 1 − δ. Let Π be an ar- LEMMA 8. If AR starts from a state u ∈ G in a terminal com- bitrary distribution over streams. There exists a deterministic path ponent C, and B starts from C, then after every transition, the state reversible automaton B with S(B, m) ≤ S(A, m) + O(log n) of B always corresponds to the unique terminal component reach- ~ which solves P with probability at least 1 − 3δ on Π. able from the state of A via streams of frequency 0.

PROOF. Let S be the set of all streams of length at most L whose Fix σ ∈ Π. We analyze the probability of AR succeeding over frequency vector is ~0. We choose L large enough such that for any all choices of s1, . . . , s2W . Because of the aforementioned mixing two states o1 and o2 of A, if there exists a stream σ with freq σ = time of G, the distribution of u0 ⊕ s1 ◦ · · · ◦ sW is within statis- ~0 such that o1 ⊕ σ = o2, then at least one such σ is included tical distance δ from π. Thus, the statistical distance between the in S. A finite bound for L can be obtained as before. We also marginal distribution of u0 ⊕ s1 ◦ · · · ◦ sW over strongly connected include in S the empty stream. Pick W to be a number large enough components of G and the distribution of the initial state of B is at such that a lazy random walk from a fixed vertex on a directed most δ. S(A,m) L graph of 2 vertices and n edges and a positive probability Consider a fixing of s1, . . . , sW such that u1 = u0 ⊕s1 ◦· · · sW of staying in every step would get to within statistical distance δ belongs to a terminal component C of G. We compare the success from a stationary distribution. Note that for the same graph, the probability of A with the success probability of B when its initial stationary distribution is dependent on the fixed starting vertex but state is C. By Lemma 7, there is only one terminal component D 0 this is not a problem for the proof. Construct Π as follows. For reachable from u1 via streams of frequency ~0. Again by the mixing each stream τ ∈ Π, we include new streams s1 ◦ s2 ◦ · · · ◦ sW ◦ time of G, the distribution of u1 ⊕ sW +1 ◦ · · · ◦ s2W is within 0 σ ◦ sW +1 ◦ · · · ◦ s2W in Π for all s1, . . . , s2W ∈ S. statistical distance δ from the stationary distribution of the random Let G be the associated multi-graph of the states of AR where walk in D, which, by Lemma 8, is the state of B. Therefore, the two vertices o1, o2 are connected by an arc if there exists σ ∈ S success probability of A and B differ by at most δ. such that o1 ⊕ σ = o2. Since S contains the empty stream, every In summary, over all random choices, the success probability of vertex in G has a self-loop. A and B differ by at most 2δ and the desirable conclusion fol- Construct a randomized automaton B as follows. Let G0 = lows. Both theorems above conclude with the existence of a determin- Algorithm 1 Extending the module M istic automaton that succeeds with probability ≥ 1−δ on Π for any // ds+1 = ··· = dn = 0 0 given distribution Π over the inputs. By Yao’s minimax theorem, 1: di ← di for i = 1, . . . , ` there exists a randomized automaton that succeeds with probability 2: for i ← ` + 1 to n do 0 0 0 ≥ 1 − δ on any input. But, the number of random bits needed by 3: Mi ← hd1f1, . . . , di−1fi−1, difi, di+1fi+1, . . . , dnfni n the randomized automaton, i.e., the number of different determin- 4: if (kfi + Mi) ∩ Z|m| = ∅ for all k = 1, . . . , di − 1 then. istic path independent automata used by the randomized automa- When di = 0 it means for all k 6= 0 0 ton, could be unbounded. However, following an argument due 5: di ← 1 to Newman [35], it suffices for the randomized automaton to pick 6: else 0 uniformly at random one of O(n log m) deterministic automata. 7: di ← di Therefore, the additional space needed for the random bits is only 8: end if O(log n + log log m). For completeness, we include the argument 9: end for 0 0 0 below. 10: return M ← hd1f1, . . . , dnfni

THEOREM 9. Let A be a randomized automaton solving prob- lem P on n with failure probability at most δ. There is a ran- Z|m| PROOF. Suppose that (f , . . . , f ) is a compatible basis of M domized automaton B that only needs to pick uniformly at random 1 n and n, and {d f , . . . , d f } is a basis of M, where d |d | · · · |d one of O(δ−2n log m) deterministic instances of A and solves P Z 1 1 s s 1 2 s and d = ··· = d = 1 for some ` ≤ s. Run Algorithm 1 and we on n with failure probability at most 2δ. 1 ` Z|m| claim that the returned M 0 is as desired.

PROOF. Let A1,..., AO(nδ−2 log m) be independent draws of We first show that Property 1 is maintained throughout the loop. n deterministic automata picked by B. Fix an input x ∈ Z|m|. Let In addition to Mi defined on Line 3, we also define pA(x) be the fraction of the automata among A1,...,AO(nδ−2 log m) 0 0 0 Mi = hd1f1, . . . , di−1fi−1, fi, di+1fi+1, . . . , dnfni. that solve problem P correctly on x and pB(x) be the probability n that B solves P on x correctly. By a Chernoff bound, we have that If (kfi + Mi) ∩ = ∅ for all k = 1, . . . , di − 1, then for any −2n Z|2m| Pr{|pA(x)−pB(x)| ≥ δ} ≤ exp(−O(n log m)) < (2m+1) . n 0 n z ∈ Z|2m| it holds that z ∈ Mi whenever z ∈ Mi . Suppose that Taking a union bound over all choices of x ∈ Z|m|, we have 0 z ∈ Mi , we write Pr{|pA(x) − pB(x)| ≥ δ for all x} > 0. Therefore, there exists 0 0 a set of A1,..., AO(nδ−2 log m) such that |pA(x) − pB(x)| ≤ δ z = c1d1f1 +···+ci−1di−1fi−1 +cifi +ci+1di+1fi+1 +···+csdsfs for all x ∈ n . The automaton B simply samples uniformly at Z|m| =: kfi + m, random from this set of deterministic automata. for some k ∈ {0, . . . , di − 1} and m ∈ Mi. Since (kfi + Mi) ∩ n Combining all theorems in this section immediately yields the Z|2m| = ∅ for all k = 1, . . . , di − 1, it must hold that k = 0 and following corollary. thus z ∈ Mi. Property 1 can be proved inductively using the above argument as the inductive step. Property 2 is immediate. −1 THEOREM 10. Suppose that a randomized algorithm A solves Now we prove Property 3. Let S = (f1, . . . , fn) and I = {i : 0 P on any stream with probability at least 1 − δ. There exists a di = di 6= 1} =: {i1, . . . , ir}, then the map g(x) = ((Sx)i1 mod n n randomized automaton B which solve P on Z|m| with probabil- di1 ,..., (Sx)ir mod dir ), x ∈ Z , induces an isomorphism n 0 ity at least 1 − 7δ such that S(B, m) ≤ S(A, m) + O(log n + Z /M ' Z/(di1 ) ⊕ · · · ⊕ Z/(dir ), where it holds automati- 1 log log m + log δ ). cally that di1 |di2 | · · · |dir . To simplify the notation, without loss of generality, we assume that ij = j for 1 ≤ i ≤ r. For each j ∈ [r], it follows from the algorithm that there exists some kj ∈ 4. REDUCTION TO LINEAR SKETCHES 0 n {1, . . . , dij } such that (kj fij +Mij )∩Z|2m| 6= ∅. Pick an arbitrary 0 n n x ∈ (k f +M )∩ , then g(x ) = (0,..., 0, k , 0,..., 0). EMMA j j ij ij Z|2m| j j L 11. Let M ⊆ Z be a module. Then there exists a r 0 r n P submodule M ⊆ M such that We then have 2 distinct vectors in Z|2mn|, namely, j=1 σj xj n 0 0 σ ∈ {0, 1}r 2r M 0 1. For all x, y ∈ Z|m|, it holds that x+M = y+M whenever with , that correspond to distinct cosets of . x + M = y + M. 0 i 2. M admits a basis b1, . . . , bt such that kbik∞ ≤ C2 m for LEMMA 13. Suppose that m ≥ 2n. Given a deterministic some absolute constant C > 0. path-independent automaton A, one can construct a deterministic free automaton B such that n The algorithm to find the basis is standard, so the proof is postponed 1. B is an output restriction of A on Z|m/(2n)|; to Appendix B. 2. S(B, m/(2n)) ≤ S(A, m) log m + O(log n); 2 3. The sketching matrix ΦB satisfies kΦBkmax ≤ exp(O(n + n LEMMA 12. Suppose that m ≥ 1 and M ⊆ Z is a module. n log m)). Then there exists a module M 0 ⊃ M such that the following are PROOF. Let M be the kernel of A and dim M = d. By true. 0 0 n Lemma 11 we may assume that M0 has a basis b1, . . . , bd such 1. For all x, y ∈ Z|m|, it holds that x + M = y + M whenever i n×d that kbik∞ = O(2 m). Let B = (b1, . . . , bd) ∈ Z and x + M 0 = y + M 0. n RBT = diag(c1, . . . , cd) (where c1| · · · |cd) be the Smith Nor- 2. The compatible basis of M and Z is also a compatible basis n×n d×d 0 n mal Form of B, where R ∈ Z and T ∈ Z are unimodular of M and Z . n 0 matrices. Now invoking Lemma 12, we can extend M0 to a module 3. Suppose that Z /M ' Z/(q1) ⊕ Z/(q2) ⊕ · · · ⊕ Z/(qr) M such that for some q1|q2| · · · |qr, where qi 6= 1 for all i. It then holds n t (a) Z /M ' Z/(q1) ⊕ · · · Z/(qr) ⊕ Z for some q1|q2| · · · |qr that [x + M 0]: x ∈ n ≥ 2r. Z|2mn| and q1 ≥ 2, r+t n (b) there are at least 2 cosets intersecting Z|m/(2n)| Summing the two probabilities above, we see that (c) there exists a submatrix S ∈ (r+t)×n such that the isomor- Z Prp {hρi, ΦAxi = 0} ≤ Prp {hρi, ΦAxi mod pi = 0} < 1/M. phism in (a) is given by i i Thus x 7→ ((Sx)1 mod q1,..., (Sx)r mod qr, (Sx)r+1,..., (Sx)r+t). t 2 Prp ,...,p ,ρ ,...,ρ {ΦBz = 0} < (1/M) = 1/|X| . From the proof of Lemma 12, we actually have that (q1, . . . , qr) 1 t 1 t is a subsequence of (c1, . . . , cd) and S is formed by r + t rows of Now, taking a union bound over at most |X|2 pairs of (x, y) such R . Without loss of generality, we can further assume that for each that ΦAx 6= ΦAy, we see that there exists a choice of p1, . . . , pt i ≤ r i S {0, 1, . . . , q −1} , the entries of -th row of are contained in i and ρ1, . . . , ρt such that the associated ΦB meets our requirement. by taking the remainders modulo qi. Then for each i ≤ r, |Sx|i ≤ Finally, n qim/2 for all x ∈ Z|m/(2n)|, so given some 0 ≤ k < qi there are n at most m/2 different (Sx)i such that (Sx)i mod qi = k. S(B, m) ≤ log2 {ΦBx : x ∈ Z|m|} + O(log n) We construct a new automaton B whose states correspond to t n ≤ log2(kΦB kmax · mn) + O(log n) XB = {Sx : x ∈ Z|m/(2n)|}, it is clear that B is a free automaton 3 and can be made an output restriction of A. This proves conclusion ≤ t log2(M mn) + O(log n) 1. ≤ 2 log2 |X| · (3 + logM (mn)) + O(log n) Let XA be the set of states of A. For each (k1, . . . , kr, kr+1, . . . , kt) ∈ ≤ 2(3 + log (mn))S(A, m) + O(log n). XA, where 0 ≤ ki < qi for all 1 ≤ i ≤ r, there are at most M (m/2)r different states Sx of B such that

((Sx) mod q ,..., (Sx) mod q , (Sx) ,..., (Sx) ) 1 1 r r r+1 t Combining Lemma 13 and Lemma 14, we conclude this section = (k1, . . . , kr, kr+1, . . . , kr+t). with r This implies that |XB| ≤ (m/2) |XA| and thus THEOREM 15. Suppose that m ≥ 2n. Given a deterministic

S(B, (m/2n)) ≤ r log(m/2) + log |XA| + O(log n) path-independent automaton A, one can construct a deterministic free automaton B such that ≤ log |XA| log(m/2) + log |XA| + O(log n) n 1. B is an output restriction of A on Z|m/(2n)|; ≤ S(A, m) log m + O(log n), 2. S(B, m/(2n)) ≤ 8S(A, m) log m + O(log n); 2 3. The sketching matrix ΦB satisfies kΦBkmax = O((n + where we used the fact (b) that S(A, m/(2n)) ≥ r + t for the 3 second inequality. This proves conclusion 2. n log m) ). n Next we show Conclusion 3. Note that kBkmax = O(2 m). By [37, Chapter 8], we can find S such that Finally, combining with Theorem 10 we conclude the paper with 2n+5 √ 4n 4n+5 4n+1 kSkmax ≤ n ( nkBkmax) kBkmax = n kBkmax . THEOREM 16 (MAIN). Let m, n be positive integers. Sup- pose that a randomized algorithm A solves P on any stream with It follows immediately that kΦBkmax = kRkmax ≤ kSkmax = 2 probability at least 1 − δ. There exists a randomized free automa- exp(O(n + n log m)). n ton B which solve P on Z|m| with probability at least 1 − 7δ such LEMMA 14. Suppose that A is a deterministic automaton with that M 1. S(B, m) = O(S(A, 2mn) log(mn) log n + log log m + sketching matrix ΦA such that kΦAkmax ≤ 2 for some M ≥ log(1/δ)); max{20 log2(mn), 100}. Then there exists a deterministic au- tomaton B such that 2. Each deterministic instance of B corresponds to a sketching 2 3 n matrix ΦB that satisfies kΦBkmax = O((n +n log(mn)) ). 1. B is an output restriction of A on Z|m|; 2. S(B, m) ≤ 2(3 + logM (mn))S(A, m) + O(log n); n 3 REMARK 17. On input x ∈ , it is possible that the fre- 3. ΦB has integer entries such that kΦBkmax ≤ M . Z|m| 2 3 M 2 3 M quency vector, at intermediate steps, has coordinates exceeding m. PROOF. Suppose that x ∈ {−4m n 2 ,..., 4m n 2 }\ However the automaton B can easily handle this by maintaining {0} and P = {p is a prime : p ≤ M 10}. Choose p uniformly 2 3 M ΦBx mod (kΦB kmaxmn). Because of the bound on the number at random from P . Since x has most log2(4m n 2 ) ≤ 1.2M 3 of rows of ΦB in Theorem 15, the space of B remains bounded by distinct prime factors and |P | ≥ M /(11 ln M), we see that O(S(A, 2mn) log(mn) + log n + log log m + log(1/δ)). Pr{p divides x} ≤ (13.2 ln M)/M 2. Suppose that ΦA has r rows and rank(ΦA) = r. Let X = n Acknowledgments {ΦAx : x ∈ Z|m|}. Let p1, . . . , pt be i.i.d. uniform on P and S be a random t×r matrix of entries i.i.d uniform over {0,..., 2M −1}, Yi Li and David Woodruff would like to thank the Simons Insti- tute for the Theory of Computing where part of this work was where t = 2 logM |X|. Construct a matrix ΦB such that (ΦB)ij = (SΦA)ij mod pi. To show that B can be realized as an output done. Huy L. Nguyen would like to thank IBM Research, Almaden, n where part of this work was done and to acknowledge the support of restriction of A, we shall show that if for any x, y ∈ Z|m|, if ΦAx 6= ΦAy then ΦBx 6= ΦBy. Let z = x − y, then ΦAz 6= NSF grant CCF 0832797, an IBM Ph.D. Fellowship, and a Siebel Scholarship. 0. Then for each i, Prρi {hρi, ΦAzi = 0} ≤ 1/(2M). Since 2 3 M |hρi, ΦAzi| ≤ 4m n 2 , it follows from the argument at the be- ginning of the proof that 5. REFERENCES  [1] N. Alon, Y. Matias, and M. Szegedy. The Space Complexity Prpi hρi, ΦAzi mod pi = 0 hρi, ΦAzi 6= 0 2 of Approximating the Frequency Moments. JCSS, ≤ (13.2 ln M)/M < 1/(2M) 58(1):137–147, 1999. [2] A. Andoni. High frequency moment via max stability. [22] S. Ganguly. Lower bounds on frequency estimation of data Available at streams. In Proceedings of the 3rd international conference http://web.mit.edu/andoni/www/papers/fkStable.pdf. on Computer science: theory and applications, CSR’08, [3] A. Andoni, R. Krauthgamer, and K. Onak. Streaming pages 204–215, 2008. algorithms via precision sampling. In FOCS, pages 363–372, [23] S. Ganguly. Distributing frequency-dependent data stream 2011. computations. In Proceedings of the Fifteenth Australasian [4] A. Andoni, H. L. Nguyên, Y. Polyanskiy, and Y. Wu. Tight Symposium on Computing: The Australasian Theory - lower bound for linear sketches of moments. In ICALP (1), Volume 94, CATS ’09, pages 163–170, 2009. pages 25–32, 2013. [24] S. Ganguly. Polynomial estimators for high frequency [5] Z. Bar-Yossef. The Complexity of Massive Data Set moments. CoRR, abs/1104.4552, 2011. Computations. PhD thesis, University of California, [25] S. Ganguly. A lower bound for estimating high moments of a Berkeley, 2002. data stream. CoRR, abs/1201.0253, 2012. [6] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. [26] A. Gronemeier. Asymptotically optimal lower bounds on the Information theory methods in communication complexity. nih-multi-party information complexity of the and-function In IEEE Conference on Computational Complexity, pages and disjointness. In STACS, pages 505–516, 2009. 93–102, 2002. [27] M. Hardt and D. P. Woodruff. How robust are linear sketches [7] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. to adaptive inputs? In STOC, pages 121–130, 2013. An information statistics approach to data stream and [28] P. Indyk. Sketching, streaming and sublinear-space communication complexity. J. Comput. Syst. Sci., algorithms, 2007. Graduate course notes available at http: 68(4):702–732, 2004. //stellar.mit.edu/S/course/6/fa07/6.895/. [8] A. Barvinok. Integer Points in Polyhedra. Contemporary [29] P. Indyk and D. P. Woodruff. Optimal approximations of the mathematics. European Mathematical Society, 2008. frequency moments of data streams. In STOC, pages [9] P. Beame, T. S. Jayram, and A. Rudra. Lower bounds for 202–208, 2005. randomized read/write stream algorithms. In STOC, pages [30] T. S. Jayram. Hellinger strikes back: A note on the 689–698, 2007. multi-party information complexity of and. In [10] L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha. Simpler APPROX-RANDOM, pages 562–573, 2009. algorithm for estimating frequency moments of data streams. [31] I. Kremer, N. Nisan, and D. Ron. On randomized one-round In SODA, pages 708–713, 2006. communication complexity. Computational Complexity, [11] V. Braverman and R. Ostrovsky. Recursive sketching for 8(1):21–49, 1999. frequency moments. CoRR, abs/1011.2571, 2010. [32] Y. Li, H. L. Nguyen, and D. P. Woodruff. On sketching [12] A. Chakrabarti, K. Do Ba, and S. Muthukrishnan. Estimating matrix norms and the top singular vector. In SODA, 2014. Entropy and Entropy Norm on Data Streams. In STACS, [33] M. Monemizadeh and D. P. Woodruff. 1-pass relative-error pages 196–205, 2006. lp-sampling with applications. In SODA, 2010. [13] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal lower [34] S. Muthukrishnan. Data Streams: Algorithms and bounds on the multi-party communication complexity of set Applications. Foundations and Trends in Theoretical disjointness. In CCC, pages 107–117, 2003. Computer Science, 1(2):117–236, 2005. [14] M. Charikar, K. Chen, and M. Farach-Colton. Finding [35] I. Newman. Private vs. common random bits in frequent items in data streams. In ICALP, pages 693–703, communication complexity. Inf. Process. Lett., pages 67–71, 2002. 1991. [15] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra [36] E. Price and D. P. Woodruff. Lower bounds for adaptive in the streaming model. In STOC, pages 205–214, 2009. sparse recovery. In SODA, pages 652–663, 2013. [16] D. Coppersmith and R. Kumar. An improved data stream [37] A. Storjohann. Algorithms for Matrix Canonical Forms. PhD algorithm for frequency moments. In Proceedings of the 15th thesis, Eidgenössische Technische Hochschule Zürich, 2000. Annual ACM-SIAM Symposium on Discrete Algorithms [38] J. von zur Gathen and M. Sieveking. A bound on solutions of (SODA), pages 151–156, 2004. linear integer equalities and inequalities. Proceedings of the [17] G. Cormode and S. Muthukrishnan. An improved data American Mathematical Society, 72(1):155–158, 1978. stream summary: the count-min sketch and its applications. [39] D. P. Woodruff. Optimal space lower bounds for all J. Algorithms, 55(1):58–75, 2005. frequency moments. In Proceedings of the 15th Annual [18] J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and ACM-SIAM Symposium on Discrete Algorithms (SODA), Z. Svitkina. On distributing symmetric streaming pages 167–175, 2004. computations. ACM Transactions on Algorithms, 6(4), 2010. [19] P. Flajolet and G. N. Martin. Probabilistic counting. In Proceedings of the 24th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 76–82, APPENDIX 1983. [20] S. Ganguly. Estimating frequency moments of data streams A. OMITTED DETAILS IN THE PROOF OF using random linear combinations. In Proceedings of the 8th THEOREM 5 International Workshop on Randomization and Computation (RANDOM), pages 369–380, 2004. 0 PROOFOF LEMMA 6. Let v = u ⊕ e = α(com(rep(u) ⊕ [21] S. Ganguly. A hybrid algorithm for estimating frequency i e )). Let σ be a stream with freq(σ) = ~0 such that (rep(u) ⊕ moments of data streams, 2004. Manuscript. i (ei)) ⊕ σ = rep(v). Note that freq(ei ◦ σ ◦ −ei) = ~0 and u is terminal, we have α(com(rep(u) ⊕ ei ◦ σ ◦ −ei)) = u. Thus, completeness. Let

0 0 0 u ⊕ ei ◦ −ei = α(com(rep(v) ⊕ −ei)) z = [α1]b1 + ··· + [αt]bt ∈ M . = α(com(rep(u) ⊕ e ◦ σ ◦ −e )) i i It is clear that z − z0 ∈ M. If z − z0 6= 0, then there exists a = u. maximum i such that αi is not an integer, so

0 z − z = (α1 − [α1])b1 + ··· + (αi − [αi])bi, PROOFOF LEMMA 7. Let u, v be two vertices in some (possi- where αi − [αi] > 0. Let ˜bi be the orthogonal projection of bi onto bly different) terminal components reachable from s ⊕ σ. Assume ⊥ span{b1, . . . , bi−1} . Then that s⊕σ⊕σu = u, s⊕σ⊕σv = v, where freq(σu) = freq(σv) = ~ 0 0. Since s belongs to a terminal component and freq(σ ◦ σu ◦ d(z − z , span{b1, . . . , bi−1}) = (αi − [αi])k˜bik2 −1 ~ 0 ~ σ ) = 0, there is a stream σu of frequency 0 such that u ⊕ ˜ −1 0 −1 0 d(bi, span{b1, . . . , bi−1}) = kbik2 σ ◦ σu = s. We have u ⊕ σ ◦ σu ◦ σ ◦ σv = v and −1 0 ~ freq(σ ◦ σu ◦ σ ◦ σv) = 0. This implies that u, v must belong to so the same terminal component since they are both in terminal com- 0 ponents. d(z − z , span(b1, . . . , bi−1)) < d(bi, span(b1, . . . , bi−1)). 0 0 PROOFOF LEMMA 8. We prove by induction. Let u be the Now, let x be the orthogonal projection of z−z onto span(b1, . . . , bi−1) 0 current state of AR and C the current state of B. By the induction such that hypothesis, C0 = α(com(u0)). Consider the transition induced 0 x = β1b1 + ··· + βi−1bi−1, β1, . . . , βi−1 ∈ by processing ei. Because C is the unique terminal component R 0 ~ reachable from u , there is a stream σ of frequency 0 such that and let 0 0 (u ⊕ ei) ⊕ −ei ◦ σ = rep(C ). By definition, the state of B after 0 0 0 the transition is α(com(rep(C ) ⊕ ei)) = α(com((u ⊕ ei) ⊕ x = [β1]b1 + ··· + [βi−1]bi−1, −ei ◦ σ ◦ ei)), which is the unique terminal component reachable then x0 ∈ M ∩ span(b , . . . , b ), x − x0 ∈ P(b , . . . , b ) and 0 ~ 1 i−1 1 i−1 from u ⊕ ei via streams of frequency 0. z − z0 − x0 ∈ M \ X. It follows that

0 0 0 0 0 B. PROOF OF LEMMA 11 d(z − z − x , P(b1, . . . , bi−1)) ≤ d(z − z − x , x − x ) = d(z − z0, x) 0 Algorithm 2 Finding an independent set of M = d(z − z , span{b1, . . . , bi−1})

1: X ← {0} < d(bi, span{b1, . . . , bi−1}) 2: t ← 0 n ≤ d(bi, P(b1, . . . , bi−1)). 3: while (M \ X) ∩ Z|2m| 6= ∅ do n 4: Pick y ∈ (M \ X) ∩ Z|2m| Observe that our choice of bi actually guarantees that 5: if t = 0 then 6: b1 ← y d(bi, P(b1, . . . , bi−1)) ≤ d(w, P(b1, . . . , bi−1)), 7: else ∀w ∈ M \ span{b1, . . . , bi−1} 8: ρ ← d(y, P(b1, . . . , bt)) 0 9: Y ← {x ∈ M \ X : d(x, P(b1, . . . , bt)) ≤ ρ} We meet a contradiction. Therefore, z = z and αi ∈ Z for all 10: bt+1 ← arg minz∈Y d(z, P(b1, . . . , bt)) . pick an i. arbitrary one if there is a tie 11: end if 12: t ← t + 1 C. FREQUENCY MOMENTS 13: X = span{b1, . . . , bt} Provided that streaming algorithms can be implemented by a lin- 14: end while ear sketch without much loss in space complexity, a lower bound 15: return b1, . . . , bt of frequency moment problem follows easily, without resorting to complicated communication complexity. C.1 Information Theory and Hellinger Distance PROOF. We run Algorithm 2 to find linearly independent vec- Suppose that X is a discrete random variable on Ω with distri- tors b1, . . . , bt ∈ M for some t. Since Y is a finite set, it is always bution p(x). Then the entropy H(X) of the random variable X is possible to find a bi+1 in Line 10. The algorithm will always termi- P n defined by H(X) = − x∈Ω p(x) log2 p(x). The joint entropy nate since Z|2m| is a finite set, and it is easy to see that the returned H(X,Y ) of a pair of discrete random variables (X,Y ) with joint set of vectors b1, . . . , bt are linearly independent. It is not diffi- i distribution p(x, y) is defined as H(X,Y ) = − P P p(x, y) log p(x, y). cult to verify inductively that kbik∞ ≤ C2 m for some absolute x y constant C > 0. We also define the conditional entropy H(X|Y ) as H(X|Y ) = 0 P Now let M = hb1, . . . , bti, the module generated by b1, . . . , bt. y H(X|Y = y) Pr{Y = y}, where H(X|Y = y) is the en- 0 n X {Y = It is clear that M is a submodule of M. Suppose that x, y ∈ Z|m| tropy of the conditional distribution of given the event and x + M = y + M. We want to show that x + M 0 = y + M 0. y}. The mutual information I(X; Y ) is the relative entropy be- Let z = x − y. By the termination condition of the algorithm, tween the joint distribution and the product distribution p(x)p(y) (where p(x) and p(y) are marginal distributions), i.e., I(X; Y ) = there exist α1, . . . , αt ∈ R such that z = α1b1 + ··· + αtbt. It P p(x,y) is a standard argument as in [8, Lemma 10.2] to show that all αi’s x,y p(x, y) log p(x)p(y) . The following are the basic properties are integers and thus z ∈ M 0. We include the argument below for regarding entropy and mutual information. PROPOSITION 18. Let X,Y,Z be discrete random variables Next we prove (1). Observe that defined on ΩX , ΩY , ΩZ respectively and f a function defined on X Ω. Then I(Φx; x) = I(Φx; xi|x1, . . . , xi−1) i 1. 0 ≤ H(X) ≤ log |ΩX |, the right equality is attained iff X is X uniform on ΩX . = H(xi|x1, . . . , xi−1) − H(xi|Φx, x1, . . . , xi−1) 2. Conditioning reduces entropy: H(X|Y ) ≤ H(X). i 3. I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) ≥ 0 X = H(x ) − H(x |Φx, x , . . . , x ) and the equality is attained iff X and Y are independent; i i 1 i−1 i 4. Chain rule for mutual information: I(X,Y ; Z) = I(X; Z)+ I(X; Y |Z); (since xi’s are independent) X 5. If X,Y are jointly independent of Z then H(X|Y,Z) = ≥ H(xi) − H(xi|Φx)(conditioning reduces entropy) H(X|Y ). i X = I(Φx; xi) DEFINITION 7. The Hellinger distance h(P,Q) between prob- i ability distributions P and Q on a domain Ω is defined by X = I(xiφi + Ri; xi) 2 X 1 X 2 h (P,Q) = 1− pP (ω)Q(ω) = (pP (ω)−pQ(ω)) . i 2 ω∈Ω ω∈Ω P where Ri = j6=i xj φj . It suffices to show that I(xiφi+Ri; Ri) = Ω(n−2/p) for Ω(n) i’s. One can verify that the Hellinger distance is a metric satisfying By correctness of the sketch, it must hold that the triangle inequality, see, e.g., [7]. The following proposition 2 1/p connects the Hellinger distance and the total variation distance. h ((2n) φi + Ri, j0φi + Ri) = Ω(1) for Ω(n) i’s

2 2 PROPOSITION 19. (see, e.g., [5]) h (P,Q) ≤ d (P,Q) ≤ for either j0 = 0 or 1. Observe that h (jφi +Ri, (j−1)φi +Ri) = √ TV 2 2h(P,Q). h (φi + Ri,Ri) for all j ≥ 1, no matter whether j0 = 0 or 1, we have that In connection with mutual information, we have that 2 1/p Ω(1) = h ((2n) φi + Ri, j0φi + Ri)

 1/p 2 LEMMA 20 ([7]). Let Fz1 and Fz2 be two random variables. (2n) Let Z denote a random variable with uniform distribution in {z , z }. X 1 2 ≤  h(jφi + Ri, (j − 1)φi + Ri) Suppose F (z) is independent of Z for each z ∈ {z1, z2}. Then, j=j0 I(Z; F (Z)) ≥ h2(F ,F ). z1 z2 2/p 2 = O(n h (φi + Ri,Ri)), C.2 Lower bound whence it follows immediately that n Suppose that x ∈ R . We say that a data stream algorithm solves 2 2/p p I(xiφi + Ri; xi) ≥ h (φ + Ri,Ri) = Ω(1/n ), the (, p)-NORM problem if its output X satisfies (1 − )kxkp ≤ X ≤ (1 + )kxkp with probability ≥ 1 − δ. p where we invoked Lemma 20 for the inequality, noting that Ri and Let µ be a distribution n defined as follows. Choose x uni- Z xi are independent. formly at random from {0, 1}n and i uniformly at random from 1/p {1, . . . , n}. With probability 1/2, let xi = (2n) . Let µ be the distribution of the resulting x.

THEOREM 21. For any p > 2, any randomized√ free automaton that solves (c, p)-NORM problem (c < 2) for x ∼ µ with proba- bility ≥ 19/20, where m ≤ poly(n), requires Ω(n1−2/p/ log2 n) space.

PROOF. In the light of Theorem 15, we may increase the space complexity by a log m factor but assume that the algorithm takes k×n linear sketches and the sketching matrix Φ ∈ Z satisfies kΦkmax = n O(poly(n)). Suppose that Φ = (φ1, . . . , φn) and x ∼ {0, 1} . We shall show that

I(Φx; x) = Ω(n1−2/p). (1)

Assume this is true for the moment, on the other hand, since entries of Φ have size O(n3), we have that

I(Φx; x) ≤ H(Φx) ≤ log((poly(n))k) = O(k log n), it follows that k = Ω(n1−2/p/ log n) as desired.