Beating Countsketch for Heavy Hitters in Insertion Streams
Total Page:16
File Type:pdf, Size:1020Kb
Beating CountSketch for Heavy Hitters in Insertion Streams ∗ y Vladimir Braverman Stephen R. Chestnut Johns Hopkins University ETH Zurich Baltimore, MD, USA Zurich, Switzerland [email protected] [email protected] z Nikita Ivkin David P. Woodruff Johns Hopkins University IBM Research, Almaden Baltimore, MD, USA San Jose, CA, USA [email protected] [email protected] ABSTRACT 2. A way to estimatep the `1 norm of a stream up to ad- ditive error " F2 with O(log n log log n) bits of space, Given a stream p1; : : : ; pm of items from a universe U, which, without loss of generality we identify with the set of inte- resolving Open Question 3 from the IITK 2006 list for insertion only streams. gers f1; 2; : : : ; ng, we consider the problem of returningp all `2- heavy hitters, i.e., those items j for which fj ≥ " F2, where fj is the number of occurrences of item j in the stream, and Categories and Subject Descriptors P 2 F2 = i2[n] fi . Such a guarantee is considerably stronger F.2 [Theory of Computation]: Analysis of Algorithms and than the `1-guarantee, which finds those j for which fj ≥ Problem Complexity; G.3 [Mathematics of Computing]: "m. In 2002, Charikar, Chen, and Farach-Colton suggested Probability and Statistics the CountSketch data structure, which finds all such j using Θ(log2 n) bits of space (for constant " > 0). The only known lower bound is Ω(log n) bits of space, which comes from the Keywords need to specify the identities of the items found. Data Streams, Heavy Hitters, Chaining In this paper we show one can achieve O(log n log log n) bits of space for this problem. Our techniques, based on 1. INTRODUCTION Gaussian processes, lead to a number of other new results for There are numerous applications of data streams, for which data streams, including the elements pi may be numbers, points, edges in a graph, and so on. Examples include internet search logs, network traf- 1. The first algorithm for estimating F2 simultaneously at fic, sensor networks, and scientific data streams (such as in all points in a stream using only O(log n log log n) bits astronomy, genomics, physical simulations, etc.). The sheer of space, improving a natural union bound and the al- size of the dataset often imposes very stringent requirements gorithm of Huang, Tai, and Yi (2014). on an algorithm's resources. In many cases only a single pass over the data is feasible, such as in network applications, ∗This material is based upon work supported in part by the since if the data on a network is not physically stored some- National Science Foundation under Grant No. 1447639, by where, it may be impossible to make a second pass over it. the Google Faculty Award and by DARPA grant N660001- There are multiple surveys and tutorials in the algorithms, 1-2-4014. Its contents are solely the responsibility of the au- database, and networking communities on the recent activity thors and do not represent the official view of DARPA or the Department of Defense. in this area; we refer the reader to [43, 6] for more details and y motivations underlying this area. This material is based upon work supported in part by the Finding heavy hitters, also known as the top-k, most popu- National Science Foundation under Grant No. 1447639. lar items, elephants, or iceberg queries, is arguably one of the zThis material is based upon work supported in part by the National Science Foundation under Grant No. 1447639 and most fundamental problems in data streams. It has applica- by DARPA grant N660001-1-2-4014. Its contents are solely tions in flow identification at IP routers [21], iceberg queries the responsibility of the authors and do not represent the [22], iceberg datacubes [8, 24], association rules, and frequent official view of DARPA or the Department of Defense. itemsets [2, 47, 51, 26, 25]. Formally, we are given a stream p1; : : : ; pm of items from a universe U, which, without loss of generality we identify with the set f1; 2; : : : ; ng. We make the common assumption that log m = O(log n), though our results generalize naturally to any m and n. Let fi denote the frequency, that is, the number of occurrences, of item i. We would like to find those items i for which fi is large, i.e., the \heavy hitters". In this paper we will consider algorithms that are allowed one pass over the stream and must use as little space (memory) in bits as possible, to maintain a succinct summary (\sketch") so that after processing the stream, we can output the identities of all be eliminated. See the appendix of [11] for a counterexample of the heavy hitters from the summary with large probability. and explanation. There are various notions of what it means for fi to be Despite the success had in obtaining space-optimal stream- large. One such notion is that we should return all indices ing algorithms for estimating moments and p-norms, this has i 2 [n] for which fi ≥ m for a parameter 2 (0; 1), and no remained a glaringly open problem. It is known that if one index i for which fi ≤ ( − φ)m, for a parameter φ. It is typi- allows deletions in the stream, in addition to insertions, then cally assumed that φ ≥ c for an absolute constant c > 0, and Θ(log2 n) bits of space is optimal [5, 31]. However, in many we will do so in this paper. This notion has been extensively cases we just have a stream of insertions, such as in the model studied, so much so, that the same streaming algorithm for studied in the seminal paper of Alon, Matias, and Szegedy [3]. it was re-invented multiple times. The first algorithm was given by Misra and Gries [41], who achieved O((log n)/) bits of space. The algorithm was rediscovered by Demaine et 1.1 Our Contributions al. [20], and then later rediscovered by Karp et al. [33]. Cor- The main result of this paper is the near resolution of the mode and Muthukrishan [18] state that \papers on frequent open question above. items are a frequent item!" While these algorithms are de- terministic, there are also several randomized algorithms, in- cluding the Count-Min sketch [19], sticky sampling [37], lossy Theorem 1 (`2-Heavy Hitters). For any > 0, there counting [37], sample and hold [21], multi-stage bloom fil- is a 1-pass algorithm in the insertion-only model that, with ters [14], sketch-guided sampling [34], and CountSketch [16]. probabilityp at least 2=3, finds all those indices i 2 [n] for which A useful (slightly suboptimal) intuition is that one can sam- fi ≥ F2, and reports no indices i 2 [n] for which fi ≤ p 1 1 ple O((log 1/)/) random stream positions, remember the 2 F2. The space complexity is O( 2 log log n log log n) bits. identities of these positions, and then maintain the counts of these items. By coupon collector arguments, all heavy hit- ters will be found this way, and one can filter out the spurious The intuition of the proof is as follows. Suppose there is a single `2-heavy hitter H, " > 0 is a constant, and we ones (those with fi ≤ ( − φ)m). One of these techniques, CountSketch [16], refined in [50], are trying to find the identity of H. Suppose further we could identify a substream S0 where H is very heavy, specif- gives a way of finding the `2-heavy hitters of a stream. Those 2 2 ically we want that the frequencies in the substream satisfy are the items for which fi ≥ " F2. Notice that this guarantee 2 fH P 2 is significantly stronger than the aforementioned guarantee poly(log n) ≥ j2S0;j6=H fj . Suppose also that we could find that fi ≥ m, which we will call the `1-guarantee. Indeed, certain R = O(log n) \breakpoints" in the stream correspond- 2 2 2 2 if fi ≥ m, then fi ≥ m ≥ F2. So, an algorithm for ing to jumps in the value of fH , that is, we knew a sequence finding the `2-heavy hitters will find all items satisfying the pq1 < pq2 < ··· < pqR which corresponds to positions in the `1-guarantee. On the otherp hand, given a stream of n distinct stream for which fH increases by a multiplicative factor of items in which fi = n for an i 2 [n], yet fj = 1 for all (1 + 1=Θ(R)). j 6= i, an algorithm satisfying the `2-heavy hitters guarantee Given all of these assumptions, in between breakpoints we will identify item i with constant , but an algorithmp which can partition the universe randomly into two pieces and run only has the ` -guarantee would need to set = 1= n, using 2 p 1 an F2-estimate [3] (AMS sketch) on each piece. Since fH P 2 Ω( n log n) bits of space. In fact, `2-heavy hitters are in is more than a poly(log n) factor times j2S0;j6=H fj , while some sense the best one can hope for with a small amount of in between each breakpoint the squared frequency of H is 2 space in a stream, as it is known for p > 2 that finding those fH p p 1−2=p Ω log n , it follows that H contributes a constant fraction i for which fi ≥ " Fp requires n bits of space [7, 15]. The CountSketch has broad applications in compressed of the F2-value in between consecutive breakpoints, and so, sensing [23, 46, 40] and numerical linear algebra [17, 39, 44, upon choosing the constants appropriately, the larger in mag- 10], and are often used as a subroutine in other data stream nitude of the two AMS sketches will identify a bit of informa- tion about H, with probability say 90%.