Streaming Algorithms for Finding Frequent Items

STREAMING ALGORITHMS FOR MINING FREQUENT ITEMS by Nikita Ivkin A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland October, 2018 c 2018 by Nikita Ivkin All rights reserved Abstract Streaming model supplies solutions for handling enormous data flows for over 20 years now. The model works with sequential data access and states sublinear memory as its primary restriction. Although the majority of the algorithms are randomized and ap- proximate, the field facilitates numerous applications from handling networking traffic to analyzing cosmology simulations and beyond. This thesis focuses on one of the most foundational and well-studied problems of finding heavy hitters, i.e. frequent items: 1. We challenge the long-lasting complexity gap in finding heavy hitters with `2 guar- antee in the insertion-only stream and present the first optimal algorithm with a space complexity of O(1) words and O(1) update time. Our result improves on Count Sketch algorithm with space and time complexity of O(log n) by Charikar et al. 2002 [39]. 2. We consider the `2-heavy hitter problem in the interval query settings, rapidly emerg- ing in the field. Compared to well known sliding window model where an algorithm is required to report the function of interest computed over the last N updates, interval query provides query flexibility, such that at any moment t one can query the function value on any interval (t1, t2) ⊆ (t − N, t). We present the first `2-heavy hitter algorithm in that model and extend the result to estimation all streamable functions of a frequency vector. ii 3. We provide the experimental study for the recent space optimal result on streaming quantiles by Karnin et al. 2016 [85]. The problem can be considered as a gener- alization to the heavy hitters. Additionally, we suggest several variations to the algorithms which improve the running time from O(1/#) to O(log 1/#), provide twice better space vs. precision trade-off, and extend the algorithm for the case of weighted updates. 4. We establish the connection between finding "halos", i.e. dense areas, in cosmology N-body simulation and finding heavy hitters. We build the first halo finder and scale it up to handle datasets with up-to 1012 particles via GPU boosting, sampling and parallel I/O. We investigate its behavior and compare it to traditional in-memory halo finders. Our solution pushes the memory footprint from several terabytes down to less than a gigabyte, therefore, make the problem feasible for small servers and even desktops. Primary reader: Vladimir Braverman, Alexander Szalay Secondary reader: Raman Arora iii Acknowledgements I would like to thank my advisor Vladimir Braverman for all the support and guidance during the entire Ph.D. program. I especially appreciate his effort in introducing me to other researchers in the field and encouraging me to find new collaborations on my own. I have gained a lot in the skill of controlling my time more efficiently and manage it mind- fully, while going through several teaching assistantships during my Ph.D. I would like to express my gratitude to all the researchers I had a chance to collaborate with: Tamas Bu- davari, Mohammad Hajiaghayi, Michael Jacobs, Zohar Karnin, Kevin Lang, Edo Liberty, Gerard Lemson, Morteza Monemizadeh, Muthu Muthukrishnan, Jelani Nelson, Mark Neyrinck, Alex Szalay, David Woodruff, Sepehr Assadi, Stephen Chestnut, Hossein Es- fandiari, Ruoyuan Gao, Srini Suresh Kumar, Zaoxing Liu, Teodor Marinov, Poorya Mi- anjy, Jalaj Upadhyay, Xin Wang, Lin Yang, Zhengyu Wang. I owe a very important debt to all the professors whose classes I had a chance to participate in and to learn a lot from them. Each class was a big excitement for me thanks to Yanif Ahmad, Raman Arora, Amitabh Basu, Vladimir Braverman, Michael Dinitz, Jim Fill and Rene Vidal. I would like to offer my special thanks to Deborah DeFord, Cathy Thornton, Zachary Burwell, Laura Graham, Tonette Harris, Shani McPherson, Joanne Selinski, and Javonnia Thomas for all the help from administrative side of the department. I am deeply grateful to all my family and all my friends who were there for me when I needed it the most. Special thanks to all the readers of my thesis Raman Arora, Vladimir Braverman and Alex Szalay. iv Finally, I would like to acknowledge that my work was financially supported by the fol- lowing grants and awards: NSA-H98230-13-C-0265, NSF IIS -1447639, DARPA -111477, Research Trends Det. HR0011-16-P-0014 NSF EAGER -16050041, ONR N00014-18-1-2364, CISCO 90073352. v Contents Abstract ii Acknowledgements iv 1 Introduction 1 1.1 Streaming model . 1 1.2 Contribution . 4 1.2.1 `2 heavy hitters algorithm with fewer words . 4 1.2.2 Streaming Algorithms for cosmological N-body simulations . 5 1.2.3 Monitoring the Network with Interval Queries . 6 1.2.4 Streaming quantiles algorithms with small space and update time . 7 2 `2 heavy hitters algorithm with fewer words 8 2.1 Introduction . 8 2.2 Beating CountSketch for Heavy Hitters in Insertion Streams . 13 2.2.1 Introduction . 13 2.2.2 `2 heavy hitters algorithm . 19 2.2.3 Chaining Inequality . 36 2.2.4 Reduced randomness . 40 vi 2.2.5 F2 at all points . 45 2.3 BPTree: an `2 heavy hitters algorithm using constant memory . 49 2.3.1 Introduction . 49 2.3.2 Algorithm and analysis . 54 2.3.3 Experimental Results . 73 2.4 Conclusion . 79 3 Monitoring the Network with Interval Queries 80 3.1 Introduction . 80 3.2 Preliminaries . 82 3.3 Interval Algorithms . 87 3.4 Evaluation . 104 3.5 Conclusion . 108 4 Streaming quantiles algorithms with small space and update time 116 4.1 Introduction . 116 4.2 A unified view of previous randomized solutions . 119 4.3 Our Contribution . 123 4.4 Experimental Results . 134 4.5 Conclusion . 140 5 Finding haloes in cosmological N-body simulations 141 5.1 Introduction . 141 5.2 Streaming Algorithms for Halo Finders . 145 5.2.1 Methodology . 145 5.2.2 Implementation . 151 vii 5.2.3 Evaluation . 156 5.3 Scalable Streaming Tools for Analyzing N-body Simulations . 165 5.3.1 Methodology . 165 5.3.2 Implementation . 173 5.3.3 Evaluation . 182 5.4 Conclusion . 196 Bibliography 206 viii List of Figures 2.1 In this example of the execution of HH1, the randomized label h(H) of the heavy hitter H begins with 01 and ends with 00. Each node in the tree corresponds to a round of HH1, which must follow the path from H0 to HR for the output to be correct. 60 2.2 Success rate for HH2 on four types of streams with n = 108 and heavy hitter p frequency a n. ................................... 76 2.3 Update rate in updates/ms (•) and storage in kB (◦) for HH2 and CountS- ketch ( and , respectively) with the CW trick hashing. 77 3.1 Interval (bucket) structure for EH and SH. 85 3.2 Interval query in the prism of EH. 86 3.3 Average frequency estimation error for flows in 10-20k interval. 105 3.4 Average frequency estimation error for flows for various suffix lengths on the NY2018 dataset. 106 3.5 Average L2 norm estimation error for flows in 10-20k intervals. 107 3.6 Average L2 norm estimation error for flows for various suffix lengths on the NY2018 dataset. 108 3.7 Quality of HH solution for 10k-20k interval (first experiment). Precision and Recall. 109 3.8 Quality of HH solution for 10k-20k interval (first experiment). F1 Measure. 110 ix 3.9 Quality of HH solution for varying suffix lengths (second experiment). Pre- cision. 111 3.10 Quality of HH solution for varying suffix lengths (second experiment). Re- call. 112 3.11 Quality of HH solution for varying suffix lengths (second experiment). F1 Measure. 113 3.12 Average entropy relative error for 10-20k intervals. 114 3.13 Average entropy relative error for various suffix lengths on the NY2018 dataset. 115 4.1 One pair compression: initially each item has weight w, compression intro- duces w error for inner queries and no error for outer queries. 119 4.2 Compaction procedure: rank error w is introduced to inner queries q2,4, no error to outer queries q1,3,5 . 120 4.3 Compactor saturation: vanilla KLL vs. lazy KLL . 124 4.4 Compaction with an equally spread error: every query q2,3,4,5 is either inner or outer equiprobably. 125 4.5 Example of one full sweep in 4 stages, each stage depicts pair chosen for the compaction, updated threshold q and new items arrived (shadow bars). 128 4.6 Intuition behind base2update algorithm . 129 4.7 Compressing pair in the weighted compactor . 133 x 4.8 Figures 4.8a, 4.8b, 4.8c, 4.8e, 4.8f depict the trade-off between maximum error over all queried quantiles and space allocated to the sketch: figures 4.8a, 4.8c, 4.8b shows the results on the randomly ordered streams but in different axes, figure 4.8e shows the results for the sorted stream, stream ordered according to zoom-in pattern, and stream with Gaussian distribution, 4.8f shows the approximation ratio for CAIDA dataset. Figure 4.8d shows the trade-off between error and the length of the stream. 137 5.1 Count-Sketch Algorithm . 150 5.2 Pick-and-Drop Algorithm . 151 5.3 Halo mass distribution of various halo finders. 152 5.4 Count-Sketch Algorithm . 153 5.5 Pick-and-Drop Sampling .

Load more