Sketching and Streaming High-Dimensional Vectors Jelani Nelson
Total Page:16
File Type:pdf, Size:1020Kb
Sketching and Streaming High-Dimensional Vectors by Jelani Nelson S.B., Massachusetts Institute of Technology (2005) M.Eng., Massachusetts Institute of Technology (2006) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2011 c Massachusetts Institute of Technology 2011. All rights reserved. Author.............................................................. Department of Electrical Engineering and Computer Science May 20, 2011 Certified by. Erik D. Demaine Professor Thesis Supervisor Certified by. Piotr Indyk Professor Thesis Supervisor Accepted by . Leslie A. Kolodziejski Chairman, Department Committee on Graduate Students Sketching and Streaming High-Dimensional Vectors by Jelani Nelson Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2011, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering Abstract A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a small-space algorithm given just one pass over the data, a so-called streaming algorithm. Sketching and streaming have found numerous applications in network traffic monitoring, data mining, trend detection, sensor networks, and databases. In this thesis, I describe several new contributions in the area of sketching and streaming algorithms. • The first space-optimal streaming algorithm for the distinct elements problem. Our algorithm also achieves O(1) update and reporting times. • A streaming algorithm for Hamming norm estimation in the turnstile model which achieves the best known space complexity. • The first space-optimal algorithm for pth moment estimation in turnstile streams for 0 < p < 2, with matching lower bounds, and another space-optimal algo- rithm which also has a fast O(log2(1=") log log(1=")) update time for (1 ± ")- approximation. • A general reduction from empirical entropy estimation in turnstile streams to moment estimation, providing the only known near-optimal space-complexity upper bound for this problem. • A proof of the Johnson-Lindenstrauss lemma where every matrix in the support of the embedding distribution is much sparser than previous known construc- tions. In particular, to achieve distortion (1 ± ") with probability 1 − δ, we embed into optimal dimension O("−2 log(1/δ)) and such that every matrix in the support of the distribution has O("−1 log(1/δ)) non-zero entries per column. Thesis Supervisor: Erik D. Demaine Title: Professor Thesis Supervisor: Piotr Indyk Title: Professor 2 Acknowledgments Including my years as an undergraduate I have been at MIT for a full decade, and I have come across many friends, colleagues, research advisors, and teachers who have made the experience very enjoyable. For this, I am very grateful. I thank my advisors Prof. Erik Demaine and Prof. Piotr Indyk for their guidance in my PhD studies. I was primarily interested in external memory data structures when I entered the program, an area within Erik's research interests, and Erik was very supportive while I explored questions in this realm. After taking a course on sketching and streaming algorithms, my research interests migrated toward that area at which point Piotr began giving me much guidance and became an unofficial (and eventually official) co-advisor, and then even officemate if one considers the 6th floor lounge to be an office. I have received a great amount of influence and guidance from Erik and Piotr, and I am grateful to have been their student. I would like to thank my coauthors: Tim Abbott, Miki Ajtai, Michael Bender, Michael Burr, Timothy Chan, Erik Demaine, Marty Demaine, Ilias Diakonikolas, Martin Farach-Colton, Vitaly Feldman, Jeremy Fineman, Yonatan Fogel, Nick Har- vey, Avinatan Hassidim, John Hugg, Daniel Kane, Bradley Kuszmaul, Stefan Langer- man, Raghu Meka, Krzysztof Onak, Ely Porat, Eynat Rafalin, Kathryn Seyboth, David Woodruff, and Vincent Yeung. I have been fortunate to have enjoyed every collaboration I have been a part of with these colleagues, and I have learned much from them. I in particular collaborated quite frequently with Daniel and David, both of whom I knew even before beginning my PhD studies, and it has been a pleasure to work with them so often. I have learned a great deal by doing so. While on the topic of learning from others, aside from advisors and coauthors I have been fortunate to have been surrounded by several colleagues and friends who I have learned a great deal from concerning theoretical computer science and other relevant areas of mathematics, at MIT, at the IBM Almaden Research Center and Microsoft Research New England during summer internships, at the Technion in the summer prior to beginning my PhD, and elsewhere. Last but not least, I would like to thank God and my parents for creating me and enabling me to have an enjoyable life thus far. 3 In memory of my high school mathematics teacher Jos´eEnrique Carrillo. 4 Contents 1 Introduction 9 1.1 Notation and Preliminaries . 11 1.2 Distinct Elements . 12 1.3 Moment Estimation . 13 1.4 Entropy Estimation . 15 1.5 Johnson-Lindenstrauss Transforms . 16 1.5.1 Sparse Johnson-Lindenstrauss Transforms . 16 1.5.2 Derandomizing the Johnson-Lindenstrauss Transform . 17 2 Distinct Elements 20 2.1 Balls and Bins with Limited Independence . 22 2.2 F0 estimation algorithm . 24 2.2.1 RoughEstimator . 25 2.2.2 Full algorithm . 28 2.2.3 Handling small F0 ......................... 32 2.2.4 Running time . 33 2.3 `0 estimation algorithm . 36 2.3.1 A Rough Estimator for `0 estimation . 38 2.4 Further required proofs . 40 2.4.1 Various calculations for balls and bins . 40 2.4.2 A compact lookup table for the natural logarithm . 43 3 Moment Estimation 45 3.1 FT-Mollification . 46 3.2 Slow Moment Estimation in Optimal Space . 49 3.2.1 Alterations for handling limited precision . 60 3.2.2 Approximating the median of jDpj . 61 3.3 Fast Moment Estimation in Optimal Space . 66 3.3.1 Estimating the contribution from heavy hitters . 68 3.3.2 Estimating the contribution from light elements . 75 3.3.3 The final algorithm: putting it all together . 80 3.3.4 A derandomized geometric mean estimator variant . 84 3.3.5 A heavy hitter algorithm for Fp . 88 3.4 Lower Bounds . 90 3.5 The additive log log d in `p bounds . 94 5 3.5.1 The upper bounds . 94 3.5.2 The lower bounds . 95 4 Entropy Estimation 96 4.1 Noisy Extrapolation . 97 4.1.1 Bounding the First Error Term . 98 4.1.2 Bounding the Second Error Term . 98 4.2 Additive Estimation Algorithm . 100 4.3 Multiplicative Estimation Algorithm . 105 4.4 Lower Bound . 109 5 Johnson-Lindenstrauss Transforms 111 5.1 Sparse JL transforms . 111 5.1.1 A Warmup Proof of the JL Lemma . 113 5.1.2 Code-Based Sparse JL Constructions . 114 5.1.3 Random Hashing Sparse JL Constructions . 116 5.1.4 Tightness of Analyses . 123 5.2 Numerical Linear Algebra Applications . 127 5.2.1 Linear regression . 129 5.2.2 Low rank approximation . 129 5.3 A proof of the Hanson-Wright inequality . 131 5.4 Derandomizing the JL lemma . 133 6 Conclusion 134 6 List of Figures 2-1 RoughEstimator pseudocode. With probability 1 − o(1), Fe0(t) = Θ(F0(t)) at every point t in the stream for which F0(t) ≥ KRE. The value ρ is :99 · (1 − e−1=3). 25 2-2 F0 algorithm pseudocode. With probability 11=20, Fe0 = (1 ± O("))F0. 29 2-3 An algorithm skeleton for F0 estimation. 36 3-1 Indyk's derandomized `p estimation algorithm pseudocode, 0 < p < 2, assuming infinite precision. 50 3-2 LightEstimator, for estimating contribution from non-heavy hitters . 76 4-1 Algorithm for additively approximating empirical entropy. 100 5-1 Three constructions to project a vector in Rd down to Rk. Figure (a) is the DKS construction in [35], and the two constructions we give in this work are represented in (b) and (c). The out-degree in each case is s, the sparsity. 112 7 List of Tables 1.1 Catalog of problems studied in this thesis, and our contributions. 10 1.2 Comparison of our algorithm to previous algorithms on estimating the number of distinct elements in a data stream. Update time not listed for algorithms assuming random oracle. 12 1.3 Comparison of our contribution to previous works on Fp estimation in data streams. 14 8 Chapter 1 Introduction In 1975, Morris discovered the following surprising result: one can count up to m using exponentially fewer than log m bits. For example, he showed how to approximately count up to numbers as large as 130; 000 using 8-bit registers, with 95% confidence that his returned estimate was within 24% of the true count. In general terms, the method allows one to approximately count the number of items in a sequence of length m using only O(log log m) bits, given only one pass over the sequence and such that the estimated sequence length has small relative error with large probability. The method was simple: maintain a value ν in memory which one imagines represents log(1 + m), then after seeing a new item in the sequence, increment ν with proba- bility exponentially small in ν. Morris' algorithm was the first non-trivial low-space streaming algorithm: an algorithm that computes a function of its input given only one or a few passes over it.