Data Stream Algorithms Lecture Notes
Total Page:16
File Type:pdf, Size:1020Kb
Data Stream Algorithms Lecture Notes Amit Chakrabarti Dartmouth College Latest Update: July 2, 2020 DRAFT Preface This book grew out of lecture notes for offerings of a course on data stream algorithms at Dartmouth, beginning with a first offering in Fall 2009. It’s primary goal is to be a resource for students and other teachers of this material. The choice of topics reflects this: the focus is on foundational algorithms for space-efficient processing of massive data streams. I have emphasized algorithms that are simple enough to permit a clean and fairly complete presentation in a classroom, assuming not much more background than an advanced undergraduate student would have. Where appropriate, I have provided pointers to more advanced research results that achieve improved bounds using heavier technical machinery. I would like to thank the many Dartmouth students who took various editions of my course, as well as researchers around the world who told me that they had found my course notes useful. Their feedback inspired me to make a book out of this material. I thank Suman Kalyan Bera, Sagar Kale, and Jelani Nelson for numerous comments and contributions. Of special note are the 2009 students whose scribing efforts got this book started: Radhika Bhasin, Andrew Cherne, Robin Chhetri, Joe Cooley, Jon Denning, Alina Djamankulova, Ryan Kingston, Ranganath Kondapally, Adrian Kostrubiak, Konstantin Kutzkow, Aarathi Prasad, Priya Natarajan, and Zhenghui Wang. Amit Chakrabarti April 2020 DRAFT Contents 0 Preliminaries: The Data Stream Model6 0.1 The Basic Setup..............................................6 0.2 The Quality of an Algorithm’s Answer..................................6 0.3 Variations of the Basic Setup.......................................7 1 Finding Frequent Items Deterministically8 1.1 The Problem................................................8 1.2 Frequency Estimation: The Misra–Gries Algorithm...........................8 1.3 Analysis of the Algorithm........................................9 1.4 Finding the Frequent Items........................................ 10 Exercises................................................. 10 2 Estimating the Number of Distinct Elements 11 2.1 The Problem................................................ 11 2.2 The Tidemark Algorithm......................................... 11 2.3 The Quality of the Algorithm’s Estimate................................. 12 2.4 The Median Trick............................................. 13 Exercises................................................. 13 3 A Better Estimate for Distinct Elements 15 3.1 The Problem................................................ 15 3.2 The BJKST Algorithm.......................................... 15 3.3 Analysis: Space Complexity....................................... 16 3.4 Analysis: The Quality of the Estimate.................................. 16 3.5 Optimality................................................. 17 Exercises.................................................DRAFT 17 4 Approximate Counting 19 4.1 The Problem................................................ 19 4.2 The Algorithm.............................................. 19 2 Dartmouth: CS 35/135 CONTENTS Data Stream Algorithms 4.3 The Quality of the Estimate........................................ 20 4.4 The Median-of-Means Improvement................................... 21 Exercises................................................. 22 5 Finding Frequent Items via (Linear) Sketching 23 5.1 The Problem................................................ 23 5.2 Sketches and Linear Sketches....................................... 23 5.3 CountSketch................................................ 24 5.3.1 The Quality of the Basic Sketch’s Estimate........................... 24 5.3.2 The Final Sketch......................................... 25 5.4 The Count-Min Sketch.......................................... 26 5.4.1 The Quality of the Algorithm’s Estimate............................. 26 5.5 Comparison of Frequency Estimation Methods............................. 27 Exercises................................................. 28 6 Estimating Frequency Moments 29 6.1 Background and Motivation....................................... 29 6.2 The (Basic) AMS Estimator for Fk .................................... 29 6.3 Analysis of the Basic Estimator...................................... 30 6.4 The Final Estimator and Space Bound.................................. 32 6.4.1 The Soft-O Notation....................................... 32 7 The Tug-of-War Sketch 33 7.1 The Basic Sketch............................................. 33 7.1.1 The Quality of the Estimate................................... 34 7.2 The Final Sketch............................................. 34 7.2.1 A Geometric Interpretation.................................... 34 Exercises................................................. 35 8 Estimating Norms Using Stable Distributions 36 8.1 A Different `2 Algorithm......................................... 36 8.2 Stable Distributions............................................ 37 8.3 The Median of a Distribution and its Estimation............................. 38 8.4 The Accuracy of the Estimate....................................... 38 8.5 Annoying Technical Details........................................ 39 9 Sparse Recovery 40 9.1 The Problem................................................DRAFT 40 9.2 Special case: 1-sparse recovery...................................... 40 9.3 Analysis: Correctness, Space, and Time................................. 41 9.4 General Case: s-sparse Recovery..................................... 42 Exercises................................................. 44 3 Dartmouth: CS 35/135 CONTENTS Data Stream Algorithms 10 Weight-Based Sampling 45 10.1 The Problem................................................ 45 10.2 The `0-Sampling Problem......................................... 46 10.2.1 An Idealized Algorithm..................................... 46 10.2.2 The Quality of the Output.................................... 47 10.3 `2 Sampling................................................ 48 10.3.1 An `2-sampling Algorithm.................................... 48 10.3.2 Analysis:............................................. 48 11 Finding the Median 51 11.1 The Problem................................................ 51 11.2 Preliminaries for an Algorithm...................................... 51 11.3 The Munro–Paterson Algorithm..................................... 53 11.3.1 Computing a Core........................................ 53 11.3.2 Utilizing a Core.......................................... 53 11.3.3 Analysis: Pass/Space Tradeoff.................................. 53 Exercises................................................. 55 12 Geometric Streams and Coresets 56 12.1 Extent Measures and Minimum Enclosing Ball............................. 56 12.2 Coresets and Their Properties....................................... 56 12.3 A Coreset for MEB............................................ 57 12.4 Data Stream Algorithm for Coreset Construction............................ 58 Exercises................................................. 59 13 Metric Streams and Clustering 60 13.1 Metric Spaces............................................... 60 13.2 The Cost of a Clustering: Summarization Costs............................. 61 13.3 The Doubling Algorithm......................................... 61 13.4 Metric Costs and Threshold Algorithms................................. 63 13.5 Guha’s Cascading Algorithm....................................... 63 13.5.1 Space Bounds........................................... 64 13.5.2 The Quality of the Summary................................... 64 Exercises................................................. 65 14 Graph Streams: Basic Algorithms 66 14.1 Streams that Describe Graphs....................................... 66 14.1.1 Semi-Streaming SpaceDRAFT Bounds.................................. 67 14.2 The Connectedness Problem....................................... 67 14.3 The Bipartiteness Problem........................................ 67 14.4 Shortest Paths and Distance Estimation via Spanners.......................... 68 14.4.1 The Quality of the Estimate................................... 69 4 Dartmouth: CS 35/135 CONTENTS Data Stream Algorithms 14.4.2 Space Complexity: High-Girth Graphs and the Size of a Spanner................ 69 Exercises................................................. 70 15 Finding Maximum Matchings 71 15.1 Preliminaries............................................... 71 15.2 Maximum Cardinality Matching..................................... 71 15.3 Maximum Weight Matching....................................... 72 Exercises................................................. 74 16 Graph Sketching 75 16.1 The Value of Boundary Edges...................................... 75 16.1.1 Testing Connectivity Using Boundary Edges.......................... 75 16.1.2 Testing Bipartiteness....................................... 76 16.2 The AGM Sketch: Producing a Boundary Edge............................. 77 17 Counting Triangles 79 17.1 A Sampling-Based Algorithm...................................... 79 17.2 A Sketch-Based Algorithm........................................ 80 Exercises................................................. 80 18 Communication Complexity and a First Look at Lower Bounds 81 18.1 Communication Games, Protocols, Complexity............................. 81 18.2 Specific Two-Player Communication