New Directions in Streaming by Ali Vakilian B.S., Sharif University of Technology (2011) M.S., University of Illinois at Urbana-Champaign (2013) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2019 c Massachusetts Institute of Technology 2019. All rights reserved. ○

Author...... Department of Electrical Engineering and Computer Science August 30, 2019 Certified by...... Erik D. Demaine Professor of Electrical Engineering and Computer Science Thesis Supervisor Certified by...... Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by...... Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 New Directions in Streaming Algorithms by Ali Vakilian

Submitted to the Department of Electrical Engineering and Computer Science on August 30, 2019, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science

Abstract Large volumes of available data have led to the emergence of new computational models for data analysis. One such model is captured by the notion of streaming algorithms: given a sequence of 푁 items, the goal is to compute the value of a given function of the input items by a small number of passes and using a sublinear amount of space in 푁. Streaming algorithms have applications in many areas such as networking and large scale machine learning. Despite a huge amount of work on this area over the last two decades, there are multiple aspects of streaming algorithms that remained poorly understood, such as (a) streaming algorithms for combinatorial optimization problems and (b) incorporating modern machine learning techniques in the design of streaming algorithms. In the first part of this thesis, we will describe (essentially) optimal streaming algorithms for set cover and maximum coverage, two classic problems in combinatorial optimization. Next, in the second part, we will show how to augment classic streaming algorithms of the frequency estimation and low-rank approximation problems with machine learning oracles in order to improve their space-accuracy tradeoffs. The new algorithms combine the benefits of machine learning with the formal guarantees available through design theory.

Thesis Supervisor: Erik D. Demaine Title: Professor of Electrical Engineering and Computer Science

Thesis Supervisor: Piotr Indyk Title: Professor of Electrical Engineering and Computer Science

3 4 Acknowledgments

First and foremost, I would like to thank my advisors Erik Demaine and Piotr Indyk. I had the chance to learn from Erik while before joining MIT through his lectures on “Introduction to Algorithms” when I was a freshman at Sharif UT. His inspiring style of teaching played a significant role on my passion to pursue TCS in grad school. For me, it was very valuable to have the opportunity to work with Erik. Besides, Erik gave me the freedom to explore different areas and pursue my interests; though always being there to support and advise. The main topic of this thesis, streaming algorithms, and all results in this thesis are in collaboration with Piotr Indyk. His guidance and supervision has shaped my academic character and his support during my PhD studies were significant both academic-wise and social-wise. He always had time for me to discuss both research and non-research questions and his continuous passion for science, research and mentoring young researchers will be always a perfect example for me. Next, I would like to thank Ronitt Rubinfeld who kindly accepted to be in my thesis committee. Ronitt introduced me to other very related areas of research in massive data analysis, namely Local Computation Algorithms and Sublinear Algorithms both through her courses and collaborations. Besides, I benefited a lot from her advice during my PhD years. I also would like to thank my MS advisor at UIUC, Chandra Chekuri. My first experience in doing research in TCS was in collaboration with Chandra and I will be always grateful of his support and supervision. He always had time for me to discuss any topic even after my graduation from UIUC. During my years in grad school, I had the chance to collaborate with many wonderful researchers and this thesis would not be there without their collaboration. I would like to thank all my co-authors: Anders Aamand, Arturs Backurs, Chandra Chekuri, Yodsawalai Chodpathumwan, Erik Demaine, Alina Ene, Timothy Goodrich, Christoph Gunru, Sariel Har-Peled, Chen-Yu Hsu, Piotr Indyk, Dina Katabi, Kyle Kloster, Nitish Korula, Brian Lavallee, Quanquan Liu, Sepideh Mahabadi, Slobodan Mitrović, Amir Nayyeri, Krzysztof Onak, Merav Parter, Ronitt Rubinfeld, Baruch Schieber, Blair Sullivan, Arash Termehchy, Jonathan Ullman, Andrew van der Poel, Tal Wagner, Marianne Winslett, David Woodruff,

5 Anak Yodpinyanee and Yang Yuan. Thanks to all members in the Theory of Computation group and in particular Maryam Aliakbarpour, Arturs, Mohammad Bavarian, Mina Dalirroyfard, Dylan Foster, Mohsen Ghaffari, Themis Gouleakis, Siddhartha Jayanti, Gautam Kamath, Pritish Kamath, William M. Leiserson, Quanquan, Sepideh, Saeed Mehraban, Slobodan, Cameron Musco, Chirstoph Musco, Merav, Madalina Persu, Ilya Razenshteyn, Nicholas Schiefer, Ludwig Schmidt, Adam Sealfon, Tal, Anak, Yang and Henry Yuen, I had great time at MIT. Being away from home, thanks to my friends in Boston area I had wonderful years during my PhD. I would like to thank all my friends and in particular those at MIT: Ehsan, Ardavan, Asma, Maryam, Fereshteh, Elaheh, Sajjad, Mina, Alireza, Amir, Mehrdad, Zahra and Ali. Last but not least, I would like to thank my family. My parents, Amir and Maryam, were always sources of inspiration for me and I am thankful of all their supports and efforts for me since my childhood. Words are not enough to acknowledge what they have done for me. I also would like to thank my siblings Mohsen and Fatemeh who always care about me. Mohsen was my very first mentor since elementary school and had a significant role in shaping my interest in Math and CS. I have benefited a lot from his advice and helps. Mohsen and Fatemeh, thanks for being so perfect! Sepideh, my beloved wife, is the most valuable gift I had from God during my PhD at MIT. Besides our several collaborations on different projects that contribute to half of this thesis, I was very fortunate tohaveher honest and unconditional support in my ups and downs during my PhD.

6 “In the name of God, the most gracious, the most merciful”

To my beloved family

7 8 Contents

1 Introduction 23 1.1 Streaming Algorithms for Coverage Problems...... 24 1.1.1 Streaming Algorithms for Set Cover...... 25 1.1.2 Streaming Algorithms for Fractional Set Cover...... 25 1.1.3 Sublinear Algorithms for Set Cover...... 26 1.1.4 Streaming Algorithms for Maximum Coverage...... 26 1.2 Learning-Based Streaming Algorithms...... 27 1.2.1 Learning-Based Algorithms for Frequency Estimation...... 28 1.2.2 Learning-Based Algorithms for Low-Rank Approximation...... 29

I Sublinear Algorithms for Coverage Problems 31

2 Streaming Set Cover 33 2.1 Introduction...... 33 2.1.1 Related Work...... 33 2.1.2 Our Results...... 35 2.2 Streaming Algorithm for Set Cover...... 39 2.2.1 Analysis of IterSetCover ...... 41 2.3 Lower Bound for Single Pass Algorithms...... 44 2.4 Geometric Set Cover...... 50 2.4.1 Preliminaries...... 50

9 2.4.2 Description of GeomSC Algorithm...... 52 2.4.3 Analysis...... 52 2.5 Lower bound for Multipass Algorithms...... 55 2.6 Lower Bound for Sparse Set Cover in Multiple Passes...... 61

3 Streaming Fractional Set Cover 65 3.1 Introduction...... 65 3.1.1 Our Results...... 65 3.1.2 Other Related Work...... 66 3.1.3 Our Techniques...... 68 3.2 MWU Framework of the Streaming Algorithm for Fractional Set Cover... 70

3.2.1 Preliminaries of the MWU method for solving covering LPs..... 71 3.2.2 Semi-streaming MWU-based algorithm for factional Set Cover.... 73 3.2.3 First Attempt: Simple Oracle and Large Width...... 74 3.3 Max Cover Problem and its Application to Width Reduction...... 76 3.3.1 The Maximum Coverage Problem...... 78 3.3.2 Sampling-Based Oracle for Fractional Max Coverage...... 82 3.3.3 Final Step: Running Several MWU Rounds Together...... 85 3.3.4 Extension to general covering LPs...... 86

4 Sublinear Set Cover 91 4.1 Introduction...... 91 4.1.1 Our Results...... 92 4.1.2 Related work...... 94 4.1.3 Our Techniques...... 94 4.2 Preliminaries...... 96 4.3 Lower Bounds for the Set Cover Problem...... 97

4.3.1 The Set Cover Lower Bound for Small Optimal Value 푘 ...... 99 4.3.2 The Set Cover Lower Bound for Large Optimal Value 푘...... 104 4.4 Sub-Linear Algorithms for the Set Cover Problem...... 111

10 4.4.1 First Algorithm: small values of 푘 ...... 113 4.4.2 Second Algorithm: large values of 푘 ...... 118 4.5 Lower Bound for the Cover Verification Problem...... 120 4.5.1 Underlying Set Structure...... 122 4.6 Generalized Lower Bounds for the Set Cover Problem...... 124

4.6.1 Construction of the Median Instance 퐼*...... 126 4.6.2 Distribution (퐼*) of the Modified Instances Derived from 퐼*..... 130 풟 4.7 Omitted Proofs from Section 4.3...... 133

5 Streaming Maximum Coverage 135 5.1 Introduction...... 135 5.1.1 Our Results...... 136 5.1.2 Our Techniques...... 136 5.1.3 Other Related Work...... 139 5.2 Preliminaries and Notations...... 139

5.2.1 Sampling Methods for Max 푘-Cover and Set Cover ...... 139 5.2.2 HeavyHitters and Contributing Classes...... 140

5.2.3 퐿0-Estimation...... 143 5.3 Estimating Optimal Coverage of Max 푘-Cover ...... 144 5.3.1 Universe Reduction...... 145

5.4 (훼, 훿, 휂)-Oracle of Max 푘-Cover ...... 148 5.4.1 Multi-layered Set Sampling: 훽 훼 s.t. cmn 휎훽|풰| ...... 150 ∃ ≤ |풰훽푘 | ≥ 훼 5.4.2 Heavy Hitters and Contributing Classes:

(OPTlarge) (OPTsmall) ...... 152 |풞 | ≥ |풞 | 5.4.3 Element Sampling: (OPTlarge) < (OPTsmall) ...... 157 |풞 | |풞 | 5.5 Approximating Max 푘-Cover Problem in Edge Arrival Streams...... 161 5.6 Lower Bound for Estimating Max 푘-Cover in Edge Arrival Streams..... 165 5.7 Chernoff Bound for Applications with Limited Independence...... 167

5.7.1 An Application: Set Sampling with Θ(log(푚푛))-wise Independence. 168

11 5.8 Generalization of Section 5.4.2...... 169

II Learning-Based Streaming Algorithms 178

6 The Frequency Estimation Problem 179 6.1 Introduction...... 179 6.1.1 Related Work...... 181 6.2 Preliminaries...... 183 6.2.1 Algorithms for Frequency Estimation...... 184 6.2.2 Zipfian Distribution...... 184 6.3 Learning-Based Frequency Estimation Algorithms...... 185 6.4 Analysis...... 186 6.4.1 Tight Bounds for Count-Min with Zipfians...... 187 6.4.2 Learned Count-Min with Zipfians...... 190 6.4.3 (Nearly) tight Bounds for Count-Sketch with Zipfians...... 194 6.4.4 Learned Count-Sketch for Zipfians...... 201 6.5 Experiments...... 203 6.5.1 Internet Traffic Estimation...... 204 6.5.2 Search Query Estimation...... 207 6.5.3 Analyzing Heavy Hitter Models...... 209

7 The Low-Rank Approximation Problem 211 7.1 Introduction...... 211 7.1.1 Our Results...... 212 7.1.2 Related work...... 213 7.2 Preliminaries...... 214 7.3 Training Algorithm...... 215 7.4 Worst Case Bound...... 216 7.5 Experimental Results...... 219

12 7.5.1 Average test error...... 221 7.5.2 Comparing Random, Learned and Mixed...... 222

7.6 The case of 푚 = 1 ...... 222 7.7 Optimization Bounds...... 223 7.8 Generalization Bounds...... 226 7.8.1 Generalization Bound for Robust Solutions...... 228

13 14 List of Figures

2.1.1 A collection of 푛2/4 distinct rectangles (i.e., sets), each containing two points. 38

2.5.1 (a) shows an example of the communication Set Chasing(4, 3) and (b) is an instance of the communication Intersection Set Chasing(4, 3)...... 55

2.5.2 The gadgets used in the reduction of the communication Intersection Set Chasing problem to the communication Set Cover problem. (a) and (c) shows the construction of the gadget for players 1 to 푝 and (c) and (d) shows the construction of the gadget for players 푝 + 1 to 2푝...... 56

2.5.3 In (b) two Set Chasing instances merge in their first set of vertices and (c) shows the corresponding gadgets of these merged vertices in the communica-

tion Set Cover...... 58

2.5.4 In (푎), path 푄 is shown with black dashed arcs and (푏) shows the corresponding cover of path 푄...... 58

3.2.1 LP relaxation of Set Cover...... 70

3.2.2 LP relaxations of the feasibility variant of set cover and general covering prob- lems...... 73

3.3.1 LP relaxation of weighted Max 푘-Cover...... 79

4.5.1 A basic slab and an example of a swapping operation...... 123

4.5.2 A example structure of a Yes instance; all elements are covered by the first 3 sets...... 123

15 4.6.1 Tables illustrating the representation of a slab under EltOf and SetOf

before and after a swap; cells modified by swap(푒2, 푒4) between 푆3 and 푆9 are highlighted in red...... 125

6.3.1 Pseudo-code and block-diagram representation of our algorithms...... 185 6.5.1 Frequency of Internet flows...... 204 6.5.2 Comparison of our algorithms with Count-Min and Count-Sketch on Internet traffic data...... 206 6.5.3 Frequency of search queries...... 207 6.5.4 Comparison of our algorithms with Count-Min and Count-Sketch on search query data...... 208 6.5.5 ROC curves of the learned models...... 209

7.3.1 The 푖-th iteration of power method...... 216 7.5.1 Test error by datasets and sketching matrices For 푘 = 10, 푚 = 20 ...... 220 7.5.2 Test error for Logo (left), Hyper (middle) and Tech (right) when 푘 = 10... 220 7.5.3 Low rank approximation results for Logo video frame: the best rank-10 ap- proximation (left), and rank-10 approximations reported by Algorithm 27 us- ing a sparse learned sketching matrix (middle) and a sparse random sketching matrix (right)...... 221

16 List of Tables

2.1.1 Summary of past work and our results. Parameters 휌 and 휌g respectively denote the approximation factor of off-line algorithms for Set Cover and its geometric variant...... 34

4.1.1 A summary of our algorithms and lower bounds. We use the following no-

tation: 푘 1 denotes the size of the optimum cover; 훼 1 denotes a pa- ≥ ≥ rameter that determines the trade-off between the approximation quality and

query/time complexities; 휌 1 denotes the approximation factor of a “black ≥ box” algorithm for set cover used as a subroutine...... 93

5.1.1 The summary of known results on the space complexity of single-pass -

ing algorithms of Max 푘-Cover...... 137 5.4.1 The values of parameters used in our algorithms...... 149

6.4.1 The performance of different algorithms on streams with frequencies obeying

Zipf Law. 푘 is the number of hash functions, 퐵 is the number of buckets, and 푛 is the number of distinct elements. The space complexity of all algorithms is Θ(퐵)...... 186

7.5.1 Test error in various settings...... 222 7.5.2 Performance of mixed sketches...... 222

17 18 List of Algorithms

1 A tight streaming algorithm for the (unweighted) Set Cover problem. Here, OfflineSC is an offline solver for Set Cover that provides 휌-approximation, and 푐 is some appropriate constant...... 39 2 RecoverBit uses a protocol for (Many vs One)-Set Disjointness(푚, 푛) to

recover Alice’s sets, 퐴 in Bob’s side...... 47 ℱ 3 A streaming algorithm for the Points-Shapes Set Cover problem...... 53 4 FracSetCover returns a (1 + 휀)-approximate solution of SetCover-LP... 71 5 A generic implementation of FeasibilityTest. Its performance depends on the implementations of Oracle, UpdateProb. We will investigate different implementations of Oracle...... 75

6 HeavySetOracle computes 푝푆 of every set given the set system in a stream or stored memory, then returns the solution x that optimally places value ℓ on the corresponding entry. It reports INFEASIBLE if there is no sufficiently good solution...... 76

7 MaxCoverOracle returns a fractional ℓ-cover with weighted coverage at least 1 훽/3 w.h.p. if ℓ 푘. It provides no guarantee on its behavior if ℓ < 푘. 80 − ≥ 8 The Prune subroutine lifts a solution in to a solution in ˘ with the same ℱ ℱ MaxCover-LP objective value and width 1. The subroutine returns z, the amount by which members of ˘ cover each element. The actual pruned so- ℱ lution ˘x may be computed but has no further use in our algorithm and thus not returned...... 81

9 An efficient implementation of FeasibilityTest which performs in 푝 passes 푂( 1 ) and consumes 푂̃︀(푚푛 푝휀 + 푛) space...... 87

19 10 The procedure of constructing a modified instance of 퐼*...... 100 11 IterSetCover is the main procedure of the SmallSetCover algorithm

for the Set Cover problem...... 115 12 OfflineSC(S, ℓ) invokes a black box that returns a cover of size at most 휌ℓ (if there exists a cover of size ℓ for S) for the Set Cover instance that is the projection of over S...... 116 ℱ 13 The description of the SmallSetCover algorithm...... 117

14 A (휌+휀)-approximation algorithm for the Set Cover problem. We assume that the algorithm has access to EltOf and SetOf oracles for Set Cover( , ), 풰 ℱ as well as and ...... 119 |풰| |ℱ| 15 The procedure of constructing a modified instance of 퐼*...... 131 16 A single pass streaming algorithm that given a vector 푎 in a stream, for each contributing class 푅, returns an index 푗 along with a (1 (1/2))-estimate of ± its frequency...... 143

17 A single-pass streaming algorithm that uses an (훼, 훿, 휂)-oracle of Max 푘-Cover to compute an (1/푂̃︀(훼))-approximation of the optimal coverage size of Max 푘- Cover...... 146 18 An (훼, 훿, 휂)-oracle of Max 푘-Cover...... 150 19 A (훼, 훿, 휂)-oracle of Max 푘-Cover that handles the case in which the number of common elements is large...... 151

20 An (훼, 훿, 휂)-oracle of Max 푘-Cover that handles the case in which the ma- jority of the coverage in an optimal solution is by the sets whose coverage

contributions are at least 1/(s훼) fraction of the optimal coverage size..... 156 21 A single pass streaming algorithm that estimates the optimal coverage size of

Max 푘-Cover( , ) within a factor of 1/푂̃︀(훼) in 푂̃︀(푚/훼2) space. To return a 풰 ℱ 푘 -cover that achieves the computed estimate it suffices to change the line 푂̃︀( 훼 ) marked by to return the corresponding 푘 -cover as well (see Section 5.5).160 (⋆⋆) 푂̃︀( 훼 ) 22 A single pass streaming algorithm that reports a (1/훼)-approximate solution of Max 푘-Cover( , ) in 푂̃︀(훽 + 푘) space when the set of common elements is 풰 ℱ large...... 162

20 23 An (훼, 훿, 휂)-oracle of Max 푘-Cover that reports a (1/푂̃︀(훼))-approximate 푘-cover.164 24 A (훼, 훿, 휂)-oracle of Max 푘-Cover that handles the case in which the majority of the coverage in an optimal solution is by the sets whose coverage contribu-

tions are at least 1/(s훼) fraction of the optimal coverage size. By modifying the lines marked by (⋆⋆) slightly, the algorithm returns the 푘-cover that achieves the returned coverage estimate (see Section 5.5 for more explanation).172 25 A single pass streaming algorithm...... 176 26 Learning-Based Frequency Estimation...... 185

27 Rank-푘 approximation of a matrix 퐴 using a sketch matrix 푆 (refer to Sec- tion 4.1.1 of [52])...... 214 28 Differentiable implementation of SVD...... 216

21 22 Chapter 1

Introduction

In recent decades, massive datasets arose in numerous application areas such as genomics, finance, social networks and Internet of things. The main challenge of massive datasets isthat the traditional data-processing architectures are not capable of analyzing them efficiently. This state of affairs has led to the emergence of new computational models for data analysis. In particular, since a large fraction of datasets are generated as a stream, algorithms for such data have been studied in the streaming algorithms model extensively. Formally, given a

sequence of 푁 items, the goal is to compute the value of a given function of the input items by a small number of passes and using a sublinear amount of space in 푁. In this thesis, we advance the development of streaming algorithms in two new directions: (a) designing streaming algorithms for new classes of problems and (b) designing streaming algorithms using modern machine learning approaches. The streaming model has been studied since early 80s. The focus of the early work was on processing numerical data such as estimating heavy hitters, number of distinct elements and quantiles [74, 35, 13, 129]. Over the last two decades, there has been a significant body of work on designing streaming algorithms for discrete optimization and in particular graph problems [102, 143, 73]. However, until recently not much was known about the complexity of coverage problems, which is a well-studied class of problems in discrete optimization, in the streaming setting. In the first part of this thesis (Chapters2-5), we provide optimal streaming algorithms for several variants of set cover and maximum coverage problems, two

23 well-studied problems in discrete optimization. Since the classical algorithms have formal worst-case guarantees, they tend to work equally well on all inputs. However, in most applications, the underlying data has certain properties that, if discovered and harnessed, could enable much better performance. Inspired by the success of machine learning, in the second part of this thesis (Chapters6-7), we de- sign such “learning-based” algorithms for two fundamental tasks in data analysis: frequency estimation and low-rank approximation. We remark that these are the first learning-based streaming algorithms. For both of these problems, we design learning-based algorithms that, in theory and practice, achieve better performance (e.g., space complexity or approximation quality) than the existing algorithms, as long as the input stream has certain patterns. At the same time, the algorithms (usually) retain the worst-case guarantees of the existing algorithms for these problems even on adversarial inputs.

1.1. Streaming Algorithms for Coverage Problems

A fundamental class of problems in the area of combinatorial optimization involves minimiz- ing a given cost function under a set of covering constraints. Perhaps the most well-studied problem in this family is set cover: given a collection of sets whose union is [푛], the goal 풮 is to identify the smallest sub-collection of whose union is [푛]. In the presence of massive 풮 data, new techniques are required to solve even this classic problem. The first part of this thesis is devoted to the design of efficient sublinear (space ortime) algorithms for Set Cover and a closely related problem Max 푘-Cover, here collectively called coverage problem, that have in many areas, including operations research, machine learning, information retrieval and data mining [86, 155, 48,2, 115, 26, 117].

Although both Set Cover and Max 푘-Cover are NP-complete, a natural which iteratively picks the “best” remaining set is provably the optimal algorithm for these problems unless P = NP [72, 152, 14, 138, 63] and it is widely used in practice. The algorithm often finds solutions that are very close to optimal. Unfortunately, due to its sequential na- ture, this algorithm does not scale very well to massive data sets (e.g., see Cormode et al. [56] for an experimental evaluation). This difficulty has motivated a considerable research effort whose goal was to design algorithms that are capable of handling large data efficiently on

24 modern architectures. Of particular interest are data stream algorithms, which compute the solution using only a small number of sequential passes over the data using a limited memory.

The study of streaming Set Cover problem initiated in [155] and since then there has been a large body of work on coverage problems in massive data analysis models and in particular streaming model [67, 61, 42, 97, 20, 29, 17, 105, 106, 107, 18, 123, 131, 17, 29, 19, 147,5].

1.1.1. Streaming Algorithms for Set Cover

In the Set Cover problem, given a ground set of 푛 elements = 푒1, , 푒푛 , and a family 풰 { ··· } of 푚 sets = 푆1, . . . , 푆푚 where 푚 푛, the goal is to select a subset such that ℱ { } ≥ ℐ ⊆ ℱ ℐ covers and the number of the sets in is as small as possible. In the streaming Set Cover 풰 ℐ problem as introduced in [155], the set of elements is stored in the memory in advance; 풰 the sets 푆1, , 푆푚 are stored consecutively in a read-only repository and an algorithm can ··· access the sets only by performing sequential scans of the repository. However, the amount of read-write memory available to the algorithm is limited, and is smaller than the input size (which could be as large as 푚푛). The objective is to design efficient approximation algorithms for the Set Cover problem that performs few passes over the data, and uses as little memory as possible. In Chapter2, we present an efficient streaming algorithm for this problem which has been proved to be optimal [17].

1.1.2. Streaming Algorithms for Fractional Set Cover

The LP relaxation of Set Cover is one of the well-studied mathematical programs in ap- proximation algorithms design. It is a continuous relaxation of the problem where each set

푆 can be selected “fractionally”, i.e., assigned a number 푥푆 from [0, 1], such that for each ∈ ℱ element its “fractional coverage” ∑︀ is at least , and the sum ∑︀ is minimized. 푒 푆:푒∈푆 푥푆 1 푆 푥푆 To the best of our knowledge, it is not known whether there exists an efficient and accurate algorithm for this problem that uses only a logarithmic (or even a polylogarithmic) number of passes. This state of affairs is perhaps surprising, given the many recent developments onfast LP solvers [119, 174, 124, 12, 11, 167]. The only prior results on streaming Packing/Covering

LPs were presented in [6], which studied the LP relaxation of Maximum Matching. In Chapter3, we present the first (1 + 휀)-approximation algorithm for the fractional Set Cover in the streaming model with constant number of passes and using sublinear amount

25 of space.

1.1.3. Sublinear Algorithms for Set Cover

Another related natural question to streaming Set Cover is the following: “is it possible to solve minimum set cover in sub-linear time?” This question was previously addressed in [145, 172], who showed that one can design constant running-time algorithms by simulating the greedy algorithm, under the assumption that the sets are of constant size and each element occurs in a constant number of sets. However, those constant-time algorithms have a few drawbacks: they only provide a mixed multiplicative/additive guarantee (the output cover size is guaranteed to be at most 푘 ln 푛+휀푛), the dependence of their running times on · the maximum set size is exponential, and they only output the (approximate) minimum set cover size, not the cover itself. From a different perspective, [119] (building on [84]) showed

that an 푂(1)-approximate solution to the fractional version of the problem can be found 1 in 푂̃︀(푚푘2 + 푛푘2) time. Combining this algorithm with the randomized rounding yields an 푂(log 푛)-approximate solution to Set Cover with the same complexity. In Chapter4, we initiate a systematic study of the complexity of sub-linear time algo- rithms for set cover with multiplicative approximation guarantees. Our upper bounds com-

plement the aforementioned result of [119] by presenting algorithms which are fast when 푘 is large, as well as algorithms that provide more accurate solutions (even with a constant-factor approximation guarantee) that use a sub-linear number of queries.2 Equally importantly, we establish nearly matching lower bounds, some of which even hold for estimating the optimal cover size.

1.1.4. Streaming Algorithms for Maximum Coverage

In Max 푘-Cover, given a ground set of 푛 elements, a family of 푚 sets (each subset of ), 풰 ℱ 풰 and a parameter 푘, the goal is to select 푘 sets in whose union has the largest cardinality. ℱ Moreover, Max 푘-Cover is an important problem in submodular maximization. The initial algorithms were developed in the set arrival model, where the input sets are listed contiguously. This restriction is natural from the perspective of submodular optimiza-

1 The method can be further improved to 푂̃︀(푚 + 푛푘) (N. Young, personal communication). 2Note that polynomial time algorithms with sub-logarithmic approximation are unlikely to exist.

26 tion, but limits the applicability of the algorithms3. Avoiding this limitation can be difficult, as streaming algorithms can no longer operate on sets as “unit objects”. As a result, the first maximum coverage algorithm for the general edge arrival model, where pairs of (set, element) can arrive in arbitrary order, have been developed recently. In particular [29] was first to (explicitly) present a one-pass algorithm with space linearin 푚 and constant approx- imation factor. We remark that many of the prior bounds (both upper and lower bounds) on set cover and max 푘-cover problems in set-arrival streams also work in edge arrival streams (e.g. [61, 97, 20, 131, 17, 105]). A particularly interesting line of research in set arrival streaming set cover and max

푘-cover is to design efficient algorithms that only use 푂̃︀(푛) space [155, 22, 67, 42, 131]. Pre- vious work have shown that we can adopt the existing greedy algorithm of Max 푘-Cover to achieve constant factor approximation in 푂̃︀(푛) space [155, 22] (which later improved to 푂̃︀(푘) by [131]). However, the complexity of the problem in the “low space” regime is very different in edge-arrival streams: [29] showed that as long as the approximation factor is a constant, any algorithm must use Ω(푚) space. Still, our understanding of approximation/space trade- offs in the general case is far from complete. In Chapter5, we provide tight bounds for the approximation/space tradeoffs of Max 푘-Cover in the general edge-arrival streams.

1.2. Learning-Based Streaming Algorithms

Classical algorithms are best known for providing formal guarantees over their performance, but often fail to leverage useful patterns in their input data to improve their output. However, in most applications, the underlying data has certain properties that can lead to much better performance if exploited by algorithms. On the other hand, machine learning approaches (in particular, deep learning models) are highly successful at capturing and utilizing complex data patterns, but often lack formal error bounds. The last few years have witnessed a growing effort to bridge this gap and introduce algorithms that can adapt to data properties while delivering worst case guarantees. For example, “learning-based” models have been integrated into the design of data structures [120, 137, 91], online algorithms [128, 151, 81],

3For example, consider a situation where the sets correspond to neighborhoods of vertices in a directed graph. Depending on the input representation, for each vertex, either the ingoing edges or the outgoing edges might be placed non-contiguously.

27 graph optimization [59], similarity search [156, 169] and compressive sensing [33]. This learning-based approach to algorithm design has attracted a considerable attention due to its potential to significantly improve the efficiency of some of the most widely used algorithmic tasks. Indeed, many applications involve processing streams of data (e.g., videos, data logs, customer activities and social media) by executing the same algorithm on an hourly, daily or weekly basis. These data sets are typically not “random” or “worst-case”; instead, they come from some distribution with useful properties which does not change rapidly from execution to execution. This makes it possible to design better streaming algorithms tailored to the specific data distribution, trained on past instances of the problem. In the secondpartof this thesis, we developed first learning-based algorithms in the streaming model and design such algorithms for two basic problems in this area frequency estimation (in Chapter6) and low-rank approximation (in Chapter7). For both of these problems, we design learning- based algorithms that empirically (and provably) achieve better performance (e.g., space complexity or approximation quality) compared to the existing algorithms if the input stream has certain patterns while achieving the worst-case guarantees of the existing algorithms for these problems even on adversarial inputs.

1.2.1. Learning-Based Algorithms for Frequency Estimation

Estimating the frequencies of elements in a data stream is one of the most fundamental sub- routines in data analysis. It has applications in many areas of machine learning, including feature selection [4], ranking [66], semi-supervised learning [164] and natural language pro- cessing [82]. It has been also used for network measurements [70, 175, 127] and security [158]. Frequency estimation algorithms have been implemented in popular data processing libraries, such as Algebird at Twitter [36]. They can answer practical questions like: what are the most searched words on the Internet? or how much traffic is sent between any two machines in a network?

The frequency estimation problem is formalized as follows: given a sequence 푆 of elements from some universe 푈, for any element 푖 푈, estimate 푓푖, the number of times 푖 occurs in 푆. ∈ If one could store all arrivals from the stream 푆, one could sort the elements and compute

28 their frequencies. However, in big data applications, the stream is too large (and may be infinite) and cannot be stored. This challenge has motivated the development of streaming

algorithms, which read the elements of 푆 in a single pass and compute a good estimate of the frequencies using a limited amount of space. Over the last two decades, many such streaming algorithms have been developed, including Count-Sketch [43], Count-Min [57] and multi-stage filters [70]. The performance guarantees of these algorithms are well-understood, with upper and lower bounds matching up to 푂( ) factors [110]. · However, such streaming algorithms typically assume generic data and do not leverage useful patterns or properties of their input. For example, in text data, the word frequency is known to be inversely correlated with the length of the word. Analogously, in network data, certain applications tend to generate more traffic than others. If such properties can be harnessed, one could design frequency estimation algorithms that are much more efficient than the existing ones. Yet, it is important to do so in a general framework that can harness various useful properties, instead of using handcrafted methods specific to a particular pattern or structure (e.g., word length, application type). In Chapter6, we introduce “learning-based” frequency estimation streaming algorithms.

1.2.2. Learning-Based Algorithms for Low-Rank Approximation

Low-rank approximation is one of the most widely used tools in massive data analysis, ma- chine learning and statistics, and has been a subject of many algorithmic studies. Formally,

in low-rank approximation, given an 푛 푑 matrix 퐴, and a parameter 푘, the goal is to × compute a rank-푘 matrix

argmin ′ [퐴]푘 = ′ rank ′ 퐴 퐴 퐹 . 퐴 : (퐴 )≤푘‖ − ‖

In particular, multiple algorithms developed over the last decade use the “sketching” approach, see e.g., [157, 171, 92, 52, 53, 144, 132, 34, 54]. Its idea is to use efficiently com- putable random projections (a.k.a., “sketches”) to reduce the problem size before performing low-rank decomposition, which makes the computation more space and time efficient. For example, [157, 52] show that if 푆 is a random matrix of size 푚 푛 chosen from an appropriate × 29 distribution4, for 푚 depending on 휀, then one can recover a rank-푘 matrix 퐴′ such that

′ 퐴 퐴 퐹 (1 + 휖) 퐴 [퐴]푘 퐹 ‖ − ‖ ≤ ‖ − ‖

by performing an SVD on 푆퐴 R푚×푑 followed by some post-processing. Typically the ∈ sketch length 푚 is small, so the matrix 푆퐴 can be stored using little space (in the context of streaming algorithms) or efficiently communicated (in the context of distributed algorithms).

Furthermore, the SVD of 푆퐴 can be computed efficiently, especially after another round of sketching, reducing the overall computation time. In light of the success of learning-based approaches, it is natural to ask whether similar improvements in performance could be obtained for other sketch-based algorithms, notably

for low-rank decompositions. In particular, reducing the sketch length 푚 while preserving its accuracy would make sketch-based algorithms more efficient. Alternatively, one could

make sketches more accurate for the same values of 푚. In Chapter7, we design the first “learning-based” (streaming) algorithm for the low-rank approximation problem.

4Initial algorithms used matrices with independent sub-gaussian entries or randomized Fourier/Hadamard matrices [157, 171, 92]. Starting from the seminal work of [53], researchers began to explore sparse binary matrices, see e.g., [144, 132].

30 Part I

Sublinear Algorithms for Coverage Problems

31 32 Chapter 2

Streaming Set Cover

2.1. Introduction

In the streaming Set Cover problem as defined in [155], the set of elements = 푒1, , 푒푛 풰 { ··· } is stored in the memory in advance; the sets 푆1, , 푆푚 are stored consecutively in a read- ··· only repository and an algorithm can access the sets only by performing sequential scans of the repository. However, the amount of read-write memory available to the algorithm is limited, and is smaller than the input size (which could be as large as 푚푛). The objective is to design an efficient approximation algorithm for the Set Cover problem that performs few passes over the data, and uses as little memory as possible. The last few years have witnessed a rapid development of new streaming algorithms for the Set Cover problem, in both theory and applied communities, see [155, 56, 122, 67, 61, 42]. Table 2.1.1 presents the approximation and space bounds achieved by those algorithms, as well as the lower bounds.1

2.1.1. Related Work

The semi-streaming Set Cover problem was first studied by Saha and Getoor [155]. Their result for Max 푘-Cover problem implies a 푂(log 푛)-pass 푂(log 푛)-approximation algorithm for the Set Cover problem that uses 푂̃︀(푛2) space. Adopting the standard greedy algorithm of Set Cover with a thresholding technique leads to 푂(log 푛)-pass 푂(log 푛)-approximation using 1Note that the simple greedy algorithm can be implemented by either storing the whole input (in one pass), or by iteratively updating the set of yet-uncovered elements (in at most 푛 passes).

33 Result Approximation Passes Space Note

1 푂(푚푛) Greedy Algorithm ln 푛 Deterministic 푛 푂(푛) [155] 푂(log 푛) 푂(log 푛) 푂(푛2 ln 푛) Deterministic

Randomized (tight bounds) [67] 푂(√푛) 1 Θ(̃︀ 푛) & deterministic algorithm

Randomized (tight bounds) [42] 푛훿 1 푂( 훿 ) 훿 1 Θ(̃︀ 푛) − & deterministic algorithm [146] 1 Randomized 2 log 푛 푂(log 푛) Ω(̃︀ 푚) 1 1 훿 [61] 푂(4 훿 휌) 푂(4 훿 ) 푂̃︀(푚푛 ) Randomized [61] 푂(1) 푂(log 푛) Ω(̃︀ 푚푛) Deterministic

Theorem 2.2.8 휌 2 훿 Randomized 푂( 훿 ) 훿 푂̃︀(푚푛 ) Theorem 2.3.8 3 1 Randomized 2 Ω(푚푛) Theorem 2.5.41 1 1 Ω(̃︀ 푚푛훿) Randomized 2훿 − Geometric Set Cover 푂(휌푔) 푂(1) 푂̃︀(푛) Randomized (Theorem 2.4.6) 푠-Sparse Set Cover 1 1 1 Ω(̃︀ 푚푠) Randomized (Theorem 2.6.6) 2훿 −

Table 2.1.1: Summary of past work and our results. Parameters 휌 and 휌g respectively denote the approximation factor of off-line algorithms for Set Cover and its geometric variant. 푂̃︀(푛) space. In 푂̃︀(푛) space regime, Emek and Rosen studied designing one-pass streaming algorithms for the Set Cover problem [67] and gave a deterministic greedy based 푂(√푛)- approximation for the problem. Moreover they proved that their algorithm is tight, even for randomized algorithms. The lower/upper bound results of [67] applied also to a generaliza-

tion of the Set Cover problem, the 휀-Partial Set Cover( , ) problem in which the goal is to 풰 ℱ cover (1 휀) fraction of elements and the size of the solution is compared to the size of an − 풰 optimal cover of Set Cover( , ). Very recently, Chakrabarti and Wirth extended the result 풰 ℱ of [67] and gave a trade-off streaming algorithm for the Set Cover problem in multiple passes [42]. They gave a deterministic algorithm with 푝 passes over the data stream that returns a (푝 + 1)푛1/(푝+1)-approximate solution of the Set Cover problem in 푂̃︀(푛) space. Moreover they

34 proved that achieving 0.99푛1/(푝+1)/(푝 + 1)2 in 푝 passes using 푂̃︀(푛) space is not possible even for randomized protocols which shows that their algorithm is tight up to a factor of (푝 + 1)3. Their result also works for the 휀-Partial Set Cover problem. In a different regime which was first studied by Demaine et al., the goal is todesigna

“low” approximation algorithms (depending on the computational model, it could be 푂(log 푛) or 푂(1)) in the smallest possible space [61]. They proved that any constant pass determin- istic (log 푛/2)-approximation algorithm for the Set Cover problem requires Ω(̃︀ 푚푛) space. It shows that unlike the results in 푂̃︀(푛)-space regime, to obtain a sublinear “low” approxima- tion streaming algorithm for the Set Cover problem in a constant number of passes, using randomness is necessary. Moreover, [61] presented a 푂(41/훿)-approximation algorithm that makes 푂(41/훿) passes and uses 푂̃︀(푚푛훿) memory space. The Set Cover problem is not polynomially solvable even in the restricted instances with points in R2 as elements, and geometric objects (either all disks or axis parallel rectangles or fat triangles) in plane as sets [71, 75, 99]. As a result, there has been a large body of

work on designing approximation algorithms for the geometric Set Cover problems. See for example [142,3, 15, 51] and references therein.

2.1.2. Our Results

Despite the progress outlined above, however, some basic questions still remained open. In particular:

(A) Is it possible to design a single pass streaming algorithm with a “low” approximation

factor2 that uses sublinear (i.e., 표(푚푛)) space?

(B) If such single pass algorithms are not possible, what are the achievable trade-offs between the number of passes and space usage?

(C) Are there special instances of the problem for which more efficient algorithms can be designed?

2Note that the lower bound in [61] excluded this possibility only for deterministic algorithms, while the upper bound in [67, 42] suffered from a polynomial approximation factor.

35 In this chapter, we make a significant progress on each of these questions. Our upper and lower bounds are depicted in Table 2.1.1.

On the algorithmic side, we give a 푂(1/훿)-pass algorithm with a strongly sub-linear 푂̃︀(푚푛훿) space and logarithmic approximation factor. This yields a significant improvement over the earlier algorithm of Demaine et al. [61] which used exponentially larger number of passes. The trade-off offered by our algorithm matches the lower bound ofNisan[146] that holds at the endpoint of the trade-off curve, i.e., for 훿 = Θ(1/ log 푛), up to poly-logarithmic factors in space3. Furthermore, our algorithm is very simple and succinct, and therefore easy to implement and deploy. Our algorithm exhibits a natural tradeoff between the number of passes and space, which resembles tradeoffs achieved for other problems [87, 88, 90]. It is thus natural to conjecture that this tradeoff might be tight, at least for “low enough” approximation factors. We present the first step in this direction by showing a lower bound for the case when the approximation factor is equal to 1, i.e., the goal is to compute the optimal set cover. In particular, by an information theoretic lower bound, we show that any streaming algorithm that computes set cover using ( 1 1) passes must use Ω(̃︀ 푚푛훿) space (even assuming exponential computational 2훿 − power) in the regime of 푚 = 푂(푛). Furthermore, we show that a stronger lower bound holds if all the input sets are sparse, that is if their cardinality is at most 푠. We prove a lower bound of Ω(̃︀ 푚푠) for 푠 = 푂(푛훿) and 푚 = 푂(푛). We also consider the problem in the geometric setting in which the elements are points in

R2 and sets are either discs, axis-parallel rectangles, or fat triangles in the plane. We show that a slightly modified version of our algorithm achieves the optimal 푂̃︀(푛) space to find an 푂(휌)-approximation in 푂(1) passes. Finally, we show that any randomized one-pass algorithm that distinguishes between covers of size 2 and 3 must use a linear (i.e., Ω(푚푛)) amount of space. This is the first result showing that a randomized, approximate algorithm cannot achieve a sub-linear space bound.

Recently Assadi et al. [20] generalized this lower bound to any approximation ratio 훼 =

3Note that to achieve a logarithmic approximation ratio we can use an off-line algorithm with the ap- proximation ratio 휌 = 1, i.e., one that runs in exponential time (see Theorem 2.2.8).

36 푂(√푛). More precisely they showed that approximating Set Cover within any factor 훼 = in a single pass requires 푚푛 space. 푂(√푛) Ω( 훼 )

Our techniques: basic idea. Our algorithm is based on the idea that whenever a large enough set is encountered, we can immediately add it to the cover. Specifically, we guess

(up to factor two) the size of the optimal cover 푘. Thus, a set is “large” if it covers at least 1/푘 fraction of the remaining elements. A small set, on the other hand, can cover only a “few” elements, and we can store (approximately) what elements it covers by storing (in memory) an appropriate random sample. At the end of the pass, we have (in memory) the projections of “small” sets onto the random sample, and we compute the optimal set cover for this projected instance using an offline solver. By carefully choosing the sizeof the random sample, this guarantees that only a small fraction of the set system remains uncovered. The algorithm then makes an additional pass to find the residual set system (i.e., the yet uncovered elements), making two passes in each iteration, and continuing to the next iteration. Thus, one can think about the algorithm as being based on a simple iterative “dimension- ality reduction” approach. Specifically, in two passes over the data, the algorithm selects a

“small” number of sets that cover all but 푛−훿 fraction of the uncovered elements, while using only 푂̃︀(푚푛훿) space. By performing the reduction step 1/훿 times we obtain a complete cover. The dimensionality reduction step is implemented by computing a small cover for a random subset of the elements, which also covers the vast majority of the elements in the ground set. This ensures that the remaining sets, when restricted to the random subset of the elements, occupy only 푂̃︀(푚푛훿) space. As a result the procedure avoids a complex set of recursive calls as presented in Demaine et al. [61], which leads to a simpler and more efficient algorithm.

Geometric results. Further using techniques and results from computational geometry we show how to modify our algorithm so that it achieves almost optimal bounds for the

Set Cover problem on geometric instances. In particular, we show that it gives a 푂(1)-pass 푂(휌)-approximation algorithm using 푂̃︀(푛) space when the elements are points in R2 and the sets are either discs, axis parallel rectangles, or fat triangles in the plane. In particular, we

37 Figure 2.1.1: A collection of 푛2/4 distinct rectangles (i.e., sets), each containing two points. use the following surprising property of the set systems that arise out of points and disks: the the number of sets is nearly linear as long as one considers only sets that contain “a few” points. More surprisingly, this property extends, with a twist, to certain geometric range spaces that might have quadratic number of shallow ranges. Indeed, it is easy to show an example of

푛 points in the plane, where there are Ω(푛2) distinct rectangles, each one containing exactly two points, see Figure 2.1.1. However, one can “split” such ranges into a small number of canonical sets, such that the number of shallow sets in the canonical set system is near linear. This enables us to store the small canonical sets encountered during the scan explicitly in memory, and still use only near linear space. We note that the idea of splitting ranges into small canonical ranges is an old idea in

orthogonal range searching. It was used by Aronov et al. [15] for computing small 휀-nets for these range spaces. The idea in the form we use, was further formalized by Ene et al. [68].

Lower bounds. The lower bounds for multi-pass algorithms for the Set Cover problem are obtained via a careful reduction from Intersection Set Chasing. The latter problem is a communication complexity problem where 푛 players need to solve a certain “set-disjointness- like” problem in rounds. A recent paper [90] showed that this problem requires 푛1+Ω(1/푝) 푝 푝푂(1) bits of communication complexity for 푝 rounds. This yields our desired trade-off of Ω(̃︀ 푚푛훿) space in 1/2훿 passes for exact protocols for Set Cover in the communication model and hence in the streaming model for 푚 = 푂(푛). Furthermore, we show a stronger lower bound on memory space of sparse instances of Set Cover in which all input sets have cardinality at most 푠. By a reduction from a variant of Equal Pointer Chasing which maps the problem to

38 Algorithm 1 A tight streaming algorithm for the (unweighted) Set Cover problem. Here, OfflineSC is an offline solver for Set Cover that provides 휌-approximation, and 푐 is some appropriate constant. 1: procedure IterSetCover(( , ), 훿) 2: ◁ try (in parallel) all possible풰 ℱ (2-approx) sizes of optimal cover 3: for 푘 2푖 0 푖 log 푛 in parallel do ◁ 푛 = 4: sol∈ { | ≤ ≤ } |풰| 5: for 푖←= ∅ 1 to 1/훿 do 6: let S be a sample of of size 푐휌푘푛훿 log 푚 log 푛 풰 7: L S, S 8: for←푆 ℱdo←◁ ∅by doing one pass 9: if ∈L ℱ 푆 S /푘 then ◁ size test 10: |sol∩ | ≥sol | | 푆 11: L ←L 푆 ∪ { } 12: else ← ∖

13: S S 푆 L ◁ store the set 푆 L explicitly in memory ℱ ← ℱ ∪ { ∩ } ∩ 14: OfflineSC(L, S, 푘) 15: sol풟 ← sol ⋃︀ ℱ ← ⋃︀ 풟 16: sol 푆◁ by doing additional pass over data 풰 ← 풰 ∖ 푆∈ 17: return best sol computed in all parallel executions. a sparse instance of Set Cover, we show that in order to have an exact streaming algorithm for 푠-Sparse Set Cover with 표(푚푠) space, Ω(log 푛) passes is necessary. More precisely, any ( 1 1)-pass exact randomized algorithm for 푠-Sparse Set Cover requires Ω(̃︀ 푚푠) memory 2훿 − space, if 푠 푛훿 and 푚 = 푂(푛). ≤ Our single pass lower bound proceeds by showing a lower bound for a one-way commu- nication complexity problem in which one party (Alice) has a collection of sets, and the other party (Bob) needs to determine whether the complement of his set is covered by one of the Alice’s sets. We show that if Alice’s sets are chosen at random, then Bob can decode Alice’s input by employing a small collection of “query” sets. This implies that the amount of communication needed to solve the problem is linear in the description size of Alice’s sets, which is Ω(푚푛).

2.2. Streaming Algorithm for Set Cover

In this section, we design an efficient streaming algorithm for the Set Cover problem that matches the lower bound results we already know about the problem. In the Set Cover problem, for a given set system ( , ), the goal is to find a subset , such that covers 풰 ℱ ℐ ⊆ ℱ ℐ 39 and its cardinality is minimum. In Algorithm1, we sketch the IterSetCover algorithm. 풰 In the IterSetCover algorithm, we have access to the OfflineSC subroutine that solves the given Set Cover instance offline (using linear space) and returns a 휌-approximate solution where 휌 could be anywhere between 1 and Θ(log 푛) depending on the computational model one assumes. Under exponential computational power, we can achieve the optimal cover of the given instance of the Set Cover (휌 = 1); however, under P = NP assumption, 휌 ̸ cannot be better than 푐 ln 푛 where 푐 is a constant [72, 152, 14, 138, 63] given polynomial · computational power.

Let 푛 = be the initial number of elements in the given ground set. The IterSet- |풰| Cover algorithm, needs to guess (up to a factor of two) the size of the optimal cover of

( , ). To this end, the algorithm , in parallel, all values 푘 in 2푖 0 푖 log 푛 . This 풰 ℱ { | ≤ ≤ } step will only increase the memory space requirement by a factor of log 푛. Consider the run of the IterSetCover algorithm, in which the guess 푘 is correct (i.e., OPT 푘 < 2 OPT , where OPT is an optimal solution). The idea is to go through 푂(1/훿) | | ≤ | | iterations such that each iteration only makes two passes and at the end of each iteration the number of uncovered elements reduces by a factor of 푛훿. Moreover, the algorithm is allowed to use 푂̃︀(푚푛훿) space. In each iteration, the algorithm starts with the current ground set of uncovered elements

, and copies it to a leftover set L. Let S be a large enough uniform sample of elements 풰 . In a single pass, using S, we estimate the size of all large sets in and add 푆 to 풰 ℱ ∈ ℱ the solution sol immediately (thus avoiding the need to store it in memory). Formally, if 푆 covers at least Ω( /푘) yet-uncovered elements of L then it is a heavy set, and the algorithm |풰| immediately adds it to the output cover. Otherwise, if a set is small, i.e., its covers less than

/푘 uncovered elements of L, the algorithm stores the set 푆 in memory. Fortunately, it is |풰| enough to store its projection over the sampled elements explicitly (i.e., 푆 L) – this requires ∩ remembering only the 푂( S /푘) indices of the elements of 푆 L. | | ∩ In order to show that a solution of the Set Cover problem over the sampled elements is a good cover of the initial Set Cover instance, we apply the relative (푝, 휀)-approximation sampling result of [100] (see Definition 2.2.4) and it is enough for S to be of size 푂̃︀(휌푘푛훿). Using relative (푝, 휀)-approximation sampling, we show that after two passes the number of

40 uncovered elements is reduced by a factor of 푛훿. Note that the relative (푝, 휀)-approximation sampling improves over the Element Sampling technique used in [61] with respect to the number of passes.

Since in each iteration we pick 푂(휌푘) sets and the number of uncovered elements decreases by a factor of 푛훿, after 1/훿 iterations the algorithm picks 푂(휌푘/훿) sets and covers all elements. Moreover, the memory space of the whole algorithm is 푂̃︀(휌푚푛훿) (see Lemma 2.2.2).

2.2.1. Analysis of IterSetCover

In the rest of this section we prove that the IterSetCover algorithm with high probabil- ity returns a 푂(휌/훿)-approximate solution of Set Cover( , ) in 2/훿 passes using 푂̃︀(푚푛훿) 풰 ℱ memory space.

Lemma 2.2.1. The number of passes the IterSetCover algorithm makes is 2/훿.

Proof: In each of the 1/훿 iterations of the IterSetCover algorithm, the algorithm makes two passes. In the first pass, based on the set of sampled elements S, it decides whether to pick a set or keep its projection over S (i.e., 푆 L) in the memory. Then the algorithm calls ∩ OfflineSC which does not require any passes over . The second pass is for computing ℱ the set of uncovered elements at the end of the iteration. We need this pass because we only know the projection of the sets we picked in the current iteration over S and not over the original set of uncovered elements. Thus, in total we make 2/훿 passes. Also note that for different guesses for the value of 푘, we run the algorithm in parallel and hence the total number of passes remains 2/훿. 

Lemma 2.2.2. The memory space used by the IterSetCover algorithm is 푂̃︀(푚푛훿).

Proof: In each iteration of the algorithm, it picks during the first pass at most 푚 sets (more precisely at most 푘 sets) which requires 푂(푚 log 푚) memory. Moreover, in the first pass we keep the projection of the sets whose projection over the uncovered sampled elements has

size at most S /푘. Since there are at most 푚 such sets, the total required space for storing | | the projections is bounded by 푂(︀휌푚푛훿 log 푚 log 푛)︀. Since in the second pass the algorithm only updates the set of uncovered elements, the amount of space required in the second pass is 푂(푛). Thus, the total required space to

41 perform each iteration of the IterSetCover algorithm is 푂̃︀(푚푛훿). Moreover, note that the algorithm does not need to keep the memory space used by the earlier iterations; thus,

훿 the total space consumed by the algorithm is 푂̃︀(푚푛 ).  Next we show the sets we picked before calling OfflineSC has large size on . 풰 Lemma 2.2.3. With probability at least 1 푚−푐 all sets that pass the “size test” in the − IterSetCover algorithm have size at least /푐푘. |풰| Proof: Let 푆 be a set of size less than /푐푘. In expectation, 푆 S is less than ( /푐푘) |풰| | ∩ | |풰| · ( S / ) = 휌푛훿 log 푚 log 푛. By Chernoff bound for large enough 푐, | | |풰|

Pr( 푆 S 푐휌푛훿 log 푚 log 푛) 푚−(푐+1). | ∩ | ≥ ≤

Applying the union bound, with probability at least 1 푚−푐, all sets passing “size test” have − size at least /(푐푘). |풰|  In what follows we define the relative (푝, 휀)-approximation sample of a set system and men- tion the result of Har-Peled and Sharir [100] on the minimum required number of sampled elements to get a relative (푝, 휀)-approximation of the given set system.

Definition 2.2.4. Let (V, ) be a set system, i.e., V is a set of elements and 2V is a ℋ ℋ ⊆ family of subsets of the ground set V. For given parameters 0 < 휀, 푝 < 1, a subset 푍 V is ⊆ a relative (푝, 휀)-approximation for (V, ), if for each 푆 , we have that if 푆 푝 V then ℋ ∈ ℋ | | ≥ | | 푆 푆 푍 푆 (1 휀)| | | ∩ | (1 + 휀)| |. − V ≤ 푍 ≤ V | | | | | | If the range is light (i.e., 푆 < 푝 V ) then it is required that | | | | 푆 푆 푍 푆 | | 휀푝 | ∩ | | | + 휀푝. V − ≤ 푍 ≤ V | | | | | | Namely, 푍 is (1 휀)-multiplicative good estimator for the size of ranges that are at least ± 푝-fraction of the ground set. The following lemma is a simplified variant of a result in Har-Peled and Sharir [100]–

indeed, a set system with 푀 sets, can have VC dimension at most log 푀. This simplified form

42 also follows by a somewhat careful but straightforward application of Chernoff’s inequality.

Lemma 2.2.5. Let ( , ) be a finite set system, and 푝, 휀, 푞 be parameters. Then, a random 풰 ℱ ′ (︁ )︁ sample of such that = 푐 log log 1 + log 1 , for an absolute constant 푐′ is a relative 풰 |풰| 휀2푝 |ℱ| 푝 푞 (푝, 휀)-approximation, for all ranges in , with probability at least (1 푞). ℱ − Lemma 2.2.6. Assuming OPT 푘 2 OPT , after any iteration, with probability at | | ≤ ≤ | | least 1 푚1−푐/4 the number of uncovered elements decreases by a factor of 푛훿, and this − iteration adds 푂(휌 OPT ) sets to the output cover. | | Proof: Let V be the set of uncovered elements at the beginning of the iteration and ⊆ 풰 note that the total number of sets that is picked during the iteration is at most (1 + 휌)푘 (see Lemma 2.2.3). Consider all possible such covers, that is = ′ ′ (1 + 휌)푘 , 풢 {ℱ ⊆ ℱ| |ℱ | ≤ } and observe that 푚(1+휌)푘. Let be the collection that contains all possible sets |풢| ≤ ℋ {︀ ⋃︀ ⃒ }︀ of uncovered elements at the end of the iteration, defined as = V 푆 ⃒ . ℋ ∖ 푆∈풞 풞 ∈ 풢 Moreover, set 푝 = 2/푛훿, 휀 = 1/2 and 푞 = 푚−푐 and note that 푚(1+휌)푘. Since |ℋ| ≤ |풢| ≤ ′ 푐 (log log 1 + log 1 ) 푐휌푘푛훿 log 푚 log 푛 = S for large enough 푐, by Lemma 2.2.5, S is a 휀2푝 |ℋ| 푝 푞 ≤ | | relative (푝, 휀)-approximation of (V, ) with (1 푞) probability. Let be the collection ℋ − 풟 ⊆ ℱ of sets picked during the iteration which covers all elements in S. Since S is a relative (푝, 휀)- approximation sample of (V, ) with probability at least 1 푚−푐, the number of uncovered ℋ − elements of V (or ) by is at most 휀푝 V = /푛훿. 풰 풟 | | |풰| Hence, in each iteration we pick 푂(휌푘) sets and at the end of iteration the number of 훿 uncovered elements reduces by 푛 .  Lemma 2.2.7. The IterSetCover algorithm computes a set cover of ( , ), whose size 풰 ℱ is within a 푂(휌/훿) factor of the size of an optimal cover with probability at least 1 푚1−푐/4. − Proof: Consider the run of IterSetCover for which the value of 푘 is between OPT and | | 2 OPT . In each of the (1/훿) iterations made by the algorithm, by Lemma 2.2.6, the number | | of uncovered elements decreases by a factor of 푛훿 where 푛 is the number of initial elements to be covered by the sets. Moreover, the number of sets picked in each iteration is 푂(휌푘). Thus after (1/훿) iterations, all elements would be covered and the total number of sets in the solution is 푂(휌 OPT /훿). Moreover by Lemma 2.2.6, the success probability of all the | | 1 푐 −1 iterations, is at least 1 1 (1/푚) 4 . − 훿푚푐/4 ≥ − 

43 Theorem 2.2.8. The IterSetCover( , , 훿) algorithm makes 2/훿 passes, uses 푂̃︀(푚푛훿) 풰 ℱ memory space, and finds a 푂(휌/훿)-approximate solution of the Set Cover problem with high probability. Furthermore, given enough number of passes the IterSetCover algorithm matches

the known lower bound on the memory space of the streaming Set Cover problem up to a polylog(푚) factor where 푚 is the number of sets in the input. Proof: The first part of the proof implied by Lemma 2.2.1, Lemma 2.2.2, and Lemma 2.2.7. As for the lower bound, note that by a result of Nisan [146], any randomized ( log 푛 )- 2 approximation protocol for Set Cover( , ) in the one-way communication model requires 풰 ℱ Ω(푚) bits of communication, no matter how many number of rounds it makes. This implies that any randomized 푂(log 푛)-pass, ( log 푛 )-approximation algorithm for Set Cover( , ) re- 2 풰 ℱ quires Ω(̃︀ 푚) space, even under the exponential computational power assumption. By the above, the IterSetCover algorithm makes 푂(1/훿) passes and uses 푂̃︀(푚푛훿) space to return a 1 -approximate solution under the exponential computational power 푂( 훿 ) assumption ( ). Thus by letting , we will have a log 푛 -approximation 휌 = 1 훿 = 푐/ log 푛 ( 2 ) streaming algorithm using 푂̃︀(푚) space which is optimal up to a factor of polylog(푚).  Theorem 2.2.8 provides a strong indication that our trade-off algorithm is optimal.

2.3. Lower Bound for Single Pass Algorithms

In this section, we study the Set Cover problem in the two-party communication model and give a tight lower bound on the communication complexity of the randomized protocols

solving the problem in a single round. In the two-party Set Cover, we are given a set of elements and there are two players Alice and Bob where each of them has a collection of 풰 subsets of , 퐴 and 퐵. The goal for them is to find a minimum size cover 퐴 퐵 풰 ℱ ℱ 풞 ⊆ ℱ ∪ ℱ covering while communicating the fewest number of bits from Alice to Bob (In this model 풰 Alice communicates to Bob and then Bob should report a solution).

Our main lower bound result for the single pass protocols for Set Cover is the following theorem which implies that the naive approach in which one party sends all of its sets to the the other one is optimal.

44 Theorem 2.3.1. Any single round randomized protocol that approximates Set Cover( , ) 풰 ℱ within a factor better than 3/2 and error probability 푂(푚−푐) requires Ω(푚푛) bits of commu- nication where 푛 = and 푚 = and 푐 is a sufficiently large constant. |풰| |ℱ| We consider the case in which the parties want to decide whether there exists a cover of size 2 for in 퐴 퐵 or not. If any of the parties has a cover of size at most 2 for , then 풰 ℱ ∪ ℱ 풰 it becomes trivial. Thus the question is whether there exist 푆푎 퐴 and 푆푏 퐵 such that ∈ ℱ ∈ ℱ 푆푎 푆푏. 풰 ⊆ ∪ A key observation is that to decide whether there exist 푆푎 퐴 and 푆푏 퐵 such that ∈ ℱ ∈ ℱ 푆푎 푆푎, one can instead check whether there exists 푆푎 퐴 and 푆푏 퐵 such that 풰 ⊆ ∪ ∈ ℱ ∈ ℱ 푆푎 푆푏 = . In other words we need to solve OR of a series of two-party Set Disjointness ∩ ∅ problems. In two-party Set Disjointness problem, Alice and Bob are given subsets of , 푆푎 풰 and 푆푏 and the goal is to decide whether 푆푎 푆푏 is empty or not with the fewest possible ∩ bits of communication. Set Disjointness is a well-studied problem in the communication complexity and it has been shown that any randomized protocol for Set Disjointness with 푂(1) error probability requires Ω(푛) bits of communication where 푛 = [27, 111, 153]. |풰| We can think of the following extensions of the Set Disjointness problem.

I. Many vs One: in this variant, Alice has 푚 subsets of , 퐴 and Bob is given a single set 풰 ℱ 푆푏. The goal is to determine whether there exists a set 푆푎 퐴 such that 푆푎 푆푏 = . ∈ ℱ ∩ ∅ II. Many vs Many: in this variant, each of Alice and Bob are given a collection of subsets

of and the goal for them is to determine whether there exist 푆푎 퐴 and 푆푏 퐵 풰 ∈ ℱ ∈ ℱ such that 푆푎 푆푏 = . ∩ ∅

Note that deciding whether two-party Set Cover has a cover of size 2 is equivalent to solving the (Many vs Many)-Set Disjointness problem. Moreover, any lower bound for (Many vs One)-Set Disjointness clearly implies the same lower bound for the (Many vs Many)-Set Disjointness problem. In the following theorem we show that any single-round randomized protocol that solves (Many vs One)-Set Disjointness(푚, 푛) with 푂(푚−푐) error probability requires Ω(푚푛) bits of communication.

Theorem 2.3.2. Any randomized protocol for (Many vs One)-Set Disjointness(푚, 푛) with

45 −푐 error probability that is 푂(푚 ) requires Ω(푚푛) bits of communication if 푛 푐1 log 푚 where ≥ 푐 and 푐1 are large enough constants. The idea is to show that if there exists a single-round randomized protocol for the problem with 표(푚푛) bits of communication and error probability 푂(푚−푐), then with constant proba- bility one can distinguish Ω(2푚푛) distinct inputs using 표(푚푛) bits which is a contradiction. Suppose that Alice has a collection of 푚 uniformly and independently random subsets of (in each of her subsets the probability that 푒 is in the subset is 1/2). Lets assume that 풰 ∈ 풰 there exists a single round protocol I for (Many vs One)-Set Disjointness(푛, 푚) with error probability 푂(푚−푐) using 표(푚푛) bits of communication. Let ExistsDisj be Bob’s algorithm in protocol I. Then we show that one can recover 푚푛 random bits with constant probability using ExistsDisj subroutine and the message 푠 sent by the first party in protocol I. The RecoverBit which is shown in Algorithm2, is the algorithm to recover random bits using protocol I and ExistsDisj. To this end, Bob gets the message 푠 communicated by protocol I from Alice and considers all subsets of size 푐1 log 푚 and 푐1 log 푚 + 1 of . Note that 푠 is communicated only once and 풰 thus the same 푠 is used for all queries that Bob makes. Then at each step Bob picks a random subset 푆푏 of size 푐1 log 푚 of and solve the (Many vs One)-Set Disjointness problem with 풰 input ( 퐴, 푆푏) by running ExistsDisj(푠, 푆푏). Next we show that if 푆푏 is disjoint from a set ℱ in 퐴, then with high probability there is exactly one set in 퐴 which is disjoint from 푆푏 ℱ ℱ (see Lemma 2.3.3). Thus once Bob finds out that his query, 푆푏, is disjoint from a set in 퐴, ℱ + he can query all sets 푆 푆푏 푒 푒 푆푏 and recover the set (or union of sets) in 퐴 푏 ∈ { ∪ | ∈ 풰 ∖ } ℱ that is disjoint from 푆푏. By a simple pruning step we can detect the ones that are union of more than one set in 퐴 and only keep the sets in 퐴. ℱ ℱ In Lemma 2.3.6, we show that the number of queries that Bob is required to make to

푐 recover 퐴 is 푂(푚 ) where 푐 is a constant. ℱ

Lemma 2.3.3. Let 푆푏 be a random subset of of size 푐 log 푚 and let 퐴 be a collection 풰 ℱ of m random subsets of . The probability that there exists exactly one set in 퐴 that is 풰 ℱ disjoint from is at least 1 . 푆푏 푚푐+1

46 Algorithm 2 RecoverBit uses a protocol for (Many vs One)-Set Disjointness(푚, 푛) to recover Alice’s sets, 퐴 in Bob’s side. ℱ 1: procedure RecoverBit( , 푠) 풰 2: 푎 3: forℱ ←푖 = ∅ 1 to 푚푐 log 푚 do 4: let 푆푏 be a random subset of of size 푐1 log 푚 풰 5: if ExistsDisj(푠, 푆푏) = true then 6: ◁ discovering the set (or union of sets) 7: ◁ in 퐴 disjoint from 푆푏 8: 푆 ℱ ← ∅ 9: for 푒 푆푏 do ∈ 풰 ∖ 10: if ExistsDisj(푆푏 푒, 푠) = false then 11: 푆 푆 푒 ∪ ′ ← ∪ ′ 12: if 푆 푎 s.t. 푆 푆 then ◁ pruning step ∃ ∈ ℱ ′ ⊂ 13: 푎 푎 푆 ℱ ← ℱ ∖ { } 14: 푎 푎 푆 15: elseℱ if← ℱ′ ∪ { s.t.} ′ then @푆 푎 푆 푆 ∈ ℱ ⊂ 16: 푎 푎 푆 ℱ ← ℱ ∪ { } 17: return 푎 ℱ Proof: The probability that 푆푏 is disjoint from exactly one set in 퐴 is ℱ

Pr(푆푏 is disjoint from 1 set in 퐴) Pr(푆푏 is disjoint from 2 sets in 퐴) ≥ ℱ − ≥ ℱ 1 (︂푚)︂ 1 1 ( )푐 log 푚 ( )2푐 log 푚 . ≥ 2 − 2 2 ≥ 푚푐+1

First we prove the first term in the above inequality. For an arbitrary set 푆 퐴, since any ∈ ℱ element is contained in with probability 1 , the probability that is disjoint from is 푆 2 푆 푆푏 (1/2)푐 log 푚.

−푐 log 푚 Pr(푆푏 is disjoint from at least one set in 퐴) 2 . ℱ ≥

(︀푚)︀ Moreover since there exist pairs of sets in 퐴, and for each 푆1, 푆2 퐴, the probability 2 ℱ ∈ ℱ −2푐 that 푆1 and 푆2 are disjoint from 푆푏 is 푚 ,

−(2푐−2) Pr(푆푏 is disjoint from at least two sets in 퐴) 푚 . ℱ ≤ 

A family of sets is called intersecting if and only if for any sets 퐴, 퐵 either both ℳ ∈ ℳ

47 퐴 퐵 and 퐵 퐴 are non-empty or both 퐴 퐵 and 퐵 퐴 are empty; in other words, there ∖ ∖ ∖ ∖ exists no 퐴, 퐵 such that 퐴 퐵. Let 퐴 be a collection of subsets of . We show ∈ ℳ ⊆ ℱ 풰 that with high probability after testing 푂(푚푐) queries for sufficiently large constant 푐, the

RecoverBit algorithm recovers 퐴 completely if 퐴 is intersecting. First we show that ℱ ℱ with high probability the collection 퐴 is intersecting. ℱ

Observation 2.3.4. Let 퐴 be a collection of 푚 uniformly random subsets of where ℱ 풰 −푐/4+2 푐 log 푚. With probability at least 1 푚 , 퐴 is an intersecting family. |풰| ≥ − ℱ 3 푛 Proof: The probability that 푆1 푆2 is ( ) and there are at most 푚(푚 1) pairs of sets in ⊆ 4 − 2 3 푛 푐 −2 퐴. Thus with probability at least 1 푚 ( ) 1 1/푚 4 , 퐴 is intersecting. ℱ − 4 ≥ − ℱ  Observation 2.3.5. The number of distinct inputs of Alice (collections of random subsets of ), that is distinguishable by RecoverBit is Ω(2푚푛). 풰 Proof: There are 2푚푛 collections of 푚 random subsets of . By Observation 2.3.4, Ω(2푚푛) 풰 of them are intersecting. Since we can only recover the sets in the input collection and not their order, the distinct number of input collection that are distinguished by RecoverBit

푚푛 is Ω( 2 ) which is Ω(2푚푛) for 푛 푐 log 푚. 푚! ≥ 

By Observation 2.3.4 and only considering the case such that 퐴 is intersecting, we have the ℱ following lemma.

Lemma 2.3.6. Let 퐴 be a collection of 푚 uniformly random subsets of and suppose ℱ 풰 푐 that 푐 log 푚. After testing at most 푚푐 queries, with probability at least (1 1 )푝푚 , |풰| ≥ − 푚 퐴 is fully recovered, where 푝 is the success rate of protocol I for the (Many vs One)-Set ℱ Disjointness problem.

Proof: By Lemma 2.3.3, for each 푆푏 of size 푐1 log 푚 the probability that 푆푏 is disjoint ⊂ 풰 푐1+1 from exactly one set in a random collection of sets 퐴 is at least 1/푚 . Given 푆푏 is disjoint ℱ from exactly one set in 퐴, due to symmetry of the problem, the chance that 푆푏 is disjoint ℱ from a specific set is at least 1 . After 푐1+2 queries where is a large 푆 퐴 푐 +2 훼푚 log 푚 훼 ∈ ℱ 푚 1 enough constant, for any 푆 퐴, the probability that there is not a query 푆푏 that is only ∈ ℱ 1 훼푚푐1+2 log 푚 −훼 log 푚 1 disjoint from 푆 is at most (1 푐 +2 ) 푒 = . − 푚 1 ≤ 푚훼 Thus after trying 훼푚푐1+2 log 푚 queries, with probability at least (1 1 ) (1 1 ), − 2푚훼−1 ≥ − 푚

48 for each 푆 퐴 we have at least one query that is only disjoint from 푆 (and not any other ∈ ℱ sets in 퐴 푆). ℱ ∖ Once we have a query subset 푆푏 which is only disjoint from a single set 푆 퐴, we can ∈ ℱ ask 푛 푐 log 푚 queries of size 푐1 log 푚+1 and recover 푆. Note that if 푆푏 is disjoint from more − than one sets in 퐴 simultaneously, the process (asking 푛 푐 log 푚 queries of size 푐1 log 푚+1) ℱ − will end up in recovering the union of those sets. Since 퐴 is an intersecting family with ℱ high probability (Observation 2.3.4), by pruning step in the RecoverBit algorithm we are

guaranteed that at the end of the algorithm, what we returned is exactly 퐴. Moreover the ℱ total number of queries the algorithm makes is at most

푛 (훼푚푐1+2 log 푚) 훼푚푐1+3 log 푚 푚푐 × ≤ ≤

for 푐 푐1 + 4. ≥ 푐 1 푚푐 Thus after testing 푚 queries, 퐴 will be recovered with probability at least (1 )푝 ℱ − 푚 where 푝 is the success probability of the protocol I for (Many vs One)-Set Disjointness(푚, 푛). Corollary 2.3.7. Let I be a protocol for (Many vs One)-Set Disjointness(푚, 푛) with error probability 푂(푚−푐) and 푠 bits of communication such that 푛 푐 log 푚 for large enough 푐. ≥ Then RecoverBit recovers 퐴 with constant success probability using 푠 bits of communi- ℱ cation.

By Observation 2.3.5, since RecoverBit distinguishes Ω(2푚푛) distinct inputs with con- stant probability of success (by Corollary 2.3.7), the size of message sent by Alice, should be

Ω(푚푛). This proves Theorem 2.3.2.

Proof of Theorem 2.3.1: As we showed earlier, the communication complexity of (Many vs One)-Set Disjointness is a lower bound for the communication complexity of Set Cover.

Theorem 2.3.2 showed that any protocol for (Many vs One)-Set Disjointness(푛, 퐴) with |ℱ | error probability less than 푂(푚−푐) requires Ω(푚푛) bits of communication. Thus any single- round randomized protocol for Set Cover with error probability 푂(푚−푐) requires Ω(푚푛) bits of communication. 

Since any 푝-pass streaming 훼-approximation algorithm for problem P that uses 푂(푠)

49 memory space, is a 푝-round two-party 훼-approximation protocol for problem P using 푂(푠푝) bits of communication [88], and by Theorem 2.3.1, we have the following lower bound for

Set Cover problem in the streaming model.

Theorem 2.3.8. Any single-pass randomized streaming algorithm for Set Cover( , ) that 풰 ℱ computes a (3/2)-approximate solution with probability Ω(1 푚−푐) requires Ω(푚푛) memory − space (assuming 푛 푐1 log 푚). ≥ 2.4. Geometric Set Cover

In this section, we consider the streaming Set Cover problem in the geometric settings. We present an algorithm for the case where the elements are a set of 푛 points in the plane R2 and the 푚 sets are either all disks, all axis-parallel rectangles, or all 훼-fat triangles (which for simplicity we call shapes) given in a data stream. As before, the goal is to find the minimum size cover of points from the given sets. We call this problem the Points-Shapes Set Cover problem.

Note that, the description of each shape requires 푂(1) space and thus the Points-Shapes Set Cover problem is trivial to be solved in 푂(푚 + 푛) space. In this setting the goal is to design an algorithm whose space is sub-linear in 푂(푚 + 푛). Here we show that almost the same algorithm as IterSetCover (with slight modifications) uses 푂̃︀(푛) space to find an 푂(휌)-approximate solution of the Points-Shapes Set Cover problem in constant passes.

2.4.1. Preliminaries

A triangle is called 훼-fat (or simply fat) if the ratio between its longest edge and its height △ on this edge is bounded by a constant 훼 > 1 (there are several equivalent definitions of 훼-fat triangles).

Definition 2.4.1. Let ( , ) be a set system such that is a set of points and is a 풰 ℱ 풰 ℱ collection of shapes, in the plane R2. The canonical representation of ( , ) is a collection ′ 풰 ℱ ℱ of regions such that the following conditions hold. First, each 푆′ ′ has 푂(1) description. ∈ ℱ Second, for each 푆′ ′, there exists 푆 such that 푆′ 푆 . Finally, for each ∈ ℱ ∈ ℱ ∩ 풰 ⊆ ∩ 풰 ′ ′ ′ ′ ′ 푆 , there exists 푐1 sets 푆 , , 푆 such that 푆 = (푆 푆 ) for some ∈ ℱ 1 ··· 푐1 ∈ ℱ ∩ 풰 1 ∪ · · · ∪ 푐1 ∩ 풰 constant 푐1.

50 The following two results are from [68] which are the formalization of the ideas in [15].

Lemma 2.4.2. (Lemma 4.18 in [68]) Given a set of points in the plane R2 and a parameter 풰 푤, one can compute a set ′ of 푂( 푤2 log ) axis-parallel rectangles with the following ℱtotal |풰| |풰| property. For an arbitrary axis-parallel rectangle 푆 that contains at most 푤 points of , there 풰 exist two axis-parallel rectangles 푆′ , 푆′ ′ whose union has the same intersection with 1 2 ∈ ℱtotal as 푆, i.e., 푆 = (푆′ 푆′ ) . 풰 ∩ 풰 1 ∪ 2 ∩ 풰 Lemma 2.4.3. (Theorem 5.6 in [68]) Given a set of points in R2, a parameter 푤 and 풰 a constant 훼, one can compute a set ′ of 푂( 푤3 log2 ) regions each having 푂(1) ℱtotal |풰| |풰| description with the following property. For an arbitrary 훼-fat triangle 푆 that contains at most 푤 points of , there exist nine regions from ′ whose union has the same intersection 풰 ℱtotal with as 푆. 풰 Using the above lemmas we get the following lemma.

Lemma 2.4.4. Let be a set of points in R2 and let be a set of shapes (discs, axis-parallel 풰 ℱ rectangles or fat triangles), such that each set in contains at most 푤 points of . Then, in ℱ 풰 a single pass over the stream of sets , one can compute the canonical representation ′ of ℱ ℱ ( , ). Moreover, the size of the canonical representation is at most 푂( 푤3 log2 ) and 풰 ℱ |풰| |풰| the space requirement of the algorithm is 푂̃︀( ′ ) = 푂̃︀( 푤3). |ℱ | |풰| Proof: For the case of axis-parallel rectangles and fat triangles, first we use Lemma 2.4.2 and

Lemma 2.4.3 to get the set ′ offline which require 푂̃︀( ′ ) = 푂̃︀( 푤3 log2 ) memory ℱtotal ℱtotal |풰| |풰| space. Then by making one pass over the stream of sets , we can find the canonical ℱ representation ′ by picking all the sets 푆′ ′ such that 푆′ 푆 for some ℱ ∈ ℱtotal ∩ 풰 ⊆ ∩ 풰 푆 . For discs however, we just make one pass over the sets and keep a maximal subset ∈ ℱ ℱ ′ such that for each pair of sets 푆′ , 푆′ ′ their projection on are different, i.e., ℱ ⊆ ℱ 1 2 ∈ ℱ 풰 푆′ = 푆′ . By a standard technique of Clarkson and Shor [50], it can be proved that 1 ∩ 풰 ̸ 2 ∩ 풰 the size of the canonical representation, i.e., 푆′ , is bounded by 푂( 푤2). Note that this is | | |풰| just counting the number of discs that contain at most 푤 points, namely the at most 푤-level discs. 

51 2.4.2. Description of GeomSC Algorithm

The outline of the GeomSC algorithm (shown in Algorithm3) is very similar to the Iter- SetCover algorithm presented earlier in Section 2.2. In the first pass, the algorithm picks all the sets that cover a large number of yet-uncovered

elements. Next, we sample S. Since we have removed all the ranges that have large size, in the first pass, the size of the remaining ranges restricted to the sample S is small. Therefore

by Lemma 2.4.4, the canonical representation of (S, S) has small size and we can afford to ℱ store it in the memory. We use Lemma 2.4.4 to compute the canonical representation S in ℱ one pass. The algorithm then uses the sets in S to find a cover sol for the points of S. ℱ S Next, in one additional pass, the algorithm replaces each set in by one of its supersets solS in . ℱ Finally, note that in IterSetCover, we are assuming that the size of the optimal

solution is 푂(푘). Thus it is enough to stop the iterations once the number of uncovered elements is less than 푘. Then we can pick an arbitrary set for each of the uncovered elements. This would add only 푘 more sets to the solution. Using this idea, we can reduce the size of the sampled elements down to 푛 훿 which would help us in getting near-linear 푐휌푘( 푘 ) log 푚 log 푛 space in the geometric setting. Note that the final pass of the algorithm can be embedded into the previous passes but for the sake of clarity we write it separately.

2.4.3. Analysis

By a similar approach to what we used in Section 2.2 to analyze the pass count and approx- imation guarantee of IterSetCover algorithm, we can show that the number of passes of

the GeomSC algorithm is 3/훿 + 1 (which can be reduced to 3/훿 with minor changes), and the algorithm returns an 푂(휌/훿)-approximate solution. Next, we analyze the space usage and the correctness of the algorithm. Note that our analysis in this section only works for

훿 1/4. ≤ Lemma 2.4.5. The algorithm uses 푂̃︀(푛) space. Proof: Consider an iteration of the algorithm. The memory space used in the first pass of

each iteration is 푂̃︀(푛). The size of S is 푐휌푘(푛/푘)훿 log 푚 log 푛 and after the first pass the size

52 Algorithm 3 A streaming algorithm for the Points-Shapes Set Cover problem. 1: procedure GeomSC(( , ), 훿) 2: for 푘 2푖 0 푖 풰logℱ푛 in parallel do ◁ 푛 = 3: let∈L { | and≤ ≤sol } |풰| 4: for 푖 =← 1 풰to 1/훿 do ← ∅ 5: for all 푆 do ◁ one pass 6: if 푆 ∈L ℱ /푘 then 7: |sol∩ | ≥sol |풰| 푆 8: L ←L 푆 ∪ { } ← ∖ 9: S sample of L of size 푐휌푘(푛/푘)훿 log 푚 log 푛 10: ← CompCanonicalRep |S| one pass S (S, , 푘 ) ◁ 11: ℱ ← OfflineSC ℱ solS (S, S) 12: for 푆← do ◁ one passℱ 13: if ∈ ℱ′ s.t. ′ then 푆 solS 푆 S 푆 S 14: ∃sol∈ sol 푆 ∩ ⊆ ∩ 15: ← ∪ { }′ solS solS 푆 16: L L← 푆 ∖ { } ← ∖ 17: for 푆 do ◁ final pass 18: if 푆∈ ℱL = then 19: sol∩ ̸ sol∅ 푆 20: L ←L 푆 ∪ { } ← ∖ 21: return smallest sol computed in parallel of each set is at most /푘. Thus using Chernoff bound for each set 푆 sol, |풰| ∈ ℱ ∖ S (︂ 4 S )︂ 1 Pr 푆 S > (1 + 2)|풰| | | exp | | ( )푐+1. | ∩ | 푘 × ≤ − 3푘 ≤ 푚 |풰| Thus, with probability at least 1 푚−푐 (by the union bound), all the sets that are not picked − in the first pass, cover at most 3 S /푘 = 푐휌(푛/푘)훿 log 푚 log 푛 elements of S. Therefore, we can | | use Lemma 2.4.4 to show that the number of sets in the canonical representation of (S, S) ℱ is at most

(︂3 S )︂3 푂( S | | log2 S ) = 푂(휌4푛 log4 푚 log6 푛), | | 푘 | | as long as 훿 1/4. To store each set in a canonical representation of (S, ) only constant ≤ ℱ space is required. Moreover, by Lemma 2.4.4, the space requirement of the second pass is

푂̃︀( S ) = 푂̃︀(푛). Therefore, the total required space is 푂̃︀(푛) and the lemma follows. |ℱ |  Theorem 2.4.6. Given a set system defined over a set of 푛 points in the plane, and a 풰 53 set of 푚 ranges (which are either all disks, axis-parallel rectangles, or fat triangles). Let ℱ 휌 be the quality of approximation to the offline set-cover solver we have, and let 0 < 훿 < 1/4 be an arbitrary parameter.

Setting 훿 = 1/4, the algorithm GeomSC with high probability returns an 푂(휌)-approximate solution of the optimal set cover solution for the instance ( , ). This algorithm uses 푂̃︀(푛) 풰 ℱ space, and performs constant passes over the data.

Proof: As before consider the run of the algorithm in which OPT 푘 < 2 OPT . Let | | ≤ | | V be the set of uncovered elements L at the beginning of the iteration and note that the

total number of sets that is picked during the iteration is at most (1 + 푐1휌)푘 where 푐1 is the constant defined in Definition 2.4.4. Let denote all possible such covers, that is 풢 {︀ ′ ⃒ ′ }︀ = ⃒ (1 + 푐1휌)푘 . Let be the collection that contains all possible set of 풢 ℱ ⊆ ℱ |ℱ | ≤ ℋ {︀ ⋃︀ ⃒ }︀ uncovered elements at the end of the iteration, defined as = V 푆 ⃒ . Set ℋ ∖ 푆∈풞 풞 ∈ 풢 ′ 푝 = (푘/푛)훿, 휀 = 1/2 and 푞 = 푚−푐. Since for large enough 푐, 푐 (log log 1 + log 1 ) 휀2푝 |ℋ| 푝 푞 ≤ 푐휌푘(푛/푘)훿 log 푚 log 푛 = S with probability at least 1 푚−푐, by Lemma 2.2.5, the set of | | − sampled elements S is a relative (푝, 휀)-approximation sample of (V, ). ℋ Let be the collection of sets picked in the third pass of the algorithm that covers all 풞 ⊆ ℱ elements in S. By Lemma 2.4.4, 푐1휌푘 for some constant 푐1. Since with high probability |풞| ≤ S is a relative (푝, 휀)-approximation sample of (V, ), the number of uncovered elements of ℋ V (or L) after adding to sol is at most 휀푝 V (푘/푛)훿. Thus with probability at least 풞 | | ≤ |풰| (1 푚−푐), in each iteration and by adding 푂(휌푘) sets, the number of uncovered elements − reduces by a factor of (푛/푘)훿. Therefore, after 4 iterations (for 훿 = 1/4) the algorithm picks 푂(휌푘) sets and with high probability the number of uncovered elements is at most 푛(푘/푛)훿/훿 = 푘. Thus, in the final pass the algorithm only adds 푘 sets to the solution sol, and hence the approximation factor

of the algorithm is 푂(휌).  Remark 2.4.7. The result of Theorem 2.4.6 is similar to the result of Agarwal and Pan [3]

– except that their algorithm performs 푂(log 푛) iterations over the data, while the algorithm of Theorem 2.4.6 performs only a constant number of iterations. In particular, one can use the algorithm of Agarwal and Pan [3] as the offline solver.

54 f3 f2 f1 f f f f f f 1 3 2 1 10 20 30 2 3 4

P3 P2 P1 P3 P2 P1 P4 P5 P6 (a) (b) Figure 2.5.1: (a) shows an example of the communication Set Chasing(4, 3) and (b) is an instance of the communication Intersection Set Chasing(4, 3). 2.5. Lower bound for Multipass Algorithms

In this section we give lower bound on the memory space of multipass streaming algorithms

for the Set Cover problem. Our main result is Ω(푚푛훿) space for streaming algorithms that return an optimal solution of the Set Cover problem in 푂(1/훿) passes for 푚 = 푂(푛). Our approach is to reduce the communication Intersection Set Chasing(푛, 푝) problem introduced by Guruswami and Onak [90] to the communication Set Cover problem.

Consider a communication problem P with 푛 players 푃1, , 푃푛. The problem P is a ··· (푛, 푟)-communication problem if players communicate in 푟 rounds and in each round they

speak in order 푃1, , 푃푛. At the end of the 푟th round 푃푛 should return the solution. ··· Moreover we assume private randomness and public messages. In what follows we define the

communication Set Chasing and Intersection Set Chasing problems.

Definition 2.5.1. The Set Chasing(푛, 푝) problem is a (푝, 푝 1) communication problem in − [푛] which the player 푖 has a function 푓푖 :[푛] 2 and the goal is to compute 푓⃗1(푓⃗2( 푓⃗푝( 1 ) )) → ··· { } ··· where ⃗ ⋃︀ . Figure 2.5.1(a) is an instance of the communication . 푓푖(푆) = 푠∈푆 푓푖(푠) Set Chasing(4, 3)

Definition 2.5.2. The Intersection Set Chasing(푛, 푝) is a (2푝, 푝 1) communication prob- − lem in which the first 푝 players have an instance of the Set Chasing(푛, 푝) problem and the other 푝 players have another instance of the Set Chasing(푛, 푝) problem. The output of the In- tersection Set Chasing(푛, 푝) is 1 if the solutions of the two instances of the Set Chasing(푛, 푝) intersect and 0 otherwise. Figure 2.5.1(b) shows an instance of the Intersection Set Chasing

(4, 3). The function 푓푖 of each player 푃푖 is specified by a set of directed edges form a copy of vertices labeled 1, , 푛 to another copy of vertices labeled 1, , 푛 . { ··· } { ··· } The communication Set Chasing problem is a generalization of the well-known commu-

55 fi 1 1 1 1 out(vi+1) in(vi ) vi+1 v i 2 2 2 2 out(vi+1) in(vi ) vi+1 v i 3 3 3 3 out(vi+1) in(vi ) vi+1 vi out(v4 ) in(v4) 4 4 i+1 i vi+1 vi Pi ei (a) (c)

fi0 1 1 in(u1) out(u1 ) ui ui+1 i i+1 2 2 2 2 in(ui ) out(ui+1) ui ui+1 3 3 3 3 in(ui ) out(ui+1) ui ui+1 4 4 4 4 in(ui ) out(ui+1) ui ui+1 Pp+i ep+i (b) (d) Figure 2.5.2: The gadgets used in the reduction of the communication Intersection Set Chasing problem to the communication Set Cover problem. (a) and (c) shows the construc- tion of the gadget for players 1 to 푝 and (c) and (d) shows the construction of the gadget for players 푝 + 1 to 2푝. nication Pointer Chasing problem in which player 푖 has a function 푓푖 :[푛] [푛] and the → goal is to compute 푓1(푓2( 푓푝(1) )). ··· ··· [90] showed that any randomized protocol that solves Intersection Set Chasing(푛, 푝) with error probability less than , requires 푛1+1/(2푝) bits of communication where 1/10 Ω( 푝16 log3/2 푛 ) 푛 is sufficiently large and 푝 log 푛 . In Theorem 2.5.4, we reduce the communication ≤ log log 푛 Intersection Set Chasing problem to the communication Set Cover problem and then give the first superlinear memory lower bound for the streaming Set Cover problem.

Definition 2.5.3 (Communication Set Cover(풰, ℱ, 푝) Problem). The communication Set Cover(푛, 푝) is a (푝, 푝 1) communication problem in which a collection of elements − 풰 is given to all players and each player 푖 has a collection of subsets of , 푖. The goal is to 풰 ℱ solve Set Cover( , 1 푝) using the minimum number of communication bits. 풰 ℱ ∪ · · · ∪ ℱ Theorem 2.5.4. Any (1/2훿 1) passes streaming algorithm that solves the Set Cover( , ) − 풰 ℱ optimally with constant probability of error requires Ω(̃︀ 푚푛훿) memory space where 훿 log log 푛 ≥ log 푛 and 푚 = 푂(푛).

56 Consider an instance ISC of the communication Intersection Set Chasing(푛, 푝). We con- struct an instance of the communication Set Cover( , , 2푝) problem such that solving Set 풰 ℱ Cover( , ) optimally determines whether the output of ISC is 1 or not. 풰 ℱ [푛] The instance ISC consists of 2푝 players. Each player 1, , 푝 has a function 푓푖 :[푛] 2 ··· → and each player 푝 + 1, , 2푝 has a function 푓 ′ :[푛] 2[푛] (see Figure 2.5.1). In ISC, each ··· 푖 → 1 푛 1 푛 function 푓푖 is shown by a set of vertices 푣 , , 푣 and 푣 , , 푣 such that there is a 푖 ··· 푖 푖+1 ··· 푖+1 푗 ℓ ′ directed edge from 푣 to 푣 if and only if ℓ 푓푖(푗). Similarly, each function 푓 is denoted 푖+1 푖 ∈ 푖 by a set of vertices 푢1, , 푢푛 and 푢1 , , 푢푛 such that there is a directed edge from 푢푗 푖 ··· 푖 푖+1 ··· 푖+1 푖+1 to 푢ℓ if and only if ℓ 푓 ′(푗) (see Figure 2.5.4(a) and Figure 2.5.4(b)). 푖 ∈ 푖 In the corresponding communication Set Cover instance of ISC, we add two elements in(푣푗) and out(푣푗) per each vertex 푣푗 where 푖 푝 + 1, 푗 푛. We also add two elements 푖 푖 푖 ≤ ≤ in(푢푗) and out(푢푗) per each vertex 푢푗 where 푖 푝 + 1, 푗 푛. In addition to these elements, 푖 푖 푖 ≤ ≤ for each player 푖, we add an element 푒푖 (see Figure 2.5.4(c) and Figure 2.5.4(d)). Next, we define a collection of sets in the corresponding Set Cover instance of ISC. For 푗 푗 ℓ each player 푃푖, where 1 푖 푝, we add a single set 푆 containing out(푣 ) and in(푣 ) for ≤ ≤ 푖 푖+1 푖 all out-going edges 푗 ℓ . Moreover, all 푗 sets contain the element . Next, for each (푣푖+1, 푣푖 ) 푆푖 푒푖 vertex 푗 we add a set 푗 that contains the two corresponding elements of 푗, 푗 and 푣푖 푅푖 푣푖 in(푣푖 ) 푗 . In Figure 2.5.4(c), the red rectangles denote -type sets and the curves denote out(푣푖 ) 푅 푆-type sets for the first half of the players.

Similarly to the sets corresponding to players 1 to 푝, for each player 푃푝+푖 where 1 푖 푝, ≤ ≤ we add a set 푗 containing 푗 and ℓ for all in-coming edges ℓ 푗 of 푗 푆푝+푖 in(푢푖 ) out(푢푖+1) (푢푖+1, 푢푖 ) 푢푖 (denoting ′−1 ). The set 푗 contains the element too. Next, for each vertex 푗 we 푓푖 (푗) 푆푝+푖 푒푝+푖 푢푖 add a set 푗 that contains the two corresponding elements of 푗, 푗 and 푗 . In 푇푝+푖 푢푖 in(푢푖 ) out(푢푖 ) Figure 2.5.4(d), the red rectangles denote 푇 -type sets and the curves denote 푆-type sets for the second half of the players.

At the end, we merge 푖 s and 푖 s as shown in Figure 2.5.3. After merging the corre- 푣1 푢1 sponding sets of 푣푗s (푅1, , 푅푛) and the corresponding sets of 푢푗 s (푇 1, , 푇 푛), we call 1 1 ··· 1 1 1 ··· 1 the merged sets 푇 1, , 푇 푛. 1 ··· 1 The main claim is that if the solution of ISC is 1 then the size of an optimal solution of its corresponding Set Cover instance SC is (2푝 + 1)푛 + 1; otherwise, it is (2푝 + 1)푛 + 2.

57 f1 f10 f1 f 0 1 f1 f10 1 1 1 1 v1 u1 v2 v1 u1 u2 2 2 2 2 2 2 2 2 u u2 v2 v1 u1 2 v2 3 3 3 3 3 u u3 v u2 v2 v1 1 2 2 4 4 4 4 4 4 u v u u2 v2 2 4 4 2 v1 1 in(v1) in(u1) P1 Pp+1 P1 Pp+1 P1 Pp+1 (a) (b) (c) Figure 2.5.3: In (b) two Set Chasing instances merge in their first set of vertices and (c) shows the corresponding gadgets of these merged vertices in the communication Set Cover.

f3 f2 f1 f10 f20 f30 1 2 3 4 P3 P2 P1 P4 P5 P6 (a) (b) Figure 2.5.4: In (푎), path 푄 is shown with black dashed arcs and (푏) shows the correspond- ing cover of path 푄. Lemma 2.5.5. The size of any feasible solution of SC is at least (2푝 + 1)푛 + 1.

Proof: For each player 푖 (1 푖 푝), since out(푣푗 )s are only covered by 푅푗 and 푆푗, at least ≤ ≤ 푖+1 푖+1 푖 1 푛 푗 푛 sets are required to cover out(푣 ), , out(푣 ). Moreover for player 푃푝, since in(푣 )s 푖+1 ··· 푖+1 푝+1 푗 1 1 푛 1 are only covered by 푅 and 푒푝 is only covered by 푆 , all 푛 + 1 sets 푅 , , 푅 , 푆 must 푝+1 푝 푝+1 ··· 푝+1 푝 be selected in any feasible solution of SC. Similarly for each player 푝+푖 (1 푖 푝), since in(푢푗)s are only covered by 푇 푗 and 푆푗 , at ≤ ≤ 푖 푖 푝+푖 least 푛 sets are required to cover in(푢1), , in(푢푛). Moreover, considering 푢1 , , 푢푛 , 푖 ··· 푖 푝+1 ··· 푝+1 since in(푢푗 ) is only covered by 푇 푗 , all 푛 sets 푇 1 , , 푇 푛 must be selected in any 푝+1 푝+1 푝+1 ··· 푝+1 feasible solution of SC.

All together, at least (2푝+1)푛+1 sets should be selected in any feasible solution of SC. Lemma 2.5.6. Suppose that the solution of ISC is 1. Then the size of an optimal solution of its corresponding Set Cover instance is exactly (2푝 + 1)푛 + 1.

Proof: By Lemma 2.5.5, the size of an optimal solution of S is at least (2푝 + 1)푛 + 1. Here we prove that (2푝 + 1)푛 + 1 sets suffice when the solution of ISC is 1. Let 푄 =

58 1 푗푝 푗2 푗1 ℓ1 ℓ2 ℓ푝 1 be a path in such that (since the solu- 푣푝+1, 푣푝 , . . . , 푣2 , 푣1 , 푢1 , 푢2 , . . . , 푢푝 , 푢푝+1 ISC 푗1 = ℓ1 tion of ISC is 1 such a path exists). The corresponding solution to 푄 can be constructed as follows (See Figure 2.5.4):

Pick 푆1 and all 푅푗 s (푛 + 1 sets). ∙ 푝 푝+1 For each 푣푗푖 in 푄 where 1 < 푖 푝, pick the set 푆푗푖 in the solution. Moreover, for each ∙ 푖 ≤ 푖−1 푗 such 푖 pick all sets 푅 where 푗 = 푗푖 (푛(푝 1) sets). 푖 ̸ − 푗1 ℓ1 푗1 푗 For 푣 (or 푢 ), pick the set 푆 . Moreover, pick all sets 푇 where 푗 = 푗1 (푛 sets). ∙ 1 1 푝+1 1 ̸ For each 푢ℓ푖 in 푄 where 1 < 푖 푝, pick the set 푆ℓ푖 in the solution. Moreover, for each ∙ 푖 ≤ 푝+푖 ℓ such 푖 pick all sets 푇 where ℓ = ℓ푖 (푛(푝 1) sets). 푖 ̸ − Pick all 푇 푗 s (푛 sets). ∙ 푝+1 It is straightforward to see that the solution constructed above is a feasible solution.  Lemma 2.5.7. Suppose that the size of an optimal solution of the corresponding Set Cover instance of ISC, SC, is (2푝 + 1)푛 + 1. Then the solution of ISC is 1.

Proof: As we proved earlier in Lemma 2.5.5, any feasible solution of SC picks 푅1 , , 푅푛 , 푆1 푝+1 ··· 푝+1 푝 and 푇 1 , , 푇 푛 . Moreover, we proved that for each 1 푖 < 푝, at least 푛 sets should 푝+1 ··· 푝+1 ≤ be selected from 푅1 , , 푅푛 , 푆1, , 푆푛. Similarly, for each 1 푖 푝, at least 푛 sets 푖+1 ··· 푖+1 푖 ··· 푖 ≤ ≤ should be selected from 푇 1, , 푇 푛, 푆1 , , 푆푛 . Thus if a feasible solution of SC, OPT, 푖 ··· 푖 푝+푖 ··· 푝+1 is of size (2푝 + 1)푛 + 1, it has exactly 푛 sets from each specified group. Next we consider the first half of the players and second half of the players separately.

Consider 푖 such that 1 푖 < 푝. Let 푆푗1 , , 푆푗푘 be the sets picked in the optimal solution ≤ 푖 ··· 푖 (because of there should be at least one set of form 푗 in OPT). Since each 푗 푒푖 푆푖 out(푣푖+1) 푗 푗 푗 is only covered by 푆 and 푅 , for all 푗 / 푗1, . . . , 푗푘 , 푅 should be selected in OPT. 푖 푖+1 ∈ { } 푖+1 푗 Moreover, for all 푗 푗1, , 푗푘 , 푅 should not be contained in OPT (otherwise the size ∈ { ··· } 푖+1 푗 of OPT would be larger than (2푝 + 1)푛 + 1). Consider 푗 푗1, . . . , 푗푘 . Since 푅 is not ∈ { } 푖+1 in OPT, there should be a set ℓ selected in OPT such that 푗 is contained in ℓ . 푆푖+1 in(푣푖+1) 푆푖+1 Thus by considering s in a decreasing order and using induction, if 푗 is in OPT then 푗 푆푖 푆푖 푣푖+1 is reachable form 1 . 푣푝+1 Next consider a set 푆푗 that is selected in OPT (1 푖 푝). By similar argument, 푝+푖 ≤ ≤ 푗 is not in OPT and there exists a set ℓ (or ℓ if ) in OPT such that 푗 푇푖 푆푝+푖−1 푆1 푖 = 1 out(푢푖 )

59 is contained in 푆ℓ . Let 푢ℓ1 , , 푢ℓ푘 be the set of vertices whose corresponding out 푝+푖−1 푖+1 ··· 푖+1 elements are in 푗 . Then by induction, there exists an index such that 푟 is reachable 푆푝+푖 푟 푣1 from 푣1 and 푢푟 is also reachable from all 푢ℓ1 , , 푢ℓ푘 . Moreover, the way we constructed 푝+1 1 푖+1 ··· 푖+1 the instance SC guarantees that all sets 푆1 , , 푆푛 contains out(푢1 ). Hence if the size of 2푝 ··· 2푝 푝+1 an optimal solution of SC is (2푝 + 1)푛 + 1 then the solution of ISC is 1.  Corollary 2.5.8. Intersection Set Chasing(푛, 푝) returns 1 if and only if the size of optimal solution of its corresponding Set Cover instance (as described here) is (2푝 + 1)푛 + 1.

Observation 2.5.9. Any streaming algorithm for Set Cover, , that in ℓ passes solves the ℐ problem optimally with a probability of error err and consumes 푠 memory space, solves the corresponding communication Set Cover problem in ℓ rounds using 푂(푠ℓ2) bits of communi- cation with probability error err.

Proof: Starting from player 푃1, each player runs over its input sets and once 푃푖 is done ℐ with its input, she sends the working memory of publicly to other players. Then next ℐ player starts the same routine using the state of the working memory received from the previous player. Since solves the Set Cover instance optimally after ℓ passes using 푂(푠) ℐ space with probability error err, applying as a black box we can solve P in ℓ rounds using ℐ 2 푂(푠ℓ ) bits of communication with probability error err.  Proof of Theorem 2.5.4: By Observation 2.5.9, any ℓ-round 푂(푠)-space algorithm that solves streaming Set Cover ( , ) optimally can be used to solve the communication Set Cover( , , 푝) 풰 ℱ 풰 ℱ problem in ℓ rounds using 푂(푠ℓ2) bits of communication. Moreover, by Corollary 2.5.8, we can decide the solution of the communication Intersection Set Chasing(푛, 푝) by solving its corresponding communication Set Cover problem. Note that while working with the corre- sponding Set Cover instance of Intersection Set Chasing(푛, 푝), all players know the collection ′ of elements and each player can construct its collection of sets 푖 using 푓푖 (or 푓 ). 풰 ℱ 푖 However, by a result of [90], we know that any protocol that solves the communication

Intersection Set Chasing(푛, 푝) problem with probability of error less than 1/10, requires 푛1+1/(2푝) bits of communication. Since in the corresponding instance of the Ω( 푝16 log3/2 푛 ) Set Cover communication Intersection Set Chasing(푛, 푝), = (2푝 + 1) 2푛 + 2푝 = 푂(푛푝) and |풰| × (2푝 + 1)푛 + 2푝푛 = 푂(푛푝), any (푝 1)-pass streaming algorithm that solves the |ℱ| ≤ −

60 problem optimally with a probability of error at most , requires 푛1+1/(2푝) Set Cover 1/10 Ω( 푝18 log3/2 푛 ) bits of communication. Then using Observation 2.5.9, since 훿 log log 푛 , any ( 1 1)-pass ≥ log 푛 2훿 − streaming algorithm of Set Cover that finds an optimal solution with error probability less than 1/10, requires Ω(̃︀ 훿) space. |ℱ| · |풰| 

2.6. Lower Bound for Sparse Set Cover in Multiple Passes

In this part we give a stronger lower bound for the instances of the streaming Set Cover problem with sparse input sets. An instance of the Set Cover problem is 푠-Sparse Set Cover, if for each set 푆 we have 푆 푠. We can us the same reduction approach described ∈ ℱ | | ≤ earlier in Section 2.5 to show that any (1/2훿 1)-pass streaming algorithm for 푠-Sparse Set − Cover requires Ω( 푠) memory space if 푠 < 훿 and = 푂( ). To prove this, we need |ℱ| |풰| ℱ 풰 to explain more details of the approach of [90] on the lower bound of the communication

Intersection Set Chasing problem. They first obtained a lower bound for Equal Pointer Chasing(푛, 푝) problem in which two instances of the communication Pointer Chasing(푛, 푝) are given and the goal is to decide whether these two instances point to a same value or not;

′ ′ 푓푝( 푓1(1) ) = 푓 ( 푓 (1) ). ··· ··· 푝 ··· 1 ··· Definition 2.6.1푟 ( -non-injective functions). A function 푓 :[푛] [푛] is called 푟-non- → injective if there exists 퐴 [푛] of size at least 푟 and 푏 [푛] such that for all 푎 퐴, ⊆ ∈ ∈ 푓(푎) = 푏.

Definition 2.6.2Pointer ( Chasing Problem). Pointer Chasing(푛, 푝) is a (푝, 푝 1) com- − munication problem in which the player 푖 has a function 푓푖 :[푛] [푛] and the goal is to → compute 푓1(푓2( 푓푝(1) )). ··· ··· Definition 2.6.3Equal ( Limited Pointer Chasing Problem). Equal Pointer Chasing(푛, 푝) is a (2푝, 푝 1) communication problem in which the first 푝 players have an instance of the − Pointer Chasing(푛, 푝) problem and the other 푝 players have another instance of the Pointer Chasing(푛, 푝) problem. The output of the Equal Pointer Chasing(푛, 푝) is 1 if the solutions of the two instances of Pointer Chasing(푛, 푝) have the same value and 0 otherwise. Further- more in another variant of pointer chasing problem, Equal Limited Pointer Chasing(푛, 푝, 푟),

if there exists 푟-non-injective function 푓푖, then the output is 1. Otherwise, the output is the

61 same as the value in Equal Pointer Chasing(푛, 푝).

For a boolean communication problem P, OR푡(P) is defined to be OR of 푡 instances of P

and the output of OR푡(P) is true if and only if the output of any of the 푡 instances is true.

Using a direct sum argument, [90] showed that the communication complexity of OR푡(Equal Limited Pointer Chasing(푛, 푝, 푟)) is 푡 times the communication complexity of Equal Limited Pointer Chasing(푛, 푝, 푟).

Lemma 2.6.4 ([90]). Let 푛, 푝, 푡 and 푟 be positive integers such that 푛 5푝, 푡 푛 and ≥ ≤ 4 푟 = 푂(log 푛). Then the amount of bits of communication to solve OR푡( Equal Limited ) with error probability less than is 푡푛 2 . Pointer Chasing(푛, 푝, 푟) 1/3 Ω( 16 ) 푂(푝푡 ) 푝 log 푛 − Lemma 2.6.5 ([90]). Let 푛, 푝, 푡 and 푟 be positive integers such that 푡2푝푟푝−1 < 푛/10. Then if there is a protocol that solves Intersection Set Chasing(푛, 푝) with probability of error less

than 1/10 using 퐶 bits of communication, there is a protocol that solves OR푡( Equal Lim- ited Pointer Chasing(푛, 푝, 푟)) with probability of error at most 2/10 using 퐶 + 2푝 bits of communication.

훿 Consider an instance of OR푡(Equal Limited Pointer Chasing (푛, 푝, 푟)) in which 푡 푛 , 푟 = ≤ log(푛), 푝 = 1 1 where 1 = 표(log 푛). By Lemma 2.6.4, the required amount of bits of com- 2훿 − 훿 munication to solve the instance with constant success probability is Ω(̃︀ 푡푛). Then,applying Lemma 2.6.5, to solve the corresponding Intersection Set Chasing, Ω(̃︀ 푡푛) bits of communi- cation is required.

In the reduction from OR푡(Equal Limited Pointer Chasing(푛, 푝, 푟)) to Intersection Set Chasing(푛, 푝) (proof of Lemma 2.6.5), the 푟-non-injective property is preserved. In other

words, in the corresponding Intersection Set Chasing instance each player’s functions 푓푖 : [푛] 4 [푛] 2 is union of 푡 푟-non-injective functions 푓푖(푎) := 푓푖,1(푎) 푓푖,푡(푎) . Given that → ∪ · · · ∪ none of the 푓푖,푗 functions is 푟-non-injective, the corresponding Set Cover instance will have sets of size at most 푟푡 (푆-type sets are of size at most 푡 for 1 푖 푝 and of size at most 푟푡 for ≤ ≤ 푝+1 푖 2푝). Since 푟 = 푂(log 푛), the corresponding Set Cover instance is 푂̃︀(푡)-sparse. As ≤ ≤ we showed earlier in the reduction from Intersection Set Chasing to Set Cover, the number

4The Intersection Set Chasing instance is obtained by overlaying the 푡 instances of Equal Pointer . To be more precise, the function of player in instance is −1 ( are Chasing(푛, 푝, 푟) 푖 푗 휋푖,푗 푓푖,푗 휋푖+1,푗 휋 randomly chosen permutation functions) and then stack the functions on top of each other.∘ ∘

62 of elements (and sets) in the corresponding Set Cover instance is 푂(푛푝). Thus we have the following result for 푠-Sparse Set Cover problem.

Theorem 2.6.6. For 푠 훿, any streaming algorithm that solves 푠-Sparse Set Cover( , ) ≤ |풰| 풰 ℱ optimally with probability of error less than 1/10 in ( 1 1) passes requires Ω(̃︀ 푠) memory 2훿 − |ℱ| space for = 푂( ). ℱ 풰

63 64 Chapter 3

Streaming Fractional Set Cover

3.1. Introduction

The LP relaxation of Set Cover (called SetCover-LP) is a continuous relaxation of the prob- lem where each set 푆 can be selected “fractionally”, i.e., assigned a number 푥푆 from ∈ ℱ , such that for each element its “fractional coverage” ∑︀ is at least , and the [0, 1] 푒 푆:푒∈푆 푥푆 1 sum ∑︀ is minimized. 푆 푥푆 Despite the significant developments in streaming Set Cover [67, 42, 61, 97, 20, 29, 17], the results for the fractional variant of the problem are still unsatisfactory. To the best of our knowledge, it is not known whether there exists an efficient and accurate algorithm for this problem that uses only a logarithmic (or even a poly logarithmic) number of passes. This state of affairs is perhaps surprising, given the many recent developments on fastLP solvers [119, 174, 124, 12, 11, 167]. To the best of our knowledge, the only prior results on streaming Packing/Covering LPs were presented in [6], which studied the LP relaxation of

Maximum Matching.

3.1.1. Our Results

In this chapter, we present the first (1 + 휀)-approximation algorithm for the fractional Set Cover in the streaming model with constant number of passes. Our algorithm performs 푂( 1 ) 푝 passes over the data stream and uses 푂̃︀(푚푛 푝휀 + 푛) memory space to return a (1 + 휀) approximate solution of the LP relaxation of Set Cover for positive parameter 휀 1/2. ≤ 65 We emphasize that similarly to the previous work on variants of Set Cover in streaming

setting, our result also holds for the edge arrival stream in which the pair of (푆푖, 푒푗) (edges) are stored in the read-only repository and all elements of a set are not necessarily stored consecutively.

3.1.2. Other Related Work

Set Cover. The Set Cover problem was first studied in the streaming model in[155], which presented an 푂(log 푛)-approximation algorithm in 푂(log 푛) passes and using 푂̃︀(푛) space. This approximation factor and the number of passes can be improved to 푂(log 푛) by adapting the greedy algorithm thresholding idea presented in [56] . In the low space

regime (푂̃︀(푛) space), Emek and Rosen [67] designed a deterministic single pass algorithm that achieves an 푂(√푛)-approximation. This is provably the best guarantee that one can hope for in a single pass even considering randomized algorithms. Later Chakrabarti and

Wirth [42] generalized this result and provided a tight trade-off bounds for Set Cover in multiple passes. More precisely, they gave an 푂(푝푛1/(푝+1))-approximate algorithm in 푝- passes using 푂̃︀(푛) space and proved that this is the best possible approximation ratio up to a factor of poly(푝) in 푝 passes and 푂̃︀(푛) space. A different line of work started by Demaine etal.[61] focused on designing a “low” ap- proximation algorithm (between Θ(1) and Θ(log 푛)) in the smallest possible amount of space. In contrast to the results in the 푂̃︀(푛) space regime, [61] showed that randomness is neces- sary: any constant pass deterministic algorithm requires Ω(푚푛) space to achieve constant approximation guarantee. Further, they provided a 푂(4푝 log 푛)-approximation algorithm that makes 푂(4푝) passes and uses 푂̃︀(푚푛1/푝 + 푛). Later Har-Peled et al. [97] improved the 1 algorithm to a 2푝-pass 푂(푝 log 푛)-approximation with memory space 푂̃︀(푚푛1/푝 + 푛) . The result was further improved by Bateni et al. where they designed a 푝-pass algorithm that returns a (1 + 휀) log 푛-approximate solution using 푚푛Θ(1/푝) memory [29]. As for the lower bounds, Assadi et al. [20] presented a lower bound of Ω(푚푛/훼) memory for any single pass streaming algorithm that computes a 훼-approxime solution. For the prob- lem of estimating the size of an optimal solution they prove Ω(푚푛/훼2) memory lower bound. 1In streaming model, space complexity is of interest and one can assume exponentital computation power. In this case the algorithms of [61, 97] save a factor of log 푛 in the approximation ratio.

66 For both settings, they complement the results with matching tight upper bounds. Further- more, Assadi [17] proved a lower bound on the space complexity of streaming algorithms for

Set Cover with multiple passes which is tight up to polylog factors: any 훼-approximation algorithm for Set Cover requires Ω(푚푛1/훼) space, even if it is allowed polylog(푛) passes over the stream, and even if the sets are arriving in a random order in the stream. Fur- ther, [17] provided the matching upper bound: a (2훼 + 1)-pass algorithm that computes a -approximate solution in 푚푛1/훼 푛 memory (assuming exponential computational (훼 + 휀) 푂̃︀( 휀2 + 휀 ) resource).

Maximum Coverage. The first result on streaming Max 푘-Cover showed how to com- pute a (1/4)-approximate solution in one pass using 푂̃︀(푘푛) space [155]. It was improved by Badanidiyuru et al. [22] to a (1/2 휀)-approximation algorithm that requires 푂̃︀(푛/휀) − space. Moreover, their algorithm works for a more general problem of Submodular Max- imization with cardinality constraints. This result was later generalized for the problem of non-monotone submodular maximization under constraints beyond cardinality [46]. Re- cently, McGregor and Vu [131] and Bateni et al. [29] independently obtained single pass

(1 1/푒 휀)-approximation with 푂̃︀(푚/휀2) space. On the lower bound side, [131] showed − − a lower bound of Ω(̃︀ 푚) for constant pass algorithm whose approximation is better than (1 1/푒). Moreover, [17] proved that any streaming (1 휀)-approximation algorithm of − − Max 푘-Cover in polylog(푛) passes requires Ω(̃︀ 푚/휀2) space even on random order streams and the case 푘 = 푂(1). This bound is also complemented by the 푂̃︀(푚푘/휀2) and 푂̃︀(푚/휀3) algorithms of [29, 131]. For more detailed survey of the results on streaming Max 푘-Cover refer to [29, 131, 17, 107].

Covering/Packing LPs. The study of LPs in streaming model was first discussed in the

work of Ahn and Guha [6] where they used multiplicative weights update (MWU) based tech- niques to solve the LP relaxation of Maximum (Weighted) Matching problem. They used the fact that MWU returns a near optimal fractional solution with small size support: first they solve the fractional matching problem, then solve the actual matching only considering the edges in the support of the returned fractional solution.

67 Our algorithm is also based on the MWU method, which is one of the main key techniques in designing fast approximation algorithms for Covering and Packing LPs [150, 173, 76, 16].

We note that the MWU method has been previously studied in the context of streaming and distributed algorithms, leading to efficient algorithms for a wide range of graph optimization problems [6, 23,7]. For a related problem, covering integer LP (covering ILP), Assadi et al. [20] designed

a one pass streaming algorithm that estimates the optimal solution of min c⊤x A⊤x { | ≥ 푛 within a factor of using 푚푛 where denotes the b, x 0, 1 훼 푂̃︀( 2 푏max + 푚 + 푛 푏max) 푏max ∈ { } } 훼 · · largest entry of b. In this problem, they assume that columns of A, constriants, are given one by one in the stream. In a different regime, [65] studied approximating the feasibility LP in streaming model with additive approximation. Their algorithm performs two passes and is most efficient when the input is dense.

3.1.3. Our Techniques

Preprocessing. Let 푘 denote the value of the optimal solution. The algorithm starts by picking a uniform fractional vector (each entry of value 푘 ) which covers all frequently 푂( 푚 ) occurring elements (those appearing in 푚 sets), and updates the uncovered elements in Ω( 푘 ) one pass. This step considerably reduces the memory usage as the uncovered elements have now lower occurrence (roughly 푚 ). Note that we do not need to assume the knowledge of 푘 the correct value 푘: in parallel we try all powers of (1 + 휀), denoting our guess by ℓ.

Multiplicative Weight Update (MWU). To cover the remaining elements, we employ

the MWU framework and show how to implement it in the streaming setting. In each iteration of MWU, we have a probability distribution p corresponding to the constraints (elements) and we need to satisfy the average covering constraint. More precisely, we need ∑︀ an oracle that assigns values to 푥푆 for each set 푆 so that 푝푆푥푆 1 subject to x 1 ℓ, 푆 ≥ ‖ ‖ ≤ where 푝푆 is the sum of probabilities of the elements in the set 푆. Then, the algorithm needs to update p according to the amount each element has been covered by the oracle’s solution. The simple greedy realization of the oracle can be implemented in the streaming

68 setting efficiently by computing all 푝푆 while reading the stream in one pass, then choosing

the heaviest set (i.e., the set with largest 푝푆) and setting its 푥푆 to ℓ. This approach works, except that the number of rounds 푇 required by the MWU framework is large. In fact, 휑 log 푛 , where is the width parameter (the maximum amount an oracle solution 푇 = Ω( 휀2 ) 휑 may over-cover an element), which is Θ(ℓ) in this naïve realization. Next, we show how to decrease 푇 in two steps.

Step 1. A first hope would be that there is a more efficient implementation oftheoracle which gives a better width parameter. Nonetheless, no matter how the oracle is implemented, if all sets in contain a fixed element 푒, then the width is inevitably Ω(ℓ). This observation ℱ implies that we need to work with a different set system that has small width, but atthe same time, it has the same objective value as of the optimal solution. Consequently, we consider the extended set system where we replace with all subsets of the sets in . This ℱ ℱ extended system preserves the optimality, and under this system we may avoid over-covering elements and obtain 푇 = 푂(log 푛) (for constant 휀). In order to turn a solution in our set system into a solution in the extended set system with small width, we need to remove the repeated elements from the sets in the solution so that every covered element appears exactly once, and thereby getting constant width. However, as a side effect, this reduces the total weight of the solution (∑︀ ), and 푆∈sol 푝푆푥푆 thus the average covering constraint might not be satisfied anymore. In fact, we need to come up with a guarantee that, on one hand, is preserved under the pruning step, and on the other hand, implies that the solution has large enough total weight Therefore, to fulfill the average constraint under the pruning step, the oracle must instead solve the maximum coverage problem: given a budget, choose sets to cover the largest (frac- tional) amount of elements. We first show that this problem can be solved approximately via the MWU framework using the simple oracle that picks the heaviest set, but this MWU algorithm still requires 푇 passes over the data. To improve the number of passes, we perform element sampling and apply the MWU algorithm to find an approximate maximum cover- age of a small number of sampled elements, whose subproblem can be stored in memory. Fortunately, while the number of fractional solutions to maximum coverage is unbounded,

69 by exploiting the structure of the solutions returned by the MWU method, we can limit the number of plausible solutions of this oracle and approximately solve the average constraint, thereby reducing the space usage to for a log 푛 -pass algorithm. 푂̃︀(푚) 푂( 휀2 )

Step 2. To further reduce the number of required passes, we observe that the weights of the constraints change slowly. Thus, in a single pass, we can sample the elements for multiple rounds in advance, and then perform rejection (sub-)sampling to obtain an unbiased set of samples for each subsequent round. This will lead to a streaming algorithm with 푝 passes and 푚푛푂(1/푝) space.

Extension. We also extend our result to handle general covering LPs. More specifically, in the LP relaxation of Set Cover, maximize c⊤x subject to Ax b and x 0, A has ≥ ≥ entries from 0, 1 whereas entries of b and c are all ones. If the non-zero entries instead { } belong to a range [1, 푀], we increase the number of sampled elements by poly(푀) to handle discrepancies between coefficients, leading toa poly(푀)-multiplicative overhead in the space usage.

3.2. MWU Framework of the Streaming Algorithm for Fractional Set Cover

In this section, we present a basic streaming algorithm that computes a (1 + 휀)-approximate solution of the LP-relaxation of Set Cover for any 휀 > 0 via the MWU framework. We will, in the next section, improve it into an efficient algorithm that achieves the claimed 푂(푝) passes and 푂̃︀(푚푛1/푝) space complexity. SetCover-LP ◁ Input: , 풰 ℱ ∑︁ minimize 푥푆 푆∈ℱ ∑︁ subject to 푥푆 1 푒 ≥ ∀ ∈ 풰 푆:푒∈푆

푥푆 0 푆 ≥ ∀ ∈ ℱ

Figure 3.2.1: LP relaxation of Set Cover.

70 Algorithm 4 FracSetCover returns a (1 + 휀)-approximate solution of SetCover-LP. 1: procedure FracSetCover(휀) 2: finds a feasible -approximate solution in log 푛 iterations ◁ (1 + 휀) 푂( 휀 ) 3: for ℓ (1 + 휀/3)푖 0 푖 log 푛 in parallel do ∈ { | ≤ ≤ 1+휀/3 } 4: xℓ FeasibilityTest(ℓ, 휀/3) 5: ◁ returns← a solution of objective value at most (1 + 휀/3)ℓ when ℓ 푘. * ≥ 6: return 푥ℓ* where ℓ min ℓ : 푥ℓ is not INFEASIBLE ← { } Let and be the ground set of elements and the collection of sets, respectively, and 풰 ℱ recall that = 푛 and = 푚. Let x R푚 be a vector indexed by the sets in , where |풰| |ℱ| ∈ ℱ 푥푆 denotes the value assigned to the set 푆. Our goal is to compute an approximate solution to the LP in Figure 3.2.1. Throughout the analysis we assume 휀 1/2, and ignore the case ≤ where some element never appears in any set, as it is easy to detect in a single pass that no cover is valid. For ease of reading, we write and to hide 1 factors. 푂̃︀ Θ̃︀ polylog(푚, 푛, 휀 )

Outline of the algorithm. Let 푘 denote the optimal objective value, and 0 < 휀 1/2 be ≤ a parameter. The outline of the algorithm is shown in FracSetCover (Algorithm4). This algorithm makes calls to the subroutine FeasibilityTest, that given a parameter ℓ, with high probability, either returns a solution of objective value at most (1+휀/3)ℓ, or detects that the optimal objective value exceeds ℓ. Consequently, we may search for the right value of ℓ by considering all values in (1 + 휀/3)푖 0 푖 log 푛 . As for some value of ℓ it holds that { | ≤ ≤ 1+휀/3 } 푘 ℓ 푘(1 + 휀/3), we obtain a solution of size (1 + 휀/3)ℓ (1 + 휀/3)(1 + 휀/3)푘 (1 + 휀)푘 ≤ ≤ ≤ ≤ which gives an approximation factor (1 + 휀). This whole process of searching for 푘 increases the space complexity of the algorithm by at most a multiplicative factor of log 푛 3 log 푛 . 1+휀/3 ≈ 휀 The FeasibilityTest subroutine employs the multiplicative weights update method

(MWU) which is described next.

3.2.1. Preliminaries of the MWU method for solving covering LPs In the following, we describe the MWU framework. The claims presented here are standard

results of the MWU method. For more details, see e.g. Section 3 of [16]. Note that we introduce the general LP notation as it simplifies the presentation later on.

Let Ax b be a set of linear constraints, and let x R푚 : x 0 be the polytope ≥ 풫 , { ∈ ≥ } of the non-negative orthant. For a given error parameter 0 < 훽 < 1, we would like to solve

71 an approximate version of the feasibility problem by doing one of the following:

Compute x^ such that A푖x^ 푏푖 훽 for every constraint 푖. ∙ ∈ 풫 − ≥ −

Correctly report that the system Ax b has no solution in . ∙ ≥ 풫

The MWU method solves this problem assuming the existence of the following oracle that takes a distribution p over the constraints and finds a solution x^ that satisfies the constraints on average over p.

Definition 3.2.1. Let 휑 1 be a width parameter and 0 < 훽 < 1 be an error parameter. ≥ A (1, 휑)-bounded (훽/3)-approximate oracle is an algorithm that takes as input a distribution p and does one of the following:

Returns a solution ^x satisfying ∙ ∈ 풫

– p⊤A^x p⊤b 훽/3, and ≥ −

– A푖^x 푏푖 [ 1, 휑] for every constraint 푖. − ∈ −

Correctly reports that the inequality p⊤Ax p⊤b has no solution in . ∙ ≥ 풫

The MWU algorithm for solving covering LPs involves 푇 rounds. It maintains the (non- negative) weight of each constraint in Ax b, which measures how much it has been ≥ satisfied by the solutions chosen so far. Let w푡 denote the weight vector at the beginning 1 of round 푡, and initialize the weights to w , 1. Then, for rounds 푡 = 1, . . . , 푇 , define the probability vector p푡 proportional to those weights w푡, and use the oracle above to find a solution x푡. If the oracle reports that the system p⊤Ax p⊤b is infeasible, the ≥ MWU algorithm also reports that the original system Ax b is infeasible, and terminates. ≥ Otherwise, define the cost vector incurred by x푡 as m푡 1 (Ax b), then update the weights , 휑 − so that 푤푡+1 푤푡(1 훽푚푡/6) and proceed to the next round. Finally, the algorithm returns 푖 , 푖 − 푖 the average solution 1 ∑︀푇 푡. x = 푇 푡=1 x The theorem (e.g., Theorem 3.5 of [16]) shows that 휑 log 푛 is sufficient to MWU 푇 = 푂( 훽2 )

correctly solve the problem, yielding A푖x^ 푏푖 훽 for every constraint, where 푛 is the − ≥ − number of constraints. In particular, the algorithm requires 푇 calls to the oracle.

72 Theorem 3.2.2 (MWU Theorem [16]). For every 0 < 훽 < 1, 휑 1 the MWU algorithm ≥ either solves the Feasibility-Covering-LP problem up to an additive error of 훽 (i.e., solves for every ) or correctly reports that the LP is infeasible, making only 휑 log 푛 A푖x 푏푖 훽 푖 푂( 2 ) − ≥ − 훽 calls to a (1, 휑)-bounded 훽/3-approximate oracle of the LP.

3.2.2. Semi-streaming MWU-based algorithm for factional Set Cover

Setting up our MWU algorithm. As described in the overview, we wish to solve, as a subroutine, the decision variant of SetCover-LP known as Feasibility-SC-LP given in Figure 3.2.2a, where the parameter ℓ serves as the guess for the optimal objective value. Feasibility-SC-LP ◁ Input: , , ℓ ∑︁ 풰 ℱ Feasibility-Covering-LP ◁ Input: A, b, c, ℓ 푥푆 ℓ ≤ 푆∈ℱ c⊤x ℓ (objective value) ∑︁ ≤ 푥푆 1 푒 Ax b (covering) ≥ ∀ ∈ 풰 ≥ 푆:푒∈푆 x 0 (non-negativity) 푥푆 0 푆 ≥ ≥ ∀ ∈ ℱ (b) LP relaxation of Feasibility Covering. (a) LP relaxation of Feasibility Set Cover. Figure 3.2.2: LP relaxations of the feasibility variant of set cover and general covering problems.

To follow the conventional notation for solving LPs in the MWU framework, consider

the more standard form of covering LPs denoted as Feasibility-Covering-LP given in Fig-

ure 3.2.2b. For our purpose, A푛×푚 is the element-set incidence matrix indexed by ; 풰 × ℱ that is, 퐴푒,푆 = 1 if 푒 푆, and 퐴푒,푆 = 0 otherwise. The vectors b and c are both all-ones ∈ vectors indexed by and , respectively. We emphasize that, unconventionally for our 풰 ℱ system Ax b, there are 푛 constraints (i.e. elements) and 푚 variables (i.e. sets). ≥ Employing the MWU approach for solving covering LPs, we define the polytope

푚 ⊤ and ℓ x R : c x ℓ x 0 . 풫 , { ∈ ≤ ≥ }

Observe that by applying the MWU algorithm to this polytope and constraints Ax b, (︁ )︁ 풫 ≥ x 푏푒−훽 we obtain a solution x ℓ such that A푒 = 1 = 푏푒, where A푒 denotes the row ∈ 풫 1−훽 ≥ 1−훽 of A corresponding to 푒. This yields a (1 + 푂(휀))-approximate solution for 훽 = 푂(휀). Unfortunately, we cannot implement the MWU algorithm on the full input under our

73 streaming context. Therefore, the main challenge is to implement the following two subtasks

of the MWU algorithm in the streaming settings. First, we need to design an oracle that solves the average constraint in the streaming setting. Moreover, we need to be able to efficiently update the weights for the subsequent rounds.

Covering the common elements. Before we proceed to applying the MWU framework, we add a simple first step to our implementation of FeasibilityTest (Algorithm5) that

will greatly reduce the amount of sapce required in implementing the MWU algorithm. This can be interpreted as the fractional version of Set Sampling described in [61]. In our subroutine, we partition the elements into the common elements that occur more frequently, which will be covered if we simply choose a uniform vector solution, and the rare elements

that occur less frequently, for which we perform the MWU algorithm to compute a good solution. In one pass we can find all frequently occurring elements by counting the number of sets containing each element. The amount of required space to perform this task is

푂(푛 log 푚). we call an element that appears in at least 푚 sets common, and we call it rare otherwise, 훼ℓ where 훼 = Θ(휀). Since we are aiming for a (1 + 휀)-approximation, we can define xcmn as a vector whose all entries are 훼ℓ . The total cost of cmn is and all common elements are 푚 x 훼ℓ covered by xcmn. Thus, throughout the algorithm we may restrict our attention to the rare elements.

Our goal now is to construct an efficient MWU-based algorithm, which finds a solution xrare covering the rare elements, with objective value at most ℓ (1 + 휀 훼)ℓ. We 1−훽 ≤ − note that our implementation does not explicitly maintain the weight vector w푡 described in Section 3.2.1, but instead updates (and normalizes) its probability vector p푡 in every round.

3.2.3. First Attempt: Simple Oracle and Large Width

A greedy solution for the oracle. We implement the oracle for MWU algorithm such that 휑 = ℓ, and thus requiring Θ(ℓ log 푛/훽2) iterations (Theorem 3.2.2). In each iteration, ⊤ ⊤ we need an oracle that finds some solution x ℓ satisfying p Ax p b 훽/3, or decides ∈ 풫 ≥ − ⊤ ⊤ that no solution in ℓ satisfies p Ax p b. 풫 ≥ 74 Algorithm 5 A generic implementation of FeasibilityTest. Its performance depends on the implementations of Oracle, UpdateProb. We will investigate different implementa- tions of Oracle. 1: procedure FeasibilityTest(ℓ, 휀) 2: 휀 , curr the initial prob. vector for the MWU algorithm on 훼, 훽 3 p 1푚×1◁ 3: ◁ compute← a cover← of common elements in one pass 풰 4: cmn 훼ℓ , x 푚 1푚×1 freq 0푛×1 5: for all←set ·푆 in the stream←do 6: for all element 푒 푆 do 7: ∈ freq푒 freq푒 + 1 8: if appears← in more than 푚 sets (i.e. 푚 ) then common element 푒 훼ℓ freq푒 > 훼ℓ ◁ 9: curr 푝푒 0 curr ← 10: curr p curr represents the current prob. vector p ‖pcurr‖ ◁ p total ← 11: x 0푚×1 12: ◁ multiplicative← weight update algorithm for covering rare elements 13: for 푖 = 1 to 푇 do 14: ◁ solve the corresp. oracle of MWU and decide if the solution is feasible 15: try x Oracle(pcurr, ℓ, ) 16: xtotal ←xtotal + x ◁ in oneℱpass, update p according to x ← 17: z 0푛×1 18: for← all set 푆 in the stream do 19: for all element 푒 푆 do ∈ 20: 푧푒 푧푒 + 푥푆 ← 21: if (pcurr)⊤z < 1 훽/3 then ◁ detect infeasible solutions returned by Oracle 22: report INFEASIBLE− 23: pcurr UpdateProb(pcurr, z) ←total 24: rare x scaled up the solution to cover rare elements x (1−훽)푇 ◁ 25: return← xcmn + xrare ⊤ * Observe that p Ax is maximized when we place value ℓ on 푥푆* where 푆 achieves the maximum value ∑︀ . Further, for our application, so ⊤ . Our im- 푝푆 , 푒∈푆 푝푒 b = 1 p b = 1 plementation HeavySetOracle of Oracle given in Algorithm6 below is a deterministic greedy algorithm that finds a solution based on this observation. As A푒x x 1 ℓ, ≤ ‖ ‖ ≤ HeavySetOracle implements a (1, ℓ)-bounded (훽/3)-approximate oracle. Therefore, the implementation of FeasibilityTest with HeavySetOracle computes a solution of ob- jective value at most (훼 + 1 )ℓ < (1 + 휀 )ℓ when ℓ 푘 as promised. 1−훽 3 ≥ Finally, we track the space usage which concludes the complexities of the current version

of our algorithm: it only stores vectors of length 푚 or 푛, whose entries each requires a logarithmic number of bits, yielding the following theorem.

75 Algorithm 6 HeavySetOracle computes 푝푆 of every set given the set system in a stream or stored memory, then returns the solution x that optimally places value ℓ on the corre- sponding entry. It reports INFEASIBLE if there is no sufficiently good solution 1: procedure HeavySetOracle(p, ℓ, ) ℱ 2: compute 푝푆 for all 푆 while reading the set system * ∈ ℱ 3: 푆 argmax 푝푆 ← 푆∈ℱ 4: if 푝푆 < (1 훽/3)/ℓ then 5: report−INFEASIBLE

6: x 0푛×1, 푥푆 ℓ 7: return← x ← Theorem 3.2.3. There exists a streaming algorithm that w.h.p. returns a (1+휀)-approximate fractional solution of in 푘 log 푛 passes and using memory SetCover-LP( , ) 푂( 2 ) 푂̃︀(푚 + 푛) 풰 ℱ 휀 for any positive 휀 1/2. The algorithm works in both set arrival and edge arrival streams. ≤ The presented algorithm suffers from large number of passes over the input. In particular,

we are interested in solving the fractional Set Cover in constant number of passes using sublinear space. To this end, we first reduce the required number of rounds in MWU by a more complicated implementation of Oracle.

3.3. Max Cover Problem and its Application to Width Reduction

In this section, we improve the described algorithm in the previous section and prove the following result.

Theorem 3.3.1. There exists a streaming algorithm that w.h.p. returns a (1+휀)-approximate fractional solution of SetCover-LP( , ) in 푝 passes and uses 푂̃︀(푚푛푂(1/푝휀) + 푛) memory for 풰 ℱ any 2 푝 polylog(푛) and 0 < 휀 1/2. The algorithm works in both set arrival and edge ≤ ≤ ≤ arrival streams.

Recall that in implementing Oracle, we must find a solution x of total size x 1 ℓ ‖ ‖ ≤ with a sufficiently large weight p⊤Ax. Our previous implementation chooses only one good

entry 푥푆 and places its entire budget ℓ on this entry. As the width of the solution is roughly the maximum amount an element is over-covered by x, this implementation induces a width of ℓ. In this section, we design an oracle that returns a solution in which the budget is distributed more evenly among the entries of x to reduce the width. To this end, we design

76 an implementation of Oracle of the MWU approach based on the Max ℓ-Cover problem (whose precise definition will be given shortly). The solution toour Max ℓ-Cover aids in reducing the width of our Oracle solution to a constant, so the required number of rounds of the MWU algorithm decreases to log 푛 , independent of . Note that, if the objective 푂( 휀2 ) ℓ value of an optimal solution of Set Cover( , ) is ℓ, then a solution of width 표(ℓ) may 풰 ℱ not exist, as shown in Lemma 3.3.2. This observation implies that we need to work with

a different set system. Besides having small width, an optimal solution ofthe Set Cover instance on the new set system should have the same objective value of the optimal solution

of Set Cover( , ). 풰 ℱ Lemma 3.3.2. There exists a set system in which, under the direct application of the MWU framework in computing a (1 + 휀)-approximate solution, induces width 휑 = Ω(푘), where 푘 is the optimal objective value. Moreover, the exists a set system in which the approach from the previous section (which handles the frequent and rare elements differently) has width √︀ 휑 = Θ(푛) = Θ( 푚/휀). Proof: For the first claim, we consider an arbitrary set system, then modify it byaddinga common element 푒 to all sets. Recall that the MWU framework returns an average of the solutions from all rounds. Thus there must exist a round where the oracle returns a solution ∑︀ ∑︀ x of size x 1 = Θ(푘). For the added element 푒, this solution has 푥푆 = 푥푆 = ‖ ‖ 푆:푒∈푆 푆∈ℱ Θ(푘), inducing width 휑 = Ω(푘). √︀ For the second claim, consider the following set system with 푘 = 푚/휀 and 푛 = 2푘 + 1.

For 푖 = 1, . . . , 푘, let 푆푖 = 푒푖, 푒푘+푖, 푒2푘+1 , whereas the remaining 푚 푘 sets are arbitrary { } − subsets of 푒1, . . . , 푒푘 . Observe that 푒푘+푖 is contained only in 푆푖, so 푥푆 = 1 in any valid set { } 푖 cover. Consequently the solution x where 푥푆 = = 푥푆 = 1 and 푥푆 = = 푥푆 = 0 1 ··· 푘 푘+1 ··· 2푘+1 √︀ forms the unique (fractional) minimum set cover of size 푘 = 푚/휀. Next, recall that an element is considered rarely occurring if it appears in at most 푚 푚 sets. As 훼ℓ > 휀푘 푒푘+1, . . . , 푒2푘 each only occurs once, and only appears in √︀ 푚 sets, these elements 푒2푘+1 푘 = 푚/휀 = 휀푘 푘 + 1 are deemed rare and thus handled by the MWU framework. ∑︀ The solution computed by the MWU framework satisfies 푥푆 1 훽 for every 푒, 푆:푒∈푆 ≥ − and in particular, for each 푒 푒푘+1, . . . , 푒2푘 . Therefore, the average solution places a total ∈ { }

77 weight of at least (1 훽) Θ(푘) on 푥푆 , . . . , 푥푆 , so there must exist a round that places at − · 1 푘 least the same total weight on these sets. However, these 푘 sets all contain 푒2푘+1, yielding ∑︀ , implying a width of √︀ . 푆:푒 ∈푆 푥푆 (1 훽) Θ(푘) = Ω(푘) Ω(푘) = Ω( 푚/휀)  2푘+1 ≥ − ·

Extended set system. First, we consider the extended set system ( , ˘), where ˘ is the 풰 ℱ ℱ collection containing all subsets of sets in ; that is, ℱ

˘ 푅 : 푅 푆 for some 푆 . ℱ , { ⊆ ∈ ℱ}

It is straightforward to see that the optimal objective value of Set Cover over ( , ˘) is 풰 ℱ equal to that of ( , ): we only add subsets of the original sets to create ˘, and we may 풰 ℱ ℱ replace any subset from ˘ in our solution with its original set in . Moreover, we may ℱ ℱ prune any collection of sets from into a collection from ˘ of the same cardinality so that, ℱ ℱ this pruned collection not only covers the same elements, but also each of these elements is covered exactly once. This extended set system is defined for the sake of analysis only: we will never explicitly handle an exponential number of sets throughout our algorithm.

We define ℓ-cover as a collection of sets of total weight ℓ. Although the pruning of an ℓ-cover reduces the width, the total weight p⊤Ax of the solution will decrease. Thus, we consider the weighted constraint of the form

(︃ )︃ ∑︁ ∑︁ 푝푒 min 1, 푥푆 1; · { } ≥ 푒∈풰 푆:푒∈푆 that is, we can only gain the value 푝푒 without any multiplicity larger than 1. The problem of maximizing the left hand side is known as the weighted max coverage problem: for a param- eter ℓ, find an ℓ-cover such that the total value 푝푒’s of the covered elements is maximized.

3.3.1. The Maximum Coverage Problem

In the design of our algorithm, we consider the weighted Max 푘-Cover problem, which is closely related to Set Cover. Extending upon the brief description given earlier, we fully specify the LP relaxation of this problem. In the weighted Max 푘-Cover( , , ℓ, p), given 풰 ℱ a ground set of elements , a collection of sets over the ground set, a budget parameter 풰 ℱ 78 MaxCover-LP ◁ Input: , , ℓ, p 풰 ℱ ∑︁ maximize 푝푒푧푒 푒∈풰 ∑︁ subject to 푥푆 푧푒 푒 ≥ ∀ ∈ 풰 푆:푒∈푆 ∑︁ 푥푆 = ℓ 푆∈ℱ

0 푧푒 1 푒 ≤ ≤ ∀ ∈ 풰 푥푆 0 푆 ≥ ∀ ∈ ℱ

Figure 3.3.1: LP relaxation of weighted Max 푘-Cover. ℓ, and a weight vector p, the goal is to return ℓ sets in whose weighted coverage, the ℱ total weight of all covered elements, is maximized. Moreover, since we are aiming for a

fractional solution of Set Cover, we consider the LP relaxation of weighted Max 푘-Cover,

MaxCover-LP (see Figure 3.3.1); in this LP relaxation, 푧푒 denotes the fractional amount that an element is covered, and hence is capped at 1. As an intermediate goal, we aim to compute an approximate solution of MaxCover-LP, given that the optimal solution covers all elements in the ground set, or to correctly detect that no solution has weighted coverage of more than (1 휀). In our application, the vector − ∑︀ p is always a probability vector: p 0 and 푝푒 = 1. We make the following useful ≥ 푒∈풰 observation.

Observation 3.3.3. Let 푘 be the value of an optimal solution of SetCover-LP( , ) and 풰 ℱ let p be an arbitrary probability vector over the ground set. Then there exists a fractional solution of MaxCover-LP( , , ℓ, p) whose weighted coverage is one if ℓ 푘. 풰 ℱ ≥

훿-integral near optimal solution of MaxCover-LP. Our plan is to solve MaxCover- LP over a randomly projected set system, and argue that with high probability this will result in a valid Oracle. Such an argument requires an application of the union bound over the set of solutions, which is generally of unbounded size. To this end, we consider a more restrictive domain of 훿-integral solutions: this domain has bounded size, but is still guaranteed to contain a sufficiently good solution.

Definition 3.3.4훿 ( -integral solution). A fractional solution x푛×1 of an LP is 훿-integral

79 Algorithm 7 MaxCoverOracle returns a fractional ℓ-cover with weighted coverage at least 1 훽/3 w.h.p. if ℓ 푘. It provides no guarantee on its behavior if ℓ < 푘. − ≥ 1: procedure MaxCoverOracle(( , ), ℓ) 풰 ℱ 2: x solution returned by MWU using HeavySetOracle on MaxCover-LP 3: return← x 1 if x is an integral vector. That is, for each 푖 [푛], 푥푖 = 푣푖훿 where each 푣푖 is an integer. 훿 · ∈ Next we claim that MaxCoverOracle given in Algorithm7 below, which is the MWU algorithm with HeavySetOracle for solving MaxCover-LP, results in a 훿-integral solu- tion.

Lemma 3.3.5. Consider a MaxCover-LP with the optimal objective value OPT (where 2 the weights of elements form a probability vector). There exists a 휀MC -integral solution Θ( log 푛 )

of MaxCover-LP whose objective value is at least (1 휀MC)OPT. In particular, if an − optimal solution covers all elements (ℓ 푘), MaxCoverOracle returns a solution 풰 ≥ whose weighted coverage is at least 1 휀MC in polynomial time. − Proof: Let (x*, z*) denote the optimal solution of value OPT to MaxCover-LP, which implies * * * that x 1 ℓ and Ax z . Consider the following covering LP: minimize x 1 subject ‖ ‖ ≤ ≥ ‖ ‖ to Ax z* and x 0. Clearly there exists an optimal solution of objective value ℓ, namely ≥ ≥ x*. This covering LP may be solved via the MWU framework. In particular, we may use the oracle that picks one set 푆 with maximum weight (as maintained in the MWU framework) ′ and places its entire budget on 푥푆. For an accurate guess ℓ = Θ(ℓ) of the optimal value, this ℓ′ log 푛 ℓ log 푛 algorithm returns an average of 푇 = Θ( 2 ) = Θ( 2 ) oracle solutions. Observe that the 휀MC 휀MC ′ outputted solution is of the form 푣푆 ℓ where is the number of rounds in which x 푥푆 = 푇 = 푣푆훿 푣푆 ′ 2 2 2 is chosen by the oracle, and ℓ′ ℓ 휀MC 휀MC . In other words, is 휀MC -integral. 푆 훿 = 푇 = ℓ log 푛 = Θ( log 푛 ) x ( log 푛 ) * By Theorem 3.2.2, x satisfies Ax (1 휀MC)z . Then in MaxCover-LP, the solution ≥ − * ⊤ * ⊤ * (x, (1 휀MC)z ) yields coverage at least p ((1 휀MC)z ) = (1 휀MC)p z = (1 휀MC)OPT. − − − − 

Pruning a fractional ℓ-cover. In our analysis, we aim to solve the Set Cover problem under the extended set system. We claim that any solution x with coverage z in the actual set system may be turned into a pruned solution ˘x in the extended set system that provides ∑︀ the same coverage z, but satisfies the strict equality ˘ ˘ ˘ 푥˘ ˘ = 푧푒. Since 푧푒 1, the 푆∈ℱ:푒∈푆 푆 ≤ pruned solution satisfies the condition for an oracle with width one.

80 Algorithm 8 The Prune subroutine lifts a solution in to a solution in ˘ with the same ℱ ℱ MaxCover-LP objective value and width 1. The subroutine returns z, the amount by which members of ˘ cover each element. The actual pruned solution ˘x may be computed but has no further useℱ in our algorithm and thus not returned. 1: procedure Prune(x) 2: , maintain the pruned solution and its coverage amount x˘ 0|ℱ|ע 1 z 0푛×1 ◁ 3: for← all 푆 ←do 4: 푆˘ 푆∈ ℱ ← 5: while 푥푆 > 0 do 6: 푟 min 푥푆, min ˘(1 푧푒) ◁ weight to be moved from 푥푆 to 푥˘ ˘ ← { 푒∈푆 − } 푆 7: 푥푆 푥푆 푟, 푥 ˘ 푥 ˘ + 푟◁ move weight to the pruned solution ← − 푆 ← 푆 8: for all 푒 푆˘ do ∈ 9: 푧푒 푧푒 + 푟◁ update coverage accordingly ← 10: 푆˘ 푆˘ 푒 푆˘ : 푧푒 = 1 ◁ remove 푒 with 푧푒 = 1 from 푆˘ ← ∖ { ∈ } 11: return z Lemma 3.3.6. A fractional ℓ-cover x of ( , ) can be converted, in polynomial time, to a 풰 ℱ ∑︀ fractional ℓ-cover ˘x of ( , ˘) such that for each element 푒, its coverage 푧푒 = ˘ ˘ ˘ 푥˘ ˘ = 풰 ℱ 푆∈ℱ:푒∈푆 푆 ∑︀ . min( 푆:푒∈푆 푥푆, 1)

Proof: Consider the algorithm Prune in Algorithm8. As we pick a valid amount 푟 푥푆 to ≤ move from to at each step, must be an -cover (in the extended set system) when 푥푆 푥˘푆˘ x˘ ℓ Prune finishes. Observe that if ∑︀ then will never be removed from any ˘, 푆:푒∈푆 푥푆 < 1 푒 푆 so is increased by for every , and thus ∑︀ . Otherwise, the condition 푧푒 푥푆 푆 푧푒 = 푆:푒∈푆 푥푆

푟 1 푧푒 ensures that 푧푒 stops increasing precisely when it reaches 1. Each 푆 takes up to ≤ − 푛 + 1 rounds in the while loop as one element 푒 푆 is removed at the end of each round. ∈ There are at most 푚 sets, so the algorithm must terminate (in polynomial time). We note that in Section 3.3.4, we need to adjust Prune to instead achieves the condition

푧푒 = min(A푒x, 1) where entries of A are arbitrary non-negative values. We simply make the

1−푧푒 following modifications: choose 푟 min(푥푆, min ˘ ) and update 푧푒 푧푒 + 푟 퐴푒,푆, and ← 푒∈푆 퐴푒,푆 ← · the same proof follows. 

Remark that to update the weights in the MWU framework, it is sufficient to have the coverage ∑︀ , which are the ’s returned by Prune; the actual solution is not 푆˘∈ℱ˘:푒∈푆˘ 푥˘푆˘ 푧푒 ˘x necessary. Observe further that our MWU algorithm can still use x instead of ˘x as its solution because x has no worse coverage than ˘x in every iteration, and so does the final,

81 average solution. Lastly, notice that the coverage z returned by Prune has the simple formula ∑︀ . That is, we introduce Prune to show an existence of , 푧푒 = min( 푆:푒∈푆 푥푆, 1) ˘x but will never run Prune in our algorithm.

3.3.2. Sampling-Based Oracle for Fractional Max Coverage

In the previous section, we simply needed to compute the values 푝푆’s in order to construct a solution for the Oracle. Here as we aim to bound the width of Oracle, our new task is to find a fractional ℓ-cover x whose weighted coverage is at least 1 훽/3. The element sampling − technique, which is also known from prior work in streaming Set Cover and Max 푘-Cover, is to sample a few elements and solve the problem over the sampled elements only. Then, by applying the union bound over all possible candidate solutions, it is shown that w.h.p. a nearly optimal cover of the sampled elements also covers a large fraction of the whole ground set. This argument applies to the aforementioned problems precisely because there are standard ways of bounding the number of all integral candidate solutions (e.g. ℓ-covers). However, in the fractional setting, there are infinitely many solutions. Consequently, we employ the notion of 훿-integral solutions where the number of such solutions is bounded. In Lemma 3.3.6, we showed that there always exists a 훿-integral solution to MaxCover-LP whose coverage is at least a (1 휀MC)-fraction of an optimal solution. Moreover, the number − of all possible solutions is bounded by the number of ways to divide the budget ℓ into ℓ/훿 equal parts of value 훿 and distribute them (possibly with repetition) among 푚 entries:

Observation 3.3.7. The number of feasible 훿-integral solutions to MaxCover-LP( , , ℓ, p) 풰 ℱ is 푂(푚ℓ/훿) for any multiple ℓ of 훿. Next, we design our algorithm using the element sampling technique: we show that a

(1 훽/3)-approximate solution of MaxCover-LP can be computed using the projection of − all sets in over a set of elements of size Θ( ℓ log 푛 log 푚푛 ) picked according to p. For every ℱ 훽4 ∑︀ fractional solution (x, z) and subset of elements , let 풱 (x) 푝푒푧푒 denote the 풱 ⊆ 풰 풞 , 푒∈풱 ∑︀ coverage of elements in where 푧푒 = min(1, 푥푆). We may omit the subscript in 풱 푆:푒∈푆 풱 풱 if = . 풞 풱 풰 The following lemma, which is essentially an extension of the Element Sampling lemma of [61] for our application, MaxCover-LP, shows that a (1 휀MC)-approximate ℓ-cover over − 82 a set of sampled elements of size Θ(ℓ log 푛 log 푚푛/훾4) w.h.p. has a weighted coverage of

at least (1 2훾)(1 휀MC) if there exists a fractional ℓ-cover whose coverage is 1. Thus, − − choosing 휀MC = 훾 = 훽/9 yields the desired guarantee for MaxCoverOracle, leading to the performance given in Theorem 3.3.9.

Lemma 3.3.8. Let 휀MC and 훾 be parameters. Consider the MaxCover-LP( , , ℓ, p) with 풰 ℱ optimal solution of value OPT, and let be a multi-set of 푠 = Θ(ℓ log 푛 log(푚푛)/훾4) ele- ℒ ments sampled independently at random according to the probability vector p. Let xsol be a 훾2 (1 휀MC)-approximate Θ( )-integral ℓ-cover over the sampled elements. Then with high − log 푛 sol probability, (x ) (1 2훾)(1 휀MC)OPT. 풞 ≥ − − Proof: Consider the MaxCover-LP( , , ℓ, p) with optimal solution (xOPT, zOPT) of value 풰 ℱ sol 훾2 OPT, and let x be a (1 휀MC)-approximate Θ( )-integral ℓ-cover over the sampled − log 푛 elements and zsol be its corresponding coverage vector. Denote the sampled elements with

= 푒^1, , 푒^푠 . Observe that by defining each X푖 as a random variable that takes the value ℒ { ··· } OPT with probability and otherwise, the expected value of X ∑︀푠 X is 푧푒^푖 푝푒^푖 0 = 푖=1 푖

푠 ∑︁ ∑︁ OPT OPT E[X] = E[X푖] = 푠 푝푒 푧 = 푠 (x ) = 푠 OPT. · 푒 · 풞 · 푖=1 푒∈풰

Let 휏 = 푠(1 훾)OPT. Since X푖 [0, 1], by applying Chernoff bound on X, we obtain − ∈

OPT Pr ℒ(x ) 휏 = Pr X (1 훾)E[X] 풞 ≤ ≤ − 2 2 − 훾 E[X] − Ω(ℓ log(푚푛) log 푛/훾 ) −Ω(ℓ log 푛/훾2) 푒 3 푒 3 = (푚푛) . ≤ ≤

sol Therefore, since x is a (1 휀MC)-approximate solution of MaxCover-LP( , , ℓ, p), with − ℒ ℱ −Ω(ℓ log 푛/훾2) sol probability 1 (푚푛) , we have ℒ(x ) (1 휀MC)휏. − 풞 ≥ − Next, by a similar approach, we show that for any fractional solution x, if ℒ(x) 풞 ≥ OPT −Ω(ℓ log 푛/훾2) (︀ 1−훾 )︀ ℒ(x ), then with probability 1 (푚푛) , (x) (1 휀MC)OPT. Con- 풞 − 풞 ≥ 1+훾 − (︀ 1−훾 )︀ sider a fractional ℓ-cover (x, z) whose coverage is less than (1 휀MC)OPT. Let Y푖 1+훾 − denote a random variable that takes value with probability , and define ∑︀푠 . 푧푒^푖 푝푒^푖 Y = 푖=1 Y푖 (︀ 1−훾 )︀ Then, E[Y푖] = (x) < (1 휀MC)OPT. For ease of analysis, let each Y푖 [0, 1] be 풞 1+훾 − ∈ an auxiliary random variable that stochastically dominates Y푖 with expectation E[Y푖] =

83 (︀ 1−훾 )︀ ∑︀푠 (1 휀MC)OPT, and Y = Y푖 which stochastically dominates Y with expectation 1+훾 − 푖=1 (︀ 1−훾 )︀ (1−휀MC)휏 E[Y] = 푠 (1 휀MC)OPT = . We then have · 1+훾 − 1+훾

Pr ℒ(x) > (1 휀MC)휏 = Pr Y > (1 휀MC)휏 = Pr Y > (1 + 훾)E[Y] 풞 − − 2 − 훾 E[Y] −Ω(ℓ log 푛/훾2) Pr Y > (1 + 훾)E[Y] 푒 3 (푚푛) , ≤ ≤ ≤

(︀ 1−훾 )︀ using the fact that (1 휀MC) = Θ(1) for our interested range of parameters. Thus, 1+훾 −

(︀1 훾 )︀ −Ω(ℓ log 푛/훾2) Pr (x) − (1 휀MC)OPT and ℒ(x) > (1 휀MC)휏 (푚푛) . 풞 ≤ 1 + 훾 − 풞 − ≤

2 In other words, except with probability (푚푛)−Ω(ℓ log 푛/훾 ), a chosen solution x that offers at least as good empirical coverage over as xOPT (namely xsol) does have actual coverage of ℒ (︀ 1−훾 )︀ at least (1 휀MC)OPT. 1+훾 − Since the total number of 훾2 -integral -covers is ℓ log 푛/훾2 (Observation 3.3.7), Θ( log 푛 ) ℓ 푂(푚 ) 2 2 applying union bound, with probability at least 1 푂(푚ℓ log 푛/훾 ) (푚푛)−Ω(ℓ log 푛/훾 ) = 1 − · − 1 훾2 , a (1 휀MC)-approximate Θ( )-integral solution of MaxCover-LP( , , ℓ, p) has poly(푚푛) − log 푛 ℒ ℱ (︀ 1−훾 )︀ weighted coverage of at least (1 휀MC)OPT > (1 2훾)(1 휀MC)OPT over . 1+훾 − − − 풰  Theorem 3.3.9. There exists a streaming algorithm that w.h.p. returns a (1+휀)-approximate fractional solution of SetCover-LP( , ) in 푂(log 푛/휀2) passes and uses 푂̃︀(푚/휀6 + 푛) mem- 풰 ℱ ory for any positive 휀 1/2. The algorithm works in both set arrival and edge arrival ≤ streams.

Proof: The algorithm clearly requires Θ(푇 ) passes to simulate the MWU algorithm. The required amount of memory, besides 푂̃︀(푛) for counting elements, is dominated by the pro- jected set system. In each pass over the stream, we sample Θ(ℓ log 푚푛 log 푛/휀4) elements, and since they are rarely occurring, each is contained in at most 푚 sets. Finally, we Θ( 휀ℓ ) run instances of the MWU algorithm in parallel to compute a log1+Θ(휀) 푛 = 푂(log 푛/휀) (1 + 휀)-approximate solution. In total, our space complexity is Θ(ℓ log 푚푛 log 푛/휀4) Θ( 푚 ) · 휀ℓ · 6 푂(log 푛/휀) = 푂̃︀(푚/휀 ). 

84 3.3.3. Final Step: Running Several MWU Rounds Together

We complete our result by further reducing the number of passes at the expense of increas- ing the required amount of memory, yielding our full algorithm FastFeasibilityTest

in Algorithm9. More precisely, aiming for a 푝-pass algorithm, we show how to execute 푇 log 푛 rounds of the MWU algorithm in a single pass. We show that this task 푅 , Θ(푝) = Θ( 푝훽2 ) may be accomplished with a multiplicative factor of 푓 Θ(log 푚푛) increase in memory usage, · Θ(1/(푝훽)) where 푓 , 푛 .

Advance sampling. Consider a sequence of 푅 consecutive rounds 푖 = 1, . . . , 푅. In order to implement the MWU algorithm for these rounds, we need (multi-)sets of sampled elements

1 푅 푖 1,..., 푅 according to probabilities p ,..., p , respectively (where p is the probability ℒ ℒ corresponding to round 푖). Since the probabilities of subsequent rounds are not known in advance, we circumvent this problem by choosing these sets 푖’s with probabilities according ℒ 1 to p , but the number of samples in each set will be 푖 = 푠 푓 Θ(log 푚푛) instead of 푠. |ℒ | · · 푖 ′ Then, once p is revealed, we sub-sample elements from 푖 to obtain as follow: for a (copy ℒ ℒ푖 푝푖 of) sampled element , add to ′ with probability 푒^ ; otherwise, simply discard it. 푒^ 푖 푒^ 푖 푝1푓 ∈ ℒ ℒ 푒^ Note that it is still left to be shown that the probability above is indeed at most 1. Since each 푒 was originally sampled with probability 푝1, then in ′ , the probability that a 푒 ℒ푖 sampled element is exactly 푖 . By having times the originally required 푒^ = 푒 푝푒/푓 푓 Θ(log 푚푛) · 푖 ′ ∑︀ 푝푒 number of samples 푠 in the first place, in expectation we still have E[ ] = 푖 = |ℒ푖| |ℒ | 푒∈풰 푓 (푠 푓 Θ(log 푚푛)) 1 = 푠 Θ(log 푚푛). Due to the Θ(log 푚푛) factor, by the Chernoff bound, we · · 푓 · conclude that with w.h.p. ′ 푠. Thus, we have a sufficient number of elements sampled |ℒ푖| ≥ with probability according to p푖 to apply Lemma 3.3.8, as needed.

Change in probabilities. As noted above, we must show that the probability that we

sub-sample each element is at most 1; that is, 푝푖 /푝1 푓 = 푛Θ(1/(푝훽)) for every element 푒 and 푒 푒 ≤ every round 푖 = 1, . . . , 푅. We bound the multiplicative difference between the probabilities of two consecutive rounds as follows.

Lemma 3.3.10. Let p and p′ be the probability of elements before and after an update. ′ Then for every element 푒, 푝 (1 + 푂(훽))푝푒. 푒 ≤ 85 ˘ Proof: Recall the weight update formula 푤푡+1 = 푤푡 (1 훽(A푒˘x−푏푒) ) for the MWU framework, 푒 푒 − 6휑 where ˘ represents the membership matrix corresponding to the extended set system A푛×|ℱ|˘

( , ˘). In our case, the desired coverage amount is 푏푒 = 1. By construction, we have 풰 ℱ A˘ 푒˘x = 푧푒 1; therefore, our width is 휑 = 1, and 1 A˘ 푒˘x 푏푒 0. That is, the weight of ≤ − ≤ − ≤ each element cannot decrease, but may increase by at most a multiplicative factor of 1+훽/6, before normalization. Thus even after normalization no weight may increase by more than

a factor of 1 + 훽/6 = 1 + 푂(훽).  Therefore, after log 푛 rounds, the probability of any element may increase by at 푅 = Θ( 푝훽2 ) log 푛 Θ( ) Θ( log 푛 ) most a factor of (1 + 푂(훽)) 푝훽2 푒 푝훽 = 푛Θ(1/(푝훽)) = 푓, as desired. This concludes the ≤ proof of Theorem 7.8.9.

Implementation details. We make a few remarks about the implementation given in Algorithm9. First, even though we perform all sampling in advance, the decisions of Max-

CoverOracle do not depend on any 푖 of later rounds, and UpdateProb is entirely ℒ deterministic: there is no dependency issue between rounds. Next, we only need to perform

UpdateProb on the sampled elements = 1 푅 during the current 푅 rounds. ℒ ℒ ∪ · · · ∪ ℒ We therefore denote the probabilities with a different vector q푖 over the sampled elements only. Probabilities of elements outside are not required by MaxCoverOracle dur- ℒ ℒ ing these rounds, but we simply need to spend one more pass after executing 푅 rounds of MWU to aggregate the new probability vector p over all (rare) elements. Similarly, since MaxCoverOracle does not have the ability to verify, during the MWU algorithm, that each solution x푖 returned by the oracle indeed provides a sufficient coverage, we check all of them during this additional pass. Lastly, we again remark that this algorithm operates on the extended set system: the solution x returned by MaxCoverOracle has at least the same coverage as ˘x. While ˘x is not explicitly computed, its coverage vector z can be computed exactly.

3.3.4. Extension to general covering LPs

We remark that our MWU-based algorithm can be extended to solve a more general class of covering LPs. Consider the problem of finding a vector x that minimizes c⊤x subject to

86 Algorithm 9 An efficient implementation of FeasibilityTest which performs in 푝 passes 푂( 1 ) and consumes 푂̃︀(푚푛 푝휀 + 푛) space. 1: procedure FastFeasibilityTest(ℓ, 휀) 2: 휀 , curr the initial prob. vector for the MWU algorithm on 훼, 훽 3 p 1푚×1 ◁ 3: ◁ see← Algorithm5 ←for implementation of the below line 풰 4: compute a cover of common elements in one pass total 5: x 0푚×1 ← 6: ◁ MWU algorithm for covering rare elements 7: for 푖 = 1 to 푝 do 8: log 푛 number of iterations performed together 푅 Θ( 2 ) ◁ MWU ← 푝훽 9: ◁ in one pass, projects all sets in over the collections of samples 1, 푅 ℱcurr Θ(1/(푝훽)) ℒ ···ℒ 10: sample 1,..., 푅 according to p each of size ℓ푛 poly(log 푚푛) ℒ ℒ 11: 1 푅, ℒ ◁ is a set whereas 1,..., 푅 are multi-sets 12: forℒ ← all ℒ set∪ ·푆 · ·in ∪ ℒ the streamℱ ←do ∅ ℒ ℒ ℒ 13: ℒ ℒ 푆 ℱ ← ℱ ∪ { ∩ ℒ} 14: ◁ each pass simulates 푅 rounds of MWU 15: for all 푒 do 16: 푞1 푝∈curr ℒ ◁ project pcurr to q1 over sampled elements 푒 ← 푒 푛×1 |ℒ|×1 17: 1 q1 q ‖q1‖ 18: for← all round 푖 = 1, . . . , 푅 do 푖 19: ′ sample each elt with probab. 푞푒 rejection sampling 푖 푒 푖 1 Θ(1/(푝훽)) ◁ ℒ ← ∈ ℒ 푞푒 푛 20: 푖 MaxCoverOracle ′ w.h.p. 푖 when x ( 푖, ℒ, ℓ) ◁ (x ) 1 훽/3 ℓ 푘 21: ◁ in←no additional pass, updatesℒ ℱ probab. q over풞 sampled≥ − elts accord.≥ to x푖 22: z 0|ℒ|×1 ◁ compute coverage over sampled elements ← 23: for all element-set pair 푒 푆 where 푆 ℒ do 푖 ∈ ∈ ℱ 24: 푧푒 min(푧푒 + 푥 , 1) ← 푆 25: q푖+1 UpdateProb(q푖, z) ◁ only update weights of elements in ← ℒ 26: ◁ in one pass, updates probab. pcurr over all (rare) elts according to x1,..., x푅 1 푅 27: z ,..., z 0푛×1 ◁ compute coverage over all (rare) elements 28: for all element-set← pair 푒 푆 in the stream do 29: for all round 푖 = 1, . . .∈ , 푅 do 30: 푧푖 min(푧푖 + 푥푖 , 1) 푒 ← 푒 푆 31: for all round 푖 = 1, . . . , 푅 do 32: if (pcurr)⊤z푖 < 1 훽/3 then ◁ detect infeasible solutions 33: report INFEASIBLE− 34: xtotal xtotal + x푖, pcurr UpdateProb(pcurr, z푖) ◁ perform actual updates total← ← 35: rare x scaled up the solution to cover rare elements x (1−훽)푇 ◁ 36: return← xcmn + xrare constraints Ax b and x 0. In terms of the Set Cover problem, 퐴푒,푆 0 indicates the ≥ ≥ ≥ multiplicity of an element 푒 in the set 푆, 푏푒 > 0 denotes the number of times we wish 푒 to

87 be covered, and 푐푆 > 0 denotes the cost per unit for the set 푆. Now define

퐴푒,푆 퐴푒,푆 퐿 , min and 푈 , max . (푒,푆):퐴푒,푆 ̸=0 푏푒푐푆 (푒,푆) 푏푒푐푆

Then, we may modify our algorithm to obtain the following result.

Theorem 3.3.11. There exists a streaming algorithm that w.h.p. returns a (1+휀)-approximate 푂( 1 ) fractional solution to general covering LPs in passes and using 푚푈 푝휀 memory 푝 푂̃︀( 6 푛 + 푛) 휀 퐿 · for any 3 푝 polylog(푛), where parameters 퐿 and 푈 are defined above. The algorithm ≤ ≤ works in both set arrival and edge arrival streams.

Proof: We modify our algorithm and provide an argument of its correctness as follows. First,

observe that we can convert the input LP into an equivalent LP with all entries 푏푒 = 푐푆 = 1

퐴푒,푆 ′ ′ ′ by simply replacing each 퐴푒,푆 with . Namely, let the new parameters be A , b and c , and 푏푒푐푆 we consider the variable ′ where ′ . It is straightforward to verify that ′⊤ ′ ⊤ x 푥푆 = 푐푆푥푆 c x = c x and A′ x′ = A푒x , reducing the LP into the desired case. Thus, we may afford to record b 푒 푏푒 and c, so that each value 퐴푒,푆 may be computed on-the-fly. Henceforth we assume that all 푏푒푐푆

entries 푏푒 = 푐푆 = 1 and 퐴푒,푆 0 [퐿, 푈]. Observe as well that the optimal objective value ∈ { } ∪ 푘 may be in the expanded range [1/푈, 푛/퐿], so the number of guesses must be increased from log 푛 to log(푛푈/퐿) . 휀 휀 Next consider the process for covering the rare elements. We instead use a uniform

cmn 훼ℓ퐿 푚 cmn solution x = 1. Observe that if an element occurs in at least sets, then A푒x = 푚 · 훼ℓ퐿 ∑︀ 훼ℓ 푚 훼ℓ 퐴푒,푆 퐿 = 1. That is, we must adjust our definition so that an element 푆:푒∈푆 · 푚 ≥ 훼ℓ퐿 · · 푚 is considered common if it appears in at least 푚 sets. Consequently, whenever we perform 훼ℓ퐿 element sampling, the required amount of memory to store information of each element

increases by a factor of 1/퐿. Next consider Lemma 3.3.5, where we show an existence of integral solutions via the

MWU algorithm with a greedy oracle. As the greedy implementation chooses a set 푆 and

places the entire budget ℓ on 푥푆, the amount of coverage 퐴푒,푆푥푆 may be as large as ℓ푈 as 퐴푒,푆 is no longer bounded by 1. Thus this application of the MWU algorithm has width 휑 = Θ(ℓ푈) 2 ℓ푈 log 푛 ℓ 휀MC and requires 푇 = Θ( 2 ) rounds. Consequently, its solution becomes Θ( ) = Θ( )- 휀MC 푇 푈 log 푛 integral. As noted in Observation 3.3.7, the number of potential solutions from the greedy

88 oracle increases by a power of 푈. Then, in Lemma 3.3.8, we must reduce the error probability of each solution by the same power. We increase the number of samples 푠 by a factor of 푈 to account for this change, increasing the required amount of memory by the same factor.

As in the previous case, any solution x may always be pruned so that the width is reduced to 1: our algorithm Prune still works as long as the entries of A are non-negative. Therefore, the fact that entries of A may take on values other than 0 or 1 does not affect the number of rounds (or passes) of our overall application of the MWU framework. Thus, we may handle general covering LPs using a factor of 푂̃︀(푈/퐿) larger memory within the same number of passes. In particular, if the non-zero entries of the input are bounded in the range

[1, 푀], this introduces a factor of 푂̃︀(푈/퐿) 푂̃︀(푀 3) overhead in memory usage. ≤ 

89 90 Chapter 4

Sublinear Set Cover

4.1. Introduction

It is well-known that although the problem of finding an optimal solution is NP-complete, a natural greedy algorithm which iteratively picks the “best” remaining set is provably optimal.

The algorithm finds a solution of size at most 푘 ln 푛 where 푘 is the optimum cover size, and can be implemented to run in time linear in the input size. However, the input size itself

could be as large as Θ(푚푛), so for large data sets even reading the input might be infeasible. This raises a natural question: is it possible to solve minimum set cover in sub-linear time? This question was previously addressed in [145, 172], who showed that one can design constant running-time algorithms by simulating the greedy algorithm, under the assumption that the sets are of constant size and each element occurs in a constant number of sets. However, those constant-time algorithms have a few drawbacks: they only provide a mixed

multiplicative/additive guarantee (the output cover size is guaranteed to be at most 푘 · ln 푛 + 휖푛), the dependence of their running times on the maximum set size is exponential, and they only output the (approximate) minimum set cover size, not the cover itself. From a different perspective, [119] (building on [84]) showed that an 푂(1)-approximate solution 1 to the fractional version of the problem can be found in 푂̃︀(푚푘2 + 푛푘2) time . Combining this algorithm with the randomized rounding yields an 푂(log 푛)-approximate solution to Set Cover with the same complexity.

1 The method can be further improved to 푂̃︀(푚 + 푛푘) (N. Young, personal communication).

91 In this chapter we study the complexity of sub-linear time algorithms for set cover with multiplicative approximation guarantees. Our upper bounds complement the aforementioned

result of [119] by presenting algorithms which are fast when 푘 is large, as well as algorithms that provide more accurate solutions that use a sub-linear number of queries.2 Equally im- portantly, we establish nearly matching lower bounds, some of which even hold for estimating the optimal cover size. Our upper and lower bounds are presented in Table 4.1.1.

Data access model. As in the prior work [145, 172] on Set Cover, our algorithms and lower bounds assume that the input can be accessed via the adjacency-list oracle.3 More precisely, the algorithm has access to the following two oracles:

th 1. EltOf: Given a set 푆푖 and an index 푗, the oracle returns the 푗 element of 푆푖. If

푗 > 푆푖 , is returned. | | ⊥

th 2. SetOf: Given an element 푒푖 and an index 푗, the oracle returns the 푗 set containing

푒푖. If 푒푖 appears in less than 푗 sets, is returned. ⊥ This is a natural model, providing a “two-way” connection between the sets and the elements. Furthermore, for some graph problems modeled by Set Cover (such as Dominating Set or Vertex Cover), such oracles are essentially equivalent to the aforementioned incident- list model studied in sub-linear graph algorithms. We also note that the other popular access model employing the membership oracle, where we can query whether an element 푒 is contained in a set 푆, is not suitable for Set Cover, as it can be easily seen that even checking whether a feasible cover exists requires Ω(푚푛) time.

4.1.1. Our Results

In this chapter, we present algorithms and lower bounds for the Set Cover problem. The results are summarized in Table 4.1.1. The NP-hardness of this problem (or even its 표(log 푛)- approximate version [72, 152, 14, 138, 63]) precludes the existence of highly accurate algo- rithms with fast running times, while (as we show) it is still possible to design algorithms

2Note that polynomial time algorithm with sub-logarithmic approximation algorithms are unlikely to exist. 3In the context of graph problems, this model is also known as the incidence-list model, and has been studied extensively, see e.g., [45, 80, 31].

92 Problem Approximation Constraints Query Complexity Section

1 푛 1 훼휌 + 휀 훼 2 푂̃︀( (푚( ) 훼−1 + 푛푘)) 4.4.1 ≥ 휀 푘 - 푚푛 4.4.2 휌 + 휀 푂̃︀( 푘휀2 ) Set Cover 1 푛 4훼+1 푛 1/(2훼) 4.6 훼 푘 < ( log 푚 ) Ω(̃︀ 푚( 푘 ) ) 훼 1.01 and 훼 ≤ Ω(̃︀ 푚푛 ) 4.3.2 푛 푘 푘 = 푂( log 푚 )

Cover - 푘 푛/2 Ω(푛푘) 4.5 Verification ≤ Table 4.1.1: A summary of our algorithms and lower bounds. We use the following notation: 푘 1 denotes the size of the optimum cover; 훼 1 denotes a parameter that determines the≥ trade-off between the approximation quality≥ and query/time complexities; 휌 1 denotes the approximation factor of a “black box” algorithm for set cover used as a subroutine.≥ with sub-linear query complexities and low approximation factors. The lower bound proofs hold for the running time of any algorithm approximation set cover assuming the defined data access model. We present two algorithms with sub-linear number of queries. First, we show that the streaming algorithm presented in [97] can be adapted so that it returns an 푂(훼)-approximate cover using 푂̃︀(푚(푛/푘)1/(훼−1) + 푛푘) queries, which could be quadratically smaller than 푚푛. Second, we present a simple algorithm which is tailored to the case when the value of 푘 is large. This algorithm computes an 푂(log 푛)-approximate cover in 푂̃︀(푚푛/푘) time (not just query complexity). Hence, by combining it with the algorithm of [119], we get an 푂(log 푛)- approximation algorithm that runs in time 푂̃︀(푚 + 푛√푚). We complement the first result by proving that for low values of 푘, the required number of queries is Ω(̃︀ 푚(푛/푘)1/(2훼)) even for estimating the size of the optimal cover. This shows that the first algorithm is essentially optimal for the values of 푘 where the first term in the runtime bound dominates. Moreover, we prove that even the Cover Verification problem, which is checking whether a given collection of 푘 sets covers all the elements, would require Ω(푛푘) queries. This provides strong evidence that the term 푛푘 in the first algorithm is unavoidable. Lastly, we complement the second algorithm, by showing a lower bound of

Ω(̃︀ 푚푛/푘) if the approximation ratio is a small constant.

93 4.1.2. Related work

Sub-linear algorithms for Set Cover under the oracle model have been previously studied as an estimation problem; the goal is only to approximate the size of the minimum set cover

rather than constructing one. Nguyen and Onak [145] consider Set Cover under the oracle model we employ in this chapter, in a specific setting where both the maximum cardinal- ity of sets in , and the maximum number of occurrences of an element over all sets, are ℱ bounded by some constants 푠 and 푡; this allows algorithms whose time and query complex-

4 푠 ities are constant, (2(푠푡) /휀)푂(2 ), containing no dependency on 푛 or 푚. They provide an algorithm for estimating the size of the minimum set cover when, unlike our work, allowing both ln 푠 multiplicative and 휀푛 additive errors. Their result has been subsequently improved to (푠푡)푂(푠)/휀2 by Yoshida et al. [172]. Additionally, the results of Kuhn et al. [121] on general packing/covering LPs in the distributed model, together with the reduction method ℒ풪풞풜ℒ of Parnas and Ron [149], implies estimating set cover size to within a 푂(ln 푠)-multiplicative factor (with 휀푛 additive error), can be performed in (푠푡)푂(log 푠 log 푡)/휀4 time/query complexi- ties.

Set Cover can also be considered as a generalization of the Vertex Cover problem. The estimation variant of Vertex Cover under the adjacency-list oracle model has been studied in [149, 130, 148, 172]. Set Cover has been also studied in the sub-linear space context, most notably for the streaming model of computation [155, 67, 42, 20, 17, 29, 105, 61, 97]. In this model, there are algorithms that compute approximate set covers with only multiplicative errors. Our algorithms use some of the ideas introduced in the last two papers [61, 97].

4.1.3. Our Techniques

Overview of the algorithms. The algorithmic results presented in Section 4.4, use the techniques introduced for the streaming Set Cover problem by [61, 97] to get new results in the context of sub-linear time algorithms for this problem. Two components previously used for the set cover problem in the context of streaming are Set Sampling and Element Sampling. Assuming the size of the minimum set cover is 푘, Set Sampling randomly samples 푂̃︀(푘) sets and adds them to the maintained solution. This ensures that all the elements that are well represented in the input (i.e., appearing in at least 푚/푘 sets) are covered by the

94 sampled sets. On the other hand, the Element Sampling technique samples roughly 푂̃︀(푘/훿) elements, and finds a set cover for the sampled elements. It can be shown that the cover for

the sampled elements covers a (1 훿) fraction of the original elements. − Specifically, the first algorithm performs a constant number of iterations. Each iteration uses element sampling to compute a “partial” cover, removes the elements covered by the sets selected so far and recurses on the remaining elements. However, making this process work in sub-linear time (as opposed to sub-linear space) requires new technical development. For example, the algorithm of [97] relies on the ability to test membership for a set-element pair, which generally cannot be efficiently performed in our model. The second algorithm performs only one round of set sampling, and then identifies the elements that are not covered by the sampled sets, without performing a full scan of those sets. This is possible because with high probability only those elements that belong to few input sets are not covered by the sample sets. Therefore, we can efficiently enumerate all pairs (푒푖, 푆푗), 푒푖 푆푗, for those elements 푒푖 that were not covered by the sampled sets. We ∈ then run a black box algorithm only on the set system induced by those pairs. This approach lets us avoid the 푛푘 term present in the query and runtime bounds for the first algorithm, which makes the second algorithm highly efficient for large values of 푘.

The Set Cover lower bound for smaller optimal value 푘. We establish our lower bound for the problem of estimating the size of the minimum set cover, by constructing two distributions of set systems. All systems in the same distribution share the same optimal set cover size, but these sizes differ by a factor 훼 between the two distributions; thus, the algorithm is required to determine from which distribution its input set system is drawn, in order to correctly estimate the optimal cover size. Our distributions are constructed by a novel use of the probabilistic method. Specifically, we first probabilistically construct a set system called median instance (see Lemma 4.3.6): this set system has the property that

(a) its minimum set cover size is 훼푘 and (b) a small number of changes to the instance reduces the minimum set cover size to 푘. We set the first distribution to be always this median instance. Then, we construct the second distribution by a random process that performs the changes (depicted in Algorithm 10) resulting in a modified instance. This

95 process distributes the changes almost uniformly throughout the instance, which implies that the changes are unlikely to be detected unless the algorithm performs a large number of queries. We believe that this construction might find applications to lower bounds for other combinatorial optimization problems.

The Set Cover lower bound for larger optimal value 푘. Our lower bound for the problem of computing an approximate set cover leverages the construction above. We create a combined set system consisting of multiple modified instances all chosen independently at random, allowing instances with much larger 푘. By the properties of the random process generating modified instances, we observe that most of these modified instances have different optimal set cover solution, and that distinguishing these instances from one another requires many queries. Thus, it is unlikely for the algorithm to be able to compute an optimal solution to a large fraction of these modified instances, and therefore it fails to achieve the desired approximation factor for the overall combined instance.

The Cover Verification lower bound for a cover of size 푘. For Cover Verification, however, we instead give an explicit construction of the distributions. We first create an

underlying set structure such that initially, the candidate sets contain all but 푘 elements. Then we may swap in each uncovered element from a non-candidate set. Our set structure is systematically designed so that each swap only modifies a small fraction of the answers from all possible queries; hence, each swap is hard to detect without Ω(푛) queries. The distribution of valid set covers is composed of instances obtained by swapping in every uncovered element, and that of non-covers is similarly obtained but leaving one element uncovered.

4.2. Preliminaries

First, we formally specify the representation of the set structures of input instances, which applies to both Set Cover and Cover Verification. Our lower bound proofs rely mainly on the construction of instances that are hard to distinguish by the algorithm. To this end, we define the swap operation that exchanges a pair of elements between two sets, and how this is implemented in the actual representation.

96 Definition 4.2.1swap ( operation). Consider two sets 푆 and 푆′. A swap on 푆 and 푆′ is defined over two elements 푒, 푒′ such that 푒 푆 푆′ and 푒′ 푆′ 푆, where 푆 and 푆′ exchange 푒 ∈ ∖ ∈ ∖ and 푒′. Formally, after performing swap(푒, 푒′), 푆 = (푆 푒′ ) 푒 and 푆′ = (푆′ 푒 ) 푒′ . As ∪{ } ∖{ } ∪{ } ∖{ } for the representation via EltOf and SetOf, each application of swap only modifies 2 entries for each oracle. That is, if previously 푒 = EltOf(푆, 푖), 푆 = SetOf(푒, 푗), 푒′ = EltOf(푆′, 푖′), and 푆′ = SetOf(푒′, 푗′), then their new values change as follows: 푒′ = EltOf(푆, 푖), 푆′ = SetOf(푒, 푗), 푒 = EltOf(푆′, 푖′), and 푆 = SetOf(푒′, 푗′). In particular, we extensively use the property that the amount of changes to the oracle’s

answers incurred by each swap is minimal. We remark that when we perform multiple swaps on multiple disjoint set-element pairs, every swap modifies distinct entries and do not interfere with one another. Lastly, we define the notion of query-answer history, which is a common tool forestab- lishing lower bounds for sub-linear algorithms under query models.

Definition 4.2.2. By query-answer history, we denote the sequence of query-answer pairs

(푞1, 푎1), (푞2, 푎2),..., (푞푟, 푎푟) recording the communication between the algorithm and the ⟨ ⟩ oracles, where each new query 푞푖+1 may only depend on the query-answer pairs (푞1, 푎1),...,

(푞푖, 푎푖). In our case, each 푞푖 represents either a SetOf query or an EltOf query made by

the algorithm, and each 푎푖 is the oracle’s answer to that respective query according to the set structure instance.

4.3. Lower Bounds for the Set Cover Problem

In this section, we present lower bounds for Set Cover both for small values of the optimal cover size 푘 (in Section 4.3.1), and for large values of 푘 (in Section 4.3.2). For low values of 푘, we prove the following theorem whose proof is postponed to Section 4.6.

푛 1 Theorem 4.3.1. For 2 푘 ( ) 4훼+1 and 1 < 훼 log 푛, any randomized algorithm ≤ ≤ 16훼 log 푚 ≤ that solves the Set Cover problem with approximation factor 훼 and success probability at least 1 2/3 requires Ω(̃︀ 푚(푛/푘) 2훼 ) queries. Instead, in Section 4.3.1 we focus on the simple setting of this theorem which applies to approximation protocols for distinguishing between instances with minimum set cover sizes

97 2 and 3, and show a lower bound of Ω(̃︀ 푚푛) (which is tight up to a polylogarithmic factor) for approximation factor 3/2. This simplification is for the purpose of both clarity and also for the fact that the result for this case is used in Section 4.3.2 to establish our lower bound

for large values of 푘.

High level idea. Our approach for establishing the lower bound is as follows. First,

we construct a median instance 퐼* for Set Cover, whose minimum set cover size is 3. We then apply a randomized procedure GenModifiedInst, which slightly modifies the median

instance into a new instance containing a set cover of size 2. Applying Yao’s principle, the distribution of the input to the deterministic algorithm is either 퐼* with probability 1/2, or a modified instance generated thru GenModifiedInst(퐼*), which is denoted by (퐼*), 풟 again with probability 1/2. Next, we consider the execution of the deterministic algorithm. We show that unless the algorithm asks at least Ω(̃︀ 푚푛) queries, the resulting query-answer history generated over 퐼* would be the same as those generated over instances constituting a constant fraction of (퐼*), reducing the algorithm’s success probability to below 2/3. More 풟 specifically, we will establish the following theorem.

Theorem 4.3.2. Any algorithm that can distinguish whether the input instance is 퐼* or belongs to (퐼*) with probability of success greater than 2/3, requires Ω(푚푛/ log 푚) queries. 풟 Corollary 4.3.3. For 1 < 훼 < 3/2, and 푘 3, any randomized algorithm that approxi- ≤ mates by a factor of 훼, the size of the optimal cover for the Set Cover problem with success probability at least 2/3 requires Ω(̃︀ 푚푛) queries. For simplicity, we assume that the algorithm has the knowledge of our construction

(which may only strengthens our lower bounds); this includes 퐼* and (퐼*), along with 풟 their representation via EltOf and SetOf. The objective of the algorithm is simply to distinguish them. Since we are distinguishing a distribution of instances (퐼*) against a 풟 single instance 퐼*, we may individually upper bound the probability that each query-answer pair reveals the modified part of the instance, then apply the union bound directly. However, establishing such a bound requires a certain set of properties that we obtain through a careful design of 퐼* and GenModifiedInst. We remark that our approach shows the hardness of

98 distinguishing instances with with different cover sizes. That is, our lower bound onthe query complexity also holds for the problem of approximating the size of the minimum set cover (without explicitly finding one). Lastly, in Section 4.3.2 we provide a construction utilizing Theorem 4.3.2 to extend Corollary 4.3.3, establish the following theorem on lower bounds for larger minimum set cover sizes.

Theorem 4.3.4. For any sufficiently small approximation factor 훼 1.01 and 푘 = 푂( 푚 ), ≤ log 푛 any randomized algorithm that computes an 훼-approximation to the Set Cover problem with success probability at least 0.99 requires Ω(̃︀ 푚푛/푘) queries.

4.3.1. The Set Cover Lower Bound for Small Optimal Value 푘

Construction of the Median Instance 퐼*. Let be a collection of 푚 sets such that ℱ (independently for each set-element pair (푆, 푒)) 푆 contains 푒 with probability 1 푝0, where − √︁ 9 log 푚 푝0 = (note that since we assume log 푚 푛/푐 for large enough 푐, we can assume 푛 ≤ that 푝0 1/2). Equivalently, we may consider the incidence matrix of this instance: each ≤ entry is either 0 (indicating 푒 / 푆) with probability 푝0, or 1 (indicating 푒 푆) otherwise. ∈ ∈ We write ( , 푝0) denoting the collection of sets obtained from this construction. ℱ ∼ ℐ 풰 Definition 4.3.5 (Median instance). An instance of Set Cover, 퐼, is a median instance if it satisfies all the following properties.

(a) No two sets cover all the elements. (The size of its minimum set cover is at least 3.) (b) For any two sets the number of elements not covered by the union of these sets is at

most 18 log 푚. (c) The intersection of any two sets has size at least 푛/8. (d) For any pair of elements 푒, 푒′, the number of sets 푆 s.t. 푒 푆 but 푒′ / 푆 is at least √ ∈ ∈ 푚 9√ log 푚 . 4 푛

(e) For any triple of sets 푆, 푆1 and 푆2, (푆1 푆2) 푆 6√푛 log 푚. | ∩ ∖ | ≤ (f) For each element, the number of sets that do not contain that element is at most √︁ log 푚 . 6푚 푛 Lemma 4.3.6. There exists a median instance 퐼* satisfying all properties from Defini- tion 4.3.5. In fact, with high probability, an instance drawn from the distribution in which

99 Pr 푒 푆 = 1 푝0 independently at random, satisfies the median properties. ∈ − The proof of the lemma follows from standard applications of concentration bounds. Specif- ically, it follows from the union bound and Lemmas 4.7.1–4.7.6, appearing in Section 4.7.

Distribution (퐼*) of Modified Instances 퐼′ Derived from 퐼*. Fix a median instance 풟 퐼*. We now show that we may perform 푂(log 푚) swap operations on 퐼* so that the size of the minimum set cover in the modified instance becomes 2. Moreover, its incidence matrix differs from that of 퐼* in 푂(log 푚) entries. Consequently, the number of queries to EltOf and SetOf that induce different answers from those of 퐼* is also at most 푂(log 푚). We define (퐼*) as the distribution of instances 퐼′ generated from a median instance 퐼* by 풟 GenModifiedInst(퐼*) given below in Algorithm 10 as follows. Assume that 퐼* = ( , ). 풰 ℱ We select two different sets 푆1, 푆2 from uniformly at random; we aim to turn these two ℱ sets into a set cover. To do so, we swap out some of the elements in 푆2 and bring in the ′ uncovered elements. For each uncovered element 푒, we pick an element 푒 푆2 that is also ∈ ′ covered by 푆1. Next, consider the candidate set that we may exchange its 푒 with 푒 푆2: ∈ Definition 4.3.7 (Candidate set). For any pair of elements 푒, 푒′, the candidate set of (푒, 푒′) are all sets that contain 푒 but not 푒′. The collection of candidate sets of (푒, 푒′) is denoted by Candidate(푒, 푒′). Note that Candidate(푒, 푒′) = Candidate(푒′, 푒) (in fact, these two ̸ collections are disjoint).

Algorithm 10 The procedure of constructing a modified instance of 퐼*. 1: procedure GenModifiedInst(퐼* = ( , )) 2: 풰 ℱ ℳ ← ∅ 3: pick two different sets 푆1, 푆2 from uniformly at random ℱ 4: for all 푒 (푆1 푆2) do ∈′ 풰 ∖ ∪ 5: pick 푒 (푆1 푆2) uniformly at random 6: ∈ 푒∩′ ∖ ℳ 7: pickℳ ←a ℳ random ∪ { } set 푆 in Candidate(푒, 푒′) ′ 8: swap(푒, 푒 ) between 푆, 푆2

′ ′ We choose a random set 푆 from Candidate(푒, 푒 ), and swap 푒 푆 with 푒 푆2 so that 푆2 ∈ ∈ now contains 푒. We repeatedly apply this process for all initially uncovered 푒 so that eventu-

ally 푆1 and 푆2 form a set cover. We show that the proposed algorithm, GenModifiedInst, can indeed be executed without getting stuck.

100 Lemma 4.3.8. The procedure GenModifiedInst is well-defined under the precondition that the input instance 퐼* is a median instance. Proof: To carry out the algorithm, we must ensure that the number of the initially uncovered elements is at most that of the elements covered by both 푆1 and 푆2. This follows from the properties of median instances (see Definition 4.3.5): (푆1 푆2) 18 log 푚 by property |풰 ∖ ∪ | ≤ (b), and that the size of the intersection of 푆1 and 푆2 is greater than 푛/8 by property (c). That is, in our construction there are sufficiently many possible choices for 푒′ to be matched and swapped with each uncovered element 푒. Moreover, by property (d) there are plenty of candidate sets for performing ′ with . 푆 swap(푒, 푒 ) 푆2 

Bounding the Probability of Modification. Let (퐼*) denote the distribution of in- 풟 stances generated by GenModifiedInst(퐼*). If an algorithm were to distinguish between 퐼* or 퐼′ (퐼*), it must find some cell in the EltOf or SetOf tables that would have been ∼ 풟 modified by GenModifiedInst, to confirm that GenModifiedInst is indeed executed; otherwise it would make wrong decisions half of the time. We will show an additional prop- erty of this distribution: none of the entries of EltOf and SetOf are significantly more likely to be modified during the execution of GenModifiedInst. Consequently, no algo-

rithm may strategically detect the difference between 퐼* or 퐼′ with the desired probability, unless the number of queries is asymptotically the reciprocal of the maximum probability of modification among any cells.

Define 푃Elt−Set : [0, 1] as the probability that an element is swapped by a set. 풰 × ℱ → More precisely, for an element 푒 and a set 푆 , if 푒 / 푆 in the median instance 퐼*, ∈ 풰 ∈ ℱ ∈ then 푃Elt−Set(푒, 푆) = 0; otherwise, it is equal to the probability that 푆 swaps 푒. We note that these probabilities are taken over 퐼′ (퐼*) where 퐼* is a fixed median instance. That is, ∼ 풟 as per Algorithm 10, they correspond to the random choices of 푆1, 푆2, the random matching

between (푆1 푆2) and 푆1 푆2, and their random choices of choosing each candidate ℳ 풰 ∖ ∪ ∩ set 푆. We bound the values of 푃Elt−Set via the following lemma.

4800 log 푚 Lemma 4.3.9. For any 푒 and 푆 , 푃Elt−Set(푒, 푆) where the probability is ∈ 풰 ∈ ℱ ≤ 푚푛 taken over 퐼′ (퐼*). ∼ 풟

101 Proof: Let 푆1, 푆2 denote the first two sets picked (uniformly at random) from to construct ℱ a modified instance of 퐼*. For each element 푒 and a set 푆 such that 푒 푆 in the basic ∈ instance 퐼*,

푃Elt−Set(푒, 푆) = Pr 푆 = 푆2 Pr 푒 푆1 푆2 · ∈ ∩

Pr 푒 matches to (푆1 푆2) 푒 푆1 푆2 · 풰 ∖ ∪ | ∈ ∩

+ Pr 푆 / 푆1, 푆2 ∈ { }

Pr 푒 푆 (푆1 푆2) 푒 푆 · ∈ ∖ ∪ | ∈

Pr 푆 swaps 푒 with 푆2 푒 푆 (푆1 푆2). · | ∈ ∖ ∪

where all probabilities are taken over 퐼′ (퐼*). Next we bound each of the above six terms. ∼ 풟 Since we choose the sets 푆1, 푆2 randomly, Pr 푆 = 푆2 = 1/푚. We bound the second term by 1. For the third term, since we pick a matching uniformly at random among all possible

(maximum) matchings between (푆1 푆2) and 푆1 푆2, by symmetry, the probability that 풰 ∖ ∪ ∩ a certain element 푒 푆1 푆2 is in the matching is (by properties (b) and (c) of median ∈ ∩ instances),

(푆1 푆2) 18 log 푚 144 log 푚 |풰 ∖ ∪ | = . 푆1 푆2 ≤ 푛/8 푛 | ∩ |

We bound the fourth term by 1. To compute the fifth term, let 푑푒 denote the number of sets in that do not contain 푒. By property (f) of median instances, the probability that 푒 푆 ℱ ∈ is in 푆 (푆1 푆2) given that 푆 / 푆1, 푆2 is at most, ∖ ∪ ∈ { }

2 log 푚 푑푒(푑푒 1) 36푚 72 log 푚 − · 푛 = . (푚 1)(푚 2) ≤ 푚2/2 푛 − − Finally for the last term, note that by symmetry, each pair of matched elements 푒푒′ is picked

by GenModifiedInst equiprobably. Thus, for any 푒 푆 (푆1 푆2), the probability that ∈ ∖ ∪ ′ 1 each element 푒 푆1 푆2 is matched to 푒 is . By properties (c)–(e) of median instances, ∈ ∩ |푆1∩푆2|

102 the last term is at most

∑︁ ′ 1 Pr 푒푒 ′ ′ ∈ ℳ · Candidate(푒, 푒 ) 푒 ∈(푆1∩푆2)∖푆 | | 1 1 = (푆1 푆2) 푆 ′ | ∩ ∖ | · 푆1 푆2 · Candidate(푒, 푒 ) | ∩ | √︀ 1 1 64 6 푛 log 푚 √ = . ≤ · 푛/8 · 푚 9√ log 푚 푚 4 푛

Therefore,

1 144 log 푚 72 log 푚 64 4800 log 푚 푃Elt−Set(푒, 푆) 1 + 1 . ≤ 푚 · · 푛 · 푛 · 푚 ≤ 푚푛 

Proof of Theorem 4.3.2. Now we consider a median instance 퐼*, and its corresponding family of modified sets (퐼*). To prove the promised lower bound for randomized protocols 풟 distinguishing 퐼* and 퐼′ (퐼*), we apply Yao’s principle and instead show that no de- ∼ 풟 terministic algorithm may determine whether the input is 퐼* or 퐼′ (퐼*) with success 풜 ∼ 풟 probability at least 2/3 using 푟 = 표( 푚푛 ) queries. Recall that if ’s query-answer history log 푚 풜 ′ * (푞1, 푎1),..., (푞푟, 푎푟) when executed on 퐼 is the same as that of 퐼 , then must unavoid- ⟨ ⟩ 풜 ably return a wrong decision for the probability mass corresponding to 퐼′. We bound the probability of this event as follows.

Lemma 4.3.10. Let 푄 be the set of queries made by on 퐼*. Let 퐼′ (퐼*) where 퐼* is 풜 ∼ 풟 a given median instance. Then the probability that returns different outputs on 퐼* and 퐼′ 풜 is at most 4800 log 푚 푄 . 푚푛 | | Proof: Let (퐼) denote the algorithm’s output for input instance 퐼 (whether the given 풜 * * instance is 퐼 or drawn from (퐼 )). For each query 푞, let ans퐼 (푞) denote the answer of 풟 퐼 to query 푞. Observe that since is deterministic, if all of the oracle’s answers to its 풜 previous queries are all the same, then it must make the same next query. Combining this fact with the union bound, we may lower bound the probability that returns the same 풜 103 outputs on 퐼* and 퐼′ (퐼*) as follows: ∼ 풟

|푄| * ′ ∑︁ Pr (퐼 ) = (퐼 ) Pr ans퐼* (푞푡) = ans퐼′ (푞푡). 풜 ̸ 풜 ≤ 푡=1 ̸

For each 푞 푄, let 푆(푞) and 푒(푞) denote respectively the set and element queried by 푞. ∈ Applying Lemma 4.3.9, we obtain

|푄| |푄| * ′ ∑︁ ∑︁ 4800 log 푚 Pr (퐼 ) = (퐼 ) Pr ans * (푞 ) = ans ′ (푞 ) 푃 (푒(푞 ), 푆(푞 )) 푄 . 퐼 푡 퐼 푡 Elt−Set 푡 푡 푚푛  풜 ̸ 풜 ≤ 푡=1 ̸ ≤ 푡=1 ≤ | |

Proof of Theorem 4.3.2. If does not output correctly on 퐼*, the probability of success 풜 of is less than 1/2; thus, we can assume that returns the correct answer on 퐼*. This 풜 풜 implies that returns an incorrect solution on the fraction of 퐼′ 퐼′(퐼*) for which (퐼*) = 풜 ∼ 풜 (퐼′). Now recall that the distribution in which we apply Yao’s principle consists of 퐼* with 풜 probability 1/2, and drawn uniformly at random from (퐼*) also with probability 1/2. Then 풟 over this distribution, by Lemma 4.3.10,

(︂ )︂ 1 * ′ 1 4800 log 푚 Pr succeeds 1 Pr퐼′∼풟(퐼*)[ (퐼 ) = (퐼 )] 1 1 푄 풜 ≤ − 2 풜 풜 ≤ − 2 − 푚푛 | | 1 2400 log 푚 = + 푄 . 2 푚푛 | |

Thus, if the number of queries made by is less than 푚푛 , then the probability that 풜 14400 log 푚 returns the correct answer over the input distribution is less than 2/3 and the proof is 풜 complete. 

4.3.2. The Set Cover Lower Bound for Large Optimal Value 푘.

Our construction of the median instance 퐼* and its associated distribution (퐼*) of modified 풟 instances also leads to the lower bound of 푚푛 for the problem of computing an approx- Ω(̃︀ 푘 ) imate solution to Set Cover. This lower bound matches the performance of our algorithm for large optimal value 푘 and shows that it is tight for some range of value 푘, albeit it only applies to sufficiently small approximation factor 훼 1.01. ≤ 104 Proof overview. We construct a distribution over compounds: a compound is a Set Cover

instance that consists of 푡 = Θ(푘) smaller instances 퐼1, . . . , 퐼푡, where each of these 푡 instances is either the median instance 퐼* or a random modified instance drawn from (퐼*). By our 풟 construction, a large majority of our distribution is composed of compounds that contains at

least 0.2푡 modified instances 퐼푖 such that, any deterministic algorithm must fail to distin- 풜 * guish 퐼푖 from 퐼 when it is only allowed to make a small number of queries. A deterministic can safely cover these modified instances with three sets, incurring a cost (sub-optimality) 풜 of 0.2푡. Still, may choose to cover such an 퐼푖 with two sets to reduce its cost, but it then 풜 * must err on a different compound where 퐼푖 is replaced with 퐼 . We track down the trade-off

between the amount of cost that saves on these compounds by covering these 퐼푖’s with 풜 two sets, and the amount of error on other compounds its scheme incurs. is allowed a 풜 small probability 훿 to make errors, which we then use to upper-bound the expected cost that may save, and conclude that still incurs an expected cost of 0.1푡 overall. We apply 풜 풜 Yao’s principle (for algorithms with errors) to obtain that randomized algorithms also incur

an expected cost of 0.05푡, on compounds with optimal solution size 푘 [2푡, 3푡], yielding the ∈ impossibility result for computing solutions with approximation factor 푘+0.1푡 훼 = 푘 > 1.01 when given insufficient queries. Next, we provide an high-level overview of the lower bound argument.

Compounds. Consider the median instance 퐼* and its associated distribution (퐼*) of 풟 modified instances for Set Cover with 푛 elements and 푚 sets, and let 푡 = Θ(푘) be a positive integer parameter. We define a compound I = I(퐼1, 퐼2, . . . , 퐼푡) as a set structure instance 푡 푡 consisting of 푡 median or modified instances 퐼1, 퐼2, . . . , 퐼푡, forming a set structure ( , ) of 풰 ℱ ′ elements and ′ sets, in such a way that each instance occupies separate 푛 , 푛푡 푚 , 푚푡 퐼푖 * elements and sets. Since the optimal solution to each instance 퐼푖 is 3 if 퐼푖 = 퐼 , and 2 if 퐼푖 is any modified instance, the optimal solution for the compound is 2푡 plus the number of occurrences of the median instance; this optimal objective value is always Θ(푘).

Random distribution over compounds. Employing Yao’s principle, we construct a

distribution D of compounds I(퐼1, 퐼2, . . . , 퐼푡): it will be applied against any deterministic

105 algorithm for computing an approximate minimum set cover, which is allowed to err on at 풜 most a 훿-fraction of the compounds from the distribution (for some small constant 훿 > 0). * (︀푚)︀ For each 푖 [푡], we pick 퐼푖 = 퐼 with probability 푐/ where 푐 > 2 is a sufficiently large ∈ 2 * constant. Otherwise, simply draw a random modified instance 퐼푖 (퐼 ). We aim to show ∼ 풟 that, in expectation over D, must output a solution that of size Θ(푡) more than the optimal 풜 set cover size of the given instance I D. ∼

frequently leaves many modified instances undetected. Consider an instance I 풜 containing at least 0.95푡 modified instances. These instances constitute at leasta 0.99- fraction of D: the expected number of occurrences of the median instance in each compound is only 푐/(︀푚)︀ 푡 = 푂(푡/푚2), so by Markov’s inequality, the probablity that there are more 2 · than 0.05푡 median instances is at most 푂(1/푚2) < 0.01 for large 푚. We make use of the following useful lemma, whose proof is deferred to Section 4.3.2. In what follow, we say

* that the algorithm “distinguishes” or “detects the difference” between 퐼푖 and 퐼 if it makes a * query that induces different answers, and thus may deduce that oneof 퐼푖 or 퐼 cannot be the * input instance. In particular, if 퐼푖 = 퐼 then detecting the difference between them would be impossible.

Lemma 4.3.11. Fix 푀 [푡] and consider the distribution over compounds I(퐼1, . . . , 퐼푡) ⊆ * * 푚푛푡 with 퐼푖 (퐼 ) for 푖 푀 and 퐼푖 = 퐼 for 푖 / 푀. If makes at most 표( ) queries to ∼ 풟 ∈ ∈ 풜 log 푚 I, then it may detect the differences between 퐼* and at least 0.75푡 of the modified instances

퐼푖 푖∈푀 , with probability at most 0.01. { } We apply this lemma for any 푀 0.95푡 (although the statement holds for any 푀, even | | ≥ vacuously for 푀 < 0.75푡). Thus, for 0.99 0.99 > 0.98-fraction of D, fails to identify, | | · 풜 for at least 0.95푡 0.75푡 = 0.2푡 modified instances 퐼푖 in I, whether it is a median instance − or a modified instance. Observe that the query-answer history of on such I would not 풜 change if we were to replace any combination of these 0.2푡 modified instances by copies of 퐼*. Consequently, if the algorithm were to correctly cover I by using two sets for some of these 퐼푖, it must unavoidably err (return a non-cover) on the compound where these 퐼푖’s are replaced by copies of the median instance.

106 Charging argument. We call a compound I tough if does not err on I, and fails 풜 풜 to detect at least 0.2푡 modified instances; denote by Dtough the conditional distribution of D restricted to tough instances. For tough I, let cost(I) denote the number of modified

instances 퐼푖 that the algorithm decides to cover with three sets. That is, for each tough compound I, cost(I) measures how far the solution returned by is, from the optimal set 풜 cover size. Then, there are at least 0.2푡 cost(I) modified instances 퐼푖 that chooses to − 풜 * cover with only two sets despite not being able to verify whether 퐼푖 = 퐼 or not. Let 푅I denote the set of the indices of these modified instances, so 푅I = 0.2푡 cost(I). By doing | | − so, then errs on the replaced compound 푟(I, 푅I), denoting the compound similar to I, 풜 * except that each modified instance 퐼푖 for 푖 푅I is replaced by 퐼 . In this event, we say ∈ that the tough compound I charges the replaced compound 푟(I, 푅I) via 푅I. Recall that the total error of is 훿: this quantity upper-bounds the total probability masses of charged 풜 instances, which we will then manipulate to obtain a lower bound on EI∼D[cost(I)].

Instances must share optimal solutions for 푅 to charge the same replaced in- stance. Observe that many tough instances may charge to the same replaced instance: we

must handle these duplicities. First, consider two tough instances I1 = I2 charing the same ̸ 1 2 1 2 1 2 Ir = 푟(I , 푅) = 푟(I , 푅) via the same 푅 = 푅 1 = 푅 2 . As I = I but 푟(I , 푅) = 푟(I , 푅), I I ̸ these tough instances differ on some modified instances with indices in 푅. Nonetheless, the query-answer histories of operating on I1 and I2 must be the same as their instances in 풜 푅 are both indistinguishable from 퐼* by the deterministic . Since does not err on tough 풜 풜 instances (by definition), both tough I1 and I2 must share the same optimal set cover on every instance in 푅. Consequently, for each fixed 푅, only tough instances that have the same optimal solution for modified instances in 푅 may charge the same replaced instance via 푅.

Charged instance is much heavier than charging instances combined. By our con-

struction of drawn from , * (︀푚)︀ for the median instance. On the I(퐼1, . . . , 퐼푡) D Pr[퐼푖 = 퐼 ] = 푐/ 2 ∑︀ℓ 푗 (︀푚)︀ (︀푚)︀ (︀푚)︀ 1 ℓ other hand, Pr[퐼푖 = 퐼 ] (1 푐/ ) (1/ ) < 1/ for modified instances 퐼 , . . . , 퐼 푗=1 ≤ − 2 · 2 2 sharing the same optimal set cover, because they are all modified instances constructed to have the two sets chosen by GenModifiedInst as their optimal set cover: each pair of sets

107 is chosen uniformly with probability (︀푚)︀. Thus, the probability that * is chosen is more 1/ 2 퐼 than 푐 times the total probability that any 퐼푗 is chosen. Generalizing this observation, we 1 2 ℓ consider tough instances I , I ,..., I charging the same Ir via 푅, and bound the difference 푗 in probabilities that Ir and any I are drawn. For each index in 푅, it is more than 푐 times more likely for D to draw the median instance, rather than any modified instances of a fixed |푅| ∑︀ℓ 푗 optimal solution. Then, for the replaced compound Ir that errs, 푝(Ir) 푐 푝(I ) 풜 ≥ · 푗=1 (where 푝 denotes the probability mass in D, not in Dtough). In other words, the probability mass of the replaced instance charged via 푅 is always at least 푐|푅| times the total probability mass of the charging tough instances.

Bounding the expected cost using 훿. In our charging argument by tough instances above, we only bound the amount of charges on the replaced instances via a fixed 푅. As there are up to 2푡 choices for 푅, we scale down the total amount charged to a replaced instance by a factor of 푡, so that ∑︀ |푅I| 푡 lower bounds the total probability 2 tough I 푐 푝(I)/2 mass of the replaced instances that errs. 풜 Let us first focus on the conditional distribution Dtough restricted to tough instances. Recall that at least a (0.98 훿)-fraction of the compounds in D are tough: fails to detect − 풜 differences between 0.2푡 modified instances from the median instance with probability 0.98, and among these compounds, may err on at most a 훿-fraction. So in the conditional 풜 distribution Dtough over tough instances, the individual probability mass is scaled-up to 푝tough(I) 푝(I) . Thus, ≤ 0.98−훿

∑︀ |푅I| ∑︀ |푅I| tough [︀ |푅 |]︀ tough 푐 푝(I) tough 푐 (0.98 훿)푝 (I) (0.98 훿)E tough 푐 I I I − = − I∼D . 2푡 ≥ 2푡 2푡

As the probability mass above cannot exceed the total allowed error 훿, we have

훿 푡 [︀ |푅 |]︀ [︀ 0.2푡−cost(I)]︀ 0.2푡−E tough [cost(I)] 2 E tough 푐 I E tough 푐 푐 I∼D , 0.98 훿 · ≥ I∼D ≥ I∼D ≥ − 108 where Jensen’s inequality is applied in the last step above. So,

훿 (︂ )︂ 훿 푡 + log 0.98−훿 1 log 0.98−훿 E tough [cost(I)] 0.2푡 = 0.2 푡 0.11푡, I∼D ≥ − log 푐 − log 푐 − log 푐 ≥

for sufficiently large 푐 (and 푚) when choosing 훿 = 0.02. We now return to the expected cost over the entire distribution I. For simplicity, define

cost(I) = 0 for any non-tough I. This yields EI∼D[cost(I)] (0.98 훿)E tough [cost(I)] ≥ − I∼D ≥ (0.98 훿) 0.11푡 0.1푡, establishing the expected cost of any deterministic with probability − · ≥ 풜 of error at most 0.02 over D.

Establishing the lower bound for randomized algorithms. Lastly, we apply Yao’s

principle4 to obtain that, for any randomized algorithm with error probability 훿/2 = 0.01, its expected cost under the worst input is at least 1 0.1푡 = 0.05푡. Recall now that our cost 2 · here lower-bounds the sub-optimality of the computed set cover (that is, the algorithm uses at least cost more sets to cover the elements than the optimal solution does). Since our input

instances have optimal solution 푘 [2푡, 3푡] and the randomized algorithm returns a solution ∈ with cost at least 0.05푡 in expectation, it achieves an approximation factor of no better than 푘+0.05푡 with 푚푛푡 queries. Theorem 4.3.4 then follows, noting the substitution 훼 = 푘 > 1.01 표( log 푚 ) of our problem size: 푚푛푡 (푚′/푡)(푛′/푡)푡 푚′푛′ . log 푚 = log(푚′/푡) = Θ( 푘′ log 푚′ )

Proof of Lemma 4.3.11. First, we recall the following result from Lemma 4.3.10 for

distinguishing between 퐼* and a random 퐼′ (퐼*). ∼ 풟 * Corollary 4.3.12. Let 푞 be the number of queries made by on 퐼푖 (퐼 ) over 푛 elements 풜 ∼ 풟 and 푚 sets, where 퐼* is a median instance. Then the probability that detects a difference 풜 between and * in one of its queries is at most 4800푞 log 푚 . 퐼푖 퐼 푚푛

Marbles and urns. Fix a compound . Let 푚푛 , and then consider the I(퐼1, . . . , 퐼푡) 푠 , 4800 log 푚 following, entirely different, scenario. Suppose that we have 푡 urns, where each urn contains th 푠 marbles. In the 푖 urn, in case 퐼푖 is a modified instance, we put in this urn one red marble 4Here we use the Monte Carlo version where the algorithm may err, and use cost instead of the time complexity as our measure of performance. See, e.g., Proposition 2.6 in [139] and the description therein.

109 * and 푠 1 white marbles; otherwise if 퐼푖 = 퐼 , we put in 푠 white marbles. Observe that − the probability of obtaining a red marble by drawing 푞 marbles from a single urn without replacement is exactly 푞/푠 (for 푞 푠). Now, we will relate the probability of drawing red ≤ marbles to the probability of successfully distinguishing instances. We emphasize that we are only comparing the probabilities of events for the sake of analysis, and we do not imply or suggest any direct analogy between the events themselves. Corollary 4.3.12 above bounds the probability that the algorithm successfully distin-

guishes a modified instance from * with 4800푞 log 푚 . Then, the probability of 퐼푖 퐼 푚푛 = 푞/푠 * distinguishing between 퐼푖 and 퐼 using 푞 queries, is bounded from above by the proba- bility of obtaining a red marble after drawing 푞 marbles from an urn. Consequently, the probability that the algorithm distinguishes 3푡/4 instances is bounded from above by the probability of drawing the red marbles from at least 3푡/4 urns. Hence, to prove that the event of Lemma 4.3.11 occurs with probability at most 0.01, it is sufficient to upper-bound the probability that an algorithm obtains 3푡/4 red marbles by 0.01. Consider an instance of 푡 urns; for each urn 푖 [푡] corresponding to a modified instance ∈ 퐼푖, exactly one of its 푠 marbles is red. An algorithm may draw marbles from each urn, one by one without replacement, for potentially up to 푠 times. By the principle of deferred decisions, the red marble is equally likely to appear in any of these 푠 draws, independent of the events

for other urns. Thus, we can create a tuple of 푡 random variables = (푇1, . . . , 푇푡) such that 풯 for each 푖 [푡], 푇푖 is chosen uniformly at random from 1, . . . , 푠 . The variable 푇푖 represents ∈ { } the number of draws required to obtain the marble in the th urn; that is, only the th red 푖 푇푖 th draw from the 푖 urn finds the red marble from that urn. In case 퐼푖 is a median instance,

we simply set 푇푖 = 푠 + 1 indicating that the algorithm never detects any difference as 퐼푖 and 퐼* are the same instance. We now show the following two lemmas in order to bound the number of red marbles the algorithm may encounter throughout its execution.

푠 Lemma 4.3.13. Let 푏 > 3 be a fixed constant and define high = 푖 푇푖 . If 푡 14푏, 풯 { | ≥ 푏 } ≥ 2 then high (1 )푡 with probability at least 0.99. |풯 | ≥ − 푏 th 1 Proof: Let low = 1, . . . , 푡 high. Notice that for the 푖 urn, Pr 푖 low < independently 풯 { }∖풯 ∈ 풯 푏

110 1 of other urns, and thus low is stochastically dominated by B(푡, ), the binomial distribution |풯 | 푏 with trials and success probability 1 . Applying Chernoff bound, we obtain 푡 푏

2푡 − 푡 Pr low 푒 3푏 < 0.01. |풯 | ≥ 푏 ≤

2푡 2 Hence, high 푡 = (1 )푡 with probability at least 0.99, as desired. |풯 | ≥ − 푏 − 푏  Lemma 4.3.14. If the total number of draws made by the algorithm is less than (1 3 ) 푠푡 , − 푏 푏 then with probability at least , the algorithm will not obtain marbles from at least 푡 0.99 red 푏 urns.

Proof: If the total number of such draws is less than (1 3 ) 푠푡 , then the number of draws − 푏 푏 from at least 3푡 urns is less than 푠 each. Assume the condition of Lemma 4.3.13: for at least 푏 푏 2 푠 (1 )푡 urns, 푇푖 . That is, the algorithm will not encounter a red marble if it makes − 푏 ≥ 푏 푠 푡 푠 less than draws from such an urn. Then, there are at least urns with 푇푖 from which 푏 푏 ≥ 푏 the algorithm makes less than 푠 draws, and thus does not obtain a marble. Overall this 푏 red event holds with probability at least 0.99 due to Lemma 4.3.13. 

We substitute 푏 = 4 and assume sufficiently large 푡. Suppose that the deterministic algorithm makes less than (1 3 ) 푠푡 = 푠푡 queries, then for a fraction of 0.99 of all possible − 4 4 16 tuples , there are 푡/4 instances 퐼푖 that the algorithm fails to detect their differences from 풯 퐼*: the probability of this event is lower-bounded by that of the event where the red marbles from those corresponding urns 푖 are not drawn. Therefore, the probability that the algorithm * makes queries that detect differences between 퐼 and more than 3푡/4 instances 퐼푖’s is bounded by 0.01, concluding our proof of Lemma 4.3.11.

4.4. Sub-Linear Algorithms for the Set Cover Problem

In this chapter, we present two different approximation algorithms for Set Cover with sub- linear query in the oracle model: SmallSetCover and LargeSetCover. Both of our algorithms rely on the techniques from the recent developments on Set Cover in the streaming model. However, adopting those techniques in the oracle model requires novel insights and technical development. Throughout the description of our algorithms, we assume that we have access to a black

111 box subroutine that given the full Set Cover instance (where all members of all sets are revealed), returns a 휌-approximate solution.5 The first algorithmSmallSetCover ( ) returns a (훼휌 + 휀) approximate solution of 1 the instance using 1 푛 훼−1 queries, while the second algorithm Set Cover 푂̃︀( 휀 (푚( 푘 ) + 푛푘)) (LargeSetCover) achieves an approximation factor of using 푚푛 queries, where (휌 + 휀) 푂̃︀( 푘휀2 ) 푘 is the size of the minimum set cover. These algorithms can be combined so that the number of queries of the algorithm becomes asymptotically the minimum of the two:

Theorem 4.4.1. There exists a randomized algorithm for Set Cover in the oracle model 6 (︀ 푛 )︀1/ log 푛 that w.h.p. computes an 푂(휌 log 푛)-approximate solution and uses 푂̃︀(min 푚 + { 푘 푛푘 , 푚푛 ) = 푂̃︀(푚 + 푛√푚) number of queries. 푘 } Our algorithms use the following two sampling techniques developed for Set Cover in the streaming model [61]: Element Sampling and Set Sampling. The first technique, Element Sampling, states that in order to find a (1 훿)-cover of w.h.p., it suffices to solve Set Cover − 풰 on a subset of elements of size 휌푘 log 푚 picked uniformly at random. It shows that we may 푂̃︀( 훿 ) restrict our attention to a subproblem with a much smaller number of elements, and our solution to the reduced instance will still cover a good fraction of the elements in the original instance. The next technique, Set Sampling, shows that if we pick ℓ sets uniformly at random from in the solution, then each element that is not covered by any of picked sets w.h.p. ℱ only occurs in 푂̃︀( 푚 ) sets in ; that is, we are left with a much sparser subproblem to solve. ℓ ℱ The formal statements of these sampling techniques are as follows. See [61] for the proofs.

Lemma 4.4.2 (Element Sampling). Consider an instance of Set Cover on ( , ) whose 풰 ℱ (︀ 휌푘 log 푚 )︀ optimal cover has size at most 푘. Let smp be a subset of of size Θ chosen 풰 풰 훿 uniformly at random, and let smp be a 휌-approximate cover for smp. Then, w.h.p. smp 풞 ⊆ ℱ 풰 풞 covers at least (1 훿) elements. − |풰|

Lemma 4.4.3 (Set Sampling). Consider an instance ( , ) of Set Cover. Let rnd be a 풰 ℱ ℱ collection of ℓ sets picked uniformly at random. Then, w.h.p. rnd covers all elements that ℱ

5The approximation factor 휌 may take on any value between 1 and Θ(log 푛) depending on the computa- tional model one assumes. 6An algorithm succeeds with high probability (w.h.p.) if its failure probability can be decreased to 푛−푐 for any constant 푐 > 0 without affecting its asymptotic performance, where 푛 denotes the input size.

112 appear in Ω( 푚 log 푛 ) sets of . ℓ ℱ

4.4.1. First Algorithm: small values of 푘

The algorithm of this section is a modified variant of the streaming algorithm of Set Cover in [97] that works in the sub-linear query model. Similarly to the algorithm of [97], our algorithm SmallSetCover considers different guesses of the value of an optimal solution

(휀−1 log 푛 guesses) and performs the core iterative algorithm IterSetCover for all of them in parallel. For each guess ℓ of the size of an optimal solution, the IterSetCover goes through 1/훼 iterations and by applying Element Sampling, guarantees that w.h.p. at the end of each iteration, the number of uncovered elements reduces by a factor of 푛−1/훼. Hence, after 1/훼 iterations all elements will be covered. Furthermore, since the number of sets picked in each iteration is at most ℓ, the final solution has at most 휌ℓ sets where 휌 is the performance of the offline block OfflineSC that IterSetCover uses to solve the reduced

instances constructed by Element Sampling. Although our general approach in IterSetCover is similar to the iterative core of the

streaming algorithm of Set Cover, there are challenges that we need to overcome so that it works efficiently in the query model. Firstly, the approach of [97] relies on the ability to test membership for a set-element pair when executing its set filtering subroutine: given a

subset S, the algorithm of [97] requires to compute 푆 S which cannot be implemented | ∩ | efficiently in the query model (in the worst case, requires 푚 S queries). Instead, here we | | employ the set sampling which w.h.p. guarantees that the number of sets that contain an (yet uncovered) element is small.

Next challenge is achieving 푚(푛/푘)1/(훼−1)+푛푘 query bound for computing an 훼-approximate solution. As mentioned earlier, both our approach and the algorithm of [97] need to run the

algorithm in parallel for different guesses ℓ of the size of an optimal solution. However, since IterSetCover performs 푚(푛/ℓ)1/(훼−1) + 푛ℓ queries, if SmallSetCover invokes IterSetCover with guesses in an increasing order then the query complexity becomes

푚푛1/(훼−1) + 푛푘; on the other hand, if it invokes IterSetCover with guesses in a de- creasing order then the query complexity becomes 푚(푛/푘)1/(훼−1) + 푚푛. To solve this issue, SmallSetCover performs in two stages: in the first stage, it finds a (log 푛)-estimate of

113 푘 by invoking IterSetCover using 푚 + 푛푘 queries (assuming guesses are evaluated in an increasing order) and then in the second rounds it only invokes IterSetCover with ap- proximation factor 훼 in the smaller 푂(log 푛)-approximate region around the (log 푛)-estimate of 푘 computed in the first stage. Thus, in our implementation, besides the desired approx- imation factor, IterSetCover receives an upper bound and a lower bound on the size of an optimal solution.

Now, we provide a detailed description of IterSetCover. It receives 훼, 휖, 푙 and 푢 as its arguments, and it is guaranteed that the size of an optimal cover of the input instance,

푘, is in [푙, 푢]. Note that the algorithm does not know the value of 푘 and the sampling techniques described in Section 4.2 rely on 푘. Therefore, the algorithm needs to find a (1 + 휀) estimate7 of 푘 denoted as ℓ. This can be done by trying all powers of (1 + 휀) in [푙, 푢]. The parameter 훼 denotes the trade-off between the query complexity and the approximation guarantee that the algorithm achieves. Moreover, we assume that the algorithm has access to a 휌-approximate black box solver of Set Cover. IterSetCover first performs Set Sampling to cover all elements that occur in Ω(̃︀ 푚/ℓ) sets. Then it goes through 훼 2 iterations and in each iteration, it performs Element − Sampling with parameter 훿 = 푂̃︀((ℓ/푛)1/(훼−1)). By Lemma 4.4.2, after (훼 2) iterations, − w.h.p. only (︀ 푛 )︀1/(훼−1) elements remain uncovered, for which the algorithm finds a cover by ℓ ℓ invoking the offline set cover solver. The parameters are set so that all (훼 1) instances that − are required to be solved by the offline set cover solver (the (훼 2) instances constructed by − and the final instance) are of size (︀ 푛 )︀1/(훼−1) . Element Sampling 푂̃︀(푚 ℓ ) In the rest of this section, we show that SmallSetCover w.h.p. returns an almost (휌훼)- 1 (︀ 푛 )︀ 훼−1 approximate solution of Set Cover( , ) with query complexity 푂̃︀(푚 + 푛푘) where 푘 풰 ℱ 푘 is the size of a minimum set cover.

Theorem 4.4.4. SmallSetCover outputs a (훼휌+휀)-approximate solution of Set Cover( , ) 1 풰 ℱ using 1 훼−1 number of queries w.h.p., where is the size of an optimal so- 푂̃︀( 휀 (푚(푛/푘) + 푛푘)) 푘 lution of ( , ). 풰 ℱ To analyze the performance of SmallSetCover, first we need to analyze the proce-

7The exact estimate that the algorithm works with is a 휀 estimate. (1 + 2휌훼 )

114 Algorithm 11 IterSetCover is the main procedure of the SmallSetCover algorithm for the Set Cover problem. 1: procedure IterSetCover(훼, 휀, 푙, 푢) 2: Try all 휀 -approximate guesses of ◁ (1 + 2훼휌 ) 푘 3: for 휀 푖 do ℓ (1 + ) log1+ 휀 푙 푖 log1+ 휀 푢 ∈ { 2훼휌 | 2훼휌 ≤ ≤ 2훼휌 } 4: solℓ collection of ℓ sets picked uniformly at random ◁ Set Sampling 5: ← ⋃︀ 푆 ◁ 푛ℓ EltOf rem 푆∈solℓ 6: for풰 푖←= 1풰to ∖ 훼 2 do − 1 7: sample of of size (︀ 푛 )︀ 훼−1 S rem 푂̃︀(휌ℓ ℓ ) 8: ← OfflineSC풰 (S, ℓ) 9: if풟 ←= null then 10: 풟break ◁ Try the next value of ℓ ⋃︀ 11: solℓ solℓ ← ⋃︀풟 12: rem rem 푆◁ 휌푛ℓEltOf 풰 ← 풰 ∖ 푆∈풟 (︀ 푛 )︀1/(훼−1) 13: if rem ℓ then ◁ Feasibility Test |풰 | ≤ ℓ 14: OfflineSC( rem, ℓ) 15: if풟 ←= null then 풰 풟 ̸ ⋃︀ 16: solℓ solℓ ← 풟 17: return solℓ dures invoked by SmallSetCover: IterSetCover and OfflineSC. The OfflineSC

procedure receives as an input a subset of elements S and an estimate on the size of an opti- mal cover of S using sets in . The OfflineSC algorithm first determines all occurrences ℱ of S in . Then it invokes a black box subroutine that returns a cover of size at most 휌ℓ (if ℱ there exists a cover of size ℓ for S) for the reduced Set Cover instance over S. Moreover, we assume that all subroutines have access to the EltOf and SetOf oracles, and . |풰| |ℱ| Lemma 4.4.5. Suppose that each 푒 S appears in 푂̃︀( 푚 ) sets of and lets assume that ∈ ℓ ℱ there exists a set of ℓ sets in that covers S. Then OfflineSC(S, ℓ) returns a cover of size ℱ at most of using 푚|S| queries. 휌ℓ S 푂̃︀( ℓ )

Proof: Since each element of S is contained by 푂̃︀( 푚 ) sets in , the information required to ℓ ℱ solve the reduced instance on can be obtained by 푚|S| queries (i.e. 푚 SetOf query S 푂̃︀( ℓ ) 푂̃︀( ℓ ) per element in S).  Lemma 4.4.6. The cover constructed by the outer loop of IterSetCover(훼, 휀, 푙, 푢) with the parameter ℓ > 푘, solℓ, w.h.p. covers . 풰

115 Algorithm 12 OfflineSC(S, ℓ) invokes a black box that returns a cover of size at most 휌ℓ (if there exists a cover of size ℓ for S) for the Set Cover instance that is the projection of over S. ℱ 1: procedure OfflineSC(S, ℓ) 2: S 3: forℱ ← all ∅element 푒 S do ∈ 4: 푒 the collection of sets containing 푒 ℱ ← 5: S S 푒 ℱ ← ℱ ∪ ℱ 6: solution of size at most 휌ℓ for Set Cover on (S, S) constructed by a solver 7: ◁풟If ← there exists no such cover, then = null ℱ 8: return 풟 풟 Proof: After picking ℓ sets uniformly at random, by Set Sampling (Lemma 5.2.3), w.h.p. each element that is not covered by the sampled sets appears in 푂̃︀( 푚 ) sets of . Next, by Element ℓ ℱ (Lemma 4.4.2 with (︀ ℓ )︀1/(훼−1)), at the end of each inner iteration, w.h.p. the Sampling 훿 = 푛 1/(훼−1) number of uncovered elements decreases by a factor of (︀ ℓ )︀ . Thus after at most (훼 2) 푛 − iterations, w.h.p. less than (︀ 푛 )︀1/(훼−1) elements remain uncovered. Finally, OfflineSC is ℓ ℓ

invoked on the remaining elements; hence, solℓ w.h.p. covers . 풰  Next we analyze the query complexity and the approximation guarantee of IterSet-

Cover. As we only apply Element Sampling and Set Sampling polynomially many times, all invocations of the corresponding lemmas during an execution of the algorithm must suc- ceed w.h.p., so we assume their high probability guarantees for the proofs in rest of this section.

Lemma 4.4.7. Given that 푙 푘 푢 , w.h.p. IterSetCover(훼, 휀, 푙, 푢) finds a (휌훼+ ≤ ≤ 1+휀/(2훼휌) -approximate solution of the input instance using (︀ 1 푛 1/(훼−1) )︀ queries. 휀) 푂̃︀ 휀 (푚( 푙 ) + 푛푘)

⌈log 휀 푘⌉ Proof: Let 휀 1+ 2훼휌 be the smallest power of 휀 greater than or equal to ℓ푘 = (1 + 2훼휌 ) 1 + 2훼휌

푘. Note that it is guaranteed that ℓ푘 [푙, 푢]. By Lemma 4.4.6, IterSetCover terminates ∈ with a guess value ℓ ℓ푘. In the following we compute the query complexity of the run of ≤ IterSetCover with a parameter ℓ ℓ푘. ≤ Set Sampling component picks ℓ sets and then update the set of elements that are not covered by those sets, rem, using 푂(푛ℓ) EltOf queries. Next, in each iteration of the inner 풰 (︀ 1/(훼−1))︀ loop, the algorithm samples a subset S of size 푂̃︀ ℓ(푛/ℓ) from rem. Recall that, by 풰 Set Sampling (Lemma 5.2.3), each 푒 S rem appears in at most 푂̃︀(푚/ℓ) sets. Since ∈ ⊂ 풰

116 each element in rem appears in 푂̃︀(푚/ℓ), OfflineSC returns a cover of size at most 풰 풟 (︁ 1/(훼−1))︁ 휌ℓ using 푂̃︀ 푚 (푛/ℓ) SetOf queries (Lemma 4.4.5). By the guarantee of Element

Sampling (Lemma 4.4.2), the number of elements in rem that are not covered by is at 풰 풟 1/(훼−1) most (ℓ/푛) rem . Finally, at the end of each inner loop, the algorithm updates the |풰 | set of uncovered elements rem by using 푂̃︀(푛ℓ) EltOf queries. The Feasibility Test which is 풰 1/(훼−1) passed w.h.p. for ℓ ℓ푘 ensures that the final run of OfflineSC performs 푂̃︀(푚(푛/ℓ) ) ≤ SetOf queries. Hence, the total number of queries performed in each iteration of the outer

(︁ 1/(훼−1) )︁ loop of IterSetCover with parameter ℓ ℓ푘 is 푂̃︀ 푚 (푛/ℓ) + 푛ℓ . ≤ By Lemma 4.4.6, if ℓ푘 푢, then the outer loop of IterSetCover is executed for ≤ 푙 ℓ ℓ푘 before it terminates. Thus, the total number of queries made by IterSetCover ≤ ≤ is:

1 log1+ 휀 ℓ푘 ⎛ ⎞ 2훼휌 (︃ )︃ 훼−1 (︂ 1 (︂ )︂ )︂ ∑︁ 푛 휀 푖 (︁푛)︁ 훼−1 ℓ푘 푛ℓ푘 푂̃︀ ⎝푚 휀 + 푛(1 + ) ⎠ = 푂̃︀ 푚 log1+ 휀 + (1 + )푖 2훼휌 푙 2훼휌 푙 휀/(휌훼) 푖=⌈log 휀 푙⌉ 2훼휌 1+ 2훼휌 (︂1 (︂ (︁푛)︁1/(훼−1) )︂)︂ = 푂̃︀ 푚 + 푛푘 . 휀 푙

Now, we show that the number of sets returned by IterSetCover is not more than

(훼휌 + 휀)ℓ푘. Set Sampling picks ℓ sets and each run of OfflineSC returns at most 휌ℓ sets.

Thus the size of the solution returned by IterSetCover is at most (1 + (훼 1)휌)ℓ푘 < − (훼휌 + 휀)푘.  Next, we prove the main theorem of the section. Algorithm 13 The description of the SmallSetCover algorithm. 1: procedure SmallSetCover(훼, 휀) 2: sol IterSetCover(log 푛, 1, 1, 푛) 3: 푘′ ←sol ◁ Find a 휌 log 푛 estimate of 푘. ′ 4: return← | IterSetCover| (훼, 휖, 푘 , 푘′(1 + 휀 ) ) ⌊ 휌 log 푛 ⌋ ⌈ 2훼휌 ⌉

Proof of Theorem 4.4.4. The algorithm SmallSetCover first finds a (휌 log 푛)-approximate solution of Set Cover( , ), sol, with 푂̃︀(푚+푛푘) queries by calling IterSetCover(log 푛, 1, 1, 푛). 풰 ℱ Having that 푘 푘′ = sol (휌 log 푛)푘, the algorithm calls IterSetCover with 훼 as ≤ | | ≤ the approximation factor and [ 푘′/(휌 log 푛) , 푘′(1 + 휀 ) ] as the range containing 푘. By ⌊ ⌋ ⌈ 2훼휌 ⌉

117 Lemma 4.4.7, the second call to IterSetCover in SmallSetCover returns a (훼휌 + 휀)- approximate solution of Set Cover( , ) using the following number of queries: 풰 ℱ

(︃ (︃ 1 )︃)︃ (︂ )︂ 훼−1 (︂ (︂ 1 )︂)︂ 1 푛 1 (︁푛)︁ 훼−1 푂̃︀ 푚 + 푛푘 = 푂̃︀ 푚 + 푛푘 . 휀 푘/(휌 log 푛) 휀 푘 

4.4.2. Second Algorithm: large values of 푘 The second algorithm, LargeSetCover, works strictly better than SmallSetCover for large values of 푘 (푘 √푚). The advantage of LargeSetCover is that it does not need to ≥ update the set of uncovered elements at any point and simply avoids the additive 푛푘 term in the query complexity bound; the result of Section 4.5 suggests that the 푛푘 term may be unavoidable if one wishes to maintain the uncovered elements. Note that the guarantees of LargeSetCover is that at the end of the algorithm, w.h.p. the ground set is covered. 풰 The algorithm LargeSetCover, given in Algorithm 14, first randomly picks 휀ℓ/3 sets. By Set Sampling (Lemma 5.2.3), w.h.p. every element that occurs in Ω(̃︀ 푚/(휀ℓ)) sets of ℱ will be covered by the picked sets. It then solves the Set Cover instance over the elements that occur in 푂̃︀(푚/(휀ℓ)) sets of by an offline solver of Set Cover using 푂̃︀(푚/(휀ℓ)) queries; ℱ note that this set of elements may include some already covered elements. In order to get

the promised query complexity, LargeSetCover enumerates the guesses ℓ of the size of an optimal set cover in the decreasing order. The algorithm returns feasible solutions for

ℓ 푘 and once it cannot find a feasible solution for ℓ, it returns the solution constructed for ≥ the previous guess of 푘, i.e., ℓ(1 + 휀/(3휌)). Since LargeSetCover performs Set Sampling for 푂̃︀(휀−1) iterations, w.h.p. the total query complexity of LargeSetCover is 푂̃︀(푚푛/(푘휀2)). Note that testing whether the number of occurrences of an element is 푂̃︀(푚/(휀ℓ)) only requires a single query, namely SetOf 푐푚 log 푛 . We now prove the desired performance of (푒, 휀ℓ ) LargeSetCover.

Lemma 4.4.8. LargeSetCover returns a (휌+휀)-approximate solution of Set Cover( , ) 풰 ℱ w.h.p.

Proof: The algorithm LargeSetCover tries to construct set covers of decreasing sizes until

118 Algorithm 14 A (휌 + 휀)-approximation algorithm for the Set Cover problem. We assume that the algorithm has access to EltOf and SetOf oracles for Set Cover( , ), as well as and . 풰 ℱ |풰| |ℱ| 1: procedure LargeSetCover(휀) 2: try all 휀 -approximate guesses of ◁ (1 + 3휌 ) 푘 휀 푖 3: for ℓ (1 + ) 0 푖 log 휀 푛 in the decreasing order do 3휌 1+ 3휌 ∈ { | ≤ 휀ℓ≤ } 4: rndℓ collection of sets picked uniformly at random ◁ set sampling ← 3 5: rare ◁ intersection with rare elements 6: ℱfor all←푒 ∅ do 7: if appears∈ 풰 in <푐푚 log 푛 sets then size test: SetOf 푐푚 log 푛 푒 휀ℓ ◁ (푒, 휀ℓ ) 푚 8: 푒 collection of sets containing 푒◁ 푂̃︀( ) SetOf queries ℱ ← 휀ℓ 9: rare rare 푒 10: Sℱ S←⋃︀ ℱ푒 ∪ ℱ ← { } 11: solution of Set Cover(S, rare) returned by a 휌-approximation algorithm 12: 풟if ← 휌ℓ then ℱ |풟| ≤ 13: sol rndℓ 14: else ← ∪ 풟 15: return sol ◁ solution for the previous value of ℓ it fails. Clearly, if 푘 ℓ then the black box algorithm finds a cover of size at most 휌ℓ for ≤ any subset of , because 푘 sets are sufficient to cover . In other words, the algorithm 풰 풰 does not terminate with ℓ 푘. Moreover, since the algorithm terminates when ℓ is smaller ≥ than , the size of the set cover found by LargeSetCover is at most 휀 휀 푘 ( 3 + 휌)(1 + 3휌 )ℓ < 휀 휀 . ( 3 + 휌)(1 + 3휌 )푘 < (휌 + 휀)푘  Lemma 4.4.9. The number of queries made by LargeSetCover is 푚푛 . 푂̃︀( 푘휀2 ) Proof: The value of ℓ in any successful iteration of the algorithm is greater than 푘/(휌 + 휀); otherwise, the size of the solution constructed by the algorithm is at most (휌+휀)ℓ < 푘 which is a contradiction.

Set Sampling guarantees that w.h.p. each uncovered element appears in Θ(̃︀ 푚/휀ℓ) sets 푚푛 and thus the algorithm needs to perform 푂̃︀( ) SetOf queries to construct rare. Moreover, 휀ℓ ℱ the number of required queries in the size test step is 푂(푛) because we only need one SetOf query per each element in . Thus, the query complexity of LargeSetCover(휀) is bounded 풰

119 by

log 휀 푛 1+ 3휌 (︃ )︃ ∑︁ 푚푛 (︁ 푚푛 푛)︁ (︁푚푛)︁ 푂 푛 + = 푂 (푛 + ) log 휀 = 푂 . ̃︀ 휀 푖 ̃︀ 1+ 3휌 ̃︀ 2  푘 휀(1 + 3휌 ) 휀푘 푘 푘휀 푖=log 휀 1+ 3휌 휌+휀

4.5. Lower Bound for the Cover Verification Problem

In this section, we give a tight lower bound on a feasibility variant of the Set Cover problem which we refer to as Cover Verification. In Cover Verification( , , 푘), besides a collection 풰 ℱ ℱ of 푚 sets and 푛 elements , we are given indices of 푘 sets 푘 , and the goal is to ℱ 풰 ℱ ⊆ ℱ determine whether they are covering the whole universe or not. We note that, throughout 풰 this section, the parameter 푘 is a candidate for, but not necessarily the value of, the size of the minimum set cover.

A naive approach for this decision problem is to query all elements in the given 푘 sets and then check whether they cover or not; this approach requires 푂(푛푘) queries. However, 풰 in what follows we show that this approach is tight and no randomized protocol can decide whether the given 푘 sets cover the whole universe with probability of success at least 0.9 using 표(푛푘) queries.

Theorem 4.5.1. Any (randomized) algorithm for deciding whether a given 푘 = Ω(log 푛) sets covers all elements with probability of success at least 0.9, requires Ω(푛푘) queries. Proof: Observe that according to our instance construction, the algorithm may verify, with a single query, whether a certain swap occurs in a certain slab. Namely, it is sufficient to query an entry of EltOf or SetOf that would have been modified by that swap, and check whether it is actually modified or not. For simplicity, we assume that the algorithm hasthe knowledge of our construction. Further, without loss of generality, the algorithm does not make multiple queries about the same swap, or make a query that is not corresponding to any swap. We employ Yao’s principle as follows: to prove a lower bound for randomized algorithms, we show a lower bound for any deterministic algorithm on a fixed distribution of input instances. Let 푠 = 푛 푘 be the number of possible swaps in each slab; assume 푠 = Θ(푛). We − 120 define our distribution of instances as follows: each ofthe 푠푘 possible Yes instances occurs with probability 1/(2푠푘), and each of the 푘푠푘−1 possible No instances occurs with probability 1/(2푘푠푘−1). Equivalently speaking, we create a random Yes instance by making one swap on each basic slab. Then we make a coin flip: with probability 1/2 we pick a random slab and undo the swap on that slab to obtain a No instance; otherwise we leave it as a Yes instance. To prove by contradiction, assume there exists a deterministic algorithm that solves the

Cover Verification problem over this distribution of instances with 푟 = 표(푠푘) queries. Consider the Yes instances portion of the distribution, and observe that we may alterna- tively interpret the random process generating them as as follows. For each slab, one of its 푠 possible swaps is chosen uniformly at random. This condition again follows the scenario con-

sidered in Section 4.3.2: we are given 푘 urns (slabs) of each consisting of 푠 marbles (possible swap locations), and aim to draw the red marble (swapped entry) from a large fraction of these urns. Following the proof of Lemmas 4.3.13-4.3.14, we obtain that if the total number

of queries made by the algorithm is less than (1 3 ) 푠푘 , then with probability at least 0.99, − 푏 푏 the algorithm will not see any swaps from at least 푘 slabs. 푏 Then, consider the corresponding No instances obtained by undoing the swap in one of the slabs of the Yes instance. Suppose that the deterministic algorithm makes less than (1 3 ) 푠푘 queries, then for a fraction of 0.99 of all possible tuples , the output of the − 푏 푏 풯 instance is the same as the output of 1 fraction of instances, namely when the Yes 푏 No slab containing no swap is one of the 푘 slabs that the algorithm has not detected a swap 푏 in the corresponding Yes instance; the algorithm must answer incorrectly on half of the corresponding weight in our distribution of input instances. Thus the probability of success

for any algorithm with less than (1 3 ) 푠푘 queries is at most − 푏 푏 2 1 1 0.495 1 Pr high (1 )푘( )( ) 1 < 0.9, − |풯 | ≥ − 푏 푏 2 ≤ − 푏

for a sufficiently small constant 푏 > 3 (e.g. 푏 = 4). As 푠 = Θ(푛) and by Yao’s principle, this

implies the lower bound of Ω(푛푘) for the Cover Verification problem. 

While this lower bound does not directly lead to a lower bound on Set Cover, it sug- gests that verifying the feasibility of a solution may even be more costly than finding the

121 approximate solution itself; any algorithm bypassing this Ω(푛푘) lower bound may not solve Cover Verification as a subroutine. We prove our lower bound by designing the Yes and No instances that are hard to distinguish, such that for a Yes instance, the union of the given 푘 sets is , while for a No 풰 instance, their union only covers 푛 1 elements. Each Yes instance is indistinguishable from − a good fraction of No instances. Thus any algorithm must unavoidably answer incorrectly on half of these fractions, and fail to reach the desired probability of success.

4.5.1. Underlying Set Structure

Our instance contains 푛 sets and 푛 elements (so 푚 = 푛), where the first 푘 sets forms

푘, the candidate for the set cover we wish to verify. We first consider the incidence matrix ℱ representation, such that the rows represent the sets and the columns represent the elements.

We focus on the first 푛/푘 elements, and consider a slab, composing of 푛/푘 columns of the incidence matrix. We define a basic slab as the structure illustrated in Figure 4.5.1 (for

푛 = 12 and 푘 = 3), where the cell (푖, 푗) is white if 푒푗 푆푖, and is gray otherwise. The rows ∈ are divided into blocks of size 푘, where first block, the query block, contains the rows whose sets we wish to check for coverage; notice that only the last element is not covered. More specifically, in a basic slab, the query block contains sets 푆1, . . . , 푆푛/푘, each of which is equal

to 푒1, . . . , 푒푛/푘−1 . The subsequent rows form the swapper blocks each consisting of 푛/푘 { } th sets. The 푟 swapper block consists of sets 푆(푟+1)푛/푘+1, . . . , 푆(푟+2)푛/푘, each of which is equal

to 푒1, . . . , 푒푛/푘 푒푟 . We perform one swap in this slab. Consider a parameter (푥, 푦) { } ∖ { } representing the index of a white cell within the query block. We exchange the color of this white cell with the gray cell on the same row, and similarly exchange the same pair of cells on

swapper block 푦. An example is given in Figure 4.5.1; the dashed blue rectangle corresponds to the indices parameterizing possible swaps, and the red squares mark the modified cells.

This modification corresponds to a single swap operation; in this example, choosing the index

(3, 2) swaps (푒2, 푒4) between 푆3 and 푆9. Observe that there are 푘 (푛/푘 1) = 푛 푘 possible × − − swaps on a single slab, and any single swap allows the query sets to cover all 푛/푘 elements. Lastly, we may create the full instance by placing all 푘 slabs together, as shown in Figure 4.5.2, shifting the elements’ indices as necessary. The structure of our sets may be

122 푒1 푒2 푒3 푒4 푒1 푒2 푒3 푒4

푆1 푆1

query block 푆2 푆2

푆3 푆3

푆4 푆4

swapper block 1 푆5 푆5

푆6 푆6

푆7 푆7

swapper block 2 푆8 푆8

푆9 푆9

푆10 푆10

swapper block 3 푆11 푆11

푆12 푆12 (a) a basic slab (b) after performing a (3, 2)-swap Figure 4.5.1: A basic slab and an example of a swapping operation.

slab 1 slab 2 slab 3

푒1 푒2 푒3 푒4 푒5 푒6 푒7 푒8 푒9 푒10 푒11 푒12

푆1

푆2

푆3

푆4

푆5

푆6

푆7

푆8

푆9

푆10

푆11

푆12 Figure 4.5.2: A example structure of a Yes instance; all elements are covered by the first 3 sets.

123 specified solely by the swaps made on these slabs. We define the structure of ourinstances as follows.

1. For a Yes instance, we make one random swap on each slab. This allows the first 푘 sets to cover all elements.

2. For a No instance, we make one random swap on each slab except for exactly one of them. In that slab, the last element is not covered by any of the first 푘 sets.

Now, to properly define an instance, we must describe our structure via EltOf and

SetOf. We first create a temporary instance consisting of 푘 basic slabs, where none of the cells are swapped. Create EltOf and SetOf lists by sorting each list in an increasing order of indices. Each instance from the above construction can then be obtained by applying up to 푘 swaps on this temporary instance. Figure 4.6.1 provides a sample realization of a basic slab with EltOf and SetOf, as well as a sample result of applying a swap on this basic slab; these correspond to the incidence matrices in Figure 4.5.1a and Figure 4.5.1b, respectively. Such a construction can be extended to include all 푘 slabs. Observe here that no two distinct swaps modify the same entry; that is, the swaps do not interfere with one another on these two functions. We also note that many entries do not participate in any swap.

4.6. Generalized Lower Bounds for the Set Cover Problem

In this section we generalize the approach of Section 4.3 and prove our main lower bound

result (Theorem 4.3.1) for the number of queries required for approximating with factor 훼 the size of an optimal solution to the Set Cover problem, where the input instance contains 푚 sets, 푛 elements, and a minimum set cover of size 푘. The structure of our proof is largely the same as the simplified case, but the definitions and the details of our analysis willbe more complicated. The size of the minimum set cover of the median instance will instead be

at least 훼푘 + 1, and GenModifiedInst reduces this down to 푘. We now aim to prove the following statement which implies the lower bound in Theorem 4.3.1.

Theorem 4.6.1. Let 푘 be the size of an optimal solution of 퐼* such that 1 < 훼 log 푛 and 1 ≤ (︁ )︁ 4훼+1 2 푘 푛 . Any algorithm that distinguishes whether the input instance is 퐼* ≤ ≤ 16훼 log 푚 124 Before: EltOf table for a basic slab After: EltOf table after applying a swap

EltOf 1 2 3 EltOf 1 2 3

푆1 푒1 푒2 푒3 푆1 푒1 푒2 푒3

푆2 푒1 푒2 푒3 푆2 푒1 푒2 푒3

푆3 푒1 푒2 푒3 푆3 푒1 푒4 푒3

푆4 푒2 푒3 푒4 푆4 푒2 푒3 푒4

푆5 푒2 푒3 푒4 푆5 푒2 푒3 푒4

푆6 푒2 푒3 푒4 푆6 푒2 푒3 푒4

푆7 푒1 푒3 푒4 푆7 푒1 푒3 푒4

푆8 푒1 푒3 푒4 푆8 푒1 푒3 푒4

푆9 푒1 푒3 푒4 푆9 푒1 푒3 푒2

푆10 푒1 푒2 푒4 푆10 푒1 푒2 푒4

푆11 푒1 푒2 푒4 푆11 푒1 푒2 푒4

푆12 푒1 푒2 푒4 푆12 푒1 푒2 푒4

Before: SetOf table for a basic slab

SetOf 1 2 3 4 5 6 7 8 9

푒1 푆1 푆2 푆3 푆7 푆8 푆9 푆10 푆11 푆12

푒2 푆1 푆2 푆3 푆4 푆5 푆6 푆10 푆11 푆12

푒3 푆1 푆2 푆3 푆4 푆5 푆6 푆7 푆8 푆9

푒4 푆4 푆5 푆6 푆7 푆8 푆9 푆10 푆11 푆12

After: SetOf table after applying a swap

SetOf 1 2 3 4 5 6 7 8 9

푒1 푆1 푆2 푆3 푆7 푆8 푆9 푆10 푆11 푆12

푒2 푆1 푆2 푆9 푆4 푆5 푆6 푆10 푆11 푆12

푒3 푆1 푆2 푆3 푆4 푆5 푆6 푆7 푆8 푆9

푒4 푆4 푆5 푆6 푆7 푆8 푆3 푆10 푆11 푆12

Figure 4.6.1: Tables illustrating the representation of a slab under EltOf and SetOf before and after a swap; cells modified by swap(푒2, 푒4) between 푆3 and 푆9 are highlighted in red. 125 or belongs to (퐼*) with probability of success at least 2/3 requires Ω(̃︀ 푚( 푛 )1/(2훼)) queries. 풟 푘 Proof: Applying the same argument as that of Lemma 4.3.10, we derive that the probability

that returns different outputs on 퐼* and 퐼′ is at most 풜

|푄| |푄| 푘 * ′ ∑︁ ∑︁ 64푝0 * ′ Pr (퐼 ) = (퐼 ) Pr ans퐼 (푞푡) = ans퐼 (푞푡) 푃Elt−Set(푒(푞푡), 푆(푞푡)) 2 푄 , 풜 ̸ 풜 ≤ ̸ ≤ ≤ 푚(1 푝0) | | 푡=1 푡=1 − via the result of Lemma 4.6.12. Then, over the distribution in which we applied Yao’s lemma, we have

(︂ 푘 )︂ 1 * ′ 1 64푝0 succeeds ′ * Pr 1 Pr퐼 ∼풟(퐼 )[ (퐼 ) = (퐼 )] 1 1 2 푄 풜 ≤ − 2 풜 풜 ≤ − 2 − 푚(1 푝0) | | 푘 − 1 32푝0 = + 2 푄 2 푚(1 푝0) | | − 1 1 32 (︂8(푘훼 + 2) log 푚)︂ 2훼 + 푄 ≤ 2 푚 푛 | |

where the last inequality follows from Lemma 4.6.2. Thus, if the number of queries made by

is less than 푚 ( 푛 )1/(2훼), then the probability that returns the correct answer 풜 192 8(푘훼+2) log 푚 풜 over the input distribution is less than 2/3 and the proof is complete. 

4.6.1. Construction of the Median Instance 퐼*.

Let be a collection of 푚 sets such that independently for each set-element pair (푆, 푒), 푆 con- ℱ 1/(훼푘) (︁ 8(훼푘+2) log 푚 )︁ tains 푒 with probability 1 푝0, where we modify the probability to 푝0 = . − 푛 We start by proving some inequalities involving 푝0 that will be useful later on, which hold for any 푘 in the assumed range. 1 (︁ )︁ 4훼+1 Lemma 4.6.2. For 2 푘 푛 , we have that ≤ ≤ 16훼 log 푚 푘/4 (a) 1 푝0 푝 , − ≥ 0 (b) 푘/4 , 푝0 1/2 ≤ 1 푝푘 (︁ )︁ 2훼 (c) 0 8(훼푘+2) log 푚 . 2 (1−푝0) ≤ 푛 Proof: Recall as well that 훼 > 1. In the given range of 푘, we have 푘4훼 푛 ≤ 16훼푘 log 푚 ≤ 126 푛 because 푘훼 2. Thus 8(훼푘+2) log 푚 ≥

1 1 (︂ )︂ 훼푘 (︂ )︂ 훼푘 8(훼푘 + 2) log 푚 1 −4/푘 푝0 = = 푘 . 푛 ≤ 푘4훼

−4/푘 − 4 ln 푘 4 ln 푘 4 −푥 푥 Next, rewrite 푘 = 푒 푘 and observe that < 1.5. Since 푒 1 for 푘 ≤ 푒 ≤ − 2 − 4 ln 푘 2 ln 푘 푘/4 − ln 푘 any 푥 < 1.5, we have 푝0 푒 푘 < 1 . Further, 푝 푒 = 1/푘. Hence ≤ − 푘 0 ≤ 푘/4 2 ln 푘 1 푝0 + 푝 1 + 1, implying the first statement. 0 ≤ − 푘 푘 ≤ The second statement easily follows as 푝푘/4 1/푘 1/2 since 푘 2. For the last 0 ≤ ≤ ≥ statement, we make use of the first statement:

1 푘 푘 (︂ )︂ 2훼 푝0 푝0 푘/2 8(훼푘 + 2) log 푚 2 푘/4 = 푝0 = (1 푝0) ≤ 2 푛 − (푝0 )

which completes the proof of the lemma.  Next, we give the new, generalized definition of median instances.

Definition 4.6.3 (Median instance). An instance of Set Cover, 퐼 = ( , ), is a median 풰 ℱ instance if it satisfies all the following properties.

(a) No 훼푘 sets cover all the elements. (The size of its minimum set cover is greater than 훼푘.) (b) The number of uncovered elements of the union of any sets is at most 푘. 푘 2푛푝0 (c) For any pair of elements 푒, 푒′, the number of sets 푆 s.t. 푒 푆 but 푒′ / 푆 is at least ∈ ℱ ∈ ∈ (1 푝0)푝0푚/2. − 푘−1 (d) For any collection of 푘 sets 푆1, , 푆푘, 푆푘 (푆1 푆푘−1) (1 푝0)(1 푝 )푛/2. ··· | ∩ ∪ · · · ∪ | ≥ − − 0 (e) For any collection of 푘 + 1 sets 푆, 푆1, , 푆푘, (푆푘 (푆1 푆푘−1)) 푆 2푝0(1 ··· | ∩ ∪ · · · ∪ ∖ | ≤ − 푘−1 푝0)(1 푝 )푛. − 0 (f) For each element, the number of sets that do not contain the element is at most (1 + 1 . 푘 )푝0푚

√︀ 푚 푛 1 * Lemma 4.6.4. For 푘 min , ( ) 4훼+1 , there exists a median instance 퐼 ≤ { 27 ln 푚 16훼 log 푚 } satisfying all the median properties from Definition 4.6.3. In fact, most of the instances constructed by the described randomized procedure satisfy the median properties.

127 Proof: The lemma follows from applying the union bound on the results of Lemmas 4.6.5–

4.6.10.  The proofs of the Lemmas 4.6.5–4.6.10 follow from standard applications of concentration bounds. We include them here for the sake of completeness.

−2 Lemma 4.6.5. With probability at least 1 푚 over ( , 푝0), the size of the minimum − ℱ ∼ ℐ 풰 set cover of the instance ( , ) is at least 훼푘 + 1. ℱ 풰 Proof: The probability that an element 푒 is covered by a specific collection of 훼푘 sets ∈ 풰 in is at most 1 푝훼푘 = 1 8(훼푘+2) log 푚 . Thus, the probability that the union of the 훼푘 ℱ − 0 − 푛 sets covers all elements in is at most (1 8(훼푘+2) log 푚 )푛 < 푚−8(훼푘+2). Applying the union 풰 − 푛 bound, with probability at least 1 푚−2 the size of an optimal set cover is at least 훼푘 + 1. −  −2 Lemma 4.6.6. With probability at least 1 푚 over ( , 푝0), any collection of 푘 sets − ℱ ∼ ℐ 풰 has at most 푘 uncovered elements. 2푛푝0

Proof: Let 푆1, , 푆푘 be a collection of 푘 sets from . For each element 푒 , the proba- ··· ℱ ∈ 풰 bility that is not covered by the union of the sets is 푘. Thus, 푒 푘 푝0

푘 훼푘 E[ (푆1 푆푘) ] = 푝0푛 푝0 푛 = 8(훼푘 + 2) log 푚. |풰 ∖ ∪ · · · ∪ | ≥

By Chernoff bound,

푝푘푛 푘 − 0 −(훼푘+2) log 푚 −푘−2 Pr (푆1 푆푘) 2푝 푛 푒 3 푒 푚 . |풰 ∖ ∪ · · · ∪ | ≥ 0 ≤ ≤ ≤

Thus with probability at least 1 푚−2, for any collection of 푘 sets in , the number of − ℱ uncovered elements by the union of the sets is at most 푘 . 2푝0푛 

′ Lemma 4.6.7. Suppose that ( , 푝0) and let 푒, 푒 be two elements in . Given 푘 1 ℱ ∼ ℐ 풰 풰 ≤ (︁ )︁ 4훼+1 푛 , with probability at least 1 푚−2, the number of sets 푆 such that 푒 푆 16훼 log 푚 − ∈ ℱ ∈ ′ but 푒 / 푆 is at least 푚푝0(1 푝0)/2. ∈ − ′ Proof: For each set 푆, Pr 푒 푆 and 푒 / 푆 = (1 푝0)푝0. This implies that the expected ∈ ∈ −

128 number of such sets 푆 satisfying the condition for 푒 and 푒′ is

푘/4 훼푘 푝0(1 푝0)푚 푝0 푝 푚 푝 푛 = 8(훼푘 + 2) log 푚 − ≥ · 0 · ≥ 0

by Lemma 4.6.2 and 푚 푛. By Chernoff bound, the probability that the number of sets ≥ ′ containing 푒 but not 푒 is less than 푚푝0(1 푝0)/2 is at most −

− 푝0(1−푝0)푚 −(훼푘+2) log 푚 −훼푘−2 푒 8 푒 푚 . ≤ ≤

Thus with probability at least 1 푚−2 property (c) holds for any pair of elements in . − 풰 

Lemma 4.6.8. Suppose that ( , 푝0) and let 푆1, , 푆푘 be 푘 different sets in . 1 ℱ ∼ ℐ 풰 ··· ℱ (︁ 푛 )︁ 4훼+1 −2 Given 푘 , with probability at least 1 푚 , 푆푘 (푆1 푆푘−1) ≤ 16훼 log 푚 − | ∩ ∪ · · · ∪ | ≥ 푘−1 (1 푝0)(1 푝 )푛/2. − − 0 푘−1 Proof: For each element 푒, Pr 푒 푆푘 (푆1 푆푘−1) = (1 푝0)(1 푝 ). This implies ∈ ∩ ∪ · · · ∪ − − 0 that the expected size of 푆푘 (푆1 푆푘−1) is ∩ ∪ · · · ∪

푘−1 푘/4 푘/4 훼푘 (1 푝0)(1 푝 )푛 푝 푝 푛 푝 푛 = 8(훼푘 + 2) log 푚. − − 0 ≥ 0 · 0 · ≥ 0

by Lemma 4.6.2. By Chernoff bound, the probability that 푆푘 (푆1 푆푘−1) (1 | ∩ ∪ · · · ∪ | ≤ − 푘−1 푝0)(1 푝 )푛/2 is at most − 0

(1−푝 )(1−푝푘−1)푛 − 0 0 −(훼푘+2) log 푚 −훼푘−2 푒 8 푒 푚 . ≤ ≤

−2 Thus with probability at least 1 푚 property (d) holds for any sets 푆1, , 푆푘 in . − ··· ℱ 

Lemma 4.6.9. Suppose that ( , 푝0) and let 푆1, , 푆푘 and 푆 be 푘+1 different sets in 1 ℱ ∼ ℐ 풰 ··· (︁ 푛 )︁ 4훼+1 −2 . Given 푘 , with probability at least 1 푚 , (푆푘 (푆1 푆푘−1)) 푆 ℱ ≤ 16훼 log 푚 − | ∩ ∪· · ·∪ ∖ | ≤ 푘−1 2푝0(1 푝0)(1 푝 )푛. − − 0

129 푘−1 Proof: For each element 푒, Pr 푒 (푆푘 (푆1 푆푘−1)) 푆 = 푝0(1 푝0)(1 푝 ). Then, ∈ ∩ ∪ · · · ∪ ∖ − − 0

푘−1 푘/4 푘/4 훼푘 E( (푆푘 (푆1 푆푘−1)) 푆 ) = 푝0(1 푝0)(1 푝0 )푛 푝0 푝0 푝0 푝0 푛 | ∩ ∪ · · · ∪ ∖ | − − ≥ · · ≥ = 8(훼푘 + 2) log 푚

by Lemma 4.6.2. By Chernoff bound, the probability that (푆푘 (푆1 푆푘−1)) 푆 | ∩ ∪ · · · ∪ ∖ | ≥ 푘−1 2푝0(1 푝0)(1 푝 )푛 is − − 0

푝 (1−푝 )(1−푝푘−1)푛 − 0 0 0 −2(훼푘+2) log 푚 −2훼푘−4 푒 3 푒 푚 . ≤ ≤

−2 Thus with probability at least 1 푚 property (e) holds for any sets 푆1, , 푆푘 and 푆 in − ··· .  ℱ 1 (︁ )︁ 4훼+1 Lemma 4.6.10. Given that 푘 푛 , for each element, the number of sets that ≤ 16훼 log 푚 do not contain the element is at most 1 . (1 + 푘 )푝0푚 1 (︁ )︁ 4훼+1 Proof: First, note that 푘 푛 √︀ 푚 as 푚 푛 and 훼 1. ≤ 16훼 log 푚 ≤ 27 ln 푚 ≥ ≥ Next, for each element , . This implies that . 푒 Pr푆∼ℱ [푒 / 푆] = 푝0 E푆( 푆 푒 / 푆 ) = 푝0푚 ∈ |{ | ∈ }| −푚푝0 1 2 By Chernoff bound, the probability that 푆 푒 / 푆 (1 + )푝0푚 is at most 푒 3푘 . Now |{ | ∈ }| ≥ 푘 if , then and thus this probability would be at most −푚 −3 for 푘 log 푛 푝0 1/푒 exp( 2 ) 푚 ≥ ≥ 3푒푘 ≤ √︀ 푚 −푚푛−1/훼푘 any 푘 . Otherwise, we have that the above probability is at most exp( 2 ) ≤ 27 ln 푚 3 log 푛 ≤ −푚1−1/훼푘 −3 exp( 2 ) 푚 given 푚 푛 and sufficiently large 푛. Thus with probability at least 3 log 푚 ≤ ≥ 1 푚−2 property (f) holds for any element 푒 . − ∈ 풰 

4.6.2. Distribution (퐼*) of the Modified Instances Derived from 퐼*. 풟 Fix a median instance 퐼*. We now show that we may perform 푂˜(푛1−1/훼푘1/훼) swap operations on 퐼* so that the size of the minimum set cover in the modified instance becomes 푘. So, the number of queries to EltOf and SetOf that induce different answers from those of 퐼* is at most 푂̃︀(푛1−1/훼푘1/훼). We define (퐼*) as the distribution of instances 퐼′ that is generated 풟 from a median instance 퐼* by GenModifiedInst(퐼*) given below in Algorithm 15. The main difference from the simplified version are that we nowselect 푘 different sets to turn

them into a set cover, and the swaps may only occur between 푆푘 and the candidates.

130 Algorithm 15 The procedure of constructing a modified instance of 퐼*. 1: procedure GenModifiedInst(퐼* = ( , )) 2: 풰 ℱ ℳ ← ∅ 3: pick 푘 different sets 푆1, 푆푘 from uniformly at random ··· ℱ 4: for all 푒 (푆1 푆푘) do ∈′ 풰 ∖ ∪ · · · ∪ 5: pick 푒 (푆푘 (푆1 푆푘−1)) uniformly at random 6: ∈ 푒푒∩′ ∪ · · · ∪ ∖ ℳ 7: pickℳ ←a ℳ random ∪ { } set 푆 in Candidate(푒, 푒′) ′ 8: swap(푒, 푒 ) between 푆, 푆푘 Lemma 4.6.11. The procedure GenModifiedInst is well-defined under the precondition

that the input instance 퐼* is a median instance. Proof: To carry out the algorithm, we must ensure that the number of the initially uncov-

ered elements is at most that of the elements covered by both 푆푘 and some other set from * 푆1, . . . , 푆푘−1. Since 퐼 is a median instance, by properties (b) and (d) from Definition 4.6.3, 푘 푘−1 these values satisfy (푆1 푆푘) 2푝 푛 and 푆푘 (푆1 푆푘−1) (1 푝0)(1 푝 )푛/2, |풰∖ ∪· · ·∪ | ≤ 0 | ∩ ∪· · ·∪ | ≥ − − 0 respectively. By Lemma 4.6.2, 푝푘/4 1/2. Using this and Lemma 4.6.2 again, 0 ≤

푘−1 푘/4 푘/4 푘/2 푘 (1 푝0)(1 푝 )푛/2 푝 푝 푛/2 푝 푛/2 2푝 푛. − − 0 ≥ 0 · 0 · ≥ 0 ≥ 0

That is, in our construction there are sufficiently many possible choices for 푒′ to be matched and swapped with each uncovered element 푒. Moreover, since 퐼* is a median instance, ′ Candidate(푒, 푒 ) (1 푝0)푝0푚/2 (by property (c)), and there are plenty of candidates for | | ≥ − each swap. 

Bounding the Probability of Modification. Similarly to the simplified case, define

푃Elt−Set : [0, 1] as the probability that an element is swapped by a set, and upper 풰 × ℱ → bound it via the following lemma.

64푝푘 Lemma 4.6.12. For any and , 0 where the probability 푒 푆 푃Elt−Set(푒, 푆) 2 ∈ 풰 ∈ ℱ ≤ (1−푝0) 푚 is taken over the random choices of 퐼′ (퐼*). ∼ 풟

Proof: Let 푆1, . . . , 푆푘 denote the first 푘 sets picked (uniformly at random) from to construct ℱ a modified instance of 퐼*. For each element 푒 and a set 푆 such that 푒 푆 in the basic instance ∈ 131 퐼*,

푃Elt−Set(푒, 푆) = Pr 푆 = 푆푘 Pr 푒 푖∈[푘−1]푆푖 푒 푆푘 · ∈ ∪ | ∈

Pr 푒 matches to ( 푖∈[푘]푆푖) 푒 푆푘 ( 푖∈[푘−1]푆푖) · 풰 ∖ ∪ | ∈ ∩ ∪

+ Pr 푆 / 푆1, . . . , 푆푘 Pr 푒 푆 ( 푖∈[푘]푆푖) 푒 푆 ∈ { }· ∈ ∖ ∪ | ∈

Pr 푆 swaps 푒 with 푆푘 푒 푆 (푆1 푆푘), · | ∈ ∖ ∪ · · · ∪ where all probabilities are taken over 퐼′ (퐼*). Next we bound each of the above six ∼ 풟 terms. Clearly, since we choose the sets 푆1, , 푆푘 randomly, Pr[푆 = 푆푘] = 1/푚. We bound ··· the second term by 1. Next, by properties (b) and (d) of median instances, the third term is at most

푘 푘 ( 푖∈[푘]푆푖) 2푝0푛 4푝0 |풰 ∖ ∪ | 푘−1 푛 2 . 푆푘 ( 푖∈[푘−1]푆푖) ≤ (1 푝0)(1 푝 ) ≤ (1 푝0) | ∩ ∪ | − − 0 2 −

We bound the fourth term by 1. Let 푑푒 denote the number of sets in that do not contain ℱ 푒. Using property (f) of median instances, the fifth term is at most

(︂ )︂푘 푑푒(푑푒 1) (푑푒 푘 + 1) 푑푒 (1 + 1/푘)푝0푚 − ··· − ( )푘 푒2푝푘, (푚 1)(푚 2) (푚 푘) ≤ 푚 1 ≤ 푚(1 1 ) ≤ 0 − − ··· − − − 푘+1

Finally for the last term, note that by symmetry, each pair of matched elements 푒푒′ is picked by GenModifiedInst equiprobably. Thus, for any 푒 푆 (푆1 푆푘), the ∈ ∖ ∪ · · · ∪ ′ 1 probability that each element 푒 푆푘 (푆1 푆푘−1) is matched to 푒 is . ∈ ∩ ∪ · · · ∪ |푆푘∩(푆1∪···∪푆푘−1)| By properties (c)-(e) of median instances, the last term is at most

∑︁ ′ ′ Pr 푒푒 Pr (푆, 푆푘) swap (푒, 푒 ) ′ ∈ ℳ · 푒 ∈(푆푘∩(∪푖∈[푘−1]푆푖))∖푆 1 1 (푆푘 ( 푖∈[푘−1]푆푖)) 푆 ′ ≤ | ∩ ∪ ∖ | · 푆푘 ( 푖∈[푘−1]푆푖) · Candidate(푒, 푒 ) | ∩ ∪ | | | 푘−1 1 1 2푝0(1 푝0)(1 푝0 )푛 푘−1 ≤ − − · (1 푝0)(1 푝 )푛/2 · 푝0(1 푝0)푚/2 − − 0 − 8

≤ (1 푝0)푚 −

132 Therefore,

푘 1 4푝0 2 푘 8 푃Elt−Set(푒, 푆) 1 2 + 1 푒 푝0 ≤ 푚 · · (1 푝0) · · (1 푝0)푚 푘 − 푘 −푘 4푝0 60푝0 64푝0 2 + 2 .  ≤ (1 푝0) (1 푝0)푚 ≤ (1 푝0) 푚 − − −

4.7. Omitted Proofs from Section 4.3

−1 Lemma 4.7.1. With probability at least 1 푚 over ( , 푝0), the size of the minimum − ℱ ∼ ℐ 풰 set cover of the instance ( , ) is greater than 2. ℱ 풰 Proof: The probability that an element 푒 is covered by two sets selected from is at ∈ 풰 ℱ most:

2 9 log 푚 Pr[푒 푆1 푆2] = 1 푝 = 1 . ∈ ∪ − 0 − 푛

9 log 푚 푛 −9 Thus, the probability that 푆1 푆2 covers all elements in is at most (1 ) < 푚 . ∪ 풰 − 푛 Applying the union bound, with probability at least 1 푚−1 the size of optimal set cover is − greater than 2. 

Lemma 4.7.2. Let 푆1 and 푆2 be two sets in where ( , 푝0). Then with probability ℱ ℱ ∼ ℐ 풰 −1 at least 1 푚 , (푆1 푆2) 18 log 푚. − |풰 ∖ ∪ | ≤ Proof: For an element , 2 9 log 푚 . So, . 푒 Pr[푒 / 푆1 푆2] = 푝0 = E[ (푆1 푆2) ] = 9 log 푚 ∈ ∪ 푛 |풰 ∖ ∪ | −9 log 푚/3 −3 By Chernoff bound, Pr[ (푆1 푆2) 18 log 푚] is at most 푒 푚 . Thus with |풰 ∖ ∪ | ≥ ≤ probability at least 1 푚−1, for any pair of sets in , the number of element not covered − ℱ by their union is at most 18 log 푚. 

Lemma 4.7.3. Let 푆1 and 푆2 be two sets in where ( , 푝0). Then 푆1 푆2 푛/8 ℱ ℱ ∼ ℐ 풰 | ∩ | ≥ with probability at least 1 푚−1. −

Proof: For each element 푒, it is either covered by both 푆1, 푆2, one of 푆1, 푆2 or none of them.

Since 푝0 1/2, the probability that an element is covered by both sets is greater than other ≤ cases, i.e., . Thus, . By Chernoff bound, Pr [푒 푆1 푆2] > 1/4 E[ (푆1 푆2) ] > 푛/4 ∈ ∩ |풰 ∖ ∩ | −1 Pr[ (푆1 푆2) 푛/8] is exponentially small. Thus with probability at least 1 푚 , the |풰 ∖ ∩ | ≤ − 133 intersection of any pairs of sets in is greater than 푛/8. ℱ  ′ Lemma 4.7.4. Suppose that ( , 푝0) and let 푒, 푒 be two elements in . With prob- ℱ ∼ ℐ 풰 풰 ability at least 1 푚−1, the number of sets 푆 such that 푒 푆 but 푒′ / 푆 is at least √ − ∈ ℱ ∈ ∈ 푚 9√ log 푚 . 4 푛

′ Proof: For each set 푆, Pr[푒 푆 and 푒 / 푆] = (1 푝0)푝0 푝0/2. This implies that the ∈ ∈ − ≥ √︁ expected number of 푆 satisfying the condition for 푒 and 푒′ is at least 푚 9 log 푚 and by 2 · 푛 Chernoff bound, the probability that the number of sets containing 푒 but not 푒′ is less than √ 푚 9√ log 푚 is exponentially small. Thus with probability at least 1 푚−1 property (d) holds 4 푛 − for any pair of elements in . 풰 

Lemma 4.7.5. Suppose that ( , 푝0) and let 푆1, 푆2 and 푆 be sets in . With proba- ℱ ∼ ℐ 풰 ℱ −1 bility at least 1 푛 , (푆1 푆2) 푆 6√푛 log 푚. − | ∩ ∖ | ≤ 2 Proof: For each element 푒, Pr[푒 (푆1 푆2) 푆] = (1 푝0) 푝0 푝0. This implies that the ∈ ∩ ∖ − ≤ expected size of (푆1 푆2) 푆 is less than √9푛 log 푚 and by Chernoff bound, the probability ∩ ∖ −1 that (푆1 푆2) 푆 6√푛 log 푚 is exponentially small. Thus with probability at least 1 푚 | ∩ ∖ | ≥ − property (푒) holds for any sets 푆1, 푆2 and 푆 in . ℱ  Lemma 4.7.6. For each element, the number of sets that do not contain the element is at √︁ most log 푚 . 6푚 푛 Proof: For each element , . This implies that is less 푒 Pr푆[푒 / 푆] = 푝0 E푆( 푆 푒 / 푆 ) √︁ ∈ |{ | ∈ }|√︁ than 푚 9 log 푚 and by Chernoff bound, the probability that 푆 푒 / 푆 2푚 9 log 푚 푛 |{ | ∈ }| ≥ 푛 is exponentially small. Thus with probability at least 1 푚−1 property (f) holds for any − element 푒 . ∈ 풰 

134 Chapter 5

Streaming Maximum Coverage

5.1. Introduction

In maximum 푘-coverage (Max 푘-Cover), given a ground set of 푛 elements, a family of 푚 풰 sets (each subset of ), and a parameter 푘, the goal is to select 푘 sets in whose union ℱ 풰 ℱ has the largest cardinality. The initial streaming algorithms for this problem were developed in the set arrival model, where the input sets are listed contiguously. This restriction is natural from the perspective of submodular optimization, but limits the applicability of the algorithms.1 Avoiding this limitation can be difficult, as streaming algorithms can no longer operate on sets as “unit objects”. As a result, the first maximum coverage algorithm for the general edge arrival model, where pairs of (set, element) can arrive in arbitrary order, have been developed recently. In particular [29] presented a one-pass algorithm with space linear

in 푚 and constant approximation factor.2 A particularly interesting line of research in set arrival streaming set cover and max

푘-cover is to design efficient algorithms that only use 푂̃︀(푛) space [155, 22, 67, 42, 131]. Pre- vious work have shown that we can adopt the existing greedy algorithm of Max 푘-Cover to achieve constant factor approximation in 푂̃︀(푛) space [155, 22] (which later improved to 1For example, consider a situation where the sets correspond to neighborhoods of vertices in a directed graph. Depending on the input representation, for each vertex, either the ingoing edges or the outgoing edges might be placed non-contiguously. 2We remark that many of the prior bounds (both upper and lower bounds) on set cover and max 푘-cover problems in set-arrival streams also work in edge arrival streams (e.g. [61, 97, 20, 131, 17, 105]). However, the design of efficient streaming algorithms for the coverage problems on edge arrival streams was firststudied explicitly in [29].

135 푂̃︀(푘) by [131]). However, the complexity of the problem in the “low space” regime is very different in edge-arrival streams: [29] showed that as long as the approximation factor is

a constant, any algorithm must use Ω(푚) space. Still, our understanding of approxima- tion/space tradeoffs in the general case is far from complete. Table 5.1.1 summarizes the known results.

5.1.1. Our Results

In this chapter, we complement the work of [29] by designing space-efficient (1/훼)-approximation algorithms for super-constant values of 훼. In fact, we show a tight (up to polylogarithmic factors) tradeoff between the two: the optimal space bound is 푂̃︀(푚/훼2) for estimating the 3 maximum coverage value, and 푂̃︀(푚/훼2+푘) for reporting an approximately optimal solution. The approximation factor 훼 can take any value in [1/Θ(̃︀ √푚), 1 1/푒). −

5.1.2. Our Techniques

In the edge arrival model, elements of each set can arrive irregularly and out of order. This necessitates the use of methods that aggregate the information about the input sets, or their coverage. In particular, distinct element sketches were used both in [29] (implicitly) and [131] (explicitly). In this chapter we expand the use of sketching toolkit. Specifically, in addition to distinct element estimation [13, 28, 112, 113, 32], we also need algorithms for heavy hitters

with respect to the 퐿2 norm [44, 165, 38, 37], as well as a frequency-based partitioning of elements, and detecting sets that “substantially contribute” to the solution [108]. Application

of vector-sketching techniques (e.g. 퐿푝-sampling/estimation and heavy hitters) in graph streaming settings have been studied extensively (e.g. [8,9, 89, 21, 49, 114]). We believe that our algorithms can lead to further connections between vector sketching methods and streaming algorithms for the coverage problems.

3We note that similar tradeoffs were previously obtained for the set cover problem, as[20] showed a Θ(푚푛/훼2) bound for estimation, and a Θ(푚푛/훼) bound for reporting. Interestingly, the 1/훼2 vs. 1/훼 gap does not occur for our problem. 4Their result works for the general submodular maximization assuming access to a value oracle that given a collection of sets computes their coverage. A careful adoption of their result to Max 푘-Cover (without the value oracle) uses 푂̃︀(푛) space.

136 Problem Stream Model Approximation Upper Bound Lower Bound

푚 [17] 1 휀 Ω(̃︀ 2 ) Estimation Edge Arrival − – 휀 1 1 휀 Ω( 푚 )[131] − 푒 − 푘2 푚 [29] 푂̃︀( 3 ) Edge Arrival 1 1 휀 휀 − 푒 − 푚 [131] Reporting 푂̃︀( 휀2 ) – 1 [155], 1 [22]4 푂(푛) Set Arrival 4 2 ̃︀ 1 푘 [131] 휀 푂̃︀( 3 ) 2 − 휀 Estimation Edge Arrival 푚 [here] 푚 [here],[29] 1/훼 푂̃︀( 훼2 ) Ω( 훼2 ) Reporting Edge Arrival 푚 [here] – 1/훼 푂̃︀(푘 + 훼2 ) Table 5.1.1: The summary of known results on the space complexity of single-pass stream- ing algorithms of Max 푘-Cover. Lower bound. Our algorithm was inspired by the lower bound. Specifically, it was pre-

viously shown by [29] that approximating Max 푘-Cover by a factor better than 2 requires Ω(푚) space. Similar approach works for larger values of 훼, by showing a reduction from the 훼-player set disjointess problem (DSJ[m]) with unique intersection guarantee (i.e., either players’ sets are disjoint or there is a unique item that appears in all sets) to the task of

훼-approximating Max 푘-Cover. The specific hard instances in the aforementioned lower bound can be distinguished in

the streaming model using space 푂(푚/훼2). To this end, we compute an 훼-approximation to

the 퐿∞-norm of a vector 푣 that, for each element 푒, counts the number of sets 푒 belongs to. 2 This problem can be solved in 푂(푚/훼 ) space, by using 퐿2-norm sketches [13]. This suggests that it might be possible to solve the general Max 푘-Cover using sketching techniques as well.

Upper bound. We start our algorithm with a “coverage boosting” universe reduction technique which constructs a reduced size instance (i.e., with reduced ground set) whose

optimal 푘-cover has constant fraction coverage (see Section 5.3.1). This step is particularly important as the space complexity of the existing methods for Max 푘-Cover is proportional to the reciprocal of the fraction of covered elements in an optimal solution.

137 Once we have a constant fraction coverage guarantee, our algorithm exploits three differ- ent approaches so that on any instance, at least one of them reports a “good” approximate solution.

Multi-layered set sampling. By extending the set sampling approach (see Section 5.2.1) and trying a larger range of sampling rate, 푘 훼푘 , we design a smooth variant of set sampling: [ 푚 , 푚 ] 5 a collection of sets sampled uniformly and independently at rate 푂̃︀(훽푘/푚) w.h.p., covers all elements that appear in at least 푚/(훽푘) sets. Besides expanding the application of set sampling in finding (1/훼)-approximate 푘-cover, this smooth variant implies more structure on the number of elements in a wider range of frequency levels which is specifically crucial in our approach for detecting sets with “low contribution”.

Unlike the set sampling based technique whose success in finding an 훼-approximate 푘- cover only depends on the structure of the set system, the performance of the next two approaches rely on the structure of optimal solutions as well: whether the majority of the coverage (in a specific optimal 푘-cover) is due to (few) “large” sets or, (many) “small” sets.

Heavy hitters and contributing frequencies. The high level idea in this approach is that if in an optimal solution, a sufficiently6 small number of sets cover the majority of the elements (covered by the optimal solution), it is enough to find a single large set, which naturally hints the use of ideas related to heavy hitters. For the sake of efficiency (in space complexity), we randomly partition sets into supersets of size at most 푘. However, once we merge sets into a single superset, we can no longer distinguish between their coverage and their total size. Since we combine sets at random, if all elements have “low” frequency in the set system, then the gap between the total size of all sets in a superset and their coverage is just 푂̃︀(1). This observation implies that if there is no “common” element in the set system, then we can use the total size of the sets in a superset as an estimate of its coverage size. To get around the case with (many) “common” elements, we show that performing the heavy hitter based algorithm on a sampled set of elements will find a sufficiently large superset as desired (see Section 5.8).

5In fact, 푂(log 푚푛)-wise independent is sufficient for all applications in this chapter. 6Depending on how large the value of 훼 is.

138 Detecting 푘-covers with many small sets. Finally, we address the case in which an optimal 푘-cover consists of many “small” sets. In this case, we can show that after subsam- pling sets uniformly with probability , a 푘 -cover with coverage at least times 1/훼 ( 훼 ) Θ(1/훼) the coverage of an optimal 푘-cover survives. This sampling method will save us a factor of 훼 in the memory usage of the algorithm. Further, by exploiting the structural prop- erty guaranteed due to the multi-layered set sampling7, we can show that element sampling

can save another factor of 훼 in the space complexity once applied to find a constant factor approximate 푘 of the subsampled sets. Max ( 훼 )-Cover 5.1.3. Other Related Work

Another important related question in this area is to design a “low-approximation” (i.e., better than the 2-approximation guarantee of the greedy approach) streaming algorithm for the max 푘-cover problem in the set arrival setting. Recently, Norouzi-Fard et al. [147] presented the first streaming algorithm that improves upon 2-approximation guarantee of greedy approach on random arrival streams. Very recently, Agrawal et al. [5] achieved an

8 almost (1 1/푒)-approximation in 푂̃︀(푛) space which is essentially the optimal bound [131]. − Still it is an important question to design such algorithms on adversarial order streams. We also remark that the algorithms of [147,5] do not work on edge arrival streams. In many scenarios, space is the most critical factor, and thus the question becomes: what approximation guarantees are possible within the given space bounds? This question has been studied before in the context of set cover in set arrival streams (e.g. [67, 42]), leading to poly(푛, 푚)-factor approximation algorithms.

5.2. Preliminaries and Notations

5.2.1. Sampling Methods for Max 푘-Cover and Set Cover Here we describe two sampling methods that have been used widely in the design of streaming algorithms for Max 푘-Cover and Set Cover [123, 61, 97, 20, 131, 17, 29, 105]. For a collection 7If the multi-layered set sampling fails to return a (1/훼)-approximate estimate, we can infer strong conditions on the maximum number of elements that belong to each frequency level in 푚 푚 . [ 푘 , 훼푘 ] 8Both [147,5] study the more general problem of submodular maximization and their results are stated with different notation and assuming oracle access. Here, we state their guarantees for max cover on set arrival streams

139 of sets , we define ( ) to denote the set of elements that are covered by ; ( ) := 풬 풞 풬 풬 풞 풬 ⋃︀ 푆. Moreover, we denote an optimal 푘-cover of ( , ) by OPT. 푆∈풬 풰 ℱ

Set Sampling. Roughly speaking, it says that by selecting sets uniformly at random, with high probability, all elements that appear in large number of sets will be covered.

Definition 5.2.1. An element 푒 is called 휆-common if it appears in at least 푐푚 polylog(푚, 푛)/휆 ∈ 풰 sets in . Furthermore, We denote the set of 휆-common elements by cmn. ℱ 풰휆 cmn cmn Observation 5.2.2. For any 0 휆1 휆2, . ≤ ≤ 풰휆1 ⊆ 풰휆2 Lemma 5.2.3 (Set Sampling [61]). Consider a set system ( , ) and let rnd be 풰 ℱ ℱ ⊆ ℱ a collection of sets such that each set 푆 is picked in rnd with probability 휆 . With high ℱ 푚 probability, rnd covers all elements that appear in Ω(̃︀ 푚/휆) sets (i.e. 휆-common elements). ℱ

Element Sampling for Max 푘-Cover . This sampling method shows that if we sample elements of uniformly with a large enough rate (i.e. proportional to (푘 )/ (OPT) ), 풰 |풰| |풞 | then a constant factor approximate 푘-cover over the sampled elements w.h.p., is a constant factor approximate solution of the original instance.

Lemma 5.2.4 (Element Sampling Lemma [123, 61]). Consider an instance of Max 푘-Cover( , ). Let’s assume that an optimal 푘-cover of ( , ) covers (1/휂)-fraction of . 풰 ℱ 풰 ℱ 풰 Let be a set of elements of size Θ(̃︀ 휂푘) picked uniformly at random. Then, with high ℒ ⊂ 풰 probability, a Θ(1)-approximate 푘-cover of ( , ) is a Θ(1)-approximate 푘-cover of ( , ). ℒ ℱ 풰 ℱ Observation 5.2.5. Let be a collection of (훽푘)-cover in . Then, in any partitioning of 풬 ℱ into 훽 groups, there exists one group with coverage at least ( ) /훽. In particular, an 풬 |풞 풬 | optimal 푘-cover in covers at least ( ) /훽. 풬 |풞 풬 | This simple observation is in particular interesting because it relates the task of 훼-approximating Max 푘-Cover to solving instances of Max (훽푘)-Cover where 훽 훼. ≤ 5.2.2. HeavyHitters and Contributing Classes

Suppose that a sequence of items 푝1, , 푝푇 arrive in a data stream where for each 푗 푇 , ··· ≤ 푝푗 [푚]. We can think of the stream as a sequence of (insertion only) updates on an initially ∈ zero vector ⃗푎 such that upon arrival of 푝푗 in the stream, ⃗푎[푗] ⃗푎[푗] + 1. Here, we review ← 140 the notion of 퐹2-heavy hitter and contributing coordinates that are used in our algorithm for approximating Max 푘-Cover.

Definition 5.2.6. Given an 푚-dimensional vector ⃗푎, an item 푗 (corresponding to ⃗푎[푗]) is a 2 ∑︀ 2 휑-HeavyHitter of 퐹2(⃗푎), if ⃗푎[푗] 휑 퐹2(⃗푎) = 휑 ⃗푎[푗] . Intuitively, the set of items ≥ · · 푗∈[푚] that appear frequently in the stream are the heavy hitters.

푖−1 푖 We conceptually partition coordinates of ⃗푎 into classes 푅푖 = 푗 2 < ⃗푎[푗] 2 . { | ≤ } 2푡 Definition 5.2.7. A class of coordinates 푅푡 is 훾-contributing if 푅푡 2 훾퐹2(⃗푎) = | | · ≥ ∑︀ 2. 훾 푗∈[푚] ⃗푎[푗]

Let 푅푡* be a 훾-contributing class and let 푛푡* denote the size of 푅푡* ; 푛푡* = 푅푡* . Further, | | * 푖*−1 푖* 푖* let’s assume that 푖 = log 푛푡* ; 2 < 푛푡* 2 . Let ℎ :[푚] [(12푚 log 푚)/2 ] be a ⌈ ⌉ ≤ → function chosen uniformly at random from a family of Θ(log(푚푛))-wise independent hash 푖* functions. We define 푖* as a sampled substream of the input stream with rate 1/2 . More 풮 precisely, 푖* only contains the updates corresponding to the coordinates 푖* = 푗 ℎ(푗) = 1 풮 ℱ { | } that are mapped to one under ℎ. Next, we show that the survived coordinates of 푅푡*

(푗 푅푡* ) in ⃗푎푖* , which is the vector ⃗푎 restricted to the items in 푖* , are Ω(̃︀ 훾)-HeavyHitters ∈ ℱ 2 of 퐹2(⃗푎푖* ); ⃗푎[푗] Ω(̃︀ 훾) 퐹2(⃗푎푖* ). Roughly speaking, if we subsample the stream so that ≥ · only polylog(푚) coordinates of 푅푡* survive, then with high probability these coordinates are Ω(̃︀ 훾)-HeavyHitters of the sampled substream.

Claim 5.2.8. With probability at least 1 푚−1, the number of survived coordinates in the − 푖* sampled substream 푖* is at least (6푚 log 푚)/2 . 풮

Proof: Let 푋푗 be a random variable which is one if item 푖푗 푖* and zero otherwise. We ∈ ℱ define 푋 := 푋1 + + 푋푚. Note that 푋푗 are Θ(log(푚푛))-wise independent and E[푋] = ··· * (12푚 log 푚)/2푖 . Then, by an application of Chernoff bound with limited independence (Lemma 5.7.3),

* Pr(푋 < (6푚 log 푚)/2푖 ) 푚−1. ≤

푖* Hence, with high probability, 푖* has size at least (6푚 log 푚)/2 . ℱ  2 푐 Lemma 5.2.9. With probability at least 1 2/(9 log 푛 log 푚), a coordinate 푗 푅푡* is a − ∈

141 훾 - in the sampled substream . ( 2 푐+1 ) HeavyHitter 푡* 162 log 푛 log 푚 풮

Proof: Let 푋푖 be a random variable corresponding to the 푖-th coordinate in 푅푡* such that if and zero otherwise. Moreover, we define . Then, 푋푖 = 1 푖 푖* 푋 := 푋1+ +푋푛 * E[푋] = ∈ ℱ ··· 푡 푖* (12푛푡* log 푚)/2 and by an application of Chernoff bound with limited independence,

√︀ Pr(푋 < 1) < Pr(푋 < (1 1/6)E[푋]) 푚−1. − ≤

푐 푖* Next, we show that with probability at least 1 1/ log 푚, 퐹2(⃗푎푖* ) 푂̃︀(퐹2(⃗푎)/2 ). It is − ≤ 푖* straightforward to check that E[퐹2(⃗푎푖* )] = (12퐹2(⃗푎) log 푚)/2 . Hence, by Markov’s inequal- ity,

2 푐+1 푖* 2 푐 Pr(퐹2(⃗푎푖* ) (108퐹2(⃗푎) log 푛 log 푚)/2 ) 1/(9 log 푛 log 푚). ≥ ≤

2 푐 Hence, with probability at least 1 2/(9 log 푛 log 푚), a coordinate 푗 푅푡* survives − ∈ 2 푐+1 푖* in 푖* and 퐹2(⃗푎푖* ) (108퐹2(⃗푎) log 푛 log 푚)/2 . Thus, with probability at least 1 ℱ ≤ − 2/(9 log2 푛 log푐 푚),

푖* 2 푡*−1 2 훾 훾 2 훾 (⃗푎[푖]) (2 ) 퐹2(⃗푣) ( )퐹2(⃗푣) = ( )퐹2(⃗푣) ≥ ≥ 4푛푡* ≥ 4 2푖* 108 log2 푛 log푐+1 푚 432 log2 푛 log푐+1 푚 ·

2 푐 In other words, with probability at least 1 2/(9 log 푛 log 푚), a coordinate 푖 푆푡* is a − ∈ 훾 - of * . ( 432 log2 푛 log푐+1 푚 ) HeavyHitter 퐹2(⃗푎푖 ) 

Next, we can use the exiting algorithms for 퐹2-HeavyHitters to complete this section.

Theorem 5.2.10 (퐹2-heavy hitters [44, 165, 38, 37]). Let’s assume that ⃗푎 is an 푚- dimensional vector initialized to zero. Let be a stream of items 푝1, , 푝푇 where for 풮 ··· each 푗 [푇 ], 푝푗 [푚]. Then, there is a single pass algorithm 퐹2-HeavyHitter that uses ∈ ∈ 2 푂̃︀(1/훾) space and with high probability returns all coordinates 푖 such that ⃗푎[푖] 훾퐹2(⃗푎). In ≥ addition, it returns (1 1 )-approximate values of these coordinates. ± 2 Finally, there exists an algorithm that with probability at least 1 2/(9 log 푛 log푐 푚), − finds at least one coordinate in each 훾-contributing class of ⃗푎 using 푂̃︀(1/훾) space.

Theorem 5.2.11 (훾-contributing [108]). Let’s assume that ⃗푎 is an 푚-dimensional vector

142 Algorithm 16 A single pass streaming algorithm that given a vector 푎 in a stream, for each contributing class 푅, returns an index 푗 along with a (1 (1/2))-estimate of its frequency. ± 1: procedure 퐹2-Contributing(훾, 푆) 2: ◁ Input: stream of updates 푖 [푚] on vector ⃗푎 3: ◁ 푆 is an upper bound∈ on the size of a 훾-contributing class 푖 4: for all 푛푡 2 푖 [log 푆] in parallel do ∈ { | ∈ } 5: ◁ 푛푡 : #coordinates in a 훾-contributing class 6: 훾 parameter of 휑 ( 푐+1 ) ◁ HeavyHitter ← 432 log 푛 log 푚 7: let HH휑 be a instance of 퐹2-HeavyHitter(휑) 8: 휌 (12 log 푚)/2푖 ◁ sample rate of the substream 9: pick← ℎ :[푚] [ 푚 ] from a family of Θ(log(푚푛))-wise independent hash functions → 휌 10: for all item 푝푖 in the input stream do 11: if ℎ(푝푖) = 1 then 12: feed 푝푖 to HH휑

13: return output of HH휑 14: ◁ returns heavy coordinates together with their approximate frequencies

initialized to zero. Let be a stream of items 푝1, , 푝푇 where for each 푗 [푇 ], 푝푗 [푚]. 풮 ··· ∈ ∈ Moreover, let’s assume no item in has frequency more than 푛. There exists a single 풮 pass algorithm 퐹2-Contributing that uses 푂̃︀(1/훾) space and with probability at least 1 − 2/(9 log 푛 log푐 푚) returns a coordinate 푖 from each 훾-contributing class. In addition, it returns (1 1 )-approximate frequency of these coordinates. ± 2 Proof: There are at most log 푛 (the total number of classes) 훾-contributing classes for

⃗푎 and for each 훾-contributing class 푅푡, by Lemma 5.2.9, with probability at least 1 − 2 푐 * 2/(9 log 푛 log 푚), a coordinate in 푅푡 will be a Ω(̃︀ 훾)-HeavyHitter of 퐹2(⃗푎푖* ) (where 푖 = ). By trying all values of * , with probability at least 2 log(푛푡) 푖 [log 푛] 1 log 푛( 2 푐 ) ⌈ ⌉ ∈ − 9 log 푛 log 푚 ≥ 2 1 푐 , 퐹2-Contributing algorithm outputs a coordinate from each 훾-contributing − 9 log 푛 log 푚 class. 

5.2.3. 퐿0-Estimation

Norm estimation is one of the fundamental problems in the area of streaming algorithms

where we are given an 푚-dimensional vector ⃗푎 which is initialized to zero and a sequence

of items 푝1, , 푝푇 (updates for the vector ⃗푎) where for each 푗 [푇 ], 푝푗 [푚] arrive ··· ∈ ∈ in a data stream. In the well-studied task 퐿0-estimation (also known as Count-distinct problem), the goal is to output a (1 휀)-estimate of the number of distinct elements (i.e., ± 143 퐿0(⃗푎) := 푖 ⃗푎[푖] = 0 ) after reading the whole stream. |{ | ̸ }|

Theorem 5.2.12 (퐿0-estimation [13, 28, 112, 113, 32]). Let’s assume that ⃗푎 is an 푚-

dimensional vector initialized to zero. Let be a stream of items 푝1, , 푝푇 where for each 풮 ··· 푗 [푇 ], 푖푗 [푚]. There exists a single pass algorithm that returns a (1 1/2)-approximation ∈ ∈ ± of 퐿0(⃗푎) and uses 푂̃︀(1) space.

5.3. Estimating Optimal Coverage of Max 푘-Cover

In this section, we describe the outline of our single-pass algorithm that approximates the

coverage size of an optimal 푘-cover of ( , ) within a factor of 훼 using 푂̃︀(푚/훼2) space in 풰 ℱ arbitrary order edge arrival streams. The input to our algorithm is 푘, 훼, 푛 = and 푚 = . |풰| |ℱ| In high level, we perform three different subroutines in parallel and show that for any given

Max 푘-Cover instance, at least one of the subroutines estimates the optimal coverage size within the desired factor in the promised space.

Theorem 5.3.1. For any 훼 [1/Θ(̃︀ √푚), 1/Θ(1))̃︀ , there exists one pass streaming algorithm ∈ that uses 푂̃︀(푚/훼2) space and with probability at least 3/4 computes the size an optimal coverage of Max 푘-Cover( , ) within a factor of 1/훼 in edge-arrival streams. 풰 ℱ Note that Theorem 5.3.1 together with the 푂(1)-approximation algorithms of [131, 29] that use 푂̃︀(푚) space, implies that for any 훼 [1/Θ(̃︀ √푚), 1 1/푒), there exists a single-pass ∈ − streaming algorithm that computes an 훼-approximation of the optimal coverage size of Max 푘-Cover( , ) in 푂̃︀(푚/훼2) space. Later in Section 5.5, we extend our approach fur- 풰 ℱ ther to achieve a single pass algorithm that computes an (1/훼)-approximate 푘-cover in 푂̃︀(푚/훼2 + 푘) space.

Theorem 5.3.2. For any 훼 [1/Θ(̃︀ √푚), 1/Θ(1))̃︀ , there exists a single-pass algorithm that ∈ uses 푂̃︀(푚/훼2 + 푘) space and with probability at least 3/4 returns an (1/훼)-approximate solution of Max 푘-Cover ( , ) in edge-arrival streams. 풰 ℱ Finally, we complement our upper bounds with a matching lower bound in Section 5.6.

Theorem 5.3.3. Any single pass (possibly randomized) algorithm on edge-arrival streams that (1/훼)-approximates the optimal coverage size of Max 푘-Cover requires Ω(푚/훼2) space.

144 As a first step, we provide a mapping from the ground set to a small size set of pseudo- 풰 elements such that the optimal 푘-cover on the pseudo-elements covers a constant fraction of the pseudo-elements. This reduction is in particular useful for bounding the number of required samples in methods such as element sampling.

5.3.1. Universe Reduction

In this section we show that in order to solve Max 푘-Cover on edge-arrival streams, it suffices to solve the instances whose optimal coverage size are at least a constant fraction of . |풰| To this end, suppose that we have an algorithm for Max 푘-Cover in edge-arrival streams 풜 with the following properties:

Definition 5.3.4훼, (( 훿, 휂)-oracle for Max 푘-Cover). An algorithm is an (훼, 훿, 휂)-oracle 풜 for Max 푘-Cover if it satisfies the following properties (훼 denotes the approximation guaran- tee, 훿 denotes the failure probability and 휂 is the promised coverage of an optimal 푘-cover):

If the optimal coverage size of Max 푘-Cover( , ) is at least /휂, then with proba- ∙ 풰 ℱ |풰| bility at least 1 훿, returns a (1/훼)-approximation of the optimal coverage size. − 풜 If returns 푧, then an optimal solution of Max 푘-Cover ( , ) with high probability, ∙ 풜 풰 ℱ has coverage at least 푧.

Using an (훼, 훿, 휂)-oracle of Max 푘-Cover, we design an (1/푂̃︀(훼))-approximation algorithm EstimateMaxCover for general Max 푘-Cover with success probability at least 1 푂̃︀(훿) − as in Algorithm 17.

As in EstimateMaxCover, let ℎ : [푧] be a hash function picked uniformly at 풰 → random from a family of 4-wise independent hash functions mapping the ground set 풰 onto pseudo-elements = 1, , 푧 . Furthermore, for a subset of elements 푆, we define 풱 { ··· } ⋃︀ . ℎ(푆) := 푒∈푆 ℎ(푒) Lemma 5.3.5. Let ℎ : [푧] be a hash function picked uniformly at random from a family 풰 → of 4-wise independent hash functions where 푧 32. Further, let 푆 be a subset of of size ≥ 풰 at least 푧. Then, with probability at least 3/4, ℎ(푆) 푧/4. | | ≥

Proof: For any pair of elements 푒푖, 푒푗 푆, let 푋푖,푗 be a random variable which is one if ∈ (i.e. they collide) and zero otherwise. Let ∑︀ denote the the ℎ(푒푖) = ℎ(푒푗) 푋 := 푒푖,푒푗 ∈푆 푋푖,푗

145 Algorithm 17 A single-pass streaming algorithm that uses an (훼, 훿, 휂)-oracle of Max 푘- Cover to compute an (1/푂̃︀(훼))-approximation of the optimal coverage size of Max 푘-Cover. 1: procedure EstimateMaxCover(푘, 훼) 2: ◁ Input: is an (훼, 훿, 휂)-oracle of Max 푘-Cover 3: ◁ 푆풜is an upper bound on the size of a 훾-contributing class 4: if 푘훼 푚 then 5: return≥ 푛/훼◁ trivial bound ◁ run for different guesses on the optimal coverage size in parallel 6: for all 푧 2푖 푖 [log 푛] do ∈ { | ∈ } 7: est푧 0 8: for 푖←= 1 to log(1/훿) do ◁ boost the success probability 9: pick ℎ푖 : [푧] from a family of 4-wise independent hash functions. 10: for all (푆,풰 푒 →) in the data stream do 11: feed (푆, ℎ푖(푒)) to (훼, 훿, 휂)-oracle 풜 12: est푧 max(output of on the stream constructed by ℎ푖, est푧) ← 풜 13: return max est푧 est푧 푧/(4훼) { | ≥ } total number of collision among the elements of 푆 under ℎ.

2 First, we show that if 푋 푆 /훾, then ℎ(푆) 훾/4. Let’s assume ℎ(푆) = 푣1, , 푣푞 ≤ | | | | ≥ { ··· } and let 푛푖 denote the number of elements in 푆 that are mapped to 푣푖 by ℎ. Then, the total number of collision, 푋, is

푞 (︂ )︂ 푞 2 ∑︁ 푛푖 ∑︁ 푛푖 1 푆 푆 푋 = ( )2 푞 (| |)2 = | | . 2 2 4 푞 4푞 푖=1 ≥ 푖=1 ≥ · ·

This implies that 푞 = ℎ(푆) 훾/4. Using this observation, it only remains to show that | | ≥ with probability at least 3/4, 푋 푆 2/푧. Since ℎ is selected from a family of 4-wise | | ≤ | | independent hash functions, 푋푖,푗 푒 ,푒 ∈푆 are pairwise independent. Hence, { } 푖 푗

∑︁ (︂ 푆 )︂ 1 푆 2 E[푋] = E[푋푖,푗] = | | ( ) | | , 2 · 푧 ≤ 2푧 푒푖,푒푗 ∈푆 ∑︁ (︂ 푆 )︂ 1 1 푧 Var[푋] = Var[푋푖,푗] = | | ( ) . 2 · 푧 − 푧2 ≥ 8 푒푖,푒푗 ∈푆

Applying Chebyshev’s inequality,

Pr(푋 > 푆 2/푧) Pr(푋 > E[푋] + Var[푋]) 1/Var[푋] 8/푧 1/4. | | ≤ ≤ ≤ ≤

146 Hence, with probability at least 3/4, 푋 푆 2/푧 which implies that with probability at least ≤ | | 3/4, ℎ(푆) 푧/4. | | ≥  Theorem 5.3.6. If there exists a single pass (훼, 훿, 휂)-oracle for Max 푘-Cover( , ) on 풰 ℱ edge-arrival streams that uses 푓(푚, 훼) space with 휂 4, then EstimateMaxCover is a ≥ (1/푂(훼))-approximation algorithm for Max 푘-Cover with failure probability at most 4훿 log 푛 that uses 푂̃︀(푓(푚, 훼)) space on edge-arrival streams.

Proof: Let OPT be an optimal solution of Max 푘-Cover( , ). First, we show that with high 풰 ℱ probability, EstimateMaxCover returns Ω( (OPT) /훼). Note that for each guess on the |풞 | optimal coverage size 푧 (OPT) , by Lemma 5.3.5, the probability that in none of log(1/훿) ≤ |풞 | iterations ℎ( (OPT)) > (OPT)/4 is at most 훿 (i.e., none of the iterations preserve the | 풞 | 풞 optimal coverage size up to a factor of 4). Moreover, by the guarantee of (훼, 훿, 휂)-oracles for Max 푘-Cover, each run of fails with probability at most 훿. Thus, by an application of union 풜 bound, with probability at least 1 2훿 log 푛, est푧 is at least 푧/(4훼) for all 푧 (OPT) . − ≤ |풞 | This in particular implies that the solution returned by EstimateMaxCover is at least

(OPT) /(8훼). Moreover, since the coverage of a 푘-cover never increases after applying the |풞 | “universe reduction” step (i.e. for each 푆 , ℎ( (푆)) 푆 ) and the estimate returned ⊆ 풰 | 풞 | ≤ | | by the (훼, 훿, 휂)-oracle is with high probability less than the optimal coverage size, the 풜 output of EstimateMaxCover is in [ (OPT) /(8훼), (OPT) ] with probability at least |풞 | |풞 | 1 4훿 log 푛. − Finally, since EstimateMaxCover runs (log 푛)(log 1/훿) instances of with parameter 풜 (훼, 훿, 휂) in parallel and each instance has 푚 sets, the total space of EstimateMaxCover

is 푂̃︀(푓(푚, 훼)). 

The universe reduction step basically enables us to only focus on the instances of Max 푘- Cover in which the optimal solution covers a constant fraction of the ground set, namely at least /4 elements. Next, in Section 5.4, we design an 푂̃︀(푚/훼)-space (훼, 훿, 휂)-oracle |풰| for Max 푘-Cover with 훼 = 1/Ω(1)̃︀ , 휂 4 and 훿 휀/ log 푛 (휀 < 1), which together with ≥ ≤ Theorem 5.3.6 completes the proof of Theorem 5.3.1. Our (훼, 훿, 휂)-oracle for Max 푘-Cover performs three different subroutines in parallel that together guarantee the required proper-

ties of (훼, 훿, 휂)-oracles and only use 푂̃︀(푚/훼2) space:

147 Set sampling based approach. This subroutine which provides the guarantee of ∙ (훼, 훿, 휂)-oracles when the number of common elements is large (see Definition 5.2.1) is an application of a “multi-layered” variant of set sampling. This subroutine is presented in Section 5.4.1.

HeavyHitter based approach. We relate the problem of 훼-estimating/approximating ∙ of Max 푘-Cover to the problem of finding contributing classes and heavy hitters on properly sampled substreams (see Section 5.2.2) when the main contribution of an

optimal solution of Max 푘-Cover is due to “large” sets. In particular, this subroutine finds an (1/훼)-estimation of the optimal coverage size when 훼 = Ω(푘). This approach is presented in Section 5.4.2.

Element sampling based approach. Finally, we employ element sampling together ∙ with a new sampling technique that samples a collection of sets to find a desired

estimate of Max 푘-Cover on instances for which the main contribution to an optimal solution comes from “small” sets. Here, we also take advantage of the structure guaran- teed by the multi-layered set sampling on the number of elements in different frequency levels. This subroutine is presented in Section 5.4.3.

5.4. (훼, 훿, 휂)-Oracle of Max 푘-Cover

In this section, we design the promised (훼, 훿, 휂)-oracle for Max 푘-Cover. Let OPT denote an optimal solution of Max 푘-Cover ( , ). As described in Definition 5.3.4, the solu- 풰 ℱ tion returned by a (훼, 훿, 휂)-oracle with high probability, is smaller than (OPT) and if |풞 | (OPT) /휂, with probability at least (1 훿), it outputs a value not smaller than |풞 | ≥ |풰| − (OPT) /훼. The following theorem together with Theorem 5.3.6 proves Theorem 5.3.1. |풞 | Theorem 5.4.1. Oracle(훼, 푘) performs a single pass on edge arrival streams of the set system ( , ) and implements a (푂̃︀(훼), 1/(log 푛 log푐 푚), 휂)-oracle of Max 푘-Cover( , ) us- 풰 ℱ 풰 ℱ ing 푂̃︀(푚/훼2) space. Proof: The proof follows from the guarantees of LargeCommon (Theorem 5.4.4), Large- Set (Theorem 5.4.8) and SmallSet (Theorem 5.4.22). The total space of the algorithm

148 is clearly 푂̃︀(푚/훼2) which is the space complexity of the each of subroutines invoked by Oracle. 

To design the promised (훼, 훿, 휂)-oracle, we design different subroutines such that each guarantees the properties required by the oracle if certain conditions based on the the size/value of following notions hold.

Common elements. An important property in design of our oracle is whether there exists

훽 훼 such that the number of (훽푘)-common elements is relatively large (see Definition 5.2.1). ≤ We also take advantage of another useful notion which is a property of a 푘-cover (though, here we only describe it for optimal 푘-covers).

Contribution to the optimal coverage. Given the input argument 훼 and a parameter s as defined in Table 5.4.1, we define the following notion of contribution for the sets in an (optimal) 푘-cover.

Definition 5.4.2. For a 푘-cover OPT = 푂1, , 푂푘 , we consider an arbitrary ordering { ··· } ′ ′ of the sets in OPT and define the contribution of 푂푖 to (OPT) as 푂 where 푂 := 풞 | 푖| 푖 ⋃︀ ′ ⋃︀ ′ 푂푖 푂푖. Note that 푂 are disjoint and 푂 = (OPT) = 푧. We (conceptually) ∖ 1≤푗<푖 푖 | 푖∈[푘] 푖| |풞 | define OPTlarge to be the collection of all sets in OPT that contribute more than 푧/(s훼) ′ ′ to (OPT) according to 푂 s; OPTlarge = 푂푖 OPT 푂 푧/(s훼) for s < 1 (as in 풞 푖 { ∈ | | 푖| ≥ } ′ Table 5.4.1). Note that since 푂 s are disjoint, OPTlarge s훼. 푖 | | ≤ 9 w 2500 log2(푚푛) 1 w = min 푘, 훼 , s = , t = , f = 7 log(푚푛), 휎 = 2 , 휂 = 4 { } 2500√2휂 log(s훼) log2(푚푛) · 훼 s 2500 log (푚푛) Table 5.4.1: The values of parameters used in our algorithms.

Design of (훿, 훼, 휂)-oracle of max 푘-cover. Here we sketch a high-level outline of our (훿, 훼, 휂)-oracle for Max 푘-Cover (refer to Algorithm 18 for a formal description). In the following cases, 1 (as in Table 5.4.1). 휎 = Ω( log2(푚푛) )

If there exists a 훽 훼 such that cmn 휎훽|풰| . In this case, by Observation 5.2.5, ∙ ≤ |풰훽푘 | ≥ 훼 to approximate Max 푘-Cover( , ) within a factor of 푂̃︀(훼), it suffices to find 훽푘 sets 풰 ℱ that cover cmn which can be done via set sampling (see Section 5.4.1). 풰훽푘

cmn 휎훽|풰| (OPTlarge) (OPT) /2 and 훽 훼, < . The subroutine for this case ∙ |풞 | ≥ |풞 | ∀ ≤ |풰훽푘 | 훼 which is presented in Section 5.4.2, handles the instances of the problem in which s훼 ≥ 149 Algorithm 18 An (훼, 훿, 휂)-oracle of Max 푘-Cover. 1: procedure Oracle(푘, 훼) 2: ◁ for instances in which 훽 훼 s.t. cmn 휎훽|풰| ∃ ≤ |풰훽푘 | ≥ 훼 3: solcmn LargeCommon(푘, 훼) 4: if s훼 ←2푘 then ≥ 5: ◁ if s훼 2푘, then OPTlarge OPT /2 ≥ | | ≥ | | 6: solHH LargeSet(푘, 훼, 푘) 7: else ←

8: ◁ for instances with s훼 < 2푘 and OPTlarge OPT /2 | | ≥ | | 9: solHH LargeSet(푘, 훼, 훼) ← 10: ◁ for instances with OPTlarge < OPT /2 | | | | 11: solsmall SmallSet(푘, 훼) ← 12: return max(solcmn, solHH, solsmall)

2푘 or, s훼 < 2푘 and there exists an optimal solution OPT such that (OPTlarge) |풞 | ≥ (OPT) /2. |풞 |

Claim 5.4.3. If s훼 2푘, then (OPTlarge) (OPT) /2. ≥ |풞 | ≥ |풞 | Proof: Consider the optimal solution OPT and ignore the sets in OPT whose contri-

bution to the coverage is less than (OPT) /(2푘). Note that the survived sets belong |풞 | |풞(OPT)| to OPTlarge and their total coverage is at least (OPT) 푘 (OPT) /2. |풞 |− · 2푘 ≥ |풞 | 

cmn 휎훽|풰| (OPTlarge) < (OPT) /2 and 훽 훼, < . In this case, the main ∙ |풞 | |풞 | ∀ ≤ |풰훽푘 | 훼 contribution to the coverage of OPT comes from “small” sets. This enables us to

show that if we sample sets with probability 1/훼, then Ω(1̃︀ /훼)-fraction of sets in OPT survive and with high probability, their coverage is Ω(̃︀ (OPT) /훼). In Section 5.4.3, |풞 | we show that element sampling method with some new ideas can take care of this case

which can only happen when s훼 < 2푘.

cmn 휎훽 5.4.1. Multi-layered Set Sampling: 훽 훼 s.t. |풰| ∃ ≤ |풰훽푘 | ≥ 훼 Here, we first guess the value of 훽 (more precisely, a 2-approximate estimate of 훽) and then pick 훽푘 sets rnd at random and compute their coverage in one pass using 푂̃︀(1) space. To get ℱ훽 the desired space complexity, we use the implementation of set sampling with 푂(log(푚푛)) random bits as described in Section 5.7.1.

Theorem 5.4.4. Consider an instance ( , ) of Max 푘-Cover. The LargeCommon al- 풰 ℱ 휎훽|풰| gorithm uses 푂̃︀(1) space and if there exists 훽 훼 such that cmn , then with ≤ |풰훽푘 | ≥ 훼 150 Algorithm 19 A (훼, 훿, 휂)-oracle of Max 푘-Cover that handles the case in which the number of common elements is large. 1: procedure LargeCommon((푘, 훼)) 푖 2: for all 훽g 2 1 푖 log 훼 in parallel do ∈ { | ≤ ≤ } 3: ◁ perform set sampling in one pass using 푂̃︀(1) space. 푐푚 log 푚 4: pick ℎg : [ ] from Θ(log(푚푛))-wise independent hash functions ℱ → 훽g푘 5: let DEg be a (1 1/2)-approximation streaming algorithm of 퐿0-estimation 6: for all (푆, 푒) in± the data stream do 7: if ℎg(푆) = 1 then rnd 8: feed 푒 to DEg ◁ computing the coverage of ℱ훽g 9: if VAL(DEg) 휎훽g /(4훼) then ≥ |풰| 10: return 2VAL(DEg)/(3훽g) 11: return infeasible s.t. cmn 휎훽|풰| ◁ @훽 [훼] 훽푘 ∈ |풰 | ≥ 훼푘 high probability, the algorithm returns at least 휎 /(6훼). Moreover, with high probability |풰| the output of LargeCommon is smaller than the coverage size of an optimal solution of

Max 푘-Cover( , ). 풰 ℱ 푖 rnd Claim 5.4.5. For each 훽g 2 1 푖 log 훼 , with high probability, 훽g푘. ∈ { | ≤ ≤ } |ℱ훽g | ≤ Lemma 5.4.6. If there exists 훽 훼 such that cmn 휎훽 /훼, then with high probability ≤ |풰훽푘 | ≥ |풰| the output of LargeCommon is at least 휎 /(6훼). |풰| Proof: Let 2푖 be the smallest power of two which is larger than or equal to 훽; 푖 := log 훽 . ⌈ ⌉ 푖 Consider the iteration of LargeCommon in which 훽g = 2 . Since 2훽 > 훽g 훽 and by ≥ Observation 5.2.2,

휎훽 휎훽g cmn cmn |풰| |풰|. |풰훽g푘 | ≥ |풰훽푘 | ≥ 훼 ≥ 2훼

Hence, by the guarantee of existing streaming algorithms for 퐿0-estimation (Theorem 5.2.12)

1 휎훽g|풰| 휎훽g|풰| and set sampling (Lemma 5.2.3 and 5.7.7), w.h.p., VAL(DEg) = . Hence, ≥ 2 · 2훼 4훼 the estimate returned by the algorithm which is a lower bound on the coverage of the best

푘 sets in rnd (see Observation 5.2.5), w.h.p., is at least 2 1 휎훽g|풰| = 휎|풰| . ℱ훽g 3 · 훽g · 4훼 6훼 Moreover, it is straightforward to check that by the guarantee of the streaming algorithm for 퐿0-estimation (Theorem 5.2.12), the value returned by LargeCommon with high prob- ability is not more than the actual coverage of the best 푘-cover in the collection of sampled sets using . ℎg 

151 Lemma 5.4.7. If LargeCommon returns infeasible, then with high probability, for all

훽 훼, cmn 휎훽 /훼. ≤ |풰훽푘 | ≤ |풰| Proof: Since the algorithm returns infeasible, by the guarantee of the (1 1/2)-approximation ± algorithm for 퐿0-estimation (Theorem 5.2.12) and set sampling (Lemma 5.2.3), for all values 푖 of 훽g 2 푖 log 훼 , with high probability, ∈ { | ≤ }

cmn rnd ( ) 2VAL(DEg) < 휎훽g /(2훼). (5.4.1) |풰훽g푘 | ≤ |풞 ℱ훽g | ≤ |풰|

⌈log 훽⌉ Now, for any given value 훽 훼, consider 훽g := 2 (i.e. set 훽g to be the smallest power ≤ of two which is larger than or equal to 훽). By Observation 5.2.2, cmn cmn which |풰훽푘 | ≤ |풰훽g푘 | cmn together with Eq. 5.4.1 implies that 휎훽g /(2훼) 휎훽 /훼. |풰훽푘 | ≤ |풰| ≤ |풰|  Proof of Theorem 5.4.4: The guarantee on the output of LargeCommon follows from Lemma 5.4.6. Moreover, by Theorem 5.2.12, the total amount of space to compute the cover-

rnd age of each collection (via existing 퐿0-estimation algorithms in streams) is 푂̃︀(1). Hence, ℱ훽g the total space to compute the coverage of all log 훼 collections considered in LargeCommon

is 푂̃︀(1).  5.4.2. Heavy Hitters and Contributing Classes:

(OPTlarge) (OPTsmall) |풞 | ≥ |풞 | In this section, we show that if there exists an optimal solution OPT of Max 푘-Cover( , ) 풰 ℱ such that the main contribution in the coverage of OPT is due to large sets, which are

formally defined to be the sets whose contribution to (OPT) is at least (OPT) /(s훼), 풞 |풞 | then we can approximate the optimal coverage size within a factor of 1/푂̃︀(훼) by detecting Ω(̃︀ 훼2/푚)-HeavyHitters in properly sampled substreams. Following is the main result of this section.

Theorem 5.4.8. Consider an instance ( , ) of Max 푘-Cover. In a single pass, LargeSet 풰 ℱ uses 푂̃︀(푚/훼2) space and if the optimal coverage size of the instance is Ω( ), then with |풰| probability at least 1 1/(log 푛 log푐 푚), it returns at least Ω(̃︀ /훼). Moreover, with high − |풰| probability, the estimate returned by LargeSet is smaller than the optimal coverage size.

We defer the proof of Theorem 5.4.8 to Section 5.8. In this section, we prove the same

152 guarantees on the performance of a simplified variant of LargeSet, LargeSetSimple, when contains no “common” elements (wee will define them formally later in this section) 풰 which essentially presents the main technical ideas.

Partitioning sets into supersets. We partition sets of randomly into (푐푚 log 푚)/w ℱ supersets := 1, , (푐푚 log 푚)/w via a hash function ℎ : [(푐푚 log 푚)/w] chosen 풬 {풟 ··· 풟 } ℱ → from a family of Θ(log(푚푛))-wise independent hash functions. More precisely, each set

푆 belongs to the superset ℎ(푆). ∈ ℱ 풟 The parameter w denotes the desired upper bound on the maximum number of sets in a superset in defined by ℎ and is set to min(훼, 푘). In fact, given w, we define ℎ to be 풬 a function picked uniformly at random from a family of Θ(log(푚푛))-wise independent hash functions [(푐푚 log 푚)/w] . {ℱ → } Claim 5.4.9. With high probability, no superset in has more than w sets. 풬 Claim 5.4.10. With high probability, for each 푒 cmn and , the number of sets ∈ 풰 ∖ 풰w 풟 ∈ 풬 in that contain 푒 is at most f where f = Θ(log(푚푛)). 풟 This implies that assuming cmn = , to get a (1/푂̃︀(훼))-approximation of Max 푘-Cover( , ), 풰w ∅ 풰 ℱ it suffices to find a superset whose total size of its setsis Ω(1̃︀ /훼) times the optimal cover- age size. Now, we are ready to exploit the results on 퐹2-heavy hitters and 퐹2-contributing classes mentioned in Section 5.2.2 to describe our (훼, 훿, 휂)-oracle for Max 푘-Cover assuming cmn = . Later in Section 5.8, we show how to remove this assumption by performing our 풰w ∅ algorithm on a set of sampled elements in instead. 풰

Partitioning supersets by their total size. First, setting 푧 = (OPT) , we partition |풞 | the supersets in (conceptually) according to the total size of their sets into 푂(log 훼) classes 풬 153 as follows:

∑︁ 0 = 푆 푧/2 , 풬 {풟 | | | ≥ } 푆∈풟 푖+1 ∑︁ 푖 푖 = 푧/2 푆 < 푧/2 , 푖 [1, log(훼)) 풬 {풟 | ≤ | | } ∀ ∈ 푆∈풟 ∑︁ small = 푆 < 푧/훼 . 풬 {풟 | | | } 푆∈풟

Further, let 푛푖 denote the number of supersets in 푖; 푛푖 = 푖 . Next, we define the 풬 |풬 | vector ⃗푣 of size (푐푚 log 푚)/w such that ⃗푣[푖] = ∑︀ 푆 denotes the total size of the sets 푆∈풟푖 | | in 푖. In the following, we show that a subset of supersets with large total size form an 풟 2 2 Ω(̃︀ 푚/훼 )-contributing class of 퐹2(⃗푣) and any superset in this Ω(̃︀ 푚/훼 )-contributing class is a (1/훼)-approximate 푘-cover of ( , ). 풰 ℱ We consider the following two cases depending on whether the coordinates corresponding

to small supersets, small, contribute to 퐹2(⃗푣); 퐹2(⃗푣small) 퐹2(⃗푣)/2 where ⃗푣small denotes the 풬 ≥ vector ⃗푣 restricted to the coordinates corresponding to supersets in small. 풬

Case 1: Supersets with total size less than 푧/훼 contribute to 퐹2(⃗푣). This implies that 푐푚 log 푚 푧 2 2푐푚 log 푚 푧2 . 퐹2(⃗푣) 2퐹2(⃗푣small) 2 ( ) ( ) = 2 ≤ ≤ · w · 훼 w · 훼 2 Claim 5.4.11. If 퐹2(⃗푣small) 퐹2(⃗푣)/2, then there exists an Ω(̃︀ 훼 /푚)-contributing class 푖* ≥ 풬 * of 퐹2(⃗푣) for an index 푖 < log(s훼).

Proof: Since each set in OPTlarge has contribution at least 푧/(s훼) to the optimal coverage,

each set of OPTlarge lands in one of 0, , log(s훼)−1. Moreover, since OPTlarge has coverage 풬 ··· 풬 at least 푧/2,

log(s훼)−1 ∑︁ 푧 ∑︁ 푛푖 푂푖 (OPTlarge) 푧/2, · 2푖 ≥ | | ≥ |풞 | ≥ 푖=0 푂푖∈OPTlarge

* 푖* which implies that there exists an index 푖 < log(s훼) such that 푛푖* 2 /(2 log(s훼)). Hence, ≥ 154 훼2 푖* is an Ω(̃︀ )-contributing class of 퐹2(⃗푣): 풬 푚

푖* 2 푧 2 2 푧 w 2 1 푖*+1 푖* ( ) = 훼 퐹2(⃗푣) (2 s훼) |풬 | · 2푖*+1 ≥ 2 log(s훼) · 4(2푖* )2 2푐푚 log 푚 · · 2푖*+3 log(s훼) · B ≤ w 1 훼2 ( ) 퐹2(⃗푣) ≥ s훼 · 8푐 log(s훼) log 푚 · 푚 ·

w More formally, since = Ω(1)̃︀ (see Table 5.4.1), 푖* is a 휑1-contributing class of 퐹2(⃗푣) s훼 풬 where

w 1 훼2 훼2 휑1 = ( ) = Ω(̃︀ ). (5.4.2) s훼 · 8푐 log(s훼) log 푚 · 푚 푚 

Hence, by Theorem 5.2.11, a superset of total size at least 2 1 푧 will be identified by 3 · 2 · s훼 2 the subroutine 퐹2-Contributing(휑1, s훼) using 푂̃︀(푚/훼 ) space.

Remark 5.4.12. Recall that in order to find a coordinate in a 휑-contributing class 푅푡* ,

퐹2-Contributing subsamples the stream proportional to 1/ 푅푡* (so that only 푂(1) co- | | ordinates of 푅푡* survive) and then with high probability any survived coordinate of 푅푡* becomes a Ω(̃︀ 휑)-HeavyHitter in the sampled substream. However, here we show that there exists a 휑-contributing class 푅푡* whose intersection with OPTlarge is a 휑-contributing class of coordinates too. Hence, it suffices to only search for a coordinate ina 휑-contributing class

of size at most OPTlarge s훼. | | ≤

Case 2. Supersets with coverage less than 푧/훼 do not contribute to 퐹2(⃗푣).

Claim 5.4.13. If 퐹2(⃗푣small) < 퐹2(⃗푣)/2, then there exists an Ω(1)̃︀ -contributing class 푖* of 풬 * 퐹2(⃗푣) for an index 푖 < log 훼.

Proof: In this case, since supersets in small are not contributing, there exists an index 풬 * 푖 < log(훼) (note that we consider all classes 0, , log 훼−1 in this case) such that 풬 ··· 풬

푖* 2 푛푖* (푧/2 ) 퐹2(⃗푣)/(2 log 훼); · ≥

155 Algorithm 20 An (훼, 훿, 휂)-oracle of Max 푘-Cover that handles the case in which the ma- jority of the coverage in an optimal solution is by the sets whose coverage contributions are at least 1/(s훼) fraction of the optimal coverage size.

1: procedure LargeSetSimple(( , w, thr1, thr2)) 2: ◁ Input: w is an upper bound풱 on the desired number of sets in a superset 2 3: ◁ Parameters: 휑1 = Ω(̃︀ 훼 /푚) and 휑2 = Ω(1)̃︀ 4: let Cntrsmall be an instance of 퐹2-Contributing(휑1, s훼) ◁ for Case 1 5: let be an instance of -Contributing 푐푚 log 푚 for Case Cntrlarge 퐹2 (휑2, w ) ◁ 2 6: pick ℎ : [(푐푚 log 푚)/w] from Θ(log(푚푛))-wise independent hash functions 7: for all (푆,ℱ 푒 →) in the data stream do 8: if 푒 then ∈ 풱 9: feed ℎ(푆) to both Cntrsmall and Cntrlarge ◁ output(Cntr) contains coordinates along with (1 1/2)-estimate of frequencies ±1 10: if there exists 푖 output(Cntrsmall) such that 푣˜푖 thr1 then ∈ ≥ 2 · 11: return 2˜푣푖/(3f) 1 12: if there exists 푖 output(Cntrlarge) such that 푣˜푖 thr2 then ∈ ≥ 2 · 13: return 2˜푣푖/(3f) 14: return infeasible

1 in other words, 푖* is a 휑2-contributing class of 퐹2(⃗푣) where 휑2 = ( ). 풬 2 log 훼  Note that, by Theorem 5.2.11, a superset of total size at least 2 1 푧 will be identified by 3 · 2 · 훼 the subroutine -Contributing 푐푚 log 푚 using space. 퐹2 (휑2, w ) 푂̃︀(1) Lemma 5.4.14. If (OPT) |풰| , then with probability at least 1 1/(3 log 푛 log푐 푚), the |풞 | ≥ 휂 − |풰| |풰| estimate returned by LargeSetSimple with parameters ( = , w, thr1 = , thr2 = ) 풱 풰 휂s훼 휂훼 has coverage at least /(3f휂훼) = Ω(̃︀ /훼). |풰| |풰| Proof: With probability at least 1 2/(9 log 푛 log푐 푚), the algorithm returns a superset whose − total size is at least 2 |풰| |풰| (in fact, if it is in Case 1, then the estimate is at least |풰| ). 3 · 2휂훼 ≥ 3휂훼 3s훼휂 Then, by Claim 5.4.10 and assumption cmn = , the coverage of the reported superset is at 풰w ∅ least 1 |풰| . f · 3휂훼  Lemma 5.4.15. The amount of space used by LargeSetSimple is 푂̃︀(푚/훼2).

Proof: By Theorem 5.2.11, the amount of space to perform Cntrsmall and Cntrlarge as defined in Algorithm 25 is respectively 2 and . 푂̃︀(1/휑1) = 푂̃︀(푚/훼 ) 푂̃︀(1/휑2) = 푂̃︀(1) 

156 5.4.3. Element Sampling: (OPTlarge) < (OPTsmall) |풞 | |풞 | Here we design an (훼, 훿, 휂)-oracle of Max 푘-Cover with the desired parameters for the case in which (OPTsmall) > (OPTlarge) . Intuitively speaking, in this case, after sampling |풞 | |풞 | each set in with probability Θ(1̃︀ /훼) still we can find 푂(푘/훼) sets whose coverage is at ℱ least Ω(̃︀ (OPT) /훼). As proved in Claim 5.4.3, if 훼 = Ω(̃︀ 푘), then the main contribution to |풞 | the coverage of OPT is due to OPTlarge and we can (1/훼)-approximate the optimal coverage size by LargeSet. Hence, in this section we assume that 훼 = 푂̃︀(푘). Moreover, throughout this section we assume that for all 훽 훼, cmn < 휎훽|풰| (otherwise, our multi-layered set ≤ |풰훽푘 | 훼 sampling approach described in Section 5.4.1 returns a (1/훼)-approximation of the optimal coverage size).

Lemma 5.4.16. Consider an instance of Max 푘-Cover ( , ). Suppose that is a collec- 풰 ℱ 풟 tion of 푘 disjoint sets with coverage 푧 such that no 푆 has size more than 푧/(s훼) where ∈ 풟 s < 1 and s = Ω(1)̃︀ . Let’s assume that smp := where each 푆 survives in with 풟 풟 ∩ℳ ∈ ℱ ℳ probability 푐/(s훼) where 푐 > 1 is a fixed constant. Then, with probability at least (1 6/푐), − smp has size at most (2푐푘)/(s훼) and covers at least (푐푧)/(2s훼) elements. 풟 ′ ′ Proof: Let = 푆 , , 푆 and for each 푖, let 푋푖 to be the random variable corresponding 풟 { 1 ··· 푘} ′ ′ ′ to 푆 such that 푋푖 = 푆 if 푆 smp and zero otherwise. 푖 | 푖| 푖 ∈ 풟 푐 ′ 푐 ′ 2 Claim 5.4.17. E[푋푖] = 푆 and Var[푋푖] 푆 . s훼 · | 푖| ≤ s훼 · | 푖|

Next, we define 푋 := 푋1 + + 푋푘. Note that E[푋] = (푐푧)/(s훼) and, by the pairwise ··· ′ independence of 푋푖s and the assumption that 푆 푧/(s훼), | 푖| ≤

푘 푐 ∑︁ Var[푋] 푆′ 2 s훼 푖 ≤ · 푖=1 | | 푐 푧 푧 s훼 ( )2 = 푐 ( )2, ≤ s훼 · · s훼 · s훼

Finally, applying Chebyshev inequality,

푐푧 푐푧 √푐 √푐 Pr[푋 < ] = Pr[푋 < ( ( 푧)] < 4/푐. 2s훼 s훼 − 2 · s훼 ·

Hence, with probability at least 1 4/푐, smp covers at least (푐푧)/(2s훼) elements. − 풟 157 Next, we show that with probability at least 1 2/푐, 0 < smp < (2푐푘)/(s훼). For each − |풟 | 푖, let 푌푖 denote the random variable corresponding to 푆푖 which is equal to one if 푆푖 smp ∈ 풟 and zero otherwise.

Claim 5.4.18. E[푌푖] = (푐/s훼) and Var[푌푖] (푐/s훼). ≤

We define 푌 = 푌1 + + 푌ℓ which denotes the size of smp. Then by pairwise independence ··· 풟 of 푌푖s, E[푌 ] = (푐푘)/(s훼) and Var[푌 ] (푐푘)/(s훼). Applying Chebyshev inequality (Pr[ 푌 ≤ | − E[푌 ] 푡Var[푌 ]] 1/(푡2Var[푌 ])), with probability at least 1 (s훼)/(푐푘) 1 2/푐 (since | ≥ ≤ − ≥ − in this case, 훼 2푘/s), 0 < smp < (2푐푘)/(s훼). ≤ |풟 | Hence, with probability at least (1 6/푐), smp is a subset of size at most (2푐푘)/(s훼) − 풟 that covers at least (푐푧)/(2s훼) elements.  Corollary 5.4.19. Consider an instance ( , ) of Max 푘-Cover and let OPT be an opti- 풰 ℱ 1 mal solution of this instance such that (OPTsmall) (OPT) /(2휂). Moreover, |풞 | ≥ 2 · |풞 | ≥ |풰| let be a collection of 푂̃︀( /훼) pairwise independent sets picked uniformly at random ℳ ⊂ ℱ |ℱ| such that each 푆 belongs to with probability 18/(s훼). With probability at least 2/3, ∈ ℱ ℳ Max ( 36푘 )-Cover( , ) has an optimal solution with coverage at least 9 /(s훼휂). s훼 풰 ℳ |풰|

Proof: By definition of OPTsmall, for each 푂 OPTsmall, the contribution of 푂 to OPT (i.e. ∈ 푂′) is at most 푧/(s훼) (see Definition 5.4.2). Then, the result follows from an application ′ of Lemma 5.4.16 on collection := 푂 푂 OPTsmall by setting 푐 = 18 and 푧 = 풟 { | ∈ } OPTsmall /(2휂). | | ≥ |풰|  Next, we show that we can perform “element sampling” and find an 푂̃︀(1)-approximation of Max ( 36푘 )-Cover of the specified instance in Corollary 5.4.19, ( , ), in one pass and s훼 풰 ℳ using 푂̃︀(푚/훼2) space. To this end, first we compute the space complexity of ( , ) where ℒ ℱ is a subset of size 푂̃︀(푘) which is picked by element sampling. ℒ ⊆ 풰 Lemma 5.4.20. Suppose that an optimal solution of Max ( 푘 )-Cover( , ) has coverage 훼 풰 ℱ /훾 = Ω(̃︀ /훼). Let be a collection of elements of size 푂̃︀( 푘훾 ) picked uniformly at |풰| |풰| ℒ ⊂ 풰 훼 random. With high probability, the total amount of space to store the set system ( , ) is ℒ ℱ 푂̃︀(푚/훼).

Proof: Recall that ( , ) has the property that for all 훽 훼, cmn < 휎훽 /훼 (otherwise, 풰 ℱ ≤ |풰훽푘 | |풰|

158 the result of Section 5.4.1 can be applied). Next, we (conceptually) partition the elements

in into log 훼 + 1 groups as follows: 풰

cmn 0 = 풲 풰 ∖ 풰훼푘 cmn cmn 푖 = 푖−1 푖 푖 [log 훼]. 풲 풰(훼/2 )푘 ∖ 풰(훼/2 )푘 ∀ ∈

푖−1 Note that 0 and for each 푖 [log 훼], 푖 휎 /2 . Since each element 푒 |풲 | ≤ |풰| ∈ |풲 | ≤ |풰| ∈ 풰 survives in with probability 푘훾 , w.h.p., for each , 휎훾푘 . 푂̃︀( ) 푖 [log 훼] 푖 = 푂̃︀(1 + 푖−1 ) ℒ 훼|풰| ∈ |풲 ∩ ℒ| 훼2 2푖푚 Furthermore, since each element in 푖 appears in at most 푂̃︀( ) sets in , the total amount 풲 훼푘 ℱ of space required to store ( , ) is at most ℒ ℱ

log 훼 log 훼 ∑︁ 푘훾 푚 ∑︁ 휎훾푘 2푖푚 푖 max freq(푒) = 푂̃︀( ) 푂̃︀( ) + 푂̃︀(1 + 푖−1 ) 푂̃︀( ) 푒∈풲푖 훼 훼푘 훼2 훼푘 푖=0 |풲 ∩ ℒ| · · 푖=1 · log 훼 훾푚 ∑︁ 2푖푚 휎훾푚 = 푂̃︀( ) + 푂̃︀( + ) 훼2 훼푘 훼2 푖=1

= 푂̃︀(푚/훼). 

Next, we show that after subsampling the sets by a factor of Θ(1̃︀ /훼), we can save another factor of Ω(̃︀ 훼) in the space complexity; in other words, ( , ) uses 푂̃︀(푚/훼2) space. Note ℒ ℳ that since 푘훼 may be as large as Ω(̃︀ 푚) we cannot hope to show directly that each element 푖 in 푖 appears in at most 푂̃︀(푚/(2 훼푘)). However, we can show that the total size of the 풲 intersection of all sets in with is 푂̃︀(푚/훼2) using the properties of the max cover ℳ ℒ instance.

Lemma 5.4.21. Suppose that an optimal solution of Max ( 푘 )-Cover( , ) has coverage 훼 풰 ℱ /훾 = Ω(̃︀ /훼). Let be a collection of elements of size 푂̃︀( 푘훾 ) picked uniformly at |풰| |풰| ℒ ⊂ 풰 훼 random and let be a collection of sets of size 푂̃︀(푚/훼) picked uniformly at random. ℳ ⊂ ℱ With high probability, the total amount of space required to store the set system ( , ) is ℒ ℳ 푂̃︀(푚/훼2).

Proof: First note that since an optimal ( 푘 )-cover of ( , ) has coverage /훾, with high 훼 풰 ℱ |풰| probability, for each set 푆 , 푆 = 푂̃︀(푘/훼). Moreover, by Lemma 5.4.20, the size ∈ ℳ | ∩ ℒ|

159 Algorithm 21 A single pass streaming algorithm that estimates the optimal coverage size of within a factor of in 2 space. To return a 푘 -cover Max 푘-Cover( , ) 1/푂̃︀(훼) 푂̃︀(푚/훼 ) 푂̃︀( 훼 ) that achieves the풰 computedℱ estimate it suffices to change the line marked by (⋆⋆) to return the corresponding 푘 -cover as well (see Section 5.5). 푂̃︀( 훼 ) 1: procedure SmallSet((푘, 훼)) −푖 2: for all 훾g 2 푖 [log 훼] in parallel do 3: is∈ an { estimate| ∈ of the} optimal coverage of 푘 -cover ◁ 훾g 푂̃︀( 훼 ) 4: for log 푛 times in parallel do 5: uniformly selected samples of size Θ(̃︀ 푚/훼) from 6: ℳ ←uniformly selected samples of size 푘 fromℱ Θ(̃︀ 훾g ( 훼 )) 7: initializeℒ ← S( , ) to be an empty set ◁ S( ·, ) stores풰( , ) 8: for all (푆, 푒)ℒinℳ the data stream do ℒ ℳ ℒ ℳ 9: if 푆 and 푒 푖 then 10: add∈ ℳ(푆, 푒) to∈S( ℒ , ) ℒ ℳ 11: if S( , ) > 푂̃︀(푚/훼2) then 12: terminateℒ ℳ 36푘 13: sol훾 maxℒ,ℳ 푂(1)-approx solution of Max ( )-Cover(S( , )) (⋆⋆) g ← { s훼 ℒ ℳ } 14: return |풰| sol sol max훾g ( 훾g ) 훾g = Ω(̃︀ 푘/훼) { 훾g(푘/훼) · | } of the intersection of all sets in with is 푂̃︀(푚/훼). Next, we (conceptually) partition the ℱ ℒ sets of into 푂(log 푘) groups based on their intersection size with as follows (푐 = 푂̃︀(1)): ℱ ℒ 1 푐푘 1 푐푘 푖 = 푆 푆 < , 1 푖 log 푘 풬 { ∈ ℱ | 2푖 · 훼 ≤ | ∩ ℒ| 2푖−1 · 훼 } ∀ ≤ ≤

Since the total size of the intersection of all sets with the sampled set is w.h.p. 푂̃︀(푚/훼), ℒ for each , 푂̃︀(푚/훼) 2푖·푚 . Since we have the assumption that 푚 (we 푖 log 푘 푖 푖 = 푂̃︀( ) 1 ≤ |풬 | ≤ (푐푘)/(2 훼) 푘 푘훼 ≥ took care of the case 푚 < 푘훼 in the first line of EstimateMaxCover separately), w.h.p., 2푖푚 for each 푖, 푖 = 푂̃︀( ). Hence, the total amount of space to store ( , ) is at 풬 |풬 ∩ ℳ| 푘훼 ℒ ℳ most

log 푘 ∑︁ 1 푐푘 2푖푚 푚 푂̃︀( ) = 푂̃︀( ). 2푖−1 훼 푘훼 훼2  푖=1 · ·

cmn 휎훽|풰| Theorem 5.4.22. If (OPTsmall) /(2휂) and for all 훽 훼, < , then with |풞 | ≥ |풰| ≤ |풰훽푘 | 훼 high probability, SmallSet outputs a (1/푂̃︀(훼))-approximate of the optimal coverage size of Max 푘-Cover( , ) in 푂̃︀(푚/훼2) space. 풰 ℱ Proof: By Corollary 5.4.19, for any sampled collection of sets of size Θ(̃︀ 푚/훼) (as in ℳ 160 SmallSet), with probability at least 2/3, there exists a subset of size at most ( 36푘 ) in s훼 ℳ whose coverage is 9 /(s훼휂) = /훾. Moreover, by the guarantee of element sampling, |풰| |풰| Lemma 5.2.4, when 훾/2 < 훾푔 훾, an 푂(1)-approximate solution of Max 푘-Cover( , ) ≤ ℒ ℳ with high probability is an 푂(1)-approximate solution of Max 푘-Cover( , ). Hence, with 풰 ℳ −1 probability 1 푛 , in at least one of the (log 푛) instances with the desired 훾g, sol훾 has − g |풰| 훾 (푘/훼) coverage Ω(̃︀ g ) = Ω(̃︀ 푘/훼) over the sampled elements . Note that we need to scale 훾 · |풰| ℒ sol by a factor of 푘 to reflects its coverage on . 훾g Θ(̃︀ /(훾g )) |풰| · 훼 풰 Further, by Lemma 5.4.21, the amount of space required to store each ( , ) with high ℒ ℳ probability is 푂̃︀(푚/훼2) and since SmallSet stores 푂̃︀(1) different instances ( , ), the total ℒ ℱ 2 space of the algorithm is 푂̃︀(푚/훼 ). 

To complete the SmallSet is indeed an (훼, 훿, 휂)-oracle with the desired parameters, we need to show that it never overestimates the optimal coverage size.

Lemma 5.4.23. The output of SmallSet with high probability is not larger than the op- timal coverage size of Max 푘-Cover( , ). 풰 ℱ 푘 Proof: Note that if the optimal coverage size Max 푂̃︀( )-Cover( , ) is less than Ω(̃︀ /(훼훾g)), 훼 ℒ ℳ |풰| then with high probability in none of the iterations sol . Hence, SmallSet log 푛 훾g = Ω(̃︀ 푘/훼) with high probability does not overestimate the size of an optimal 푂̃︀(푘/훼)-cover in ( , ). 풰 ℱ  5.5. Approximating Max 푘-Cover Problem in Edge Arrival Streams

In this section, we show that we can modify Oracle(푘, 훼) and the subroutines within it slightly to report a (1/훼)-approximate solution as well. Currently, in the algorithms for estimating maximum coverage size, we only maintain the coverage of a collection of sets. To

provide the actual (1/훼)-approximate solution, we also need to report the indices of the sets that belong to the collection.

Reporting a 푘-cover in LargeCommon. In this subroutine, the coverage of different covers with possibly size more than 푘 are computed. For the task of estimating the maximum coverage size, this does not hurt as long as we scale the coverage by a factor that bounds

how much larger the size of cover is compared to 푘. However, in the task of reporting an

161 Algorithm 22 A single pass streaming algorithm that reports a (1/훼)-approximate solution of Max 푘-Cover( , ) in 푂̃︀(훽 + 푘) space when the set of common elements is large. 풰 ℱ 1: procedure ReportLargeCommon((푘, 훼, 훽)) 푖 2: for all 훽g 2 0 푖 log 훽 in parallel do ∈ { | ≤ ≤ } 3: ◁ perform set sampling in one pass using 푂̃︀(1) space. 4: pick ℎ : [ 푐푚 log 푚 ] from Θ(log(푚푛))-wise independent hash functions ℱ → (훽g푘) 5: pick ℎ2 : [훽g log 푚] from Θ(log(푚푛))-wise independent hash functions ℱ → 6: let DEg,푖 푖 [훽g] be 훽g instances of (1/2)-approximation of 퐿0-estimation 7: for{ all (푆,| 푒) ∈in the} data stream do 8: if ℎ(푆) = 1 then rnd 9: feed 푒 to DEg,ℎ (푆) ◁ computing ( ) 2 풞 ℱ훽g,푖 * 10: if 푖 [log 훽] s.t. VAL(DEg,푖* ) 휎 /(4훼) then ∃ ∈ ≥ |풰| * 11: return 2VAL(DEg)/3 and 푆 ℎ(푆) = 1 and ℎ2(푆) = 푖 { | 3휎훽|풰| } 12: return infeasible ◁ no 훽 [훼] s.t. cmn = Ω(̃︀ ) ∃ ∈ |풰훽푘 | 훼푘 actual 푘-cover, it becomes crucial to have at most 푘 sets. To this end, instead of computing

the coverage of a randomly selected (훽g푘)-cover, we partition the collection into 푂̃︀(훽g) sub- collections each of size at most 푘 and by Observation 5.2.5, at least one of them achieves the reported coverage size up to polylogarithmic factors. Moreover, we need to maintain the coverage of 푂̃︀(훽g) = Θ(̃︀ 훽) covers which add an extra 훽g term in the space complexity (see Figure 22).

Lemma 5.5.1. The space complexity of ReportLargeCommon(푘, 훼, 훽) is 푂̃︀(훽 + 푘). Proof: Similarly to the space analysis of LargeCommon (Theorem 5.4.4) and by employ-

ing an existing (1 1/2)-approximation algorithm of 퐿0-estimation that uses 푂̃︀(1) space ± (Theorem 5.2.12), the total amount of space used in the algorithm to maintain DEg,푖 for

all 훽 log 훽 possible choices of (훽g, 푖) is 푂̃︀(훽). Moreover, the algorithm uses 푂̃︀(푘) space to compute the indices of the best 푘-cover. In total, the space complexity is 푂̃︀(훽 + 푘).  Note that the approximation analysis of ReportLargeCommon is essentially the same as the analysis of LargeCommon in Theorem 5.4.4.

Reporting a 푘-cover in LargeSet. This subroutine invokes LargeSetComplete (Algorithm 24) on different sets of sampled elements and in order to report anactual 푘- cover along with a (1/푂̃︀(훼))-approximation of the optimal coverage size, we modify the LargeSetComplete subroutine slightly.

162 In LargeSetComplete, it is always guaranteed that a superset with sufficiently large coverage is detected and its coverage is computed and reported. To get the actual collection

of sets that achieve the coverage, we need to try ℎ on all sets (which belong to [푚]) and report the indices of those that are mapped to the superset with large coverage. More precisely, anywhere in Figure 24 that the algorithm returns a (non infeasible) value (the lines that

are marked by (⋆⋆)), it also returns the set 푆 ℎ(푆) = 푖* which is guaranteed to have size { | } at most w = min(훼, 푘). Lemma 5.5.2. The space complexity of the modified LargeSetComplete with the sug- gested modification at lines is 푚 . (⋆⋆) 푂̃︀( 훼2 + 푘) Proof: Similarly to the space analysis of LargeSetComplete without the modification

(Lemma 5.8.7), the modified algorithm consumes 푂̃︀(푚/훼2) space to compute a (1/푂̃︀(훼))- approximation of the optimal coverage size. Moreover, the new modified algorithm uses

additional 푂̃︀(w) = 푂̃︀(푘) space to find the 푘-cover corresponding to the returned estimate of the optimal coverage size. Hence, the total space complexity of LargeSetComplete with the suggested modification at lines is 푚 . (⋆⋆) 푂̃︀( 훼2 + 푘)  Corollary 5.5.3. The space complexity of the LargeSet that invokes LargeSetCom- plete with the suggested modification at lines is 푚 . (⋆⋆) 푂̃︀( 훼2 + 푘)

Reporting a 푘-cover in SmallSet. This is the most straightforward case. The only required change is to modify the line marked by (⋆⋆) in SmallSet to return both an -approximation of the 푘 instance at line and its corresponding 푂(1) Max 푂̃︀( 훼 )-Cover (⋆⋆) 푘 -cover. Clearly, the extra amount of space to implement this modification is 푘 . 푂̃︀( 훼 ) 푂̃︀( 훼 ) Lemma 5.5.4. The space complexity of the modified SmallSet with the suggested modifi- cation at line is 푚 푘 . (⋆⋆) 푂̃︀( 훼2 + 훼 ) We also need to slightly modify the described Oracle in Algorithm 18 so that, besides

a (1/푂̃︀(훼))-estimate of the optimal coverage size, it reports a (1/푂̃︀(훼))-approximate 푘- cover. Note that it is crucial to run ReportLargeCommon with different parameters depending on whether s훼 2푘 and s훼 < 2푘. The main reason is that the space complexity ≥ of ReportLargeCommon as described above in Algorithm 22 has an additive Θ(훽) term

163 Algorithm 23 An (훼, 훿, 휂)-oracle of Max 푘-Cover that reports a (1/푂̃︀(훼))-approximate 푘-cover. 1: procedure ReportOracle((푘, 훼)) 2: if s훼 2푘 then 3: ◁ if≥ cmn 휎|풰| |풰푘 | ≥ 훼 4: solcmn ReportLargeCommon(푘, 훼, 1) ← 5: ◁ if s훼 2푘, then OPTlarge OPT /2 ≥ | | ≥ | | 6: solHH LargeSet(푘, 훼, 푘) ◁ the described modified LargeSetComplete 7: else ← 8: ◁ for instances in which 훽 훼 s.t. cmn 휎훽|풰| ∃ ≤ |풰훽푘 | ≥ 훼 9: solcmn ReportLargeCommon(푘, 훼, 훼) ← 10: ◁ for instances with s훼 < 2푘 and OPTlarge OPT /2 | | ≥ | | 11: solHH LargeSet(푘, 훼, 훼) ◁ the described modified LargeSetComplete ← 12: ◁ for instances with OPTlarge < OPT /2 | | | | 13: solsmall SmallSet(푘, 훼) ◁ with the described modification in line (⋆⋆) ← 14: ◁ each sol computed above is of from (estimate of the optimal coverage size, 푘-cover) 15: return max(solcmn, solHH, solsmall) ◁ max is over the estimate size of sols which can be as large as 훼 in our implementation of LargeCommon in Algorithm 19. By all minor modifications in subroutines LargeCommon, LargeSet and SmallSet, we design a modified variant of Oracle called ReportOracle as in Algorithm 23.

Theorem 5.5.5. ReportOracle(푘, 훼) performs a single pass on the edge arrival stream of the set system ( , ) and implements an (푂̃︀(훼), 1/(log 푛 log푐 푚), 휂)-oracle of Max 푘-Cover( , ) 풰 ℱ 풰 ℱ using 푂̃︀(푚/훼2 + 푘) space. Moreover, the modified oracle also reports a 푘-cover with the promised coverage size.

Proof: The approximation analysis of the algorithm is basically the same as the analysis of Oracle and we do not repeat it here. We only need show that structural property on the number of elements in different frequency levels provided by LargeCommon with input

parameters (푘, 훼, 1) is sufficient for the (modified variant of) LargeSet in the case s훼 푘. ≥ In LargeSet with input 훼 = Ω(̃︀ 푘), the only bound on the number of common elements that is used in the analysis is cmn = cmn 휎|풰| (in Theorem 5.8.6). In other words, in |풰w | |풰푘 | ≤ 훼 this case, unlike the case 훼 = 푂̃︀(푘), we do not need the guarantee on the size of cmn for 풰훽푘 all values of 훽 훼. Hence, for the case 훼 = Ω(̃︀ 푘), it is sufficient to invoke LargeCommon ≤ with 훽 = 1. Note that the values of the rest of parameters are the same as in Oracle. To bound the space complexity, we consider two cases: 1) s훼 2푘 and 2) s훼 < 2푘. ≥

164 1. s훼 ≥ 2푘. By Lemma 5.5.1 and Corollary 5.5.3, ReportLargeCommon(푘, 훼, 1) and LargeSet that invokes the modified variant of LargeSetComplete respectively use and 푚 space. Hence, the total amount of space used in the case is 푂̃︀(푘) 푂̃︀( 훼2 + 푘) 푚 . 푂̃︀( 훼2 + 푘)

2. s훼 < 2푘. In this case, by Lemma 5.5.1, ReportLargeCommon(푘, 훼, 훼) uses 푂̃︀(푘 + 훼) = 푂̃︀(푘) space. Moreover, by Corollary 5.5.3 and Lemma 5.5.4, the space complexity of the other two subroutines with the described modification is 푚 . Hence, the 푂̃︀( 훼2 + 푘) total amount of space that the algorithm uses in this case is 푚 . 푂̃︀( 훼2 + 푘) Finally, by the modifications we described above, ReportOracle correctly reports an

actual 푘-cover whose coverage is within a 푂̃︀(훼) factor of the optimal coverage as well.  The theorem above together with Theorem 5.3.6 completes the proof of Theorem 5.3.2.

5.6. Lower Bound for Estimating Max 푘-Cover in Edge Ar- rival Streams

By the result of [29], it is known that estimating the size of an optimal coverage of Max 푘- Cover within a factor of two requires Ω(푚) space. Their argument relies on a reduction from Set Disjointness problem and implies the mentioned bound for 1-cover instances. In the following, we generalize their approach and provide lower bounds for the all range of

approximation guarantees 훼 smaller than √푚. We remark that both our lower bound result and the lower bound result of [29] are basically similar to the lower bound of 퐿∞ and 퐿푘 estimation first proved in [13, 27].

The lower bound result we explain in this section is based on the well-known 푟-player Set Disjointness problem with unique set intersection promise which has been studied extensively in communication complexity (e.g. [28, 41, 85]). The setting of the problem is as follows:

There are 푟 players and each has a set 푇푖 [푚]. The promise is that the input is in one of ⊆ the following forms:

No Case: There is a unique element 푗 [푚] such that for all 푖 푟, 푗 푇푖. ∙ ∈ ≤ ∈ Yes Case: All sets are pair-wise disjoint. ∙ 165 Moreover, a round of communication consists of each player 푖 sending a message to player 푖 + 1 in order from 푖 = 1 to 푟 1. The goal is that at end of a single round, player 푟 − be able to correctly output whether the input belongs to the family of Yes instances or No instances. Chakrabarti et al. [41] showed the following tight lower bound on the one-

way communication complexity of the 푟-player Set Disjointness problem with unique set intersection promise.

Theorem 5.6.1 (From [41]). Any randomized one-way protocol that solves 푟-player Set Disjointness(푚) with success probability at least 2/3 requires Ω(푚/푟) bits of communication.

We remark that the same Ω(푚/푟) communication lower bound was later proved for the general model (i.e. with multiple rounds) by Gronemeier [85]. However, for our application, the lower bound on the one-way communication model suffices.

Corollary 5.6.2. Any single-pass streaming algorithm that solves 푟-player Set Disjointness(푚) with success probability at least 2/3 consumes Ω(푚/푟2) space.

Next, we sketch a reduction from 푟-player Set Disjointness(푚) to Max 푘-Cover with 푚 sets such that an 훼-approximation protocol of Max 푘-Cover solves the corresponding instance of 푟-player Set Disjointness(푚). To this end, consider an arbitrary instance of 훼-player Set Disjointness(푚) problem ℐ in which each player 푖 has a set 푇푖 [푚]. Define ℐ = 푒1, , 푒훼 to be the set of ⊂ 풰 { ··· } elements in the Max 1-Cover instance and for each player 푖 if 푗 푇푖 then add (푒푖, 푆푗,) ∈ to the stream. In other words, in the constructed Max 1-Cover( ℐ , ℐ := 푆1, , 푆푚 ) 풰 ℱ { ··· } instance, we have an element 푒푖 corresponding to each player 푖 and there exists a set 푆푗 corresponding to each item 푗 [푚]. Moreover, each set 푆푗 in the Max 1-Cover instance ∈ ( ℐ , ℐ ) denotes the set of players in the Set Disjointness(푚) instance whose input sets 풰 ℱ ℐ contain 푗; 푆푗 := 푖 [훼] 푗 푇푖 . { ∈ | ∈ } Claim 5.6.3. If is a No instance, then the optimal coverage of the Max 1-Cover instance ℐ ( ℐ , ℐ ) is 훼. 풰 ℱ Proof: In this case, by the unique intersection promise, there exists an item 푗 that belongs to all 푇푖 (for 푖 [훼]). Hence, by the construction of the Max 1-Cover instance, 푆푗 covers ∈ the whole ℐ . Thus, the optimal 1-cover has size 훼. 풰  166 Claim 5.6.4. If is a Yes instance, then the optimal coverage of the Max 1-Cover instance ℐ ( ℐ , ℐ ) is 1. 풰 ℱ

Proof: Since 푇푖s are disjoint, for each 푗 [푚], the set 푆푗 has cardinality one. ∈  Corollary 5.6.2 together with Claims 5.6.3 and 5.6.4 implies the stated lower bound on

훼-approximating the optimal coverage size of Max 푘-Cover in edge-arrival streams in The- orem 5.3.3: Any single pass (possibly randomized) algorithm on edge-arrival streams that

(1/훼)-approximates the optimal coverage size of Max 푘-Cover requires Ω(푚/훼2) space.

5.7. Chernoff Bound for Applications with Limited Inde- pendence

In this section, we mention some of the results in [159] on applications of Chernoff bound with limited independence that are used in our analysis.

Definition 5.7.1 (Family of 푑-wise Independent Hash Functions). A family of func- tions = ℎ :[푚] [푛] is 푑-wise independent, if for any set of 푑 distinct values 푥1, , 푥푑, ℋ { → } ··· the random variables ℎ(푥1), , ℎ(푥푑) are independent and uniformly distributed in [푛] when ··· ℎ is picked uniformly at random from . ℋ Next, we exploit the results that show for small values of 푑, we can store a family of 푑-wise hash function in small space and in the same time it suffices for our application of Chernoff bound.

Lemma 5.7.2 (Corollary 3.34 in [166]). For every values of 푚, 푛, and 푑, there is a fam- ily of 푑-wise independent hash functions = ℎ :[푚] [푛] such that a selecting a random ℋ { → } function from only requires 푑 log(푚푛). ℋ ·

Lemma 5.7.3 (Theorem 5 in [159]). Let 푋1, 푋푛 be binary 푑-wise independent ran- ··· dom variables and let 푋 := 푋1 + + 푋푛. Then, ···

⎧ 2 ⎨푒−E[푋]훿 /3 : if 훿 < 1 and 푑 = Ω(훿2E[푋]); Pr( 푋 E[푋] 훿E[푋]) | − | ≥ ≤ ⎩푒−E[푋]훿/3 : if 훿 1 and 푑 = Ω(훿E[푋]). ≥

Lemma 5.7.4 (Theorem 6 in [159]). Let 푋1, , 푋푛 and 푌1, , 푌푛 be Bernoulli trials ··· ··· 167 such that for each 푖, E[푋푖] = E[푌푖] = 푝푖. Let assume that 푌푖s are independent, but 푋푖s are only -wise independent. Further, let and respectively denote ∑︀푛 and 푑 푝(푟) 푝푑(푟) Pr( 푖=1 푌푖 = 푟) ∑︀푛 and let ∑︀푛 be the expected number of success in the trials. Pr( 푖=1 푋푖 = 푟) 휇 = 푖=1 푝푖 −퐷 If 푑 푒휇 + ln(1/푝(0)) + 푟 + 퐷, then 푝푑(푟) 푝(푟) 푒 푝(푟). ≥ | − | ≤ 5.7.1. An Application: Set Sampling with Θ(log(푚푛))-wise Indepen- dence

Consider a set system ( , ) and let ℎ : [(푐푚 log 푚)/훾] be a function selected uni- 풰 ℱ ℱ → formly at random from a family of Θ(log(푚푛))-wise independent hash functions where 푐 is a sufficiently large constant. Then, we think of our randomly selected sets rnd to be the ℱ collection of sets in that are mapped to one by ℎ; rnd := 푆 ℎ(푆) = 1 . ℱ ℱ { ∈ ℱ | } Lemma 5.7.5. Assuming 훾 6푐 log2 푚, with probability at least 1 푚−1, rnd 훾. ≥ − |ℱ | ≤ rnd Proof: Let 푋푖 be a random variable which is one if 푆푖 and zero otherwise. We define ∈ ℱ 푋 := 푋1 + +푋푚. Note that 푋푖 are Θ(log(푚푛))-wise independent and E[푋] = 훾/(푐 log 푚). ··· Then, by an application of Chernoff bound with limited independence (Lemma 5.7.3),

√︂6푐 Pr(푋 > 훾) Pr(푋 > (1 + log 푚)E[푋]) < 푚−1. ≤ 훾 ⏟ ⏞ ≤2훾/푐

Hence, with high probability, rnd has size at most 훾. ℱ  Next, we show that rnd covers the set of elements cmn (see Definition 5.2.1). ℱ 풰훾 Lemma 5.7.6. With probability at least 1 푛−1, rnd covers cmn. − ℱ 풰훾 cmn Proof: Let 푒 and let 푆1, , 푆푞 be the sets in that cover 푒: for each 푖 푞, 푒 푆푖. ∈ 풰훾 ··· ℱ ≤ ∈ rnd We define 푋푖 to be a random variable which is one if 푆푖 and is zero otherwise. We ∈ ℱ rnd also define 푋 := 푋1 + + 푋푞 to denote the number of sets in that cover 푒. Note that ··· ℱ 푋푖 are Θ(log(푚푛))-wise independent and E[푋] = (훾/(푐푚 log 푚)) 푞 log 푛 log(푚푛). Then, · ≥ applying Chernoff bound on random variables with limited independence (Lemma 5.7.3),

√︀ Pr(푋 < 1) < Pr(푋 < (1 (6 log 푛)/E[푋])E[푋]) < 푛−2. − ⏟ ⏞ ≤1/2

168 Hence, by union bound over all elements in cmn, with high probability, rnd covers cmn. 풰훾 ℱ 풰훾  Lemma 5.7.7. Θ(log(푚푛)) random bits suffice to implement set sampling method.

5.8. Generalization of Section 5.4.2

In this section we generalize the approach of Section 5.4.2 which relates the results on heavy hitters and contributing classes to approximating the optimal coverage size. In Section 5.4.2,

to simplify the presentation, we had the assumption that cmn is empty. This is in particular 풰w important since we can then assume that the total size of 푘-covers are roughly the same as their coverage size (up to polylogarithmic factors).

Here, we take care of the case in which cmn is non-empty and complete the description 풰w of our (훼, 훿, 휂)-oracle of Max 푘-Cover for the case (OPTlarge) (OPT) /2. The high |풞 | ≥ |풞 | level idea is to sample enough number of elements so that the algorithm using heavy hitters

and contributing classes still work but in the same time no w-common element is among the sampled elements with at least a constant probability.

Step 1. Sampling Elements. We sample a subset in which each element 푒 is in ℒ ⊂ 풰 ℒ with probability 휌 = (t s훼 휂)/ where t = 푂̃︀(1) (see Table 5.4.1 for the exact values). We · · |풰| implement the process of sampling via a hash function from a family of Θ(log(푚푛))-wise ℒ independent functions = ℎ : [ 푛 ] such that = 푒 ℎ(푒) = 1 . ℋ { 풰 → t·s훼·휂 } ℒ { ∈ 풰 | } Claim 5.8.1. With high probability, (휌 )/2 (3휌 )/2. |풰| ≤ |ℒ| ≤ |풰| For each collection of set , we define ′ to be the intersection of with ; ′ := 푆 푆 풟 풟 풟 ℒ 풟 { ∩ℒ | ∈ . 풟} Claim 5.8.2. If ( ) /(54f휂훼), then with probability at least 1 푚−2, ( ′) |풞 풟 | ≥ |풰| − |풞 풟 | ≥ 휌 ( ) /2. Moreover, if ( ) < /(54f휂훼), then with probability at least 1 푚−2, |풞 풟 | |풞 풟 | |풰| − ( ′) < ts/(36f). |풞 풟 | Proof: Since ts 27 24f log 푚 (as in Table 5.4.1), it follows from two applications of Chernoff ≥ · bound on random variables with limited independence (Lemma 5.7.3).  Similarly to Lemma 5.4.14, we have the following guarantee for LargeSetSimple using the sampled set of elements . ℒ 169 Lemma 5.8.3. If (OPT) /휂 and cmn = , then with probability at least 1 |풞 | ≥ |풰| ℒ ∩ 풰 ∅ − 푐 1/(2 log 푛 log 푚), the output of LargeSetSimple with parameters ( , w, 푆1 = sℒ훼, 푆2 = ℒ 푐푚 log 푚 |ℒ| |ℒ| , thr1 = , thr2 = ) is a superset whose coverage is at least /(54f휂훼). w 18휂s훼 6휂훼 |풰| Proof: By Claim 5.8.1, (휌 )/2 (3휌 )/2. Moreover, by Claim 5.8.2 and since |풰| ≤ |ℒ| ≤ |풰| −2 (OPT) /휂, with probability at least 1 푚 , (OPTlarge) 휌 (OPT) /4 |풞 | ≥ |풰| − |풞 ∩ ℒ| ≥ |풞 | ≥ /(6휂). We define 휂ℒ := 6휂 to denote the coverage of OPTlarge over the sampled elements |ℒ| . ℒ Now, consider the collection OPTlarge := 푂1, , 푂푞 . Since for each 푖 푞, the con- { ··· } ≤ ⋃︀ tribution of 푂푖 to the coverage of OPT is at least 1/(s훼) fraction (i.e. 푂푖 푂푗 | ∖ 푗<푖 | ≥ (OPT) /(s훼) /(휂s훼)), by Claim 5.8.2, with probability at least 1 푚−1, for all 푖 푞, |풞 | ≥ |풰| − ≤

⋃︁ (푂푖 푂푗) 휌 (OPT) /(2s훼). | ∖ 푗<푖 ∩ ℒ| ≥ |풞 |

This implies that 푂푖 푂푖 OPTlarge is a collection of sets whose contribution to { ∩ ℒ | ∈ } 휌|풞(OPT)| 3휌|풞(OPT)| (OPT) w.h.p. is at least ( )/( ) = 1/(3s훼). We define sℒ := 3s which 풞 ∩ ℒ 2s훼 2 denotes the contribution of sets in OPTlarge compared to the coverage of OPT over the sampled elements . ℒ |ℒ| |ℒ| By an application of Lemma 5.4.14 with parameters ( := , thr1 := , thr2 := ), 풱 ℒ 휂ℒsℒ훼 휂ℒ훼 with probability at least 1 1/(3 log 푛 log푐 푚), the algorithm returns a superset ′ whose − 풟푖 coverage on the sampled set is at least ℒ 휌 ts |ℒ| |풰| = . 3f휂ℒ훼 ≥ 36f훼휂 36f

−2 Then, by Claim 5.8.2, with probability at least 1 푚 , 푖 has coverage at least /(54f휂훼). − 풟 |풰| 

Step 2. Handling Common Elements. Next, we turn our attention to the case ℒ ∩ cmn = . Although common elements may be covered Ω(1)̃︀ times within a single superset, 풰w ̸ ∅ it is important to note that the contribution of common elements to all supersets are roughly the same.

Claim 5.8.4. Let cmn := cmn be the set of w-common elements that are sampled in . ℒw ℒ ∩ 풰w ℒ 170 Then, with high probability, for each superset , the total number of times that elements of 풟 cmn appear in (counting duplicates) belongs to [푃, 2푃 ] where 푃 is a fixed number larger ℒw 풟 than log 푛.

cmn Proof: Let 푒 be a w-common element and let 푆1, , 푆푞 be the collection of sets ∈ 풰w ··· that contain 푒. For a superset 푗, define 푋푖,푗 to be a binary random variable that denotes 풟 whether 푆푖 푗. Moreover, let 푌푗,푒 := 푋1,푗 + + 푋푞,푗. Then, E[푌푗,푒] = w푞/(푐푚 log 푚). ∈ 풟 ··· By an application of Chernoff bound with limited independence (Lemma 5.7.3) and since w푞 푐푚 log 푚 log 푛 log(푚푛) (see Definition 5.2.1), ≥ √︃ 6푐 log(푚푛)푚 log 푚 −2 Pr( 푌푗,푒 E[푌푗,푒] E[푌푗,푒]) (푚푛) . (5.8.1) | − | ≥ w푞 ≤ ⏟ ⏞ ≤1/3

Note that for any w-common element 푒 and any pair of supersets 푗, 푖, E[푌푗,푒] = E[푌푖,푒]. 풟 풟 In particular, we define 푌푒 := E[푌푗,푒] whose value is independet of the supersets. Next, we define ∑︀ to denote the expected contribution of -common elements to any 푌cmn := 푒∈풰 cmn 푌푒 w 2 superset. Hence, for each superset 푗, with probability at least 1 1/(푛푚 ), the total number 풟 − of times that w-common elements are covered by 푗 belongs to the range [2푌cmn/3, 4푌cmn/3]. 풟 Hence, with probability at least 1 (푚푛)−1, the contribution of sampled w-common elements − cmn to each superset belongs to [2푌cmn/3, 4푌cmn/3] where 푌cmn log 푛 log(푚푛) log 푛. ≥ |ℒw | ≥  Next, we show that if a w-common element is picked in the algorithm still does not ℒ return a superset with small coverage (though it may missed all large supersets). To this end, we modify LargeSetSimple and design a new subroutine LargeSetComplete as in Figure 24. The high level idea is to guarantee that if the main contribution of a superset is just from the duplicate counts of w-common elements ( 푃 ), it will not be returned. ∝ To achieve this, unlike LargeSetSimple we do not allow 퐹2-Contributing to check for all contributing class of any size (up to ). Instead, we set parameters 푆1 and 푆2 which |풬| denotes how large the size of a contributing class that we are looking for is. To handle the case in which the size of a contributing class is large (i.e. larger than 푆2), we sample supersets proportional to 1/푆2 and compute their coverage by existing 퐿0-estimation algorithms.

171 Algorithm 24 A (훼, 훿, 휂)-oracle of Max 푘-Cover that handles the case in which the majority of the coverage in an optimal solution is by the sets whose coverage contributions are at least 1/(s훼) fraction of the optimal coverage size. By modifying the lines marked by (⋆⋆) slightly, the algorithm returns the 푘-cover that achieves the returned coverage estimate (see Section 5.5 for more explanation).

1: procedure LargeSetComplete(( , w, 푆1, 푆2, thr1, thr2)) 2: ◁ Input: w is an upper bound on풱 the desired number of sets in a superset 2 3: ◁ Parameters: 휑1 = Ω(̃︀ 훼 /푚) and 휑2 = Ω(1)̃︀ 4: let Cntrsmall be an instance of 퐹2-Contributing(휑1, 푆1) ◁ for Case 1 5: let Cntrlarge be an instance of 퐹2-Contributing(휑2, 푆2) ◁ for Case 2 6: pick ℎ : [(푐푚 log 푚)/w] from Θ(log(푚푛))-wise independent hash functions 7: for all (푆,ℱ 푒 →) in the data stream do 8: if 푒 then ∈ 풱 9: feed ℎ(푆) to both Cntrsmall and Cntrlarge 10: ◁ output(Cntr) contains coordinates along with (1 1/2)-estimate of frequencies * 1 ± 11: if there exists s.t. * then 푖 output(Cntrsmall) 푣˜푖 2 thr1 ∈ ≥ · * 12: return 2˜푣푖* /(3f) (⋆⋆) ◁ add return 푆 ℎ(푆) = 푖 to get the 푘-cover * {1 | } 13: if there exists s.t. * then 푖 output(Cntrlarge) 푣˜푖 2 thr2 ∈ ≥ · * 14: return 2˜푣푖* /(3f) (⋆⋆) ◁ add return 푆 ℎ(푆) = 푖 to get the 푘-cover { | } 15: ◁ case 2: if size of the contributing class is large; Ω(̃︀ ) |풬| 16: let be a collection of size 12 log 푚/푆2 picked uniformly at random 17: forℳ all ⊂푖 풬 do ◁ DE to estimate the|풬| coverage of the supersets in ∈ ℳ ℒ 18: let DE푖 be a (1/2)-approximation algorithm of 퐿0-estimation initialized to zero 19: pick ℎ : [(푐푚 log 푚)/w] from Θ(log(푚푛))-wise independent hash functions 20: for all (푆,ℱ 푒 →) in the data stream do 21: if 푒 and ℎ(푆) then feed ℎ(푆) to DEℎ(푆) ∈ 풱 ∈ ℳ * 1 22: if there exists such that VAL * then 푖 (DE푖 ) 2 thr2 ∈ ℳ ≥ · * 23: return 2VAL(DE푖* )/3 (⋆⋆) ◁ add return 푆 ℎ(푆) = 푖 to get the 푘-cover { | } 24: return infeasible Lemma 5.8.5. Even if cmn = , with probability at least 1 푚−1, none of the solutions ℒ∩풰w ̸ ∅ − 푐푚 log 푚 returned by LargeSetComplete with parameters ( , w, 푆1 = sℒ훼, 푆2 = Θ(̃︀ ), thr1 = ℒ w |ℒ| |ℒ| , thr2 = ) is a superset whose coverage is at less than /(54f휂훼). 18휂s훼 6휂훼 |풰| Proof: Here, we need to revisit Case 1 and Case 2 of Section 5.4.2 and redo the calculations with respect to the sampled set of elements . ℒ

2 Case 1. Suppose that 퐹2-Contributing(Ω(̃︀ 훼 /푚), sℒ훼) returns a solution whose cover- age is less than (54f휂훼). |풰| Let ⃗푟 be a vector of size (푐푚 log 푚)/w whose 푖th entry denotes the total size of the

172 rare cmn ∑︀ rare intersection of sets in 푖 and := ; ⃗푟[푖] := 푆 . By Claim 5.8.2, 풟 ℒw ℒ ∖ 풰w 푆∈풟푖 | ∩ ℒw | −2 rare if ( 푗) < /(54f휂훼), with probability at least 1 푚 , ( 푖) < ( 푖) < |풞 풟 | |풰| − |풞 풟 ∩ ℒw | |풞 풟 ∩ ℒ| ts/(36f). Hence, together with Claim 5.4.10, with probability at least 1 2푚−2, ⃗푟[푖] < ts/36. − Similarly, let ⃗푣 be a vector of size (푐푚 log 푚)/w whose 푖th entry denotes the total size of ∑︀ the intersection of sets in 푖 with the sampled elements ; ⃗푣[푖] := 푆 . Note that 풟 ℒ 푆∈풟푖 | ∩ ℒ| Since 푃 1, for each superset 푖 with coverage less than /(54f휂훼), ⃗푣[푖] 2푃 + ⃗푟[푖] ≥ 풟 |풰| ≤ ≤ (ts/36)푃 . On the other hand, by Claim 5.8.4, with probability at least 1 푚−1, the value − of ⃗푣[푗] for all 푗 in the sampled substream is at least 푃 .

Moreover, since the size of contributing classes in this case is at most sℒ훼 = 3s훼 (more pre- cisely, we can always find at most sets that are 훼2 -contributing), by Claim 5.2.8, with 3s훼 Ω(̃︀ 푚 ) log 푚 probability at least 1 , all sampled substreams considered by 퐹2-Contributing(휑1, sℒ훼) − 푚 have size at least 푐푚 log 푚/(ws훼). Hence, by Claim 5.8.4, with probability at least 1 2 log 푚 , − 푚 푐푚 log 푚 2 퐹2(⃗푣smp) ( )푃 where ⃗푣smp is a vector corresponding to a sampled substream consid- ≥ ws훼 4 2 ered in 퐹2-Contributing(휑1, sℒ훼). Since s t 81/(2휂 log(s훼)) (see Table 5.4.1) and by ≤ the value of 휑1 (in Eq. 5.4.2), the following holds:

ts 2 2 푐푚 log 푚 2 ( ) 푃 < 휑1 푃 , 36 · ws훼

which implies that with probability at least 1 3 log 푚/푚, an entry corresponding to a − superset with coverage less than /(54f휂훼) cannot be a 휑1-HeavyHitter in any of the |풰| sampled substreams considered in 퐹2-Contributing(휑1, sℒ훼).

Case 2. The high level idea in this case is similar to the previous case. In Case 1, we heavily 2 use the fact that there exists a class containing at most sℒ훼 coordinates that is Ω(̃︀ 훼 /푚)- contributing. This observation is crucial because then we could argue that all sampled

substreams considered in 퐹2-Contributing(휑1, sℒ훼) have size at least Ω(̃︀ 푚/(wsℒ훼)) which rules out the possibility that a coordinate corresponding to a small superset is a Ω(̃︀ 훼2/푚)-

HeavyHitter for sufficiently small values of s (recall that sℒ = 3s). However, in this case, a contributing class may have size Ω(̃︀ 푚/w) which results in a

sampled substream with only 푂̃︀(1) coordinates in the run of 퐹2-Contributing! To address

173 the issue, we handle the case in which a contributing class has more than 푆2 coordinates separately (푆2 is a parameter to be determined shortly):

푐푚 log 푚 Ω(1)̃︀ -contributing class has size less than 푆2 := 훾. Since 푃 1 and by ∙ w · ≥ Claim 5.8.2 and 5.4.10, for each superset 푗 with coverage less than /(54f휂훼), with 풟 |풰| probability at least 1 2푚−2, ⃗푣[푗] (ts/36)푃 . On the other hand, by Claim 5.8.4, with − ≤ probability at least 1 푚−1, the value of ⃗푣[푗] for all 푗 in the sampled substream is at least − 푃 . Moreover, by Claim 5.2.8, with probability at least 1 log 푚 , all sampled substreams − 푚 considered in -Contributing 1 invoked by LargeSetComplete 퐹2 (휑2 = 2 log(훼) , 푆2) 3푐푚 log 푚 3 (which is to handle Case 2) have at least ( ) = coordinates. Hence, 퐹2(⃗푣smp) w·푆2 훾 ≥ 3 푃 2. By setting 훾 ·

3휑 1944 훾 < 2 = , (5.8.2) (ts/36)2 log(훼)t2s2

2 2 2 1 ⃗푣[푗] (ts/36) 푃 < ( 퐹2(⃗푣smp)), which implies that an entry corresponding to a ≤ 2 log 훼 superset with coverage less than /(27f휂훼) cannot be a 휑2-HeavyHitter in any of |풰| the sampled substream considered in 퐹2-Contributing(휑2, 푆2).

푐푚 log 푚 Ω(1)̃︀ -contributing class has size at least 푆2 := 훾. Here, we also need to ∙ w · consider an extra case compared to Lemma 5.4.14 and 5.8.3 because we do not allow 푆2 to try all values up to 푐푚 log 푚 . To address the case in which the number of coordinates ( w )

in a contributing class is larger than 푆2, we sample ℓ = (12 log 푚) /푆2 supersets |풬| ℳ uniformly at random from ; with high probability, contains a superset from the 풬 ℳ contributing class. Then, we compute the coverage of sampled supersets via an existing

algorithm for 퐿0-estimation. By Claim 5.4.13, the coverage of supersets corresponding

to the 휑2-contributing class whose size is larger than 푆2 on the sampled set is at ℒ least /(휂ℒ훼) ts/(12f). Hence, the algorithm finds a superset with coverage at least |ℒ| ≥ ts/(36f) on which by Claim 5.8.2, it implies that the returned superset has coverage ℒ at least /(54f휂s훼). |풰| 

Theorem 5.8.6. If (OPT) /휂, then with probability at least 1 1/(log 푛 log푐 푚), |풞 | ≥ |풰| − LargeSet(푘, 훼) returns at least /(54f휂훼). Moreover, if LargeSet(푘, 훼) returns a value |풰| 174 other than infeasible, then with probability at least 1 4푚−1, (OPT) /(54f휂훼). − |풞 | ≥ |풰| Proof: By Lemma 5.8.5, with probability at least 1 3푚−1, LargeSet will never return − a superset with coverage less than /(27f휂훼); either it returns a large enough estimate or |풰| returns infeasible. Here, we show that if (OPT) > /휂, then the algorithm will return |풞 | |풰| an estimate at least /(54f휂훼) with probability at least 1 1/(2 log 푛 log푐 푚) 푛−1. To this |풰| − − end, we show that with high probability, in one of the 푂(log 푛) parallel runs of LargeSet, the sampled sets of elements does not contain any common element. Then, by Lemma 5.8.3, ℒ the algorithm with probability at least 1 1/(2 log 푛 log푐 푚) returns /(54f휂훼) in the − |풰| iteration in which the sampled set of elements that does not contain any common element.

Now, we show that with probability at least 1 푛−1, the sampled set of one of 푂(log 푛) − parallel runs of LargeSet does not contain any common element. Let 푞 = cmn and define |풰w | 푌1, , 푌푞 to be independent Bernoulli trials with probability of success equal to 휌. Recall ··· that, we have the assumption that cmn 휎|풰| and since w 푘, cmn cmn 휎|풰| , |풰푘 | ≤ 훼 ≤ |풰w | ≤ |풰푘 | ≤ 훼

푞 ∑︁ 휎 휇 = E[ 푌푖] |풰| 휌 = ts휂휎 훼 푖=1 ≤ · 푞 ∑︁ 푞 −2휌푞 −2ts휂휎 Pr( 푌푖 = 0) = (1 휌) 푒 푒 푖=1 − ≥ ≥

cmn Next, let’s assume that = 푒1, , 푒푞 . Further, define 푋1, , 푋푞 to be random 풰w { ··· } ··· variables such that 푋푖 = 1 if 푒푖 . Hence, 푋1, , 푋푞 are Θ(log(푚푛))-wise independent ∈ ℒ ··· Bernoulli trials with success probability E[푋푖] = 휌. Next, we apply Lemma 5.7.4 with the following parameters:

푟 = 0, ln(1/푝(0)) 2ts휂휎, 휇 = ts휂휎, 퐷 = Θ(log(푚푛)) = 12 log(푚푛), ≤ and show that

푞 푞 cmn ∑︁ ∑︁ −퐷 −2ts휂휎 Pr( w = ) = Pr( 푋푖 = 0) Pr( 푌푖)(1 푒 ) 푒 /2. ℒ ∩ 풰 ∅ 푖=1 ≥ 푖=1 − ≥

175 Algorithm 25 A single pass streaming algorithm. 1: procedure LargeSet((푘, 훼, w)) 2: ◁ run LargeSetComplete on sampled elements in parallel for 푂(log 푛) instances 3: for 푂(log 푛) times in parallel do 4: let s.t. each 푒 (Θ(log(푚푛))-wise independent) w.p. 휌 = t·s훼·휂 ℒ ⊆ 풰 ∈ ℒ |풰| 5: , 푐푚 log 푚 , , 1944 푆1 sℒ훼 푆2 훾 thr1 /(18휂s훼) thr2 /(6휂훼) ◁ 훾 := 2 2 ← ← w · ← |ℒ| ← |ℒ| t s log 훼 6: sol LargeSetComplete( , w, 푆1, 푆2, thr1, thr2) 7: if sol←= infeasible then ℒ 8: return̸ /(54f휂훼) |풰| 9: return infeasible Hence, since ts휂휎 = 휂 = 푂(1) (see Table 5.4.1),

Pr(In all runs, cmn = ) (1 푒−2ts휂휎)푂(log 푛) 푛−1. ℒ ∩ 풰 ̸ ∅ ≤ − ≤

The second property follows from Lemma 5.8.5: if the algorithm returns a value other than infeasible, then with probability at least 1 4푚−1, (OPT) /(54f휂훼). − |풞 | ≥ |풰|  Lemma 5.8.7. The amount of space used by LargeSet is 푂̃︀(푚/훼2).

Proof: Note that LargeSet performs 푂(log 푛) instances of LargeSetComplete in par- allel. Hence, the total amount of space use by LargeSet is 푂(log 푛) times the space complexity of LargeSetComplete. Similarly to the space analysis of LargeSetSimple, the amount of space to perform

2 Cntrsmall and Cntrlarge as defined in LargeSetComplete is respectively 푂̃︀(1/휑1) = 푂̃︀(푚/훼 )

and 푂̃︀(1/휑2) = 푂̃︀(1). Moreover, for the last case in which the contributing class has size 푚 larger than 푆2, by Theorem 5.2.12, in total 푂( ) = 푂(1) space is required to compute the ̃︀ w·푆2 ̃︀ coverage of sampled supersets in . Note that, in all cases, by Lemma 5.7.2, the algorithm ℳ can store ℎ in 푂̃︀(1) space. 2 Hence, the total amount of space required to implement LargeSet is 푂̃︀(푚/훼 ).  Proof of Theorem 5.4.8: The guarantee on the quality of the returned estimate follows from Theorem 5.8.6 with w = min 훼, 푘 and s = 푂̃︀(w/훼) (as in Table 5.4.1). Moreover, { } Lemma 5.8.7 shows that the space complexity of LargeSet is 푂̃︀(푚/훼2). Moreover, since with high probability the estimate returned by the algorithm is a lower bound on the coverage size of a 푘-cover of , the output of LargeSet with high probability, ℱ

176 is smaller than the optimal coverage size of Max 푘-Cover( , ). 풰 ℱ 

177 Part II

Learning-Based Streaming Algorithms

178 Chapter 6

The Frequency Estimation Problem

6.1. Introduction

The frequency estimation problem is formalized as follows: given a sequence 푆 of elements

from some universe 푈, for any element 푖 푈, estimate 푓푖, the number of times 푖 occurs in 푆. ∈ If one could store all arrivals from the stream 푆, one could sort the elements and compute their frequencies. However, in big data applications, the stream is too large (and may be infinite) and cannot be stored. This challenge has motivated the development of streaming algorithms, which read the elements of 푆 in a single pass and compute a good estimate of the frequencies using a limited amount of space. Specifically, the goal of the problem is as follows.

Given a sequence 푆 of elements from 푈, the desired algorithm reads 푆 in a single pass while writing into memory 퐶 (whose size can be much smaller than the length of 푆). Then, given any element 푖 푈, the algorithm reports an estimation of 푓푖 based only on the content of 퐶. ∈ Over the last two decades, many such streaming algorithms have been developed, including Count-Sketch [43], Count-Min [57] and multi-stage filters [70]. The performance guarantees of these algorithms are well-understood, with upper and lower bounds matching up to 푂( ) · factors [110]. However, such streaming algorithms typically assume generic data and do not leverage useful patterns or properties of their input. For example, in text data, the word frequency is known to be inversely correlated with the length of the word. Analogously, in network data, certain applications tend to generate more traffic than others. If such properties

179 can be harnessed, one could design frequency estimation algorithms that are much more efficient than the existing ones. Yet, it is important to do so in a general framework that can harness various useful properties, instead of using handcrafted methods specific to a particular pattern or structure (e.g., word length, application type). In this chapter, we introduce learning-based frequency estimation streaming algorithms. Our algorithms are equipped with a learning model that enables them to exploit data prop- erties without being specific to a particular pattern or knowing the useful property apriori. We further provide theoretical analysis of the guarantees associated with such learning-based algorithms. We focus on the important class of “hashing-based” algorithms, which includes some of the most used algorithms such as Count-Min, Count-Median and Count-Sketch. Informally,

these algorithms hash data items into 퐵 buckets, count the number of items hashed into each bucket, and use the bucket value as an estimate of item frequency. The process can be repeated using multiple hash functions to improve accuracy. Hashing-based algorithms have several useful properties. In particular, they can handle item deletions, which are implemented by decrementing the respective counters. Furthermore, some of them (notably

Count-Min) never underestimate the true frequencies, i.e., 푓˜푖 푓푖 holds always. However, ≥ hashing algorithms lead to estimation errors due to collisions: when two elements are mapped to the same bucket, they affect each others’ estimates. Although collisions are unavoidable given the space constraints, the overall error significantly depends on the pattern of collisions. For example, collisions between high-frequency elements (“heavy hitters”) result in a large estimation error, and ideally should be minimized. The existing algorithms, however, use random hash functions, which means that collisions are controlled only probabilistically.

Our idea is to use a small subset of 푆, call it 푆′, to learn the heavy hitters. We can then assign heavy hitters their own buckets to avoid the more costly collisions. It is important to emphasize that we are learning the properties that identify heavy hitters as opposed to the identities of the heavy hitters themselves. For example, in the word frequency case, shorter words tend to be more popular. The subset 푆′ itself may miss many of the popular words, but whichever words popular in 푆′ are likely to be short. Our objective is not to learn the identity of high frequency words using 푆′. Rather, we hope that a learning model trained

180 on 푆′ learns that short words are more frequent, so that it can identify popular words even if they did not appear in 푆′. Our main contributions are as follows:

We introduce learning-based frequency estimation streaming algorithms, which learn ∙ the properties of heavy hitters in their input and exploit this information to reduce errors.

We provide performance guarantees showing that our learned algorithms can deliver ∙ a logarithmic factor improvement in the error bound over their non-learning coun- terparts. Furthermore, we show that our learning-based instantiation of Count-Min, a widely used algorithm, is asymptotically optimal among all instantiations of that algorithm. See Table 6.4.1 in Section 6.4 for the details.

We evaluate our learning-based algorithms using two real-world datasets: network ∙ traffic and search query popularity. In comparison to their non-learning counterparts, our algorithms yield performance gains that range from 18% to 71%.

6.1.1. Related Work

Frequency estimation in data streams. Frequency estimation, and the closely related problem of finding frequent elements in a data stream, are some of the most fundamental and well-studied problems in streaming algorithms, see [55] for an overview. Hashing-based algorithms such as Count-Sketch [43], Count-Min [57] and multi-stage filters [70] are widely used solutions for these problems. These algorithms also have close connections to sparse recovery and compressed sensing [40, 64], where the hashing output can be considered as a compressed representation of the input data [79]. Several “non-hashing” algorithms for frequency estimation have been also proposed [136, 62, 116, 133]. These algorithms do not possess many of the properties of hashing-based methods listed in the introduction (such as the ability to handle deletions), but they often have better accuracy/space tradeoffs. For a fair comparison, our evaluation focuses only on hashing algorithms. However, our approach for learning heavy hitters should be useful for non-hashing algorithms as well.

181 Some papers have proposed or analyzed frequency estimation algorithms customized to data that follows Zipf Law [43, 58, 133, 135, 154]; the last algorithm is somewhat similar to the “lookup table” implementation of the heavy hitter oracle that we use as a baseline in our experiments. Those algorithms need to know the data distribution a priori, and apply only to one distribution. In contrast, our learning-based approach applies to any data property or distribution, and does not need to know that property or distribution a priori.

Learning-based algorithms. Recently, researchers have begun exploring the idea of in- tegrating machine learning models into algorithm design. In particular, researchers have proposed improving compressed sensing algorithms, either by using neural networks to im- prove sparse recovery algorithms [140, 33], or by designing linear measurements that are optimized for a particular class of vectors [25, 141], or both. The latter methods can be viewed as solving a problem similar to ours, as our goal is to design “measurements” of

the frequency vector (푓1, 푓2 . . . , 푓|푈|) tailored to a particular class of vectors. However, the aforementioned methods need to explicitly represent a matrix of size 퐵 푈 , where 퐵 is the × | | number of buckets. Hence, they are unsuitable for streaming algorithms which, by definition, have space limitations much smaller than the input size. Another class of problems that benefited from machine learning is distance estimation, i.e., compression of high-dimensional vectors into compact representations from which one can estimate distances between the original vectors. Early solutions to this problem, such as Locality-Sensitive Hashing, have been designed for worst case vectors. Over the last decade, numerous methods for learning such representations have been developed [156, 169, 109, 168]. Although the objective of those papers is similar to ours, their techniques are not usable in our applications, as they involve a different set of tools and solve different problems. More broadly, there have been several recent papers that leverage machine learning to design more efficient algorithms. The authors of[59] show how to use reinforcement learn- ing and graph embedding to design algorithms for graph optimization (e.g., TSP). Other learning-augmented combinatorial optimization problems are studied in [101, 24, 128]. More recently, [120, 137] have used machine learning to improve indexing data structures, includ- ing Bloom filters that (probabilistically) answer queries of the form “is a given element inthe

182 data set?” As in those papers, our algorithms use neural networks to learn certain properties of the input. However, we differ from those papers both in our design and theoretical analy- sis. Our algorithms are designed to reduce collisions between heavy items, as such collisions greatly increase errors. In contrast, in existence indices, all collisions count equally. This also leads to our theoretical analysis being very different from that in[137].

6.2. Preliminaries

Estimation Error. To measure and compare the overall accuracy of different frequency estimation algorithms, we will use the expected estimation error which is defined as follows:

let = 푓1, , 푓푛 and ˜풜 = 푓˜1, , 푓˜푛 respectively denote the actual frequencies and ℱ { ··· } ℱ { ··· } the estimated frequencies obtained from algorithm of items in the input stream. We 풜 remark that when is clear from the context we denote ˜풜 as ˜. Then we define 풜 ℱ ℱ

Err( , ˜풜) := E푖∼풟 푓푖 푓˜푖 , (6.2.1) ℱ ℱ | − | where denotes the query distribution of the items. Here, we assume that the query dis- 풟 tribution is the same as the frequency distribution of items in the stream, i.e., for any 풟 * * * * 푖 [푛], Pr푖∼풟[푖 = 푖 ] 푓푖* (more precisely, for any 푖 [푛], Pr푖∼풟[푖 = 푖 ] = 푓푖* /푁 where ∈ ∝ ∈ ∑︀ denotes the total sum of all frequencies in the stream). 푁 = 푖∈[푛] 푓푖 We note that the theoretical guarantees of frequency estimation algorithms are typically

phrased in the “(휀, 훿)-form”, e.g., Pr[ 푓˜푖 푓푖 > 휀푁] < 훿 for every 푖 (see e.g., [57]). However, | − | this formulation involves two objectives (휀 and 훿). We believe that the (single objective) expected error in Equation (6.2.1) is more natural from the machine learning perspective.

Remark 6.2.1. As all upper/lower bounds in this paper are proved by bounding the ex- pected error of estimating the frequency a single item, E[ 푓˜푖 푓푖 ], then using linearity of | − | expectation, in fact we obtain analogous bounds for any query distribution (푝푖)푖∈[푛]. In particularly this means that the bounds of Table 6.4.1 for CM and CS hold for any query distribution. For L-CM and L-CS the factor of log(푛/퐵)/ log 푛 gets replaced by ∑︀푛 푝 푖=퐵ℎ+1 푖 where 퐵ℎ = Θ(퐵) is the number of buckets reserved for heavy hitters.

183 6.2.1. Algorithms for Frequency Estimation

In this section, we recap three variants of hashing-based algorithms for frequency estimation.

Count-Min. We have 푘 distinct hash functions ℎ푖 : 푈 [퐵] and an array 퐶 of size 푘 퐵. → × The algorithm maintains 퐶, such that at the end of the stream we have 퐶[ℓ, 푏] = ∑︀ 푓 . 푗:ℎℓ(푗)=푏 푗

For each 푖 푈, the frequency estimate 푓˜푖 is equal to minℓ≤푘 퐶[ℓ, ℎℓ(푖)], and always satisfies ∈ 푓˜푖 푓푖. ≥

Count-Sketch. Similarly to Count-Min, we have 푘 distinct hash functions ℎ푖 : 푈 [퐵] → and an array 퐶 of size 푘 퐵. Additionally, in Count-Sketch, we have 푘 sign functions × ∑︀ 푔푖 : 푈 1, 1 , and the algorithm maintains 퐶 such that 퐶[ℓ, 푏] = 푓푗 푔ℓ(푗). For → {− } 푗:ℎℓ(푗)=푏 · each 푖 푈, the frequency estimate 푓˜푖 is equal to the median of 푔ℓ(푖) 퐶[ℓ, ℎℓ(푖)] ℓ≤푘. Note ∈ { · } that unlike the previous two methods, here we may have 푓˜푖 < 푓푖.

6.2.2. Zipfian Distribution

In our analysis we assume that the frequency distribution of items follows Zipf’s law. That is, if we sort the items according to their frequencies with no loss of generality assuming that 푓1 푓2 푓푛, then for any 푗 [푛], 푓푗 1/푗. Given that the frequencies of items ≥ ≥ · · · ≥ ∈ ∝ follow Zipf’s law and assuming that the query distribution is the same as the distribution of

* * the frequency of items in the input stream (i.e., Pr푖∼풟[푖 ] = 푓푖* /푁 = 1/(푖 퐻푛) where 퐻푛 · denotes the 푛-th harmonic number), we can write the expected error defined in (6.2.1) as follows:

1 ∑︁ 1 ∑︁ 1 Err( , ˜풜) = E푖∼풟[ 푓푖 푓˜푖 ] = 푓˜푖 푓푖 푓푖 = 푓˜푖 푓푖 (6.2.2) ℱ ℱ | − | 푁 · | − | · 퐻푛 · | − | · 푖 푖∈[푛] 푖∈[푛]

Throughout this paper, we present our results with respect to the objective function in

1 ∑︀푛 the right hand side of (6.2.2), i.e., 푓˜푖 푓푖 푓푖. We further assume that in fact 퐻푛 · 푖=1 | − | · 푓푖 = 1/푖. At first sight this assumption may seem strange since it says thatitem 푖 appears a non-integral number of times in the stream. This is however just a matter of scaling and the assumption is convenient as it removes the dependence on the length of the stream in

184 Algorithm 26 Learning-Based Frequency Estimation Element i

1: procedure LearnedSketch(퐵, 퐵푟, HH, SketchAlg) 2: for each stream element 푖 do 3: if HH(푖) = 1 then ◁ if 푖 is a heavy item Learned 4: if 푖 is already stored in a unique bucket then Oracle 5: count count + 1 푖 푖 Heavy Not heavy 6: else ←

7: initialize count푖 = 1 Unique SketchAlg 8: else Buckets (e.g., Count-Min) 9: feed 푖 to SketchAlg Figure 6.3.1: Pseudo-code and block-diagram representation of our algorithms our bounds.

6.3. Learning-Based Frequency Estimation Algorithms

We aim to develop frequency estimation algorithms that exploit data properties for better performance. To do so, we learn an oracle that identifies heavy hitters, and use the oracle to assign each heavy hitter its unique bucket to avoid collisions. Other items are simply hashed using any classic frequency estimation algorithm (e.g., Count-Min, or Count-Sketch), as shown in the block-diagram in Figure 6.3.1. This design has two useful properties: first, it allows us to augment a classic frequency estimation algorithm with learning capabilities, producing a learning-based counterpart that inherits the original guarantees of the classic algorithm. For example, if the classic algorithm is Count-Min, the resulting learning-based algorithm never underestimates the frequencies. Second, it provably reduces the estimation errors, and for the case of Count-Min it is (asymptotically) optimal.

Algorithm 26 provides pseudo code for our design. The design assumes an oracle HH(푖) that attempts to determine whether an item 푖 is a “heavy hitter” or not. All items classified as

heavy hitters are assigned to one of the 퐵푟 unique buckets reserved for heavy items. All other

items are fed to the remaining 퐵 퐵푟 buckets using a conventional frequency estimation − algorithm SketchAlg (e.g., Count-Min or Count-Sketch).

The estimation procedure is analogous. To compute 푓˜푖, the algorithm first checks whether 푖 is stored in a unique bucket, and if so, reports its count. Otherwise, it queries the SketchAlg procedure. Note that if the element is stored in a unique bucket, its reported count is exact,

i.e., 푓˜푖 = 푓푖.

185 Expected Error Algorithm 푘 = 1 푘 > 1

푘 log( 푘푛 ) Count-Min log 푛 퐵 Θ( 퐵 ) Θ( 퐵 ) log2( 푛 ) log2( 푛 ) Learned Count-Min 퐵 퐵 Θ( 퐵 log 푛 ) Ω( 퐵 log 푛 ) Count-Sketch log 퐵 푘1/2 and 푘1/2 Θ( 퐵 ) Ω( 퐵 log 푘 ) 푂( 퐵 ) log( 푛 ) Learned Count-Sketch log 푛(푛/퐵) 퐵 Θ( 퐵 log 푛 ) Ω( 퐵 log 푛 ) Table 6.4.1: The performance of different algorithms on streams with frequencies obeying Zipf Law. 푘 is the number of hash functions, 퐵 is the number of buckets, and 푛 is the number of distinct elements. The space complexity of all algorithms is Θ(퐵). The oracle is constructed using machine learning and trained with a small subset of 푆, call it 푆′. Note that the oracle learns the properties that identify heavy hitters as opposed to the identities of the heavy hitters themselves. For example, in the case of word frequency, the oracle would learn that shorter words are more frequent, so that it can identify popular words even if they did not appear in the training set 푆′.

6.4. Analysis

Our algorithms combine simplicity with strong error bounds. Below, we summarize our theoretical results, and leave all theorems, lemmas, and proofs to the appendix. In partic- ular, Table 6.4.1 lists the results proven in this chapter, where each row refers to a specific streaming algorithm and its corresponding error bound. First, we show that if the heavy hitter oracle is accurate, then the error of the learned variant of Count-Min and Count-Sketch are up to logarithmic factors smaller than that of their standard counterparts. Second, we show that, asymptotically, our learned Count-Min algorithm cannot be im- proved any further by designing a better hashing scheme. Specifically, for the case of Learned Count-Min with a perfect oracle, our design achieves the same asymptotic error as the “Ideal Count-Min”, which optimizes its hash function for the given input. We remark that for both Count-Min and Count-Sketch we aim at analyzing the expected ∑︀ value of the variable 푓푖 푓˜푖 푓푖 where 푓푖 = 1/푖 and 푓˜푖 is the estimate of 푓푖 output by 푖∈[푛] · | − | 186 the relevant sketching algorithm. Throughout this paper we use the following notation: for

an event 퐸 we denote by [퐸] the random variable in 0, 1 which is 1 if and only if 퐸 occurs. { } 6.4.1. Tight Bounds for Count-Min with Zipfians

We begin by presenting our tight analysis of Count-Min with Zipfians. The main theorem is the following.

Theorem 6.4.1. Let with and . Let further 푛, 퐵, 푘 N 푘 2 퐵 푛/푘 ℎ1, . . . , ℎ푘 :[푛] [퐵] ∈ ≥ ≤ → be independent and truly random hash functions. For 푖 [푛], define the random variable ∈ (︁∑︀ )︁ 푓˜푖 = minℓ∈[푘] [ℎℓ(푗) = ℎℓ(푖)]푓푗 . For any 푖 [푛] it holds that 푗∈[푛] ∈

(︃ (︀ 푛 )︀)︃ log 퐵 E[ 푓˜푖 푓푖 ] = Θ | − | 퐵

Replacing 퐵 by 퐵/푘 in Theorem 6.4.1 and using linearity of expectation we obtain the desired bound for Count-Min in the upper right hand side of Table 6.4.1. The natural assumption

that 퐵 푛/푘 simply says that the total number of buckets is upper bounded by the number ≤ of items. To prove Theorem 6.4.1 we start with the following special case of the theorem.

Lemma 6.4.2. Suppose that we are in the setting of Theorem 6.4.1 and further that1 푛 = 퐵. Then

(︂ 1 )︂ E[ 푓˜푖 푓푖 ] = 푂 . | − | 푛

Proof: It suffices to show the result when 푘 = 2 since adding more hash functions and ∑︀ corresponding tables only decreases the value of 푓˜푖 푓푖 . Define 푍ℓ = [ℎℓ(푗) = | − | 푗∈[푛]∖{푖} ℎℓ(푖)]푓푗 for ℓ [2] and note that these variables are independent. For a given 푡 3/푛 we ∈ ≥ wish to upper bound Pr[푍ℓ 푡]. Let 푠 < 푡 and note that if 푍ℓ 푡 then either of the following ≥ ≥ two events must hold:

퐸1: There exists a 푗 [푛] 푖 with 푓푗 > 푠 and ℎℓ(푗) = ℎℓ(푖). ∈ ∖ { } 1In particular we dispose with the assumption that 퐵 푛/푘. ≤ 187 퐸2: The set 푗 [푛] 푖 : ℎℓ(푗) = ℎℓ(푖) contains at least 푡/푠 elements. { ∈ ∖ { } }

Union bounding we find that

(︂ )︂ 푡/푠 1 푛 −푡/푠 1 (︁푒푠)︁ Pr[푍ℓ 푡] Pr[퐸1] + Pr[퐸2] + 푛 + . ≥ ≤ ≤ 푛푠 푡/푠 ≤ 푛푠 푡

푡 (︁ log(푡푛) )︁ Choosing 푠 = , a simple calculation yields that Pr[푍ℓ 푡] = 푂 . As 푍1 and 푍2 log(푡푛) ≥ 푡푛 (︂(︁ )︁2)︂ are independent, Pr[푍 푡] = 푂 log(푡푛) , so ≥ 푡푛

(︃ )︃ ∫︁ ∞ 3 ∫︁ ∞ (︂log(푡푛))︂2 (︂ 1 )︂ E[푍] = Pr[푍 푡] 푑푡 + 푂 푑푡 = 푂 .  0 ≥ ≤ 푛 3/푛 푡푛 푛

Before proving the full statement of Theorem 6.4.1 we recall Bennett’s inequality.

Theorem 6.4.3 (Bennett’s inequality [30]). Let 푋1, . . . , 푋푛 be independent, mean zero ∑︀푛 2 2 random variables. Let 푆 = 푋푖, and 휎 , 푀 > 0 be such that Var[푆] 휎 and 푋푖 푀 푖=1 ≤ | | ≤ for all 푖 [푛]. For any 푡 0, ∈ ≥ (︂ 휎2 (︂푡푀 )︂)︂ Pr[푆 푡] exp ℎ , ≥ ≤ −푀 2 휎2 where is defined by . The same tail bound holds ℎ : R≥0 R≥0 ℎ(푥) = (푥 + 1) log(푥 + 1) 푥 → − on the probability Pr[푆 푡]. ≤ − Remark 6.4.4. It is well known and easy to check that for 푥 0, ≥ 1 푥 log(푥 + 1) ℎ(푥) 푥 log(푥 + 1). 2 ≤ ≤

We will use these asymptotic bounds repeatedly in this paper.

Proof of Theorem 6.4.1. We start out by proving the upper bound. Let 푁1 = [퐵] 푖 and ∖ { } ∑︀ 푁2 = [푛] ([퐵] 푖 ). Let 푏 [푘] be such that 푓푗 [ℎ푏(푗) = ℎ푏(푖)] is minimized. Note ∖ ∪ { } ∈ 푗∈푁1 · 188 that 푏 is itself a random variable. We also define

∑︁ 푌1 = 푓푗 [ℎ푏(푗) = ℎ푏(푖)], · 푗∈푁1 ∑︁ 푌2 = 푓푗 [ℎ푏(푗) = ℎ푏(푖)] · 푗∈푁2

1 Clearly 푓˜푖 푓푖 푌1 + 푌2. Using Lemma 6.4.2, we obtain that E[푌1] = 푂( ). For 푌2 we | − | ≤ 퐵 observe that

(︃ (︀ 푛 )︀)︃ ∑︁ 푓푗 log 퐵 E[푌2 푏] = = 푂 . | 퐵 퐵 푗∈푁2

We conclude that

(︃ (︀ 푛 )︀)︃ log 퐵 E[ 푓˜푖 푓푖 ] E[푌1] + E[푌2] = E[푌1] + E[E[푌2 푏]] = 푂 , | − | ≤ | 퐵 as desired.

(푗) (︀ 1 )︀ Next we show the lower bound. For 푗 [푛] and ℓ [푘] we define 푋 = 푓푗 [ℎℓ(푗) = ℎℓ(푖)] . ∈ ∈ ℓ · − 퐵 Note that the variables (푗) are independent. We also define ∑︀ (푗) for (푋ℓ )푗∈[푛] 푆ℓ = 푗∈푁2 푋ℓ (푗) 1 (푗) ℓ [푘]. Observe that 푋 푓푗 for 푗 퐵, E[푋 ] = 0, and that ∈ | ℓ | ≤ ≤ 퐵 ≥ ℓ (︂ )︂ ∑︁ 2 1 1 2 Var[푆ℓ] = 푓 . 푗 퐵 − 퐵2 ≤ 퐵2 푗∈푁2

Applying Bennett’s inequality with 2 2 and thus gives that 휎 = 퐵2 푀 = 1/퐵

(︂ (︂푡퐵 )︂)︂ Pr[푆ℓ 푡] exp 2ℎ . ≤ − ≤ − 2

(︂ log 푛 )︂ ∑︀ ( 퐵 ) Defining 푊ℓ = 푓푗 [ℎℓ(푗) = ℎℓ(푖)] it holds that E[푊ℓ] = Θ and 푆ℓ = 푗∈푁2 · 퐵 푊ℓ E[푊ℓ], so putting 푡 = E[푊ℓ]/2 in the inequality above we obtain that − (︁ (︁ (︁ 푛 )︁)︁)︁ Pr[푊ℓ E[푊ℓ]/2] = Pr[푆ℓ E[푊ℓ]/2] exp 2ℎ Ω log . ≤ ≤ − ≤ − 퐵

189 Appealing to Remark 6.4.4 and using that 퐵 푛/푘 the above bound becomes ≤ (︁ (︁ 푛 (︁ 푛 )︁)︁)︁ Pr[푊ℓ E[푊ℓ]/2] exp Ω log log log + 1 ≤ ≤ − 퐵 · 퐵 = exp( Ω(log 푘 log(log 푘 + 1))) = 푘−Ω(log(log 푘+1)). (6.4.1) − ·

By the independence of the events (푊ℓ > 퐸[푊ℓ]/2)ℓ∈[푘] we have that

[︂ ]︂ E[푊ℓ] −Ω(log(log 푘+1)) 푘 Pr 푓˜푖 푓푖 (1 푘 ) = Ω(1), | − | ≥ 2 ≥ −

(︂ 푛 )︂ log( 퐵 ) and so E[ 푓˜푖 푓푖 ] = Ω(E[푊ℓ]) = Ω , as desired. | − | 퐵 

6.4.2. Learned Count-Min with Zipfians

One hash function. By the tight analysis of Count-Min, we have the following theorem

on the expected error of the learned Count-Min with 푘 = 1. By taking 퐵1 = 퐵ℎ = Θ(퐵)

and 퐵2 = 퐵 퐵ℎ = Θ(퐵) in the theorem below the result on learned Count-Min for 푘 = 1 − claimed in Table 6.4.1 follows immediately.

Theorem 6.4.5. Let and let be an independent and truly 푛, 퐵1, 퐵2 N ℎ :[푛] [퐵1] [퐵2] ∈ ∖ → ∑︀ random hash function. For 푖 [푛] [퐵1], define the random variable 푓˜푖 = [ℎ(푗) = ∈ ∖ 푗∈[푛] ℎ(푖)]푓푗. For any 푖 [푛] [퐵1] it holds that ∈ ∖ ⎛ (︁ )︁⎞ log 푛 ˜ 퐵1 E[ 푓푖 푓푖 ] = Θ ⎝ ⎠ | − | 퐵2

Multiple hash functions. Next we compute an asymptotic lower bound on the expected error of the Count-Min no matter how the hash functions are picked. In particular, this implies a lower bound on the expected error of the learned Count-Min.

Claim 6.4.6. For a set of items 퐼, let 푓(퐼) denote the total frequency of all items in 퐼; ∑︀ 푓(퐼) := 푓푖. In any hash function of form ℎ : 퐼 [퐵퐼 ] with minimum expected error, 푖∈퐼 { → } any item with frequency at least 푓(퐼) does not collide with any other items in 퐼. 퐵퐼

Proof: For each bucket 푏 [퐵퐼 ], lets 푓ℎ(푏) denotes the frequency of items mapped to 푏 under ∈ 190 ∑︀ ℎ; 푓(푏) := 푓푖. Then we can rewrite Err( (퐼), ˜ℎ(퐼)) as 푖∈퐼:ℎ(푖)=푏 ℱ ℱ

∑︁ ∑︁ ∑︁ 2 Err( (퐼), ˜ℎ(퐼)) = 푓푖 (푓(ℎ(푖)) 푓푖) = 푓푖 푓(ℎ(푖)) 푓 ℱ ℱ · − · − 푖 푖∈퐼 푖∈퐼 푖∈퐼 ∑︁ ∑︁ = 푓(푏)2 푓 2. (6.4.2) − 푖 푏∈[퐵퐼 ] 푖∈퐼

Note that in (6.4.2) the second term is independent of ℎ and is a constant. Hence, an optimal hash function minimizes the first term, ∑︀ 푓 푏 2. 푏∈퐵퐼 ( ) Suppose that an item 푖* with frequency at least 푓(퐼) collides with a (non-empty) set of 퐵퐼 items 퐼* 퐼 푖* under an optimal hash function ℎ*. Since the total frequency of the ⊆ ∖ { } items mapped to the bucket 푏* containing 푖* is greater than 푓(퐼) (i.e., 푓(ℎ(푖*)) > 푓(퐼) ), there 퐵퐼 퐵퐼 exists a bucket 푏 such that 푓(푏) < 푓(퐼) . Next, we define a new hash function ℎ with smaller 퐵퐼 estimation error compared to ℎ* which contradicts the optimality of ℎ*:

⎧ ⎨ ℎ*(푖) if 푖 퐼 퐼* ℎ(푖) = ∈ ∖ ⎩ 푏 otherwise.

Formally,

* 2 2 * 2 2 Err( (퐼), ˜ℎ* (퐼)) Err( (퐼), ˜ (퐼)) = 푓ℎ* (푏 ) + 푓ℎ* (푏) 푓 (푏 ) 푓 (푏) ℱ ℱ − ℱ ℱℎ − ℎ − ℎ * 2 2 2 * 2 = (푓푖* + 푓(퐼 )) + 푓ℎ* (푏) 푓 * (푓ℎ* (푏) + 푓(퐼 )) − 푖 − * * = 2푓푖* 푓(퐼 ) 2푓ℎ* (푏) 푓(퐼 ) · − · * = 2푓(퐼 ) (푓푖* 푓ℎ* (푏)) · − 푓(퐼) > 0 Since 푓 * > > 푓 * (푏). B 푖 퐵 ℎ 

Next, we show that in any optimal hash function ℎ* :[푛] [퐵] and assuming Zipfian → distribution, Θ(퐵) most frequent items do not collide with any other items under ℎ*.

Lemma 6.4.7. Suppose that 퐵 = 푛/훾 where 훾 is a large enough constant and lets assume that items follow Zipfian distribution. In any hash function ℎ* :[푛] [퐵] with minimum → expected error, none of the 퐵 most frequent items collide with any other items (i.e., they 2 ln 훾

191 are mapped to a singleton bucket).

* Proof: Let 푖푗* be the most frequent item that is not mapped to a singleton bucket under ℎ . If 푗* > 퐵 then the statement holds. Suppose it is not the case and 푗* 퐵 . Let 퐼 denote 2 ln 훾 ≤ 2 ln 훾 * * the set of items with frequency at most 푓푗* = 1/푗 (i.e., 퐼 = 푖푗 푗 푗 ) and let 퐵퐼 denote { | ≥ } * * the number of buckets that the items with index at least 푗 mapped to; 퐵퐼 = 퐵 (푗 1). − − It is straightforward to show that 푛 . Next, by Claim 6.4.6, we show that 푓(퐼) < ln( 푗* ) + 1 * * ℎ does not hash the items 푗 , , 푛 to 퐵퐼 optimally. In particular, we show that the { ··· } frequency of item 푗* is more than 푓(퐼) . To prove this, first we observe that the function 퐵퐼 푔(푗) := 푗 (ln(푛/푗) + 1) is strictly increasing in [1, 푛]. Hence, for any 푗* 퐵 , · ≤ 2 ln 훾

푛 퐵 푗* (ln( ) + 1) (ln(2훾 ln 훾) + 1) · 푗* ≤ 2 ln 훾 · 1 퐵 (1 ) Since ln(2 ln 훾) + 2 < ln 훾 for sufficiently large 훾 ≤ · − 2 ln 훾 B

< 퐵퐼

Thus, 1 ln(푛/푗*)+1 푓(퐼) which implies that an optimal hash function must map * 푓푗* = * > > 푗 푗 퐵퐼 퐵퐼 to a singleton bucket.  Theorem 6.4.8. If 푛/퐵 is sufficiently large, then the expected error of any hash function that maps a set of items following Zipfian distribution to buckets is ln2(푛/퐵) . 푛 퐵 Ω( 퐵 ) Proof: Let 훾 = 푛/퐵. By Lemma 6.4.7, in any hash function with minimum expected error, the 퐵 most frequent items do not collide with any other items (i.e., they are mapped ( 2 ln 훾 ) into a singleton bucket) where 훾 is a sufficiently large constant. Hence, the goal is to minimize (6.4.2) for the set of items 퐼 which consist of all items other than the 퐵 most frequent items. Since the sum of squares of items that summed ( 2 ln 훾 ) 푚 to is at least 푆2 , the expected error of any optimal hash function is at least: 푆 푚

192 ∑︁ 2 ∑︁ 2 Err( (퐼), ˜ℎ* (퐼)) = 푓(푏) 푓 by (6.4.2) ℱ ℱ − 푖 B 푏∈[퐵] 푖∈[푛] ∑︀ 2 ( 푓푖) ∑︁ 푖∈퐼 푓 2 ≥ 퐵(1 1 ) − 푖 − 2 ln 훾 푖∈퐼 (ln(2훾 ln 훾) 1)2 2 ln 훾 1 − + ≥ 퐵 − 퐵 푛 ln2 훾 = Ω( ) for sufficiently large 훾 퐵 B ln2(푛/퐵) = Ω( ). 퐵 

Next, we show a more general statement which basically shows that the estimation error of any Count-Min with buckets is ln2(푛/퐵) no matter how many rows it has. 퐵 Ω( 퐵 ) Theorem 6.4.9. If 푛/퐵 is sufficiently large, then the estimation error of any Count-Min that maps a set of items following Zipfian distribution to buckets is ln2(푛/퐵) . 푛 퐵 Ω( 퐵 ) Proof: We prove the statement by showing a reduction that given a Count-Min 퐶푀(푘) with * hash functions ℎ1, , ℎ푘 ℎ :[푛] [퐵/푘] constructs a single hash function ℎ :[푛] [퐵] ··· ∈ { → } → whose estimation error is less than or equal to the estimation error of 퐶푀(푘). For each item 푖, we define 퐶′[푖] to be the bucket whose value is returned by 퐶푀(푘) as the ′ estimate of 푓푖; 퐶 [푖] := argmin푗∈[푘]퐶[푗, ℎ푗(푖)]. Since the total number of buckets in 퐶푀(푘) is 퐵, 퐶′[푖] 푖 [푛] 퐵; in other words, we only consider the subset of buckets that |{ | ∈ }| ≤ 퐶푀(푘) uses to report the estimates of 푓푖 푖 [푛] which trivially has size at most 퐵. We { | ∈ } define ℎ* as follows:

* * * ℎ (푖) = (푗 , ℎ푗* (푖)) for each 푖 [푛], where 푗 = argmin푗∈[푘]퐶[푗, ℎ푗(푖)] B ∈

Unlike 퐶푀(푘), ℎ* maps each item to exactly one bucket in 퐶[ℓ, 푗] ℓ [푘], 푗 [퐵/푘] ; { | ∈ ∈ } ′ * * hence, for each item 푖, 퐶 [ℎ (푖)] 퐶[ℎ (푖)] = 푓˜푖 where 푓˜푖 is the estimate of 푓푖 returned by ≤ ′ * 퐶푀(푘). Moreover, since for each 푖, 퐶 [ℎ (푖)] 푓푖, Err( , ˜ℎ* ) Err( , ˜퐶푀(푘)). Finally, ≥ ℱ ℱ ≤ ℱ ℱ by Theorem 6.4.8, the estimation error of * is ln2(푛/퐵) which implies that the estimation ℎ Ω( 퐵 ) error of is ln2(푛/퐵) as well. 퐶푀(푘) Ω( 퐵 ) 

193 6.4.3. (Nearly) tight Bounds for Count-Sketch with Zipfians

In this section we proceed to analyze Count-Sketch for Zipfians either using a single or more

hash functions. We start with two simple lemmas which for certain frequencies (푓푖)푖∈[푛] of the items in the stream can be used to obtain respectively good upper and lower bounds on

E[ 푓˜푖 푓푖 ] in Count-Sketch with a single hash function. We will use these two lemmas both | − | in our analysis of standard and learned Count-Sketch for Zipfians.

Lemma 6.4.10. Let 푛, independent Bernoulli variables 푤 = (푤1, . . . , 푤푛) R 휂1, . . . , 휂푛 ∈ taking value 1 with probability 푝, and 휎1, . . . , 휎푛 1, 1 independent Rademachers, i.e., ∈ {− } ∑︀푛 Pr[휂푖 = 1] = Pr[휂푖 = 1] = 1/2. Let 푆 = 푤푖휂푖휎푖. Then − 푖=1

E[ 푆 ] = 푂 (√푝 푤 2) . | | ‖ ‖

Proof: Using that E[휎푖휎푗] = 0 for 푖 = 푗 and Jensen’s inequality ̸

[︃ 푛 ]︃ 2 2 ∑︁ 2 2 E[ 푆 ] E[푆 ] = E 푤푖 휂푖 = 푝 푤 2, | | ≤ 푖=1 ‖ ‖

from which the result follows.  Lemma 6.4.11. Suppose that we are in the setting of Lemma 6.4.10. Let 퐼 [푛] and let ⊂ 푛 be defined by . Then 푤퐼 R (푤퐼 )푖 = [푖 퐼] 푤푖 ∈ ∈ ·

1 |퐼|−1 E[ 푆 ] 푝 (1 푝) 푤퐼 1. | | ≥ 2 − ‖ ‖

∑︀ ∑︀ Proof: Let 퐽 = [푛] 퐼, 푆1 = 푤푖휂푖휎푖, and 푆2 = 푤푖휂푖휎푖. Let 퐸 denote the event ∖ 푖∈퐼 푖∈퐽 that 푆1 and 푆2 have the same sign or 푆2 = 0. Then Pr[퐸] 1/2 by symmetry. For 푖 퐼 ≥ ∈ |퐼|−1 we denote by 퐴푖 the event that 푗 퐼 : 휂푗 = 0 = 푖 . Then Pr[퐴푖] = 푝(1 푝) and { ∈ ̸ } { } − furthermore 퐴푖 and 퐸 are independent. If 퐴푖 퐸 occurs, then 푆 푤푖 and as the events ∩ | | ≥ | | (퐴푖 퐸)푖∈퐼 are disjoint it thus follows that ∩

∑︁ 1 |퐼|−1 E[ 푆 ] Pr[퐴푖 퐸] 푤푖 = 푝 (1 푝) 푤퐼 1. | | ≥ ∩ · | | 2 − ‖ ‖  푖∈퐼

194 One hash-function. We are now ready to commence our analysis of Count-Sketch for Zipfians. As in the discussion succeeding Theorem 6.4.1 the following theorem yields the desired result for a single hash function as presented in Table 6.4.1.

Theorem 6.4.12. Suppose that 퐵 푛 and let ℎ :[푛] [퐵] and 푠 :[푛] 1, 1 be ≤ → → {− } truly random hash functions. Define the random variable ˜ ∑︀ for 푓푖 = 푗∈[푛][ℎ(푗) = ℎ(푖)]푠(푗)푓푗 푖 [푛]. Then ∈ (︂log 퐵 )︂ E[ 푓˜푖 푠(푖)푓푖 ] = 푂 . | − | 퐵

Proof: Let 푖 [푛] be fixed. We start by defining 푁1 = [퐵] 푖 and 푁2 = [푛] ([퐵] 푖 ) ∈ ∖ { } ∖ ∪ { } and note that

⃒ ⃒ ⃒ ⃒ ⃒ ∑︁ ⃒ ⃒ ∑︁ ⃒ 푓˜푖 푠(푖)푓푖 ⃒ [ℎ(푗) = ℎ(푖)]푠(푗)푓푗⃒ + ⃒ [ℎ(푗) = ℎ(푖)]푠(푗)푓푗⃒ := 푋1 + 푋2. | − | ≤ ⃒ ⃒ ⃒ ⃒ ⃒푗∈푁1 ⃒ ⃒푗∈푁2 ⃒

Using the triangle inequality

1 ∑︁ (︂log 퐵 )︂ E[푋1] 푓푗 = 푂 . ≤ 퐵 퐵 푗∈푁1

Also, by Lemma 6.4.10, (︀ 1 )︀ and combining the two bounds we obtain the desired E[푋2] = 푂 퐵 upper bound.

For the lower bound we apply Lemma 6.4.11 with 퐼 = 푁1 concluding that

|푁 |−1 1 (︂ 1 )︂ 1 ∑︁ (︂log 퐵 )︂ E[ 푓˜푖 푠(푖)푓푖 ] 1 푓푖 = Ω . | − | ≥ 2퐵 − 퐵 퐵  푖∈푁1

Multiple hash functions. Let be odd. For a tuple 푘 we 푘 N 푥 = (푥1, . . . , 푥푘) R ∈ ∈ denote by median 푥 the median of the entries of 푥. The following theorem immediately leads to the result on CS with 푘 3 hash functions claimed in Table 6.4.1. ≥

Theorem 6.4.13. Let 푘 3 be odd, 푛 푘퐵 and all ℎ1, . . . , ℎ푘 :[푛] [퐵] and 푠1, . . . , 푠푘 : ≥ ≥ → [푛] 1, 1 be truly random hash functions. For each 푖 [푛], define the random variable → {− } ∈ 195 (︁∑︀ )︁ 2 푓˜푖 = medianℓ∈[푘] [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 . Assume that 푘 퐵. Then 푗∈[푛] ≤

(︂ 1 )︂ (︂ 1 )︂ E[ 푓˜푖 푠(푖)푓푖 ] = Ω , and E[ 푓˜푖 푠(푖)푓푖 ] = 푂 | − | 퐵√푘 log 푘 | − | 퐵√푘

The assumption 푛 푘퐵 simply says that the total number of buckets is upper bounded by ≥ the number of items. Again using linearity of expectation for the summation over 푖 [푛] √ ∈ √ and replacing by we obtain the claimed upper and lower bounds of 푘 and 푘 퐵 퐵/푘 퐵 log 푘 퐵 respectively. We note that even if the bounds above are only tight up to a factor of log 푘 they still imply that it is asymptotically optimal to choose 푘 = 푂(1). To settle the correct asymptotic growth is thus of merely theoretical interest. We need the following claim to the prove the theorem.

Claim 6.4.14. Let 퐼 R be the closed interval centered at the origin of length 2푡, i.e., ⊂ 퐼 = [ 푡, 푡]. Suppose that √1 푡 1 . For ℓ [푘], Pr[푋(ℓ) 퐼] = Ω(푡퐵). − 푘퐵 ≤ ≤ 2퐵 ∈ ∈ Proof: We first show that with probability , (ℓ) lies in the interval for some Ω(1) 푋2 [1/퐵, 훾/퐵] constant 훾. To see this we note that by Lemma 6.4.10, E[ 푋(ℓ) ] = 푂 (︀ 1 )︀, so it follows by | 2 | 퐵 Markov’s inequality that if 훾 = 푂(1) is large enough, the probability that 푋(ℓ) 훾/퐵 is at | 2 | ≥ most 1 . For a constant probability lower bound on 푋(ℓ) we write 200 | 2 |

(ℓ) ∑︁ ∑︁ 푋2 = [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 + [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 := 푆1 + 푆2.

푗∈푁2∩{퐵+1,...,2퐵} 푗∈푁2∩{2퐵+1,...,푛}

Condition on 푆2. If 퐵 4 the probability that there exist exactly two 푗 푁2 퐵 + ≥ ∈ ∩ { 1,..., 2퐵 with ℎℓ(푗) = ℎℓ(푖) is at least } (︂ )︂ (︂ )︂퐵−2 (︂ )︂ 푁2 퐵 + 1,..., 2퐵 1 1 퐵 1 1 1 | ∩ { }| 1 − . 2 퐵2 − 퐵 ≥ 2 푒퐵2 ≥ 8푒

With probability 1/4 the corresponding signs 푠ℓ(푗) are both the same as that of 푆2. By independence of and the probability that this occurs is at least 1 and if it does, 푠ℓ ℎℓ 32푒 푋(ℓ) 1/퐵. Combining these two bounds it follows that 푋(ℓ) [ 1 , 훾 ] with probability | 2 | ≥ | 2 | ∈ 퐵 퐵

2This very mild assumption can probably be removed at the cost of a more technical proof. In our proof it can even be replaced by 푘 퐵2−휀 for any 휀 = Ω(1). ≤ 196 at least 1 1 1 . By symmetry, Pr[푋(ℓ) [ 1 , 훾 ]] 1 = Ω(1). Denote this event by 32푒 − 200 ≥ 200 2 ∈ 퐵 퐵 ≥ 400 퐸. Also let 퐹 be the event that 푗 푁1 : ℎℓ(푗) = ℎℓ(푖) = 1. Then Pr[퐹 ] = Ω(1) and as 퐸 |{ ∈ }| and 퐹 are independent Pr[퐸 퐹 ] = Ω(1). Conditioned on 퐸 퐹 we now lower bound the ∩ ∩ probability that 푋(ℓ) 퐼. For this it suffices to fix 푋(ℓ) = 휂 for some 1 휂 훾 and lower ∈ 2 퐵 ≤ ≤ 휂 bound the probability that +휎푓 [ 푡, 푡] where 휎 is a Rademacher and 푓 1/푗 : 푗 푁1 퐵 ∈ − ∈ { ∈ } is chosen uniformly at random. Let be such that 푚1, 푚2 R>0 ∈ 휂 1 휂 1 = 푡 and = 푡. 퐵 − 푚1 − 퐵 − 푚2

Then 2푡퐵 2 . Using that is larger than a big enough constant 푚2 푚1 = 퐵 2 2 2 = Ω(푡퐵 ) 퐵 − 휂 −푡 퐵 and the mild assumption 푘 퐵, we have that 푚2 푚1 = Ω(퐵/√푘) = Ω(√퐵) 1, and ≤ − ≥ 2 so 푚2 푚1 = Ω(푚2 푚1) = Ω(푡퐵 ) as well. As we have 푁2 퐵 1 options for 푓 it ⌊ − ⌋ − | | ≥ − follows that

[︁ 휂 ]︁ 푚2 푚1 Pr + 휎푓 [ 푡, 푡] ⌊ − ⌋ = Ω(푡퐵), 퐵 ∈ − ≥ 퐵 1 −

as desired.  Proof of Theorem 6.4.13. If 퐵 (and hence 푘) is a constant the result follows easily from Lemma 6.4.10, so in what follows we assume that 퐵 is larger than a sufficiently large constant.

We first prove the upper bound. Define 푁1 = [퐵] 푖 and 푁2 = [푛] ([퐵] 푖 ). Let for ∖ { } ∖ ∪ { } (ℓ) ∑︀ (ℓ) ∑︀ ℓ [푘], 푋 = [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 and 푋 = [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗. Finally ∈ 1 푗∈푁1 2 푗∈푁2 write (ℓ) (ℓ) (ℓ). 푋 = 푋1 + 푋2 As the absolute error in Count-Sketch with one pair of hash functions (ℎ, 푠) is always upper bounded by the corresponding error in Count-Min with the single hash function ℎ, we can use the bound in the proof of Lemma 6.4.2 to conclude that

(︂log(푡퐵))︂ Pr[ 푋(ℓ) 푡] = 푂 , | 1 | ≥ 푡퐵

when 푡 3/퐵. Also ≥ (︂ 1 1 )︂ ∑︁ 2 Var[푋ℓ] = 푓 2 , 2 퐵 − 퐵2 푗 ≤ 퐵2 푗∈푁2

197 so by Bennett’s inequality (with 푀 = 1/퐵 and 휎2 = 2/퐵2) and Remark 6.4.4,

(︂ 1 (︂푡퐵 )︂)︂ (︂log(푡퐵))︂ Pr[ 푋(ℓ) 푡] 2 exp ( 2ℎ(푡퐵/2)) 2 exp 푡퐵 log + 1 = 푂 , | 2 | ≥ ≤ − ≤ −2 2 푡퐵

for 푡 3 . It follows that for 푡 3/퐵, ≥ 퐵 ≥ (︂log(푡퐵))︂ Pr[ 푋(ℓ) 2푡] Pr[( 푋(ℓ) 푡)] + Pr( 푋(ℓ) 푡)] = 푂 . | | ≥ ≤ | 1 | ≥ | 2 | ≥ 푡퐵

Let 퐶 be the implicit constant in the 푂-notation above. If 푓˜푖 푠(푖)푓푖 2푡, at least half of | − | ≥ (ℓ) the values ( 푋 )ℓ∈[푘] are at least 2푡. For 푡 3/퐵 it thus follows by a union bound that | | ≥

(︂ 푘 )︂ (︂ log(푡퐵))︂⌈푘/2⌉ (︂ log(푡퐵))︂⌈푘/2⌉ Pr[ 푓˜푖 푠(푖)푓푖 2푡] 2 퐶 2 4퐶 . (6.4.3) | − | ≥ ≤ 푘/2 푡퐵 ≤ 푡퐵 ⌈ ⌉ If 훼 = 푂(1) is chosen sufficiently large it thus holds that

∫︁ ∞ ∫︁ ∞ Pr[ 푓˜푖 푠(푖)푓푖 푡] 푑푡 = 2 Pr[ 푓˜푖 푠(푖)푓푖 2푡] 푑푡 훼/퐵 | − | ≥ 훼/(2퐵) | − | ≥ 4 ∫︁ ∞ (︂ log(푡))︂⌈푘/2⌉ 4퐶 푑푡 ≤ 퐵 훼/2 푡 1 1 . ≤ 퐵2푘 ≤ 퐵√푘

Here the first inequality uses (6.4.3) and a change of variable. The second inequality uses ⌈푘/2⌉ that (︀4퐶 log 푡 )︀ (퐶′/푡)2푘/5 for some constant 퐶′ followed by a calculation of the integral. 푡 ≤ ∫︀ 훼/퐵 (︁ 1 )︁ For our upper bound it therefore suffices to show that Pr[ 푓˜푖 푠(푖)푓푖 푡] 푑푡 = 푂 √ . 0 | − | ≥ 퐵 푘 1 1 (ℓ) For this let √ 푡 be fixed. If 푓˜푖 푠(푖)푓푖 푡, at least half of the values (푋 )ℓ∈[푘] 푘퐵 ≤ ≤ 퐵 | − | ≥ are at least 푡 or at most 푡. Let us focus on bounding the probability that at least half are − at least 푡, the other bound being symmetric giving an extra factor of 2 in the probability bound. By symmetry and Claim 6.4.14, Pr[푋(ℓ) 푡] = 1 Ω(푡퐵). For ℓ [푘] we define ≥ 2 − ∈ (ℓ) ∑︀ (︀ 1 )︀ 푌ℓ = [푋 푡], and we put 푆 = 푌ℓ. Then E[푆] = 푘 Ω(푡퐵) . If at least half of ≥ ℓ∈[푘] 2 − (ℓ) the values (푋 )ℓ∈[푘] are at least 푡 then 푆 푘/2. By Hoeffding’s inequality we can bound ≥

198 the probability of this event by

Pr[푆 푘/2] = Pr[푆 E[푆] = Ω(푘푡퐵)] = exp( Ω(푘푡2퐵2)). ≥ − −

2 2 It follows that Pr[ 푓˜푖 푠(푖)푓푖 푡] 2 exp( Ω(푘푡 퐵 )). Thus | − | ≥ ≤ −

1 ∫︁ 훼/퐵 ∫︁ 2퐵 ∫︁ 훼/퐵 1 2 2 Pr[ 푓˜푖 푠(푖)푓푖 푡] 푑푡 + 2 exp( Ω(푘푡 퐵 )) 푑푡 + 2 exp( Ω(푘)) 푑푡 0 | − | ≥ ≤ 퐵√푘 √1 − 1 − 퐵 푘 2퐵 (︂ 1 )︂ 1 ∫︁ ∞ (︂ 1 )︂ = 푂 + exp( 푡2) 푑푡 = 푂 . 퐵√푘 퐵√푘 1 − 퐵√푘

This completes the proof of the upper bound and we proceed with the lower bound. Fix

ℓ [푘] and let 푀1 = [퐵 log 푘] 푖 and 푀2 = [푛] ([퐵 log 푘] 푖 ). Write ∈ ∖ { } ∖ ∪ { }

∑︁ ∑︁ 푆 := [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 + [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 := 푆1 + 푆2.

푗∈푀1 푗∈푀2

We also define . Let be the closed interval around 퐽 := 푗 푀1 : ℎℓ(푗) = ℎℓ(푖) 퐼 R { ∈ } ⊆ 1 푠ℓ(푖)푓푖 of length √ . We now upper bound the probability that 푆 퐼 conditioned on 퐵 푘 log 푘 ∈ the value of 푆2. To ease the notation the conditioning on 푆2 has been left out in the notation to follow. Note first that

|푀1| ∑︁ Pr[푆 퐼] = Pr[푆 퐼 퐽 = 푟] Pr[ 퐽 = 푟]. ∈ 푟=0 ∈ | | | · | |

For a given 푟 1 we now proceed to bound Pr[푆 퐼 퐽 = 푟]. This probability is the same ≥ ∈ | | | ∑︀ as the probability that 푆2 + 휎푗푓푗 퐼, where 푅 푀1 is a uniformly random 푟-subset 푗∈푅 ∈ ⊆ and the 휎푗’s are independent Rademachers. Suppose that we sample the elements from 푅 as

well as the corresponding signs (휎푖)푖∈푅 sequentially, and let us condition on the values and signs of the first 푟 1 sampled elements. At this point at most 퐵√log 푘 +1 possible samples for − 푘 the last element in 푅 brings 푆 into 퐼. Indeed, the minimum distance between consecutive 2 points in 푓푗 : 푗 푀1 is at most 1/(퐵 log 푘) so at most { ∈ } 1 퐵 log 푘 (퐵 log 푘)2 + 1 = + 1 퐵√푘 log 푘 · √푘

199 samples brings 푆 into 퐼. For 1 푟 (퐵 log 푘)/2 we can thus can upper bound ≤ ≤

퐵 log 푘 √ + 1 2 2 3 Pr[푆 퐼 퐽 = 푟] 푘 + . ∈ | | | ≤ 푀1 푟 + 1 ≤ √푘 퐵 log 푘 ≤ √푘 | | − By a standard Chernoff bound

Pr[ 퐽 퐵 log 푘/2] = exp ( Ω(퐵 log 푘)) = 푘−Ω(퐵). | | ≥ −

If we assume that 퐵 is larger than a constant, then Pr[ 퐽 퐵 log 푘/2] 푘−1. Finally, | | ≥ ≤ Pr[ 퐽 = 0] = (1 1/퐵)퐵 log 푘 푘−1. Combining these three bounds, | | − ≤

(퐵 log 푘)/2 ∑︁ Pr[푆 퐼] Pr[ 퐽 = 0] + Pr[푆 퐼 퐽 = 푟] Pr[ 퐽 = 푟] ∈ ≤ | | 푟=1 ∈ | | | · | |

|푀1| ∑︁ (︂ 1 )︂ + Pr[ 퐽 = 푟] = 푂 , | | √ 푟=(퐵 log 푘)/2 푘

which holds even after removing the conditioning on 푆2. We now show that with probability

at least half the values (ℓ) are at least √1 . Let be the probability that Ω(1) (푋 )ℓ∈[푘] 2퐵 푘 log 푘 푝0 푋(ℓ) √1 . This probability does not depend on ℓ [푘] and by symmetry and what ≥ 2퐵 푘 log 푘 ∈ we showed above, . Define the function by 푝0 = 1/2 푂(1/√푘) 푓 : 0, . . . , 푘 R − { } → (︂ )︂ 푘 푡 푘−푡 푓(푡) = 푝 (1 푝0) . 푡 0 −

Then is the probability that exactly of the values (ℓ) are at least √ 1 . Using 푝(푡) 푡 (푋 )ℓ∈[푘] 퐵 푘 log 푘 (︁ 1 )︁ that 푝0 = 1/2 푂(1/√푘), a simple application of Stirling’s formula gives that 푓(푡) = Θ √ − 푘 for 푡 = 푘/2 ,..., 푘/2 + √푘 when 푘 is larger than some constant 퐶. It follows that with ⌈ ⌉ ⌈ ⌉ probability at least half the (ℓ) are at least √ 1 and in particular Ω(1) (푋 )ℓ∈[푘] 퐵 푘 log 푘

(︂ 1 )︂ E[ 푓˜푖 푓푖 ] = Ω . | − | 퐵√푘 log 푘

Finally we handle the case where 푘 퐶. It is easy to check (e.g. with Lemma 6.4.11) that ≤ 푋(ℓ) = Ω(1/퐵) with probability Ω(1). Thus this happens for all ℓ [푘] with probability ∈

200 Ω(1) and in particular E[ 푓˜푖 푓푖 ] = Ω(1/퐵), which is the desired for constant 푘. | − |  6.4.4. Learned Count-Sketch for Zipfians

We now proceed to analyze the learned Count-Sketch algorithm. First, we estimate the expected error when using a single hash function and next we show that the expected error only increases when using more hash functions. Recall that we assume on the number of

buckets 퐵ℎ used to store the heavy hitters that 퐵ℎ = Θ(퐵 퐵ℎ) = Θ(퐵). −

One hash function. By taking 퐵1 = 퐵ℎ = Θ(퐵) and 퐵2 = 퐵 퐵ℎ = Θ(퐵) in the − theorem below the result on L-CS for 푘 = 1 claimed in Table 6.4.1 follows immediately.

Theorem 6.4.15. Let ℎ :[푛] [퐵1] [퐵2] and 푠 :[푛] 1, 1 be truly random hash ∖ → → {− } functions where and3 . Define the random variable ˜ 푛, 퐵1, 퐵2 N 푛 퐵1 퐵2 퐵1 푓푖 = ∈ − ≥ ≥ ∑︀푛 [ℎ(푗) = ℎ(푖)]푠(푗)푓푗 for 푖 [푛] [퐵1]. Then 푗=퐵1+1 ∈ ∖ (︃ )︃ log 퐵2+퐵1 퐵1 E[ 푓˜푖 푠(푖)푓푖 ] = Θ | − | 퐵2

Proof: Let 푁1 = [퐵1 + 퐵2] ([퐵1] 푖 ) and 푁2 = [푛] ([퐵1 + 퐵2] 푖 ). Let 푋1 = ∖ ∪ { } ∖ ∪ { } ∑︀ and ∑︀ . By the triangle inequality 푗∈푁1 [ℎ(푗) = ℎ(푖)]푠(푗)푓푗 푋2 = 푗∈푁2 [ℎ(푗) = ℎ(푖)]푠(푗)푓푗 and linearity of expectation,

(︃ )︃ log 퐵2+퐵1 퐵1 E[ 푋1 ] = 푂 . | | 퐵2

(︁ 1 )︁ Moreover, it follows directly from Lemma 6.4.10 that E [ 푋2 ] = 푂 . Thus | | 퐵2 (︃ )︃ log 퐵2+퐵1 퐵1 E[ 푓˜푖 푠(푖)푓푖 ] E[ 푋1 ] + E[ 푋2 ] = 푂 , | − | ≤ | | | | 퐵2 as desired.

3The first inequality is the standard assumption that we have at least as many items as buckets. The second inequality says that we use at least as many buckets for non-heavy items as for heavy items (which doesn’t change the asymptotic space usage).

201 [︁⃒ ⃒]︁ For the lower bound on E ⃒푓˜푖 푠(푖)푓푖⃒ we apply Lemma 6.4.11 to obtain that, ⃒ − ⃒

(︃ 퐵 +퐵 )︃ [︁⃒ ⃒]︁ 1 (︂ 1 )︂|푁1|−1 log 2 1 ˜ ∑︁ 퐵1 E ⃒푓푖 푠(푖)푓푖⃒ 1 푓푖 = Ω .  ⃒ − ⃒ ≥ 2퐵2 − 퐵2 퐵2 푖∈푁1

Corollary 6.4.16. Let ℎ :[푛] [퐵ℎ] [퐵 퐵ℎ] and 푠 :[푛] 1, 1 be truly random ∖ → − → {− } hash functions where and . Define the random variable 푛, 퐵1, 퐵2 N 퐵ℎ = Θ(퐵) 퐵/2 ∈ ≤ ∑︀푛 푓˜푖 = [ℎ(푗) = ℎ(푖)]푠(푗)푓푗 for 푖 [푛] [퐵ℎ]. Then 푗=퐵ℎ+1 ∈ ∖

(︂ 1 )︂ E[ 푓˜푖 푠(푖)푓푖 ] = Θ | − | 퐵

More hash functions. We now show that, like for Count-Min, using more hash func- tions does not decrease the expected error. We first state the Littlewood-Offord lemma as strengthened by Erdős.

Theorem 6.4.17 (Littlewood-Offord [126], Erdős [69]). Let with 푎1, . . . , 푎푛 R 푎푖 ∈ | | ≥ 1 for 푖 [푛]. Let further 휎1, . . . , 휎푛 1, 1 be random variables with Pr[휎푖 = 1] = Pr[휎푖 = ∈ ∈ {− } and define ∑︀푛 . For any it holds that 1] = 1/2 푆 = 휎푖푎푖 푣 R − 푖=1 ∈ (︂ 푛 )︂ 1 (︂ 1 )︂ Pr[ 푆 푣 1] = 푂 . | − | ≤ ≤ 푛/2 · 2푛 √푛 ⌊ ⌋

Setting 퐵1 = 퐵ℎ = Θ(퐵) and 퐵2 = 퐵 퐵ℎ = Θ(퐵) in the theorem below gives the final − bound from Table 6.4.1 on L-CS with 푘 3. ≥

Theorem 6.4.18. Let 푛 퐵1 + 퐵2 2퐵1, 푘 3 odd, and ℎ1, . . . , ℎ푘 :[푛] [퐵1] [퐵2/푘] ≥ ≥ ≥ ∖ → and 푠1, . . . , 푠푘 :[푛] [퐵1] [퐵2/푘] be independent and truly random. Define the random ∖ → (︁∑︀ )︁ variable 푓˜푖 = medianℓ∈[푘] [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 for 푖 [푛] [퐵1]. Then 푗∈[푛]∖[퐵1] ∈ ∖

(︂ 1 )︂ E[ 푓˜푖 푠(푖)푓푖 ] = Ω . | − | 퐵2

Proof: Like in the proof of the lower bound of Theorem 6.4.13 it suffices to show that for each the probability that the sum ∑︀ lies in the 푖 푆ℓ := 푗∈[푛]∖([퐵1]∪{푖})[ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 interval 퐼 = [ 1/(2퐵2), 1/(2퐵2)] is 푂(1/√푘). Then at least half the (푆ℓ)ℓ∈[푘] are at least − 202 1/(2퐵2) with probability Ω(1) by an application of Stirling’s formula, and it follows that

E[ 푓˜푖 푠(푖)푓푖 ] = Ω(1/퐵2). | − | Let ℓ [푘] be fixed, 푁1 = [2퐵2] ([퐵2] 푖 ), and 푁2 = [푛] (푁1 푖 ), and write ∈ ∖ ∪ { } ∖ ∪ { }

∑︁ ∑︁ 푆ℓ = [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 + [ℎℓ(푗) = ℎℓ(푖)]푠ℓ(푗)푓푗 := 푋1 + 푋2.

푗∈푁1 푗∈푁2

Now condition on the value of 푋2 = 푥2 of 푋2. Letting 퐽 = 푗 푁1 : ℎℓ(푗) = ℎℓ(푖) it follows { ∈ } by Theorem 6.4.17 that

(︃ )︃ ∑︁ Pr[퐽 = 퐽 ′] (︁ )︁ Pr[푆ℓ 퐼 푋2 = 푥2] = 푂 = 푂 Pr[ 퐽 < 푘/2] + 1/√푘 . √︀ ′ ∈ | ′ 퐽 + 1 | | 퐽 ⊆푁1 | |

An application of Chebyshev’s inequality gives that Pr[ 퐽 < 푘/2] = 푂(1/푘), so Pr[푆ℓ | | ∈ 퐼] = 푂(1/√푘). Since this bound holds for any possible value of 푥2 we may remove the conditioning and the desired result follows. 

Remark 6.4.19. The bound above is probably only tight for 퐵1 = Θ(퐵2). Indeed, we know

that it cannot be tight for all 퐵1 퐵2 since when 퐵1 becomes very small, the bound from the ≤ standard Count-Sketch with 푘 3 takes over—and this is certainly worse than the bound ≥ in the theorem. It is an interesting open problem (that requires a better anti-concentration

inequality than the Littlewood-Offord lemma) to settle the correct bound when 퐵1 퐵2. ≪ 6.5. Experiments

Baselines. We compare our learning-based algorithms with their non-learning counter- parts. Specifically, we augment Count-Min with a learned oracle as in Algorithm 26, and call the learning-augmented algorithm “Learned Count-Min”. We then compare Learned Count-Min with traditional Count-Min. We also compare it with “Learned Count-Min with Ideal Oracle” where the neural-network oracle is replaced with an ideal oracle that knows the identities of the heavy hitters in the test data, and “Count-Min with Lookup Table” where the heavy hitter oracle is replaced with a lookup table that memorizes heavy hitters in the training set. The comparison with the latter baseline allows us to show the ability of Learned Count-Min to generalize and detect heavy items unseen in the training set. We

203 105

104

103

102

101

Frequency (log scale) 100 101 103 105 107 Sorted items (log scale) Figure 6.5.1: Frequency of Internet flows repeat the evaluation where we replace Count-Min (CM) with Count-Sketch (CS) and the corresponding variants.

Training a Heavy Hitter Oracle. We construct the heavy hitter oracle by training a neural network to predict the heaviness of an item. Note that the prediction of the network is not the final estimation. It is used in Algorithm 26 to decide whether to assign an item to a unique bucket. We train the network to predict the item counts (or the log of the counts) and minimize the squared loss of the prediction. Empirically, we found that when the counts of heavy items are few orders of magnitude larger than the average counts (as is the case for the Internet traffic data set), predicting the log of the counts leads to more stable training and better results. Once the model is trained, we can select the optimal cutoff threshold using validation data, and use the model as the oracle described in Algorithm 26.

6.5.1. Internet Traffic Estimation

For our first experiment, the goal is to estimate the number of packets for each network flow. A flow is a sequence of packets between two machines on the Internet. It is identified bythe IP addresses of its source and destination and the application ports. Estimating the size of

each flow 푖—i.e., the number of its packets 푓푖—is a basic task in network management [161].

Dataset. The traffic data is collected at a backbone link of a Tier1 ISP between Chicago and Seattle in 2016 [39]. Each recording session is around one hour. Within each minute, there are around 30 million packets and 1 million unique flows. For a recording session, we use the first 7 minutes for training, the following minute for validation, and estimate the packet counts in subsequent minutes. The distribution of packet counts over Internet flows

204 is heavy tailed, as shown in Figure 6.5.1.

Model. The patterns of the Internet traffic are very dynamic, i.e., the flows with heavy traffic change frequently from one minute to the next. However, we hypothesize thatthe space of IP addresses should be smooth in terms of traffic load. For example, data centers at large companies and university campuses with many students tend to generate heavy traffic. Thus, though the individual flows from these sites change frequently, we could still discover regions of IP addresses with heavy traffic through a learning approach. We trained a neural network to predict the log of the packet counts for each flow. The model takes as input the IP addresses and ports in each packet. We use two RNNs to encode the source and destination IP addresses separately. The RNN takes one bit of the IP address at each step, starting from the most significant bit. We use the final states of the RNNasthe feature vector for an IP address. The reason to use RNN is that the patterns in the bits are hierarchical, i.e., the more significant bits govern larger regions in the IP space. Additionally, we use two-layer fully-connected networks to encode the source and destination ports. We then concatenate the encoded IP vectors, encoded port vectors, and the protocol type as the final features to predict the packet counts.4 The inference time takes 2.8 microseconds per item on a single GPU without any optimizations.5

Results. We plot the results of two representative test minutes (the 20th and 50th) in Figure 6.5.2. All plots in the figure refer to the estimation error (6.2.2) as a function of the used space. The space includes space for storing the buckets and the model. Since we use the same model for all test minutes, the model space is amortized over the 50-minute testing period. Figure 6.5.2 reveals multiple findings. First, the figure shows that our learning-based algorithms exhibit a better performance than their non-learning counterparts. Specifically, Learned Count-Min, compared to Count-Min, reduces the the error by 32% with space of

4We use RNNs with 64 hidden units. The two-layer fully-connected networks for the ports have 16 and 8 hidden units. The final layer before the prediction has 32 hidden units. 5Note that new specialized hardware such as Google TPU, hardware accelerators and network compres- sion [93, 163, 47, 94, 95] can drastically improve the inference time. Further, Nvidia has predicted that GPU will get 1000x faster by 2025. Because of these trends, the overhead of neural network inference is expected to be less significant in the future [120].

205 200 200 Count Min Count Min 175 Table lookup CM 175 Table lookup CM

150 Learned CM (NNet) 150 Learned CM (NNet) Learned CM (Ideal) Learned CM (Ideal) 125 125

100 100

75 75

50 50

25 25

Average Error per item (Frequency) 0 Average Error per item (Frequency) 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Space (MB) Space (MB) (a) Learned Count-Min - 20th test minute (b) Learned Count-Min - 50th test minute 200 200 Count Sketch Count Sketch 175 Table lookup CS 175 Table lookup CS

150 Learned CS (NNet) 150 Learned CS (NNet) Learned CS (Ideal) Learned CS (Ideal) 125 125

100 100

75 75

50 50

25 25

Average Error per item (Frequency) 0 Average Error per item (Frequency) 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Space (MB) Space (MB) (c) Learned Count-Sketch - 20th test (d) Learned Count-Sketch - 50th test minute minute Figure 6.5.2: Comparison of our algorithms with Count-Min and Count-Sketch on Internet traffic data. 0.5 MB and 42% with space of 1.0 MB (Figure 6.5.2a). Learned Count-Sketch, compared to Count-Sketch, reduces the error by 52% at 0.5 MB and 57% at 1.0 MB (Figure 6.5.2c). In our experiments, each regular bucket takes 4 bytes. For the learned versions, we account for the extra space needed for the unique buckets to store the item IDs and the counts. One unique bucket takes 8 bytes, twice the space of a normal bucket.6 Second, the figure also shows that our neural-network oracle performs better thanmem- orizing the heavy hitters in a lookup table. This is likely due to the dynamic nature of Internet traffic –i.e., the heavy flows in the training set are significantly different fromthose in the test data. Hence, memorization does not work well. On the other hand, our model is able to extract structures in the input that generalize to unseen test data.

6 By using hashing with open addressing, it suffices to store IDs hashed into log 퐵푟 + 푡 bits (instead of −푡 whole IDs) to ensure there is no collision with probability 1 2 . log 퐵푟 + 푡 is comparable to the number of bits per counter, so the space for a unique bucket is twice− the space of a normal bucket.

206 104

103

102

101

Frequency (log scale) 100 101 103 105 Sorted items (log scale) Figure 6.5.3: Frequency of search queries Third, the figure shows that our model’s performance stays roughly the same fromthe 20th to the 50th minute (Figure 6.5.2b and Figure 6.5.2d), showing that it learns properties of the heavy items that generalize over time. Lastly, although we achieve significant improvement over Count-Min and Count-Sketch, our scheme can potentially achieve even better results with an ideal oracle, as shown by the dashed green line in Figure 6.5.2. This indicates potential gains from further optimizing the neural network model.

6.5.2. Search Query Estimation

For our second experiment, the goal is to estimate the number of times a search query appears.

Dataset. We use the AOL query log dataset, which consists of 21 million search queries collected from 650 thousand users over 90 days. The users are anonymized in the dataset. There are 3.8 million unique queries. Each query is a search phrase with multiple words (e.g., “periodic table element poster”). We use the first 5 days for training, the following day for validation, and estimate the number of times different search queries appear in subsequent days. The distribution of search query frequency follows the Zipfian law, as shown in Figure 6.5.3.

Model. Unlike traffic data, popular search queries tend to appear more consistently across multiple days. For example, “google” is the most popular search phrase in most of the days in the dataset. Simply storing the most popular words can easily construct a reasonable heavy hitter predictor. However, beyond remembering the popular words, other factors also

207 4.0 4.0 Count Min Count Min 3.5 Table lookup CM 3.5 Table lookup CM

3.0 Learned CM (NNet) 3.0 Learned CM (NNet) Learned CM (Ideal) Learned CM (Ideal) 2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0

0.5 0.5

Average Error per item (Frequency) 0.0 Average Error per item (Frequency) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2 Space (MB) Space (MB) (a) Learned Count-Min 50th test day (b) Learned Count-Min 80th test day 4.0 4.0 Count Sketch Count Sketch 3.5 Table lookup CS 3.5 Table lookup CS

3.0 Learned CS (NNet) 3.0 Learned CS (NNet) Learned CS (Ideal) Learned CS (Ideal) 2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0

0.5 0.5

Average Error per item (Frequency) 0.0 Average Error per item (Frequency) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2 Space (MB) Space (MB) (c) Learned Count-Sketch 50th test day (d) Learned Count-Sketch 80th test day Figure 6.5.4: Comparison of our algorithms with Count-Min and Count-Sketch on search query data. contribute to the popularity of a search phrase that we can learn. For example, popular search phrases appearing in slightly different forms may be related to similar topics. Though not included in the AOL dataset, in general, metadata of a search query (e.g., the location of the search) can provide useful context of its popularity. To construct the heavy hitter oracle, we trained a neural network to predict the number of times a search phrase appears. To process the search phrase, we train an RNN with LSTM cells that takes characters of a search phrase as input. The final states encoded by the RNN are fed to a fully-connected layer to predict the query frequency. Our character vocabulary includes lower-case English alphabets, numbers, punctuation marks, and a token for unknown characters. We map the character IDs to embedding vectors before feeding them to the RNN.7 We choose RNN due to its effectiveness in processing sequence data [162, 83, 120].

7We use an embedding size of 64 dimensions, an RNN with 256 hidden units, and a fully-connected layer with 32 hidden units.

208 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 True Positive Rate AUC = 0.900 True Positive Rate AUC = 0.808 0.0 0.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 False Positive Rate False Positive Rate (a) Internet traffic 50th test minute (b) Search query 50th test day Figure 6.5.5: ROC curves of the learned models. Results. We plot the estimation error vs. space for two representative test days (the 50th and 80th day) in Figure 6.5.4. As before, the space includes both the bucket space and the space used by the model. The model space is amortized over the test days since the same model is used for all days. Similarly, our learned sketches outperforms their conventional counterparts. For Learned Count-Min, compared to Count-Min, it reduces the loss by 18% at 0.5 MB and 52% at 1.0 MB (Figure 6.5.4a). For Learned Count-Sketch, compared to Count-Sketch, it reduces the loss by 24% at 0.5 MB and 71% at 1.0 MB (Figure 6.5.4c). Further, our algorithm performs similarly for the 50th and the 80th day (Figure 6.5.4b and Figure 6.5.4d), showing that the properties it learns generalize over a long period. The figures also show an interesting difference from the Internet traffic data: memorizing the heavy hitters in a lookup table is quite effective in the low space region. This is likely because the search queries are less dynamic compared to Internet traffic (i.e., top queries in the training set are also popular on later days). However, as the algorithm is allowed more space, memorization becomes ineffective.

6.5.3. Analyzing Heavy Hitter Models

We analyze the accuracy of the neural network heavy hitter models to better understand the results on the two datasets. Specifically, we use the models to predict whether anitem is a heavy hitter (top 1% in counts) or not, and plot the ROC curves in Figure 6.5.5. The figures show that the model for the Internet traffic data has learned to predict heavyitems more effectively, with an AUC score of 0.9. As for the model for search query data, theAUC

209 score is 0.8. This also explains why we see larger improvements over non-learning algorithms in Figure 6.5.2.

210 Chapter 7

The Low-Rank Approximation Problem

7.1. Introduction

In the low-rank decomposition problem, given an 푛 푑 matrix 퐴 and a parameter 푘, the goal × is to compute a rank-푘 matrix [퐴]푘 defined as below

argmin ′ [퐴]푘 = ′ rank ′ 퐴 퐴 퐹 . 퐴 : (퐴 )≤푘‖ − ‖

Low-rank approximation is one of the most widely used tools in massive data analysis, machine learning and statistics, and has been a subject of many algorithmic studies. In par- ticular, multiple algorithms developed over the last decade use the “sketching” approach, see e.g., [157, 171, 92, 52, 53, 144, 132, 34, 54]. Its idea is to use efficiently computable random projections (a.k.a., “sketches”) to reduce the problem size before performing low-rank decom- position, which makes the computation more space and time efficient. For example, [157, 52] show that if 푆 is a random matrix of size 푚 푛 chosen from an appropriate distribution1, × for 푚 depending on 휀, then one can recover a rank-푘 matrix 퐴′ such that

′ 퐴 퐴 퐹 (1 + 휖) 퐴 [퐴]푘 퐹 ‖ − ‖ ≤ ‖ − ‖

1Initial algorithms used matrices with independent sub-gaussian entries or randomized Fourier/Hadamard matrices [157, 171, 92]. Starting from the seminal work of [53], researchers began to explore sparse binary matrices, see e.g., [144, 132]. In this chapter we mostly focus on the latter distribution.

211 by performing an SVD on 푆퐴 R푚×푑 followed by some post-processing. Typically the ∈ sketch length 푚 is small, so the matrix 푆퐴 can be stored using little space (in the context of streaming algorithms) or efficiently communicated (in the context of distributed algorithms).

Furthermore, the SVD of 푆퐴 can be computed efficiently, especially after another round of sketching, reducing the overall computation time. See the survey [170] for an overview of these developments. In light of recent developments on learning-based algorithms [168, 59, 120, 24, 128, 151, 137, 141, 25, 33, 134, 96, 103,1, 118], it is natural to ask whether similar improvements in performance could be obtained for other sketch-based algorithms, notably for low-rank decompositions. In particular, reducing the sketch length 푚 while preserving its accuracy would make sketch-based algorithms more efficient. Alternatively, one could make sketches more accurate for the same values of 푚. This is the problem we address in this chapter.

7.1.1. Our Results

Our main finding is that learned sketch matrices can indeed yield (much) more accurate low-rank decompositions than purely random matrices. We focus our study on a streaming algorithm for low-rank decomposition due to [157, 52], described in more detail in Section 7.2.

Specifically, suppose we have a training set of matrices Tr = 퐴1, . . . , 퐴푁 sampled from some { } distribution . Based on this training set, we compute a matrix 푆* that (locally) minimizes 풟 the empirical loss

∑︁ * 퐴푖 SCW(푆 , 퐴푖) 퐹 (7.1.1) 푖 ‖ − ‖

* where SCW(푆 , 퐴푖) denotes the output of the aforementioned Sarlos-Clarkson-Woodruff * streaming low-rank decomposition algorithm on matrix 퐴푖 using the sketch matrix 푆 . Once the sketch matrix 푆* is computed, it can be used instead of a random sketch matrix in all future executions of the SCW algorithm. We demonstrate empirically that, for multiple types of data sets, an optimized sketch matrix 푆* can substantially reduce the approximation loss compared to a random matrix 푆, sometimes by one order of magnitude (see Figure 7.5.1 or 7.5.2). Equivalently, the optimized sketch matrix can achieve the same approximation loss for lower values of 푚. A possible disadvantage of learned sketch matrices is that an algorithm that uses them no

212 longer offers worst-case guarantees. As a result, if such an algorithm is applied to an input matrix that does not conform to the training distribution, the results might be worse than if random matrices were used. To alleviate this issue, we also study mixed sketch matrices, where (say) half of the rows are trained and the other half are random. We observe that if such matrices are used in conjunction with the SCW algorithm, its results are no worse than if only the random part of the matrix was used (Theorem 7.4.1 in Section 7.4)2. Thus, the resulting algorithm inherits the worst-case performance guarantees of the random part of the sketching matrix. At the same time, we show that mixed matrices still substantially reduce the approximation loss compared to random ones, in some cases nearly matching the performance of “pure” learned matrices with the same number of rows. Thus, mixed random matrices offer “the best of both worlds”: improved performance for matrices from the training distribution, and worst-case guarantees otherwise. Finally, in order to understand the theoretical aspects of our approach further, we study the special case of 푚 = 1. This corresponds to the case where the sketch matrix 푆 is just a single vector. Our results are two-fold:

We give an approximation algorithm for minimizing the empirical loss as in Equa- ∙ tion 7.1.1, with an approximation factor depending on the stable rank of matrices in the training set. See Appendix 7.7.

Under certain assumptions about the robustness of the loss minimizer, we show gen- ∙ eralization bounds for the solution computed over the training set. See Appendix 7.8.

7.1.2. Related work

As outlined in the introduction, over the last few years there has been multiple papers exploring the use of machine learning methods to improve the performance of “standard” algorithms. Among those, the closest to the topic of this chapter are the works on learning- based compressive sensing, such as [141, 25, 33, 134], and on learning-based streaming algo- rithms [103]. Since neither of these two lines of research addresses computing matrix spectra, the technical development therein was quite different from ours.

2We note that this property is non-trivial, in the sense that it does not automatically hold for all sketching algorithms. See Section 7.4 for further discussion.

213 Algorithm 27 Rank-푘 approximation of a matrix 퐴 using a sketch matrix 푆 (refer to Section 4.1.1 of [52])

1: Input: 퐴 R푛×푑, 푆 R푚×푛 ∈ ∈ 2: 푈, Σ, 푉 ⊤ CompactSVD(푆퐴) ◁ 푟 = rank(푆퐴), 푈 R푚×푟, 푉 R푑×푟 ← ⊤ ∈ ∈ 3: return [퐴푉 ]푘푉 In this chapter we focus on learning-based optimization of low-rank approximation algo-

rithms that use linear sketches, i.e., map the input matrix 퐴 into 푆퐴 and perform compu- tation on the latter. There are other sketching algorithms for low-rank approximation that involve non-linear sketches [125, 78, 77]. The benefit of linear sketches is that they areeasy to update under linear changes to the matrix 퐴, and (in the context of our work) that they are easy to differentiate, making it possible to compute the gradient of the loss functionas in Equation 7.1.1. We do not know whether it is possible to use our learning-based approach for non-linear sketches, but we believe this is an interesting direction for future research.

7.2. Preliminaries

Notation. Consider a distribution on matrices 퐴 R푛×푑. We define the training set 풟 ∈ as 퐴1, , 퐴푁 sampled from . For matrix 퐴, its singular value decomposition (SVD) { ··· } 풟 can be written as 퐴 = 푈Σ푉 ⊤ such that both 푈 R푛×푛 and 푉 R푑×푛 have orthonormal ∈ ∈ columns and Σ = diag 휆1, , 휆푑 is a diagonal matrix with nonnegative entries. Moreover, { ··· } if rank(퐴) = 푟, then the first 푟 columns of 푈 are an orthonormal basis for the column space of 퐴 (we denote it as colsp(퐴)), the first 푟 columns of 푉 are an orthonormal basis for the 3 row space of 퐴 (we denote it as rowsp(퐴)) and 휆푖 = 0 for 푖 > 푟. In many applications it is quicker and more economical to compute the compact SVD which only contains the rows and columns corresponding to the non-zero singular values of Σ: 퐴 = 푈 푐Σ푐(푉 푐)⊤ where 푈 푐 R푛×푟, Σ푐 R푟×푟 and 푉 푐 R푑×푟. ∈ ∈ ∈

How sketching works. We start by describing the SCW algorithm for low-rank matrix approximation, see Algorithm 27. The algorithm computes the singular value decomposition of 푆퐴 = 푈Σ푉 ⊤, and compute the best rank-푘 approximation of 퐴푉 . Finally it outputs ⊤ [퐴푉 ]푘푉 as a rank-푘 approximation of 퐴.

3The remaining columns of 푈 and 푉 respectively are orthonormal bases for the nullspace of 퐴 and 퐴⊤.

214 Note that if 푚 is much smaller than 푑 and 푛, the space bound of this algorithm is significantly better than when computing a rank-푘 approximation for 퐴 in the naïve way. Thus, minimizing 푚 automatically reduces the space usage of the algorithm.

Sketching matrix. We use matrix 푆 that is sparse4 Specifically, each column of 푆 has exactly one non-zero entry, which is either +1 or 1. This means that the fraction of non- − zero entries in 푆 is 1/푚. Therefore, one can use a vector to represent 푆, which is very memory efficient. It is worth noting, however, after multiplying the sketching matrix 푆 with other matrices, the resulting matrix (e.g., 푆퐴) is in general not sparse.

7.3. Training Algorithm

In this section, we describe our learning-based algorithm for computing a data dependent

sketch 푆. The main idea is to use back-propagation algorithm to compute the stochastic gradient of 푆 with respect to the rank-푘 approximation loss in Equation (7.1.1), where the initial value of 푆 is the same random sparse matrix used in SCW. Once we have the stochastic gradient, we can run stochastic gradient descent (SGD) algorithm to optimize 푆, in order to improve the loss. Our algorithm maintains the sparse structure of 푆, and only optimizes the values of the 푛 non-zero entries (initially +1 or 1). − However, the standard SVD implementation (step 2 in Algorithm 27 ) is not differen- tiable, which means we cannot get the gradient in the straightforward way. To make SVD implementation differentiable, we use the fact that the SVD procedure can be represented as 푚 individual top singular value decompositions (see e.g. [10]), and that every top singular value decomposition can be computed using the power method. See Figure 7.3.1 and Algo- rithm 28. We store the results of the 푖-th iteration into the 푖-th entry of the list 푈, Σ, 푉 , and finally concatenate all entries together to get the matrix (or matrix diagonal) formatof

푈, Σ, 푉 . This allows gradients to flow easily. Due to the extremely long computational chain, it is infeasible to write down the explicit form of loss function or the gradients. However, just like how modern deep neural networks

4The original papers [157, 52] used dense matrices, but the work of [53] showed that sparse matrices work as well. We use sparse matrices since they are more efficient to train and to operate on.

215 Algorithm 28 Differentiable implementation of SVD 1: Input: 푚×푑 푣1 퐴1 R (푚 < 푑) 푇 times 2: 푈, Σ, 푉 ∈, , × 3: for ← {} {}do{} 푖 1 . . . 푚 퐴푖⊤퐴푖푣푡 4: ← random initialization in 푑 푣1 R 퐴 퐴푖푣푡 2 5: for←푡 1 . . . 푇 do ‖ 푖⊤ ‖ 퐴⊤퐴 푣 6: ← 푖 푖 푡 power method 푣푡+1 ‖퐴⊤퐴 푣 ‖ ◁ ← 푖 푖 푡 2 7: 푉 [푖] 푣푇 +1 푈[푖] 푈 8: ← Σ[푖] 퐴푖푉 [푖] 2 Σ[푖] 9: ← ‖퐴푖푉 [푖] ‖ Σ 푈[푖] Σ[푖] ← ⊤ 푉 [푖] 10: 퐴푖+1 퐴푖 Σ[푖]푈[푖]푉 [푖] 푉 ← − 11: return 푈, Σ, 푉 Figure 7.3.1: The 푖-th iteration of power method. compute their gradients, we used the autograd feature in PyTorch to numerically compute

the gradient with respect to the sketching matrix 푆. We emphasize again that our method is only optimizing 푆 for the training phase. After 푆 is fully trained, we still call Algorithm 27 for low rank approximation, which has exactly the same running time as the SCW algorithm, but with better performance.

7.4. Worst Case Bound

In this section, we show that concatenating two sketching matrices 푆1 and 푆2 (of size respec- tively 푚1 푛 and 푚2 푛) into a single matrix 푆* (of size (푚1 + 푚2) 푛) will not increase × × × the approximation loss of the final rank-푘 solution computed by Algorithm 27 compared to the case in which only one of 푆1 or 푆2 are used as the sketching matrix. In the rest of this section, the sketching matrix 푆* denotes the concatenation of 푆1 and 푆2 as follows:

⎡ ⎤ 푆1(푚1×푛) ⎢ ⎥ 푆 = ⎢ ⎥ *((푚1+푚2)×푛) ⎢ ⎥ ⎣ ⎦

푆2(푚2×푛)

Formally, we prove the following theorem in this section.

Theorem 7.4.1. Let 푈*Σ*푉* and 푈1Σ1푉1 respectively denote the SVD of 푆*퐴 and 푆1퐴.

216 Then,

⊤ ⊤ [퐴푉*]푘푉 퐴 퐹 [퐴푉1]푘푉 퐴 퐹 . || * − || ≤ || 1 − || In particular, the above theorem implies that the output of Algorithm 27 with the sketch-

ing matrix 푆* is a better rank-푘 approximation to 퐴 compared to the output of the algorithm

with 푆1. In the rest of this section we prove Theorem 7.4.1.

Fact 7.4.2 (Pythagorean Theorem). If 퐴 and 퐵 are matrices with the same number of rows and columns, then 퐴퐵⊤ = 0 implies 퐴 + 퐵 2 = 퐴 2 + 퐵 2 . || ||퐹 || ||퐹 || ||퐹

Claim 7.4.3. Let 푉1 and 푉2 be two matrices such that colsp(푉1) colsp(푉2). Then, ⊆

2 2 min 푋 퐴 퐹 min 푋 퐴 퐹 rowsp(푋)⊆colsp(푉1); || − || ≥ rowsp(푋)⊆colsp(푉2); || − || rank(푋)≤푘 rank(푋)≤푘

Before proving the main theorem, we state the following helpful lemma from [52].

Lemma 7.4.4 (Lemma 4.3 in [52]). Suppose that 푉 is a matrix with orthonormal columns. ⊤ Then, a best rank-k approximation to 퐴 in the colsp(푉 ) is given by [퐴푉 ]푘푉 .

Proof: Note that 퐴푉 푉 ⊤ is a row projection of 퐴 on the colsp(푉 ). Then, for any conforming 푌 ,

(퐴 퐴푉 푉 ⊤)(퐴푉 푉 ⊤ 푌 푉 ⊤)⊤ = 퐴(퐼 푉 푉 ⊤)푉 (퐴푉 푌 )⊤ − − − − = 퐴(푉 푉 푉 ⊤푉 )(퐴푉 푌 )⊤ = 0. − − where the last equality follows from the fact if 푉 has orthonormal columns then 푉 푉 ⊤푉 = 푉 (e.g., see Lemma 3.5 in [52]). Then, by the Pythagorean Theorem (Fact 7.4.2), we have

퐴 푌 푉 ⊤ 2 = 퐴 퐴푉 푉 ⊤ 2 + 퐴푉 푉 ⊤ 푌 푉 ⊤ 2 (7.4.1) || − ||퐹 || − ||퐹 || − ||퐹

Since 푉 has orthonormal columns, for any conforming 푥, 푥⊤푉 ⊤ = 푥 . Thus, for any 푍 || || || ||

217 of rank at most 푘,

⊤ ⊤ ⊤ 퐴푉 푉 [퐴푉 ]푘푉 퐹 = (퐴푉 [퐴푉 ]푘)푉 퐹 = 퐴푉 [퐴푉 ]푘 퐹 || − || || − || || − || ⊤ ⊤ 퐴푉 푍 퐹 = 퐴푉 푉 푍푉 퐹 ≤ || − || || − || (7.4.2)

Hence,

⊤ 2 ⊤ 2 ⊤ ⊤ 2 퐴 [퐴푉 ]푘푉 = 퐴 퐴푉 푉 + 퐴푉 푉 [퐴푉 ]푘푉 By (7.4.1) || − ||퐹 || − ||퐹 || − ||퐹 B 퐴 퐴푉 푉 ⊤ 2 + 퐴푉 푉 ⊤ 푍푉 ⊤ 2 By (7.4.2) ≤ || − ||퐹 || − ||퐹 B

This implies that ⊤ is a best rank- approximation of in the . [퐴푉 ]푘푉 푘 퐴 colsp(푉 )  Since the above statement is a transposed version of the lemma from [52], we include the proof in the appendix for completeness.

Proof: (Proof of Theorem 7.4.1) First, we show that colsp(푉1) colsp(푉*). By the proper- ⊆ ties of the (compact) SVD, colsp(푉1) = rowsp(푆1퐴) and colsp(푉*) = rowsp(푆*퐴). Since, 푆*

has all rows of 푆1, then

colsp(푉1) colsp(푉*). (7.4.3) ⊆

By Lemma 7.4.4,

⊤ 퐴 [퐴푉*]푘푉* 퐹 = min 푋 퐴 퐹 || − || rowsp(푋)⊆colsp(푉*); || − || rank(푋)≤푘

⊤ 퐴 [퐴푉1]푘푉1 퐹 = min 푋 퐴 퐹 || − || colsp(푋)⊆colsp(푉1); || − || rank(푋)≤푘

218 Finally, together with (7.4.3) and Claim 7.4.3

⊤ 퐴 [퐴푉*]푘푉* 퐹 = min 푋 퐴 퐹 || − || rowsp(푋)⊆colsp(푉*); || − || rank(푋)≤푘

⊤ min 푋 퐴 퐹 = 퐴 [퐴푉1]푘푉1 퐹 . ≤ rowsp(푋)⊆colsp(푉1); || − || || − || rank(푋)≤푘 which completes the proof.  Finally, we note that the property of Theorem 7.4.1 is not universal, i.e., it does not hold for all sketching algorithms for low-rank decomposition. For example, an alternative algorithm proposed in [54] proceeds by letting 푍 to be the top 푘 singular vectors of 푆퐴 and then reports 퐴푍푍⊤. It is not difficult to see that, by adding extra rows to the sketching matrix 푆, one can skew the output of the algorithm so that it is far from the optimal.

7.5. Experimental Results

The main question considered here is whether, for natural matrix datasets, optimizing the sketch matrix 푆 can improve the performance of the sketching algorithm for the low-rank de- composition problem. To answer this question, we implemented and compared the following methods for computing 푆 R푚×푛. ∈

Sparse Random. Sketching matrices are generated at random as in [53]. Specifically, ∙ we select a random hash function ℎ :[푛] [푚], and for all 푖 = 1 . . . 푛, 푆ℎ[푖],푖 is selected → to be either +1 or 1 with equal probability. All other entries in 푆 are set to 0. − Therefore, 푆 has exactly 푛 non-zero entries.

Dense Random. All the 푛푚 entries in the sketching matrices are sampled from ∙ Gaussian distribution (we include this method for comparison).

Learned. Using the sparse random matrix as the initialization, we run Algorithm 28 to ∙ optimize the sketching matrix using the training set, and return the optimized matrix.

푚 Mixed (J). We first generate two sparse random matrices 2 ×푛 (assuming 푆1, 푆2 R 푚 ∙ ∈ is even), and define 푆 to be their combination. We then run Algorithm 28 to optimize

푆 using the training set, but only 푆1 will be updated, while 푆2 is fixed. Therefore, 푆 is

219 a mixture of learned matrix and random matrix, and the first matrix is trained jointly with the second one.

푚 Mixed (S). We first compute a learned matrix 2 ×푛 using the training set, 푆1 R ∙ ∈ and then append another sparse random matrix to get 푚×푛. Therefore, 푆2 푆 R 푆 ∈ is a mixture of learned matrix and random matrix, but the learned matrix is trained separately.

8.0 8.0 8 Learned Sparse Random 6 Dense Random 4.7 4.3 4.1 4.0 4 3.5 2.9 3.0 Test Error 2.1 2.0 2 0.5 0.1 0.2 0.2 0 Logo Eagle Friends Hyper Tech Figure 7.5.1: Test error by datasets and sketching matrices For 푘 = 10, 푚 = 20

101 Learned Learned 101 Learned Sparse Random Sparse Random Sparse Random 100 Dense Random Dense Random Dense Random 100

1 0 Test Error Test Error 10 Test Error 10

20 40 60 80 20 40 60 80 20 40 60 80 m m m Figure 7.5.2: Test error for Logo (left), Hyper (middle) and Tech (right) when 푘 = 10.

Datasets. We used a variety of datasets to test the performance of our methods:

Videos5: Logo, Friends, Eagle. We downloaded three high resolution videos from ∙ Youtube, including logo video, Friends TV show, and eagle nest cam. From each video,

we collect 500 frames of size 1920 1080 3 pixels, and use 400 (100) matrices as the × × training (test) set. For each frame, we resize it as a 5760 1080 matrix. × Hyper. We use matrices from HS-SOD, a dataset for hyperspectral images from ∙ natural scenes [104]. Each matrix has 1024 768 pixels, and we use 400 (100) matrices × 5They can be downloaded from http://youtu.be/L5HQoFIaT4I, http://youtu.be/xmLZsEfXEgE and http://youtu.be/ufnf_q_3Ofg

220 as the training (test) set.

Tech. We use matrices from TechTC-300, a dataset for text categorization [60]. Each ∙ matrix has 835, 422 rows, but on average only 25, 389 of the rows contain non-zero entries. On average each matrix has 195 columns. We use 200 (95) matrices as the training (test) set.

Evaluation metric. To evaluate the quality of a sketching matrix 푆, it suffices to evaluate the output of Algorithm 27 using the sketching matrix 푆 on different input matrices 퐴. We * first define the optimal approximation loss for testset Te as follows: App E퐴∼Te 퐴 Te , ‖ − [퐴]푘 퐹 . ‖ Note that App* does not depend on , and in general it is not achievable by any Te 푆 sketch 푆 with 푚 < 푑, because of information loss. Based on the definition of the optimal approximation loss, we define the error of the sketch 푆 for Te as Err(Te, 푆) E퐴∼Te 퐴 , ‖ − * SCW(푆, 퐴) 퐹 App . ‖ − Te In our datasets, some of the matrices have much larger singular values than the others. To avoid imbalance in the dataset, we normalize the matrices so that their top singular values are all equal.

Figure 7.5.3: Low rank approximation results for Logo video frame: the best rank-10 approximation (left), and rank-10 approximations reported by Algorithm 27 using a sparse learned sketching matrix (middle) and a sparse random sketching matrix (right).

7.5.1. Average test error

We first test all methods on different datasets, with various combination of 푘, 푚. See Fig- ure 7.5.1 for the results when 푘 = 10, 푚 = 20. As we can see, for video datasets, learned sketching matrices can get 20 better test error than the sparse random or dense random × sketching matrices. For other datasets, learned sketching matrices are still more than 2 × better. Similar improvement of the learned sketching matrices over the random sketching

221 Table 7.5.1: Test error in various settings. Table 7.5.2: Performance of mixed sketches 푘, 푚, Sketch Logo Hyper Tech 푘, 푚, Sketch Logo Eagle Friends Hyper Tech 10, 10, Learned 0.39 1.25 6.70 0.39 0.31 1.03 1.25 6.70 10, 10, Learned 10, 10, Random 5.22 7.90 17.08 10, 10, Random 5.22 6.33 11.56 7.90 17.08 10, 20, Learned 0.10 0.18 0.22 0.52 2.95 10, 20, Learned 0.10 0.52 2.95 10, 20, Random 2.09 4.31 4.11 2.92 7.99 10, 20, Mixed (J) 0.20 0.78 3.73 20, 20, Learned 0.61 0.66 1.41 1.68 7.79 10, 20, Mixed (S) 0.24 0.87 3.69 20, 20, Random 4.18 5.79 9.10 5.71 14.55 10, 20, Random 2.09 2.92 7.99 20, 40, Learned 0.18 0.41 0.42 0.72 3.09 10, 40, Learned 0.04 0.28 1.16 20, 40, Random 1.19 3.50 2.44 2.23 6.20 10, 40, Mixed (J) 0.05 0.34 1.31 0.72 1.06 1.78 1.90 7.14 30, 30, Learned 10, 40, Mixed (S) 0.05 0.34 1.20 3.11 6.03 6.27 5.23 12.82 30, 30, Random 10, 40, Random 0.45 1.12 3.28 0.21 0.61 0.42 0.84 2.78 30, 60, Learned 10, 80, Learned 0.02 0.16 0.31 30, 60, Random 0.82 3.28 1.79 1.88 4.84 10, 80, Random 0.09 0.32 0.80 matrices can be observed when 푘 = 10, 푚 = 10, 20, 30, 40, , 80, see Figure 7.5.2. We also ··· include the test error results in Table 7.5.1 for the case when 푘 = 20, 30. Finally, in Figure 7.5.3, we visualize an example output of the algorithm for the case 푘 = 10, 푚 = 20 for the Logo dataset.

7.5.2. Comparing Random, Learned and Mixed

In Table 7.5.2, we investigate the performance of the mixed sketching matrices by comparing them with random and learned sketching matrices. In all scenarios, mixed sketching matrices yield much better results than random sketching matrices, and sometimes the results are comparable to those of learned sketching matrices. This means, in most cases it suffices to train one half of the sketching matrix to obtain good empirical results, and at the same time, by our Theorem 7.4.1, we can use the remaining random half of the sketch matrix to obtain worst-case guarantee.

7.6. The case of 푚 = 1

In this section, we denote the SVD of 퐴 as 푈 퐴Σ퐴(푉 퐴)⊤ such that both 푈 퐴 and 푉 퐴 have orthonormal columns and Σ퐴 = diag 휆퐴, , 휆퐴 is a diagonal matrix with nonnegative { 1 ··· 푑 } 퐴 entries. For simplicity, we assume that for all 퐴 , 1 = 휆1 휆푑. We use 푈 to ∼ 풟 ≥ · · · ≥ 푖 denote the 푖-th column of 푈 퐴, and similarly for 푉 퐴. Denote Σ퐴 = diag 휆퐴, , 휆퐴 . 푖 { 1 ··· 푑 } We want to find [퐴]푘, the rank-푘 approximation of 퐴. In general, it is hard to obtain a closed form expression of the output of Algorithm 27. However, for 푚 = 1, such expressions can be calculated. Indeed, if 푚 = 1, the sketching matrix becomes a vector 푠 R1×푛. ∈ 222 Therefore 퐴푉 in Algorithm 27 has rank at most one, so it suffices to set 푘 = 1. Consider ∑︀ 퐴 퐴 퐴 ⊤ a matrix 퐴 as the input to Algorithm 27. By calculation, 퐴푆 = 휆 푠, 푈 (푉 ) , ∼ 풟 푖 푖 ⟨ 푖 ⟩ 푖 which is a vector. For example, if 퐴, we obtain 퐴 퐴 ⊤. 푆 = 푈1 휆1 (푉1 ) Since 푆퐴 is a vector, applying SVD on it is equivalent to performing normalization. Therefore,

∑︀푑 퐴 퐴 퐴 ⊤ 푖=1 휆푖 푠, 푈푖 (푉푖 ) 푉 = √︁ ⟨ ⟩ ∑︀푑 (휆퐴)2 푠, 푈 퐴 2 푖=1 푖 ⟨ 푖 ⟩

Ideally, we hope that is as close to 퐴 as possible, because that means ⊤ is close 푉 푉1 [퐴푉 ]1푉 to 퐴 퐴 퐴 ⊤, which captures the top singular component of , i.e., the optimal solution. 휆1 푈1 (푉1 ) 퐴 More formally,

∑︀푑 퐴 2 퐴 퐴 푖=1(휆푖 ) 푠, 푈푖 푈푖 퐴푉 = √︁ ⟨ ⟩ ∑︀푑 (휆퐴)2 푠, 푈 퐴 2 푖=1 푖 ⟨ 푖 ⟩ We want to maximize its norm, which is:

∑︀푑 (휆퐴)4 푠, 푈 퐴 2 푖=1 푖 ⟨ 푖 ⟩ (7.6.1) ∑︀푑 (휆퐴)2 푠, 푈 퐴 2 푖=1 푖 ⟨ 푖 ⟩ We note that one can simplify (7.6.1) by considering only the contribution from the top left singular vector 퐴, which corresponds to the maximization of the following expression: 푈1

푠, 푈 퐴 2 ⟨ 1 ⟩ (7.6.2) ∑︀푑 (휆퐴)2 푠, 푈 퐴 2 푖=1 푖 ⟨ 푖 ⟩

7.7. Optimization Bounds

Motivated by the empirical success of sketch optimization, we investigate the complexity of optimizing the loss function. We focus on the simple case where 푚 = 1 and therefore 푆 is just a (dense) vector. Our main observation is that a vector 푠 picked uniformly at random from the 푑-dimensional unit sphere achieves an approximately optimal solution, with the approximation factor depending on the maximum stable rank of matrices 퐴1, , 퐴푁 . This ··· 223 algorithm is not particularly useful for our purpose, as our goal is to improve over the random

choice of the sketching matrix 푆. Nevertheless, it demonstrates that an algorithm with a non-trivial approximation factor exists.

Definition 7.7.1 (stable rank푟 ( ′)). For a matrix 퐴, the stable rank of 퐴 is defined as the squared ratio of Frobenius and operator norm of 퐴. I.e.,

2 ∑︀ 퐴 2 ′ 퐴 퐹 푖(휆푖 ) 푟 (퐴) = || || = 퐴 2 . 퐴 2 max푖(휆 ) || || 푖

Note that since we assume for all matrices 퐴 , 1 = 휆퐴 휆퐴 > 0, for all these ∼ 풟 1 ≥ · · · ≥ 푑 matrices ′ ∑︀ 퐴 2. 푟 (퐴) = 푖(휆푖 ) First, we consider the simplified objective function as in(7.6.2).

Lemma 7.7.2. A random vector 푠 which is picked uniformly at random from the 푑-dimensional unit sphere, is an 푂(푟′)-approximation to the optimum value of the simplified objective func- ′ tion in Equation (7.6.2), where 푟 is the maximum stable rank of matrices 퐴1, , 퐴푁 . ··· Proof: We will show that

[︃ ]︃ 푠, 푈 퐴 2 퐸 ⟨ 1 ⟩ = Ω(1/푟′(퐴)) ∑︀푑 (휆퐴)2 푠, 푈 퐴 2 푖=1 푖 ⟨ 푖 ⟩

for all 퐴 where 푠 is a vector picked uniformly at random from S푑−1. Since for all ∼ 풟 ⟨푠,푈 퐴⟩2 we have 1 , by the linearity of expectation we have that the vector 퐴 ∑︀푑 (휆퐴)2⟨푠,푈 퐴⟩2 1 ∼ 풟 푖=1 푖 푖 ≤ 푠 achieves an 푂(푟′)-approximation to the maximum value of the objective function

푁 퐴 ∑︁ 푠, 푈 푗 2 ⟨ 1 ⟩ ∑︀푑 (휆퐴푗 )2 푠, 푈 퐴푗 2 푗=1 푖=1 푖 ⟨ 푖 ⟩ .

First, recall that to sample 푠 uniformly at random from S푑−1 we can generate 푠 as √︁ [︁ ⟨푠,푈 퐴⟩2 ]︁ ∑︀푑 퐴 ∑︀푑 2 where for all , . This helps us evaluate 1 푖=1 훼푖푈푖 / 푖=1 훼푖 푖 푑 훼푖 (0, 1) E ∑︀푑 (휆퐴)2⟨푠,푈 퐴⟩2 ≤ ∼ 풩 푖=1 푖 푖 for an arbitrary matrix 퐴 : ∼ 풟

224 [︃ 퐴 2 ]︃ [︃ 2 ]︃ 푠, 푈 (훼1) 퐸 = E ⟨ 1 ⟩ = E ∑︀푑 퐴 2 퐴 2 ∑︀푑 퐴 2 2 (휆 ) 푠, 푈 (휆 ) (훼푖) 푖=1 푖 ⟨ 푖 ⟩ 푖=1 푖 · [︃ 2 ]︃ (훼1) E Ψ1 Ψ2 Pr(Ψ1 Ψ2) ∑︀푑 퐴 2 2 ≥ (휆 ) (훼푖) | ∩ · ∩ 푖=1 푖 ·

where the events Ψ1, Ψ2 are defined as:

[︃ 푑 ]︃ [︂ 1]︂ ∑︁ Ψ 1 훼 and Ψ 1 (휆퐴)2(훼 )2 2 푟′(퐴) 1 , 1 2 2 , 푖 푖 | | ≥ 푖=2 ≤ ·

Since 훼푖s are independent, we have

[︂ 2 ]︂ (훼1) 1 퐸 E 2 ′ Ψ1 Ψ2 Pr(Ψ1) Pr(Ψ2) ′ Pr(Ψ1) Pr(Ψ2) ≥ (훼1) + 2 푟 (퐴)| ∩ · · ≥ 8 푟 (퐴) + 1 · · · ·

2 where we used that (훼1) is increasing for 2 1 . It remains to prove that 2 ′ (훼1) (훼1) +2·푟 (퐴) ≥ 4 Pr(Ψ1), Pr(Ψ2) = Θ(1). We observe that, since 훼푖 (0, 1), we have ∼ 풩 (︂ 1)︂ Pr(Ψ1) = Pr 훼1 = Θ(1) | | ≥ 2

Similarly, by Markov inequality, we have

(︃ 푑 )︃ (︃ 푑 )︃ ∑︁ ∑︁ 1 Pr(Ψ ) = Pr (휆퐴)2(훼 )2 2푟′(퐴) 1 Pr (휆퐴)2(훼 )2 > 2푟′(퐴) 2 푖 푖 푖 푖 2  푖=1 ≤ ≥ − 푖=1 ≥

Next, we prove that a random vector 푠 S푑−1 achieves an 푂(푟′(퐴))-approximation to ∈ the optimum of the main objective function as in Equation (7.6.1).

Lemma 7.7.3. A random vector 푠 which is picked uniformly at random from the 푑-dimensional unit sphere, is an 푂(푟′)-approximation to the optimum value of the objective function in ′ Equation (7.6.1), where 푟 is the maximum stable rank of matrices 퐴1, , 퐴푁 . ··· Proof: We assume that the vector 푠 is generated via the same process as in the proof of

225 Lemma 7.7.2. It follows that

[︃∑︀푑 퐴 4 퐴 2 ]︃ [︃ 2 ]︃ (휆 ) 푠, 푈 (훼1) E 푖=1 푖 ⟨ 푖 ⟩ E = Ω(1/푟′(퐴)) ∑︀푑 퐴 2 퐴 2 ∑︀푑 퐴 2 2  (휆 ) 푠, 푈 ≥ (휆 ) (훼푖) 푖=1 푖 ⟨ 푖 ⟩ 푖=1 푖 ·

7.8. Generalization Bounds

∑︀푑 퐴 4 퐴 2 푖=1(휆푖 ) ⟨푠,푈푖 ⟩ 푑−1 Define the loss function as L 푠 2 We want to find a vector 푠 ( ) , E퐴∼풟[ ∑︀푑 퐴 퐴 2 ] S − 푖=1(휆푖 ) ⟨푠,푈푖 ⟩ ∈ to minimize L(푠), where S푑−1 is the 푑-dimensional unit sphere. Since is unknown, we are (︁ 퐴 )︁4 퐴 풟 ∑︀푑 휆 푗 ⟨푠,푈 푗 ⟩2 optimizing 1 ∑︀푁 푖=1 푖 푖 . LTr(푠) [ 퐴 2 퐴 ] , 푁 푗=1 ∑︀푑 (︁ 푗 )︁ 푗 2 − 푖=1 휆푖 ⟨푠,푈푖 ⟩

The importance of robust solutions. We start by observing that if 푠 minimizes the training loss L, it is not necessarily true that 푠 is the optimal solution for the population

loss L. For example, it could be the case that 퐴푗 푗=1,··· ,푁 are diagonal matrices with only { } 1 non-zeros on the top row, while 푠 = (휖, √1 휖2, 0, , 0) for 휀 close to 0. In this case, we − ··· know that LTr(푠) = 1, which is at its minimum value. − However, such a solution is not robust. In the population distribution, if there exists a

matrix 퐴 such that 퐴 = diag(√1 100휀2, 10휀, 0, 0, , 0), insert 푠 into (7.6.1), − ···

푑 (︀ 퐴)︀4 퐴 2 ∑︀ 휆 푠, 푈 (1 100휀2)2휀2 + 104휀4(1 휀2) 휀2 + 104휀4 1 + 104휀2 푖=1 푖 ⟨ 푖 ⟩ = − − < = ∑︀푑 (휆퐴)2 푠, 푈 퐴 2 (1 100휀2)휀2 + 100휀2(1 휀2) 101휀2 100휀4 101 100휀2 푖=1 푖 ⟨ 푖 ⟩ − − − − The upper bound is very close to 0 if 휀 is small enough. This is because when the denominator is extremely small, the whole expression is susceptible to minor perturbations on 퐴. This is a typical example showing the importance of finding a robust solution. Because of this issue, we will show a generalization guarantee for a robust solution 푠.

[︁∑︀푑 (︀ 퐴)︀2 퐴 2 ]︁ Definition of robust solution. First, define event 휁퐴,훿,푠 1 휆 푠, 푈 < 훿 , , 푖=1 푖 ⟨ 푖 ⟩ which is the denominator in the loss function. Ideally, we want this event to happen with a small probability, which indicates that for most matrices, the denominator is large, therefore

푠 is robust in general. We have the following definition of robustness.

226 Definition 7.8.1휌, (( 훿)-robustness). 푠 is (휌, 훿)-robust with respect to if E퐴∼풟[휁퐴,훿,푠] 풟 ≤ 휌. 푠 is (휌, 훿)-robust with respect to Tr if E퐴∼Tr[휁퐴,훿,푠] 휌. ≤ For a given , we can define robust solution set that includes all robust vectors. 풟 푑−1 Definition 7.8.2휌, (( 훿)-robust set). M풟,휌,훿 is defined to be the set of all vectors 푠 S ∈ s.t. 푠 is (휌, 훿)-robust with respect to . 풟

Estimating M풟,휌,훿. The drawback of the above definition is that M풟,휌,훿 is defined by the unknown distribution , so for fixed 훿 and 휌, we cannot tell whether 푠 is in M풟,휌,훿 or not. 풟 However, we can estimate the robustness of 푠 using the training set. Specifically, we have the following lemma:

Lemma 7.8.3 (Estimating robustness). For a training set Tr of size 푁 sampled uni- formly at random from , and a given 푠 R푑, a constant 1 > 휂 > 0, if 푠 is (휌, 훿)-robust 풟 ∈ 2 − 휂 푝푁 (︁ 휌 )︁ with respect to Tr, then with probability at least 1 푒 2 , 푠 is , 훿 -robust with respect − 1−휂 to . 풟 Proof: Suppose that , which means [︀∑︀ ]︀ . Since events Pr퐴∼풟[휁퐴,훿,푠] = 휌1 E 퐴푖∈Tr 휁퐴푖,훿,푠 = 휌1푁 ’s are - random variables, by Chernoff bound, 휁퐴푖,훿,푠 0 1

(︃ )︃ 2 ∑︁ − 휂 휌1푁 Pr 휁퐴 ,훿,푠 (1 휂)휌1푁 푒 2 푖 ≤ − ≤ 퐴푖∈Tr

2 − 휂 휌1푁 If 휌1 < 휌 < 휌/(1 휂), our claim is immediately true. Otherwise, we know 푒 2 − ≤ 휂2휌푁 휂2휌푁 − 2 . Hence, with probability at least − 2 , ∑︀ . This 푒 1 푒 푁휌 = 퐴푖∼Tr 휁퐴푖,훿,푠 > (1 휂)휌1푁 − 2 − − 휂 휌푁 휌 implies that with probability at least 1 푒 2 , 휌1 . − ≤ 1−휂  Lemma 7.8.3 implies that for a fixed solution 푠, if it is (휌, 훿)-robust in Tr, it is also (푂(휌), 훿)-robust in with high probability. However, Lemma 7.8.3 only works for a single 풟 solution 푠, but there are infinitely many potential 푠 on the 푑-dimensional unit sphere. To remedy this problem, we discretize the unit sphere to bound the number of potential solutions. Classical results tell us that discretizing the unit sphere into a grid of edge length

√휀 gives 퐶 points on the grid for some constant (e.g., see Section 3.3 in [98] for more 푑 휀푑 퐶 푑 details). We will only consider these points as potential solutions, denoted as B^ . Thus, we

227 푑 can find a “robust” solution 푠 B^ with decent probability, using Lemma 7.8.3 and union ∈ bound.

Lemma 7.8.4 (Picking robust 푠). For a fixed constant 휌 > 0, 1 > 휂 > 0, with probability 2 퐶 − 휂 휌푁 푑 (︁ 휌 )︁ at least 1 푒 2 , any (휌, 훿)-robust 푠 B^ with respect to Tr is , 훿 -robust with − 휀푑 ∈ 1−휂 respect to . 풟 Since we are working on the discretized solution, we need a new definition of robust set.

Definition 7.8.5 (Discretized휌, ( 훿)-robust set). M^ 풟,휌,훿 is defined to be the set of all 푑 vector 푠 B^ s.t. 푠 is (휌, 훿)-robust with respect to . ∈ 풟

Using similar arguments as Lemma 7.8.4, we know all solutions from M^ 풟,휌,훿 are robust with respect to Tr as well.

2 퐶 − 휂 휌푁 Lemma 7.8.6. With probability at least 1 푒 3 , for a constant 휂 > 0, all solutions in − 휀푑 M^ 풟,휌,훿, are ((1 + 휂)휌, 훿)-robust with respect to Tr. [︀∑︀ ]︀ Proof: Consider a fixed solution 푠 M^ 풟,휌,훿. Note that E 휁퐴 ,훿,푠 = 휌푁 and 휁퐴 ,훿,푠 ∈ 퐴푖∈Tr 푖 푖 are 0-1 random variables. Therefore by Chernoff bound,

(︃ )︃ 2 ∑︁ − 휂 휌푁 Pr 휁퐴 ,훿,푠 (1 + 휂)휌푁 푒 3 . 푖 ≥ ≤ 퐴푖∈Tr

2 − 휂 휌푁 Hence, with probability at least 1 푒 3 , 푠 is ((1 + 휂)휌, 훿)-robust with respect to Tr. By − 푑 union bound on all points in M^ 풟,휌,훿 B^ , the proof is complete. ⊆  7.8.1. Generalization Bound for Robust Solutions

Finally, we show the generalization bounds for robust solutions. To this can we use Rademacher

complexity to prove generalization bound. Define Rademacher complexity 푅(M^ 풟,휌,훿 Tr) as ∘ ⎡ (︁ )︁4 ⎤ 푁 ∑︀푑 퐴푗 퐴푗 2 휎푗 휆 푠, 푈 1 ∑︁ 푖=1 푖 ⟨ 푖 ⟩ 푅(M^ 풟,휌,훿 Tr) E sup ⎢ ⎥ . , 푁 ⎣ (︁ )︁2 ⎦ ∘ 푁 휎∼{±1} ^ ∑︀푑 퐴푗 퐴푗 2 푠∈M풟,휌,훿 푗=1 휆 푠, 푈 푖=1 푖 ⟨ 푖 ⟩

푅(M^ 풟,휌,훿 Tr) is handy, because we have the following theorem (notice that the loss function ∘ takes value in [ 1, 0]): −

228 Theorem 7.8.7 (Theorem 26.5 in [160]). Given constant 훿 > 0, with probability of at

least 1 훿, for all 푠 M^ 풟,휌,훿, − ∈ √︂ 2 log(4/훿) L(푠) LTr(푠) 2푅(M^ 풟,휌,훿 Tr) + 4 − ≤ ∘ 푁

That means, it suffices to bound 푅(M^ 풟,휌,훿 Tr) to get the generalization bound. ∘

Lemma 7.8.8 (Bound on 푅(M^ 풟,휌,훿 ∘ Tr)). For a constant 휂 > 0, with probability at 2 퐶 휂 푝푁 1−훿 푑 least − 3 , ^ √ . 1 푑 푒 푅(M풟,휌,훿 Tr) (1 + 휂)휌 + + − 휀 ∘ ≤ 2훿 푁 2 ′ 퐶 − 휂 푝푁 Proof: Define 휌 = (1 + 휂)휌. By Lemma 7.8.6, we know that with probability 1 푒 3 , − 휀푑 ′ ∑︀ ′ any 푠 M^ 풟,휌,훿 is (휌 , 훿)-robust with respect to Tr, hence 휁퐴,훿,푠 휌 푁. The analysis ∈ 퐴∈Tr ≤ below is conditioned on this event.

2 푑 퐶 휂 푝푁 Define ∑︀ 퐴 2 퐴 2 . We know that with probability − 3 , ℎ퐴,훿,푠 max 훿, (휆 ) 푠, 푈 1 푑 푒 , { 푖=1 푖 ⟨ 푖 ⟩ } − 휀 (︁ )︁4 푁 ∑︀푑 퐴푗 퐴푗 2 ∑︁ 휎푗 푖=1 휆푖 푠, 푈푖 ^ Tr 푁 ⟨ ⟩ 푁 푅(M풟,휌,훿 ) = E휎∼{±1} sup (︁ )︁2 · ∘ ^ ∑︀푑 퐴푗 퐴푗 2 푠∈M풟,휌,훿 푗=1 휆 푠, 푈 푖=1 푖 ⟨ 푖 ⟩ (︁ )︁4 푁 ∑︀푑 퐴푗 퐴푗 2 휎푗 푖=1 휆푖 푠, 푈푖 ′ ∑︁ ⟨ ⟩ 휌 푁 + E휎 sup (7.8.1) ≤ ^ ℎ퐴,훿,푠 푠∈M풟,휌,훿 푗=1

∑︀푑 퐴 2 퐴 2 where (7.8.1) holds because by definition, ℎ퐴,훿,푠 (휆 ) 푠, 푈 if and only if 휁퐴,훿,푠 = 1, ≥ 푖=1 푖 ⟨ 푖 ⟩ ′ which happens for at most 휌 푁 matrices. Note that for any matrix 퐴푗,

(︁ )︁4 ∑︀푑 퐴푗 퐴푗 2 휎푗 휆 푠, 푈 푖=1 푖 ⟨ 푖 ⟩ (︁ )︁2 1. ∑︀푑 휆퐴푗 푠, 푈 퐴푗 2 ≤ 푖=1 푖 ⟨ 푖 ⟩

229 Now,

(︁ )︁4 푁 ∑︀푑 퐴푗 퐴푗 2 휎푗 휆 푠, 푈 ∑︁ 푖=1 푖 ⟨ 푖 ⟩ E휎 sup ^ ℎ퐴,훿,푠 푠∈M풟,휌,훿 푗=1 (︁ )︁4 푁 ∑︀푑 퐴푗 퐴푗 2 푑 푖=1 휆푖 푠, 푈푖 (︁ )︁4 ∑︁ ⟨ ⟩ ∑︁ 퐴푗 퐴푗 2 (7.8.2) E휎 sup (1휎푗 =1 1휎푗 =−1 휆푖 푠, 푈푖 ) ≤ ^ 훿 − ⟨ ⟩ 푠∈M풟,휌,훿 푗=1 푖=1 푁 (︃ 푑 푑 )︃ 4 4 ∑︁ ∑︁ (︁ 퐴푗 )︁ 퐴푗 2 ∑︁ (︁ 퐴푗 )︁ 퐴푗 2 1 = E휎 sup 휎푗 휆푖 푠, 푈푖 + 1휎푗 =1 휆푖 푠, 푈푖 ( 1) ^ ⟨ ⟩ ⟨ ⟩ 훿 − 푠∈M풟,휌,훿 푗=1 푖=1 푖=1 푁 푑 푁 푁 (︁ )︁4 ∑︁ ∑︁ 퐴푗 퐴푗 2 (7.8.3) + E휎 sup 휎푗 휆푖 푠, 푈푖 ≤ 2훿 − 2 ^ ⟨ ⟩ 푠∈M풟,휌,훿 푗=1 푖=1

(︁ 퐴 )︁4 퐴 ∑︀푑 휆 푗 ⟨푠,푈 푗 ⟩2 The first inequality (7.8.2) holds as 푖=1 푖 푖 [훿, 1]. It remains to bound the last ℎ퐴,훿,푠 ∈ term (7.8.3).

푁 푑 푑 푁 (︁ )︁4 ⟨ (︁ )︁2 ⟩2 ∑︁ ∑︁ 퐴푗 퐴푗 2 ∑︁ ∑︁ 퐴푗 퐴푗 (7.8.4) E휎 sup 휎푗 휆푖 푠, 푈푖 E휎 sup 휎푗 푠, 휆푖 푈푖 ^ ⟨ ⟩ ≤ ^ 푠∈M풟,휌,훿 푗=1 푖=1 푖=1 푠∈M풟,휌,훿 푗=1

By contraction lemma of Rademacher complexity, we have

푁 푁 ⟨ 2 ⟩2 ⟨ 2 ⟩ ∑︁ (︁ 퐴푗 )︁ 퐴푗 ∑︁ (︁ 퐴푗 )︁ 퐴푗 E휎 sup 휎푗 푠, 휆푖 푈푖 E휎 sup 휎푗 푠, 휆푖 푈푖 ^ ≤ ^ 푠∈M풟,휌,훿 푗=1 푠∈M풟,휌,훿 푗=1 ⟨ 푁 ⟩ 2 ∑︁ (︁ 퐴푗 )︁ 퐴푗 = E휎 sup 푠, 휎푗 휆푖 푈푖 ^ 푠∈M풟,휌,훿 푗=1 ⃦ 푁 ⃦ 2 ⃦∑︁ (︁ 퐴푗 )︁ 퐴푗 ⃦ E휎 ⃦ 휎푗 휆 푈 ⃦ ≤ ⃦ 푖 푖 ⃦ ⃦ 푗=1 ⃦2

Where the last inequality is by Cauchy-Schwartz inequality. Now, by Jensen’s inequality,

1/2 ⃦ 푁 ⃦ ⎛ ⃦ 푁 ⃦2⎞ (︃ 푁 )︃1/2 ⃦ (︁ )︁2 ⃦ ⃦ (︁ )︁2 ⃦ (︁ )︁4 ⃦∑︁ 퐴푗 퐴푗 ⃦ ⃦∑︁ 퐴푗 퐴푗 ⃦ ∑︁ 퐴푗 E휎 휎푗 휆푖 푈푖 ⎝E휎 휎푗 휆푖 푈푖 ⎠ = 휆푖 √푁 ⃦ ⃦ ≤ ⃦ ⃦ ≤ ⃦ 푗=1 ⃦2 ⃦ 푗=1 ⃦2 푗=1 (7.8.5)

230 ′ 1−훿 푑 Combining (7.8.1), (7.8.3), (7.8.4) and (7.8.5), we have 푅(M^ 풟,휌,훿 Tr) 휌 + + √ . ∘ ≤ 2훿 푁  Combining with Theorem 7.8.7, we get our main theorem:

푁 Theorem 7.8.9 (Main Theorem). Given a training set Tr = 퐴푗 sampled uniformly { }푗=1 from , and fixed constants 1 > 휌 0, 훿 > 0, 1 > 휂 > 0, if there exists a (휌, 훿)-robust 풟 ≥ 2 휂2푝푁 푑 퐶 휂 푝푁 퐶 − solution ^ with respect to , then with probability at least − 2 3(1−휂) , 푠 B Tr 1 휀푑 푒 휀푑 푒 푑 ∈ − − for 푠 B^ that is a (휌, 훿)-robust solution with respect to Tr, ∈ √︂ 2(1 + 휂)휌 1 훿 2푑 2 log(4/훿) L(푠) LTr(푠) + + − + + 4 ≤ 1 휂 훿 √푁 푁 −

푑 Proof: Since we can find 푠 B^ s.t. 푠 is (휌, 훿)-robust with respect to Tr, by Lemma 7.8.4, 2 ∈ 퐶 휂 푝푁 휌 with probability − 2 , is -robust with respect to . Therefore, 휌 . 1 푑 푒 푠 ( , 훿) 푠 M^ 휀 1−휂 풟, 1−휂 ,훿 − 2 풟 ∈ − 휂 휌푁 휌(1+휂) By Lemma 7.8.8, with probability at least 퐶 3(1−휂) , ^ 1−훿 √푑 . 1 푑 푒 푅(M풟,휌,훿 Tr) + + − 휀 ∘ ≤ 1−휂 2훿 푁 Combined with Theorem 7.8.7, the proof is complete. 

In summary, Theorem 7.8.9 states that if we can find a robust solution 푠 which “fits” the training set then it generalizes to the test set.

231 232 Bibliography

[1] A. Aamand, P. Indyk, and A. Vakilian. (learned) frequency estimation algorithms under zipfian distribution. arXiv preprint arXiv:1908.05198, 2019.

[2] Z. Abbassi, V. S. Mirrokni, and M. Thakur. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 32–40, 2013.

[3] P. K. Agarwal and J. Pan. Near-linear algorithms for geometric hitting sets and set covers. In Proc. 30th Annu. Sympos. Comput. Geom. (SoCG), pages 271–279, 2014.

[4] A. Aghazadeh, R. Spring, D. LeJeune, G. Dasarathy, A. Shrivastava, and R. G. Bara- niuk. Mission: Ultra large-scale feature selection using count-sketches. In International Conference on Machine Learning (ICML), 2018.

[5] S. Agrawal, M. Shadravan, and C. Stein. Submodular secretary problem with shortlists. In 10th Innovations in Theoretical Computer Science Conference (ITCS), pages 1:1– 1:19, 2019.

[6] K. J. Ahn and S. Guha. Linear programming in the semi-streaming model with appli- cation to the maximum matching problem. In Proc. 38th Int. Colloq. Automata Lang. Prog. (ICALP), pages 526–538, 2011.

[7] K. J. Ahn and S. Guha. Access to data and number of iterations: Dual primal algo- rithms for maximum matching under resource constraints. In Proc. 27th ACM Sympos. Parallel Alg. Arch. (SPAA), pages 202–211, 2015.

[8] K. J. Ahn, S. Guha, and A. McGregor. Analyzing graph structure via linear mea- surements. In Proc. 23rd ACM-SIAM Sympos. Discrete Algs. (SODA), pages 459–467, 2012.

[9] K. J. Ahn, S. Guha, and A. McGregor. Graph sketches: sparsification, spanners, and subgraphs. In Proc. 31st ACM Sympos. on Principles of Database Systems (PODS), pages 5–14, 2012.

[10] Z. Allen-Zhu and Y. Li. Lazysvd: even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974–982, 2016.

233 [11] Z. Allen-Zhu and L. Orecchia. Nearly-linear time positive LP solver with faster con- vergence rate. In Proc. 47th Annu. ACM Sympos. Theory Comput. (STOC), pages 229–236, 2015.

[12] Z. Allen-Zhu and L. Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proc. 26th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 1439–1456, 2015.

[13] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.

[14] N. Alon, D. Moshkovitz, and S. Safra. Algorithmic construction of sets for 푘- restrictions. ACM Trans. Algo., 2(2):153–177, 2006.

[15] B. Aronov, E. Ezra, and M. Sharir. Small-size 휀-nets for axis-parallel rectangles and boxes. SIAM J. Comput., 39(7):3248–3282, 2010.

[16] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta- algorithm and applications. Theory of Computing, 8(1):121–164, 2012.

[17] S. Assadi. Tight Space-Approximation Tradeoff for the Multi-Pass Streaming Set Cover Problem. In Proc. 36th ACM Sympos. on Principles of Database Systems (PODS), pages 321–335, 2017.

[18] S. Assadi, N. Karpov, and Q. Zhang. Distributed and streaming linear programming in low dimensions. In Proc. 38th ACM Sympos. on Principles of Database Systems (PODS), pages 236–253, 2019.

[19] S. Assadi and S. Khanna. Tight bounds on the round complexity of the distributed maximum coverage problem. In Proc. 29th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 2412–2431, 2018.

[20] S. Assadi, S. Khanna, and Y. Li. Tight bounds for single-pass streaming complexity of the set cover problem. In Proc. 48th Annu. ACM Sympos. Theory Comput. (STOC), pages 698–711, 2016.

[21] S. Assadi, S. Khanna, Y. Li, and G. Yaroslavtsev. Maximum matchings in dynamic graph streams and the simultaneous communication model. In Proc. 27th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 1345–1364, 2016.

[22] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671–680, 2014.

[23] B. Bahmani, A. Goel, and K. Munagala. Efficient primal-dual graph algorithms for mapreduce. In International Workshop on Algorithms and Models for the Web-Graph, pages 59–78, 2014.

234 [24] M.-F. Balcan, T. Dick, T. Sandholm, and E. Vitercik. Learning to branch. In Inter- national Conference on Machine Learning (ICML), pages 353–362, 2018. [25] L. Baldassarre, Y.-H. Li, J. Scarlett, B. Gözcü, I. Bogunovic, and V. Cevher. Learning- based compressive subsampling. IEEE Journal of Selected Topics in Signal Processing, 10(4):809–822, 2016. [26] N. Bansal, A. Caprara, and M. Sviridenko. A new approximation method for set covering problems, with applications to multidimensional bin packing. SIAM Journal on Computing, 39(4):1256–1278, 2009. [27] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An information statis- tics approach to data stream and communication complexity. J. Comput. Sys. Sci., 68(4):702–732, 2004. [28] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science, pages 1–10, 2002. [29] M. Bateni, H. Esfandiari, and V. S. Mirrokni. Almost optimal streaming algorithms for coverage problems. In Proc. 29th ACM Sympos. Parallel Alg. Arch. (SPAA), pages 13–23, 2017. [30] G. Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57(297):33–45, 1962. [31] S. Bhattacharya, M. Henzinger, D. Nanongkai, and C. Tsourakakis. Space-and time- efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In Proc. 47th Annu. ACM Sympos. Theory Comput. (STOC), pages 173–182, 2015. [32] J. Blasiok. Optimal streaming and tracking distinct elements with high probability. In Proc. 29th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 2432–2448, 2018. [33] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In International Conference on Machine Learning ICML, pages 537–546, 2017. [34] C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled ran- domized hadamard transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301–1340, 2013. [35] R. S. Boyer and J. S. Moore. Mjrty—a fast majority vote algorithm. In Automated Reasoning, pages 105–117. 1991. [36] O. Boykin, A. Bryant, E. Chen, ellchow, M. Gagnon, M. Nakamura, S. Noble, S. Ritchie, A. Singhal, and A. Zymnis. Algebird. https://twitter.github.io/ algebird/, 2016. [37] V. Braverman, S. R. Chestnut, N. Ivkin, J. Nelson, Z. Wang, and D. P. Woodruff.

Bptree: An ℓ2 heavy hitters algorithm using constant memory. In Proc. 36th ACM Sympos. on Principles of Database Systems (PODS), pages 361–376, 2017.

235 [38] V. Braverman, S. R. Chestnut, N. Ivkin, and D. P. Woodruff. Beating countsketch for heavy hitters in insertion streams. In Proc. 48th Annu. ACM Sympos. Theory Comput. (STOC), pages 740–753, 2016.

[39] CAIDA. Caida internet traces 2016 chicago. http://www.caida.org/data/monitors/passive-equinix-chicago.xml.

[40] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489–509, 2006.

[41] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In 18th Annual IEEE Conference on Computational Complexity, pages 107–117, 2003.

[42] A. Chakrabarti and A. Wirth. Incidence geometries and the pass complexity of semi- streaming set cover. In Proc. 27th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 1365–1373, 2016.

[43] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. 29th Int. Colloq. Automata Lang. Prog. (ICALP), pages 693–703, 2002.

[44] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. 29th Int. Colloq. Automata Lang. Prog. (ICALP), pages 693–703, 2002.

[45] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approximating the minimum spanning weight in sublinear time. SIAM Journal on computing, 34(6):1370–1379, 2005.

[46] C. Chekuri, S. Gupta, and K. Quanrud. Streaming algorithms for submodular function maximization. In International Colloquium on Automata, Languages, and Program- ming, pages 318–330, 2015.

[47] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-efficient reconfig- urable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.

[48] F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In Proc. 19th Int. Conf. World Wide Web (WWW), pages 231–240, 2010.

[49] R. Chitnis, G. Cormode, H. Esfandiari, M. Hajiaghayi, A. McGregor, M. Monem- izadeh, and S. Vorotnikova. Kernelization via sampling with applications to finding matchings and related problems in dynamic graph streams. In Proc. 27th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 1326–1344, 2016.

[50] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geometry, II. Discrete Comput. Geom., 4:387–421, 1989.

[51] K. L. Clarkson and K. Varadarajan. Improved approximation algorithms for geometric set cover. Discrete Comput. Geom., 37(1):43–58, 2007.

236 [52] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In Proc. 41st Annu. ACM Sympos. Theory Comput. (STOC), pages 205–214, 2009.

[53] K. L. Clarkson and D. P. Woodruff. Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6):54, 2017.

[54] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proc. 47th Annu. ACM Sympos. Theory Comput. (STOC), pages 163–172, 2015.

[55] G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Pro- ceedings of the VLDB Endowment, 1(2):1530–1541, 2008.

[56] G. Cormode, H. J. Karloff, and A. Wirth. Set cover algorithms for very large datasets. In Proc. 19th ACM Conf. Info. Know. Manag. (CIKM), pages 479–488, 2010.

[57] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count- min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.

[58] G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In Proceedings of the 2005 SIAM International Conference on Data Mining, pages 44–55. SIAM, 2005.

[59] H. Dai, E. Khalil, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimiza- tion algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6351–6361, 2017.

[60] D. Davidov, E. Gabrilovich, and S. Markovitch. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, pages 250–257, 2004.

[61] E. D. Demaine, P. Indyk, S. Mahabadi, and A. Vakilian. On streaming and commu- nication complexity of the set cover problem. In Proc. 28th Int. Symp. Dist. Comp. (DISC), volume 8784, pages 484–498, 2014.

[62] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. 10th Annu. European Sympos. Algorithms (ESA), pages 348–360, 2002.

[63] I. Dinur and D. Steurer. Analytical approach to parallel repetition. In Proc. 46th Annu. ACM Sympos. Theory Comput. (STOC), pages 624–633, 2014.

[64] D. L. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.

[65] P. Drineas, R. Kannan, and M. W. Mahoney. Sampling sub-problems of heteroge- neous Max-Cut problems and approximation algorithms. In Annual Symposium on Theoretical Aspects of Computer Science, pages 57–68, 2005.

237 [66] F. Dzogang, T. Lansdall-Welfare, S. Sudhahar, and N. Cristianini. Scalable preference learning from data streams. In Proc. 24th Int. Conf. World Wide Web (WWW), pages 885–890, 2015.

[67] Y. Emek and A. Rosén. Semi-streaming set cover. In Proc. 41st Int. Colloq. Automata Lang. Prog. (ICALP), volume 8572 of Lect. Notes in Comp. Sci., pages 453–464, 2014.

[68] A. Ene, S. Har-Peled, and B. Raichel. Geometric packing under non-uniform con- straints. In Proc. 28th Annu. Sympos. Comput. Geom. (SoCG), pages 11–20, 2012.

[69] P. Erdös. On a lemma of littlewood and offord. Bulletin of the American Mathematical Society, 51(12):898–902, 1945.

[70] C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems (TOCS), 21(3):270–313, 2003.

[71] T. Feder and D. H. Greene. Optimal algorithms for approximate clustering. In Proc. 20th Annu. ACM Sympos. Theory Comput. (STOC), pages 434–444, 1988.

[72] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652, 1998.

[73] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2-3):207–216, 2005.

[74] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applica- tions. Journal of computer and system sciences, 31(2):182–209, 1985.

[75] R. J. Fowler, M. S. Paterson, and S. L. Tanimoto. Optimal packing and covering in the plane are NP-complete. Inform. Process. Lett., 12(3):133–137, 1981.

[76] N. Garg and J. Koenemann. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM Journal on Computing, 37(2):630–652, 2007.

[77] M. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762– 1792, 2016.

[78] M. Ghashami and J. M. Phillips. Relative errors for deterministic low-rank matrix approximations. In Proc. 25th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 707–717, 2014.

[79] A. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of the IEEE, 98(6):937–947, 2010.

[80] A. Goel, M. Kapralov, and S. Khanna. Perfect matchings in 푂(푛 log 푛) time in regular bipartite graphs. SIAM Journal on Computing, 42(3):1392–1404, 2013.

238 [81] S. Gollapudi and D. Panigrahi. Online algorithms for rent-or-buy with expert advice. In International Conference on Machine Learning (ICML), pages 2319–2327, 2019.

[82] A. Goyal, H. Daumé III, and G. Cormode. Sketch algorithms for estimating point queries in nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1093–1103, 2012.

[83] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

[84] M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 18(2):53–58, 1995.

[85] A. Gronemeier. Asymptotically optimal lower bounds on the NIH-Multi-Party in- formation complexity of the AND-function and disjointness. In 26th International Symposium on Theoretical Aspects of Computer Science, (STACS), pages 505–516, 2009.

[86] T. Grossman and A. Wool. Computational experience with approximation algorithms for the set covering problem. Euro. J. Oper. Res., 101(1):81–92, 1997.

[87] S. Guha and A. McGregor. Lower bounds for quantile estimation in random-order and multi-pass streaming. In Proc. 34th Int. Colloq. Automata Lang. Prog. (ICALP), volume 4596 of Lect. Notes in Comp. Sci., pages 704–715, 2007.

[88] S. Guha and A. McGregor. Tight lower bounds for multi-pass stream computation via pass elimination. In Proc. 35th Int. Colloq. Automata Lang. Prog. (ICALP), volume 5125 of Lect. Notes in Comp. Sci., pages 760–772, 2008.

[89] S. Guha, A. McGregor, and D. Tench. Vertex and hyperedge connectivity in dy- namic graph streams. In Proc. 34th ACM Sympos. on Principles of Database Systems (PODS), pages 241–247, 2015.

[90] V. Guruswami and K. Onak. Superlinear lower bounds for multipass graph processing. In Proc. 28th Conf. Comput. Complexity (CCC), pages 287–298, 2013.

[91] A. Hadian and T. Heinis. Interpolation-friendly b-trees: Bridging the gap between algorithmic and learned indexes. In Advances in Database Technology - 22nd Interna- tional Conference on Extending Database Technology (EDBT), pages 710–713, 2019.

[92] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.

[93] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga.In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 75–84, 2017.

239 [94] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254, 2016.

[95] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[96] P. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. In Conference On Learning Theory (COLT), pages 970–978, 2018.

[97] S. Har-Peled, P. Indyk, S. Mahabadi, and A. Vakilian. Towards tight bounds for the streaming set cover problem. In Proc. 35th ACM Sympos. on Principles of Database Systems (PODS), pages 371–383, 2016.

[98] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321–350, 2012.

[99] S. Har-Peled and K. Quanrud. Approximation algorithms for polynomial-expansion and low-density graphs. In Proc. 23rd Annu. European Sympos. Algorithms (ESA), volume 9294 of Lect. Notes in Comp. Sci., pages 717–728.

[100] S. Har-Peled and M. Sharir. Relative (푝, 휀)-approximations in geometry. Discrete Comput. Geom., 45(3):462–496, 2011.

[101] H. He, H. Daume III, and J. M. Eisner. Learning to search in branch and bound algorithms. In Advances in neural information processing systems (NeurIPS), pages 3293–3301, 2014.

[102] M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. pages 107–118, 1998.

[103] C.-Y. Hsu, P. Indyk, D. Katabi, and A. Vakilian. Learning-based frequency estimation algorithms. International Conference on Learning Representations (ICLR), 2019.

[104] N. Imamoglu, Y. Oishi, X. Zhang, G. Ding, Y. Fang, T. Kouyama, and R. Nakamura. Hyperspectral image dataset for benchmarking on salient object detection. In Tenth International Conference on Quality of Multimedia Experience, (QoMEX), pages 1–3, 2018.

[105] P. Indyk, S. Mahabadi, R. Rubinfeld, J. Ullman, A. Vakilian, and A. Yodpinyanee. Fractional set cover in the streaming model. Approximation, Randomization, and Combinatorial Optimization (APPROX/RANDOM), pages 198–217, 2017.

[106] P. Indyk, S. Mahabadi, R. Rubinfeld, A. Vakilian, and A. Yodpinyanee. Set cover in sub-linear time. In Proc. 29th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 2467–2486, 2018.

240 [107] P. Indyk and A. Vakilian. Tight trade-offs for the maximum k-coverage problem in the general streaming model. In Proc. 38th ACM Sympos. on Principles of Database Systems (PODS), pages 200–217, 2019.

[108] P. Indyk and D. P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proc. 37th Annu. ACM Sympos. Theory Comput. (STOC), pages 202–208, 2005.

[109] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2011.

[110] H. Jowhari, M. Sağlam, and G. Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Proceedings of the thirtieth ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems, pages 49–58, 2011.

[111] B. Kalyanasundaram and G. Schintger. The probabilistic communication complexity of set intersection. SIAM J. Discrete Math., 5(4):545–557, 1992.

[112] D. M. Kane, J. Nelson, and D. P. Woodruff. Revisiting norm estimation in data streams. arXiv preprint arXiv:0811.3648, 2008.

[113] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Proc. 29th ACM Sympos. on Principles of Database Systems (PODS), pages 41–52, 2010.

[114] M. Kapralov, Y. T. Lee, C. Musco, C. Musco, and A. Sidford. Single pass spectral sparsification in dynamic streams. SIAM Journal on Computing, 46(1):456–477, 2017.

[115] N. Karmarkar and R. M. Karp. An efficient approximation scheme for the one- dimensional bin-packing problem. In Foundations of Computer Science, 1982. SFCS’08. 23rd Annual Symposium on, pages 312–320, 1982.

[116] R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding fre- quent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51–55, 2003.

[117] M. J. Kearns and U. V. Vazirani. An introduction to computational learning theory. MIT press, 1994.

[118] M. Khani, M. Alizadeh, J. Hoydis, and P. Fleming. Adaptive neural signal detection for massive MIMO. CoRR, abs/1906.04610, 2019.

[119] C. Koufogiannakis and N. E. Young. A nearly linear-time PTAS for explicit fractional packing and covering linear programs. Algorithmica, 70(4):648–674, 2014.

[120] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 489–504, 2018.

241 [121] F. Kuhn, T. Moscibroda, and R. Wattenhofer. The price of being near-sighted. In Proc. 17th ACM-SIAM Sympos. Discrete Algs. (SODA), 2006.

[122] R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in MapReduce and streaming. In Proc. 25th ACM Sympos. Parallel Alg. Arch. (SPAA), pages 1–10, 2013.

[123] S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In Proc. 23rd ACM Sympos. Parallel Alg. Arch. (SPAA), pages 85–94, 2011.

[124] Y. T. Lee and A. Sidford. Path finding methods for linear programming: Solving linear programs in 푂(푣푟푎푛푘) iterations and faster algorithms for maximum flow. In Proc. 55th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 424–433, 2014.

[125] E. Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 581–588, 2013.

[126] J. E. Littlewood and A. C. Offord. On the number of real roots of a random alge- braic equation. ii. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 35, pages 133–148. Cambridge University Press, 1939.

[127] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 101–114, 2016.

[128] T. Lykouris and S. Vassilvitskii. Competitive caching with machine learned advice. In International Conference on Machine Learning ICML, pages 3302–3311, 2018.

[129] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2):426–435, 1998.

[130] S. Marko and D. Ron. Distance approximation in bounded-degree and general sparse graphs. In Approximation, Randomization, and Combinatorial Optimization. Algo- rithms and Techniques, pages 475–486. 2006.

[131] A. McGregor and H. T. Vu. Better streaming algorithms for the maximum coverage problem. In 20th International Conference on Database Theory (ICDT), pages 22:1– 22:18, 2017.

[132] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proc. 45th Annu. ACM Sympos. Theory Comput. (STOC), pages 91–100, 2013.

[133] A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, pages 398–412, 2005.

242 [134] C. Metzler, A. Mousavi, and R. Baraniuk. Learned d-amp: Principled neural network based compressive image recovery. In Advances in Neural Information Processing Sys- tems (NeurIPS), pages 1772–1783, 2017.

[135] G. T. Minton and E. Price. Improved concentration bounds for count-sketch. In Proc. 25th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 669–686, 2014.

[136] J. Misra and D. Gries. Finding repeated elements. Science of computer programming, 2(2):143–152, 1982.

[137] M. Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. In Advances in Neural Information Processing Systems (NeurIPS), pages 464–473, 2018.

[138] D. Moshkovitz. The projection games conjecture and the NP-hardness of ln 푛- approximating set-cover. In Approximation, Randomization, and Combinatorial Opti- mization. Algorithms and Techniques, pages 276–287. 2012.

[139] R. Motwani and P. Raghavan. Randomized algorithms. Chapman & Hall/CRC, 2010.

[140] A. Mousavi, G. Dasarathy, and R. G. Baraniuk. Deepcodec: Adaptive sensing and recovery via deep convolutional neural networks. arXiv preprint arXiv:1707.03386, 2017.

[141] A. Mousavi, A. B. Patel, and R. G. Baraniuk. A deep learning approach to structured signal recovery. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pages 1336–1343, 2015.

[142] N. H. Mustafa, R. Raman, and S. Ray. Settling the apx-hardness status for geometric set cover. In Proc. 55th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 541–550, 2014.

[143] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends R in Theoretical Computer Science, 1(2):117–236, 2005. ○ [144] J. Nelson and H. L. Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Proc. 54th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 117–126, 2013.

[145] H. N. Nguyen and K. Onak. Constant-time approximation algorithms via local im- provements. In Proc. 49th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 327–336, 2008.

[146] N. Nisan. The communication complexity of approximate set packing and covering. In Proc. 29th Int. Colloq. Automata Lang. Prog. (ICALP), pages 868–875, 2002.

[147] A. Norouzi-Fard, J. Tarnawski, S. Mitrovic, A. Zandieh, A. Mousavifar, and O. Svens- son. Beyond 1/2-approximation for submodular maximization on massive data streams. In International Conference on Machine Learning (ICML), pages 3826–3835, 2018.

243 [148] K. Onak, D. Ron, M. Rosen, and R. Rubinfeld. A near-optimal sublinear-time algo- rithm for approximating the minimum vertex cover size. In Proc. 23rd ACM-SIAM Sympos. Discrete Algs. (SODA), pages 1123–1131, 2012.

[149] M. Parnas and D. Ron. Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theoretical Computer Science, 381(1):183–196, 2007.

[150] S. A. Plotkin, D. B. Shmoys, and É. Tardos. Fast approximation algorithms for frac- tional packing and covering problems. Mathematics of Operations Research, 20(2):257– 301, 1995.

[151] M. Purohit, Z. Svitkina, and R. Kumar. Improving online algorithms via ml pre- dictions. In Advances in Neural Information Processing Systems (NeurIPS), pages 9661–9670, 2018.

[152] R. Raz and S. Safra. A sub-constant error-probability low-degree test, and a sub- constant error-probability PCP characterization of NP. In Proc. 29th Annu. ACM Sympos. Theory Comput. (STOC), 1997.

[153] A. A. Razborov. On the distributional complexity of disjointness. Theo. Comp. Sci., 106(2):385–390, 1992.

[154] P. Roy, A. Khan, and G. Alonso. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD), pages 1449–1463, 2016.

[155] B. Saha and L. Getoor. On maximum coverage in the streaming model & application to multi-topic blog-watch. In Proc. SIAM Int. Conf. Data Mining (SDM), pages 697–708, 2009.

[156] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approx- imate Reasoning, 50(7):969–978, 2009.

[157] T. Sarlos. Improved approximation algorithms for large matrices via random projec- tions. In Proc. 47th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 143–152, 2006.

[158] S. Schechter, C. Herley, and M. Mitzenmacher. Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks. In Proceedings of the 5th USENIX conference on Hot topics in security, pages 1–8, 2010.

[159] J. P. Schmidt, A. Siegel, and A. Srinivasan. Chernoff–hoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics, 8(2):223–250, 1995.

[160] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

244 [161] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, and J. Rexford. Heavy-hitter detection entirely in the data plane. In Proceedings of the Symposium on SDN Research, pages 164–176, 2017.

[162] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (NeurIPS), pages 3104–3112, 2014.

[163] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.

[164] P. Talukdar and W. Cohen. Scaling graph-based semi supervised learning to large number of labels using count-min sketch. In Artificial Intelligence and Statistics, pages 940–947, 2014.

[165] M. Thorup and Y. Zhang. Tabulation-based 5-independent hashing with applica- tions to linear probing and second moment estimation. SIAM Journal on Computing, 41(2):293–331, 2012.

[166] S. Vadhan. Pseudorandomness. Foundations and Trends R in Theoretical Computer Science, 7(1–3):1–336, 2012. ○

[167] D. Wang, S. Rao, and M. W. Mahoney. Unified acceleration method for packing and covering problems via diameter reduction. In Proc. 43rd Int. Colloq. Automata Lang. Prog. (ICALP), pages 50:1–50:13, 2016.

[168] J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data - a survey. Proceedings of the IEEE, 104(1):34–57, 2016.

[169] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in neural infor- mation processing systems (NeurIPS), pages 1753–1760, 2009.

[170] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 10(1–2):1–157, 2014. ○ [171] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335– 366, 2008.

[172] Y. Yoshida, M. Yamamoto, and H. Ito. Improved constant-time approximation algo- rithms for maximum matchings and other optimization problems. SIAM Journal on Computing, 41(4):1074–1093, 2012.

[173] N. E. Young. Randomized rounding without solving the linear program. In Proc. 6th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 170–178, 1995.

[174] N. E. Young. Nearly linear-work algorithms for mixed packing/covering and facility- location linear programs. arXiv preprint arXiv:1407.3015, 2014.

245 [175] M. Yu, L. Jose, and R. Miao. Software defined traffic measurement with opensketch. In NSDI, volume 13, pages 29–42, 2013.

246