The Adaptive Complexity of Submodular Optimization

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Balkanski, Eric. 2019. The Adaptive Complexity of Submodular Optimization. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:42029793

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA The Adaptive Complexity of Submodular Optimization

a dissertation presented by Eric Balkanski to The School of Engineering and Applied Sciences

in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science

Harvard University Cambridge, Massachusetts

May 2019 © 2019 Eric Balkanski All rights reserved. Dissertation advisor: Yaron Singer Eric Balkanski

The Adaptive Complexity of Submodular Optimization

Abstract

In this thesis, we develop a new optimization technique that leads to exponentially faster for solving submodular optimization problems. For the canonical problem of maximizing a non-decreasing submodular function under a cardinality constraint, it is well known that the celebrated greedy which iteratively adds elements whose marginal contribution is largest achieves a 1 1/e approximation, which ≠ is optimal. The optimal approximation guarantee of the greedy algorithm comes at a price of high adaptivity. The adaptivity of an algorithm is the number of sequential rounds it makes when polynomially-many function evaluations can be executed in parallel in each round. Since submodular optimization is regularly applied on very large datasets, adaptivity is crucial as algorithms with low adaptivity enable dramatic speedups in parallel computing time. Submodular optimization has been studied for well over forty years now, and somewhat surprisingly, there was no known constant-factor approximation algorithm for submodular maximization whose adaptivity is sublinear in the size of the ground set n. Our main contribution is a novel optimization technique called adaptive sampling which leads to constant factor approximation algorithms for submodular maximization in only logarithmically many adaptive rounds. This is an exponential speedup in the parallel runtime for submodular maximization compared to previous constant factor approximation algorithms. Furthermore, we show that no algorithm can achieve a constant factor approximation in o˜(log n) rounds. Thus, the adaptive complexity of submodular maximization, i.e., the minimum number of rounds r such that there exists an r-adaptive algorithm which achieves a constant factor approximation, is logarithmic up to lower order terms.

iii Contents

1 Introduction 1 1.1 Adaptive Sampling: An Exponential Speedup ...... 4 1.2 ResultsOverview ...... 5 1.2.1 From Predictions to Decisions: Optimization from Samples ...... 6 1.2.2 Faster Parallel Algorithms Through Adaptive Sampling ...... 8 1.3 Preliminaries ...... 12 1.4 Discussion About Adaptivity ...... 14 1.4.1 Adaptivity in Other Areas ...... 14 1.4.2 Related Models of Parallel Computation ...... 15 1.4.3 Applications of Adaptivity ...... 17

2 Non-Adaptive Optimization 20 2.1 From Predictions to Decisions: Optimization from Samples ...... 20 2.1.1 The Optimization from Samples Model ...... 23 2.1.2 Optimization from Samples is Equivalent to Non-Adaptivity . . . . . 25 2.1.3 Overview of Results ...... 27 2.2 Optimization from Samples Algorithms ...... 29 2.2.1 Curvature ...... 30 2.2.2 Learning to Influence ...... 35

iv 2.2.3 General Submodular Maximization ...... 54 2.3 The Limitations of Optimization from Samples ...... 62 2.3.1 A Framework for Hardness of Optimization from Samples ...... 62 2.3.2 Submodular Maximization ...... 65 2.3.3 Maximum Coverage ...... 67 2.3.4 Curvature ...... 84 2.4 References and Acknowledgments ...... 91

3 Adaptive Optimization 92 3.1 Adaptivity ...... 93 3.1.1 The Adaptive Complexity Model ...... 93 3.1.2 The Adaptivity Landscape for Submodular Optimization ...... 94 3.1.3 Main Result ...... 95 3.1.4 Adaptive Sampling: a Coupling of Learning and Optimization . . . . 96 3.1.5 Overview of Results ...... 96 3.2 Adaptive Algorithms ...... 99 3.2.1 An Algorithm with Logarithmic Adaptivity ...... 99 3.2.2 Experiments ...... 117 3.2.3 The Optimal Approximation ...... 123 3.2.4 Non-monotone Functions ...... 141 3.2.5 Constraints ...... 166 3.3 Adaptivity Lower Bound ...... 205 3.4 References and Acknowledgments ...... 214

v Acknowledgments

First and foremost, I would like to thank my advisor Yaron Singer. Yaron has taught me everything about research, from long-term research vision to choosing the right font for a presentation. I have been extremely fortunate to have an advisor who has always believed in me, cares so much about my success, and tirelessly mentored and advised me through every stage of my PhD. My family has been a vital source of support through the years. I am grateful to my parents, Cecile and Yves, for all the encouragement and freedom to explore every project and idea I had since a very young age. Thank you to my siblings Sophie and Jefor all the good times laughing together and the endless singing in the car during vacations, I cannot wait for our next family vacation. Thanks also to my girlfriend Meghan for the emotional support and dealing with me during paper deadlines. I am very grateful for my two internships at Google Research NYC, which have played an important role in my PhD. During these two summers, I met and worked with wonderful people, broadened my research horizon, and explored new directions. Thank you in particular to my hosts and collaborators Umar Syed, Sergei Vassilvitskii, Balu Sivan, Renato Paes Leme, and Vahab Mirrokni. I have also had the chance to work with incredible collaborators from whom I have learned a lot and who made significant contributions to this thesis: Aviad Rubinstein, Jason Hartline, Andreas Krause, Baharan Mirzasoleiman, Nicole Immorlica, Amir Globerson, Nir Rosenfeld, and Adam Breuer. I am also grateful for the numerous discussions about the content of this thesis with Thibaut Horel, whose breadth of knowledge has been very helpful. I am fortunate to have spent my PhD working in a fun and warm environment. Thank you to my ocemates through the years, Emma Heikensten, Siri Isaksson, Dimitris Kalimeris, Gal Kaplun, Sharon Qian, Greg Stoddard, Bo Waggoner and Ming Yin, I will miss the

vi great atmosphere in MD115. The EconCS group and Maxwell Dworkin have been amazingly friendly and supportive places to grow academically, thanks in particular to David Parkes, Yiling Chen, Jean Pouget-Abadie, Hongyao Ma, Chara Podimata, Jarek Blasiok, Debmalya Mandal, Preetum Nakkiran, Lior Seeman, Goran Radanovic, and Yang Liu, and to my research committee, Yaron Singer, David Parkes, Sasha Rush, and Michael Mitzenmacher. I am fortunate for the countless and diverse opportunities I had during my undergraduate studies at Carnegie Mellon which helped me grow. These opportunities lead me to valuable research, teaching, leadership, social, and athletic experiences. Thank you in particular to my undergraduate research advisor, Ariel Procaccia, who was the first person to tell me he believed I could become a professor, and my academic advisor John Mackey for all the opportunities. This thesis was supported in part by a Google PhD Fellowship and a Smith Family Graduate Science and Engineering Fellowship.

vii Chapter 1

Introduction

The field of optimization has recently expanded to new application domains, where complex decision-making tasks are challenging existing frameworks and techniques. A main diculty is that the scale of these tasks is growing at a vertiginous rate. Innovative frameworks and techniques are needed to capture modern application domains as well as address the challenges posed by large scale computation. We begin by discussing three application domains where optimization has recently played an important role. The first domain is genomics. Recent developments in computational biology that allow processing large amounts of genomic data have been a catalyst for progress towards understanding the human genome. One approach that helps to understand and process massive gene datasets is to cluster gene sequences. Given gene clusters, a small representative subset of genes, in other words a summary of the entire dataset, can be obtained by choosing one gene sequence from each cluster. The problem of clustering gene sequences is an example of a large scale optimization problem. A dierent domain is recommender systems. Twenty years ago, a Friday movie night would require browsing through multiple aisles at Blockbuster to find the right movie. Today, streaming services use recommender systems to suggest a personalized collection of movies to

1 a user. Recommender systems are not limited to movies, but are also used for music and online retail. Given the preferences of dierent users, optimization techniques are used to find diverse and personalized collections of movies, songs, or products. The third domain, ride-sharing services, did not even exist a decade ago. Ride-sharing services have revolutionized transportation by allowing passengers to request a ride to their desired destination in a few seconds using their smartphone. These companies face novel and complex optimization tasks. One example is driver dispatch, which is the problem of assigning drivers to dierent locations, in order to match the demand from riders. Gene sequences, user ratings for music and movies, rider-driver locations are examples of large datasets that we wish to harness using optimization. Novel algorithmic techniques and frameworks that are adapted for these new domains are needed. A standard approach to these optimization problems is to use an algorithmic technique called the greedy approach. Informally, a greedy algorithm takes small local steps towards building a global solution to a problem. For example, for the driver dispatch problem in New York City, a greedy algorithm might first dispatch a first driver to bustling midtown Manhattan. Then, greedy could send another driver to a dierent location in the financial district, then a third driver to a concert in Brooklyn, and so on. This greedy technique is very popular in the field of algorithms, it has been (and still is) used to solve a wide variety of problems. In New York City, ride-sharing companies complete hundreds of thousands of trips per day [NYTimes, 2018]. In large cities, a greedy approach of picking only one location for each driver at a time by iteratively considering each driver one after the other is of course unreasonable.

Adaptivity. The main explanation for the greedy approach being unreasonable for large scale problems is because it is highly sequential, it only assigns a location to a single driver

2 at each step of the algorithm. A crucial feature for an algorithm to scale to large instances is

how parallelizable this algorithm is. Intuitively, a ride-sharing service should not pick the next location sequentially for one driver at a time, but should be choosing the location of multiple drivers in parallel and at the same time. Adaptivity is a convenient measure which captures this notion of sequentiality of an algorithm. In other words, it measures to what extent an algorithm can be parallelized for large scale problems.

Submodular optimization. In optimization, it is desirable to develop generic algorithmic solutions which can be applied to a large family of problems. In other words, algorithms which are not tailored to one specific application domain. One may believe that it is possible to use similar optimization techniques for a movie recommendation system and a music recommendation system. Perhaps more surprisingly, is the fact that the problems from the three dierent application domains previously mentioned (genomics, recommender systems, and driver dispatch) all share some common feature, or structure, which allows grouping them in a single family of problems solved with the same algorithmic approaches. This family of problems is called submodular optimization problems. Submodularity captures a simple and natural diminishing returns property. An illustration of this diminishing returns property is that the gain from sending an additional driver to midtown Manhattan diminishes if there is already a large number of drivers in midtown Manhattan. Fundamental quantities that we often wish to optimize, such as coverage, diversity, entropy, or mutual information, all exhibit this diminishing returns property. This explains why a wide variety of problems such as clustering, recommender systems, facility location, influence in networks, users’ preferences over goods, just to name a few, are all submodular.

3 1.1 Adaptive Sampling: An Exponential Speedup

In this thesis, we develop a novel optimization technique which leads to exponentially faster algorithms for solving submodular optimization problems. We begin by discussing this exponential speedup in this section. In Section 1.2, we give an overview of the results and present the organization of this thesis. In the preliminaries in Section 1.3, we introduce formal definitions and notation. We further discuss adaptivity in Section 1.4, in particular related work on adaptivity, related models of parallel computation, and application areas. For the canonical problem of maximizing a non-decreasing submodular function un- der a cardinality constraint it is well known that the celebrated greedy algorithm which iteratively adds elements whose marginal contribution is largest achieves a 1 1/e approxi- ≠ mation [Nemhauser et al., 1978]. Furthermore, this approximation guarantee is optimal for any algorithm that uses polynomially-many value queries [Nemhauser and Wolsey, 1978]. The optimal approximation guarantee of the greedy algorithm comes at a price of high adaptivity. Informally, the adaptivity of an algorithm is the number of sequential rounds it makes when polynomially-many function evaluations can be executed in parallel in each round. Adaptivity provides a measure of eciency of parallel computation (see Section 1.4.2 for related models of parallel computation). For a cardinality constraint k and ground set of size n, the greedy algorithm is k-adaptive since it sequentially adds elements in k rounds. In each round it makes (n) function evaluations to identify and include the element with O maximal marginal contribution to the set of elements selected in previous rounds. In the worst case k (n) and thus the greedy algorithm is (n)-adaptive and its parallel running œ time is (n). Since submodular optimization is regularly applied on very large datasets, adaptivity is crucial as algorithms with low adaptivity enable dramatic speedups in parallel computing time. Submodular optimization has been studied for well over forty years now, and in

4 the past decade there has been extensive study of submodular maximization for large datasets [Jegelka et al., 2011, Badanidiyuru et al., 2012b, Kumar et al., 2015a, Jegelka et al., 2013, Mirzasoleiman et al., 2013, Wei et al., 2014, Nishihara et al., 2014, Badanidiyuru and Vondrák, 2014, Pan et al., 2014, Badanidiyuru et al., 2014, Mirzasoleiman et al., 2015a, Mirrokni and Zadimoghaddam, 2015, Mirzasoleiman et al., 2015b, Barbosa et al., 2015, Mirzasoleiman et al., 2016, Barbosa et al., 2016, Epasto et al., 2017]. Somewhat surprisingly however, until very recently, there was no known constant-factor approximation algorithm for submodular maximization whose adaptivity is sublinear in n. Our main contribution is a novel optimization technique called adaptive sampling which leads to constant factor approximation algorithms for submodular maximization in only logarithmically many adaptive rounds. This is an exponential speedup in the parallel runtime for submodular maximization compared to previous constant factor approximation algorithms. Furthermore, we show that no algorithm can achieve a constant factor approximation in o˜(log n) rounds. Thus, the adaptive complexity of submodular maximization, i.e., the minimum number of rounds r such that there exists an r-adaptive algorithm which achieves a constant factor approximation, is logarithmic up to lower order terms. The high level idea behind adaptive sampling is to adaptively construct distributions over sets of elements. Based on the value of samples from previous rounds, the algorithm constructs a new distribution at every round. It then samples multiple sets from this new distribution and performs function evaluations over these samples in parallel.

1.2 Results Overview

We study two closely related lines of work. The first is in the optimization from samples model, where the goal is to understand which guarantees are obtainable when optimizing an objective that is learned from data. The second is in the adaptive complexity model, which

5 is motivated by results in the optimization from samples model and where we develop faster parallel algorithms for large scale applications in machine learning.

1.2.1 From Predictions to Decisions: Optimization from Samples

The first part of this thesis studies optimization from samples in Section 2. The traditional approach in optimization typically assumes there is an underlying model known to the algorithm designer, and the goal is to optimize an objective function defined through the model. In a routing problem, for example, the model is a weighted graph which encodes roads and their congestion, and the objective is to select a shortest route. In influence maximization, we are given a weighted graph which models the likelihood of individuals forwarding information, and the objective is to select a subset of nodes that maximizes the spread of information [Kempe et al., 2003]. In many applications like influence maximization or routing, we do not actually know the objective functions we wish to optimize since they depend on the behavior of the world generating the model. In such cases, we gather data about the objective function from past observations, such as yesterday’s trac or past information cascades in social networks. A reasonable approach is to use this data to learn a surrogate function that approximates the function generating the data and optimize the surrogate, as illustrated in Figure 1.1. However, this approach performs poorly even for simple examples as the optima of the learned function may be far from the optima of the true function [Narasimhan et al., 2015]. The sensitivity of optimization to the learning method raises the following question: can we actually optimize objective functions from the training data we use to learn them?

The model of optimization from samples. Answering this question requires synthesiz- ing learning theory and the theory of optimization into a new theory for optimization from sampled data. We suggest the model of optimization from samples in Section 2.1, where the

6 OPTIMIZATION OF OBJECTIVES LEARNED FROM DATA

data model decision

0.1 0.2 0.7 0.6 0.4 0.7 learning 0.1 optimization 0.5 0.4 0.4 0.1 0.2 0.2 0.1 0.7

Figure 1.1: Learning and optimization for influence maximization. The model is a weighted network, the observed data is cascades of information, and the decision is to find the most influential nodes. input is samples (S ,f(S )) m where the sets S are drawn from a distribution , as in { i i }i=1 i D N PAC-learning. The goal is to solve the problem maxS f(S) for some function f :2 R œM æ and constraint , as in optimization. If we believe that learning functions from data and M then optimizing them leads to desirable outcomes, then functions that are learnable and optimizable when given f should be optimizable from samples.

Main result: there exist functions that are both learnable and optimizable but not optimizable from samples. Somewhat surprisingly, however, we show that this is not the case. Coverage functions are PMAC-learnable [Badanidiyuru et al., 2012a] and approximately optimizable under a cardinality constraint when given full information about f [Nemhauser et al., 1978], but there is no constant factor approximation for maximizing a coverage function under a cardinality constraint when given polynomially-many samples drawn from any distribution (Section 2.3). This result implies that in general there are no guarantees when optimizing a coverage function that is learned from sampled data, which is often the case for applications of maximum coverage in machine learning, mechanism design, and data-mining.

Positive results. For the special cases of functions with bounded curvature and influence in the stochastic block model, which models community structure, we develop optimization

7 from samples algorithms with constant approximation guarantees (Section 2.2).

Future directions. Optimization from samples is a general framework for decision-making from data. Instead of the maximum coverage problem, one may ask whether other decision problems are approximable from samples. During an internship at Google, we considered the problem of cost sharing when the underlying cooperative game is not known and only samples are observed [Balkanski et al., 2017c]. Reinforcement learning is also another interesting decision problem to explore from samples.

1.2.2 Faster Parallel Algorithms Through Adaptive Sampling

Since sharp impossibility results arise when given samples drawn from any distribution, we turned to an adaptive sampling model (Section 3). In adaptive sampling, similarly as in active learning, the algorithm obtains multiples batches of samples where each batch is drawn from a new distribution chosen by the algorithm based on the previous batches. The central question with adaptive sampling is then how many batches of samples are needed for optimizing the function. This question carries strong implications for parallelization since samples that are drawn from a same distribution are non-adaptive and can be evaluated in parallel. This connection to parallelization is formalized by the adaptive complexity model, which we introduced in the context of combinatorial optimization. In this model, the adaptivity of an algorithm is the number of sequential rounds it makes when each round can execute function evaluations in parallel. An algorithm using r batches of samples is hence an r-adaptive algorithm.

Previous algorithms for submodular optimization have linear adaptivity. For the canonical problem of maximizing a monotone submodular function under a cardinality constraint k, the celebrated greedy algorithm which adds at every round the element with

8 THE ADAPTIVE COMPLEXITY OF SUBMODULAR OPTIMIZATION

Approximation Nemhauser-Wolsey 78 Feige 98 1-1/e Nemhauser-Wolsey-Fisher 78 + every other previously known constant apx algorithm ?

B-Rubinstein- Singer 17 ~ Θ(n-1/4)

1 k Rounds of adaptivity !(n) worst case : Algorithm : Hardness

Figure 1.2: The adaptivity landscape for submodular optimization.

largest marginal contribution achieves a 1 1/e approximation [Nemhauser et al., 1978], ≠ which is optimal [Feige, 1998]. Although greedy achieves an optimal approximation, it is highly sequential and has k adaptive rounds. Since k (n) in the worst case, greedy is an œ (n) adaptive algorithm. For large scale applications of submodularity in machine learning, such as data summarization, recommendation systems, clustering, and feature selection, we are naturally interested in algorithms with faster parallel runtime. Somewhat surprisingly, until very recently, no constant factor approximation algorithms with sublinear adaptivity were known for this problemTHE OPTIMAL APPROXIMATION (Figure 1.2).

Approximation Nemhauser-Wolsey 78 Feige 98 1-1/e B-Rubinstein-Singer, Ene-L.Nguyên 19 1-1/e 1/2 B-Singer 18 (under some condition) 1/3 B-Singer 18 Nemhauser-Wolsey-Fisher 78 + every other previously known constant apx algorithm

-1 log n B-Rubinstein- B-Singer 18 Singer 17 ~ Θ(n-1/4) ~ 1 Θ(log n) k Rounds of adaptivity !(n) worst case : Algorithm : Hardness

Figure 1.3: The adaptivity landscape for submodular optimization including recent results.

9 Genes Genes Clustering Clustering Movie Genes Recommendation Clustering 80 80 80 ● ● 8e+08 ● ●● ● ● ●

60 60 60 6e+08

40 40 40 4e+08

GreedyGreedy Greedy AdaptiveAdaptive Sampling Sampling Adaptive Sampling Objective value Objective Objective Value Objective Objective Value Objective Objective Value Objective 20 20 2e+0820

0 0 0e+000

0 0 10 10 20 20 30 30 40 40 50 50 0 0 10 50 20 100 30 15040 50200 ParallelParallel Runtime Runtime ParallelParallel Runtime Runtime

Figure 1.4: For gene clustering and movie recommendation applications, adaptive sampling matches the performance of greedy with a significantly smaller number of adaptive rounds.

Main result: an algorithm with logarithmic adaptivity. In Section 3.2, we develop adaptive sampling algorithms with only logarithmically many batches of samples. Our main result is a (log n) adaptive algorithm with an approximation arbitrarily close to O 1/3 for the problem of maximizing a monotone submodular functions under a cardinality constraint (Section 3.2.1). This is an exponential speedup in the parallel runtime for submodular maximization compared to previous constant factor approximation algorithms (Figure 1.3). Furthermore, we show in Section 3.3 that no algorithm can achieve a constant factor approximation in o˜(log n) rounds.

Empirical evaluations. In Section 3.2.2 and Section 3.2.4, we analyzed the empirical performance of adaptive sampling algorithms for the following machine learning applications: movie recommendation, taxi dispatch, image summarization, revenue maximization in social networks, and trac monitoring. These experiments demonstrated that adaptive sampling is a technique that is also powerful in practice. The empirical results showed significant speedups in parallel runtime while obtaining performance with the standard greedy algorithm (Figure 1.4).

10 Algorithms Hardness Family Family Apx Rounds Apx Rounds Section Section

Submodular 1 Submodular 1 +‘ ˜(n≠ 4 ) 1 (n≠ 4 ) 1 2.2.3 2.3.2 O Influence Coverage 1 +‘ constant 1 (n≠ 5 ) 1 2.2.2 2.3.3 O Curvature 1 c o(1) Curvature 1 c+o(1) ≠ ≠ ≠ 1+c c2 1 1+c c2 1 2.2.1 ≠ 2.3.4 ≠ Submodular Submodular 1 ‘ ( 1 log n) 1 ( log n ) 3.2.1 3 ≠ O ‘2 3.3 log n O log log n Submodular 1 1 ‘ ( 1 log n) 3.2.3 ≠ e ≠ O ‘2 Non-monotone 1 ‘ ( 1 log2 n) 3.2.4 2e ≠ O ‘2 Matroid 1 1 ‘ ( 1 log2 n ) 3.2.5 ≠ e ≠ O ‘3 ‘3

Table 1.1: Overview of results. Unless otherwise specified, these results are for monotone submodular maximization under a cardinality constraint. The most significant results are highlighted in bold.

Additional results in the adaptive complexity model. Very recent work by our group and other groups has improved the approximation guarantee to the optimal 1 1/e ≠ (Section 3.2.3 and [Ene and Nguyen, 2019]), as well as the number of function evaluations [Fahrbach et al., 2019]. Non-monotone functions (Section 3.2.4 and [Ene et al., 2019, Fahrbach et al., 2018] and more complex constraints (Section 3.2.5 and [Chekuri and Quanrud, 2019a,b, Ene et al., 2019]) have also been studied recently in the adaptive complexity model.

Future directions. At the Broad Institute and the Harvard Molecular and Cellular Biology department, large collections of genes are regularly clustered for DNA analysis. In preliminary experiments, the main runtime bottleneck with this clustering application was the pre- computation of pairwise distances between gene sequences. This raises a new algorithmic challenge of reducing the number of pairwise distance computations while incurring only a small loss in the clustering objective. More generally, there remain many open problems on

11 adaptivity and submodular optimization.

Organization. Table 1.1 summarizes our results and the organization of this thesis. We reference at the end of each chapter the specific papers covered in that chapter.

1.3 Preliminaries

Submodular optimization. A set function f :2N R maps subsets of elements S N, æ ™ where N is the ground set of n = N elements, to a value f(S) R. The marginal | | œ contribution of a set X N to a set S N is defined as f (X) := f(S X) f(S).1 ™ ™ S fi ≠

Definition 1. A function f :2N R is submodular if for any sets S T N and any æ ™ ™ element a N T , we have œ \ f (a) f (a). S Ø T

A function is monotone if f(S) f(T ) for all sets S T . The canonical problem for Æ ™ submodular optimization is to maximize a monotone submodular function under a cardinality constraint, i.e., find a subset of elements S N of size at most k which maximizes f: ™

max f(S). S: S k | |Æ

We measure the performance of an algorithm by its approximation guarantee.

Definition 2. Let be a class of functions. An algorithm obtains an –-approximation for F maximizing under a constraint if, for all f , it finds a feasible solution S F M œF œM such that f(S) – max f(T ). T Ø · œM

1For readability we abuse notation and write a instead of a when evaluating a singleton a N. { } œ

12 It is well-known that constant factor approximation algorithms can be obtained for monotone submodular maximization under a cardinality constraint, but that it is impossible to exactly, i.e. – =1, maximize submodular functions in general.

Adaptivity. As standard, we assume access to a value oracle of the function such that for any sets S N the oracle returns f(S) in (1) time. Informally, the adaptivity of an ™ O algorithm is the number of sequential rounds it makes when polynomially-many function evaluations can be executed in parallel in each round.

Definition 3. Given a value oracle for f, an algorithm is r-adaptive if every query f(S) for the value of a set S occurs at a round i [r] s.t. S is independent of the values f(SÕ) of all œ other queries at round i, with at most poly(n) queries at every round.

In the next section, we discuss adaptivity and parallel computing.

Other classes of functions. In this thesis, we also consider classes of functions other than submodular. Coverage functions are the canonical example of submodular functions.

Definition 4. A function is called coverage if there exists a family of sets T1,...,Tn that covers subsets of a universe U with weights w(a ) for a U such that for all S, f(S)= j j œ

aj i S Ti w(aj). A coverage function is polynomial-sized if the universe is of polynomial œfi œ sizeq in n. Influence maximization is a generalization of maximizing coverage functions under a cardinality constraint.

An important property of submodular functions that has been heavily explored recently

is that of curvature. Informally, the curvature is a measure of how far the function is to being modular.

Definition 5. A function f has curvature c [0, 1] if f (a) (1 c)f(a) for any set S N œ S Ø ≠ ™ and element a N. œ

If c =0, then the function f is modular, i.e.f(S)= a S f(a). œ q 13 1.4 Discussion About Adaptivity

1.4.1 Adaptivity in Other Areas

Adaptivity has been heavily studied across a wide spectrum of areas in computer science. These areas include classical problems in theoretical computer science such as sorting and selection (e.g. [Valiant, 1975, Cole, 1988, Braverman et al., 2016]), where adaptivity is known under the term of parallel algorithms, and communication complexity (e.g. [Papadimitriou and Sipser, 1984, Duris et al., 1984, Nisan and Widgerson, 1991, Miltersen et al., 1995, Dobzinski et al., 2014, Alon et al., 2015]), where the number of rounds measures how much interaction is needed for a communication protocol. For the multi-armed bandits problem, the relationship of interest is between adaptivity and query complexity, instead of adaptivity and approximation guarantee. Recent work showed that (logı n) adaptive rounds are necessary and sucient to obtain the optimal worst case query complexity [Agarwal et al., 2017]. In the bandits setting, adaptivity is necessary to obtain non-trivial query complexity due to the noisy outcomes of the queries. In contrast, queries in submodular optimization are deterministic and adaptivity is necessary to obtain a non trivial approximation since there are at most polynomially many queries per round and the function is of exponential size. Adaptivity is also well-studied for the problems of sparse recovery (e.g. [Haupt et al., 2009a, Indyk et al., 2011, Haupt et al., 2009b, Ji et al., 2008, Malioutov et al., 2008, Aldroubi et al., 2008]) and property testing (e.g. [Canonne and Gur, 2017, Buhrman et al., 2012, Chen et al., 2017, Raskhodnikova and Smith, 2006, Servedio et al., 2015]). In these areas, it has been shown that adaptivity allows significant improvements compared to the non-adaptive setting, which is similar to the results shown in this paper for submodular optimization. However, in contrast to all these areas, adaptivity has not been previously studied in the context of submodular optimization.

14 We note that the term adaptive submodular maximization has been previously used, but in an unrelated setting where the goal is to compute a policy which iteratively picks elements one by one, which, when picked, reveal stochastic feedback about the environment [Golovin and Krause, 2010].

1.4.2 Related Models of Parallel Computation

In this section, we discuss two related models, the Map-Reduce model for distributed computation and the PRAM model. These models are compared to the notion of adaptivity in the context of submodular optimization.

Map-Reduce

The problem of distributed submodular optimization has been extensively studied in the Map-Reduce model in the past decade. This framework is primarily motivated by large scale problems over massive data sets. At a high level, in the Map-Reduce framework [Dean and Ghemawat, 2008], an algorithm proceeds in multiple Map-Reduce rounds, where each round consists of a first step where the input to the algorithm is partitioned to be independently processed on dierent machines and of a second step where the outputs of this processing are merged. Notice that the notion of rounds in Map-Reduce is dierent than for adaptivity, where one round of Map-Reduce usually consists of multiple adaptive rounds. The formal model of [Karloet al., 2010] for Map-Reduce requires the number of machines and their memory to be sublinear. This framework for distributing the input to multiple machines with sublinear memory is designed to tackle issues related to massive data sets. Such data sets are too large to either fit or be processed by a single machine and the Map-Reduce framework formally models this need to distribute such inputs to multiple machines. Instead of addressing distributed challenges, adaptivity addresses the issue of sequentiality,

15 where each query evaluation requires a long time to complete and where these evaluations can be parallelized (see Section 1.4.3 for applications). In other words, while Map-Reduce

addresses the horizontal challenge of large scale problems, adaptivity addresses an orthogonal vertical challenge where long query-evaluation time is causing the main runtime bottleneck. A long line of work has studied problems related to submodular maximization in Map- Reduce achieving dierent improvements on parameters such as the number of Map-Reduce rounds, the communication complexity, the approximation ratio, the family of functions, and the family of constraints (e.g. [Kumar et al., 2015a, Mirzasoleiman et al., 2013, Mirrokni and Zadimoghaddam, 2015, Mirzasoleiman et al., 2015b, Barbosa et al., 2015, 2016, Epasto et al., 2017]). To the best of our knowledge, all the existing Map-Reduce algorithms for submodular optimization have adaptivity that is linear in n in the worst-case, which is exponentially larger than the adaptivity of our algorithm. This high adaptivity is caused by the distributed algorithms which are run on each machine. These algorithms are variants of the greedy algorithm and thus have adaptivity at least linear in k. We also note that our algorithm does not (at least trivially) carry over to the Map-Reduce setting.

PRAM

In the PRAM model, the notion of depth is closely related to the concept of adaptivity studied in this paper. Our positive result extends to the PRAM model, showing that there is a ˜(log2 n d ) depth algorithm with ˜(nk2) work whose approximation is arbitrarily close O · f O to 1/3 for maximizing any monotone submodular function under a cardinality constraint, where df is the depth required to evaluate the function on a set. The PRAM model is a generalization of the RAM model with parallelization, it is an idealized model of a shared memory machine with any number of processors which can

execute instructions in parallel. The depth of a PRAM algorithm is the number of parallel steps of this algorithm on the PRAM, in other words, it is the longest chain of dependencies

16 of the algorithm, including operations which are not necessarily queries. The problem of designing low-depth algorithms has been heavily studied (e.g. [Blelloch, 1996, Blelloch et al., 2011, Berger et al., 1989, Rajagopalan and Vazirani, 1998, Blelloch and Reid-Miller, 1998, Blelloch et al., 2012]). Thus, in addition to the number of adaptive rounds of querying, depth also measures the number of adaptive steps of the algorithms which are not queries. However, for the applications we consider, the runtime of the algorithmic computations which are not queries are usually insignificant compared to the time to evaluate a query. In addition, the PRAM model assumes that the input is loaded in memory while we consider the value query model where the algorithm is given oracle access to a function of potentially exponential size. In crowdsourcing applications, for example, where the value of a set can be queried on a crowdsourcing platform, there does not necessarily exist a succinct representation of the underlying function. Our positive results extend with an additional d O˜(log n) factor in the depth compared f ·

to the number of adaptive rounds, where df is the depth required to evaluate the function on a set in the PRAM model. The operations that our algorithms performed at every round, which are maximum, summation, set union, and set dierence over an input of size at most quasilinear, can all be executed by algorithms with logarithmic depth. A simple divide-and-conquer approach suces for maximum and summation, while logarithmic depth for set union and set dierence can be achieved with treaps [Blelloch and Reid-Miller, 1998].

1.4.3 Applications of Adaptivity

Beyond being a fundamental concept, adaptivity is important for applications where sequentiality is the main runtime bottleneck.

17 Crowdsourcing and data summarization. One class of such problems where adaptivity plays an important role are human-in-the-loop problems. At a high level, these algorithms involve subtasks performed by the crowd. The intervention of humans in the evaluation of queries causes algorithms with a large number of adaptive rounds impractical. A crowdsourcing platform consists of posted tasks and crowdworkers who are remunerated for performing these posted tasks. For the submodular problem of data summarization, where the objective is to select a small representative subset of a dataset, the quality of subsets as representatives can be evaluated on a crowdsourcing platform [Tschiatschek et al., 2014, Singla et al., 2016, Braverman et al., 2016]. The algorithm must wait to obtain the feedback from the crowdworkers, however an algorithm can send out a large number of tasks to be performed simultaneously by dierent crowdworkers.

Biological simulations. Adaptivity is also studied in molecular biology to simulate protein folding. Adaptive sampling techniques are used to obtain significant improvements in execution of simulations and discovery of low energy states [Bowman et al., 2011].

Experimental Design. In experimental design, the goal is to pick a collection of entities (e.g. subjects, chemical elements, data points) which obtains the best outcome when combined for an experiment. Experiments can be run in parallel and have a waiting time to observe the outcome [Frazier et al., 2010].

Influence Maximization. The submodular problem of influence maximization, initiated studied by Domingos and Richardson [2001a], Richardson and Domingos [2002], Kempe et al. [2003] has since then received considerable attention (e.g. [Chen et al., 2009, 2010, Goyal et al., 2011, Seeman and Singer, 2013, Horel and Singer, 2015, Badanidiyuru et al., 2016]). Influence maximization consists of finding the most influential nodes in a social network to maximize the spread of information in this network. Information does not spread instantly

18 and a waiting time occurs when observing the total number of nodes influenced by some seed set of nodes.

Advertising. In advertising, the goal is to select the optimal subset of advertisement slots to objectives such as the click-through-rate or the number of products purchased by customers, which are objectives exhibiting diminishing returns [Alaei and Malekian, 2010, Devanur et al., 2016]. Naturally, a waiting time is incurred to observe the behavior of customers.

19 Chapter 2

Non-Adaptive Optimization

A natural starting point to study the number of rounds of adaptivity needed for optimiza- tion is to consider the case of a single round of adaptivity, which corresponds to non-adaptive algorithms. Non-adaptive algorithms may perform polynomially many function evaluations, but a function evaluation cannot depend on the outcome of another function evaluation. Even though the question of what is achievable in a single round of adaptivity is interesting in its own right, the main motivation for studying non-adaptivity is due to its role in the subtle interplay between machine learning and optimization, and specifically for data-driven optimization. We begin by discussing data-driven optimization through a novel model of optimization from samples. In Section 2.1.2, we explain why optimization from samples plays a crucial role for adaptivity and establish a formal connection between these two models.

2.1 From Predictions to Decisions: Optimization from

Samples

The traditional approach in optimization typically assumes there is an underlying model known to the algorithm designer, and the goal is to optimize an objective function defined

20 OPTIMIZATION OF OBJECTIVES LEARNED FROM DATA

data model decision

0.1 0.2 0.7 0.6 0.4 0.7 learning 0.1 optimization 0.5 0.4 0.4 0.1 0.2 0.2 0.1 0.7

Figure 2.1: Learning and optimization illustrated for the problem of influence maximization. The model is a network, the observed data is cascades of information, and the decision is to find the most influential nodes.

through the model. In a routing problem, for example, the model can be a weighted graph which encodes roads and their congestion, and the objective is to select a route that minimizes expected travel time from source to destination. In influence maximization, we are given a weighted graph which models the likelihood of individuals forwarding information, and the objective is to select a subset of nodes to spread information and maximize the expected number of nodes that receive information [Kempe et al., 2003]. In many applications like influence maximization or routing, we do not actually know the objective functions we wish to optimize since they depend on the behavior of the world generating the model. In such cases, we gather information about the objective function from past observations. A reasonable approach is to learn a surrogate function that approximates the function generating the data (e.g. [Daneshmand et al., 2014, Du et al., 2013, Gomez- Rodriguez et al., 2010]) and optimize the surrogate, as illustrated in Figure 2.1. In routing, we may observe trac, fit weights to a graph that represents congestion times, and optimize for the shortest path on the weighted graph learned from data. In influence maximization, we can observe information spreading in a social network, fit weights to a graph that encodes the influence model and optimize for the k most influential nodes. But what guarantees do we have with this approach? One problem with optimizing a surrogate learned from data is that it may be inap-

21 proximable. For a problem like influence maximization, for example, even if a surrogate

f :2N R approximates a submodular influence function f :2N R within a factor æ æ of (1 ‘) for sub-constant ‘>0, in general there is no polynomial-time algorithm that ±

can obtain a reasonable approximation to maxS: S k f(S) or maxS: S k f(S) [Hassidim and | |Æ | |Æ Singer, 2015]. A dierent concern is that the function learned from data may be approximable (e.g. if the surrogate remains submodular), but its optima are very far from the optima of the function generating the data. In influence maximization, even if the weights of the graph are learned within a factor of (1 ‘) for sub-constant ‘>0 the optima of the surrogate may be ± a poor approximation to the true optimum [Narasimhan et al., 2015, He and Kempe, 2016]. The sensitivity of optimization to the nuances of the learning method therefore raises the following question:

Can we actually optimize objective functions from the training data we use to learn them?

Doing so requires synthesizing learning theory and the theory of optimization into a new theory for optimization from sampled data. Given sampled data, can we optimize functions for which we know how to predict from data and for which we know how to make decisions from a model? Such a theory has applications well beyond influence maximization or routing, as there are numerous examples where we make decisions from data and seek good outcomes.

In auctions, for example, the auctioneer aims to decide on prices or on an allocation of goods to buyers whose complex valuations are not known but may be inferred from data [Dobzinski

and Schapira, 2006, Lehmann et al., 2001]. In the learning to rank framework in information retrieval, an algorithm receives samples of ranked documents and the goal is to select the k documents that are the most relevant to a query [Kumar et al., 2015b]. In recommendation systems, optimal tagging problems seek to pick k tags for new content to maximize incoming trac [Rosenfeld and Globerson, 2016].

22 OPTIMIZATION OF OBJECTIVES LEARNED FROM DATA

data decision

optimization from samples

Figure 2.2: Optimization from samples illustrated for the problem of influence maximization.

2.1.1 The Optimization from Samples Model

To formulate the notion of data-driven optimization, we suggest the model of optimization from samples (OPS). In optimization from samples, the input is samples (S ,f(S )) m { i i }i=1 where the sets S are drawn from a distribution , as in learning. The goal is to solve the i D

problem maxS f(S) for some constraint , as in optimization. More formally: œM M

Definition 6. A class of functions :2N R is –-optimizable in from samples F æ M over distribution if there exists a (not necessarily polynomial time) algorithm whose D input is a set of samples S ,f(S ) m , where f and S is drawn i.i.d. from , and { i i }i=1 œF i D returns S s.t.: œM Pr f(S) – max f(T ) 1 ”, S ,...,Sm T 1 ≥D 5 Ø · œM 6 Ø ≠ where m poly( N ) and ” [0, 1) is a constant. œ | | œ

This framework synthesizes the standard notions of learnability and optimizability, as

illustrated in Figure 2.2. The standard notion of learnability for set functions is PMAC- learnability which is a generalization of the well-known PAC-learnability model. The standard notion of optimizability is that of APX. A class of functions and a constraint are in APX if, given value query access to the function (given S the oracle returns f(S)), there is a polynomial-time algorithm that yields a constant-factor approximation for optimizing a function in that class under the constraint.

23 An algorithm with the above guarantees is an –-OPS algorithm. In this chapter we focus on the simplest constraint, where = S N : S k is a cardinality constraint. For M { ™ | |Æ } a class of functions we say that optimization from samples is possible when there exists F some constant – (0, 1] and any distribution s.t. is –-optimizable from samples over œ D F D in = S : S k . The following points are worth noting: M { | |Æ }

• Optimization from samples is defined per distribution. Note that if we demand opti- mization from samples to hold on all distributions, then trivially no function would be optimizable from samples (e.g. for the distribution which always returns the empty set);

• Optimization from samples seeks to approximate the global optimum. In learning, we evaluate a hypothesis on the same distribution we use to train it since it enables making a prediction about events that are similar to those observed. For optimization it is trivial to be competitive against a sample by simply selecting the feasible solution with maximal value from the set of samples observed. Since an optimization algorithm has the power to select any solution, the hope is that polynomially many samples contain enough information for optimizing the function. In influence maximization, for example, we are interested in selecting a set of influencers, even if we did not observe a set of highly influential individuals that initiate a cascade together.

Optimization from samples is particularly interesting when functions are learnable and optimizable.

• APX-Optimizability. We are interested in functions f :2N R and constraint æ M such that given access to a value oracle (given S the oracle returns f(S)), there exists a

constant factor approximation algorithm for maxS f(S). For this purpose, monotone œM submodular functions are a convenient class to work with, where the canonical problem

is max S k f(S). It is well known that there is a 1 1/e approximation algorithm for | |Æ ≠ 24 this problem [Nemhauser et al., 1978] and that this is tight using polynomially many value queries [Feige, 1998]. Influence maximization is an example of maximizing a monotone submodular function under a cardinality constraint [Kempe et al., 2003].

• PMAC-learnability. The standard framework in the literature for learning set

functions is Probably Mostly Approximately Correct (–-PMAC) learnability due to Balcan and Harvey [2011]. This framework nicely generalizes Valiant’s notion of Probably

Approximately Correct (PAC) learnability [Valiant, 1984]. Informally, PMAC-learnability guarantees that after observing polynomially many samples of sets and their function values, one can construct a surrogate function that is likely to, –-approximately, mimic the behavior of the function observed from the samples (see full version of the paper for formal definitions). Since the seminal paper of Balcan and Harvey, there has been a great deal of work on learnability of submodular functions [Feldman and Kothari, 2014, Balcan et al., 2012, Badanidiyuru et al., Feldman and Vondrák, 2013, Feldman and Vondrák, 2015, Balcan, 2015].

2.1.2 Optimization from Samples is Equivalent to Non-Adaptivity

Non-adaptivity corresponds to one round of adaptivity. Informally, a non-adaptive algorithm may perform polynomially many function evaluations, but a function evaluation cannot depend on the outcome of another function evaluation.

Definition 7. An algorithm is non-adaptive if every query f(S) for the value of a set S is independent of the values f(SÕ) of all other queries SÕ made by the algorithm, with at most poly(n) total number of queries.

An important observation is that optimization from samples algorithms are non-adaptive. The algorithm observes samples, but these samples are drawn independently from a same distribution. Perhaps less intuitive is the fact that any non-adaptive algorithm can be used

25 to construct an optimization from samples algorithm. We formalize the equivalence of these two notions with the following theorem. For the remainder of this chapter, we consider the optimization from samples model. However, all the results extend to the non-adaptive model.

Theorem 2.1.1. For any class of functions , there exists a distribution such that is F D F –-optimizable from samples if and only if there exists a non-adaptive algorithm that obtains, with probability 1 ”, an –-approximation for optimizing , for any constant ”>0. ≠ F

Proof. We first show that if there exists a distribution such that is –-optimizable from D F samples with algorithm , then there exists a non-adaptive algorithm that Asamples Anon-adaptive obtains, with probability 1 ”,an– approximation. Let m be the number of samples from ≠ required by . D Asamples We consider the algorithm which first samples m sets S i.i.d. and queries Anon-adaptive i ≥D these m sets to the oracle. Note that these are non-adaptive queries since they are drawn from the same distribution . Given the values f(S ) of these queries, then D i Anon-adaptive mimics and returns the same set S as . Since is an –-optimizable Asamples Asamples Asamples from samples algorithm, the set S returned by is an –-approximate solution with Anon-adaptive probability 1 ”. ≠ For the reverse direction, consider a non-adaptive algorithm that obtains, Anon-adaptive with probability 1 ”,an– approximation. Let = S m be the collection of sets ≠ S { i}i=1 non-adaptively queried by .Let be the uniform distribution over and let Anon-adaptive D S mÕ be such that with mÕ i.i.d. samples Õ from , with probability 1 ”,everysetS S D ≠ œS is also in Õ.Let be the algorithm that, if every set S is also in Õ, mimics S Asamples œS S and returns the same solution S, and otherwise returns an arbitrarily solution. Anon-adaptive By the guarantee of , obtains an –-approximate solution with probability Anon-adaptive Asamples 1 ” if every set S is also in Õ. ≠ œS S

26 2.1.3 Overview of Results

We give an overview of the results presented in this chapter.

Main Result

If we believe that learning functions from data and then optimizing them leads to desirable

outcomes, then functions that are PMAC-learnable and in APX should be optimizable from samples. Somewhat surprisingly, however, we show that this is not the case. There is an

interesting class of functions, called coverage functions, that is PMAC-learnable and in APX but it cannot be optimized from samples. Coverage functions are a canonical example of monotone submodular functions and are hence optimizable. In terms of learnability, for any constant

‘>0, coverage functions are (1 ‘)-PMAC learnable over any distribution [Badanidiyuru ≠ et al.], unlike monotone submodular functions which are generally not PMAC learnable [Balcan and Harvey, 2011]. In Section 2.3.3, we show that there is no constant factor approximation

for maximizing a coverage function using polynomially-many samples drawn from any distribution. Coverage functions are heavily used in mechanism design [Dobzinski and Schapira, 2006, Lehmann et al., 2001], machine learning [Guestrin et al., 2005, Swaminathan et al., 2009], data-mining [Chierichetti et al., 2010, Du et al., 2014b], privacy [Feldman and Kothari, 2014, Gupta et al., 2013], as well as influence maximization [Kempe et al., 2003, Seeman and Singer, 2013]. In many of these applications, the functions are learned from data and the goal is to optimize the function under a cardinality constraint.

Algorithms for Optimization from Samples

Despite the main result being an impossibility, there are classes of functions and distribu- tions for which optimization from samples is possible.

27 Curvature. We show in Section 2.2.1 that for any monotone submodular function with curvature c there is a (1 c)/(1 + c c2) approximation algorithm for maximization under ≠ ≠ cardinality constraints when polynomially-many samples are drawn from the uniform dis- tribution over feasible sets. The curvature assumption is crucial as the above impossibility results shows that no algorithm can obtain a constant-factor approximation for maximization under a cardinality constraint when observing polynomially-many samples drawn from any distribution over feasible sets, even when the function is statistically learnable.

Influence maximization. We then consider in Section 2.2.2 the canonical problem of influence maximization in social networks. Since the seminal work of Kempe et al. [2003], there have been two largely disjoint eorts on this problem. The first studies the problem associated with learning the parameters of the generative influence model. The second focuses on the algorithmic challenge of identifying a set of influencers, assuming the parameters of the generative model are known. The above impossibility result implies that in general, if the generative model is not known but rather learned from training data, no algorithm can yield a constant factor approximation guarantee using polynomially-many samples, drawn from any distribution. We design a simple heuristic that overcomes this negative result in practice by leveraging the strong community structure of social networks. Although in general the approximation guarantee of our algorithm is necessarily unbounded, we show that this algorithm performs well experimentally. To justify its performance, we prove our algorithm obtains a constant factor approximation guarantee on graphs generated through the stochastic block model, traditionally used to model networks with community structure.

Submodular maximization. For general monotone submodular functions, we develop

1/4 an ˜(n≠ ) optimization from samples algorithm over some distribution . This bound is D 1/4+‘ essentially tight since submodular functions are not n≠ -optimizable from samples over

28 Algorithms Hardness Family Family Apx Rounds Apx Rounds Section Section

Submodular 1 Submodular 1 +‘ ˜(n≠ 4 ) 1 (n≠ 4 ) 1 2.2.3 2.3.2 O Influence Coverage 1 +‘ constant 1 (n≠ 5 ) 1 2.2.2 2.3.3 O Curvature 1 c o(1) Curvature 1 c+o(1) ≠ ≠ ≠ 1+c c2 1 1+c c2 1 2.2.1 ≠ 2.3.4 ≠

Table 2.1: Overview of results for non-adaptive optimization. Unless otherwise specified, these results are for monotone submodular maximization under a cardinality constraint. The most significant result is highlighted in bold. any distribution (see Section 2.3.2).

Matching Lower Bounds

In Section 2.3.2 and Section 2.3.4, we show that the optimization from samples algorithms for submodular functions and functions with curvature are optimal. We show that, up to lower order terms, no algorithm can achieve a better approximation than the one obtained by these algorithms. A summary of the results for this chapter is provided in Table 2.1.

2.2 Optimization from Samples Algorithms

In this section, we describe optimization from samples algorithms for three important classes of functions. By Theorem 2.1.1, these are non-adaptive algorithms with guarantees that extend to the non-adaptive model described in Definition 7. These algorithms are for the problem of maximizing a function f in a class of functions under a cardinality constraint k. F In Section 2.2.1, for the class of monotone submodular functions with curvature c, we give a (1 c)/(1 + c c2) o(1) approximation algorithm for optimization from samples from the ≠ ≠ ≠ uniform distribution over feasible sets, which we show is tight in Section 2.3.4. In Section 2.2.2, we show that the despite the general negative results implied by Section 2.3.3 for influence

29 maximization form samples, it is possible to obtain constant approximation guarantees when the underlying network exhibits a community structure. Finally, in Section 2.2.3, we show an

1/4 ˜(n≠ ) optimization from samples algorithm for general monotone submodular functions, which we show is tight up to lower order terms in Section 2.3.2.

2.2.1 Curvature

In this section, we consider the problem of optimization from samples of monotone submod- ular functions with bounded curvature. We show that for any monotone submodular function with curvature c there is a (1 c)/(1 + c c2) approximation algorithm for maximization ≠ ≠ under cardinality constraints when polynomially-many samples are drawn from the uniform distribution over feasible sets. In Section 2.2.1, we show that this algorithm is optimal. That is, for any c<1, there exists a submodular function with curvature c for which no algorithm can achieve a better approximation. The curvature assumption is crucial as for general monotone submodular functions no algorithm can obtain a constant-factor approximation for maximization under a cardinality constraint when observing polynomially-many samples drawn from any distribution over feasible sets, even when the function is statistically learnable (Section 2.3). In practice however, the functions we aim to optimize may be better behaved. An important property of submodular functions that has been heavily explored recently is that of curvature. Informally, the curvature is a measure of how far the function is to being modular.

Definition 8. A function f has curvature c [0, 1] if f (a) (1 c)f(a) for any set S N œ S Ø ≠ ™ and element a N. œ

If c =0, then the function f is modular, i.e.f(S)= a S f(a). Curvature plays an œ important role since the hard instances of submodular optimizationq often occur only when the curvature is unbounded, i.e., c close to 1. The hardness results for optimization from samples

30 are no dierent, and apply when the curvature is unbounded. The curvature assumption has applications in problems such as maximum entropy sampling [Sviridenko et al., 2015], column-subset selection [Sviridenko et al., 2015], and submodular welfare [Vondrák, 2010].

Related work on curvature. In the value oracle model, the greedy algorithm is a

c (1 e≠ )/c approximation algorithm for cardinality constraints [Conforti and Cornuéjols, 1984]. ≠ Recently, Sviridenko et al. [2015] improved this approximation to 1 c/e with variants of the ≠ continuous greedy and local search algorithms, which was shown to be tight. Submodular optimization and curvature have also been studied for more general constraints [Vondrák, 2010, Iyer and Bilmes, 2013] and submodular minimization [Iyer et al., 2013].

The algorithm. Algorithm 1 first estimates the expected marginal contribution of each element e to a uniformly random set of size k 1, which we denote by R for the remaining i ≠

of this section. These expected marginal contributions ER[fR(ei)] are estimated with vˆi. The

estimates vˆi are the dierences between the average value avg( k,i) := ( T f(T ))/ k,i S œSk,i |S | of the collection of samples of size k containing e and the average valueq of the collection Sk,i i

k 1,i 1 of samples of size k 1 not containing ei. We then wish to return the best set between S ≠ ≠ ≠

the random set R and the set S consisting of the k elements with the largest estimates vˆi.

Since we do not know the value of S, we lower bound it with vˆS using the curvature property.

We estimate the expected value ER[f(R)] of R with vˆR, which is the average value of the

collection k 1 of all samples of size k 1. Finally, we compare the values of S and R using S ≠ ≠

vˆS and vˆR to return the best of these two sets.

31 Algorithm 1 A tight (1 c)/(1 + c c2) o(1)-optimization from samples algorithm for ≠ ≠ ≠ monotone submodular functions with curvature c Input: samples (S ,f(S )) where S , the uniform distribution over feasible sets { i i } i ≥U

1: vˆi avg( k,i) avg( k 1,i 1) Ω S ≠ S ≠ ≠

2: S argmax T =k i T vˆi Ω | | œ q 3: vˆS (1 c) e S vˆi a lower bound on the value of f(S) Ω ≠ iœ q 4: vˆR avg( k 1) an estimate of the value of a random set R Ω S ≠ 5: if vˆ vˆ then S Ø R 6: return S

7: else

8: return R

9: end if

The analysis. Without loss of generality, let S = e ,...,e be the set defined in Line 2 { 1 k} of the algorithm and define S to be the first i elements in S, i.e., S := e ,...,e . Similarly, i i { 1 i} for the optimal solution Sı, we have Sı = eı,...,eı and Sı := eı,...,eı . We abuse { 1 k} i { 1 i }

notation and denote by f(R) and fR(e) the expected values ER[f(R)] and ER[fR(e)] where the randomization is over the random set R of size k 1. ≠

At a high level, the curvature property is used to bound the loss from f(S) to i k fR(ei) Æ ı ı q ı and from i k fR(ei ) to f(S ). By the algorithm, i k fR(ei) is greater than i k fR(ei ). Æ Æ Æ q ı ı q q When bounding the loss from i k fR(ei ) to f(S ), a key observation is that if this loss is Æ large, then it must be the case thatq R has a high expected value. This observation is formalized in our analysis by bounding this loss in terms of f(R) and motivates Algorithm 1 returning the best of R and S. Lemma 2.2.1 is the main part of the analysis and gives an approximation for S. The approximation guarantee for Algorithm 1 (formalized as Theorem 2.2.1) follows by finding the worst-case ratios of f(R) and f(S).

Lemma 2.2.1. Let S be the set defined in Algorithm 1 and f( ) be a monotone submodular · 32 function with curvature c, then

f(R) ı f(S) (1 o(1))ˆvS (1 c) 1 c o(1) f(S ). Ø ≠ Ø A ≠ A ≠ · f(Sı)B ≠ B

Proof. First, observe that

f(S)= fSi 1 (ei) (1 c) f(ei) (1 c) fR(ei) ≠ i k Ø ≠ i k Ø ≠ i k ÿÆ ÿÆ ÿÆ where the first inequality is by curvature and the second is by monotonicity. We now claim that w.h.p. and with a suciently large polynomial number of samples the estimates of the marginal contribution of an element are precise,

f(Sı) f(Sı) f (e )+ vˆ f (e ) R i n2 Ø i Ø R i ≠ n2

ı ı and defer the proof to Claim 1. Thus f(S) (1 c) i k vˆi f(S )/n vˆS f(S )/n. Ø ≠ Æ ≠ Ø ≠ Next, by the definition of S in the algorithm, we get q

ı vˆS ı ı f(S ) = vˆi vˆi fR(ei ) . 1 c i k Ø i k Ø i k ≠ n ≠ ÿÆ ÿÆ ÿÆ

ı ı It is possible to obtain a 1 c loss between i k fR(ei ) and f(S ) with a similar argument as in ≠ Æ the first part. The key idea to improve thisq loss is to use the curvature property on the elements

ı ı in R instead of on the elements e S . By curvature, we have that f ı (R) (1 c)f(R). i œ S Ø ≠ ı ı ı We now wish to relate fSı (R) and i k fR(ei ). Note that f(S )+fSı (R)=f(R S )= Æ fi ı q ı ı f(R)+fR(S ) by the definition of marginal contribution and i k fR(ei ) fR(S ) by Æ Ø ı ı q submodularity. We get i k fR(ei ) f(S )+fSı (R) f(R) by combining the previous Æ Ø ≠ q

33 equation and inequality. By the previous curvature observation, we conclude that

ı ı f(R) ı fR(ei ) f(S )+(1 c)f(R) f(R)= 1 c ı f(S ). i k Ø ≠ ≠ A ≠ · f(S )B ÿÆ

Claim 1. Let f be a monotone submodular function. Then, with a suciently large polynomial

ı 2 number of samples, the estimations vˆi and vˆR are f(S )/n -close to fR(ei) and f(R) with high probability, i.e., f(Sı) f(Sı) f (e )+ vˆ f (e ) , R i n2 Ø i Ø R i ≠ n2 and f(Sı) f(Sı) f(R)+ vˆ f(R) . n2 Ø R Ø ≠ n2

Proof. We assume that k n/2 (otherwise, a random subset of size k is a 1/2-approximation). Æ The size of a sample which is the most likely is k, so the probability that a sample is of size k

n n is at least 2/n. Since k 1 k /n, the probability that a sample is of size k 1 is at least ≠ Ø ≠ 1 2 1 2 2/n2. A given element i has probability at least 1/n of being in a sample and probability at least 1/2 of not being in a sample. Therefore, to observe at least n5 samples of size k which contain i and at least n5 samples of size k 1 which do not contain i, n8 samples are sucient ≠ with high probability. Since f(S) f(Sı) for all samples S, by Hoeding’s inequality, Æ

ı f(S ) 2n5(f(Sı)/2n2)2/f(Sı)2 n/2 Pr avg( k,i) ES : S =k,i S[f(S)] 2e≠ 2e≠ . A S ≠ | | œ Ø 2n2 B Æ Æ ------similarly, ı f(S ) n/2 Pr avg( k 1,i 1) ES : S =k 1,i S[f(S)] 2e≠ A S ≠ ≠ ≠ | | ≠ ”œ Ø 2n2 B Æ ------and ı f(S ) n/2 Pr avg( k 1) ES : S =k 1[f(S)] 2e≠ . A S ≠ ≠ | | ≠ Ø 2n2 B Æ ------34 Since vˆi = avg( k,i) avg( k 1,i 1),fR(ei)=ES : S =k,i S[f(S)] ES : S =k 1,i S[f(S)], S ≠ S ≠ ≠ | | œ ≠ | | ≠ ”œ

vˆR = avg( k 1), and f(R)=ES : S =k 1[f(S)], the claim holds with high probability. S ≠ | | ≠

Combining Lemma 2.2.1 and the fact that we obtain value at least max f(R), (1 { ≠ c) k vˆ , we obtain the main result of this section. i=1 i} q Theorem 2.2.1. Let f( ) be a monotone submodular function with curvature c. Then · Algorithm 1 is a (1 c)/(1 + c c2) o(1) approximation algorithm for optimization from ≠ ≠ ≠ samples from the uniform distribution over feasible sets . U

Proof. The estimate vˆ of f(R) is precise, we have that f(R)+f(Sı)/n2 vˆ f(R) R Ø R Ø ≠ f(Sı)/n2 with suciently many samples by standard concentration bounds and with high probability. In addition, by the first inequality in Lemma 2.2.1, f(S) (1 o(1))vˆ . So by Ø ≠ S the algorithm and the second inequality in Lemma 2.2.1, the approximation obtained by the set returned is at least

f(R) vˆ f(R) f(R) (1 o(1)) max , S (1 o(1)) max , (1 c) 1 c . ≠ · If(Sı) f(Sı)J Ø ≠ · If(Sı) ≠ A ≠ · f(Sı)BJ

ı ı Let x := f(R)/f(Sú), the best of f(R)/f(S ) and (1 c) (1 c f(R)/f(S )) o(1) is ≠ ≠ · ≠ minimized when x =(1 c)(1 cx), or when x =(1 c)/(1+c c2). Thus, the approximation ≠ ≠ ≠ ≠ obtained is at least (1 c)/(1 + c c2) o(1). ≠ ≠ ≠

2.2.2 Learning to Influence

For well over a decade now, there has been extensive work on the canonical problem of influence maximization in social networks. First posed by Domingos and Richardson [2001b], Richardson and Domingos [2002] and elegantly formulated and further developed by Kempe et al. [2003], influence maximization is the algorithmic challenge of selecting individuals who

35 can serve as early adopters of a new idea, product, or technology in a manner that will trigger a large cascade in the social network. In their seminal paper, Kempe, Kleinberg, and Tardos characterize a family of natural influence processes for which selecting a set of individuals that maximize the resulting cascade reduces to maximizing a submodular function under a cardinality constraint. Since submodular functions can be maximized within a 1 1/e approximation guarantee, one ≠ can then obtain desirable guarantees for the influence maximization problem. There have since been two, largely separate, agendas of research on the problem. The first line of work is concerned with learning the underlying submodular function from observations of cascades [Liben-Nowell and Kleinberg, 2003, Adar and Adamic, 2005, Leskovec et al., 2007, Goyal et al., 2010, Chierichetti et al., 2011, Gomez-Rodriguez et al., 2011, Netrapalli and Sanghavi, 2012, Gomez-Rodriguez et al., 2010, Du et al., 2012, Abrahao et al., 2013, Du et al., 2013, Feldman and Kothari, 2014, De et al., 2014, Cheng et al., 2014, Daneshmand et al., 2014, Du et al., 2014a, Narasimhan et al., 2015, Honorio and Ortiz, 2015]. The second line of work focuses on algorithmic challenges revolving around maximizing influence, assuming the underlying function that generates the diusion process is known [Kempe et al., 2005, Mossel and Roch, 2007, Seeman and Singer, 2013, Borgs et al., 2014, Hassidim and Singer, 2015, He and Kempe, 2016, Angell and Schoenebeck, 2016]. In this paper, we consider the problem of learning to influence where the goal is to maximize influence from observations of cascades. This problem synthesizes both problems of learning the function from training data and of maximizing influence given the influence function. A natural approach for learning to influence is to first learn the influence function from cascades, and then apply a submodular optimization algorithm on the function learned from data. Somewhat counter-intuitively, it turns out that this approach yields desirable guarantees only under very strong learnability conditions. In some cases, when there are suciently many samples, and one can observe exactly which node attempts to influence

36 whom at every time step, these learnability conditions can be met. A slight relaxation however (e.g. when there are only partial observations [Narasimhan et al., 2015, He et al., 2016]), can lead to sharp inapproximability.

Learning to influence social networks. As with all impossibility results, the inapprox- imability discussed above holds for worst case instances, and it may be possible that such instances are rare for influence in social networks. In the previous section, it was shown that when a submodular function has bounded curvature, there is a simple algorithm that can maximize the function under a cardinality constraint from samples. Unfortunately, simple examples show that submodular functions that dictate influence processes in social networks do not have bounded curvature. Are there other reasonable conditions on social networks that yield desirable approximation guarantees?

Main result. In this section we present a simple algorithm for learning to influence. This algorithm leverages the idea that social networks exhibit strong community structure. At a high level, the algorithm observes cascades and aims to select a set of nodes that are influential, but belong to dierent communities. Intuitively, when an influential node from a certain community is selected to initiate a cascade, the marginal contribution of adding another node from that same community is small, since the nodes in that community were likely already influenced. This observation can be translated into a simple algorithm which performs very well in practice. Analytically, since community structure is often modeled using stochastic block models, we prove that the algorithm obtains a constant factor approximation guarantee in such models, under mild assumptions. The analysis for the approximation guarantees lies at the intersection of combinatorial optimization and random graph theory. We formalize the intuition that the algorithm leverages the community structure of social networks in the standard model to analyze communities, which is the stochastic block model. Intuitively, the algorithm obtains good

37 approximations by picking the nodes that have the largest individual influence while avoiding picking multiple nodes in the same community by pruning nodes with high influence overlap. The individual influence of nodes and their overlap are estimated by the algorithm with what we call first and second order marginal contributions of nodes, which can be estimated from samples. We then uses phase transition results of Erds-Rényi random graphs and branching processes techniques to compare these individual influences for nodes in dierent communities in the stochastic block model and bound the overlap of pairs of nodes.

The Model

We assume that the influence process follows the standard independent cascade model. In the independent cascade model, a node a influences each of its neighbors b with some

probability qab, independently. Thus, given a seed set of nodes S, the set of nodes influenced is the number of nodes connected to some node in S in the random subgraph of the network which contains every edge ab independently with probability qab. We define f(S) to be the expected number of nodes influenced by S according to the independent cascade model over some weighted social network.

The learning to influence model: optimization from samples for influence maxi- mization. The learning to influence model is an interpretation of the optimization from samples model for the specific problem of influence maximization in social networks. We focus on bounded product distributions , so every node a is, independently, in S with D ≥D some probability p [1/ poly(n), 1 1/ poly(n)]. We assume this is the case throughout a œ ≠ the paper. We are given a collection of samples (S , cc(S ) ) m where sets S are the seed { i | i | }i=1 i sets of nodes and cc(S ) is the number of nodes influenced by S , i.e., the number of nodes | i | i

that are connected to Si in the random subgraph of the network. This number of nodes is a

random variable with expected value f(Si) := E[ cc(Si) ] over the realization of the influence | |

38 process. Each sample is an independent realization of the influence process. The goal is then to find a set of nodes S under a cardinality constraint k which maximizes the influence in expectation, i.e., find a set S of size at most k which maximizes the expected number of nodes f(S) influenced by seed set S.

Description of the Algorithm

We present the main algorithm, COPS. This algorithm is based on a novel optimization from samples technique which detects overlap in the marginal contributions of two dierent nodes, which is useful to avoid picking two nodes who have intersecting influence over a same

collection of nodes. COPS, consists of two steps. It first orders nodes in decreasing order of first order marginal contribution, which is the expected marginal contribution of a node a to a random set S . Then, it iteratively removes nodes a whose marginal contribution ≥D overlaps with the marginal contribution of at least one node before a in the ordering. The solution is the k first nodes in the pruned ordering.

Algorithm 2 COPS, learns to influence networks with COmmunity Pruning from Samples. Input: Samples = (S, f(S)) , acceptable overlap –. S { } Order nodes according to their first order marginal contributions Iteratively remove from this ordering nodes a whose marginal contribution has overlap of at least – with at least one node before a in this ordering.

return k first nodes in the ordering

The strong performance of this algorithm for the problem of influence maximization is best explained with the concept of communities. Intuitively, this algorithm first orders nodes in decreasing order of their individual influence and then removes nodes which are in a same community. This second step allows the algorithm to obtain a diverse solution which influences multiple dierent communities of the social network. In comparison, the other

39 optimization from samples algorithms only use first order marginal contributions and perform well if the function is close to linear. Due to the high overlap in influence between nodes in a same community, influence functions are far from being linear and these algorithms have poor performance for influence maximization since they only pick nodes from a very small number of communities.

Computing overlap using second order marginal contributions We define second order marginal contributions, which are used to compute the overlap between the marginal contribution of two nodes.

Definition 9. The second order expected marginal contribution of a node a to a random set S containing node b is

vb(a) := E [f(S a ) f(S)]. S :a S,b S ≥D ”œ œ fi{ } ≠

The first order marginal contribution v(a) of node a is defined similarly as the marginal

contribution of a node a to a random set S, i.e., v(a) := ES :a S[f(S a ) f(S)]. These ≥D ”œ fi{ } ≠ contributions can be estimated arbitrarily well for product distributions by taking the D dierence between the average value of samples containing a and b and the average value of samples containing b but not a.

The subroutine Overlap(a, b, –), – [0, 1], compares the second order marginal contri- œ bution of a to a random set containing b and the first order marginal contribution of a to a random set. If b causes the marginal contribution of a to decrease by at least a factor of 1 –, then we say that a has marginal contribution with overlap of at least – with node b. ≠

40 Algorithm 3 Overlap(a, b, –), returns true if a and b have marginal contributions that overlap by at least a factor –. Input: Samples = (S, f(S)) , node a, acceptable overlap – S { } If second order marginal contribution v (a) is at least a factor of 1 – smaller than first b ≠ order marginal contribution v(a),

return Node a has overlap of at least – with node b

Overlap is used to detect nodes in a same community. In the extreme case where two nodes a and b are in a community C where any node in C influences all of community C, then

the second order marginal contribution vb(a) of a to random set S containing b is vb(a)=0 since b already influences all of C so a does not add any value, while v(a) C . In the ¥| | opposite case where a and b are in two communities which are not connected in the network, we have v(a)=vb(a) since adding b to a random set S has no impact on the value added by a.

Analyzing community structure The main benefit from COPS is that it leverages the community structure of social networks. To formalize this explanation, we analyze our algorithm in the standard model used to study the community structure of networks, the stochastic block model. In this model, a fixed set of nodes V is partitioned in communities

C1,...,C¸. The network is then a random graph G =(V,E) where edges are added to E independently and where an intra-community edge is in E with much larger probability than

sb an inter-community edge. These edges are added with identical probability qC for every edge in a same community, but with dierent probabilities for edges inside dierent communities

Ci and Cj. We illustrate this model in Figure 2.3.

41 Figure 2.3: An illustration of the stochastic block model with communities C1, C2, C3 and C4 of sizes 6, 4, 4 and 4. The optimal solution for influence maximization with k =4is in green. Picking the k first nodes in the ordering by marginal contributions without pruning, as in the previous section, leads to a solution with nodes from only C1 (red). By removing nodes with overlapping marginal contributions, COPS obtains a diverse solution.

Dense Communities and Small Seed Set in the Stochastic Block Model

1 In this section, we show that COPS achieves a 1 O( C ≠ ) approximation, where C is ≠ | k| k the kth largest community, in the regime with dense communities and small seed set, which is described below. We show that the algorithm picks a node from each of the k largest communities with high probability, which is the optimal solution. In the next section, we show a constant factor approximation algorithm for a generalization of this setting, which requires a more intricate analysis. In order to focus on the main characteristics of the community structure as an explanation for the performance of the algorithm, we make the following simplifying assumptions for the analysis. We first assume that there are no inter-community edges.1 We also assume that the random graph obtained from the stochastic block model is redrawn for every sample and that we aim to find a good solution in expectation over both the stochastic block model and the independent cascade model. Formally, let G =(V,E) be the random graph over n nodes obtained from an independent cascade process over the graph generated by the stochastic block model. Similarly as for the stochastic block model, edge probabilities for the independent cascade model may vary

1The analysis easily extends to cases where inter-community edges form with probability significantly sb smaller to qC , for all C.

42 between dierent communities and are identical within a single community C, where all

ic edges have weights qC . Thus, an edge e between two nodes in a community C is in E with probability p := qic qsb, independently for every edge, where qic and qsb are the edge C C · C C C probabilities in the independent cascade model and the stochastic block model respectively. The total influence by seed set S is then cc (S ) where cc (S) is the set of nodes connected | G i | G to S in G and we drop the subscript when it is clear from context. Thus, the objective

function is f(S) := EG[ cc(S) ]. We describe the two assumptions for this section. | |

Dense communities. We assume that for the k largest communities C, p > 3 log C / C C | | | | and C has super-constant size ( C = Ê(1)). This assumption corresponds to communities | | where the probability p that a node a C influences another node a C is large. Since C i œ j œ the subgraph G[C] of G induced by a community C is an Erds-Rényi random graph, we get that G[C] is connected with high probability. We first review Erds-Rényi random graphs.

Erds-Rényi random graphs. A Gn,p Erds-Rényi graph is a random graph over n vertices where every edge realizes with probability p. Note that the graph obtained by the two step process which consists of first the stochastic block model and then the independent

cascade model is a union of G C ,p for each community C. The following are seminal results | | C

from Erds-Rényi characterizes phase transitions for Gn,p graphs.

Lemma 2.2.2. [Erdos and Rényi, 1960] Assume C is a “dense" community, then the subgraph

2 G[C] of G is connected with probability 1 O( C ≠ ). ≠ | | Proof. Assume p = c log C / C for c>1. From Theorem 4.6 in [Blum et al.] which C | | | | presents the result from Erdos and Rényi [1960], the expected number of isolated vertices a in G( C ,p) is | | 1 c E[i]= C ≠ + o(1) | |

and from Theorem 4.15 in [Blum et al.], the expected number of components of size between

43 1 2c 2 and C /2 is O( C ≠ ). Thus the expected number of components of size at most n/2 | | | | 1 c 1 c is O(n ≠ ) and the probability that the graph is connected is 1 O( C ≠ ). Finally, since ≠ | | c 3 for dense communities, the probability that the graph for community C is connected is Ø 2 1 O( C ≠ ). ≠ | |

Small seed set. We also assume that the seed sets S are small enough so that they ≥D

rarely intersect with a fixed community C, i.e., PrS [S C = ] 1 o(1). This assumption ≥D fl ÿ Ø ≠ corresponds to cases where the set of early influencers is small, which is usually the case in cascades. The analysis in this section relies on two main lemmas. We first show that the first order marginal contribution of a node is approximately the size of the community it belongs to (Lemma 2.2.3). Thus, the ordering by marginal contributions orders elements by the size of the community they belong to. Then, we show that any node a C that is s.t. that there is œ a node b C before a in the ordering is pruned (Lemma 2.2.4). Regarding the distribution œ S generating the samples, as previously mentioned, we consider any bounded product ≥D distribution. This implies that w.p. 1 1/ poly(n), the algorithm can compute marginal ≠ contribution estimates v˜ that are all a 1/ poly(n)-additive approximation to the true marginal contributions v. Thus, we give the analysis for the true marginal contributions, which, with probability 1 1/ poly(n) over the samples, easily extends for arbitrarily good estimates. ≠ The following lemma shows that the ordering by first order marginal contributions corresponds to the ordering by decreasing order of community sizes that nodes belong to.

Lemma 2.2.3. For all a C where C is one of the k largest communities, the first œ order marginal contribution of node a is approximately the size of its community, i.e., (1 o(1)) C v(a) C . ≠ | |Æ Æ| |

Proof. Assume a is a node in one of the k largest communities. Let a and a denote the D D≠ distributions S conditioned on a S and a S respectively. We also denote marginal ≥D œ ”œ 44 contributions by f (a) := f(S a ) f(S). We obtain S fi{ } ≠

v(a)= E [fS(a)] Pr [S C = ] Pr[cc(a)=C] E [fS(a)] S a,G Ø S a fl ÿ · G · S a : S C= , ≥D≠ ≥D≠ ≠ ≥GD: cc(a)=fl C ÿ

=Pr[S C = ] Pr[cc(a)=C] C S a G ≥D≠ fl ÿ · ·| | (1 o(1)) C Ø ≠ ·| | where the last inequality is by the small seed set assumption and since C is connected with probability 1 o(1) (Lemma 2.2.2 and C = Ê(1) by dense community assumption). For ≠ | | the upper bound, v(a) is trivially at most the size of a’s community since there are no inter-community edges.

The next lemma shows that the algorithm does not pick two nodes in a same community.

Lemma 2.2.4. With probability 1 o(1), for all pairs of nodes a, b such that a, b C where ≠ œ C is one of the k largest communities, Overlap(a, b, –)=True for any constant – [0, 1). œ

Proof. Let a, b be two nodes in one of the k largest communities C and a,b denote the D≠ distribution S conditioned on a S and b S. Then, ≥D ”œ œ

vb(a)= E [fS(a)] Pr[b cc(a)] 0+Pr[b cc(a)] C = o(1) o(1) v(a) S a,b ≥D≠ Æ œ · ”œ ·| | Æ ·

2 where the last equality is since G[C] is not connected w.p. O( C ≠ ) by Lemma 2.2.2 and | | since C = Ê(1) by the dense community assumption, which concludes the proof. | |

By combining Lemmas 2.2.3 and 2.2.4, we obtain the main result for this section.

Theorem 2.2.2. In the dense communities and small seed set setting, COPS with –-overlap allowed, for any constant – (0, 1) is a 1 o(1)-approximation algorithm for learning to œ ≠ 45 influence from samples from a bounded product distribution . D

Proof. First, we claim that a node a C is not removed from the ordering if there is no œ other node from C before a.For b C, we have ”œ

v(a)=ES a [fS(a)] = ES a [fS(a):b S]=ES a,b [fS(a)] = vb(a) ≥D≠ ≥D≠ œ ≥D≠ where the second equality is since a and b are in dierent communities and since is a D product distribution. Thus, Overlap(a, b, –)=False for any – (0, 1]. œ Next, recall that v(a) C for all a C. Thus, by Lemmas 2.2.3 and 2.2.4. COPS Æ| | œ returns a set that contains one node from k dierent communities that have sizes that are at most a factor 1 o(1) away from the sizes of the k largest communities. Since the k largest ≠ communities are connected with high probability, the optimal solution contains one node from each of the k largest communities. Thus, we obtain a 1 o(1) approximation. ≠

Constant Approximation for General Stochastic Block Model

In this section, we relax assumptions from the previous section and show that COPS is a constant factor approximation algorithm in this more demanding setting. Recall that G is the random graph obtained from both the stochastic block model and the independent cascade model. A main observation that is used in the analysis is to observe that the random

subgraph G[C], for some community C, is an Erds-Rényi random graph G C ,p . | | C

Relaxation of the assumptions. Instead of only considering dense communities where p =((log C )/ C ), we consider both tight communities C where p (1 + ‘)/ C for C | | | | C Ø | | some constant ‘>0 and loose communities C where p (1 ‘)/ C for some constant C Æ ≠ | |

46 ‘>0.2 We also relax the small seed set assumption to the reasonable non-ubiquitous seed set assumption. Instead of having a seed set S rarely intersect with a fixed community ≥D

C, we only assume that PrS [S C = ] ‘ for some constant ‘>0. Again, since seed ≥D fl ÿ Ø sets are of small sizes in practice, it seems reasonable that with some constant probability a community does not contain any seeds.

Overview of analysis. At a high level, the analysis exploits the remarkably sharp threshold for the phase transition of Erds-Rényi random graphs. This phase transition (Lemma 2.2.5) tells us that a tight community C contains w.h.p. a giant connected component with a constant fraction of the nodes from C. Thus, a single node from a tight community influences a constant fraction of its community in expectation. The ordering by first order marginal contributions thus ensures a constant factor approximation of the value from nodes in tight communities (Lemma 2.2.8). On the other hand, we show that a node from a loose community influences only at most a constant number of nodes in expectation (Lemma 2.2.6) by using branching processes. Since the algorithm checks for overlap using second order marginal contributions, the algorithm picks at most one node from any tight community (Lemma 2.2.9). Combining all the pieces together, we obtain a constant factor approximation (Theorem 2.2.3). We first state the result for the giant connected component in a tight community, which is an immediate corollary of the prominent giant connected component result in the Erds-Rényi model.

Lemma 2.2.5. [Erdos and Rényi, 1960] Let C be a tight community with C = Ê(1), then | | G[C] has a “giant" connected component containing a constant fraction of the nodes in C w.p. 1 o(1). ≠ The following lemma analyzes the influence of a node in a loose community through the

2Thus, we consider all possible sizes of communities except communities of size that converges to exactly 1/pC , which is unlikely to occur in practice.

47 lenses of Galton-Watson branching processes to show that such a node influences at most a constant number of nodes in expectation.

Lemma 2.2.6. Let C be a loose community, then f( a ) c for all a C and some { } Æ œ constant c.

Proof. Fix a node a C. We consider a Galton-Watson branching process starting at œ individual a where the number of osprings of an individual is X = Binomial( C 1,p ). | |≠ C We show that the expected total size s of this branching process is 1/(1 p ( C 1)) and ≠ C · | |≠ that this expected size s upper bounds f( a ). { } We first argue that s f( a ). The expected number of nodes influenced by a can be Ø { } counted via a breadth first search (BFS) of community C starting at a. The number of edges leaving a node in this BFS is Binomial( C 1,p ), which is exactly the number of osprings | |≠ C of an individual in the branching process. Since the nodes explored in the BFS are only the nodes not yet explored, the number of nodes explored by BFS is upper bounded by the

branching process and we get E[s] f( a ). Ø { } Next, let µ = p ( C 1) < 1 be the expected number of osprings of an individual in C · | |≠

the branching process. Let si be the expected number of individuals at generation i of the

i branching process. We show by induction that E[si]=µ . The base case is trivial for i =1. Next, for i =2,

Œ Œ i E[si]= Pr[si 1 = j] E[si si 1 = j]= Pr[si 1 = j] j µ = µ E[si 1]=µ ≠ · | ≠ ≠ · · ≠ jÿ=0 jÿ=0

i where the last inequality is by the inductive hypothesis. Thus, E[s]= iŒ µ =1/(1 µ) since =0 ≠ q µ<1. Finally, since C is tight, µ 1 ‘ for some constant ‘>0 and f( a ) E[s] 1/‘. Æ ≠ { } Æ Æ

We can now upper bound the value of the optimal solution Sı.LetC ,...,C be the t k 1 t Æ ı tight communities that have at least one node in Ci that is in the optimal solution S and

48 that are of super-constant size, i.e., C = Ê(1). Without loss, we order these communities in | | decreasing order of their size C . | i|

ı Lemma 2.2.7. Let S be the optimal set of nodes and Ci and t be defined as above. There exists a constant c such that f(Sı) t C + c k. Æ i=1 | i| · q ı ı Proof. Let SA and SB be a partition of the optimal nodes in nodes that are in tight communi- ties with super-constant individual influence and nodes that are not in such a community. The influence f(Sı ) is trivially upper bounded by t C . Next, there exists some constant c A i=1 | i| ı q s.t. f(SB) a Sı f( a ) c where the first inequality is by submodularity and the second Æ œ B { } Æ · since nodes inq loose communities have constant individual influence by Lemma 2.2.6 and nodes in tight community without super-constant individual influence have constant influence by definition. We conclude that by submodularity, f(Sı) f(Sı )+f(Sı ) t C +c k. Æ A B Æ i=1 | i| · q Next, we argue that the solution returned by the algorithm is a constant factor away from t C . i=1 | i| q Lemma 2.2.8. Let a be the ith node in the ordering by first order maginal contribution after the pruning and Ci be the ith largest tight community with super-constant individual influence and with at least one node in the optimal solution Sı. Then, f( a ) ‘ C for some constant { } Ø | i| ‘>0.

Proof. By definition of C , we have C C that are all tight communities. Let i | 1|Ø···Ø| i| b be a node in C for j [i], 1 be the indicator variable indicating if there is a giant j œ gc(C) component in community C, and gc(C) be this giant component. We get

1 v(b) Pr[ gc(Cj )] Pr [S Cj = ] Pr[b gc(Cj)] E[ gc(Cj) : b gc(Cj)] S b Ø · ≥D≠ fl ÿ · œ · | | œ (1 o(1)) ‘ ‘ ‘ C ‘ C Ø ≠ · 1 · 2 · 3| j|Ø | j|

49 for some constants ‘1,‘2,‘3,‘> 0 by Lemma 2.2.5 and the non-ubiquitous assumption. Similarly as in Theorem 2.2.2, if a and b are in dierent communities, Overlap(a, b, –)= False for – (0, 1]. Thus, there is at least one node b i C at position i or after in œ œfij=1 j the ordering after the pruning, and v(b) ‘ C for some j [i]. By the ordering by first Ø | j| œ order marginal contributions and since node a is in ith position, v(a) v(b), and we get that Ø f( a ) v(a) v(b) ‘ C ‘ C . { } Ø Ø Ø | j|Ø | i|

Next, we show that the algorithm never picks two nodes from a same tight community.

Lemma 2.2.9. If a, b C and C is a tight community, then Overlap(a, b, –)=True for œ – = o(1).

Proof. Let a, b C s.t. C is a tight community. The marginal contribution of node a can œ i be decomposed into the whether a gc(C): œ

v(a)=Pr[a gc(C)] E [fS(a):a gc(C)] + Pr[a gc(C)] E [fS(a):a gc(C)] G S a G S a œ · ≥D≠ œ ”œ · ≥D≠ ”œ

Since is a product distribution, D

E [fS(a):a gc(C)] = (Pr[b gc(C)] + Pr [b gc(C),b S]) E [fS(a):a gc(C)] S a S S a, b ≥D≠ œ ”œ ≥D œ ”œ ≥D≠ ≠ œ

(1 + ‘)Pr[b gc(C)] E [fS(a):a gc(C)] S a, b Ø ”œ ≥D≠ ≠ œ

=(1+‘) E [fS(a):a gc(C)] S a,b ≥D≠ œ

for some constant ‘>0 since PrS a [b gc(C),b S] ‘1 for some constant ‘1 > 0. Since, ≥D≠ œ ”œ Ø

Pr[a gc(C)] E [fS(a):a gc(C)] ‘Õ C ‘Õ Pr[a gc(C)] E [fS(a):a gc(C)] G S a,b G S a œ ≥D≠ œ Ø | |Ø ”œ · ≥D≠ ”œ

50 for some constant ‘Õ > 0,weget

v(a) (1 + ‘)Pr[a gc(C)] E [fS(a):a gc(C)] G S a,b Ø œ · ≥D≠ œ

+Pr[a gc(C)] E [fS(a):a gc(C)] G S a ”œ · ≥D≠ ”œ

(1 + ‘‘Õ/2) Pr[a gc(C)] E [fS(a):a gc(C)] Ø G œ S a,b œ 3 ≥D≠

+Pr[a gc(C)] E [fS(a):a gc(C)] G ”œ · S a,b ”œ ≥D≠ 4

=(1+‘‘Õ/2)vb(a)

Thus, Overlap(a, b, –)=True for – = o(1).

We combine the above lemmas to obtain the approximation guarantee of COPS.

Theorem 2.2.3. With overlap allowed – =1/ poly(n), COPS is a constant factor approx- imation algorithm for learning to influence from samples drawn from a bounded product distribution in the setting with tight and loose communities and non-ubiquitous seed sets. D

Proof. First, observe that f(S) k since nodes trivially influence themselves. Let a Ø i be the node picked by the algorithm that is in the ith position of the ordering after the pruning and assume i t. By Lemma 2.2.8, f( a ) ‘ C where C is the ith largest Æ { i} Ø | i| i tight community with super-constant individual influence and with at least one node in Sı. Thus a , i [t] is in a tight community, otherwise it would have constant influence i œ by Lemma 2.2.6, which is a contradiction with f( a ) < ‘ C . Since a , i [t] is in a { i} Ø | i| i œ

tight community, by Lemma 2.2.9, we obtain that a1,...,ai are all in dierent communities.

We denote by St the subset of the solution returned by COPS and obtain We obtain f(Sı) t C + c k t 1 f( a )+c f(S)= 1 f(S )+c f(S) c f(S) for Æ i=1 | i| · Æ i=1 ‘ · { i} · ‘ · t · Æ 1 · q q some constant ‘, c, c1 by Lemmas 2.2.7, 2.2.8, and since ai,aj are in dierent communities for i, j t. Æ

51 Experiments

In this section, we compare the performance of COPS and three other algorithms on real and synthetic networks. We show that COPS performs well in practice, it outperforms the previous optimization from samples algorithm and gets closer to the solution obtained when given complete access to the influence function.

Experimental setup. The first synthetic network considered is the stochastic block model,

SBM 1, where communities have random sizes with one community of size significantly larger than the other communities. We maintained the same expected community size as n varied. In the second stochastic block model, SBM 2, all communities have same expected size and the number of communities was fixed as n varied. The third and fourth synthetic

networks were an Erds-Rényi (ER) random graph and the preferential attachment model (PA). Experiments were also conducted on two real networks publicly available ([Leskovec and Krevl, 2015]). The first is a subgraph of the Facebook social network with n =4k and m =88k. The second is a subgraph of the DBLP co-authorship network, which has ground truth communities as described in [Leskovec and Krevl, 2015], where nodes of degree at most 10 were pruned to obtain n =54k, m =361k and where the 1.2k nodes with degree at least 50 were considered as potential nodes in the solution.

Benchmarks. We considered three dierent benchmarks to compare the COPS algorithm against. The standard Greedy algorithm in the value query model is an upper bound since it is the optimal ecient algorithm given value query access to the function and COPS is in the more restricted setting with only samples. MargI is the optimization from samples algorithm which picks the k nodes with highest first order marginal contribution. as in the

previous section, and does not use second order marginal contributions. Random simply returns a random set. All the samples are drawn from the product distribution with marginal

52 DBLP DBLP Facebook Facebook 400 300 400 300

Greedy 250 250 300 COPS 300 MargI 200 200 Random

200 150 200 150 Performance Performance Performance Performance 100 100 100 100 50 50

0 0 0 0 0.0 0.4 0.8 1.2 1.6 2.0 0 3 6 9 12 15 0.6 0.7 0.8 0.9 1 1.1 0 3 6 9 12 15 $2 $2 q x10x10$2&& k q x10x10&$2& k

Stochastic Block Model 1 Stochastic Block Model 2 Erdős–Rényi Preferential Attachment 600 1500 800 250

1200 200 600 400 900 150 400 600 100 Performance Performance Performance Performance 200 200 300 50

0 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 5 5 55 n x10 % n x10x105%% n x10x105%% n x10x10%%

Figure 2.4: Empirical performance of COPS against the Greedy upper bound, the previous optimization from samples algorithm MargI and a random set.

probability k/n, so that samples have expected size k. Each point in a plot corresponds to the average performance of the algorithms over 10 trials. The default values for k is k =10. For the experiments on synthetic data, the default overlap allowed is – =0.5, for the Facebook experiments – =0.4 and for the DBLP experiments – =0.2. The default edge weights are chosen so that in the random realization of G the average degree of the nodes is approximately 1.

Empirical evaluation. COPS significantly outperforms the previous optimization from samples algorithm MargI, getting much closer to the Greedy upper bound. We observe that the more there is a community structure in the network, the better the performance of

COPS is compared to MargI, e.g., SBM vs ER and PA (which do not have a community structure). When the edge weight q := qi.c. for the cascades is small, the function is near-linear

and MargI performs well, whereas when it is large, there is a lot of overlap and COPS performs better. The performance of COPS as a function of the overlap allowed can be

53 1.0

0.8 DBLP Facebook 0.6

0.4

0.2

0.0

Normalized performance Normalized -0.2

-0.4 0.0 0.2 0.4 0.6 0.8 1.0 Overlap allowed

Figure 2.5: Performance of COPS as a function of the overlap – allowed. The per- formance is normalized so that the performance of Greedy and MargI corresponds to value 1 and 0 respectively

explained as follows: Its performance slowly increases as the the overlap allowed increases

and COPS can pick from a larger collection of nodes until it drops when it allows too much overlap and picks mostly very close nodes from a same community. For SBM 1 with one larger community, MargI is trapped into only picking nodes from that larger community and performs even less well than Random. As n increases, the number of nodes influenced increases roughly linearly for SBM 2 when the number of communities is fixed since the number of nodes per community increases linearly, which is not the case for SBM 1.

2.2.3 General Submodular Maximization

1/4 We develop an ˜(n≠ ) optimization from samples algorithm over for monotone sub- D modular functions, for some distribution . This bound is essentially tight since submodular D 1/4+‘ functions are not n≠ -optimizable from samples over any distribution (see Section 2.3.2). We first describe the distribution for which the approximation holds. Then we describe the algorithm, which builds upon estimates of expected marginal contributions.

54 The distribution. Let be the uniform distribution over all sets of size i. Define the Di distribution sub to be the distribution which draws from , , and at random. D Dk DÔn DÔn+1

Estimates of expected marginal contributions We estimate vˆi ES e S[fS(ei)] ¥ ≥DÔn| i”œ with samples from and . DÔn DÔn+1

Algorithm 4 EEMC Estimates the Expected Marginal Contribution ES e S[fS(ei)]. ≥DÔn| i”œ Input: = S :(S ,f(S )) is a sample) S { j j j } for i [n] do œ S : S ,e S, S = Ôn +1 Si,Ôn+1 Ω{ œS i œ | | }

i,Ôn S : S ,ei S, S = Ôn S≠ Ω{ œS ”œ | | } 1 1 vˆi = S f(S) S f(S) i,Ôn+1 i,Ôn+1 i,Ôn i,Ôn |S | œS ≠ |S≠ | œS≠ end for q q

return (ˆv1,...,vˆn)

By standard concentration bounds (Hoeding’s inequality, Lemma 2.2.10), these are good

estimates of ES e S[fS(ei)] for product distributions (Lemma 2.2.11). ≥DÔn| i”œ D

Lemma 2.2.10 (Hoeding’s inequality). Let X1,...,Xm be independent random variables

1 m with values in [0,b]. Let X = m i=1 Xi. Then for any ‘>0, q

2m‘2/b2 Pr [ X E[X] ‘] 2e≠ . | ≠ |Ø Æ

n Lemma 2.2.11. With probability at least 1 O(e≠ ), the estimations vˆ defined above are ‘ ≠ i accurate, for any ‘ f(N)/ poly(n) and for all e , i.e., Ø i

vˆi ES e S[fS(ei)] ‘. | ≠ ≥DÔn| i”œ |Æ

Proof. Let ‘ f(N)/nc for some constant c. With a suciently large polynomial number of Ø 55 2c+1 samples, i,Ôn+1 and i,Ôn are of size at least 2n with exponentially high probability. S S≠ Then by Hoeding’s inequality (Lemma 2.2.10 with m =2n2c+1 and b = f(N)),

1 4n2c+1(‘/2)2/f(N)2 n Pr f(S) ES e S[f(S)] ‘/2 2e≠ 2e≠ Q- ≥DÔn+1| iœ - R - i,Ôn+1 S ≠ - Ø Æ Æ -|S | œSÿi,Ôn+1 - a- - b - - - - and similarly,

1 n Pr f(S) E [f(S)] ‘/2 2e≠ . - S Ôn ei S - Q i,Ôn ≠ ≥D | ”œ Ø R Æ - S i,Ôn - -|S≠ | œSÿ≠ - a- - b - - - - By the definition of vˆi and since

ES e S[fS(ei)] = ES e S[f(S)] ES e S[f(S)] , ≥DÔn| i”œ ≥DÔn+1| iœ ≠ ≥DÔn| i”œ

n the claim then holds with probability at least 1 4e≠ . ≠

The algorithm. We begin by computing the expected marginal contributions of all elements. We then place the elements in 3 log n bins according to their estimated expected marginal

contribution vˆi. The algorithm then simply returns either the best sample of size k or a random subset of size k of a random bin. Up to logarithmic factors, we can restrict our attention to just one bin. We give a formal description below.

56 1/4 sub Algorithm 5 An ˜(n≠ )-optimization from samples algorithm over for monotone D submodular functions. Input: = S :(S ,f(S )) is a sample) S { i i i } 1 With probability 2 :

return argmaxS : S =k f(S) best sample of size k œS | | 1 With probability 2 : (ˆv ,...,vˆ ) EEMC( ) 1 n Ω S vˆ max vˆ max Ω i i for j [3 log n] do œ vˆmax vˆmax Bj i : j 1 vˆi < j Ω 2 1≠ Æ 2 1 Ó Ô end for Pick j [3 log n] u.a.r. œ return S, a subset of B of size min B ,k u.a.r. a random set from a random bin j {| j| }

Analysis of the algorithm. The main crux of this result is in the analysis of the algorithm. The analysis is divided in two cases, depending if a random set S of size Ôn has low ≥DÔn value or not. Let Sı be the optimal solution.

ı • Assume that ES [f(S)] f(S )/4. Thus, optimal elements have large estimated ≥DÔn Æ ı expected marginal contribution vˆi by submodularity. Let B be the bin with the largest value among the bins with contributions vˆ f(Sı)/(4k). We argue that a random Ø subset of Bı of size k performs well. We first show that a random subset of Bı is a Bı /(4kÔn)-approximation. At a high level, a random subset S of size Ôn contains | | ı ı ı B /Ôn elements from bin B in expectation, and these B /Ôn elements S ı have | | | | B ı ı contributions at least f(S )/(4k) to SBı . We then show that a random subset of B is an ˜(k/ Bı )-approximation to f(Sı). The proof first shows that f(Bı) has high | | value by the assumption that a random set S has low value, and then uses the ≥DÔn

57 fact that a subset of Bı of size k is a k/ Bı approximation to Bı. Note that either | | ı ı 1/4 B /(4kÔn) or ˜(k/ B ) is at least ˜(n≠ ). | | | | ı • Assume that ES [f(S)] f(S )/4. We argue that the best sample of size k ≥DÔn Ø performs well. We first show that, by submodularity, a random set of size k is a k/(4Ôn) approximation since a random set of size k is a fraction k/(Ôn) smaller than a random set from in expectation. We then show that the best sample of size k is DÔn a 1/k-approximation since it contains the elements with the highest value with high

1/4 probability. Note that either k/(4Ôn) or 1/k is at least n≠ .

We begin with the following useful lemma.

Lemma 2.2.12. For any monotone submodular function f( ), the value of a uniform random · set S of size k is a k/n-approximation to f(N).

Proof. Partition the ground set into sets of size k uniformly at random. A uniform random set of this partition is a k/n-approximation to f(N) in expectation by submodularity. A uniform random set of this partition is also a uniform random set of size k.

ı In the first case of the analysis, we assume that ES [f(S)] f(S )/4.LetjÕ be the ≥DÔn Æ largest j such that bin B contains at least one element e such that vˆ f(Sı)/(2k). So any j i i Ø ı ı ı element ei Bj, j jÕ is such that vˆi f(S )/(4k). Define B = argmaxB :j j f(S Bj) œ Æ Ø j Æ Õ fl to be the bin B with high marginal contributions that has the highest value from the optimal solution. Let t be the size of Bı.

ı ı Lemma 2.2.13. If ES [f(S)] f(S )/4, then a uniformly random subset of bin B of ≥DÔn Æ size min k, t is a (1 o(1)) min(1/4,t/(4kÔn))-approximation to f(Sı). { } ≠ ·

58 Proof. Note that

ı ES S Bı [f(S)] ES [f(S B )] (submodularity) ≥Dt/Ôn| ™ Ø ≥DÔn fl

ES f(S Bı) e (ei) (submodularity) ≥DÔn S fl \ i T Ø e S Bı iœÿfl U V

ES ES e S [fS (ei)] (submodularity) ≥DÔn S Õ≥DÔn| i”œ Õ Õ T Ø e S Bı iœÿfl U V 3 ES (ˆvi f(N)/n ) ≥DÔn S T Ø e S Bı ≠ iœÿfl U Vı ı f(S ) ı ı = ES [ S B ](1 o(1)) (ˆvi f(S )/(4k) for ei B ) ≥DÔn | fl | ≠ 4k Ø œ tf(Sı) =(1 o(1)) ≠ 4kÔn

If t/Ôn k, then a uniformly random subset of bin Bı of size k is a kÔn/t approximation Ø ı to ES S Bı [f(S)] by Lemma 2.2.12,soa(1 o(1))/4 approximation to f(S ) by the ≥Dt/Ôn| ™ ≠ above inequalities. Otherwise, if t/Ôn

of size min k, t has value at least ES S Bı [f(S)] by monotonicity, and is thus a { } ≥Dt/Ôn| ™ (1 o(1))t/(4kÔn) approximation to f(Sı). ≠

ı ı Lemma 2.2.14. If ES [f(S)] f(S )/4, a uniformly random subset of bin B of size ≥DÔn Æ min(k, t) is an ˜(min(1,k/t))-approximation to f(Sı).

59 Proof. We start by bounding the value of optimal elements not in bin B with j jÕ. j Æ

ı f(S ( B :j j Bj)) \ fi j Æ Õ ı ES f(S S ( B :j j Bj)) (monotonicity) Æ ≥DÔn fi \ fi j Æ Õ Ë È

ES Ôn Sf(S)+ ESÕ Ôn ei SÕ [fSÕ (ei)]T (submodularity) Æ ≥D ı ≥D | ”œ ei (S ( B :j j Bj )) S W œ \ fiÿj Æ Õ \ X U V ı 3 f(S )/4+ES Ôn S (ˆvi + f(N)/n )T (assumption) Æ ≥D ı ei (S ( B :j j Bj )) S W œ \ fiÿj Æ Õ \ X U V ı ı ı f(S )/4+f(S )/2+f(S )/n (definition of jÕ) Æ

ı ı ı Since f(S ) f(S ( B :j j Bj)) + f(S ( B :j j Bj)) by submodularity, we get that Æ fl fi j Æ Õ \ fi j Æ Õ ı ı ı ı f(S ( B :j j Bj)) f(S )/5. Since there are 3 log n bins, f(S B ) is a 3 log n to fl fi j Æ Õ Ø fl f(Sı)/5 by submodularity and the definition of Bı. By monotonicity, f(Bı) is a 15 log n to f(Sı). Thus, by Lemma 2.2.12, a random subset S of size k of bin Bı is an ˜(min(1,k/t)) approximation to f(Sı).

ı Corollary 1. If ES [f(S)] f(S )/4, a uniformly random subset of size min k, Bj ≥DÔn Æ { | |} 1/4 ı of a random bin Bj is an ˜(n≠ )-approximation to f(S ).

Proof. With probability 1/(3 log n), the random bin is Bı and we assume this is the case. By Lemma 2.2.13 and 2.2.14, a random subset of Bı of size min(k, t) is both an (min(1,t/(kÔn))) and an ˜(min(1,k/t)) approximation to f(Sı). Assume t/(kÔn) 1 and k/t 1, otherwise Æ Æ 1/4 1/4 we are done. Finally, note that if t/k n , then (t/(kÔn)) (n≠ ), otherwise, Ø Ø 1/4 ˜(k/t) ˜(n≠ ). Ø

ı In the second case of the proof, we assume that ES [f(S)] f(S )/4 ≥DÔn Ø

60 ı Lemma 2.2.15. For any monotone submodular function f, if ES [f(S)] f(S )/4, ≥DÔn Ø then a uniformly random set of size k is a min(1/4,k/(4Ôn)) approximation to f(Sı).

Proof. If k Ôn, then a uniformly random set of size k is a 1/4-approximation to f(Sı) by Ø monotonicity. Otherwise, a uniformly random subset of size k of N is a uniformly random subset of size k of a uniformly random subset of size Ôn of N. So by Lemma 2.2.12,

k k ı ES [f(S)] ES [f(S)] f(S ). ≥Dk Ø Ôn · ≥DÔn Ø 4Ôn ·

Lemma 2.2.16. For any monotone submodular function f( ), the sample S with the largest · value among at least n log n samples of size k is a 1/k-approximation to f(Sı) with high probability.

Proof. By submodularity, there exists an element eı such that eı is a 1/k-approximation i { i } ı to the optimal solution. By monotonicity, any set which contains ei is a 1/k-approximation to the optimal solution. After observing n log n samples, the probability of never observing a

ı set that contains ei is polynomially small.

ı Corollary 2. If ES [f(S)] f(S )/4, then the sample of size k with the largest value is ≥DÔn Ø 1/4 ı a min(1/4,n≠ /4) approximation to f(S ).

Proof. By 2.2.15 and Lemma 2.2.16, the sample of size k with the largest value k is a min(1/4,k/(4Ôn)) and a 1/k approximation to f(Sı).Ifk n1/4, then min(1/4,k/(4Ôn)) Ø Ø 1/4 1/4 min(1/4,n≠ /4), otherwise, 1/k 1/n≠ . Ø

By combining Corollaries 1 and 2, we obtain the main result for this section.

1/4 sub Theorem 2.2.4. Algorithm 5 is an ˜(n≠ ) optimization from samples algorithm over D for monotone submodular functions.

61 2.3 The Limitations of Optimization from Samples

In this section, we show hardness results for optimization from samples from any distribu- tion for three classes of functions. We begin with a general framework for showing hardness results in the optimization from samples model (Section 2.3.1). As a warm-up for coverage functions, we show in Section 2.3.2 that for submodular functions, there is no optimization

1/4+‘ from samples algorithm that obtains an (n≠ ) approximation. We then present the main O result for the hardness of optimizing coverage functions from samples in Section 2.3.3. Finally we show in Section 2.3.4 an (1 c)/(1 + c c2)+o(1) lower bound for monotone submodular ≠ ≠ functions with curvature c. For submodular functions and functions with curvature, these lower bounds are tight up to lower order terms with the upper bounds obtained for these functions in the previous section.

2.3.1 A Framework for Hardness of Optimization from Samples

The framework we introduce partitions the ground set of elements into good, bad, and masking elements. We derive two conditions on the values of these elements so that samples do not contain enough information to distinguish good and bad elements with high probability. We then give two additional conditions so that if an algorithm cannot distinguish good and bad elements, the solution returned by this algorithm has low value compared to the optimal set consisting of the good elements. We begin by defining the partition.

Definition 10. The collection of partitions contains all partitions P of the ground set P N in r parts T ,...,T of k elements and a part M of remaining n rk elements, where 1 r ≠ n = N . | |

The elements in T are called the good elements, for some i [r]. The bad and masking i œ r elements are the elements in T i := j=1,j=iTj and M respectively. Next, we define a class of ≠ fi ”

62 functions (g, b, m, m+) such that f (g, b, m, m+) is defined in terms of good, bad, and F œF masking functions g, b, and m+, and a masking fraction m [0, 1].3 œ

Definition 11. Given functions g, b, m, m+, the class of functions (g, b, m, m+) contains F functions f P,i, where P and i [r], defined as œP œ

P,i + f (S) := (1 m(S M)) g(S Ti)+b(S T i) + m (S M). ≠ fl 3 fl fl ≠ 4 fl

We use probabilistic arguments over the partition P and the integer i [r] chosen œP œ uniformly at random to show that for any distribution and any algorithm, there exists a D function in (g, b, m, m+) that the algorithm optimizes poorly given samples from . The F D functions g, b, m, m+ have desired properties that are parametrized below. At a high level, the identical on small samples and masking on large samples properties imply that the samples do not contain enough information to learn i, i.e. distinguish good and bad elements, even though the partition P can be learned. The gap and curvature property imply that if an algorithm cannot distinguish good and bad elements, then the algorithm performs poorly.

Definition 12. The class of functions (g, b, m, m+) has an (–,—)-gap if the following F conditions are satisfied for some t, where ( ) is the uniform distribution over . U P P

Ê(1) 1. Identical on small samples. For a fixed S : S t, with probability 1 n≠ over | |Æ ≠

partition P ( ), g(S Ti)+b(S T i) is independent of i; ≥U P fl fl ≠ Ê(1) 2. Masking on large samples. For a fixed S : S t, with probability 1 n≠ over | |Ø ≠ partition P ( ), the masking fraction is m(S M)=1; ≥U P fl 3. –-Gap. Let S : S = k, then g(S) max – b(S),– m+(S) ; | | Ø { · · } 4. —-Curvature. Let S : S = k and S : S = k/r, then g(S ) (1 —) r g(S ). 1 | 1| 2 | 2| 1 Ø ≠ · · 2

3The notation m+ refers to the role of this function, which is to maintain monotonicity of masking elements. These four functions are assumed to be normalized such that g( )=b( )=m( )=m+( )=0. ÿ ÿ ÿ ÿ 63 The following lemma reduces the problem of showing an impossibility result to constructing g, b, m, and m+ which satisfy the above properties.

Lemma 2.3.1. Assume the functions g, b, m, m+ have an (–,—)-gap, then (g, b, m, m+) is F not 2 max(1/(r(1 —)), 2/–)-optimizable from samples over any distribution . ≠ D Proof. Fix any distribution . We first claim that for a fixed set S, f P,i(S) is independent D Ê(1) of i with probability 1 n≠ over a uniformly random partition P ( ).IfS t, ≠ ≥U P | |Æ then the claim holds immediately by the identical on small samples property. If S t, then | |Ø Ê(1) m(S M)=1with probability 1 n≠ over P by the masking on large samples property fl ≠ and f P,i(S)=m+(S M). fl Next, we claim that there exists a partition P such that f P,i(S) is independent of i œP Ê(1) P,i with probability 1 n≠ over S . Denote the event that f (S) is independent of i by ≠ ≥D I(S, P ). By switching sums,

1 Pr(P ( )) Pr(S ) I(S,P ) P ≥U P S 2N ≥D ÿœP ÿœ 1 = Pr(S ) Pr(P ( )) I(S,P ) S 2N ≥D P ≥U P ÿœ ÿœP Ê(1) Pr(S ) 1 n≠ Ø S 2N ≥D ≠ ÿœ 1 2 Ê(1) =1 n≠ ≠ where the inequality is by the first claim. Thus there exists some P such that

1 Ê(1) Pr(S ) I(S,P ) 1 n≠ , S 2N ≥D Ø ≠ ÿœ which proves the desired claim. Fix a partition P such that the previous claim holds, i.e., f P,i(S) is independent of i with

Ê(1) probability 1 n≠ over a sample S . Then, by a union bound over the polynomially ≠ ≥D 64 P,i Ê(1) many samples, f (S) is independent of i for all samples S with probability 1 n≠ , and ≠ we assume this is the case for the remaining of the proof. It follows that the choices of the algorithm given samples from f f P,i r are independent of i. Pick i [r] uniformly at œ{ }i=1 œ random and consider the (possibly randomized) set S returned by the algorithm. Since S is

independent of i,wegetEi,S[ S Ti ] k/r.LetSk/r = argmaxS: S =k/r(g(S)), we obtain | fl | Æ | |

P,i + Ei,S f (S) Ei,S g(S Ti)+b(S T i)+m (S M) Æ fl fl ≠ fl Ë È Ë È g(S )+b(S)+m+(S) Æ k/r 1 2 g(T )+ g(T ) Æ r(1 —) i – i ≠ 1 2 P,i 2 max , f (Ti) Æ Ar(1 —) –B ≠ where the first inequality is since m(S M) 1, the second by monotonicity and submodu- fl Æ larity, and the third by the curvature and gap properties. Thus, there exists at least one i

,i such that the algorithm does not obtain a 2 max(1/(r(1 —)), 2/–)-approximation to f P (T ), ≠ i

and Ti is the optimal solution.

2.3.2 Submodular Maximization

Using the hardness framework from Section 2.3.1, it is relatively easy to show that there

1/4+‘ is no optimization from samples algorithm that obtains an (n≠ ) approximation for O submodular functions over any distribution . The good, bad, and masking functions D

65 g, b, m, m+ we use are:

g(S)= S , | | b(S) = min( S , log n), | | m(S) = min(1, S /n1/2), | | + 1/4 1/2 m (S)=n≠ min(n , S ). · | |

It is easy to show that (g, b, m, m+) is a class of monotone submodular functions (Lemma 2.3.3). F 1/4+‘ 1/4 ‘/2 To derive the optimal n≠ impossibility we consider the cardinality constraint k = n ≠

1/4 + 1/4 ‘ and the size of the partition to be r = n . We show that (g, b, m, m ) has an (n ≠ , 0)-gap. F

+ 1/4 ‘ Lemma 2.3.2. The class (g, b, m, m ) as defined above has an (n ≠ , 0)-gap with t = F n1/2+‘/4.

1/4 ‘ Proof. We show that these functions satisfy the properties to have an (n ≠ , 0)-gap.

1/2+‘/4 1/2 ‘/2 • Identical on small samples. Assume S n . Then T i S /n n ≠ | |Æ | ≠ |·| | Æ · 1/2+‘/4 ‘/4 n /n n≠ , so by Lemma 2.3.6, S T i log n w.p. 1 Ê(1) over P ( ). Æ | fl ≠ |Æ ≠ ≥U P Thus

r g(S Ti)+b(S T i)= S ( j=1Tj) fl fl ≠ | fl fi |

with probability 1 Ê(1) over P . ≠ • Identical on large samples. Assume S n1/2+‘/4. Then S M n1/2 with | |Ø | fl |Ø exponentially high probability over P ( ) by Chernobound (Lemma 2.3.11), and ≥U P m(S M)=1w.p. at least 1 Ê(1). fl ≠ 1/4 ‘ 1/4 ‘/2 + ‘/2 • Gap n ≠ . Note that g(S)=k = n ≠ , b(S)=log n, m (S)=n≠ for S = k, | | 1/4 ‘ 1/4 + so g(S) n ≠ b(S) for n large enough and g(S)=n m (S). Ø

66 • Curvature — =0. The curvature — =0follows from g being linear.

We show that that we obtain monotone submodular functions.

Lemma 2.3.3. The class of functions (g, b, m, m+) is a class of monotone submodular F functions.

Proof. We show that the marginal contributions f (e) of an element e N to a set S N S œ ™ are such that f (e) f (e) for S T (submodular) and f (e) 0 for all S (monotone) for S Ø T ™ S Ø all elements e.Fore T , for all j, this follows immediately from g and b being monotone œ j submodular. For e M, note that œ

1 1/4 1/2 1/2 ( S Ti + min( S T i , log n)) + n≠ if S M

Together with Lemma 2.3.1, these two lemmas imply the hardness result.

Theorem 2.3.1. For every constant ‘>0, there is no optimization from samples algorithm

1/4+‘ that obtains an (n≠ ) approximation for monotone submodular functions, for any O distribution . D

2.3.3 Maximum Coverage

We show that optimization from samples is in general impossible, over any distribution , D even when the function is learnable and optimizable, which is the main result of this chapter. Specifically, we show that there exists no constant – and distribution such that coverage D 67 functions are –-optimizable from samples, even though they are (1 ‘)-PMAC learnable over ≠ any distribution and can be maximized under a cardinality constraint within a factor of D 1 1/e. We begin by formally defining coverage functions. ≠

Definition. A function is called coverage if there exists a family of sets T1,...,Tn that covers subsets of a universe U with weights w(a ) for a U such that for all S, f(S)= j j œ

aj i S Ti w(aj). A coverage function is polynomial-sized if the universe is of polynomial œfi œ sizeq in n. Influence maximization is a generalization of maximizing coverage functions under a cardinality constraint.

Coverage functions are heavily used in machine learning [Swaminathan et al., 2009, Yue and Joachims, 2008, Guestrin et al., 2005, Krause and Guestrin, 2007, Antonellis et al., 2012, Lin and Bilmes, 2011, Takamura and Okumura, 2009], data-mining [Chierichetti et al., 2010, Du et al., 2014b, Saha and Getoor, 2009, Singer, 2012, Dasgupta et al., 2007, Gomez-Rodriguez et al., 2010], mechanism design [Dobzinski and Schapira, 2006, Lehmann et al., 2001, Dughmi and Vondrák, 2015, Dughmi et al., 2011, Buchfuhrer et al., 2010, Dughmi, 2011], privacy [Gupta et al., 2013, Feldman and Kothari, 2014], as well as influence maximization [Kempe et al., 2003, Seeman and Singer, 2013, Borgs et al., 2014]. In many of these applications, the functions are learned from data and the goal is to optimize the function under a cardinality constraint. In addition to learnability and optimizability, coverage functions have many other desirable properties. One important fact is that they are

parametric: if the sets T1,...,Tn are known, then the coverage function is completely defined by the weights w(a):a U . Our impossibility result holds even in the case where the { œ }

sets T1,...,Tn are known. We state the main result.

(Ôlog n) Theorem 2.3.2. No algorithm can obtain an approximation better than 2≠ for maxi- mizing a polynomial-sized coverage function under a cardinality constraint, using polynomially many samples drawn from any distribution.

68 In Section 2.3.1, we constructed a framework which reduces the problem of proving information theoretic lower bounds to constructing functions that satisfy certain properties. The main challenge is to construct coverage functions that satisfy these properties. We first state a definition of coverage functions that is equivalent to the traditional definition and that is used through this section.

Definition 13. A function f :2N R is coverage if there exists a G = æ (N a ,E) between elements N and children a with weights w(a ) such that the fi{ j}j { j}j j value of a set S is the sum of the weights of the children covered by S, i.e., for all S N, ™

f(S)= a : ea E,e S w(aj). A coverage function is polynomial-sized if the number of children j ÷ j œ œ is polynomialq in n = N . | |

The construction of good and bad coverage functions g and b that combine the identical on small samples property and a large –-gap on large sets as needed by the framework is a main technical challenge. The bad function b needs to increase slowly (or not at all) for large sets to obtain a large –-gap, which requires a non-trivial overlap in the children covered by bad

elements (this is related to coverage functions being second-order supermodular [Korula et al., 2015]). The overlap in children covered by good elements then must be similar (identical on small samples) while the good function still needs to grow quickly for large sets (large gap),

2/5 ‘ as illustrated in Figure 2.6. We consider the cardinality constraint k = n ≠ and a number

1/5 ‘ of parts r = n ≠ . At a high level, the proof follows three main steps.

1. Constructing the good and bad functions. In the first part, we construct the good and bad functions whose values are identical on small samples for t = n3/5+‘,

1/5 ‘ have gap – = n ≠ , and curvature — = o(1). These good and bad functions are

ane combinations of primitives Cp p N which are coverage functions with desirable { } œ properties;

2. Constructing the masking function. In the second part, we construct m and m+

69 Figure 2.6: Sketches of the desired g(S) and b(S) in the simple case where they only depend on y = S . | |

that are masking on large samples for t = n3/5+‘ and that have a gap – = n1/5. In this construction, masking elements cover the children from functions g and b such that t

1/5 masking elements cover all the children, but k masking elements only covers an n≠ fraction of them.

3. From exponential to polynomial-sized coverage functions. Lastly, we prove the hardness result for polynomial-sized coverage functions. This construction relies on constructions of ¸-wise independent variables to reduce the number of children.

Constructing the Good and the Bad Coverage Functions

We describe the construction of good and bad functions that are identical on small samples

3/5+‘ 1/5 ‘ for t = n , with a gap – = n ≠ and curvature — = o(1). To do so, we introduce a class of primitive functions Cp, through which we express the good and bad functions. For symmetric functions h (i.e. whose value only depends on the size of the set), we abuse notation and simply write h(y) instead of h(S) for a set S of size y.

70 Figure 2.7: The value of coverage functions C (y) for 1 p 7 and sets y [10]. p Æ Æ œ

The construction. We begin by describing the primitives we use for the good and bad

functions. These primitives are the family Cp p N, which are symmetric, and defined as: { } œ

C (y)=p (1 (1 1/p)y) . p · ≠ ≠

These are coverage functions defined over an exponential number of children. For symmetric functions h (i.e. whose value only depends on the size of the set), we abuse notation and simply write h(y) instead of h(S) for a set S of size y. We begin by showing that the primitives C (y)=p (1 (1 1/p)y) (illustrated in Figure 2.7) are coverage functions. It p · ≠ ≠ then follows that the functions g and b are coverage.

Claim 2. Consider the coverage function over ground set N where for each set S, there is a child a that is covered by exactly S, and child a has weight w(a )=p Pr(S B(N,1/p)) S S S · ≥ where the binomial distribution B(N,1/p) picks each element in N independently with proba- bility 1/p, then this coverage function is Cp.

71 Proof. Note that

Cp(S)= w(aT ) T : T S 1 | ÿfl |Ø = t Pr(T B(N,1/p)) · T : T S 1 ≥ | ÿfl |Ø = t 1 Pr(T B(N,1/p)) Q R · ≠ T : T S =0 ≥ | ÿfl | a b = t 1 Pr [ T S =0] T B(N,1/p) · A ≠ ≥ | fl | B S 1 | | = t 1 1 . · Q ≠ A ≠ pB R a b

For a given ¸ [n], we construct g and b as ane combinations of ¸ coverage functions œ C (y) weighted by variables x for j [¸]: pj j œ

• The good function is defined as:

g(y) := y + ( x )C (y) ≠ j pj j :ÿxj <0

r • The bad function is defined as b(S)= j=1,j=i bÕ(S Tj), with ” fl q

bÕ(y) := xjCpj (y) j :ÿxj >0

Overview of the analysis of the good and bad functions. Observe that if g(y)=bÕ(y) for all y ¸ for some suciently large ¸, then we obtain the identical on small samples Æ property. The main idea is to express these ¸ constraints as a system of linear equations

Ax = y where A := C (i) and y := j, with i, j [¸]. We prove that this has two ij pj j œ crucial properties:

72 1. A is invertible. In Lemma 2.3.5 we show that there exists p ¸ such that the { j}j=1

matrix A is invertible by interpreting its entries defined by Cpj as non-zero polynomials of degree ¸. This implies that the system of linear equations A x = y can be solved · and that there exists a coecient xú needed for our construction of the good and the bad functions;

2. xı is bounded. In Lemma 2.3.8 we use Cramer’s rule and Hadamard’s inequality || ||Œ to prove that the entries of xı are bounded. This implies that the linear term y in g(y) dominates xı C (y) for large y and all j. This then allows us to prove the curvature j · pj and gap properties.

These properties of A imply the desired properties of g and b for ¸ = log log n.

Lemma 2.3.4. For every constant ‘>0, there exists coverage functions g, b such that the

3/5+‘ 1/5 ‘ identical on small samples property holds for t = n , with gap – = n ≠ and curvature — = o(1).

In the remaining of this section, we prove Lemma 2.3.4. Lemma 2.3.7 shows the identical on small samples property. It uses Lemma 2.3.6 which shows that if S n3/5+‘, then with | |Æ Ê(1) probability 1 n≠ , S T log log n for all j. The property then follows from the ≠ | fl j|Æ system of linear equations. The gap and curvature properties are proven in Lemmas 2.3.9 and 2.3.10 using the fact that the term y in g dominates the other terms in g and b.

Lemma 2.3.5. Matrix A( p ¸ ) is invertible for some set of integers p ¸ such that { j}j=1 { j}j=1 j p j(j +1)for all 1 j ¸. Æ j Æ Æ Æ

Proof. The proof goes by induction on ¸ and shows that it is possible to pick p¸ such that the rows of A( p ¸ ) are linearly independent. The base case is trivial. In the inductive { j}j=1 ¸ 1 step, assume p1, ,p¸ 1 have been picked so that the (¸ 1) (¸ 1) matrix A( pj j≠=1) is ··· ≠ ≠ ◊ ≠ { }

invertible. We show that for some choice of integer p¸ [p¸ 1,¸(¸ +1)]there does not exist œ ≠ 73 ¸ avectorz such that i ¸ ziAi,j =0for all j ¸ where A = A( pj j=1). We write the first Æ Æ { } ¸ 1 entries of row Aqas a linear combination of the other ¸ 1 rows: ≠ ¸ ≠

z A = A j<¸. i i,j ¸,j ’ ÿi<¸

¸ 1 ı Since A( p ≠ ) is invertible by the inductive hypothesis, there exists a unique solution z { j}j=1 to the above system of linear equations. It remains to show that zıA = A , which by i<¸ i i,¸ ” ¸,¸ ı q the uniqueness of z implies that there does not exist a vector z such that i ¸ ziAi,j =0for Æ ı ¸ ¸ ı i q i ¸ i ¸ 1 all j ¸. Observe that A + z A =(p (p 1) + z (p (p 1) )p ≠ )/p ≠ Æ ¸,¸ i<¸ i i,¸ ¸ ≠ ¸ ≠ i<¸ i ¸ ≠ ¸ ≠ ¸ ¸ and that q q

¸ ¸ ı i i ¸ i p (p 1) + z (p (p 1) )p ≠ ¸ ≠ ¸ ≠ i ¸ ≠ ¸ ≠ ¸ ÿi<¸

is a non-zero polynomial of degree ¸ that has at most ¸ roots. Therefore, there exists p¸

ı ¸ such that p¸ 1

hypothesis, p¸ p¸ 1 + ¸ +1 (¸ 1)¸ + ¸ +1 ¸(¸ +1). Æ ≠ Æ ≠ Æ

We need the following lemma to show the identical on small samples property.

Lemma 2.3.6. Let T be a uniformly random set of size T and consider a set S such that | | ‘ (¸) T S /n n≠ for some constant ‘>0, then Pr( S T ¸)=n≠ . | |·| | Æ | fl |Ø

Proof. We start by considering a subset L of S of size ¸. We first bound the probability that L is a subset of T ,

T T ¸ Pr(L T ) Pr(e T ) | | = | | . ™ Æ e L œ Æ e L n A n B Ÿœ Ÿœ

We then bound the probability that S T >¸with a union bound over the events that a | fl |

74 set L is a subset of T , for all subsets L of S of size ¸:

Pr( S T >¸) Pr(L S) | fl | Æ L S : L =¸ ™ ™ ÿ| | ¸ ¸ S T T S ‘¸ | | | | | |·| | n≠ Æ A ¸ B · A n B Æ A n B Æ

‘ where the last inequality follows from the assumption that T S /n n≠ . | |·| | Æ

For coverage functions, we let ¸ = log log n.

Lemma 2.3.7. The identical on small samples property holds for t = n3/5+‘/2.

Proof. Lemma 2.3.6 implies that S T ¸ = log log n w.p. 1 Ê(1) over P ( ) for | fl j|Æ ≠ ≥U P all j for a set S of size at most n3/5+‘/2. Thus, g(S T )=bT (S T ) for all j w.p. 1 Ê(1) fl j fl j ≠ by the system of linear equations, which implies the identical on small samples property for t = n3/5+‘/2.

The gap and curvature properties require bounding the coecients x (Lemma 2.3.8). We recall two basic results from linear algebra (Theorems 2.3.3 and 2.3.4) that are used to bound the coecients.

Theorem 2.3.3 (Cramer’s rule). Let A be an invertible matrix. The solution to the linear

det Ai system Ax = y is given by xi = det A , where Ai is the matrix A with the i-th column replaced by the vector y.

Theorem 2.3.4 (Hadamard’s inequality). det A v , where v denotes the Euclidean Æ Î iÎ Î iÎ norm of the i-th column of A. r

Lemma 2.3.8. Let xı be the solution to the system of linear equations A( p ¸ ) x = y, { j}j=1 1 2 then the entries of this solution are bounded: xı ¸O(¸4). | i |Æ

75 ¸ ı 1 Proof. Denote A := A( p ). By Lemma 2.3.5, A is invertible, so let x = (A)≠ y.By { j}j=1 ı det Ai Cramer’s rule (Theorem 2.3.3), xi = det A , where Ai is A with the i-th column replaced by the vector y. Using the bound from Lemma 2.3.5, every entry in A can be represented as a rational number, with numerator and denominator bounded by ¸O(¸). We can multiply by all the denominators, and get an integer matrix with positive entries bounded by ¸O(¸3). Now, by Hadamard’s inequality (Theorem 2.3.4), the determinants of the integral A and all the

O(¸4) ı Ai’s are integers bounded by ¸ . Therefore every entry in x can be written as a rational number with numerator and denominator bounded by ¸O(¸4).

Using the bounds previously shown for xı, the two following lemmas establish the gap – and curvature — of the good and bad functions g( ) and b( ). · ·

Lemma 2.3.9. The gap between the good and the bad functions g( ) and b( ) is at least · · 1/5 ‘ (Ôlog n) – = n ≠ for general coverage functions and at least – =2 for polynomial-size coverage functions.

Proof. We show the gap between the good and the bad function on a set S of size k. Recall

ı ı that b(S) r j : x >0,j ¸ xj Cpj (k). We can bound each summand as: Æ · j Æ q

xıC (k) xıp (C and c upper bounded by p) j pj Æ j j p p xı¸(¸ +1) (Lemma 2.3.5) Æ j 4 ¸O(¸ ) (Lemma 2.3.8), Æ

and therefore bT (k) ¸O(¸4). On the other hand, the good function is bounded from below Æ 2/5 ‘ 1/5 ‘ by the cardinality: g(k) k. Plugging in k = n ≠ , r = n ≠ and ¸ = log log n, we get the Ø gap –, 2/5 g(k) n 1/5 ‘ n ≠ . b(S) Ø n1/5(log log n)log4 log n ∫

76 With k =2Ôlog n,r =2Ôlog n/2, and ¸ = log log n,weget

Ôlog n g(k) 2 (1 o(1))Ôlog n/2 =2 ≠ . b(S) Ø 2Ôlog n/2(log log n)log4 log n

Lemma 2.3.10. The curvature for both the general and polynomial-size good function is — = o(1).

Proof. Note that k/r 2Ôlog n/2 for both the general and polynomial size cases. Thus, the Ø curvature — is

g(k) k 1 1 ≠ r g(k/r) Æ ≠ r k/r + r (log log n)log4 log n · · · Ôlog n/2 log4 log n 1 1 (1 + 2≠ (log log n) )≠ Æ ≠ · = o(1) where the first inequality follows a similar reasoning as the one used to upper bound b(S) in Lemma 2.3.9.

Finally, combining Lemmas 2.3.7, 2.3.9, and 2.3.10, we get Lemma 2.3.4.

Constructing the Masking Function

Masking elements allow the indistinguishability of good and bad elements from large samples.

The masking elements. The construction of the coverage functions g and b defined in the previous section is generalized so that we can add masking elements with desirable properties.

3/5 For each child ai in the coverage function defined by g + b, we divide ai into n children

77 w(ai) ai,1,...,ai,n3/5 with equal weights w(ai,j)= n3/5 for all j. Each element covering ai according

to g and b now covers children ai,1,...,ai,n3/5 . Note that the value of g(S) and b(S) remains unchanged with this new construction and thus, the previous analysis still holds. Each masking elements in M is defined by drawing j [n3/5] and having this element cover ≥U

children ai,j for all i. The masking function m+(S) is the total weight covered by masking elements S and the masking fraction m(S) is the fraction of j [n3/5] such that j is drawn for at least one œ element in S.

Masking properties. Masking elements cover children that are already covered by good or bad elements. A large number of masking elements mask the good and bad elements, which implies that good and bad elements are indistinguishable.

• In Lemma 2.3.12 we prove that the masking property holds for t = n3/5+‘.

• We show a gap – = n1/5 in Lemma 2.3.13.ForanyS : S k, we have g(S) | |Æ Ø n1/5 m+(S). ·

We begin by stating the Chernobound, used in Lemma 2.3.12.

Lemma 2.3.11 (ChernoBound). Let X ,...,X 0, 1 be independent indicator random 1 m œ{ } m variables. Let X = i=1 Xi and µ = E[X]. For 0 <”<1, q

µ”2/3 Pr( X µ ”µ) 2e≠ . | ≠ |Ø Æ

Lemma 2.3.12. Consider the construction above for masking elements, then the masking on large samples property holds for t = n3/5+‘.

Proof. First, we show that a large set S contains a large number of masking elements with

(n‘) exponentially high probability, i.e., with probability at least 1 e≠ for some constant ≠ 78 ‘>0. We then show that a large number of masking elements covers all the children with exponentially high probability, thus m(S M)=1. fl 1/5 ‘ 2/5 ‘ Masking elements are a 1 o(1) fraction of N since there are n rk = n n ≠ n ≠ ≠ ≠ ≠ masking elements. By Chernobound (Lemma 2.3.11), a set of size at least n3/5+‘ contains at least n3/5+‘/2 masking elements with exponentially high probability. By another Cherno bound, with n3/5+‘/2 masking elements, at least one of these elements cover a fixed child a with exponentially high probability. By a union bound, this holds for all j [n3/5]. i,j œ

Finally, note that if a set of masking elements cover ai,1,...,ai,n3/5 for some i, this set covers

Ê(1) a ,...,a 3/5 for all i. Thus w.p. at least 1 n≠ , m(S M)=1. i,1 i,n ≠ fl Lemma 2.3.13. The masking function m has a gap – = n1/5+‘ with the good function g.

Proof. We first bound the value of all good and bad elements, and then bound the fraction of that total value obtained by k masking elements. The value of all bad elements is

m T ı T b (Tj)=(r 1) xj Cpj (k) (Definition of b ) j=1,j=j ≠ 1 j ¸ : xı 0 ÿ” Æ Æÿ j Ø O(¸4) r ¸ Cpj (k) (Lemma 2.3.8) Æ 1 j ¸ ÆÿÆ O(¸4) r ¸ pj (Cp,cp p) Æ 1 j ¸ Æ ÆÿÆ 4 r ¸O(¸ ) (Lemma 2.3.5) Æ 1 j ¸ ÆÿÆ o(k) Æ o(g(k)) Æ

1/5 ‘ 2/5 ‘ where the second to last inequality is with ¸ = log log n, r = n ≠ ,k = n ≠ . Now note that a masking element covers a 1/n3/5 fraction of the value of all good and bad elements by

2/5 ‘ 1/5+‘ the above construction. Thus, k = n ≠ masking elements cover at most a 1/n fraction of the total value of all good and bad elements, combining this with the total value of bad

79 elements that is upper bounded by o(g(k)) concludes the proof.

1/5 ‘ Combining Lemmas 2.3.4, 2.3.12, and 2.3.13, we obtain an (n ≠ ,o(1))-gap. The main result for exponential size coverage functions then follows from Lemma 2.3.1.

An impossibility result for exponential size coverage functions. We have the four

1/5 ‘ P,i + properties for an (n ≠ ,o(1))-gap. The functions f are coverage since g, b, m are coverage and m+ is the fraction of overlap between children from g, b, and m+.

1/5+‘ Claim 3. Coverage functions are not n≠ -optimizable from samples over any distribution , for any constant ‘>0. D

From Exponential to Polynomial Size Coverage Functions

The construction above relies on the primitives Cp which are defined with exponentially many children. In this section we modify the construction to use primitives cp which are coverage with polynomially many children. The function class (g, b, m, m+) obtained are F then coverage functions with polynomially many children. The functions cp we construct satisfy c (y)=C (y) for all y ¸, and thus the matrix A for polynomial size coverage functions p p Æ is identical to the general case. We lower the cardinality constraint to k =2Ôlog n = T so | j| that the functions c (S T ) need to be defined over only 2Ôlog n elements. We also lower the p fl j number of parts to r =2Ôlog n/2.

Maintaining symmetry via ¸-wise independence. The technical challenge in defining a coverage function with polynomially many children is in maintaining the symmetry of

z non-trivial size sets. To do so, we construct coverage functions ’ z [k] for which the elements { } œ that cover a random child are approximately ¸-wise independent. The next lemma reduces the problem to that of constructing coverage functions ’z that satisfy certain properties.

80 Lemma 2.3.14. Assume there exist symmetric (up to sets of size ¸) coverage functions ’z with poly(n) children that are each covered by z [k] parents. Then, there exists coverage œ functions c with poly(n) children that satisfy c (S)=C (y) for all S such that S = y ¸, p p p | | Æ and cp(k)=Cp(k).

Proof. The proof starts from the construction for Cp with exponentially many children over a ground set of size p and modifies it into a coverage function with polynomially many children while satisfying the desired conditions. For each z k, replace all children in C Æ p that are covered by exactly z elements with ’z( ). Define Cz to be C but only with these · p p children that are covered by exactly z elements. Let the new children from ’z( ) be such that · z z ’ (k)=Cp (k). Clearly c has polynomially many children in n since each ’z( ) has polynomially many p · children. Then, note that

k k z z cp(k)= ’ (k)= Cp (k)=Cp(k). zÿ=1 zÿ=1

Finally, we show that c (S)=C (y) for all S such that S = y l. Note that it suces to p p | | Æ z z show that ’ (S)=Cp (y) for all z, which we prove by induction on y. The base case y =0is trivial. If y>0, then consider some set S such that S = y and let e S. By the inductive | | œ hypothesis, ’z(S e)=Cz(y 1).LetT be the set of all children in ’z( ) not covered by \ p ≠ · S e. Define ’z ( ) to be ’z( ) but only with children in T . Since all children in T are covered \ T · · by z distinct elements that are not in S e, \

z z z ’T (eÕ)=z (’ (k) ’ (y 1)). e S e · ≠ ≠ Õÿ”œ \

z z z By the assumptions on ’ , ’ (S)=’ (S e eÕ).ForanyeÕ S e, by combining with both \ fi ”œ \

81 z z z z z z ’ (S)=’ (S e)+’ (e) and ’ (S e eÕ)=’ (S e)+’ (eÕ), \ T \ fi \ T

z z z z ’ (e)=’ (eÕ)=z (’ (k) ’ (y 1))/(k y +1). T T · ≠ ≠ ≠

So,

’z(S)=’z(S e)+’z (e) \ T = ’z(y 1) + z (’z(k) ’z(y 1))/(k y +1) ≠ · ≠ ≠ ≠ = Cz(y 1) + z (Cz(k) Cz(y 1))/(k y +1) p ≠ · p ≠ p ≠ ≠ z = Cp (y)

z z where the last equality is obtained for Cp similarly as how it was obtained for ’ .

We now construct such ’z. Assume without loss that k is prime (o.w. pick some prime

¸ i close to k). Given a [k] , and x [z], let ha(x) := i [¸] aix mod k. The children œ œ œ in ’z are U = a [k]¸ : h (x ) = h (x ) for all distinctq x ,x [z] . The k elements { œ a 1 ” a 2 1 2 œ } are j :0 j

Lemma 2.3.15. Let a be a uniformly random child, then Pr(ha(x1)=j1,...,ha(x¸)=j¸) is independent of distinct x ,...,x [z] and j ,...,j [p]. More precisely, 1 ¸ œ 1 ¸ œ

¸ 1 Pr(h (x )=j ,...,h (x )=j )= . a 1 1 a ¸ ¸ p +1 i iŸ=1 ≠

Proof. It is well-known that the random variables (h (0),...,h (z 1)) where a is chosen a a ≠ uniformly at random from [p]¸ are ¸-wise independent since h ( ) is a polynomial of degree a ·

82 ¸ 1,so ≠ ¸ Pra ([p]l)(ha(x1)=j1,...,ha(x¸)=j¸)= Pr(ha(xi)=ji) . ≥U iŸ=1

By throwing away all children a such that there exists distinct x1, x2 with ha(x1)=ha(x2), we obtain the following by combining with the symmetry of the children a removed (there exists exactly one polynomial defined by some a passing through any collection of ¸ points):

¸ Pr(ha(x1)=j1,...,ha(x¸)=j¸)= Pr(ha(xi)=ji ha(xi) j1,...,ji 1 ) . | ”œ { ≠ } iŸ=1

Finally, note that

¸ ¸ 1 Pr(ha(xi)=ji ha(xi) j1,...,ji 1 )= . | ”œ { ≠ } p +1 i iŸ=1 iŸ=1 ≠

by the symmetry induced by a0.

We are now ready to show the main lemma for the coverage functions ’z( ) . ·

Lemma 2.3.16. The coverage function ’z is symmetric for all sets of size at most ¸.

Proof. Let a be a child chosen uniformly at random and S be a set of size at most ¸. Then,

z ’ (S)=k Pr( j S( x [z] s.t. ha(x)=j)) and · fi œ ÷ œ

T +1 Pr( j S( x [z] s.t. ha(x)=j)) = ( 1)| | Pr(T ha(x):x [z] ) œ fi ÷ œ T S ≠ ™{ œ } ÿ™

by inclusion-exclusion. Note that Pr(T h (x):x [z] ) only depends on the size of T ™{ a œ } by Lemma 2.3.15. Therefore ’z(S) only depends on the size of S and ’z( ) is symmetric for · all sets of size at most ¸.

+ We construct g, b, m, m as in the general case but in terms of primitives cp instead of

ı Cp. By Lemmas 2.3.14 and 2.3.16, we obtain the same matrix A and coecients x as in the

83 general case, so the identical on small samples property holds. The masking on large samples and curvature property hold almost identically as previously. Finally, since k is reduced, the gap – is reduced to 2(Ôlog n). We obtain an (– =2(Ôlog n),— = o(1))-gap for polynomial

sized coverage functions by using the primitives cp.

Lemma 2.3.17. There exists polynomial-sized coverage functions g, b, m, and m+ that satisfy an (– =2(Ôlog n),— = o(1))-gap with t = n3/5+‘.

+ Proof. We construct g, b, m, m as in the general case but in terms of primitives cp instead

ı of Cp. By Lemmas 2.3.14 and 2.3.16, we obtain the same matrix A and coecients x as in the general case, so the identical on small samples property holds. The masking on large samples holds identically as for general coverage functions. The gap and curvature properties are shown in Lemmas 2.3.9 and 2.3.10.

Hardness of optimization from samples for coverage functions. We get our main result by combining Lemma 2.3.1 with this (– =2(Ôlog n),— = o(1))-gap.

1/5+‘ Theorem 2.3.5. For every constant ‘>0, coverage functions are not n≠ -optimizable from samples over any distribution . In addition, polynomial-sized coverage functions are D (Ôlog n) not 2≠ -optimizable from samples over any distribution . D

2.3.4 Curvature

We show that the approximation obtained for submodular functions with curvature in Section 2.2.1 is tight. For every c<1, there exists monotone submodular functions that cannot be (1 c)/(1 + c c2)-optimized from samples. This impossibility result is information ≠ ≠ theoretic, we show that with high probability the samples do not contain the right information to obtain a better approximation.

84 Technical overview. To obtain a tight bound, all the losses from Algorithm 1 must be tight. We need to obtain a 1 cf(R)/f(Sı) gap between the contribution of optimal elements ≠ ı ı to a random set i k fR(ei ) and the value f(S ). This gap implies that as a set grows Æ with additional randomq elements, the contribution of optimal elements must decrease. The main diculty is in obtaining this decrease while maintaining random sets of small value, submodularity, and the curvature. The ground set of elements is partitioned into three parts: the good elements G, the bad elements B, and the poor elements P . In relation to the analysis of the algorithm, the optimal solution Sı is G, the set S consists mostly of elements in B, and a random set consists mostly of elements in P . The values of the good, bad, and poor elements are given by the good, bad, and poor functions g( ), b( ), and p( ) to be later defined and the functions f( ) we construct · · · · for the impossibility result are:

f G(S) := g(S G, S P )+b(S B)+p(S P ). fl fl fl fl

The value of the good function is also dependent on the poor elements to obtain the decrease in marginal contribution of good elements mentioned above. The proof of the hardness result (Theorem 2.3.6) starts with concentration bounds in Lemma 2.3.18 to show that w.h.p. every sample contains a small number of good and bad elements and a large number of poor elements. Using these concentration bounds, Lemma 2.3.19 gives two conditions on the functions g( ), b( ), and p( ) to obtain the desired result. Informally, the first condition is · · · that good and bad elements cannot be distinguished while the second is that G has larger value than a set with a small number of good elements. We then construct these functions and show that they satisfy the two conditions in Lemma 2.3.20. Finally, Lemma 2.3.21 shows that f( ) is monotone submodular with curvature c. ·

Theorem 2.3.6. For every c<1, there exists a hypothesis class of monotone submodular

85 functions with curvature c that is not (1 c)/(1 + c c2)+o(1) optimizable from samples. ≠ ≠

The remaining of this section is devoted to the proof of Theorem 2.3.6.Let‘>0 be

2/3 ‘ some small constant. The set of poor elements P is fixed and has size n n ≠ . The ≠ good elements G are then a uniformly random subset of P C of size k := n1/3, the remaining elements B are the bad elements. The following concentration bound is used to show that elements in G and B cannot be distinguished.

Lemma 2.3.18. All samples S are such that S (G B) log n and S P k 2 log n | fl fi |Æ | fl |Ø ≠ with high probability.

Proof. We start by showing the first part. Consider a subset L of G B of size log n.We fi first bound the probability that L is a subset of a sample S,

k k log n Pr(L S) Pr(e S) = . ™ Æ e L œ Æ e L n AnB Ÿœ Ÿœ

We then bound the probability that S (G B) > log n with a union bound over the events | fl fi | that a set L is a subset of S, for all subsets L of T of size log n:

Pr( S (G B) > log n) Pr(L S) | fl fi | Æ L G B : L =log n ™ ™ fi ÿ| | G B k log n | fi | Æ A log n B · AnB k G B log n ·| fi | Æ A n B

‘ log n n≠ . Æ

We now show that a sample is of size at least k log n w.h.p., which combined with the ≠ first part of the lemma, implies the second part. We lower bound the ratio of the number of

86 loss: 1/(1+c-c2)

g(s,0) function value k – 2log n) ≥ loss: 1-c P g(s,s

b(s)

0

0 log n set size s

Figure 2.8: The symmetric functions g(sG,sP ) and b(sB).

sets of size k to the number of sets of size at most k log n: ≠

n n k k k 1log2n n 1 2n ≠ Ø i=0 i k k log n · ≠ q 1 2 log1n 1 2 n 1 ≠ k i = ≠ k 1 n 2 i=0 k i 1 Ÿ ≠ ≠ log n 1 1 2 1 ≠ n k + i 1 = ≠ ≠ k k i iŸ=0 ≠ 1 n log n (1 o(1)) Ø k 3 k 4 ≠

We now give two conditions on the good, bad, and poor functions to obtain an impossibility result based on the above concentration bounds. The first condition ensures that good and bad elements cannot be distinguished. The second condition quantifies the gap between the value of k good elements and a set with a small number of good elements. We denote by sG the number of good elements in a set S, i.e., s := S G and define similarly s and s . G | fl | B P The good, bad, and, poor functions are symmetric, meaning they each have equal value over sets of equal size, and we abuse the notation with g(s ,s )=g(S G, S P ) and similarly G P fl fl

for b(sB) and p(sP ). Figure 2.8 is a simplified illustration of these two conditions.

Lemma 2.3.19. Consider sets S and SÕ, and assume g( ), b( ), and p( ) are such that · · ·

87 1. g(sG,sP )+b(sB)=g(sGÕ ,sPÕ )+b(sBÕ ) if

• s + s = sÕ + sÕ log n and s ,sÕ k 2 log n, G B G B Æ P P Ø ≠ 2. g(s ,s )+b(s )+p(s ) <– g(k, 0) if G P B P · • s n‘ and s + s + s k G Æ G B P Æ then the hypothesis class of functions = f G( ):G P C , G = k is not –-optimizable F { · ™ | | } from samples.

Proof. By Lemma 2.3.18, for any two samples S and SÕ, s + s log n, sÕ + sÕ log n and G B Æ G B Æ s ,sÕ k 2 log n with high probability. If s + s = sÕ + sÕ , then by the first assumption, P P Ø ≠ G B G B

g(sG,sP )+b(sB)=g(sGÕ ,sPÕ )+b(sBÕ ). Recall that G is a uniformly random subset of the fixed set P C and that B consists of the remaining elements in P C . Thus, w.h.p., the value f G(S)

of all samples S is independent of which random subset G is. In other words, no algorithm can distinguish good elements from bad elements with polynomially many samples. Let T be the set returned by the algorithm. Since any decision of the algorithm is independent from G,

2 2/3 ‘ ‘ the expected number of good elements in T is t k G / G B = k /n ≠ = n . Thus, G Æ ·| | | fi |

E f G(T ) = g(t ,t )+b(t )+p(t ) g(n‘,t )+b(t )+p(t ) <– g(k, 0) G G P B P Æ P B P · Ë È where the first inequality is by the submodularity and monotonicity properties of the good elements G for f G( ) and the second inequality is by the second condition of the lemma. By · expectations, the set S returned by the algorithm is therefore not an –-approximation to the solution G for at least one function f G( ) and is not –-optimizable from samples. · œF F

88 Constructing g( ),b( ),p( ). The goal is now to construct g( ), b( ) and p( ) that satisfy · · · · · · the above conditions. We start with the good and bad function:

1 1 sG 1 1 1+c c2 sP k 2logn if sp k 2 log n Y · ≠ ≠ ≠ · · ≠ Æ ≠ g(sG,sP )= 1 1 2 2 _ ]_ 1 sG 1+c c2 otherwise · ≠ _ _ [ 1 sB 2 if sB log n Y 1+c c b(sB)= · ≠ Æ _ ]_ 1 c 1 (sB log n) 1+c≠ c2 + log n 1+c c2 otherwise ≠ · ≠ · ≠ _ [_ These functions exactly exhibit the losses from the analysis of the algorithm in the case where the algorithm returns bad elements. As illustrated in Figure 2.8, there is a 1 c loss ≠ between the contribution 1/(1 + c c2) of a bad element to a random set and its contribution ≠ (1 c)/(1 + c c2) to a set with at least log n bad elements. There is also a 1/(1 + c c2) ≠ ≠ ≠ loss between the contribution 1 of a good element to a set with no poor elements and its contribution 1/(1 + c c2) to a random set. We add a function p(s ) to f G( ) so that it is ≠ P · monotone increasing when adding poor elements.

1 c k sp 1+c≠ c2 k 2logn if sP k 2 log n p(s )=Y · ≠ · ≠ Æ ≠ P _ _ (1 c)2 1 c k ] ≠ (sp (k 2 log n)) 1+c c2 +(k 2 log n) 1+c≠ c2 k 2logn otherwise ≠ ≠ ≠ ≠ ≠ ≠ _1 2 [_ The next two lemmas show that theses function satisfy Lemma 2.3.19 and that f G( ) is · monotone submodular with curvature c, which concludes the proof of Theorem 2.3.6.

Lemma 2.3.20. The functions g( ), b( ), and p( ) defined above satisfy the conditions of · · · Lemma 2.3.19 with – =(1 c)/(1 + c c2)+o(1). ≠ ≠

Proof. We start with the first condition. Assume s + s = sÕ + sÕ log n and s ,sÕ G B G B Æ P P Ø

89 k 2 log n. Then, ≠

1 1 g(s ,s )+b(s )=(s + s ) =(sÕ + sÕ ) = g(sÕ ,sÕ )+b(sÕ ). G P B G B · 1+c c2 G B · 1+c c2 G P B ≠ ≠

For the second condition, assume s n‘ and s + s + s k. It is without loss to G Æ G B P Æ assume that s + s k n‘, then B P Ø ≠

1 c 1 c f G(S) (1 + o(1)) (s + s ) ≠ k ≠ + o(1) . Æ · B P · 1+c c2 Æ · 1+c c2 ≠ 3 ≠ 4

We conclude by noting that g(k, 0) = k.

Lemma 2.3.21. The function f G( ) is a monotone submodular function with curvature c. ·

Proof. We show that the marginal contributions are positive (monotonicity), decreasing (submodularity), but not by more than a 1 c factor (curvature), i.e., that f (e) f (e) ≠ S Ø T Ø (1 c)f (e) 0 for all S T and e T .Lete be a good element, then ≠ S Ø ™ ”œ

1 1 1 1 1+c c2 sP k 2logn if sp k 2 log n G Y ≠ ≠ ≠ · · ≠ Æ ≠ fS (e)= 1 1 2 2 _ ]_ 1 1+c c2 otherwise. ≠ _ [_ Since s t for S T , we obtain f (e) f (e) 0. It is also easy to see that we get P Æ P ™ S Ø T Ø 1 fT (e) 1+c c2 (1 c) (1 c)fS(e).Forbad elements, Ø ≠ Ø ≠ Ø ≠

1 1+c c2 if sB log n G Y ≠ Æ fS (e)= _ ]_ 1 c 1+c≠ c2 otherwise. ≠ _ [_

90 Thus, f (e) f (e) (1 c)f (e) 0 for all S T and e T . Finally, for poor elements, S Ø T Ø ≠ S Ø ™ ”œ

1 1 1 c k 1 1+c c2 sG k 2logn + 1+c≠ c2 k 2logn if sP k 2 log n f G(e)=Y≠ ≠ ≠ · · ≠ ≠ · ≠ Æ ≠ S _ 1 2 _ (1 c)2 k ] ≠ 1+c c2 k 2logn otherwise. ≠ ≠ _ [_ Since s k, G Æ

1 c k (1 c)2 k ≠ f G(e) ≠ . 1+c c2 · k 2 log n Ø S Ø 1+c c2 k 2 log n ≠ ≠ ≠ ≠

Consider S T , then s t , and f (e) f (e) (1 c)f (e) 0. ™ G Æ G S Ø T Ø ≠ S Ø

2.4 References and Acknowledgments

The work in Section 2.2.1 and Section 2.3.4 on functions with curvature was produced in collaboration with Aviad Rubinstein and Yaron Singer and appeared in the NIPS 2016

publication The Power of Optimization from Samples (Balkanski et al. [2016]). The work in Section 2.2.2 on learning to influence was produced in collaboration with

Nicole Immorlica and Yaron Singer and appeared in the NIPS 2017 publication The Importance of Communities for Learning to Influence (Balkanski et al. [2017a]). The work in Section 2.2.3, Section 2.3.2, and Section 2.3.3 on coverage and submodular functions was produced in collaboration with Aviad Rubinstein and Yaron Singer and appeared

in the STOC 2017 publication The Limitations of Optimization from Samples (Balkanski et al. [2017b]).

91 Chapter 3

Adaptive Optimization

Since sharp impossibility results arise when given samples drawn from any distribution, we turn to an adaptive sampling model. In adaptive sampling, similarly as in active learning, the algorithm obtains multiples batches of samples where each batch is drawn from a new distribution chosen by the algorithm based on the previous batches. The central question with adaptive sampling is then how many batches of samples are needed for optimizing the function. This question carries strong implications for parallelization since samples that are drawn from a same distribution are non-adaptive and can be evaluated in parallel. This connection to parallelization is formalized by the adaptive complexity model, which we introduced in the context of combinatorial optimization. The adaptivity of an algorithm is the number of sequential rounds it makes when each round can execute function evaluations in parallel. An algorithm using r batches of samples is hence an r-adaptive algorithm. Since submodular optimization is regularly applied on very large datasets, adaptivity is crucial as algorithms with low adaptivity enable dramatic speedups in parallel computing time. Submodular optimization has been studied for well over forty years now, and in the past decade there has been extensive study of submodular maximization for large datasets [Jegelka et al., 2011, Badanidiyuru et al., 2012b, Kumar et al., 2015a, Jegelka et al.,

92 2013, Mirzasoleiman et al., 2013, Wei et al., 2014, Nishihara et al., 2014, Badanidiyuru and Vondrák, 2014, Pan et al., 2014, Badanidiyuru et al., 2014, Mirzasoleiman et al., 2015a, Mirrokni and Zadimoghaddam, 2015, Mirzasoleiman et al., 2015b, Barbosa et al., 2015, Mirzasoleiman et al., 2016, Barbosa et al., 2016, Epasto et al., 2017]. Somewhat surprisingly however, until very recently, there was no known constant-factor approximation algorithm for submodular maximization whose adaptivity is sublinear in n. In this chapter, we show that the adaptive complexity of submodular maximization, i.e., the minimum number of rounds r such that there exists an r-adaptive algorithm which achieves a constant factor approximation, is logarithmic up to lower order terms. We also obtain algorithms with logarithmic adaptivity for submodular functions with curvature, non-monotone submodular functions, and matroid constraints.

3.1 Adaptivity

3.1.1 The Adaptive Complexity Model

We first review the adaptive complexity model. As standard, we assume access to a value oracle of the function such that for any sets S N the oracle returns f(S) in (1) time. ™ O Informally, the adaptivity of an algorithm is the number of sequential rounds it makes when polynomially-many function evaluations can be executed in parallel in each round.

Definition 14. Given a value oracle for f, an algorithm is r-adaptive if every query f(S)

for the value of a set S occurs at a round i [r] s.t. S is independent of the values f(SÕ) of œ all other queries at round i, with at most poly(n) queries at every round.

In Section 1.4.2, we discuss adaptivity and parallel computing.

93 THE ADAPTIVE COMPLEXITY OF SUBMODULAR OPTIMIZATION

Approximation Nemhauser-Wolsey 78 Feige 98 1-1/e Nemhauser-Wolsey-Fisher 78 + every other previously known constant apx algorithm ?

B-Rubinstein- Singer 17 ~ Θ(n-1/4)

1 k Rounds of adaptivity !(n) worst case : Algorithm : Hardness

Figure 3.1: The adaptivity landscape for submodular optimization.

3.1.2 The Adaptivity Landscape for Submodular Optimization

For the canonical problem of maximizing a monotone submodular function under a cardinality constraint k, very little was known about the adaptive complexity of this problem. The mostly negative results for optimization from samples from Chapter 2 are a consequence

of non-adaptivity: any algorithm that only has access to samples of function values cannot make adaptive queries and this restriction inhibits reasonable approximation guarantees. The main impossibility result for optimization from samples implies that the adaptive complexity must be strictly larger than 1. On the other hand, the celebrated greedy algorithm which achieves the optimal 1 1/e approximation guarantee by iteratively adding the element ≠ with largest marginal contribution is trivially k-adaptive. In the worst case, k (n). All œ constant factor approximation algorithms we are aware of for maximizing a submodular function under a cardinality constraint are at best k-adaptive. So all we know is that the adaptive complexity is between 1 and (n) (Figure 3.1).

What is the adaptive complexity of maximizing a submodular function?

Adaptivity is not only a fundamental theoretical concept but it also has important practical consequences. There is a wide variety of applications of submodular maximization where

94 function evaluations are easily parallelized but each evaluation requires a long time to complete. In crowdsourcing for example, function evaluations depend on responses from human agents and highly sequential algorithms are impractical. Data summarization, experimental design, influence maximization, marketing, survey design, and biological simulations are all examples where the adaptive complexity of optimization largely determines the runtime bottleneck of the optimization algorithm (see Section 1.4.3 for a detailed discussion of these applications).

3.1.3 Main Result

Our main result is that the adaptive complexity of submodular maximization is ˜ (log n). This provides a characterization that is tight up to low-order terms, and an exponential improvement in the adaptivity over any known constant factor approximation algorithm for maximizing a monotone submodular function. Our characterization is composed of two major results. The first is an algorithm whose adaptivity is (log n) and obtains an approximation O arbitrarily close to 1/3.

Theorem. For the problem of monotone submodular maximization under a cardinality constraint and any constant ‘>0, there exists an (log n)-adaptive algorithm which obtains, O with probability 1 o(1),a(1/3 ‘)-approximation. ≠ ≠

We complement the upper bound by showing that the adaptive complexity of submodular

maximization is at least quasi-logarithmic by showing that no ˜(log n)-adaptive algorithm

1 can obtain an approximation strictly better than log n .

Theorem. For the problem of monotone submodular maximization under a cardinality

log n 1 constraint, there is no 12 log log n -adaptive algorithm that obtains, with probability Ê n ,a 1 1 2 1 2 log n -approximation.

In fact, we show the following more general impossibility result: for any r log n, there Æ

95 1 1 2 is no r-adaptive algorithm that obtains, with probability Ê ,ann≠ 2r+2 (r +3)log n n · 1 2 approximation.

3.1.4 Adaptive Sampling: a Coupling of Learning and Optimiza-

tion

Our motivation is to understand what are the necessary and sucient conditions from a learnability model that yield desirable approximation guarantees for optimization. Since sharp impossibility results arise from learning with non-adaptive samples over any distribution, we turned to an adaptive sampling model [Thompson, 1990]. In adaptive sampling, the learner obtains poly(n) samples drawn from a distribution of her choice in every round. Our (1/3 ‘)-approximation (log n)-adaptive algorithm is achieved by adaptive sampling. Our ≠ O hardness result holds for queries and hence also for adaptive sampling. This implies that in the realizable case, where there is a true underlying function generating the data, ˜ (log n)

batches of adaptive samples are necessary and sucient to approximately “learn to optimize" a monotone submodular function under a cardinality constraint.

3.1.5 Overview of Results

In Section 3.2, we present algorithms with low adaptivity as well as their analysis. In Section 3.2.1, we describe an adaptive sampling algorithm which requires (log n) O sequential rounds and achieves an approximation that is arbitrarily close to 1/3 for monotone submodular maximization under a cardinality constraint. This result implies that the adaptive complexity of maximizing a monotone submodular function is (log n). O In Section 3.2.2, we conduct experiments on real data sets for movie recommendation and taxi dispatch systems which demonstrate the eectiveness of adaptive sampling in practice. In Section 3.2.3, we describe an algorithm whose approximation is arbitrarily close to the

96 THE OPTIMAL APPROXIMATION

Approximation Nemhauser-Wolsey 78 Feige 98 1-1/e B-Rubinstein-Singer, Ene-L.Nguyên 19 1-1/e 1/2 B-Singer 18 (under some condition) 1/3 B-Singer 18 Nemhauser-Wolsey-Fisher 78 + every other previously known constant apx algorithm

-1 log n B-Rubinstein- B-Singer 18 Singer 17 ~ Θ(n-1/4) ~ 1 Θ(log n) k Rounds of adaptivity !(n) worst case : Algorithm : Hardness

Figure 3.2: The adaptivity landscape for submodular optimization including recent results. optimal 1 1/e guarantee in (log n) adaptive rounds for monotone submodular maximization ≠ O under a cardinality constraint. This algorithm achieves an exponential speedup in parallel running time for submodular maximization at the expense of an arbitrarily small loss in approximation quality. This guarantee is optimal in both approximation and adaptivity, up to lower order terms.

In Section 3.2.4, we describe an algorithm for non-monotone functions whose approxima- tion is arbitrarily close to 1/2e in (log2 n) adaptive rounds, for non-monotone submodular O maximization under a cardinality constraint. This is also an exponential speedup in par- allel running time, over any previously studied algorithm for constrained non-monotone submodular maximization. Beyond its provable guarantees, the algorithm performs well in practice. Specifically, experiments on trac monitoring and personalized data summarization applications show that the algorithm finds solutions whose values are competitive with state-of-the-art algorithms while running in exponentially fewer parallel iterations. Despite the burst in work on submodular maximization in the adaptive complexity model, the fundamental problem of maximizing a monotone submodular function under a matroid constraint has remained elusive. In particular, all known techniques fail for this problem and there are no known constant factor approximation algorithms whose adaptivity is sublinear

97 Algorithms Hardness Family Family Apx Rounds Apx Rounds Section Section Submodular Submodular 1 ‘ ( 1 log n) 1 ( log n ) 3.2.1 3 ≠ O ‘2 3.3 log n O log log n Submodular 1 1 ‘ ( 1 log n) 3.2.3 ≠ e ≠ O ‘2 Non-monotone 1 ‘ ( 1 log2 n) 3.2.4 2e ≠ O ‘2 Matroid 1 1 ‘ ( 1 log2 n ) 3.2.5 ≠ e ≠ O ‘3 ‘3

Table 3.1: Overview of results for adaptive optimization. Unless otherwise speci- fied, these results are for monotone submodular maximization under a cardinality constraint. The most significant results are highlighted in bold.

in the rank of the matroid k or in the worst case sublinear in the size of the ground set n. In Section 3.2.5, we describe an approximation algorithm for the problem of maximizing a monotone submodular function under a matroid constraint. The approximation guarantee of the algorithm is arbitrarily close to the optimal 1 1/e and it has near optimal adaptivity ≠ of (log(n) log(k)). This result is obtained using a novel technique of adaptive sequencing O which departs from previous techniques for submodular maximization in the adaptive com- plexity model. In addition to our main result we show how to use this technique to design other approximation algorithms with strong approximation guarantees and polylogarithmic adaptivity. In Section 3.3, we show that no algorithm can achieve an approximation better than ( 1 ) with fewer than ( log n ) rounds. This results implies that the adaptive complexity O log n O log log n of maximizing a monotone submodular function under a cardinality constraint is ˜(log n). We summarized the results in this Chapter in Table 3.1.

98 3.2 Adaptive Algorithms

3.2.1 An Algorithm with Logarithmic Adaptivity

In this section, we show that the adaptive complexity of maximizing a monotone sub-

modular function under a cardinality constraint is (log n) via the Adaptive-Sampling O algorithm, which has logarithmic adaptivity and obtains an approximation arbitrarily close to 1/3. This algorithm uses two simple, yet powerful, adaptive sampling techniques as primi-

tives. The first is down-sampling which in each round maintains a uniform distribution over high-valued elements by iteratively discarding elements with low marginal contribution to a

random set. The second primitive is up-sampling which at every round identifies the elements with highest value and includes them in all future samples. Neither of these primitives achieves a constant factor approximation in (log n) rounds, but an appropriate combination O of them does.

Technical Overview

The main building block of the adaptive sampling algorithm is the construction, at every round r, of a meaningful distribution with elements having marginal probabilities of being Dr

drawn p1,...,pn. We begin by presenting two simple primitives: down-sampling and up- sampling. In every round, down-sampling identifies elements a N whose expected marginal i œ contribution to a random set drawn according to is suciently low and sets p =0for all Dr i future rounds. This approach achieves logarithmic adaptivity, but its approximation guarantee

is 1/ log n. The second approach, up-sampling, sets pi =1, for all future rounds, for all elements in the sample with highest value at that round. It achieves a constant approximation

but at the cost of a linear adaptivity. Our main algorithm, Adaptive-Sampling achieves logarithmic adaptivity and constant factor approximation by shaping via up-sampling at Dr rounds where a random set has high value and down-sampling otherwise. The analysis then

99 heavily exploits submodularity in non-trivial ways to bound the marginal contribution of elements to a random set drawn from , which evolves in every round. Dr

Down-Sampling

The down-sampling algorithm is (log n)-adaptive but its approximation guarantee is O 1 ( log n ). We describe the algorithm and analyze its properties which will later be used in the analysis of Adaptive-Sampling. In every round, as long as the expected value of a random subset of size k of the surviving elements is not an –-approximation of the value of

the optimal solution OPT, the down-sampling algorithm discards all elements whose expected marginal contribution to a random set is below a fixed threshold . A formal description is included below.

Algorithm 6 Down-Sampling, discards a large number of elements at every round by sampling. Input: approximation – and threshold parameter Initialize S N, as the uniform distribution over sets of size k Ω D while S >kand ER [f(R)] <–OPT do | | ≥D S S a : ER fR a (a) < Ω \ ≥D \{ } Ó Ë È Ô Update to be uniform over subsets of S of size k D return R ≥D

Algorithm 6 is an idealized description of the down-sampling algorithm. In practice, we

cannot evaluate the exact expected value of a random set and we do not know OPT. Instead, we sample random sets from at every round to estimate the expectations and guess OPT. D For ease of notation and presentation, we analyze this idealized version of the algorithm, and later discuss the extension to the full algorithm.

This idealized version also has a nice interpretation via the multi-linear extension of submodular functions as a search of a continuous point x [0, 1]n which, at every iteration, œ

100 is projected to a lower dimension on the boundary of the polytope of feasible points.

Algorithm 7 DownSamplingContinuous, a continuous description of Down-Sampling. Input: approximation – and precision ‘ x ( k ,..., k ), Ω n n while F (x) <–OPT do

1 v : v 0 (1 ‘) x 0, v 1 = k, v 1 ‘ x MΩ Î Î Æ ≠ Î Î Î Î Æ ≠ Ó Ô x argmaxv v, F (x) Ω œMÈ Ò Í return x

The multilinear extension F :[0, 1]n R of a submodular function f is a popular tool æ for continuous approaches to submodular optimization, where F (x) is the expected value

E[f(R)] of a random set R containing each element ai independently with probability xi. An interpretation of this algorithm is a continuous point x which, at every iteration, is projected

1 to a lower dimension ( v 0 (1 ‘) x 0) among the remaining dimensions (v 1 ‘ x) on Î Î Æ ≠ Î Î Æ ≠ the boundary of the polytope of feasible points ( v = k). Î Î1

Analysis of down-sampling. The analysis of down-sampling largely relies on Lemma 3.2.1 and Lemma 3.2.2, which respectively bound the number of elements discarded at every round and the loss in the approximation due to these discarded elements. We discuss these lemmas

in the following subsections. Recall that a function f :2N R+ is submodular if for every æ S T N and a/T we have that f (a) f (a), where f (b) denotes the marginal ™ ™ œ S Ø T A contribution f (b)=f(A b ) f(A) of b N to A N. Such a function is monotone A fi{ } ≠ œ ™ if f(S) f(T ) for all S T . Finally, it is subadditive if f(A B) f(A)+f(B) for all Æ ™ fi Æ A, B N, which is satisfied by submodular functions. ™

The adaptivity of down-sampling. One crucial property of the down-sampling algorithm is that it is (log n)-adaptive. This is largely due to the fact that in every round a significant O

101 fraction of the remaining elements are discarded. Throughout the paper we use (S, t) to U denote the uniform distribution over subsets of S of size t.

Lemma 3.2.1. Let f :2N R be a monotone submodular function. For all S N, t [n], æ ™ œ and > 0, let = (S, t) and the discarded elements be S≠ = a : ER fR a (a) < . D U ≥D \{ } Ó Ë È Ô Then: ER [f(R)] S S≠ ≥D S . \ Æ t ·| | - - - - · - - Proof. At a high level, we first lower bound the value of a random set R by the marginal ≥D contributions of the remaining elements S S≠. Then, we lower bound these marginal \ contributions with the threshold since these elements must have large enough marginal contributions to not be removed. The first lower bound is the following:

E [f(R)] E f(R (S S≠)) monotonicity Ø fl \ Ë È

E fR (S S ) a (a) Submodularity S fl \ ≠ \{ } T Ø a R (S S ) œ flÿ\ ≠ U V 1 E a R fR a (a) Submodularity S œ \{ } T Ø a S S · œÿ\ ≠ U V 1 = E a R fR a (a) . œ \{ } a S S · œÿ\ ≠ Ë È

By bounding the marginal contribution of remaining elements with the threshold ,we

102 obtain

1 E a R fR a (a) = Pr [a R] E fR a (a) a R œ \{ } \{ } a S S · a S S œ · | œ œÿ\ ≠ Ë È œÿ\ ≠ Ë È

Pr [a R] E fR a (a) Submodularity \{ } Ø a S S œ · œÿ\ ≠ Ë È Pr [a R] definition of Ø a S S œ · œÿ\ ≠ t = S S≠ definition of R | \ |· S · | |

–OPT Notice that when =c for some c>1, the lemma implies that if ER [f(R)] <–OPT, · k ≥D then the number of elements remaining is reduced by a factor of at least c at every round.

The approximation guarantee of down-sampling. The down-sampling algorithm is an

1 ( log n ) approximation (Corollary 3). To analyze the value of sets that survive down-sampling, we show that the value f( S≠) of discarded optimal elements is small. Thus, the optimal Ofl elements that are not discarded conserve a large fraction of OPT.

Lemma 3.2.2. Let f :2N R be monotone submodular with optimal solution O and æ = (S, t), for any S N and t [n]. The loss from discarding elements S≠ := D U ™ œ a S : ER D fR a (a) < is approximately bounded by the value of R : œ ≥ \{ } ≥D Ó Ë È Ô

f( S≠) O S≠ + E [f(R)] . R Ofl Æ| fl | ≥D

Proof. The value of S≠ is upper bounded using the threshold for elements to be in S≠, Ofl

f( S≠) [f(R)] f ( S≠) f (a) [f (a)] S≠ E E R E S R T E R Ofl ≠ Æ Ofl Æ a S Æ a S Æ|Ofl |· Ë È œÿOfl ≠ œÿOfl ≠ U V

103 where the first inequality is by monotonicity, the second by submodularity, the third by

linearity of expectation, and the last by submodularity and definition of S≠

At this point, we can prove the following corollary about the down-sampling algorithm.

Corollary 3. Down-Sampling with = OPT and – = 1 is ( log n )-adaptive and 4k log n O log log n 1 obtains, in expectation, a log n -approximation. 1 2 OPT 1 Proof. Let t = k, = 4k , – = log n , and ‘ =1/2 for Lemma 3.2.2. We first show the adaptive complexity and then the approximation guarantee. Recall that = (S, t) is the D U uniform distribution over subsets of S of size t.

The adaptive complexity. We bound the number of rounds r where elements are removed

from S. Notice that if Down-Sampling removes elements from S at some round, then

ER [f(R)] <–OPT.Weget ≥D

ER [f(R)] S S≠ ≥D S Lemma 3.2.1 | \ |Æ t ·| | · ER [f(R)] = ≥D S OPT ·| | 4 S Algorithm Æ log n ·| |

Thus, after r rounds, there are S (4/ log n)r n elements remaining. With | |Æ ·

log(n/k) log n r = , log n Æ log n log 4 log 4 1 2 1 2 we obtain that S k and the algorithm terminates. | |Æ

The approximation guarantee. There are two cases. If the algorithm terminates because

E [f(R)] –OPT, then returning a random set R is immediately (in expectation) a 1 - Ø log n 1 approximation with – = log n .

104 Otherwise, the algorithm returns S and we assume this is the case for the remaining of this proof. Let Si and Si≠ be the sets S and S≠ := a S : ER fR a (a) < at round œ ≥D \{ } Ó Ë È Ô i [r] of Down-Sampling where the notation S and S≠ is by default these sets when the œ algorithm terminates. First, by monotonicity and subadditivity, we have

r f (S) f(O) f(O S)=OPT f S≠ Ø ≠ \ ≠ fii=1 i flO 11 2 2 since O S are the discarded optimal elements. If Down-Sampling removes elements from \

Si at some round i, then ER (Si,k) [f(R)] <–OPT and we get ≥U

r r f S≠ = f S≠ fii=1 i flO fii=1 i flO 11 2 2 r1 1 22 f S≠ Subadditivity Æ Ofl i i=1 1 2 ÿr O Si≠ + E [f(R)] Lemma 3.2.2 R Æ i=1 3| fl | ≥D 4 ÿ r O ( Si≠) + E [f(R)] Æ| fl fi | R ÿi=1 ≥D k+r E [f(R)] R Æ · ≥D OPT + r–OPT Algorithm Æ 2

By combining the above inequalities, we conclude that

OPT f (S) OPT + r–OPT Ø ≠ 2 1 1 OPT OPT Ø 2 ≠ log n log 4 1 2

1 It is important to note that ( log n ) is the best approximation the down-sampling algorithm can achieve, regardless of the number of rounds. There is a delicate tradeobetween the

105 approximation obtained when the algorithm terminates due to E[f(R)] –OPT and the one Ø when S k, s.t. more rounds do not improve the approximation guarantee. | |Æ We argue that this tradeoimplies that more rounds do not improve the approximation

guarantee for down-sampling. Notice that when the threshold –OPT to return R increases, then the threshold =c –OPT needed to remove elements also increases. If this threshold · k to remove elements increase, then the algorithm potentially discards optimal elements with higher value. Thus, the apporoximation guarantee obtained by the solution S worsens. This tradeois independent of the number of rounds and adding more rounds thus does not improve the approximation guarantee.

Up-Sampling

A second component of the main algorithm is up-sampling. Instead of discarding elements, the up-sampling algorithm adds elements which are included in all future samples. At each round, the sample containing the k/r new elements with highest value is added to the current solution X.

Algorithm 8 Up-Sampling, adds a large number of elements at every round by sampling. Input: Sample complexity m and number of rounds r Initialize X Ωÿ for r rounds do Update to be uniform over subsets of N X of size k/r D \ X X argmax f(X R ):R m Ω fi Ri { fi i i ≥D}i=1 return X

Note that when r = k this method is the celebrated greedy algorithm. In contrast to down- sampling, which obtains a logarithmic number of rounds and approximation, up-sampling is inherently sequential and only obtains an O(r/k) approximation.

106 Proposition 3.2.1. For any constant c k/r, Up-Sampling is an r-adaptive algorithm and Æ 1 c r 2+c obtains, w.p. 1 o(1),a 1 · approximation, with sample complexity m = cn log n ≠ ≠ e k+c 1 2 at every round.

Proof. Let Si be the set S at round i of the upsampling algorithm (Algorithm 8). We first argue that for all subsets T of N S of size c, T is contained in at least one sample drawn \ i at round i +1. First, assume c = k/r. Then this is the coupon collector problem and with n2 n log n samples, all subsets T of size c are observed at least once with probability · c · c 1 2 11 22 at least 1 1/n2.Ifc

OPT f(Si 1)+ fSi 1 (T ) ≠ Æ ≠ O T i 1 œÿS ≠

f(Si 1)+ (f(Si) f(Si 1)) Æ ≠ O ≠ ≠ T i 1 œÿS ≠

f(Si 1)+(k/c +1)(f(Si) f(Si 1) Æ ≠ ≠ ≠ where the first inequality is by submodularity and the second is by monotonicity and since at

least one sample contains the c elements with largest marginal contribution to Si 1. Thus, ≠ by a similar induction as in the analysis for the classical greedy algorithm,

i 1 f(Si) 1 1 OPT. Ø Q ≠ A ≠ k/c +1B R a b Thus, with i = r, the set S returned by the upsampling algorithm obtains the following

107 approximation:

r k/c+1 r/(k/c+1) 1 1 r/(k/c+1) 1 cr 1 1 =1 1 1 e≠ 1 . Q R ≠ A ≠ k/c +1B ≠ A ≠ k/c +1B Ø ≠ Ø 3 ≠ e 4 k + c a b

Adaptive-Sampling: (log n)-Adaptivity and Constant Factor Approximation O We build upon down and up-sampling to obtain the main algorithm, Adaptive-Sampling. The algorithm maintains two sets, S for down-sampling and X for up-sampling. If a random subset has high expected value, then a sample of high value is added to the up-sampling set X. Otherwise, low-value elements can be discarded from the down-sampling solution S.A crucial subtlety is that this algorithm samples sets of size k/r not only for up-sampling but also for down-sampling (rather than k). The description below is an idealized version of the algorithm.

Algorithm 9 Adaptive-Sampling: down-samples or up-samples depending on context. Input: approximation –, threshold , sample complexity m, bound on up-sampling rounds r

Initialize X ,S N Ωÿ Ω while X kdo | | | fi | Update to be uniform over subsets of S X of size k/r D \ if ER [fX (R)] (–/r)OPT then ≥D Ø X X argmax f(X R ):R m Ω fi Ri { fi i i ≥D}i=1 else

S S a : ER fX R a (a) < Ω \ ≥D fi \{ } Ó Ë È Ô return X if X = k,orX S otherwise | | fi

108 The adaptivity of Adaptive-Sampling is (log n). The adaptivity of Adaptive- O Sampling is the sum of the number of up-sampling rounds and of the number of down-

sampling rounds, which we denote by ru and rd respectively.

Lemma 3.2.3. Adaptive-Sampling is (r + r )-adaptive with r + r r + log n when u d u d Æ c =c –OPT . · k

Proof. We first bound the number of rounds rd where Adaptive-Sampling downsamples. If S k , then S X k and the algorithm terminates since X k k if the algorithm | |Æ r | fi |Æ | |Æ ≠ r has not (yet) returned X. Notice that if Adaptive-Sampling removes elements from S at

– some round, then ER [fX (R)] < OPT. Thus, the number of elements remaining in S after ≥D r one round of removing elements from S is

ER [fX (R)] S S≠ ≥D S Lemma 3.2.1 with | \ |Æ t ·| | · submodular function f ( ) X · –OPT/r S Algorithm Æ t ·| | 1 · –OPT k S =c and t = Æ c ·| | k r

Thus, after r rounds of removing elements, there are S (1/c )rd n elements remaining. d | |Æ · With n log k/r log n rd = 1 2 , log c Æ log c

S k/r and the algorithm terminates. | |Æ For the second part of the lemma, after r rounds of upsampling, r disjoint samples of size k have been added to X. Thus X = k and the algorithm terminates with r = r. r | | u

Adaptive-Sampling is a constant factor approximation. We now analyze the ap-

proximation guarantee of Adaptive-Sampling.

109 Lemma 3.2.4. Adaptive-Sampling obtains, w.p. 1 ”,a 1 ‘ -approximation and ≠ 3 ≠ 2 1 2 has sample complexity m = r log 2r at every round, with – = 1 , r = 3 log n, ‘ ” 3 ‘ · 1+‘/2 ‘ –OPT 1 2 1 2 = 1+ 2 k . 1 2 Proof. We begin with the case where the algorithm returns S X, which is the main fi component of the proof, and then we consider the case where it returns X. At a high level, the first part in analyzing S X consists of bounding f(S X) in terms of the loss from fi fi optimal elements f(O S). Then, we use Lemma 3.2.2 to bound the loss from these elements \ at every round. A main theme of this proof is that we need to simultaneously deal with the up-sampling solution X while analyzing the loss from O S. \ Let = o ,...,o be the optimal solution indexed by an arbitrary order and X and S≠ O { 1 k} i i be the sets X and S≠ = a : ER fX R a (a) < at the ith round of down-sampling, { ≥D fi \{ } } Ë È i [r ]. First, by monotonicity, subadditivity, and again monotonicity, we get œ d

f (S X) f(O) f(O (S X)) OPT f(O S). fi Ø ≠ \ fi Ø ≠ \

The remaining of the proof bounds the loss f(O S) from optimal elements that were \ discarded from S. Next, we bound f ( S). The elements in S are elements that have O\ O\ rd been discarded from S,so S = S≠ , and we get O\ fii=1 i flO 1 2

rd rd rd f ( S)) = f S≠ f S≠ +f(X) f S≠ +f(S X). O\ fii=1 i flO Æ X fii=1 i flO Æ X Ofl i fi 1 1 22 1 1 22 ÿi=1 1 2 where the first inequality is by monotonicity and the second by subadditivity. Next,

fX Si≠ fXi Si≠ Si≠ +E [fXi (R)] . Ofl Æ Ofl Æ|Ofl |· 1 2 1 2 where the first inequality is by submodularity and the second is by Lemma 3.2.2 and since

110 f ( ) is a submodular function. Thus, Xi ·

rd

f ( S) f(S X) Si≠ +E [fXi (R)] O\ ≠ fi Æ |O fl |· ÿi=1 1 2 rd rd ( i=1Si≠) + – OPT Æ|Ofl fi |· 3 · r 4 rd k + – OPT Æ · 3 · r 4 ‘ rd 1+ –OPT + – OPT Æ 3 24 3 · r 4

where E [fXi (R)] (–/r)OPT at a downsampling round i by the algorithm. By combining Æ the previous inequalities, we get

‘ rd 1 f (S X) OPT 1+ –OPT f(S X) –OPT ‘ OPT fi Ø ≠ 3 24 ≠ fi ≠ r Ø 33 ≠ 4 where r log n by Lemma 3.2.3 with c =1+‘/2 and since r = 3 log n. d Æ 1+‘/2 ‘ · 1+‘/2 + What remains is the case where the algorithm returns X.LetXi and Ri be the set X and the sample R added to X at the ith round of up-sampling, i [r]. By standard œ concentration bound (Lemma 3.2.5), with m = (r/‘)2 log (2r/”), w.p. 1 ”/r, f R+ ≠ Xi i Ø 1 2 ER [fXi (R)] ‘OPT/r. By a union bound this holds for all r rounds of up-sampling with ≥D ≠ probability 1 ”. We obtain ≠

r r r + ‘OPT –OPT 1 f(X)= fXi Ri E [fXi (R)] ‘OPT = ‘ OPT. Ø R ≠ r Ø r ≠ 3 ≠ ÿi=1 1 2 ÿi=1 3 ≥D 4 ÿi=1 3 4

Lemma 3.2.5. For any X, S N such that X R k, let = S, k and R+ = ™ | fi |Æ D U r 1 2 argmaxi [m] f(X Ri). Then, with probability 1 ” over the samples drawn from , œ fi ≠ D

+ fX (R ) E [fX (R)] ‘ R Ø ≥D ≠

111 2 1 OPT 2 with sample complexity m = 2 ‘ log ” . 1 2 1 2 2 Proof. By Lemma 3.2.6, with m = 1 OPT log 2 , with probability 1 ”, 2 ‘ ” ≠ 1 2 1 2

vX (S, t) E [fX (R)] ‘. - ≠ R - Æ - ≥D - - - - - 1 m Since vX (S, t)= m i=1 fX (Ri), it must be the case that for at least one sample R used to q compute vX (S, t),

fX (R) E [fX (R)] ‘. R Ø ≥D ≠

We conclude by observing that the sample with largest marginal contribution fX (R)= f(X R) f(X) is returned. fi ≠

The Full Algorithm

We discuss the two missing pieces, estimating the expectations and the assumption that we

know OPT, both needed to implement the idealized algorithm. To estimate expectations within arbitrarily fine precision ‘>0 in one round, we query m = poly(n) sets X R ,...,X R fi 1 fi m where R ,...,R are sampled according to (S X, k/r). To guess OPT, we pick log n 1 m U \ 1+‘ dierent values vı as proxies for OPT, one of which must be an ‘ multiplicative approximation to OPT (Lemma 3.2.7 using submodularity). We then run the algorithm for each of these proxies in parallel, and return the solution with highest value. With these two final pieces, we obtain the main result for this section. We describe the implementable version Adaptive- Sampling-Full formally next.

Estimates of expectations in one round via sampling We show that the expected value of a random set and the expected marginal contribution of elements to a random set

can be estimated arbitrarily well in one round, which is needed for the Down-Sampling and Adaptive-Sampling algorithms. Recall that (S, t) denotes the uniform distribution over U 112 subsets of S of size t. The values we are interested in estimating are ER (S,t) [fX (R)] and ≥U

ER (S,t) fX R a (a) . We denote the corresponding estimates by vX (S, t) and vX (S, t, a), ≥U fi \{ } Ë È which are computed in Algorithms 10 and 11. These algorithms first sample m sets from (S, t), where m is the sample complexity, then query the desired sets to obtain a random U

realization of fX (R) and fX R a (a), and finally averages the m random realizations of these fi \{ } values.

Algorithm 10 Estimate1: computes estimate vX (S, t) of ER (S,t) [fX (R)]. ≥U Input: set S N, size t [n], sample complexity m. ™ œ Sample R ,...,R i.i.d. (S, t) 1 m ≥U Query X, X R ,...,X R { fi 1 fi m} v (S, t) 1 m f(X R ) f(X) X Ω m i=1 fi i ≠ q return vX (S, t)

Algorithm 11 Estimate2: Computes estimate vX (S, t, a) of ER (S,t) fX R a (a) . ≥U fi \{ } Ë È Input: set S N, size t [n], sample complexity m, element a N. ™ œ œ Sample R ,...,R i.i.d. (S, t) 1 m ≥U Query X R a ,X R a ,...,X R a ,X R a { fi 1 fi{ } fi 1 \{ } fi m fi{ } fi m \{ }} v (S, t, a) 1 m f(X R a ) f(X R a ) X Ω m i=1 fi i fi{ } ≠ fi i \{ } q return vX (S, t, a)

Using standard concentration bounds, the estimates computed by these algorithms are arbitrarily good for a suciently large sample complexity m.

2 Lemma 3.2.6. Let m = 1 OPT log 2 , then for all X, S N and t [n] such that 2 ‘ ” ™ œ 1 2 1 2 X + t k, with probability at least 1 ” over the samples R ,...,R , | | Æ ≠ 1 m

vX (S, t) E [fX (R)] ‘ - ≠ R (S,t) - Æ - ≥U ------113 - 2 Similarly, let m = 1 OPT log 2 , then for all X, S N, t [n], and a N such that 2 ‘ ” ™ œ œ 1 2 1 2 X + t k, with probability at least 1 ” over the samples R ,...,R , | | Æ ≠ 1 m

vX (S, t, a) E fX R a a R(a) ‘. - ≠ R (S,t) fi \{ }| œ - Æ - ≥U Ë È------2 Thus, with m = n OPT log 2n total samples in one round, with probability 1 ”, it holds ‘ ” ≠ 1 2 1 2 that v (S, t) and v (S, t, a), for all a N, are ‘-estimates. X X œ

Proof. Note that

E [vX (S, t)] = E [fX (R)] and E [vX (S, t, a)] = E fX R a (a) R (S,t) R (S,t) fi \{ } ≥U ≥U Ë È

Since all queries are of size at most k, their values are all bounded by OPT. Thus, by 2 1 OPT 2 Hoeding’s inequality with m = 2 ‘ log ” ,weget 1 2 1 2

2m‘2 2 Pr vX (S, t) E [fX (R)] ‘ 2e≠ OPT ” C- ≠ R (S,t) - Ø D Æ Æ - ≥U ------for ‘>0. Similarly, we get

Pr vX (S, t, a) E fX R a (a) ‘ ”. C- ≠ R (S,t) fi \{ } - Ø D Æ - ≥U Ë È------2 OPT 2n Thus, with m = n ‘ log ” total samples in one round, by a union bound over each 1 2 1 2 of the estimates holding with probability 1 ”/n individually, we get that all the estimates ≠ hold simultaneously with probability 1 ”. ≠

We can now describe the (almost) full version of the main algorithm which uses these

estimates. One additional small dierence with Adaptive-Sampling is that we force the algorithm to stop after r+ rounds to obtain the adaptive complexity with probability 1. The

114 loss from the event, happening with low probability, that the algorithm is forced to stop is accounted for in the ” probability of failure of the approximation guarantee of the algorithm.

Algorithm 12 Adaptive-Sampling-Proxy, simultaneously downsamples and upsamples. Input: bounds on up-sampling rounds r and on total rounds r+, approximation –, threshold parameter , sample complexity m, and proxy vı Initialize X ,S N,t k ,c 1 Ωÿ Ω Ω r Ω while X k, and c

for a S do Non-adaptive loop œ v (S, t, a) Estimate2 (S X, t, m, a) X Ω \ S S a : v (S, t, a) Ω \{ X Æ } c c +1 Ω return X if X = k,orS X otherwise | | fi

Estimating OPT. The main idea to estimate OPT is to have O(log n) values vi such that one of them is guaranteed to be a (1 ‘)-approximation to OPT. To obtain such values, we use the ≠ simple observation that the singleton aı with largest value is at least a 1/n approximation to

OPT.

ı Lemma 3.2.7. Let a = argmaxa N f( a ) be the optimal singleton, and œ { }

v =(1+‘)i f ( aı ) . i · { }

115 Then, there exists some i log n such that œ log(1+‘) Ë È

OPT v (1 + ‘) OPT. Æ i Æ ·

Proof. By submodularity, we get f( aı ) 1 OPT 1 OPT. By monotonicity, we have { } Ø k Ø n ı f( a ) OPT. Combining these two inequalities, we get v0 OPT v log n . By the { } Æ Æ Æ log(1+‘) definition of v , we then conclude that there must exists some i log n such that i œ log(1+‘) Ë È OPT v (1 + ‘) OPT. Æ i Æ ·

Since the solution obtained for the unknown vi which approximates OPT well is guaranteed to be a good solution, we run the algorithm in parallel for each of these values and return

the solution with largest value. We obtain the full algorithm Adaptive-Sampling-Full which we describe next.

Algorithm 13 Adaptive-Sampling-Full, simultaneously downsamples and upsamples. Input: bounds on up-sampling rounds r and on total rounds r+, approximation –, threshold parameter , sample complexity m, and precision ‘ Initialize L Ωÿ Query a ,..., a {{ 1} { n}} aı argmax f( a ) Ω ai { i} for i 0,...,log n do Non-adaptive loop œ 1+‘/3 Ó Ô vı (1 + ‘)i f ( aı ) Ω · { } Add solution from Adaptive-Sampling-Proxy(vı) to L

return argmaxS L f(S) œ

Theorem 3.2.2. For any ‘,”> 0, Adaptive-Sampling-Full is a log n 3 +2 - 1+‘/3 · ‘ 1 2 adaptive algorithm that, w.p. 1 ”, obtains a ( 1 ‘)-approximation, with sample complexity ≠ 3 ≠ at every round m = 64 nk2 log 2n + log2 n log 2 1 , for maximizing a monotone ‘2 ” 1+‘/3 · ” · ‘ 1 1 2 1 2 2

116 submodular function under a cardinality constraint, with parameters r = 3 log n, – = 1 , ‘ · 1+‘/3 3 vı and =(1+‘) 3k .

3.2.2 Experiments

We conduct experiments on two datasets to empirically evaluate the performance of the adaptive sampling algorithm. We observe that it performs almost as well as the standard greedy algorithm, which achieves the optimal 1 1/e approximation, and outperforms two ≠ simple algorithms with low adaptivity. These experiments indicate that in practice, adaptive sampling performs significantly better than its worst-case 1/3 approximation guarantee. The first application is a movie recommendation system using the MovieLens dataset of movie ratings by users. The second application is a taxi dispatch system using the New York City trip record dataset of taxi trips.

Description of the algorithm. The Adaptive-Sampling algorithm is a generalization of the algorithm in Section 3.2.1 designed to achieve superior approximation guarantees for bounded curvature. The algorithm maintains two solutions X and S, initialized to the empty set and the ground set N respectively. At every round, the algorithm either adds

k r elements to X or discards from S a constant fraction of its remaining elements. The algorithm terminates when X = k or alternatively when suciently many elements have | | been discarded to get X S k. Thus, with r = (log n), the algorithm has at most | fi |Æ O logarithmic many rounds. The algorithm is formally described below.

117 Figure 3.3: A map of New York City where the purple circles are of size propor- tional to the number of taxi trips with pick-ups that occurred in the corresponding neighborhood.

Algorithm 14 Adaptive-Sampling Input: threshold , approximation –, samples m, rounds r Initialize X ,S N Ωÿ Ω while X kdo | | | fi | update to be uniform over subsets of S of size k D r

R argmaxR R m fX (R) Ω œ{ i≥D}i=1 M top k valued elements a with respect to f (a) Ω r X if max f (R),f (M) – OPT then { X X }Ø r add argmax f (R),f (M) to X, discard it from S { X X } else

discard a : ER fX R a (a) < from S { ≥D fi \{ } } Ë È return X if X = k,orX S otherwise | | fi

Algorithm 14 generalizes the adaptive sampling algorithm in Section 3.2.1 by not only considering the best sample R when adding elements to X, but also the set M of top k/r

elements a with largest contribution fX (a). This generalization is needed to obtain, by a simple argument about curvature, a 1 c approximation for functions with curvature c. ≠

118 Movie Taxi Movie Taxi 100 100

80 80 1 1

60 60 f(S) f(S) f(S) f(S) 40 0.5 40 0.5

20 20

0 0 0 0 0 20 40 60 80 100 0 6 12 18 24 30 0 0.2 0.4 0.6 0.8 1 0 1 2 3 Number of rounds Number of rounds α Radius

Figure 3.4: The Greedy, Adaptive-Sampling, TopK, and Random algorithms correspond to the black, red, blue, and green lines respectively. Figures 3.4(a) and 3.4(b) show the evolution of the value of the current solution of each algorithm at every round. The dotted lines indicate that the algorithm terminated at a previous round. Figures 3.4(c) and 3.4(d) show the final value obtained by each algorithm as a function of the weight parameter – and radius R for the movie recommendation and taxi applications respectively. The curvature of the functions increases as – and R increase.

Experimental Setup

We begin by describing the two datasets and the benchmarks for the experiments.

Datasets.

Movie recommendation system. The goal of a movie recommendation system is to find a personalized and diverse collection of movies to recommend to an individual user, given ratings of movies that this user has already seen. We use the MovieLens 1M dataset [Harper and Konstan., 2015] which contains 1 million ratings from 6000 users on 4000 movies. A standard approach to solve the problem of movie recommendation is low-rank matrix completion. This approach models the problem as an incomplete rating matrix with users as rows and movies as columns and aims to produce a complete matrix which agrees with the

incomplete matrix and has low rank. For a given user ui, the completed matrix then gives a

predicted score for each movie mj which we denote by vi,j. A high quality recommendation must also be diverse. We add a diversity term in the objective that is a coverage function

119 C where C(S) is the number of dierent genres covered by movies in S.1 We obtain the

following objective for user ui:

fi,–(S)=(1 –) vi,j + –C(S) ≠ m S ÿj œ where – is a parameter controlling the weight of the objective on the individual movie scores versus the diversity term. Similar submodular objectives for movie recommendation systems have previously been used, e.g., [Mitrovic et al., 2017, Lindgren et al., 2015, Mirzasoleiman et al., 2016, Feldman et al., 2017]. The algorithm used for low-rank matrix completion is an iterative low-rank SVD decomposition algorithm from the python package fancyimpute [Rubinsteyn and Feldman, 2017] corresponding to the SVDimpute algorithm analyzed in Troyanskaya et al. [2001]. Unless otherwise specified, we set k =100, – =0.6, and number of rounds of adding elements r =4for adaptive sampling.

Taxi dispatch. In the taxi dispatch application, there are k taxis and the goal is to pick the k best locations to cover the maximium number of potential customers. We use 2 millions taxi trips in June 2017 from the New York City taxi and limousine commission trip record dataset [NYC-Taxi-Limousine-Commission, 2017], illustrated in Figure 3.3. We assign a weight w to each neighborhood n N that is equal to the number of trips where the pick-up i i œ was in neighborhood ni, where N is the collection of all neighborhoods. We then build a

coverage function CR(S) which is equal to the sum of the weights of neighborhoods ni that are reachable from at least one location in S, where reachable means n S is at “as the j œ

1Each movie has one genre, for example, "romantic comedy" is one genre, which is dierent than the "romantic drama" genre.

120 Movie Taxi Movie Taxi 100 100 30

80 1 80 20 60 60

f(S) f(S) r = 1 40 r = 1 0.5 40 r = 2 r = 2 10 r = 5 r = 4 20 20 r = 10 of rounds Number of rounds Number r = 10

0 0 0 0 0 3 6 9 12 0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 0 1 2 3 Number of rounds Number of rounds α Radius

Figure 3.5: Figures 3.5(a) and 3.5(b) show the evolution of the value of the current solution of Adaptive-Sampling at every round, for dierent values of the parameter r which controls the number of rounds of the algorithm. Figures 3.5(c) and 3.5(d) show how many rounds are needed by Greedy (in black) and Adaptive-Sampling (in red) to achieve 95 percent of the final value obtained by Greedy. crow flies" distance d(i, j) R from n . More precisely, Æ i

1 CR(S)= n S:d(i,j) R wi. ÷ j œ Æ n N · ÿiœ

Unless otherwise specified, the parameters are k =30, radius R =1.5km, and number of rounds of adding elements for adaptive sampling r =3.

Benchmarks. We compare the performance of Adaptive-Sampling with three algo- rithms. The Greedy algorithm, which adds the element with largest marginal contribution at each round, is the standard algorithm for submodular optimization and obtains the opti-

1 Ÿ mal 1 e≠ approximation (and (1 e≠ )/Ÿ for functions with curvature Ÿ [Conforti and ≠ ≠ Cornuéjols, 1984]) in linearly many rounds. It is used as an upper bound to measure the performance cost of obtaining logarithmic adaptivity with Adaptive-Sampling. The TopK algorithm picks the k elements a with largest singleton value f(a). This simple algorithm has one adaptive round and obtains a 1 Ÿ approximation for submodular functions with ≠ curvature Ÿ. Its low adaptivity and its approximation guarantee make it a natural benchmark.

Finally, Random simply returns a random subset of size k and has 0 rounds of adaptivity.

121 Experimental Results

General performance. We first analyze how the value of the solutions maintained by each

algorithm evolves at every round. In Figures 3.4(a) and 3.4(b), we observe that Adaptive- Sampling achieves a final value that is close to the one obtained by Greedy, but in a much smaller number of rounds. Adaptive-Sampling also significantly outperforms the two simple algorithms. There are rounds where the value of the Adaptive-Sampling solution does not increase, these correspond to rounds where elements are discarded, and which allow to then pick better elements in future rounds. For the movie recommender application,

the value of the solution obtained by Greedy increases linearly but we emphasize that this function is not linear, as movies that have the same genre as a movie already picked have their marginal contribution to the solution that decreases by –. In these experiments,

Adaptive-Sampling uses only 100 samples at every round. In fact, we observe very similar performance for Adaptive-Sampling whether it uses 10 or 10K samples per round. Thus, the sample complexity is not an issue for Adaptive-Sampling in practice and can be much lower than the theoretical sample complexity needed for the approximation guarantee.

The role of curvature. Next, we analyze the performance of the algorithms as a function of curvature. Both functions have curvature c =0when – =0and R =0respectively, and the cuvature increase as – and the radius increase. Again, we observe in Figures 3.4(c)

and 3.4(d) that Adaptive-Sampling obtains a solution of value of very close to the value obtained by Greedy, and significantly better than the two simple algorithms in general for any – and any radius. As it is implied by the theoretical bounds, Adaptive-Sampling, TopK, and Greedy all perform arbitrarily close to the optimal solution when the curvature is small. The gap between Adaptive-Sampling and Greedy is the largest for mid-range values of – and R. This can be explained by the design of the functions, which become “easier" to optimize as – and R increase since any neighborhood covers a large fraction of the

122 total value when R is large and since there is always a large number of movies that have a genre that is not yet in the current solution.

Figures 3.5(c) and 3.5(d) show how many rounds are needed by Greedy and Adaptive- Sampling to obtain 95 percent of the value of the solution of Greedy. When the curvature is small, the k elements with largest contribution is a good solution so Adaptive-Sampling only needs one round, whereas the value obtained by Greedy grows linearly so it needs to be close to 95 percent of its k rounds. For the movie recommendation, since the value

obtained by Greedy always grows almost linearly, Greedy always needs 95 rounds for k =100. For the taxi dispatch, since a small number of elements can have very large value for

large radius, the number of rounds needed by Greedy decreases for large radius, as well as for Adaptive-Sampling. Similarly as in the two previous figures with the approximation, we observe that the setting where Adaptive-Sampling needs the most number of rounds is for mid-range radius.

Number of rounds r versus performance. There is a tradeobetween the number of

rounds of Adaptive-Sampling and its performance. This tradeois more apparent for the taxi application than for the movie recommender application where Adaptive-Sampling obtains high value after 2 rounds (Figures 3.5(a) and 3.5(b)). Overall, Adaptive-Sampling obtains a high value in a small number of rounds, but this value can be slightly improved by

increasing the number of rounds of Adaptive-Sampling.

3.2.3 The Optimal Approximation

Iterative-Filtering: An (log2 n)-Adaptive Algorithm O In this section, we present the Iterative-Filtering algorithm which obtains an ap- proximation arbitrarily close to 1 1/e in (log2 n) adaptive rounds. At a high level, the ≠ O algorithm iteratively identifies large blocks of elements of high value and adds them to the

123 solution. There are (log n) such iterations and each iteration requires (log n) adaptive O O rounds, which amounts to (log2 n)-adaptivity. The analysis in this section will later be O used as we generalize this algorithm to one that obtains an approximation arbitrarily close to 1 1/e in (log n) adaptive rounds. ≠ O

Technical Overview

The main goal of this section is to achieve the optimal 1 1/e guarantee in (log n) ≠ O adaptive steps. The optimal 1 1/e approximation of the greedy algorithm stems from the ≠ guarantee that for any given set S there exists an element whose marginal contribution to S

is at least a 1/k fraction of the remaining optimal value OPT f(S). A standard inductive ≠ argument then shows that iteratively adding the element whose marginal contribution is maximal results in the 1 1/e approximation guarantee. To obtain the 1 1/e guarantee in ≠ ≠ r = (log n) adaptive steps rather than k, we could mimic this idea if in each adaptive step O we could add a block of k/r elements whose marginal contribution to the existing solution S

is at least a 1/r fraction of OPT f(S). ≠ The entire challenge is in finding such a block of k/r elements in (1) adaptive steps. O A priori, this is a formidable task when k/r is super-constant. In general, the maximal

marginal contribution over all sets of size k/r is as low as (OPT f(S))/r. Finding a block ≠ of size t of maximal marginal contribution in polynomial time is as hard as solving the general problem of submodular maximization under cardinality constraint t, which, in general, cannot be approximated within any factor better than 1 1/e using polynomially-many ≠ queries [Nemhauser and Wolsey, 1978]. Furthermore, we know it is impossible to approximate within any constant approximation in o(log n/ log log n) adaptive rounds (Section 3.3). Despite this seeming diculty, we show one can exploit a fundamental property of submodular functions to identify a block of size k/r whose marginal contribution is arbitrarily

close to (OPT f(S))/r. In general, we show that for monotone submodular functions, while ≠

124 it is hard to find a set of size k whose value is an arbitrarily good approximation to OPT, it is actually possible to find a set of size k/r whose value is arbitrarily close to that of OPT/r in polynomial time for r = (log n), even when k/r is super-constant. O We first describe an algorithm which progressively adds a subset of size k/r to the existing solution S whose marginal contribution is arbitrarily close to (OPT f(S))/r. To do so, ≠ it uses (log n) rounds in each such progression and it is hence (log2 n)-adaptive. At a O O high level, in each iteration that it adds a block of size k/r, the algorithm carefully and aggressively filters elements in (log n) rounds by considering their marginal contribution to O a random set drawn from a distribution that evolves throughout the filtering iterations. We then generalize the algorithm so that, on average, every step of adding a block of k/r elements is done in (1) adaptive steps. The main idea is to consider epochs, which O consist of sequences of iterations such that, in the worst case, an iteration might still consist of (log n) rounds, but the amortized number of rounds per iteration during an epoch is now O constant.

Description of the algorithm. The Iterative-Filtering algorithm consists of r itera- tions which each add k/r elements to the solution S. To find these elements the algorithm

filters out elements from the ground set using the Filter subroutine and then adds a set of size k/r sampled uniformly at random from the remaining elements. Let (X, t) denote the U uniform distribution over subsets of X of size t. Throughout the paper we always sample sets of size t = k/r and therefore write (X) instead of (X, k ) to simplify notation. The U U r Iterative-Filtering algorithm is described formally above as Algorithm 15.

125 Algorithm 15 Iterative-Filtering Input: constraint k, bound on number of iterations r S Ωÿ for r iterations do

X Filter(N,S,r) Ω S S R, where R (X) Ω fi ≥U return S

The Filter subroutine iteratively discards elements until a random set R (X) has ≥U marginal contribution arbitrarily close to the desired (OPT f(S))/r value. In each iteration, ≠ the elements discarded from the set of surviving elements X are those whose marginal

contribution to R (X) is low. Intuitively, Filter terminates quickly since if a random ≥U set has low expected marginal contribution, then there are many elements whose marginal contribution to a random set is low and these elements are then discarded. The subroutine

Filter is formally described below.

Algorithm 16 Filter(X, S, r) Input: Remaining elements X, current solution S, bound on number of outer-iterations r

while ER (X) [fS(R)] < (1 ‘)(OPT f(S)) /r do ≥U ≠ ≠ X X a : ER (X) fS (R a )(a) < (1 + ‘/2) (1 ‘)(OPT f(S)) /k Ω \ ≥U fi \{ } ≠ ≠ Ó Ë È Ô return X

Both Iterative-Filtering and Filter are idealized versions of the algorithms we implement. This is due to the fact that we do not know the value of the optimal solution

OPT and we cannot compute expectations exactly. In practice, we can apply multiple guesses of OPT in parallel and estimate expectations by repeated sampling. For ease of presentation we analyze these idealized versions of the algorithms. In our analysis we assume that in

Iterative-Filtering when ER (X)[fS(R)] t this implies that a random set R (X) ≥U Ø ≥U

126 respects f (R) t.2 S Ø

Analysis. The analysis of Iterative-Filtering relies on two properties of its Filter subroutine: (1) the marginal contribution of the set of elements not discarded in Filter after

(r) iterations is arbitrarily close to (OPT f(S))/r and (2) there are at most k/r remaining O ≠ elements after (r) rounds. We assume that ‘>0 is a small constant in the analysis. O

Bounding the value of elements that survive Filter. We first prove that the marginal

contribution of elements returned by Filter to the existing solution S is arbitrarily close to

(OPT f(S))/r. We do so by arguing that the set returned by Filter includes a subset of ≠ the optimal solution O with such marginal contribution. Let fl be the number of iterations of

the while loop in Filter. For a given iteration i [fl] let R be a random set of size r drawn œ i k

uniformly at random from Xi, where Xi are the remaining elements at iteration i. Notice that by monotonicity and submodularity, f (O) OPT f(S). We first show that we can consider S Ø ≠ the marginal contribution of O not only to S but S ( fl R ), while suering an arbitrarily fi fii=1 i

small loss. Considering the marginal contribution over random sets Ri is important to show that some optimal elements of high value must survive all rounds.

Lemma 3.2.8. Let R (X) be the random set at iteration i of Filter(N,S,r). For all i ≥U S N and r, fl> 0, if Filter(N,S,r) has not terminated after fl iterations, then ™

fl E f fl (O) 1 (OPT f(S)) . S ( i Ri) R1,...,Rfl 5 fi fi =1 6 Ø 3 ≠ r 4 · ≠

Proof. We exploit the fact that if Filter(N,S,r) has not terminated after fl iterations, then by the algorithm, the random set R (X) at iteration i has expected value that is upper i ≥U

2 Since we estimate ER (X)[fS(R)] by sampling in the full version of the algorithm, there is at least one ≥U sample with value at least the estimated value of ER (X)[fS(R)] that we can take. ≥U

127 bounded as follows: 1 ‘ E [fS (Ri)] < ≠ (OPT f(S)) Ri r ≠

fl fl for all i fl. Next, by subadditivity, we have ER1,...,Rfl [fS (( i Ri))] i ERi [fS (Ri)] Æ fi =1 Æ =1 fl q and, by monotonicity, ER1,...,Rfl [fS (O ( i Ri))] OPT f(S). Combining the above fi fi =1 Ø ≠ inequalities, we conclude that

fl fl E f fl (O) = E [fS (O ( i=1Ri))] E [fS (( i=1Ri))] S ( i Ri) R1,...,Rfl 5 fi fi =1 6 R1,...,Rfl fi fi ≠ R1,...,Rfl fi fl OPT f(S) E [fS (Ri)] Ø ≠ ≠ Ri ÿi=1 fl 1 (OPT f(S)) . Ø 3 ≠ r 4 · ≠

Next, we bound the value of elements that survive filtering rounds. To do so, we use Lemma 3.2.8 to show that there exists a a subset T of the optimal solution O that survives fl

rounds of filtering and that has marginal contribution to S arbitrarily close to (OPT f(S))/r. ≠ 1 Lemma 3.2.9. For all S N and ‘>0, if r 20fl‘≠ , then the elements X that survive fl ™ Ø fl iterations of Filter(N,S,r) satisfy

1 f (X ) (1 ‘)(OPT f(S)) . S fl Ø r ≠ ≠

Proof. At a high level, the proof first defines a subset T of the optimal solution O. Then, the remaining of the proof consists of two main parts. First, we show that elements in T

survive fl iterations of Filter(N,S,r). Then, we show that

1 f (T ) (1 ‘)(OPT f(S)) . S Ø r ≠ ≠

We introduce some notation. Let O = o ,...,o be the optimal elements in some arbitrary { 1 k} order and O = o ,...,o . We define the following marginal contribution of each optimal ¸ { 1 ¸} ¸ 128 element o¸:

¸ := E f fl (o¸) . S O¸ 1 ( i Ri o¸ ) R1,...,Rfl 5 fi ≠ fi fi =1 \{ } 6 We define T to be the set of optimal elements o such that (1 ‘/4) where ¸ ¸ Ø ≠

1 fl := 1 (OPT f(S)) . k 3 ≠ r 4 · ≠

We first argue that elements in T survive fl iterations of Filter(N,S,r). For element o T , ¸ œ we have

1 fl 1 ¸ (1 ‘/4) (1 ‘/4) 1 (OPT f(S)) (1 + ‘/2)(1 ‘) (OPT f(S)) Ø ≠ Ø k ≠ 3 ≠ r 4 · ≠ Ø k ≠ · ≠

1 where the last inequality is since since r 20fl‘≠ . Thus, at iteration i fl, by submodularity, Ø Æ

1 fl E fS (Ri o¸ )(o¸) E f (o¸) =¸ (1+‘/2)(1 ‘) (OPT f(S)) S O¸ 1 ( i Ri o¸ ) Ri fi \{ } Ø R1,...,Rfl fi ≠ fi fi =1 \{ } Ø k ≠ · ≠ Ë È 5 6

and o survives all iterations i fl, for all o T . ¸ Æ ¸ œ Next, we argue that f (T ) 1 (1 ‘)(OPT f(S)) . Note that S Ø r ≠ ≠

k fl f fl (O) 1 ( f(S)) = k. ¸ E S Ri OPT Ø R1,...,Rfl fi(fii=1 ) Ø ≠ r · ≠ ÿ¸=1 5 6 3 4

where the second inequality is by Lemma 3.2.8. Next, observe that

k = + + k(1 ‘/4). ¸ ¸ ¸ Æ ¸ ≠ ¸=1 o¸ T j O T o¸ T ÿ ÿœ œÿ\ ÿœ

By combining the two inequalities above, we get o T ¸ k‘/4. Thus, by submodularity, ¸œ Ø q

fl fS(T ) fS O¸ 1 (o¸) E fS O R o (o¸) = ¸ k‘/4. fi ≠ R ,...,R ¸ 1 ( i=1 i ¸ ) Ø o T Ø o T 1 fl 5 fi ≠ fi fi \{ } 6 o T Ø ÿ¸œ ÿ¸œ ÿ¸œ

129 We conclude that

fl 1 fS(Xfl) fS(T ) k‘/4=(‘/4) 1 (OPT f(S)) (1 ‘) (OPT f(S)) . Ø Ø 3 ≠ r 4 · ≠ Ø r · ≠ · ≠ where the first inequality is by monotonicity and since T X is a set of surviving elements. ™ fl

The adaptivity of Filter. The second part of the analysis bounds the number of adaptive

rounds of the Filter algorithm. A main lemma for this part, Lemma 3.2.10, shows that a constant fraction of elements are discarded at every round of filtering. Combined with the previous lemma that bounds the value of remaining elements, Lemma 3.2.11 then shows

that Filter has at most log n rounds. The analysis that a constant fraction of elements are discarded at every round is similar as Section 3.2.1.

Lemma 3.2.10. Let Xi and Xi+1 be the surviving elements at the start and end of iteration i of Filter(N,S,r). For all S N and r, i, ‘> 0, if Filter(N,S,r) does not terminate at ™ iteration i, then X X < | i| . | i+1| 1+‘/2

Proof. At a high level, since the surviving elements must have high value and a random set has low value, we can then use the thresholds to bound how many such surviving elements there can be while also having a random set of low value. To do so, we focus on the value of

130 f(R X ) of the surviving elements X in a random set R . i fl i+1 i+1 i ≥DXi

E [fS(Ri Xi+1)] E fS (Ri Xi+1 a)(a) submodularity S fi fl \ T fl Ø a R X œ ÿifl i+1 U V 1 E a Ri fS (Ri a)(a) submodularity S œ fi \ T Ø a X · œÿi+1 U V 1 = E a Ri fS (Ri a)(a) . œ fi \ a X · œÿi+1 Ë È

= Pr [a Ri] E fS (Ri a)(a) a Ri fi \ a X œ · | œ œÿi+1 Ë È

Pr [a Ri] E fS (Ri a)(a) submodularity fi \ Ø a X œ · œÿi+1 Ë È 1 Pr [a Ri] (1 + ‘/2) (1 ‘)(OPT f(S)) algorithm Ø a X œ · k ≠ ≠ œÿi+1 k 1 = X (1 + ‘/2) (1 ‘)(OPT f(S)) definition of (X) | i+1|·r X · k ≠ ≠ U | i| 1 = X (1 + ‘/2) (1 ‘)(OPT f(S)) . | i+1|·r X · ≠ ≠ | i|

Next, since elements are discarded, a random set must have low value by the algorithm,

1 (1 ‘)(OPT f(S)) > E [fS(Ri)] . r ≠ ≠

By monotonicity, we get E [fS(Ri)] E [fS(Ri Xi+1)]. Finally, by combining the above Ø fl inequalities, we conclude that X X /(1 + ‘/2). | i+1|Æ| i|

Thus, by the previous lemma, there are at most k/r surviving elements after logarithmically many filtering rounds and by Lemma 3.2.9, these remaining elements must have high value.

Thus, Filter terminates and we obtain the following main lemma for the number of rounds.

2 Lemma 3.2.11. For all S N, if r 40‘≠ log n, then Filter(N,S,r) terminates after at ™ Ø most (log n) iterations. O

131 1 Proof. If Filter(N,S,r) has not yet terminated after 2‘≠ log n iterations, then, by Lemma 3.2.10,

1 at most k/r elements survived these fl =2‘≠ log n iterations. By Lemma 3.2.9, with

1 1 r 20fl‘≠ , the set X of elements that survive these 2‘≠ log n iterations is such that Ø fl f (X ) 1 (1 ‘)(OPT f(S)). Since there are at most k/r surviving elements X, R = X S fl Ø r · ≠ ≠ fl for R (X ) and ≥U fl 1 f (R)=f (X ) (1 ‘)(OPT f(S)) , S S fl Ø r · ≠ ≠

and Filter(N,S,r) terminates at this iteration.

Main result for Iterative-Filtering. We are now ready to prove the main result for

Iterative-Filtering. By Lemma 3.2.11, at every iteration of Iterative-Filtering, in at most (log n) iterations of Filter, the value of the solution S is increased by at least O (1 ‘)(OPT f(S)) /r with k/r new elements. The analysis of the 1 1/e ‘ approximation ≠ ≠ ≠ ≠ then follows similarly as for the standard analysis of the greedy algorithm. Regarding the

2 total number of rounds, we fix parameter r =40‘≠ log n. there are at most r iterations of

Iterative-Filtering, each of which with at most (log n) iterations of Filter and the O queries at every iteration of Filter are non-adaptive.

Theorem 3.2.3. For any constant ‘>0, Iterative-Filtering is a log2 n -adaptive O 1 2 2 algorithm that obtains a 1 1/e ‘ approximation, with parameter r =40‘≠ log n. ≠ ≠

Proof. Let Si denote the solution S at the ith iteration of Iterative-Filtering. The algorithm increases the value of the solution S by at least (1 ‘)(OPT f(S)) /r at every ≠ ≠ iteration with k/r new elements. Thus,

1 ‘ f(Si) f(Si 1)+ ≠ (OPT f(Si 1)) . Ø ≠ r ≠ ≠

132 Next, we show by induction on i that

1 ‘ i f(Si) 1 1 ≠ OPT. Ø A ≠ 3 ≠ r 4 B

Observe that

1 ‘ f(Si) f(Si 1)+ ≠ (OPT f(Si 1)) Ø ≠ r ≠ ≠ 1 ‘ 1 ‘ = ≠ OPT + 1 ≠ f(Si 1) r 3 ≠ r 4 ≠ i 1 1 ‘ 1 ‘ 1 ‘ ≠ ≠ OPT + 1 ≠ 1 1 ≠ OPT Ø r 3 ≠ r 4 A ≠ 3 ≠ r 4 B 1 ‘ i = 1 1 ≠ OPT A ≠ 3 ≠ r 4 B

Thus, with i = r where there has been r iterations of adding k/r elements, we return solution S such that 1 ‘ r f(S) 1 1 ≠ OPT Ø 3 ≠ 3 ≠ r 4 4 and obtain

(1 ‘) 1+2‘ 1 f(S) 1 e≠ ≠ OPT 1 OPT 1 ‘ OPT Ø ≠ Ø ≠ e Ø ≠ e ≠ 1 2 3 4 3 4 where the second inequality is since ex 1+2x for 0

1 1 most ‘≠ log n iterations of Filter by Lemma 3.2.11, with r = O(‘≠ log n).

Amortized-Filtering: An (log n)-Adaptive Algorithm O In this section, we build on the algorithm and analysis from the previous section to

obtain the main result of this paper. We present Amortized-Filtering which accelerates Iterative-Filtering by using less filtering rounds while maintaining the same approxi-

133 mation guarantee. In particular, it obtains an approximation arbitrarily close to 1 1/e in ≠ logarithmically-many adaptive rounds.

Description of the algorithm. Amortized-Filtering iteratively adds a block of k/r el- ements obtained using the Filter subroutine to the existing solution S, exactly as Iterative- Filtering. The improvement in adaptivity comes from the use of epochs. An epoch is a sequence of iterations during which the value of the solution S increases by at most

‘(OPT f(S))/20. During an epoch, the algorithm invokes Filter with the surviving ele- ≠ ments from the previous iteration of Amortized-Filtering, rather than all elements in the ground set as in Iterative-Filtering. In a new epoch, Filter is then again invoked with the ground set. A formal description of an idealized version is included below.

Algorithm 17 Amortized-Filtering Input: bound on number of iterations r S Ωÿ 20 for ‘ epochs do X N,T Ω Ωÿ while f (T ) < (‘/20)(OPT f(S)) and S T

Analysis of Amortized-Filtering. As in the previous section, we analyze the idealized version described above. Our analysis for Amortized-Filtering relies on the properties of every epoch. In particular, we first show that during an epoch, the surviving elements X

have marginal contribution at least ‘(OPT f(S))/20 to S T . Notice that the marginal ≠ fi contribution is with respect to the set S T and the value with respect only to S.We fi 134 then show that for any epoch, the total number of iterations of Filter during that epoch is (log n). We emphasize that an iteration of Filter is dierent than an iteration of O the while-loop of Amortized-Filtering, i.e., an epoch consists of multiple iterations of Amortized-Filtering, each of which consists of multiple iterations of Filter. Since there

1 are at most 20‘≠ epochs, the amortized number of iterations of Filter per iteration of Amortized-Filtering is now constant.

Bounding the value of elements that survive an epoch. For any given epoch, we first bound the marginal contribution of O to S T and the random sets R fl when there fi { i}i=1 are fl iterations of filtering during the epoch. Similar to the previous section, we show that the marginal contribution of O to S T and the random sets is arbitrarily close to the desired fi OPT f(S) value. The analysis is similar to the analysis of Lemma 3.2.8, except for a subtle ≠ yet crucial dierence. The analysis in this section needs to handle the fact that the solution S T changes during the epoch. To do so we rely on the fact that the increase in the value fi of S T during an epoch is bounded. fi

Lemma 3.2.12. For any epoch j and ‘>0, let R (X) be the random set at iteration i ≥U i of filtering during epoch j. For all r, fl> 0, if epoch j has not ended after fl iterations of filtering, then

fl E f + fl (O) 1 ‘/20 (OPT f(Sj)) Sj ( i Ri) R1,...,Rfl 5 fi fi =1 6 Ø 3 ≠ r ≠ 4 · ≠ where S is the set S at epoch j and S+ is the set S T at the last iteration of epoch j. j j fi

Proof. We introduce some notation and terminology. We now call the iteration i of filtering during epoch j the ith iteration discarding elements inside of Filter since the beginning of epoch j, over the multiple invokations of Filter. An element survives fl iterations of Filter at epoch j if it has not been discarded at iteration i of filtering during epoch j, for all i fl. Æ

135 1 + Let S denote the solution S at epoch j [20‘≠ ], S denote S T during the last iteration j œ j j fi of Amortized-Filtering at epoch j, i.e., the last T such that f (T ) < (‘/20)(OPT f(S)), S ≠ and S denote S T at the iteration i of filtering during epoch j. Thus, for all i

S S S S+ S j ™ j,i1 ™ j,i2 ™ j ™ j+1

and f(S+) f(S ) < (‘/20)(OPT f(S )).. j ≠ j ≠ j Similarly as for Lemma 3.2.8, we exploit the fact that if Filter(N,S,r) has not terminated after fl iterations, then by the algorithm, the random set R (X) at iteration i has low i ≥U expected value. In addition, we also use the bound on the change in value of S during epoch j:

E f + fl (O) Sj ( i Ri) R1,...,Rfl 5 fi fi =1 6 fl fl = f + (O ( Ri)) f + (( Ri)) E Sj i=1 E Sj i=1 R1,...,Rfl 5 fi fi 6 ≠ R1,...,Rfl 5 fi 6 + fl OPT f(S ) f + (( Ri)) monotonicity j E Sj i=1 Ø ≠ ≠ R1,...,Rfl 5 fi 6 fl OPT f(Sj) (‘/20) (OPT f(Sj)) f + (( Ri)) same epoch E Sj i=1 Ø ≠ ≠ ≠ ≠ R1,...,Rfl 5 fi 6 fl

(1 ‘/20) (OPT f(Sj)) E fS+ (Ri) subadditivity R j Ø ≠ ≠ ≠ i=1 i 5 6 ÿfl

(1 ‘/20) (OPT f(Sj)) E fSj,i (Ri) submodularity Ø ≠ ≠ ≠ Ri ÿi=1 Ë È fl 1 ‘ (1 ‘/20) (OPT f(S )) ≠ (OPT f(S ))] algorithm Ø ≠ ≠ j ≠ r ≠ j ÿi=1 fl = 1 ‘/20 (OPT f(Sj)) . 3 ≠ ≠ r 4 · ≠

Next, we bound the value of elements that survive the filtering iterations during an epoch. The proof is similar to that of Lemma 3.2.9, modified to handle the fact that the solution S

136 evolves during an epoch.

1 Lemma 3.2.13. For any epoch j and ‘>0, if r 20fl‘≠ , then the elements X that survive Ø fl fl iterations of filtering at epoch j satisfy

fS+ (Xfl) (‘/4) (1 ‘)(OPT f(Sj)) . j Ø ≠ ≠ where S is the set S at epoch j and S+ is the set S T at the last iteration of epoch j. j j fi

Proof. Let j be any epoch. Similarly as for Lemma 3.2.9, the proof defines a subset Q of the optimal solution O and then shows show that elements in Q survive fl iterations of filtering

at epoch j and show that fS+ (Q) (‘/4) (1 ‘)(OPT f(Sj)) . We define the following j Ø ≠ ≠

marginal contribution ¸ of each optimal element o¸:

¸ := E f + fl (o¸) . Sj O¸ 1 ( i Ri o¸ ) R1,...,Rfl 5 fi ≠ fi fi =1 \{ } 6

We define Q to be the set of optimal elements o such that (1 ‘/4) where ¸ ¸ Ø ≠

1 fl := 1 ‘/20 (OPT f(Sj)) . k 3 ≠ r ≠ 4 · ≠

We first argue that elements in Q survive fl iterations of filtering at epoch j. For element o Q, we have ¸ œ

1 fl 1 ¸ (1 ‘/4) (1 ‘/4) 1 ‘/20 (OPT f(Sj)) (1+‘/2)(1 ‘) (OPT f(Sj)) Ø ≠ Ø k ≠ 3 ≠ r ≠ 4· ≠ Ø k ≠ · ≠ where the third inequality is by the condition on r. Thus, at iteration i fl, by submodularity, Æ

1 fl E fSj,i (Ri o¸ )(o¸) E f + (o¸) =¸ (1+‘/2)(1 ‘) (OPT f(Sj)) Sj O¸ 1 ( i Ri o¸ ) Ri fi \{ } Ø R1,...,Rfl fi ≠ fi fi =1 \{ } Ø k ≠ · ≠ Ë È 5 6

137 and o survives all iterations i fl, for all o Q. ¸ Æ ¸ œ

Next, we argue that fS+ (Q) (‘/4) (1 ‘)(OPT f(Sj)) . Note that j Ø ≠ ≠

k fl f + fl (O) 1 ‘/20 ( f(S )) = k. ¸ E S Ri OPT j Ø R1,...,Rfl j fi(fii=1 ) Ø ≠ r ≠ · ≠ ÿ¸=1 5 6 3 4

where the second inequality is by Lemma 3.2.12. Next, observe that

k = + + k(1 ‘/4). ¸ ¸ ¸ Æ ¸ ≠ ¸=1 o¸ Q j O Q o¸ Q ÿ ÿœ œÿ\ ÿœ

By combining the two inequalities above, we get o Q ¸ k‘/4. Thus, by submodularity, ¸œ Ø q

fS+ (Q) fS+ O (o¸) E f + fl (o¸) = ¸ k‘/4. j j ¸ 1 R ,...,R Sj O¸ 1 ( i=1Ri o¸ ) Ø o Q fi ≠ Ø o Q 1 fl 5 fi ≠ fi fi \{ } 6 o Q Ø ÿ¸œ ÿ¸œ ÿ¸œ

We conclude that

fl f + (Xfl) f + (Q) k‘/4=(‘/4) 1 (OPT f(Sj)) (‘/4) (1 ‘) (OPT f(Sj)) . Sj Sj Ø Ø 3 ≠ r 4 · ≠ Ø · ≠ · ≠ where the first inequality is by monotonicity and since Q X is a set of surviving elements. ™ fl

The adaptivity of an epoch. The next lemma bounds the total number of iterations

of filtering per epoch. At a high level, similarly as for Iterative-Filtering, a constant fraction of elements are discarded at each iteration of filtering by Lemma 3.2.10 and there are at most k/r surviving elements after logarithmically-many filtering rounds. Then, we use Lemma 3.2.13 and the fact that the surviving elements during an epoch have high contribution to show that the epoch terminates.

Lemma 3.2.14. In any epoch of Amortized-Filtering and for any ‘ (0, 1/2), if œ

138 2 1 r 40‘≠ log n, then there are at most 2‘≠ log n iterations of filtering during the epoch. Ø

1 Proof. If an epoch j has not yet terminated after fl =2‘≠ log n iterations of filtering, then, by Lemma 3.2.10, at most k/r elements survived these fl filtering iterations. We consider the set

1 T obtained after these fl filtering iterations. By Lemma 3.2.13, with r 20fl ‘≠ , the set X Ø · fl

of elements that survive these iterations is such that fS T (Xfl) (‘/4) (1 ‘)(OPT f(S)). fi Ø · ≠ ≠ Since there are at most k/r surviving elements, R = X for R (X ) and fl ≥U fl

1 E [fS T (R)] fS T (Xfl) (‘/4) (1 ‘)(OPT f(S)) (1 ‘)(OPT f(S T )) fi Ø fi Ø · ≠ ≠ Ø r · ≠ ≠ fi where the last inequality is by monotonicity. Thus, the current call to the Filter subroutine

terminates and Xfl is added to T by the algorithm. Next,

fS(T Xfl) fS(Xfl) fS T (Xfl) (‘/4) (1 ‘)(OPT f(S)) (‘/20) (OPT f(S)) fi Ø Ø fi Ø · ≠ ≠ Ø ≠

the first inequality is by monotonicity and the second by submodularity. Thus, epoch j ends.

Main result. We are now ready to prove the main result of the paper which is that analysis

of Amortized-Filtering. There are two cases: either the algorithm terminates after r

1 iterations with S T = k or it terminates after 20‘≠ epochs. With r = (log n), there | fi | O are at most (log n) iterations of adding elements and at most (1) epochs with (log n) O O O filtering iterations per epoch. Thus the total number of adaptive rounds is (log n). O

2 Theorem 3.2.4. For any constant ‘>0, when using parameter r =40‘≠ log n, Amortized- Filtering obtains a 1 1/e ‘ approximation in (log n)-adaptive steps. ≠ ≠ O

Proof. First, consider the case where the algorithm terminates after r iterations of adding

elements to S.LetSi denote the solution S at the ith iteration. Amortized-Filtering

139 increases the value of the solution S by at least (1 ‘)(OPT f(S)) /r at every iteration ≠ ≠ with k/r new elements. Thus,

1 ‘ f(Si) f(Si 1)+ ≠ (OPT f(Si 1)) Ø ≠ r ≠ ≠

(1 ‘) 1 and we obtain f(S) 1 e≠ ≠ OPT (1 e≠ ‘) OPT similarly as for Theorem 3.2.3. Ø ≠ Ø ≠ ≠ 1 2 1 Next, consider the case where the algorithm terminated after (‘/20)≠ epochs. At every

epoch j, the algorithm increases the value of the solution S by (‘/20)(OPT f(S )). Thus, ≠ j

f(Sj) f(Sj 1)+(‘/20) (OPT f(Sj 1)) . Ø ≠ ≠ ≠

1 1 Similarly as in the first case, we get that after (‘/20)≠ epochs, f(S) (1 e≠ )OPT. Ø ≠ 3 The total number of rounds of adaptivity of Amortized-Filtering is at most (‘≠ log n) O 2 1 since there are at most r =40‘≠ log n iterations of adding elements and at most (‘/20)≠

1 epochs with, by Lemma 3.2.14, at most ‘≠ log n filtering iterations each. The queries at each filtering iteration are independent and can be evaluated in parallel.

Similarly as for Iterative-Filtering, Amortized-Filtering is an idealized version

of the full algorithm since we do not know OPT and cannot compute expectations exactly. The full algorithm guesses OPT and estimates expectations arbitrarily well by sampling in one adaptive round in the same manner as in Section 3.2.1. The algorithm is randomized due to the sampling at every round and its analysis is nearly identical to that presented in this section while accounting for an additional arbitrarily small errors due to the guessing of

OPT and the estimates of the expectation. The main result is the following theorem for the full algorithm.

Theorem 3.2.5. For any ‘ (0, 1/2), there exists an algorithm that obtains a 1 1/e ‘ œ ≠ ≠ 2 approximation with probability 1 ” in (‘≠ log n) adaptive steps. Its query complexity in ≠ O

140 2 3 5 n each round is nk log (n)‘≠ log . O ” 1 1 22

3.2.4 Non-monotone Functions

Non-monotone submodular maximization is well-studied [Feige et al., 2011, Lee et al., 2009, Gupta et al., 2010, Feldman et al., 2011, Gharan and Vondrák, 2011, Buchbinder et al., 2014, Chekuri et al., 2015, Mirzasoleiman et al., 2016, Ene and Nguyen, 2016], particularly under a cardinality constraint [Lee et al., 2009, Gupta et al., 2010, Gharan and Vondrák, 2011, Buchbinder et al., 2014, Mirzasoleiman et al., 2016]. For maximizing a non-monotone submodular function under a cardinality constraint k, a simple randomized greedy algorithm that iteratively includes a random element from the set of k elements with largest marginal contribution at every iteration achieves a 1/e approximation to the optimal set of size k [Buchbinder et al., 2014]. For more general constraints, Mirzasoleiman et al. develop an algorithm with strong approximation guarantees that works well in practice [Mirzasoleiman et al., 2016]. While the algorithms for constrained non-monotone submodular maximization achieve strong approximation guarantees, their parallel runtime is linear in the size of the data

due to their high adaptivity. Informally, the adaptivity of an algorithm is the number of sequential rounds it requires when polynomially-many function evaluations can be executed in parallel in each round. The adaptivity of the randomized greedy algorithm is k since it sequentially adds elements in k rounds. The algorithm in Mirzasoleiman et al. is also k-adaptive, as is any known constant approximation algorithm for constrained non-monotone submodular maximization. In general, k may be (n), and hence the adaptivity as well as the parallel runtime of all known constant approximation algorithms for constrained submodular

maximization are at least linear in the size of the data.

141 Technical Overview

Non-monotone submodular functions are notoriously challenging to optimize. Unlike in the monotone case, standard algorithms for submodular maximization such as the greedy algorithm perform arbitrarily poorly on non-monotone functions, and the best achievable approximation remains unknown.3 Since the marginal contribution of an element to a set is not guaranteed to be non-negative, an algorithm’s local decisions in the early stages of optimization may contribute negatively to the value of its final solution. At a high level, we overcome this problem with an algorithmic approach that iteratively adds to the solution blocks of elements obtained after aggressively discarding other elements. Showing the guarantees for this algorithm on non-monotone functions requires multiple subtle components. Specifically, we require that at every iteration, any element is added to the solution with low probability. This requirement imposes a significant additional challenge to just finding a block of high contribution at every iteration, but it is needed to show that in future iterations there will exist a block with large contribution to the solution. Second, we introduce a pre-processing step that discards elements with negative expected marginal contribution to a random set drawn from some distribution. This pre-processing step is needed for two dierent arguments: the first is that a large number of elements are discarded at every iteration, and the second is that a random block has high value when there are k surviving elements.

The Blits Algorithm

In this section, we describe the BLock ITeration Submodular maximization algorithm

(henceforth Blits), which obtains an approximation arbitrarily close to 1/2e in (log2 n) O adaptive rounds. Blits iteratively identifies a block of at most k/r elements using a Sieve subroutine, treated as a black-box in this section, and adds this block to the current solution

3To date, the best upper and lower bounds are [Buchbinder et al., 2014] and [Gharan and Vondrák, 2011] respectively for non-monotone submodular maximization under a cardinality constraint.

142 S,forr iterations.

Algorithm 18 Blits: the BLock ITeration Submodular maximization algorithm Input: constraint k, bound on number of iterations r S Ωÿ for r iterations i =1to r do

S S Sieve(S, k, i, r) Ω fi return S

The main challenge is to find in logarithmically many rounds a block of size at most k/r

to add to the current solution S. Before describing and analyzing the Sieve subroutine, in the following lemma we reduce the problem of showing that Blits obtains a solution of value –vı/e to showing that Sieve finds a block with marginal contribution at least

i 1 ı ı (–/r)((1 1/r) ≠ v f(Si 1)) to S at every iteration i, where we wish to obtain v close to ≠ ≠ ≠ OPT. The proof generalizes an argument in [Buchbinder et al., 2014].

Lemma 3.2.15. For any – [0, 1], assume that at iteration i with current solution Si 1, œ ≠ i 1 – 1 ≠ ı Sieve returns a random set Ti s.t. E fSi 1 (Ti) r 1 r v f(Si 1) . Then, ≠ Ø 3 ≠ ≠ ≠ 4 – ı Ë È 1 2 E [f(Sr)] v . Ø e · i 1 i– 1 ≠ ı Proof. We show by induction that E [f(Si)] 1 v . Observe that Ø r ≠ r 1 2

E [f(Si)] = E [f(Si 1)] + E fSi 1 (Ti) ≠ ≠ Ë Èi 1 – 1 ≠ ı E [f(Si 1)] + 1 v f(Si 1) Ø ≠ r A3 ≠ r 4 ≠ ≠ B i 1 – – 1 ≠ ı 1 E [f(Si 1)] + 1 v inductive hypothesis Ø 3 ≠ r 4 ≠ r 3 ≠ r 4 i 2 i 1 – (i 1)– 1 ≠ – 1 ≠ 1 ≠ 1 vı + 1 vı Ø 3 ≠ r 4 r 3 ≠ r 4 r 3 ≠ r 4 i 1 i– 1 ≠ 1 vı – 1 Ø r 3 ≠ r 4 Æ

143 Thus, with i = r, r 1 1 ≠ ı – ı E [f(Sr)] = – 1 v v . 3 ≠ r 4 Ø e

The advantage of Blits is that it terminates after (d log n) adaptive rounds when O · using r = (log n) and a Sieve subroutine that is d-adaptive. In the next section we describe O Sieve and prove that it respects the conditions of Lemma 3.2.15 in d = (log n) rounds. O

The Sieve Subroutine

In this section, we describe and analyze the Sieve subroutine. We show that for any constant ‘>0, this algorithm finds in (log n) rounds a block of at most k/r elements with O i 1 marginal contribution to S that is at least t/r, with t := ((1 ‘/2)/2)((1 1/r) ≠ (1 ≠ ≠ ≠

‘/2)OPT f(Si 1)), when called at iteration i of Blits. By Lemma 3.2.15 with – =(1 ‘)/2 ≠ ≠ ≠ and vı =(1 ‘/2)OPT, this implies that Blits obtains an approximation arbitrarily close to ≠ 1/2e in (log2 n) rounds. O The Sieve algorithm, described formally below, iteratively discards elements from a set X initialized to the ground set N. We denote by (X) the uniform distribution over all U subsets of X of size exactly k/r and by (a, S, X) the expected marginal contribution of an element a to a union of the current solution S and a random set R (X), i.e. ≥U

(a, S, X) := ER (X) fS (R a)(a) . ≥U fi \ Ë È

At every iteration, Sieve first pre-processes surviving elements X to obtain X+, which is the set of elements a X with non-negative marginal contribution (a, S, X). After this œ + pre-processing step, Sieve evaluates the marginal contribution ER (X) [fS(R X )] of a ≥U fl random set R (X) without its elements not in X+ (i.e. R excluding its elements with ≥U

144 negative expected marginal contribution). If the marginal contribution of R X+ is at least fl t/r, then R X+ is returned. Otherwise, the algorithm discards from X the elements a with fl expected marginal contribution (a, S, X) less than (1 + ‘/2)t/k. The algorithm iterates

+ until either E[fS(R X )] t/r or there are less than k surviving elements, in which case fl Ø Sieve returns a random set R X+ with R (X) and with dummy elements added to X fl ≥U so that X = k. A dummy element a is an element with f (a)=0for all S. | | S Algorithm 19 Sieve(S, k, i, r) Input: current solution S at outer-iteration i r Æ 1 ‘/2 i 1 X N,t ≠ ((1 1/r) ≠ (1 ‘/2)OPT f(S)) Ω Ω 2 ≠ ≠ ≠ while X >kdo | | X+ a X :(a, S, X) 0 Ω{ œ Ø } + + if ER (X) [fS(R X )] t/r return R X , where R (X) ≥U fl Ø fl ≥U X a X :(a, S, X) (1 + ‘/4) t/k Ω{ œ Ø } X X k X dummy elements Ω fi{ ≠| | } X+ a X :(a, S, X) 0 Ω{ œ Ø } return R X+, where R (X) fl ≥U

The above description is an idealized version of the algorithm. In practice, we do not know

OPT and we cannot compute expectations exactly. Fortunately, we can apply multiple guesses for OPT non-adaptively and obtain arbitrarily good estimates of the expectations in one round by sampling. The sampling process for the estimates first samples m sets from (X), then U + queries the desired sets to obtain a random realization of fS(R X ) and fS (R a)(a), and fl fi \ finally averages the m random realizations of these values. By standard concentration bounds,

m = ((OPT/‘)2 log(1/”)) samples are sucient to obtain with probability 1 ” an estimate O ≠ with an ‘ error. For ease of presentation and notation, we analyze the idealized version of the algorithm, which easily extends to the algorithm with estimates and guesses as in the

145 previous section.

The approximation. Our goal is to show that Sieve returns a random block whose expected marginal contribution to S is at least t/r. By Lemma 3.2.15 this implies Blits obtains a (1 ‘)/2e-approximation. ≠

1 Lemma 3.2.16. Assume r 20fl‘≠ and that after at most fl 1 iterations of Sieve, Sieve Ø ≠ returns a set R at iteration i of Blits, then

i 1 t 1 ‘/2 1 ≠ E[fS(R)] = ≠ 1 (1 ‘/2)OPT f(S) . Ø r 2r A3 ≠ r 4 ≠ ≠ B

The remainder of the analysis of the approximation is devoted to the proof of Lemma 3.2.16.

+ First note that if Sieve returns R X , then the desired bound on E[fS(R)] follows fl immediately from the condition to return that block. Otherwise Sieve returns R due to X k, and then the proof consists of two parts. First, we argue that when Sieve terminates, | |Æ there exists a subset T of X for which f (T ) t. Then, we prove that such a subset T of X S Ø for which f (T ) t not only exists, but is also returned by Sieve. We do this by proving S Ø a new general lemma for non-monotone submodular functions that may be of independent interest. This lemma shows that a random subset of X of size s well approximates the optimal subset of size s in X.

Existence of a surviving block with high contribution to S. The main result in this

section is Lemma 3.2.20, which shows that when Sieve terminates there exists a subset T of X s.t f (T ) t. To prove this, we first prove Lemma 3.2.18, which argues that S Ø i 1 i 1 f(O S) (1 1/r) ≠ OPT. This bound explains the (1 1/r) ≠ (1 ‘/2)OPT f(Si 1)) fi Ø ≠ ≠ ≠ ≠ ≠ term in t. For monotone functions, this is trivial since f(O S) f(O)=OPT by definition fi Ø of monotonicity. For non-monotone functions, this inequality does not hold. Instead, the approach used to bound f(O S) is to argue that any element a N is added to S by fi œ 146 Sieve with probability at most 1/r at every iteration. The key to that argument is that in both cases where Sieve terminates we have X k (with X possibly containing dummy | |Ø elements), which implies that every element a is in R (X) with probability at most ≥U 1/r. We use the following lemma from Feige et al. [2011], which is useful for non-monotone functions:

Lemma 3.2.17 (Feige et al. [2011]). Let g :2N R be a non-negative submodular function. æ Denote by A(p) a random subset of A where each element appears with probability at most p

(not necessarily independently). Then, E [g(A(p))] (1 p)g( )+p g(A) (1 p)g( ). Ø ≠ ÿ · Ø ≠ ÿ

Lemma 3.2.18. Let S be the set obtained after i 1 iterations of Blits calling the Sieve ≠ i 1 subroutine, then E[f(O S)] (1 1/r) ≠ OPT. fi Ø ≠

Proof. In both cases where Sieve terminates, X k. Thus Pr[a R (X)] = | |Ø œ ≥U i 1 k/(r X ) < 1/r. This implies that at iteration i of Blits, Pr[a S] 1 (1 1/r) ≠ . Next, | | œ Æ ≠ ≠ we define g(T ) := f(O T ), which is also submodular. By Lemma 3.2.17,weget fi

i 1 i 1 E[f(S O)] = E[g(S)] (1 1/r) ≠ g( )=(1 1/r) ≠ OPT. fi Ø ≠ ÿ ≠

Let fl, Xj, and Rj denote the number of iterations of Sieve(S, k, i, r), the set X at iteration j fl of Sieve, and the set R (X ) respectively. We show that the expected Æ ≥U j fl i 1 marginal contribution of O to S R approximates (1 1/r) ≠ OPT f(S) well. This fi fij=1 j ≠ ≠ 1 2 crucial fact allows us to argue about the value of optimal elements that survive iterations of

Sieve.

1 Lemma 3.2.19. For all r, fl,‘> 0 s.t. r 20fl‘≠ , if Sieve(S, k, i, r) has not terminated Ø after fl iterations, then

i 1 fl ER1,...,Rfl fS R (O) (1 ‘/10) (1 1/r) ≠ (1 ‘/2)OPT f(S) . fi(fij=1 j ) Ø ≠ ≠ ≠ ≠ 5 6 1 2 147 Proof. We exploit the fact that if Sieve(S, k, i, r) has not terminated after fl iterations, then by the algorithm, the random set R (X) at iteration j of Sieve has expected value that j ≥U is upper bounded as follows:

i 1 1 ‘/2 1 ≠ E [fS (Rj)] < ≠ 1 (1 ‘/2)OPT f(S) Rj 2r A3 ≠ r 4 ≠ ≠ B

for all j fl. Next, by subadditivity, we have Æ

fl fl ER1,...,Rfl fS j Rj ERj [fS (Rj)] . fi =1 Æ Ë 11 22È jÿ=1

Note that k/r 1 Pr [a Rj] Rj œ Æ X Æ r | j| since X >kduring Sieve. Thus, by a union bound, | |

fl fl ‘ Pr a j=1Rj . R1,...,Rfl œfi Æ r Æ 20 Ë È

Next, define g(T )=f(O S T ) fi fi which is non-negative submodular. Thus, we have

fl fl E fS O j=1Rj = E g j=1Rj f(S) R1,...,Rfl fi fi R1,...,Rfl fi ≠ Ë 1 1 22È Ë 1 2È ‘ 1 g( ) f(S) Lemma 3.2.17 Ø 3 ≠ 204 ÿ ≠ ‘ = 1 f(O S) f(S) 3 ≠ 204 fi ≠ ‘ i 1 1 (1 1/r) ≠ OPT f(S) Lemma 3.2.18 Ø 3 ≠ 204 ≠ ≠

148 Combining the above inequalities, we conclude that

E f fl (O) S ( j Rj ) R1,...,Rfl 5 fi fi =1 6 fl fl = E fS O j=1Rj E fS j=1Rj R1,...,Rfl fi fi ≠ R1,...,Rfl fi Ë 1 1 22È Ë fl11 22È ‘ i 1 1 (1 1/r) ≠ OPT f(S) E [fS (Rj)] Ø ≠ 20 ≠ ≠ ≠ Rj 33 4 4 jÿ=1 i 1 ‘ i 1 fl(1 ‘/2) 1 ≠ 1 (1 1/r) ≠ OPT f(S) ≠ 1 (1 ‘/2)OPT f(S) Ø 33 ≠ 204 ≠ ≠ 4 ≠ 2r A3 ≠ r 4 ≠ ≠ B ‘ i 1 1 (1 1/r) ≠ (1 ‘/2)OPT f(S) Ø ≠ 10 ≠ ≠ ≠ 3 4 1 2

We are now ready to show that when Sieve terminates after fl iterations, there exists a subset T of X s.t f (T ) t. At a high level, the proof defines T to be a set of meaningful fl S Ø optimal elements, then uses Lemma 3.2.19 to show that these elements survive fl iterations of

Sieve and respect f (T ) t. S Ø

1 Lemma 3.2.20. For all r, fl,‘> 0, if r 20fl‘≠ , then there exists T X , that survives fl it- Ø ™ fl 1 ‘/10 i 1 erations of Sieve(S, k, i, r) and that satisfies f (T ) ≠ ((1 1/r) ≠ (1 ‘/2)OPT f(S)) . S Ø 2 ≠ ≠ ≠

Proof. At a high level, the proof first defines a subset T of the optimal solution O. Then, the re- mainder of the proof consists of two main parts. First, we show that elements in T survive fl it-

1 ‘ i 1 erations of Sieve(S, k, i, r). Then, we show that f (T ) 1 ((1 1/r) ≠ (1 ‘/2)OPT f(S)) . S Ø 2 ≠ 10 ≠ ≠ ≠ 1 2 We introduce some notation. Let O = o ,...,o be the optimal elements in some arbitrary { 1 k} order and O = o ,...,o . We define the following marginal contribution of optimal ¸ { 1 ¸} ¸

element o¸:

¸ := E f fl (o¸) . S O¸ 1 ( j Rj o¸ ) R1,...,Rfl 5 fi ≠ fi fi =1 \{ } 6 We define T to be the set of optimal elements o such that 1 where ¸ ¸ Ø 2

1 := E f fl (O) . S ( j Rj ) k · R1,...,Rfl 5 fi fi =1 6 149 We first argue that elements in T survive fl iterations of Sieve(S, k, i, r). For element o T , ¸ œ we have

1 1 1 ‘ i 1 f fl (O) 1 (1 1/r) ≠ (1 ‘/2) f(S) ¸ E S Rj OPT Ø 2 Ø 2k · R1,...,Rfl fi(fij=1 ) Ø 2k ≠ 10 ≠ ≠ ≠ 5 6 3 4 1 2

Thus, at iteration i fl, by submodularity, Æ

fl E fS (Rj o¸ )(o¸) E f (o¸) S O¸ 1 ( j Rj o¸ ) Rj fi \{ } Ø R1,...,Rfl fi ≠ fi fi =1 \{ } Ë È 5 6

1 ‘ i 1 1 (1 1/r) ≠ (1 ‘/2)OPT f(S) Ø 2k 3 ≠ 104 ≠ ≠ ≠ 1 i 1 2 1 ‘/2 1 ≠ (1 + ‘/4) ≠ 1 OPT f(S) Ø 2k A3 ≠ r 4 ≠ B

and o survives all iterations j fl, for all o T . Next, note that ¸ Æ ¸ œ

k f fl (O) = k. ¸ E S Rj Ø R1,...,Rfl fi(fij=1 ) ÿ¸=1 5 6

Next, observe that

k k = + + . ¸ ¸ ¸ Æ ¸ 2 ¸=1 o¸ T ¸ O T o¸ T ÿ ÿœ œÿ\ ÿœ

k By combining the two inequalities above, we get o T ¸ . Thus, by submodularity, ¸œ Ø 2 q k fl fS(T ) fS O¸ 1 (o¸) E fS O R o (o¸) = ¸ . fi ≠ R ,...,R ¸ 1 ( j=1 j ¸ ) Ø o T Ø o T 1 fl 5 fi ≠ fi fi \{ } 6 o T Ø 2 ÿ¸œ ÿ¸œ ÿ¸œ

150 We conclude that

k 1 fS(T ) = E f fl (O) S ( j Rj ) Ø 2 2 R1,...,Rfl 5 fi fi =1 6 1 ‘ i 1 1 (1 1/r) ≠ (1 ‘/2)OPT f(S) Ø 2 ≠ 10 ≠ ≠ ≠ 3 4 1 2

A random subset approximates the best surviving block. In the previous part of the analysis, we showed the existence of a surviving set T with contribution at least

1 ‘/10 i 1 ≠ ((1 1/r) ≠ (1 ‘/2)OPT f(S)) to S. In this part, we show that the random set 2 ≠ ≠ ≠ R X+, with R (X), is a 1/r approximation to any surviving set T X+ when X = k. fl ≥U ™ | | A key component of the algorithm for this argument to hold for non-monotone functions is the final pre-processing step to restrict X to X+ after adding dummy elements. We use this restriction to argue that every element a R X+ must contribute a non-negative expected œ fl value to the set returned.

Lemma 3.2.21. Assume Sieve returns R X+ with R (X) and X = k. For any fl ≥U | | + + T X , we have ER (X)[fS(R X )] fS(T )/r. ™ ≥U fl Ø

Proof. Let T X+. First note that ™

+ + + E[fS(R X )] = E[fS((R X ) T )] + E[fS ((R X+) T )((R X ) T )] fl fl fl fi fl fl fl \ + = E[fS(R T )] + E[fS (R T )((R X ) T )]. fl fi fl fl \

+ where the second inequality is due to the fact that T X . We first bound E[fS(R T )].Let ™ fl T = a ,...,a be some arbitrary ordering of the elements in T and define T = a ,...,a . { 1 ¸} i { 1 i}

151 Then,

E[fS(R T )] = E fS (R Ti 1)(ai) E fS Ti 1 (ai) submodularity S fi fl ≠ T S fi ≠ T fl a R T Ø a R T iœÿfl iœÿfl U V U V 1 = E ai R fS Ti 1 (ai) S œ fi ≠ T a T · ÿiœ U V 1 = fS Ti 1 (ai) E [ ai R] fi ≠ œ a T · ÿiœ 1 = fS Ti 1 (ai) fi ≠ r a T ÿiœ 1 = f (T ). r S

k/r where the third equality is due to the fact that Pr[ai R]= X =1/r. Next, we bound œ | | + + E[fS((R X ) T )]. Similarly as in the previous case, assume that X = a1,...,a¸ is fl \ { } some arbitrary ordering of the elements in X+ and define X+ = a ,...,a . Observe that i { 1 i}

+ E[fS (R T )((R X ) T )] = E fS (R T ) ((R X+ ) T )(ai) fi fl S fi fl fi fl i 1 \ T fl \ a (R X+) T ≠ iœ ÿfl \ U V

E fS (R ai)(ai) submodularity S fi \ T Ø a (R X+) T iœ ÿfl \ U V 1 = E ai R fS (R ai)(ai) S œ fi \ T a X+ T · iœÿ \ U V 1 = E ai R fS (R ai)(ai) œ fi \ a X+ T · iœÿ \ Ë È

= Pr[ai R] E fS (R ai)(ai) ai R fi \ a X+ T œ · | œ iœÿ \ Ë È

Pr[ai R] E fS (R ai)(ai) submodularity fi \ Ø a X+ T œ · iœÿ \ Ë È + Pr[ai R] 0 ai X Ø a X+ T œ · œ iœÿ \

+ 1 We conclude that E[fS(R X )] fS(T ) fl Ø r

152 There is a tradeobetween the contribution fS(T ) of the best surviving set T and the contribution of a random set R X+ returned in the middle of an iteration due to the fl thresholds (1 + ‘/4)t/k and t/r, which is controlled by t. The optimization of this tradeo explains the (1 ‘/2)/2 term in t. ≠

Proof of main lemma.

Proof of Lemma 3.2.16. There are two cases. If Sieve returns R X+ in the middle of fl + an iteration, then by the condition to return that set, ER (X) [fS(R X )] t/r = ≥U fl Ø 1 ‘/2 i 1 + ≠ ((1 1/r) ≠ (1 ‘/2)OPT f(S))/r. Otherwise, Sieve returns R X with X = k. 2 ≠ ≠ ≠ fl | | By Lemma 3.2.20, there exists T X that survives fl iterations of Sieve s.t. f (T ) ™ fl S Ø 1 ‘/10 i 1 ≠ ((1 1/r) ≠ (1 ‘/2)OPT f(S)) . Since there are at most fl 1 iterations of Sieve, 2 ≠ ≠ ≠ ≠ T survives each iteration and the final pre-processing. This implies that T X+ when the ™ + algorithm terminates. By Lemma 3.2.21, we then conclude that ER (X)[fS(R X )] ≥U fl Ø 1 ‘/10 i 1 f (T )/r ≠ ((1 1/r) ≠ (1 ‘/2)OPT f(S)) t/r. S Ø 2r ≠ ≠ ≠ Ø

The adaptivity of Sieve is (log n) O We now observe that the number of iterations of Sieve is (log n). This logarithmic O adaptivity is due to the fact that Sieve either returns a random set or discards a constant fraction of the surviving elements at every iteration. The pre-processing step to obtain X+ is crucial to argue that since a random subset R X+ has contribution below the t/r threshold fl and since all elements in X+ have non-negative marginal contributions, there exists a large set of elements in X+ with expected marginal contribution to S R that is below the (1+‘/4)t/k fi threshold.

Lemma 3.2.22. Let Xj and Xj+1 be the surviving elements X at the start and end of iteration j of Sieve(S, k, i, r). For all S N and r, j, ‘> 0, if Sieve(S, k, i, r) does not ™ terminate at iteration j, then X < X /(1 + ‘/4). | j+1| | j| 153 Proof. At a high level, since the surviving elements must have high value and a random set has low value, we can then use the thresholds to bound how many such surviving elements there can be while also having a random set of low value. To do so, we focus on the value of f(R X ) of the surviving elements X in a random set R . j fl j+1 j+1 j ≥DXj We denote by a ,...,a the elements in R X+. Observe that { 1 ¸} j fl j

+ E fS(Rj Xj ) fl Ë ¸ È

= E fS a1,...,aj 1 (aj) S fi{ ≠ } T jÿ=1 U V

E S fS (Rj a)(a)T submodularity Ø fi \ a R X+ W œ ÿj fl j X U V 1 = E S a Rj fS (Rj a)(a)T œ · fi \ a X+ W œÿj X U V 1 = E a Rj fS (Rj a)(a) . œ · fi \ a X+ œÿj Ë È

= Pr [a Rj] E fS (Rj a)(a) a Rj œ · fi \ | œ a X+ œÿj Ë È

Pr [a Rj] E fS (Rj a)(a) submodularity Ø œ · fi \ a X+ œÿj Ë È

Then, for a X+, either a survives this iteration i and a X or it is filtered and œ j œ j+1 + a Xj Xj+1.Ifa Xj+1, then E fS (Rj a)(a) (1 + ‘/4) t/k by the algorithm and œ \ œ fi \ Ø k Ë È Pr [a Rj]= r X by the definition of (X). Thus, œ | j | U

1 Pr [a Rj] E fS (Rj a)(a) (1 + ‘/4) t. œ · fi \ Ø r X · Ë È | j|

+ + If a Xj Xj+1, then E fS (Rj a)(a) 0 by the definition of Xj . Putting all the previous œ \ fi \ Ø Ë È

154 pieces together, we get

+ 1 E fS(Rj Xj ) Xj+1 (1 + ‘/4) t. fl Ø| |·r X · Ë È | j|

Next, since elements are discarded, a random set must have low value by the algorithm,

+ t/r E fS(Rj Xj ) . Finally, by combining the above inequalities, we conclude that Ø fl Ë È X X /(1 + ‘/4). | j+1|Æ| j|

Main Result for Blits

1 Theorem 3.2.6. For any constant ‘>0, Blits initialized with r =20‘≠ log1+‘/2(n) is

2 1 ‘ log n -adaptive and obtains a ≠ approximation. O 2e 1 2 i 1 1 ‘/2 1 ≠ ≠ Proof. By Lemma 3.2.16, we have E[fS(R)] 2 1 r (1 ‘/2)OPT f(S) . Thus, Ø 3 ≠ ≠ ≠ 4 1 ‘/2 ı 1 2 by Lemma 3.2.15 with – = ≠ and v =(1 ‘/2)OPT, Blits returns S that satisfies 2 ≠ 1 ‘/2 1 ‘ E [f(S)] ≠ (1 ‘/2)OPT ≠ OPT. For adaptivity, note that each iteration of Sieve Ø 2e · ≠ Ø 2e · + has two adaptive rounds: one for (a, S, X) for all a N and one for ER (X) [fS(R X )]. œ ≥U fl Since X decreases by a 1+‘/4 fraction at every iteration of Sieve, every call to Sieve has | | 1 at most log1+‘/4(n) iterations. Finally, as there are r =20‘≠ log1+‘/4(n) iterations of Blits, the adaptivity is log2 n . O 1 2

Experiments

Our goal in this section is to show that beyond its provable guarantees, Blits performs well in practice across a variety of application domains. Specifically, we are interested in showing that despite the fact that the parallel running time of our algorithm is smaller by several orders of magnitude than that of any known algorithm for maximizing non- monotone submodular functions under a cardinality constraint, the quality of its solutions are consistently competitive with or superior to those of state-of-the-art algorithms for this

155 problem. To do so, we conduct two sets of experiments where the goal is to solve the problem of maxS: S k f(S) given a function f that is submodular and non-monotone. In the first set | |Æ of experiments, we test our algorithm on the classic max-cut objective evaluated on graphs generated by various random graph models. In the second set of experiments, we apply our algorithm to a max-cut objective on a new road network dataset, and we also benchmark it on the three objective functions and datasets used in [Mirzasoleiman et al., 2016]. In each set of experiments, we compare the quality of solutions found by our algorithm to those found by several alternative algorithms.

Experiment set I: cardinality constrained max-cut on synthetic graphs. Given an undirected graph G =(N,E), recall that the cut induced by a set of nodes S N denoted ™ C(S) is the set of edges that have one end point in S and another in N S. The cut function \ f(S)= C(S) is a quintessential example of a non-monotone submodular function. To study | | the performance of our algorithm on dierent cut functions, we use four well-studied random graph models that yield cut functions with dierent properties. For each of these graphs, we run the algorithms from Section 3.2.4 to solve maxS: S k C(S) for dierent k: | |Æ | | • Erds Rényi. We construct a G(n, p) graph with n =1000nodes and p =1/2.We set k =700. Since each node’s degree is drawn from a Binomial distribution, many nodes will have a similar marginal contribution to the cut function, and a random set S may perform well.

• Stochastic block model. We construct an SBM graph with 7 disconnected clusters of 30 to 120 nodes and a high (p =0.8) probability of an edge within each cluster. We set k =360. Unlike for G(n, p), here we expect a set S to achieve high value only by covering all of the clusters.

• Barbási-Albert. We create a graph with n =500and m =100edges added per iteration. We set k =333. We expect that a relatively small number of nodes will have

156 ER Graph SBM Graph BA Graph Configuration Graph 20000 ●● ● 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 750 ● 1e+05 7500 ● 15000 ● BLITS ● BLITS+ 500 5000 10000 f(S) RandomGreedy f(S) f(S) f(S) 5e+04 ● P−Fantom 2500 5000 250 Greedy Random 0e+00 0 0 0 0 200 400 600 0 100 200 300 0 100 200 300 0 100 200 300 Round Round Round Round

Figure 3.6: Experiments Set 1: Random Graphs. Performance of Blits (red) and Blits+ (blue) versus RandomGreedy (yellow), P-Fantom (green), Greedy (dark blue), and Random (purple).

high degree in this model, so a set S consisting of these nodes will have much greater value than a random set.

• Configuration model. We generate a configuration model graph with n =500,a power law degree distribution with exponent 2.Wesetk =333. Although configuration model graphs are similar to Barbási-Albert graphs, their high degree nodes are not connected to each other, and thus greedily adding these high degree nodes to S is a good heuristic.

Experiment set II: performance benchmarks on real data. To measure the perfor-

mance of Blits on real data, we use it to optimize four dierent objective functions, each on a dierent dataset. Specifically, we consider a trac monitoring application as well as three additional applications introduced and experimented with in [Mirzasoleiman et al., 2016]: image summarization, movie recommendation, and revenue maximization. We note that while these applications are sometimes modeled with monotone objectives, there are many advantages to using non-monotone objectives (see [Mirzasoleiman et al., 2016]). We briefly describe these objective functions and data here and provide additional details later on in this section.

• Trac monitoring. Consider an application where a government has a budget to build a fixed set of monitoring locations to monitor the trac that enters or exits a

157 region via its transportation network. Here, the goal is not to monitor trac circulating within the network, but rather to choose a set of locations (or nodes) such that the volume of trac entering or exiting via this set is maximal. To accomplish this, we optimize a cut function defined on the weighted transportation network. More

precisely, we seek to solve maxS: S k f(S), where f(S) is the sum of weighted edges (e.g. | |Æ trac counts between two points) that have one end point in S and another in N S. \ To conduct an experiment for this application, we reconstruct California’s highway transportation network using data from the CalTrans PeMS system [CalTrans], which provides real-time trac counts at over 40,000 locations on California’s highways. At the end of this section, we detail this network reconstruction. The result is a directed network in which nodes are locations along each direction of travel on each highway and edges are the total count of vehicles that passed between adjacent locations in April, 2018. We set k =300for this experiment.

• Image summarization. Here we must select a subset to represent a large, diverse set of images. This experiment uses 500 randomly chosen images from the 10K Tiny Images dataset [Krizhevsky and Hinton, 2009] with k =80. We measure how well an image represents another by their cosine similarity.

• Movie recommendation. Here our goal is to recommend a diverse short list S of movies for a user based on her ratings of movies she has already seen. We conduct this experiment on a randomly selected subset of 500 movies from the MovieLens dataset [Harper and Konstan., 2015] of 1 million ratings by 6000 users on 4000 movies with k =200. Following [Mirzasoleiman et al., 2016], we define the similarity of one movie to another as the inner product of their raw movie ratings vectors.

• Revenue maximization. Here we choose a subset of k =100users in a social network to receive a product for free in exchange for advertising it to their network neighbors,

158 Image Summarization Movie Recommendation Revenue Maximization on YouTube Highway Network Objective 8e+08 ● ● ● ● ● ● ● 400 ● ● ● 460 ● ● ● ● ● 6e+08 450 300 ● ● ● ● 1e+09 BLITS 440 BLITS+ ● 4e+08 200 ● f(S) f(S) f(S) f(S) RandomGreedy ● 430 5e+08 P−Fantom 2e+08 100 420 Greedy Random 410 0e+00 0 0e+00 0 20 40 60 80 0 50 100 150 200 0 25 50 75 100 0 100 200 300 Round Round Round Round

Figure 3.7: Experiments Set 2: Real Data. Performance of Blits (red) and Blits+ (blue) versus RandomGreedy (yellow), P-Fantom (green), Greedy (dark blue), and Random (purple).

and the goal is to choose users in a manner that maximizes revenue. We conduct this experiment on 25 randomly selected communities ( 1000 nodes) from the 5000 largest ≥ communities in the YouTube social network [Feldman et al., 2015], and we randomly assign edge weights from (0, 1). U

Algorithms. We implement a version of Blits exactly as described in this paper as well as a slightly modified heuristic, Blits+. The only dierence is that whenever a round of samples has marginal value exceeding the threshold, Blits+ adds the highest marginal value sample to its solution instead of a randomly chosen sample. Blits+ does not have any approximation guarantees but slightly outperforms Blits in practice. We compare these algorithms to several benchmarks:

• RandomGreedy. This algorithm adds an element chosen u.a.r. from the k elements with the greatest marginal contribution to f(S) at each round. It is a 1/e approximation for non-monotone objectives and terminates in k adaptive rounds [Buchbinder et al., 2014].

• P-Fantom. P-Fantom is a parallelized version of the Fantom algorithm in [Mirza- soleiman et al., 2016]. Fantom is the current state-of-the-art algorithm for non- monotone submodular objectives, and its main advantage is that it can maximize a

159 non-monotone submodular function subject to a variety of intersecting constraints that

are far more general than cardinality constraints. The parallel version, P-Fantom, requires (k) rounds and gives a 1/6 ‘ approximation. O ≠

We also compare our algorithm to two reasonable heuristics:

• Greedy. Greedy iteratively adds the element with the greatest marginal contribution at each round. It is k-adaptive and may perform arbitrarily poorly for non-monotone functions.

• Random. This algorithm merely returns a randomly chosen set of k elements. It performs arbitrarily poorly in the worst case but requires 0 adaptive rounds.

Experimental results. For each experiment, we analyze the value of the algorithms’ solutions over successive rounds (Fig. 3.6 and 3.7). The results support four conclusions.

First, Blits and/or Blits+ nearly always found solutions whose value matched or exceeded those of Fantom and RandomGreedy— the two alternatives we consider that oer approximation guarantees for non-monotone objectives. This also implies that Blits found solutions with value far exceeding its own approximation guarantee, which is less than that

of RandomGreedy. Second, our algorithms also performed well against the top-performing algorithm — Greedy. Note that Greedy’s solutions decrease in value after some number of rounds, as Greedy continues to add the element with the highest marginal contribution each round even when only negative elements remain. While Blits’s solutions were slightly eclipsed by the maximum value found by Greedy in five of the eight experiments, our algorithms matched Greedy on Erds Rényi graphs, image summarization, and movie recommendation. Third, our algorithms achieved these high values despite the fact that their solutions S contained 10-15% fewer than k elements, as they removed negative elements before adding ≥ blocks to S at each round. This means that they could have actually achieved even higher

160 values in each experiment if we had allowed them to run until S = k elements. Finally, we | | note that Blits achieved this performance in many fewer adaptive rounds than alternative algorithms. Here, it is also worth noting that for all experiments, we initialized Blits to use only 30 samples of size k/r per round — far fewer than the theoretical requirement necessary

to fulfill its approximation guarantee. We therefore conclude that in practice, Blits ’s superior adaptivity does not come at a high price in terms of sample complexity.

Experiments and implementation details. Here we provide details regarding the data, objective functions, and implementations for our experiments.

California highway network experiment. Consider the following set of problems: a state authority must choose where to build freeway stations to measure the commodities that are imported or exported from the state; a country bordering a disease epidemic must decide which border entry points will receive extra equipment to monitor the health of those who wish to enter; a law enforcement agency is charged with deciding where to deploy roadblocks so as to maximize the chance of arresting a suspect fleeing to the border. These problems and many others share two aspects in common: First, success depends on choosing a set of locations (or nodes) on a transportation network such that the volume of trac entering or exiting via this set is maximal. Note that this objective diers starkly from classic vertex cover objectives, as in our cases we are not interested in commodities/people circulating within the state, so the volume of trac (edge weight) moving between the chosen set of monitoring locations provides no value. Second, government authorities often have fixed project budgets (or, in the case of law enforcement, a fixed number of ocers) and monitoring equipment carries fixed costs regardless of where it is installed. Therefore, these problems are accurately modeled via a cardinality constraint on the number of monitoring locations (nodes). Therefore, we propose that these problems can be modeled by applying the the cardinality-constrained max-cut objective to a directed, weighted transportation network.

161 Figure 3.8: (left) Raw metadata latitude and longitude plotted for 40,000 trac ≥ counting monitors along CA highways; and (right) inferred highway network (each highway plotted in a dierent color).

Specifically, we seek to solve maxS: S k f(S), where f(S) is the sum of weighted edges (e.g. | |Æ trac counts between two points) that have one end point in S and another in N S. \ Road network reconstruction. We reconstruct California’s highway and freeway transportation network using data from the California Department of Transportation’s (CalTrans) PeMs system [CalTrans], which provides real-time trac counts at over 40,000 locations along California’s highways. Specifically, PeMs reports real-time trac counts and metadata for each monitoring location (see Fig. 3.8), but does not provide network information such as which stations form a sequence along each highway, and only a subset of monitoring stations adjacent to highway intersections are noted as such. Therefore, we cross- reference each PeMs trac monitoring station’s latitude, longitude, highway, and directional metadata with the Google Maps Location API to infer this network. The result is the network plotted on the right side of Fig. 3.8. Specifically, nodes in this network are locations along each direction of travel on each highway and directed edges are the total count of vehicles that passed between adjacent locations for the month of April, 2018. Finally, we use the Google API to infer the location of missing edges representing highway intersections, and

162 ● ●● ●

● ● ●

● ● ● ● ● ● ● ●

● ● ●

Figure 3.9: The first 25 highway monitoring locations (red points) that Blits adds to its solution. Colored lines represent each of the 22 highways in the highway network inferred from metadata on the 3932 PeMs monitoring stations in Los Angeles and Ventura County. we impute these edges’ respective edge weights using a simple linear model trained on the highway intersection trac counts that are present in the data. In the spirit of our proposed applications, we restrict our network to the 22 highways comprising 3932 trac monitors in LA and Ventura, and we restrict our solution to nodes within a 10mi radius of the Los Angeles network center.

Fig. 3.9 plots the first 25 highway monitoring locations chosen by Blits against this highway network.

Image, movie, and YouTube experiments. Image summarization experiment. In the image summarization application, we are given a large collection X of images and we must select a small representative subset S. Following [Mirzasoleiman et al., 2016], our goal is to choose our representative subset of images so that at least one image in this subset is similar to each image in the full collection, but the subset itself is diverse. These concerns inform the first and second terms of the non-monotone submodular objective function they propose:

163 1 f(S)= max si,j sj,k (3.1) j S i X œ ≠ X j S k S ÿœ | | ÿœ ÿœ where si,j represents the cosine similarity of image i to image j. Image Data. As in [Mirzasoleiman et al., 2016], we maximize this objective function on a randomly selected collection of 500 images from the Tiny Images ‘test set’ data [Krizhevsky and Hinton, 2009], where each image is a 32 by 32 pixel RGB image.

Movie recommendation experiment. The goal of a movie recommendation system is to recommend a diverse short list S of movies that are likely to be highly rated by a user based on the ratings she has assigned to movies she has already seen. [Mirzasoleiman et al., 2016] propose that this goal can be translated into the following non-monotone submodular objective function:

f(S)= si,j 0.95 sj,k (3.2) i S j X ≠ j S k S ÿœ ÿœ ÿœ ÿœ where X is the set of all movies and si,j is a measure of the similarity between movies i and j. Movie data. Following [Mirzasoleiman et al., 2016], we optimize this objective function on a randomly selected set of 500 movies from the MovieLens 1M dataset [Harper and Konstan., 2015], which contains 1 million ratings by 6000 users on 4000 movies. Because each user has rated only a small subset of the movies, we adopt the standard approach and use low-rank matrix completion to infer each user’s rating for movies she has not seen in a manner that is consistent with her observed ratings. As in [Mirzasoleiman et al., 2016], we

use this completed ratings matrix to compute the movie similarity measure si,j by setting si,j equal to the inner product of the column of raw movie ratings of movies i and j.

Revenue maximization experiment. [Mirzasoleiman et al., 2016] also consider a

164 variant of the influence maximization problem. Here, we can choose a subset of k users of a social network who will receive a product for free in exchange for advertising it to their network neighbors, and the goal is to choose these users in a manner that maximizes revenue. More precisely, if we select the set S of users to receive the product for free, then our revenue can be modeled via the following submodular objective function:

f(S)= wi,j (3.3) i X S Ûj S œÿ\ ÿœ where X is the set of all users (nodes) in the network and wi,j is the network edge weight between users i and j. Note that unlike the other objective functions, eqn. 3.3 is monotone.

Revenue maximization data. As in [Mirzasoleiman et al., 2016], we conduct this experiment on social network data from the 5000 largest communities of the Youtube social network, which are comprised of 39,841 nodes and 224,234 undirected edges [Feldman et al., 2015]. Because edges in this data are unweighted, we randomly assign each edge a weight by drawing from the uniform distribution U(0, 1). We then optimize the revenue function on a randomly selected subset of 25 communities ( 1000 nodes). ≥

Implementation details. The 8 plots in Fig. 3.6 and 3.7 comparing the performance of

Blits and Blits+ to alternatives show typical performance. Specifically, with the exception of the plot depicting results for the movie recommendation and revenue maximization experiments, each plot was generated from a single run of each algorithm. For these runs,

Blits and Blits+ were initialized with r =10, ‘ =0.3, and OPT set to the value of the maximum value of a solution found by Greedy. This means that in practice, one could achieve higher values with Blits and Blits+ by running each algorithm multiple times in parallel (for which we do not incur an adaptivity cost) and picking the highest value S.For

165 the movie recommendation and revenue maximization experiments, we noted that Blits was more sensitive to the OPT parameter. On the movie recommendation experiment, we noted a significant increase in the value of the solution S returned by Blits as OPT was decreased from f(SGreedy) to 0.7f(SGreedy). On the revenue maximization experiment, we noted a significant increase in the value of the solution S returned by Blits as r was reduced to 5 and OPT was increased from f(SGreedy) to 3.5f(SGreedy), as this resulted in a more aggressive application of Sieve. After exploring this behavior, we therefore set OPT to these better performing values and conducted one run of Blits each to produce the data for the movie recommendation and revenue maximization experiment plots.

Finally, we note that the plot lines for P-Fantom are less smooth than those for the other algorithms because unlike the other algorithms, P-Fantom does not construct a solution set S only by adding elements to S. Specifically, P-Fantom works by iteratively building S by calling a version of Greedy in which the element with the highest marginal contribution is only added to S if its marginal value exceeds one of X cleverly chosen thresholds, then | | paring S by testing whether some subset of S achieves higher value than S itself. Therefore, unlike for the other algorithms we consider, running P-Fantom with a given constraint k does not provide us with a single value of its partial solution in each round. Because it would be computationally costly to run P-Fantom for all of these constraints (i.e. for all kˆ k) Æ in order to generate values to plot against the other algorithms, we instead chose 10 equally spaced values in the interval 0 < kˆ k and ran P-Fantom once for each. Æ

3.2.5 Matroid Constraints

For the fundamental problem of maximizing a monotone submodular function under a general matroid constraint it is well known since the late 70s that the greedy algorithm achieves a 1/2 approximation [Nemhauser et al., 1978] and that even for the special case of cardinality constraint no algorithm can obtain an approximation guarantee better than 1 1/e ≠ 166 using polynomially-many value queries [Nemhauser and Wolsey, 1978]. Thirty years later, in seminal work, Vondrák introduced the continuous greedy algorithm which approximately maximizes the multilinear extension of the submodular function [Clinescu et al., 2007] and showed it obtains the optimal 1 1/e approximation guarantee [Vondrák, 2008]. ≠ Despite the surge of interest in adaptivity of submodular maximization, the problem of maximizing a monotone submodular function under a matroid constraint in the adaptive complexity model has remained elusive. As we later discuss, when it comes to matroid constraints there are fundamental limitations of the techniques developed previously in this chapter. The best known adaptivity for obtaining a constant factor approximation guarantee for maximizing a monotone submodular function under a matroid constraint is achieved by the greedy algorithm and is linear in the rank of the matroid. The best known adaptivity for obtaining the optimal 1 1/e guarantee is achieved by the continuous greedy and is linear in ≠ the size of the ground set.

Is there an algorithm whose adaptivity is sublinear in the size of the rank of the matroid that obtains a constant factor approximation guarantee?

The main result of this section is an algorithm for the problem of maximizing a monotone submodular function under a matroid constraint whose approximation guarantee is arbitrarily close to the optimal 1 1/e and has near optimal adaptivity of (log(n) log(k)). ≠ O

Theorem. For any ‘>0 there is an log(n) log k 1 adaptive algorithm that, with O ‘3 ‘3 1 1 2 2 probability 1 o(1), obtains a 1 1/e (‘) approximation for maximizing a monotone ≠ ≠ ≠O submodular function under a matroid constraint.

Our result provides an exponential improvement in the adaptivity for maximizing a monotone submodular function under a matroid constraint with an arbitrarily small loss in approximation guarantee. As we later discuss, beyond the information theoretic consequences,

167 this implies that a very broad class of combinatorial optimization problems can be solved exponentially faster in standard parallel computation models given appropriate representations of the matroid constraints. Our main result is largely powered by a new technique developed in this paper which we

call adaptive sequencing. This technique proves to be extremely powerful and is a departure from all previous techniques for submodular maximization in the adaptive complexity model. In addition to our main result we show that this technique gives us a set of other strong results that include:

• An (log(n) log(k)) adaptive combinatorial algorithm that obtains a 1 ‘ approximation O 2 ≠ for monotone submodular maximization under a matroid constraint (Theorem 3.2.7);

• An (log(n) log(k)) adaptive combinatorial algorithm that obtains a 1 ‘ approx- O P +1 ≠ imation for monotone submodular maximization under intersection of P (Theorem 3.2.8);

• An (log(n) log(k)) adaptive algorithm that obtains an approximation of 1 1/e ‘ O ≠ ≠ for monotone submodular maximization under a partition matroid constraint that can be implemented in the PRAM model with polylogarithmic depth.

In addition to these results the adaptive sequencing technique can be used to design algorithms that achieve the same results as those for cardinality constraint in Section 3.2.3 and [Ene and Nguyen, 2019, Fahrbach et al., 2019] and for non-monotone submodular maximization under cardinality constraint as in Section 3.2.4. We first give some basic definitions.

Matroids. Asetsystem 2N is a matroid if it satisfies the downward closed and M™ augmentation properties. A set system is downward closed if for all S T such that M ™ T , then S . The augmentation property is that if S, T and S < T , then œM œM œM | | | |

168 there exists a T such that S a . We call a set S feasible or independent. The œ fi œM œM rank k = rank( ) of a matroid is the maximum size of an independent set S. The rank M rank(S) of a set S is the maximum size of an independent subset T S.AsetB ™ œM is called a base of if B = rank( ). The matroid polytope P ( ) is the collection of M | | M M points x [0, 1]n in the convex hull of the independent sets of , or equivalently the points œ M x such that i S xi rank(S) for all S [n]. œ Æ ™ q

n The multilinear extension. The multilinear extension F :[0, 1] R+ of a function f æ maps a point x [0, 1]n to the expected value of a random set R x containing each element œ ≥ i [n] with probability xi independently, i.e. F (x)=ER x[f(R)]. We note that given an œ ≥ oracle for f, one can estimate F (x) arbitrarily well in one round by querying in parallel a suciently large number of samples R ,...,R i.i.d. x and taking the average value of f(R ) 1 m ≥ i over i [m] [Chekuri et al., 2015, Chekuri and Quanrud, 2019a]. For ease of presentation, œ we assume throughout the paper that we are given access to an exact value oracle for F in addition to f. The results which rely on F then extend to the case where the algorithm is only given an oracle for f with an arbitrarily small loss in the approximation, no loss in the adaptivity, and additional (n log n) factor in the query complexity.4 O

Technical Overview

The standard approach to obtain an approximation guarantee arbitrarily close to 1 1/e ≠ for maximizing a submodular function under a matroid constraint is by the continuous M greedy algorithm due to Vondrák [2008]. This algorithm approximately maximizes the multilinear extension F of the submodular function [Clinescu et al., 2007] in (n) adaptive O steps. In each step the algorithm updates a continuous solution x [0, 1] in the direction of œ

4 2 With (‘≠ n log n) samples, F (x) is estimated within a (1 ‘) multiplicative factor with high probability O ± [Chekuri and Quanrud, 2019a].

169 1S, where S is chosen by maximizing an additive function under a matroid constraint. In this paper we introduce the accelerated continuous greedy algorithm whose approx- imation is arbitrarily close to the optimal 1 1/e. Similarly to continuous greedy, this ≠ algorithm approximately maximizes the multilinear extension by carefully choosing S œM

and updating the solution in the direction of 1S. In sharp contrast to continuous greedy, however, the choice of S is done in a manner that allows making a constant number of updates to the solution, each requiring (log(n) log(k)) adaptive rounds. We do this by constructing O a feasible set S using (log(n) log(k)) adaptive rounds, at each one of the 1/⁄ iterations of O accelerated continuous greedy, s.t. S approximately maximizes the contribution of taking

a step of constant size ⁄ in the direction of 1S. We construct S via a novel combinatorial algorithm. The new combinatorial algorithm achieves by itself a 1/2 approximation for submodular maximization under a matroid constraint in (log(n) log(k)) adaptive rounds. This algorithm O is developed using a fundamentally dierent approach from all previous low adaptivity

algorithms for submodular maximization. This new framework uses a single random sequence (a ,...,a ) of elements. In particular, for each i [k], element a is chosen uniformly at 1 k œ i random among all elements such that S a ,...,a . This random feasibility of each fi{ 1 i}œM element is central to the analysis. Informally, this ordering allows the sequence to navigate randomly through the matroid constraint. For each position i in this sequence, we analyze the

number of elements a such that S a1,...,ai a and fS a ,...,a (a) is large. The key fi{ }fi œM fi{ 1 i} observation is that if this number is large at a position i, by the randomness of the sequence,

fS a ,...,a (ai+1) is large w.h.p., which is important for the approximation. Otherwise, if this fi{ 1 i} number is low we discard a large number of elements, which is important for the adaptivity. We then analyze the approximation of the accelerated continuous greedy algorithm, which is the main result of the paper. We use the combinatorial algorithm to selects S as the

direction and show F (x + ⁄1 ) F (x) (1 ‘)⁄(OPT F (x)), which implies a 1 1/e ‘ S ≠ Ø ≠ ≠ ≠ ≠

170 approximation. Finally, we parallelize the matroid oracle queries. The random sequence generated in each iteration of the combinatorial algorithm is independent of function evaluations and requires zero adaptive rounds, though it sequentially queries the matroid oracle. For practical implementation it is important to parallelize the matroid queries to achieve fast parallel runtime. When given explicit matroid constraints such as for uniform or partition matroids, this parallelization is relatively simple. For general matroid constraints given via rank or independence oracles we show how to parallelize the matroid queries. We give upper and lower bounds by building on the seminal work of Karp, Upfal, and Wigderson on the parallel complexity of finding the base of a matroid [Karp et al., 1988]. For rank oracles we show how to execute the algorithms with (log(n) log(k)) parallel steps that matches the O (log(n) log(k)) adaptivity. For independence oracles we show how to execute the algorithm O using ˜(n1/2) steps of parallel matroid queries and give an ˜(n1/3) lower bound even for O additive functions and partition matroids.

Previous Optimization Techniques in the Adaptive Complexity Model

The random sequencing approach developed in this paper is a fundamental departure from the adaptive sampling approach introduced in Section 3.2.1 and employed in previous combinatorial algorithms that achieve low adaptivity for submodular maximization (previous sections in this chapter and [Ene and Nguyen, 2019, Fahrbach et al., 2019, 2018]). In adaptive sampling an algorithm samples multiple large feasible sets at every iteration to determine elements which should be added to the solution or discarded. The issue with these uniformly random feasible sets is that, although they have a simple structure for uniform matroids, they are complex objects to generate and analyze for general matroid constraints. Chekuri and Quanrud recently obtained a 1 1/e ‘ approximation in (log2 m log n) ≠ ≠ O adaptive rounds for the family of m packing constraints, which includes partition and laminar

171 matroids [Chekuri and Quanrud, 2019a]. This setting was then also considered for non- monotone functions in [Ene and Nguyen, 2018]. Their approach also uses the continuous greedy algorithm, combined with a multiplicative weight update technique to handle the constraints. Since general matroids consist of exponentially many constraints, a multiplicative weight update approach over these constraints is not feasible. More generally packing constraints assume an explicit representation of the matroid. For general matroid constraints, the algorithm is not given such a representation but an oracle. Access to an independence oracle for a matroid breaks these results: any constant factor approximation algorithm with an independence oracle must have ˜(n1/3) sequential steps.

The Combinatorial Algorithm

In this section we describe a combinatorial algorithm used at every iteration of the

accelerated continuous greedy algorithm to find a direction 1S for an update of a continuous solution. In the next section we will show how to use this algorithm as a subprocedure in the accelerated continuous greedy algorithm to achieve an approximation arbitrarily close to 1 1/e with (log(n) log(k)) adaptivity. The optimization of this direction S is itself an ≠ O instance of maximizing a monotone submodular function under a matroid constraint. The

main result of this section is a (log(n) log(k)) adaptive algorithm, which we call Adaptive O Sequencing, that returns a solution a s.t., for all i, the marginal contribution of a { i}i i

to a1,...,ai 1 is near optimal with respect to all elements a s.t. a1,...,ai 1,a . { ≠ } { ≠ }œM We note that this guarantee also implies that Adaptive Sequencing itself achieves an approximation that is arbitrarily close to 1/2 with high probability. Unlike all previous low-adaptivity combinatorial algorithms for submodular maximization,

the Adaptive Sequencing algorithm developed here does not iteratively sample large sets of elements in parallel at every iteration. Instead, it samples a single random sequence of elements in every iteration. Importantly, this sequence is generated without any function

172 evaluations, and therefore can be executed in zero adaptive rounds. The goal is then to identify a high-valued prefix of the sequence that can be added to the solution and discard a large number of low-valued elements at every iteration. Identifying a high valued prefix enables the approximation guarantee and discarding a large number of elements in every iteration ensures low adaptivity.

Generating random feasible sequences. The algorithm crucially requires generating a random sequence of elements in zero adaptive rounds.

Definition 15. Given a matroid we say that (a1,...,arank( )) is a random feasible se- M M quence if for all i [rank( )], ai is an element chosen u.a.r. from a : a1,...,ai 1,a œ M { { ≠ }œ . M} A simple way to obtain a random feasible sequence is by sampling feasible elements sequentially.

Algorithm 20 Random Sequence Input: matroid M for i =1to rank( ) do M

X a : a1,...,ai 1,a Ω{ { ≠ }œM} a a uniformly random element from X i ≥

return a1,...,arank( ) M

It is immediate that Algorithm 20 outputs a random feasible sequence. Since Algorithm 20 is independent of f, its adaptivity is zero. For ease of presentation, we describe the algorithm

using Random Sequence as a subroutine, despite its sequential calls to the matroid oracle. Later, we show how to eciently parallelize this procedure using standard matroid oracles.

The algorithm. The main idea behind the algorithm is to generate a random feasible sequence in each adaptive round, and use that sequence to determine which elements should

173 be added to the solution and which should be discarded from consideration. Given a position i 1,...,l in a sequence (a ,a ,...,a), a subset S, and threshold t, we say that an element œ{ } 1 2 l a is good if adding it to S a ,...,a satisfies the matroid constraint and its marginal fi{ 1 i} contribution to S a ,...,a is at least threshold t. In each adaptive round the algorithm fi{ 1 i} generates a random feasible sequence and finds the index iı which is the minimal index i such that at most a 1 ‘ fraction of the surviving elements X are good. The algorithm then adds ≠ the set a ,...,a ı to S. A formal description of the algorithm is included below. We use { 1 i } (S, X) := T X : S T to denote the matroid over elements X where a subset is M { ™ fi œM} feasible in (X, S) if its union with the current solution S is feasible according to . M M Algorithm 21 Adaptive Sequencing Input: function f, feasibility constraint M

S ,t maxa N f(a) Ωÿ Ω œ for iterations do X N Ω while X = do ” ÿ

a1,...,arank( (S,X)) Random Sequence( (S, X)) M Ω M

Xi a X : S a1,...,ai,a and fS a ,...,a (a) t Ω{ œ fi{ }œM fi{ 1 i} Ø } iı min i : X (1 ‘) X Ω { | i|Æ ≠ | |} S S a ,...,a ı Ω fi{ 1 i } X X ı Ω i t (1 ‘)t Ω ≠ return S

Intuitively, adding a ,...,a ı to the current solution S is desirable for two important { 1 i } reasons. First, for a random feasible sequence we have that S a ,...,a ı and for fi{ 1 i }œM each element a at a position i iı, there is a high likelihood that the marginal contribution i Æ

174 ı of ai to the previous elements in the sequence is at least t. Second, by definition of i a constant fraction ‘ of elements are not good at that position, and we discard these elements from X. This discarding guarantees that there are at most logarithmically many iterations until X is empty. The threshold t maintains the invariant that it is approximately an upper bound on the optimal marginal contribution to the current solution. By submodularity, the optimal marginal contribution to S decreases as S grows. Thus, to maintain the invariant, the algorithm iterates over decreasing values of t. In particular, at each of = 1 log k O ‘ ‘ 1 1 22 iterations, where k := rank( ), the algorithm decreases t by a 1 ‘ factor when there are M ≠ no more elements which can be added to S with marginal contribution at least t, so when X is empty.

Adaptivity. In each inner-iteration the algorithm makes polynomially-many queries that are independent of each other. Indeed, in each iteration, we generate X1,...,Xk S non- ≠| | adaptively and make at most n function evaluations for each Xi. The adaptivity immediately follows from the definition of iı that ensures an ‘ fraction of surviving elements in X are discarded at every iteration.

Lemma 3.2.23. With = 1 log k , Adaptive Sequencing has adaptivity O ‘ ‘ 1 1 22 k 1 log(n) log . O A A ‘ B ‘2 B

1 Proof. The for loop has iterations. The while loop has at most (‘≠ log n) iterations O since, by definition of iı,an‘ fraction of the surviving elements are discarded from X at every iteration. We can find iı by computing X for each i [k] in parallel in one round. i œ

We note that the query complexity of the algorithm is nk log(n) log k 1 and can be O ‘ ‘2 1 1 2 2 improved to n log(n) log(k) log k 1 if we allow log(n) log(k) log k 1 adaptivity O ‘ ‘2 O ‘ ‘2 1 1 2 2 1 1 2 2 175 ı by doing a binary search over at most k sets Xi to find i . The details can be found below.

Quasi-linear query complexity. The query complexity of Adaptive Sequencing and Accelerated Continuous Greedy can be improved from (nk log(n) log(k)) to O quasi-linear with (n log(n) log2(k)) queries if we allow (log(n) log2(k)) rounds. This is O O done by finding iı at every iteration of Adaptive Sequencing by doing binary search of i [rank( (S, X))] instead of computing X for all i in parallel. Since there are at most k œ M i values of i, this decrease the query complexity of finding iı from nk to n log k, but increases the adaptivity by log k. An important property to be able to perform binary search is to have X decreasing in i. | i| We show this with the following lemma.

Lemma 3.2.24. At every iteration of Adaptive Sequencing, X X for all i< i+1 ™ i rank( (S, X)). M

Proof. Assume a Xi+1. Thus, S a1,...,ai + a and fS a ,...,a (a) t. By the œ fi{ } œM fi{ 1 i} Ø

downward closed property of matroids, S a1,...,ai 1 + a . By submodularity, fi{ ≠ } œM

fS a1,...,ai 1 (a) fS a1,...,ai (a) t. We get that a Xi. fi{ ≠ } Ø fi{ } Ø œ Corollary 4. If Adaptive Sequencing finds iı by doing binary search, then its query complexity is (n log(n) log2(k))and its adaptivity is (log(n) log2(k)). O O

Approximation guarantee. The main result for the approximation guarantee is that the algorithm returns a solution S = a ,...,a s.t. for all i l, the marginal contribution { 1 l} Æ

obtained by ai to a1,...,ai 1 is near optimal with respect to all elements a such that { ≠ }

a1,...,ai 1,a . To prove this we show that the threshold t is an approximate upper { ≠ }œM bound on the maximum marginal contribution.

Lemma 3.2.25. Assume that f is submodular and that is downward closed. Then, at M any iteration, t (1 ‘) maxa:S a fS(a). Ø ≠ fi œM 176 Proof. The claim initially holds by the initial definitions of t = maxa N f(a), S = and œ ÿ X = N. We show that this invariant is maintained through the algorithm when either S or t are updated.

First, assume that at some iteration of the algorithm we have t (1 ‘) maxa:S a fS(a) Ø ≠ fi œM and that S is updated to S a ,...,a ı . Then, for all a such that S a , fi{ 1 i } fi œM

fS a ,...,a ı (a) fS(a) t/(1 ‘) fi{ 1 i } Æ Æ ≠ where the first inequality is by submodularity and the second by the inductive hypothesis.

Since a : S a ,...,a ı a a : S a by the downward closed property { fi{ 1 i }fi œM}™{ fi œM} of , M

max fS a1,...,aiı (a) max fS a1,...,aiı (a). a:S a ,...,a ı a fi{ } a:S a fi{ } fi{ 1 i }fi œM Æ fi œM

Thus, when S is updated to S a1,...,aiı , we have t (1 ‘) maxa:S a ,...,a ı a fS(a). fi{ } Ø ≠ fi{ 1 i }fi œM Next, consider an iteration where t is updated to tÕ =(1 ‘)t. By the algorithm, X = at ≠ ÿ that iteration with current solution S. Thus, by the algorithm, for all a N, a was discarded œ from X at some previous iteration with current solution SÕ s.t. SÕ a ,...,a ı S. Since fi{ 1 i }™

a was discarded, it is either the case that SÕ a1,...,aiı a or fS a ,...,a ı (a)

and since SÕ a1,...,aiı S. Otherwise, fS a ,...,a ı (a)

fS(a) fS a ,...,a ı (a)

By exploiting the definition of iı and the random feasible sequence property we show that Lemma 3.2.25 implies that every element added to S at some iteration j has near-optimal

expected marginal contribution to S. We define XM := a X : S a ,...,a a . i { œ fi{ 1 i}fi œM}

Lemma 3.2.26. Assume that a1,...,arank( (S,X)) is a random feasible sequence, then for M

177 all i iı, Æ

2 Eai fS a1,...,ai 1 (ai) (1 ‘) max fS a1,...,ai 1 (ai). fi{ ≠ } Ø ≠ a:S a1,...,ai 1 a fi{ ≠ } Ë È fi{ ≠ }fi œM

Proof. By the random feasibility condition, we have ai (XiM1).Weget ≥U ≠

Xi 1 Xi 1 Pr fS a ,...,a (ai) t t = | ≠ | t | ≠ | t (1 ‘)(1 ‘) max fS (ai) a 1 i 1 i 1 i fi{ ≠ } Ø · XiM1 · Ø X · Ø ≠ ≠ a:S a1,...,ai 1 a ≠ Ë È | ≠ | | | fi{ ≠ }fi œM where the equality is by definition of Xi 1, the first inequality since XiM1 X, and ≠ ≠ ™ ı the second since i i and by Lemma 3.2.25. Finally, note that E fS a1,...,ai 1 (ai) Æ fi{ ≠ } Ø Ë È Pr fS a1,...,ai 1 (ai) t t. fi{ ≠ } Ø · Ë È Next, we show that if every element a in a solution S = a ,...,a of size k = rank( ) i { 1 k} M

has near-optimal expected marginal contribution to Si 1 := a1,...,ai 1 , then we obtain ≠ { ≠ } an approximation arbitrarily close to 1/2 in expectation.

Lemma 3.2.27. Assume that S = a ,...,a such that { 1 k}

Eai [fSi 1 (ai)] (1 ‘) max fSi 1 (a) ≠ a:Si 1 a ≠ Ø ≠ ≠ fi œM where S = a ,...,a . Then, for a matroid constraint , we have i { 1 i} M

E [f(S)] (1/2 (‘))OPT. Ø ≠O

Proof. Let O = o1,...,ok such that a1,...,ai 1,oi is feasible for all i, which exists by { } { ≠ } the augmentation property of matroids. We get,

E[f(S)] = E[fSi 1 (ai)] (1 ‘) E[fSi 1 (oi)] (1 ‘)fS(O) (1 ‘)(OPT f(S)). ≠ ≠ i [k] Ø ≠ i [k] Ø ≠ Ø ≠ ≠ ÿœ ÿœ

178 A corollary of the lemmas above is that Adaptive Sequencing has (log(n) log(k)) O adaptive rounds and provides an approximation that is arbitrarily close to 1/2, in expectation. To obtain this guarantee with high probability we can simply run parallel instances of the while-loop in the algorithm and include the elements obtained from the best instance. We

also note that the solution S returned by Adaptive Sequencing might have size smaller than rank( ), which causes an arbitrarily small loss for suciently large . We give the M full details below.

From expectation to high probability for the combinatorial algorithm. We gen-

eralize Adaptive Sequencing to obtain an algorithm called Adaptive Sequencing++, described below, which achieves a 1/2 ‘ approximation with high probability, instead of in ≠ expectation. We note that this generalization is not needed when Adaptive Sequencing is used as a subroutine of Accelerated Continuous Greedy for the 1 1/e ‘ result. ≠ ≠

179 Algorithm 22 Adaptive Sequencing++, Adaptive Sequencing with high probability guarantee Input: function f, feasibility constraint M

S ,t maxa N f(a) Ωÿ Ω œ for iterations do X N Ω while X = do ” ÿ for j =1to fl do (non-adaptivity and in parallel)

a1,...,arank( (S,X)) Random Sequence( (S, X)) M Ω M

Xi a X : S a1,...,ai,a and fS a ,...,a (a) t Ω{ œ fi{ }œM fi{ 1 i} Ø } iı min i : X (1 ‘) X Ω { | i|Æ ≠ | |} j S S a ,...,a ı Ω fi{ 1 i } j X X ı Ω i j 1 iı v iı ¸=1 fS a1,...,a¸ 1 (a¸) Ω fi{ ≠ } ı q j j argmaxj [fl] v Ω œ S Sj Ω X Xj Ω t (1 ‘)t Ω ≠ return S

Theorem 3.2.7. For any ‘>0, there is an log(n) log k 1 adaptive algorithm that O ‘ ‘2 1 1 2 2 obtains a 1/2 (‘) approximation with probability 1 o(1) for maximizing a monotone ≠O ≠ submodular function under a matroid constraint.

Proof. We set = 1 log k . Initially we have t OPT. After iterations of Adaptive O ‘ ‘ i Æ 1 1 22 Sequencing, the final value of t is t (1 ‘)OPT = ‘ OPT. We begin by adding f Æ ≠ O k 1 2 dummy elements to S so that S = k, which enables pairwise comparisons between S and | |

180 O. In particular, we consider SÕ, which is S together with rank( ) S dummy elements M ≠| |

a S +1,...ak such that, for any T , fT (a)=tf . Thus, by Lemma 3.2.25, for dummy elements | |

ai, fSi 1 (ai)=tf (1 ‘) maxa:Si 1 a fSi 1 (a). ≠ Ø ≠ ≠ fi œM ≠ By Lemma 3.2.23, there are ( log(n)/‘) iterations of the while-loop. Since each iteration O of the while-loop is non-adaptive, Adaptive Sequencing++ is (log(n)/‘) adaptive O Consider an iteration of the while-loop of Adaptive Sequencing++. We first ar-

2 ı ı gue that for each inner-iteration j, i [i ] fSi 1 (ai) (1 ‘) i t. We first note that œ ≠ Ø ≠ q ı Prai fS a1,...,ai 1 (ai) t 1 ‘ by the definition of i and the random feasible sequence fi{ ≠ } Ø Ø ≠ Ë È ı property. Let Y be the number of indices i i such that fS a1,...,ai 1 (ai) t. By Cherno Æ fi{ ≠ } Ø bound, with µ = E[Y ] (1 ‘)iı Ø ≠

ı ‘2(1 ‘)iı/2 ‘2(1 ‘)/2 Pr [Y (1 ‘)(1 ‘)i ] e≠ ≠ e≠ ≠ . Æ ≠ ≠ Æ Æ

Let Z fl be the number of inner-iterations j such that Y (1 ‘)(1 ‘)iı. By Cherno Æ Ø ≠ ≠ ‘2(1 ‘)/2 bound, with µ = E[Z] (1 e≠ ≠ )fl, Ø ≠

2 1 ‘2(1 ‘)/2 (1 e ‘ (1 ‘)/2)fl/8 Pr Z (1 e≠ ≠ )fl e ≠ ≠ ≠ . 5 Æ 2 ≠ 6 Æ

1 logn Thus, with fl = 2 log , we have that with probability 1 (‘”/(log n)), 1 e ‘ ‘” O ≠ ≠ ≠O 1 1 22 ı ı there is at least one inner-iteration j such that Y (1 ‘)(1 ‘)i . Thus i [i ] fSi 1 (ai) Ø ≠ ≠ œ ≠ Ø (1 ‘)2iıt. By Lemma 3.2.25, q ≠

3 fSi 1 (ai) (1 ‘) max fSi 1 (a). ≠ Ø ≠ a:Si 1 a ≠ i [iı] ≠ fi œM œÿ

By a union bound, this holds over all iterations of the while-loop of Adaptive Sequenc-

181 ing++ with probability 1 ” and we get that ≠

3 fSi 1 (ai) (1 ‘) max fSi 1 (a). ≠ Ø ≠ a:Si 1 a ≠ i [k] ≠ fi œM ÿœ

Let O = o1,...,ok such that a1,...,ai 1,oi is feasible for all i, which exists by the { } { ≠ } augmentation property of matroids. We conclude that with probability 1 ”, ≠

f(SÕ)= E[fSi 1 (ai)] ≠ i [k] ÿœ 3 (1 ‘) max fSi 1 (a) a:Si 1 a ≠ Ø ≠ ≠ fi œM 3 (1 ‘) E[fSi 1 (oi)] ≠ Ø ≠ i [k] ÿœ (1 ‘)3f (O) Ø ≠ SÕ 3 (1 ‘) (OPT f(SÕ)) Ø ≠ ≠

and since

f(S)=f(SÕ) (rank( ) S )t f(SÕ) (‘)OPT, ≠ M ≠| | f Ø ≠O we conclude that f(S) (1/2 (‘))OPT. Ø ≠O

Intersection of matroid constraints. Next, we generalize this result and obtain a 1/(P +1) (‘) approximation with high probability for the intersection of P matroids. We ≠O consider constraint = P which is the intersection of P matroids , i.e. S M fli=1Mi Mi œM if S for all i P . Similarly as for a single matroid constraint, we denote the size of œMi Æ the largest feasible set by k. We denote the rank of a set S with respect to matroid by Mj rank (S). We define span (S), called the span of S in by: j j Mj

span (S)= a N : rank (S a)=rank (S) j { œ j fi j }

182 We will use the following claim.

t 1 Claim 4 (Prop. 2.2 in [Nemhauser et al., 1978]). If for t [k] i=0≠ ‡i t and pi 1 pi, ’ œ Æ ≠ Ø q with ‡i,pi 0 then: Ø k 1 k 1 ≠ ≠ p ‡ p . i i Æ i ÿi=0 ÿi=0 Similarly as for a single matroid, we give the approximation guaranteed obtained by a solution S with near-optimal marginal contributions for each a S. œ Lemma 3.2.28. Assume that S = a ,...,a such that { 1 k}

fSi 1 (ai) (1 ‘) max fSi 1 (a) ≠ a:Si 1 a ≠ Ø ≠ ≠ fi œM where S = a ,...,a . Then, if is the intersection of P matroids, we have i { 1 i} M

1 f(S) (‘) OPT. Ø 3P +1≠O 4

Proof. Since S and O are independent sets in we have: i Mj

rank (span (S ) O)= span (S ) O span (S ) = S i j j i fl | j i fl |Æ| j i | | i|Æ

Define U = P span (S ), to be the set of elements which are not part of the maxi- i fij=1 j i mization at index i +1of the procedure, and hence cannot give value at that stage. We have: P U O = ( P span (S )) O span (S ) O P i | i fl | | fij=1 j i fl |Æ | j i fl |Æ · jÿ=1

Let Vi =(Ui Ui 1) O be the elements of O which are not part of the maximization at \ ≠ fl index i, but were part of the maximization at index i 1.Ifa V then it must be that ≠ œ i

(1 ‘)fSk (a)(1 ‘) fSi 1 (a) max fSi 1 (b) ≠ b:Si 1 b ≠ ≠ ≠ Æ Æ ≠ fi œM 183 where the first inequality is due to submodularity of f. Hence, we can upper bound:

k k k

fSk (o) max fSi 1 (a)= Vi max fSi 1 (a) P max fSi 1 (a) Æ a:Si 1 a ≠ | | a:Si 1 a ≠ Æ a:Si 1 a ≠ o O S i=1 o Vi ≠ fi œM i=1 ≠ fi œM i=1 ≠ fi œM œÿ\ k ÿ ÿœ ÿ ÿ where the last inequality uses i V = U O Pi and the claim due to 4. Together t=1 | t| | i fl |Æ q with OPT f(O Sk) f(Sk)+ o O Sk fSk (o) and fSi 1 (ai) (1 ‘) maxa:Si 1 a fSi 1 (a) Æ fi Æ œ \ ≠ Ø ≠ ≠ fi œM ≠ we get: q 1 f(S) (‘) OPT. Ø 3P +1≠O 4 as required.

Since Lemma 3.2.25 only uses the downward closed property of and since intersections M of matroids are downward closed, Adaptive Sequencing++ obtains a solution S with near-optimal marginal contributions for each a S = a ,...,a . Combined with the i œ { 1 k} previous lemma, we obtain the result for intersections of matroids.

Theorem 3.2.8. For any ‘>0, Adaptive Sequencing++ is an log(n) log k 1 O ‘ ‘2 1 1 2 2 adaptive algorithm that obtains a 1/(P +1) (‘) approximation with probability 1 o(1) ≠O ≠ for maximizing a monotone submodular function under the intersection of P matroids.

Proof. The first part of the of the proof follows similarly as the proof for Theorem 3.2.10 by using Lemma 3.2.25, which also hold for intersections of matroids, to obtain the near-optimal marginal contributions of each a S with probability 1 o(1): i œ ≠

3 fSi 1 (ai) (1 ‘) max fSi 1 (a). ≠ Ø ≠ a:Si 1 a ≠ i [iı] ≠ fi œM œÿ We then combine this with Lemma 3.2.28 to obtain the 1/(P +1) (‘) approximation with ≠O probability 1 o(1). ≠

184 The Accelerated Continuous Greedy Algorithm

In this section we describe the accelerated continuous greedy algorithm that achieves the main result of the paper. This algorithm employs the combinatorial algorithm from the previous section to construct a continuous solution which approximately maximizes the multilinear relaxation F of the function f. This algorithm requires (log(n) log(k)) adaptive O rounds and it produces a continuous solution whose approximation to the optimal solution is with high probability arbitrarily close to 1 1/e. Finally, since the solution is continuous ≠ and we seek a feasible discrete solution, it requires rounding. Fortunately, by using either dependent rounding [Chekuri et al., 2009] or contention resolution schemes [Vondrák et al., 2011] this can be done with an arbitrarily small loss in the approximation guarantee without any function evaluations, and hence without any additional adaptive rounds.

The algorithm. The accelerated continuous greedy algorithm follows the same principle as the (standard) continuous greedy algorithm [Vondrák, 2008]: at every iteration, the solution x [0, 1]n moves in the direction of a feasible set S . The crucial dierence between the œ œM accelerated continuous greedy and the standard continuous greedy is in the choice of this set

S guiding the direction in which x moves. This dierence allows the accelerated continuous greedy to terminate after a constant number of iterations, each of which has (log(n) log(k)) O adaptive rounds, in contrast to the continuous greedy which requires a linear number of iterations. To determine the direction in every iteration, the accelerated continuous greedy applies

Adaptive Sequencing on the surrogate function g that measures the marginal contribution

to x when taking a step of size ⁄ in the direction of S. That is, g(S) := Fx(⁄S)= F (x + ⁄S) F (x) where we abuse notation and write ⁄S instead of ⁄1 for ⁄ [0, 1] and ≠ S œ S N. Since f is a monotone submodular function it is immediate that g is monotone and ™ submodular as well.

185 Algorithm 23 Accelerated Continuous Greedy Input: matroid , step size ⁄ M x 0 Ω for 1/⁄ iterations do

N define g :2 R to be g(T )=Fx(⁄T ) æ S Adaptive Sequencing(g, ) Ω M x x + ⁄S Ω return x

The analysis shows that in every one of the 1/⁄ iterations, Adaptive Sequencing finds S such that the contribution of taking a step of size ⁄ in the direction of S is approximately a

⁄ fraction of OPT F (x).Forany⁄ this is a sucient condition for obtaining the 1 1/e ‘ ≠ ≠ ≠ guarantee. The reason why the standard continuous greedy cannot be implemented with a con-

stant number of rounds 1/⁄ is that in every round it moves in the direction of 1S for

S := argmaxT a T g(a). When ⁄ is constant Fx(⁄S) is arbitrarily low due to the po- œM œ tential overlap betweenq high valued singletons (see discussion below). Selecting S using

Adaptive Sequencing is the crucial part of the accelerated continuous greedy which allows implementing it in a constant number of iterations.

Discussion on constant step size ⁄. In contrast to the continuous greedy, the accelerated continuous greedy uses constant steps sizes ⁄ to guarantee low adaptivity. The challenge with using constant ⁄ is that Fx(⁄S) is arbitrarily low with S := argmaxT a T g(a) due œM œ to the overlap in value of elements a with high individual value g(a). q For example, consider ground set N = A B with fi

f(S) = min(log n, S A )+ S B , | fl | | fl |

186 x = 0 and S = A. With ⁄ =1/n, we note that sampling R ⁄A where R independently ≥ contains each element in S with probability 1/n gives R log n with high probability and we | |Æ get F (⁄A)=(1 o(1)) A , which is near-optimal for a set of size A . However, with constant x ≠ | | | | ⁄, then sampling R ⁄A gives R > log n with high probability. Thus F (⁄A) log(n) ≥ | | x Æ which is arbitrarily far from optimal for A = B >> log n since F (⁄B)=⁄ B . | | | | x | |

Analysis. We start by giving a sucient condition on Adaptive Sequencing to obtain the 1 1/e (‘) approximation guarantee. The analysis is standard. ≠ ≠O

Lemma 3.2.29. For a given matroid assume that Adaptive Sequencing outputs M S s.t. ES [Fx(⁄S)] (1 ‘)⁄(OPT F (x)) at every iteration of Accelerated œM Ø ≠ ≠ Continuous Greedy. Then Accelerated Continuous Greedy outputs x P ( ) œ M s.t. E[F (x)] (1 1/e ‘) OPT. Ø ≠ ≠

1 Proof. First, x P since it is a convex combinations of ⁄≠ vectors 1S with S . Next, œ œM

let xi denote the solution x at the ith iteration of Accelerated Continuous Greedy. The algorithm increases the value of the solution x by at least (1 ‘) ⁄ (OPT F (x)) at ≠ · · ≠ every iteration. Thus,

F (xi) F (xi 1)+(1 ‘) ⁄ (OPT F (xi 1)) . Ø ≠ ≠ · · ≠ ≠

Next, we show by induction on i that

F (x ) 1 (1 (1 ‘)⁄)i OPT. i Ø ≠ ≠ ≠ 1 2

187 Observe that

F (xi) F (xi 1)+(1 ‘)⁄ (OPT F (xi 1)) Ø ≠ ≠ ≠ ≠

=(1 ‘)⁄OPT +(1 (1 ‘)⁄) F (xi 1) ≠ ≠ ≠ ≠ i 1 (1 ‘)⁄OPT +(1 (1 ‘)⁄) 1 (1 (1 ‘)⁄) ≠ OPT Ø ≠ ≠ ≠ ≠ ≠ ≠ 1 2 = 1 (1 (1 ‘)⁄)i OPT ≠ ≠ ≠ 1 2

1 1 Thus, with i = ⁄≠ , we return solution x = x⁄≠ such that

1 F (x) 1 (1 (1 ‘)⁄)⁄≠ OPT. Ø ≠ ≠ ≠ 1 2

1 1 ⁄≠ x ⁄≠ (1 ‘)⁄ (1 ‘) Next, since 1 x e≠ for all x R, (1 (1 ‘)⁄) e≠ ≠ = e≠ ≠ . We ≠ Æ œ ≠ ≠ Æ 1 2 conclude that

‘ (1 ‘) e 1+2‘ 1 F (x) 1 e≠ ≠ OPT = 1 OPT 1 OPT 1 ‘ OPT Ø ≠ ≠ e Ø ≠ e Ø ≠ e ≠ 1 2 3 4 3 4 3 4 where the second inequality is since ex 1+2x for 0

For a set S = a ,...,a we define S := a ,...,a and S := a ,...,a . We use { 1 k} i { 1 i} j:k { j k} this notation in the lemma below. The lemma is folklore.

Lemma 3.2.30. Let be a matroid, then for any feasible sets S = a ,...,a and O of M { 1 k} size k, there exists an ordering of O = o ,...,o where for all i [k], S O and { 1 k} œ i fi i+1:k œM S O = . i fl i+1:k ÿ

Proof. The proof is by reverse induction. For i = k, we have S O = S = S i fi i+1:k k œM by Lemma 3.2.25. Consider i

188 By the augmentation property of matroids, there exists o O (S O ) such that i+1 œ \ i fi i+2:k S O + o = S O . i fi i+2:k i+1 i fi i+1:k œM

The following lemma is key in our analysis. We argue that unless the algorithm already constructed S of suciently large value, the sum of the contributions of the optimal elements

to S is arbitrarily close to the desired ⁄(OPT F (x)). ≠

Lemma 3.2.31. Assume that g(S) ⁄(OPT F (x)), then i gS O (oi) ⁄(1 ⁄)(OPT Æ ≠ \ i:k Ø ≠ ≠ F (x)). q

Proof. We first lower bound this sum of marginal contribution of optimal elements with the contribution of the optimal solution to the current solution x + ⁄S at the end of the iteration:

gS O (oi)= Fx+⁄S O (⁄oi) \ i:k \ i:k i [k] i [k] ÿœ ÿœ

Fx+Oi 1+⁄S(⁄oi) ≠ Ø i [k] ÿœ

⁄ Fx+Oi 1+⁄S(oi) ≠ Ø i [k] ÿœ

= ⁄Fx+⁄S(O) where the first inequality is by submodularity and the second by the multilinearity of F . In the standard analysis of greedy algorithms the optimal solution O may overlap with the current solution. In the continuous algorithm, since the algorithm takes steps of size ⁄,we can bound the overlap between the solution at this iteration ⁄S and the optimal solution:

F (O)=F (O + ⁄S) F (⁄S) F (O) ⁄(OPT F (x)) = (1 ⁄)(OPT F (x)) x+⁄S x ≠ x Ø x ≠ ≠ ≠ ≠

the first inequality is by monotonicity and lemma assumption and the second by monotonicity.

189 As shown in Lemma 3.2.30, Adaptive Sequencing picks elements ai with near-optimal marginal contributions. Together with Lemma 3.2.31 we get the desired bound on the

contribution of ⁄S to x.

Lemma 3.2.32. Let = 1 log k and ⁄ = (‘). For any x such that F (x) < O ‘ ‘⁄ O 1 1 22 (1 1/e)OPT, the set S returned by Adaptive Sequencing(g, ) satisfies ≠ M

E [Fx(⁄S)] (1 (‘))⁄(OPT F (x)). Ø ≠O ≠

Proof. Initially, we have t < OPT. After = 1 log k iterations of the outer loop i O ‘ ‘⁄ 1 1 22 of Adaptive Sequencing,wegett =(1 ‘)OPT = ‘⁄OPT . We begin by adding f ≠ O k 1 2 dummy elements to S so that S = k, which enables pairwise comparisons between S | | and O. In particular, we consider SÕ, which is S together with rank( ) S dummy M ≠| |

elements a S +1,...ak such that, for any y and ⁄, Fy(⁄a)=tf , which is the value of t | | when Adaptive Sequencing terminates. Thus, by Lemma 3.2.25, for dummy elements ai,

gSi 1 (ai)=tf (1 ‘) maxa:Si 1 a gSi 1 (a). ≠ Ø ≠ ≠ fi œM ≠ We will conclude the proof by showing that S is a good approximation to SÕ.From

Lemma 3.2.26 that the contribution of ai to Si 1 approximates the optimal contribution to ≠

Si 1: ≠ k k 2 E [Fx(⁄SÕ)] = E gSi 1 (ai) (1 ‘) max gSi 1 (ai). ≠ Ø ≠ a:Si 1 a ≠ ÿi=1 Ë È ÿi=1 ≠ fi œM

By Lemma 3.2.30 and submodularity, we have maxa:Si 1 a gSi 1 (ai) gS Oi:k (oi). By ≠ fi œM ≠ Ø \ k Lemma 3.2.31, we also have i=1 gS Oi:k (oi) ⁄(1 ⁄)(OPT F (x)). Combining the previous \ Ø ≠ ≠ pieces, we obtain q

2 E [Fx(⁄SÕ)] (1 ‘) ⁄(1 ⁄)(OPT F (x)). Ø ≠ ≠ ≠

190 We conclude by removing the value of dummy elements,

E [Fx(⁄S)] = E [Fx(⁄SÕ) Fx+⁄S(⁄(SÕ S))] E [Fx(⁄SÕ)] ktf E [Fx(⁄SÕ)] ‘⁄OPT. ≠ \ Ø ≠ Ø ≠

The lemma assumes that F (x) < (1 1/e)OPT and ⁄ = (‘),soOPT e(OPT F (x)) and ≠ O Æ ≠ ‘⁄OPT = (‘)⁄(OPT F (x)). We conclude that E [Fx(⁄S)] (1 (‘)) ⁄(OPT F (x)). O ≠ Ø ≠O ≠ The approximation guarantee of the Accelerated Continuous Greedy follows from lemmas 3.2.32 and 3.2.29, and the adaptivity from Lemma 3.2.23.

Theorem 3.2.9. For any ‘>0 Accelerated Continuous Greedy makes

k 1 log(n) log O A A‘2 B ‘2 B adaptive rounds and obtains a 1 1/e (‘) approximation in expectation for maximizing a ≠ ≠O monotone submodular function under a matroid constraint.

Proof. We use step size ⁄ = (‘) for Accelerated Continuous Greedy and = O 1 log k outer-iterations for Adaptive Sequencing. Thus, by Lemma 3.2.23, the O ‘ ‘⁄ 1 1 22 logn k 1 adaptivity is = log(n) log 2 2 . By Lemma 3.2.32, we have E[Fx(”S)] O ⁄‘ O ‘ ‘ Ø 1 2 1 1 2 2 (1 (‘))⁄(OPT F (x)) at every iteration i. Combining with Lemma 3.2.29, we obtain that ≠O ≠ 1 E[F (x)] (1 e≠ (‘))OPT. Ø ≠ ≠O It remains to round the solution x. We note that there exist rounding schemes with arbitrarily small loss that are independent of the function f [Chekuri et al., 2009, Vondrák et al., 2011] (so they do not perform any queries to f). The set S we obtain from rounding

the solution x returned by Accelerated Continuous Greedy with these techniques is thus a 1 1/e (‘) approximation with no additional adaptivity. ≠ ≠O The final step in our analysis shows that the guarantee of Accelerated Continuous Greedy holds not only in expectation but also with high probability. To do so we argue in

191 the lemma below that if over all iterations i, Fx(⁄S) is close on average over the rounds to ⁄(OPT F (x)), we obtain an approximation arbitrarily close to 1 1/e with high probability. ≠ ≠

Lemma 3.2.33. Assume that Adaptive Sequencing outputs S s.t. F (⁄S) œM x Ø – ⁄(OPT F (x)) at every iteration i of Accelerated Continuous Greedy and that i ≠ 1 ⁄ ⁄≠ – 1 ‘. Then Accelerated Continuous Greedyoutputs x P ( ) s.t. i=1 i Ø ≠ œ M Fq(x) (1 1/e ‘) OPT. Ø ≠ ≠

1 Proof. First, x P since it is a convex combinations of ⁄≠ vectors 1 . Next, let x œ S œM i denote the solution x at the ith iteration of Accelerated Continuous Greedy. The

algorithm increases the value of the solution x by at least – ⁄ (OPT F (x)) at every i · · ≠ iteration. Thus,

F (xi) F (xi 1)+–i ⁄ (OPT F (xi 1)) . Ø ≠ · · ≠ ≠

Next, we show by induction on i that

i F (x ) 1 (1 ⁄– ) OPT. i Ø Q ≠ ≠ j R jŸ=1 a b Observe that

F (xi) F (xi 1)+–i⁄ (OPT F (xi 1)) Ø ≠ ≠ ≠

= –i⁄OPT +(1 –i⁄) F (xi 1) ≠ ≠ i 1 ≠ – ⁄OPT +(1 – ⁄) 1 (1 ⁄– ) OPT Ø i ≠ i Q ≠ ≠ j R jŸ=1 a i b = – ⁄OPT + 1 – ⁄ (1 ⁄– ) OPT i Q ≠ i ≠ ≠ j R jŸ=1 i a b = 1 (1 ⁄– ) OPT Q ≠ ≠ j R jŸ=1 a b where the first inequality is by the assumption of the lemma, the second by the inductive

192 1 hypothesis, and the equalities by rearranging the terms. Thus, with i = ⁄≠ , we return

1 solution x = x⁄≠ such that

1 ⁄≠ F (x) 1 (1 ⁄– ) OPT. Ø Q ≠ ≠ j R jŸ=1 a b x Since 1 x e≠ for all x R, ≠ Æ œ

⁄ 1 ⁄ 1 1 ≠ ≠ ⁄≠ ⁄– ⁄ –j 1 (1 ⁄– ) 1 e≠ j =1 e≠ j=1 ≠ ≠ j Ø ≠ ≠ jŸ=1 jŸ=1 q (1 ‘) 1 1 1 e≠ ≠ 1 e≠ 2‘/e 1 e≠ ‘ Ø ≠ Ø ≠ ≠ Ø ≠ ≠ where the second inequality is since ex 1+2x for 0

The approximation – obtained at iteration i is 1 (‘) in expectation by Lemma 3.2.32. i ≠O Thus, by a simple concentration bound, w.h.p. it is close to 1 (‘) in average over all ≠O iterations. Together with Lemma 3.2.33, this implies the 1 1/e ‘ approximation with ≠ ≠ high probability.

Theorem 3.2.10. Accelerated Continuous Greedy is an log(n) log k 1 adap- O ‘⁄ ‘⁄ 1 1 2 2 tive algorithm that, with probability 1 ”, obtains a 1 1/e (‘) approximation for ≠ ≠ ≠O maximizing a monotone submodular function under a matroid constaint, with step size

2 1 1 ⁄ = ‘ log≠ . O ” 1 1 22 Proof. We use = 1 log k outer-iterations for Adaptive Sequencing. Thus, by O ‘ ‘⁄ 1 1 22 Lemma 3.2.23, the adaptivity is logn = log(n) log k 1 . O ⁄‘ O ‘⁄ ‘⁄ 1 2 1 1 2 2 By Lemma 3.2.32, we have Fx(”S) –i⁄(OPT F (x)) at every iteration i with E [–i] Ø ≠ Ø

1 ‘Õ where ‘Õ = (‘). By a Chernobound with E[⁄ i ⁄ 1 –i] 1 ‘Õ, ≠ O œ ≠ Ø ≠ q

‘2(1 ‘ )⁄ 1/2 Pr ⁄ – < (1 ‘)(1 ‘Õ) e≠ ≠ Õ ≠ . S i T i [⁄ 1] ≠ ≠ Æ œÿ≠ U V 193 2 1 ‘ (1 ‘Õ)⁄≠ /2 Thus, with probability p =1 e≠ ≠ , ⁄ i [⁄ 1] –i 1 ‘ ‘Õ. By Lemma 3.2.33, ≠ œ ≠ Ø ≠ ≠ 1 q 2 we conclude that w.p. p, F (x) (1 e≠ (‘ + ‘Õ))OPT. With step size ⁄ = O(‘ / log(1/”)), Ø ≠ ≠ 1 we get that with probability 1 ”, F (x) (1 e≠ O(‘))OPT. ≠ Ø ≠ ≠ It remains to round the solution x. We note that there exist rounding schemes with arbitrarily small loss that are independent of the function f [Chekuri et al., 2009, Vondrák et al., 2011] (so they do not perform any queries to f). The set S we obtain from rounding

the solution x returned by Accelerated Continuous Greedy with these techniques is thus a 1 1/e (‘) approximation with no additional adaptivity. ≠ ≠O

Parallelization of Matroid Oracle Queries

Throughout the paper we relied on Random Sequence as a simple procedure to generate a random feasible sequence to achieve our (log(n) log(k)) adaptive algorithm with O an approximation arbitrarily close to 1 1/e. Although Random Sequence has zero ≠ adaptivity, it makes rank( ) sequential steps depending on membership in the matroid to M

generate the sets X1,...,Xrank(M). From a practical perspective, we may wish to accelerate this process via parallelization. In this section we show how to do so in the standard rank and independence oracle models for matroids.

Matroid rank oracles. Given a rank oracle for the matroid, we get an algorithm that only makes (log(n) log(k)) steps of matroid oracle queries and has polylogarithmic depth O on a PRAM machine. Recall that a rank oracle for is given a set S and returns its rank, M i.e. the maximum size of an independent subset T S. The number of steps of matroid ™ queries of an algorithm is the number of sequential steps it makes when polynomially-many queries to a matroid oracle for can be executed in parallel in each step [Karp et al., 1988].5 M We use a parallel algorithm from Karp et al. [1988] designed for constructing a base of a

5More precisely, it allows p queries per step and the results depend on p, we consider the case of p = poly(n).

194 matroid with a rank oracle, and show that it satisfies the random feasibility property.

Algorithm 24 Parallel Random Sequence for matroid constraint with rank oracle Input: matroid , ground set N M

b1,...,bN random permutation of N | | Ω r rank( b ,...,b ), for all i 1,...,n i Ω { 1 i} œ{ }

ai ith bj s.t. rj rj 1 =1 Ω ≠ ≠

return a1,...,a¸

Lemma 3.2.34. Algorithm 24 satisfies the random feasibility condition.

Proof. Consider ai with i ¸. Then ai = bj for some j N such that rj = rj 1 +1. Since Æ œ| | ≠

b1,...,bN is a random permutation, bj is uniformly random elements in N b1,...,bj 1 . | | \{ ≠ }

We argue that a : rank( b1,...,bj 1,a ) rj 1 =1 is the set of all elements a = b¸ for { { ≠ } ≠ ≠ }

some ¸ j such that a1,...,ai 1,a by induction. Ø { ≠ }œM

We first show that if rank( b1,...,bj 1,a ) rj 1 =1then a1,...,ai 1,a . By the { ≠ } ≠ ≠ { ≠ }œM

algorithm, rank( b1,...,bj 1 )=i 1 and by the inductive hypothesis, a1,...,ai 1 . { ≠ } ≠ { ≠ }œM

Thus, a1,...,ai 1 is an independent subset of b1,...,bj 1 of maximum size. Let S be { ≠ } { ≠ }

an independent subset of b1,...,bj 1,a of maximum size. Since rank( b1,...,bj 1,a ) { ≠ } { ≠ } ≠

rj 1 =1, S = a1,...,ai 1 +1=i. Thus, by the augmentation property, there exists ≠ | | |{ ≠ }|

b S such that a1,...,ai 1,b . We have b = bj , jÕ

contradict rank( b1,...,bj 1 )=i 1. Thus b = a and a1,...,ai 1,a . { ≠ } ≠ { ≠ }œM

Next, we show that if a = b¸ for some ¸ j such that a1,...,ai 1,a , then Ø { ≠ }œM

rank( b1,...,bj 1,a ) rj 1 =1. By the algorithm rj 1 = a1,...,ai 1 = j 1. Since { ≠ } ≠ ≠ ≠ |{ ≠ }| ≠

a1,...,ai 1,a , rank( b1,...,bj 1,a ) j. Since the rank can only increase by one { ≠ }œM { ≠ } Ø when adding an element, we have rank( b1,...,bj 1,a )=j = rj 1 +1. { ≠ } ≠

It is easy to see that Algorithm 24 has one step of rank queries. Karp et al. [1988] showed that Algorithm 24 constructs a base of . M 195 Lemma 3.2.35 ([Karp et al., 1988]). Algorithm 24 returns a base of . M

With Algorithm 24 as the Random Sequence subroutine for Adaptive Sequencing, we obtain the following result for oracles.

Theorem 3.2.11. For any ‘>0, there is an algorithm that obtains, with probability 1 o(1), ≠ a 1/2 (‘) approximation with log(n) log k 1 adaptivity and steps of matroid rank ≠O O ‘ ‘2 1 1 2 2 queries.

Proof. By Theorem 3.2.7, Adaptive Sequencing++ is a 1/2 (‘) approximation algo- ≠O rithm with O(log(n) log(k)) adaptivity if Random Sequence satisfies the random feasibility condition, which Algorithm 25 does by Lemma 3.2.34. Since there are O(log(n) log(k))

iterations of calling Random Sequence and Random Sequence has 1 step of rank queries, there are O(log(n) log(k)) total steps of rank queries.

This gives (log(n) log(k)) adaptivity and steps of independence queries with 1 1/e ‘ O ≠ ≠ approximation for maximizing the multilinear relaxation and 1/2 ‘ approximation for ≠ maximizing a monotone submodular function under a matroid constraint. In particular, we get polylogarithmic depth on a PRAM machine with a rank oracle.

Matroid independence oracles. Recall that an independence oracle for is an oracle M which given S N answers whether S or S M. We give a subroutine that ™ œM ”œ requires ˜(n1/2) steps of independence matroid oracle queries and show that ˜(n1/3) steps O are necessary. Similar to the case of rank oracles we use a parallel algorithm from Karp et al.

[1988] for constructing a base of a matroid that can be used as the Random Sequence subroutine while satisfying the random feasibility condition.

O˜(Ôn) upper bound. We use the algorithm from Karp et al. [1988] for constructing a base of a matroid.

196 Algorithm 25 Parallel Random Sequence for matroid constraint with independence oracle Input: matroid , ground set N M c 0,X N Ω Ω while N > 0 do | |

b1,...,bX random permutation of X | | Ω iı max i : a ,...,a b ,...,b Ω { { 1 c}fi{ 1 i}œM} a ,...,a ı b ,...,b ı c+1 c+i Ω 1 i c c + iı Ω X a X : a ,...,a ,a Ω{ œ { 1 c }œM}

return a1,...,ac

We show that Algorithm 25 satisfies the random feasibility condition required by Adap- tive Sequencing.

Lemma 3.2.36. Algorithm 25 satisfies the random feasibility condition.

Proof. Consider a for i c. By the algorithm, we have a = b for some j iı at some i Æ i j Æ

iteration ¸ with b1,...,bX and c¸. By the definition of bj and X, we have that bj is a | ¸| ı uniformly random elements among all elements X¸ b1,...,bj 1 . Conditioned on i j, \{ ≠ } Ø we have that a ,...,a ,b ,...,b . By the downward closed property of matroids, { 1 c¸ 1 j}œM

a X : a1,...,ac ,b1,...,bj 1,a X¸ b1,...,bj 1 . Thus bj = ai is uniformly { œ { ¸ ≠ }œM}™ \{ ≠ }

random over all a X such that a1,...,ac ,b1,...,bj 1,a = a1,...,ai 1,a . œ { ¸ ≠ } { ≠ }œM

Karp et al. [1988] showed that this algorithm has O(Ôn) iterations.

Lemma 3.2.37 ([Karp et al., 1988]). Algorithm 25 has O(Ôn) steps of independence queries.

With Algorithm 25 as the Random Sequence subroutine for Adaptive Sequencing, we obtain the following result with independence oracles.

197 Theorem 3.2.12. There is an algorithm that obtains, w.p. 1 o(1),a1/2 (‘) approxima- ≠ ≠O tion with log(n) log k 1 adaptivity and O Ôn log(n) log k 1 steps of independence O ‘ ‘2 ‘ ‘2 1 1 2 2 1 1 2 2 queries.

Proof. By Theorem 3.2.7, Adaptive Sequencing++ is a 1/2 ‘ approximation algorithm ≠ with O(log(n) log(k)) adaptivity if Random Sequence satisfies the random feasibility condition, which Algorithm 25 does by Lemma 3.2.36. Since there are O(log(n) log(k))

iterations of calling Random Sequence and Random Sequence has O(Ôn) steps of inde- pendence queries by Lemma 3.2.37, there are O(Ôn log(n) log(k)) total steps of independence queries.

This gives (log(n) log(k)) adaptivity and Ôn log(n) log(k) steps of independence queries O with 1 1/e ‘ approximation for maximizing the multilinear relaxation and 1/2 ‘ ≠ ≠ ≠ approximation for maximizing a monotone submodular function under a matroid constraint. In particular, even with independence oracles we get a sublinear algorithm in the PRAM model.

˜(n1/3) lower bound. We show that there is no algorithm which obtains a constant approximation with less than ˜(n1/3) steps of independence queries, even for a cardinality function f(S)= S . We do so by using the same construction for a hard matroid instance | | as in [Karp et al., 1988] used to show an ˜(n1/3) lower bound on the number of steps of independence queries for constructing a base of a matroid. Although the matroid instance is the same, we use a dierent approach since the proof technique of [Karp et al., 1988] does not hold in our case. We first give the construction from [Karp et al., 1988]. The partition matroid has

1/3 2 2/3 2 p = n / log n parts P1,...,Pp of equal size n log n and a set S is independent if S P in1/3 log2 n for all parts P . Informally, the hardness is since an algorithm cannot | fl i|Æ i

learn part Pi+1 in i steps of independence queries.

198 We lower bound the performance of any algorithm against a matroid chosen uniformly at

random over all such partitions P1,...,Pp. The issue with applying the approach in [Karp et al., 1988] is that when it considers a query B at some step j

i 1 uniformly random parts P ,...,P of N ≠ P . However, a query at step j

argues about independence of queries with some parts Pi,...,Pp under some conditioning on

Pi,...,Pp. We introduce some notation. Let (S) be the indicator variable for S . We denote M œM j j by S the elements in S that are not in a part Pi with i

(1 + 1/8i) Si Si P | | | fl j|Æ p i ≠

for all j>iand we use c(S, i) for the indicator variable for S concentrating at step i. Finally, I(S, i) indicates if, for some P ,...,P, the answer (S) of the independence oracle to 1 i M M query S is independent of the randomization of parts P ,...,P over N i P . The main i+1 p \fij=1 j lemma is the following.

(log n) Lemma 3.2.38. For any i [p], with probability at least 1 n≠ over P ,...,P, for œ ≠ 1 i all queries S by an algorithm with i steps of queries, the answer (S) of the oracle is M independent of Pi+1,...,Pp, conditioned on S concentrating over these parts, i.e., for all queries S at step i

(log n) Pr [I(S, i) c(S, i)] 1 n≠ . P1,...,Pi | Ø ≠

199 Proof. The proof is by induction on i. Consider i>0 and assume that for all queries S

(log n) at step j

i i (1/8i)2 Si /2(p i) Pr S Pj (1 + 1/8i) S /(p i) 1 e≠ | | ≠ Pj | fl |Æ | | ≠ Ø ≠ Ë È (1/8i)2n1/3 log2 n/2 1 e≠ Ø ≠ (log2 n) =1 e≠ ≠ (log n) 1 n≠ , Ø ≠

assuming that Si /(p i) n1/3 log2 n and since i n1/3. By a union bound, for all queries | | ≠ Ø Æ S at some step j

parts Pj+1,...,Pi 1 concentrate with S, it is independent of the randomization of parts ≠ i 1 P ,...,P over N ≠ P conditioned on these parts concentrating. i p \fi¸=1 ¸ Thus, the decision of the algorithm to query a set S at step i is independent of the

randomization of Pi,...,Pp, conditioned on these parts concentrating with previous queries.

i 1 We first consider a uniformly random part P over N ≠ P . There are two cases for a i \fi¸=1 ¸ query S at step i

i 1/3 i • If S > (1 + 1/4i)in (p i). Then by Chernobound with µ = EPi [ S Pi ] = | | ≠ | fl | (1 + 1/4i)in1/3 log2 n,

i 1/3 2 (log n) Pr S Pi in log n = n≠ . Pi | fl |Æ Ë È 200 (log n) Thus, with probability 1 n≠ , S and this is independent of the randomization ≠ ”œ M

of Pi+1,...,Pp.

• If Si (1 + 1/4i)in1/3(p i). Then, if S concentrates with parts P ,...,P ,by | |Æ ≠ i+1 p i Si | | definition, S Pj (1 + 1/8i) p i and | fl |Æ ≠

Si P (1 + 1/4i)(1 + 1/8i)in1/3 log2 n<(i +3/4)n1/3 log2 n

for j>iand S is feasible with respect to part P . So (S) is independent of the j M

randomization of Pi+1,...,Pp, conditioned on c(S, i).

The last piece needed from Pi is that, due to the conditioning, it must concentrate with all

(log n) queries from previous steps. As previously, this is the case with probability 1 n≠ . ≠ (log n) Combined with the above, we obtain Pr [I(S, i) c(S, i)] 1 n≠ . P1,...,Pi | Ø ≠

n1/3 Theorem 3.2.13. For any constant –, there is no algorithm with 2 1 steps of poly(n) 4– log n ≠ (log n) matroid queries which, w.p. strictly greater than n≠ , obtains an – approximation for maximizing a cardinality function under a partition matroid constraint when given an independence oracle.

Proof. Consider a uniformly random partition of the ground set in parts P1,...,Pp with p = n1/3/ log2 n each of size n2/3 log2 n, the partition matroid overt these parts described previously, and the simple function f(S)= S . By a similar Chernobound as in Lemma 3.2.38,we | | i i (log n) have that Pr [ S P (1 + 1/8i) S /(p i)] 1 n≠ for all queries at S at step Pj | fl j|Æ | | ≠ Ø ≠ (log n) i and j>i. Thus Pr[c(S, i)] 1 n≠ for query S at step i and by a union bound this Ø ≠ holds for all queries by the algorithm. Thus, by Lemma 3.2.38, we have that for all queries S by an p/(4–) 1 steps algorithm, the answer of the oracle to query S is independent of the ≠ (log n) randomization of P ,...,P with probability 1 n≠ , conditioned on these parts p/(4–) p ≠ (log n) concentrating with the queries, which they do with probability 1 n≠ . ≠ 201 Thus, the solution S returned by the algorithm after p/(4–) 1 steps of matroid queries is ≠ (log n) independent of the randomization of P ,...,P with probability 1 n≠ , conditioned p/(4–) p ≠ on these parts concentrating with the queries. Assume the algorithm returns S such that Sp/(4–) > (1+1/8n1/3)(1 1/(4–))pn2/3 log2 n/(4–). | | ≠ p/(4–) 1 Thus, with P a random part of N ≠ P , p/(4–)+1 \fi¸=1 ¸

1/3 2/3 2 EP S Pp/(4–)+1 =(1+1/8n )n log n/(4–). p/(4–) | fl | Ë È

(log n) By Chernobound, we have that with probability 1 n≠ , ≠

2/3 2 (log n) Pr S Pp/(4–) >n log n/(4–) 1 n≠ Pp/(4–) | fl | Ø ≠ Ë È

and thus S . ”œ M If the algorithm returns S such that Sp/(4–) (1+1/8n1/3)(1 1/(4–))pn2/3 log2 n/(4–). | |Æ ≠ Then, if S , there are at most p/(4–) n1/3 log2(n)p/(4–) elements in S from the first œM · p/(4–) parts. Thus

S (1 + 1/8i)(1 1/(4–))n/(4–)+p/(4–) n1/3 log2(n)p/(4–)

Note that a base B for the matroid has size

p B = in1/3 log2 n = n1/3n1/3(n1/3/ log2 n +1)/2 >n/(2 log2 n). | | ÿi=1

(log n) The parts P 1/3 ,...,P concentrate with all the queries with probability 1 n≠ . n /(4–) p ≠ (log n) Thus, the algorithm returns, with probability 1 n≠ , either a set S or a set ≠ ”œ M S such that S < B /–. Thus there is at least one instance of parts P ,...,P such that | | | | 1 p the algorithm does not obtain an – approximation with probability strictly greater than

202 (log n) n≠ .

To the best of our knowledge, the gap between the lower and upper bounds of Omega˜ (n1/3) and O(n1/2) parallel steps for constructing a matroid basis given an independence oracle remains open since Karp et al. [1988]. Closing this gap for submodular maximization under a matroid constraint given an independence oracle is an interesting open problem that would also close the gap of Karp et al. [1988].

Discussion about Additional Results

We discuss several cases for which our results and techniques generalize.

Cardinality constraint. We first mention a generalization of Adaptive Sequencing that is a (log(n)) adaptive algorithm that obtains a 1 1/e (‘) approximation with O ≠ ≠O probability 1 o(1) for monotone submodular maximization under a cardinality constraint, ≠ which is the special case of a . Instead of sampling uniformly random subsets of X of size k/r as done in every iteration of the algorithm in Section 3.2.3, it is possible to generate a single sequence and then add elements to S and discard elements from X in

the same manner as Adaptive Sequencing. We note that generating a random feasible sequence in parallel is trivial for a cardinality constraint k, one can simply pick k elements uniformly at random. Similarly, the elements we add to the solution are approximately locally optimal and we discard a constant fraction of elements at every round. A main dierence is

that for the case of a cardinality constraint, setting the threshold t to t =(OPT f(S))/k ≠ is sucient and, as shown in Section 3.2.3, this threshold only needs a constant number of updates. Thus, for the case of a cardinality constraint, we obtain a (log n) adaptive O algorithm with a variant of Adaptive Sequencing. In addition, the continuous greedy algorithm is not needed for a cardinality constraint since adding elements with marginal

contribution which approximates (OPT f(S))/k at every iteration guarantees a 1 1/e ‘ ≠ ≠ ≠

203 approximation.

Non-monotone functions. For the case of maximizing a non-monotone submodular function under a cardinality constraint, similarly as for the monotone algorithm discussed above, we can also generate a single sequence instead of multiple random blocks of elements, as done in Section 3.2.4.

Partition matroids with explicit representation. Special families of matroids, such as graphical and partition matroids, have explicit representations. We consider the case where a partition matroid is given as input to the algorithm not as an oracle but with its

explicit representation, meaning the algorithm is given the parts P1,...,Pm of the partition

matroid and the number p1,...,pm of elements of each parts allowed by the matroid. For the more general setting of packing constraints given to the algorithm as a collection of m linear constraints, as previously mentioned, Chekuri and Quanrud [2019a] developed a (log2(m) log(n)) adaptive algorithm that obtains with high probability a 1 1/e ‘ O ≠ ≠ approximation, and has polylogarithmic depth on a PRAM machine for partition matroids. In this case of partition matroids, we obtain a log(n) log k 1 adaptive algorithm O ‘⁄ ‘⁄ 1 1 2 2 2 1 1 that, with probability 1 ”, obtains a 1 1/e (‘) approximation with ⁄ = ‘ log≠ . ≠ ≠ ≠O O ” 1 1 22 This algorithm also has polylogarithmic depth. This algorithm uses Accelerated Contin- uous Greedy with the Random Sequence subroutine for rank oracles since a rank oracle for partition matroids can easily be constructed in polylogarithmic depth when given the explicit representation of the matroid. As mentioned in [Chekuri and Quanrud, 2019a], it is also possible to obtain a rounding scheme for partition matroids in polylogarithmic depth.

Intersection of P matroids. We formally analyze the more general constraint consisting of the intersection of P matroids.

204 3.3 Adaptivity Lower Bound

In this section, we show that the adaptive complexity of maximizing a monotone sub- modular function under a cardinality constraint is ˜(log n) with a hardness result showing ˜ 1 that with strictly less than (log n) rounds, the best approximation possible is log n . With the algorithm from the previous section, we get that the adaptive complexity of submodular maximization is log n, up to lower-order terms, to obtain a constant factor approximation.

Technical Overview

To bound the number of rounds necessary to obtain a certain approximation guarantee, we analyze the information that an algorithm can learn in one round that may depend on queries from previous rounds. Reasoning about these dependencies between rounds is the main challenge. To do so, we reduce the problem of finding an r-adaptive algorithm to the problem of finding an r +1-adaptive algorithm over a family of functions with additional information. This approach is related to the round elimination technique used in communication complexity (e.g. [Miltersen et al., 1995]).

The Round Elimination Lemma

The following simple lemma gives two conditions that, if satisfied by some collections of functions, imply the desired hardness result. The main condition is that an r-adaptive algorithm for a family of functions can be modified into an (r 1)-adaptive algorithm for a Fr ≠

more restricted family r 1. The base case of this inductive argument is that with no queries, F ≠ there does not exist any –-approximation algorithm for . A similar round elimination F0 technique is used in communication complexity to characterize the tradeobetween the number of rounds and the total amount of communication of a protocol. Here, the tradeois dierent and is between the number of rounds and the approximation of an algorithm.

205 Lemma 3.3.1. Assume r poly(n). If there exist families of functions ,..., such that œ F0 Fr the following two conditions hold:

• Round elimination. For all i 1,...,r , if there exists an i-adaptive algorithm œ{ } Ê(1) that obtains, with probability n≠ , an –-approximation for , then there exists an i 1 Fi ≠ Ê(1) adaptive algorithm that obtains, with probability n≠ , an –-approximation algorithm

for i 1; F ≠ • Last round. There does not exist a 0-adaptive algorithm that obtains, with probability

Ê(1) n≠ , an –-approximation for ; F0

Then, there is no r-adaptive algorithm that obtains, w.p. o(1), an –-approximation for . Fr

Proof. The proof is by induction on the number of rounds r.Ifr =0, then by the last round condition, is not – optimizable in 0 adaptive rounds. If r>0, then assume by F0 contradiction that there exists an –-approximation r-adaptive algorithm for . By the FR round elimination condition, this implies that there exists an –-approximation r 1 adaptive ≠

algorithm for r 1. This is a contradiction with the induction hypothesis for r 1. F ≠ ≠

The proof follows immediately by induction on the number of rounds r.

The Onion Construction

The main technical challenge is to “fit" r +1 families of functions ,..., in the class of F0 Fr

submodular functions while also having every family i be significantly richer than i 1.In F F ≠ our context, significantly richer means that an i-adaptive algorithm for can be transformed Fi

into an (i 1)-adaptive algorithm for i 1. To do so, at a high level, we want to show ≠ F ≠

that after one round of querying a function in i, functions in i 1 are indistinguishable. If F F ≠

functions in i 1 are indistinguishable to an i-adaptive algorithm after one round of querying, F ≠

then the last i 1 rounds of this algorithm form an (i 1)-adaptive algorithm for i 1. ≠ ≠ F ≠

206 … * L0 L1 Lr L

ı Figure 3.10: The partition of the elements into layers L0,...,Lr,L for the hard ı functions. An algorithm cannot learn Li before round i +1and L is the optimal solution.

We construct functions that depend on a partition P of the ground set N into layers

ı L0,...,Lr,L (illustrated in Figure 3.10). The main motivation behind this layered construc-

ı tion is to create a hard instance s.t. an algorithm cannot distinguish layer Li from L before round i +1. The size of the layers decreases as i grows, with Lı being the smallest layer.

1 i ı 1 More precisely, we set L = n ≠ r+1 for i>0, L = n 2r+2 , and L consists of the remaining | i| | | 0 elements. We define ¸ (S) := L S and abuse notation with ¸ = ¸ (S) when it is clear i | i fl | i i from context. The hard function is defined as

r r S 1 P 2 ı 2r+2 2 ı f (S) := min(¸i, log n)+¸ + min | |1 , 1 2n min(¸i, log n)+¸ . i A8n r+1 B · A ≠ Ai BB ÿ=0 ÿ=0

To gain some intuition about this function, we note the following two simple facts about f P :

1 P 1 • If a query is large, i.e. S 8n r+1 , then f (S)=2n 2r+2 . Informally, all the layers are | |Ø hidden since no information can be learned about the partition from query S;

• On the other hand, if S L log2 n, i.e. ¸ log2 n, then elements in L and Lı are | fl i|Æ i Æ i 2 indistinguishable to an algorithm that is given the value f(S) since min(¸i, log n)=¸i.

1 2 Thus, the queries need to be of size smaller than 8n r+1 while also containing at least log n

ı elements in Li for the algorithm to learn some information to distinguish layers Li and L .

207 log2 n Since the size of layers diminishes at a rate faster than 1 , it is hard for the algorithm to 8n r+1 ı ı distinguish layers Li+1 and L if it has not distinguished Li and L in previous rounds. An interpretation of this construction is that an algorithm can only learn the outmost remaining layer at any round.

Round Elimination for the Construction

In order to argue that functions in i 1 are indistinguishable after one round of querying F ≠ f , we begin by reducing the problem of showing indistinguishability from non-adaptive œFi queries to showing structural properties of a randomized collection of functions . FRi

Lemma 3.3.2. Let be a randomized collection of functions in some F. Assume that for FRi Ê(1) all S N, w.p. 1 n≠ over , for all f ,f , we have that f (S)=f (S). Then, ™ ≠ FRi 1 2 œFRi 1 2 for any (possibly randomized) collection of poly(n) non-adaptive queries , there exists a Q Ê(1) deterministic collection of functions F such that with probability 1 n≠ over the Fœ ≠ randomization of , for all queries S and all f ,f , f (S)=f (S). Q œQ 1 2 œF 1 2

Proof. We denote by I( , ) the event that f (S)=f (S) for all functions f ,f in a F S 1 2 1 2 collection of functions and all sets S in a collection of sets .Let be a randomized F S Q collection of poly(n) non-adaptive queries and let be a collection of functions s.t. for FR Ê(1) all S N, w.p. 1 n≠ over the randomization of , for all f ,f , we have that ™ ≠ FR 1 2 œF

f1(S)=f2(S). Let be any realization of the randomized collection of queries . By a union bound S Q Ê(1) over the poly(n) queries S , Pr [I( R, )] 1 n≠ . Since R F, we obtain œS FR F S Ø ≠ F œ

Ê(1) max Pr [I( i, )] Pr Pr [I( R, )] 1 n≠ F Fœ Q F Q Ø FR Q F Q Ø ≠

Ê(1) and there exists some F such that w.p. 1 n≠ over the randomization of , for all Fœ ≠ Q queries S , f (S)=f (S) for all f ,f . œQ 1 2 1 2 œF 208 The randomized collection of functions . For round r i, we define the randomized FRi ≠

collection of functions R Fr i, for some Fr i, needed for Lemma 3.3.2. Informally, the F i œ ≠ ≠

layers L0,...,Li 1 are fixed and the ith layer is a random subset Ri of the remaining elements ≠ i 1 N j≠=0Lj. The collection Ri is then all functions with layers L0,...,Li 1,Ri. Formally, \fi F ≠

given L0,...,Li 1, the randomized collection of functions R is ≠ F i

ı (L0,...,Li 1,Ri,Si+1...Sr,S ) i 1 Ri (L0,...,Li 1) := f ≠ : j i+1,...,r,ı Sj = N j≠=0Lj Ri F ≠ Û œ{ } \ fi fi Ó 1 2Ô

i 1 1 i 1 i where R (N ≠ L ,n ≠ r+1 ) is a uniformly random subset of size n ≠ r+1 of the i ≥U \fij=0 j i 1 remaining elements N j≠=0Lj and denotes the disjoint union of sets. We define Fr i to \fi Û ≠

be the collection of all such R (L0,...,Li 1) over all subsets Ri of the remaining elements F i ≠ i 1 N ≠ L . \fij=0 j To show the indistinguishability property (Lemma 3.3.4) for Lemma 3.3.2, we need the following concentration bound, which shows that, a small set S has small intersection with small random sets with high probability.

Lemma 3.3.3. Let R be a uniformly random subset of a set T . Consider a subset S T ™ 1 that is independent of the randomization of R and such that S R / T e≠ , then | |·| | | |Æ

2 Ê(1) Pr S R log n n≠ . | fl |Ø Æ Ë È

Proof. We start by considering a subset L of S of size log2 n. We first bound the probability that L is a subset of R,

2 R R log n Pr [L R] Pr [a R] | | = | | . ™ Æ a L œ Æ a L T A T B Ÿœ Ÿœ | | | |

We then bound the probability that S R log2 n with a union bound over the events that | fl |Ø

209 asetL is a subset of R, for all subsets L of S of size log2 n:

Pr S R log2 n Pr [L R] | fl |Ø Æ ™ L S : L =log2 n Ë È ™ |ÿ| 2 S R log n | 2| | | Æ Alog nB · A T B | 2 | S R log n | |·| | Æ A T B | | 2 n 1 log e≠ Æ 1 2 log n = n≠

1 where the last inequality follows from the assumption that S R / T e≠ . | |·| | | |Æ

Corollary 5. Assume r log n. For all i [r], let L1,...,Li 1 be fixed layers and S be a Æ œ ≠ 1 1 set of size S n r+1 that is independent of the randomization of R , then | |Æ 8 i

r 2 Ê(1) Pr S j=i+1Lj log n 1 n≠ . Si fl fi Æ Ø ≠ Ë- 1 2- È - - - - Proof. Since r log n,weget L 1 L for all j, which implies that r L < 2 L . Æ | j+1|Æ 2 | j| j=i | j| | i| Without loss, assume S R r L . The claim then immediatelyq follows from ™ i fi fij=i+1 j i 1 1 2 r i 1 Lemma 3.3.3 with T = N ≠ S = R L and R = N ≠ L R = \ fij=1 j i fi fij=i+1 j \ fij=1 i fi i 1 2 1 2 1 2 r L , since fij=i+1 j r 1 1 1 i+1 S Lj r+1 ≠ r+1 j=i+1 8 n 2n 1 1 | |· fi · e≠ . - r - 1 i Ri - Lj- Æ 2n ≠ r+1 Æ 8 Æ fi -fij=i+1 - - 1 2- - - - -

Next, we show the desired property for the randomized collection of functions to FRi apply Lemma 3.3.2.

Ê(1) Lemma 3.3.4. Assume r log n. For all S N and i [r], with probability 1 n≠ Æ ™ œ ≠

210 over the randomization of R Fr i, for all f1,f2 R , f1(S)=f2(S). F i œ ≠ œF i

1 Proof. If S 8n r+1 , then | |Ø P 1 f (S)=2n 2r+2

for all f P with probability 1. œFR 1 r 2 Ê(1) If S < 8n r+1 , then by Corollary 5, Pr S L log n 1 n≠ . Thus, | | Ri | fl fij=i+1 j |Æ Ø ≠ Ê(1) Ë 1 2 P È with probability 1 n≠ over the randomization of , for all f , ≠ FR œFR

i f P (S) := min(¸ , log2 n)+ S r L + ¸ı i fl fij=i+1 j j=1 - 1 2- ÿ - - - i - S 1 2r+2 2 r ı + min | |1 , 1 2n min(¸i, log n)+ S j=i+1Lj + ¸ A r+1 B · Q ≠ Q fl fi RR 8n j=1 - 1 2- ÿ - - a a - - bb r i 1 The size of S L = S N ≠ L R is the same for all f . Thus, we fl fij=i+1 j fl \ fij=1 j fi i œFR 1 2 1 Ê1(1) 22 conclude that with probability 1 n≠ over the randomization of , for all f ,f , ≠ FR 1 2 œFR

f1(S)=f2(S).

Combining the two previous lemmas, we are ready to show the round elimination condition.

Lemma 3.3.5. Assume r log n. For all i 1,...,r and all collection of functions Æ œ{ } Ê(1) F , if there exists an i-adaptive algorithm that obtains, with probability n≠ , an Fi œ i

–-approximation for i, then there exists a collection of functions i 1 Fi 1 such that there F F ≠ œ ≠ Ê(1) exists an (i 1)-adaptive algorithm that obtains, with probability n≠ , an –-approximation ≠

for i 1. F ≠

Proof. Assume that for some F , there exists an i-adaptive algorithm that obtains, w.p. Fi œ i A Ê(1) P n≠ ,an–-approximation . Let L0,...,Lr i be the layers of all f i and consider the ≠ œF

randomized collection of functions Rr i+1 (L0,...,Lr i) Fi 1. By combining Lemma 3.3.2 F ≠ ≠ œ ≠

and Lemma 3.3.4, there exists a deterministic collection of functions i 1 Fi 1 such that, F ≠ œ ≠ Ê(1) w.p. 1 n≠ over the randomization of the algorithm for the non-adaptive queries ≠ A 211 in the first round, for all queries S , f1(S)=f2(S) for all f1,f2 i 1. Since Q œQ œF≠ Ê(1) i 1 i, algorithm is also an i-adaptive algorithm that obtains, with probability n≠ , F ≠ ™F A

an –-approximation for i 1. F ≠ Ê(1) For optimizing f i 1, the decisions of the algorithm are, w.p. 1 n≠ , independent œF≠ ≠

of the queries made in the first round, which is the case when for all queries S, f1(S)=f2(S)

for all f1,f2 i 1 . Consider the algorithm Õ that consists of the last i 1 rounds of œF≠ A ≠

algorithm when for all queries S by in the first round, f1(S)=f2(S) for all f1,f2 i 1. A A œF≠ Ê(1) We get that Õ is an i 1 adaptive algorithm that obtains, w.p. n≠ ,an–-approximation A ≠

for i 1. F ≠

It remains to show the last round condition needed for Lemma 3.3.1.

Lemma 3.3.6. For all F , there does not exist a 0-adaptive algorithm that obtains, with Fœ 0 Ê(1) 1 2 probability n≠ , an n≠ 2r+2 (r + 2) log n +1 approximation. · 1 2

Proof. Let F0 and L0,...Lr 1 be the fixed layers for all functions in .LetS be the Fœ ≠ F solution returned by the algorithm. Consider f ı . Note that since the algorithm is R ≥F 0-adaptive, S is independent of the randomization of Rı as a uniformly random subset of

ı Lr L . Since fi 1 1 ı 1 2r+2 2r+2 S L 3 n n 1 r ≠ | |·| ı| 1 e , Lr L Æ n ≠ r+1 Æ | fi | we get that S Lı log2 n | fl |Æ

Ê(1) with probability 1 n≠ by Lemma 3.3.3 and we assume this is the case for the remaining ≠ of the proof. Since S log4 n, we then obtain that | |Æ

1 1 2r+2 2 2 3 n 1 2 ı 2r+2 fR (S) (r + 1) log n + log n + 1 2n (r + 2) log n +1. Æ 8n r+1 · Æ

212 1 1 Then, for all S N such that S n 2r+2 , ™ | |Æ 3

ı 1 fRı (R ) n 2r+2 E 2 . fRı ( ) f ı S ≥U F C R ( ) D Æ (r + 2) log n +1

Thus there exists at least one function f such that the algorithm does not obtain, with œF Ê(1) 1 2 probability 1 n≠ ,an≠ 2r+2 (r + 2) log n +1 -approximation. ≠ · 1 2 The next lemma shows that f P is monotone and submodular.

Lemma 3.3.7. For all partitions P , the function f P is monotone submodular.

ı Proof. Let f be a function defined by a partition P =(L0,...,Lr,L ). The marginal contribution f (a )=f(S a ) f(S) of an element a L ,toasetS is S j fi{ j} ≠ j œ j

r+1 1 1 1 S 2r+2 2 ı 1 1 + 1 2n min(¸i, log n) ¸ | |1 case A Y ≠ 8n r+1 8n r+1 A ≠ i=1 ≠ B ≠ 8n r+1 _ ÿ _ r+1 _ 1 1 ı _ 2r+2 2 _ 1 2n min(¸i, log n) ¸ case B _ r+1 A ≠ ≠ B fS(aj)=_8n i=1 _ ÿ ]_ S 1 min | |1 , 1 case C _ ≠ A8n r+1 B _ _ _ _0 case D _ _ [_ where the cases are:

1 2 • case A if S < 8n r+1 , j = r +2, and ¸ < log n, | | ” j 1 2 • case B if S < 8n r+1 , j = r +2, and ¸ log n, | | ” j Ø 1 2 • case C if S 8n r+1 , j = r +2, and ¸ < log n, | |Ø ” j 1 2 • and case D if S 8n r+1 , j = r +2, and ¸ log n. | |Ø ” j Ø

1 S Monotone. In case A, we have 1 | 1| 1 and fS(aj) 0. It is easy to see that ≠ 8n r+1 ≠ 8n r+1 Ø≠ Ø for all other cases, f (a ) 0 as well. Thus, f is monotone. S j Ø 213 Submodular. Within each case, the marginal contributions are decreasing, as ¸i increases, for all i, and as S increases. It remains to show that the marginal contributions are | | decreasing from one case to another when ¸ and S increase. i | |

1 S • Since 1 | 1| 1 in A, marginal contributions are decreasing from A to B. ≠ 8n r+1 ≠ 8n r+1 Ø≠ 1 r+1 2 ı • Since 2n 2r+2 min(¸ , log n) ¸ 1 in A, they are decreasing from A to C. ≠ i=1 i ≠ Ø q • Since the marginal contributions are non-negative, they are decreasing from B and C to D.

By combining lemmas 3.3.1, 3.3.5, 3.3.6, and 3.3.7, we obtain the main result for this section.

Theorem 3.3.1. For any r log n, there is no r-adaptive algorithm for maximizing a Æ monotone submodular function under a cardinality constraint that obtains, with probability

1 1 2 log n Ê , an n≠ 2r+2 (r + 3) log n approximation. In particular, there is no -adaptive n · 12 log log n 1 2 1 2 1 1 algorithm that obtains, with probability Ê n ,a log n -approximation. 1 2

3.4 References and Acknowledgments

The work in Section 3.2.1 and Section 3.3 which shows the adaptive complexity of submodular maximization was produced in collaboration with Yaron Singer and appeared in

the STOC 2018 publication The Adaptive Complexity of Maximizing a Submodular Function (Balkanski and Singer [2018a]). The work in Section 3.2.2 on empirical evaluations of adaptive sampling was produced in

collaboration with Yaron Singer and appeared in the ICML 2018 publication Approximation Guarantees for Adaptive Sampling (Balkanski and Singer [2018b]).

214 The work in Section 3.2.3 which obtains the optimal 1 1/e approximation was produced ≠ in collaboration with Aviad Rubinstein and Yaron Singer and appeared in the SODA 2019

publication An Exponential Speedup in Parallel Running Time for Submodular Maximization without Loss in Approximation (Balkanski et al. [2019a]). A similar result was obtained independently and concurrently by Ene and Nguyen [2019]. The work in Section 3.2.4 on non-monotone functions was produced in collaboration with

Adam Breuer and Yaron Singer and appeared in the NeurIPS 2018 publication Non-monotone Submodular Maximization in Exponentially Fewer Iterations (Balkanski et al. [2018]). The work in Section 3.2.5 with matroid constraints was produced in collaboration with

Aviad Rubinstein and Yaron Singer and appeared in the STOC 2019 publication An Optimal Approximation for Submodular Maximization under a Matroid Constraint in the Adaptive Complexity Model (Balkanski et al. [2019b]). A similar result was obtained independently and concurrently by Ene et al. [2019] and Chekuri and Quanrud [2019b].

215 Bibliography

Bruno D. Abrahao, Flavio Chierichetti, Robert Kleinberg, and Alessandro Panconesi. Trace complexity of network inference. In KDD,2013. Eytan Adar and Lada A. Adamic. Tracking information epidemics in blogspace. In WI,2005. Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In COLT, pages 39–75, 2017. Saeed Alaei and Azarakhsh Malekian. Maximizing sequence-submodular functions and its application to online advertising. arXiv preprint arXiv:1009.4153,2010. Akram Aldroubi, Haichao Wang, and Kourosh Zarringhalam. Sequential adaptive compressed sampling via human codes. arXiv preprint arXiv:0810.4916,2008. Noga Alon, Noam Nisan, Ran Raz, and Omri Weinstein. Welfare maximization with limited interaction. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 1499–1512. IEEE, 2015. Rico Angell and Grant Schoenebeck. Don’t be greedy: Leveraging community structure to find high quality seed sets for influence maximization. arXiv preprint arXiv:1609.06520, 2016. Ioannis Antonellis, Anish Das Sarma, and Shaddin Dughmi. Dynamic covering for recommen- dation systems. In Proceedings of the 21st ACM international conference on Information and knowledge management,2012. Ashwinkumar Badanidiyuru and Jan Vondrák. Fast algorithms for maximizing submodular functions. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1497–1514, 2014. doi: 10.1137/1.9781611973402.110. URL https://doi.org/10.1137/1.9781611973402. 110. Ashwinkumar Badanidiyuru, Shahar Dobzinski, Hu Fu, Robert Kleinberg, Noam Nisan, and Tim Roughgarden. Sketching valuation functions. In Proceedings of the Twenty- Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17-19, 2012.

216 Ashwinkumar Badanidiyuru, Shahar Dobzinski, Hu Fu, Robert Kleinberg, Noam Nisan, and Tim Roughgarden. Sketching valuation functions. SODA,2012a.

Ashwinkumar Badanidiyuru, Shahar Dobzinski, Hu Fu, Robert Kleinberg, Noam Nisan, and Tim Roughgarden. Sketching valuation functions. In SODA, pages 1025–1035. Society for Industrial and Applied Mathematics, 2012b.

Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming submodular maximization: massive data summarization on the fly. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pages 671–680, 2014. doi: 10.1145/2623330.2623637. URL http://doi.acm.org/10.1145/2623330.2623637.

Ashwinkumar Badanidiyuru, Christos Papadimitriou, Aviad Rubinstein, Lior Seeman, and Yaron Singer. Locally adaptive optimization: Adaptive seeding for monotone submodular functions. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 414–429. Society for Industrial and Applied Mathematics, 2016.

Maria-Florina Balcan. Learning submodular functions with applications to multi-agent systems. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems,2015.

Maria-Florina Balcan and Nicholas J. A. Harvey. Learning submodular functions. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011,2011.

Maria-Florina Balcan, Florin Constantin, Satoru Iwata, and Lei Wang. Learning valuation functions. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland,2012.

Eric Balkanski and Yaron Singer. The adaptive complexity of maximizing a submodular function. STOC,2018a.

Eric Balkanski and Yaron Singer. Approximation guarantees for adaptive sampling. ICML, 2018b.

Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The power of optimization from samples. NIPS,2016.

Eric Balkanski, Nicole Immorlica, and Yaron Singer. The importance of communities for learning to influence. NIPS,2017a.

Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The limitations of optimization from samples. STOC, 2017b.

Eric Balkanski, Umar Syed, and Sergei Vassilvitskii. Statistical cost sharing. NIPS,2017c.

217 Eric Balkanski, Adam Breuer, and Yaron Singer. Non-monotone submodular maximization in exponentially fewer iterations. NeurIPS,2018. Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An exponential speedup in parallel running time for submodular maximization without loss in approximation. SODA,2019a. Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An optimal approximation for submod- ular maximization under a matroid constraint in the adaptive complexity model. STOC, 2019b. Rafael Barbosa, Alina Ene, Huy Nguyen, and Justin Ward. The power of randomization: Distributed submodular maximization on massive datasets. In ICML, pages 1236–1244, 2015. Rafael da Ponte Barbosa, Alina Ene, Huy L Nguyen, and Justin Ward. A new framework for distributed submodular maximization. In FOCS, pages 645–654. Ieee, 2016. Bonnie Berger, John Rompel, and Peter W Shor. Ecient nc algorithms for set cover with applications to learning and geometry. In FOCS, pages 54–59. IEEE, 1989. Guy E Blelloch. Programming parallel algorithms. Communications of the ACM,39(3): 85–97, 1996. Guy E Blelloch and Margaret Reid-Miller. Fast set operations using treaps. In SPAA, pages 16–26, 1998. Guy E Blelloch, Richard Peng, and Kanat Tangwongsan. Linear-work greedy parallel approximate set cover and variants. In SPAA, pages 23–32, 2011. Guy E Blelloch, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Parallel and i/o ecient set covering algorithms. In SPAA, pages 82–90. ACM, 2012. Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Christian Borgs, Michael Brautbar, Jennifer Chayes, and Brendan Lucier. Maximizing social influence in nearly optimal time. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 946–957. SIAM, 2014. Gregory R Bowman, Vincent A Voelz, and Vijay S Pande. Taming the complexity of protein folding. Current opinion in structural biology, 21(1):4–11, 2011. Mark Braverman, Jieming Mao, and S Matthew Weinberg. Parallel algorithms for select and partition with noisy comparisons. In STOC, pages 851–862, 2016. Niv Buchbinder, Moran Feldman, Joseph SeNaor, and Roy Schwartz. Submodular maximization with cardinality constraints. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1433–1452. Society for Industrial and Applied Mathematics, 2014.

218 Dave Buchfuhrer, Michael Schapira, and Yaron Singer. Computation and incentives in combinatorial public projects. In Proceedings of the 11th ACM conference on Electronic commerce,2010.

Harry Buhrman, David García-Soriano, Arie Matsliah, and Ronald de Wolf. The non-adaptive query complexity of testing k-parities. arXiv preprint arXiv:1209.3849,2012.

Gruia Clinescu, Chandra Chekuri, Martin Pál, and Jan Vondrák. Maximizing a submodular set function subject to a matroid constraint (extended abstract). In Integer Programming and Combinatorial Optimization, 12th International IPCO Conference, Ithaca, NY, USA, June 25-27, 2007, Proceedings, pages 182–196, 2007.

CalTrans. Pems: California performance measuring system. http://pems.dot.ca.gov/ [accessed: May 1, 2018].

Clement Canonne and Tom Gur. An adaptivity hierarchy theorem for property testing. arXiv preprint arXiv:1702.05678,2017.

Chandra Chekuri and Kent Quanrud. Submodular function maximization in parallel via the multilinear relaxation. SODA,2019a.

Chandra Chekuri and Kent Quanrud. Parallelizing greedy for submodular set function maximization in matroids and beyond. STOC, 2019b.

Chandra Chekuri, Jan Vondrák, and Rico Zenklusen. Dependent randomized rounding for matroid polytopes and applications. arXiv preprint arXiv:0909.4348,2009.

Chandra Chekuri, TS Jayram, and Jan Vondrák. On multiplicative weight updates for concave and submodular function maximization. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 201–210. ACM, 2015.

Wei Chen, Yajun Wang, and Siyu Yang. Ecient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199–208. ACM, 2009.

Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1029–1038. ACM, 2010.

Xi Chen, Rocco A Servedio, Li-Yang Tan, Erik Waingarten, and Jinyu Xie. Settling the query complexity of non-adaptive junta testing. arXiv preprint arXiv:1704.06314,2017.

Justin Cheng, Lada A. Adamic, P. Alex Dow, Jon M. Kleinberg, and Jure Leskovec. Can cascades be predicted? In WWW,2014.

219 Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. Max-cover in map-reduce. In Proceedings of the 19th international conference on World wide web, pages 231–240. ACM, 2010.

Flavio Chierichetti, Jon M. Kleinberg, and David Liben-Nowell. Reconstructing patterns of information diusion from incomplete observations. In NIPS,2011.

Richard Cole. Parallel merge sort. SIAM Journal on Computing, 17(4):770–785, 1988.

Michele Conforti and Gérard Cornuéjols. Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the rado-edmonds theorem. Discrete applied mathematics,1984.

Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Schölkopf. Estimating diusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 793–801, 2014.

Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christopher Olston, Sandeep Pandey, and Andrew Tomkins. The discoverability of the web. In Proceedings of the 16th international conference on World Wide Web,2007.

Abir De, Sourangshu Bhattacharya, Parantapa Bhattacharya, Niloy Ganguly, and Soumen Chakrabarti. Learning a linear influence model from transient opinion dynamics. In CIKM, 2014.

Jerey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

Nikhil R Devanur, Zhiyi Huang, Nitish Korula, Vahab S Mirrokni, and Qiqi Yan. Whole-page optimization and submodular welfare maximization with online bidders. ACM Transactions on Economics and Computation, 4(3):14, 2016.

Shahar Dobzinski and Michael Schapira. An improved approximation algorithm for com- binatorial auctions with submodular bidders. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1064–1073. Society for Industrial and Applied Mathematics, 2006.

Shahar Dobzinski, Noam Nisan, and Sigal Oren. Economic eciency requires interaction. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 233–242. ACM, 2014.

Pedro Domingos and Matt Richardson. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 57–66. ACM, 2001a.

220 Pedro Domingos and Matthew Richardson. Mining the network value of customers. In KDD, 2001b.

Nan Du, Le Song, Alexander J. Smola, and Ming Yuan. Learning networks of heterogeneous influence. In NIPS,2012.

Nan Du, Le Song, Manuel Gomez-Rodriguez, and Hongyuan Zha. Scalable influence estimation in continuous-time diusion networks. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3147–3155, 2013.

Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. Influence function learning in information diusion networks. In ICML,2014a.

Nan Du, Yingyu Liang, Maria-Florina F Balcan, and Le Song. Learning time-varying coverage functions. In Advances in neural information processing systems, pages 3374–3382, 2014b.

Shaddin Dughmi. A truthful randomized mechanism for combinatorial public projects via convex optimization. In Proceedings of the 12th ACM conference on Electronic commerce, 2011.

Shaddin Dughmi and Jan Vondrák. Limitations of randomized mechanisms for combinatorial auctions. Games and Economic Behavior,2015.

Shaddin Dughmi, Tim Roughgarden, and Qiqi Yan. From convex optimization to randomized mechanisms: toward optimal combinatorial auctions. In Proceedings of the forty-third annual ACM symposium on Theory of computing,2011.

Pavol Duris, Zvi Galil, and Georg Schnitger. Lower bounds on communication complexity. In STOC, pages 81–91, 1984.

Alina Ene and Huy L Nguyen. Constrained submodular maximization: Beyond 1/e. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 248–257. IEEE, 2016.

Alina Ene and Huy L Nguyen. Submodular maximization with nearly-optimal approximation and adaptivity in nearly-linear time. arXiv preprint arXiv:1804.05379,2018.

Alina Ene and Huy L Nguyen. Submodular maximization with nearly-optimal approximation and adaptivity in nearly-linear time. SODA,2019.

Alina Ene, Huy L Nguyen, and Adrian Vladu. Submodular maximization with matroid and packing constraints in parallel. STOC,2019.

Alessandro Epasto, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Bicriteria distributed submodular maximization in a few rounds. In SPAA, pages 25–33, 2017.

221 Paul Erdos and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.

Matthew Fahrbach, Vahab Mirrokni, and Morteza Zadimoghaddam. Non-monotone submod- ular maximization with nearly optimal adaptivity complexity. arXiv,2018.

Matthew Fahrbach, Vahab Mirrokni, and Morteza Zadimoghaddam. Submodular maximiza- tion with optimal approximation, adaptivity and query complexity. SODA,2019.

Uriel Feige. A threshold of ln n for approximating set cover. JACM,1998.

Uriel Feige, Vahab S Mirrokni, and Jan Vondrak. Maximizing non-monotone submodular functions. SIAM Journal on Computing, 40(4):1133–1153, 2011.

Moran Feldman, Joseph Naor, and Roy Schwartz. A unified continuous greedy algorithm for submodular maximization. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 570–579. IEEE, 2011.

Moran Feldman, Christopher Harshaw, and Amin Karbasi. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42, 1 (2015), 33 pages.,2015.

Moran Feldman, Christopher Harshaw, and Amin Karbasi. Greed is good: Near-optimal submodular maximization via greedy optimization. arXiv preprint arXiv:1704.01652,2017.

Vitaly Feldman and Pravesh Kothari. Learning coverage functions and private release of marginals. In Proceedings of The 27th Conference on Learning Theory, pages 679–702, 2014.

Vitaly Feldman and Jan Vondrák. Optimal bounds on approximation of submodular and XOS functions by juntas. In 54th Annual IEEE Symposium on Foundations of Computer Science,2013.

Vitaly Feldman and Jan Vondrák. Tight bounds on low-degree spectral concentration of submodular and xos functions. In Foundations of Computer Science,2015.

Joseph D Frazier, Peter K Jimack, and Robert M Kirby. On the use of adjoint-based sensitivity estimates to control local mesh refinement. Commun Comput Phys, 7:631–8, 2010.

Shayan Oveis Gharan and Jan Vondrák. Submodular maximization by simulated annealing. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 1098–1116. Society for Industrial and Applied Mathematics, 2011.

Daniel Golovin and Andreas Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, pages 333–345, 2010.

222 Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, July 25-28, 2010, pages 1019–1028, 2010.

Manuel Gomez-Rodriguez, David Balduzzi, and Bernhard Schölkopf. Uncovering the temporal dynamics of diusion networks. In ICML,2011.

Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. Learning influence probabilities in social networks. In KDD,2010.

Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Celf++: optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th international conference companion on World wide web, pages 47–48. ACM, 2011.

Carlos Guestrin, Andreas Krause, and Ajit Paul Singh. Near-optimal sensor placements in gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 265–272, 2005.

Anupam Gupta, Aaron Roth, Grant Schoenebeck, and Kunal Talwar. Constrained non- monotone submodular maximization: Oine and secretary algorithms. In International Workshop on Internet and Network Economics, pages 246–257. Springer, 2010.

Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. SIAM Journal on Computing, 42(4):1494– 1520, 2013.

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages., 2015. doi: http://dx.doi.org/10.1145/2827872.

Avinatan Hassidim and Yaron Singer. Submodular optimization under noise. 2015. Working paper.

Jarvis Haupt, Robert Nowak, and Rui Castro. Adaptive sensing for sparse signal recovery. In Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop, pages 702–707. IEEE, 2009a.

Jarvis D Haupt, Richard G Baraniuk, Rui M Castro, and Robert D Nowak. Compressive distilled sensing: Sparse recovery using adaptivity in compressive measurements. In Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on, pages 1551–1555. IEEE, 2009b.

Xinran He and David Kempe. Robust influence maximization. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 885–894, 2016.

223 Xinran He, Ke Xu, David Kempe, and Yan Liu. Learning influence functions from incomplete observations. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2065–2073, 2016. URL http://papers.nips.cc/paper/ 6181-learning-influence-functions-from-incomplete-observations.

Jean Honorio and Luis Ortiz. Learning the structure and parameters of large-population graphical games from behavioral data. Journal of Machine Learning Research, 16:1157–1210, 2015.

Thibaut Horel and Yaron Singer. Scalable methods for adaptively seeding a social network. In Proceedings of the 24th International Conference on World Wide Web, pages 441–451. International World Wide Web Conferences Steering Committee, 2015.

Piotr Indyk, Eric Price, and David P Woodru. On the power of adaptivity in sparse recovery. In FOCS, pages 285–294. IEEE, 2011.

Rishabh K Iyer and JeA Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In NIPS,2013.

Rishabh K Iyer, Stefanie Jegelka, and JeA Bilmes. Curvature and optimal algorithms for learning and minimizing submodular functions. In NIPS,2013.

Stefanie Jegelka, Hui Lin, and JeA Bilmes. On fast approximate submodular minimization. In Advances in Neural Information Processing Systems, pages 460–468, 2011.

Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reflection methods for user-friendly submodu- lar optimization. In Advances in Neural Information Processing Systems, pages 1313–1321, 2013.

Shihao Ji, Ya Xue, and Lawrence Carin. Bayesian compressive sensing. IEEE Transactions on Signal Processing, 56(6):2346–2356, 2008.

Howard Karlo, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938–948, 2010.

Richard M. Karp, Eli Upfal, and Avi Wigderson. The complexity of parallel search. J. Comput. Syst. Sci., 36(2):225–253, 1988.

David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. KDD,2003.

David Kempe, Jon M. Kleinberg, and Éva Tardos. Influential nodes in a diusion model for social networks. In Automata, Languages and Programming, 32nd International Colloquium, ICALP 2005, Lisbon, Portugal, July 11-15, 2005, Proceedings, pages 1127–1138, 2005. doi: 10.1007/11523468_91. URL http://dx.doi.org/10.1007/11523468_91.

224 Nitish Korula, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Online submodular welfare maximization: Greedy beats 1/2 in random order. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing,2015.

Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In AAAI,2007.

Alex Krizhevsky and Georey Hinton. Learning multiple layers of features from tiny images, 2009.

Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in mapreduce and streaming. ACM Transactions on Parallel Computing,2(3): 14, 2015a.

Ravi Kumar, Andrew Tomkins, Sergei Vassilvitskii, and Erik Vee. Inverting a steady-state. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 359–368. ACM, 2015b.

Jon Lee, Vahab S Mirrokni, Viswanath Nagarajan, and Maxim Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 323–332. ACM, 2009.

Benny Lehmann, Daniel Lehmann, and Noam Nisan. Combinatorial auctions with decreasing marginal utilities. In Proceedings of the 3rd ACM conference on Electronic Commerce, pages 18–28. ACM, 2001.

Jure Leskovec and Andrej Krevl. Snap datasets, stanford large network dataset collection. 2015.

Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie S. Glance, and Matthew Hurst. Patterns of cascading behavior in large blog graphs. In SDM,2007.

David Liben-Nowell and Jon M. Kleinberg. The link prediction problem for social networks. In CIKM,2003.

Hui Lin and JeBilmes. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 510–520. Association for Computational Linguistics, 2011.

Erik M Lindgren, Shanshan Wu, and Alexandros G Dimakis. Sparse and greedy: Sparsifying submodular facility location problems. In NIPS Workshop on Optimization for Machine Learning,2015.

Dmitry M Malioutov, Sujay Sanghavi, and Alan S Willsky. Compressed sensing with sequential observations. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3357–3360. IEEE, 2008.

225 Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures and asymmetric communication complexity. In STOC, pages 103–111. ACM, 1995.

Vahab Mirrokni and Morteza Zadimoghaddam. Randomized composable core-sets for dis- tributed submodular maximization. In STOC, pages 153–162, 2015.

Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In NIPS, pages 2049–2057, 2013.

Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause. Lazier than lazy greedy. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 1812–1818, 2015a. URL http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/ view/9956.

Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, and Andreas Krause. Distributed submodular cover: Succinctly summarizing massive data. In NIPS, pages 2881–2889, 2015b.

Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. Fast constrained submodular maximization: Personalized data summarization. In ICML, pages 1358–1367, 2016.

Slobodan Mitrovic, Ilija Bogunovic, Ashkan Norouzi-Fard, Jakub M Tarnawski, and Volkan Cevher. Streaming robust submodular maximization: A partitioned thresholding approach. In NIPS, pages 4560–4569, 2017.

Elchanan Mossel and Sébastien Roch. On the submodularity of influence in social networks. In STOC,2007.

Harikrishna Narasimhan, David C. Parkes, and Yaron Singer. Learnability of influence in networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3186–3194, 2015.

George L Nemhauser and Laurence A Wolsey. Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research, 3(3):177–188, 1978.

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approxima- tions for maximizing submodular set functions—i. Mathematical Programming,1978.

Praneeth Netrapalli and Sujay Sanghavi. Learning the graph of epidemic cascades. In SIGMETRICS/Performance,2012.

226 Noam Nisan and Avi Widgerson. Rounds in communication complexity revisited. In STOC, pages 419–429, 1991.

Robert Nishihara, Stefanie Jegelka, and Michael I Jordan. On the convergence rate of decomposable submodular function minimization. In Advances in Neural Information Processing Systems, pages 640–648, 2014.

NYC-Taxi-Limousine-Commission. New york city taxi and limousine commission trip record data. 2017. URL http://www.nyc.gov/html/tlc/html/about/trip_record_ data.shtml.

NYTimes. Taxi drivers in new york are struggling. so are uber drivers., 2018. URL https: //www.nytimes.com/2018/06/17/nyregion/uber-taxi-drivers-struggle.html.

Xinghao Pan, Stefanie Jegelka, Joseph E Gonzalez, Joseph K Bradley, and Michael I Jordan. Parallel double greedy submodular maximization. In Advances in Neural Information Processing Systems, pages 118–126, 2014.

Christos H Papadimitriou and Michael Sipser. Communication complexity. Journal of Computer and System Sciences, 28(2):260–269, 1984.

Sridhar Rajagopalan and Vijay V Vazirani. Primal-dual rnc approximation algorithms for set cover and covering integer programs. SIAM Journal on Computing, 28(2):525–540, 1998.

Sofya Raskhodnikova and Adam Smith. A note on adaptivity in testing properties of bounded degree graphs. ECCC, TR06-089,2006.

Matthew Richardson and Pedro Domingos. Mining knowledge-sharing sites for viral marketing. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 61–70. ACM, 2002.

Nir Rosenfeld and Amir Globerson. Optimal tagging with markov chain optimization. CoRR, abs/1605.04719, 2016.

Alex Rubinsteyn and Sergey Feldman. fancyimpute, matrix completion and feature imputation algorithms. 2017. URL https://github.com/hammerlab/fancyimpute.

Barna Saha and Lise Getoor. On maximum coverage in the streaming model & application to multi-topic blog-watch. In SDM, volume 9, 2009.

Lior Seeman and Yaron Singer. Adaptive seeding in social networks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 459–468. IEEE, 2013.

Rocco A Servedio, Li-Yang Tan, and John Wright. Adaptivity helps for testing juntas. In Proceedings of the 30th Conference on Computational Complexity, pages 264–279. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2015.

227 Yaron Singer. How to win friends and influence people, truthfully: influence maximization mechanisms for social networks. In Proceedings of the fifth ACM international conference on Web search and data mining,2012.

Adish Singla, Sebastian Tschiatschek, and Andreas Krause. Noisy submodular maximization via adaptive sampling with applications to crowdsourced image collection summarization. In AAAI, pages 2037–2043, 2016.

Maxim Sviridenko, Jan Vondrák, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. In SODA,2015.

Ashwin Swaminathan, Cherian V Mathew, and Darko Kirovski. Essential pages. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01, pages 173–182, 2009.

Hiroya Takamura and Manabu Okumura. Text summarization model based on maximum coverage problem and its variant. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics,2009.

Steven K Thompson. Adaptive cluster sampling. Journal of the American Statistical Association, 85(412):1050–1059, 1990.

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001.

Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, and JeA Bilmes. Learning mixtures of submodular functions for image collection summarization. In Advances in neural information processing systems, pages 1413–1421, 2014.

Leslie G Valiant. Parallelism in comparison problems. SIAM Journal on Computing,4(3): 348–355, 1975.

Leslie G. Valiant. A Theory of the Learnable. Commun. ACM,1984.

Jan Vondrák. Optimal approximation for the submodular welfare problem in the value oracle model. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17-20, 2008, pages 67–74, 2008.

Jan Vondrák. Submodularity and curvature: the optimal algorithm. RIMS,2010.

Jan Vondrák, Chandra Chekuri, and Rico Zenklusen. Submodular function maximization via the multilinear relaxation and contention resolution schemes. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 783–792. ACM, 2011.

Kai Wei, Rishabh Iyer, and JeBilmes. Fast multi-stage submodular maximization. In International Conference on Machine Learning, pages 1494–1502, 2014.

228 Yisong Yue and Thorsten Joachims. Predicting diverse subsets using structural svms. In Proceedings of the 25th international conference on Machine learning,2008.

229