The Adaptive Complexity of Submodular Optimization
The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters
Citation Balkanski, Eric. 2019. The Adaptive Complexity of Submodular Optimization. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:42029793
Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA The Adaptive Complexity of Submodular Optimization
a dissertation presented by Eric Balkanski to The School of Engineering and Applied Sciences
in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science
Harvard University Cambridge, Massachusetts
May 2019 © 2019 Eric Balkanski All rights reserved. Dissertation advisor: Yaron Singer Eric Balkanski
The Adaptive Complexity of Submodular Optimization
Abstract
In this thesis, we develop a new optimization technique that leads to exponentially faster algorithms for solving submodular optimization problems. For the canonical problem of maximizing a non-decreasing submodular function under a cardinality constraint, it is well known that the celebrated greedy algorithm which iteratively adds elements whose marginal contribution is largest achieves a 1 1/e approximation, which ≠ is optimal. The optimal approximation guarantee of the greedy algorithm comes at a price of high adaptivity. The adaptivity of an algorithm is the number of sequential rounds it makes when polynomially-many function evaluations can be executed in parallel in each round. Since submodular optimization is regularly applied on very large datasets, adaptivity is crucial as algorithms with low adaptivity enable dramatic speedups in parallel computing time. Submodular optimization has been studied for well over forty years now, and somewhat surprisingly, there was no known constant-factor approximation algorithm for submodular maximization whose adaptivity is sublinear in the size of the ground set n. Our main contribution is a novel optimization technique called adaptive sampling which leads to constant factor approximation algorithms for submodular maximization in only logarithmically many adaptive rounds. This is an exponential speedup in the parallel runtime for submodular maximization compared to previous constant factor approximation algorithms. Furthermore, we show that no algorithm can achieve a constant factor approximation in o˜(log n) rounds. Thus, the adaptive complexity of submodular maximization, i.e., the minimum number of rounds r such that there exists an r-adaptive algorithm which achieves a constant factor approximation, is logarithmic up to lower order terms.
iii Contents
1 Introduction 1 1.1 Adaptive Sampling: An Exponential Speedup ...... 4 1.2 ResultsOverview ...... 5 1.2.1 From Predictions to Decisions: Optimization from Samples ...... 6 1.2.2 Faster Parallel Algorithms Through Adaptive Sampling ...... 8 1.3 Preliminaries ...... 12 1.4 Discussion About Adaptivity ...... 14 1.4.1 Adaptivity in Other Areas ...... 14 1.4.2 Related Models of Parallel Computation ...... 15 1.4.3 Applications of Adaptivity ...... 17
2 Non-Adaptive Optimization 20 2.1 From Predictions to Decisions: Optimization from Samples ...... 20 2.1.1 The Optimization from Samples Model ...... 23 2.1.2 Optimization from Samples is Equivalent to Non-Adaptivity . . . . . 25 2.1.3 Overview of Results ...... 27 2.2 Optimization from Samples Algorithms ...... 29 2.2.1 Curvature ...... 30 2.2.2 Learning to Influence ...... 35
iv 2.2.3 General Submodular Maximization ...... 54 2.3 The Limitations of Optimization from Samples ...... 62 2.3.1 A Framework for Hardness of Optimization from Samples ...... 62 2.3.2 Submodular Maximization ...... 65 2.3.3 Maximum Coverage ...... 67 2.3.4 Curvature ...... 84 2.4 References and Acknowledgments ...... 91
3 Adaptive Optimization 92 3.1 Adaptivity ...... 93 3.1.1 The Adaptive Complexity Model ...... 93 3.1.2 The Adaptivity Landscape for Submodular Optimization ...... 94 3.1.3 Main Result ...... 95 3.1.4 Adaptive Sampling: a Coupling of Learning and Optimization . . . . 96 3.1.5 Overview of Results ...... 96 3.2 Adaptive Algorithms ...... 99 3.2.1 An Algorithm with Logarithmic Adaptivity ...... 99 3.2.2 Experiments ...... 117 3.2.3 The Optimal Approximation ...... 123 3.2.4 Non-monotone Functions ...... 141 3.2.5 Matroid Constraints ...... 166 3.3 Adaptivity Lower Bound ...... 205 3.4 References and Acknowledgments ...... 214
v Acknowledgments
First and foremost, I would like to thank my advisor Yaron Singer. Yaron has taught me everything about research, from long-term research vision to choosing the right font for a presentation. I have been extremely fortunate to have an advisor who has always believed in me, cares so much about my success, and tirelessly mentored and advised me through every stage of my PhD. My family has been a vital source of support through the years. I am grateful to my parents, Cecile and Yves, for all the encouragement and freedom to explore every project and idea I had since a very young age. Thank you to my siblings Sophie and Je for all the good times laughing together and the endless singing in the car during vacations, I cannot wait for our next family vacation. Thanks also to my girlfriend Meghan for the emotional support and dealing with me during paper deadlines. I am very grateful for my two internships at Google Research NYC, which have played an important role in my PhD. During these two summers, I met and worked with wonderful people, broadened my research horizon, and explored new directions. Thank you in particular to my hosts and collaborators Umar Syed, Sergei Vassilvitskii, Balu Sivan, Renato Paes Leme, and Vahab Mirrokni. I have also had the chance to work with incredible collaborators from whom I have learned a lot and who made significant contributions to this thesis: Aviad Rubinstein, Jason Hartline, Andreas Krause, Baharan Mirzasoleiman, Nicole Immorlica, Amir Globerson, Nir Rosenfeld, and Adam Breuer. I am also grateful for the numerous discussions about the content of this thesis with Thibaut Horel, whose breadth of knowledge has been very helpful. I am fortunate to have spent my PhD working in a fun and warm environment. Thank you to my o cemates through the years, Emma Heikensten, Siri Isaksson, Dimitris Kalimeris, Gal Kaplun, Sharon Qian, Greg Stoddard, Bo Waggoner and Ming Yin, I will miss the
vi great atmosphere in MD115. The EconCS group and Maxwell Dworkin have been amazingly friendly and supportive places to grow academically, thanks in particular to David Parkes, Yiling Chen, Jean Pouget-Abadie, Hongyao Ma, Chara Podimata, Jarek Blasiok, Debmalya Mandal, Preetum Nakkiran, Lior Seeman, Goran Radanovic, and Yang Liu, and to my research committee, Yaron Singer, David Parkes, Sasha Rush, and Michael Mitzenmacher. I am fortunate for the countless and diverse opportunities I had during my undergraduate studies at Carnegie Mellon which helped me grow. These opportunities lead me to valuable research, teaching, leadership, social, and athletic experiences. Thank you in particular to my undergraduate research advisor, Ariel Procaccia, who was the first person to tell me he believed I could become a professor, and my academic advisor John Mackey for all the opportunities. This thesis was supported in part by a Google PhD Fellowship and a Smith Family Graduate Science and Engineering Fellowship.
vii Chapter 1
Introduction
The field of optimization has recently expanded to new application domains, where complex decision-making tasks are challenging existing frameworks and techniques. A main di culty is that the scale of these tasks is growing at a vertiginous rate. Innovative frameworks and techniques are needed to capture modern application domains as well as address the challenges posed by large scale computation. We begin by discussing three application domains where optimization has recently played an important role. The first domain is genomics. Recent developments in computational biology that allow processing large amounts of genomic data have been a catalyst for progress towards understanding the human genome. One approach that helps to understand and process massive gene datasets is to cluster gene sequences. Given gene clusters, a small representative subset of genes, in other words a summary of the entire dataset, can be obtained by choosing one gene sequence from each cluster. The problem of clustering gene sequences is an example of a large scale optimization problem. A di erent domain is recommender systems. Twenty years ago, a Friday movie night would require browsing through multiple aisles at Blockbuster to find the right movie. Today, streaming services use recommender systems to suggest a personalized collection of movies to
1 a user. Recommender systems are not limited to movies, but are also used for music and online retail. Given the preferences of di erent users, optimization techniques are used to find diverse and personalized collections of movies, songs, or products. The third domain, ride-sharing services, did not even exist a decade ago. Ride-sharing services have revolutionized transportation by allowing passengers to request a ride to their desired destination in a few seconds using their smartphone. These companies face novel and complex optimization tasks. One example is driver dispatch, which is the problem of assigning drivers to di erent locations, in order to match the demand from riders. Gene sequences, user ratings for music and movies, rider-driver locations are examples of large datasets that we wish to harness using optimization. Novel algorithmic techniques and frameworks that are adapted for these new domains are needed. A standard approach to these optimization problems is to use an algorithmic technique called the greedy approach. Informally, a greedy algorithm takes small local steps towards building a global solution to a problem. For example, for the driver dispatch problem in New York City, a greedy algorithm might first dispatch a first driver to bustling midtown Manhattan. Then, greedy could send another driver to a di erent location in the financial district, then a third driver to a concert in Brooklyn, and so on. This greedy technique is very popular in the field of algorithms, it has been (and still is) used to solve a wide variety of problems. In New York City, ride-sharing companies complete hundreds of thousands of trips per day [NYTimes, 2018]. In large cities, a greedy approach of picking only one location for each driver at a time by iteratively considering each driver one after the other is of course unreasonable.
Adaptivity. The main explanation for the greedy approach being unreasonable for large scale problems is because it is highly sequential, it only assigns a location to a single driver
2 at each step of the algorithm. A crucial feature for an algorithm to scale to large instances is
how parallelizable this algorithm is. Intuitively, a ride-sharing service should not pick the next location sequentially for one driver at a time, but should be choosing the location of multiple drivers in parallel and at the same time. Adaptivity is a convenient measure which captures this notion of sequentiality of an algorithm. In other words, it measures to what extent an algorithm can be parallelized for large scale problems.
Submodular optimization. In optimization, it is desirable to develop generic algorithmic solutions which can be applied to a large family of problems. In other words, algorithms which are not tailored to one specific application domain. One may believe that it is possible to use similar optimization techniques for a movie recommendation system and a music recommendation system. Perhaps more surprisingly, is the fact that the problems from the three di erent application domains previously mentioned (genomics, recommender systems, and driver dispatch) all share some common feature, or structure, which allows grouping them in a single family of problems solved with the same algorithmic approaches. This family of problems is called submodular optimization problems. Submodularity captures a simple and natural diminishing returns property. An illustration of this diminishing returns property is that the gain from sending an additional driver to midtown Manhattan diminishes if there is already a large number of drivers in midtown Manhattan. Fundamental quantities that we often wish to optimize, such as coverage, diversity, entropy, or mutual information, all exhibit this diminishing returns property. This explains why a wide variety of problems such as clustering, recommender systems, facility location, influence in networks, users’ preferences over goods, just to name a few, are all submodular.
3 1.1 Adaptive Sampling: An Exponential Speedup
In this thesis, we develop a novel optimization technique which leads to exponentially faster algorithms for solving submodular optimization problems. We begin by discussing this exponential speedup in this section. In Section 1.2, we give an overview of the results and present the organization of this thesis. In the preliminaries in Section 1.3, we introduce formal definitions and notation. We further discuss adaptivity in Section 1.4, in particular related work on adaptivity, related models of parallel computation, and application areas. For the canonical problem of maximizing a non-decreasing submodular function un- der a cardinality constraint it is well known that the celebrated greedy algorithm which iteratively adds elements whose marginal contribution is largest achieves a 1 1/e approxi- ≠ mation [Nemhauser et al., 1978]. Furthermore, this approximation guarantee is optimal for any algorithm that uses polynomially-many value queries [Nemhauser and Wolsey, 1978]. The optimal approximation guarantee of the greedy algorithm comes at a price of high adaptivity. Informally, the adaptivity of an algorithm is the number of sequential rounds it makes when polynomially-many function evaluations can be executed in parallel in each round. Adaptivity provides a measure of e ciency of parallel computation (see Section 1.4.2 for related models of parallel computation). For a cardinality constraint k and ground set of size n, the greedy algorithm is k-adaptive since it sequentially adds elements in k rounds. In each round it makes (n) function evaluations to identify and include the element with O maximal marginal contribution to the set of elements selected in previous rounds. In the worst case k (n) and thus the greedy algorithm is (n)-adaptive and its parallel running œ time is (n). Since submodular optimization is regularly applied on very large datasets, adaptivity is crucial as algorithms with low adaptivity enable dramatic speedups in parallel computing time. Submodular optimization has been studied for well over forty years now, and in
4 the past decade there has been extensive study of submodular maximization for large datasets [Jegelka et al., 2011, Badanidiyuru et al., 2012b, Kumar et al., 2015a, Jegelka et al., 2013, Mirzasoleiman et al., 2013, Wei et al., 2014, Nishihara et al., 2014, Badanidiyuru and Vondrák, 2014, Pan et al., 2014, Badanidiyuru et al., 2014, Mirzasoleiman et al., 2015a, Mirrokni and Zadimoghaddam, 2015, Mirzasoleiman et al., 2015b, Barbosa et al., 2015, Mirzasoleiman et al., 2016, Barbosa et al., 2016, Epasto et al., 2017]. Somewhat surprisingly however, until very recently, there was no known constant-factor approximation algorithm for submodular maximization whose adaptivity is sublinear in n. Our main contribution is a novel optimization technique called adaptive sampling which leads to constant factor approximation algorithms for submodular maximization in only logarithmically many adaptive rounds. This is an exponential speedup in the parallel runtime for submodular maximization compared to previous constant factor approximation algorithms. Furthermore, we show that no algorithm can achieve a constant factor approximation in o˜(log n) rounds. Thus, the adaptive complexity of submodular maximization, i.e., the minimum number of rounds r such that there exists an r-adaptive algorithm which achieves a constant factor approximation, is logarithmic up to lower order terms. The high level idea behind adaptive sampling is to adaptively construct distributions over sets of elements. Based on the value of samples from previous rounds, the algorithm constructs a new distribution at every round. It then samples multiple sets from this new distribution and performs function evaluations over these samples in parallel.
1.2 Results Overview
We study two closely related lines of work. The first is in the optimization from samples model, where the goal is to understand which guarantees are obtainable when optimizing an objective that is learned from data. The second is in the adaptive complexity model, which
5 is motivated by results in the optimization from samples model and where we develop faster parallel algorithms for large scale applications in machine learning.
1.2.1 From Predictions to Decisions: Optimization from Samples
The first part of this thesis studies optimization from samples in Section 2. The traditional approach in optimization typically assumes there is an underlying model known to the algorithm designer, and the goal is to optimize an objective function defined through the model. In a routing problem, for example, the model is a weighted graph which encodes roads and their congestion, and the objective is to select a shortest route. In influence maximization, we are given a weighted graph which models the likelihood of individuals forwarding information, and the objective is to select a subset of nodes that maximizes the spread of information [Kempe et al., 2003]. In many applications like influence maximization or routing, we do not actually know the objective functions we wish to optimize since they depend on the behavior of the world generating the model. In such cases, we gather data about the objective function from past observations, such as yesterday’s tra c or past information cascades in social networks. A reasonable approach is to use this data to learn a surrogate function that approximates the function generating the data and optimize the surrogate, as illustrated in Figure 1.1. However, this approach performs poorly even for simple examples as the optima of the learned function may be far from the optima of the true function [Narasimhan et al., 2015]. The sensitivity of optimization to the learning method raises the following question: can we actually optimize objective functions from the training data we use to learn them?
The model of optimization from samples. Answering this question requires synthesiz- ing learning theory and the theory of optimization into a new theory for optimization from sampled data. We suggest the model of optimization from samples in Section 2.1, where the
6 OPTIMIZATION OF OBJECTIVES LEARNED FROM DATA
data model decision
0.1 0.2 0.7 0.6 0.4 0.7 learning 0.1 optimization 0.5 0.4 0.4 0.1 0.2 0.2 0.1 0.7
Figure 1.1: Learning and optimization for influence maximization. The model is a weighted network, the observed data is cascades of information, and the decision is to find the most influential nodes. input is samples (S ,f(S )) m where the sets S are drawn from a distribution , as in { i i }i=1 i D N PAC-learning. The goal is to solve the problem maxS f(S) for some function f :2 R œM æ and constraint , as in optimization. If we believe that learning functions from data and M then optimizing them leads to desirable outcomes, then functions that are learnable and optimizable when given f should be optimizable from samples.
Main result: there exist functions that are both learnable and optimizable but not optimizable from samples. Somewhat surprisingly, however, we show that this is not the case. Coverage functions are PMAC-learnable [Badanidiyuru et al., 2012a] and approximately optimizable under a cardinality constraint when given full information about f [Nemhauser et al., 1978], but there is no constant factor approximation for maximizing a coverage function under a cardinality constraint when given polynomially-many samples drawn from any distribution (Section 2.3). This result implies that in general there are no guarantees when optimizing a coverage function that is learned from sampled data, which is often the case for applications of maximum coverage in machine learning, mechanism design, and data-mining.
Positive results. For the special cases of functions with bounded curvature and influence in the stochastic block model, which models community structure, we develop optimization
7 from samples algorithms with constant approximation guarantees (Section 2.2).
Future directions. Optimization from samples is a general framework for decision-making from data. Instead of the maximum coverage problem, one may ask whether other decision problems are approximable from samples. During an internship at Google, we considered the problem of cost sharing when the underlying cooperative game is not known and only samples are observed [Balkanski et al., 2017c]. Reinforcement learning is also another interesting decision problem to explore from samples.
1.2.2 Faster Parallel Algorithms Through Adaptive Sampling
Since sharp impossibility results arise when given samples drawn from any distribution, we turned to an adaptive sampling model (Section 3). In adaptive sampling, similarly as in active learning, the algorithm obtains multiples batches of samples where each batch is drawn from a new distribution chosen by the algorithm based on the previous batches. The central question with adaptive sampling is then how many batches of samples are needed for optimizing the function. This question carries strong implications for parallelization since samples that are drawn from a same distribution are non-adaptive and can be evaluated in parallel. This connection to parallelization is formalized by the adaptive complexity model, which we introduced in the context of combinatorial optimization. In this model, the adaptivity of an algorithm is the number of sequential rounds it makes when each round can execute function evaluations in parallel. An algorithm using r batches of samples is hence an r-adaptive algorithm.
Previous algorithms for submodular optimization have linear adaptivity. For the canonical problem of maximizing a monotone submodular function under a cardinality constraint k, the celebrated greedy algorithm which adds at every round the element with
8 THE ADAPTIVE COMPLEXITY OF SUBMODULAR OPTIMIZATION
Approximation Nemhauser-Wolsey 78 Feige 98 1-1/e Nemhauser-Wolsey-Fisher 78 + every other previously known constant apx algorithm ?
B-Rubinstein- Singer 17 ~ Θ(n-1/4)
1 k Rounds of adaptivity !(n) worst case : Algorithm : Hardness
Figure 1.2: The adaptivity landscape for submodular optimization.
largest marginal contribution achieves a 1 1/e approximation [Nemhauser et al., 1978], ≠ which is optimal [Feige, 1998]. Although greedy achieves an optimal approximation, it is highly sequential and has k adaptive rounds. Since k (n) in the worst case, greedy is an œ (n) adaptive algorithm. For large scale applications of submodularity in machine learning, such as data summarization, recommendation systems, clustering, and feature selection, we are naturally interested in algorithms with faster parallel runtime. Somewhat surprisingly, until very recently, no constant factor approximation algorithms with sublinear adaptivity were known for this problemTHE OPTIMAL APPROXIMATION (Figure 1.2).
Approximation Nemhauser-Wolsey 78 Feige 98 1-1/e B-Rubinstein-Singer, Ene-L.Nguyên 19 1-1/e 1/2 B-Singer 18 (under some condition) 1/3 B-Singer 18 Nemhauser-Wolsey-Fisher 78 + every other previously known constant apx algorithm
-1 log n B-Rubinstein- B-Singer 18 Singer 17 ~ Θ(n-1/4) ~ 1 Θ(log n) k Rounds of adaptivity !(n) worst case : Algorithm : Hardness
Figure 1.3: The adaptivity landscape for submodular optimization including recent results.
9 Genes Genes Clustering Clustering Movie Genes Recommendation Clustering 80 80 80 ● ● 8e+08 ● ●● ● ● ●
60 60 60 6e+08
40 40 40 4e+08
GreedyGreedy Greedy AdaptiveAdaptive Sampling Sampling Adaptive Sampling Objective value Objective Objective Value Objective Objective Value Objective Objective Value Objective 20 20 2e+0820
0 0 0e+000
0 0 10 10 20 20 30 30 40 40 50 50 0 0 10 50 20 100 30 15040 50200 ParallelParallel Runtime Runtime ParallelParallel Runtime Runtime
Figure 1.4: For gene clustering and movie recommendation applications, adaptive sampling matches the performance of greedy with a significantly smaller number of adaptive rounds.
Main result: an algorithm with logarithmic adaptivity. In Section 3.2, we develop adaptive sampling algorithms with only logarithmically many batches of samples. Our main result is a (log n) adaptive algorithm with an approximation arbitrarily close to O 1/3 for the problem of maximizing a monotone submodular functions under a cardinality constraint (Section 3.2.1). This is an exponential speedup in the parallel runtime for submodular maximization compared to previous constant factor approximation algorithms (Figure 1.3). Furthermore, we show in Section 3.3 that no algorithm can achieve a constant factor approximation in o˜(log n) rounds.
Empirical evaluations. In Section 3.2.2 and Section 3.2.4, we analyzed the empirical performance of adaptive sampling algorithms for the following machine learning applications: movie recommendation, taxi dispatch, image summarization, revenue maximization in social networks, and tra c monitoring. These experiments demonstrated that adaptive sampling is a technique that is also powerful in practice. The empirical results showed significant speedups in parallel runtime while obtaining matching performance with the standard greedy algorithm (Figure 1.4).
10 Algorithms Hardness Family Family Apx Rounds Apx Rounds Section Section
Submodular 1 Submodular 1 +‘ ˜(n≠ 4 ) 1 (n≠ 4 ) 1 2.2.3 2.3.2 O Influence Coverage 1 +‘ constant 1 (n≠ 5 ) 1 2.2.2 2.3.3 O Curvature 1 c o(1) Curvature 1 c+o(1) ≠ ≠ ≠ 1+c c2 1 1+c c2 1 2.2.1 ≠ 2.3.4 ≠ Submodular Submodular 1 ‘ ( 1 log n) 1 ( log n ) 3.2.1 3 ≠ O ‘2 3.3 log n O log log n Submodular 1 1 ‘ ( 1 log n) 3.2.3 ≠ e ≠ O ‘2 Non-monotone 1 ‘ ( 1 log2 n) 3.2.4 2e ≠ O ‘2 Matroid 1 1 ‘ ( 1 log2 n ) 3.2.5 ≠ e ≠ O ‘3 ‘3
Table 1.1: Overview of results. Unless otherwise specified, these results are for monotone submodular maximization under a cardinality constraint. The most significant results are highlighted in bold.
Additional results in the adaptive complexity model. Very recent work by our group and other groups has improved the approximation guarantee to the optimal 1 1/e ≠ (Section 3.2.3 and [Ene and Nguyen, 2019]), as well as the number of function evaluations [Fahrbach et al., 2019]. Non-monotone functions (Section 3.2.4 and [Ene et al., 2019, Fahrbach et al., 2018] and more complex constraints (Section 3.2.5 and [Chekuri and Quanrud, 2019a,b, Ene et al., 2019]) have also been studied recently in the adaptive complexity model.
Future directions. At the Broad Institute and the Harvard Molecular and Cellular Biology department, large collections of genes are regularly clustered for DNA analysis. In preliminary experiments, the main runtime bottleneck with this clustering application was the pre- computation of pairwise distances between gene sequences. This raises a new algorithmic challenge of reducing the number of pairwise distance computations while incurring only a small loss in the clustering objective. More generally, there remain many open problems on
11 adaptivity and submodular optimization.
Organization. Table 1.1 summarizes our results and the organization of this thesis. We reference at the end of each chapter the specific papers covered in that chapter.
1.3 Preliminaries
Submodular optimization. A set function f :2N R maps subsets of elements S N, æ ™ where N is the ground set of n = N elements, to a value f(S) R. The marginal | | œ contribution of a set X N to a set S N is defined as f (X) := f(S X) f(S).1 ™ ™ S fi ≠
Definition 1. A function f :2N R is submodular if for any sets S T N and any æ ™ ™ element a N T , we have œ \ f (a) f (a). S Ø T
A function is monotone if f(S) f(T ) for all sets S T . The canonical problem for Æ ™ submodular optimization is to maximize a monotone submodular function under a cardinality constraint, i.e., find a subset of elements S N of size at most k which maximizes f: ™
max f(S). S: S k | |Æ
We measure the performance of an algorithm by its approximation guarantee.
Definition 2. Let be a class of functions. An algorithm obtains an –-approximation for F maximizing under a constraint if, for all f , it finds a feasible solution S F M œF œM such that f(S) – max f(T ). T Ø · œM
1For readability we abuse notation and write a instead of a when evaluating a singleton a N. { } œ
12 It is well-known that constant factor approximation algorithms can be obtained for monotone submodular maximization under a cardinality constraint, but that it is impossible to exactly, i.e. – =1, maximize submodular functions in general.
Adaptivity. As standard, we assume access to a value oracle of the function such that for any sets S N the oracle returns f(S) in (1) time. Informally, the adaptivity of an ™ O algorithm is the number of sequential rounds it makes when polynomially-many function evaluations can be executed in parallel in each round.
Definition 3. Given a value oracle for f, an algorithm is r-adaptive if every query f(S) for the value of a set S occurs at a round i [r] s.t. S is independent of the values f(SÕ) of all œ other queries at round i, with at most poly(n) queries at every round.
In the next section, we discuss adaptivity and parallel computing.
Other classes of functions. In this thesis, we also consider classes of functions other than submodular. Coverage functions are the canonical example of submodular functions.
Definition 4. A function is called coverage if there exists a family of sets T1,...,Tn that covers subsets of a universe U with weights w(a ) for a U such that for all S, f(S)= j j œ
aj i S Ti w(aj). A coverage function is polynomial-sized if the universe is of polynomial œfi œ sizeq in n. Influence maximization is a generalization of maximizing coverage functions under a cardinality constraint.
An important property of submodular functions that has been heavily explored recently
is that of curvature. Informally, the curvature is a measure of how far the function is to being modular.
Definition 5. A function f has curvature c [0, 1] if f (a) (1 c)f(a) for any set S N œ S Ø ≠ ™ and element a N. œ
If c =0, then the function f is modular, i.e.f(S)= a S f(a). œ q 13 1.4 Discussion About Adaptivity
1.4.1 Adaptivity in Other Areas
Adaptivity has been heavily studied across a wide spectrum of areas in computer science. These areas include classical problems in theoretical computer science such as sorting and selection (e.g. [Valiant, 1975, Cole, 1988, Braverman et al., 2016]), where adaptivity is known under the term of parallel algorithms, and communication complexity (e.g. [Papadimitriou and Sipser, 1984, Duris et al., 1984, Nisan and Widgerson, 1991, Miltersen et al., 1995, Dobzinski et al., 2014, Alon et al., 2015]), where the number of rounds measures how much interaction is needed for a communication protocol. For the multi-armed bandits problem, the relationship of interest is between adaptivity and query complexity, instead of adaptivity and approximation guarantee. Recent work showed that (logı n) adaptive rounds are necessary and su cient to obtain the optimal worst case query complexity [Agarwal et al., 2017]. In the bandits setting, adaptivity is necessary to obtain non-trivial query complexity due to the noisy outcomes of the queries. In contrast, queries in submodular optimization are deterministic and adaptivity is necessary to obtain a non trivial approximation since there are at most polynomially many queries per round and the function is of exponential size. Adaptivity is also well-studied for the problems of sparse recovery (e.g. [Haupt et al., 2009a, Indyk et al., 2011, Haupt et al., 2009b, Ji et al., 2008, Malioutov et al., 2008, Aldroubi et al., 2008]) and property testing (e.g. [Canonne and Gur, 2017, Buhrman et al., 2012, Chen et al., 2017, Raskhodnikova and Smith, 2006, Servedio et al., 2015]). In these areas, it has been shown that adaptivity allows significant improvements compared to the non-adaptive setting, which is similar to the results shown in this paper for submodular optimization. However, in contrast to all these areas, adaptivity has not been previously studied in the context of submodular optimization.
14 We note that the term adaptive submodular maximization has been previously used, but in an unrelated setting where the goal is to compute a policy which iteratively picks elements one by one, which, when picked, reveal stochastic feedback about the environment [Golovin and Krause, 2010].
1.4.2 Related Models of Parallel Computation
In this section, we discuss two related models, the Map-Reduce model for distributed computation and the PRAM model. These models are compared to the notion of adaptivity in the context of submodular optimization.
Map-Reduce
The problem of distributed submodular optimization has been extensively studied in the Map-Reduce model in the past decade. This framework is primarily motivated by large scale problems over massive data sets. At a high level, in the Map-Reduce framework [Dean and Ghemawat, 2008], an algorithm proceeds in multiple Map-Reduce rounds, where each round consists of a first step where the input to the algorithm is partitioned to be independently processed on di erent machines and of a second step where the outputs of this processing are merged. Notice that the notion of rounds in Map-Reduce is di erent than for adaptivity, where one round of Map-Reduce usually consists of multiple adaptive rounds. The formal model of [Karlo et al., 2010] for Map-Reduce requires the number of machines and their memory to be sublinear. This framework for distributing the input to multiple machines with sublinear memory is designed to tackle issues related to massive data sets. Such data sets are too large to either fit or be processed by a single machine and the Map-Reduce framework formally models this need to distribute such inputs to multiple machines. Instead of addressing distributed challenges, adaptivity addresses the issue of sequentiality,
15 where each query evaluation requires a long time to complete and where these evaluations can be parallelized (see Section 1.4.3 for applications). In other words, while Map-Reduce
addresses the horizontal challenge of large scale problems, adaptivity addresses an orthogonal vertical challenge where long query-evaluation time is causing the main runtime bottleneck. A long line of work has studied problems related to submodular maximization in Map- Reduce achieving di erent improvements on parameters such as the number of Map-Reduce rounds, the communication complexity, the approximation ratio, the family of functions, and the family of constraints (e.g. [Kumar et al., 2015a, Mirzasoleiman et al., 2013, Mirrokni and Zadimoghaddam, 2015, Mirzasoleiman et al., 2015b, Barbosa et al., 2015, 2016, Epasto et al., 2017]). To the best of our knowledge, all the existing Map-Reduce algorithms for submodular optimization have adaptivity that is linear in n in the worst-case, which is exponentially larger than the adaptivity of our algorithm. This high adaptivity is caused by the distributed algorithms which are run on each machine. These algorithms are variants of the greedy algorithm and thus have adaptivity at least linear in k. We also note that our algorithm does not (at least trivially) carry over to the Map-Reduce setting.
PRAM
In the PRAM model, the notion of depth is closely related to the concept of adaptivity studied in this paper. Our positive result extends to the PRAM model, showing that there is a ˜(log2 n d ) depth algorithm with ˜(nk2) work whose approximation is arbitrarily close O · f O to 1/3 for maximizing any monotone submodular function under a cardinality constraint, where df is the depth required to evaluate the function on a set. The PRAM model is a generalization of the RAM model with parallelization, it is an idealized model of a shared memory machine with any number of processors which can
execute instructions in parallel. The depth of a PRAM algorithm is the number of parallel steps of this algorithm on the PRAM, in other words, it is the longest chain of dependencies
16 of the algorithm, including operations which are not necessarily queries. The problem of designing low-depth algorithms has been heavily studied (e.g. [Blelloch, 1996, Blelloch et al., 2011, Berger et al., 1989, Rajagopalan and Vazirani, 1998, Blelloch and Reid-Miller, 1998, Blelloch et al., 2012]). Thus, in addition to the number of adaptive rounds of querying, depth also measures the number of adaptive steps of the algorithms which are not queries. However, for the applications we consider, the runtime of the algorithmic computations which are not queries are usually insignificant compared to the time to evaluate a query. In addition, the PRAM model assumes that the input is loaded in memory while we consider the value query model where the algorithm is given oracle access to a function of potentially exponential size. In crowdsourcing applications, for example, where the value of a set can be queried on a crowdsourcing platform, there does not necessarily exist a succinct representation of the underlying function. Our positive results extend with an additional d O˜(log n) factor in the depth compared f ·
to the number of adaptive rounds, where df is the depth required to evaluate the function on a set in the PRAM model. The operations that our algorithms performed at every round, which are maximum, summation, set union, and set di erence over an input of size at most quasilinear, can all be executed by algorithms with logarithmic depth. A simple divide-and-conquer approach su ces for maximum and summation, while logarithmic depth for set union and set di erence can be achieved with treaps [Blelloch and Reid-Miller, 1998].
1.4.3 Applications of Adaptivity
Beyond being a fundamental concept, adaptivity is important for applications where sequentiality is the main runtime bottleneck.
17 Crowdsourcing and data summarization. One class of such problems where adaptivity plays an important role are human-in-the-loop problems. At a high level, these algorithms involve subtasks performed by the crowd. The intervention of humans in the evaluation of queries causes algorithms with a large number of adaptive rounds impractical. A crowdsourcing platform consists of posted tasks and crowdworkers who are remunerated for performing these posted tasks. For the submodular problem of data summarization, where the objective is to select a small representative subset of a dataset, the quality of subsets as representatives can be evaluated on a crowdsourcing platform [Tschiatschek et al., 2014, Singla et al., 2016, Braverman et al., 2016]. The algorithm must wait to obtain the feedback from the crowdworkers, however an algorithm can send out a large number of tasks to be performed simultaneously by di erent crowdworkers.
Biological simulations. Adaptivity is also studied in molecular biology to simulate protein folding. Adaptive sampling techniques are used to obtain significant improvements in execution of simulations and discovery of low energy states [Bowman et al., 2011].
Experimental Design. In experimental design, the goal is to pick a collection of entities (e.g. subjects, chemical elements, data points) which obtains the best outcome when combined for an experiment. Experiments can be run in parallel and have a waiting time to observe the outcome [Frazier et al., 2010].
Influence Maximization. The submodular problem of influence maximization, initiated studied by Domingos and Richardson [2001a], Richardson and Domingos [2002], Kempe et al. [2003] has since then received considerable attention (e.g. [Chen et al., 2009, 2010, Goyal et al., 2011, Seeman and Singer, 2013, Horel and Singer, 2015, Badanidiyuru et al., 2016]). Influence maximization consists of finding the most influential nodes in a social network to maximize the spread of information in this network. Information does not spread instantly
18 and a waiting time occurs when observing the total number of nodes influenced by some seed set of nodes.
Advertising. In advertising, the goal is to select the optimal subset of advertisement slots to objectives such as the click-through-rate or the number of products purchased by customers, which are objectives exhibiting diminishing returns [Alaei and Malekian, 2010, Devanur et al., 2016]. Naturally, a waiting time is incurred to observe the behavior of customers.
19 Chapter 2
Non-Adaptive Optimization
A natural starting point to study the number of rounds of adaptivity needed for optimiza- tion is to consider the case of a single round of adaptivity, which corresponds to non-adaptive algorithms. Non-adaptive algorithms may perform polynomially many function evaluations, but a function evaluation cannot depend on the outcome of another function evaluation. Even though the question of what is achievable in a single round of adaptivity is interesting in its own right, the main motivation for studying non-adaptivity is due to its role in the subtle interplay between machine learning and optimization, and specifically for data-driven optimization. We begin by discussing data-driven optimization through a novel model of optimization from samples. In Section 2.1.2, we explain why optimization from samples plays a crucial role for adaptivity and establish a formal connection between these two models.
2.1 From Predictions to Decisions: Optimization from
Samples
The traditional approach in optimization typically assumes there is an underlying model known to the algorithm designer, and the goal is to optimize an objective function defined
20 OPTIMIZATION OF OBJECTIVES LEARNED FROM DATA
data model decision
0.1 0.2 0.7 0.6 0.4 0.7 learning 0.1 optimization 0.5 0.4 0.4 0.1 0.2 0.2 0.1 0.7
Figure 2.1: Learning and optimization illustrated for the problem of influence maximization. The model is a network, the observed data is cascades of information, and the decision is to find the most influential nodes.
through the model. In a routing problem, for example, the model can be a weighted graph which encodes roads and their congestion, and the objective is to select a route that minimizes expected travel time from source to destination. In influence maximization, we are given a weighted graph which models the likelihood of individuals forwarding information, and the objective is to select a subset of nodes to spread information and maximize the expected number of nodes that receive information [Kempe et al., 2003]. In many applications like influence maximization or routing, we do not actually know the objective functions we wish to optimize since they depend on the behavior of the world generating the model. In such cases, we gather information about the objective function from past observations. A reasonable approach is to learn a surrogate function that approximates the function generating the data (e.g. [Daneshmand et al., 2014, Du et al., 2013, Gomez- Rodriguez et al., 2010]) and optimize the surrogate, as illustrated in Figure 2.1. In routing, we may observe tra c, fit weights to a graph that represents congestion times, and optimize for the shortest path on the weighted graph learned from data. In influence maximization, we can observe information spreading in a social network, fit weights to a graph that encodes the influence model and optimize for the k most influential nodes. But what guarantees do we have with this approach? One problem with optimizing a surrogate learned from data is that it may be inap-
21 proximable. For a problem like influence maximization, for example, even if a surrogate
f :2N R approximates a submodular influence function f :2N R within a factor æ æ of (1 ‘) for sub-constant ‘>0, in general there is no polynomial-time algorithm that ±
can obtain a reasonable approximation to maxS: S k f(S) or maxS: S k f(S) [Hassidim and | |Æ | |Æ Singer, 2015]. A di erent concern is that the function learned from data may be approximable (e.g. if the surrogate remains submodular), but its optima are very far from the optima of the function generating the data. In influence maximization, even if the weights of the graph are learned within a factor of (1 ‘) for sub-constant ‘>0 the optima of the surrogate may be ± a poor approximation to the true optimum [Narasimhan et al., 2015, He and Kempe, 2016]. The sensitivity of optimization to the nuances of the learning method therefore raises the following question:
Can we actually optimize objective functions from the training data we use to learn them?
Doing so requires synthesizing learning theory and the theory of optimization into a new theory for optimization from sampled data. Given sampled data, can we optimize functions for which we know how to predict from data and for which we know how to make decisions from a model? Such a theory has applications well beyond influence maximization or routing, as there are numerous examples where we make decisions from data and seek good outcomes.
In auctions, for example, the auctioneer aims to decide on prices or on an allocation of goods to buyers whose complex valuations are not known but may be inferred from data [Dobzinski
and Schapira, 2006, Lehmann et al., 2001]. In the learning to rank framework in information retrieval, an algorithm receives samples of ranked documents and the goal is to select the k documents that are the most relevant to a query [Kumar et al., 2015b]. In recommendation systems, optimal tagging problems seek to pick k tags for new content to maximize incoming tra c [Rosenfeld and Globerson, 2016].
22 OPTIMIZATION OF OBJECTIVES LEARNED FROM DATA
data decision
optimization from samples
Figure 2.2: Optimization from samples illustrated for the problem of influence maximization.
2.1.1 The Optimization from Samples Model
To formulate the notion of data-driven optimization, we suggest the model of optimization from samples (OPS). In optimization from samples, the input is samples (S ,f(S )) m { i i }i=1 where the sets S are drawn from a distribution , as in learning. The goal is to solve the i D
problem maxS f(S) for some constraint , as in optimization. More formally: œM M
Definition 6. A class of functions :2N R is –-optimizable in from samples F æ M over distribution if there exists a (not necessarily polynomial time) algorithm whose D input is a set of samples S ,f(S ) m , where f and S is drawn i.i.d. from , and { i i }i=1 œF i D returns S s.t.: œM Pr f(S) – max f(T ) 1 ”, S ,...,Sm T 1 ≥D 5 Ø · œM 6 Ø ≠ where m poly( N ) and ” [0, 1) is a constant. œ | | œ
This framework synthesizes the standard notions of learnability and optimizability, as
illustrated in Figure 2.2. The standard notion of learnability for set functions is PMAC- learnability which is a generalization of the well-known PAC-learnability model. The standard notion of optimizability is that of APX. A class of functions and a constraint are in APX if, given value query access to the function (given S the oracle returns f(S)), there is a polynomial-time algorithm that yields a constant-factor approximation for optimizing a function in that class under the constraint.
23 An algorithm with the above guarantees is an –-OPS algorithm. In this chapter we focus on the simplest constraint, where = S N : S k is a cardinality constraint. For M { ™ | |Æ } a class of functions we say that optimization from samples is possible when there exists F some constant – (0, 1] and any distribution s.t. is –-optimizable from samples over œ D F D in = S : S k . The following points are worth noting: M { | |Æ }
• Optimization from samples is defined per distribution. Note that if we demand opti- mization from samples to hold on all distributions, then trivially no function would be optimizable from samples (e.g. for the distribution which always returns the empty set);
• Optimization from samples seeks to approximate the global optimum. In learning, we evaluate a hypothesis on the same distribution we use to train it since it enables making a prediction about events that are similar to those observed. For optimization it is trivial to be competitive against a sample by simply selecting the feasible solution with maximal value from the set of samples observed. Since an optimization algorithm has the power to select any solution, the hope is that polynomially many samples contain enough information for optimizing the function. In influence maximization, for example, we are interested in selecting a set of influencers, even if we did not observe a set of highly influential individuals that initiate a cascade together.
Optimization from samples is particularly interesting when functions are learnable and optimizable.
• APX-Optimizability. We are interested in functions f :2N R and constraint æ M such that given access to a value oracle (given S the oracle returns f(S)), there exists a
constant factor approximation algorithm for maxS f(S). For this purpose, monotone œM submodular functions are a convenient class to work with, where the canonical problem
is max S k f(S). It is well known that there is a 1 1/e approximation algorithm for | |Æ ≠ 24 this problem [Nemhauser et al., 1978] and that this is tight using polynomially many value queries [Feige, 1998]. Influence maximization is an example of maximizing a monotone submodular function under a cardinality constraint [Kempe et al., 2003].
• PMAC-learnability. The standard framework in the literature for learning set
functions is Probably Mostly Approximately Correct (–-PMAC) learnability due to Balcan and Harvey [2011]. This framework nicely generalizes Valiant’s notion of Probably
Approximately Correct (PAC) learnability [Valiant, 1984]. Informally, PMAC-learnability guarantees that after observing polynomially many samples of sets and their function values, one can construct a surrogate function that is likely to, –-approximately, mimic the behavior of the function observed from the samples (see full version of the paper for formal definitions). Since the seminal paper of Balcan and Harvey, there has been a great deal of work on learnability of submodular functions [Feldman and Kothari, 2014, Balcan et al., 2012, Badanidiyuru et al., Feldman and Vondrák, 2013, Feldman and Vondrák, 2015, Balcan, 2015].
2.1.2 Optimization from Samples is Equivalent to Non-Adaptivity
Non-adaptivity corresponds to one round of adaptivity. Informally, a non-adaptive algorithm may perform polynomially many function evaluations, but a function evaluation cannot depend on the outcome of another function evaluation.
Definition 7. An algorithm is non-adaptive if every query f(S) for the value of a set S is independent of the values f(SÕ) of all other queries SÕ made by the algorithm, with at most poly(n) total number of queries.
An important observation is that optimization from samples algorithms are non-adaptive. The algorithm observes samples, but these samples are drawn independently from a same distribution. Perhaps less intuitive is the fact that any non-adaptive algorithm can be used
25 to construct an optimization from samples algorithm. We formalize the equivalence of these two notions with the following theorem. For the remainder of this chapter, we consider the optimization from samples model. However, all the results extend to the non-adaptive model.
Theorem 2.1.1. For any class of functions , there exists a distribution such that is F D F –-optimizable from samples if and only if there exists a non-adaptive algorithm that obtains, with probability 1 ”, an –-approximation for optimizing , for any constant ”>0. ≠ F
Proof. We first show that if there exists a distribution such that is –-optimizable from D F samples with algorithm , then there exists a non-adaptive algorithm that Asamples Anon-adaptive obtains, with probability 1 ”,an– approximation. Let m be the number of samples from ≠ required by . D Asamples We consider the algorithm which first samples m sets S i.i.d. and queries Anon-adaptive i ≥D these m sets to the oracle. Note that these are non-adaptive queries since they are drawn from the same distribution . Given the values f(S ) of these queries, then D i Anon-adaptive mimics and returns the same set S as . Since is an –-optimizable Asamples Asamples Asamples from samples algorithm, the set S returned by is an –-approximate solution with Anon-adaptive probability 1 ”. ≠ For the reverse direction, consider a non-adaptive algorithm that obtains, Anon-adaptive with probability 1 ”,an– approximation. Let = S m be the collection of sets ≠ S { i}i=1 non-adaptively queried by .Let be the uniform distribution over and let Anon-adaptive D S mÕ be such that with mÕ i.i.d. samples Õ from , with probability 1 ”,everysetS S D ≠ œS is also in Õ.Let be the algorithm that, if every set S is also in Õ, mimics S Asamples œS S and returns the same solution S, and otherwise returns an arbitrarily solution. Anon-adaptive By the guarantee of , obtains an –-approximate solution with probability Anon-adaptive Asamples 1 ” if every set S is also in Õ. ≠ œS S
26 2.1.3 Overview of Results
We give an overview of the results presented in this chapter.
Main Result
If we believe that learning functions from data and then optimizing them leads to desirable
outcomes, then functions that are PMAC-learnable and in APX should be optimizable from samples. Somewhat surprisingly, however, we show that this is not the case. There is an
interesting class of functions, called coverage functions, that is PMAC-learnable and in APX but it cannot be optimized from samples. Coverage functions are a canonical example of monotone submodular functions and are hence optimizable. In terms of learnability, for any constant
‘>0, coverage functions are (1 ‘)-PMAC learnable over any distribution [Badanidiyuru ≠ et al.], unlike monotone submodular functions which are generally not PMAC learnable [Balcan and Harvey, 2011]. In Section 2.3.3, we show that there is no constant factor approximation
for maximizing a coverage function using polynomially-many samples drawn from any distribution. Coverage functions are heavily used in mechanism design [Dobzinski and Schapira, 2006, Lehmann et al., 2001], machine learning [Guestrin et al., 2005, Swaminathan et al., 2009], data-mining [Chierichetti et al., 2010, Du et al., 2014b], privacy [Feldman and Kothari, 2014, Gupta et al., 2013], as well as influence maximization [Kempe et al., 2003, Seeman and Singer, 2013]. In many of these applications, the functions are learned from data and the goal is to optimize the function under a cardinality constraint.
Algorithms for Optimization from Samples
Despite the main result being an impossibility, there are classes of functions and distribu- tions for which optimization from samples is possible.
27 Curvature. We show in Section 2.2.1 that for any monotone submodular function with curvature c there is a (1 c)/(1 + c c2) approximation algorithm for maximization under ≠ ≠ cardinality constraints when polynomially-many samples are drawn from the uniform dis- tribution over feasible sets. The curvature assumption is crucial as the above impossibility results shows that no algorithm can obtain a constant-factor approximation for maximization under a cardinality constraint when observing polynomially-many samples drawn from any distribution over feasible sets, even when the function is statistically learnable.
Influence maximization. We then consider in Section 2.2.2 the canonical problem of influence maximization in social networks. Since the seminal work of Kempe et al. [2003], there have been two largely disjoint e orts on this problem. The first studies the problem associated with learning the parameters of the generative influence model. The second focuses on the algorithmic challenge of identifying a set of influencers, assuming the parameters of the generative model are known. The above impossibility result implies that in general, if the generative model is not known but rather learned from training data, no algorithm can yield a constant factor approximation guarantee using polynomially-many samples, drawn from any distribution. We design a simple heuristic that overcomes this negative result in practice by leveraging the strong community structure of social networks. Although in general the approximation guarantee of our algorithm is necessarily unbounded, we show that this algorithm performs well experimentally. To justify its performance, we prove our algorithm obtains a constant factor approximation guarantee on graphs generated through the stochastic block model, traditionally used to model networks with community structure.
Submodular maximization. For general monotone submodular functions, we develop
1/4 an ˜(n≠ ) optimization from samples algorithm over some distribution . This bound is D 1/4+‘ essentially tight since submodular functions are not n≠ -optimizable from samples over
28 Algorithms Hardness Family Family Apx Rounds Apx Rounds Section Section
Submodular 1 Submodular 1 +‘ ˜(n≠ 4 ) 1 (n≠ 4 ) 1 2.2.3 2.3.2 O Influence Coverage 1 +‘ constant 1 (n≠ 5 ) 1 2.2.2 2.3.3 O Curvature 1 c o(1) Curvature 1 c+o(1) ≠ ≠ ≠ 1+c c2 1 1+c c2 1 2.2.1 ≠ 2.3.4 ≠
Table 2.1: Overview of results for non-adaptive optimization. Unless otherwise specified, these results are for monotone submodular maximization under a cardinality constraint. The most significant result is highlighted in bold. any distribution (see Section 2.3.2).
Matching Lower Bounds
In Section 2.3.2 and Section 2.3.4, we show that the optimization from samples algorithms for submodular functions and functions with curvature are optimal. We show that, up to lower order terms, no algorithm can achieve a better approximation than the one obtained by these algorithms. A summary of the results for this chapter is provided in Table 2.1.
2.2 Optimization from Samples Algorithms
In this section, we describe optimization from samples algorithms for three important classes of functions. By Theorem 2.1.1, these are non-adaptive algorithms with guarantees that extend to the non-adaptive model described in Definition 7. These algorithms are for the problem of maximizing a function f in a class of functions under a cardinality constraint k. F In Section 2.2.1, for the class of monotone submodular functions with curvature c, we give a (1 c)/(1 + c c2) o(1) approximation algorithm for optimization from samples from the ≠ ≠ ≠ uniform distribution over feasible sets, which we show is tight in Section 2.3.4. In Section 2.2.2, we show that the despite the general negative results implied by Section 2.3.3 for influence
29 maximization form samples, it is possible to obtain constant approximation guarantees when the underlying network exhibits a community structure. Finally, in Section 2.2.3, we show an
1/4 ˜(n≠ ) optimization from samples algorithm for general monotone submodular functions, which we show is tight up to lower order terms in Section 2.3.2.
2.2.1 Curvature
In this section, we consider the problem of optimization from samples of monotone submod- ular functions with bounded curvature. We show that for any monotone submodular function with curvature c there is a (1 c)/(1 + c c2) approximation algorithm for maximization ≠ ≠ under cardinality constraints when polynomially-many samples are drawn from the uniform distribution over feasible sets. In Section 2.2.1, we show that this algorithm is optimal. That is, for any c<1, there exists a submodular function with curvature c for which no algorithm can achieve a better approximation. The curvature assumption is crucial as for general monotone submodular functions no algorithm can obtain a constant-factor approximation for maximization under a cardinality constraint when observing polynomially-many samples drawn from any distribution over feasible sets, even when the function is statistically learnable (Section 2.3). In practice however, the functions we aim to optimize may be better behaved. An important property of submodular functions that has been heavily explored recently is that of curvature. Informally, the curvature is a measure of how far the function is to being modular.
Definition 8. A function f has curvature c [0, 1] if f (a) (1 c)f(a) for any set S N œ S Ø ≠ ™ and element a N. œ
If c =0, then the function f is modular, i.e.f(S)= a S f(a). Curvature plays an œ important role since the hard instances of submodular optimizationq often occur only when the curvature is unbounded, i.e., c close to 1. The hardness results for optimization from samples
30 are no di erent, and apply when the curvature is unbounded. The curvature assumption has applications in problems such as maximum entropy sampling [Sviridenko et al., 2015], column-subset selection [Sviridenko et al., 2015], and submodular welfare [Vondrák, 2010].
Related work on curvature. In the value oracle model, the greedy algorithm is a
c (1 e≠ )/c approximation algorithm for cardinality constraints [Conforti and Cornuéjols, 1984]. ≠ Recently, Sviridenko et al. [2015] improved this approximation to 1 c/e with variants of the ≠ continuous greedy and local search algorithms, which was shown to be tight. Submodular optimization and curvature have also been studied for more general constraints [Vondrák, 2010, Iyer and Bilmes, 2013] and submodular minimization [Iyer et al., 2013].
The algorithm. Algorithm 1 first estimates the expected marginal contribution of each element e to a uniformly random set of size k 1, which we denote by R for the remaining i ≠
of this section. These expected marginal contributions ER[fR(ei)] are estimated with vˆi. The
estimates vˆi are the di erences between the average value avg( k,i) := ( T f(T ))/ k,i S œSk,i |S | of the collection of samples of size k containing e and the average valueq of the collection Sk,i i
k 1,i 1 of samples of size k 1 not containing ei. We then wish to return the best set between S ≠ ≠ ≠
the random set R and the set S consisting of the k elements with the largest estimates vˆi.
Since we do not know the value of S, we lower bound it with vˆS using the curvature property.
We estimate the expected value ER[f(R)] of R with vˆR, which is the average value of the
collection k 1 of all samples of size k 1. Finally, we compare the values of S and R using S ≠ ≠
vˆS and vˆR to return the best of these two sets.
31 Algorithm 1 A tight (1 c)/(1 + c c2) o(1)-optimization from samples algorithm for ≠ ≠ ≠ monotone submodular functions with curvature c Input: samples (S ,f(S )) where S , the uniform distribution over feasible sets { i i } i ≥U
1: vˆi avg( k,i) avg( k 1,i 1) Ω S ≠ S ≠ ≠
2: S argmax T =k i T vˆi Ω | | œ q 3: vˆS (1 c) e S vˆi a lower bound on the value of f(S) Ω ≠ iœ q 4: vˆR avg( k 1) an estimate of the value of a random set R Ω S ≠ 5: if vˆ vˆ then S Ø R 6: return S
7: else
8: return R
9: end if
The analysis. Without loss of generality, let S = e ,...,e be the set defined in Line 2 { 1 k} of the algorithm and define S to be the first i elements in S, i.e., S := e ,...,e . Similarly, i i { 1 i} for the optimal solution Sı, we have Sı = eı,...,eı and Sı := eı,...,eı . We abuse { 1 k} i { 1 i }
notation and denote by f(R) and fR(e) the expected values ER[f(R)] and ER[fR(e)] where the randomization is over the random set R of size k 1. ≠
At a high level, the curvature property is used to bound the loss from f(S) to i k fR(ei) Æ ı ı q ı and from i k fR(ei ) to f(S ). By the algorithm, i k fR(ei) is greater than i k fR(ei ). Æ Æ Æ q ı ı q q When bounding the loss from i k fR(ei ) to f(S ), a key observation is that if this loss is Æ large, then it must be the case thatq R has a high expected value. This observation is formalized in our analysis by bounding this loss in terms of f(R) and motivates Algorithm 1 returning the best of R and S. Lemma 2.2.1 is the main part of the analysis and gives an approximation for S. The approximation guarantee for Algorithm 1 (formalized as Theorem 2.2.1) follows by finding the worst-case ratios of f(R) and f(S).
Lemma 2.2.1. Let S be the set defined in Algorithm 1 and f( ) be a monotone submodular · 32 function with curvature c, then
f(R) ı f(S) (1 o(1))ˆvS (1 c) 1 c o(1) f(S ). Ø ≠ Ø A ≠ A ≠ · f(Sı)B ≠ B
Proof. First, observe that
f(S)= fSi 1 (ei) (1 c) f(ei) (1 c) fR(ei) ≠ i k Ø ≠ i k Ø ≠ i k ÿÆ ÿÆ ÿÆ where the first inequality is by curvature and the second is by monotonicity. We now claim that w.h.p. and with a su ciently large polynomial number of samples the estimates of the marginal contribution of an element are precise,
f(Sı) f(Sı) f (e )+ vˆ f (e ) R i n2 Ø i Ø R i ≠ n2
ı ı and defer the proof to Claim 1. Thus f(S) (1 c) i k vˆi f(S )/n vˆS f(S )/n. Ø ≠ Æ ≠ Ø ≠ Next, by the definition of S in the algorithm, we get q
ı vˆS ı ı f(S ) = vˆi vˆi fR(ei ) . 1 c i k Ø i k Ø i k ≠ n ≠ ÿÆ ÿÆ ÿÆ
ı ı It is possible to obtain a 1 c loss between i k fR(ei ) and f(S ) with a similar argument as in ≠ Æ the first part. The key idea to improve thisq loss is to use the curvature property on the elements
ı ı in R instead of on the elements e S . By curvature, we have that f ı (R) (1 c)f(R). i œ S Ø ≠ ı ı ı We now wish to relate fSı (R) and i k fR(ei ). Note that f(S )+fSı (R)=f(R S )= Æ fi ı q ı ı f(R)+fR(S ) by the definition of marginal contribution and i k fR(ei ) fR(S ) by Æ Ø ı ı q submodularity. We get i k fR(ei ) f(S )+fSı (R) f(R) by combining the previous Æ Ø ≠ q
33 equation and inequality. By the previous curvature observation, we conclude that
ı ı f(R) ı fR(ei ) f(S )+(1 c)f(R) f(R)= 1 c ı f(S ). i k Ø ≠ ≠ A ≠ · f(S )B ÿÆ
Claim 1. Let f be a monotone submodular function. Then, with a su ciently large polynomial
ı 2 number of samples, the estimations vˆi and vˆR are f(S )/n -close to fR(ei) and f(R) with high probability, i.e., f(Sı) f(Sı) f (e )+ vˆ f (e ) , R i n2 Ø i Ø R i ≠ n2 and f(Sı) f(Sı) f(R)+ vˆ f(R) . n2 Ø R Ø ≠ n2
Proof. We assume that k n/2 (otherwise, a random subset of size k is a 1/2-approximation). Æ The size of a sample which is the most likely is k, so the probability that a sample is of size k
n n is at least 2/n. Since k 1 k /n, the probability that a sample is of size k 1 is at least ≠ Ø ≠ 1 2 1 2 2/n2. A given element i has probability at least 1/n of being in a sample and probability at least 1/2 of not being in a sample. Therefore, to observe at least n5 samples of size k which contain i and at least n5 samples of size k 1 which do not contain i, n8 samples are su cient ≠ with high probability. Since f(S) f(Sı) for all samples S, by Hoe ding’s inequality, Æ
ı f(S ) 2n5(f(Sı)/2n2)2/f(Sı)2 n/2 Pr avg( k,i) ES : S =k,i S[f(S)] 2e≠ 2e≠ . A S ≠ | | œ Ø 2n2 B Æ Æ ------similarly, ı f(S ) n/2 Pr avg( k 1,i 1) ES : S =k 1,i S[f(S)] 2e≠ A S ≠ ≠ ≠ | | ≠ ”œ Ø 2n2 B Æ ------and ı f(S ) n/2 Pr avg( k 1) ES : S =k 1[f(S)] 2e≠ . A S ≠ ≠ | | ≠ Ø 2n2 B Æ ------34 Since vˆi = avg( k,i) avg( k 1,i 1),fR(ei)=ES : S =k,i S[f(S)] ES : S =k 1,i S[f(S)], S ≠ S ≠ ≠ | | œ ≠ | | ≠ ”œ
vˆR = avg( k 1), and f(R)=ES : S =k 1[f(S)], the claim holds with high probability. S ≠ | | ≠
Combining Lemma 2.2.1 and the fact that we obtain value at least max f(R), (1 { ≠ c) k vˆ , we obtain the main result of this section. i=1 i} q Theorem 2.2.1. Let f( ) be a monotone submodular function with curvature c. Then · Algorithm 1 is a (1 c)/(1 + c c2) o(1) approximation algorithm for optimization from ≠ ≠ ≠ samples from the uniform distribution over feasible sets . U
Proof. The estimate vˆ of f(R) is precise, we have that f(R)+f(Sı)/n2 vˆ f(R) R Ø R Ø ≠ f(Sı)/n2 with su ciently many samples by standard concentration bounds and with high probability. In addition, by the first inequality in Lemma 2.2.1, f(S) (1 o(1))vˆ . So by Ø ≠ S the algorithm and the second inequality in Lemma 2.2.1, the approximation obtained by the set returned is at least
f(R) vˆ f(R) f(R) (1 o(1)) max , S (1 o(1)) max , (1 c) 1 c . ≠ · If(Sı) f(Sı)J Ø ≠ · If(Sı) ≠ A ≠ · f(Sı)BJ
ı ı Let x := f(R)/f(Sú), the best of f(R)/f(S ) and (1 c) (1 c f(R)/f(S )) o(1) is ≠ ≠ · ≠ minimized when x =(1 c)(1 cx), or when x =(1 c)/(1+c c2). Thus, the approximation ≠ ≠ ≠ ≠ obtained is at least (1 c)/(1 + c c2) o(1). ≠ ≠ ≠
2.2.2 Learning to Influence
For well over a decade now, there has been extensive work on the canonical problem of influence maximization in social networks. First posed by Domingos and Richardson [2001b], Richardson and Domingos [2002] and elegantly formulated and further developed by Kempe et al. [2003], influence maximization is the algorithmic challenge of selecting individuals who
35 can serve as early adopters of a new idea, product, or technology in a manner that will trigger a large cascade in the social network. In their seminal paper, Kempe, Kleinberg, and Tardos characterize a family of natural influence processes for which selecting a set of individuals that maximize the resulting cascade reduces to maximizing a submodular function under a cardinality constraint. Since submodular functions can be maximized within a 1 1/e approximation guarantee, one ≠ can then obtain desirable guarantees for the influence maximization problem. There have since been two, largely separate, agendas of research on the problem. The first line of work is concerned with learning the underlying submodular function from observations of cascades [Liben-Nowell and Kleinberg, 2003, Adar and Adamic, 2005, Leskovec et al., 2007, Goyal et al., 2010, Chierichetti et al., 2011, Gomez-Rodriguez et al., 2011, Netrapalli and Sanghavi, 2012, Gomez-Rodriguez et al., 2010, Du et al., 2012, Abrahao et al., 2013, Du et al., 2013, Feldman and Kothari, 2014, De et al., 2014, Cheng et al., 2014, Daneshmand et al., 2014, Du et al., 2014a, Narasimhan et al., 2015, Honorio and Ortiz, 2015]. The second line of work focuses on algorithmic challenges revolving around maximizing influence, assuming the underlying function that generates the di usion process is known [Kempe et al., 2005, Mossel and Roch, 2007, Seeman and Singer, 2013, Borgs et al., 2014, Hassidim and Singer, 2015, He and Kempe, 2016, Angell and Schoenebeck, 2016]. In this paper, we consider the problem of learning to influence where the goal is to maximize influence from observations of cascades. This problem synthesizes both problems of learning the function from training data and of maximizing influence given the influence function. A natural approach for learning to influence is to first learn the influence function from cascades, and then apply a submodular optimization algorithm on the function learned from data. Somewhat counter-intuitively, it turns out that this approach yields desirable guarantees only under very strong learnability conditions. In some cases, when there are su ciently many samples, and one can observe exactly which node attempts to influence
36 whom at every time step, these learnability conditions can be met. A slight relaxation however (e.g. when there are only partial observations [Narasimhan et al., 2015, He et al., 2016]), can lead to sharp inapproximability.
Learning to influence social networks. As with all impossibility results, the inapprox- imability discussed above holds for worst case instances, and it may be possible that such instances are rare for influence in social networks. In the previous section, it was shown that when a submodular function has bounded curvature, there is a simple algorithm that can maximize the function under a cardinality constraint from samples. Unfortunately, simple examples show that submodular functions that dictate influence processes in social networks do not have bounded curvature. Are there other reasonable conditions on social networks that yield desirable approximation guarantees?
Main result. In this section we present a simple algorithm for learning to influence. This algorithm leverages the idea that social networks exhibit strong community structure. At a high level, the algorithm observes cascades and aims to select a set of nodes that are influential, but belong to di erent communities. Intuitively, when an influential node from a certain community is selected to initiate a cascade, the marginal contribution of adding another node from that same community is small, since the nodes in that community were likely already influenced. This observation can be translated into a simple algorithm which performs very well in practice. Analytically, since community structure is often modeled using stochastic block models, we prove that the algorithm obtains a constant factor approximation guarantee in such models, under mild assumptions. The analysis for the approximation guarantees lies at the intersection of combinatorial optimization and random graph theory. We formalize the intuition that the algorithm leverages the community structure of social networks in the standard model to analyze communities, which is the stochastic block model. Intuitively, the algorithm obtains good
37 approximations by picking the nodes that have the largest individual influence while avoiding picking multiple nodes in the same community by pruning nodes with high influence overlap. The individual influence of nodes and their overlap are estimated by the algorithm with what we call first and second order marginal contributions of nodes, which can be estimated from samples. We then uses phase transition results of Erd s-Rényi random graphs and branching processes techniques to compare these individual influences for nodes in di erent communities in the stochastic block model and bound the overlap of pairs of nodes.
The Model
We assume that the influence process follows the standard independent cascade model. In the independent cascade model, a node a influences each of its neighbors b with some
probability qab, independently. Thus, given a seed set of nodes S, the set of nodes influenced is the number of nodes connected to some node in S in the random subgraph of the network which contains every edge ab independently with probability qab. We define f(S) to be the expected number of nodes influenced by S according to the independent cascade model over some weighted social network.
The learning to influence model: optimization from samples for influence maxi- mization. The learning to influence model is an interpretation of the optimization from samples model for the specific problem of influence maximization in social networks. We focus on bounded product distributions , so every node a is, independently, in S with D ≥D some probability p [1/ poly(n), 1 1/ poly(n)]. We assume this is the case throughout a œ ≠ the paper. We are given a collection of samples (S , cc(S ) ) m where sets S are the seed { i | i | }i=1 i sets of nodes and cc(S ) is the number of nodes influenced by S , i.e., the number of nodes | i | i
that are connected to Si in the random subgraph of the network. This number of nodes is a
random variable with expected value f(Si) := E[ cc(Si) ] over the realization of the influence | |
38 process. Each sample is an independent realization of the influence process. The goal is then to find a set of nodes S under a cardinality constraint k which maximizes the influence in expectation, i.e., find a set S of size at most k which maximizes the expected number of nodes f(S) influenced by seed set S.
Description of the Algorithm
We present the main algorithm, COPS. This algorithm is based on a novel optimization from samples technique which detects overlap in the marginal contributions of two di erent nodes, which is useful to avoid picking two nodes who have intersecting influence over a same
collection of nodes. COPS, consists of two steps. It first orders nodes in decreasing order of first order marginal contribution, which is the expected marginal contribution of a node a to a random set S . Then, it iteratively removes nodes a whose marginal contribution ≥D overlaps with the marginal contribution of at least one node before a in the ordering. The solution is the k first nodes in the pruned ordering.
Algorithm 2 COPS, learns to influence networks with COmmunity Pruning from Samples. Input: Samples = (S, f(S)) , acceptable overlap –. S { } Order nodes according to their first order marginal contributions Iteratively remove from this ordering nodes a whose marginal contribution has overlap of at least – with at least one node before a in this ordering.
return k first nodes in the ordering
The strong performance of this algorithm for the problem of influence maximization is best explained with the concept of communities. Intuitively, this algorithm first orders nodes in decreasing order of their individual influence and then removes nodes which are in a same community. This second step allows the algorithm to obtain a diverse solution which influences multiple di erent communities of the social network. In comparison, the other
39 optimization from samples algorithms only use first order marginal contributions and perform well if the function is close to linear. Due to the high overlap in influence between nodes in a same community, influence functions are far from being linear and these algorithms have poor performance for influence maximization since they only pick nodes from a very small number of communities.
Computing overlap using second order marginal contributions We define second order marginal contributions, which are used to compute the overlap between the marginal contribution of two nodes.
Definition 9. The second order expected marginal contribution of a node a to a random set S containing node b is
vb(a) := E [f(S a ) f(S)]. S :a S,b S ≥D ”œ œ fi{ } ≠
The first order marginal contribution v(a) of node a is defined similarly as the marginal
contribution of a node a to a random set S, i.e., v(a) := ES :a S[f(S a ) f(S)]. These ≥D ”œ fi{ } ≠ contributions can be estimated arbitrarily well for product distributions by taking the D di erence between the average value of samples containing a and b and the average value of samples containing b but not a.
The subroutine Overlap(a, b, –), – [0, 1], compares the second order marginal contri- œ bution of a to a random set containing b and the first order marginal contribution of a to a random set. If b causes the marginal contribution of a to decrease by at least a factor of 1 –, then we say that a has marginal contribution with overlap of at least – with node b. ≠
40 Algorithm 3 Overlap(a, b, –), returns true if a and b have marginal contributions that overlap by at least a factor –. Input: Samples = (S, f(S)) , node a, acceptable overlap – S { } If second order marginal contribution v (a) is at least a factor of 1 – smaller than first b ≠ order marginal contribution v(a),
return Node a has overlap of at least – with node b
Overlap is used to detect nodes in a same community. In the extreme case where two nodes a and b are in a community C where any node in C influences all of community C, then
the second order marginal contribution vb(a) of a to random set S containing b is vb(a)=0 since b already influences all of C so a does not add any value, while v(a) C . In the ¥| | opposite case where a and b are in two communities which are not connected in the network, we have v(a)=vb(a) since adding b to a random set S has no impact on the value added by a.
Analyzing community structure The main benefit from COPS is that it leverages the community structure of social networks. To formalize this explanation, we analyze our algorithm in the standard model used to study the community structure of networks, the stochastic block model. In this model, a fixed set of nodes V is partitioned in communities
C1,...,C¸. The network is then a random graph G =(V,E) where edges are added to E independently and where an intra-community edge is in E with much larger probability than
sb an inter-community edge. These edges are added with identical probability qC for every edge in a same community, but with di erent probabilities for edges inside di erent communities
Ci and Cj. We illustrate this model in Figure 2.3.
41 Figure 2.3: An illustration of the stochastic block model with communities C1, C2, C3 and C4 of sizes 6, 4, 4 and 4. The optimal solution for influence maximization with k =4is in green. Picking the k first nodes in the ordering by marginal contributions without pruning, as in the previous section, leads to a solution with nodes from only C1 (red). By removing nodes with overlapping marginal contributions, COPS obtains a diverse solution.
Dense Communities and Small Seed Set in the Stochastic Block Model
1 In this section, we show that COPS achieves a 1 O( C ≠ ) approximation, where C is ≠ | k| k the kth largest community, in the regime with dense communities and small seed set, which is described below. We show that the algorithm picks a node from each of the k largest communities with high probability, which is the optimal solution. In the next section, we show a constant factor approximation algorithm for a generalization of this setting, which requires a more intricate analysis. In order to focus on the main characteristics of the community structure as an explanation for the performance of the algorithm, we make the following simplifying assumptions for the analysis. We first assume that there are no inter-community edges.1 We also assume that the random graph obtained from the stochastic block model is redrawn for every sample and that we aim to find a good solution in expectation over both the stochastic block model and the independent cascade model. Formally, let G =(V,E) be the random graph over n nodes obtained from an independent cascade process over the graph generated by the stochastic block model. Similarly as for the stochastic block model, edge probabilities for the independent cascade model may vary
1The analysis easily extends to cases where inter-community edges form with probability significantly sb smaller to qC , for all C.
42 between di erent communities and are identical within a single community C, where all
ic edges have weights qC . Thus, an edge e between two nodes in a community C is in E with probability p := qic qsb, independently for every edge, where qic and qsb are the edge C C · C C C probabilities in the independent cascade model and the stochastic block model respectively. The total influence by seed set S is then cc (S ) where cc (S) is the set of nodes connected | G i | G to S in G and we drop the subscript when it is clear from context. Thus, the objective
function is f(S) := EG[ cc(S) ]. We describe the two assumptions for this section. | |
Dense communities. We assume that for the k largest communities C, p > 3 log C / C C | | | | and C has super-constant size ( C = Ê(1)). This assumption corresponds to communities | | where the probability p that a node a C influences another node a C is large. Since C i œ j œ the subgraph G[C] of G induced by a community C is an Erd s-Rényi random graph, we get that G[C] is connected with high probability. We first review Erd s-Rényi random graphs.
Erd s-Rényi random graphs. A Gn,p Erd s-Rényi graph is a random graph over n vertices where every edge realizes with probability p. Note that the graph obtained by the two step process which consists of first the stochastic block model and then the independent
cascade model is a union of G C ,p for each community C. The following are seminal results | | C
from Erd s-Rényi characterizes phase transitions for Gn,p graphs.
Lemma 2.2.2. [Erdos and Rényi, 1960] Assume C is a “dense" community, then the subgraph
2 G[C] of G is connected with probability 1 O( C ≠ ). ≠ | | Proof. Assume p = c log C / C for c>1. From Theorem 4.6 in [Blum et al.] which C | | | | presents the result from Erdos and Rényi [1960], the expected number of isolated vertices a in G( C ,p) is | | 1 c E[i]= C ≠ + o(1) | |
and from Theorem 4.15 in [Blum et al.], the expected number of components of size between
43 1 2c 2 and C /2 is O( C ≠ ). Thus the expected number of components of size at most n/2 | | | | 1 c 1 c is O(n ≠ ) and the probability that the graph is connected is 1 O( C ≠ ). Finally, since ≠ | | c 3 for dense communities, the probability that the graph for community C is connected is Ø 2 1 O( C ≠ ). ≠ | |
Small seed set. We also assume that the seed sets S are small enough so that they ≥D
rarely intersect with a fixed community C, i.e., PrS [S C = ] 1 o(1). This assumption ≥D fl ÿ Ø ≠ corresponds to cases where the set of early influencers is small, which is usually the case in cascades. The analysis in this section relies on two main lemmas. We first show that the first order marginal contribution of a node is approximately the size of the community it belongs to (Lemma 2.2.3). Thus, the ordering by marginal contributions orders elements by the size of the community they belong to. Then, we show that any node a C that is s.t. that there is œ a node b C before a in the ordering is pruned (Lemma 2.2.4). Regarding the distribution œ S generating the samples, as previously mentioned, we consider any bounded product ≥D distribution. This implies that w.p. 1 1/ poly(n), the algorithm can compute marginal ≠ contribution estimates v˜ that are all a 1/ poly(n)-additive approximation to the true marginal contributions v. Thus, we give the analysis for the true marginal contributions, which, with probability 1 1/ poly(n) over the samples, easily extends for arbitrarily good estimates. ≠ The following lemma shows that the ordering by first order marginal contributions corresponds to the ordering by decreasing order of community sizes that nodes belong to.
Lemma 2.2.3. For all a C where C is one of the k largest communities, the first œ order marginal contribution of node a is approximately the size of its community, i.e., (1 o(1)) C v(a) C . ≠ | |Æ Æ| |
Proof. Assume a is a node in one of the k largest communities. Let a and a denote the D D≠ distributions S conditioned on a S and a S respectively. We also denote marginal ≥D œ ”œ 44 contributions by f (a) := f(S a ) f(S). We obtain S fi{ } ≠
v(a)= E [fS(a)] Pr [S C = ] Pr[cc(a)=C] E [fS(a)] S a,G Ø S a fl ÿ · G · S a : S C= , ≥D≠ ≥D≠ ≠ ≥GD: cc(a)=fl C ÿ
=Pr[S C = ] Pr[cc(a)=C] C S a G ≥D≠ fl ÿ · ·| | (1 o(1)) C Ø ≠ ·| | where the last inequality is by the small seed set assumption and since C is connected with probability 1 o(1) (Lemma 2.2.2 and C = Ê(1) by dense community assumption). For ≠ | | the upper bound, v(a) is trivially at most the size of a’s community since there are no inter-community edges.
The next lemma shows that the algorithm does not pick two nodes in a same community.
Lemma 2.2.4. With probability 1 o(1), for all pairs of nodes a, b such that a, b C where ≠ œ C is one of the k largest communities, Overlap(a, b, –)=True for any constant – [0, 1). œ
Proof. Let a, b be two nodes in one of the k largest communities C and a,b denote the D≠ distribution S conditioned on a S and b S. Then, ≥D ”œ œ
vb(a)= E [fS(a)] Pr[b cc(a)] 0+Pr[b cc(a)] C = o(1) o(1) v(a) S a,b ≥D≠ Æ œ · ”œ ·| | Æ ·
2 where the last equality is since G[C] is not connected w.p. O( C ≠ ) by Lemma 2.2.2 and | | since C = Ê(1) by the dense community assumption, which concludes the proof. | |
By combining Lemmas 2.2.3 and 2.2.4, we obtain the main result for this section.
Theorem 2.2.2. In the dense communities and small seed set setting, COPS with –-overlap allowed, for any constant – (0, 1) is a 1 o(1)-approximation algorithm for learning to œ ≠ 45 influence from samples from a bounded product distribution . D
Proof. First, we claim that a node a C is not removed from the ordering if there is no œ other node from C before a.For b C, we have ”œ
v(a)=ES a [fS(a)] = ES a [fS(a):b S]=ES a,b [fS(a)] = vb(a) ≥D≠ ≥D≠ œ ≥D≠ where the second equality is since a and b are in di erent communities and since is a D product distribution. Thus, Overlap(a, b, –)=False for any – (0, 1]. œ Next, recall that v(a) C for all a C. Thus, by Lemmas 2.2.3 and 2.2.4. COPS Æ| | œ returns a set that contains one node from k di erent communities that have sizes that are at most a factor 1 o(1) away from the sizes of the k largest communities. Since the k largest ≠ communities are connected with high probability, the optimal solution contains one node from each of the k largest communities. Thus, we obtain a 1 o(1) approximation. ≠
Constant Approximation for General Stochastic Block Model
In this section, we relax assumptions from the previous section and show that COPS is a constant factor approximation algorithm in this more demanding setting. Recall that G is the random graph obtained from both the stochastic block model and the independent cascade model. A main observation that is used in the analysis is to observe that the random
subgraph G[C], for some community C, is an Erd s-Rényi random graph G C ,p . | | C
Relaxation of the assumptions. Instead of only considering dense communities where p = ((log C )/ C ), we consider both tight communities C where p (1 + ‘)/ C for C | | | | C Ø | | some constant ‘>0 and loose communities C where p (1 ‘)/ C for some constant C Æ ≠ | |
46 ‘>0.2 We also relax the small seed set assumption to the reasonable non-ubiquitous seed set assumption. Instead of having a seed set S rarely intersect with a fixed community ≥D
C, we only assume that PrS [S C = ] ‘ for some constant ‘>0. Again, since seed ≥D fl ÿ Ø sets are of small sizes in practice, it seems reasonable that with some constant probability a community does not contain any seeds.
Overview of analysis. At a high level, the analysis exploits the remarkably sharp threshold for the phase transition of Erd s-Rényi random graphs. This phase transition (Lemma 2.2.5) tells us that a tight community C contains w.h.p. a giant connected component with a constant fraction of the nodes from C. Thus, a single node from a tight community influences a constant fraction of its community in expectation. The ordering by first order marginal contributions thus ensures a constant factor approximation of the value from nodes in tight communities (Lemma 2.2.8). On the other hand, we show that a node from a loose community influences only at most a constant number of nodes in expectation (Lemma 2.2.6) by using branching processes. Since the algorithm checks for overlap using second order marginal contributions, the algorithm picks at most one node from any tight community (Lemma 2.2.9). Combining all the pieces together, we obtain a constant factor approximation (Theorem 2.2.3). We first state the result for the giant connected component in a tight community, which is an immediate corollary of the prominent giant connected component result in the Erd s-Rényi model.
Lemma 2.2.5. [Erdos and Rényi, 1960] Let C be a tight community with C = Ê(1), then | | G[C] has a “giant" connected component containing a constant fraction of the nodes in C w.p. 1 o(1). ≠ The following lemma analyzes the influence of a node in a loose community through the
2Thus, we consider all possible sizes of communities except communities of size that converges to exactly 1/pC , which is unlikely to occur in practice.
47 lenses of Galton-Watson branching processes to show that such a node influences at most a constant number of nodes in expectation.
Lemma 2.2.6. Let C be a loose community, then f( a ) c for all a C and some { } Æ œ constant c.
Proof. Fix a node a C. We consider a Galton-Watson branching process starting at œ individual a where the number of o springs of an individual is X = Binomial( C 1,p ). | |≠ C We show that the expected total size s of this branching process is 1/(1 p ( C 1)) and ≠ C · | |≠ that this expected size s upper bounds f( a ). { } We first argue that s f( a ). The expected number of nodes influenced by a can be Ø { } counted via a breadth first search (BFS) of community C starting at a. The number of edges leaving a node in this BFS is Binomial( C 1,p ), which is exactly the number of o springs | |≠ C of an individual in the branching process. Since the nodes explored in the BFS are only the nodes not yet explored, the number of nodes explored by BFS is upper bounded by the
branching process and we get E[s] f( a ). Ø { } Next, let µ = p ( C 1) < 1 be the expected number of o springs of an individual in C · | |≠
the branching process. Let si be the expected number of individuals at generation i of the
i branching process. We show by induction that E[si]=µ . The base case is trivial for i =1. Next, for i =2,
Œ Œ i E[si]= Pr[si 1 = j] E[si si 1 = j]= Pr[si 1 = j] j µ = µ E[si 1]=µ ≠ · | ≠ ≠ · · ≠ jÿ=0 jÿ=0
i where the last inequality is by the inductive hypothesis. Thus, E[s]= iŒ µ =1/(1 µ) since =0 ≠ q µ<1. Finally, since C is tight, µ 1 ‘ for some constant ‘>0 and f( a ) E[s] 1/‘. Æ ≠ { } Æ Æ
We can now upper bound the value of the optimal solution Sı.LetC ,...,C be the t k 1 t Æ ı tight communities that have at least one node in Ci that is in the optimal solution S and
48 that are of super-constant size, i.e., C = Ê(1). Without loss, we order these communities in | | decreasing order of their size C . | i|
ı Lemma 2.2.7. Let S be the optimal set of nodes and Ci and t be defined as above. There exists a constant c such that f(Sı) t C + c k. Æ i=1 | i| · q ı ı Proof. Let SA and SB be a partition of the optimal nodes in nodes that are in tight communi- ties with super-constant individual influence and nodes that are not in such a community. The influence f(Sı ) is trivially upper bounded by t C . Next, there exists some constant c A i=1 | i| ı q s.t. f(SB) a Sı f( a ) c where the first inequality is by submodularity and the second Æ œ B { } Æ · since nodes inq loose communities have constant individual influence by Lemma 2.2.6 and nodes in tight community without super-constant individual influence have constant influence by definition. We conclude that by submodularity, f(Sı) f(Sı )+f(Sı ) t C +c k. Æ A B Æ i=1 | i| · q Next, we argue that the solution returned by the algorithm is a constant factor away from t C . i=1 | i| q Lemma 2.2.8. Let a be the ith node in the ordering by first order maginal contribution after the pruning and Ci be the ith largest tight community with super-constant individual influence and with at least one node in the optimal solution Sı. Then, f( a ) ‘ C for some constant { } Ø | i| ‘>0.
Proof. By definition of C , we have C C that are all tight communities. Let i | 1|Ø···Ø| i| b be a node in C for j [i], 1 be the indicator variable indicating if there is a giant j œ gc(C) component in community C, and gc(C) be this giant component. We get
1 v(b) Pr[ gc(Cj )] Pr [S Cj = ] Pr[b gc(Cj)] E[ gc(Cj) : b gc(Cj)] S b Ø · ≥D≠ fl ÿ · œ · | | œ (1 o(1)) ‘ ‘ ‘ C ‘ C Ø ≠ · 1 · 2 · 3| j|Ø | j|
49 for some constants ‘1,‘2,‘3,‘> 0 by Lemma 2.2.5 and the non-ubiquitous assumption. Similarly as in Theorem 2.2.2, if a and b are in di erent communities, Overlap(a, b, –)= False for – (0, 1]. Thus, there is at least one node b i C at position i or after in œ œfij=1 j the ordering after the pruning, and v(b) ‘ C for some j [i]. By the ordering by first Ø | j| œ order marginal contributions and since node a is in ith position, v(a) v(b), and we get that Ø f( a ) v(a) v(b) ‘ C ‘ C . { } Ø Ø Ø | j|Ø | i|
Next, we show that the algorithm never picks two nodes from a same tight community.
Lemma 2.2.9. If a, b C and C is a tight community, then Overlap(a, b, –)=True for œ – = o(1).
Proof. Let a, b C s.t. C is a tight community. The marginal contribution of node a can œ i be decomposed into the whether a gc(C): œ
v(a)=Pr[a gc(C)] E [fS(a):a gc(C)] + Pr[a gc(C)] E [fS(a):a gc(C)] G S a G S a œ · ≥D≠ œ ”œ · ≥D≠ ”œ
Since is a product distribution, D
E [fS(a):a gc(C)] = (Pr[b gc(C)] + Pr [b gc(C),b S]) E [fS(a):a gc(C)] S a S S a, b ≥D≠ œ ”œ ≥D œ ”œ ≥D≠ ≠ œ
(1 + ‘)Pr[b gc(C)] E [fS(a):a gc(C)] S a, b Ø ”œ ≥D≠ ≠ œ
=(1+‘) E [fS(a):a gc(C)] S a,b ≥D≠ œ
for some constant ‘>0 since PrS a [b gc(C),b S] ‘1 for some constant ‘1 > 0. Since, ≥D≠ œ ”œ Ø
Pr[a gc(C)] E [fS(a):a gc(C)] ‘Õ C ‘Õ Pr[a gc(C)] E [fS(a):a gc(C)] G S a,b G S a œ ≥D≠ œ Ø | |Ø ”œ · ≥D≠ ”œ
50 for some constant ‘Õ > 0,weget
v(a) (1 + ‘)Pr[a gc(C)] E [fS(a):a gc(C)] G S a,b Ø œ · ≥D≠ œ
+Pr[a gc(C)] E [fS(a):a gc(C)] G S a ”œ · ≥D≠ ”œ
(1 + ‘‘Õ/2) Pr[a gc(C)] E [fS(a):a gc(C)] Ø G œ S a,b œ 3 ≥D≠
+Pr[a gc(C)] E [fS(a):a gc(C)] G ”œ · S a,b ”œ ≥D≠ 4
=(1+‘‘Õ/2)vb(a)
Thus, Overlap(a, b, –)=True for – = o(1).
We combine the above lemmas to obtain the approximation guarantee of COPS.
Theorem 2.2.3. With overlap allowed – =1/ poly(n), COPS is a constant factor approx- imation algorithm for learning to influence from samples drawn from a bounded product distribution in the setting with tight and loose communities and non-ubiquitous seed sets. D
Proof. First, observe that f(S) k since nodes trivially influence themselves. Let a Ø i be the node picked by the algorithm that is in the ith position of the ordering after the pruning and assume i t. By Lemma 2.2.8, f( a ) ‘ C where C is the ith largest Æ { i} Ø | i| i tight community with super-constant individual influence and with at least one node in Sı. Thus a , i [t] is in a tight community, otherwise it would have constant influence i œ by Lemma 2.2.6, which is a contradiction with f( a ) < ‘ C . Since a , i [t] is in a { i} Ø | i| i œ
tight community, by Lemma 2.2.9, we obtain that a1,...,ai are all in di erent communities.
We denote by St the subset of the solution returned by COPS and obtain We obtain f(Sı) t C + c k t 1 f( a )+c f(S)= 1 f(S )+c f(S) c f(S) for Æ i=1 | i| · Æ i=1 ‘ · { i} · ‘ · t · Æ 1 · q q some constant ‘, c, c1 by Lemmas 2.2.7, 2.2.8, and since ai,aj are in di erent communities for i, j t. Æ
51 Experiments
In this section, we compare the performance of COPS and three other algorithms on real and synthetic networks. We show that COPS performs well in practice, it outperforms the previous optimization from samples algorithm and gets closer to the solution obtained when given complete access to the influence function.
Experimental setup. The first synthetic network considered is the stochastic block model,
SBM 1, where communities have random sizes with one community of size significantly larger than the other communities. We maintained the same expected community size as n varied. In the second stochastic block model, SBM 2, all communities have same expected size and the number of communities was fixed as n varied. The third and fourth synthetic
networks were an Erd s-Rényi (ER) random graph and the preferential attachment model (PA). Experiments were also conducted on two real networks publicly available ([Leskovec and Krevl, 2015]). The first is a subgraph of the Facebook social network with n =4k and m =88k. The second is a subgraph of the DBLP co-authorship network, which has ground truth communities as described in [Leskovec and Krevl, 2015], where nodes of degree at most 10 were pruned to obtain n =54k, m =361k and where the 1.2k nodes with degree at least 50 were considered as potential nodes in the solution.
Benchmarks. We considered three di erent benchmarks to compare the COPS algorithm against. The standard Greedy algorithm in the value query model is an upper bound since it is the optimal e cient algorithm given value query access to the function and COPS is in the more restricted setting with only samples. MargI is the optimization from samples algorithm which picks the k nodes with highest first order marginal contribution. as in the
previous section, and does not use second order marginal contributions. Random simply returns a random set. All the samples are drawn from the product distribution with marginal
52 DBLP DBLP Facebook Facebook 400 300 400 300
Greedy 250 250 300 COPS 300 MargI 200 200 Random
200 150 200 150 Performance Performance Performance Performance 100 100 100 100 50 50
0 0 0 0 0.0 0.4 0.8 1.2 1.6 2.0 0 3 6 9 12 15 0.6 0.7 0.8 0.9 1 1.1 0 3 6 9 12 15 $2 $2 q x10x10$2&& k q x10x10&$2& k
Stochastic Block Model 1 Stochastic Block Model 2 Erdős–Rényi Preferential Attachment 600 1500 800 250
1200 200 600 400 900 150 400 600 100 Performance Performance Performance Performance 200 200 300 50
0 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 5 5 55 n x10 % n x10x105%% n x10x105%% n x10x10%%
Figure 2.4: Empirical performance of COPS against the Greedy upper bound, the previous optimization from samples algorithm MargI and a random set.
probability k/n, so that samples have expected size k. Each point in a plot corresponds to the average performance of the algorithms over 10 trials. The default values for k is k =10. For the experiments on synthetic data, the default overlap allowed is – =0.5, for the Facebook experiments – =0.4 and for the DBLP experiments – =0.2. The default edge weights are chosen so that in the random realization of G the average degree of the nodes is approximately 1.
Empirical evaluation. COPS significantly outperforms the previous optimization from samples algorithm MargI, getting much closer to the Greedy upper bound. We observe that the more there is a community structure in the network, the better the performance of
COPS is compared to MargI, e.g., SBM vs ER and PA (which do not have a community structure). When the edge weight q := qi.c. for the cascades is small, the function is near-linear
and MargI performs well, whereas when it is large, there is a lot of overlap and COPS performs better. The performance of COPS as a function of the overlap allowed can be
53 1.0
0.8 DBLP Facebook 0.6
0.4
0.2
0.0
Normalized performance Normalized -0.2
-0.4 0.0 0.2 0.4 0.6 0.8 1.0 Overlap allowed
Figure 2.5: Performance of COPS as a function of the overlap – allowed. The per- formance is normalized so that the performance of Greedy and MargI corresponds to value 1 and 0 respectively
explained as follows: Its performance slowly increases as the the overlap allowed increases
and COPS can pick from a larger collection of nodes until it drops when it allows too much overlap and picks mostly very close nodes from a same community. For SBM 1 with one larger community, MargI is trapped into only picking nodes from that larger community and performs even less well than Random. As n increases, the number of nodes influenced increases roughly linearly for SBM 2 when the number of communities is fixed since the number of nodes per community increases linearly, which is not the case for SBM 1.
2.2.3 General Submodular Maximization
1/4 We develop an ˜(n≠ ) optimization from samples algorithm over for monotone sub- D modular functions, for some distribution . This bound is essentially tight since submodular D 1/4+‘ functions are not n≠ -optimizable from samples over any distribution (see Section 2.3.2). We first describe the distribution for which the approximation holds. Then we describe the algorithm, which builds upon estimates of expected marginal contributions.
54 The distribution. Let be the uniform distribution over all sets of size i. Define the Di distribution sub to be the distribution which draws from , , and at random. D Dk DÔn DÔn+1
Estimates of expected marginal contributions We estimate vˆi ES e S[fS(ei)] ¥ ≥DÔn| i”œ with samples from and . DÔn DÔn+1
Algorithm 4 EEMC Estimates the Expected Marginal Contribution ES e S[fS(ei)]. ≥DÔn| i”œ Input: = S :(S ,f(S )) is a sample) S { j j j } for i [n] do œ S : S ,e S, S = Ôn +1 Si,Ôn+1 Ω{ œS i œ | | }
i,Ôn S : S ,ei S, S = Ôn S≠ Ω{ œS ”œ | | } 1 1 vˆi = S f(S) S f(S) i,Ôn+1 i,Ôn+1 i,Ôn i,Ôn |S | œS ≠ |S≠ | œS≠ end for q q
return (ˆv1,...,vˆn)
By standard concentration bounds (Hoe ding’s inequality, Lemma 2.2.10), these are good
estimates of ES e S[fS(ei)] for product distributions (Lemma 2.2.11). ≥DÔn| i”œ D
Lemma 2.2.10 (Hoe ding’s inequality). Let X1,...,Xm be independent random variables
1 m with values in [0,b]. Let X = m i=1 Xi. Then for any ‘>0, q
2m‘2/b2 Pr [ X E[X] ‘] 2e≠ . | ≠ |Ø Æ
n Lemma 2.2.11. With probability at least 1 O(e≠ ), the estimations vˆ defined above are ‘ ≠ i accurate, for any ‘ f(N)/ poly(n) and for all e , i.e., Ø i
vˆi ES e S[fS(ei)] ‘. | ≠ ≥DÔn| i”œ |Æ
Proof. Let ‘ f(N)/nc for some constant c. With a su ciently large polynomial number of Ø 55 2c+1 samples, i,Ôn+1 and i,Ôn are of size at least 2n with exponentially high probability. S S≠ Then by Hoe ding’s inequality (Lemma 2.2.10 with m =2n2c+1 and b = f(N)),
1 4n2c+1(‘/2)2/f(N)2 n Pr f(S) ES e S[f(S)] ‘/2 2e≠ 2e≠ Q- ≥DÔn+1| iœ - R - i,Ôn+1 S ≠ - Ø Æ Æ -|S | œSÿi,Ôn+1 - a- - b - - - - and similarly,
1 n Pr f(S) E [f(S)] ‘/2 2e≠ . - S Ôn ei S - Q i,Ôn ≠ ≥D | ”œ Ø R Æ - S i,Ôn - -|S≠ | œSÿ≠ - a- - b - - - - By the definition of vˆi and since
ES e S[fS(ei)] = ES e S[f(S)] ES e S[f(S)] , ≥DÔn| i”œ ≥DÔn+1| iœ ≠ ≥DÔn| i”œ
n the claim then holds with probability at least 1 4e≠ . ≠
The algorithm. We begin by computing the expected marginal contributions of all elements. We then place the elements in 3 log n bins according to their estimated expected marginal
contribution vˆi. The algorithm then simply returns either the best sample of size k or a random subset of size k of a random bin. Up to logarithmic factors, we can restrict our attention to just one bin. We give a formal description below.
56 1/4 sub Algorithm 5 An ˜(n≠ )-optimization from samples algorithm over for monotone D submodular functions. Input: = S :(S ,f(S )) is a sample) S { i i i } 1 With probability 2 :
return argmaxS : S =k f(S) best sample of size k œS | | 1 With probability 2 : (ˆv ,...,vˆ ) EEMC( ) 1 n Ω S vˆ max vˆ max Ω i i for j [3 log n] do œ vˆmax vˆmax Bj i : j 1 vˆi < j Ω 2 1≠ Æ 2 1 Ó Ô end for Pick j [3 log n] u.a.r. œ return S, a subset of B of size min B ,k u.a.r. a random set from a random bin j {| j| }
Analysis of the algorithm. The main crux of this result is in the analysis of the algorithm. The analysis is divided in two cases, depending if a random set S of size Ôn has low ≥DÔn value or not. Let Sı be the optimal solution.
ı • Assume that ES [f(S)] f(S )/4. Thus, optimal elements have large estimated ≥DÔn Æ ı expected marginal contribution vˆi by submodularity. Let B be the bin with the largest value among the bins with contributions vˆ f(Sı)/(4k). We argue that a random Ø subset of Bı of size k performs well. We first show that a random subset of Bı is a Bı /(4kÔn)-approximation. At a high level, a random subset S of size Ôn contains | | ı ı ı B /Ôn elements from bin B in expectation, and these B /Ôn elements S ı have | | | | B ı ı contributions at least f(S )/(4k) to SBı . We then show that a random subset of B is an ˜(k/ Bı )-approximation to f(Sı). The proof first shows that f(Bı) has high | | value by the assumption that a random set S has low value, and then uses the ≥DÔn
57 fact that a subset of Bı of size k is a k/ Bı approximation to Bı. Note that either | | ı ı 1/4 B /(4kÔn) or ˜(k/ B ) is at least ˜(n≠ ). | | | | ı • Assume that ES [f(S)] f(S )/4. We argue that the best sample of size k ≥DÔn Ø performs well. We first show that, by submodularity, a random set of size k is a k/(4Ôn) approximation since a random set of size k is a fraction k/(Ôn) smaller than a random set from in expectation. We then show that the best sample of size k is DÔn a 1/k-approximation since it contains the elements with the highest value with high
1/4 probability. Note that either k/(4Ôn) or 1/k is at least n≠ .
We begin with the following useful lemma.
Lemma 2.2.12. For any monotone submodular function f( ), the value of a uniform random · set S of size k is a k/n-approximation to f(N).
Proof. Partition the ground set into sets of size k uniformly at random. A uniform random set of this partition is a k/n-approximation to f(N) in expectation by submodularity. A uniform random set of this partition is also a uniform random set of size k.
ı In the first case of the analysis, we assume that ES [f(S)] f(S )/4.LetjÕ be the ≥DÔn Æ largest j such that bin B contains at least one element e such that vˆ f(Sı)/(2k). So any j i i Ø ı ı ı element ei Bj, j jÕ is such that vˆi f(S )/(4k). Define B = argmaxB :j j f(S Bj) œ Æ Ø j Æ Õ fl to be the bin B with high marginal contributions that has the highest value from the optimal solution. Let t be the size of Bı.
ı ı Lemma 2.2.13. If ES [f(S)] f(S )/4, then a uniformly random subset of bin B of ≥DÔn Æ size min k, t is a (1 o(1)) min(1/4,t/(4kÔn))-approximation to f(Sı). { } ≠ ·
58 Proof. Note that
ı ES S Bı [f(S)] ES [f(S B )] (submodularity) ≥Dt/Ôn| ™ Ø ≥DÔn fl
ES f(S Bı) e (ei) (submodularity) ≥DÔn S fl \ i T Ø e S Bı iœÿfl U V
ES ES e S [fS (ei)] (submodularity) ≥DÔn S Õ≥DÔn| i”œ Õ Õ T Ø e S Bı iœÿfl U V 3 ES (ˆvi f(N)/n ) ≥DÔn S T Ø e S Bı ≠ iœÿfl U Vı ı f(S ) ı ı = ES [ S B ](1 o(1)) (ˆvi f(S )/(4k) for ei B ) ≥DÔn | fl | ≠ 4k Ø œ tf(Sı) =(1 o(1)) ≠ 4kÔn
If t/Ôn k, then a uniformly random subset of bin Bı of size k is a kÔn/t approximation Ø ı to ES S Bı [f(S)] by Lemma 2.2.12,soa(1 o(1))/4 approximation to f(S ) by the ≥Dt/Ôn| ™ ≠ above inequalities. Otherwise, if t/Ôn of size min k, t has value at least ES S Bı [f(S)] by monotonicity, and is thus a { } ≥Dt/Ôn| ™ (1 o(1))t/(4kÔn) approximation to f(Sı). ≠ ı ı Lemma 2.2.14. If ES [f(S)] f(S )/4, a uniformly random subset of bin B of size ≥DÔn Æ min(k, t) is an ˜(min(1,k/t))-approximation to f(Sı). 59 Proof. We start by bounding the value of optimal elements not in bin B with j jÕ. j Æ ı f(S ( B :j j Bj)) \ fi j Æ Õ ı ES f(S S ( B :j j Bj)) (monotonicity) Æ ≥DÔn fi \ fi j Æ Õ Ë È ES Ôn Sf(S)+ ESÕ Ôn ei SÕ [fSÕ (ei)]T (submodularity) Æ ≥D ı ≥D | ”œ ei (S ( B :j j Bj )) S W œ \ fiÿj Æ Õ \ X U V ı 3 f(S )/4+ES Ôn S (ˆvi + f(N)/n )T (assumption) Æ ≥D ı ei (S ( B :j j Bj )) S W œ \ fiÿj Æ Õ \ X U V ı ı ı f(S )/4+f(S )/2+f(S )/n (definition of jÕ) Æ ı ı ı Since f(S ) f(S ( B :j j Bj)) + f(S ( B :j j Bj)) by submodularity, we get that Æ fl fi j Æ Õ \ fi j Æ Õ ı ı ı ı f(S ( B :j j Bj)) f(S )/5. Since there are 3 log n bins, f(S B ) is a 3 log n to fl fi j Æ Õ Ø fl f(Sı)/5 by submodularity and the definition of Bı. By monotonicity, f(Bı) is a 15 log n to f(Sı). Thus, by Lemma 2.2.12, a random subset S of size k of bin Bı is an ˜(min(1,k/t)) approximation to f(Sı). ı Corollary 1. If ES [f(S)] f(S )/4, a uniformly random subset of size min k, Bj ≥DÔn Æ { | |} 1/4 ı of a random bin Bj is an ˜(n≠ )-approximation to f(S ). Proof. With probability 1/(3 log n), the random bin is Bı and we assume this is the case. By Lemma 2.2.13 and 2.2.14, a random subset of Bı of size min(k, t) is both an (min(1,t/(kÔn))) and an ˜(min(1,k/t)) approximation to f(Sı). Assume t/(kÔn) 1 and k/t 1, otherwise Æ Æ 1/4 1/4 we are done. Finally, note that if t/k n , then (t/(kÔn)) (n≠ ), otherwise, Ø Ø 1/4 ˜(k/t) ˜(n≠ ). Ø ı In the second case of the proof, we assume that ES [f(S)] f(S )/4 ≥DÔn Ø 60 ı Lemma 2.2.15. For any monotone submodular function f, if ES [f(S)] f(S )/4, ≥DÔn Ø then a uniformly random set of size k is a min(1/4,k/(4Ôn)) approximation to f(Sı). Proof. If k Ôn, then a uniformly random set of size k is a 1/4-approximation to f(Sı) by Ø monotonicity. Otherwise, a uniformly random subset of size k of N is a uniformly random subset of size k of a uniformly random subset of size Ôn of N. So by Lemma 2.2.12, k k ı ES [f(S)] ES [f(S)] f(S ). ≥Dk Ø Ôn · ≥DÔn Ø 4Ôn · Lemma 2.2.16. For any monotone submodular function f( ), the sample S with the largest · value among at least n log n samples of size k is a 1/k-approximation to f(Sı) with high probability. Proof. By submodularity, there exists an element eı such that eı is a 1/k-approximation i { i } ı to the optimal solution. By monotonicity, any set which contains ei is a 1/k-approximation to the optimal solution. After observing n log n samples, the probability of never observing a ı set that contains ei is polynomially small. ı Corollary 2. If ES [f(S)] f(S )/4, then the sample of size k with the largest value is ≥DÔn Ø 1/4 ı a min(1/4,n≠ /4) approximation to f(S ). Proof. By 2.2.15 and Lemma 2.2.16, the sample of size k with the largest value k is a min(1/4,k/(4Ôn)) and a 1/k approximation to f(Sı).Ifk n1/4, then min(1/4,k/(4Ôn)) Ø Ø 1/4 1/4 min(1/4,n≠ /4), otherwise, 1/k 1/n≠ . Ø By combining Corollaries 1 and 2, we obtain the main result for this section. 1/4 sub Theorem 2.2.4. Algorithm 5 is an ˜(n≠ ) optimization from samples algorithm over D for monotone submodular functions. 61 2.3 The Limitations of Optimization from Samples In this section, we show hardness results for optimization from samples from any distribu- tion for three classes of functions. We begin with a general framework for showing hardness results in the optimization from samples model (Section 2.3.1). As a warm-up for coverage functions, we show in Section 2.3.2 that for submodular functions, there is no optimization 1/4+‘ from samples algorithm that obtains an (n≠ ) approximation. We then present the main O result for the hardness of optimizing coverage functions from samples in Section 2.3.3. Finally we show in Section 2.3.4 an (1 c)/(1 + c c2)+o(1) lower bound for monotone submodular ≠ ≠ functions with curvature c. For submodular functions and functions with curvature, these lower bounds are tight up to lower order terms with the upper bounds obtained for these functions in the previous section. 2.3.1 A Framework for Hardness of Optimization from Samples The framework we introduce partitions the ground set of elements into good, bad, and masking elements. We derive two conditions on the values of these elements so that samples do not contain enough information to distinguish good and bad elements with high probability. We then give two additional conditions so that if an algorithm cannot distinguish good and bad elements, the solution returned by this algorithm has low value compared to the optimal set consisting of the good elements. We begin by defining the partition. Definition 10. The collection of partitions contains all partitions P of the ground set P N in r parts T ,...,T of k elements and a part M of remaining n rk elements, where 1 r ≠ n = N . | | The elements in T are called the good elements, for some i [r]. The bad and masking i œ r elements are the elements in T i := j=1,j=iTj and M respectively. Next, we define a class of ≠ fi ” 62 functions (g, b, m, m+) such that f (g, b, m, m+) is defined in terms of good, bad, and F œF masking functions g, b, and m+, and a masking fraction m [0, 1].3 œ Definition 11. Given functions g, b, m, m+, the class of functions (g, b, m, m+) contains F functions f P,i, where P and i [r], defined as œP œ P,i + f (S) := (1 m(S M)) g(S Ti)+b(S T i) + m (S M). ≠ fl 3 fl fl ≠ 4 fl We use probabilistic arguments over the partition P and the integer i [r] chosen œP œ uniformly at random to show that for any distribution and any algorithm, there exists a D function in (g, b, m, m+) that the algorithm optimizes poorly given samples from . The F D functions g, b, m, m+ have desired properties that are parametrized below. At a high level, the identical on small samples and masking on large samples properties imply that the samples do not contain enough information to learn i, i.e. distinguish good and bad elements, even though the partition P can be learned. The gap and curvature property imply that if an algorithm cannot distinguish good and bad elements, then the algorithm performs poorly. Definition 12. The class of functions (g, b, m, m+) has an (–,—)-gap if the following F conditions are satisfied for some t, where ( ) is the uniform distribution over . U P P Ê(1) 1. Identical on small samples. For a fixed S : S t, with probability 1 n≠ over | |Æ ≠ partition P ( ), g(S Ti)+b(S T i) is independent of i; ≥U P fl fl ≠ Ê(1) 2. Masking on large samples. For a fixed S : S t, with probability 1 n≠ over | |Ø ≠ partition P ( ), the masking fraction is m(S M)=1; ≥U P fl 3. –-Gap. Let S : S = k, then g(S) max – b(S),– m+(S) ; | | Ø { · · } 4. —-Curvature. Let S : S = k and S : S = k/r, then g(S ) (1 —) r g(S ). 1 | 1| 2 | 2| 1 Ø ≠ · · 2 3The notation m+ refers to the role of this function, which is to maintain monotonicity of masking elements. These four functions are assumed to be normalized such that g( )=b( )=m( )=m+( )=0. ÿ ÿ ÿ ÿ 63 The following lemma reduces the problem of showing an impossibility result to constructing g, b, m, and m+ which satisfy the above properties. Lemma 2.3.1. Assume the functions g, b, m, m+ have an (–,—)-gap, then (g, b, m, m+) is F not 2 max(1/(r(1 —)), 2/–)-optimizable from samples over any distribution . ≠ D Proof. Fix any distribution . We first claim that for a fixed set S, f P,i(S) is independent D Ê(1) of i with probability 1 n≠ over a uniformly random partition P ( ).IfS t, ≠ ≥U P | |Æ then the claim holds immediately by the identical on small samples property. If S t, then | |Ø Ê(1) m(S M)=1with probability 1 n≠ over P by the masking on large samples property fl ≠ and f P,i(S)=m+(S M). fl Next, we claim that there exists a partition P such that f P,i(S) is independent of i œP Ê(1) P,i with probability 1 n≠ over S . Denote the event that f (S) is independent of i by ≠ ≥D I(S, P ). By switching sums, 1 Pr(P ( )) Pr(S ) I(S,P ) P ≥U P S 2N ≥D ÿœP ÿœ 1 = Pr(S ) Pr(P ( )) I(S,P ) S 2N ≥D P ≥U P ÿœ ÿœP Ê(1) Pr(S ) 1 n≠ Ø S 2N ≥D ≠ ÿœ 1 2 Ê(1) =1 n≠ ≠ where the inequality is by the first claim. Thus there exists some P such that 1 Ê(1) Pr(S ) I(S,P ) 1 n≠ , S 2N ≥D Ø ≠ ÿœ which proves the desired claim. Fix a partition P such that the previous claim holds, i.e., f P,i(S) is independent of i with Ê(1) probability 1 n≠ over a sample S . Then, by a union bound over the polynomially ≠ ≥D 64 P,i Ê(1) many samples, f (S) is independent of i for all samples S with probability 1 n≠ , and ≠ we assume this is the case for the remaining of the proof. It follows that the choices of the algorithm given samples from f f P,i r are independent of i. Pick i [r] uniformly at œ{ }i=1 œ random and consider the (possibly randomized) set S returned by the algorithm. Since S is independent of i,wegetEi,S[ S Ti ] k/r.LetSk/r = argmaxS: S =k/r(g(S)), we obtain | fl | Æ | | P,i + Ei,S f (S) Ei,S g(S Ti)+b(S T i)+m (S M) Æ fl fl ≠ fl Ë È Ë È g(S )+b(S)+m+(S) Æ k/r 1 2 g(T )+ g(T ) Æ r(1 —) i – i ≠ 1 2 P,i 2 max , f (Ti) Æ Ar(1 —) –B ≠ where the first inequality is since m(S M) 1, the second by monotonicity and submodu- fl Æ larity, and the third by the curvature and gap properties. Thus, there exists at least one i ,i such that the algorithm does not obtain a 2 max(1/(r(1 —)), 2/–)-approximation to f P (T ), ≠ i and Ti is the optimal solution. 2.3.2 Submodular Maximization Using the hardness framework from Section 2.3.1, it is relatively easy to show that there 1/4+‘ is no optimization from samples algorithm that obtains an (n≠ ) approximation for O submodular functions over any distribution . The good, bad, and masking functions D 65 g, b, m, m+ we use are: g(S)= S , | | b(S) = min( S , log n), | | m(S) = min(1, S /n1/2), | | + 1/4 1/2 m (S)=n≠ min(n , S ). · | | It is easy to show that (g, b, m, m+) is a class of monotone submodular functions (Lemma 2.3.3). F 1/4+‘ 1/4 ‘/2 To derive the optimal n≠ impossibility we consider the cardinality constraint k = n ≠ 1/4 + 1/4 ‘ and the size of the partition to be r = n . We show that (g, b, m, m ) has an (n ≠ , 0)-gap. F + 1/4 ‘ Lemma 2.3.2. The class (g, b, m, m ) as defined above has an (n ≠ , 0)-gap with t = F n1/2+‘/4. 1/4 ‘ Proof. We show that these functions satisfy the properties to have an (n ≠ , 0)-gap. 1/2+‘/4 1/2 ‘/2 • Identical on small samples. Assume S n . Then T i S /n n ≠ | |Æ | ≠ |·| | Æ · 1/2+‘/4 ‘/4 n /n n≠ , so by Lemma 2.3.6, S T i log n w.p. 1 Ê(1) over P ( ). Æ | fl ≠ |Æ ≠ ≥U P Thus r g(S Ti)+b(S T i)= S ( j=1Tj) fl fl ≠ | fl fi | with probability 1 Ê(1) over P . ≠ • Identical on large samples. Assume S n1/2+‘/4. Then S M n1/2 with | |Ø | fl |Ø exponentially high probability over P ( ) by Cherno bound (Lemma 2.3.11), and ≥U P m(S M)=1w.p. at least 1 Ê(1). fl ≠ 1/4 ‘ 1/4 ‘/2 + ‘/2 • Gap n ≠ . Note that g(S)=k = n ≠ , b(S)=log n, m (S)=n≠ for S = k, | | 1/4 ‘ 1/4 + so g(S) n ≠ b(S) for n large enough and g(S)=n m (S). Ø 66 • Curvature — =0. The curvature — =0follows from g being linear. We show that that we obtain monotone submodular functions. Lemma 2.3.3. The class of functions (g, b, m, m+) is a class of monotone submodular F functions. Proof. We show that the marginal contributions f (e) of an element e N to a set S N S œ ™ are such that f (e) f (e) for S T (submodular) and f (e) 0 for all S (monotone) for S Ø T ™ S Ø all elements e.Fore T , for all j, this follows immediately from g and b being monotone œ j submodular. For e M, note that œ 1 1/4 1/2 1/2 ( S Ti + min( S T i , log n)) + n≠ if S M Together with Lemma 2.3.1, these two lemmas imply the hardness result. Theorem 2.3.1. For every constant ‘>0, there is no optimization from samples algorithm 1/4+‘ that obtains an (n≠ ) approximation for monotone submodular functions, for any O distribution . D 2.3.3 Maximum Coverage We show that optimization from samples is in general impossible, over any distribution , D even when the function is learnable and optimizable, which is the main result of this chapter. Specifically, we show that there exists no constant – and distribution such that coverage D 67 functions are –-optimizable from samples, even though they are (1 ‘)-PMAC learnable over ≠ any distribution and can be maximized under a cardinality constraint within a factor of D 1 1/e. We begin by formally defining coverage functions. ≠ Definition. A function is called coverage if there exists a family of sets T1,...,Tn that covers subsets of a universe U with weights w(a ) for a U such that for all S, f(S)= j j œ aj i S Ti w(aj). A coverage function is polynomial-sized if the universe is of polynomial œfi œ sizeq in n. Influence maximization is a generalization of maximizing coverage functions under a cardinality constraint. Coverage functions are heavily used in machine learning [Swaminathan et al., 2009, Yue and Joachims, 2008, Guestrin et al., 2005, Krause and Guestrin, 2007, Antonellis et al., 2012, Lin and Bilmes, 2011, Takamura and Okumura, 2009], data-mining [Chierichetti et al., 2010, Du et al., 2014b, Saha and Getoor, 2009, Singer, 2012, Dasgupta et al., 2007, Gomez-Rodriguez et al., 2010], mechanism design [Dobzinski and Schapira, 2006, Lehmann et al., 2001, Dughmi and Vondrák, 2015, Dughmi et al., 2011, Buchfuhrer et al., 2010, Dughmi, 2011], privacy [Gupta et al., 2013, Feldman and Kothari, 2014], as well as influence maximization [Kempe et al., 2003, Seeman and Singer, 2013, Borgs et al., 2014]. In many of these applications, the functions are learned from data and the goal is to optimize the function under a cardinality constraint. In addition to learnability and optimizability, coverage functions have many other desirable properties. One important fact is that they are parametric: if the sets T1,...,Tn are known, then the coverage function is completely defined by the weights w(a):a U . Our impossibility result holds even in the case where the { œ } sets T1,...,Tn are known. We state the main result.