Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings

Aryan Mokhtari 1 Hamed Hassani 2 Amin Karbasi 3

Abstract timization methods are not only desirable but very often necessary (Boyd et al., 2011). In this paper, we showcase the interplay be- tween discrete and continuous optimization in The focus of this paper is on decentralized optimization, network-structured settings. We propose the the goal of which is to maximize/minimize a global ob- first fully decentralized optimization method jective function –distributed over a network of comput- for a wide class of non-convex objective func- ing units– through local computation and communications tions that possess a diminishing returns prop- among nodes. A canonical example in machine learning is erty. More specifically, given an arbitrary con- fitting models using M-estimators where given a set of data nected network and a global continuous submod- points the parameters of the model are estimated through ular function, formed by a sum of local functions, an empirical risk minimization (Vapnik, 1998). Here, the we develop Decentralized Continuous global objective function is defined as an average of local Greedy (DCG), a message passing loss functions associated with each data point. Such local that converges to the tight (1 1/e) approxima- loss functions can be convex (e.g., logistic regression, SVM, tion factor of the optimum global solution using etc) or non-convex (e.g., non-linear square loss, robust re- only local computation and communication. We gression, mixture of Gaussians, deep neural nets, etc) (Mei also provide strong convergence bounds as a func- et al., 2016). Due to the sheer volume of data points, these tion of network size and spectral characteristics optimization tasks cannot be fulfilled on a single comput- of the underlying topology. Interestingly, DCG ing cluster node. Instead, we need to opt for decentralized readily provides a simple recipe for decentralized solutions that can efficiently exploit dispersed (and often discrete submodular maximization through the distant) computational resources linked through a tightly means of continuous relaxations. Formally, we connected network. Furthermore, local computations should demonstrate that by lifting the local discrete func- be light so that they can be done on single machines. In tions to continuous domains and using DCG as an particular, when the data is high dimensional, extra care interface we can develop a consensus algorithm should be given to any optimization procedure that relies on that also achieves the tight (1 1/e) approxima- projections over the feasibility domain. tion guarantee of the global discrete solution once In addition to large scale machine learning applications, de- a proper rounding scheme is applied. centralized optimization is a method of choice in many other domains such as Internet of Things (IoT) (Abu-Elkheir et al., 2013), remote sensing (Ma et al., 2015), multi-robot systems 1. Introduction (Tanner & Kumar, 2005), and sensor networks (Rabbat & Nowak, 2004). In such scenarios, individual entities can In recent years, we have reached unprecedented data vol- communicate over a network and interact with the environ- umes that are high dimensional and sit over (clouds of) ment by exchanging the data generated through sensing. At networked machines. As a result, decentralized collection the same time they can react to events and trigger actions to of these data sets along with accompanying distributed op- control the physical world. These applications highlight an- 1Laboratory for Information and Decision Systems, Mas- other important aspect of decentralized optimization where sachusetts Institute of Technology 2Department of Electrical and private data is collected by different sensing units (Yang 3 Systems Engineering, University of Pennsylvania Department et al., 2017). Here again, we aim to optimize a global objec- of Electrical Engineering and Computer Science, Yale University. tive function while avoiding to share the private data among Correspondence to: Aryan Mokhtari . computing units. Thus, by design, one cannot solve such Proceedings of the 35 th International Conference on Machine private optimization problems in a centralized manner and Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 should rely on decentralized solutions where local private by the author(s). Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings computation is done where the data is collected. communication time. In decentralized methods, these issues are overcame by removing the master node and considering Continuous submodular functions, a broad subclass of non- each node as an independent unit that is allowed to exchange convex functions with diminishing returns property, have information with its neighbors. recently received considerable attention (Bach, 2015; Bian et al., 2017). Due to their interesting structures that allow Convex decentralized consensus optimization is a relatively strong approximation guarantees (Mokhtari et al., 2018a; mature area with a myriad of primal and dual Bian et al., 2017), they have found various applications, (Bertsekas & Tsitsiklis, 1989). Among primal methods, de- including the design of online experiments (Chen et al., centralized (sub) is perhaps the most well 2018), budget and resource allocations (Eghbali & Fazel, known algorithm which is a mix of local gradient descent 2016; Staib & Jegelka, 2017), and learning assignments and successive averaging (Nedic & Ozdaglar, 2009; Yuan (Golovin et al., 2014). However, all the existing work suffer et al., 2016). It also can be interpreted as a penalty method from centralized computing. Given that many information that encourages agreement among neighboring nodes. This gathering, data summarization, and non-parametric learning latter interpretation has been exploited to solve the pe- problems are inherently related to large-scale submodular nalized objective function using accelerated gradient de- maximization, the demand for a fully decentralized solution scent (Jakovetic´ et al., 2014; Qu & Li, 2017), Newton’s is immediate. In this paper, we develop the first decentral- method (Mokhtari et al., 2017; Bajovic et al., 2017), or ized framework for both continuous and discrete submodular quasi-Newton algorithms (Eisen et al., 2017). The methods functions. Our contributions are as follows: that operate in the dual domain consider a constraint that enforces equality between nodes’ variables and solve the Continuous submodular maximization: For any • problem by ascending on the dual function to find optimal global objective function that is monotone and con- Lagrange multipliers. A short list of dual methods are the al- tinuous DR-submodular and subject to any down- ternating directions method of multipliers (ADMM) (Boyd closed and bounded convex body, we develop et al., 2011), dual ascent algorithm (Rabbat et al., 2005), Decentralized Continuous Greedy, a de- and augmented Lagrangian methods (Jakovetic et al., 2015; centralized and projection-free algorithm that achieves Chatzipanagiotis & Zavlanos, 2015). Recently, there have the tight (1 1/e ✏) approximation guarantee in 2 been many attempts to extend the tools in decentralized con- O(1/✏ ) rounds of local communication. sensus optimization to the case that the objective function is Discrete submodular maximization: For any global ob- non-convex (Di Lorenzo & Scutari, 2016; Sun et al., 2016; • jective function that is monotone and submodular and Hajinezhad et al., 2016; Tatarenko & Touri, 2017). However, subject to any matroid constraint, we develop a dis- such works are mainly concerned with reaching a stationary crete variant of Decentralized Continuous point and naturally cannot provide any optimality guarantee. Greedy that achieves the tight (1 1/e ✏) approxi- In this paper, our focus is to provide the first decentralized 3 mation ratio in O(1/✏ ) rounds of communication. algorithms for both discrete and continuous submodular All proofs are provided in the supplementary material. functions. It is known that the centralized greedy approach of (Nemhauser et al., 1978), and its many variants (Feige et al., 2011; Buchbinder et al., 2015; 2014; Feldman et al., 2. Related Work 2017; Mirzasoleiman et al., 2016), reach tight approxima- Decentralized optimization is a challenging problem as tion guarantees in various scenarios. As such methods are nodes only have access to separate components of the global sequential in nature, they do not scale to massive datasets. objective function, while they aim to collectively reach the To partially resolve this issue, MapReduce style methods, global optimum point. Indeed, one naive approach to tackle with a master-slave architecture, have been proposed (Mirza- this problem is to broadcast local objective functions to all soleiman et al., 2013; Kumar et al., 2015; da Ponte Barbosa the nodes in the network and then solve the problem locally. et al., 2015; Mirrokni & Zadimoghaddam, 2015; Qu et al., However, this scheme requires high communication over- 2015). head and disregards the privacy associated with the data One can extend the notion of diminishing returns to contin- of each node. An alternative approach is the master-slave uous domains (Wolsey, 1982; Bach, 2015). Even though setting (Bekkerman et al., 2011; Shamir et al., 2014; Zhang continuous submodular functions are not generally convex & Lin, 2015) where at each iteration, nodes use their local (nor concave) Hassani et al. (2017) showed that in the mono- data to compute the information needed by the master node. tone setting and subject to a general bounded convex body Once the master node receives all the local information, it constraint, stochastic gradient methods can achieve a 1/2 updates its decision and broadcasts the decision to all the approximation guarantee. The approximation guarantee can nodes. Although this scheme protects the privacy of nodes it be tightened to (1 1/e) by using Frank-Wolfe (Bian et al., is not robust to machine failures and is prone to high overall Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings

2017) or stochastic Frank-Wolfe (Mokhtari et al., 2018a). we consider the maximization of continuous submodular p functions subject to down-closed convex bodies R+ pC⇢ defined as follows. For any two vectors x, y R , where 3. Notation and Background 2 + x y, down-closedness means that if y , then so is  2C In this section, we review the notation that we use through- x . Note that for a down-closed set we have 0 . 2C p 2C out the paper. We then give the precise definition of sub- modularity in discrete and continuous domains. 4. Decentralized Submodular Maximization Notation. Lowercase boldface v denotes a vector and up- In this section, we state the problem of decentralized sub- percase boldface W a matrix. The i-th element of v is modular maximization in continuous and discrete settings. written as vi and the element on the i-th row and j-th col- umn of W is denoted by wi,j. We use v to denote the Continuous Case. We consider a set of n computing ma- k k Euclidean norm of vector v and W to denote the spectral chines/sensors that communicate over a graph to maximize k k norm of matrix W. The null space of matrix W is denoted a global objective function. Each machine can be viewed by null(W). The inner product of vectors x, y is indicated as a node i 1, ,n . We further assume that the 2N , { ··· } by x, y , and the transpose of a vector v or matrix W are possible communication links among nodes are given by a h i n denoted by v† and W†, respectively. The vector 1n R bidirectional connected communication graph =( , ) 2 G N E is the vector of all ones with n components, and the vector where each node can only communicate with its neighbors p 0p R is the vector of all zeros with p components. in . We formally use i to denote node i’s neighbors. In 2 G N V our setting, we assume that each node i has access Submodulary. A set function f :2 R+, defined on the 2N ! to a local function Fi : R+. The nodes cooperate ground set V , is called submodular if for all A, B V , we X! ✓ in order to maximize the aggregate monotone and contin- have f(A)+f(B) f(A B)+f(A B). We often need \ [ uous DR-submodular function F : R+ subject to a to maximize submodular functions subject to a down-closed Xp! V down-closed convex body R , i.e., set family . In particular, we say 2 is a matroid if 1) C⇢X⇢ + I I⇢ for any A B V , if B , then A and 2) for any n ⇢ ⇢ 2I 2I 1 A, B if A < B , then there is an element e B such max F (x) = max Fi(x). (4) 2I | | | | 2 x x n that A e . 2C 2C i=1 [{ }2I X The notion of submodularity goes beyond the discrete do- The goal is to design a message passing algorithm to solve main (Wolsey, 1982; Vondrak´ , 2007; Bach, 2015). Consider (4) such that: (i) at each iteration t, the nodes send their p a continuous function F : R+ where the set R messages (and share their information) to their neighbors in p X! X✓ is of the form = i=1 i and each i is a compact sub- , and (ii) as t grows, all the nodes reach to a point x X X X G 2C set of R+. We call the continuous function F submodular if that provides a (near-) optimal solution for (4). for all x, y weQ have 2X Discrete Case. Let us now consider the discrete counterpart F (x)+F (y) F (x y)+F (x y), (1) of problem (4). In this setting, each node i has access _ ^ V 2N to a local set function fi :2 R+. The nodes cooperate where x y := max(x, y) (component-wise) and x y := ! _ ^ in maximizing the aggregate monotone submodular function min(x, y) (component-wise). In this paper, our focus is on V f :2 R+ subject to a matroid constraint , i.e. differentiable continuous submodular functions with two ! I additional properties: monotonicity and diminishing returns. 1 n Formally, a submodular function F is monotone if max f(S) = max fi(S). (5) S S n 2I 2I i=1 x y = F (x) F (y), (2) X  )  Note that even in the centralized case, and under reasonable for all x, y . Note that x y in (2) means that x y 2X  i  i complexity-theoretic assumptions, the best approximation for all i =1,...,p. Furthermore, a differentiable sub- guarantee we can achieve for Problems (4) and (5) is (1 modular function F is called DR-submodular (i.e., shows 1/e) (Feige, 1998). In the following, we show that it is diminishing returns) if the gradients are antitone, namely, possible to achieve the same approximation guarantee in a for all x, y we have 2X decentralized setting. x y = F (x) F (y). (3)  )r r 5. Decentralized Continuous Greedy Method When the function F is twice differentiable, submodularity implies that all cross-second-derivatives are non-positive In this section, we introduce the Decentralized (Bach, 2015), and DR-submodularity implies that all second- Continuous Greedy (DCG) algorithm for solving derivatives are non-positive (Bian et al., 2017) In this work, Problem (4). Recall that in a decentralized setting, the nodes Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings have to cooperate (i.e., send messages to their neighbors) Algorithm 1 DCG at node i in order to solve the global optimization problem. We will Require: Stepsize ↵ and weights wij for j i i explain how such messages are designed and communicated 0 0 2N [{} 1: Initialize local vectors as xi = di = 0p in DCG. Each node i in the network keeps track of two local 0 0 p 2: Initialize neighbor’s vectors as xj = dj = 0p if j i variables xi, di R which are iteratively updated at each 2N 2 3: for t =1, 2,...,T do round t using the information gathered from the neighboring t t 1 t t 4: Compute di =(1 ↵) wijdj + ↵ Fi(xi); nodes. The vector xi is the local decision variable of node i r j i at step t whose value we expect to eventually converge to 2NXi[{ } the (1 1/e) fraction of the optimal solution of Problem (4). 5: Exchange dt with neighboring nodes j i i The vector dt is the estimate of the gradient of the global t t 2N i 6: Evaluate vi = argmaxv di, v ; objective function that node i keeps at step t. 2C h i 1 7: Update the variable xt+1 = w xt + vt; i ij j T i To properly incorporate the received information from their j i 2NXi[{ } neighbors, nodes should assign nonnegative weights to their t+1 8: Exchange xi with neighboring nodes j i neighbors. Define wij 0 to be the weight that node 2N 9: i assigns to node j. These weights indicate the effect of end for (variable or gradient) information nodes received from their neighbors in order to update their local (variable or gradient) t replace it by its current approximation di and hence we information. Indeed, the weights wij must fulfill some obtain the update rule (7). requirements (later described in Assumption 1), but they are t design parameters of DCG and can be properly chosen by After computing the local ascent directions vi, the nodes t the nodes prior to the implementation of the algorithm. update their local variables xi by averaging their local and neighboring iterates and ascend in the direction vt with The first step at each round t of DCG is updating the local i t stepsize 1/T where T is the total number of iterations, i.e., gradient approximation vectors di using local and neighbor- ing gradient information. In particular, node i computes its 1 xt+1 = w xt + vt. (8) vector dt according to the update rule i ij j T i i j i 2NXi[{ } t t 1 t d =(1 ↵) w d + ↵ F (x ), (6) i ij j r i i The update rule (8) ensures that the neighboring iter- j i 2NXi[{ } ates are not far from each other via the averaging term t j i wijxj, while the iterates approach the optimal where ↵ [0, 1] is an averaging coefficient. Note that the 2Ni[{ } 2 t 1 maximizer of the global objective function by ascending sum wijd in (6) is a weighted average of P t j i i j in the conditional gradient direction v . The update in (8) 2N [{ } t 1 t 1 i i d d nodeP’s vector i and its neighbors j , evaluated at requires a round of local communication among neighbors t t step t 1. Hence, node i computes the vector di by evaluat- x to exchange their local variables i. The steps of the DCG ing a weighted average of its current local gradient F (xt) r i i method are summarized in Algorithm 1. and the local and neighboring gradient information at step t 1 t Indeed, the weights wij that nodes assign to each other t 1, i.e., wijd . Since the vector d is eval- j i i j i uated by aggregating2N [{ gradient} information from neighboring cannot be arbitrary. In the following, we formalize the P t conditions that they should satisfy (Yuan et al., 2016). nodes, it is reasonable to expect that di becomes a proper approximation for the global objective function gradient n Assumption 1 The weights that nodes assign to each other (1/n) k=1 fk(x) as time progresses. Note that to imple- r are nonegative, i.e., wij 0 for all i, j , and if node j ment the update in (6) nodes should exchange their local 2N P t is not a neighbor of node i then the corresponding weight vectors di with their neighbors. is zero, i.e., wij =0if j/ i. Further, the weight matrix t n n 2N Using the gradient approximation vector di, each node i W R ⇥ with entries wij satisfies t 2 evaluates its local ascent direction vi by solving W† = W, W1n = 1n, null(I W) = span(1n). (9) t t vi = argmax di, v . (7) v h i 2C The first condition in (9) ensures that the weights are sym- The update in (7) is also known as conditional gradient metric, i.e., wij = wji. The second condition guarantees update. Ideally, in a conditional , we should the weights that each node assigns to itself and its neighbors n choose the feasible direction v that maximizes the in- sum up to 1, i.e., j=1 wij =1for all i. Note that the 2C n ner product by the full gradient vector 1 F (xt). condition W1 = 1 implies that I W is rank deficient. n k=1 r k i n n However, since in the decentralized setting the exact gradi- Hence, the last conditionP in (9) ensures that the rank of ent 1 n F (xt) is not available at thePi-th node, we I W is exactly n 1. Indeed, it is possible to optimally n k=1 r k i P Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings design the weight matrix W to accelerate the averaging pro- Algorithm 2 Discrete DCG at node i cess as discussed in (Boyd et al., 2004), but this is not the Require: ↵, [0, 1] and weights wij for j i i focus of this paper. We emphasize that W is not a problem 2 0 0 0 2N [{} 1: Initialize local vectors as xi = di = gi = 0 parameter, and we design it prior to running DCG. 2: Initialize neighbor’s vectors as x0 = d0 = 0 if j j j 2Ni Notice that the stepsize 1/T and the conditions in Assump- 3: for t =1, 2,...,T do t t 1 ˜ t tion 1 on the weights wij are needed to ensure that the 4: Compute gi =(1 )gi + Fi(xi); t r local variables xi are in the feasible set , as stated in the t t 1 t C 5: Compute di =(1 ↵) wijdj + ↵gi; following proposition. j i 2NXi[{ } t Proposition 1 Consider the DCG method outlined in Al- 6: Exchange di with neighboring nodes j i t t 2N gorithm 1. If Assumption 1 holds and nodes start from 7: Evaluate vi = argmaxv di, v ; 2C h i 0 t 1 xi = 0p , then the local iterates xi are always in the t+1 t t 2C 8: Update the variable xi = wijxj + vi; feasible set , i.e., xt for all i and t =1,...,T. T i j i C 2C 2N 2NXi[{ } 9: Exchange xt+1 with neighboring nodes j ; Let us now explain how DCG relates to and innovates be- i 2Ni yond the exisiting work in submodular maximization as well 10: end for as decentralized . Note that in order to 11: Apply proper rounding to obtain a solution for (5); solve Problem (4) in a centralized fashion (i.e., when every node has access to all the local functions) we can use the continuous (Vondrak´ , 2008), a variant of where Fi is the multilinear extension of fi defined as the conditional gradient method. However, in decentralized F (x)= f (S) x (1 x ), (11) settings, nodes have only access to their local gradients, and i i i j S V i S j/S therefore, continuous greedy is not implementable. Similar X⇢ Y2 Y2 to the decentralized convex optimization, we can address and the down-closed convex set = conv 1 : I is this issue via local information aggregation. Our proposed C { I 2I} the matroid polytope. Note that the discrete and continuous DCG method incorporates the idea of choosing the ascent optimization formulations lead to the same optimal value direction according to a conditional gradient update as is (Calinescu et al., 2011). done in the continuous greedy algorithm (i.e., the update rule (7)), while it aggregates the global objective function in- Based on the expression in (11), computing the full gradient formation through local communications with neighboring F at each node i will require an exponential computation r i nodes (i.e., the update rule (8)). Unlike traditional consen- in terms of V , since the number of summands in (11) is V | | sus optimization methods that require exchanging nodes’ 2| |. As a result, in the discrete setting, we will slightly local variables only (Nedic & Ozdaglar, 2009; Nedic et al., modify the DCG algorithm and work with unbiased esti- 2010), DCG also requires exchanging local gradient vectors mates of the gradient that can be computed in time O( V ) | | to achieve a (1 1/e) fraction of the optimal solution at (see Appendix 9.7 for one such estimator). More precisely, each node (i.e., the update rule (6)). This major difference is in the discrete setting, each node i updates three local t t t V 2N t t due to the fact that in conditional gradient methods, unlike variables x , d , g R| |. The variables x , d play the i i i 2 i i proximal gradient algorithms, the local gradients can not be same role as in DCG and are updated using the messages t used instead of the global gradient. In other words, in the received from the neighboring nodes. The variable gi at t t update rule (7), we can not use the local gradients Fi(xi) node i is defined to approximate the local gradient Fi(xi). t r ˜ t r in lieu of di. Indeed, there are settings for which such a Consider the vector Fi(xi) as an unbiased estimator of r t replacement provides arbitrarily bad solutions. We formally the local gradient Fi(xi) at time t, and define the vector t r characterize the convergence of DCG in Theorem 1. gi as the outcome of the recursion

t t 1 t g =(1 )g + F˜ (x ), (12) 5.1. Extension to the Discrete Setting i i r i i In this section we show how DCG can be used for maxi- where [0, 1] is the averaging parameter. We initialize all 2 0 V mizing a decentralized submodular set function f, namely vectors as g = 0 R| |. It was shown recently (Mokhtari i 2 Problem (5), through its continuous relaxation. Formally, et al., 2018a;b) that the averaging technique in (12) reduces in lieu of solving Problem (5), we can form the following the noise of the gradient approximations. Therefore, the t t decentralized continuous optimization problem sequence of g approaches the true local gradient Fi(x ) i r i as time progresses. 1 n max Fi(x), (10) The steps of the Decentralized Continuous x n 2C i=1 X Greedy for the discrete setting is summarized in Algo- Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings rithm 2. Note that the major difference between the Dis- In the following lemma, we establish an upper bound on the crete DCG method (Algorithm 2) and the continuous DCG variation in the sequence of average variables x¯t . { } method (Algorithm 1) is in Step 5 in which the exact local t gradient Fi(xi) is replaced by the stochastic approxima- Lemma 1 Consider the proposed DCG algorithm defined t r ¯t tion gi which only requires access to the computationally in Algorithm 1. Further, recall the definition of x in (16). If cheap unbiased gradient estimator F˜ (xt). The commu- Assumptions 1 and 2 hold, then the difference between two r i i nication complexity of both the discrete and continuous consecutive average vectors is upper bounded by versions of DCG are the same at each round. However, D since we are using unbiased estimations of the local gra- x¯t+1 x¯t . (17) k k T dients F (x ), the Discrete DCG takes more rounds to r i i converge to a near-optimal solution compared to continuous Recall that at every node i, the messages are mixed using DCG. We characterize the convergence of Discrete DCG in the coefficients w , i.e., the i-th row of the matrix W. It is Theorem 2. Further, the implementation of Discrete DCG ij thus not hard to see that the spectral properties of W (e.g. requires rounding the continuous solution to obtain a dis- the spectral gap) play an important role in the the speed of crete solution for the original problem without any loss in achieving consensus in decentralized methods. terms of the objective function value. The provably lossless rounding schemes include the pipage rounding (Calinescu Definition 1 Consider the eigenvalues of W which can et al., 2011) and contention resolution (Chekuri et al., 2014). be sorted in a nonincreasing order as 1= (W) 1 (W) (W) > 1. Define as the second 2 ··· n 6. Convergence Analysis largest magnitude of the eigenvalues of W, i.e.,

In this section, we study the convergence properties of DCG := max 2(W) , n(W) . (18) in both continuous and discrete settings. In this regard, we {| | | |} assume that the following conditions hold. As we will see, a mixing matrix W with smaller has a larger spectral gap 1 which yields faster convergence Assumption 2 Euclidean distance of the elements in the set are uniformly bounded, i.e., for all x, y we have (Boyd et al., 2004; Duchi et al., 2012). In the following C 2C lemma, we derive an upper bound on the sum of the dis- x y D. (13) t t k k tances between the local iterates xi and their average x¯ , where the bound is a function of the graph spectral gap 1 , Assumption 3 The local objective functions Fi(x) are size of the network n, and the total number of iterations T . monotone and DR-submodular. Further, their gradients are L-Lipschitz continuous over the set , i.e., for all x, y X 2X Lemma 2 Consider the proposed DCG algorithm defined F (x) F (y) L x y . (14) in Algorithm 1. Further, recall the definition of x¯t in (16). kr i r i k k k If Assumptions 1 and 2 hold, then for all t T we have Assumption 4 The norm of gradients Fi(x) are  kr k bounded over the convex set , i.e., for all x , i , n 1/2 C 2C 2N t t 2 pnD xi x¯ . (19) Fi(x) G. (15)  T (1 ) kr k i=1 ! X The condition in Assumption 2 guarantees that the diameter ¯t of the convex set is bounded. Assumption 3 is needed Let us now define d as the average of local gradient approx- C t ¯t 1 n t to ensure that the local objective functions F are smooth. imations di at step t, i.e., d = n i=1 di. We will show i t Finally, the condition in Assumption 4 enforces the gradi- in the following that the vectors di also become uniformly ¯t P ents norm to be bounded over the convex set . All these close to d . C assumptions are customary and necessary in the analysis Consider the proposed DCG algorithm defined of decentralized algorithms. For more details, please check Lemma 3 in Algorithm 1. If Assumptions 1 and 3 hold, then Section VII-B in Jakovetic´ et al. (2014). 1/2 We proceed to derive a constant factor approximation for n ↵pnG dt d¯t 2 . (20) DCG. Our main result is stated in Theorem 1. However, k i k  1 (1 ↵) i=1 ! to better illustrate the main result, we first need to provide X several definitions and technical lemmas. Let us begin by defining the average variables x¯t as Lemma 3 guarantees that the individual local gradient ap- t ¯t proximation vectors di are close to the average vector d 1 n x¯t = xt. (16) if the parameter ↵ is small. To show that the gradient vec- n i t i=1 tors di, generated by DCG, approximate the gradient of the X Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings global objective function, we further need to show that the where the expectation is with respect to the randomness of average vector d¯t approaches the global objective function the unbiased estimator. gradient F . We prove this claim in the following lemma. r In the following theorem, we show that Discrete DCG Lemma 4 Consider the proposed DCG algorithm defined achieves a (1 1/e) approximation ration for Problem (5). in Algorithm 1. If Assumptions 1-4 hold, then Theorem 2 Consider our proposed Discrete DCG algo- 1 n d¯t F (x¯t) rithm outlined in Algorithm 2. Recall the definition of the n r i i=1 multilinear extension function Fi in (11). If Assumptions X 1/2 2/3 1-5 hold and we set ↵ = T and = T , then for all t (1 ↵)LD LD (1 ↵) G + + . (21) nodes j the local variables xT obtained after running  ↵T T (1 ) 2N j ✓ ◆ Discrete DCG for T iterations satisfy ↵ =1/pT By combining Lemmas 3 and 4 and setting we T 1 1 E F (x ) (1 e )F (x⇤) , (25) can conclude that the local gradient approximation vector j O (1 )T 1/3 dt of each node i is within (1/pT ) distance of the global ✓ ◆ i O ⇥ ⇤ objective gradient F (x¯t) evaluated at x¯t. We use this ob- where x is the global maximizer of Problem (10). r ⇤ servation in the following theorem to show that the sequence of iterates generated by DCG achieves the tight (1 1/e) Theorem 2 states that the sequence of iterates generated by approximation ratio of the optimum global solution. Discrete DCG achieves the tight (1 1/e ✏) approximation guarantee for Problem (10) after (1/✏3) iterations. Theorem 1 Consider the proposed DCG method outlined O V in Algorithm 1. Further, consider x⇤ as the global maxi- Remark 1 For any submodular set function h :2 R ! mizer of Problem (4). If Assumptions 1-4 hold and we set with associated multilinear extension H, it can be shown ↵ =1/pT , for all nodes j , the local variable xT that its Lipschitz constant L and the gradient norm G are 2N j obtained after T iterations satisfies both bounded above by m V , where m is the maxi- f | | f 2 mum marginal value of f, i.e., mf =maxi V f( i ) (see, T 1 LD + GD GD p 2 { } F (x ) (1 e )F (x⇤) Hassani et al. (2017)). Similarly, it can be shown that for the j T 1/2 T 1/2(1 ) unbiased estimator in Appendix 9.7 we have mf V . LD2 GD + LD2  | | . (22) p 2T T (1 ) 7. Numerical Experiments Theorem 1 shows that the sequence of the local variables We will consider a discrete setting for our experiments t xj, generated by DCG, is able to achieve the optimal ap- and use Algorithm 2 to find a decentralized solution. The proximation ratio (1 1/e), while the error term vanishes main objective is to demonstrate how consensus is reached at a sublinear rate of (1/T 1/2), i.e., and how the global objective increases depending on the O topology of the network and the parameters of the algorithm. T 1 F (x ) (1 1/e)F (x⇤) , (23) j O (1 )T 1/2 For our experiments, we have used the MovieLens data set. ✓ ◆ It consists of 1 million ratings (from 1 to 5) by M = 6000 which implies that the iterate of each node reaches an ob- users for p = 4000 movies. We consider a network of jective value larger than (1 1/e ✏)OPT after (1/✏2) O n = 100 nodes. The data has been distributed equally rounds of communication. It is worth mentioning that the between the nodes of the network, i.e., the set of users has result in Theorem 1 is consistent with classical results in de- been partitioned into 100 equally-sized sets and each node centralized optimization that the error term vanishes faster in the network has access to only one chunk (partition) of for the graphs with larger spectral gap 1 . We proceed the data. The global task is to find a set of k movies that are to study the convergence properties of Discrete DCG in most satisfactory to all the users (the precise formulation Algorithm 2. To do so, we first assume that the variance of will appear shortly). However, as each of the nodes in the the stochastic gradients F˜ (x) used in Discrete DCG is r i network has access to the data of a small portion of the users, bounded. We justify this assumption in Remark 1. the nodes have to cooperate to fulfill the global task. Assumption 5 The variance of the unbiased estimators We consider a well motivated objective function for the 2 F˜(x) is bounded above by over the convex set , i.e., experiments. Let r`,j denote the rating of user ` for movie j r C for any i and any vector x we can write (if such a rating does not exist in the data we assign r`,j to 2N 2C 0). We associate to each user ` a “facility location” objective ˜ 2 2 E Fi(x) Fi(x) , (24) function g`(S) = maxj S r`,j, where S is any subset of kr r k  2 h i Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings

0 4.9 ) 1 f 4.8 Line 4.7 ER (T = 50) ER ER (T = 1000) 2 4.6 Complete Line (T = 50) 4.5 Line (T = 1000) Objective value ( (distance to average) Complete (T = 1000)

log 3 4.4 Centralized Solution

200 400 600 800 1000 10 20 30 40 50 T (number of iterations) k (cardinality constraint) Figure 1. The logarithm of the distance-to-average at final round T Figure 2. The average objective value is plotted as a function of the is plotted as a function of T . Note that when the underlying graph cardinality constraint k for different choices of the communication is complete or Erdos-Renyi (ER) with a good average degree, then graph as well as number of iterations T . Note that “ER” stands for consensus will be achieved even for small number of iterations the Erdos-Reny graph with average degree 5, “Line” stands for the T . However, for poor connected graphs such as the line graph, line graph and “Complete” is for the complete graph. We have run reaching consensus requires a large number of iterations. Algorithm 2 for T =50and T =1000. the movies (i.e. the ground set V is the set of the movies). at the end of Algorithm 2 as a function of the cardinal- Such a function shows how much user ` will be “satisfied” ity constraint k. We also compare these values with the by a subset S of the movies. Recall that each node i in the value obtained by the centralized greedy algorithm (i.e. the network has access to the data of a (small) subset of users centralized solution). A few comments are in order. The performance of Algorithm 2 is close to the centralized so- which we denote by i. The objective function associated i U f (S)= g (S) lution when the underlying graph is the Erdos-Renyi (with with node is given by i ` i ` . With such a choice of the local functions, our2 globalU task is hence average degree 5) graph or the complete graphs. This is to solve problem (5) when the matroidP is the k-uniform because for both such graphs consensus is achieved from I matroid (a.k.a. the k-cardinality constraint). the early stages of the algorithm. By increasing T , we see that the performance becomes closer to the centralized solu- We consider three different choices for the underlying com- tion. However, when the underlying graph is the line graph, munication graph between the 100 nodes: A line graph then consensus will not be achieved unless the number of (which looks like a simple path from node 1 to node 100), iterations is significantly increased. Consequently, for small an Erdos-Renyi random graph (with average degree 5), and a number of iterations (e.g. T 1000) the performance of  complete graph. The matrix W is chosen as follows (based the algorithm will not be close to the centralized solution. on each of the three graphs). If (i, j) is and edge of the graph, we let w =1/(1 + max(d ,d )). If (i, j) is not an i,j i j 8. Conclusion edge and i, j are distinct integers, we have wi,j =0. Finally we let w =1 w . It is not hard to show that i,i j i,j In this paper, we proposed the first fully decentralized op- the above choice for W2Nsatisfies Assumption 1. P timization method for maximizing discrete and continuous Figure 1 shows how consensus is reached w.r.t each of the submodular functions. We developed Decentralized three underlying networks. To measure consensus, we plot Continuous Greedy (DCG) that achieves a (1 1/e 2 3 the (logarithm of) distance-to-average value 1 n xT ✏) approximation guarantee with (1/✏ ) and (1/✏ ) local n i=1 || i O x¯T as a function of the total number of iterations T av- rounds of communication in the continuous and discrete || eraged over many trials (see (16) for the definitionP of x¯T ). settings, respectively. It is easy to see that the distance to average is small if and T only if all the local decisions xi are close to the average Acknowledgements decision x¯T . As expected, it takes much less time to reach This work was done while A. Mokhtari was visiting the consensus when the underlying graph is fully connected (i.e. Simons Institute for the Theory of Computing, and his work complete graph). For the line graph, the convergence is very was partially supported by the DIMACS/Simons Collab- slow as this graph has the least degree of connectivity. oration on Bridging Continuous and Discrete Optimiza- Figure 2 depicts the obtained objective value of Discrete tion through NSF grant #CCF-1740425. The work of A. DCG (Algorithm 2) for the three networks considered above. Karbasi was supported by DARPA Young Faculty Award 1 n T More precisely, we plot the value n i=1 f(xi ) obtained (D16AP00046) and AFOSR YIP (FA9550-18-1-0160). P Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings

References Chen, Lin, Hassani, Hamed, and Karbasi, Amin. Online continuous submodular maximization. In AISTATS, 2018. Abu-Elkheir, Mervat, Hayajneh, Mohammad, and Ali, Na- jah Abu. Data management for the internet of things: da Ponte Barbosa, Rafael, Ene, Alina, Nguyen, Huy L., and Design primitives and solution. Sensors, 13(11):15582– Ward, Justin. The power of randomization: Distributed 15612, 2013. submodular maximization on massive datasets. In ICML, 2015. Bach, F. Submodular functions: from discrete to continuous domains. arXiv preprint arXiv:1511.00394, 2015. Di Lorenzo, Paolo and Scutari, Gesualdo. Next: In-network Bajovic, Dragana, Jakovetic, Dusan, Krejic, Natasa, and nonconvex optimization. IEEE Trans. on Signal and Jerinkic, Natasa Krklec. Newton-like method with diago- Information Process. over Networks, 2(2):120–136, 2016. nal correction for distributed optimization. SIAM Journal Duchi, John C, Agarwal, Alekh, and Wainwright, Martin J. on Optimization, 27(2):1171–1203, 2017. Dual averaging for distributed optimization: Convergence Bekkerman, Ron, Bilenko, Mikhail, and Langford, John. analysis and network scaling. IEEE Transactions on Scaling up machine learning: Parallel and distributed Automatic control, 57(3):592–606, 2012. approaches. Cambridge University Press, 2011. Eghbali, Reza and Fazel, Maryam. Designing smoothing Bertsekas, Dimitri P and Tsitsiklis, John N. Parallel and functions for improved worst-case competitive ratio in distributed computation: numerical methods, volume 23. online optimization. In NIPS, pp. 3279–3287, 2016. Prentice hall Englewood Cliffs, NJ, 1989. Eisen, Mark, Mokhtari, Aryan, and Ribeiro, Alejandro. De- Bian, Andrew An, Mirzasoleiman, Baharan, Buhmann, centralized quasi-Newton methods. IEEE Transactions Joachim M., and Krause, Andreas. Guaranteed non- on Signal Processing, 65(10):2613–2628, 2017. convex optimization: Submodular maximization over Feige, Uriel. A threshold of ln n for approximating set cover. continuous domains. In AISTATS, 2017. Journal of the ACM (JACM), 1998. Boyd, Stephen, Diaconis, Persi, and Xiao, Lin. Fastest Feige, Uriel, Mirrokni, Vahab S, and Vondrak, Jan. Maxi- mixing markov chain on a graph. SIAM review, 46(4): mizing non-monotone submodular functions. SIAM Jour- 667–689, 2004. nal on Computing, 40(4):1133–1153, 2011. Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed optimization and Feldman, Moran, Harshaw, Christopher, and Karbasi, statistical learning via the alternating direction method Amin. Greed is good: Near-optimal submodular max- imization via greedy optimization. arXiv preprint of multipliers. Foundations and Trends R in Machine Learning, 3(1):1–122, 2011. arXiv:1704.01652, 2017. Buchbinder, Niv, Feldman, Moran, Naor, Joseph, and Golovin, Daniel, Krause, Andreas, and Streeter, Matthew. Schwartz, Roy. Submodular maximization with cardi- Online submodular maximization under a matroid con- nality constraints. In SODA 2014, pp. 1433–1452, 2014. straint with application to learning assignments. arXiv preprint arXiv:1407.1082, 2014. Buchbinder, Niv, Feldman, Moran, Naor, Joseph, and Schwartz, Roy. A tight linear time (1/2)-approximation Hajinezhad, Davood, Hong, Mingyi, Zhao, Tuo, and Wang, for unconstrained submodular maximization. SIAM Jour- Zhaoran. NESTT: A nonconvex primal-dual splitting nal on Computing, 44(5):1384–1402, 2015. method for distributed and stochastic optimization. In NIPS, 2016. Calinescu, Gruia, Chekuri, Chandra, Pal,´ Martin, and Vondrak,´ Jan. Maximizing a monotone submodular func- Hassani, S. Hamed, Soltanolkotabi, Mahdi, and Karbasi, tion subject to a matroid constraint. SIAM Journal on Amin. Gradient methods for submodular maximization. Computing, 40(6):1740–1766, 2011. In NIPS, pp. 5843–5853, 2017. Chatzipanagiotis, Nikolaos and Zavlanos, Michael M. On Jakovetic,´ Dusan,ˇ Xavier, Joao, and Moura, Jose´ MF. Fast the convergence rate of a distributed augmented La- distributed gradient methods. IEEE Transactions on Au- grangian optimization algorithm. In (ACC), 2015. tomatic Control, 59(5):1131–1146, 2014. Chekuri, Chandra, Vondrak,´ Jan, and Zenklusen, Rico. Sub- Jakovetic, Dusan, Moura, Jose MF, and Xavier, Joao. Lin- modular function maximization via the multilinear re- ear convergence rate of a class of distributed augmented laxation and contention resolution schemes. SIAM J. Lagrangian algorithms. Automatic Control, IEEE Trans- Comput., 43(6):1831–1879, 2014. actions on, 60(4):922–936, 2015. Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings

Kumar, Ravi, Moseley, Benjamin, Vassilvitskii, Sergei, and Qu, Guannan, Brown, Dave, and Li, Na. Distributed greedy Vattani, Andrea. Fast greedy algorithms in mapreduce algorithm for satellite assignment problem with submod- and streaming. TOPC, 2015. ular utility function? IFAC-PapersOnLine, 2015.

Ma, Yan, Wu, Haiping, Wang, Lizhe, Huang, Bormin, Ran- Rabbat, Michael and Nowak, Robert. Distributed optimiza- jan, Rajiv, Zomaya, Albert, and Jie, Wei. Remote sensing tion in sensor networks. In Proceedings of the 3rd inter- big data computing: Challenges and opportunities. Future national symposium on Information processing in sensor Generation Computer Systems, 51:47–60, 2015. networks, pp. 20–27. ACM, 2004. Rabbat, Michael G, Nowak, Robert D, et al. Generalized Mei, Song, Bai, Yu, and Montanari, Andrea. The landscape consensus computation in networked systems with era- of empirical risk for non-convex losses. arXiv preprint sure links. In SPAWC, pp. 1088–1092, 2005. arXiv:1607.06534, 2016. Shamir, Ohad, Srebro, Nathan, and Zhang, Tong. Mirrokni, Vahab S. and Zadimoghaddam, Morteza. Ran- Communication-efficient distributed optimization using domized composable core-sets for distributed submodular an approximate Newton-type method. In ICML 2014, pp. maximization. In STOC 2015, pp. 153–162, 2015. 1000–1008, 2014. Mirzasoleiman, Baharan, Karbasi, Amin, Sarkar, Rik, and Staib, Matthew and Jegelka, Stefanie. Robust budget alloca- Krause, Andreas. Distributed submodular maximization: tion via continuous submodular functions. In Advances Identifying representative elements in massive data. In in neural information processing systems, 2017. NIPS, 2013. Sun, Ying, Scutari, Gesualdo, and Palomar, Daniel. Dis- Mirzasoleiman, Baharan, Badanidiyuru, Ashwinkumar, and tributed nonconvex multiagent optimization over time- Karbasi, Amin. Fast constrained submodular maximiza- varying networks. In Signals, Systems and Computers, tion: Personalized data summarization. In ICML 2016, 2016 50th Asilomar Conference on, pp. 788–794, 2016. pp. 1358–1367, 2016. Tanner, Herbert G. and Kumar, Amit. Towards decentraliza- tion of multi-robot navigation functions. In ICRA 2005, Mokhtari, Aryan, Ling, Qing, and Ribeiro, Alejandro. Net- pp. 4132–4137, 2005. work Newton distributed optimization methods. IEEE Trans. on Signal Process., 65(1):146–161, 2017. Tatarenko, Tatiana and Touri, Behrouz. Non-convex dis- tributed optimization. IEEE Transactions on Automatic Mokhtari, Aryan, Hassani, Hamed, and Karbasi, Amin. Con- Control, 62(8):3744–3757, 2017. ditional gradient method for stochastic submodular maxi- mization: Closing the gap. In AISTATS, pp. 1886–1895, Vapnik, Vladimir. Statistical learning theory. Wiley, 1998. 2018a. ISBN 978-0-471-03003-4. Vondrak,´ Jan. Submodularity in combinatorial optimization. Mokhtari, Aryan, Hassani, Hamed, and Karbasi, Amin. 2007. Stochastic conditional gradient methods: From con- vex minimization to submodular maximization. arXiv Vondrak,´ Jan. Optimal approximation for the submodular preprint arXiv:1804.09554, 2018b. welfare problem in the value oracle model. In STOC, pp. 67–74, 2008. Nedic, Angelia and Ozdaglar, Asuman. Distributed sub- gradient methods for multi-agent optimization. IEEE Wolsey, Laurence A. An analysis of the greedy algorithm Transactions on Automatic Control, 54(1):48–61, 2009. for the submodular set covering problem. Combinatorica, 2(4):385–393, 1982. Nedic, Angelia, Ozdaglar, Asuman, and Parrilo, Pablo A. Yang, Yuchen, Wu, Longfei, Yin, Guisheng, Li, Lijie, and Constrained consensus and optimization in multi-agent Zhao, Hongbin. A survey on security and privacy issues networks. IEEE Transactions on Automatic Control, 55 in internet-of-things. IEEE Internet of Things Journal,4 (4):922–938, 2010. (5):1250–1258, 2017. Nemhauser, George L, Wolsey, Laurence A, and Fisher, Mar- Yuan, Kun, Ling, Qing, and Yin, Wotao. On the conver- shall L. An analysis of approximations for maximizing gence of decentralized gradient descent. SIAM Journal submodular set functions–I. Mathematical Programming, on Optimization, 26(3):1835–1854, 2016. 14(1):265–294, 1978. Zhang, Yuchen and Lin, Xiao. DiSCO: Distributed opti- Qu, Guannan and Li, Na. Accelerated distributed Nesterov mization for self-concordant empirical loss. In ICML gradient descent. arXiv preprint arXiv:1705.07176, 2017. 2015, pp. 362–370, 2015.