Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap

Aryan Mokhtari Hamed Hassani Amin Karbasi Massachusetts Institute of Technology University of Pennsylvania Yale University

Abstract tions directly. This direct approach has led to provably good guarantees, for specific problem instances. Ex- amples include latent variable models [Anandkumar In this paper, we study the problem of con- et al., 2014], non-negative matrix factorization [Arora strained and stochastic continuous submod- et al., 2016], robust PCA [Netrapalli et al., 2014], ma- ular maximization. Even though the ob- trix completions [Ge et al., 2016], and training certain jective function is not concave (nor con- specific forms of neural networks [Mei et al., 2016]. vex) and is defined in terms of an expec- However, it is well known that in general finding the tation, we develop a variant of the condi- global optimum of a non- problem tional gradient method, called Stochastic is NP-hard [Murty and Kabadi, 1987]. This computa- Continuous Greedy, which achieves a tight tional barrier has mainly shifted the goal of non-convex approximation guarantee. More pre- optimization towards two directions: a) finding an ap- cisely, for a monotone and continuous DR- proximate local minimum by avoiding saddle points submodular function and subject to a gen- [Ge et al., 2015, Anandkumar and Ge, 2016, Jin et al., eral convex body constraint, we prove that 2017, Paternain et al., 2017], or b) characterizing gen- Stochastic Continuous Greedy achieves a eral conditions under which the underlying non-convex [(1 1/e)OPT ✏] guarantee (in expectation) optimization is tractable [Hazan et al., 2016]. with (1/✏3) stochastic gradient computa- O tions. This guarantee matches the known In this paper, we consider a broad class of non-convex hardness results and closes the gap between optimization problems that possess special combina- deterministic and stochastic continuous sub- torial structures. More specifically, we focus on con- modular maximization. By using stochastic strained maximization of stochastic continuous sub- continuous optimization as an interface, we modular functions (CSF) that demonstrate diminish- also provide the first (1 1/e) tight approxi- ing returns, i.e., continuous DR-submodular functions, mation guarantee for maximizing a monotone . ˜ but stochastic submodular set function sub- max F (x) = max Ez P [F (x, z)]. (1) x x ⇠ ject to a general matroid constraint. 2C 2C ˜ Here, the functions F : R+ are stochastic where x is the optimizationX⇥Z! variable, z is a 1 Introduction realization2 ofX the random variable Z drawn from2Z a dis- tribution P , and Rn is a compact set. Our goal X⇢ + Many procedures in statistics and artificial intelligence is to maximize the expected value of the random func- ˜ n require solving non-convex problems, including clus- tions F (x, z) over the convex body R+. Note that C✓ tering [Abbasi and Younis, 2007], training deep neu- we only assume that F (x) is DR-submodular, and not ral networks [Bengio et al., 2007], and performing necessarily the stochastic functions F˜(x, z). We also Bayesian optimization [Snoek et al., 2012], to name consider situations where the distribution P is either a few. Historically, the focus has been to convex- unknown (e.g., when the objective is given as an im- ify non-convex objectives; in recent years, there has plicit stochastic model) or the domain of the random been significant progress to optimize non-convex func- variable Z is very large (e.g., when the objective is de- fined in terms of an empirical risk) which makes the Proceedings of the 21st International Conference on Ar- cost of computing the expectation very high. In these tificial Intelligence and Statistics (AISTATS) 2018, Lan- regimes, stochastic optimization methods, which oper- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the ate on computationally cheap estimates of gradients, author(s). arise as natural solutions. In fact, very recently, it was Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap shown in [Hassani et al., 2017] that stochastic gradient guarantee for general monotone submodular set func- methods achieve a (1/2) approximation guarantee to tions subject to a matroid constraint. Problem (1). The authors also showed that current versions of the conditional gradient method (a.k.a., Notation. Lowercase boldface v denotes a vector Frank-Wolfe), such as continuous greedy [Vondr´ak, and uppercase boldface A a matrix. We use v to k k 2008] or its close variant [Bian et al.], can perform ar- denote the Euclidean norm of vector v.Thei-th ele- bitrarily poorly in stochastic continuous submodular ment of the vector v is written as vi and the element maximization settings. on the i-th row and j-th column of the matrix A is denoted by A . Our contributions. We provide the first tight i,j (1 1/e) approximation guarantee for Problem (1) when the continuous function F is monotone, smooth, 2RelatedWork DR-submodular, and the constraint set is a bounded convex body. To this end, we developC a Maximizing a deterministic submodular set function novel conditional gradient method, called Stochastic has been extensively studied. The celebrated result of Continuous Greedy (SCG), that produces a solution Nemhauser et al. [1978] shows that a greedy with an objective value larger than ((1 1/e)OPT ✏) achieves a (1 1/e) approximation guarantee for a after O 1/✏3 iterations while only having access to monotone function subject to a cardinality constraint. unbiased estimates of the gradients (here OPT de- It is also known that this result is tight under reason- notes the optimal value of Problem (1)). SCG is also able complexity-theoretic assumptions [Feige, 1998]. memory ecient in the following sense: in contrast Recently, variants of the have been to previously proposed conditional gradient methods proposed to extend the above result to non-monotone in stochastic convex [Hazan and Luo, 2016] and non- and more general constraints [Feige et al., 2011, Buch- convex [Reddi et al., 2016] settings, SCG does not re- binder et al., 2015, 2014, Feldman et al., 2017]. While quire using a minibatch in each step. Instead it simply discrete greedy are fast, they usually do not averages over the stochastic estimates of the previous provide the tightest guarantees for many classes of fea- gradients. sibility constraints. This is why continuous relaxations of submodular functions, e.g., the multilinear exten- Connection to Discrete Problems. Even though sion, have gained a lot of interest [Vondr´ak, 2008, Cali- submodularity has been mainly studied in discrete do- nescu et al., 2011, Chekuri et al., 2014, Feldman et al., mains [Fujishige, 2005], many ecient methods for op- 2011, Gharan and Vondr´ak, 2011, Sviridenko et al., timizing such submodular set functions rely on con- 2015]. In particular, it is known that the continuous tinuous relaxations either through a multi-linear ex- greedy algorithm achieves a (1 1/e) approximation tension [Vondr´ak, 2008] (for maximization) or Lovas guarantee for monotone submodular functions under a extension [Lov´asz, 1983] (for minimization). In fact, general matroid constraint [Calinescu et al., 2011]. An Problem (1) has a discrete counterpart, recently con- c improved ((1 e )/c)-approximation guarantee can sidered in [Hassani et al., 2017, Karimi et al., 2017]: be obtained iff has curvature c [Vondr´ak, 2010]. . ˜ max f(S) = max Ez P [f(S, z)], (2) Continuous submodularity naturally arises in many S S ⇠ 2I 2I learning applications such as robust budget allocation ˜ V where the functions f :2 R+ are stochastic, S [Staib and Jegelka, 2017, Soma et al., 2014], online is the optimization set variable⇥Z! defined over a ground resource allocation [Eghbali and Fazel, 2016], learn- set V , z is the realization of a random variable Z ing assignments [Golovin et al., 2014], as well as Ad- drawn from2Z the distribution P , and is a general ma- words for e-commerce and advertising [Devanur and troid constraint. Since P is unknown,I problem (2) can- Jain, 2012, Mehta et al., 2007]. Maximizing a detem- not be directly solved using the current state-of-the-art inistic continuous submodular function dates back to techniques. Instead, Hassani et al. [2017] showed that the work of Wolsey [1982]. More recently, Chekuri by lifting the problem to the continuous domain (via et al. [2015] proposed a multiplicative weight update multi-linear relaxation) and using stochastic gradient algorithm that achieves (1 1/e ✏) approximation methods on a continuous relaxation to reach a solution guarantee after O˜(n/✏2) oracle calls to gradients of a that is within a factor (1/2) of the optimum. Contem- monotone smooth submodular function F (i.e., twice porarily, [Karimi et al., 2017] used a concave relax- di↵erentiable DR-submodular) subject to a polytope ation technique to provide a (1 1/e) approximation constraint. A similar approximation factor can be for the class of submodular coverage functions. Our obtained after (n/✏) oracle calls to gradients of F work also closes the gap for maximizing the stochastic for monotone DR-submodularO functions subject to a submodular set maximization, namely, Problem (2), down-closed convex body using the continuous greedy by providing the first tight (1 1/e) approximation method [Bian et al.]. However, such results require ex- Aryan Mokhtari, Hamed Hassani, Amin Karbasi act computation of the gradients F which is not fea- ground set V , is called submodular if for all subsets sible in Problem (1). An alternativer approach is then A, B V ,wehave to modify the current algorithms by replacing gradi- ✓ ents F (x ) by their stochastic estimates F˜(x , z ); f(A)+f(B) f(A B)+f(A B). t t t \ [ however,r this modification may lead to arbitrarilyr poor solutions as demonstrated in [Hassani et al., 2017]. The notion of submodularity goes beyond the discrete Another alternative is to estimate the gradient by av- domain [Wolsey, 1982, Vondr´ak, 2007, Bach, 2015]. eraging over a (large) mini-batch of samples at each Consider a continuous function F : R+ where the set is of the form = n X!and each iteration. While this approach can potentially re- X X i=1 Xi Xi duce the noise variance, it increases the computa- is a compact subset of R+. We call the continuous function F submodular if for allQx, y we have tional complexity of each iteration and is not favor- 2X able. The work by Hassani et al. [2017] is perhaps F (x)+F (y) F (x y)+F (x y), (3) the first attempt to solve Problem (1) only by exe- _ ^ cuting stochastic estimates of gradients (without us- where x y := max(x, y) (component-wise) and ing a large batch). They showed that the stochastic x y :=_ min(x, y) (component-wise). In this paper, gradient ascent method achieves a (1/2 ✏) approxi- ^ 2 our focus is on di↵erentiable continuous submodular mation guarantee after O(1/✏ ) iterations. Although functions with two additional properties: monotonic- this work opens the door for maximizing stochastic ity and diminishing returns. Formally, a submodular CSFs using computationally cheap stochastic gradi- function F is monotone (on the set )if ents, it fails to achieve the optimal (1 1/e) approx- X imation. To close the gap, we propose in this paper x y = F (x) F (y), (4) Stochastic Continuous Greedy which outputs a so-  )  lution with function value at least ((1 1/e)OPT ✏) for all x, y . Note that x y in (4) means that 3 2X  after O(1/✏ ) iterations. Notably, our result only re- xi yi for all i =1,...,n. Furthermore, a di↵eren- quires the expected function F to be monotone and tiable submodular function F is called DR-submodular DR-submodular and the stochastic functions F˜ need (i.e., shows diminishing returns) if the gradients are not be monotone nor DR-submodular. Moreover, in antitone, namely, for all x, y we have contrast to the result in [Bian et al.], which holds only 2X for down-closed convex constraints, our result holds x y = F (x) F (y). (5)  )r r for any convex constraints. When the function F is twice di↵erentiable, submodu- Our result also has important implications for Prob- larity implies that all cross-second-derivatives are non- lem (2); that is, maximizing a stochastic discrete positive [Bach, 2015], i.e., submodular function subject to a matroid constraint. Since the proposed SCG method works in stochastic @2F (x) settings, we can relax the discrete objective function for all i = j, for all x , 0, (6) 6 2X @xi@xj  f in Problem (2) to a continuous function F through the multi-linear extension (note that expectation is a and DR-submodularity implies that all second- linear operator). Then we can maximize F within a derivatives are non-positive [Bian et al.], i.e., (1 1/e ✏) approximation to the optimum value by using only (1/✏3) oracle calls to the stochastic gra- @2F (x) O for all i, j, for all x , 0. (7) dients of F . Finally, a proper rounding scheme (such 2X @xi@xj  as the contention resolution method [Chekuri et al., 2014]) results in a feasible set whose value is a (1 1/e) 4 Stochastic Continuous Greedy approximation to the optimum set in expectation. The focus of our paper is on the maximization of In this section, we introduce our main algorithm, stochastic submodular functions. However, there are Stochastic Continuous Greedy (SCG), which is a also very interesting results for minimization of such stochastic variant of the continuous greedy method to functions [Staib and Jegelka, 2017, Ene et al., 2017, solve Problem (1). We only assume that the expected Chakrabarty et al., 2017, iye]. objective function F is monotone and DR-submodular and the stochastic functions F˜(x, z) may not be mono- tone nor submodular. Since the objective function F 3 Continuous Submodularity is monotone and DR-submodular, continuous greedy algorithm [Bian et al., Calinescu et al., 2011] can be We begin by recalling the definition of a submodular used in principle to solve Problem (1). Note that V set function: A function f :2 R+, defined on the each update of continuous greedy requires computing ! Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap

Algorithm 1 Stochastic Continuous Greedy in the convex set . We should highlight that the (SCG) convex body mayC not be down-closed or contain 0. C Nonetheless, the solution xT returned by SCG will be Require: Stepsizes ⇢t > 0. Initialize d0 = x0 = 0 1: for t =1, 2,...,T do a feasible point in . The steps of the proposed SCG method are outlinedC in Algorithm 1. 2: Compute dt =(1 ⇢t)dt 1 + ⇢t F˜(xt, zt); r 3: Compute vt = argmaxv dt, v ; 2C{h i1} 4: Update the variable xt+1 = xt + T vt; 5 Convergence Analysis 5: end for In this section, we study the convergence properties of SCG for solving Problem (1). To do so, we first ˜ the gradient of F ,i.e., F (x):=E[ F (x, z)]. How- assume that the following conditions hold. ever, if we only have accessr to the (computationallyr Assumption 1. The Euclidean norm of the elements cheap) stochastic gradients F˜(x, z), then the contin- in the constraint set are uniformly bounded, i.e., for uous greedy method will notr be directly usable [Has- all x we can writeC sani et al., 2017]. This limitation is due to the non- 2C vanishing variance of gradient approximations. To re- x D. (11) solve this issue, we introduce stochastic version of the k k continuous greedy algorithm which reduces the noise of Assumption 2. The function F is DR-submodular gradient approximations via a common averaging tech- and monotone. Further, its gradients are L-Lipschitz nique in stochastic optimization [Ruszczy´nski, 1980, continuous over the set , i.e., for all x, y X 2X 2008, Yang et al., 2016, Mokhtari et al., 2017]. F (x) F (y) L x y . (12) Let t N be a discrete time index and ⇢t a given kr r k k k stepsize2 which approaches zero as t approaches infinity. Assumption 3. The variance of the unbiased stochas- Our proposed estimated gradient d is defined by the tic gradients F˜(x, z) is bounded above by 2, i.e., for t r following recursion any vector x we can write 2X ˜ ˜ 2 2 dt =(1 ⇢t)dt 1 + ⇢t F (xt, zt), (8) E F (x, z) F (x) , (13) r kr r k  h i where the initial vector is defined as d0 = 0. It can where the expectation is with respect to the randomness be shown that the averaging technique in (8) reduces of z P . the noise of gradient approximation as time increases. ⇠ More formally, the expected noise of gradient estima- Due to the initialization step of SCG (starting from 0) 2 tion E dt F (xt) approaches zero asymptoti- we need a bound on the furthest feasible solution from cally (Lemmak r 2). Thisk property implies that the gra- 0 that we can end up with; and such a bound is guar- ⇥ ⇤ dient estimate dt is a better candidate for approxi- anteed by Assumption 1. The condition in Assump- mating the gradient F (xt) comparing to the the un- tion 2 ensures that the objective function F is smooth. biased gradient estimater F˜(x , z ) that su↵ers from Note again that F˜(x, z) may or may not be Lipschitz t t r a high variance approximation.r We therefore define continuous. Finally, the required condition in Assump- the ascent direction vt of SCG as follows tion 3 guarantees that the variance of stochastic gradi- ents F˜(x, z) is bounded by a finite constant 2 < r 1 vt = argmax dt, v , (9) which is customary in stochastic optimization. v {h i} 2C To study the convergence of SCG, we first derive an which is a linear objective maximization over the con- upper bound for the expected error of gradient approx- vex set . Indeed, if instead of the gradient estimate 2 imation E[ F (xt) dt ] in the following lemma. d we useC the exact gradient F (x ) for the updates kr k t t Lemma 1. Consider Stochastic Continuous in (9), the continuous greedy updater will be recovered. Greedy (SCG) outlined in Algorithm 1. If Assump- Here, as in continuous greedy, the initial decision vec- tions 1-3 are satisfied, then the sequence of expected tor is the null vector, x = 0. Further, the stepsize for 0 squared gradient errors F (x ) d 2 for the updating the iterates is equal to 1/T , and the variable E t t iterates generated by SCG satisfieskr k xt is updated as ⇥ ⇤ 2 E F (xt) dt 1 kr k xt+1 = xt + vt. (10) ⇢t 2 2 2 T ⇥ 1 E F⇤ (xt 1) dt 1 + ⇢t  2 kr k 2 2 2 2 The stepsize 1/T and the initialization x0 = 0 en- ⇣ L D ⌘ 2⇥L D ⇤ + + . (14) sure that after T iterations the variable x ends up 2 2 T T ⇢tT Aryan Mokhtari, Hamed Hassani, Amin Karbasi

Proof. See Section 9.1. ⌅ this substitution and add and subtract the inner prod- uct d , v to the right hand side of (17) to obtain h t ti

The result in Lemma 1 showcases that the ex- F (xt+1) pected squared error of gradient approximation 2 2 1 1 LD E F (xt) dt decreases at each iteration by the F (xt)+ vt, dt + vt, F (xt) dt kr k T h i T h r i 2T 2 factor (1 ⇢t/2) if the remaining terms on the right ⇥ ⇤ 1 1 LD2 hand side of (14) are negligible relative to the term F (xt)+ x⇤, dt + vt, F (xt) dt . 2 T h i T h r i 2T 2 (1 ⇢t/2)E[ F (xt 1) dt 1 ]. This condition can kr k (18) be satisfied, if the parameters ⇢t are chosen prop- erly. We formalize this claim in{ the} following lemma 2 Note that the second inequality in (18) holds and show that the expected error E[ F (xt) dt ] kr 2/3 k since based on (9) we can write x⇤, dt converges to zero at a sublinear rate of (t ). h i O maxv v, dt = vt, dt . Now add and subtract 2X {h i} h i Lemma 2. Consider Stochastic Continuous the inner product x⇤, F (xt) /T to the RHS of (18) Greedy (SCG) outlined in Algorithm 1. If Assump- to get h r i 4 tions 1-3 are satisfied and ⇢t = (t+8)2/3 , then for t =0,...,T we have 1 F (x ) F (x )+ x⇤, F (x ) t+1 t T h r t i 2 2 Q 1 LD E F (xt) dt , (15) + v x⇤, F (x ) d . (19) kr k  (t + 9)2/3 T h t r t ti 2T 2 ⇥ ⇤ 2 2/3 2 2 2 We further have x , F (x ) F (x ) F (x ); this where Q := max F (x0) d0 9 , 16 +3L D . ⇤ t ⇤ t {kr k } follows from monotonicityh r ofiF as well as concav- ity of F along positive directions; see, e.g., [Ca- Proof. See Section 9.2. ⌅ linescu et al., 2011]. Moreover, by Young’s in- equality we can show that the inner product v h t x⇤, F (xt) dt is lower bounded by (t/2) vt r2 i 2 || Let us now use the result in Lemma 2 to show that x⇤ (1/2t) F (xt) dt for any t > 0. By the sequence of iterates generated by SCG reaches a applying|| these substitutions||r into|| (19) we obtain (1 1/e) approximation for Problem (1). 2 Theorem 1. Consider Stochastic Continuous 1 LD F (xt+1) F (xt)+ (F (x⇤) F (xt)) 2 Greedy (SCG) outlined in Algorithm 1. If Assump- T 2T 4 2 tions 1-3 are satisfied and ⇢ = , then the ex- 1 2 F (xt) dt t (t+8)2/3 v x⇤ + ||r || . (20) 2T t|| t || pected objective function value for the iterates gener- ✓ t ◆ ated by SCG satisfies the inequality 2 2 Replace v x⇤ by its upper bound 4D and com- || t || 2DQ1/2 LD2 pute the expected value of (20) to write E [F (xT )] (1 1/e)OPT , (16) T 1/3 2T 1 E [F (xt+1)] E [F (xt)] + E [F (x⇤) F (xt))] where OPT = maxx F (x). T 2C 2 2 1 E F (xt) dt LD 2 ||r || 4tD + 2 . (21) 2T " ⇥ t ⇤# 2T Proof. Let x⇤ be the global maximizer within the con- straint set . Based on the smoothness of the function 2 C Substitute E F (xt) dt by its upper bound F with constant L we can write ||r || Q/((t + 9)2/3) according to the result in (15). Fur- ⇥ 1/2 ⇤ 1/3 ther, set t =(Q )/(2D(t + 9) ) and regroup the F (xt+1) resulted expression to obtain L 2 F (xt)+ F (xt), xt+1 xt xt+1 xt hr i 2 || || 1 1 L E [F (x⇤) F (xt+1)] 1 E [F (x⇤) F (xt)] = F (x )+ F (x ), v v 2, (17)  T t T hr t ti2T 2 || t|| ✓ ◆ 2DQ1/2 LD2 + + . (22) where the equality follows from the update in (10). (t + 9)1/3T 2T 2 Since vt is in the set , it follows from Assumption 1 that the norm v 2 isC bounded above by D2. Apply By applying the inequality in (22) recursively for t = k tk Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap

0,...,T 1 we obtain Vondr´ak[2014] and Chekuri et al. [2015] suggested approximating the gradient using a sucient number T 1 of samples from f. This mechanism still requires ac- E [F (x⇤) F (xT )] 1 (F (x⇤) F (x0))  T cess to the set function f multiple times at each itera- ✓ ◆ T 1 T 1 tion, and hence is not feasible for solving Problem (2). 2DQ1/2 LD2 + + . (23) The idea is then to use a stochastic (unbiased) esti- (t + 9)1/3T 2T 2 t=0 t=0 mate for the gradient F . In Appendix 9.3, we pro- X X vide a method to computer an unbiased estimate of the Simplifying the terms on the right hand side (23) leads gradient using n samples from f˜(S , z), where z P to the expression i ⇠ and Si’s, i =1, ,n, are carefully chosen sets. In- deed, the stochastic··· gradient ascent method proposed E [F (x⇤) F (xT )] in [Hassani et al., 2017] can be used to solve the mul- 1/2 2 1 2DQ LD tilinear extension problem in (26) using unbiased esti- (F (x⇤) F (x0)) + + . (24)  e T 1/3 2T mates of the gradient at each iteration. However, the stochastic gradient ascent method fails to achieve the Here, we use the fact that F (x0) 0, and hence the expression in (24) can be simplified to optimal (1 1/e) approximation. Further, the work in [Karimi et al., 2017] achieves a (1 1/e) approximation 2DQ1/2 LD2 solution only when each f˜( , z) is a coverage function. E [F (xT )] (1 1/e)F (x⇤) , (25) · T 1/3 2T Here, we show that SCG achieves the first (1 1/e) tight approximation guarantee for the discrete stochas- and the claim in (16) follows. ⌅ tic submodular Problem (2). More precisely, we show that SCG finds a solution for (26), with an expected The result in Theorem 1 shows that the sequence of function value that is at least (1 1/e)OPT ✏,in iterates generated by SCG, which only has access to (1/✏3) iterations. To do so, we first show in the fol- a noisy unbiased estimate of the gradient at each it- lowingO lemma that the di↵erence between any coordi- eration, is able to achieve the optimal approximation nates of gradients of two consecutive iterates generated bound (1 1/e), while the error term vanishes at a by SCG, i.e., jF (xt+1) jF (xt) for j 1,...,n , 1/3 sublinear rate of (T ). r r 2{ } O is bounded by xt+1 xt multiplied by a factor which is independentk of the problemk dimension n. 6 Discrete Submodular Maximization Lemma 3. Consider Stochastic Continuous Greedy (SCG) outlined in Algorithm 1 with iterates According to the results in Section 5, the SCG method xt, and recall the definition of the multilinear exten- achieves in expectation a (1 1/e)-optimal solution for sion function F in (27). If we define r as the rank of Problem (1). The focus of this section is on extending the matroid and mf , maxi 1, ,n f(i), then this result into the discrete domain and showing that I 2{ ··· }

SCG can be applied for maximizing a stochastic sub- jF (xt+1) jF (xt) mf pr xt+1 xt , (28) modular set function f, namely Problem (2), through |r r | k k the multilinear extension of the function f. To be more holds for j =1,...,n. precise, in lieu of solving the problem in (2) one can solve the continuous optimization problem Proof. See Section 9.4. ⌅ max F (x), (26) x 2C The result in Lemma 3 states that in an ascent direc- tion of SCG, the gradient is m pr-Lipschitz continu- where F is the multilinear extension of the function f f ous. Here, m is the maximum marginal value of the defined as f function f and r is the rank for the matroid. Using F (x)= f(S) x (1 x ), (27) the result of Lemma 3 and a coordinate-wise analysis, i j S V i S j/S the bounds in Theorem 1 can be improved and speci- X⇢ Y2 Y2 fied for the case of multilinear extension maximization and the convex set = conv 1I : I is the matroid problem as we show in the following theorem. polytope [CalinescuC et al., 2011].{ Note2I} that in (27), x i Theorem 2. Consider Stochastic Continuous denotes the i-th element of the vector x. Greedy (SCG) outlined in Algorithm 1. Recall the def- Indeed, the continuous greedy algorithm is able to inition of the multilinear extension function F in (27) solve the program in (26); however, each iteration of and the definitions of r and mf in Lemma 3. Further, 2/3 the method is computationally costly due to gradi- set the averaging parameter as ⇢t =4/(t+8) .IfAs- ent F (x) evaluations. Instead, Badanidiyuru and sumptions 1 and 3 hold, then the iterate x generated r T Aryan Mokhtari, Hamed Hassani, Amin Karbasi by SCG satisfies the inequality function f( ,i) for evaluating sets of movies. The goal is to find a· set of k movies such that in expectation 2 2DK mf prD over users’ preferences it provides the highest utility, E [F (xT )] (1 1/e)OPT 1/3 , (29) . T 2T i.e., max S k f(S), where f(S) = Ei P [f(S, i)]. This is an instance| | of the (discrete) stochastic⇠ submodular where K := max F (x ) d 91/3, 4 + p3rm D . {kr 0 0k f } maximization problem defined in (2). For simplicity, we assume f has the form of an empirical objective Proof. The proof is similar to the proof of Theorem 1. N function, i.e. f(S)= 1 f(S, i). In other words, The main di↵erence is to write the analysis coordinate- N i=1 the distribution P is assumed to be uniform on the wise and replace L by m pr, as shown in Lemma 3. f integers between 1 and PN. The continuous counter- For more details, check Section 9.5 in the supplemen- part of this problem is to consider the the multilin- tary material. ⌅ ear extension F ( ,i) of any function f( ,i) and solve the problem in the· continuous domain as· follows. Let The result in Theorem 2 indicates that the sequence of n F (x)=Ei [F (x,i)] for x [0, 1] and define the iterates generated by SCG achieves a (1 1/e)OPT ⇠D 2N n constraint set = x [0, 1] : i=1 xi k .The ✏ approximation guarantee. Note that the constants discrete and continuousC { 2 optimization formulations } lead on the right hand side of (29) are independent of n, to the same optimal value [CalinescuP et al., 2011]: except K which depends on . It can be shown that, in the worst case, the variance depends on the size max f(S) = max F (x). (33) S: S k x of the ground set n and the variance of the stochastic | | 2C functions f˜( , z). Therefore, by running SCG we can find a solution in · the continuous domain that is at least 1 1/e ap- Let us now explain how the variance of the stochastic proximation to the optimal value. By rounding that gradients of F relates to the variance of the marginal fractional solution (for instance via randomized Pipage values of f. Consider a generic submodular set func- rounding [Calinescu et al., 2011]) we obtain a set whose tion g and its multilinear extension G. It can be shown utility is at least 1 1/e of the optimum solution set of size k. We note that randomized Pipage rounding G(x)=G(x; x = 1) G(x; x = 0). (30) rj j j does not need access to the value of f. We also remark Hence, from submodularity we have jG(x) g( j ). that each iteration of SCG can be done very eciently r  { } Using this simple fact we can deduce that in O(n) time (the argmax step reduces to finding the largest k elements of a vector of length n). Therefore, 2 2 E F˜(x, z) F (x) n max E[f˜( j , z) ]. such approach easily scales to big data scenarios where kr r k  j [n] { } 2 h i (31) the size of the data set N (e.g. number of users) or the Therefore, the constat can be upper bounded by number of items n (e.g. number of movies) are large.

2 1/2 In our experiments, we use the following baselines: pn max E[f˜( j , z) ] . (32)  j [n] { } 2 (i) Stochastic Continuous Greedy (SCG) with ⇢t = 1 2/3 As a result, we have the following guarantee for SCG 2 t and mini-batch size B. The details for in the case of multilinear functions. computing an unbiased estimator for the gradient F are given in Section 9.3 in the supplementary Corollary 1. Consider Stochastic Continuous r Greedy (SCG) outlined in Algorithm 1. Suppose the material. conditions in Theorem 2 are satisfied. Then, the se- (ii) Stochastic Gradient Ascent (SGA) of [Hassani quence of iterates generated by SCG achieves a (1 et al., 2017]: with stepsize µt = c/pt and mini- 1/e)OPT ✏ solution after (n3/2/✏3) iterations. O batch size B. Proof. According to the result in Theorem 2, SCG (iii) Frank-Wolfe (FW) variant of [Bian et al., Cali- reaches a (1 1/e)OPT (n1/2/T 1/3) solution after nescu et al., 2011]: with parameter T for the total T iterations. Therefore, toO achieve a ((1 1/e)OPT ✏) number of iterations and batch size B (we further 3/2 3 approximation, (n /✏ ) iterations are required. ⌅ let ↵ =1, = 0, see Algorithm 1 in [Bian et al.] or O the continuous greedy method of [Calinescu et al., 2011] for more details). 7 Numerical Experiments (iv) Batch-mode Greedy (Greedy): by running the In our experiments, we consider a movie recommenda- vanilla greedy algorithm (in the discrete domain) tion application [Stan et al., 2017] consisting of N users in the following way. At each round of the al- and n movies. Each user i has a user-specific utility gorithm (for selecting a new element), B random Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap

5.00 8 4.75 4.92

4.50 6 SG(B = 20) 4.90 SGA(B = 20) 4.25 Greedy(B = 100) Greedy(B = 100) Greedy(B = 1000) SGA(B = 20,c= 4) Greedy(B = 1000) 4.88 4 4.00 FW(B = 20) SCG(B =5,c=0.5) FW(B = 20) SCG(B = 20) Greedy SCG(B = 20) 3.75 4.86 10 20 30 40 50 0.0 0.5 1.0 1.5 10 20 30 40 50 k number of function computations 108 k ⇥ (a) facility location (b) facility location (c) concave-over-modular

Figure 1: Comparison of the performances of SG, Greedy, FW, and SCG in a movie recommendation application. Fig. 1a illustrates the performance of the algorithms in terms of the facility-location objective value w.r.t. the cardinality constraint size k after T = 2000 iterations. Fig. 1b compares the considered methods in terms of runtime (for a fixed k = 40) by illustrating the facility location objective function value vs. the number of (simple) function evaluations. Fig. 1c demonstrates the concave-over-modular objective function value vs. the size of the cardinality constraint k after running the algorithms for T = 2000 iterations.

users are picked and the function f is estimated with respect to the number of times the (simple) func- by the average over of the B selected users. tions (i.e., f( ,i)’s) are evaluated. Note that the total number of (simple)· function evaluations for SGA and To run the experiments we use the MovieLens data SGA is nBT ,whereT is the number of iterations. set. It consists of 1 million ratings (from 1 to 5) by Also, for Greedy the total number of evaluations is nkB. This further shows that SCG has a better com- N = 6041 users for n = 4000 movies. Let ri,j denote the rating of user i for movie j (if such a rating does putational complexity requirement w.r.t. SGA as well as the Greedy algorithm (in the discrete domain). not exist we assign ri,j to 0). In our experiments, we consider two well motivated objective functions. The first one is called “facility location ” where the 8 Conclusion valuation function by user i is defined as f(S, i)= maxj S ri,j. In words, the way user i evaluates a set S In this paper, we provided the first tight approxi- 2 is by picking the highest rated movie in S.Thus,the mation guarantee for maximizing a stochastic mono- objective function is tone DR-submodular function subject to a general convex body constraint. We developed Stochastic 1 N Continuous Greedy that achieves a [(1 1/e)OPT ✏] ffac(S)= max ri,j. (34) 3 N j S guarantee (in expectation) with (1/✏ ) stochastic i=1 2 O X gradient computations. We also demonstrated that In our second experiment, we consider a di↵erent user- our continuous algorithm can be used to provide the first (1 1/e) tight approximation guarantee for max- specific valuation function which is a concave func- tion composed with a modular function, i.e., f(S, i)= imizing a monotone but stochastic submodular set 1/2 function subject to a general matroid constraint. We ( j S ri,j) . Again, by considering the uniform dis- tribution2 over the set of users, we obtain believe that our results provide an important step to- P wards unifying discrete and continuous submodular 1 N 1/2 optimization in stochastic settings. f (S)= r . (35) con N i,j i=1 j S Acknowledgements X ⇣ X2 ⌘ Figure 1 depicts the performance of di↵erent algo- This work was done while A. Mokhtari was visit- rithms for the two proposed objective functions. As ing the Simons Institute for the Theory of Comput- Figures 1a and 1c show, the FW algorithm needs a ing, and his work was partially supported by the DI- higher mini-batch size to be comparable to SCG. Note MACS/Simons Collaboration on Bridging Continuous that a smaller batch size leads to less computational ef- and Discrete Optimization through NSF grant #CCF- fort (under the same value for B,T, the computational 1740425. The work of A. Karbasi was supported complexity of FW, SGA, SCG is almost the same). by DARPA Young Faculty Award (D16AP00046) and Figures 1b shows the performance of the algorithms AFOSR YIP (FA9550-18-1-0160). Aryan Mokhtari, Hamed Hassani, Amin Karbasi

References R. Eghbali and M. Fazel. Designing smoothing functions for improved worst-case competitive ratio in online opti- mization. In Advances in Neural Information Processing A. A. Abbasi and M. Younis. A survey on clustering algo- Systems 29,pages3279–3287,2016. rithms for wireless sensor networks. Computer commu- A. Ene, H. L. Nguyen, and L. A. V´egh. Decomposable nications,30(14):2826–2841,2007. submodular function minimization: Discrete and con- A. Anandkumar and R. Ge. Ecient approaches for escap- tinuous. In Advances in Neural Information Processing ing higher order saddle points in non-convex optimiza- Systems 30,pages2874–2884,2017. tion. In Proceedings of the 29th Conference on Learning U. Feige. A threshold of ln n for approximating set cover. Theory,pages81–102,2016. Journal of the ACM,45(4):634–652,1998. A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Tel- U. Feige, V. S. Mirrokni, and J. Vondrak. Maximizing garsky. Tensor decompositions for learning latent vari- non-monotone submodular functions. SIAM Journal on able models. Journal of Machine Learning Research,15: Computing,40(4):1133–1153,2011. 2773–2832, 2014. M. Feldman, J. Naor, and R. Schwartz. A unified continu- S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing ous greedy algorithm for submodular maximization. In a nonnegative matrix factorization - provably. SIAM J. IEEE 52nd Annual Symposium on Foundations of Com- Comput.,45(4):1582–1611,2016. puter Science,pages570–579,2011. F. Bach. Submodular functions: from discrete to continu- M. Feldman, C. Harshaw, and A. Karbasi. Greed is good: ous domains. arXiv preprint arXiv:1511.00394,2015. Near-optimal submodular maximization via greedy op- A. Badanidiyuru and J. Vondr´ak. Fast algorithms for max- timization. In Proceedings of the 30th Conference on imizing submodular functions. In Proceedings of the Learning Theory,pages758–784,2017. Twenty-Fifth Annual ACM-SIAM Symposium on Dis- S. Fujishige. Submodular functions and optimization,vol- crete Algorithms,pages1497–1514,2014. ume 58. Elsevier, 2005. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from Greedy layer-wise training of deep networks. In Ad- saddle points–online stochastic gradient for tensor de- vances in Neural Information Processing Systems 19, composition. In Proceedings of The 28th Conference on pages 153–160, 2007. Learning Theory,pages797–842,2015. A. A. Bian, B. Mirzasoleiman, J. M. Buhmann, and R. Ge, J. D. Lee, and T. Ma. Matrix completion has no A. Krause. Guaranteed non-convex optimization: Sub- spurious local minimum. In Advances in Neural Infor- modular maximization over continuous domains. In Pro- mation Processing Systems 29,pages2973–2981,2016. ceedings of the 20th International Conference on Artifi- S. O. Gharan and J. Vondr´ak. Submodular maximization cial Intelligence and Statistics. by . In Proceedings of the Twenty- N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. Second Annual ACM-SIAM Symposium on Discrete Al- Submodular maximization with cardinality constraints. gorithms,pages1098–1116,2011. In Proceedings of the Twenty-Fifth Annual ACM-SIAM D. Golovin, A. Krause, and M. Streeter. Online sub- Symposium on Discrete Algorithms,pages1433–1452, modular maximization under a matroid constraint with 2014. application to learning assignments. arXiv preprint N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. A arXiv:1407.1082,2014. tight linear time (1/2)-approximation for unconstrained S. H. Hassani, M. Soltanolkotabi, and A. Karbasi. Gradient submodular maximization. SIAM J. Comput.,44(5): methods for submodular maximization. In Advances in 1384–1402, 2015. Neural Information Processing Systems 30,pages5843– G. Calinescu, C. Chekuri, M. P´al, and J. Vondr´ak. Max- 5853, 2017. imizing a monotone submodular function subject to a E. Hazan and H. Luo. Variance-reduced and projection- matroid constraint. SIAM Journal on Computing,40 free stochastic optimization. In Proceedings of the 33nd (6):1740–1766, 2011. International Conference on Machine Learning,pages D. Chakrabarty, Y. T. Lee, A. Sidford, and S. C. Wong. 1263–1271, 2016. Subquadratic submodular function minimization. In E. Hazan, K. Y. Levy, and S. Shalev-Shwartz. On grad- Proceedings of the 49th Annual ACM SIGACT Sympo- uated optimization for stochastic non-convex problems. sium on Theory of Computing,pages1220–1231,2017. In Proceedings of the 33nd International Conference on C. Chekuri, J. Vondr´ak, and R. Zenklusen. Submodular Machine Learning,pages1833–1841,2016. function maximization via the multilinear relaxation and C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jor- contention resolution schemes. SIAM Journal on Com- dan. How to escape saddle points eciently. In Proceed- puting,43(6):1831–1879,2014. ings of the 34th International Conference on Machine C. Chekuri, T. Jayram, and J. Vondr´ak. On multiplica- Learning,pages1724–1732,2017. tive weight updates for concave and submodular func- M. R. Karimi, M. Lucic, S. H. Hassani, and A. Krause. tion maximization. In Proceedings of the 2015 Confer- Stochastic submodular maximization: The case of cov- ence on Innovations in Theoretical Computer Science, erage functions. In Advances in Neural Information Pro- pages 201–210. ACM, 2015. cessing Systems 30,pages6856–6866,2017. N. R. Devanur and K. Jain. Online matching with con- L. Lov´asz. Submodular functions and convexity. In Math- cave returns. In Proceedings of the 44th Symposium on ematical Programming The State of the Art,pages235– Theory of Computing Conference,pages137–144,2012. 257. Springer, 1983. Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap

A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani. Ad- J. Vondr´ak. Optimal approximation for the submodular words and generalized online matching. Journal of the welfare problem in the value oracle model. In Proceed- ACM,54(5):22,2007. ings of the 40th Annual ACM Symposium on Theory of Computing,pages67–74.ACM,2008. S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint J. Vondr´ak. Submodularity and curvature : The optimal arXiv:1607.06534,2016. algorithm (combinatorial optimization and discrete algo- rithms). RIMS Kokyuroku Bessatsu,B23:253–266,2010. A. Mokhtari, A. Koppel, G. Scutari, and A. Ribeiro. Large-scale nonconvex stochastic optimization by dou- L. A. Wolsey. An analysis of the greedy algorithm for the bly stochastic successive convex approximation. In 2017 submodular set covering problem. Combinatorica,2(4): IEEE International Conference on Acoustics, Speech 385–393, 1982. and Signal Processing,pages4701–4705,2017. Y. Yang, G. Scutari, D. P. Palomar, and M. Pesavento. A K. G. Murty and S. N. Kabadi. Some np-complete prob- parallel decomposition method for nonconvex stochastic lems in quadratic and . Mathe- multi-agent optimization problems. IEEE Transactions matical programming,39(2):117–129,1987. on Signal Processing,64(11):2949–2964,2016. G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions–I. Mathematical Programming,14(1):265– 294, 1978. P. Netrapalli, U. N. Niranjan, S. Sanghavi, A. Anandku- mar, and P. Jain. Non-convex robust PCA. In Ad- vances in Neural Information Processing Systems 27, pages 1107–1115, 2014. S. Paternain, A. Mokhtari, and A. Ribeiro. A second or- der method for nonconvex optimization. arXiv preprint arXiv:1707.08028,2017. S. J. Reddi, S. Sra, B. P´oczos, and A. Smola. Stochas- tic Frank-Wolfe methods for nonconvex optimization. In 54th Annual Allerton Conference on Communication, Control, and Computing,pages1244–1251.IEEE,2016. A. Ruszczy´nski. Feasible direction methods for stochastic programming problems. Mathematical Programming,19 (1):220–229, 1980. A. Ruszczy´nski. A merit function approach to the subgra- dient method with averaging. Optimisation Methods and Software,23(1):161–172,2008. J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25,pages2951–2959,2012. T. Soma, N. Kakimura, K. Inaba, and K. Kawarabayashi. Optimal budget allocation: Theoretical guarantee and ecient algorithm. In Proceedings of the 31th Interna- tional Conference on Machine Learning,pages351–359, 2014. M. Staib and S. Jegelka. Robust budget allocation via con- tinuous submodular functions. In Proceedings of the 34th International Conference on Machine Learning,pages 3230–3240, 2017. S. Stan, M. Zadimoghaddam, A. Krause, and A. Karbasi. Probabilistic submodular maximization in sub-linear time. In Proceedings of the 34th International Confer- ence on Machine Learning,pages3241–3250,2017. M. Sviridenko, J. Vondr´ak, and J. Ward. Optimal approx- imation for submodular and supermodular optimization with bounded curvature. In Proceedings of the Twenty- Sixth Annual ACM-SIAM Symposium on Discrete Algo- rithms,pages1134–1148,2015. J. Vondr´ak. Submodularity in combinatorial optimization. 2007.