1 A LEVEL-SET METHOD FOR WITH A 2 FEASIBLE SOLUTION PATH

3 QIHANG LIN∗, SELVAPRABU NADARAJAH† , AND NEGAR SOHEILI‡

4 Abstract. Large-scale constrained convex optimization problems arise in several application 5 domains. First-order methods are good candidates to tackle such problems due to their low iteration 6 complexity and memory requirement. The level-set framework extends the applicability of first-order 7 methods to tackle problems with complicated convex objectives and constraint sets. Current methods 8 based on this framework either rely on the solution of challenging subproblems or do not guarantee a 9 feasible solution, especially if the procedure is terminated before convergence. We develop a level-set 10 method that finds an -relative optimal and feasible solution to a constrained convex optimization 11 problem with a fairly general objective function and set of constraints, maintains a feasible solution 12 at each iteration, and only relies on calls to first-order oracles. We establish the iteration complexity 13 of our approach, also accounting for the smoothness and strong convexity of the objective function 14 and constraints when these properties hold. The dependence of our complexity on  is similar to 15 the analogous dependence in the unconstrained setting, which is not known to be true for level-set 16 methods in the literature. Nevertheless, ensuring feasibility is not free. The iteration complexity of 17 our method depends on a condition number, while existing level-set methods that do not guarantee 18 feasibility can avoid such dependence. We numerically validate the usefulness of ensuring a feasible 19 solution path by comparing our approach with an existing level set method on a Neyman-Pearson classification problem.20

21 Key words. constrained convex optimization, level-set technique, first-order methods, com- plexity analysis22

AMS subject classifications. 90C25, 90C30, 90C5223

24 1. Introduction. Large scale constrained convex optimization problems arise 25 in several business, science, and engineering applications. A commonly encountered 26 form for such problems is

27 f ∗ := min f(x)(1) x∈X

2928 (2) s.t. gi(x) ≤ 0, i = 1, . . . , m,

n 30 where X ⊆ R is a non-empty closed and , and f and gi, i = 1, . . . , m, are 31 closed convex real-valued functions defined on X . Given  > 0, a point x is -optimal  ∗  32 if it satisfies f(x ) − f ≤  and -feasible if it satisfies maxi=1,...,m gi(x ) ≤ . In 33 addition, given 0 <  ≤ 1, and a feasible and suboptimal solution e ∈ X , a point x is 34 -relative optimal with respect to e if it satisfies (f(x) − f ∗)/(f(e) − f ∗) ≤ . 35 Level-set techniques [2,3, 15, 24, 39, 40] provide a flexible framework for solving 36 the optimization problem (1)-(2). Specifically, this problem can be reformulated as a ∗ ∗ 37 convex root finding problem L (t) = 0, where L (t) = minx∈X max{f(x)−t; gi(x), i = 38 1, . . . , m} (see Lemar´echal et al. [24]). Then a bundle method [21, 23, 41] or an accel- 39 erated method [29, Section 2.3.5] can be applied to handle the minimization 40 subproblem to estimate L∗(t) and facilitate the root finding process. In general, these 41 methods solve the subproblem using a nontrivial quadratic program that needs to be 42 solved at each iteration. 43 An important variant of (1)-(2) arising in statistics, image processing, and signal 44 processing occurs when f(x) takes a simple form, such as a gauge function, and there

∗Tippie College of Business, University of Iowa, USA, [email protected] †College of Business Administration, University of Illinois at Chicago, USA, [email protected] ‡College of Business Administration, University of Illinois at Chicago, USA, [email protected] 1

This manuscript is for review purposes only. 2 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

45 is a single constraint with the structure ρ(Ax − b) ≤ λ, where ρ is a , 46 A is a matrix, b is a vector, and λ is a scalar. In this setting, Van Den Berg and 47 Friedlander [39, 40] and Aravkin et al. [2] use a level-set approach to swap the roles 48 of the objective function and constraints in (1)-(2) to obtain the convex root finding 49 problem v(t) = 0 with v(t) := minx∈X ,f(x)≤t ρ(Ax − b) − λ. Harchaoui et al. [15] 50 proposed a similar approach. The root finding problem arising in this approach is 51 solved using inexact newton and secant methods in [2, 39, 40] and [2], respectively. 52 Since f(x) is assumed to be simple, projection onto the feasible set {x ∈ Rn|x ∈ 53 X , f(x) ≤ t} is possible and the subproblem in the definition of v(t) can be solved 54 by various first-order methods. Although the assumption on the objective function 55 is important, the constraint structure is not limiting because (2) can be obtained by 56 choosing A to be an identity matrix, b = 0, λ = 0, and ρ(x) = maxi=1,...,m gi(x). 57 The level-set techniques discussed above find an -feasible and -optimal solution 58 to (1)-(2), that is, the terminal solution may not be feasible. This issue can be 59 overcome when the infeasible terminal solution is also super-optimal and a strictly 60 feasible solution to the problem exists. In this case, a radial-projection scheme [2, 35] 61 can be applied to obtain a feasible and -relative optimal solution. Super-optimality of 62 the terminal solution holds if the constraint f(x) ≤ t is satisfied at termination. This 63 condition can be enforced when projections on to the feasible set with this constraint 64 are computationally cheap, which typically requires f(x) to be simple (e.g. see [2]). 65 Otherwise, such a guarantee does not exist. 66 Building on this rich literature, we develop a feasible level-set method that (i) 67 employs a first-order oracle to solve subproblems and compute a feasible and -relative 68 optimal solution to (1)-(2) with a general objective function and set of constraints 69 while also maintaining feasible intermediate solutions at each iteration (i.e., a feasible 70 solution path), and (ii) it avoids potentially costly projections. Our method tackles 71 a root finding problem involving the univariate function L∗(t), that is L∗(t) = 0, 72 which differs from the univariate function v(t) considered by the level set methods 73 in [2, 39, 40]. The level set methods in [18] and [6] also solve L∗(t) = 0. However, 74 exact computation of L∗(t) is assumed in the former case while the latter paper 75 solves a potentially expensive quadratic problem when m is large to obtain inexact 76 evaluations of this function. Level set methods in the literature increase the level 77 parameter t when moving towards the root whereas our procedure approaches the 78 root by gradually decreasing t. This distinction is important to ensure feasibility. 79 Our focus on maintaining feasibility is especially desirable in situations where 80 level-set methods need to be terminated before convergence, for instance due to a 81 time constraint. In this case, our level-set approach would return a feasible, albeit 82 possibly suboptimal, solution that can be implemented whereas the solution from 83 existing level-set approaches may not be implementable due to potentially large con- 84 straint violations. Convex optimization problems where constraints model operating 85 conditions that need to be satisfied by solutions for implementation arise in several 86 applications, including network and nonlinear resource allocation, signal processing, 87 circuit design, and classification problems encountered in medical diagnosis and other 88 areas [10, 11, 33, 36]. We numerically validate the usefulness of ensuring feasibility on 89 a Neyman-Pearson classification problem, where we find that early termination of an 90 existing level set method can lead to highly infeasible solutions, while, as expected, 91 our feasible level set method provides a path of feasible solutions. 92 The radial sub-gradient method proposed by Renegar [35] and extended by Grim- 93 mer [14] also emphasizes feasibility. This method avoids projection and finds a feasible 94 and -relative optimal solution by performing a line search at each iteration. This line

This manuscript is for review purposes only. 3

95 search, while easy to execute for some linear or conic programs, can be challenging for 96 general convex programs (1)-(2). Moreover, it is unknown whether the iteration com- 97 plexity of this approach can be improved by utilizing the potential smoothness and 98 strong convexity of the objective function and constraints. In contrast, the feasible 99 level-set approach in this paper avoids the need for line search, in addition to projec- 100 tion, and its iteration complexity leverages the strong convexity and smoothness of 101 the objective and constraints when these properties are true. 102 Our work is indeed related to first-order methods, which are popular due to their 103 low per-iteration cost and memory requirement, for example compared to interior 104 point methods, for solving large scale convex optimization problems [16]. Most ex- 105 isting first-order methods find a feasible and -optimal solution to a version of the 106 optimization problem (1)-(2) with a simple feasible region (e.g. a box, simplex or 107 ball) which makes projection on to the feasible set at each iteration inexpensive. En- 108 suring feasibility using projection becomes difficult for general constraints (2). Some 109 sub-gradient methods, including the method by Nesterov [29, Section 3.2.4] and its 110 recent variants by Lan and Zhou [22] and Bayandina [5], can be applied to (1)-(2) 111 without the need for projection mappings. Augmented Lagrangian methods have been 112 recently developed by Xu [42, 43, 44] for solving (1)-(2) with additional linear equal- 113 ity constraints and can avoid projections by minimizing an augmented Lagrangian 114 function that integrates both the constraint and objective functions. However these 115 methods do not ensure feasibility of the solution as we do, that is, they instead find 116 an -feasible and -optimal solution to (1)-(2) at convergence. 117 Given the focus of our research, one naturally wonders whether there is an asso- 118 ciated computational cost of ensuring feasibility in the context of level-set methods. 119 Our analysis suggests that the answer depends on the perspective one takes. First, 120 consider the dependence of the iteration complexity on a condition measure as a cri- 121 terion. A nice property of the level-set method in [2] is that its iteration complexity 122 is independent of a condition measure for finding an -feasible and -optimal solu- 123 tion. Instead, finding an -relative optimal and feasible solution leads to iteration 124 complexities with dependence on condition measures when (i) combining the level-set 125 approach [2] and the transformation in [35], or (ii) using our feasible level-set ap- 126 proach. This suggests that a cost of ensuring feasibility is the presence of a condition 127 measure in the iteration complexity. Second, suppose we consider the dependence of 128 the number of iterations on  as our assessment criterion. The complexity result of the 129 level-set approach provided by [2] can be interpreted as the product of the iteration 130 complexity of the oracle used to solve the subproblem and a O(log(1/)) factor. The 131 dependence of our algorithm would be similar under a non-adaptive analysis akin to 132 [2]. However, using a novel adaptive analysis we find that the iteration complexity of 133 our feasible level-set approach can in fact eliminate the O(log(1/)) factor. In other 134 words, our complexity is analogous to the known complexity of first order methods for 135 unconstrained convex optimization and it is unclear to us whether a similar adaptive 136 analysis can be applied to the approach in [2]. Overall, in terms of the dependence 137 on  there appears to be no cost to ensuring feasibility. 138 The rest of this paper is organized as follows. Section2 introduces the level-set 139 formulation and our feasible level-set approach, also establishing its outer iteration 140 complexity to compute an -relative optimal and feasible solution to the optimization 141 problem (1)-(2). Section3 describes two first-order oracles that solve the subproblem 142 of our level-set formulation when the functions f and gi, i = 1, . . . , m, are smooth 143 and non-smooth, and in each case accounts for the potential strong convexity of these 144 functions. The inner iteration complexity of each oracle is also analyzed in this section.

This manuscript is for review purposes only. 4 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

145 Section4 presents an adaptive complexity analysis to establish the overall iteration 146 complexity of using our feasible level-set approach. Section5 discusses the numerical 147 performance of the proposed approach.

148 2. Feasible Level-set Method. We begin by formally stating the assumptions 149 used in our methodological developments. 150 Assumption 1. (a) The set X is a non-empty closed convex set and the functions 151 f and gi, i = 1, . . . , m are closed convex real-valued functions over this set. ∗ 152 (b) There exists a solution e ∈ X such that maxi=1,...,m gi(e) < 0 and f(e) > f . We 153 label such a solution g-strictly feasible. 154 Given t ∈ R, we define

155 (3) L∗(t) := min L(t, x), 156 x∈X

∗ 157 where L(t, x) := max{f(x) − t; gi(x), i = 1, . . . , m} for all x ∈ X . Let ∂L (t) denote 158 the sub-differential of L∗ at t. Lemma 1 summarizes known properties of L∗(t). 159 Lemma 1 (Lemmas 2.3.4, 2.3.5, and 2.3.6 in [29]). It holds that 160 (a) L∗(t) is non-increasing and convex in t; 161 (b) L∗(t) ≤ 0, if t ≥ f ∗ and L∗(t) > 0, if t < f ∗. Moreover, under Assumption 1, 162 it holds that L∗(t) < 0, for any t > f ∗. 163 (c) Given ∆ ≥ 0, we have L∗(t) − ∆ ≤ L∗(t + ∆) ≤ L∗(t). Therefore, −1 ≤ ∗ 164 ζL∗ ≤ 0 for all ζL∗ ∈ ∂L (t) and t ∈ R. 165 Proof. We only prove the second part of Item (b) as the proofs of other items are 166 provided in [29]. In particular, we show L∗(t) < 0 for any t > f ∗ under Assumption 1. 167 Let x∗ denote an optimal solution to (1)-(2). If t > f ∗ we have f(x∗) − t < 0. Since f 168 is a closed and convex function defined on X , it is continuous on this set. Hence, there ∗ ∗ 169 exists δ > 0 such that for any x ∈ Bδ(x ) ∩ X we have f(x) − t < 0, where Bδ(x ) is a ∗ ∗ ∗ 170 ball of radius δ centered at x . Takex ¯ ∈ Bδ(x ) ∩ X such thatx ¯ = λx + (1 − λ)e for 171 some λ ∈ (0, 1), where e is a solution that satisfies Assumption 1(b). By the convexity ∗ 172 of the functions gi, i = 1, . . . , m, the feasibility of x , and the g-strict feasibility of e, we ∗ ∗ 173 have gi(¯x) ≤ λgi(x )+(1−λ)gi(e) < 0 for i = 1, . . . , m. In addition, sincex ¯ ∈ Bδ(x ), ∗ 174 we have f(¯x) − t < 0. Therefore, L (t) ≤ max{f(¯x) − t, gi(¯x), i = 1, . . . , m} < 0. 175 Lemma 1(b), specifically L∗(t) < 0 for any t > f ∗, ensures that the solution 176 t∗ = f ∗ is a unique root to L∗(t) = 0 and the solution x∗(t∗) determining L∗(t∗) is an 177 optimal solution to (1)-(2). In this case, it is possible to find an -relative optimal and 178 feasible solution by closely approximating t∗ from above with a tˆ and then finding a 179 pointx ˆ ∈ X with L(t,ˆ xˆ) < 0. The latter condition ensures thatˆx is feasible. In order ∗ 180 to obtain such a tˆ, we start with an initial value t0 > t and generate a decreasing ∗ 181 sequence t0 ≥ t1 ≥ t2 ≥ · · · ≥ t , where tk is updated to tk+1 based on a term 182 obtained by solving (3) with t = tk up to a certain optimality gap. We present this 183 approach in Algorithm 1 which utilizes an oracle A defined as follows. 184 Definition 1. An algorithm is called an oracle A if, given t ≥ f ∗, x ∈ X , α > 1 ∗ ∗ 185 and γ¯ ≥ L(t, x) − L (t), it finds a point x+ ∈ X , such that L (t) ≥ αL(t, x+). 186 Given a g-strictly feasible solution e and α > 1, Algorithm 1 finds an -relative 187 optimal and feasible solution by calling an oracle A with input tuple (tk, xk, α, γ¯k) 188 at each iteration k. This oracle provides a lower bound, αL(tk, xk+1) ≤ 0, and an ∗ 189 upper bound, L(tk, xk+1) ≤ 0, for L (tk). We use the upper bound to update tk 190 and the lower bound in the stopping criterion. In particular, Algorithm 1 terminates

This manuscript is for review purposes only. 5

Algorithm 1 Feasible Level-Set Method 1 Input: g-Strictly feasible solution e, and parameters α > 1 and 0 <  ≤ 1. 2 Initialization: Set x0 = e and t0 = f(x0). 3 for k = 0, 1, 2, ... do ∗ 4 Chooseγ ¯k such thatγ ¯k ≥ L(tk, xk) − L (tk). ∗ 5 Generate xk+1 = A(tk, xk, α, γ¯k) such that L (tk) ≥ αL(tk, xk+1). 6 if αL(tk, xk+1) ≥ L(t0, x1) then 7 Terminate and return xk+1. 8 end if 1 9 Set tk+1 = tk + 2 L(tk, xk+1). 10 end for

L∗(t) L∗(t)

∗ t ∗ t t 0 t t 0 t 0 β 1 0 β 1 ∗ L (t0)

∗ L (t0)

(a) Well conditioned case. (b) Poorly conditioned case.

∗ Fig. 1: Illustration of condition measure β for t = 0 and t0 = 1.

191 when αL(tk, xk+1) is greater than or equal to L(t0, x1). We discuss possible choices 192 for oracle A in §3 and its inputγ ¯ in Table 1. Theorem 2 shows that Algorithm 1 is 193 a feasible level-set method, that is xk generated at each iteration k is feasible, and 194 terminates with an -relative optimal solution. It also establishes its outer iteration 195 complexity. This complexity depends on a condition measure

∗ L (t0) 196 (4) β := − ∗ , t0 − t

∗ 197 for problem (1)-(2). The suboptimality of x0 = e implies t0 = f(x0) > t . By 198 Lemma 1 we can show that β ∈ (0, 1]. Larger values of β signify a better conditioned 199 problem, as illustrated in Figure 1, because we iteratively use a close approximation ∗ ∗ ∗ 200 of L (t) to approach t from t0. Therefore, larger L (t) in absolute value intuitively 201 suggests faster movement towards t∗. In addition, the following inequalities can be 202 easily verified from the convexity of L∗(t) and Lemma 1(c):

∗ ∗ L (t0) L (t) 203 ∗ (5) − β = ∗ ≥ ∗ ≥ −1, ∀t ∈ (t , t0]. t0 − t t − t

This manuscript is for review purposes only. 6 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

204 Theorem 2. Given α > 1, 0 <  ≤ 1, and a g-strictly feasible solution e, Algo- 205 rithm 1 maintains a feasible solution at each iteration and ensures

 β  206 (6) t − t∗ ≤ 1 − (t − t∗). k+1 2α k

207 In addition, it returns an -relative optimal solution to (1)-(2) in at most

2α α2  208 K := ln + 1 β β

209 outer iterations. ∗ ∗ 210 Proof. Notice that t0 > t since t0 = f(e). Suppose t ≤ tk ≤ t0 at iteration k. 211 At iteration k + 1 we have 1 212 t − t∗ = t − t∗ + L(t , x ) k+1 k 2 k k+1 1 213 ≥ t − t∗ + L∗(t ) k 2 k 1 214 ≥ t − t∗ − (t − t∗) k 2 k 216215 ≥ 0,

217 where the first equality is obtained by substituting for tk+1 using the update equation ∗ 218 for tk+1 in Algorithm 1, the first inequality using the definition of L (tk), the second ∗ 219 inequality from (5) with t = tk, and the final inequality from tk −t ≥ 0. This implies ∗ ∗ 220 that tk ≥ t for all k. In addition, the condition L (tk) ≥ αL(tk, xk+1) satisfied by ∗ 221 the oracle A, α > 1, and Lemma 1(b) imply that L(tk, xk+1) ≤ L (tk)/α ≤ 0. It thus 222 follows that tk+1 ≤ tk and the xk+1 maintained by Algorithm 1 is feasible for all k. 223 Next we establish (6) and the outer iteration complexity of Algorithm 1. Using ∗ 224 the definiton of tk+1, α > 1, and the condition L (tk) ≥ αL(tk, xk+1) satisfied by 225 oracle A, we write

∗ ∗ ∗ 1 ∗ L (tk) 226 (7) tk+1 − t = tk − t + L(tk, xk+1) ≤ tk − t + . 227 2 2α ∗ 228 Using (5) with t = tk to substitute for L (tk) in (7) gives the required inequality (6). 229 Recursively using (6) starting from k = 0 gives

 β k 230 (8) t − t∗ ≤ 1 − (t − t∗). k 2α 0

231 To obtain the terminal condition, we note that, for k ≥ K − 1,

∗ 232 −αL(tk, xk+1) ≤ −αL (tk) ∗ 233 ≤ α(tk − t )  β K−1 234 ≤ α 1 − (t − t∗) 2α 0  235 ≤ − L∗(t ) α 0 237236 ≤ −L(t0, x1),

This manuscript is for review purposes only. 7

238 where the first inequality holds by the definition of L∗(t); the second using (5) for 239 t = tk; the third using (8); and the fourth by applying the definition of K, the 240 inequality (1 − x)a ≤ e−ax which holds for any x ∈ [0, 1] and a ≥ 0, and equation 241 (4). The last inequality follows from the condition satisfied by the oracle at t0 and x1, ∗ 242 that is, αL(t0, x1) ≤ L (t0). Hence, Algorithm 1 terminates after K outer iterations. 243 Finally we proceed to show that the terminal solution is -relative optimal. Ob- 244 serve that

∗ ∗ ∗ 245 (9) f(xk+1) − f = f(xk+1) − t ≤ tk − t ,

246 where the equality is a consequence of the definition of t∗ and the inequality is obtained 247 using f(xk+1) ≤ tk, which holds because f(xk+1) − tk ≤ L(tk, xk+1) ≤ 0. Combining ∗ ∗ ∗ ∗ ∗ 248 tk − t ≤ (t0 − t )L (tk)/L (t0) from (5) and (9), we obtain f(xk+1) − f ≤ (t0 − ∗ ∗ ∗ 249 t )L (tk)/L (t0). Rearranging terms results in the inequality

∗ ∗ f(xk+1) − f L (tk) 250 ∗ ≤ ∗ , f(e) − f L (t0)

∗ ∗ ∗ ∗ 251 where we use t = f and t0 = f(e). Indeed, L (tk)/L (t0) ≤  when the algorithm 252 terminates. To verify this, the terminal condition and the property satisfied by oracle ∗ ∗ 253 A indicate that L (tk) ≥ αL(tk, xk+1) ≥ L(t0, x1) ≥ L (t0), which further implies

∗ L (tk) ∗ ≤ . 254 L (t0)

255 3. First-order oracles. In this section, we adapt first-order methods to be used 256 as the oracle A in Algorithm 1 to solve (3) under different properties of the convex 257 functions f and gi, i = 1, . . . , m. We adapt these methods in a manner that facilitates 258 our adaptive complexity analysis in §4. 259 We assume that the functions f and gi, i = 1, . . . , m, are µ-convex for some µ ≥ 0. 260 Namely, for h = f, g1, . . . , gm, we have µ 261 (10) h(x) ≥ h(y) + hζ , x − yi + kx − yk2 h 2 2

262 for any x and y in X and ζh ∈ ∂h(y), where ∂h(y) denotes the set of all sub- 263 of h at y ∈ X and k · k2 denotes the 2-norm. We call functions f and gi, i = 1, . . . , m, 264 convex if µ = 0 and strongly convex if µ > 0. Let

265 D := max kxk2 and M := max max{kζf k2; kζg k2, i = 1, . . . , m}, x∈X x∈X i

266 where ζf ∈ ∂f(x) and ζgi ∈ ∂gi(x), i = 1, . . . , m. When f and gi, i = 1, . . . , m are 267 smooth, M := maxx∈X max{k∇f(x)k2; k∇gi(x)k2, i = 1, . . . , m}. 268 We make Assumption 2 in the rest of the paper. 269 Assumption 2. M < +∞ and either (i) µ > 0 or (ii) µ = 0 and D < +∞. 270 Assuming M to be finite is a common assumption in the literature when developing 271 first-order methods for non-smooth problems (see, e.g., [8, 13, 27, 31, 34]). When 272 µ > 0, the stopping criteria for our oracles are not contingent on D being finite. 273 In contrast, when µ = 0, we do require the assumption of finite D to ensure the 274 termination of oracles. Notice that D < +∞ if and only if X is compact.

This manuscript is for review purposes only. 8 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

275 3.1. Smooth convex functions. In this subsection, in addition to assump- 276 tions1 and2, we also assume that the functions in the objective and constrains of 277 (1)-(2) are smooth, as formally stated below.

278 Assumption 3. Functions f and gi, i = 1, . . . , m, are S-smooth for some S ≥ 0, S 2 279 namely, for h = f, g1, . . . , gm, h(x) ≤ h(y) + h∇h(y), x − yi + 2 kx − yk2 for any x 280 and y in X . 281 A first-order method achieves a fast linear convergence rate when the objective 282 function is both smooth and strongly convex [29]. Despite this smoothness assump- 283 tion, since the function L(t, x) is a maximum of functions modeling the objective and 284 constraints, it may be non-smooth in x. Nevertheless, we can approximate L(t, x) 285 by a smooth and strongly convex function of x using a smoothing method [8, 30] 286 and a standard regularization technique. Applying first-order methods to minimize 287 this smooth approximation of L(t, x) leads to a lower iteration complexity than di- 288 rectly working with the original non-smooth function. The technique we utilize is 289 a modified version of the smoothing and regularization reduction method by Allen- 290 Zhu and Hazan [1]. However, since the formulation and the non-smooth structure 291 of (3) are quite different from the problems considered in [1], we present below the 292 reduction technique and conduct the complexity analysis under our framework. The 293 exponentially smoothed and regularized version of L(t, x) is

m ! 1 X λ 294 (11) L (t, x) := log exp(σ(f(x) − t)) + exp(σg (x)) + kxk2, σ,λ σ i 2 2 295 i=1

296 where σ > 0 and λ ≥ 0 are smoothing and regularization parameters, respectively. 297 Lemma2 is based on related arguments in [8] and summarizes properties of Lσ,λ(t, x) 298 and its “closeness” to L(t, x). 299 Lemma 2. Suppose that assumptions2 and3 hold. Given σ > 0 and λ ≥ 0, the 2 300 function Lσ,λ(t, x) is (i) (σM + S)-smooth and (ii) (λ + µ)-convex. Moreover, we 301 have log(m + 1) λD2 302 (12) 0 ≤ L (t, x) − L(t, x) ≤ + . σ,λ σ 2

303 Here, we follow the convention that0 · ∞ = 0 so that λD2 = 0 if λ = 0 and D = +∞. 304 Proof. Item (i) and the relationship (12) are respectively minor modifications of 305 Proposition 4.1 and Example 4.9 in [8]. Item (ii) directly follows from the µ-convexity λ 2 306 of f, gi, i = 1, . . . , m and λ-convexity of the function 2 kxk2.

We solve the following version of (3) with L(t, x) replaced by Lσ,λ(t, x):

min Lσ,λ(t, x). x∈X

307 Specifically, we can apply different accelerated gradient methods (e.g., [7, 20, 32, 38]) 308 for a given t > t∗ and a sequence of (σ, λ) with σ increasing to infinity and λ decreasing 309 to zero (please see Remark1 on page 11 for further details). We refer to this gradient 310 method as the restarted smoothing gradient method (RSGM) and summarize its steps 311 in Oracle 1. The choice of (σ, λ) and the number of iterations performed by RSGM 312 depend on whether f and gi, i = 1, . . . , m are strongly convex (µ > 0) or convex (µ = 313 0). Moreover, RSGM is re-started after Ni accelerated gradient method iterations

This manuscript is for review purposes only. 9

314 with a smaller precision γi until this precision is below the required threshold of 315 (1 − α)L(t, xi). Theorem 3 establishes that Oracle 1 returns a solution x+ ∈ X such ∗ 316 that L (t) ≥ αL(t, x+). Hence, RSGM can be used as an oracle A (see Definition 1).

Oracle 1 Restarted Smoothing Gradient Method: RSGM(t, x, α, γ¯) 1 Input: Level t, initial solution x ∈ X , constant α > 1, and initial precision γ¯ ≥ L(t, x) − L∗(t). 2 Set x0 = x and γ0 =γ ¯. 3 for i = 0, 1, 2,... do 4 if γi ≤ (1 − α)L(t, xi) then 5 Terminate and return x+ = xi. 6 end if 7 Set   4 log(m + 1)   if µ > 0  0 if µ > 0 γi σi = and λi = 8 log(m + 1) γ   i  if µ = 0  2 if µ = 0. γi 4D

8 Starting at xi, apply the accelerated gradient method [38, Algorithm 1] with Euclidean distance function as the Bregman divergence on the optimization prob-

lem minx∈X Lσi,λi (t; x) for   & (r p )'  40S M 160 log(m + 1)  max , √ if µ > 0  µ µγi Ni =  & (s 2 p )'  160SD 4MD 80 log(m + 1)  max , if µ = 0.  γi γi

iterations and let xi+1 be the solution returned by this method. 9 Set γi+1 = γi/2. 10 end for

∗ 317 Theorem 3. Suppose assumptions1,2, and3 hold. Given level t ∈ (t , t0], 318 solution x ∈ X , parameters α > 1 and γ¯ ≥ L(t, x)−L∗(t), Oracle 1 returns a solution ∗ 319 x+ ∈ X that satisfies L (t) ≥ αL(t, x+) in at most s ! s 40S  αγ¯  160α log(m + 1) 320 + 1 log + 1 + 3M µ 2 (1 − α)L∗(t) µ(1 − α)L∗(t) 321

322 iterations of the accelerated gradient method when µ > 0 and s 10Sα 4MDαp80 log(m + 1)  αγ¯  323 12D + + log + 1 (1 − α)L∗(t) (1 − α)L∗(t) 2 (1 − α)L∗(t) 324

325 such iterations when µ = 0 and D < +∞.

This manuscript is for review purposes only. 10 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

∗ ∗ ∗ 326 Proof. Let xσ,λ(t) := arg minx∈X Lσ,λ(t, x) and Lσ,λ(t) := Lσ,λ(t, xσ,λ(t)). We 327 will first show by induction that at each iteration i = 0, 1, 2,... of Oracle 1 the 328 following inequality holds.

∗ 329 (13) L(t, xi) − L (t) ≤ γi.

330 The validity of (13) for i = 0 holds because x0 = x, γ0 =γ ¯, and we know that 331 L(t, x) − L∗(t) ≤ γ¯. Suppose (13) holds for iteration i. We claim that it also holds 332 for iteration i + 1 in both the strongly convex and convex cases. In fact, at iteration 333 i + 1, we have

2 log(m + 1) λiD 334 L(t, x ) − L∗(t) ≤ L (t, x ) − L∗ (t) + + i+1 σi,λi i+1 σi,λi σi 2 2(σ M 2 + S)kx − x∗ (t)k2 2 i i σi,λi 2 log(m + 1) λiD 335 (14) ≤ 2 + + Ni σi 2 4(σ M 2 + S)(L (t, x ) − L∗ (t)) 2 i σi,λi i σi,λi log(m + 1) λiD 336 ≤ 2 + + (µ + λi)Ni σi 2 2  2  2 4(σiM + S) log(m + 1) λiD log(m + 1) λiD 337 ≤ 2 γi + + + + (µ + λi)Ni σi 2 σi 2 2 5γi(σiM + S) γi 338 = 2 + (µ + λi)Ni 4 γi γi γi 339 ≤ + + 8 8 4 γi 340 = = γi+1. 341 2

2 342 The first inequality follows from Lemma 2; the second inequality from the (σiM +

343 S)-smoothness of Lσi,λi (t, x) and the convergence property, i.e., Lσi,λi (t, xi+1) − ∗ 2 2 ∗ 2 344 Lσ ,λ (t) ≤ 2 (σiM + S)kxi − xσ ,λ (t)k2, of an accelerated gradient method [38, i i Ni i i

345 Corollary 1]; the third using the (λi + µ)-convexity of Lσi,λi (t, x) in x; and the fourth ∗ 346 by combining Lemma 2 and the induction hypothesis that L(t, xi) − L (t) ≤ γi. The 2 2 347 rest of the relationships follow from the inequalities 5γiσiM /((µ + λi)Ni ) ≤ γi/8 2 348 and 5γiS/((µ + λi)Ni ) ≤ γi/8, which hold for both µ > 0 and µ = 0 according to the 349 definitions of σi, λi, and Ni. 350 It can be easily verified that Oracle 1 terminates in at most I iterations, where αγ¯  351 I := log2 (1−α)L∗(t) + 1 ≥ 0. In fact, according to (13), for any i ≥ I, we have γ0 γ¯ (1−α) 352 L(t, x ) − L∗(t) ≤ γ = ≤ ≤ L∗(t), which implies that αL(t, x ) ≤ L∗(t). i i 2i 2I α i 1−α 353 Hence γi ≤ α (αL(t, xi)) = (1 − α)L(t, xi) and Oracle 1 terminates. 354 We next analyze the total number of iterations across all calls of the accelerated 355 gradient method [38, Algorithm 1] during the execution of Oracle 1. 356 Suppose µ > 0. The number of iterations taken by Oracle 1 can be bounded as 357 follows:

I−1 I−1 s I−1 s I−1 X X 40S X 160 log(m + 1) X 358 N ≤ + M + 1 i µ µγ i=0 i=0 i=0 i i=0 s ! s I−1 1/2 40S 160 log(m + 1) X  1  359 = + 1 I + M µ µ γ i=0 i

This manuscript is for review purposes only. 11

s ! s I−1 40S 160 log(m + 1) X 360 = + 1 I + M 2i/2 µ µγ 0 i=0 s ! s 40S 160 log(m + 1) 1 − 2I/2  361 = + 1 I + M √ µ µγ¯ 1 − 2 s ! s 40S  αγ¯  160α log(m + 1) 362 ≤ + 1 log + 1 + 3M . µ 2 (1 − α)L∗(t) µ(1 − α)L∗(t) 363

364 The first inequality above is obtained using the inequality dae ≤ a+1 for a ∈ R+, and 365 by bounding the maximum determining Ni by a sum of its terms; the first equality 366 from simplifying terms; the second equality from using the definition of γi; the third I−1 √ 367 equality from γ =γ ¯ and the geometric sum formula P 2i/2 = (1−2I/2)/(1− 2); 0 i=0 √ √ 368 and the last inequality by substituting for I, using the relationship a + 1 ≤ a + 1 369 for a ∈ R+ to obtain the inequality

 αγ¯ 1/2  αγ¯ 1/2 370 I/2 (15) 2 − 1 = ∗ + 1 − 1 ≤ ∗ , 371 (1 − α)L (t) (1 − α)L (t) √ 372 and bounding 1/( 2 − 1) above by 3. 373 Suppose µ = 0. The iteration complexity can be bounded using the following 374 steps:

I−1 I−1 s I−1 p I−1 X X 10S X 80 log(m + 1) X 375 N ≤ 4D + 4MD + 1 i γ γ i=0 i=0 i i=0 i i=0 s I−1 p I−1 10S X 80 log(m + 1) X 376 = 4D 2i/2 + 4MD 2i + I γ γ 0 i=0 0 i=0 s 10S 1 − 2I/2  p80 log(m + 1) 1 − 2I  377 = 4D √ + 4MD + I γ0 1 − 2 γ0 1 − 2 s 10Sα 4αMDp80 log(m + 1)  αγ¯  378 ≤ 12D + + log + 1 . (1 − α)L∗(t) (1 − α)L∗(t) 2 (1 − α)L∗(t) 379

380 The first inequality follows from replacing the maximum in the definition of Ni by 381 the sum of its terms and using dae ≤ a + 1 for a ∈ R+, the first and second equalities 382 by rearranging terms and using the geometric sum formula, respectively, and the last 383 inequality by substituting for I and using the inequality (15).

384 Remark 1. Besides [38, Algorithm 1], other accelerated gradient methods with 385 Euclidean distance function as the Bregman divergence can be used in Line 8 of Ora- 386 cle 1 as long as they ensure the output xi+1 after Ni iterations started at xi satisfies 2 ∗ 2 C(σiM +S)kxi−x (t)k ∗ σi,λi 2 ∗ 387 Lσi,λi (t, xi+1) − Lσ ,λ (t) ≤ 2 for a constant C. Here, xσ,λ(t) i i Ni ∗ 388 and Lσ,λ(t) are defined in the proof of Theorem3. This property is needed to establish 389 inequality (14). The methods with this property include [38, Algorithm 2] and the 390 ones in [7, 20, 32]. Depending on C, the choice of Ni in Oracle 1 and the itera- 391 tion complexity in Theorem3 may differ by some constant factor when using different 392 methods.

This manuscript is for review purposes only. 12 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

393 3.2. Non-smooth functions. In this subsection, we only enforce assumptions1 394 and2. In other words, the functions f and gi i = 1, . . . , m can be non-smooth. Under 395 Assumption2, we have max x∈X kζL,t(x)k2 ≤ M for any ζL,t(x) ∈ ∂xL(t, x) and t ∈ R, 396 where ∂xL(t, x) denotes the set of all sub-gradients of L(t, ·) at x with respect to x. 397 As oracle A, we choose the standard sub-gradient method with varying stepsizes 398 from [28] and its variant in [19]1, respectively, when µ = 0 and µ > 0 to directly 399 solve (3). We present this sub-gradient method for both choices of µ in Oracle 2. The 400 iteration complexity of Oracle 2 is well-known and given in Proposition1 without 401 proof. The proof for µ = 0 can be found in [28, §2.2] and and the one for µ > 0 402 is given in [19, Section 3.2]. Here we only provide arguments to show that Oracle 2 403 will return a solution with the desired optimality gap required by Algorithm 1, which 404 then qualifies this algorithm to be used as an oracle A (see Definition 1).

Oracle 2 Sub-: SGD(t, x, α) 1 Input: Level t, initial solution x ∈ X and constant α > 1 2 Set x0 = x. 3 for i = 0, 1, 2,... do 4 Set    2  M 2  if µ > 0,  if µ > 0, (i + 2)µ  (i + 2)µ ηi = √ and Ei := √  2D  2 2DM √ if µ = 0,  √ if µ = 0.  i + 1M  i + 1

5 Set

 2  Pi (j + 1)x if µ > 0, (i + 1)(i + 2) j=0 j x¯i =  1 i  P x if µ = 0. i + 1 j=0 j

1 2 6 Let xi+1 = arg minx∈X 2 kx − xi + ηiζL,t(xi)k2. 7 if Ei ≤ (1 − α)L(t, x¯i) then 8 Terminate and return x+ =x ¯i. 9 end if 10 end for

∗ 405 Proposition 1([19, 28]). Givena level t ∈ (t , t0], solution x ∈ X , and a ∗ 406 parameter α > 1, Oracle 2 returns a solution x+ ∈ X that satisfies L (t) ≥ αL(t, x+) 407 in at most αM 2 408 (16) µ(1 − α)L∗(t)

409 iterations of the sub-gradient method when µ > 0 and at most 8α2M 2D2 410 (17) ((1 − α)L∗(t))2

1The original algorithms proposed in [19] and [28] are stochastic sub-gradient methods. Here, we apply these methods with deterministic sub-gradient.

This manuscript is for review purposes only. 13

411 such iterations when µ = 0 and D < +∞. 412 Proof. We first consider the case when µ = 0 and D < +∞. Oracle 2 is the sub- 1 2 413 gradient method in [28] with the distance-generating function 2 k · k2. From equation 414 (2.24) in [28], we have √ ∗ 2 2DM 415 L(t, x¯i) − L (t) ≤ √ = Ei. 416 i + 1

8α2M 2D2 ∗ 1−α ∗ 417 Hence, when i ≥ ((α−1)L∗(t))2 − 1, we have L(t, x¯i) − L (t) ≤ Ei ≤ α L (t), which ∗ 1−α ∗ 418 implies L (t) ≥ αL(t, x¯i). Hence, Ei ≤ α L (t) ≤ (1 − α)L(t, x¯i) which indicates 8α2M 2D2 419 that Oracle 2 terminates in no more than ((α−1)L∗(t))2 iterations. 420 Now, suppose µ > 0. Oracle 2, in this case, is the sub-gradient method in [19]. 421 By Section 3.2 in [19], we have

2 ∗ M 422 L(t, x¯i) − L (t) ≤ . 423 (i + 2)µ

αM 2 ∗ 1−α ∗ 424 Therefore, when i ≥ µ(α−1)L∗(t) − 2, we have L(t, x¯i) − L (t) ≤ Ei ≤ α L (t), which ∗ 425 implies L (t) ≥ αL(t, x¯i). In addition, the algorithm terminates after at most this 1−α 426 many iterations because Ei ≤ α (αL(t, x¯i)) ≤ (1 − α)L(t, x¯i). 427 This proposition indicates that the sub-gradient method in [28] and its variant 428 in [19] qualify as oracles A in Algorithm 1 when µ = 0 and µ > 0, respectively (see 429 Definition 1). Since the iteration complexities (16) and (17) do not depend on the 430 initial solution x, the worst-case complexity does not restrict this choice of x ∈ X . In 431 addition,γ ¯ is not needed for executing Oracle 2. Therefore, we omit it as an input to 432 this algorithm even though the generic call to the oracle A in Algorithm 1 includesγ ¯.

433 4. Overall iteration complexity. We present an adaptive complexity analysis 434 to establish the overall complexity of Algorithm 1 when using the oracles A discussed 435 in §3. We present two lemmas that are needed to show our main complexity result in 436 Theorem 4. 437 Lemma 3. Suppose Algorithm 1 terminates at iteration K. Then for any 1 ≤ 438 k ≤ K, we have

∗ L (t0) 439 (18) L∗(t ) < . k 2α2

1 ∗ 440 Proof. We first use the relationship tk = tk−1 + 2 L(tk−1, xk) to show L (tk) ≤ 1 ∗ ∗ 441 2 L (tk−1) for any k ≥ 1. Defining xk := arg minx∈X L(tk, x), we proceed as follows: 1 442 L∗(t ) ≤ max{f(x∗ ) − t − L(t , x ); g (x∗ ), i = 1, . . . , m} k k−1 k−1 2 k−1 k i k−1 1 443 = max{f(x∗ ) − t ; g (x∗ ) + L(t , x ), i = 1, . . . , m} k−1 k−1 i k−1 2 k−1 k 1 444 − L(t , x ) 2 k−1 k 1 445 ≤ L∗(t ) − L(t , x ) k−1 2 k−1 k 1 ∗ 446 (19) ≤ L (tk−1), 447 2

This manuscript is for review purposes only. 14 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

1 ∗ 448 where the second inequality holds since L(tk−1, xk) ≤ α L (tk−1) ≤ 0 and the third ∗ 449 inequality follows from the inequality L (tk−1) ≤ L(tk−1, xk). Since Algorithm 1 has 450 not terminated at iteration k − 1 for 1 ≤ k ≤ K, we have

∗ 2 2 ∗ 451 (20) L (t0) ≥ αL(t0, x1) > α L(tk−1, xk) ≥ α L (tk−1),

452 where the first inequality holds by the condition satisfied by oracle A, the second 453 by the stopping criterion of Algorithm 1 not being satisfied, and the third by the ∗ 454 definition of L (tk−1). The inequality (18) then follows by combining (19) and (20). 455 Lemma 4 and Theorem 4 require the condition measure β, which was previously 456 defined in (4). 457 Lemma 4. Suppose Algorithm 1 terminates at iteration K. The following hold: 3 K 1 4α 458 P 1. k=0 ∗ ≤ 2 ∗ ; −L (tk) −β L (t0) √ 2 K 1 4 2α 459 2. P ≤ ; k=0 p ∗ 1.5p ∗ −L (tk) β −L (t0)

5 K 1 16α 460 P 3. k=0 ∗ 2 ≤ 3 ∗ 2 . (L (tk)) 3β (L (t0))

461 Proof. We only prove Item 1. Items2 and3 can be derived following analogous 462 steps. We have

K K X 1 X 1 463 ∗ ≤ ∗ −L (tk) β(tk − t ) k=0 k=0 K β X (1 − )K−k 464 2α ≤ ∗ β(tK − t ) k=0 K K−k 1 X  β  465 ≤ ∗ 1 − −βL (tK ) 2α k=0 K K−k 2α2 X  β  466 ≤ ∗ 1 − −βL (t0) 2α k=0 " # 2α2 1 − (1 − β )K+1 467 = 2α −βL∗(t ) β 0 1 − (1 − 2α ) 4α3 468 ≤ 2 ∗ , 469 −β L (t0)

470 where the first inequality follows by (5); the second by (6); the third by (5); the 471 fourth using (18) with k = K; the first equality by the geometric sum; and the fifth 472 inequality by dropping negative terms.

473 We summarize the choice of oracle A andγ ¯k in each iteration of Algorithm 1 in 474 Table1 (asms abbreviates assumptions) and characterize the overall iteration com- 475 plexity (measured by the total number of gradient iterations) of this algorithm in 476 Theorem 4. The dependence on  in our big-O complexity for solving the constrained 477 convex optimization problem (1)-(2) and the known analogue of this complexity for

This manuscript is for review purposes only. 15

Table 1: Choices of the oracle A andγ ¯k in Algorithm 1

Case Oracle Choice ofγ ¯k Strongly convex anyγ ¯ ≥ −L∗(t ) if k = 0 Smooth 1 0 (µ > 0) −αL(t , x ) if k ≥ 1 (Asms1,2,&3) k−1 k Convex 1 2MD (µ = 0 & D < +∞) Strongly convex Non-smooth 2 N/A (µ > 0) (Asms1&2) Convex 2 N/A (µ = 0 & D < +∞)

478 the unconstrained version of this problem are interestingly the same. However, unlike 479 the unconstrained case, the complexity terms in Theorem 4 depend on the condition 480 measure β.

481 Theorem 4. Suppose the oracle A and γ¯k in Algorithm 1 are chosen according 482 to Table1. Given 0 <  ≤ 1, α > 1, and a g-strictly feasible solution e, Algorithm 1 483 returns an -relative optimal and feasible solution to (1)-(2) in at most ! M rS  1  1   1  484 O + ln + log , 1.5p ∗ 2 ∗ β −µL (t0) µ β β −L (t0) √ ! MD D S  M 2   M 2D2  485 O + , O , and O 2 ∗ 1.5p ∗ 2 ∗ 3 ∗ 2 −β L (t0) β −L (t0) −µβ L (t0) β (L (t0))

486 gradient iterations, respectively, when the functions f and gi, i = 1, . . . , m are smooth 487 strongly convex, smooth convex, non-smooth strongly convex, and non-smooth convex. ∗ 488 Notice that we have only presented the terms , β, µ, S, D, and L (t0) in our 489 big-O complexity. The exact numbers of iterations are given explicitly in the proof.

490 Remark 2. Based on Table 1, when the functions f, gi, i = 1, . . . , m, are smooth 491 and µ > 0, the input γ¯ used in the first call of Oracle 1 should be bounded below by ∗ 492 −L (t0). We do not specify the choice of such γ¯0 in our complexity bound (see (25)). 1 2 493 However, one simple choice of γ¯0, in this case, could be 2µ k∇f(e)k since

∗ n µ 2 o 494 −L (t0) ≤ − min max h∇f(e), x − ei + kx − ek , gi(x), i = 1, . . . , m x∈X 2 n µ o 495 ≤ − min h∇f(e), x − ei + kx − ek2 x∈X 2 1 496 ≤ k∇f(e)k2, 497 2µ

1 498 where the first inequality follows from (10) and the third holds since x = e − µ ∇f(e)  µ 2 n 499 solves minx∈R h∇f(e), x − ei + 2 kx − ek .

500 Proof. [Smooth strongly convex case] We first show that we can guaranteeγ ¯k ≥ ∗ ∗ 501 L(tk, xk)−L (tk) by choosingγ ¯0 ≥ −L (t0) andγ ¯k = −αL(tk−1, xk) for k ≥ 1. Hence 502 γ¯k can be used as an input to Oracle 1. ∗ ∗ 503 When k = 0, we have L(t0, x0) = 0 and hence L(t0, x0) − L (t0) = −L (t0) ≤ γ¯0.

This manuscript is for review purposes only. 16 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

504 Take k ≥ 1. We show that 1 505 L(t , x ) = max{f(x ) − t − L(t , x ); g (x ), i = 1, . . . , m} k k k k−1 2 k−1 k i k 1 1 506 = max{f(x ) − t ; g (x ) + L(t , x ), i = 1, . . . , m} − L(t , x ) k k−1 i k 2 k−1 k 2 k−1 k 1 507 ≤ L(t , x ) − L(t , x ) k−1 k 2 k−1 k 509508 (21) ≤ 0, 1 ∗ 510 where the first and the second inequalities hold because L(tk−1, xk) ≤ α L (tk−1) ≤ 0. ∗ ∗ ∗ 511 Furthermore, since L (t) is a decreasing function in t we have L (tk−1) ≤ L (tk) ∗ ∗ 512 since tk ≤ tk−1. Combining this with (21) yields L(tk, xk) − L (tk) ≤ −L (tk−1) ≤ 513 −αL(tk−1, xk) =γ ¯k. 514 We next characterize the iteration complexity of oracle A in each iteration of Al- 515 gorithm 1 using Theorem3. By virtue of this theorem, the total number of gradient 516 iterations performed by Algorithm 1 is at most K " s !   s # X 40S αγ¯k 160α log(m + 1) 517 + 1 log2 ∗ + 1 + 3M ∗ µ (1 − α)L (tk) µ(1 − α)L (tk) k=0 s !"   K  2 # 40S αγ¯0 X (α − 2α )L(tk−1, xk) 518 = + 1 log2 ∗ + 1 + log2 ∗ µ (1 − α)L (t0) (1 − α)L (tk) k=1 (22) s K 160α log(m + 1) X 1 519 + 3M , µ(α − 1) p−L∗(t ) 520 k=0 k

521 where we have replacedγ ¯k by −αL(tk−1, xk) for k > 0, and used the inequality ∗ ∗ 522 (1 − α)L (tk) ≤ (1 − α)L (tk−1) ≤ α(1 − α)L(tk−1, xk). The first sum in (22) can be 523 bounded as follows. K  2  X (α − 2α )L(tk−1, xk) 524 log2 ∗ (1 − α)L (tk) k=1 K 2 ! Y (α − 2α )L(tk−1, xk) 525 = log2 ∗ (1 − α)L (tk) k=1  2 K K−1 ! 2α − α L(t0, x1) Y L(tk, xk+1) 526 = log2 · ∗ ∗ α − 1 L (tK ) L (tk) k=1  2 K ! 2α − α L(t0, x1) 527 ≤ log2 · ∗ α − 1 L (tK )   2    2α − α L(t0, x1) 528 = K log2 + log2 ∗ α − 1 L (tK ) 2α α2   2α2 − α 2α2  529 (23) ≤ ln + 1 log2 + log2 , 530 β β α − 1  ∗ 531 where the first inequality holds because L(tk, xk+1)/L (tk) ≤ 1 for all k = 0, 1,...,K, l 2α  α2 m 532 and the second inequality follows from setting K equal to β ln β + 1, using ∗ 533 (18) for k = K, and the inequality L (t0) ≤ L(t0, x1) ≤ 0.

This manuscript is for review purposes only. 17

534 The second sum in (22) can be upper bounded using Item 2 of Lemma 4. Specif- 535 ically,

K s s X 160α log(m + 1) 12Mα2 320α log(m + 1) 536 (24) 3M ∗ ≤ 1.5 ∗ . (1 − α)µL (tk) β µ(1 − α)L (t0) 537 k=0

538 Adding (23) and (24), we obtain the following upper bound on the total number 539 of iterations (22) required by Algorithm 1: s ! 40S 2α α2   2α2 − α 2α2  540 + 1 ln + 1 log + log µ β β 2 α − 1 2 

   2  1/2 αγ¯0 12Mα 320α log(m + 1) 541 (25) + log2 ∗ + 1 + 1.5 ∗ . 542 (1 − α)L (t0) β µ(1 − α)L (t0)

543 [Smooth convex case] Notice thatγ ¯k = 2MD can be used as an input to Oracle 1 544 at every iteration k ≥ 1 since

∗ ∗ 545 L(tk, xk) − L (tk) ≤ − hζL,tk (xk), xk − xki ∗ 546 ≤ kζL,tk (xk)k2kxk − xkk2

548547 ≤ 2MD,

549 where the first inequality follows from the convexity of L(t, x) in x, the second using 550 the Cauchy-Schwarz inequality, and the last using the definition of M and D. By 551 theorems2 and3, the total number of iterations required by Algorithm 1 is equal to

K " s p  # X 10αS 4MDα 80 log(m + 1) αγ¯k 552 12D ∗ + ∗ + log2 ∗ + 1 . (1 − α)L (tk) (1 − α)L (tk) (1 − α)L (tk) 553 k=0

554 An upper bound on this quantity is

K s p s X  10αS 4MDα 80 log(m + 1) 2MDα  555 12D ∗ + ∗ + ∗ , (1 − α)L (tk) (1 − α)L (tk) (1 − α)L (tk) 556 k=0

557 which is obtained by replacingγ ¯k = 2MD for all k and using the inequality log(a) ≤ 558 a√−1 a for a > 1. Next using Lemma 4, we obtain the required upper bound on the total 559 number of iterations taken by Algorithm 1 as

48Dα2  20Sα 1/2 16MDα4 560 1/2 1.5 ∗ + 2 ∗ (80 log(m + 1)) β (1 − α)L (t0) β (1 − α)L (t0) 8α2  MDα 1/2 561 + 1.5 ∗ . 562 β (1 − α)L (t0)

563 [Non-smooth strongly convex case] By Lemma4 and Proposition 1, the total num- 564 ber of iterations required by Algorithm 1 can be bounded above in a straightforward 565 manner: K X αM 2 4M 2α4 566 ∗ ≤ 2 ∗ . µ(1 − α)L (tk) µβ (1 − α)L (t0) k=0

This manuscript is for review purposes only. 18 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

567 [Non-smooth convex case] In a similar fashion, by Lemma4 and Proposition 1, 568 the total number of iterations required by Algorithm 1 is upper bounded by

K X 8α2M 2D2 43M 2D2α7 ≤ . ∗ 2 3 ∗ 2 569 k=0 ((1 − α)L (tk)) β ((1 − α)L (t0))

570 5. Numerical Experiments. In this section, we evaluate the numerical perfor- 571 mance of the proposed feasible level-set method on the Neyman-Pearson classification 572 problem [36, 37] in machine learning. Suppose that we have training data with sam- 573 ples represented on an n-dimensional feature space (i.e., Rn) and each sample having n0 n1 574 an associated positive (+1) or negative (-1) label. Let the sets {ξ0j}j=1 and {ξ1j}j=1, 575 respectively, define the partition of the training data based on positive and negative 576 labels. Ideally, a binary classification problem attempts to find a classifier x ∈ Rn > > 577 such that x ξ0j > 1 for j = 1, . . . , n0 and x ξ1j < −1 for j = 1, . . . , n1. Since such a 578 classifier may not exist for a given dataset, the Neyman-Pearson classification prob- 579 lem finds instead an x ∈ Rn that minimizes the Type-II error (identifying a positive 580 instance as negative) over the training set while placing an upper bound on the Type-I 581 error (identifying a negative instance as positive). Errors are measured using a loss 582 function φ(z) := max{1 − z, 0}, where misclassification on the positive and negative > > 583 labeled data are evaluated as φ(x ξ0j) and φ(−x ξ1j), respectively. 584 The optimization formulation of the Neyman-Pearson classification problem is   n0 n1  1 X >  1 X > 585 (26) min f(x) := φ(x ξ0j) , s.t. g(x) := φ(−x ξ1j) − r ≤ 0, kxk ≤κ n n 2  0 j=1  1 j=1

586 where the parameter κ in the constraint kxk2 ≤ κ controls overfitting and r > 0 587 specifies a bound on the Type-II error. The constraint g(x) ≤ 0 involving r can be 588 interpreted as imposing the largest allowable Type-II error for a classifier x to be 589 implementable in practice. Note that (26) is an instance of the optimization problem 590 (1)-(2) with m = 1 and X = {x : kxk2 ≤ κ}. Since the objective and constraint 591 functions in (26) are non-smooth and not strongly convex, we implemented the version 592 of our feasible level-set method that uses Oracle2 to solve L∗(t) = 0, where   n0 n1 ∗  1 X > 1 X >  593 (27) L (t) = min max φ(x ξ0j) − t, φ(−x ξ1j) − r . kxk ≤κ n n 2  0 j=1 1 j=1 

594 We normalized each data sample to a unit defined using an `2-norm, which 595 makes it possible for us to set M = 1 and D = 5 in this oracle. 596 As a benchmark, we consider a level set method that solves L∗(t) = 0 using 597 the inexact Newton approach discussed in [2]. This method executes (outer) level set 598 iterations to update t and calls at each outer iteration an oracle, which needs to satisfy 599 certain conditions. These conditions are unlike the ones met by the oracles that we 600 develop for the feasible level set method. To elaborate, the oracle used by the inexact ∗ 601 Newton method must return a triple (`k, uk, sk) such that (i) 0 < `k ≤ L (tk) ≤ uk ≤ 602 α`k for any k and α ∈ (1, 2) and (ii) sk defines a function t :→ `k + sk(t − tk) that 603 globally minorizes L∗(t). Next we define such an oracle as a mirror prox method [26] 604 that solves a min-max reformulation of L∗(t) in (27). Leveraging the relationship 605 max{a, b} = maxθ∈[0,1] θ · a + (1 − θ) · b, it is easy to verify that (27) can be restated as

( n0 ! n1 !) 1 X > 1 X > 606 min max θ φ(x ξ0j ) − t + (1 − θ) φ(−x ξ1j ) − r . kxk2≤κ θ∈[0,1] n0 n1 j=1 j=1

This manuscript is for review purposes only. 19

> > > 607 Next replacing θ · φ(x ξ0j ) and (1 − θ) · φ(−x ξ1j ) by maxw0j ∈[0,θ] w0j · (1 − x ξ0j ) and > 608 maxw1j ∈[0,1−θ] w1j · (1 + x ξ1j ), respectively, we obtain the desired min-max formulation:

( n0 n1 ) 1 X > 1 X > 609 min max Φ(x, y; t) := w0j (1 − x ξ0j ) + w1j (1 + x ξ1j ) − θt − (1 − θ)r , kxk2≤κ y∈Y n0 n1 j=1 j=1 where

 n0 n1 1+n0+n1 Y := (θ, (w0j )j=1, (w1j )j=1) ∈ R : 0 ≤ w0j ≤ θ, 0 ≤ w1j ≤ 1 − θ, θ ∈ [0, 1] .

610 In the k-th iteration of the inexact Newton method, we apply the mirror prox 611 method to solve this min-max problem with t = tk. The result is a sequence of so- 612 lutions (xi, yi) ∈ X × Y for i = 0, 1,..., with primal-dual gaps maxy∈Y Φ(xi, y; tk) − 613 minx∈X Φ(x, yi; tk) that converge to zero as i increases. However, one need not 614 run the mirror prox method until the gap is zero. It suffices to terminate when 615 maxy∈Y Φ(xi, y; tk) ≤ α minx∈X Φ(x, yi; tk) and then set uk = maxy∈Y Φ(xi, y; tk), 616 `k = minx∈X Φ(x, yi; tk) and sk = −θi, where θi is the first coordinate of the solution 617 yi. The triple (`k, uk, sk) defined in this manner satisfies the conditions needed by the 618 inexact Newton method. This property is easy to verify. 619 To compare the feasible level set method and the inexact Newton method, we 620 created test instances using the binary classification datasets RCV1, covtype, and 621 News20 from the LIBSVM library [12]. The characteristics of these datasets are 622 summarized in Table2, where the density of the feature matrix is the fraction of 623 nonzero entries. We set the parameters κ and r equal to 5 and 0.5, respectively. 624 We implemented algorithms in Matlab running on a 64-bit Microsoft Windows 10 625 machine with a 2.70 Ghz Intel Core i7-6820HQ CPU and 8GB of memory.

Number of Number of Density of Dataset Source samples features feature matrix RCV1 [25] 20,242 47,236 0.16% covtype [9] 581,012 54 22.00% News20 [4, 17] 19,996 1,355,191 0.04%

Table 2: Characteristics of three binary classification datasets from [12].

626 To ensure a fair comparison, we started the fesible level set and inexact Newton

627 methods at a g-strictly feasible solution x0 = arg minkxk2≤κ g(x), which was com- 628 puted by running a sub-gradient method (Oracle2) for a sufficiently large number 629 of iterations. We also initialized the former algorithm with the level parameter t0 = ∗ ∗ ∗ 630 f(x0) > f and chose this parameter in the latter algorithm as t0 = f − |f(x0) − f | ∗ 631 so that the initial deviation from optimality, that is |t0 − f |, is the same in both 632 cases. (The optimal value f ∗ is found by running the feasible level set method for a 633 sufficiently large number of iterations.) We set the parameter α equal to 100 and 1.9 634 in the feasible level-set and the inexact Newton methods, respectively. This choice 635 was based on experimentally testing candidate values for α from a discrete set and 636 selecting the one that led to the best performance in each case. 637 Figure2 displays the performance of each method. The y-axis of the subfigures 638 in the first row reports the term f(x) − f ∗, that is, it focuses on optimality, while 639 this axis in the second row shows the feasibility of solutions by plotting g(x). We 640 track these measures as a function of the number of data passes performed by each

This manuscript is for review purposes only. 20 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

RCV1 covtype News20

1.5 1 Feasible Level Set Feasible Level Set 0.2 1 Feasible Level Set Inexact Newton Inexact Newton Inexact Newton

∗ 0.5 0.1 0.5 f

− 0 0 0 )

x -0.1 -0.5 ( -0.5 f -0.2 -1

-0.3 -1.5 -1 0 200 400 600 0 200 400 600 0 50 100 150 200 250 Number of Data Passes Number of Data Passes Number of Data Passes 0.15 1 1 Feasible Level Set Feasible Level Set Feasible Level Set 0.1 Inexact Newton Inexact Newton Inexact Newton 0.5 0.5

) 0.05 x ( 0 g 0 0 -0.05

-0.1 -0.5 -0.5 0 200 400 600 0 200 400 600 0 50 100 150 200 250 Number of Data Passes Number of Data Passes Number of Data Passes

Fig. 2: Performance of the feasible level-set method and the inexact Newton method.

641 algorithm in the x-axis, where a data pass involves going over the entire training data 642 and is a basic operation in the oracles used by both methods. Moreover, the time 643 consumed in performing data passes dominates the total run time of both methods. 644 Data passes are performed by the (inner) iterations of Oracle2 and the mirror prox 645 method used by the feasible level set and inexact Newton methods, respectively. To 646 indicate the values of f(x) − f ∗ and g(x) corresponding to the solutions maintained 647 at each level set (outer) iteration we use line markers in Figure2. 648 The results in Figure2 clearly showcase the behavior of our feasible level set 649 method established in the theoretical analysis (Theorem2). In all three datasets, the 650 solutions maintained in the outer level set iterations are feasible and sub-optimal (i.e., 651 g(x) ≤ 0 and f(x)−f ∗ ≥ 0). In contrast, the level set iterations in the inexact Newton 652 method maintain mostly infeasible and super-optimal (i.e., g(x) > 0 and f(x) − f ∗ < 653 0) solutions even though we initialize the procedure with a feasible solution. Indeed, 654 both approaches converge to essentially optimal solutions as iterations progress but, 655 as alluded to in the introduction, the feasible level-set method has the advantage of 656 providing feasible, and thus implementable, solutions if the algorithm is terminated 657 before convergence. Finally, the run times for the feasible level-set method are 15, 658 93 and 65 seconds for the three datasets, respectively. The corresponding run times 659 for the inexact Newton method are 25%-35% larger on all instances for the same 660 number of data passes because the mirror prox method used by the inexact Newton 661 method performs projections onto Y that are instead not needed in the feasible level- 662 set approach.

REFERENCES663

664 [1] Z. Allen-Zhu and E. Hazan, Optimal black-box reductions between optimization objectives, arXiv preprint arxiv:1603.05642, (2016).665 666 [2] A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, and S. Roy, Level-set 667 methods for convex optimization, arXiv preprint arXiv:1602.01506, (2016).

This manuscript is for review purposes only. 21

668 [3] A. Y. Aravkin, J. V. Burke, and M. P. Friedlander, Variational properties of value func- 669 tions, SIAM Journal on Optimization, 23 (2013), pp. 1689–1717. 670 [4] K. Bache and M. Lichman, UCI machine learning repository, (2013). 671 [5] A. Bayandina, Adaptive mirror descent for constrained optimization, arXiv preprint arXiv:1705.02029, (2017).672 673 [6] A. Beck, A. Ben-Tal, and L. Tetruashvili, A sequential ascending parameter method for 674 solving constrained minimization problems, SIAM Journal on Optimization, 22 (2012), pp. 244–260.675 676 [7] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse 677 problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202. 678 [8] A. Beck and M. Teboulle, Smoothing and first order methods: A unified framework, SIAM Journal on Optimization, 22 (2012), pp. 557–580.679 680 [9] J. A. Blackard, D. J. Dean, and C. W. Anderson, Covertype data set, in UCI Machine 681 Learning Repository, K. Bache and M. Lichman, eds., URL: http://archive.ics.uci.edu/ml, 682 2013, University of California, Irvine, School of Information and Computer Sciences. 683 [10] K. M. Bretthauer and B. Shetty, The nonlinear knapsack problem–algorithms and appli- 684 cations, European Journal of Operational Research, 138 (2002), pp. 459–472. 685 [11] M. Chiang, Geometric programming for communication systems, Foundations and Trends in 686 Communications and Information Theory, 2 (2005), pp. 1–154. 687 [12] R.-E. Fan and C.-J. Lin, LIBSVM data: Classification, regression and multi-label, (2011). 688 [13] R. M. Freund and H. Lu, New computational guarantees for solving convex optimization 689 problems with first order methods, via a function growth condition measure, arXiv preprint arXiv:1511.02974, (2015).690 691 [14] B. Grimmer, Radial subgradient method, arXiv preprint arXiv:1703.09280v2, (2017). 692 [15] Z. Harchaoui, A. Juditsky, and A. Nemirovski, Conditional gradient algorithms for norm- 693 regularized smooth convex optimization, Mathematical Programming, 152 (2015), pp. 75– 112.694 695 [16] A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scale opti- 696 mization, i: General purpose methods, (2011), pp. 121–148. 697 [17] S. S. Keerthi and D. DeCoste, A modified finite Newton method for fast solution of large 698 scale linear svms, Journal of Machine Learning Research, 6 (2005), pp. 341–361. 699 [18] K. C. Kiwiel, Proximal level bundle methods for convex nondifferentiable optimization, saddle- 700 point problems and variational inequalities, Mathematical Programming, 69 (1995), pp. 89– 109.701 702 [19] S. Lacoste-Julien, M. Schmidt, and F. Bach, A simpler approach to obtaining an 703 O(1/t) convergence rate for the projected stochastic subgradient method, arXiv preprint arXiv:1212.2002, (2012).704 705 [20] Lan, Z. Lu, and R. D. C. Monteiro, Primal-dual first-order methods with O(1/e) iteration- 706 complexity for cone programming, Mathematical Programming, 126 (2011), pp. 1–29. 707 [21] G. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex opti- 708 mization, Mathematical Programming, 149 (2015), pp. 1–45. 709 [22] G. Lan and Z. Zhou, Algorithms for stochastic optimization with expectation constraints, arXiv preprint arXiv:1604.03887, (2016).710 711 [23] C. Lemarechal´ , An extension of davidon methods to non differentiable problems, Nondiffer- entiable Optimization, (1975), pp. 95–109.712 713 [24] C. Lemarechal,´ A. Nemirovskii, and Y. Nesterov, New variants of bundle methods, Math- ematical programming, 69 (1995), pp. 111–147.714 715 [25] D. D. Lewis, Y. Yang, T. Rose, and F. Li, RCV1: A new benchmark collection for text 716 categorization research, Journal of Machine Learning Research, 5 (2004), pp. 361–397. 717 [26] A. Nemirovski, Prox-method with rate of convergence o(1/t) for variational inequalities with 718 lipschitz continuous monotone operators and smooth convex-concave saddle point problems, 719 SIAM Journal on Optimization, 15 (2004), pp. 229–251. 720 [27] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation 721 approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574– 1609.722 723 [28] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation 724 approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574– 1609.725 726 [29] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Kluwer, Nether- lands, (2004).727 728 [30] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming, 103 (2005), pp. 127–152.729

This manuscript is for review purposes only. 22 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

730 [31] Y. Nesterov, Primal-dual subgradient methods for convex problems, Mathematical program- ming, 120 (2009), pp. 221–259.731 732 [32] Y. Nesterov, Gradient methods for minimizing composite functions, Mathematical Program- ming, 140 (2013), pp. 125–161.733 734 [33] M. Patriksson and C. Stromberg¨ , Algorithms for the continuous nonlinear resource allo- 735 cation problem–new implementations and numerical studies, European Journal of Opera- tional Research, 243 (2015), pp. 703–722.736 737 [34] E. Pee and J. O. Royset, On solving large-scale finite minimax problems using exponential 738 smoothing, Journal of optimization theory and applications, 148 (2011), pp. 390–421. 739 [35] J. Renegar, “Efficient” subgradient methods for general convex optimization, SIAM Journal on Optimization, 26 (2016), pp. 2649–2676.740 741 [36] P. Rigollet and X. Tong, Neyman-pearson classification, convexity and stochastic con- 742 straints, Journal of Machine Learning Research, 12 (2011), pp. 2831–2855. 743 [37] X. Tong, Y. Feng, and A. Zhao, A survey on neyman-pearson classification and suggestions 744 for future research, Wiley Interdisciplinary Reviews: Computational Statistics, 8 (2016), pp. 64–81.745 746 [38] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, submit- ted to SIAM Journal on Optimization, (2008).747 748 [39] E. Van Den Berg and M. P. Friedlander, Probing the pareto frontier for basis pursuit 749 solutions, SIAM Journal on Scientific Computing, 31 (2008), pp. 890–912. 750 [40] E. Van Den Berg and M. P. Friedlander, Sparse optimization with least-squares constraints, 751 SIAM Journal on Optimization, 21 (2011), pp. 1201–1229. 752 [41] P. Wolfe, A method of conjugate subgradients for minimizing nondifferentiable functions, Springer, 1975, pp. 145–173.753 754 [42] Y. Xu, First-order methods for constrained convex programming based on linearized augmented 755 lagrangian function, arXiv preprint arXiv:1711.08020, (2017). 756 [43] Y. Xu, Global convergence rates of augmented lagrangian methods for constrained convex pro- gramming, arXiv preprint arXiv:1711.05812, (2017).757 758 [44] Y. Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints, arXiv preprint arXiv:1802.02724, (2018).759

This manuscript is for review purposes only.