1 A Unified Successive Pseudo-Convex Approximation Framework Yang Yang and Marius Pesavento

Abstract—In this paper, we propose a successive pseudo-convex because problem (2) can be rewritten into a problem with the approximation algorithm to efficiently compute stationary points form of (1) by the help of auxiliary variables: for a large class of possibly nonconvex optimization problems. The stationary points are obtained by solving a sequence of minimize f(x) + y successively refined approximate problems, each of which is x,y (3) much easier to solve than the original problem. To achieve subject to x ∈ X , g(x) ≤ y. convergence, the approximate problem only needs to exhibit a weak form of convexity, namely, pseudo-convexity. We show that We do not assume that f(x) is convex, so (1) is in general a the proposed framework not only includes as special cases a nonconvex optimization problem. The focus of this paper is on number of existing methods, for example, the gradient method the development of efficient iterative algorithms for computing and the Jacobi algorithm, but also leads to new algorithms which the stationary points of problem (1). The optimization problem enjoy easier implementation and faster convergence speed. We (1) represents general class of optimization problems with a also propose a novel line search method for nondifferentiable optimization problems, which is carried out over a properly vast number of diverse applications. Consider for example constructed differentiable function with the benefit of a simpli- the sum-rate maximization in the MIMO multiple access fied implementation as compared to state-of-the-art line search channel (MAC) [1], the broadcast channel (BC) [2, 3] and the techniques that directly operate on the original nondifferentiable interference channel (IC) [4, 5, 6, 7, 8, 9], where f(x) is the objective function. The advantages of the proposed algorithm sum-rate function of multiple users (to be maximized) while are shown, both theoretically and numerically, by several ex- ample applications, namely, MIMO broadcast channel capacity the set X characterizes the users’ power constraints. In the computation, energy efficiency maximization in massive MIMO context of the MIMO IC, (1) is a nonconvex problem and NP- systems and LASSO in sparse signal recovery. hard [5]. As another example, consider portfolio optimization Index Terms—Energy efficiency, exact line search, LASSO, in which f(x) represents the expected return of the portfolio massive MIMO, MIMO broadcast channel, nonconvex optimiza- (to be maximized) and the set X characterizes the trading tion, nondifferentiable optimization, successive convex approxi- constraints [10]. Furthermore, in sparse (l1-regularized) linear mation. regression, f(x) denotes the least square function and g(x) is the sparsity regularization function [11, 12]. Commonly used iterative algorithms belong to the class of I.INTRODUCTION descent direction methods such as the conditional gradient In this paper, we propose an iterative algorithm to solve the method and the gradient projection method for the differen- following general optimization problem: tiable problem (1) [13] and the proximal gradient method for the nondifferentiable problem (2) [14, 15], which often suffer minimize f(x) from slow convergence. To speed up the convergence, the x (1) block coordinate descent (BCD) method that uses the notion of subject to x ∈ X , the nonlinear best-response has been widely studied [13, Sec. where X ⊆ Rn is a closed and , and f(x): Rn → 2.7]. In particular, this method is applicable if the constraint R is a proper and differentiable function with a continuous set of (1) has a Cartesian product structure X = X1 ×...×XK gradient. We assume that problem (1) has a solution. such that Problem (1) also includes some class of nondifferentiable minimize f(x1,..., xK ) x=(x )K optimization problems, if the nondifferentiable function g(x) k k=1 (4) arXiv:1506.04972v2 [math.OC] 7 Apr 2016 is convex: subject to xk ∈ Xk, k = 1,...,K. minimize f(x) + g(x) The BCD method is an iterative algorithm: in each iteration, x t+1 (2) only one variable is updated by its best-response xk = subject to x ∈ X , t+1 t+1 t t arg minxk∈Xk f(x1 ,..., xk−1, xk, xk+1,..., xK ) (i.e., the point that minimizes f(x) with respect to (w.r.t.) the variable Y. Yang is with Intel Deutschland GmbH, Germany (email: xk only while the remaining variables are fixed to their values [email protected]). M. Pesavento is with Communication Systems Group, Darmstadt University of the preceding iteration) and the variables are updated se- of Technology, Germany (email: [email protected]). quentially. This method and its variants have been successfully The authors acknowledge the financial support of the Seventh Framework adopted to many practical problems [1, 6, 7, 10, 16]. Programme for Research of the European Commission under grant number ADEL-619647 and the EXPRESS project within the DFG priority program When the number of variables is large, the convergence CoSIP (DFG-SPP 1798). speed of the BCD method may be slow due to the sequential 2 nature of the update. A parallel variable update based on the 2) The stepsizes can be determined based on the problem best-response seems attractive as a mean to speed up the structure, typically resulting in faster convergence than in cases updating procedure, however, the convergence of a parallel where constant stepsizes [2, 10, 18] and decreasing stepsizes best-response algorithm is only guaranteed under rather re- [8, 19] are used. For example, a constant stepsize can be used strictive conditions, c.f. the diagonal dominance condition on when f(x) is given as the difference of two convex functions the objective function f(x1,..., xK ) [17], which is not only as in DC programming [21]. When the objective function difficult to satisfy but also hard to verify. If f(x1,..., xK ) is nondifferentiable, we propose a new exact/successive line is convex, the parallel algorithms converge if the stepsize search method that is carried out over a properly constructed is inversely proportional to the number of block variables differentiable function. Thus it is much easier to implement K. This choice of stepsize, however, tends to be overly than state-of-the-art techniques that operate on the original conservative in systems with a large number of block variables nondifferentiable objective function directly. and inevitably slows down the convergence [2, 10, 18]. In the proposed algorithm, the exact/successive line search A recent progress in parallel algorithms has been made in is used to determine the stepsize and it can be implemented in [8, 9, 19, 20], in which it was shown that the stationary point a centralized controller, whose existence presence is justified of (1) can be found by solving a sequence of successively for particular applications, e.g., the base station in the MIMO refined approximate problems of the original problem (1), and BC, and the portfolio manager in multi-portfolio optimization convergence to a stationary point is established if, among other [10]. We remark that also in applications in which centralized conditions, the approximate function (the objective function controller are not admitted, however, the line search procedure of the approximate problem) and stepsizes are properly se- does not necessarily imply an increased signaling burden when lected. The parallel algorithms proposed in [8, 9, 19, 20] are it is implemented in a distributed manner among different essentially descent direction methods. A description on how distributed processors. For example, in the LASSO problem to construct the approximate problem such that the convexity studied in Sec. IV-C, the stepsize based on the exact line of the original problem is preserved as much as possible is search can be computed in closed-form and it does not incur also contained in [8, 9, 19, 20] to achieve faster convergence any additional signaling as in predetermined stepsizes, e.g., than standard descent directions methods such as classical decreasing stepsizes and constant stepsizes. Besides, even conditional gradient method and gradient projection method. in cases where the line search procedure induces additional Despite its novelty, the parallel algorithms proposed in signaling, the burden is often fully amortized by the significant [8, 9, 19, 20] suffer from two limitations. Firstly, the approx- increase in the convergence rate. imate function must be strongly convex, and this is usually The rest of the paper is organized as follows. In Sec. guaranteed by artificially adding a quadratic regularization II we introduce the mathematical background. The novel term to the original objective function f(x), which how- iterative method is proposed and its convergence is analyzed ever may destroy the desirable characteristic structure of the in Sec. III; its connection to several existing descent direction original problem that could otherwise be exploited, e.g., to algorithms is presented there. In Sec. IV, several applications obtain computationally efficient closed-form solutions of the are considered: the sum rate maximization problem of MIMO approximate problems [6]. Secondly, the algorithms require BC, the energy efficiency maximization of a massive MIMO the use of a decreasing stepsize. On the one hand, a slow system to illustrate the advantage of the proposed approximate decay of the stepsize is preferable to make notable progress function, and the LASSO problem to illustrate the advantage and to achieve satisfactory convergence speed; on the other of the proposed stepsize. The paper is finally concluded in hand, theoretical convergence is guaranteed only when the Sec. V. stepsize decays fast enough. In practice, it is a difficult task Notation: We use x, x and X to denote a scalar, vector on its own to find a decay rate for the stepsize that provides and matrix, respectively. We use X to denote the (j, k)- a good trade-off between convergence speed and convergence jk th element of X; x is the k-th element of x where x = guarantee, and current practices mainly rely on heuristics [19]. k (x )K , and x denotes all elements of x except x : x = The contribution of this paper consists in the development k k=1 −k k −k (x )K . We denote x−1 as the element-wise inverse of of a novel iterative convex approximation method to solve j j=1,j6=k x, i.e., (x−1) = 1/x . Notation x ◦ y and X ⊗ Y denotes problem (1). In particular, the advantages of the proposed k k the Hadamard product between x and y, and the Kronecker iterative algorithm are the following: product between X and Y, respectively. The operator [x]b 1) The approximate function of the original problem (1) in a returns the element-wise projection of x onto [a, b]: [x]b each iteration only needs to exhibit a weak form of convexity, a , max(min(x, b), a), and [x]+ [x] . We denote dxe as the namely, pseudo-convexity. The proposed iterative method not , 0 smallest integer that is larger than or equal to x. We denote only includes as special cases many existing methods, for d(X) as the vector that consists of the diagonal elements of X example, [4, 8, 9, 19], but also opens new possibilities for and diag(x) is a diagonal matrix whose diagonal elements are constructing approximate problems that are easier to solve. For as same as x. We use 1 to denote the vector whose elements example, in the MIMO BC sum-rate maximization problems are equal to 1. (Sec. IV-A), the new approximate problems can be solved in closed-form. We also show by a counterexample that the assumption on pseudo-convexity is tight in the sense that if it is not satisfied, the algorithm may not converge. II.PRELIMINARIESON DESCENT DIRECTION METHOD 3

AND CONVEX FUNCTIONS In this section, we introduce the basic definitions and concepts that are fundamental in the development of the mathematical formalism used in the rest of the paper. Stationary point. A point y ∈ X is a stationary point of (1) if (x − y)T ∇f(y) ≥ 0, ∀ x ∈ X . (5) Condition (5) is the necessary condition for local optimality of the variable y. For nonconvex problems, where global Figure 1. Relationship of functions with different degree of convexity optimality conditions are difficult to establish, the computation of stationary points of the optimization problem (1) is gener- ally desired. If (1) is convex, stationary points coincide with It is strictly convex if the above inequality is satisfied with (globally) optimal points and condition (5) is also sufficient strict inequality whenever x 6= y. It is easy to see that a convex for y to be (globally) optimal. function is pseudo-convex. Descent direction. The vector dt is a descent direction of Strongly convex functions. A function h(x) is strongly the function f(x) at x = xt if convex with constant a if

t T t T a 2 ∇f(x ) d < 0. (6) h(y) ≥ h(x) + ∇h(x) (y − x) + 2 ky − xk2 , ∀ x, y ∈ X , If (6) is satisfied, the function f(x) can be decreased when x for some positive constant a. The relationship of functions is updated from xt along direction dt. This is because in the with different degree of convexity is summarized in Figure 1 Taylor expansion of f(x) around x = xt is given by: where the arrow denotes implication in the direction of the arrow. f(xt + γdt) = f(xt) + γ∇f(xt)T dt + o(γ), where the first order term is negative in view of (6). For III.THE PROPOSED SUCCESSIVE PSEUDO-CONVEX sufficiently small γ, the first order term dominates all higher APPROXIMATION ALGORITHM order terms. More rigorously, if dt is a descent direction, there In this section, we propose an iterative algorithm that exists a γ¯t > 0 such that [22, 8.2.1] solves (1) as a sequence of successively refined approximate

t t t t problems, each of which is much easier to solve than the f(x + γd ) < f(x ), ∀γ ∈ (0, γ¯ ). original problem (1), e.g., the approximate problem can be Note that the converse is not necessarily true, i.e., f(xt+1) < decomposed into independent subproblems that might even f(xt) for arbitrary functions f(x) does not necessarily imply exhibit closed-form solutions. ˜ t that xt+1 − xt is a descent direction of f(x) at x = xt. In iteration t, let f(x; x ) be the approximate function of t Quasi-. A function h(x) is quasi-convex if f(x) around the point x . Then the approximate problem is for any α ∈ [0, 1]: minimize f˜(x; xt) x (8) h((1 − α)x + αy) ≤ max(h(x), h(y)), ∀ x, y ∈ X . subject to x ∈ X ,

A locally optimal point y of a quasi-convex function h(x) and its optimal point and solution set is denoted as Bxt and over a convex set X is also globally optimal, i.e., S(xt), respectively:

t t n ? ? t t o h(x) ≥ h(y), ∀ x ∈ X . Bx ∈ S(x ) x ∈ X : f˜(x ; x ) = min f˜(x; x ) . (9) , x∈X Pseudo-convex function. A function h(x) is pseudo-convex We assume that the approximate function f˜(x; y) satisfies the if [23] following technical conditions: ˜ ∇h(x)T (y − x) ≥ 0 =⇒ h(y) ≥ h(x), ∀x, y ∈ X . (A1) The approximate function f(x; y) is pseudo-convex in x for any given y ∈ X ; Another equivalent definition of pseudo-convex functions is (A2) The approximate function f˜(x; y) is continuously differ- also useful in our context [23]: entiable in x for any given y ∈ X and continuous in y for any x ∈ X ; h(y) < h(x) =⇒ ∇h(x)T (y − x) < 0. (7) (A3) The gradient of f˜(x; y) and the gradient of f(x) are ˜ In other words, h(y) < h(x) implies that y − x is a descent identical at x = y for any y ∈ X , i.e., ∇xf(y; y) = ∇xf(y); direction of h(x). A pseudo-convex function is also quasi- Based on (9), we define the mapping Bx that is used to convex [23, Th. 9.3.5], and thus any locally optimal points of generate the sequence of points in the proposed algorithm: pseudo-convex functions are also globally optimal. X 3 x 7−→ Bx ∈ X . (10) Convex function. A function h(x) is convex if Given the mapping Bx, the following properties hold. h(y) ≥ h(x) + ∇h(x)T (y − x), ∀ x, y ∈ X . 4

Proposition 1 (Stationary point and descent direction). Pro- Algorithm 1 The iterative convex approximation algorithm for vided that Assumptions (A1)-(A3) are satisfied: (i) A point y differentiable problem (1) is a stationary point of (1) if and only if y ∈ S(y) defined in Data: t = 0 and x0 ∈ X . (9); (ii) If y is not a stationary point of (1), then By − y is a Repeat the following steps until convergence: descent direction of f(x): S1: Compute Bxt using (9). S2: Compute γt by the exact line search (13) or the succes- ∇f(y)T ( y − y) < 0. B (11) sive line search (14). S3: Update xt+1 according to (12) and set t ← t + 1. Proof: See Appendix A. If xt is not a stationary point, according to Proposition 1, we define the vector update xt+1 in the (t + 1)-th iteration as:

t t m t+1 t t t t the stepsize γ is set to be γ = β t , where m is the smallest x = x + γ (Bx − x ), (12) t nonnegative integer m satisfying the following inequality: where γt ∈ (0, 1] is an appropriate stepsize that can be f(xt + βm( xt − xt)) ≤ f(xt) + αβm∇f(xt)T ( xt − xt). determined by either the exact line search (also known as the B B (14) minimization rule) or the successive line search (also known Note that the existence of a finite m satisfying (14) is as the Armijo rule). Since xt ∈ X , xt ∈ X and γt ∈ (0, 1], t B always guaranteed if xt − xt is a descent direction at xt it follows from the convexity of X that xt+1 ∈ X for all t. B and ∇f(xt)T ( xt − xt) < 0 [13], i.e., from Proposition 1 Exact line search. The stepsize is selected such that the B inequality (14) always admits a solution. function f(x) is decreased to the largest extent along the The algorithm is formally summarized in Algorithm 1 and descent direction xt − xt: B its convergence properties are given in the following theorem. t t t t γ ∈ arg min f(x + γ(Bx − x )). (13) Theorem 2 (Convergence to a stationary point). Consider 0≤γ≤1 the sequence {xt} generated by Algorithm 1. Provided that With this stepsize rule, it is easy to see that if xt is not a Assumptions (A1)-(A3) as well as the following assumptions stationary point, then f(xt+1) < f(xt). are satisfied: In the special case that f(x) in (1) is convex and γ? nulls (A4) The solution set S(xt) is nonempty for t = 1, 2,...; the gradient of f(xt + γ( xt − xt)), i.e., ∇ f(xt + γ?( xt − t B γ B (A5) Given any convergent subsequence {x }t∈T where T ⊆ xt)) = 0, then γt in (13) is simply the projection of γ? onto t {1, 2,...}, the sequence {Bx }t∈T is bounded. the interval [0, 1]: Then any limit point of {xt} is a stationary point of (1).  t t t 1, if ∇γ f(x + γ( x − x ))| ≥ 0, Proof: See Appendix B.  B γ=1 t ? 1 t t t In the following we discuss some properties of the proposed γ = [γ ]0 = 0, if ∇γ f(x + γ(Bx − x ))|γ=0 ≤ 0, γ?, otherwise. Algorithm 1. On the conditions (A1)-(A5). The only requirement on If 0 ≤ γt = γ? ≤ 1, the constrained optimization problem the convexity of the approximate function f˜(x; xt) is that it in (13) is essentially unconstrained. In some applications it is is pseudo-convex, cf. (A1). To the best of our knowledge, possible to compute γ? analytically, e.g., if f(x) is quadratic these are the weakest conditions for descent direction meth- as in the LASSO problem (Sec. IV-C). Otherwise, for general ods available in the literature. As a result, it enables the convex functions, γ? can be found efficiently by the bisection construction of new approximate functions that can often be method as follows. Restricting the function f(x) to a line optimized more easily or even in closed-form, resulting in a xt + γ(Bxt − xt), the new function f(xt + γ(Bxt − xt)) is significant reduction of the computational cost. Assumptions t t t convex in γ [24]. It thus follows that ∇γ f(x +γ(Bx −x )) < (A2)-(A3) represent standard conditions for successive convex ? t t t ? 0 if γ < γ and ∇γ f(x +γ(Bx −x )) > 0 if γ > γ . Given approximation techniques and are satisfied for many existing ? an interval [γlow, γup] containing γ (the initial value of γlow approximation functions, cf. Sec. III-B. Sufficient conditions and γup is 0 and 1, respectively), set γmid = (γlow + γup)/2 for Assumptions (A4)-(A5) are that either the feasible set X in and refine γlow and γup according to the following rule: (8) is bounded or the approximate function in (8) is strongly convex [25]. We show that how these assumptions are satisfied ( t t t γlow = γmid, if ∇γ f(x + γmid(Bx − x )) > 0, in popular applications considered in Sec. IV. t t t On the pseudo-convexity of the approximate function. γup = γmid, if ∇γ f(x + γmid(Bx − x )) < 0. Assumption (A1) is tight in the sense that if it is not satisfied, The procedure is repeated for finite times until the gap γup − Proposition 1 may not hold. Consider the following simple 3 γlow is smaller than a prescribed precision. example: f(x) = x , where −1 ≤ x ≤ 1 and the point Successive line search. If no structure in f(x) (e.g., con- xt = 0 at iteration t. Choosing the approximate function vexity) can be exploited to efficiently compute γt according f˜(x; xt) = x3, which is quasi-convex but not pseudo-convex, to the exact line search (13), the successive line search can all assumptions except (A1) are satisfied. It is easy to see that instead be employed: given scalars 0 < α < 1 and 0 < β < 1, Bxt = −1, however (Bxt − xt)∇f(xt) = (−1 − 0) · 0 = 0, 5 and thus Bxt − xt is not a descent direction, i.e., inequality Algorithm 2 The iterative convex approximation algorithm for (11) in Proposition 1 is violated. nondifferentiable problem (2) On the stepsize. The stepsize can be determined in a more Data: t = 0 and x0 ∈ X . straightforward way if f˜(x; xt) is a global upper bound of Repeat the following steps until convergence: t f(x) that is exact at x = x , i.e., assume that S1: Compute Bxt using (22). (A6) f˜(x; xt) ≥ f(x) and f˜(xt; xt) = f(xt), S2: Compute γt by the exact line search (23) or the succes- then Algorithm 1 converges under the choice γt = 1 which sive line search (24). t+1 results in the update xt+1 = Bxt. To see this, we first remark S3: Update x according to t that γ = 1 must be an optimal point of the following problem: t+1 t t t t x = x + γ (Bx − x ). t t t t 1 ∈ argmin f˜(x + γ(Bx − x ); x ), (15) Set t ← t + 1. 0≤γ≤1 otherwise the optimality of Bxt is contradicted, cf. (9). At the same time, it follows from Proposition 1 that ˜ t t T t t t ? t t ∇f(x ; x ) (Bx − x ) < 0. The successive line search over The convergence of Algorithm 1 with (Bx , y (x )) and γ f˜(xt+γ(Bxt−xt)) thus yields a nonnegative and finite integer given by (18)-(19) directly follows from Theorem 2. t+1 mt such that for some 0 < α < 1 and 0 < β < 1: The point y given in (20b) can be further refined:

˜ t t ˜ t mt t t t f(Bx ; x ) ≤ f(x + β (Bx − x ); x ) f(xt+1) + yt+1 = f(xt+1) + yt + γt(y?(xt) − yt) ˜ t mt ˜ t t T t t t+1 t t t t ≤ f(x ) + αβ ∇f(x ; x ) (Bx − x ) ≥ f(x ) + g(x ) + γ (g(Bx ) − g(x )) t mt t T t t t+1 t t t t = f(x ) + αβ ∇f(x ) (Bx − x ), (16) ≥ f(x ) + g((1 − γ )x + γ Bx ) t+1 t+1 where the second inequality comes from the definition of = f(x ) + g(x ), successive line search [cf. (14)] and the last equality follows where the first and the second inequality comes from the fact from Assumptions (A3) and (A6). Invoking Assumption (A6) that yt ≥ g(xt) as well as y?(xt) = g( xt) and Jensen’s again, we obtain B inequality of convex functions g(x) [24], respectively. Since t+1 t mt t T t t t+1 t+1 t+1 t+1 f(x ) ≤ f(x ) + αβ ∇f(x ) ( x − x ) t+1 t . y ≥ g(x ) by definition, the point (x , g(x )) B x =Bx (17) always yields a lower value of f(x) + y than (xt+1, yt+1) The proof of Theorem 2 can be used verbatim to prove the while (xt+1, g(xt+1)) is still a feasible point for problem (3). convergence of Algorithm 1 with a constant stepsize γt = 1. The update (20b) is then replaced by the following enhanced rule: t+1 t+1 A. Nondifferentiable Optimization Problems y = g(x ). (21) In the following we show that the proposed Algorithm 1 Algorithm 1 with Bxt given in (20a) and yt+1 given in (21) can also be applied to solve problem (3), and its equivalent still converges to a stationary point of (3). formulation (2) which contains a nondifferentiable objective The notation in (18)-(19) can be simplified by removing the t function. Suppose that f˜(x; x ) is an approximate function of auxiliary variable y: xt in (18) can be equivalently written t B f(x) in (3) around x and it satisfies Assumptions (A1)-(A3). as Then the approximation of problem (3) around (xt, yt) is t  t Bx = arg min f˜(x; x ) + g(x) (22) t ? t ˜ t x∈X (Bx , y (x )) , arg min f(x; x ) + y. (18) (x,y):x∈X ,g(x)≤y and combining (19) and (21) yields That is, we only need to replace the differentiable function γt ∈ f(xt+γ( xt−xt))+γ(g( xt)−g(xt)) . t argmin B B (23) f(x) by its approximate function f˜(x; x ). To see this, it is 0≤γ≤1 sufficient to verify Assumption (A3) only: In the context of the successive line search, customizing ˜ t t t t ∇x(f(x ; x ) + y) = ∇x(f(x ) + y ), the general definition (14) for problem (2) yields the choice t mt ˜ t t t γ = β with mt being the smallest integer that satisfies the ∇y(f(x ; x ) + y) = ∇y(f(x ) + y) = 1. inequality: Based on the exact line search, the stepsize γt in this case is t m t t t given as f(x + β (Bx − x )) − f(x ) ≤ t  t t t t ? t t βmα∇f(xt)T ( xt − xt)+(α − 1)(g( xt) − g(xt)). γ ∈ argmin f(x +γ(Bx −x ))+y +γ(y (x )−y )) , (19) B B 0≤γ≤1 (24) where yt ≥ g(xt). Then the variables xt+1 and yt+1 are Based on the derivations above, the proposed algorithm for defined as follows: the nondifferentiable problem (2) is formally summarized in Algorithm 2. xt+1 = xt + γt( xt − xt), (20a) B γt t+1 t t ? t t It is much easier to calculate according to (23) than in y = y + γ (y (x ) − y ). (20b) state-of-the-art techniques that directly carry out the exact line 6 search over the original nondifferentiable objective function in where st > 0. In the context of the proposed framework (22), (2) [26, Rule E], i.e., the update (27) is equivalent to defining f˜(x; xt) as follows:

t t t t t t min f(x + γ( x − x )) + g(x + γ( x − x )). t T t 1 t 2 B B f˜(x; x ) = ∇f(x) (x − x ) + x − x (28) 0≤γ≤1 2st This is because the objective function in (23) is differentiable and setting the stepsize γt = 1 for all t. According to Theorem in γ while state-of-the-art techniques involve the minimization 2 and the discussion following Assumption (A6), the proposed of a nondifferentiable function. If f(x) exhibits a specific algorithm converges under a constant unit stepsize if f˜(x; xt) t structure such as in quadratic functions, γ can even be is a global upper bound of f(x), which is indeed the case when calculated in closed-form. This property will be exploited st ≤ 1/L in view of the descent lemma [13, Prop. A.24]. to develop fast and easily implementable algorithm for the Jacobi algorithm: In problem (1), if f(x) is convex in each popular LASSO problem in Sec. IV-C. xk where k = 1,...,K (but not necessarily jointly convex in In the proposed successive line search, the left hand side (x1,..., xK )), the approximate function is defined as [8] of (24) depends on f(x) while the right hand side is linear 2 m ˜ t PK t τk t  in β . The proposed variation of the successive line search f(x; x ) = k=1 f(xk, x−k) + 2 xk − xk 2 , (29) thus involves only the evaluation of the differentiable function where τk ≥ 0 for k = 1,...,K. The k-th component function f(x) and its computational complexity and signaling exchange 2 f(x , xt ) + τk kx − xt k in (29) is obtained from the (when implemented in a distributed manner) is thus lower than k −k 2 k k 2 original function f(x) by fixing all variables except x , i.e., state-of-the-art techniques (for example [26, Rule A’], [27, k x = xt , and further adding a quadratic regularization Equations (9)-(10)], [19, Remark 4] and [28, Algorithm 2.1]), −k −k ˜ t in which the whole nondifferentiable function f(x) + g(x) term. Since f(x; x ) in (29) is convex, Assumption (A1) is must be repeatedly evaluated (for different m) and compared satisfied. Based on the observations that t t t τ t 2 with a certain benchmark before mt is found. ˜ k ∇xk f(x ; x ) = ∇xk (f(xk, x−k) + 2 xk − xk ) t 2 xk=xk = ∇ f(x , xt ) + τ (x − xt ) xk k −k k k k x =xt B. Special Cases and New Algorithms k k = ∇ f(xt), In this subsection, we interpret some existing methods in the xk context of Algorithm 1 and show that they can be considered we conclude that Assumption (A3) is satisfied by the choice of as special cases of the proposed algorithm. the approximate function in (29). The resulting approximate Conditional gradient method: In this iterative algorithm problem is given by for problem (1), the approximate function is given as the first- PK t τk t 2 t minimize k=1(f(xk, x−k) + 2 kxk − xkk ) order approximation of f(x) at x = x [13, Sec. 2.2.2], i.e., x=(x )K k k=1 (30) subject to x ∈ X . f˜(x; xt) = ∇f(xt)T (x − xt). (25) This is commonly known as the Jacobi algorithm. The struc- Then the stepsize is selected by either the exact line search or ture inside the constraint set X , if any, may be exploited the successive line search. to solve (30) even more efficiently. For example, the con- Gradient projection method: In this iterative algorithm for straint set X consists of separable constraints in the form of problem (1), xt is given by [13, Sec. 2.3] PK B k=1 hk(xk) ≤ 0 for some convex functions hk(xk). Since t  t t t  subproblem (30) is convex, primal and dual decomposition Bx = x − s ∇f(x ) , X techniques can readily be used to solve (30) efficiently [29] t where s > 0 and [x]X denotes the projection of x onto X . (such an example is studied in Sec. IV-A). This is equivalent to defining f˜(x; xt) in (9) as follows: To guarantee the convergence, the condition proposed in [9] is that τk > 0 for all k in (29) unless f(x) is strongly t t T t 1 t 2 f˜(x; x ) = ∇f(x ) (x − x ) + x − x , (26) x f(x) 2st 2 convex in each k. However, the strong convexity of in each xk is a strong assumption that cannot always be which is the first-order approximation of f(x) augmented by a satisfied and the additional quadratic regularization term that is quadratic regularization term that is introduced to improve the otherwise required may destroy the convenient structure that numerical stability [17]. A generalization of (26) is to replace t t t t could otherwise be exploited, as we will show through an the quadratic term by (x−x )H (x−x ) where H 0 [27]. example application in the MIMO BC in Sec. IV-A. In the Proximal gradient method: If f(x) is convex and has a case τk = 0, convergence of the Jacobi algorithm (30) is only Lipschitz continuous gradient with a constant L, the proximal proved when f(x) is jointly convex in (x1,..., xK ) and the gradient method for problem (2) has the following form [14, stepsize is inversely proportional to the number of variables Sec. 4.2]: K [2, 10, 13], namely, γt = 1/K. However, the resulting t+1 n t 1 t t t 2o convergence speed is usually slow when K is large, as we x = arg min s g(x) + x − (x − s ∇f(x )) x 2 will later demonstrate numerically in Sec. IV-A. n t t 1 t 2 o With the technical assumptions specified in Theorem 2, the = arg min ∇f(x )(x − x ) + t x − x + g(x) . x 2s convergence of the Jacobi algorithm with the approximate (27) problem (30) and successive line search is guaranteed even 7

t t T Algorithm 3 The Jacobi algorithm for problem (4) Since f2(x) is convex and f2(x) ≥ f2(x ) + ∇f2(x ) (x − 0 t Data: t = 0 and xk ∈ Xk for all k = 1,...,K. x ), Assumption (A6) is satisfied and the constant unit stepsize Repeat the following steps until convergence: can be chosen [30]. t S1: For k = 1,...,K, compute Bkx using (31). t S2: Compute γ by the exact line search (13) or the succes- IV. EXAMPLE APPLICATIONS sive line search (14). S3: Update xt+1 according to A. MIMO Broadcast Channel Capacity Computation In this subsection, we study the MIMO BC capacity com- xt+1 = xt + γt( xt − xt ), ∀k = 1,...,K. k k Bk k putation problem to illustrate the advantage of the proposed Set t ← t + 1. approximate function. Consider a MIMO BC where the channel matrix charac- terizing the transmission from the base station to user k is denoted by Hk, the transmit covariance matrix of the signal ˜ t from the base station to user k is denoted as Qk, and the when τk = 0. This is because f(x; x ) in (29) is already noise at each user k is an additive independent and identically convex when τk = 0 for all k and it naturally satisfies the pseudo-convexity assumption specified by Assumption (A1). distributed Gaussian vector with unit variance on each of its In the case that the constraint set X has a Cartesian product elements. Then the sum capacity of the MIMO BC is [31] structure (4), the subproblem (30) is naturally decomposed into PK H maximize log I + k=1HkQkHk K sub-problems, one for each variable, which are then solved {Qk} in parallel. In this case, the requirement in the convexity of PK subject to Qk  0, k = 1,...,K, k=1tr(Qk) ≤ P, (32) f(x) in each xk can even be relaxed to pseudo-convexity PK t P only (although the sum function k=1 f(xk, x−k) is not where is the power budget at the base station. necessarily pseudo-convex in x as pseudo-convexity is not Problem (32) is a convex problem whose solution cannot be preserved under nonnegative weighted sum operator), and this expressed in closed-form and can only be found iteratively. To t t K leads to the following update: Bx = (Bkx )k=1 and apply Algorithm 1, we invoke (29)-(30) and the approximate problem at the t-th iteration is t t Bkx ∈ arg min f(xk, x−k), k = 1,...,K, (31) PK t H xk∈Xk maximize k=1 log Rk(Q−k) + HkQkHk {Qk} t PK where Bkx can be interpreted as variable xk’s best-response subject to Qk  0, k = 1,...,K, tr(Qk) ≤ P, t k=1 to other variables x−k = (xj)j6=k when x−k = x−k. (33) The proposed Jacobi algorithm is formally summarized in t P t H Algorithm 3 and its convergence is proved in Theorem 3. where Rk(Q−k) , I + j6=k HjQjHj . The approximate function is concave in Q and differentiable in both Q and Theorem 3. Consider the sequence {xt} generated by Algo- Qt, and thus Assumptions (A1)-(A3) are satisfied. Since the rithm 3. Provided that f(x) is pseudo-convex in x for all k constraint set in (33) is compact, the approximate problem k = 1,...,K and Assumptions (A4)-(A5) are satisfied. Then (33) has a solution and Assumptions (A4)-(A5) are satisfied. any limit point of the sequence generated by Algorithm 3 is a Problem (33) is convex and the sum-power constraint stationary point of (4). coupling Q1,..., QK is separable, so dual decomposition Proof: See Appendix C. techniques can be used [29]. In particular, the constraint set The convergence condition specified in Theorem 3 relaxes has a nonempty interior, so strong holds and (33) can those in [8, 19]: f(x) only needs to be pseudo-convex in each be solved from the dual domain by relaxing the sum-power variable xk and no regularization term is needed (i.e., τk = 0). constraint into the Lagrangian [24]: To the best of our knowledge, this is the weakest convergence ( K ) P log R (Qt ) + H Q HH condition on Jacobi algorithms available in the literature. We t k=1 k −k k k k BQ = arg max K . K ? P will show in Sec. IV-B by an example application of the energy (Qk0)k=1 −λ ( k=1tr(Qk) − P ) efficiency maximization problem in massive MIMO systems (34) t t K ? how the weak assumption on the approximate function’s where BQ = (BkQ )k=1 and λ is the optimal Lagrange convexity proposed in Theorem 2 can be exploited to the multiplier that satisfies the following conditions: λ? ≥ 0, PK t ? PK t largest extent. Besides this, the line search usually yields much k=1 tr(BkQ ) − P ≤ 0, λ ( k=1 tr(BkQ ) − P ) = 0, and faster convergence than the fixed stepsize adopted in [2] even can be found efficiently using the bisection method. under the same approximate problem, cf. Sec. IV-A. The problem in (34) is uncoupled among different variables DC algorithm: If the objective function in (1) is the Qk in both the objective function and the constraint set, so it difference of two convex functions f1(x) and f2(x): can be decomposed into a set of smaller subproblems which t t K are solved in parallel: BQ = (BkQ )k=1 and f(x) = f1(x) − f2(x), t  t H ? the following approximate function can be used: BkQ = arg max log Rk(Q−k) + HkQkHk −λ tr(Qk) , Q 0 ˜ t t t T t k f(x; x ) = f1(x) − (f2(x ) + ∇f2(x ) (x − x )). (35) 8

16

t and BkQ exhibits a closed-form expression based on the 14 waterfilling solution [2]. Thus problem (33) also has a closed- 12 form solution up to a Lagrange multiplier that can be found # of users: 20 efficiently using the bisection method. With the update direc- tion Qt − Qt, the base station can implement the exact line 10 B # of users: 100 search to determine the stepsize using the bisection method described after (13) in Sec. III. 8

We remark that when the channel matrices Hk are rank 6 deficient, problem (33) is convex but not strongly convex, sum rate (nats/s) but the proposed algorithm with the approximate problem 4 (33) still converges. However, if the approximate function in sum capacity (benchmark) [8] is used [cf. (29)], an additional quadratic regularization 2 parallel update with fixed stepsize (state−of−the−art) parallel update with exact line search (proposed) term must be included into (33) (and thus (35)) to make 0 the approximate problem strongly convex, but the resulting 0 20 40 60 80 100 approximate problem no longer exhibits a closed-form solution number of iterations and thus are much more difficult to solve. Figure 2. MIMO BC: sum-rate versus the number of iterations. Simulations. The parameters are set as follows. The number 6 10 of users is K = 20 and K = 100, the number of transmit parallel update with decreasing stepsize (state−of−the−art) parallel update with exact line search (proposed) and receive antenna is (5,4), and P = 10 dB. The simulation 4 10 results are averaged over 20 instances.

We apply Algorithm 1 with approximate problem (33) and 2 10 stepsize based on the exact line search, and compare it with the

iterative algorithm proposed in [2, 18], which uses the same ) 0 t 10 approximate problem (33) but with a fixed stepsize γt = 1/K

(K is the number of users). It is easy to see from Figure 2 −2 (Tx,Rx)=(10,8)

error e(Q 10 that the proposed method converges very fast (in less than

10 iterations) to the sum capacity, while the method of [2] −4 10 requires many more iterations. This is due to the benefit of the exact line search applied in our algorithm over the fixed −6 10 stepsize which tends to be overly conservative. Employing the (Tx,Rx)=(5,4) exact line search adds complexity as compared to the simple −8 10 choice of a fixed stepsize, however, since the objective function 0 10 20 30 40 50 of (32) is concave, the exact line search consists in maximizing number of iterations t t t t a differentiable with a scalar variable, and Figure 3. MIMO BC: error e(Q ) = <(tr(∇f(Q )(BQ − Q ))) versus it can be solved efficiently by the bisection method with the number of iterations. affordable cost. More specifically, it takes 0.0023 seconds to solve problem (33) and 0.0018 seconds to perform the exact line search (the software/hardware environment is further rule performs equally well for all choices of parameters. specified in Sec. IV-C). Therefore, the overall CPU time We remark once again that the complexity of each iteration (time per iteration×number of iterations) is still dramatically of the proposed algorithm is very low because of the exis- decreased due to the notable reduction in the number of tence of a closed-form solution to the approximate problem iterations. Besides, in contrast to the method of [2], increasing (33), while the approximate problem proposed in [20] does the number of users K does not slow down the convergence, not exhibit a closed-form solution and can only be solved so the proposed algorithm is scalable in large networks. iteratively. Specifically, it takes CVX (version 2.0 [32]) 21.1785 We also compare the proposed algorithm with the itera- seconds (based on the dual approach (35) where λ? is found tive algorithm of [20], which uses the approximate problem by bisection). Therefore, the overall complexity per iteration (33) but with an additional quadratic regularization term, cf. −5 of the proposed algorithm is much lower than that of [20]. (29), where τk = 10 for all k, and decreasing stepsizes γt+1 = γt(1−dγt) where d = 0.01 is the so-called decreasing rate that controls the rate of decrease in the stepsize. We B. Energy Efficiency Maximization in Massive MIMO Systems can see from Figure 3 that the convergence behavior of In this subsection, we study the energy efficiency maximiza- [20] is rather sensitive to the decreasing rate d. The choice tion problem in massive MIMO systems to illustrate the advan- d = 0.01 performs well when the number of transmit and tage of the relaxed convexity requirement of the approximate receive antennas is 5 and 4, respectively, but it is no longer a function f˜(x; xt) in the proposed iterative optimization ap- good choice when the number of transmit and receive antenna proach: according to Assumption (A1), f(x; xt) only needs to increases to 10 and 8, respectively. A good decreasing rate d is exhibit the pseudo-convexity property rather than the convexity usually dependent on the problem parameters and no general or strong convexity property that is conventionally required. 9

PK Consider the massive MIMO network with K cells and each Similarly, since Pc + j=1 in the denominator is linear t cell serves one user. The achievable transmission rate for user in pk, we only set p−k = p−k. Note that the division k in the uplink can be formulated into the following general operator in the original problem (37) is kept in the approximate form1: function f˜(p; pt) (38). Although it will destroy the concavity ! (recall that a concave function divided by a linear function w p kk k is no longer a concave function), the pseudo-concavity of rk(p) , log 1 + 2 P , (36) σk + φkpk + j6=k wkjpj t P t r˜k(pk; p )/(Pc +pk + j6=k pj) is still preserved, as we show 2 in two steps. where pk is the transmission power for user k, σk is the t covariance of the additive noise at the receiver of user k, Step 1: The function rk(pk, p−k) is concave in pk. For the simplicity of notation, we define two constants c1 while φk and {wkj}k,j are positive constants that depend on , w /φ > 0 and c (σ2+P w p(t))/φ > 0. The first- the channel conditions only. In particular, φkpk accounts for kk k 2 , k j6=k kj j k t P order derivative and second-order derivative of rk(pk, p ) the hardware impairments, and j6=k wkjpj accounts for the −k interference from other users [33]. w.r.t. pk are In 5G wireless communication networks, the energy effi- 1 + c 1 ∇ r (p , pt ) = 1 − , ciency is a key performance indicator. To address this issue, pk k k −k (1 + c1)pk + c2 pk + c2 we look for the optimal power allocation that maximizes the (1 + c )2 1 ∇2 r (p , pt ) = − 1 + energy efficiency: pk k k −k 2 2 ((1 + c1)pk + c2) (pk + c2) PK r (p) −2c c p (1 + c ) − (c2 + 2c )c2 maximize k=1 k = 1 2 k 1 1 1 2 . p PK ((1 + c )p + c )2(p + c )2 Pc + k=1 pk 1 k 2 k 2 subject to p ≤ p ≤ p¯, (37) ∇2 r (p , pt ) < 0 p ≥ 0 r (p , pt ) Since pk k k −k when , k k −k is a concave function of pk in the nonnegative orthant pk ≥ 0 where Pc is a positive constant representing the total cir- cuit power dissipated in the network, p = (p )K and [24]. k k=1 Step 2: Given the concavity of r (p , pt ), the p = (p )K specifies the lower and upper bound constraint, k k −k k k=1 functionr˜ (p ; pt) is concave in p . Since the denominator respectively. k k k function of f˜ (p ; pt) is a linear (and thus convex) function Problem (37) is nonconvex and it is a NP-hard problem to k k of p , it follows from [34, Lemma 3.8] that r˜ (p ; pt)/(P + find a globally optimal point [5]. Therefore we aim at finding k k k c p + P pt ) is pseudo-concave. a stationary point of (37) using the proposed algorithm. To k j6=k j Then we verify that the gradient of the approximate function begin with, we propose the following approximate function at and that of the original objective function are identical at p = p = pt: pt. It follows that K t X r˜k(pk; p ) f˜(p; pt) = , (38)  r˜ (p ; pt)  P t ˜ t k k Pc + pk + pj ∇pk f(p; p ) = ∇pk t k=1 j6=k p=pt P + p + P p c k j6=k j t pk=p where k K t T t t X ∇p rj(p )(pc + 1 p ) − rj(p ) t t X t t t k = T t 2 r˜k(pk; p ) , rk(pk, p−k)+ (rj(p )+(pk −pk)∇pk rj(p )). (P + 1 p ) j=1 c j6=k PK ! (39) rj(p) = ∇ j=1 , ∀ k. and pk PK Pc + j=1 pj t t wjk p=p ∇p rj(p )) = k 2 PK σj + φjpj + l=1 wjlpl Therefore Assumption (A3) in Theorem 3 is satisfied. As- w sumption (A2) is also satisfied because both r˜ (p ; pt) and − jk k k 2 P p + P + P pt σj + φjpj + l6=j wjlpl k c j6=k j are continuously differentiable for any pt ≥ 0. ˜ t Note that the approximate function f(p; p ) consists of K Given the approximate function (38), the approximate prob- component functions, one for each variable pk, and the k- lem in iteration t is thus th component function is constructed as follows: since rk(p) K t is concave in pk (shown in the right column of this page) t X r˜k(pk; p ) Bp = arg max P t . (40) but r (p) is not concave in p (as a matter of fact, it p≤p≤p¯ Pc + pk + p j k k=1 j6=k j t is convex in pk), the concave function rk(pk, p−k) is pre- t t Assumptions (A4) and (A5) can be proved to hold in a similar served in r˜k(pk; p ) in (39) with p−k fixed to be p −k procedure shown in the previous example application. Since while the nonconcave functions {rj(p)} are linearized w.r.t. t the objective function in (37) is nonconcave, it may not be pk at p = p . In this way, the partial concavity in the PK ˜ t computationally affordable to perform the exact line search. nonconcave function rj(p) is preserved in f(p; p ). j=1 Instead, the successive line search can be applied to calculate 1We assume a single resource block. However, the extension to multiple the stepsize. As a result, the convergence of the proposed resource blocks is straightforward. algorithm with approximate problem (40) and successive line 10

 r   pk (2φ + w )2 − 4φ wkk + 1 − 1 k kk k (π (pt)−λt,τ )int (pt) t,τ  t k k k  pk(λ ) = intk(p ) , (43) k  t t,τ   2φk(πk(p ) − λk )(φk + wkk)  p k

search follows from the same line of analysis used in the proof presents only a weak form of convexity, namely, the pseudo- of Theorem 3. convexity, which is much weaker than those required in state- The optimization problem in (40) can be decomposed into of-the-art convergence analysis, e.g., uniform strong convexity independent subproblems (41) that can be solved in parallel: in [8]. Simulations. The number of antennas at the BS in each r˜ (p ; pt) pt = arg max k k , k = 1,...,K, (41) cell is M = 50, and the channel from user j to cell Bk P t M×1 p ≤pk≤p¯k Pc + pk + p k h ∈ k j6=k j is kj C . We assume a similar setup as [33]: H 2 H 2 H wkk = h hkk , wkj = h hkj + h Djhkk for t t K kk kk kk where p = ( kp ) . As we have just shown, the H B B k=1 j 6= k and φk = hkkDkhkk, where  = 0.01 is the numerator function and the denominator function in (41) is error magnitude of hardware impairments at the BS and concave and linear, respectively, so the optimization problem 2 M 2 Dj = diag({|hjj(m)| }m=1). The noise covariance σk = 1, in (41) is a fractional programming problem and can be solved and the hardware dissipated power p is 10dBm, while p t,τ c k by the Dinkelbach’s algorithm [33, Algorithm 5]: given λk is -10dBm and p is 10dBm for all users. The benchmark t,0 k (λk can be set to 0), the following optimization problem in algorithm is [33, Algorithm 1], which successively maximizes iteration τ + 1 is solved: the following lower bound function of the objective function p = pt t,τ t t,τ P t in (37), which is tight at : pk(λ ) arg max r˜k(pk; p ) − λ (Pc + pk + p ), k , k j6=k j K p ≤pk≤p P t t k k k=1 bk + ak log wkk (42a) maximize K + q P qk t Pc + k=1 e where r˜k(pk; p ) is the numerator function in (41). The K t,τ P at (q − log(σ2 + φ eqk + P w eqj )) variable λk is then updated in iteration τ + 1 as k=1 k k k k j6=k kj K P qk t,τ t Pc + k=1 e r˜k(pk(λ ); p ) λt,τ+1 = k . (42b) subject to log(p ) ≤ q ≤ log(p ), k = 1,...,K, (44) k t,τ P t k k k Pc + pk(λk ) + j6=k pj where It follows from the convergence properties of the Dinkelbach’s t t sinrk(p ) algorithm that ak , t , 1 + sinrk(p ) lim p (λt,τ ) = pt k k Bk sinr (pt) τ→∞ bt log(1 + sinr (pt)) − k log(sinr (pt)), k , k 1 + sinr (pt) k at a superlinear convergence rate. Note that problem (42a) k can be solved in closed-form, as p (λt,τ ) is simply the and k k w pt projection of the point that sets the gradient of the objective kk sinrk(p) , 2 P . function in (42a) to zero onto the interval [p , p ]. It can σk + φkpk + j6=k wkjpj k k be verified that finding that point is equivalent to finding Denote the optimal variable of (44) as qt (which can be the root of a polynomial with order 2 and it thus admits found by the Dinkelbach’s algorithm); then the variable p is t+1 qt a closed-form expression. We omit the detailed derivations updated as p = e k for all k = 1,...,K. We thus coin t,τ k and directly give the expression of pk(λk ) in (43) at the [33, Algorithm 1] as the successive lower bound maximization t P t top of this page, where πk(p ) , j6=k ∇pk rj(p ) and (SLBM) method. t 2 P t intk(p ) , σk + j6=k wkjpj. In Figure 4, we compare the convergence behavior of the We finally remark that the approximate function in (38) is proposed method and the SLBM method in terms of both the constructed in the same spirit as [8, 9, 35] by keeping as number of iterations (the upper subplots) and the CPU time much concavity as possible, namely, rk(p) in the numerator (the lower subplots), for two different number of users: K = PK and Pc + pj in the denominator, and linearizing the non- 10 in Figure 4 (a) and K = 50 in Figure 4 (b). It is obvious that j=1 P concave functions only, namely, j6=k rj(p) in the numerator. the convergence speed of the proposed algorithm in terms of Besides this, the denominator function is also kept. Therefore, the number of iterations is comparable to that of the SLBM the proposed algorithm is of a best-response nature and ex- method. However, we remark that the approximate problem pected to converge faster than gradient based algorithms which (40) of the proposed algorithm is superior to that of the SLBM PK PK linearizes the objective function j=1 rj(p)/(Pc + j=1 pj) method in the following aspects: in (37) completely. However, the convergence of the proposed Firstly, the approximate problem of the proposed algorithm algorithm with the approximate problem given in (40) cannot consists of independent subproblems that can be solved in be derived from existing works, since the approximate function parallel, cf. (41), while each subproblem has a closed-form 11

600 800 500 600

400 400

300 best-response with successive line search (proposed) 200 best-response with successive line search (proposed)

nats/Hz/s/Joul successive lower bound maximization (state-of-the-art) nats/Hz/s/Joul successive lower bound maximization (state-of-the-art) 200 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 number of iterations number of iterations

600 800 500 600 800 600 700 400 600 400 500 500 400 0 0.1 0.2 300 0 0.02 0.04 best-response with successive line search (proposed) 200 best-response with successive line search (proposed)

nats/Hz/s/Joul successive lower bound maximization (state-of-the-art) nats/Hz/s/Joul successive lower bound maximization (state-of-the-art) 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time (sec) time (sec) (a) Number of users: K = 10 (b) Number of users: K = 50

Figure 4. Energy Efficiency Maximization: achieved energy efficiency versus the number of iterations solution, cf. (42)-(43). However, the optimization variable in sparse signal recovery [11, 12, 36, 37]: the approximate problem of the SLBM method (44) is a vector 2 q ∈ K×1 and the approximate problem can only be solved minimize 1 kAx − bk + µ kxk , (45) R x 2 2 1 by a general purpose solver. In the simulations, we use the Matlab optimization toolbox where A ∈ RN×K (with N  K), b ∈ RK×1 and µ > to solve (44) and the iterative update specified in (42)-(43) to 0 are given parameters. Problem (45) is an instance of the t,τ solve (40), where the stopping criterion for (42) is λ ∞ ≤ general problem structure defined in (2) with the following −5 10 . The upper subplots in Figure 4 show that the numbers decomposition: of iterations required for convergence are approximately the same for the SLBM method when K = 10 in Figure 4 (a) 1 2 f(x) , 2 kAx − bk2 , and g(x) , µ kxk1 . (46) and when K = 50 in Figure 4 (b). However, we see from the lower subplots in Figure 4 that the CPU time of each iteration Problem (45) is convex, but its objective function is non- of the SLBM method is dramatically increased when K is differentiable and it does not have a closed-form solution. To increased from 10 to 50. On the other hand, the CPU time K apply Algorithm 2, the scalar decomposition x = (xk)k=1 is of the proposed algorithm is not notably changed because the adopted. Recalling (22) and (29), the approximate problem is operations are parallelizable2 and the required CPU time is thus not affected by the problem size. t PK t q = epk Bx = arg min k=1f(xk, x−k) + g(x) . (47) Secondly, since a variable substitution k is adopted x in the SLBM method (we refer to [33] for more details), the lower bound constraint p = 0 (which corresponds to q = k k Note that g(x) can be decomposed among different compo- PK −∞) cannot be handled by the SLBM method numerically. nents of x, i.e., g(x) = k=1 g(xk), so the vector problem This limitation impairs the applicability of the SLBM method (47) reduces to K independent scalar subproblems that can be in many practical scenarios. solved in parallel:

t  t Bkx = arg min f(xk, x−k) + g(xk) C. LASSO xk T −1 t In this subsection, we study the LASSO problem to illus- = dk(A A) Sµ(rk(x )), k = 1,...,K, trate the advantage of the proposed line search method for T T where dk(A A) is the k-th diagonal element of A A, nondifferentiable optimization problems. + + LASSO is an important and widely studied problem in Sa(b) , [b − a] − [−b − a] is the so-called soft- thresholding operator [37] and

2 t,τ t,τ By stacking the pk(λk )’s into the vector form p(λ ) = T T t,τ K r(x) d(A A) ◦ x − A (Ax − b), (48) (pk(λk ))k=1 we can see that only element wise operations between vectors , and matrix vector multiplications are involved. The simulations on which Figure 4 are based are not performed in a real parallel computing environment or more compactly: with K processors, but only make use of the efficient linear algebraic implementations available in Matlab which already implicitly admits a certain t t K T −1 t level of parallelism. Bx = (Bkx )k=1 = d(A A) ◦ Sµ1(r(x )). (49) 12

Thus the update direction exhibits a closed-form expression. The stepsize based on the proposed exact line search (23) is

t  t t t t t  γ = arg min f(x + γ(Bx − x )) + γ g(Bx ) − g(x ) 0≤γ≤1 ( ) 1 kA(xt + γ( xt − xt)) − bk2 = arg min 2 B 2 t t  0≤γ≤1 + γ µ kBx k1 − kx k1  t T t t t t 1 (Ax − b) A(Bx − x ) + µ(kBx k1 − kx k1) = − t t T t t . (A(Bx − x )) (A(Bx − x )) 0 (50) The exact line search consists in solving a convex quadratic optimization problem with a scalar variable and a bound constraint, so the problem exhibits a closed-form solution (50). Therefore, both the update direction and stepsize can be calculated in closed-form. We name the proposed update (49)- (50) as Soft-Thresholding with Exact Line search Algorithm Figure 5. Operation flow and signaling exchange between local processor (STELA). p and the central processor. A solid line indicates the computation that is The proposed update (49)-(50) has several desirable features locally performed by the central/local processor, and a solid line with an that make it appealing in practice. Firstly, in each iteration, all arrow indicates signaling exchange between the central and local processor and the direction of the signaling exchange. elements are updated in parallel based on the nonlinear best- response (49). This is in the same spirit as [19, 38] and the We label the first P processors as local processors and the last convergence speed is generally faster than BCD [39] or the one as the central processor, and partition A as gradient-based update [40]. Secondly, the proposed exact line search (50) not only yields notable progress in each iteration A = [A1, A2,..., AP ], but also enjoys an easy implementation given the closed-form N×Kp PP expression. The convergence speed is thus further enhanced where Ap ∈ R and p=1 Kp = K. Matrix Ap is stored as compared to the procedures proposed in [19, 27, 38] where and processed in the local processor p, and the following either decreasing stepsizes are used [19] or the line search is computations are decomposed among the local processors: over the original nondifferentiable objective function in (45) PP Ax = Apxp, (52a) [27, 38]: p=1 T T P A (Ax − b) = Ap (Ax − b) , (52b) ( 1 t t t 2 ) p=1 2 kA(x + γ(Bx − x )) − bk2 T T P min . d(A A) = (d(Ap Ap))p=1. (52c) 0≤γ≤1 t t t +µ kx + γ(Bx − x )k1 Kp where xp ∈ R . The central processor computes the best- Computational complexity. The computational overhead response Bxt in (49) and the stepsize γt in (50). The decom- associated with the proposed exact line search (50) can position in (52) enables us to analyze the signaling exchange significantly be reduced if (50) is carefully implemented as between local processor p and the central processor involved outlined in the following. The most complex operation in (50) in (49) and (50)3. t is the matrix-vector multiplication, namely, Ax − b in the The signaling exchange is summarized in Figure 5. Firstly, t t numerator and A(Bx − x ) in the denominator. On the one the central processor sends Axt − b to the local processors t t hand, the term Ax −b is already available from r(x ), which (S1.1)4, and each local processor p for p = 1,...,P first is computed in order to determine the best-response in (49). On computes AT (Axt − b) and then sends it back to the central t t p the other hand, the matrix-vector multiplication A(Bx − x ) processor (S1.2), which forms AT (Axt−b) (S1.3) as in (52b) is also required for the computation of Axt+1 − b as it can t t and calculates r(x ) as in (48) (S1.4) and then Bx as in (49) alternatively be computed as: t t (S1.5). Then the central processor sends Bxp −xp to the local t+1 t t t t processor p for p = 1,...,P (S2.1), and each local processor Ax − b = A(x + γ (Bx − x )) − b t t first computes Ap(Bxp − xp) and then sends it back to the = (Axt − b) + γtA( xt − xt), t t B (51) central processor (S2.2), which forms A(Bx − x ) (S2.3) as in (52a), calculates γt as in (50) (S2.4), and updates xt+1 where then only an additional vector addition is involved. As (S3.1) and Axt+1 − b (S3.2) according to (51). From Figure a result, the stepsize (50) does not incur any additional matrix- vector multiplications, but only affordable vector-vector mul- 3Updates (49) and (50) can also be implemented by a parallel architecture tiplications. without a central processor. In this case, the signaling is exchanged mutually Signaling exchange. When A cannot be stored and pro- between every two of the local processors, but the analysis is similar and the conclusion to be drawn remains same: the proposed exact line search (50) cessed by a centralized processing unit, a parallel architecture does not incur additional signaling compared with predetermined stepsizes. can be employed. Assume there are P +1 (P ≥ 2) processors. 4x0 is set to x0 = 0, so Ax0 − b = −b. 13

STELA: parallel update with simplified exact line search (proposed) 0 10 FLEXA: parallel update with decreasing stepsize (state−of−the−art) µ) and no general rule performs equally well for all choices of parameters. By comparison, the proposed algorithm STELA is fast to converge and exhibits stable performance without −2 10 requiring any parameter tuning. ) t We also compare in Figure 7 the proposed algorithm STELA decreasing rate: 10−4 with other competitive algorithms in literature: FISTA [37], error e(x −4 10 −3 ADMM [12], GreedyBCD [42] and SpaRSA [43]. We sim- decreasing rate: 10−2 decreasing rate: 10 ulated GreedyBCD of [42] because it exhibits guaranteed convergence. The dimension of A is 2000 × 4000 (the left −6 column of Figure 7) and 5000 × 10000 (the right column). 10 −1 decreasing rate: 10 It is generated by the Matlab command randn with each

50 100 150 200 250 300 350 400 450 500 row being normalized to unity. The density (the proportion of number of iterations nonzero elements) of the sparse vector xtrue is 0.1 (the upper row of Figure 7), 0.2 (the middle row) and 0.4 (the lower Figure 6. Convergence of STELA (proposed) and FLEXA (state-of-the-art) for LASSO: error versus the number of iterations. row). The vector b is generated as b = Axtrue + e where e is drawn from an i.i.d. Gaussian distribution with variance 10−4. T The regularization gain µ is set to µ = 0.1 A b ∞, which allows xtrue to be recovered to a high accuracy [43]. 5 we observe that the exact line search (50) does not incur The simulations are carried out under Matlab R2012a on a any additional signaling compared with that of predetermined PC equipped with an operating system of Windows 7 64-bit stepsizes (e.g., constant and decreasing stepsize), because the Home Premium Edition, an Intel i5-3210 2.50GHz CPU, and a signaling exchange in S2.1-S2.2 has also to be carried out in 8GB RAM. All of the Matlab codes are available online [41]. t+1 the computation of Ax − b in S3.2, cf. (51). The comparison is made in terms of CPU time that is required We finally remark that the proposed successive line search until either a given error bound e(xt) ≤ 10−6 is reached or the can also be applied and it exhibits a closed-form expression maximum number of iterations, namely, 2000, is reached. The as well. However, since the exact line search yields faster running time consists of both the initialization stage required convergence, we omit the details at this point. for preprocessing (represented by a flat curve) and the formal Simulations. We first compare in Figure 6 the proposed stage in which the iterations are carried out. For example, in algorithm STELA with FLEXA [19] in terms of the error the proposed algorithm STELA, d(AT A) is computed5 in the t criterion e(x ) defined as: initialization stage since it is required in the iterative variable t t  t tµ1 update in the formal stage, cf. (49). The simulation results are e(x ) , ∇f(x ) − ∇f(x ) − x . (53) −µ1 2 averaged over 20 instances. Note that x? is a solution of (45) if and only if e(x?) = 0 We observe from Figure 7 that the proposed algorithm [28]. FLEXA is implemented as outlined in [19]; however, the STELA converges faster than all competing algorithms. Some selective update scheme [19] is not implemented in FLEXA further observations are in order. because it is also applicable for STELA and it cannot eliminate • The proposed algorithm STELA is not sensitive to the the slow convergence and sensitivity of the decreasing stepsize. density of the true signal xtrue. When the density is increased We also remark that the stepsize rule for FLEXA is γt+1 = from 0.1 (left column) to 0.2 (middle column) and then to 0.4 γt(1−min(1, 10−4/e(xt))dγt) [19], where d is the decreasing (right column), the CPU time increases negligibly. 0 rate and γ = 0.9. The code and the data generating the figure • The proposed algorithm STELA scales relatively well with can be downloaded online [41]. the problem dimension. When the dimension of A is increased t Note that the error e(x ) plotted in Figure 6 does not from 2000×4000 (the left column) to 5000×10000 (the right necessarily decrease monotonically while the objective func- column), the CPU time is only marginally increased. t t tion f(x ) + g(x ) always does. This is because STELA and • The initialization stage of ADMM is time consuming FLEXA are descent direction methods. For FLEXA, when because of some expensive matrix operations as, e.g., AAT , the decreasing rate is low (d = 10−4), no improvement 1 T −1 T 1 T −1 I + c AA and A I + c AA A (c is a given is observed after 100 iterations. As a matter of fact, the positive constant). More details can be found in [12, Sec. stepsize in those iterations is so large that the function value 6.4]. Furthermore, the CPU time of the initialization stage of is actually dramatically increased, and thus the associated ADMM is increased dramatically when the dimension of A is iterations are discarded in Figure 6. A similar behavior is increased from 2000 × 4000 to 5000 × 10000. also observed for d = 10−3, until the stepsize becomes • SpaRSA performs better when the density of xtrue is sufficiently small. When the stepsize is quickly decreasing smaller, e.g., 0.1, than in the case when it is large, e.g., 0.2 −1 (d = 10 ), although improvement is made in all iterations, and 0.4. the asymptotic convergence speed is slow because the stepsize • The asymptotic convergence speed of GreedyBCD is is too small to make notable improvement. For this example, −2 the choice d = 10 performs well, but the value of a good 5The Matlab command is sum(A.^2,1), so matrix-matrix multiplication decreasing rate depends on the parameter setup (e.g., A, b and between AT and A is not required. 14

STELA (proposed) ADMM on the approximate function is that it is pseudo-convex. On the FISTA one hand, the relaxation of the assumptions on the approximate GreedyBCD SpaRSA functions can make the approximate problems much easier to

1 1 solve. We show by a counter-example that the assumption 10 10 on pseudo-convexity is tight in the sense that when it is ) t violated, the algorithm may not converge. On the another hand, the stepsize based on the exact/successive line search yields

error e(x notable progress in each iteration. Additional structures can be −6 −6 10 10 exploited to assist with the selection of the stepsize, so that 0 2 4 6 8 10 0 20 40 60 80 100 1 1 the algorithm can be further accelerated. The advantages and 10 10 benefits of the proposed algorithm have been demonstrated ) t using prominent applications in communication networks and signal processing, and they are also numerically consolidated.

error e(x The proposed algorithm can readily be applied to solve other −6 −6 10 10 problems as well, such as portfolio optimization [10]. 0 2 4 6 8 10 0 20 40 60 80 100 1 1 10 10 APPENDIX A ) t PROOFOF PROPOSITION 1 Proof: i) Firstly, suppose y is a stationary point of (1); it error e(x −6 −6 satisfies the first-order optimality condition: 10 10 0 2 4 6 8 10 0 20 40 60 80 100 T time (sec) time (sec) ∇f(y) (x − y) ≥ 0, ∀ x ∈ X . Using Assumption (A3), we get Figure 7. Time versus error of different algorithms for LASSO. In the left and right column, the dimension of A is 2000 × 4000 and 5000 × 10000, ∇f˜(y; y)T (x − y) ≥ 0, ∀ x ∈ X . respectively. In the higher, middle and lower column, the density of xtrue is 0.1, 0.2 and 0.4. Since f˜(•; y) is pseudo-convex, the above condition implies f˜(x; y) ≥ f˜(y; y), ∀ x ∈ X . slow, because only one variable is updated in each iteration. ˜ ˜ To further evaluate the performance of the proposed al- That is, f(y; y) = minx∈X f(x; y) and y ∈ S(y). gorithm STELA, we test it on the benchmarking platform Secondly, suppose y ∈ S(y). We readily get developed by the Optimization Group from the Department of ∇f(y)T (x − y) = ∇f˜(y; y)T (x − y) ≥ 0, ∀ x ∈ X , (54) Mathematics at the Darmstadt University of Technology6 and compare it with different algorithms in various setups (data set, where the equality and inequality comes from Assumption problem dimension, etc.) for the basis pursuit (BP) problem (A3) and the first-order optimality condition, respectively, so [44]: y is a stationary point of (1). ii) From the definition of x, it is either minimize kxk1 B subject to Ax = b. f˜(By; y) = f˜(y; y), (55a) To adapt STELAfor the BP problem, we use the augmented or Lagrangian approach [13, 45]: f˜(By; y) < f˜(y; y), (55b) ct xt+1 = kxk + (λt)T (Ax − b) + kAx − bk2 , If (55a) is true, then y ∈ S(y) and, as we have just shown, it is 1 2 2 a stationary point of (1). So only (55b) can be true. We know λt+1 = λt + ct(Axt+1 − b), from the pseudo-convexity of f˜(x; y) in x (cf. Assumption t+1 t 2 0 T t+1 where c = min(2c , 10 ) (c = 10/ A b ∞), x is (A1)) and (55b) that By 6= y and computed by STELA and this process is repeated until λt ∇f˜(y; y)T ( y − y) = ∇f(y)T ( y − y) < 0, (56) converges. The numerical results summarized in [46] show B B that, although STELA must be called multiple times before the where the equality comes from Assumption (A3). Lagrange multiplier λ converges, the proposed algorithm for BP based on STELA is very competitive in terms of running APPENDIX B time and robust in the sense that it solved all problem instances PROOFOF THEOREM 2 in the test platform database. Proof: Since Bxt is the optimal point of (8), it satisfies V. CONCLUDING REMARKS the first-order optimality condition: t t T t In this paper, we have proposed a novel iterative algorithm ∇f˜(Bx ; x ) (x − Bx ) ≥ 0, ∀ x ∈ X . (57) based on convex approximation. The most critical requirement If (55a) is true, then xt ∈ S(xt) and it is a stationary point 6Project website: http://wwwopt.mathematik.tu-darmstadt.de/spear/ of (1) according to Proposition 1 (i). Besides, it follows from 15

(54) (with x = Bxt and y = xt) that ∇f(xt)T (Bxt −xt) ≥ 0. APPENDIX C Note that equality is actually achieved, i.e., PROOFOF THEOREM 3 Proof: We first need to show that Proposition 1 still holds. ∇f(xt)T ( xt − xt) = 0 B (i) We prove y is a stationary point of (4) if and only if y ∈ arg min f(x , y ) for all k. because otherwise xt − xt would be an ascent direction k xk∈Xk k −k B Suppose y is a stationary point of (4), it satisfies the first- of f˜(x; xt) at x = xt and the definition of xt would be B order optimality condition: contradicted. Then from the definition of the successive line search, we can readily infer that T PK T ∇f(y) (x − y) = k=1∇kf(y) (xk − yk) ≥ 0, ∀ x ∈ X , f(xt+1) ≤ f(xt). (58) and it is equivalent to ∇ f(y)T (x − y ) ≥ 0, ∀ x ∈ X . It is easy to see (58) holds for the exact line search as well. k k k k k t t t If (55b) is true, x is not a stationary point and Bx − x Since f(x) is pseudo-convex in xk, the above condition is a strict descent direction of f(x) at x = xt according to implies f(yk, y−k) = minxk∈Xk f(xk, y−k) for all k = Proposition 1 (ii): f(x) is strictly decreased compared with 1,...,K. f(xt) if x is updated at xt along the direction xt − xt. B Suppose yk ∈ arg minxk∈Xk f(xk, y−k) for all k = From the definition of the successive line search, there always 1,...,K. The first-order optimality conditions yields exists a γt such that 0 < γt ≤ 1 and T ∇kf(y) (xk − yk) ≥ 0, ∀ xk ∈ Xk. f(xt+1) = f(xt + γt( xt − xt)) < f(xt). (59) B Adding the above inequality for all k = 1,...,K yields This strict decreasing property also holds for the exact line ∇f(y)T (x − y) ≥ 0, ∀ x ∈ X . search because it is the stepsize that yields the largest decrease, which is always larger than or equal to that of the successive Therefore, y is a stationary point of (4). line search. (ii) We prove that if y is not a stationary point of (4), then T We know from (58) and (59) that {f(xt)} is a monoton- ∇f(y) (By − y) < 0. ically decreasing sequence and it thus converges. Besides, It follows from the optimality of Bkx that for any two (possibly different) convergent subsequences f( y, y ) ≤ f(y , y ), {xt} and {xt} , the following holds: Bk −k k −k t∈T1 t∈T2 and lim f(xt) = lim f(xt) = lim f(xt). t→∞ T13t→∞ T23t→∞ T ∇kf(Bky, y−k) (xk − Bky) ≥ 0, ∀ xk ∈ Xk. (61) Since f(x) is a continuous function, we infer from the Firstly, there must exist an index j such that preceding equation that f(Bjy, y−j) < f(yj, y−j), (62)     t t f lim x = f lim x . (60) otherwise y would be a stationary point of (4). Since f(x) is T13t→∞ T23t→∞ pseudo-convex in xk for k = 1,...,K, it follows from (62) t that Now consider any convergent subsequence {x }t∈T with t ∇ f(y)T ( y − y ) < 0. limit point y, i.e., limT 3t→∞ x = y. To show that y is j Bj j (63) a stationary point, we first assume the contrary: y is not Secondly, for any index k such that f( y, y ) = a stationary point. Since f˜(x; xt) is continuous in both x Bk −k f(y , y ), y minimizes f(x , y ) over x ∈ X and and xt by Assumption (A2) and {Bxt} is bounded by k −k k k −k k k t∈T ∇ f(y , y )T (x − y ) ≥ 0 for any x ∈ X . Setting Assumption (A5), it follows from [25, Th. 1] that there exists k k −k k k k x = y yields a sequence { xt} with T ⊆ T such that it converges k Bk B t∈Ts s and lim xt ∈ S(y). Since both f(x) and ∇f(x) are T Ts3t→∞ B ∇kf(yk, y−k) (Bky − yk) ≥ 0. (64) continuous, applying [25, Th. 1] again implies there is a Ts0  t+1 0 such that Ts0 ⊆ Ts(⊆ T ) and x converges to y Similarly, setting xk = yk in (61) yields t∈Ts0 defined as: T ∇kf(Bky, y−k) (yk − Bky) ≥ 0. (65) 0 y , y + ρ(By − y), Adding (64) and (65), we can infer that (∇kf(y) − T where ρ is the stepsize when either the exact or successive ∇kf(Bky, y−k)) (yk −Bky) ≥ 0. Therefore, we can rewrite (65) as follows line search is applied to f(y) along the direction By − y. Since y is not a stationary point, it follows from (59) that 0 ≤ ∇ f( y, y )T (y − y) f(y0) < f(y) y k Bk −k k Bk , but this would contradict (60). Therefore is T a stationary point, and the proof is completed. = (∇kf(Bky, y−k) − ∇kf(y) + ∇kf(y)) (yk − Bky), 16 and thus [18] P. He, “Correction of Convergence Proof for Iterative Water-Filling in Gaussian MIMO Broadcast Channels,” IEEE Transactions on T Information Theory, vol. 57, no. 4, pp. 2539–2543, Apr. 2011. ∇kf(y) ( ky − yk) ≤ B [19] F. Facchinei, G. Scutari, and S. Sagratella, “Parallel Selective T −(∇kf(Bky, y−k)−∇kf(y)) (Bky − yk) ≤ 0. (66) Algorithms for Nonconvex Big Data Optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 7, pp. 1874–1889, Nov. 2015. Adding (63) and (66) over all k = 1,...,K yields [20] G. Scutari, F. Facchinei, L. Lampariello, , and P. Song, “Distributed Methods for Constrained Nonconvex Multi-Agent Optimization-Part T PK T I: Theory,” Oct. 2014, submitted to IEEE Transactions on Signal ∇f(y) (By − y) = k=1∇kf(y) (Bky − yk) < 0. Processing. [Online]. Available: http://arxiv.org/abs/1410.4754 [21] R. Horst and N. Thoai, “DC programming: overview,” Journal of That is, By − y is a descent direction of f(x) at the point y. Optimization Theory and Applications, vol. 103, no. 1, pp. 1–43, 1999. The proof of Theorem 2 can then be used verbatim to [22] J. M. Ortega and W. C. Rheinboldt, Iterative solution of nonlinear equations in several variables. Academic, New York, 1970. prove the convergence of the algorithm with the approximate [23] O. L. Mangasarian, Nonlinear programming. McGraw-Hill, 1969. problem (31) and the exact/successive line search. [24] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge Univ Pr, 2004. [25] S. M. Robinson and R. H. Day, “A sufficient condition for continuity of REFERENCES optimal sets in mathematical programming,” Journal of Mathematical Analysis and Applications, vol. 45, no. 2, pp. 506–511, Feb. 1974. [1] W. Yu, W. Rhee, S. Boyd, and J. Cioffi, “Iterative Water-Filling for [26] M. Patriksson, “Cost Approximation: A Unified Framework of Descent Gaussian Vector Multiple-Access Channels,” IEEE Transactions on Algorithms for Nonlinear Programs,” SIAM Journal on Optimization, Information Theory, vol. 50, no. 1, pp. 145–152, Jan. 2004. vol. 8, no. 2, pp. 561–582, May 1998. [2] N. Jindal, W. Rhee, S. Vishwanath, S. Jafar, and A. Goldsmith, “Sum [27] P. Tseng and S. Yun, A coordinate gradient descent method for Power Iterative Water-Filling for Multi-Antenna Gaussian Broadcast nonsmooth separable minimization, 2009, vol. 117, no. 1-2. Channels,” IEEE Transactions on Information Theory, vol. 51, no. 4, [28] R. H. Byrd, J. Nocedal, and F. Oztoprak, “An Inexact Successive pp. 1570–1580, Apr. 2005. Quadratic Approximation Method for Convex L-1 Regularized [3] W. Yu, “Sum-capacity computation for the Gaussian vector broadcast Optimization,” Sep. 2013. [Online]. Available: http://arxiv.org/abs/1309. channel via dual decomposition,” IEEE Transactions on Information 3529 Theory, vol. 52, no. 2, pp. 754–759, Feb. 2006. [29] D. P. Palomar and M. Chiang, “A tutorial on decomposition methods [4] S. Ye and R. Blum, “Optimized signaling for MIMO interference for network utility maximization,” IEEE Journal on Selected Areas in systems with feedback,” IEEE Transactions on Signal Processing, Communications, vol. 24, no. 8, pp. 1439–1451, Aug. 2006. vol. 51, no. 11, pp. 2839–2848, Nov. 2003. [30] R. Horst and N. V. Thoai, “Dc programming: Overview,” Journal of [5] Z.-Q. Luo and S. Zhang, “Dynamic Spectrum Management: Complexity Optimization Theory and Applications, vol. 103, no. 1, pp. 1–43. and Duality,” IEEE Journal of Selected Topics in Signal Processing, [31] W. Yu and J. M. Cioffi, “Sum Capacity of Gaussian Vector Broadcast vol. 2, no. 1, pp. 57–73, Feb. 2008. Channels,” IEEE Transactions on Information Theory, vol. 50, no. 9, [6] S.-J. Kim and G. B. Giannakis, “Optimal Resource Allocation for pp. 1875–1892, Sep. 2004. MIMO Ad Hoc Cognitive Radio Networks,” IEEE Transactions on [32] M. Grant and S. Boyd, “CVX: Matlab Software for Disciplined Convex Information Theory, vol. 57, no. 5, pp. 3117–3131, May 2011. Programming, version 2.0 beta,” http://cvxr.com/, 2012. [7] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An Iteratively [33] A. Zappone, L. Sanguinetti, G. Bacci, E. Jorswieck, and M. Debbah, Weighted MMSE Approach to Distributed Sum-Utility Maximization “Energy-efficient power control: A look at 5g wireless technologies,” for a MIMO Interfering Broadcast Channel,” IEEE Transactions on IEEE Transactions on Signal Processing, vol. 64, no. 7, pp. 1668–1683, Signal Processing, vol. 59, no. 9, pp. 4331–4340, Sep. 2011. April 2016. [8] G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, [34] C. Olsson, A. P. Eriksson, and F. Kahl, “Efficient optimization for “Decomposition by Partial Linearization: Parallel Optimization of L∞-problems using ,” in 2007 IEEE 11th International Multi-Agent Systems,” IEEE Transactions on Signal Processing, Conference on Computer Vision. IEEE, 2007, pp. 1–8. vol. 62, no. 3, pp. 641–656, Feb. 2014. [35] Y. Yang, G. Scutari, D. P. Palomar, and M. Pesavento, “A Parallel [9] Y. Yang, G. Scutari, P. Song, and D. P. Palomar, “Robust MIMO Decomposition Method for Nonconvex Stochastic Multi-Agent Opti- Cognitive Radio Systems Under Interference Temperature Constraints,” mization Problems,” Dec. 2015, to appear in IEEE Transactions on IEEE Journal on Selected Areas in Communications, vol. 31, no. 11, Signal Processing. pp. 2465–2482, Nov. 2013. [36] R. Tibshirani, “Regression shrinkage and selection via the lasso: a retro- [10] Y. Yang, F. Rubio, G. Scutari, and D. P. Palomar, “Multi-Portfolio spective,” Journal of the Royal Statistical Society: Series B (Statistical Optimization: A Potential Game Approach,” IEEE Transactions on Methodology), vol. 58, no. 1, pp. 267–288, Jun. 1996. Signal Processing, vol. 61, no. 22, pp. 5590–5602, Nov. 2013. [37] A. Beck and M. Teboulle, “A Fast Iterative Shrinkage-Thresholding [11] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, “An Interior- Algorithm for Linear Inverse Problems,” SIAM Journal on Imaging Point Method for Large-Scale l1-Regularized Least Squares,” IEEE Sciences, vol. 2, no. 1, pp. 183–202, Jan. 2009. Journal of Selected Topics in Signal Processing, vol. 1, no. 4, pp. 606– [38] M. Elad, “Why simple shrinkage is still relevant for redundant repre- 617, Dec. 2007. sentations?” IEEE Transactions on Information Theory, vol. 52, no. 12, [12] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed pp. 5559–5569, 2006. Optimization and Statistical Learning via the Alternating Direction [39] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani, “Pathwise Method of Multipliers,” Foundations and Trends in Machine Learning, coordinate optimization,” The Annals of Applied Statistics, vol. 1, no. 2, vol. 3, no. 1, 2010. pp. 302–332, Dec. 2007. [13] D. P. Bertsekas, Nonlinear programming. Athena Scientific, 1999. [40] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient [14] N. Parikh and S. Boyd, “Proximal Algorithms,” Foundations and Projection for Sparse Reconstruction: Application to Compressed Trends in Optimization, vol. 1, no. 3, pp. 127–239, 2014. Sensing and Other Inverse Problems,” IEEE Journal of Selected Topics [15] P. L. Combettes and J.-C. Pesquet, “Proximal Splitting Methods in in Signal Processing, vol. 1, no. 4, pp. 586–597, Dec. 2007. Signal Processing,” in Fixed-Point Algorithms for Inverse Problems [41] Y. Yang, http://www.nts.tu-darmstadt.de/home_nts/staff_nts/ in Science and Engineering, ser. Springer Optimization and Its mitarbeiterdetails_32448.en.jsp. Applications, H. H. Bauschke, R. S. Burachik, P. L. Combettes, [42] Z. Peng, M. Yan, and W. Yin, “Parallel and distributed sparse V. Elser, D. R. Luke, and H. Wolkowicz, Eds. New York, NY: optimization,” 2013 Asilomar Conference on Signals, Systems and Springer New York, 2011, vol. 49, pp. 185–212. Computers, pp. 659–646, Nov. 2013. [16] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence [43] S. Wright, R. Nowak, and M. Figueiredo, “Sparse Reconstruction by analysis of block successive minimization methods for nonsmooth Separable Approximation,” IEEE Transactions on Signal Processing, optimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. vol. 57, no. 7, pp. 2479–2493, Jul. 2009. 1126–1153, 2013. [44] D. Lorenz, M. Pfetsch, and A. Tillmann, “Solving Basis Pursuit: [17] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: Heuristic Optimality Check and Solver Comparison,” ACM Transactions Numerical methods. Prentice Hall, 1989. on Mathematical Software, vol. 41, no. 2, pp. 1–29, 2015. 17

[45] R. T. Rockafellar, “Augmented Lagrangians and Applications of the Proximal Point Algorithm in Convex Programming,” Mathematics of Operations Research, vol. 1, no. 2, pp. 97–116, 1976. [46] J. Kuske and A. Tillmann, “Solving basis pursuit: Update,” Tech. Rep., April 2016. [Online]. Available: http://www.mathematik.tu-darmstadt. de/~tillmann/docs/SolvingBPupdateApr2016.pdf