An exact penalty method for semidefinite-box constrained low-rank matrix optimization problems
Tianxiang Liu∗ Zhaosong Lu† Xiaojun Chen‡ Yu-Hong Dai§
September 9, 2018
Abstract This paper considers a matrix optimization problem where the objective function is con- tinuously differentiable and the constraints involve a semidefinite-box constraint and a rank constraint. We first replace the rank constraint by adding a non-Lipschitz penalty function in the objective and prove that this penalty problem is exact with respect to the original problem. Next, for the penalty problem, we present a nonmonotone proximal gradient (NPG) algorithm whose subproblem can be solved by Newton’s method with globally quadratic convergence. We also prove the convergence of the NPG algorithm to a first-order stationary point of the penalty problem. Furthermore, based on the NPG algorithm, we propose an adaptive penal- ty method (APM) for solving the original problem. Finally, the efficiency of APM is shown via numerical experiments for the sensor network localization (SNL) problem and the nearest low-rank correlation matrix problem.
Keywords: rank constrained optimization, non-Lipschitz penalty, nonmonotone proximal gradi- ent, penalty method.
1 Introduction
In this paper we consider the following constrained problem
min f(X) s.t. 0 X I, rank(X) ≤ r, (1.1)
n where f : S+ → < is continuously differentiable with gradient ∇f being Lipschitz continuous, n and r < n is a given positive integer. Here, S+ denotes the cone of n × n positive semidefinite n n symmetric matrices, I is the n×n identity matrix, and 0 X I means X ∈ S+ and I −X ∈ S+,
∗Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong, China ([email protected]). This author was supported partly by the AMSS-PolyU Joint Research Institute Post- doctoral Scheme. †Department of Mathematics, Simon Fraser University, Canada ([email protected]). This author was supported in part by NSERC Discovery Grant. ‡Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong, China ([email protected]). This author was supported in part by NSFC/Hong Kong Research Grant Council N-PolyU504/14. §Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China ([email protected]). This author was supported in part by the Chinese Natural Science Foundation (Nos. 11631013 and 11331012) and the National 973 Program of China (Nos. 2015CB856002).
1 which is referred to as a semidefinite-box constraint. Many application problems can be modeled by (1.1), including the wireless sensor network localization problem [4, 14] and the nearest low-rank correlation matrix problem [5, 12, 21]. Problem (1.1) is generally difficult to solve, due to the discontinuity and nonconvexity of the rank function. Recently, approximations of the rank function have been extensively studied. One well-known convex approximation is the nuclear norm kXk∗, namely, the sum of singular values of X (see for example [10]). For other research works involving this approximation, see for example [7, 22, 23]. Besides, a nonconvex and nonsmooth approximation, the so-called Schatten p-norm p P p kXkp = i≥1 σi(X) (p ∈ (0, 1), σi(X) is the i-th largest singular value), has attracted a lot of attention due to its good computational performance (see for example [14, 20, 17]). However, simply adding these approximations into the objective genenerally cannot guarantee to produce a solution satisfying the rank constraint rank(X) ≤ r since they are not the exact penalty function for this constraint. Inspired by the relation n X p rank(X) ≤ r ⇔ λi (X) = 0 for X 0 i=r+1 and good computational performance of the p-norm with p ∈ (0, 1] for sparsity, we propose the following penalty model for problem (1.1): n X p min Fµ(X) := f(X) + µ λi (X), (1.2) 0XI i=r+1 where µ > 0 and λi(X)(i = 1, ..., n) is the i-th largest eigenvalue of X. Such a penalty term with p = 1 has been used in [11] for solving a nearest low-rank correlation matrix problem. Nevertheless, we observe in numerical experiments that the penalty term with p ∈ (0, 1) is generally more efficient than p = 1 in producing a low-rank solution of problem (1.1). The main contributions of this paper are as follows.
• We propose a new penalty model (1.2) for the low-rank constrained problem (1.1) and prove that (1.2) is an exact penalty reformulation for (1.1) in the sense: there exists someµ ¯ > 0 such that for any µ > µ¯, X∗ is a global minimizer of problem (1.1) if and only if it is a global minimizer of problem (1.2). Furthermore, for any µ ≥ µ¯, any local minimizer of problem (1.1) is a local minimizer of problem (1.2). • We propose a nonmonotone proximal gradient (NPG) method for solving the penalty model (1.2). Although the associated proximal subproblem is sophisticated and challenging due to the partial set of eigenvalues, we reduce it into a set of univariate root-finding problems and show that they can be suitably solved by Newton’s method with globally quadratic convergence. • We propose an adaptive penalty method (APM) for (1.1) with a suitable updating scheme on penalty parameter in which each penalty subproblem is solved by the aforementioned NPG. We establish its global convergence and also provide an estimate on iteration complexity for finding an approximated stationary point of (1.1).
The rest of this paper is organized as follows. In Section 2, notation and preliminaries are given. In Section 3, we show that the penalty model (1.2) is an exact penalty reformulation of problem (1.1). In Section 4, we present an NPG algorithm for solving the penalty problem (1.2). In Section 5, we propose an APM for solving problem (1.1). In Section 6, we present numerical experiments for solving a sensor network localization problem and a nearest low-rank correlation matrix problem.
2 2 Notation and preliminaries
n The following notation will be used throughout this paper. Given any x ∈ < , x[i] denotes the ith largest entry of x and supp(x) denotes the support of x, namely, supp(x) = {i : xi 6= 0}. n n The symbol 1n denotes the all-ones vector of dimension n. Given x, y ∈ < and Ω ⊆ < , x ≤ y means xi ≤ yi for all i and δΩ(·) is the indicator function of Ω, i.e., δΩ(x) = 0 if x ∈ Ω, otherwise n n δΩ(x) = ∞. For x ∈ < and a closed convex set Ω ⊆ < , PΩ(x) is the projection of x onto Ω. The space of symmetric n × n matrices is denoted by Sn. If X ∈ Sn is positive semidefinite, we write X 0. Given any X and Y in Sn, X Y means Y − X is postive semidefinite. In addition, given matrices X and Y in n to denote the feasible regions of problems (1.2) and (1.1), respectively. Given any X ∈ S , let XΩ be a projection of X onto Ω, that is, XΩ ∈ Ω and kX − XΩkF = min kX − ZkF . (2.2) Z∈Ω Recall that f is assumed to be continuously differentiable in C. It follows that f is Lipschitz continuous in C, that is, there exists some constant Lf > 0 such that |f(X) − f(Y )| ≤ Lf kX − Y kF , ∀X,Y ∈ C. (2.3) Before ending this section, we present some preliminary technical results that will be used subsequently. Lemma 2.1. Let p ∈ (0, 1] and XΩ be a projection of X onto Ω. Then it holds n X p kX − XΩkF ≤ λi (X), ∀X ∈ C. (2.4) i=r+1 Proof. By Proposition 2.6 of [19], it is not hard to show that v u n u X 2 kX − XΩkF = t λi (X), ∀X ∈ C. (2.5) i=r+1 Notice from (2.1) that 0 ≤ λi(X) ≤ 1 for all i and X ∈ C. In view of this fact and p ∈ (0, 1], one can observe that v u n n n u X 2 X X p t λi (X) ≤ λi(X) ≤ λi (X), ∀X ∈ C. (2.6) i=r+1 i=r+1 i=r+1 It then follows from this relation and (2.5) that (2.4) holds as desired. This completes the proof. 3 3 Exact penalty reformulation In this section we study the relationship between the penalty model (1.2) and problem (1.1). The following theorem shows that (1.2) is an exact penalty reformulation of (1.1), in terms of global minimizers. Theorem 3.1. Let p ∈ (0, 1]. For any µ ≥ Lf , any global minimizer of problem (1.1) is a global minimizer of problem (1.2). Conversely, for any µ > Lf , any global minimizer of (1.2) is also a global minimizer of (1.1). Proof. For the first part, let X∗ be a global minimizer of (1.1) and X be an arbitrary matrix in ∗ C. We let XΩ denote a projection of X onto Ω. Thus, we know from the global optimality of X ∗ that f(XΩ) ≥ f(X ). Using this relation and (2.3), we have ∗ ∗ f(X ) − f(X) = f(X ) − f(XΩ) + f(XΩ) − f(X) (3.1) ≤ f(XΩ) − f(X) ≤ Lf kX − XΩkF . ∗ This together with (2.4), µ ≥ Lf and rank(X ) ≤ r implies that n n X p ∗ ∗ X p ∗ f(X) + µ λi (X) ≥ f(X) + Lf kX − XΩkF ≥ f(X ) = f(X ) + µ λi (X ), i=r+1 i=r+1 which together with the arbitrariness of X ∈ C and X∗ ∈ C implies that X∗ is a global minimizer of (1.2). ∗ ∗ For the second part, assume µ > Lf . Let X be a global minimizer of problem (1.2) and XΩ be a projection of X∗ onto Ω. It is easy to observe that if X∗ ∈ Ω, then it is a global minimizer of problem (1.1). Thus it suffices to show that X∗ ∈ Ω. Suppose for contradiction that X∗ 6∈ Ω. ∗ ∗ Then we have kX − XΩkF > 0, and hence ∗ ∗ ∗ ∗ ∗ ∗ ∗ f(XΩ) ≤ f(X ) + Lf kX − XΩkF < f(X ) + µkX − XΩkF n ∗ X p ∗ ∗ ≤ f(X ) + µ λi (X ) < f(XΩ), i=r+1 where the first inequality follows from (2.3), the second inequality is due to µ > Lf , the third inequality is due to (2.4), and the last inequality follows from the global optimality of X∗. These ∗ ∗ inequalities immediately lead to a contradiction f(XΩ) < f(XΩ). This completes the proof. We show in the next theorem that any local minimizer of problem (1.1) is also that of problem (1.2), provided µ ≥ Lf . Theorem 3.2. Let p ∈ (0, 1]. For any µ ≥ Lf , any local minimizer of problem (1.1) is a local minimizer of problem (1.2). ∗ Proof. Suppose that X is an arbitrary local minimizer of problem (1.1) with µ ≥ Lf . Then there exists some ε > 0 such that f(X) ≥ f(X∗), ∀X ∈ B(X∗; ε) ∩ Ω. (3.2) It follows from (2.2) that for every X ∈ B(X∗; ε/2), ∗ ∗ ∗ kXΩ − X kF ≤ kXΩ − XkF + kX − X kF ≤ 2kX − X kF ≤ ε, 4 ∗ ∗ where XΩ is a projection of X onto Ω. This implies XΩ ∈ B(X ; ε) ∩ Ω for every X ∈ B(X ; ε/2). ∗ ∗ It follows from this and (3.2) that f(XΩ) ≥ f(X ) for any X ∈ B(X ; ε/2). Using this relation and (2.3), we see that (3.1) also holds for every X ∈ B(X∗; ε/2) ∩ C. In view of (2.4), (3.1) and an argument similar to the proof of Theorem 3.1, one can obtain that for every X ∈ B(X∗; ε/2) ∩ C, n n X p ∗ ∗ X p ∗ f(X) + µ λi (X) ≥ f(X) + Lf kX − XΩkF ≥ f(X ) = f(X ) + µ λi (X ), i=r+1 i=r+1 where the equality is due to rank(X∗) ≤ r. Hence, X∗ is a local minimizer of problem (1.2). This completes the proof. 4 A nonmonotone proximal gradient method for solving (1.2) In this section, we present a nonmonotone proximal gradient (NPG) method for solving problem (1.2), which is similar to the one proposed by Wright et al. [26]. We show that the subproblems arising in NPG can be efficiently solved. Also, we establish convergence for this method. 4.1 NPG algorithm and convergence We first present an NPG method for solving problem (1.2). Algorithm 1 Nonmonotone proximal gradient (NPG) method for (1.2) Initialization. Let 0 < Lmin < Lmax, γ > 1, c > 0, integer N ≥ 0 be given. Choose an arbitrary 0 X0 I and set k = 0. 0 0 Step 1. Choose Lk ∈ [Lmin,Lmax] arbitrarily. Set Lk = Lk. (1a) Solve the subproblem ( n ) Lk X Xk+1 ∈ Arg min h∇f(Xk),X − Xki + kX − Xkk2 + µ λp(X) . (4.1) 2 F i 0XI i=r+1 (1b) Go to Step 2 if k+1 i c k+1 k 2 Fµ(X ) ≤ max Fµ(X ) − kX − X kF . (4.2) [k−N]+≤i≤k 2 (1c) Set Lk ← γLk and go to (1a). Step 2. Set k ← k + 1 and go to Step 1. k Remark 4.1. (i) When N = 0, the sequence {Fµ(X )} is monotonically decreasing. Otherwise, it may increase at some iterations and thus the above method is generally a nonmonotone method. 0 (ii) A popular choice of Lk is by the following formula proposed by Barzilai and Borwein [1] (see also [2]): k k 0 hS ,Y i Lk = max Lmin, min Lmax, k 2 , (4.3) kS kF where Sk = Xk − Xk−1 and Y k = ∇f(Xk) − ∇f(Xk−1). 5 We next study the convergence of the NPG method for solving problem (1.2). Before proceed- ing, we introduce two definitions as follows, which can be found in [24]. Definition 4.1 (limiting subdifferential). For a lower semi-continuous function g in Sn, the limiting subdifferential of g at X ∈ Sn is defined as k k k k g k g(Z) − g(Z ) − hV ,Z − Z i ∂g(X) := V : ∃Z → X,V → V with lim inf k ≥ 0 ∀k , Z→Zk kZ − Z kF g where Zk → X means Zk → X and g(Zk) → g(X). Definition 4.2 (first-order stationary point). We say that X∗ is a first-order stationary point of (1.2) if X∗ ∈ C and ∗ ∗ ∗ 0 ∈ ∇f(X ) + ∂(µΘ(X ) + δC(X )), (4.4) Pn p where Θ(X) := i=r+1 λi (X), C is defined in (2.1) and ∂(·) is given in Definition 4.1. Notice from [24, Theorem 10.1] and [24, Exercise 10.10] that any local minimizer X¯ ∈ C of (1.2) is a first-order stationary point of (1.2). The following theorem states that at each outer iteration of Algorithm 1, the number of its inner iterations is uniformly bounded. Its proof is similar to that of [19, Theorem 4.2]. Theorem 4.1. For each k ≥ 0, the inner termination criterion (4.2) is satisfied after at most log(L + c) − log(L ) max ∇f min + 1 , 1 log γ inner iterations, where L∇f is the Lipschitz constant associated with ∇f. We next show that any accumulation point of {Xk} is a first-order stationary point of problem (1.2). Theorem 4.2. Let the sequence {Xk} be generated by Algorithm 1. The following statements hold: k+1 k (i) kX − X kF → 0 as k → ∞; (ii) Any accumulation point of {Xk} is a first-order stationary point of (1.2). Proof. (i) The proof is similar to that of [26, Lemma 4]. (ii) Let L¯k be the final value of Lk at the kth outer iteration. It follows from (4.1) that {L¯k} is bounded. By the first-order optimality condition of (4.1), we have Xk+1 ∈ C and k k+1 k k+1 k+1 0 ∈ ∇f(X ) + L¯k(X − X ) + ∂(µΘ(X ) + δC(X )), (4.5) Pn p k k where Θ(X) := i=r+1 λi (X). Notice that {X } ⊂ C and C is bounded. Hence, {X } is bounded and it has at least an accumulation point, say X∗. Let K be a subsequence index such that k ∗ k k+1 k ∗ {X }K → X , which together with {X } ⊂ C and kX − X kF → 0 implies that X ∈ C k+1 ∗ and {X }K → X . Using this, the boundedness of {L¯k}, the continuity of ∇f, and the outer semi-continuity of ∂(µΘ + δC) [24, Proposition 8.7], and taking limits on both sides of (4.5) as k ∈ K → ∞, we have ∗ ∗ ∗ 0 ∈ ∇f(X ) + ∂(µΘ(X ) + δC(X )). Hence, X∗ is a first-order stationary point of (1.2). This completes the proof. 6 4.2 An efficient algorithm for solving subproblem (4.1) In this subsection we propose an efficient algorithm for solving subproblem (4.1). To proceed, we first consider the parametric univariate optimization problem 1 min Φ(z, t) := (z − t)2 + νzp (4.6) 0≤z≤1 2 for ν > 0 and p ∈ (0, 1]. Clearly, problem (4.6) has at least one optimal solution z∗ and Φ(z∗, t) is well defined for any t ∈ (−∞, ∞). In addition, it is not hard to see that for p = 1, problem (4.6) has a unique optimal solution z∗ = min(1, max(t − ν, 0)). In what follows, we study some properties of the optimal solution set of (4.6) for p ∈ (0, 1). Lemma 4.1. Let Z∗(t) denote the set of optimal solutions of problem (4.6) for t ∈ (−∞, ∞) and p ∈ (0, 1). Let n 1 o 1 α := min [2(1 − p)ν] 2−p , 1 , β := [νp(1 − p)] 2−p , (4.7) α 1 t := + ναp−1, t := max + ν, 1 + νp . (4.8) 1 2 2 2 Then the following statements hold: ∗ (i) 0 ∈ Z (t) if and only if t ≤ t1; ∗ (ii) 1 ∈ Z (t) if and only if t ≥ t2; ∗ ∗ ∗ (iii) Z (t) = {z } ⊆ [β, min{t, 1}) if and only if t ∈ (t1, t2), where z is the unique root of the equation g(z) := z − t + νpzp−1 = 0 (4.9) in the interval [β, ∞). The proof of this Lemma is given in Appendix. As an immediate consequence of Lemma 4.1, we obtain the following formula for computing an optimal solution of problem (4.6) for p ∈ (0, 1). Corollary 4.1. Let Z∗(t) denote the set of optimal solutions of problem (4.6) for t ∈ (−∞, ∞) and ∗ ∗ p ∈ (0, 1). Let β, t1 and t2 be defined in (4.7) and (4.8), respectively. Then we have z (t) ∈ Z (t), where z∗ : < → [0, 1] is defined as follows: 0 if t ≤ t1, ∗ ∗ z (t) = z˜ if t1 < t < t2, (4.10) 1 otherwise, where z˜∗ is the unique root of equation (4.9) in [β, ∞). ∗ As seen from (4.10), the value of z (t) is precisely known for t ≤ t1 or t ≥ t2. Nevertheless, for ∗ t ∈ (t1, t2), the exact value of z (t) is typically unknown since equation (4.9) generally does not have a closed-form root. We next present an efficient numerical scheme for estimating the rootz ˜∗ of equation (4.9) by Newton’s method. Newton’s method for solving (4.9): Let β, t1, t2 and g(·) be defined in (4.7), (4.8) and (4.9), respectively. Let t ∈ (t1, t2) be given. If ∗ g(β) = 0, setz ˜ = β. Otherwise choose z0 ∈ (β, ∞) and perform 0 zk+1 = zk − g(zk)/g (zk) for k ≥ 0. (4.11) 7 Remark 4.2. Recall from Lemma 4.1 (iii) that the unique root z˜∗ of equation (4.9) in [β, ∞) lies in [β, min{t, 1}). Therefore, for practical efficiency, it is natural to choose z0 = (β + min{t, 1})/2. The following theorem shows that the above Newton’s method is able to find an approximate root in [β, ∞) to equation (4.9), and moreover, it is globally and quadratically convergent. Theorem 4.3. Let β, t1 and t2 be defined in (4.7) and (4.8), respectively. Then for any t ∈ (t1, t2) and p ∈ (0, 1), Newton’s method given above either finds the root z˜∗ of equation (4.9) or generates ∗ a sequence {zk} which is globally and quadratically convergent to z˜ , and in particular, νp(1 − p)(2 − p)(˜z∗)p−3 0 ≤ z − z˜∗ ≤ (z − z˜∗)2, ∀k ≥ 1. k+1 1 − νp(1 − p)(˜z∗)p−2 k Proof. In view of Corollary 4.1, we know that for any t ∈ (t1, t2) and p ∈ (0, 1), equation (4.9) has a unique rootz ˜∗ in [β, ∞). Therefore, if g(β) = 0, thenz ˜∗ = β. Otherwise,z ˜∗ is the unique root of (4.9) in (β, ∞) and Newton’s iterartion (4.11) generates a sequence {zk}. We have from (4.9) that g0(z) = 1 − νp(1 − p)zp−2, g00(z) = νp(1 − p)(2 − p)zp−3. (4.12) Notice that g0(β) = 0 and β > 0. It is easy to see that g0(z) > 0, g00(z) > 0 and g00(z) is continuous for every z ∈ (β, ∞). Hence, the assumptions of Lemma A.1 hold for q = g, a = β ∗ 0 ∗ ∗ p−2 and z∗ =z ˜ . In addition, one can observe from (4.12) that g (˜z ) = 1 − νp(1 − p)(˜z ) and max g00(z) = νp(1 − p)(2 − p)(˜z∗)p−3. Therefore, the conclusion follows directly from Lemma ∗ z∈[˜z ,z1] A.1. This completes the proof. The proof of the following lemma is by a similar approach as proposed in [18]. Lemma 4.2. Let p ∈ (0, 1] and ν > 0 be given, and let 1 1 V (t) := min (z − t)2 + νzp − min (z − t)2 , ∀t ∈ <. (4.13) 0≤z≤1 2 0≤z≤1 2 | {z } | {z } V1(t) V2(t) Then V (t) is increasing in (−∞, ∞). Proof. It is not hard to observe from (4.13) that V1, V2 and V are well defined. Also, we see from (4.6) that V1(t) = min Φ(z, t), (4.14) 0≤z≤1 where Φ is defined in (4.6). Notice that Φ and ∇tΦ are continuous in [0, ∞) × <. Moreover, |∇tΦ(z, t)| = |t − z| ≤ |t| + 1, ∀z ∈ [0, 1]. Hence, Φ is locally Lipschitz in t, uniformly for all z ∈ [0, 1]. Using these facts, one can observe that the assumptions of [9, Theorem 2.1] hold for g = Φ and U = [0, 1]. Thus it follows from [9, Theorem 2.1] that V1 is locally Lipschitz continuous in < and moreover ∗ ∗ ∂V1(t) = conv({∇tΦ(z, t): z ∈ Z (t)}) = conv(t − Z (t)), (4.15) where ∂V1 denotes the Clarke subdifferential of V1, conv(·) is the convex hull of the associated ∗ set, and Z (t) denotes the set of optimal solutions of (4.14). Notice that V2 is differentiable and moreover ∇V2(t) = t − P[0,1](t), ∀t ∈ <. It then follows from the above two equalities that