An Exact for Binary Optimization Based on MPEC Formulation

Ganzhao Yuan and Bernard Ghanem King Abdullah University of Science and Technology (KAUST), Saudi Arabia [email protected], [email protected]

Abstract registration (Wang et al., 2016), and social network analy- sis (e.g. subgraph discovery (Yuan and Zhang, 2013; Ames, Binary optimization is a central problem in mathematical op- 2015), biclustering (Ames, 2014), planted clique and bi- timization and its applications are abundant. To solve this problem, we propose a new class of continuous optimization clique discovery (Ames and Vavasis, 2011), and community techniques, which is based on Mathematical Programming discovery (He et al., 2016; Chan and Yeung, 2011)), etc. with Equilibrium Constraints (MPECs). We first reformulate The binary optimization problem is difficult to solve, s- the binary program as an equivalent augmented biconvex op- ince it is NP-hard. One type of method to solve this problem timization problem with a bilinear equality constraint, then is continuous in nature. The simple way is to relax the bina- we propose an exact penalty method to solve it. The result- ry constraint with (LP) relaxation con- ing algorithm seeks a desirable solution to the original prob- straints −1 ≤ x ≤ 1 and round the entries of the resulting lem via solving a sequence of linear programming convex re- continuous solution to the nearest integer at the end. Howev- laxation subproblems. In addition, we prove that the penalty function, induced by adding the complementarity constraint er, not only may this solution not be optimal, it may not even to the objective, is exact, i.e., it has the same local and glob- be feasible and violate some constraint. Another type of op- al minima with those of the original binary program when timization focuses on the cutting-plane and branch-and-cut the penalty parameter is over some threshold. The conver- method. The cutting plane method solves the LP relaxation gence of the algorithm can be guaranteed, since it essentially and then adds linear constraints that drive the solution to- reduces to block coordinate descent in the literature. Final- wards integers. The branch-and-cut method partially devel- ly, we demonstrate the effectiveness of our method on the ops a binary tree and iteratively cuts out the nodes having problem of dense subgraph discovery. Extensive experiments a lower bound that is worse than the current upper bound, show that our method outperforms existing techniques, such while the lower bound can be found using convex relaxation, as iterative hard thresholding and linear programming relax- Lagrangian duality, or Lipschitz continuity. However, this ation. class of method ends up solving all 2n convex subproblems in the worst case. Our algorithm aligns with the first research 1 Introduction direction. It relies on solving a convex LP relaxation sub- In this paper, we mainly focus on the following binary opti- problem iteratively, but it provably terminates in polynomial mization problem: iterations. In non-, good initialization is very min f(x), s.t. x ∈ {−1, +1}n, x ∈ Ω. (1) x important to the quality of the solution. Motivated by this, n several papers design smart initialization strategies and es- where the objective function f : R → R is convex but tablish optimality qualification of the solutions for non- not necessarily smooth on some convex set Ω, and the non- convex problems. For example, the work of (Zhang, 2010) convexity of (1) is only caused by the binary constraints. In n considers a multi-stage convex optimization algorithm to re- addition, we assume {−1, 1} ∩ Ω 6= ∅. fine the global solution by the initial convex method; the The optimization in (1) describes many applications of work of (Candes,` Li, and Soltanolkotabi, 2015) starts with a interest in both computer vision and machine learning, in- careful initialization obtained by a spectral method and im- cluding graph bisection (Goemans and Williamson, 1995; proves this estimate by ; the work of (Jain, Keuchel et al., 2003), Markov random fields (Boykov, Vek- Netrapalli, and Sanghavi, 2013) uses the top-k singular vec- sler, and Zabih, 2001), the permutation problem (Jiang, Liu, tors of the matrix as initialization and provides theoretical and Wen, 2016; Fogel et al., 2015), graph matching (Cour, guarantees for biconvex alternating minimization algorith- Srinivasan, and Shi, 2007; Toshev, Shi, and Daniilidis, 2007; m. The proposed method also uses a similar initialization Zaslavskiy, Bach, and Vert, 2009), image (co-)segmentation strategy since it reduces to convex LP relaxation in the first (Shi and Malik, 2000; Joulin, Bach, and Ponce, 2010), image iteration. Copyright c 2017, Association for the Advancement of Artificial The contributions of this paper are three-fold. (a) We re- Intelligence (www.aaai.org). All rights reserved. formulate the binary program as an equivalent augmented Table 1: Existing continuous methods for binary optimization.

Method and Reference Description n 2 spectral relaxation (Cour and Shi, 2007) {−1, +1} ≈ {x | kxk2 = n} linear programming relaxation (Komodakis and Tziritas, 2007) {−1, +1}n ≈ {x | − 1 ≤ x ≤ 1} {0, +1}n ≈ {x | X  xxT , diag(X) = x} SDP relaxation (Wang et al., 2016) {−1, +1}n ≈ {x | X  xxT , diag(X) = 1} doubly positive relaxation (Huang, Chen, and Guibas, 2014) {0, +1}n ≈ {x | X  xxT , diag(X) = x, x ≥ 0, X ≥ 0} completely positive relaxation (Burer, 2009) {0, +1}n ≈ {x | X  xxT , diag(X) = x, x ≥ 0, X is CP} Relaxed Approximation SOCP relaxation (Kumar, Kolmogorov, and Torr, 2009) {−1, +1}n ≈ {x | hX − xxT , LLT i ≥ 0, diag(X) = 1}, ∀ L 0 2 n iterative hard thresholding (Yuan and Zhang, 2013) minx kx − x k2, s.t. x ∈ {−1, +1} piecewise separable reformulation (Zhang et al., 2007) {−1, +1}n ⇔ {x | (1 + x) (1 − x) = 0} n `0 norm non-separable reformulation (Yuan and Ghanem, 2016b) {−1, +1} ⇔ {x | kx + 1k0 + kx − 1k0 ≤ n} n 2 `2 box non-separable reformulation (Murray and Ng, 2010) {−1, +1} ⇔ {x | − 1 ≤ x ≤ 1, kxk2 = n} n p `p box non-separable reformulation (Wu and Ghanem, 2016) {−1, +1} ⇔ {x | − 1 ≤ x ≤ 1, kxkp = n, 0 < p < ∞} n 2 Equivalent Optimization `2 box non-separable MPEC [This paper] {−1, +1} ⇔ {x | −1 ≤ x ≤ 1, kvk2 ≤ n, hx, vi = n, ∀v} optimization problem with a bilinear equality constraint via straints. Linear programming relaxation (Komodakis and a variational characterization of the binary constraint. Then, Tziritas, 2007; Kumar, Kolmogorov, and Torr, 2009) trans- we propose an exact penalty method to solve it. The result- forms the NP-hard optimization problem into a convex box- ing algorithm seeks a desirable solution to the original bi- constrained optimization problem, which can be solved by nary program. (b) We prove that the penalty function, in- well-established optimization methods and software. Semi- duced by adding the complementarity constraint to the ob- Definite Programming (SDP) relaxation (Huang, Chen, and jective is exact, i.e. the set of their globally optimal solu- Guibas, 2014) uses a lifting technique X = xxT and relax- tions coincide with that of (1) when the penalty parame- es to a convex conic X  xxT 1 to handle the binary con- ter is over some threshold. Thus, the convergence of the straint. Combining this with a unit-ball randomized round- algorithm can be guaranteed, since it reduces to block co- ing algorithm, the work of (Goemans and Williamson, 1995) ordinate descent in the literature (Tseng, 2001; Bolte, S- proves that at least a factor of 87.8% to the global optimal abach, and Teboulle, 2014). To our knowledge, this is the solution can be achieved for the graph bisection problem. S- first attempt to solve general non-smooth binary optimiza- ince the original paper of (Goemans and Williamson, 1995), tion with guaranteed convergence. (c) We provide numeri- SDP has been applied to develop numerous approximation cal comparisons with state-of-the-art techniques, such as it- algorithms for NP-hard problems. As more constraints lead erative hard thresholding (Yuan and Zhang, 2013) and lin- to tighter bounds for the objective, doubly positive relax- ear programming relaxation (Komodakis and Tziritas, 2007; ation considers constraining both the eigenvalues and the el- Kumar, Kolmogorov, and Torr, 2009) on dense subgraph dis- ements of the SDP solution to be nonnegative, leading to covery. Extensive experiments demonstrate the effectiveness better solutions than canonical SDP methods. In addition, of our proposed method. Completely Positive (CP) relaxation (Burer, 2010, 2009) Notations. We use lowercase and uppercase boldfaced further constrains the entries of the factorization of the solu- letters to denote real vectors and matrices respectively. The tion X = LLT to be nonnegative L ≥ 0. It can be solved Euclidean inner product between x and y is denoted by by tackling its associated dual co-positive program, which is hx, yi or xT y. X  0 means that matrix X is positive semi- related to the study of indefinite optimization and sum-of- definite. Finally, sign is a signum function with sign(0) = squares optimization in the literature. Second-Order Cone ±1. Programming (SOCP) relaxes the SDP conic into the non- negative orthant (Kumar, Kolmogorov, and Torr, 2009) us- 2 Related Work ing the fact that hX − xxT , LLT i ≥ 0, ∀ L, resulting in This paper proposes a new continuous method for binary tighter bound than the LP method, but looser than that of optimization. We briefly review existing related work in this the SDP method. Therefore it can be viewed as a balance research direction in the literature (see Table1). between efficiency and efficacy. There are generally two types of methods in the lit- Another type of methods for binary optimization relates erature. One is the relaxed approximation method. Spec- to equivalent optimization. The iterative hard thresholding tral relaxation (Cour and Shi, 2007; Olsson, Eriksson, and method directly handles the non-convex constraint via pro- Kahl, 2007; Shi and Malik, 2000) replaces the binary con- jection and it has been widely used due to its simplicity and straint with a spherical one and solves the problem using eigen decomposition. Despite its computational merits, it 1Using Schur complement lemma, one can rewrite X  xxT X x  is difficult to generalize to handle linear or nonlinear con- as xT 1  0. efficiency (Yuan and Zhang, 2013). However, this method we have z = 1 and it holds that x ∈ {−1, +1}n. (ii) Sec- is often observed to obtain sub-optimal accuracy and it is ondly, we prove that v ∈ {−1, +1}n. We have: not directly applicable, when the objective is non-smooth. n = xT v ≤ kxk kvk ≤ kvk = |v|T 1 ≤ kvk k1k (2) A piecewise separable reformulation has been considered in ∞ 1 1 2 2 √ 2 (Zhang et al., 2007), which can exploit existing smooth op- Thus, we obtain√ kvk2 ≥ n. Combining kvk2 ≤ n, we timization techniques. Binary optimization can be reformu- have kvk2 = n and kvk2k1k2 = n. By the Squeeze The- lated as an `0 norm semi-continuous optimization problem. orem, all the equalities in (2) hold automatically. Using the Thus, existing `0 norm sparsity constrained optimization equality condition for Cauchy-Schwarz Inequality, we have techniques such as quadratic penalty decomposition method |v| = 1 and it holds that v ∈ {−1, +1}n. (iii) Finally, s- (Lu and Zhang, 2013) and multi-stage convex optimization ince x ∈ {−1, +1}n, v ∈ {−1, +1}n, and hx, vi = n, we method (Zhang, 2010; Yuan and Ghanem, 2016b) can be ap- obtain x = v. 2 plied. A continuous `2 box non-separable reformulation has been used in the literature (Raghavachari, 1969; Kalan- tari and Rosen, 1982). A second-order interior point method Using Lemma1, we can rewrite (1) in an equivalent form (Murray and Ng, 2010; De Santis and Rinaldi, 2012) has as follows. been developed to solve the continuous reformulation opti- min f(x), s.t. xT v = n, x ∈ Ω (3) −1≤x≤1, kvk2≤n mization problem. A continuous `p box non-separable refor- 2 mulation has recently been used in (Wu and Ghanem, 2016), We remark that xT v = n is referred to as the complemen- where an interesting geometric illustration of `p-box inter- tarity (or equilibrium) constraint in the literature (Luo, Pang, 3 section has been shown . In addition, they infuse this e- and Ralph, 1996; Ralph and Wright, 2004) and it always T √ quivalence into the optimization framework of Alternating holds that x v ≤ kxk∞kvk1 ≤ nkvk2 ≤ n for any fea- Direction Method of Multipliers (ADMM). However, their sible x and v. guarantee of convergence is weak. In this paper, to tackle the problem of binary optimization, we propose a new frame- Algorithm 1 MPEC-EPM: An Exact Penalty Method for work that is based on Mathematical Programming with E- Solving MPEC Problem (3) quilibrium Constraints (MPECs). Our resulting algorithm is (S.0) Set t = 0, x0 = v0 = 0, ρ > 0, σ > 1. theoretically convergent and empirically effective. (S.1) Solve the following x-subproblem [primal step]: Mathematical programs with equilibrium constraints are optimization problems, where the constraints include com- xt+1 = arg min J (x, vt), s.t. − 1 ≤ x ≤ 1, x ∈ Ω (4) plementarities or variational inequalities. They are difficult x to deal with because their feasible region may not necessar- (S.2) Solve the following v-subproblem [dual step]: ily be convex or even connected. Motivated by recent devel- t+1 t+1 2 opment of MPECs for non-convex optimization (Yuan and v = arg min J (x , v), s.t. kvk2 ≤ n (5) v Ghanem, 2015, 2016a,b), we consider continuous `2 box non-separable MPEC for binary optimization 4. (S.3) Update the penalty in every T iterations: 3 An Exact Penalty Method ρ ⇐ min(2L, ρ × σ) (6) This section presents an exact penalty method for binary (S.4) Set t := t + 1 and then go to Step (S.1) optimization, which is based on a new MPEC formulation. First, we present our reformulation of the binary constraint. We now present our exact penalty method for solving the optimization problem in (3). It is worthwhile to point out that Lemma 1. `2 box non-separable MPEC. We define Θ , T 2 there are many studies on exact penalty for MPECs (refer to {(x, v) | x v = n, kvk2 ≤ n, −1 ≤ x ≤ 1}. Assume that (x, v) ∈ Θ, then x ∈ {−1, +1}n, v ∈ {−1, +1}n and (Luo, Pang, and Ralph, 1996; Hu and Ralph, 2004; Ralph x = v. and Wright, 2004; Yuan and Ghanem, 2016b) for examples), but they do not afford the exactness of our penalty problem. Proof. (i) Firstly, we prove that x ∈ {−1, +1}n. Using In an exact penalty method, we penalize the complementary the definition of Θ and the Cauchy-Schwarz Inequality, we error directly by a penalty function. The resulting objective √ √ n m T T J : × → is defined in (7), where ρ is the penalty have: n = x v ≤ kxk2kvk2 ≤ kxk2 n = nx x ≤ R R R p p parameter that is iteratively increased to enforce the bilinear nkxk1kxk∞ ≤ nkxk1. Thus, we obtain kxk1 ≥ n. We constraint. define z = |x|. Combining kxk ≤ 1, we have the follow- ∞ T ing constraint sets for z: P z ≥ n, 0 ≤ z ≤ 1. Therefore, Jρ(x, v) = f(x) + ρ(n − x v) i i (7) s.t. − 1 ≤ x ≤ 1, kvk2 ≤ n, x ∈ Ω 2They replace x ∈ {0, 1}n with 0 ≤ x ≤ 1, xT (1 − x) = 0. 2 We extend this strategy to replace {−1, +1}n with −1 ≤ x ≤ In each iteration, we minimize over x and v alternatingly (T- T 2 1, (1 + x) (1 − x) = 0 which reduces to kxk∞ ≤ 1, kxk2 = n. seng, 2001; Bolte, Sabach, and Teboulle, 2014), while fixing 3We adapt their formulation to our {−1, +1} formulation. the parameter ρ. We summarize our exact penalty method in 4For {0, +1} binary variable, we have: {0, +1}n ⇔ {x | 0 ≤ Algorithm1. The parameter T is the number of inner itera- 2 x ≤ 1, k2v − 1k2 ≤ n, h2x − 1, 2v − 1i = n, ∀v} tions for solving the biconvex problem and the parameter L is the Lipschitz constant of the objective function f(·). We inequalities: make the following observations about the algorithm. √ √ p 2 0 n − nkxk2 n − n (n − 1) + δ (a) Initialization. We initialize v to 0. This is for the sake > p of finding a reasonable local minimum in the first iteration, ksign(x) − xk2 (1 − δ)2 √ √ as it reduces to convex LP relaxation (Komodakis and Tzir- n − n( n − 1 + δ) itas, 2007) for the binary optimization problem. ≥ (1 − δ) (b) Exact property. One remarkable feature of our method √ √ √ n − n n − 1 nδ is the boundedness of the penalty parameter ρ (see Theo- = + rem1). Therefore, we terminate the optimization when the (1 − δ) (1 − δ) √ √ threshold is reached (see (6)). This distinguishes it from the n − n n − 1 > + 0 quadratic penalty method (Lu and Zhang, 2013), where the 1 penalty may become arbitrarily large for non-convex prob- √ √ √ lems. where we use the inequality a + b ≤ a + b, ∀a, b > 0 (c) v-Subproblem. Variable v in (5) is updated by solving and the fact that 0 < δ < 1. Since the lower bound above can the following convex problem: be applied to an arbitrary vector, we finish the proof of the first inequality. (ii) We prove the second inequality in (10). We have the following results: 1/4 > 0 ⇒ n2 −n+1/4 > vt+1 = arg min hv, −xt+1i s.t. kvk2 ≤ n (8) √ √ 2 n2 − n ⇒ (n − 1/2) > n2 − n ⇒ n − n2 − n > 1/2. When xt+1 = 0, any feasible solution is also an optimal solution. When xt+1 6= 0, the optimal solution will be The following lemma is useful in establishing the exact- 2 achieved at the constraint boundary with kvk2 = n and ness property of the penalty function in Algorithm1. 1 2 t+1 (8) is equivalent to solving: min 2 kvk − hv, x i. kvk2=n 2 2 Lemma 3. Consider the following optimization problem: Thus, we have the following optimal solution for v: ∗ ∗ (x , v ) = arg min Jρ(x, v). (11) ρ ρ 2 −1≤x≤1,kvk2≤n, x∈Ω  √ t+1 t+1 t+1 t+1 n · x /kx k2, x 6= 0; v = 2 (9) Assume that f(·) is a L-Lipschitz continuous convex func- any v with kvk2 ≤ n, otherwise. ∗ ∗ tion on −1 ≤ x ≤ 1. When ρ > 2L, hxρ, vρi = n will be achieved for any local optimal solution of (11). (d) x-Subproblem. Variable x in (4) is updated by solving a box constrained convex problem, which has no closed-form Proof. First of all, we focus on the v-subproblem in (11): ∗ T 2 ∗ solution in general. However, it can be solved using Nes- vρ = arg minv −x v, s.t. kvk2 ≤ n. Assume that xρ 6= ∗ √ ∗ ∗ terov’s proximal gradient method (Nesterov, 2003) or clas- 0, we have vρ = n · xρ/kxρk2 by (9). Then the biconvex sical/linearized ADM (He and Yuan, 2012). optimization problem reduces to the following: ∗ √ Theoretical Analysis. In the following, we present some xρ = arg min p(x) , f(x) + ρ(n − nkxk2) (12) theoretical analysis of our exact penalty method. The fol- x∈[−1,+1]n∩Ω lowing lemma is very crucial and useful in our proofs. ∗ For any xρ ∈ Ω, we derive the following inequalities: n ∗ ∗ Lemma 2. Let x ∈ R be an arbitrary vector with −1 ≤ 0.5ρksign(xρ) − xρk2  1, x > 0; √ ∗ x ≤ 1. We define sign(x) = ±1, x = 0; and assume ≤ ρ(n − nkxρk2) −1, x < 0. √ ∗ ∗ ∗ sign(x) 6= x. The following inequalities hold: = [ρ(n − nkxρk2) + f(xρ)] − f(xρ) √ ∗ ∗ ∗ √ ≤ [ρ(n − nksign(xρ)k2) + f(sign(xρ))] − f(xρ) n − nkxk ∗ ∗ 2 p 2 = f(sign(x )) − f(x ) h(x) , > n − n − n > 1/2 (10) ρ ρ ksign(x) − xk2 ∗ ∗ = Lksign(xρ) − xρk2 (13)

Proof. (i) We prove the first inequality in (10). We de- where the√ first step uses Lemma2 that ksign(x) − xk2 ≤ fine N (x) as the number of ±1 binary variables in x, i.e., 2(n − nkxk2) for any x in kxk∞ ≤ 1. The third step uses the optimality of x∗ in (12), where p(x∗) ≤ p(y) for N (x) , #(|x| = 1). Clearly, the objective function h(x) ρ ρ decreases as N (x) increases. Note that N (x) 6= n, since any y ∈ [−1, +1]n ∩ Ω. The fourth step uses the fact that n √ otherwise it violates the assumption that sign(x) 6= x. We sign(xρ) ∈ {−1, +1} and nksign(xρ)k2 = n, while the consider the objective value h(x) when N (x) = n − 1. last step exploits the Lipschitz continuity of f(·). ∗ ∗ In this situation, there exists only one coordinate such that From (13), we have kxρ − sign(xρ)k2 · (ρ − 2L) ≤ 0. sign(xi) 6= xi with xi = ±δ, 0 < δ < 1 and the remain- Since ρ − 2L > 0, we conclude that it always holds that ∗ ∗ ∗ n ing coordinates take binary variable in {−1, +1}. Note that kxρ − sign(xρ)k2 = 0. Thus, xρ ∈ {−1, +1} . Finally, we ∗ √ ∗ ∗ ∗ ∗ ∗ δ 6= 0 and δ 6= 1, since otherwise it also violates the assump- have xρ = n · xρ/kxρk2 = vρ and hxρ, vρi = n. tion that sign(x) 6= x. Therefore, we derive the following √ The following theorem shows that when the penalty pa- d(ln(L 2n)−ln(ρ0))/ ln σe outer iterations 6 with the ac- rameter ρ is larger than some threshold, the biconvex objec- curacy at least n − xT v ≤ . Moreover, after hx, vi = n is tive function in (7) is equivalent to the original constrained obtained, the sequence of {f(xt)} generated by Algorithm MPEC problem in (3). This essentially implies the theoret- 1 is monotonically non-increasing. ical convergence of the algorithm, since it reduces to well- known block coordinate descent in the literature 5. Proof. We denote s and t as the outer iteration and inner iteration counters in Algorithm1, respectively. (i) We now Theorem 1. Exactness of the Penalty Function. Assume prove the convergence rate of Algorithm1. Assume that Al- that f(·) is a L-Lipschitz continuous convex function on gorithm1 takes s outer iterations to converge. We denote −1 ≤ x ≤ 1. When ρ > 2L, the biconvex optimization 0 2 f (x) as the sub-gradient of f(·) in x. According to the x- minx, v Jρ(x, v), s.t. − 1 ≤ x ≤ 1, kvk2 ≤ n, x ∈ Ω in subproblem in (12), if x∗ solves (12), then we have the fol- (7) has the same local and global minima with the original lowing mixed variational inequality condition (He and Yuan, problem in (3). 2012; Jiang et al., 2016): ∗ Proof. We let x be any global minimizer of (3) and ∀x ∈ [−1, +1]n ∩ Ω, hx − x∗, f 0(x∗)i+ (x∗, v∗) ρ > 2L ρ ρ be any global minimizer of (7) for some . (i) √ √ ∗ We now prove that x∗ is also a global minimizer of (7). For ρ(n − nkxk2) − ρ(n − nkx k2) ≥ 0. any feasible x and v, we derive the following inequalities: Letting x be any feasible solution that x ∈ {−1, +1}n ∩ Ω, J (x, v, ρ) we have the following inequality: T √ ∗ √ 1 ∗ 0 ∗ ≥ min f(x) + ρ(n − x v) n − nkx k2 ≤ n − nkxk2 + hx − x , f (x )i kxk ≤1, kvk2≤n, x∈Ω ρ ∞ 2 √ (14) T 1 ∗ 0 ∗ = min f(x), s.t. x v = n ≤ ρ kx − x k2kf (x )k2 ≤ L 2n/ρ 2 kxk∞≤1, kvk2≤n, x∈Ω ∗ ∗T ∗ where the second inequality is due to the Cauchy-Schwarz = f(x ) + ρ(n − x v ) Inequality, the third inequality is due to the fact that kx − ∗ ∗ √ = J (x , v , ρ) yk2 ≤ 2n, ∀ − 1 ≤ x , y ≤ 1 and the Lipschitz 0 ∗ where the first equality holds due to the fact that the con- continuity of√f(·) that kf (x )k2 ≤ L.(14) implies that s straint xT v = n is satisfied at the local optimal solution when√ρ ≥ L 2n/, Algorithm1 achieves accuracy at least n − nkxk ≤  ρs = σsρ0 when ρ > 2L (see Lemma3). Therefore, we conclude that 2 . Noticing that √, we have that any optimal solution of (3) is also an optimal solution of (7).  accuracy will be achieved when σsρ0 ≥ L 2n . Thus, we ∗  (ii) We now prove that xρ is also a global minimizer of (3). obtain x v √ For any feasible and , we naturally have the following √ inequalities: s L 2n 0 σ ≥ 0 ⇒ s ≥ (ln(L 2n) − ln(ρ ))/ ln σ ∗ ρ f(xρ) − f(x) ∗ ∗T ∗ T (ii) We now prove the asymptotic monotone property of Al- = f(xρ) + ρ(n − xρ vρ) − f(x) − ρ(n − x v) gorithm1. We naturally derive the following inequalities: = J (x∗, v∗) − J (x, v) ρ ρ ρ ρ f(xt+1) − f(xt) ≤ 0 ≤ ρ(n − hxt, vti) − ρ(n − hxt+1, vti) where the first equality uses Lemma3. Therefore, we con- = ρ hxt+1, vti − hxt, vti clude that any optimal solution of (7) is also an optimal solu- tion of (3). (iii) In summary, we conclude that when ρ > 2L, ≤ ρ hxt+1, vt+1i − hxt, vti = 0 the biconvex optimization in (7) has the same local and glob- t+1 al minima with the original problem in (3). where the first inequality uses the fact that f(x ) + ρ(n − hxt+1, vti) ≤ f(xt) + ρ(n − hxt, vti) holds because xt+1 is the optimal solution of (4). The second inequality uses The following theorem characterizes the convergence rate the fact −hxt+1, vt+1i ≤ −hxt+1, vti holds due to the op- and asymptotic monotone property of Algorithm1. timality of vt+1 for (5). The last step uses hx, vi = n. Theorem 2. Convergence Rate and Asymptotic Monotone Note that the equality hx, vi = n together with the fea- 2 Property of Algorithm1. Assume that f(·) is a L-Lipschitz sible set −1 ≤ x ≤ 1, kvk2 ≤ n also implies that n continuous convex function on −1 ≤ x ≤ 1. Algorith- x ∈ {−1, +1} . m1 will converge to the first-order KKT point in at most We have a few remarks on the theorems above. We as- 5Specifically, using Tseng’s convergence results of block coor- sume that the objective function is L-Lipschits continuous. dinate descent for non-differentiable minimization (Tseng, 2001), However, such hypothesis is not strict. Because the solution one can guarantee that every clustering point of Algorithm1 is also x is defined on the compact set, the Lipschits constant can a stationary point. In addition, stronger convergence results (Bolte, always be computed for any continuous objective (e.g. norm Sabach, and Teboulle, 2014; Yuan and Ghanem, 2016b) can be ob- function, min/max envelop function). In fact, it is equivalent tained by combining a proximal strategy and Kurdyka-Łojasiewicz inequality assumption on J (·). 6Every time we increase ρ, we call it one outer iteration. 14 100 60 50 12 50 45 80 10 40 40 60 8 35 30 density density density density 6 40 30 20

4 25 20 10 2 20 100 1000 2000 3000 4000 5000 100 1000 2000 3000 4000 5000 100 1000 2000 3000 4000 5000 100 1000 2000 3000 4000 5000 cardinality cardinality cardinality cardinality (a) wordassociation (b) enron (c) uk-2007-05 (d) cnr-2000

70 400 8 100

60 350 7 80 300 50 250 6 60 40 200 5 density density density density 30 150 40 100 4 20 50 20 10 3 100 1000 2000 3000 4000 5000 100 1000 2000 3000 4000 5000 100 1000 2000 3000 4000 5000 100 1000 2000 3000 4000 5000 cardinality cardinality cardinality cardinality (e) dblp-2010 (f) in-2004 (g) amazon-2008 (h) dblp-2011

Figure 1: Experimental results for dense subgraph discovery. to say that the (sub-) gradient of the objective is bounded by Table 2: The statistics of the web graph data sets used in our L 7. Although exact penalty method has been study in the dense subgraph discovery experiments. literature (Han and Mangasarian, 1979; Di Pillo and Grip- po, 1989; Di Pillo, 1994), their results cannot directly apply Graph # Nodes # Arcs Avg. Degree here. The theoretical bound 2L (on the penalty parameter ρ) wordassociation 10617 72172 6.80 heavily depends on the specific structure of the optimization enron 69244 276143 3.99 problem. Moreover, we also establish the convergence rate uk-2007-05 100000 3050615 30.51 and asymptotic monotone property of our algorithm. cnr-2000 325557 3216152 9.88 Based on the discussions above, we summarize the mer- dblp-2010 326186 1615400 4.95 its of our MPEC-based exact penalty method as follows. (a) in-2004 1382908 16917053 12.23 It exhibits strong convergence guarantees, since it essential- amazon-2008 735323 5158388 7.02 ly reduces to block coordinate descent in the literature. (b) dblp-2011 986324 6707236 6.80 It seeks desirable solutions, since the LP convex relaxation method in the first iteration provides a good initialization. (c) It is efficient since it is amenable to the use of existing Dense subgraphs discovery (Ravi, Rosenkrantz, and Tay- convex methods to solve the sub-problem. (d) It has a mono- i, 1994; Feige, Peleg, and Kortsarz, 2001; Yuan and Zhang, tone/greedy property due to the complimentary constraints 2013) is a fundamental graph-theoretic problem, as it cap- brought on by the MPEC. We penalize the complimentary tures numerous graph mining applications, such as commu- error and ensure that it is decreasing in every iteration, lead- nity finding, regulatory motifs detection, and real-time story ing to binary solutions. identification. It aims at finding the maximum density sub- graph on k vertices, which can be formulated as the follow- ing binary program: 4 Experimental Validation T T maxx∈{0,1}n x Wx, s.t. x 1 = k (15) This section demonstrates the advantages of our MPEC- based exact penalty method (MPEC-EPM) on the dense where W ∈ Rn×n is the adjacency matrix of the graph. subgraph discovery problem. All codes are implemented in Although the objective function in (15) may not be convex, Matlab on an Intel 3.20GHz CPU with 8 GB RAM 8. one can append an additional term λxT x to the objective with a sufficiently large λ such that λI − W  0 (similar 7For example, for the quadratic function f(x) = 0.5xT Ax + to (Ghanem, Cao, and Wonka, 2015)). This is equivalent to T n×n n T x b with A ∈ R and b ∈ R , the Lipschits constant√ is adding a constant to the objective since λx x = λk in the bounded by L ≤ kAx + bk ≤ kAkkxk + kbk ≤ kAk n + effective domain. Therefore, we have the following equiva- kbk; for the `1 regression function f(x) = kAx − bk1 with m×n m lent problem: A ∈ R and b ∈ R , the Lipschits√ constant is bounded by L ≤ kAT ∂|Ax − b|k ≤ kAT k m. T T minx∈{0,1}n f(x) , x (λI − W)x, s.t. x 1 = k (16) 8For the purpose of reproducibility, we provide our MATLAB code at: yuanganzhao.weebly.com. In the experiments, λ is set to the largest eigenvalue of W. −32 −55 −150 −1

−33 −60 −2 −200 −34 −65 −3 −35 −250 −70 −4 −36 −75 −5 Objective Objective Objective −300 Objective −37 −80 −6 −38 −350 −39 −85 −7

−40 −90 −400 −8 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Iteration Iteration Iteration Iteration (a) enron (b) uk-2007-05 (c) in-2004 (d) amazon-2008 −10 −30 −240 −2

−260 −40 −3 −15 −280 −50 −300 −4 −20 −60 −320 −5 −70 −340 Objective −25 Objective Objective Objective −360 −6 −80 −30 −380 −90 −7 −400

−35 −100 −420 −8 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Iteration Iteration Iteration Iteration (e) enron (f) uk-2007-05 (g) in-2004 (h) amazon-2008

Figure 2: Convergence curve for dense subgraph discovery on different datasets with k = 3000 (first row) and k = 4000 (second row).

T T Compared Methods. In our experiments, we compare mulation: minx x (λI − W)x, s.t. 0 ≤ x ≤ 1, x 1 = 2 the following methods with different cardinality k ∈ k, k2x − 1k2 = n. It introduces auxiliary variables to sepa- {100,1000, 2000, 3000, 4000, 5000} on 8 datasets 9 (see Ta- rate the two constraint sets and then performing block co- ble2), which contain up to 1 million nodes and 7 million ordinate descend on each variable. (vi) MPEC-EPM (Al- arcs. (i) Feige’s (GEIGE) (Feige, Peleg, gorithm1) solves the NP-hard problem in (16) via succes- and Kortsarz, 2001) is included in our comparisons. This sive convex LP relaxation. We stop Algorithm1 when the method is known to achieve the best approximation ratio complimentary constraint is satisfied up to a threshold, i.e., T for general k. (ii) Ravi’s greedy algorithm (RAVI) (Ravi, n − x v ≤ , where  is√ set to 0.01. Moreover, we choose Rosenkrantz, and Tayi, 1994) starts from a heaviest edge and ρ = 0.01,T = 10, σ = 10. repeatedly adds a vertex to the current subgraph to maximize Solution Quality. We compare the quality of the solution the weight of the resulting new subgraph. It has asymptotic x∗ by measuring the density of the extracted k-subgraphs, performance guarantee of π/2, when the weights satisfy the which can be computed as x∗T Wx∗/k. Several observa- triangle inequality. (iii) LP relaxation solves a capped sim- T tions can be drawn from Figure1. (i) Both FEIGE and RAVI plex problem minx f(x), s.t. 0 ≤ x ≤ 1, x 1 = k by generally fail to solve the dense subgraph discovery prob- proximal gradient descent method via xk+1 ⇐ proj(xk − k k lem and they lead to solutions with low density. (ii) LP re- ∇f(x )/η) based on the current gradient ∇f(x ). Here, the laxation gives better performance than the state-off-the-art projection operator proj(a) , arg min0≤x≤1, xT 1=k kx − technique TPM in some cases. (iii) L2-box ADMM outper- 2 ak2 can be evaluated analytically and exactly in n log(n) forms LP relaxation for all cases, but it generates unsatisfy- time by a break point search method (Helgason, Kennington, ing accuracy in ‘dblp-2010’, ‘in-2004’, ‘amazon-2008’ and and Lall, 1980). We use the Matlab implementation provid- ‘dblp-2011’. (iv) Our proposed method MPEC-EPM gener- ed in (Yuan and Ghanem, 2016b). η is the gradient Lipschitz ally outperforms all compared methods. λI−W constant and it is set to the largest eigenvalue of . (iv) Convergence Curve. We demonstrate the convergence Truncated Power Method (TPM) (Yuan and Zhang, 2013) curve of the methods {LP, TPM, L2box-ADMM, MPEC- considers an iterative procedure that combines power itera- EPM} for dense subgraph discovery on different data sets. tion and hard-thresholding truncation. It works by greedily As can be seen in Figure2, MPEC-EPM converges with- decreasing the objective, while maintaining the desired bina- 10 in 100 iterations. Moreover, its objective values generally ry property for the intermediate solutions. We use the code decrease monotonically, and we attribute this to the greedy provided by the authors. As suggested in (Yuan and Zhang, property of the penalty method. 2013), the initial solution is set to the indicator vector of the vertices with the top k weighted degrees of the graph in our Computational Efficiency. We provide some runtime experiments. (v) L2-box ADMM (Wu and Ghanem, 2016) comparisons for the four methods on different data sets. As applies ADMM directly to the `2 box non-separable refor- can be seen in Table3, even for the data set such as ‘dblp- 2011’ that contains about one million nodes and 7 million 9http://law.di.unimi.it/datasets.php edges, all the methods can terminate within 15 minutes. 10https://sites.google.com/site/xtyuan1980/publications Moreover, the runtime efficiency of our method is several times slower than LP and comparable with L2-box ADMM. Ames, B. P. 2014. Guaranteed clustering and biclustering This is expected, since (i) MPEC-EPM needs to call the LP via semidefinite programming. Mathematical Program- procedure multiple times, and (ii) the methods {LP, L2-box ming 147(1-2):429–465.1 ADMM, MPEC-EPM} are alternating methods and have the Ames, B. P. 2015. Guaranteed recovery of planted cliques same computational complexity. Our method calls the con- and dense subgraphs by convex relaxation. Journal of vex LP procedure many times until convergence. Although Optimization Theory and Applications 167(2):653–675.1 we only present a simple projection method in our imple- mentation, we argue that this convex LP procedure could be Bolte, J.; Sabach, S.; and Teboulle, M. 2014. Proximal al- further significantly accelerated, by integrating exiting more ternating linearized minimization for nonconvex and non- advanced optimization techniques (such as coordinate gradi- smooth problems. Mathematical Programming 146(1- ent descent). However, this is outside the scope of this paper 2):459–494.2,3,5 and left as future work. Boykov, Y.; Veksler, O.; and Zabih, R. 2001. Fast ap- proximate energy minimization via graph cuts. TPAMI Table 3: CPU time (in seconds) comparisons. 23(11):1222–1239.1

Graph LP TPM L2box-ADM MPEC-EPM Burer, S. 2009. On the copositive representation of binary wordassoc. 1 1 7 2 and continuous nonconvex quadratic programs. Mathe- enron 2 1 40 29 matical Programming 120(2):479–495.2 uk-2007-05 6 1 75 65 Burer, S. 2010. Optimizing a polyhedral-semidefinite re- cnr-2000 16 1 210 209 laxation of completely positive programs. Mathematical dblp-2010 15 1 234 282 Programming Computation in-2004 79 2 834 1023 2(1):1–19.2 amazon-2008 49 5 501 586 Candes,` E. J.; Li, X.; and Soltanolkotabi, M. 2015. Phase dblp-2011 59 8 554 621 retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61(4):1985–2007.1, 8 5 Conclusions and Future Work Chan, E. Y. K., and Yeung, D. 2011. A convex formulation of modularity maximization for community detection. In This paper presents a new continuous MPEC-based op- IJCAI, 2218–2225.1 timization method to solve general binary programs. Al- though the problem is non-convex, we design an exact penal- Cour, T., and Shi, J. 2007. Solving markov random fields ty method to solve its equivalent MPEC reformulation. It with spectral relaxation. In AISTATS, volume 2, 15.2 works by solving a sequence of convex relaxation sub- Cour, T.; Srinivasan, P.; and Shi, J. 2007. Balanced graph problems, resulting in better and better approximations to matching. NIPS 19:313.1 the original non-convex formulation. We also shed some the- oretical light on the equivalent formulation and optimization De Santis, M., and Rinaldi, F. 2012. Continuous reformu- algorithm. Experimental results on binary problems demon- lations for zero–one programming problems. Journal of strate that our method generally outperforms existing solu- Optimization Theory and Applications 153(1):75–84.3 tions in terms of solution quality. Di Pillo, G., and Grippo, L. 1989. Exact penalty functions in As for our future work, we plan to investigate the op- constrained optimization. SIAM Journal on Control and timality qualification of our multi-stage convex relaxation Optimization 27(6):1333–1360.6 method for some specific objective functions, e.g., as is done Di Pillo, G. 1994. Exact penalty methods. In Algorithms for in (Goemans and Williamson, 1995; Zhang, 2010; Candes,` Continuous Optimization. Springer. 209–253.6 Li, and Soltanolkotabi, 2015; Jain, Netrapalli, and Sanghavi, 2013). Feige, U.; Peleg, D.; and Kortsarz, G. 2001. The dense k- subgraph problem. Algorithmica 29(3):410–421.6,7 Acknowledgments Fogel, F.; Jenatton, R.; Bach, F. R.; and d’Aspremont, A. 2015. Convex relaxations for permutation problems. This work was supported by the King Abdullah University SIMAX 36(4):1465–1488.1 of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) fund- Ghanem, B.; Cao, Y.;and Wonka, P. 2015. Designing camer- ing. Yuan is also supported by NSF-China (61402182). A a networks by convex . Computer special thanks is also extended to Prof. Shaohua Pan and Dr. Graphics Forum (Proceedings of Eurographics).6 Li Shen (South China University of Technology) for their Goemans, M. X., and Williamson, D. P. 1995. Improved helpful discussions on this paper. approximation algorithms for maximum cut and satisfia- bility problems using semidefinite programming. Journal References of the ACM 42(6):1115–1145.1,2,8 Ames, B. P. W., and Vavasis, S. A. 2011. Nuclear norm Han, S.-P., and Mangasarian, O. L. 1979. Exact penalty minimization for the planted clique and biclique problem- functions in . Mathematical pro- s. Mathematical Programming 129(1):69–89.1 gramming 17(1):251–269.6 He, B., and Yuan, X. 2012. On the O(1/n) convergence Nesterov, Y. E. 2003. Introductory lectures on convex opti- rate of the douglas-rachford alternating direction method. mization: a basic course, volume 87 of Applied Optimiza- SINUM 50(2):700–709.4,5 tion. Kluwer Academic Publishers.4 He, L.; Lu, C.; Ma, J.; Cao, J.; Shen, L.; and Yu, P. S. 2016. Olsson, C.; Eriksson, A. P.; and Kahl, F. 2007. Solving Joint community and structural hole spanner detection via large scale binary quadratic problems: Spectral methods harmonic modularity. In SIGKDD, 875–884.1 vs. semidefinite programming. In CVPR, 1–8.2 Helgason, R.; Kennington, J.; and Lall, H. 1980. Raghavachari, M. 1969. On connections between zero- A polynomially bounded algorithm for a singly con- one and concave programming un- strained quadratic program. Mathematical Programming der linear constraints. Operations Research 17(4):680– 18(1):338–343.7 684.3 Hu, X., and Ralph, D. 2004. Convergence of a penalty Ralph, D., and Wright, S. J. 2004. Some properties of regu- method for mathematical programming with complemen- larization and penalization schemes for mpecs. Optimiza- tarity constraints. Journal of Optimization Theory and tion Methods and Software 19(5):527–556.3 Applications 123(2):365–390.3 Ravi, S. S.; Rosenkrantz, D. J.; and Tayi, G. K. 1994. Huang, Q.; Chen, Y.; and Guibas, L. J. 2014. Scalable Heuristic and special case algorithms for dispersion prob- semidefinite relaxation for maximum A posterior estima- lems. Operations Research 42(2):299–310.6,7 tion. In ICML, 64–72.2 Shi, J., and Malik, J. 2000. Normalized cuts and image Jain, P.; Netrapalli, P.; and Sanghavi, S. 2013. Low-rank ma- segmentation. TPAMI 22(8):888–905.1,2 trix completion using alternating minimization. In STOC, Toshev, A.; Shi, J.; and Daniilidis, K. 2007. Image match- 665–674.1,8 ing via saliency region correspondences. In CVPR, 1–8. Jiang, B.; Lin, T.; Ma, S.; and Zhang, S. 2016. Structured IEEE.1 nonconvex and nonsmooth optimization: Algorithms and Tseng, P. 2001. Convergence of a block coordinate descen- iteration complexity analysis. arXiv preprint.5 t method for nondifferentiable minimization. Journal of Optimization Theory and Applications 109(3):475–494. Jiang, B.; Liu, Y.-F.; and Wen, Z. 2016. `p-norm regulariza- tion algorithms for optimization over permutation matri- 2,3,5 ces. SIAM Journal on Optimization (SIOPT) 26(4):2284– Wang, P.; Shen, C.; van den Hengel, A.; and Torr, P. 2016. 2313.1 Large-scale binary quadratic optimization using semidef- Joulin, A.; Bach, F. R.; and Ponce, J. 2010. Discriminative inite relaxation and applications. TPAMI.1,2 clustering for image co-segmentation. In CVPR, 1943– Wu, B., and Ghanem, B. 2016. `p-box ADMM: A versatile 1950.1 framework for integer programming. In arXiv preprint.2, 3,7 Kalantari, B., and Rosen, J. B. 1982. Penalty for zero–one integer equivalent problem. Mathematical Programming Yuan, G., and Ghanem, B. 2015. `0tv: A new method for 24(1):229–232.3 image restoration in the presence of impulse noise. In CVPR, 5369–5377.3 Keuchel, J.; Schnorr, C.; Schellewald, C.; and Cremers, D. 2003. Binary partitioning, perceptual grouping, Yuan, G., and Ghanem, B. 2016a. A proximal alternating and restoration with semidefinite programming. TPAMI direction method for semi-definite rank minimization. In 25(11):1364–1379.1 AAAI, 2300–2308.3 Komodakis, N., and Tziritas, G. 2007. Approximate label- Yuan, G., and Ghanem, B. 2016b. Sparsity constrained min- ing via graph cuts based on linear programming. TPAMI imization via mathematical programming with equilibri- 29(8):1436–1453.2,4 um constraints. In arXiv preprint.2,3,5,7 Kumar, M. P.; Kolmogorov, V.; and Torr, P. H. S. 2009. Yuan, X., and Zhang, T. 2013. Truncated power method for An analysis of convex relaxations for MAP estimation of sparse eigenvalue problems. JMLR 14(1):899–925.1,2, discrete mrfs. JMLR 10:71–106.2 3,6,7 Lu, Z., and Zhang, Y. 2013. Sparse approximation via penal- Zaslavskiy, M.; Bach, F. R.; and Vert, J. 2009. A path fol- ty decomposition methods. SIOPT 23(4):2448–2478.3, lowing algorithm for the graph matching problem. TPAMI 4 31(12):2227–2242.1 Luo, Z.-Q.; Pang, J.-S.; and Ralph, D. 1996. Mathemat- Zhang, Z.; Li, T.; Ding, C.; and Zhang, X. 2007. Binary ical programs with equilibrium constraints. Cambridge matrix factorization with applications. In ICDM, 391– University Press.3 400.2,3 Murray, W., and Ng, K. 2010. An algorithm for nonlinear Zhang, T. 2010. Analysis of multi-stage convex relaxation optimization problems with binary variables. Computa- for sparse regularization. JMLR 11:1081–1107.1,3,8 tional Optimization and Applications 47(2):257–288.2, 3