Chapter 7 Duality / Augmented Lagrangian / ADMM

Chapter 7 Duality / augmented Lagrangian / ADMM Contents (class version) 7.0 Introduction........................................ 7.2 Variable splitting.......................................... 7.3 7.1 Convex conjugate..................................... 7.4 7.2 Method of Lagrange multipliers............................. 7.7 Lagrange dual function....................................... 7.12 Lagrange dual problem....................................... 7.15 7.3 Augmented Lagrangian methods............................ 7.17 Alternating direction method of multipliers (ADMM)....................... 7.18 Binary classifier with hinge loss function via ADMM....................... 7.20 ADMM in general (sketch)..................................... 7.21 Convergence............................................ 7.22 Linearized augmented Lagrangian method (LALM)....................... 7.23 Primal-dual hybrid gradient method................................ 7.28 Near-circulant splitting method.................................. 7.29 7.4 Summary.......................................... 7.31 7.1 © J. Fessler, April 5, 2020, 17:26 (class version) 7.2 7.0 Introduction Despite many optimization algorithms for many applications in previous chapters, there are still important families of optimization problems that cannot be addressed by the methods describes so far. Here are two examples from machine learning. Consider robust regression with a sparsity regularizer: x^ = arg min kAx − yk1 + β kxk1 : x2FN The first 1-norm provides robustness to outliers, and the other one encourages x to be sparse. This function is convex but non-smooth. Of course we could replace the first 1-norm by 10 (·; δ); but then we would need to choose the parameter δ. Another example is binary classifier design using the hinge loss function with a sparsity regularizer: 0 x^ = arg min 1 h:(Ax) + β kxk1 : x2FN Again, we could replace the non-differentiable hinge with the Huber hinge function, for parameter δ. An example from signal processing is robust dictionary learning [1,2]: ^ D = arg min min kvec(X − DZ)k1 + β kvec(Z)k1 : D2D Z This chapter discusses algorithms that can address such applications without approximations. © J. Fessler, April 5, 2020, 17:26 (class version) 7.3 Variable splitting Recall the analysis regularized LS cost function for x 2 FN , A 2 FM×N and T 2 FK×N : 1 Ψ(x) = kAx − yk2 + β kT xk : (7.1) 2 2 1 This problem is challenging because of the matrix T inside the 1-norm, when T is not unitary or diagonal. One way to address this challenge is to write the exactly equivalent constrained minimization problem involving auxiliary variable(s). The classical example for (7.1) is the following formulation: 1 2 x^ = arg min min kAx − yk2 + β kzk1 : (7.2) x z : z=T x 2 This variable splitting idea underlies many closely related methods: • split Bregman method [3] • augmented Lagrangian (AL) method [4,5] • alternating direction multiplier method (ADMM)[6,7][8] • Douglas-Rachford splitting method • For surveys see [9][10]. • For a unifying generalization called function-linearized proximal ADMM (FliP-ADMM) see [11]. This topic continues Ch.6 because the methods involve alternating minimization. © J. Fessler, April 5, 2020, 17:26 (class version) 7.4 7.1 Convex conjugate This chapter uses another way to “transform” functions called the convex conjugate or Fenchel transform. As is often the case when working with convex functions, we need extended real numbers. Define. The convex conjugate of a function f : RN 7! R [ {±∞} is the function f ∗ : RN 7! R [ {±∞} defined by ∗ 0 f (y) , sup y x − f(x): (7.3) x2RN Example. Consider the affine function f(x) = a0x + b for some vector a 2 RN and constant b 2 R: Then ∗ 0 0 0 0 b; y = a f (y) = sup y x − f(x) = sup y x − (a x + b) = sup(y − a) x + b = = b + χfag(y): x x x 1; y 6= a Because of the supremum in the definition, 1 arises fairly often when working with convex conjugates! © J. Fessler, April 5, 2020, 17:26 (class version) 7.5 Convex conjugate properties The convex conjugate of a closed convex function is again a closed convex function (one whose sublevel sets are closed sets). Example. A convex function that is not closed is χ(0;1)(x): Scaling: g(x) = f(ax) =) g∗(y) = f ∗(y=a) for a 6= 0 Scaling: g(x) = af(x) =) g∗(y) = af ∗(y=a) for a > 0 Shift: g(x) = f(x + b) =) g∗(y) = f ∗(y) − b0y: A key property is that the biconjugate (f ∗)∗ = f ∗∗ = f when f is convex and lower semi-continuous, so f ∗∗(u) = sup u0y − f ∗(y) = f(u): (7.4) y Convex combination: α 2 [0; 1] =) (αf + (1 − α)g)∗ ≤ αf ∗ + (1 − α)g∗; if f and g have same domain. Unfortunately there is no general linearity property. However, see the following question. If g(x) = f(Ax) where A 2 RN×N is an invertible matrix and x 2 RN , then g∗(y) = ? A: f ∗(Ay) B: f ∗(A0y) C: f ∗(A−1y) D: f ∗(A−T y) E: None ?? ?? © J. Fessler, April 5, 2020, 17:26 (class version) 7.6 N Example. Consider f(x) = kxk1 for x 2 R . Then applying the definition (7.3): group work f ∗(y) = sup y0x − f(x) x ynxn for yn = 2=3 0 = sup y x − kxk1 x XN = sup (ynxn − jxnj) xn x n=1 ynxn − jxnj XN = sup (ynxn − jxnj) n=1 xn2R XN χ χ − jxnj = fjyn|≤1g = : n=1 fkyk1≤1g The convex conjugate of the 1-norm is the characteristic function of the unit-ball for the 1-norm. As expected based on the property (7.4), the biconjugate is: N ∗∗ 0 ∗ 0 χ X χ f (x) = sup x y − f (y) = sup x y − kyk ≤1 = sup xnyn − fjyn|≤1g y y f 1 g y n=1 XN XN XN = sup xnyn − χfjy |≤1g = sup (xnyn) = jxnj = kxk = f(x): n=1 n n=1 n=1 1 yn jyn|≤1 © J. Fessler, April 5, 2020, 17:26 (class version) 7.7 7.2 Method of Lagrange multipliers The constrained optimization form (7.2) has the potential benefit that there is no matrix T inside the 1-norm, but it also has the challenge that it involves the equality constraint z = T x. The classic strategy for dealing with equality constraints is the method of Lagrange multipliers. To solve a constrained optimization problem of the form x^ = arg min f(x) s.t. h(x) = 0M ; (7.5) x2RN where h : RN 7! RM , we first define the Lagrangian function as follows (using a “+” per [12, §5.1.1]): 0 L(x; γ) , f(x) + γ h(x) (7.6) where γ 2 RM is a vector of Lagrange multipliers or dual variables. We assume throughout that the set of feasible points is nonempty: N x 2 R : h(x) = 0M 6= ;; otherwise the problem is vacuous. © J. Fessler, April 5, 2020, 17:26 (class version) 7.8 Assuming that both f and h are continuously differentiable, we then solve (7.5) by seeking stationary points of the Lagrangian, i.e., points (x; γ) where XM 0N = rxL = rf(x) + rh(x)γ = rf(x) + rhm(x)γm m=1 0M = rγ L = h(x); which involves solving N + M equations in N + M unknowns. In general there can be multiple stationary points, and one must evaluate f(x) for all the candidates to find an global optimizer. There are several nice pencil-and-paper examples on wikipedia. However, a serious limitation of the method for numerical optimization methods is that critical points occur at saddle points of L, not at local minima or maxima, so standard numerical optimization methods are inapplicable. So the next few examples are all the pencil-and-paper kind. © J. Fessler, April 5, 2020, 17:26 (class version) 7.9 Example. Suppose x represents the probability mass function of a discrete probability distribution, so xn ≥ 0 0 0 and 1N x = 1, i.e., h(x) = 0 where h(x) = 1 x − 1. Let H(x) denote the correspoding Shannon entropy: N X 0 H(x) = − xn log xn = −x log :(x): n=1 To find the distribution x that has the maximum entropy, we want to minimize f(x) = −H(x) subject to the constraint h(x) = 0. The Lagrangian and its gradients are L(x; γ) = f(x) + γh(x) = x0 log :(x) + γ(10x − 1) rxL(x; γ) = 1 + log :(x) + γ1 = 1(1 + γ) + log :(x) 0 rγL(x; γ) = 1 x − 1: −(1+γ) −(1+γ) Setting rxL = 0 yields xn = e and combining with rγL = 0 yields N e = 1 so xn = 1=N. Thus the discrete uniform distribution x = 1=N has maximum Shannon entropy. Technically this problem also has an inequality constraint xn ≥ 0, but it turned out not to matter because the solution was positive. Lagrange theory can also handle inequality constraints [12, §5.1.1], but we will not need that generalization here. © J. Fessler, April 5, 2020, 17:26 (class version) 7.10 Example. Consider the case 2 2 3 f(x) = (x − 1) and h(x) = x − 9. 75 saddle point Here 2 2 2 50 L(x; γ) = f(x) + γh(x) = (x − 1) + γ(x − 9) 1 @ L(x; γ) = 2(x − 1) + 2γx =) γ = 1=x − 1 0 25 @x @ L(x; γ) = x2 − 9 =) x = ±3 -1 @γ 0 -2 -25 -3 -3 0 3 The figure shows L(x; γ) and its saddle points at (x; γ) = (3; −2=3) and (−3; −4=3). Checking f(x) at x = ±3 shows that x = 3 is the constrained minimizer. This is a trivial example, but it allows us to visualize L(x; γ) to see the saddle points. Note that we are not trying to minimize the Lagrangian with respect to both x and γ.

Load more