EE 546, Univ of Washington, Spring 2014
10. Multiplier methods
• proximal point algorithm
• Moreau-Yosida regularization
• augmented Lagrangian method
• alternating direction method of multipliers (ADMM)
Multiplier methods 10–1 Recall: proximal mapping the proximal mapping (prox-operator) of a convex function h is
1 2 proxh(x) = argmin h(u)+ ku − xk2 u 2 examples
• h(x) = 0: proxh(x)= x
• h(x)= IC(x) (indicator function of C): proxh is projection on C
2 proxh(x) = argmin ku − xk2 = PC(x) u∈C
• h(x)= tkxk1: proxh is the ‘soft-threshold’ (shrinkage) operation
xi − t xi ≥ t proxh(x)i = 0 |xi|≤ t xi + t xi ≤−t Multiplier methods 10–2 Recall: proximal gradient method unconstrained problem with cost function split in two components (slightly different notation from lecture 8)
minimize f(x)= g(x)+ h(x)
• g convex, differentiable, with dom g = Rn • h convex, possibly nondifferentiable, with inexpensive prox-operator proximal gradient algorithm
(k) (k−1) (k−1) x = proxt h x − tk∇g(x ) k tk > 0 is step size, constant or determined by line search
Multiplier methods 10–3 Recall: conjugate of closed functions
for f closed and convex • f ∗∗ = f • y ∈ ∂f(x) if and only if x ∈ ∂f ∗(y) • the following three statements are equivalent:
xT y = f(x)+ f ∗(y) ⇐⇒ y ∈ ∂f(x) ⇐⇒ x ∈ ∂f ∗(y)
Multiplier methods 10–4 Proximal point algorithm a (conceptual) algorithm for minimizing a closed convex function f
x(k) = prox (x(k−1)) tkf
1 (k−1) 2 = argmin f(u)+ ku − x k2 u 2tk
• special case of the proximal gradient method with g(x) = 0
• step size tk > 0 affects #iterations, cost of prox evaluations • a practical algorithm if inexact prox evaluations are used • of interest if prox evaluations are much easier than original problem basis of the method of multipliers or augmented Lagrangian method
Multiplier methods 10–5 Convergence assumptions
• f is closed and convex (hence, proxtf (x) is uniquely defined for all x) • optimal value f ⋆ is finite and attained at x⋆ • exact evaluations of prox-operator result 2 x(0) − x⋆ f(x(k)) − f ⋆ ≤ 2 k
2 ti iP=1
• implies convergence if i ti → ∞ P • rate is 1/k if ti is constant
• ti is arbitrary; however cost of prox evaluations will depend on ti (when no closed form, and we choose to do inexactly)
Multiplier methods 10–6 proof: follows from analysis of proximal gradient method in lecture 8, setting g = 0
• can also apply accelerated proximal method (with g = 0)
• different variants from lecture 8 can be used
Multiplier methods 10–7 Outline
• proximal point algorithm
• Moreau-Yosida regularization
• augmented Lagrangian method
• alternating direction method of multipliers (ADMM) Moreau-Yosida regularization
Moreau-Yosida regularization of closed convex f is defined as
1 2 f(µ)(x) = inf f(u)+ ku − xk2 (with µ> 0) u 2µ minimizer in the definition is u = proxµf (x) immediate properties
• f(µ) is convex (infimum over u of a convex function of x, u)
n • domain of f(µ) is R (recall that proxµf (x) is defined for all x)
Multiplier methods 10–8 Examples indicator function (of closed convex set C)
1 f(x)= I (x), f (x)= dist(x)2 C (µ) 2µ dist(x) is the Euclidean distance to C
1-norm n
f(x)= kxk1, f(µ)(x)= φµ(xk) kX=1
φµ is the Huber penalty
Multiplier methods 10–9 Conjugate of Moreau-Yosida regularization
µ (f )∗(y)= f ∗(y)+ kyk2 (µ) 2 2 proof:
∗ T (f(µ)) (y) = sup y x − f(µ)(x) x T 1 2 = sup y x − f(u) − ku − xk2 x,u 2µ
T µ 2 = sup y (u + µy) − f(u) − kyk2 u 2 µ = f ∗(y)+ kyk2 2 2
• maximizer x in definition of conjugate satisfies µy = x − proxµf (x) ∗ • note: (f(µ)) is strongly convex with parameter µ
Multiplier methods 10–10 Gradient of Moreau-Yosida regularization
T ∗ µ 2 f(µ)(x) = sup x y − f (y) − kyk2 y 2
• f(µ) is differentiable; gradient is Lipschitz continuous with constant 1/µ
• maximizer in definition satisfies
x − µy ∈ ∂f ∗(y) ⇐⇒ y ∈ ∂f(x − µy)
• the maximizing y is the gradient of f(µ): from p. 7-15 and p. 7-27,
1 ∇f (x) = x − prox (x) (µ) µ µf = proxf∗/µ(x/µ)
Multiplier methods 10–11 Interpretation of proximal point algorithm apply gradient method to minimize Moreau-Yosida regularization:
1 2 minimize f(µ)(x) = inf f(u)+ ku − xk u 2µ 2 this is an exact smooth reformulation of original problem:
• solution x is minimizer of f
• f(µ) is differentiable with Lipschitz continuous gradient (L = 1/µ) gradient update: with fixed tk = 1/L = µ
(k) (k−1) (k−1) x = x − µ∇f(µ)(x ) (k−1) = proxµf (x ) this is the proximal point algorithm with constant step size tk = µ
Multiplier methods 10–12 Outline
• proximal point algorithm
• Moreau-Yosida regularization
• augmented Lagrangian method
• alternating direction method of multipliers (ADMM) Augmented Lagrangian method convex problem and dual (linear constraints for simplicity)
minimize f(x) maximize −F (λ, ν) subject to Gx h Ax = b where
hT λ + bT ν + f ∗(−GT λ − AT ν) λ 0 F (λ, ν)= +∞ otherwise
augmented Lagrangian method: proximal point algorithm applied to the dual
Multiplier methods 10–13 Prox-operator of negative dual function from p. 9-35 λ + t(Gxˆ +s ˆ − h) prox (λ, ν)= tF ν + t(Axˆ − b) where (ˆx, sˆ) is the solution of
minimize L(x,s,λ,ν) subject to s 0
cost function is augmented Lagrangian
L(x,s,λ,ν)= t f(x)+ λT (Gx + s − h)+ νT (Ax − b)+ kGx + s − hk2 + kAx − bk2 2 2 2
Multiplier methods 10–14 Algorithm choose λ, ν, t> 0
1. minimize the augmented Lagrangian
(ˆx, sˆ) := argmin L(x,s,λ,ν) x,s0
2. dual update
λ := λ + t(Gxˆ +s ˆ − h), ν := ν + t(Axˆ − b)
• this is the proximal point algorithm applied to dual problem • equivalently, gradient method applied to Moreau-Yosida regularized dual • as a variant, can apply fast proximal point algorithm to the dual
Multiplier methods 10–15 Applications augmented Lagrangian method is useful when subproblems
t 1 1 minimize f(x)+ kGx − h + s + λk2 + kAx − b + νk2 2 t 2 t 2 subject to s 0 are substantially easier than original problem
(note: apply ‘completion of squares’ to aug. Lagrangian on page 10-14) example minimize kxk1 subject to Ax = b
• solve sequence of ℓ1-regularized least-squares problems
Multiplier methods 10–16 Outline
• proximal point algorithm
• Moreau-Yosida regularization
• augmented Lagrangian method
• alternating direction method of multipliers (ADMM) Goals
robust methods for
• arbitrary-scale optimization – machine learning/statistics with huge data-sets – dynamic optimization on large-scale network
• decentralized optimization – devices/processors/agents coordinate to solve large problem, by passing relatively small messages
• ideas go back to the 60’s; recent surge of interest
Multiplier methods 10–17 Dual decomposition convex problem with separable objective
minimize f(x)+ h(y) subject to Ax + By = b augmented Lagrangian
t L(x,y,ν)= f(x)+ h(y)+ νT (Ax + By − b)+ kAx + By − bk2 2 2
• difficulty: quadratic penalty destroys separability of Lagrangian • solution: replace joint minimization over (x, y) by alternating minimization
Multiplier methods 10–18 Alternating direction method of multipliers apply one cycle of alternating minimization steps (also known as Gauss-Siedel, block-coordinate descent, etc.) to augmented Lagrangian
1. minimize augmented Lagrangian over x:
x(k) = argmin L(x, y(k−1),ν(k−1)) x
2. minimize augmented Lagrangian over y:
y(k) = argmin L(x(k),y,ν(k−1)) y
3. dual update:
ν(k) := ν(k−1) + t Ax(k) + By(k) − b can be shown to converge under weak assumptions
Multiplier methods 10–19 Example
minimize f(x)+ kAx − bk f convex (not necessarily strongly) reformulated problem minimize f(x)+ kyk subject to y = Ax − b augmented Lagrangian
t L(x, y, z) = f(x)+ kyk + zT (y − Ax + b)+ ky − Ax + bk2 2 2 t 1 1 = f(x)+ kyk + ky − Ax + b + zk2 − kzk2 2 t 2 2t 2
Multiplier methods 10–20 alternating minimization
1. minimization over x
T t 2 argmin L(x,y,ν) = argmin f(x) − z Ax + kAx − y − bk2 x x 2
2. minimization over y involves projection on dual norm ball (p. 7-29)
argmin L(x, y, z) = proxk·k/t (Ax − b − (1/t)z) y 1 = (P (z − t(Ax − b)) − (z − t(Ax − b))) t C
where C = {u | kuk∗ ≤ 1}
3. dual update
z := z + t(y − Ax − b)= PC(z − t(Ax − b))
Multiplier methods 10–21 comparison with dual proximal gradient algorithm (page 9-33)
• ADMM does not require strong convexity of f, can use larger values of t • dual updates are identical
2 • ADMM step 1 may be more expensive, e.g., for f(x)=(1/2)kx − ak2:
x := (I + tAT A)−1(a + AT (z + t(y − b))
as opposed to x := a + AT z in the dual proximal gradient method
related algorithms (see references)
• split Bregman method with linear constraints • fast alternating minimization algorithms
Multiplier methods 10–22 example: nuclear norm approximation (problem instance of p. 9-23) 1 minimize kx − ak2 + kA(x) − Bk 2 2 ∗
n n p×q k · k∗ is nuclear norm; A : R × R with A(x)= xiAi 100 iP=1
10-1
10-2 ADMM FISTA 10-3 relative error
10-4
-5 10 0 50 100 150 200 250 k
2 2 FISTA step size is 1/L = 1/kAk2; ADMM step size is t = 100/kAk2 (recall FISTA is a variant of Nesterov’s 1st method covered in lecture 8)
Multiplier methods 10–23 References proximal point algorithm and fast proximal point algorithm
• O. G¨uler, On the convergence of the proximal point algorithm for convex minimization, SIAM J. Control and Optimization (1991) • O. G¨uler, New proximal point algorithms for convex minimization, SIOPT (1992) • O. G¨uler, Augmented Lagrangian algorithm for linear programming, JOTA (1992) augmented Lagrangian algorithm
• D.P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods (1982) alternating direction method of multipliers and related algorithms
• S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers (2010) • D. Goldfarb, S. Ma, K. Scheinberg, Fast alternating linearization methods for minimizing the sum of two convex functions, arxiv.org/abs/0912.4571 (2010) • T. Goldstein and S. Osher, The split Bregman method for L1-regularized problems, SIAM J. Imag. Sciences (2009)
Multiplier methods 10–24