EE 546, Univ of Washington, Spring 2014

10. Multiplier methods

• proximal point

• Moreau-Yosida regularization

• augmented Lagrangian method

• alternating direction method of multipliers (ADMM)

Multiplier methods 10–1 Recall: proximal mapping the proximal mapping (prox-operator) of a convex function h is

1 2 proxh(x) = argmin h(u)+ ku − xk2 u  2  examples

• h(x) = 0: proxh(x)= x

• h(x)= IC(x) (indicator function of C): proxh is projection on C

2 proxh(x) = argmin ku − xk2 = PC(x) u∈C

• h(x)= tkxk1: proxh is the ‘soft-threshold’ (shrinkage) operation

xi − t xi ≥ t proxh(x)i =  0 |xi|≤ t  xi + t xi ≤−t  Multiplier methods 10–2 Recall: proximal gradient method unconstrained problem with cost function split in two components (slightly different notation from lecture 8)

minimize f(x)= g(x)+ h(x)

• g convex, differentiable, with dom g = Rn • h convex, possibly nondifferentiable, with inexpensive prox-operator proximal gradient algorithm

(k) (k−1) (k−1) x = proxt h x − tk∇g(x ) k   tk > 0 is step size, constant or determined by

Multiplier methods 10–3 Recall: conjugate of closed functions

for f closed and convex • f ∗∗ = f • y ∈ ∂f(x) if and only if x ∈ ∂f ∗(y) • the following three statements are equivalent:

xT y = f(x)+ f ∗(y) ⇐⇒ y ∈ ∂f(x) ⇐⇒ x ∈ ∂f ∗(y)

Multiplier methods 10–4 Proximal point algorithm a (conceptual) algorithm for minimizing a closed convex function f

x(k) = prox (x(k−1)) tkf

1 (k−1) 2 = argmin f(u)+ ku − x k2 u  2tk 

• special case of the proximal gradient method with g(x) = 0

• step size tk > 0 affects #iterations, cost of prox evaluations • a practical algorithm if inexact prox evaluations are used • of interest if prox evaluations are much easier than original problem basis of the method of multipliers or augmented Lagrangian method

Multiplier methods 10–5 Convergence assumptions

• f is closed and convex (hence, proxtf (x) is uniquely defined for all x) • optimal value f ⋆ is finite and attained at x⋆ • exact evaluations of prox-operator result 2 x(0) − x⋆ f(x(k)) − f ⋆ ≤ 2 k

2 ti iP=1

• implies convergence if i ti → ∞ P • rate is 1/k if ti is constant

• ti is arbitrary; however cost of prox evaluations will depend on ti (when no closed form, and we choose to do inexactly)

Multiplier methods 10–6 proof: follows from analysis of proximal gradient method in lecture 8, setting g = 0

• can also apply accelerated proximal method (with g = 0)

• different variants from lecture 8 can be used

Multiplier methods 10–7 Outline

• proximal point algorithm

• Moreau-Yosida regularization

• augmented Lagrangian method

• alternating direction method of multipliers (ADMM) Moreau-Yosida regularization

Moreau-Yosida regularization of closed convex f is defined as

1 2 f(µ)(x) = inf f(u)+ ku − xk2 (with µ> 0) u  2µ  minimizer in the definition is u = proxµf (x) immediate properties

• f(µ) is convex (infimum over u of a convex function of x, u)

n • domain of f(µ) is R (recall that proxµf (x) is defined for all x)

Multiplier methods 10–8 Examples indicator function (of closed convex set C)

1 f(x)= I (x), f (x)= dist(x)2 C (µ) 2µ dist(x) is the Euclidean distance to C

1-norm n

f(x)= kxk1, f(µ)(x)= φµ(xk) kX=1

φµ is the Huber penalty

Multiplier methods 10–9 Conjugate of Moreau-Yosida regularization

µ (f )∗(y)= f ∗(y)+ kyk2 (µ) 2 2 proof:

∗ T (f(µ)) (y) = sup y x − f(µ)(x) x  T 1 2 = sup y x − f(u) − ku − xk2 x,u  2µ 

T µ 2 = sup y (u + µy) − f(u) − kyk2 u  2  µ = f ∗(y)+ kyk2 2 2

• maximizer x in definition of conjugate satisfies µy = x − proxµf (x) ∗ • note: (f(µ)) is strongly convex with parameter µ

Multiplier methods 10–10 Gradient of Moreau-Yosida regularization

T ∗ µ 2 f(µ)(x) = sup x y − f (y) − kyk2 y  2 

• f(µ) is differentiable; gradient is Lipschitz continuous with constant 1/µ

• maximizer in definition satisfies

x − µy ∈ ∂f ∗(y) ⇐⇒ y ∈ ∂f(x − µy)

• the maximizing y is the gradient of f(µ): from p. 7-15 and p. 7-27,

1 ∇f (x) = x − prox (x) (µ) µ µf  = proxf∗/µ(x/µ)

Multiplier methods 10–11 Interpretation of proximal point algorithm apply gradient method to minimize Moreau-Yosida regularization:

1 2 minimize f(µ)(x) = inf f(u)+ ku − xk u  2µ 2 this is an exact smooth reformulation of original problem:

• solution x is minimizer of f

• f(µ) is differentiable with Lipschitz continuous gradient (L = 1/µ) gradient update: with fixed tk = 1/L = µ

(k) (k−1) (k−1) x = x − µ∇f(µ)(x ) (k−1) = proxµf (x ) this is the proximal point algorithm with constant step size tk = µ

Multiplier methods 10–12 Outline

• proximal point algorithm

• Moreau-Yosida regularization

• augmented Lagrangian method

• alternating direction method of multipliers (ADMM) Augmented Lagrangian method convex problem and dual (linear constraints for simplicity)

minimize f(x) maximize −F (λ, ν) subject to Gx  h Ax = b where

hT λ + bT ν + f ∗(−GT λ − AT ν) λ  0 F (λ, ν)=  +∞ otherwise

augmented Lagrangian method: proximal point algorithm applied to the dual

Multiplier methods 10–13 Prox-operator of negative dual function from p. 9-35 λ + t(Gxˆ +s ˆ − h) prox (λ, ν)= tF  ν + t(Axˆ − b)  where (ˆx, sˆ) is the solution of

minimize L(x,s,λ,ν) subject to s  0

cost function is augmented Lagrangian

L(x,s,λ,ν)= t f(x)+ λT (Gx + s − h)+ νT (Ax − b)+ kGx + s − hk2 + kAx − bk2 2 2 2 

Multiplier methods 10–14 Algorithm choose λ, ν, t> 0

1. minimize the augmented Lagrangian

(ˆx, sˆ) := argmin L(x,s,λ,ν) x,s0

2. dual update

λ := λ + t(Gxˆ +s ˆ − h), ν := ν + t(Axˆ − b)

• this is the proximal point algorithm applied to dual problem • equivalently, gradient method applied to Moreau-Yosida regularized dual • as a variant, can apply fast proximal point algorithm to the dual

Multiplier methods 10–15 Applications augmented Lagrangian method is useful when subproblems

t 1 1 minimize f(x)+ kGx − h + s + λk2 + kAx − b + νk2 2  t 2 t 2 subject to s  0 are substantially easier than original problem

(note: apply ‘completion of squares’ to aug. Lagrangian on page 10-14) example minimize kxk1 subject to Ax = b

• solve sequence of ℓ1-regularized least-squares problems

Multiplier methods 10–16 Outline

• proximal point algorithm

• Moreau-Yosida regularization

• augmented Lagrangian method

• alternating direction method of multipliers (ADMM) Goals

robust methods for

• arbitrary-scale optimization – machine learning/statistics with huge data-sets – dynamic optimization on large-scale network

• decentralized optimization – devices/processors/agents coordinate to solve large problem, by passing relatively small messages

• ideas go back to the 60’s; recent surge of interest

Multiplier methods 10–17 Dual decomposition convex problem with separable objective

minimize f(x)+ h(y) subject to Ax + By = b augmented Lagrangian

t L(x,y,ν)= f(x)+ h(y)+ νT (Ax + By − b)+ kAx + By − bk2 2 2

• difficulty: quadratic penalty destroys separability of Lagrangian • solution: replace joint minimization over (x, y) by alternating minimization

Multiplier methods 10–18 Alternating direction method of multipliers apply one cycle of alternating minimization steps (also known as Gauss-Siedel, block-, etc.) to augmented Lagrangian

1. minimize augmented Lagrangian over x:

x(k) = argmin L(x, y(k−1),ν(k−1)) x

2. minimize augmented Lagrangian over y:

y(k) = argmin L(x(k),y,ν(k−1)) y

3. dual update:

ν(k) := ν(k−1) + t Ax(k) + By(k) − b   can be shown to converge under weak assumptions

Multiplier methods 10–19 Example

minimize f(x)+ kAx − bk f convex (not necessarily strongly) reformulated problem minimize f(x)+ kyk subject to y = Ax − b augmented Lagrangian

t L(x, y, z) = f(x)+ kyk + zT (y − Ax + b)+ ky − Ax + bk2 2 2 t 1 1 = f(x)+ kyk + ky − Ax + b + zk2 − kzk2 2 t 2 2t 2

Multiplier methods 10–20 alternating minimization

1. minimization over x

T t 2 argmin L(x,y,ν) = argmin f(x) − z Ax + kAx − y − bk2 x x  2 

2. minimization over y involves projection on dual norm ball (p. 7-29)

argmin L(x, y, z) = proxk·k/t (Ax − b − (1/t)z) y 1 = (P (z − t(Ax − b)) − (z − t(Ax − b))) t C

where C = {u | kuk∗ ≤ 1}

3. dual update

z := z + t(y − Ax − b)= PC(z − t(Ax − b))

Multiplier methods 10–21 comparison with dual proximal gradient algorithm (page 9-33)

• ADMM does not require strong convexity of f, can use larger values of t • dual updates are identical

2 • ADMM step 1 may be more expensive, e.g., for f(x)=(1/2)kx − ak2:

x := (I + tAT A)−1(a + AT (z + t(y − b))

as opposed to x := a + AT z in the dual proximal gradient method

related (see references)

• split Bregman method with linear constraints • fast alternating minimization algorithms

Multiplier methods 10–22 example: nuclear norm approximation (problem instance of p. 9-23) 1 minimize kx − ak2 + kA(x) − Bk 2 2 ∗

n n p×q k · k∗ is nuclear norm; A : R × R with A(x)= xiAi 100 iP=1

10-1

10-2 ADMM FISTA 10-3 relative error

10-4

-5 10 0 50 100 150 200 250 k

2 2 FISTA step size is 1/L = 1/kAk2; ADMM step size is t = 100/kAk2 (recall FISTA is a variant of Nesterov’s 1st method covered in lecture 8)

Multiplier methods 10–23 References proximal point algorithm and fast proximal point algorithm

• O. G¨uler, On the convergence of the proximal point algorithm for convex minimization, SIAM J. Control and Optimization (1991) • O. G¨uler, New proximal point algorithms for convex minimization, SIOPT (1992) • O. G¨uler, Augmented Lagrangian algorithm for , JOTA (1992) augmented Lagrangian algorithm

• D.P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods (1982) alternating direction method of multipliers and related algorithms

• S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers (2010) • D. Goldfarb, S. Ma, K. Scheinberg, Fast alternating linearization methods for minimizing the sum of two convex functions, arxiv.org/abs/0912.4571 (2010) • T. Goldstein and S. Osher, The split Bregman method for L1-regularized problems, SIAM J. Imag. Sciences (2009)

Multiplier methods 10–24