EE 546, Univ of Washington, Spring 2014

12. methods

theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method •

Coordinate descent methods 12–1 Notations consider smooth unconstrained minimization problem:

minimize f(x) x RN ∈

n coordinate blocks: x =(x ,...,x ) with x RNi and N = N • 1 n i ∈ i=1 i more generally, partition with a permutation matrix: U =[PU1 Un] • ··· n T xi = Ui x, x = Uixi i=1 X blocks of : • f(x) = U T f(x) ∇i i ∇ coordinate update: • x+ = x tU f(x) − i∇i

Coordinate descent methods 12–2 (Block) coordinate descent

choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ 1. choose coordinate i(k) 2. update x(k+1) = x(k) t U f(x(k)) − k ik∇ik

among the first schemes for solving smooth unconstrained problems • cyclic or round-Robin: difficult to analyze convergence • mostly local convergence results for particular classes of problems • does it really work (better than full )? •

Coordinate descent methods 12–3 Steepest coordinate descent

choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ (k) 1. choose i(k) = argmax if(x ) 2 i 1,...,n k∇ k ∈{ } 2. update x(k+1) = x(k) t U f(x(k)) − k i(k)∇i(k) assumptions

f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2

f has bounded sub-level set, in particular, define •

⋆ R(x) = max max y x 2 : f(y) f(x) y x⋆ X⋆ k − k ≤  ∈ 

Coordinate descent methods 12–4 Analysis for constant step size quadratic upper bound due to block coordinate-wise Lipschitz assumption:

L f(x + U v) f(x)+ f(x),v + i v 2, i =1,...,n i ≤ h∇i i 2 k k2 , assume constant step size 0

f(x) f ⋆ f(x),x x⋆ − ≤ h∇ − i f(x) x x⋆ f(x) R(x(0)) ≤ k∇ k2k − k2 ≤ k∇ k2 therefore t 2 f(x) f(x+) f(x) f ⋆ − ≥ 2nR2 −  Coordinate descent methods 12–5 let ∆ = f(x(k)) f ⋆, then k − t ∆ ∆ ∆2 k − k+1 ≥ 2nR2 k consider their multiplicative inverses

1 1 ∆k ∆k+1 ∆k ∆k+1 t = − − 2 2 ∆k+1 − ∆k ∆k+1∆k ≥ ∆k ≥ 2nR therefore 1 1 k 2t kt + 2 2 + 2 ∆k ≥ ∆0 2nLmaxR ≥ nR 2nR finally 2nR2 f(x(k)) f ⋆ = ∆ − k ≤ (k + 4)t

Coordinate descent methods 12–6 Bounds on full gradient Lipschitz constant

N N lemma: suppose A R × is positive semidefinite and has the partition ∈ N N A =[Aij]n n, where Aij R i× j for i,j =1,...,n, and × ∈ A L I , i =1,...,n ii  i Ni then n A L I  i N i=1 ! X

n n n 2 proof: xT Ax = xT A x xT A x i ij j ≤ i ii i i=1 i=1 i=1 ! X X X q n 2 n n L1/2 x L x 2 ≤ i k ik2 ≤ i k ik2 i=1 ! i=1 ! i=1 X X X conclusion: the full gradient Lipschitz constant L n L f ≤ i=1 i P Coordinate descent methods 12–7 Computational complexity and justifications

nMR2 (steepest) coordinate descent O k   L R2 full gradient method O f k   in general coordinate descent has worse complexity bound it can happen that M O(L ) • ≥ f choosing i(k) may rely on computing full gradient • too expensive to do based on function values • nevertheless, there are justifications for huge-scale problems even computation of a function value can require substantial effort • limits by computer memory, distributed storage, and human patience •

Coordinate descent methods 12–8 Example

n def 1 2 minimize f(x) = fi(xi)+ Ax b 2 x Rn 2k − k ∈ ( i=1 ) X f are convex differentiable univariate functions • i m n A =[a a ] R × , and assume a has p nonzero elements • 1 ··· n ∈ i i n computing either function value or full gradient costs O( i=1 pi) operations P computing coordinate directional derivatives: O(pi) operations

f(x) = f (x )+ aT r(x), i =1,...,n ∇i ∇ i i i r(x) = Ax b − given r(x), computing f(x) requires O(p ) operations • ∇i i + coordinate update x = x + αei results in efficient update of residue: • + r(x )= r(x)+ αai, which also cost O(pi) operations

Coordinate descent methods 12–9 Outline

theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Randomized coordinate descent

choose x(0) Rn and α R, and iterate for k =0, 1, 2,... ∈ ∈ α (α) Li 1. choose i(k) with probability pi = n α, i =1,...,n j=1 Lj

(k+1) (k) 1 P (k) 2. update x = x Ui(k) i(k)f(x ) − Li ∇

(0) special case: α =0 gives uniform distribution pi =1/n for i =1,...,n assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n (1) k∇i i i −∇i k2 ≤ ik ik2 f has bounded sub-level set, and f ⋆ is attained at some x⋆ •

Coordinate descent methods 12–10 Solution guarantees convergence in expectation

E[f(x(k))] f ⋆ ǫ − ≤ high probability iteration complexity: number of iterations to reach

prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ −  confidence level 0 <ρ< 1 • error tolerance ǫ> 0 •

Coordinate descent methods 12–11 Convergence analysis block coordinate-wise Lipschitz continuity of f(x) implies for i =1,...,n ∇ L f(x + U v ) f(x)+ f(x),v + i v 2, x RN , v RNi i i ≤ h∇i ii 2 k ik2 ∀ ∈ i ∈ coordinate update obtained by minimizing quadratic upper bound

+ x = x + Uivˆi L vˆ = argmin f(x)+ f(x),v + i v 2 i h∇i ii 2 k ik2 vi   objective function is non-increasing:

+ 1 2 f(x) f(x ) if(x) 2 − ≥ 2Lik∇ k

Coordinate descent methods 12–12 A pair of conjugate norms for any α R, define ∈ n 1/2 n 1/2 α 2 α 2 x = L x , y ∗ = L− y k kα i k ik2 k kα i k ik2 i=1 i=1 X  X  n α let Sα = i=1 Li (note that S0 = n) P lemma (Nesterov): let f satisfy (1), then for any α R, ∈ N f(x) f(y) 1∗ α Sα x y 1 α, x,y R k∇ −∇ k − ≤ k − k − ∀ ∈ therefore

Sα 2 N f(x) f(y)+ f(y),x y + x y 1 α, x,y R ≤ h∇ − i 2 k − k − ∀ ∈

Coordinate descent methods 12–13 Convergence in expectation theorem (Nesterov): for any k 0, ≥

(k) ⋆ 2 2 (0) Ef(x ) f SαR1 α(x ) − ≤ k +4 −

(0) ⋆ (0) where R1 α(x ) = max max y x 1 α : f(y) f(x ) − y x⋆ X⋆ k − k − ≤  ∈  proof: define random variables ξ = i(0),...,i(k) , k { } n f(x(k)) E f(x(k+1)) = p(α) f(x(k)) f(x(k) + U vˆ ) − i(k) i − i i i=1 X  n p(α) i f(x(k)) 2 ≥ 2L k∇i k2 i=1 i X 1 2 = f(x) 1∗ α 2Sα k∇ k −  Coordinate descent methods 12–14 f(x(k)) f ⋆ min f(x(k)),x(k) x⋆ x⋆ X⋆ − ≤ ∈ h∇ − i (k) (0) f(x ) 1∗ αR1 α(x ) ≤ k∇ k − −

2 (0) therefore, with C =2SαR1 α(x ), −

1 2 f(x(k)) E f(x(k+1)) f(x(k)) f ⋆ − i(k) ≥ C −  taking expectation of both sides with respect to ξk 1 = i(0),...,i(k 1) , − { − }

(k) (k+1) 1 (k) ⋆ 2 Ef(x ) Ef(x ) Eξk 1 f(x ) f − ≥ C − − 1 h 2  i Ef(x(k)) f ⋆ ≥ C −   finally, following steps on page 12–6 to obtain desired result

Coordinate descent methods 12–15 Discussions

α =0: S = n and • 0 2n 2n n Ef(x(k)) f ⋆ R2(x(0)) L x(0) x⋆ 2 − ≤ k +4 1 ≤ k +4 ik i − k2 i=1 X (k) ⋆ γ 2 (0) corresponding rate of full gradient method: f(x ) f kR1(x ), where γ is big enough to ensure 2f(x) γ diag L−I ≤n ∇  { i Ni}i=1 conclusion: proportional to worst case rate of full gradient method

α =1: S = n L and • 1 i=1 i n P 2 Ef(x(k)) f ⋆ L R2(x(0)) − ≤ k +4 i 0 i=1 ! X L corresponding rate of full gradient method: f(x(k)) f ⋆ f R2(x(0)) − ≤ k 0 conclusion: same as worst case rate of full gradient method but each iteration of randomized coordinate descent can be much cheaper

Coordinate descent methods 12–16 An interesting case consider α =1/2, let Ni =1 for i =1,...,n, and let

(0) (0) D (x ) = max max max xi yi : f(x) f(x ) ∞ x y X⋆ 1 i n | − | ≤  ∈ ≤ ≤  2 (0) 2 (0) then R1/2(x ) S1/2D (x ) and hence ≤ ∞

n 2 (k) ⋆ 2 1/2 2 (0) Ef(x ) f Li D (x ) − ≤ k +4 ∞ i=1 ! X worst-case dimension-independent complexity of minimizing convex • functions over n-dimensional box is infinite (Nemirovski & Yudin 1983) S can be bounded for very big or even infinite dimension problems • 1/2 conclusion: RCD can work in situations where full gradient methods have no theoretical justification

Coordinate descent methods 12–17 Convergence for strongly convex functions theorem (Nesterov): if f is strongly convex with respect to the norm 1 α with convexity parameter σ1 α > 0, then k·k − − k (k) ⋆ σ1 α (0) ⋆ Ef(x ) f 1 − f(x ) f − ≤ − S −  α   proof: combine consequence of strong convexity

(k) ⋆ 1 2 f(x ) f f(x) 1∗ α − ≤ σ1 α k∇ k − −  with inequality on page 12–14 to obtain

(k) (k+1) 1 2 σ1 α (k) ⋆ f(x ) Ei(k)f(x ) f(x) 1∗ α − f(x ) f − ≥ 2Sα k∇ k − ≥ Sα −   it remains to take expectations over ξk 1 = i(0),...,i(k 1) − { − }

Coordinate descent methods 12–18 High probability bounds number of iterations to guarantee

prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ −  where 0 <ρ< 1 is confidence level and ǫ> 0 is error tolerance

for smooth convex functions • n 1 O 1+log ǫ ρ   

for smooth strongly convex functions • n 1 O log µ ǫρ   

Coordinate descent methods 12–19 Outline

theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Minimizing composite objectives

minimize F (x) , f(x)+Ψ(x) x RN ∈ assumptions 

f differentiable and f(x) block coordinate-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n k∇i i i −∇i k2 ≤ ik ik2

Ψ is block separable: • n Ψ(x)= Ψi(xi) i=1 X and each Ψi is convex and closed, and also simple

Coordinate descent methods 12–20 Coordinate update use quadratic upper bound on smooth part:

F (x + Uiv) = f(x + Uivi)+Ψ(x + Uivi) L f(x)+ f(x),v + i v +Ψ (x + v )+ Ψ (x ) ≤ h∇i ii 2 k ik2 i i i j j j=i X6 define L V (x,v )= f(x)+ f(x),v + i v +Ψ (x + v ) i i h∇i ii 2 k ik2 i i i coordinate descent takes the form

(k+1) (k) x = x + Ui∆xi where ∆xi = argmin V (x,vi) vi

Coordinate descent methods 12–21 Randomized coordinate descent for composite functions

choose x(0) Rn and α R, and iterate for k =0, 1, 2,... ∈ ∈ 1. choose i(k) with uniform probability 1/n

(k) 2. compute ∆xi = argminvi V (x ,vi) and update (k+1) (k) x = x + Ui∆xi

similar convergence results as for the smooth case • can only choose coordinate with uniform distribution? •

(see references for details)

Coordinate descent methods 12–22 Outline

theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Assumptions restrict to unconstrained smooth minimization problem

minimize f(x) x N ∈ℜ assumptions

f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2

f has convexity parameter µ 0 • ≥ µ f(y) f(x)+ f(x),y x + y x 2 ≥ h∇ − i 2k − kL

Coordinate descent methods 12–23 : ARCD(x0)

0 0 Set v = x , choose γ0 > 0 arbitrarily, and repeat for k =0, 1, 2,... 1. Compute α (0,n] from the equation k ∈ α2 = 1 αk γ + αk µ k − n k n  and set γ = 1 αk γ + αk µ k+1 − n k n

k 1  αk k k 2. Compute y = αk n γkv + γk+1x n γk+γk+1  3. Choose i 1,...,n uniformly at random, and update k ∈{ }

k+1 k 1 k x = y L Uik ikf(y ) − ik ∇

k+1 1 αk k αk k αk k 4. Set v = γ 1 n γkv + n µy L Uik ikf(y ) k+1 − − ik ∇    Coordinate descent methods 12–24 Algorithm: ARCD(x0)

0 0 Set v = x , choose α 1 (0,n], and repeat for k =0, 1, 2,... − ∈ 1. Compute α (0,n] from the equation k ∈

2 αk 2 αk αk = 1 n αk 1 + n µ, − −

nα µ µ k− and set θk = n2 µ , βk =1 nα − − k 2. Compute yk = θ vk + (1 θ )xk k − k 3. Choose i 1,...,n uniformly at random, and update k ∈{ }

k+1 k 1 k x = y L Uik ikf(y ) − ik ∇

k+1 k k 1 k 4. Set v = βkv + (1 βk)y α L Uik ikf(y ) − − k ik ∇

Coordinate descent methods 12–25 Convergence analysis

⋆ ⋆ theorem: Let x be an solution of minx f(x) and f be the optimal value. If xk is generated by ARCD method, then for any k 0 { } ≥ γ E[f(xk)] f ⋆ λ f(x0) f ⋆ + 0 x0 x⋆ 2 , − ≤ k − 2 k − kL   k 1 α where λ =1 and λ = − 1 i . In particular, if γ µ, then 0 k i=0 − n 0 ≥

Q  2 √µ k n λk min 1 , γ . ≤  − n n + k√ 0 !    2    when n =1, recovers results for accelerated full gradient methods • efficient implementation possible using change of variables •

Coordinate descent methods 12–26 Randomized estimate sequence definition: (φ (x),λ ) ∞ is a randomized estimate sequence of f(x) if { k k }k=0 λ 0 (assume λ independent of ξ = i ,...,i ) • k → k k { 0 k} N Eξk 1[φk(x)] (1 λk)f(x)+ λkφ0(x), x • − ≤ − ∀ ∈ℜ

(k) k lemma: if x satisfies Eξk 1[f(x )] minx Eξk 1[φk(x)], then { } − ≤ − k ⋆ ⋆ ⋆ Eξk 1[f(x )] f λk (φ0(x ) f ) 0 − − ≤ − → proof: k Eξk 1[f(x )] min Eξk 1[φk(x)] − ≤ x − min (1 λk)f(x)+ λkφ0(x) ≤ x { − } (1 λ )f(x⋆)+ λ φ (x⋆) ≤ − k k 0 = f ⋆ + λ (φ (x⋆) f ⋆) k 0 −

Coordinate descent methods 12–27 Construction of randomized estimate sequence lemma: if αk k 0 satisfies αk (0,n) and ∞ αk = , then { } ≥ ∈ k=0 ∞ αk P λk+1 = 1 λk, with λ0 = 1  − n 

αk αk k k k µ k 2 φk+1(x) = 1 φk(x) + f(y ) + n i f(y ),xi y + x y  − n  n  h∇ k k − iki 2 k − kL is a pair of randomized estimate sequence

proof: for k =0, Eξ 1[φ0(x)] = φ0(x)=(1 λ0)f(x)+ λ0φ0(x); then − − E E E ξk[φk+1(x)] = ξk 1 ik[φk+1(x)] −   α α µ E k k k k k k 2 = ξk 1 1 φk(x)+ f(y )+ f(y ),x y + x y L −  − n  n  h∇ − i 2 k − k  E 1 αk φ (x) + αk f(x) ≤ ξk 1 − n k n −    αk αk 1 [(1 λk)f(x) + λkφ0(x)] + f(x) ... ≤  − n  − n

Coordinate descent methods 12–28 Derivation of APCD

let φ (x)= φ⋆ + γ0 x v0 2 , then for all k 0, • 0 0 2 k − kL ≥ γ φ (x)= φ⋆ + k x vk 2 k k 2 k − kL

⋆ k can derive expressions for φk, γk and v explicitly

follow the same steps as in deriving accelerated full gradient method • actually use a strong condition • k Eξk 1f(x ) Eξk 1[min φk(x)] − ≤ − x

which implies k Eξk 1f(x ) min Eξk 1[φk(x)] − ≤ x −

Coordinate descent methods 12–29 References

Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale • optimization problems, SIAM Journal on Optimization, 2012

P. Richt´arik and M. Tak´aˇc, Iteration complexity of randomized • block-coordinate descent methods for minimizing a composite function, Mathematical Programming, 2014

Z. Lu and L. Xiao, On the complexity analysis of randomized • block-coordinate descent methods, MSR Tech Report, 2013

Y. T. Lee and A. Sidford, Efficient accelerated coordinate descent • methods and faster for solving linear systems, arXiv, 2013

P. Tseng and S. Yun, A coordinate method for • nonsmooth separable minimization, Mathematical Programming, 2009

Coordinate descent methods 12–30