EE 546, Univ of Washington, Spring 2014
12. Coordinate descent methods
theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method •
Coordinate descent methods 12–1 Notations consider smooth unconstrained minimization problem:
minimize f(x) x RN ∈
n coordinate blocks: x =(x ,...,x ) with x RNi and N = N • 1 n i ∈ i=1 i more generally, partition with a permutation matrix: U =[PU1 Un] • ··· n T xi = Ui x, x = Uixi i=1 X blocks of gradient: • f(x) = U T f(x) ∇i i ∇ coordinate update: • x+ = x tU f(x) − i∇i
Coordinate descent methods 12–2 (Block) coordinate descent
choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ 1. choose coordinate i(k) 2. update x(k+1) = x(k) t U f(x(k)) − k ik∇ik
among the first schemes for solving smooth unconstrained problems • cyclic or round-Robin: difficult to analyze convergence • mostly local convergence results for particular classes of problems • does it really work (better than full gradient method)? •
Coordinate descent methods 12–3 Steepest coordinate descent
choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ (k) 1. choose i(k) = argmax if(x ) 2 i 1,...,n k∇ k ∈{ } 2. update x(k+1) = x(k) t U f(x(k)) − k i(k)∇i(k) assumptions
f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2
f has bounded sub-level set, in particular, define •
⋆ R(x) = max max y x 2 : f(y) f(x) y x⋆ X⋆ k − k ≤ ∈
Coordinate descent methods 12–4 Analysis for constant step size quadratic upper bound due to block coordinate-wise Lipschitz assumption:
L f(x + U v) f(x)+ f(x),v + i v 2, i =1,...,n i ≤ h∇i i 2 k k2 , assume constant step size 0 f(x) f ⋆ f(x),x x⋆ − ≤ h∇ − i f(x) x x⋆ f(x) R(x(0)) ≤ k∇ k2k − k2 ≤ k∇ k2 therefore t 2 f(x) f(x+) f(x) f ⋆ − ≥ 2nR2 − Coordinate descent methods 12–5 let ∆ = f(x(k)) f ⋆, then k − t ∆ ∆ ∆2 k − k+1 ≥ 2nR2 k consider their multiplicative inverses 1 1 ∆k ∆k+1 ∆k ∆k+1 t = − − 2 2 ∆k+1 − ∆k ∆k+1∆k ≥ ∆k ≥ 2nR therefore 1 1 k 2t kt + 2 2 + 2 ∆k ≥ ∆0 2nLmaxR ≥ nR 2nR finally 2nR2 f(x(k)) f ⋆ = ∆ − k ≤ (k + 4)t Coordinate descent methods 12–6 Bounds on full gradient Lipschitz constant N N lemma: suppose A R × is positive semidefinite and has the partition ∈ N N A =[Aij]n n, where Aij R i× j for i,j =1,...,n, and × ∈ A L I , i =1,...,n ii i Ni then n A L I i N i=1 ! X n n n 2 proof: xT Ax = xT A x xT A x i ij j ≤ i ii i i=1 i=1 i=1 ! X X X q n 2 n n L1/2 x L x 2 ≤ i k ik2 ≤ i k ik2 i=1 ! i=1 ! i=1 X X X conclusion: the full gradient Lipschitz constant L n L f ≤ i=1 i P Coordinate descent methods 12–7 Computational complexity and justifications nMR2 (steepest) coordinate descent O k L R2 full gradient method O f k in general coordinate descent has worse complexity bound it can happen that M O(L ) • ≥ f choosing i(k) may rely on computing full gradient • too expensive to do line search based on function values • nevertheless, there are justifications for huge-scale problems even computation of a function value can require substantial effort • limits by computer memory, distributed storage, and human patience • Coordinate descent methods 12–8 Example n def 1 2 minimize f(x) = fi(xi)+ Ax b 2 x Rn 2k − k ∈ ( i=1 ) X f are convex differentiable univariate functions • i m n A =[a a ] R × , and assume a has p nonzero elements • 1 ··· n ∈ i i n computing either function value or full gradient costs O( i=1 pi) operations P computing coordinate directional derivatives: O(pi) operations f(x) = f (x )+ aT r(x), i =1,...,n ∇i ∇ i i i r(x) = Ax b − given r(x), computing f(x) requires O(p ) operations • ∇i i + coordinate update x = x + αei results in efficient update of residue: • + r(x )= r(x)+ αai, which also cost O(pi) operations Coordinate descent methods 12–9 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Randomized coordinate descent choose x(0) Rn and α R, and iterate for k =0, 1, 2,... ∈ ∈ α (α) Li 1. choose i(k) with probability pi = n α, i =1,...,n j=1 Lj (k+1) (k) 1 P (k) 2. update x = x Ui(k) i(k)f(x ) − Li ∇ (0) special case: α =0 gives uniform distribution pi =1/n for i =1,...,n assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n (1) k∇i i i −∇i k2 ≤ ik ik2 f has bounded sub-level set, and f ⋆ is attained at some x⋆ • Coordinate descent methods 12–10 Solution guarantees convergence in expectation E[f(x(k))] f ⋆ ǫ − ≤ high probability iteration complexity: number of iterations to reach prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − confidence level 0 <ρ< 1 • error tolerance ǫ> 0 • Coordinate descent methods 12–11 Convergence analysis block coordinate-wise Lipschitz continuity of f(x) implies for i =1,...,n ∇ L f(x + U v ) f(x)+ f(x),v + i v 2, x RN , v RNi i i ≤ h∇i ii 2 k ik2 ∀ ∈ i ∈ coordinate update obtained by minimizing quadratic upper bound + x = x + Uivˆi L vˆ = argmin f(x)+ f(x),v + i v 2 i h∇i ii 2 k ik2 vi objective function is non-increasing: + 1 2 f(x) f(x ) if(x) 2 − ≥ 2Lik∇ k Coordinate descent methods 12–12 A pair of conjugate norms for any α R, define ∈ n 1/2 n 1/2 α 2 α 2 x = L x , y ∗ = L− y k kα i k ik2 k kα i k ik2 i=1 i=1 X X n α let Sα = i=1 Li (note that S0 = n) P lemma (Nesterov): let f satisfy (1), then for any α R, ∈ N f(x) f(y) 1∗ α Sα x y 1 α, x,y R k∇ −∇ k − ≤ k − k − ∀ ∈ therefore Sα 2 N f(x) f(y)+ f(y),x y + x y 1 α, x,y R ≤ h∇ − i 2 k − k − ∀ ∈ Coordinate descent methods 12–13 Convergence in expectation theorem (Nesterov): for any k 0, ≥ (k) ⋆ 2 2 (0) Ef(x ) f SαR1 α(x ) − ≤ k +4 − (0) ⋆ (0) where R1 α(x ) = max max y x 1 α : f(y) f(x ) − y x⋆ X⋆ k − k − ≤ ∈ proof: define random variables ξ = i(0),...,i(k) , k { } n f(x(k)) E f(x(k+1)) = p(α) f(x(k)) f(x(k) + U vˆ ) − i(k) i − i i i=1 X n p(α) i f(x(k)) 2 ≥ 2L k∇i k2 i=1 i X 1 2 = f(x) 1∗ α 2Sα k∇ k − Coordinate descent methods 12–14 f(x(k)) f ⋆ min f(x(k)),x(k) x⋆ x⋆ X⋆ − ≤ ∈ h∇ − i (k) (0) f(x ) 1∗ αR1 α(x ) ≤ k∇ k − − 2 (0) therefore, with C =2SαR1 α(x ), − 1 2 f(x(k)) E f(x(k+1)) f(x(k)) f ⋆ − i(k) ≥ C − taking expectation of both sides with respect to ξk 1 = i(0),...,i(k 1) , − { − } (k) (k+1) 1 (k) ⋆ 2 Ef(x ) Ef(x ) Eξk 1 f(x ) f − ≥ C − − 1 h 2 i Ef(x(k)) f ⋆ ≥ C − finally, following steps on page 12–6 to obtain desired result Coordinate descent methods 12–15 Discussions α =0: S = n and • 0 2n 2n n Ef(x(k)) f ⋆ R2(x(0)) L x(0) x⋆ 2 − ≤ k +4 1 ≤ k +4 ik i − k2 i=1 X (k) ⋆ γ 2 (0) corresponding rate of full gradient method: f(x ) f kR1(x ), where γ is big enough to ensure 2f(x) γ diag L−I ≤n ∇ { i Ni}i=1 conclusion: proportional to worst case rate of full gradient method α =1: S = n L and • 1 i=1 i n P 2 Ef(x(k)) f ⋆ L R2(x(0)) − ≤ k +4 i 0 i=1 ! X L corresponding rate of full gradient method: f(x(k)) f ⋆ f R2(x(0)) − ≤ k 0 conclusion: same as worst case rate of full gradient method but each iteration of randomized coordinate descent can be much cheaper Coordinate descent methods 12–16 An interesting case consider α =1/2, let Ni =1 for i =1,...,n, and let (0) (0) D (x ) = max max max xi yi : f(x) f(x ) ∞ x y X⋆ 1 i n | − | ≤ ∈ ≤ ≤ 2 (0) 2 (0) then R1/2(x ) S1/2D (x ) and hence ≤ ∞ n 2 (k) ⋆ 2 1/2 2 (0) Ef(x ) f Li D (x ) − ≤ k +4 ∞ i=1 ! X worst-case dimension-independent complexity of minimizing convex • functions over n-dimensional box is infinite (Nemirovski & Yudin 1983) S can be bounded for very big or even infinite dimension problems • 1/2 conclusion: RCD can work in situations where full gradient methods have no theoretical justification Coordinate descent methods 12–17 Convergence for strongly convex functions theorem (Nesterov): if f is strongly convex with respect to the norm 1 α with convexity parameter σ1 α > 0, then k·k − − k (k) ⋆ σ1 α (0) ⋆ Ef(x ) f 1 − f(x ) f − ≤ − S − α proof: combine consequence of strong convexity (k) ⋆ 1 2 f(x ) f f(x) 1∗ α − ≤ σ1 α k∇ k − − with inequality on page 12–14 to obtain (k) (k+1) 1 2 σ1 α (k) ⋆ f(x ) Ei(k)f(x ) f(x) 1∗ α − f(x ) f − ≥ 2Sα k∇ k − ≥ Sα − it remains to take expectations over ξk 1 = i(0),...,i(k 1) − { − } Coordinate descent methods 12–18 High probability bounds number of iterations to guarantee prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − where 0 <ρ< 1 is confidence level and ǫ> 0 is error tolerance for smooth convex functions • n 1 O 1+log ǫ ρ for smooth strongly convex functions • n 1 O log µ ǫρ Coordinate descent methods 12–19 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Minimizing composite objectives minimize F (x) , f(x)+Ψ(x) x RN ∈ assumptions f differentiable and f(x) block coordinate-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n k∇i i i −∇i k2 ≤ ik ik2 Ψ is block separable: • n Ψ(x)= Ψi(xi) i=1 X and each Ψi is convex and closed, and also simple Coordinate descent methods 12–20 Coordinate update use quadratic upper bound on smooth part: F (x + Uiv) = f(x + Uivi)+Ψ(x + Uivi) L f(x)+ f(x),v + i v +Ψ (x + v )+ Ψ (x ) ≤ h∇i ii 2 k ik2 i i i j j j=i X6 define L V (x,v )= f(x)+ f(x),v + i v +Ψ (x + v ) i i h∇i ii 2 k ik2 i i i coordinate descent takes the form (k+1) (k) x = x + Ui∆xi where ∆xi = argmin V (x,vi) vi Coordinate descent methods 12–21 Randomized coordinate descent for composite functions choose x(0) Rn and α R, and iterate for k =0, 1, 2,... ∈ ∈ 1. choose i(k) with uniform probability 1/n (k) 2. compute ∆xi = argminvi V (x ,vi) and update (k+1) (k) x = x + Ui∆xi similar convergence results as for the smooth case • can only choose coordinate with uniform distribution? • (see references for details) Coordinate descent methods 12–22 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Assumptions restrict to unconstrained smooth minimization problem minimize f(x) x N ∈ℜ assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2 f has convexity parameter µ 0 • ≥ µ f(y) f(x)+ f(x),y x + y x 2 ≥ h∇ − i 2k − kL Coordinate descent methods 12–23 Algorithm: ARCD(x0) 0 0 Set v = x , choose γ0 > 0 arbitrarily, and repeat for k =0, 1, 2,... 1. Compute α (0,n] from the equation k ∈ α2 = 1 αk γ + αk µ k − n k n and set γ = 1 αk γ + αk µ k+1 − n k n k 1 αk k k 2. Compute y = αk n γkv + γk+1x n γk+γk+1 3. Choose i 1,...,n uniformly at random, and update k ∈{ } k+1 k 1 k x = y L Uik ikf(y ) − ik ∇ k+1 1 αk k αk k αk k 4. Set v = γ 1 n γkv + n µy L Uik ikf(y ) k+1 − − ik ∇ Coordinate descent methods 12–24 Algorithm: ARCD(x0) 0 0 Set v = x , choose α 1 (0,n], and repeat for k =0, 1, 2,... − ∈ 1. Compute α (0,n] from the equation k ∈ 2 αk 2 αk αk = 1 n αk 1 + n µ, − − nα µ µ k− and set θk = n2 µ , βk =1 nα − − k 2. Compute yk = θ vk + (1 θ )xk k − k 3. Choose i 1,...,n uniformly at random, and update k ∈{ } k+1 k 1 k x = y L Uik ikf(y ) − ik ∇ k+1 k k 1 k 4. Set v = βkv + (1 βk)y α L Uik ikf(y ) − − k ik ∇ Coordinate descent methods 12–25 Convergence analysis ⋆ ⋆ theorem: Let x be an solution of minx f(x) and f be the optimal value. If xk is generated by ARCD method, then for any k 0 { } ≥ γ E[f(xk)] f ⋆ λ f(x0) f ⋆ + 0 x0 x⋆ 2 , − ≤ k − 2 k − kL k 1 α where λ =1 and λ = − 1 i . In particular, if γ µ, then 0 k i=0 − n 0 ≥ Q 2 √µ k n λk min 1 , γ . ≤ − n n + k√ 0 ! 2 when n =1, recovers results for accelerated full gradient methods • efficient implementation possible using change of variables • Coordinate descent methods 12–26 Randomized estimate sequence definition: (φ (x),λ ) ∞ is a randomized estimate sequence of f(x) if { k k }k=0 λ 0 (assume λ independent of ξ = i ,...,i ) • k → k k { 0 k} N Eξk 1[φk(x)] (1 λk)f(x)+ λkφ0(x), x • − ≤ − ∀ ∈ℜ (k) k lemma: if x satisfies Eξk 1[f(x )] minx Eξk 1[φk(x)], then { } − ≤ − k ⋆ ⋆ ⋆ Eξk 1[f(x )] f λk (φ0(x ) f ) 0 − − ≤ − → proof: k Eξk 1[f(x )] min Eξk 1[φk(x)] − ≤ x − min (1 λk)f(x)+ λkφ0(x) ≤ x { − } (1 λ )f(x⋆)+ λ φ (x⋆) ≤ − k k 0 = f ⋆ + λ (φ (x⋆) f ⋆) k 0 − Coordinate descent methods 12–27 Construction of randomized estimate sequence lemma: if αk k 0 satisfies αk (0,n) and ∞ αk = , then { } ≥ ∈ k=0 ∞ αk P λk+1 = 1 λk, with λ0 = 1 − n αk αk k k k µ k 2 φk+1(x) = 1 φk(x) + f(y ) + n i f(y ),xi y + x y − n n h∇ k k − iki 2 k − kL is a pair of randomized estimate sequence proof: for k =0, Eξ 1[φ0(x)] = φ0(x)=(1 λ0)f(x)+ λ0φ0(x); then − − E E E ξk[φk+1(x)] = ξk 1 ik[φk+1(x)] − α α µ E k k k k k k 2 = ξk 1 1 φk(x)+ f(y )+ f(y ),x y + x y L − − n n h∇ − i 2 k − k E 1 αk φ (x) + αk f(x) ≤ ξk 1 − n k n − αk αk 1 [(1 λk)f(x) + λkφ0(x)] + f(x) ... ≤ − n − n Coordinate descent methods 12–28 Derivation of APCD let φ (x)= φ⋆ + γ0 x v0 2 , then for all k 0, • 0 0 2 k − kL ≥ γ φ (x)= φ⋆ + k x vk 2 k k 2 k − kL ⋆ k can derive expressions for φk, γk and v explicitly follow the same steps as in deriving accelerated full gradient method • actually use a strong condition • k Eξk 1f(x ) Eξk 1[min φk(x)] − ≤ − x which implies k Eξk 1f(x ) min Eξk 1[φk(x)] − ≤ x − Coordinate descent methods 12–29 References Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale • optimization problems, SIAM Journal on Optimization, 2012 P. Richt´arik and M. Tak´aˇc, Iteration complexity of randomized • block-coordinate descent methods for minimizing a composite function, Mathematical Programming, 2014 Z. Lu and L. Xiao, On the complexity analysis of randomized • block-coordinate descent methods, MSR Tech Report, 2013 Y. T. Lee and A. Sidford, Efficient accelerated coordinate descent • methods and faster algorithms for solving linear systems, arXiv, 2013 P. Tseng and S. Yun, A coordinate gradient descent method for • nonsmooth separable minimization, Mathematical Programming, 2009 Coordinate descent methods 12–30