12. Coordinate Descent Methods

12. Coordinate Descent Methods

EE 546, Univ of Washington, Spring 2014 12. Coordinate descent methods theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Coordinate descent methods 12–1 Notations consider smooth unconstrained minimization problem: minimize f(x) x RN ∈ n coordinate blocks: x =(x ,...,x ) with x RNi and N = N • 1 n i ∈ i=1 i more generally, partition with a permutation matrix: U =[PU1 Un] • ··· n T xi = Ui x, x = Uixi i=1 X blocks of gradient: • f(x) = U T f(x) ∇i i ∇ coordinate update: • x+ = x tU f(x) − i∇i Coordinate descent methods 12–2 (Block) coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ 1. choose coordinate i(k) 2. update x(k+1) = x(k) t U f(x(k)) − k ik∇ik among the first schemes for solving smooth unconstrained problems • cyclic or round-Robin: difficult to analyze convergence • mostly local convergence results for particular classes of problems • does it really work (better than full gradient method)? • Coordinate descent methods 12–3 Steepest coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ (k) 1. choose i(k) = argmax if(x ) 2 i 1,...,n k∇ k ∈{ } 2. update x(k+1) = x(k) t U f(x(k)) − k i(k)∇i(k) assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2 f has bounded sub-level set, in particular, define • ⋆ R(x) = max max y x 2 : f(y) f(x) y x⋆ X⋆ k − k ≤ ∈ Coordinate descent methods 12–4 Analysis for constant step size quadratic upper bound due to block coordinate-wise Lipschitz assumption: L f(x + U v) f(x)+ f(x),v + i v 2, i =1,...,n i ≤ h∇i i 2 k k2 , assume constant step size 0 <t 1/M, with M maxi 1,...,n Li ≤ ∈{ } t t f(x+) f(x) f(x) 2 f(x) f(x) 2 ≤ − 2k∇i k2 ≤ − 2nk∇ k2 by convexity and Cauchy-Schwarz inequality, f(x) f ⋆ f(x),x x⋆ − ≤ h∇ − i f(x) x x⋆ f(x) R(x(0)) ≤ k∇ k2k − k2 ≤ k∇ k2 therefore t 2 f(x) f(x+) f(x) f ⋆ − ≥ 2nR2 − Coordinate descent methods 12–5 let ∆ = f(x(k)) f ⋆, then k − t ∆ ∆ ∆2 k − k+1 ≥ 2nR2 k consider their multiplicative inverses 1 1 ∆k ∆k+1 ∆k ∆k+1 t = − − 2 2 ∆k+1 − ∆k ∆k+1∆k ≥ ∆k ≥ 2nR therefore 1 1 k 2t kt + 2 2 + 2 ∆k ≥ ∆0 2nLmaxR ≥ nR 2nR finally 2nR2 f(x(k)) f ⋆ = ∆ − k ≤ (k + 4)t Coordinate descent methods 12–6 Bounds on full gradient Lipschitz constant N N lemma: suppose A R × is positive semidefinite and has the partition ∈ N N A =[Aij]n n, where Aij R i× j for i,j =1,...,n, and × ∈ A L I , i =1,...,n ii i Ni then n A L I i N i=1 ! X n n n 2 proof: xT Ax = xT A x xT A x i ij j ≤ i ii i i=1 i=1 i=1 ! X X X q n 2 n n L1/2 x L x 2 ≤ i k ik2 ≤ i k ik2 i=1 ! i=1 ! i=1 X X X conclusion: the full gradient Lipschitz constant L n L f ≤ i=1 i P Coordinate descent methods 12–7 Computational complexity and justifications nMR2 (steepest) coordinate descent O k L R2 full gradient method O f k in general coordinate descent has worse complexity bound it can happen that M O(L ) • ≥ f choosing i(k) may rely on computing full gradient • too expensive to do line search based on function values • nevertheless, there are justifications for huge-scale problems even computation of a function value can require substantial effort • limits by computer memory, distributed storage, and human patience • Coordinate descent methods 12–8 Example n def 1 2 minimize f(x) = fi(xi)+ Ax b 2 x Rn 2k − k ∈ ( i=1 ) X f are convex differentiable univariate functions • i m n A =[a a ] R × , and assume a has p nonzero elements • 1 ··· n ∈ i i n computing either function value or full gradient costs O( i=1 pi) operations P computing coordinate directional derivatives: O(pi) operations f(x) = f (x )+ aT r(x), i =1,...,n ∇i ∇ i i i r(x) = Ax b − given r(x), computing f(x) requires O(p ) operations • ∇i i + coordinate update x = x + αei results in efficient update of residue: • + r(x )= r(x)+ αai, which also cost O(pi) operations Coordinate descent methods 12–9 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Randomized coordinate descent choose x(0) Rn and α R, and iterate for k =0, 1, 2,... ∈ ∈ α (α) Li 1. choose i(k) with probability pi = n α, i =1,...,n j=1 Lj (k+1) (k) 1 P (k) 2. update x = x Ui(k) i(k)f(x ) − Li ∇ (0) special case: α =0 gives uniform distribution pi =1/n for i =1,...,n assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n (1) k∇i i i −∇i k2 ≤ ik ik2 f has bounded sub-level set, and f ⋆ is attained at some x⋆ • Coordinate descent methods 12–10 Solution guarantees convergence in expectation E[f(x(k))] f ⋆ ǫ − ≤ high probability iteration complexity: number of iterations to reach prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − confidence level 0 <ρ< 1 • error tolerance ǫ> 0 • Coordinate descent methods 12–11 Convergence analysis block coordinate-wise Lipschitz continuity of f(x) implies for i =1,...,n ∇ L f(x + U v ) f(x)+ f(x),v + i v 2, x RN , v RNi i i ≤ h∇i ii 2 k ik2 ∀ ∈ i ∈ coordinate update obtained by minimizing quadratic upper bound + x = x + Uivˆi L vˆ = argmin f(x)+ f(x),v + i v 2 i h∇i ii 2 k ik2 vi objective function is non-increasing: + 1 2 f(x) f(x ) if(x) 2 − ≥ 2Lik∇ k Coordinate descent methods 12–12 A pair of conjugate norms for any α R, define ∈ n 1/2 n 1/2 α 2 α 2 x = L x , y ∗ = L− y k kα i k ik2 k kα i k ik2 i=1 i=1 X X n α let Sα = i=1 Li (note that S0 = n) P lemma (Nesterov): let f satisfy (1), then for any α R, ∈ N f(x) f(y) 1∗ α Sα x y 1 α, x,y R k∇ −∇ k − ≤ k − k − ∀ ∈ therefore Sα 2 N f(x) f(y)+ f(y),x y + x y 1 α, x,y R ≤ h∇ − i 2 k − k − ∀ ∈ Coordinate descent methods 12–13 Convergence in expectation theorem (Nesterov): for any k 0, ≥ (k) ⋆ 2 2 (0) Ef(x ) f SαR1 α(x ) − ≤ k +4 − (0) ⋆ (0) where R1 α(x ) = max max y x 1 α : f(y) f(x ) − y x⋆ X⋆ k − k − ≤ ∈ proof: define random variables ξ = i(0),...,i(k) , k { } n f(x(k)) E f(x(k+1)) = p(α) f(x(k)) f(x(k) + U vˆ ) − i(k) i − i i i=1 X n p(α) i f(x(k)) 2 ≥ 2L k∇i k2 i=1 i X 1 2 = f(x) 1∗ α 2Sα k∇ k − Coordinate descent methods 12–14 f(x(k)) f ⋆ min f(x(k)),x(k) x⋆ x⋆ X⋆ − ≤ ∈ h∇ − i (k) (0) f(x ) 1∗ αR1 α(x ) ≤ k∇ k − − 2 (0) therefore, with C =2SαR1 α(x ), − 1 2 f(x(k)) E f(x(k+1)) f(x(k)) f ⋆ − i(k) ≥ C − taking expectation of both sides with respect to ξk 1 = i(0),...,i(k 1) , − { − } (k) (k+1) 1 (k) ⋆ 2 Ef(x ) Ef(x ) Eξk 1 f(x ) f − ≥ C − − 1 h 2 i Ef(x(k)) f ⋆ ≥ C − finally, following steps on page 12–6 to obtain desired result Coordinate descent methods 12–15 Discussions α =0: S = n and • 0 2n 2n n Ef(x(k)) f ⋆ R2(x(0)) L x(0) x⋆ 2 − ≤ k +4 1 ≤ k +4 ik i − k2 i=1 X (k) ⋆ γ 2 (0) corresponding rate of full gradient method: f(x ) f kR1(x ), where γ is big enough to ensure 2f(x) γ diag L−I ≤n ∇ { i Ni}i=1 conclusion: proportional to worst case rate of full gradient method α =1: S = n L and • 1 i=1 i n P 2 Ef(x(k)) f ⋆ L R2(x(0)) − ≤ k +4 i 0 i=1 ! X L corresponding rate of full gradient method: f(x(k)) f ⋆ f R2(x(0)) − ≤ k 0 conclusion: same as worst case rate of full gradient method but each iteration of randomized coordinate descent can be much cheaper Coordinate descent methods 12–16 An interesting case consider α =1/2, let Ni =1 for i =1,...,n, and let (0) (0) D (x ) = max max max xi yi : f(x) f(x ) ∞ x y X⋆ 1 i n | − | ≤ ∈ ≤ ≤ 2 (0) 2 (0) then R1/2(x ) S1/2D (x ) and hence ≤ ∞ n 2 (k) ⋆ 2 1/2 2 (0) Ef(x ) f Li D (x ) − ≤ k +4 ∞ i=1 ! X worst-case dimension-independent complexity of minimizing convex • functions over n-dimensional box is infinite (Nemirovski & Yudin 1983) S can be bounded for very big or even infinite dimension problems • 1/2 conclusion: RCD can work in situations where full gradient methods have no theoretical justification Coordinate descent methods 12–17 Convergence for strongly convex functions theorem (Nesterov): if f is strongly convex with respect to the norm 1 α with convexity parameter σ1 α > 0, then k·k − − k (k) ⋆ σ1 α (0) ⋆ Ef(x ) f 1 − f(x ) f − ≤ − S − α proof: combine consequence of strong convexity (k) ⋆ 1 2 f(x ) f f(x) 1∗ α − ≤ σ1 α k∇ k − − with inequality on page 12–14 to obtain (k) (k+1) 1 2 σ1 α (k) ⋆ f(x ) Ei(k)f(x ) f(x) 1∗ α − f(x ) f − ≥ 2Sα k∇ k − ≥ Sα − it remains to take expectations over ξk 1 = i(0),...,i(k 1) − { − } Coordinate descent methods 12–18 High probability bounds number of iterations to guarantee prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − where 0 <ρ< 1 is confidence level and ǫ> 0 is error tolerance for smooth convex functions • n 1 O 1+log ǫ ρ for smooth strongly convex functions • n 1 O log µ ǫρ Coordinate descent methods 12–19 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Minimizing composite objectives minimize F (x) , f(x)+Ψ(x) x RN ∈ assumptions f differentiable and f(x) block coordinate-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n k∇i i i −∇i k2 ≤ ik ik2 Ψ is block separable: • n Ψ(x)= Ψi(xi) i=1 X and each Ψi is convex and closed, and also simple Coordinate descent methods 12–20 Coordinate update use quadratic upper bound on smooth part: F (x + Uiv) = f(x + Uivi)+Ψ(x + Uivi) L f(x)+ f(x),v + i v +Ψ (x + v )+ Ψ (x ) ≤ h∇i ii 2 k ik2 i i i j j j=i X6 define L V (x,v )= f(x)+ f(x),v + i v +Ψ (x + v ) i i h∇i ii 2 k ik2 i i i coordinate descent takes the form (k+1) (k) x = x + Ui∆xi where ∆xi = argmin V (x,vi) vi Coordinate descent methods 12–21 Randomized coordinate descent for composite functions choose x(0) Rn and α R, and iterate for k =0, 1, 2,..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    33 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us