12. Coordinate Descent Methods
Total Page:16
File Type:pdf, Size:1020Kb
EE 546, Univ of Washington, Spring 2014 12. Coordinate descent methods theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Coordinate descent methods 12–1 Notations consider smooth unconstrained minimization problem: minimize f(x) x RN ∈ n coordinate blocks: x =(x ,...,x ) with x RNi and N = N • 1 n i ∈ i=1 i more generally, partition with a permutation matrix: U =[PU1 Un] • ··· n T xi = Ui x, x = Uixi i=1 X blocks of gradient: • f(x) = U T f(x) ∇i i ∇ coordinate update: • x+ = x tU f(x) − i∇i Coordinate descent methods 12–2 (Block) coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ 1. choose coordinate i(k) 2. update x(k+1) = x(k) t U f(x(k)) − k ik∇ik among the first schemes for solving smooth unconstrained problems • cyclic or round-Robin: difficult to analyze convergence • mostly local convergence results for particular classes of problems • does it really work (better than full gradient method)? • Coordinate descent methods 12–3 Steepest coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ (k) 1. choose i(k) = argmax if(x ) 2 i 1,...,n k∇ k ∈{ } 2. update x(k+1) = x(k) t U f(x(k)) − k i(k)∇i(k) assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2 f has bounded sub-level set, in particular, define • ⋆ R(x) = max max y x 2 : f(y) f(x) y x⋆ X⋆ k − k ≤ ∈ Coordinate descent methods 12–4 Analysis for constant step size quadratic upper bound due to block coordinate-wise Lipschitz assumption: L f(x + U v) f(x)+ f(x),v + i v 2, i =1,...,n i ≤ h∇i i 2 k k2 , assume constant step size 0 <t 1/M, with M maxi 1,...,n Li ≤ ∈{ } t t f(x+) f(x) f(x) 2 f(x) f(x) 2 ≤ − 2k∇i k2 ≤ − 2nk∇ k2 by convexity and Cauchy-Schwarz inequality, f(x) f ⋆ f(x),x x⋆ − ≤ h∇ − i f(x) x x⋆ f(x) R(x(0)) ≤ k∇ k2k − k2 ≤ k∇ k2 therefore t 2 f(x) f(x+) f(x) f ⋆ − ≥ 2nR2 − Coordinate descent methods 12–5 let ∆ = f(x(k)) f ⋆, then k − t ∆ ∆ ∆2 k − k+1 ≥ 2nR2 k consider their multiplicative inverses 1 1 ∆k ∆k+1 ∆k ∆k+1 t = − − 2 2 ∆k+1 − ∆k ∆k+1∆k ≥ ∆k ≥ 2nR therefore 1 1 k 2t kt + 2 2 + 2 ∆k ≥ ∆0 2nLmaxR ≥ nR 2nR finally 2nR2 f(x(k)) f ⋆ = ∆ − k ≤ (k + 4)t Coordinate descent methods 12–6 Bounds on full gradient Lipschitz constant N N lemma: suppose A R × is positive semidefinite and has the partition ∈ N N A =[Aij]n n, where Aij R i× j for i,j =1,...,n, and × ∈ A L I , i =1,...,n ii i Ni then n A L I i N i=1 ! X n n n 2 proof: xT Ax = xT A x xT A x i ij j ≤ i ii i i=1 i=1 i=1 ! X X X q n 2 n n L1/2 x L x 2 ≤ i k ik2 ≤ i k ik2 i=1 ! i=1 ! i=1 X X X conclusion: the full gradient Lipschitz constant L n L f ≤ i=1 i P Coordinate descent methods 12–7 Computational complexity and justifications nMR2 (steepest) coordinate descent O k L R2 full gradient method O f k in general coordinate descent has worse complexity bound it can happen that M O(L ) • ≥ f choosing i(k) may rely on computing full gradient • too expensive to do line search based on function values • nevertheless, there are justifications for huge-scale problems even computation of a function value can require substantial effort • limits by computer memory, distributed storage, and human patience • Coordinate descent methods 12–8 Example n def 1 2 minimize f(x) = fi(xi)+ Ax b 2 x Rn 2k − k ∈ ( i=1 ) X f are convex differentiable univariate functions • i m n A =[a a ] R × , and assume a has p nonzero elements • 1 ··· n ∈ i i n computing either function value or full gradient costs O( i=1 pi) operations P computing coordinate directional derivatives: O(pi) operations f(x) = f (x )+ aT r(x), i =1,...,n ∇i ∇ i i i r(x) = Ax b − given r(x), computing f(x) requires O(p ) operations • ∇i i + coordinate update x = x + αei results in efficient update of residue: • + r(x )= r(x)+ αai, which also cost O(pi) operations Coordinate descent methods 12–9 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Randomized coordinate descent choose x(0) Rn and α R, and iterate for k =0, 1, 2,... ∈ ∈ α (α) Li 1. choose i(k) with probability pi = n α, i =1,...,n j=1 Lj (k+1) (k) 1 P (k) 2. update x = x Ui(k) i(k)f(x ) − Li ∇ (0) special case: α =0 gives uniform distribution pi =1/n for i =1,...,n assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n (1) k∇i i i −∇i k2 ≤ ik ik2 f has bounded sub-level set, and f ⋆ is attained at some x⋆ • Coordinate descent methods 12–10 Solution guarantees convergence in expectation E[f(x(k))] f ⋆ ǫ − ≤ high probability iteration complexity: number of iterations to reach prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − confidence level 0 <ρ< 1 • error tolerance ǫ> 0 • Coordinate descent methods 12–11 Convergence analysis block coordinate-wise Lipschitz continuity of f(x) implies for i =1,...,n ∇ L f(x + U v ) f(x)+ f(x),v + i v 2, x RN , v RNi i i ≤ h∇i ii 2 k ik2 ∀ ∈ i ∈ coordinate update obtained by minimizing quadratic upper bound + x = x + Uivˆi L vˆ = argmin f(x)+ f(x),v + i v 2 i h∇i ii 2 k ik2 vi objective function is non-increasing: + 1 2 f(x) f(x ) if(x) 2 − ≥ 2Lik∇ k Coordinate descent methods 12–12 A pair of conjugate norms for any α R, define ∈ n 1/2 n 1/2 α 2 α 2 x = L x , y ∗ = L− y k kα i k ik2 k kα i k ik2 i=1 i=1 X X n α let Sα = i=1 Li (note that S0 = n) P lemma (Nesterov): let f satisfy (1), then for any α R, ∈ N f(x) f(y) 1∗ α Sα x y 1 α, x,y R k∇ −∇ k − ≤ k − k − ∀ ∈ therefore Sα 2 N f(x) f(y)+ f(y),x y + x y 1 α, x,y R ≤ h∇ − i 2 k − k − ∀ ∈ Coordinate descent methods 12–13 Convergence in expectation theorem (Nesterov): for any k 0, ≥ (k) ⋆ 2 2 (0) Ef(x ) f SαR1 α(x ) − ≤ k +4 − (0) ⋆ (0) where R1 α(x ) = max max y x 1 α : f(y) f(x ) − y x⋆ X⋆ k − k − ≤ ∈ proof: define random variables ξ = i(0),...,i(k) , k { } n f(x(k)) E f(x(k+1)) = p(α) f(x(k)) f(x(k) + U vˆ ) − i(k) i − i i i=1 X n p(α) i f(x(k)) 2 ≥ 2L k∇i k2 i=1 i X 1 2 = f(x) 1∗ α 2Sα k∇ k − Coordinate descent methods 12–14 f(x(k)) f ⋆ min f(x(k)),x(k) x⋆ x⋆ X⋆ − ≤ ∈ h∇ − i (k) (0) f(x ) 1∗ αR1 α(x ) ≤ k∇ k − − 2 (0) therefore, with C =2SαR1 α(x ), − 1 2 f(x(k)) E f(x(k+1)) f(x(k)) f ⋆ − i(k) ≥ C − taking expectation of both sides with respect to ξk 1 = i(0),...,i(k 1) , − { − } (k) (k+1) 1 (k) ⋆ 2 Ef(x ) Ef(x ) Eξk 1 f(x ) f − ≥ C − − 1 h 2 i Ef(x(k)) f ⋆ ≥ C − finally, following steps on page 12–6 to obtain desired result Coordinate descent methods 12–15 Discussions α =0: S = n and • 0 2n 2n n Ef(x(k)) f ⋆ R2(x(0)) L x(0) x⋆ 2 − ≤ k +4 1 ≤ k +4 ik i − k2 i=1 X (k) ⋆ γ 2 (0) corresponding rate of full gradient method: f(x ) f kR1(x ), where γ is big enough to ensure 2f(x) γ diag L−I ≤n ∇ { i Ni}i=1 conclusion: proportional to worst case rate of full gradient method α =1: S = n L and • 1 i=1 i n P 2 Ef(x(k)) f ⋆ L R2(x(0)) − ≤ k +4 i 0 i=1 ! X L corresponding rate of full gradient method: f(x(k)) f ⋆ f R2(x(0)) − ≤ k 0 conclusion: same as worst case rate of full gradient method but each iteration of randomized coordinate descent can be much cheaper Coordinate descent methods 12–16 An interesting case consider α =1/2, let Ni =1 for i =1,...,n, and let (0) (0) D (x ) = max max max xi yi : f(x) f(x ) ∞ x y X⋆ 1 i n | − | ≤ ∈ ≤ ≤ 2 (0) 2 (0) then R1/2(x ) S1/2D (x ) and hence ≤ ∞ n 2 (k) ⋆ 2 1/2 2 (0) Ef(x ) f Li D (x ) − ≤ k +4 ∞ i=1 ! X worst-case dimension-independent complexity of minimizing convex • functions over n-dimensional box is infinite (Nemirovski & Yudin 1983) S can be bounded for very big or even infinite dimension problems • 1/2 conclusion: RCD can work in situations where full gradient methods have no theoretical justification Coordinate descent methods 12–17 Convergence for strongly convex functions theorem (Nesterov): if f is strongly convex with respect to the norm 1 α with convexity parameter σ1 α > 0, then k·k − − k (k) ⋆ σ1 α (0) ⋆ Ef(x ) f 1 − f(x ) f − ≤ − S − α proof: combine consequence of strong convexity (k) ⋆ 1 2 f(x ) f f(x) 1∗ α − ≤ σ1 α k∇ k − − with inequality on page 12–14 to obtain (k) (k+1) 1 2 σ1 α (k) ⋆ f(x ) Ei(k)f(x ) f(x) 1∗ α − f(x ) f − ≥ 2Sα k∇ k − ≥ Sα − it remains to take expectations over ξk 1 = i(0),...,i(k 1) − { − } Coordinate descent methods 12–18 High probability bounds number of iterations to guarantee prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − where 0 <ρ< 1 is confidence level and ǫ> 0 is error tolerance for smooth convex functions • n 1 O 1+log ǫ ρ for smooth strongly convex functions • n 1 O log µ ǫρ Coordinate descent methods 12–19 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Minimizing composite objectives minimize F (x) , f(x)+Ψ(x) x RN ∈ assumptions f differentiable and f(x) block coordinate-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n k∇i i i −∇i k2 ≤ ik ik2 Ψ is block separable: • n Ψ(x)= Ψi(xi) i=1 X and each Ψi is convex and closed, and also simple Coordinate descent methods 12–20 Coordinate update use quadratic upper bound on smooth part: F (x + Uiv) = f(x + Uivi)+Ψ(x + Uivi) L f(x)+ f(x),v + i v +Ψ (x + v )+ Ψ (x ) ≤ h∇i ii 2 k ik2 i i i j j j=i X6 define L V (x,v )= f(x)+ f(x),v + i v +Ψ (x + v ) i i h∇i ii 2 k ik2 i i i coordinate descent takes the form (k+1) (k) x = x + Ui∆xi where ∆xi = argmin V (x,vi) vi Coordinate descent methods 12–21 Randomized coordinate descent for composite functions choose x(0) Rn and α R, and iterate for k =0, 1, 2,..