Convex Optimization Lecture 16

Convex Optimization Lecture 16 Today: Projected Gradient Descent • Conditional Gradient Descent • Stochastic Gradient Descent • Random Coordinate Descent • Recall:Gradient Gradient Descent Descent (Steepest Descent w.r.t Euclidean Norm) � Gradient descent algorithm: Δ� = −�� Init � 0 ∈ ��(�) Iterate � �+1 ← � � − � � �� Reminder: We are violatingLower'Bounds here the distinction between the primal space and Convergence:the dual1 gradient space—we are implicitly linking them by matching representations w.r.t. a chosen basis • Some%upper%bounds: Note: Δ� is# notiter normalized (i.e. #weiter don’t require#iter Δ� 2 = 1).& This≤ just* changes the meaning of �. Oracle/ops $ ≼ &' ≼ ( &' ≼ ( & ≤ * $ ≼ &' 2 3 ∗ # 5# 3 ∗ # � � 5# GD + logHow1/1 do we choose the stepsize ? 78 + !(") 1 1# 61 ∗ # 1 2 3 A>κGD= M/µ + log 1/1 x x 78 + !(") 1 • Is%this%the%best%we%can%do? • What%if%we%allow%! "# ops? • YES! • When%using%only%first>order%oracle%(gradients) • History: • Nemirovski &%Yudin (1983) • Nesterov (2004) Smoothness and Strong Convexity Def: � is �-strongly convex Def: � is �-smooth � � � � + �� , Δ� + Δ� ≤ � � + Δ� ≤ � � + �� , Δ� + Δ� 2 2 Can be viewed as a condition on the directional 2nd derivatives � ≤ � � = � � + �� = �� ≤ � (for � = 1) � � � + �� , Δ� + Δ� 2 �(� + Δ�) � � � + �� , Δ� + Δ� 2 � � + 〈�� , Δ�〉 What about constraints? min f(x) x s.t. x 2 X where is convex X Projected Gradient Descent Idea: make sure that points are feasible by projecting onto X3.1. Projected Subgradient Descent for Lipschitz functions 21 y Algorithm: t+1y(k+1) projection (3.3) (k+1) (k) (k) (k) y = x t g (k+1) gradient step xx • where g(k) −@f(x(k)) t+1 2 (3.2) (k+1) (k+1) x = Π (y ) x(k) • X xt The projection operator Π onto : X X X Π (x) = min x z X z k − k 2X Fig. 3.2 Illustration of the Projected Subgradient Descent method. Notice: subgradient instead of gradient (even for differentiable functions) not exist) by a subgradient g @f(x). Secondly, and more importantly, 2 we make sure that the updated point lies in by projecting back (if X necessary) onto it. This gives the Projected Subgradient Descent algorithm which iterates the following equations for t 1: ≥ y = x ⌘g , where g @f(x ), (3.2) t+1 t − t t 2 t xt+1 = ⇧ (yt+1). (3.3) X This procedure is illustrated in Figure 3.2. We prove now a rate of convergence for this method under the above assumptions. Theorem 3.1. The Projected Subgradient Descent with ⌘ = R sat- Lpt isfies t 1 RL f x f(x⇤) . t s − p s=1 ! t X Proof. Using the definition of subgradients, the definition of the method, and the elementary identity 2a b = a 2 + b 2 a b 2, > k k k k k − k Projected gradient descent { convergence rate: L, µ 2 M 2 M L krk ≤ r r krk ≤ µ 2 r 2 2 2 2 1 M x∗ +(f(x1) f(x∗)) L x∗ L κ log k k − k2 k µ Same as unconstrained case! But, requires projection... how expensive is that? Examples: Euclidean ball PSD constraints Linear constraints Ax b ≤ Sometimes as expensive as solving the original optimization problem! Conditional Gradient Descent A projection-free algorithm! Introduced for QP by Marguerite Frank and Philip Wolfe (1956) Algorithm Initialize: x(0) • 2 X s(k) = argmin f(x(k)); s • s hr i 2X x(k+1) = x(k) + t(k)(s(k) x(k)) • − Notice f assumed M-smooth • assumed bounded •X First-order oracle • Linear optimization (in place of projection) • Sparse iterates (e.g., for polytope constraints) • Convergence rate (k) 2 For M-smooth functions with step size t = k+1: MR2 # iterations required for -optimality: where R = supx;y x y 2X k − k Proof M f(x(k+1)) f(x(k)) + f(x(k)); x(k+1) x(k) + x(k+1) x(k) 2 [smoothness] ≤ hr − i 2 k − k M =f(x(k)) + t(k) f(x(k)); s(k) x(k) + (t(k))2 s(k) x(k) 2 [update] hr − i 2 k − k (k) (k) (k) (k) M (k) 2 2 f(x ) + t f(x ); x∗ x + (t ) R ≤ hr − i 2 (k) (k) (k) M (k) 2 2 f(x ) + t (f(x∗) f(x )) + (t ) R [convexity] ≤ − 2 (k) (k) Define: δ = f(x ) f(x∗), we have: − M(t(k))2R2 δ(k+1) (1 t(k))δ(k) + ≤ − 2 (k) 2 A simple induction shows that for t = k+1: 2MR2 δ(k) ≤ k + 1 Same rate as projected gradient descent, but without projection! Does need linear optimization What about strong convexity? Not helpful! Does not give linear rate (κ log(1/)) ? Active research Randomness in Convex Optimization Insight: first-order methods are robust { inexact gradients are sufficient As long as gradients are correct on average, the error will vanish Long history (Robbins & Monro, 1951) Stochastic Gradient Descent Motivation Many machine learning problems have the form of empirical risk minimization m min fi(x) + λΩ(x) x Rn 2 i=1 X where fi are convex and λ is the regularization constant Classification: SVM, logistic regression Regression: least-squares, ridge regression, LASSO Cost of computing the gradient? m n · What if m is VERY large? We want cheaper iterations Idea: Use stochastic first-order oracle: for each point x dom(f) returns a stochastic gradient 2 g~(x) s.t. E[g~(x)] @f(x) 2 That is, g~ is an unbiased estimator of the subgradient Example F (x) 1 m i min (fi(x) + λΩ(x)) x Rn m 2 i=1 X z }| { For this objective, select j 1; : : : ; m u.a.r. and return Fj(x) Then, 2 f g r 1 [~g(x)] = F (x) = f(x) E m i i r r X SGD iterates: x(k+1) x(k) t(k)g~(x(k)) Stochastic vs. deterministic− methods n How to choose step size t(k)?1 Minimizing g(θ)= f (θ) with f (θ)=" y , θ!Φ(x ) + µΩ(θ) • n i i i i i=1 ! " #n Lipschitz case: t(k) 1 γt Batch gradient descent:pk θt = θt 1 γtg#(θt 1)=θt 1 fi#(θt 1) • • / − − − − − n − Stochastic vs. deterministic methods i=1 µ-strongly-convex case: t(k) 1 ! n• / µk 1 Stochastic vs. deterministic methods Minimizing g(θ)= f (θ) with f (θ)=" y , θ!Φ(x ) + µΩ(θ) • n i i i i i=1 Goal = best of both worlds:linearratewithO(1) iteration cost Note:! decaying step size!" #n • γt Batch gradient descent: θt = θt GD1 γtg#(θt 1)=θt 1 f SGDi#(θt 1) • −Stochastic− −gradient− descent:− n θt = θ−t 1 γtfi#(t)(θt 1) • i=1 − − − ! stochastic deterministic log(excess cost) Stochastic gradient descent: θt = θt 1 γtfi#(t)(θt 1) • (Figures borrowed− − from Francis− Bach's slides) time Convergence rates L, µ 2 M 2 M L krk ≤ r r krk ≤ µ 2 r 2 2 2 2 1 M x∗ L x∗ L GD κ log k k k2 k µ 2 2 2 B x∗ B SGD ? ? k2 k µ Additional assumption: E[ g~(x) 2] B2 for all x dom(f) k k ≤ 2 Comment: holds in expectation, with averaged iterates K 1 (k) E f x f(x∗) ::: K − ≤ " k=1 !# X Similar rates as with exact gradients! L, µ 2 M 2 M L krk ≤ r r krk ≤ µ 2 r 2 2 2 2 1 M x∗ L x∗ L GD κ log k k k2 k µ 1 M x 2 AGD pκ log k ∗k p × × 2 2 2 2 x∗ σ M x∗ B x∗ B SGD ? k 2k + k k k2 k µ where E[ f(x) g~(x) 2] σ2 kr − k ≤ Smoothness? Not helpful! (same rate as non-smooth) Lower bounds (Nemirovski & Yudin, 1983) ? Active research Acceleration? Cannot be easily accelerated! Mini-batch acceleration ? Active research Random Coordinate Descent Recall: cost of computing exact GD update: m n What if n VERY is large? · We want cheaper iterations Random coordinate descent algorithm: Initialize: x(0) dom(f) • 2 Iterate: pick i(k) 1; : : : ; n randomly • 2 f g x(k+1) = x(k) t(k) f(x(k))e − ri(k) i(k) @f where we denote: if(x) = (x) r @xi Assumption: f is convex and differentiable What if f not differentiable? 4 2 f 0 ● x2 2 − 4 − x2 x1 −4 −2 0 2 4 x1 A:(Figures No! Look borrowed at the from above Ryan Tibshirani's counterexample slides) n Q: Same question again, but now f (x)= g(x)+ i=1 hi(xi),with g convex, di↵erentiable and each hi convex ... ? (Non-smooth part P here called separable) 5 Iteration cost? if(x) + O(1) Compare to fr(x) + O(n) for GD r Example: quadratic 1 f(x) = x>Qx v>x 2 − f(x) = Qx v r − if(x) = q>x vi r i − Can view CD as SGD with oracle:g ~(x) = n if(x)ei Clearly, r 1 [g~(x)] = n f(x)e = f(x) E n i i i r r X Can replace individual coordinates with blocks of coordinates SHALEV-SHWARTZ AND ZHANG λ astro-ph CCAT cov1 0 0 0 10 10 10 SDCA SDCA SDCA SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 SGD 10 SGD −2 −2 −2 10 10 10 −3 −3 −3 3 10 10 10 10− −4 −4 −4 10 10 10 −5 −5 −5 10 10 10 −6 −6 −6 10 10 10 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 0 0 0 10 10 10 SDCA SDCA SDCA SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 SGD 10 SGD −2 −2 −2 10 10 10 −3 −3 −3 4 10 10 10 10− −4 −4 −4 10 10 10 −5 −5 −5 10 10 10 −6 −6 −6 10 10 10 5 10 15 20 25 30 35 40 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 0 Example: SVM 0 0 10 10 10 SDCA SDCA SDCA Primal: SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 λ 2 SGD 10 SGD min w + max(1 yiw>zi; 0) −2 w −2 2 −2 10 10 k k i − 10 X −3 −3 −3 5 10 Dual: 10 10 10− 1 −4 −4 min α>Qα 1>α −4 10 10 α 2 − 10 −5 −5 −5 10 10 s.t.

Convex Optimization Lecture 16

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support