
Convex Optimization Lecture 16 Today: Projected Gradient Descent • Conditional Gradient Descent • Stochastic Gradient Descent • Random Coordinate Descent • Recall:Gradient Gradient Descent Descent (Steepest Descent w.r.t Euclidean Norm) � Gradient descent algorithm: Δ� = −�� � Init � 0 ∈ ���(�) Iterate � �+1 ← � � − � � �� � � Reminder: We are violatingLower'Bounds here the distinction between the primal space and Convergence:the dual1 gradient space—we are implicitly linking them by matching representations w.r.t. a chosen basis • Some%upper%bounds: Note: Δ� is# notiter normalized (i.e. #weiter don’t require#iter Δ� 2 = 1).& This≤ just* changes the meaning of �. Oracle/ops $ ≼ &' ≼ ( &' ≼ ( & ≤ * $ ≼ &' 2 3 ∗ # 5# 3 ∗ # � � 5# GD + logHow1/1 do we choose the stepsize ? 78 + !(") 1 1# 61 ∗ # 1 2 3 A>κGD= M/µ + log 1/1 x x 78 + !(") 1 • Is%this%the%best%we%can%do? • What%if%we%allow%! "# ops? • YES! • When%using%only%first>order%oracle%(gradients) • History: • Nemirovski &%Yudin (1983) • Nesterov (2004) Smoothness and Strong Convexity Def: � is �-strongly convex Def: � is �-smooth � � � � + �� � , Δ� + Δ� ≤ � � + Δ� ≤ � � + �� � , Δ� + Δ� 2 2 Can be viewed as a condition on the directional 2nd derivatives � ≤ � � = � � + �� = ��� � � ≤ � (for � = 1) � � � + �� � , Δ� + Δ� 2 �(� + Δ�) � � � + �� � , Δ� + Δ� 2 � � + 〈�� � , Δ�〉 What about constraints? min f(x) x s.t. x 2 X where is convex X Projected Gradient Descent Idea: make sure that points are feasible by projecting onto X3.1. Projected Subgradient Descent for Lipschitz functions 21 y Algorithm: t+1y(k+1) projection (3.3) (k+1) (k) (k) (k) y = x t g (k+1) gradient step xx • where g(k) −@f(x(k)) t+1 2 (3.2) (k+1) (k+1) x = Π (y ) x(k) • X xt The projection operator Π onto : X X X Π (x) = min x z X z k − k 2X Fig. 3.2 Illustration of the Projected Subgradient Descent method. Notice: subgradient instead of gradient (even for differentiable functions) not exist) by a subgradient g @f(x). Secondly, and more importantly, 2 we make sure that the updated point lies in by projecting back (if X necessary) onto it. This gives the Projected Subgradient Descent algo- rithm which iterates the following equations for t 1: ≥ y = x ⌘g , where g @f(x ), (3.2) t+1 t − t t 2 t xt+1 = ⇧ (yt+1). (3.3) X This procedure is illustrated in Figure 3.2. We prove now a rate of convergence for this method under the above assumptions. Theorem 3.1. The Projected Subgradient Descent with ⌘ = R sat- Lpt isfies t 1 RL f x f(x⇤) . t s − p s=1 ! t X Proof. Using the definition of subgradients, the definition of the method, and the elementary identity 2a b = a 2 + b 2 a b 2, > k k k k k − k Projected gradient descent { convergence rate: L, µ 2 M 2 M L krk ≤ r r krk ≤ µ 2 r 2 2 2 2 1 M x∗ +(f(x1) f(x∗)) L x∗ L κ log k k − k2 k µ Same as unconstrained case! But, requires projection... how expensive is that? Examples: Euclidean ball PSD constraints Linear constraints Ax b ≤ Sometimes as expensive as solving the original optimization problem! Conditional Gradient Descent A projection-free algorithm! Introduced for QP by Marguerite Frank and Philip Wolfe (1956) Algorithm Initialize: x(0) • 2 X s(k) = argmin f(x(k)); s • s hr i 2X x(k+1) = x(k) + t(k)(s(k) x(k)) • − Notice f assumed M-smooth • assumed bounded •X First-order oracle • Linear optimization (in place of projection) • Sparse iterates (e.g., for polytope constraints) • Convergence rate (k) 2 For M-smooth functions with step size t = k+1: MR2 # iterations required for -optimality: where R = supx;y x y 2X k − k Proof M f(x(k+1)) f(x(k)) + f(x(k)); x(k+1) x(k) + x(k+1) x(k) 2 [smoothness] ≤ hr − i 2 k − k M =f(x(k)) + t(k) f(x(k)); s(k) x(k) + (t(k))2 s(k) x(k) 2 [update] hr − i 2 k − k (k) (k) (k) (k) M (k) 2 2 f(x ) + t f(x ); x∗ x + (t ) R ≤ hr − i 2 (k) (k) (k) M (k) 2 2 f(x ) + t (f(x∗) f(x )) + (t ) R [convexity] ≤ − 2 (k) (k) Define: δ = f(x ) f(x∗), we have: − M(t(k))2R2 δ(k+1) (1 t(k))δ(k) + ≤ − 2 (k) 2 A simple induction shows that for t = k+1: 2MR2 δ(k) ≤ k + 1 Same rate as projected gradient descent, but without projection! Does need linear optimization What about strong convexity? Not helpful! Does not give linear rate (κ log(1/)) ? Active research Randomness in Convex Optimization Insight: first-order methods are robust { inexact gradients are sufficient As long as gradients are correct on average, the error will vanish Long history (Robbins & Monro, 1951) Stochastic Gradient Descent Motivation Many machine learning problems have the form of empirical risk minimization m min fi(x) + λΩ(x) x Rn 2 i=1 X where fi are convex and λ is the regularization constant Classification: SVM, logistic regression Regression: least-squares, ridge regression, LASSO Cost of computing the gradient? m n · What if m is VERY large? We want cheaper iterations Idea: Use stochastic first-order oracle: for each point x dom(f) returns a stochas- tic gradient 2 g~(x) s.t. E[g~(x)] @f(x) 2 That is, g~ is an unbiased estimator of the subgradient Example F (x) 1 m i min (fi(x) + λΩ(x)) x Rn m 2 i=1 X z }| { For this objective, select j 1; : : : ; m u.a.r. and return Fj(x) Then, 2 f g r 1 [~g(x)] = F (x) = f(x) E m i i r r X SGD iterates: x(k+1) x(k) t(k)g~(x(k)) Stochastic vs. deterministic− methods n How to choose step size t(k)?1 Minimizing g(θ)= f (θ) with f (θ)=" y , θ!Φ(x ) + µΩ(θ) • n i i i i i=1 ! " #n Lipschitz case: t(k) 1 γt Batch gradient descent:pk θt = θt 1 γtg#(θt 1)=θt 1 fi#(θt 1) • • / − − − − − n − Stochastic vs. deterministic methods i=1 µ-strongly-convex case: t(k) 1 ! n• / µk 1 Stochastic vs. deterministic methods Minimizing g(θ)= f (θ) with f (θ)=" y , θ!Φ(x ) + µΩ(θ) • n i i i i i=1 Goal = best of both worlds:linearratewithO(1) iteration cost Note:! decaying step size!" #n • γt Batch gradient descent: θt = θt GD1 γtg#(θt 1)=θt 1 f SGDi#(θt 1) • −Stochastic− −gradient− descent:− n θt = θ−t 1 γtfi#(t)(θt 1) • i=1 − − − ! stochastic deterministic log(excess cost) Stochastic gradient descent: θt = θt 1 γtfi#(t)(θt 1) • (Figures borrowed− − from Francis− Bach's slides) time Convergence rates L, µ 2 M 2 M L krk ≤ r r krk ≤ µ 2 r 2 2 2 2 1 M x∗ L x∗ L GD κ log k k k2 k µ 2 2 2 B x∗ B SGD ? ? k2 k µ Additional assumption: E[ g~(x) 2] B2 for all x dom(f) k k ≤ 2 Comment: holds in expectation, with averaged iterates K 1 (k) E f x f(x∗) ::: K − ≤ " k=1 !# X Similar rates as with exact gradients! L, µ 2 M 2 M L krk ≤ r r krk ≤ µ 2 r 2 2 2 2 1 M x∗ L x∗ L GD κ log k k k2 k µ 1 M x 2 AGD pκ log k ∗k p × × 2 2 2 2 x∗ σ M x∗ B x∗ B SGD ? k 2k + k k k2 k µ where E[ f(x) g~(x) 2] σ2 kr − k ≤ Smoothness? Not helpful! (same rate as non-smooth) Lower bounds (Nemirovski & Yudin, 1983) ? Active research Acceleration? Cannot be easily accelerated! Mini-batch acceleration ? Active research Random Coordinate Descent Recall: cost of computing exact GD update: m n What if n VERY is large? · We want cheaper iterations Random coordinate descent algorithm: Initialize: x(0) dom(f) • 2 Iterate: pick i(k) 1; : : : ; n randomly • 2 f g x(k+1) = x(k) t(k) f(x(k))e − ri(k) i(k) @f where we denote: if(x) = (x) r @xi Assumption: f is convex and differentiable What if f not differentiable? 4 2 f 0 ● x2 2 − 4 − x2 x1 −4 −2 0 2 4 x1 A:(Figures No! Look borrowed at the from above Ryan Tibshirani's counterexample slides) n Q: Same question again, but now f (x)= g(x)+ i=1 hi(xi),with g convex, di↵erentiable and each hi convex ... ? (Non-smooth part P here called separable) 5 Iteration cost? if(x) + O(1) Compare to fr(x) + O(n) for GD r Example: quadratic 1 f(x) = x>Qx v>x 2 − f(x) = Qx v r − if(x) = q>x vi r i − Can view CD as SGD with oracle:g ~(x) = n if(x)ei Clearly, r 1 [g~(x)] = n f(x)e = f(x) E n i i i r r X Can replace individual coordinates with blocks of coordinates SHALEV-SHWARTZ AND ZHANG λ astro-ph CCAT cov1 0 0 0 10 10 10 SDCA SDCA SDCA SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 SGD 10 SGD −2 −2 −2 10 10 10 −3 −3 −3 3 10 10 10 10− −4 −4 −4 10 10 10 −5 −5 −5 10 10 10 −6 −6 −6 10 10 10 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 0 0 0 10 10 10 SDCA SDCA SDCA SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 SGD 10 SGD −2 −2 −2 10 10 10 −3 −3 −3 4 10 10 10 10− −4 −4 −4 10 10 10 −5 −5 −5 10 10 10 −6 −6 −6 10 10 10 5 10 15 20 25 30 35 40 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 0 Example: SVM 0 0 10 10 10 SDCA SDCA SDCA Primal: SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 λ 2 SGD 10 SGD min w + max(1 yiw>zi; 0) −2 w −2 2 −2 10 10 k k i − 10 X −3 −3 −3 5 10 Dual: 10 10 10− 1 −4 −4 min α>Qα 1>α −4 10 10 α 2 − 10 −5 −5 −5 10 10 s.t.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages22 Page
-
File Size-