<<

Convex Optimization Lecture 16

Today:

Projected Descent • Conditional • Stochastic Gradient Descent • Random • Recall:Gradient Gradient Descent Descent (Steepest Descent w..t Euclidean Norm) � Gradient descent algorithm: Δ� = −�� �

Init � 0 ∈ ���(�)

Iterate � �+1 ← � � − � � �� � �

Reminder: We are violatingLower'Bounds here the distinction between the primal space and Convergence:the dual1 gradient space—we are implicitly linking them by matching representations w.r.t. a chosen basis • Some%upper%bounds:

Note: Δ� is# notiter normalized (i.e. #weiter don’t require#iter Δ� 2 = 1).& This≤ just* changes the meaning of �. Oracle/ops $ ≼ &' ≼ ( &' ≼ ( & ≤ * $ ≼ &' 2 3 ∗ # 5# 3 ∗ # � � 5# GD + logHow1/1 do we choose the stepsize ? 78 + !(") 1 1# 61

∗ # 1 2 3 A>κGD= M/µ + log 1/1 x x 78 + !(") 1 • Is%this%the%best%we%can%do? • What%if%we%allow%! "# ops? • YES! • When%using%only%first>order%oracle%() • History: • Nemirovski &%Yudin (1983) • Nesterov (2004) Smoothness and Strong Convexity

Def: � is �-strongly convex Def: � is �-smooth � � � � + �� � , Δ� + Δ� ≤ � � + Δ� ≤ � � + �� � , Δ� + Δ� 2 2

Can be viewed as a condition on the directional 2nd derivatives � ≤ � � = � � + �� = ��� � � ≤ � (for � = 1) � � � + �� � , Δ� + Δ� 2 �(� + Δ�) � � � + �� � , Δ� + Δ� 2 � � + 〈�� � , Δ�〉 What about constraints?

min f(x) x s.t. x ∈ X where is convex X Projected Gradient Descent

Idea: make sure that points are feasible by projecting onto X3.1. Projected Subgradient Descent for Lipschitz functions 21

y Algorithm: t+1y(k+1) projection (3.3) (k+1) (k) (k) (k) y = x t g (k+1) gradient step xx • where g(k) −∂f(x(k)) t+1 ∈ (3.2) (k+1) (k+1) x = Π (y ) x(k) • X xt

The projection operator Π onto : X X X Π (x) = min x z X z k − k ∈X Fig. 3.2 Illustration of the Projected Subgradient Descent method. Notice: subgradient instead of gradient (even for differentiable functions) not exist) by a subgradient g @f(x). Secondly, and more importantly, 2 we make sure that the updated point lies in by projecting back (if X necessary) onto it. This gives the Projected Subgradient Descent algo- rithm which iterates the following equations for t 1: y = x ⌘g , where g @f(x ), (3.2) t+1 t t t 2 t xt+1 = ⇧ (yt+1). (3.3) X This procedure is illustrated in Figure 3.2. We prove now a rate of convergence for this method under the above assumptions.

Theorem 3.1. The Projected Subgradient Descent with ⌘ = R sat- Lpt isfies t 1 RL f x f(x⇤) . t s  p s=1 ! t X

Proof. Using the definition of subgradients, the definition of the method, and the elementary identity 2a b = a 2 + b 2 a b 2, > k k k k k k Projected gradient descent – convergence rate:

L, µ 2 M 2 M L k∇k ≤  ∇  ∇  k∇k ≤ µ 2  ∇ 2 2 2 2 1 M x∗ +(f(x1) f(x∗)) L x∗ L κ log  k k  − k2 k µ

Same as unconstrained case! But, requires projection... how expensive is that?

Examples: Euclidean ball PSD constraints Linear constraints Ax b ≤ Sometimes as expensive as solving the original optimization problem! Conditional Gradient Descent

A projection-free algorithm! Introduced for QP by Marguerite Frank and Philip Wolfe (1956)

Algorithm

Initialize: x(0) • ∈ X s(k) = argmin f(x(k)), s • s h∇ i ∈X x(k+1) = x(k) + t(k)(s(k) x(k)) • − Notice

f assumed M-smooth • assumed bounded •X First-order oracle • Linear optimization (in place of projection) • Sparse iterates (e.g., for polytope constraints) •

Convergence rate (k) 2 For M-smooth functions with step size t = k+1:

MR2 # iterations required for -optimality: 

where R = supx,y x y ∈X k − k Proof M f(x(k+1)) f(x(k)) + f(x(k)), x(k+1) x(k) + x(k+1) x(k) 2 [smoothness] ≤ h∇ − i 2 k − k M =f(x(k)) + t(k) f(x(k)), s(k) x(k) + (t(k))2 s(k) x(k) 2 [update] h∇ − i 2 k − k (k) (k) (k) (k) M (k) 2 2 f(x ) + t f(x ), x∗ x + (t ) R ≤ h∇ − i 2 (k) (k) (k) M (k) 2 2 f(x ) + t (f(x∗) f(x )) + (t ) R [convexity] ≤ − 2 (k) (k) Define: δ = f(x ) f(x∗), we have: − M(t(k))2R2 δ(k+1) (1 t(k))δ(k) + ≤ − 2 (k) 2 A simple induction shows that for t = k+1: 2MR2 δ(k) ≤ k + 1 Same rate as projected gradient descent, but without projection! Does need linear optimization What about strong convexity? Not helpful! Does not give linear rate (κ log(1/)) ? Active research Randomness in Convex Optimization

Insight: first-order methods are robust – inexact gradients are sufficient

As long as gradients are correct on average, the error will vanish

Long history (Robbins & Monro, 1951) Stochastic Gradient Descent

Motivation Many machine learning problems have the form of empirical risk minimization

m

min fi(x) + λΩ(x) x Rn ∈ i=1 X where fi are convex and λ is the regularization constant

Classification: SVM, logistic regression Regression: least-squares, ridge regression, LASSO

Cost of computing the gradient? m n · What if m is VERY large? We want cheaper iterations Idea: Use stochastic first-order oracle: for each point x dom(f) returns a stochas- tic gradient ∈ g˜(x) s.t. E[g˜(x)] ∂f(x) ∈ That is, g˜ is an unbiased estimator of the subgradient

Example F (x) 1 m i min (fi(x) + λΩ(x)) x Rn m ∈ i=1 X z }| { For this objective, select j 1, . . . , m u.a.r. and return Fj(x) Then, ∈ { } ∇ 1 [˜g(x)] = F (x) = f(x) E m i i ∇ ∇ X SGD iterates: x(k+1) x(k) t(k)g˜(x(k)) Stochastic vs.← deterministic− methods n How to choose step size t(k)?1 Minimizing g(θ)= f (θ) with f (θ)=" y , θ!Φ(x ) + µΩ(θ) • n i i i i i=1 ! " #n Lipschitz case: t(k) 1 γt Batch gradient descent:√k θt = θt 1 γtg#(θt 1)=θt 1 fi#(θt 1) • • ∝ − − − − − n − Stochastic vs. deterministic methods i=1 µ-strongly-convex case: t(k) 1 ! n• ∝ µk 1 Stochastic vs. deterministic methods Minimizing g(θ)= f (θ) with f (θ)=" y , θ!Φ(x ) + µΩ(θ) • n i i i i i=1 Goal = best of both worlds:linearratewithO(1) iteration cost Note:! decaying step size!" #n • γt Batch gradient descent: θt = θt GD1 γtg#(θt 1)=θt 1 f SGDi#(θt 1) • −Stochastic− −gradient− descent:− n θt = θ−t 1 γtfi#(t)(θt 1) • i=1 − − − ! stochastic

deterministic log(excess cost) Stochastic gradient descent: θt = θt 1 γtfi#(t)(θt 1) • (Figures borrowed− − from Francis− Bach’s slides) time Convergence rates

L, µ 2 M 2 M L k∇k ≤  ∇  ∇  k∇k ≤ µ 2  ∇ 2 2 2 2 1 M x∗ L x∗ L GD κ log  k k k2 k µ 2 2 2 B x∗ B SGD ? ? k2 k µ

Additional assumption: E[ g˜(x) 2] B2 for all x dom(f) k k ≤ ∈ Comment: holds in expectation, with averaged iterates

K 1 (k) E f x f(x∗) ... K − ≤ " k=1 !# X Similar rates as with exact gradients! L, µ 2 M 2 M L k∇k ≤  ∇  ∇  k∇k ≤ µ 2  ∇ 2 2 2 2 1 M x∗ L x∗ L GD κ log  k k k2 k µ 1 M x 2 AGD √κ log k ∗k  √ × × 2 2 2 2 x∗ σ M x∗ B x∗ B SGD ? k 2k + k k k2 k µ where E[ f(x) g˜(x) 2] σ2 k∇ − k ≤ Smoothness? Not helpful! (same rate as non-smooth) Lower bounds (Nemirovski & Yudin, 1983) ? Active research

Acceleration? Cannot be easily accelerated! Mini-batch acceleration ? Active research Random Coordinate Descent

Recall: cost of computing exact GD update: m n What if n VERY is large? · We want cheaper iterations

Random coordinate descent algorithm:

Initialize: x(0) dom(f) • ∈ Iterate: pick i(k) 1, . . . , n randomly • ∈ { } x(k+1) = x(k) t(k) f(x(k))e − ∇i(k) i(k)

∂f where we denote: if(x) = (x) ∇ ∂xi Assumption: f is convex and differentiable What if f not differentiable? 4 2

f

0 ● x2 2 − 4 −

x2

x1 −4 −2 0 2 4

x1

A:(Figures No! Look borrowed at the from above Ryan Tibshirani’s counterexample slides)

n Q: Same question again, but now f (x)= g(x)+ i=1 hi(xi),with g convex, di↵erentiable and each hi convex ... ? (Non-smooth part P here called separable)

5 Iteration cost? if(x) + O(1) Compare to f∇(x) + O(n) for GD ∇ Example: quadratic 1 f(x) = x>Qx v>x 2 − f(x) = Qx v ∇ − if(x) = q>x vi ∇ i −

Can view CD as SGD with oracle:g ˜(x) = n if(x)ei Clearly, ∇ 1 [g˜(x)] = n f(x)e = f(x) E n i i i ∇ ∇ X Can replace individual coordinates with blocks of coordinates SHALEV-SHWARTZ AND ZHANG

λ astro-ph CCAT cov1

0 0 0 10 10 10 SDCA SDCA SDCA SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 SGD 10 SGD

−2 −2 −2 10 10 10

−3 −3 −3 3 10 10 10 10−

−4 −4 −4 10 10 10

−5 −5 −5 10 10 10

−6 −6 −6 10 10 10 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22

0 0 0 10 10 10 SDCA SDCA SDCA SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 SGD 10 SGD

−2 −2 −2 10 10 10

−3 −3 −3 4 10 10 10 10−

−4 −4 −4 10 10 10

−5 −5 −5 10 10 10

−6 −6 −6 10 10 10 5 10 15 20 25 30 35 40 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22

0 Example: SVM 0 0 10 10 10 SDCA SDCA SDCA Primal: SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 λ 2 SGD 10 SGD min w + max(1 yiw>zi, 0) −2 w −2 2 −2 10 10 k k i − 10 X −3 −3 −3 5 10 Dual: 10 10 10− 1 −4 −4 min α>Qα 1>α −4 10 10 α 2 − 10

−5 −5 −5 10 10 s.t. 0 αi 1/λ i 10 ≤ ≤ ∀

−6 −6 −6 10 10 10 10 20 30where40 50 Q60 ij 70= 80yiy90j z100i>z110j 5 10 15 20 25 30 35 5 10 15 20 25 30

0 0 0 10 10 10 SDCA SDCA SDCA SDCA−Perm SDCA−Perm SDCA−Perm −1 −1 −1 10 SGD 10 SGD 10 SGD

−2 −2 −2 10 10 10

−3 −3 −3 6 10 10 10 10−

−4 −4 −4 10 10 10

−5 −5 −5 10 10 10

−6 −6 −6 10 10 10 50 100 150 200 250 300 20 40 60 80 100 120 140 160 180 200 10 20 30 40 50 60 70 80 90 100 110 (Shalev-Schwartz & Zhang, 2013)

Figure 7: Comparing the primal sub-optimality of SDCA and SGD for the non-smooth hinge-loss (γ = 0). In all plots the horizontal axis is the number of iterations divided by trainingset size (corresponding to the number of epochs through the data).

596 Convergence rate

Directional smoothness for f: there exist M1,...,Mn s.t. for any i 1, . . . , n , ∈ { } x Rn, and u R ∈ ∈ if(x + uei) if(x) Mi u |∇ − ∇ | ≤ | | Note: implies f is M-smooth with M Mi ≤ i Consider the update: P

(k+1) (k) 1 (k) x = x i(k)f(x ) ei(k) − Mi(k)∇ ·

No need to know Mi’s, can be adjusted dynamically Rates (Nesterov, 2012):

L, µ 2 M 2 M L k∇k ≤  ∇  ∇  k∇k ≤ µ 2  ∇ 2 2 2 2 1 M x∗ L x∗ L GD κ log  k k k2 k µ 2 1 i Mi n x∗ i Mi CD nκ log  , κ = µ k k  P P × ×

Same total cost as GD, but with much cheaper iterations

Comment: holds in expectation

(k) E f(x ) f ∗ ... − ≤ h i Acceleration? Yes! ? Active research