<<

Introduction to Convex Optimization for Machine Learning

John Duchi

University of California, Berkeley

Practical Machine Learning, Fall 2009

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 1/53 Outline

What is Optimization

Convex Sets

Convex Functions

Convex Optimization Problems

Lagrange

Optimization Algorithms

Take Home Messages

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 2/53 What is Optimization What is Optimization (and why do we care?)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 3/53 What is Optimization What is Optimization?

◮ Finding the minimizer of a subject to constraints:

minimize f0(x) x

s.t. fi(x) 0, i = 1, . . . , k ≤ { } hj(x) = 0, j = 1,...,l { }

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 4/53 What is Optimization What is Optimization?

◮ Finding the minimizer of a function subject to constraints:

minimize f0(x) x

s.t. fi(x) 0, i = 1, . . . , k ≤ { } hj(x) = 0, j = 1,...,l { } ◮ Example: Stock market. “Minimize variance of return subject to getting at least $50.”

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 4/53 What is Optimization Why do we care?

Optimization is at the heart of many (most practical?) machine learning algorithms. ◮ Linear regression: minimize Xw y 2 w k − k ◮ Classification (logistic regresion or SVM):

n T minimize log 1 + exp( yix w) w − i i=1 X n  2 T or w + ξi s.t. ξi 1 yix w, ξi 0. k k ≥ − i ≥ i X=1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 5/53 What is Optimization We still care...

◮ Maximum likelihood estimation: n maximize log pθ(xi) θ i X=1 ◮ Collaborative filtering:

T T minimize log 1 + exp(w xi w xj) w − i j X≺  ◮ k-means: k 2 minimize J(µ)= xi µj µ1,...,µk k − k j=1 i Cj X X∈ ◮ And more (graphical models, feature selection, active learning, control)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 6/53 What is Optimization But generally speaking...

We’re screwed. ◮ Local (non global) minima of f0 ◮ All kinds of constraints (even restricting to continuous functions): h(x) = sin(2πx) = 0

250

200

150

100

50

0

−50 3 2 3 1 2 0 1 −1 0 −1 −2 −2 −3 −3

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 7/53 What is Optimization But generally speaking...

We’re screwed. ◮ Local (non global) minima of f0 ◮ All kinds of constraints (even restricting to continuous functions): h(x) = sin(2πx) = 0

250

200

150

100

50

0

−50 3 2 3 1 2 0 1 −1 0 −1 −2 −2 −3 −3 ◮ Go for convex problems! Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 7/53 Convex Sets Convex Sets

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 8/53 Convex Sets

Definition A set C Rn is convex if for x, y C and any α [0, 1], ⊆ ∈ ∈ αx + (1 α)y C. − ∈

y x

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 9/53 Convex Sets Examples

◮ All of Rn (obvious)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 10/53 Convex Sets Examples

◮ All of Rn (obvious)

◮ Non-negative orthant, Rn : let x 0, y 0, clearly +   αx + (1 α)y 0. − 

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 10/53 Convex Sets Examples

◮ All of Rn (obvious)

◮ Non-negative orthant, Rn : let x 0, y 0, clearly +   αx + (1 α)y 0. − 

◮ Norm balls: let x 1, y 1, then k k≤ k k≤ αx + (1 α)y αx + (1 α)y = α x + (1 α) y 1. k − k ≤ k k k − k k k − k k≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 10/53 Convex Sets Examples

◮ Affine subspaces: Ax = b, Ay = b, then A(αx + (1 α)y)= αAx + (1 α)Ay = αb + (1 α)b = b. − − −

1

0.8

0.6

0.4 3

x 0.2

0

−0.2

−0.4 1

0.8 1 0.6 0.8 0.4 0.6 0.4 0.2 0.2 0 0 x2 x1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 11/53 Convex Sets More examples

◮ Arbitrary intersections of convex sets: let Ci be convex for i , ∈ I C = i Ci, then

x C, y C αx + (1 α)y Ci i T ∈ ∈ ⇒ − ∈ ∀ ∈ I so αx + (1 α)y C. − ∈

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 12/53 Convex Sets More examples

◮ PSD Matrices, a.k.a. the positive semidefinite cone Sn Rn n Sn + × . A + means T ⊂ ∈ Rn x Ax 0 for all x . For 1 ≥S+ ∈ A, B , 0.8 ∈ n 0.6

T z x (αA + (1 α)B) x 0.4 − = αxT Ax + (1 α)xT Bx 0. 0.2 0 − ≥ 1

0.5 1 0.8 ◮ 0 On right: 0.6 −0.5 0.4 0.2 y −1 0 x x z S2 = 0 = x,y,z : x 0, y 0, xy z2 + z y  ≥ ≥ ≥    

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 13/53 Convex Functions Convex Functions

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 14/53 Convex Functions

Definition A function f : Rn is convex if for x, y dom f and any α [0, 1], → ∈ ∈ f(αx + (1 α)y) αf(x) + (1 α)f(y). − ≤ −

f(y) αf(x) + (1 - α)f(y)

f(x)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 15/53 Convex Functions First order convexity conditions

Theorem Suppose f : Rn R is differentiable. Then f is convex if and only if for → all x, y dom f ∈ f(y) f(x)+ f(x)T (y x) ≥ ∇ −

f(y)

f(x) + f(x)T (y - x) ∇

(x, f(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 16/53 Convex Functions Actually, more general than that

Definition The subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x)= g : f(y) f(x)+ gT (y x) for all y . ≥ − 

f(y)

Theorem f : Rn R is convex if and only if it→ has non-empty subdifferential set everywhere. (x, f(x))

f(x) + gT (y - x)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 17/53 Convex Functions Second order convexity conditions

Theorem Suppose f : Rn R is twice differentiable. Then f is convex if and only if → for all x dom f, ∈ 2f(x) 0. ∇ 

10

8

6

4

2

0 2

1 2 1 0 0 −1 −1 −2 −2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 18/53 Convex Functions Convex sets and convex functions

Definition The of a function f is the epi f set of points

epi f = (x,t): f(x) t . { ≤ }

◮ epi f is convex if and only if f is convex. a ◮ Sublevel sets, x : f(x) a { ≤ } are convex for convex f.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 19/53 Convex Functions Examples Examples

◮ Linear/affine functions:

f(x)= bT x + c.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 20/53 Convex Functions Examples Examples

◮ Linear/affine functions:

f(x)= bT x + c.

◮ Quadratic functions: 1 f(x)= xT Ax + bT x + c 2 for A 0. For regression:  1 1 1 Xw y 2 = wT XT Xw yT Xw + yT y. 2 k − k 2 − 2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 20/53 Convex Functions Examples More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

αx + (1 α)y αx + (1 α)y = α x + (1 α) y . k − k ≤ k k k − k k k − k k

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 21/53 Convex Functions Examples More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

αx + (1 α)y αx + (1 α)y = α x + (1 α) y . k − k ≤ k k k − k k k − k k ◮ Composition with an affine function f(Ax + b):

f(A(αx + (1 α)y)+ b)= f(α(Ax + b) + (1 α)(Ay + b)) − − αf(Ax + b) + (1 α)f(Ay + b) ≤ −

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 21/53 Convex Functions Examples More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

αx + (1 α)y αx + (1 α)y = α x + (1 α) y . k − k ≤ k k k − k k k − k k ◮ Composition with an affine function f(Ax + b):

f(A(αx + (1 α)y)+ b)= f(α(Ax + b) + (1 α)(Ay + b)) − − αf(Ax + b) + (1 α)f(Ay + b) ≤ − ◮ Log-sum-exp (via 2f(x) PSD): ∇ n f(x) = log exp(xi) i ! X=1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 21/53 Convex Functions Examples Important examples in Machine Learning

3

◮ SVM loss:

[1 - x]+ T f(w)= 1 yix w − i + ◮ Binary logistic loss: 

x T log(1 + e ) f(w) = log 1 + exp( yix w) − i

 0 −2 3

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 22/53 Convex Optimization Problems Convex Optimization Problems

Definition An optimization problem is convex if its objective is a , the inequality constraints fj are convex, and the equality constraints hj are affine

minimize f0(x) (Convex function) x

s.t. fi(x) 0 (Convex sets) ≤ hj(x) = 0 (Affine)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 23/53 Convex Optimization Problems It’s nice to be convex

Theorem If xˆ is a local minimizer of a convex optimization problem, it is a global minimizer.

4

∗ 3.5 x

3

2.5

2

1.5

1

0.5 0 0.5 1 1.5 2 2.5 3 3.5

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 24/53 Convex Optimization Problems Even more reasons to be convex

Theorem f(x) = 0 if and only if x is a global minimizer of f(x). ∇ Proof. ◮ f(x) = 0. We have ∇ f(y) f(x)+ f(x)T (y x)= f(x). ≥ ∇ − ◮ f(x) = 0. There is a direction of descent. ∇ 6

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 25/53 Convex Optimization Problems

LET’S TAKE A BREAK

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 26/53 Lagrange Duality Lagrange Duality

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 27/53 Lagrange Duality Goals of Lagrange Duality

◮ Get certificate for optimality of a problem

◮ Remove constraints

◮ Reformulate problem

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 28/53 Lagrange Duality Constructing the dual

◮ Start with optimization problem:

minimize f0(x) x

s.t. fi(x) 0, i = 1, . . . , k ≤ { } hj(x) = 0, j = 1,...,l { } ◮ Form Lagrangian using Lagrange multipliers λi 0, νi R ≥ ∈ k l (x,λ,ν)= f (x)+ λifi(x)+ νjhj(x) L 0 i j X=1 X=1 ◮ Form dual function

k l g(λ, ν) = inf (x,λ,ν) = inf f0(x)+ λifi(x)+ νjhj(x) x L x   i j  X=1 X=1 

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 29/53 Lagrange Duality Remarks

◮ Original problem is equivalent to

minimize sup (x,λ,ν) x λ 0,ν L "  # ◮ Dual problem is switching the min and max:

maximize inf (x,λ,ν) . λ 0,ν x L  h i

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 30/53 Lagrange Duality One Great Property of Dual

Lemma (Weak Duality) If λ 0, then  g(λ, ν) f (x∗). ≤ 0

Proof. We have

g(λ, ν) = inf (x,λ,ν) (x∗,λ,ν) x L ≤ L k l = f (x∗)+ λifi(x∗)+ νjhj(x∗) f (x∗). 0 ≤ 0 i j X=1 X=1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 31/53 Lagrange Duality The Greatest Property of the Dual

Theorem For reasonable1 convex problems,

sup g(λ, ν)= f0(x∗) λ 0,ν 

1There are conditions called constraint qualification for which this is true Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 32/53 Lagrange Duality Geometric Look

Minimize 1 (x c 1)2 subject to x2 c. 2 − − ≤

14

12 0.6

10 0.4

8

0.2

6

0 4

2 −0.2

0 −0.4

−2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 −2 −1.5 −1 −0.5 x0 0.5 1 1.5 2 λ True function (blue), constraint Dual function g(λ) (black), primal op- (green), (x, λ) for different λ L timal (dotted blue) (dotted)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 33/53 Lagrange Duality Intuition

Can interpret duality as linear approximation.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 34/53 Lagrange Duality Intuition

Can interpret duality as linear approximation. ◮ I (a)= if a> 0, 0 otherwise; I0(a)= unless a = 0. Rewrite − ∞ ∞ problem as

k l minimize f0(x)+ I (fi(x)) + I0(hj(x)) x − i j X=1 X=1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 34/53 Lagrange Duality Intuition

Can interpret duality as linear approximation. ◮ I (a)= if a> 0, 0 otherwise; I0(a)= unless a = 0. Rewrite − ∞ ∞ problem as

k l minimize f0(x)+ I (fi(x)) + I0(hj(x)) x − i j X=1 X=1 ◮ Replace I(fi(x)) with λifi(x); a measure of “displeasure” when λi 0, fi(x) > 0. νihj(x) lower bounds I (hj(x)): ≥ 0 k l minimize f0(x)+ λifi(x)+ νjhj(x) x i j X=1 X=1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 34/53 Lagrange Duality Example: Linearly constrained

1 minimize Ax b 2 s.t. Bx = d. x 2 k − k Form the Lagrangian: 1 (x,ν)= Ax b 2 + νT (Bx d) L 2 k − k − Take infimum: T T T T 1 T T x (x,ν)= A Ax A b + B ν x = (A A)− (A b B ν) ∇ L − ⇒ − Simple unconstrained quadratic problem! inf (x,ν) x L 1 T 1 T T 2 T T 1 T T T = A(A A)− (A b B ν) b + ν B((A A)− A b B ν) d ν 2 − − − −

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 35/53 Lagrange Duality Example: Quadratically constrained least squares

1 1 minimize Ax b 2 s.t. x 2 c. x 2 k − k 2 k k ≤ Form the Lagrangian (λ 0): ≥ 1 1 (x, λ)= Ax b 2 + λ( x 2 2c) L 2 k − k 2 k k − Take infimum:

T T T 1 T x (x,ν)= A Ax A b + λI x = (A A + λI)− A b ∇ L − ⇒ 1 T 1 T 2 λ T 1 T 2 inf (x, λ)= A(A A + λI)− A b b + (A A + λI)− A b λc x L 2 − 2 −

One variable dual problem!

1 T T 1 T 1 2 g(λ)= b A(A A + λI)− A b λc + b . −2 − 2 k k

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 36/53 Lagrange Duality Uses of the Dual

◮ Main use: certificate of optimality (a.k.a. duality gap). If we have feasible x and know the dual g(λ, ν), then

g(λ, ν) f (x∗) f (x) f (x∗) f (x) g(λ, ν) f (x) ≤ 0 ≤ 0 ⇒ 0 − 0 ≥ − 0 f (x) f (x∗) f (x) g(λ, ν). ⇒ 0 − 0 ≤ 0 −

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 37/53 Lagrange Duality Uses of the Dual

◮ Main use: certificate of optimality (a.k.a. duality gap). If we have feasible x and know the dual g(λ, ν), then

g(λ, ν) f (x∗) f (x) f (x∗) f (x) g(λ, ν) f (x) ≤ 0 ≤ 0 ⇒ 0 − 0 ≥ − 0 f (x) f (x∗) f (x) g(λ, ν). ⇒ 0 − 0 ≤ 0 − ◮ Also used in more advanced primal-dual algorithms (we won’t talk about these).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 37/53 Optimization Algorithms Optimization Algorithms

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 38/53 Optimization Algorithms Methods

The simplest algorithm in the world (almost). Goal:

minimize f(x) x Just iterate xt = xt ηt f(xt) +1 − ∇ where ηt is stepsize.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 39/53 Optimization Algorithms Gradient Methods Single Step Illustration

f(x)

(xt, f(xt))

T f(xt)- η f(xt) (x - xt) ∇

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 40/53 Optimization Algorithms Gradient Methods Full Gradient Descent

f(x) = log(exp(x + 3x .1) + exp(x 3x .1) + exp( x .1)) 1 2 − 1 − 2 − − 1 −

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 41/53 Optimization Algorithms Gradient Methods Stepsize Selection

How do I choose a stepsize? ◮ Idea 1: exact

ηt = argmin f (x η f(x)) η − ∇

Too expensive to be practical.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 42/53 Optimization Algorithms Gradient Methods Stepsize Selection

How do I choose a stepsize? ◮ Idea 1: exact line search

ηt = argmin f (x η f(x)) η − ∇

Too expensive to be practical. ◮ Idea 2: backtracking (Armijo) line search. Let α (0, 1 ), β (0, 1). ∈ 2 ∈ Multiply η = βη until

f (x η f(x)) f(x) αη f(x) 2 − ∇ ≤ − k∇ k Works well in practice.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 42/53 Optimization Algorithms Gradient Methods Illustration of Armijo/Backtracking Line Search

f(x - η f(x)) ∇

f(x)- ηα f(x) 2 k∇ k

η ηt = 0 η0

As a function of stepsize η. Clearly a region where f(x η f(x)) is − ∇ below line f(x) αη f(x) 2. − k∇ k Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 43/53 Optimization Algorithms Gradient Methods Newton’s method

Idea: use a second-order approximation to function. 1 f(x + ∆x) f(x)+ f(x)T ∆x + ∆xT 2f(x)∆x ≈ ∇ 2 ∇ Choose ∆x to minimize above:

1 ∆x = 2f(x) − f(x) − ∇ ∇ This is descent direction:  

1 f(x)T ∆x = f(x)T 2f(x) − f(x) < 0. ∇ −∇ ∇ ∇  

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 44/53 Optimization Algorithms Gradient Methods Newton step picture

(x, f(x)) f

(x + ∆x, f(x + ∆x))

fˆ is 2nd-order approximation, f is true function.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 45/53 Optimization Algorithms Gradient Methods Convergence of gradient descent and Newton’s method

◮ Strongly convex case: 2f(x) mI, then “Linear convergence.” For ∇  t some γ (0, 1), f(xt) f(x ) γ , γ < 1. ∈ − ∗ ≤ t 1 1 f(xt) f(x∗) γ or t log f(xt) f(x∗) ε. − ≤ ≥ γ ε ⇒ − ≤

◮ Smooth case: f(x) f(y) C x y . k∇ − ∇ k≤ k − k K f(xt) f(x∗) − ≤ t2 ◮ Newton’s method often is faster, especially when f has “long valleys”

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 46/53 Optimization Algorithms Gradient Methods What about constraints?

◮ Linear constraints Ax = b are easy. For example, in Newton method (assume Ax = b): 1 minimize f(x)T ∆x + ∆xT 2f(x)∆x s.t. A∆x = 0. ∆x ∇ 2 ∇ Solution ∆x satisfies A(x + ∆x)= Ax + A∆x = b.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 47/53 Optimization Algorithms Gradient Methods What about constraints?

◮ Linear constraints Ax = b are easy. For example, in Newton method (assume Ax = b): 1 minimize f(x)T ∆x + ∆xT 2f(x)∆x s.t. A∆x = 0. ∆x ∇ 2 ∇ Solution ∆x satisfies A(x + ∆x)= Ax + A∆x = b. ◮ Inequality constraints are a bit tougher...

T 1 T 2 minimize f(x) ∆x + ∆x f(x)∆x s.t. fi(x + ∆x) 0 ∆x ∇ 2 ∇ ≤ just as hard as original.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 47/53 Optimization Algorithms Gradient Methods Logarithmic Barrier Methods

Goal: minimize f0(x) s.t. fi(x) 0, i = 1, . . . , k x ≤ { } Convert to k minimize f0(x)+ I (fi(x)) x − i X=1 Approximate I (u) t log( u) for small t. − ≈− − k minimize f0(x) t log ( fi(x)) x − − i X=1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 48/53 Optimization Algorithms Gradient Methods The

10

0

−2 −3 u 0 1 I (u) is dotted line, others are t log( u). − − − Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 49/53 Optimization Algorithms Gradient Methods Illustration

Minimizing cT x subject to Ax b. ≤

c t = 1 t = 5

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 50/53 Optimization Algorithms Subgradient Methods Subgradient Descent

Really, the simplest algorithm in the world. Goal:

minimize f(x) x Just iterate xt = xt ηtgt +1 − where ηt is a stepsize, gt ∂f(xt). ∈

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 51/53 Optimization Algorithms Subgradient Methods Why subgradient descent?

◮ Lots of non-differentiable convex functions used in machine learning:

k T f(x)= 1 a x , f(x)= x , f(X)= σr(X) − + k k1 r=1   X where σr is the rth singular value of X. ◮ Easy to analyze ◮ Do not even need true sub-gradient: just have Egt ∂f(xt). ∈

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 52/53 Optimization Algorithms Subgradient Methods Proof of convergence for subgradient descent

Idea: bound xt x using subgradient inequality. Assume that k +1 − ∗k gt G. k k≤ 2 2 xt x∗ = xt ηgt x∗ k +1 − k k − − k 2 T 2 2 = xt x∗ 2ηg (xt x∗)+ η gt k − k − t − k k

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 53/53 Optimization Algorithms Subgradient Methods Proof of convergence for subgradient descent

Idea: bound xt x using subgradient inequality. Assume that k +1 − ∗k gt G. k k≤ 2 2 xt x∗ = xt ηgt x∗ k +1 − k k − − k 2 T 2 2 = xt x∗ 2ηg (xt x∗)+ η gt k − k − t − k k Recall that

T T f(x∗) f(xt)+ g (x∗ xt) g (xt x∗) f(x∗) f(xt) ≥ t − ⇒ − t − ≤ − so 2 2 2 2 xt x∗ xt x∗ + 2η [f(x∗) f(xt)] + η G . k +1 − k ≤ k − k − Then 2 2 xt x∗ xt+1 x∗ η 2 f(xt) f(x∗) k − k − k − k + G . − ≤ 2η 2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 53/53 Optimization Algorithms Subgradient Methods Almost done...

Sum from t = 1 to T :

T T 1 2 2 Tη 2 f(xt) f(x∗) xt x∗ xt x∗ + G − ≤ 2η k − k − k +1 − k 2 t=1 t=1 X X h i 1 2 1 2 Tη 2 = x x∗ xT x∗ + G 2η k 1 − k − 2η k +1 − k 2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 54/53 Optimization Algorithms Subgradient Methods Almost done...

Sum from t = 1 to T :

T T 1 2 2 Tη 2 f(xt) f(x∗) xt x∗ xt x∗ + G − ≤ 2η k − k − k +1 − k 2 t=1 t=1 X X h i 1 2 1 2 Tη 2 = x x∗ xT x∗ + G 2η k 1 − k − 2η k +1 − k 2

Now let D = x x , and keep track of min along run, k 1 − ∗k

1 2 η 2 f (x ) f(x∗) D + G . best − ≤ 2ηT 2

Set η = D and G√T DG f (xbest) f(x∗) . − ≤ √T

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 54/53 Optimization Algorithms Subgradient Methods Extension: projected subgradient descent

Now have a convex constraint set X. Goal: minimize f(x) xt x X ∈ ΠX (xt) Idea: do subgradient steps, project xt back into X at every iteration.

xt = ΠX (xt ηgt) X +1 − ∗ Proof: x

ΠX (xt) x∗ xt x∗ k − k ≤ k − k if x X. ∗ ∈

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 55/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Projected subgradient example

1 minimize Ax b s.t. x 1 x 2 k − k k k1 ≤

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 56/53 Optimization Algorithms Subgradient Methods Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt 0, ∞ ηt = gives → t=1 ∞

f x f(x∗) 0. P avg(t) − → 

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 57/53 Optimization Algorithms Subgradient Methods Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt 0, ∞ ηt = gives → t=1 ∞

f x f(x∗) 0. P avg(t) − → ◮  Slightly less brain-dead analysis than earlier shows with ηt 1/√t ∝ C f xavg(t) f(x∗) − ≤ √t 

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 57/53 Optimization Algorithms Subgradient Methods Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt 0, ∞ ηt = gives → t=1 ∞

f x f(x∗) 0. P avg(t) − → ◮  Slightly less brain-dead analysis than earlier shows with ηt 1/√t ∝ C f xavg(t) f(x∗) − ≤ √t  ◮ Same convergence when gt is random, i.e. Egt ∂f(xt). Example: ∈ n 1 2 T f(w)= w + C 1 yix w 2 k k − i + i=1 X   Just pick random training example.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 57/53 Take Home Messages Recap

◮ Defined convex sets and functions

◮ Saw why we want optimization problems to be convex (solvable)

◮ Sketched some of Lagrange duality

◮ First order methods are easy and (often) work well

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 58/53 Take Home Messages Take Home Messages

◮ Many useful problems formulated as convex optimization problems

◮ If it is not convex and not an eigenvalue problem, you are out of luck

◮ If it is convex, you are golden

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 59/53