Introduction to Convex Optimization for Machine Learning
John Duchi
University of California, Berkeley
Practical Machine Learning, Fall 2009
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 1/53 Outline
What is Optimization
Convex Sets
Convex Functions
Convex Optimization Problems
Lagrange Duality
Optimization Algorithms
Take Home Messages
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 2/53 What is Optimization What is Optimization (and why do we care?)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 3/53 What is Optimization What is Optimization?
◮ Finding the minimizer of a function subject to constraints:
minimize f0(x) x
s.t. fi(x) 0, i = 1, . . . , k ≤ { } hj(x) = 0, j = 1,...,l { }
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 4/53 What is Optimization What is Optimization?
◮ Finding the minimizer of a function subject to constraints:
minimize f0(x) x
s.t. fi(x) 0, i = 1, . . . , k ≤ { } hj(x) = 0, j = 1,...,l { } ◮ Example: Stock market. “Minimize variance of return subject to getting at least $50.”
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 4/53 What is Optimization Why do we care?
Optimization is at the heart of many (most practical?) machine learning algorithms. ◮ Linear regression: minimize Xw y 2 w k − k ◮ Classification (logistic regresion or SVM):
n T minimize log 1 + exp( yix w) w − i i=1 X n 2 T or w + C ξi s.t. ξi 1 yix w, ξi 0. k k ≥ − i ≥ i X=1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 5/53 What is Optimization We still care...
◮ Maximum likelihood estimation: n maximize log pθ(xi) θ i X=1 ◮ Collaborative filtering:
T T minimize log 1 + exp(w xi w xj) w − i j X≺ ◮ k-means: k 2 minimize J(µ)= xi µj µ1,...,µk k − k j=1 i Cj X X∈ ◮ And more (graphical models, feature selection, active learning, control)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 6/53 What is Optimization But generally speaking...
We’re screwed. ◮ Local (non global) minima of f0 ◮ All kinds of constraints (even restricting to continuous functions): h(x) = sin(2πx) = 0
250
200
150
100
50
0
−50 3 2 3 1 2 0 1 −1 0 −1 −2 −2 −3 −3
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 7/53 What is Optimization But generally speaking...
We’re screwed. ◮ Local (non global) minima of f0 ◮ All kinds of constraints (even restricting to continuous functions): h(x) = sin(2πx) = 0
250
200
150
100
50
0
−50 3 2 3 1 2 0 1 −1 0 −1 −2 −2 −3 −3 ◮ Go for convex problems! Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 7/53 Convex Sets Convex Sets
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 8/53 Convex Sets
Definition A set C Rn is convex if for x, y C and any α [0, 1], ⊆ ∈ ∈ αx + (1 α)y C. − ∈
y x
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 9/53 Convex Sets Examples
◮ All of Rn (obvious)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 10/53 Convex Sets Examples
◮ All of Rn (obvious)
◮ Non-negative orthant, Rn : let x 0, y 0, clearly + αx + (1 α)y 0. −
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 10/53 Convex Sets Examples
◮ All of Rn (obvious)
◮ Non-negative orthant, Rn : let x 0, y 0, clearly + αx + (1 α)y 0. −
◮ Norm balls: let x 1, y 1, then k k≤ k k≤ αx + (1 α)y αx + (1 α)y = α x + (1 α) y 1. k − k ≤ k k k − k k k − k k≤
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 10/53 Convex Sets Examples
◮ Affine subspaces: Ax = b, Ay = b, then A(αx + (1 α)y)= αAx + (1 α)Ay = αb + (1 α)b = b. − − −
1
0.8
0.6
0.4 3
x 0.2
0
−0.2
−0.4 1
0.8 1 0.6 0.8 0.4 0.6 0.4 0.2 0.2 0 0 x2 x1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 11/53 Convex Sets More examples
◮ Arbitrary intersections of convex sets: let Ci be convex for i , ∈ I C = i Ci, then
x C, y C αx + (1 α)y Ci i T ∈ ∈ ⇒ − ∈ ∀ ∈ I so αx + (1 α)y C. − ∈
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 12/53 Convex Sets More examples
◮ PSD Matrices, a.k.a. the positive semidefinite cone Sn Rn n Sn + × . A + means T ⊂ ∈ Rn x Ax 0 for all x . For 1 ≥S+ ∈ A, B , 0.8 ∈ n 0.6
T z x (αA + (1 α)B) x 0.4 − = αxT Ax + (1 α)xT Bx 0. 0.2 0 − ≥ 1
0.5 1 0.8 ◮ 0 On right: 0.6 −0.5 0.4 0.2 y −1 0 x x z S2 = 0 = x,y,z : x 0, y 0, xy z2 + z y ≥ ≥ ≥
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 13/53 Convex Functions Convex Functions
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 14/53 Convex Functions
Definition A function f : Rn R is convex if for x, y dom f and any α [0, 1], → ∈ ∈ f(αx + (1 α)y) αf(x) + (1 α)f(y). − ≤ −
f(y) αf(x) + (1 - α)f(y)
f(x)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 15/53 Convex Functions First order convexity conditions
Theorem Suppose f : Rn R is differentiable. Then f is convex if and only if for → all x, y dom f ∈ f(y) f(x)+ f(x)T (y x) ≥ ∇ −
f(y)
f(x) + f(x)T (y - x) ∇
(x, f(x))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 16/53 Convex Functions Actually, more general than that
Definition The subgradient set, or subdifferential set, ∂f(x) of f at x is
∂f(x)= g : f(y) f(x)+ gT (y x) for all y . ≥ −
f(y)
Theorem f : Rn R is convex if and only if it→ has non-empty subdifferential set everywhere. (x, f(x))
f(x) + gT (y - x)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 17/53 Convex Functions Second order convexity conditions
Theorem Suppose f : Rn R is twice differentiable. Then f is convex if and only if → for all x dom f, ∈ 2f(x) 0. ∇
10
8
6
4
2
0 2
1 2 1 0 0 −1 −1 −2 −2
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 18/53 Convex Functions Convex sets and convex functions
Definition The epigraph of a function f is the epi f set of points
epi f = (x,t): f(x) t . { ≤ }
◮ epi f is convex if and only if f is convex. a ◮ Sublevel sets, x : f(x) a { ≤ } are convex for convex f.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 19/53 Convex Functions Examples Examples
◮ Linear/affine functions:
f(x)= bT x + c.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 20/53 Convex Functions Examples Examples
◮ Linear/affine functions:
f(x)= bT x + c.
◮ Quadratic functions: 1 f(x)= xT Ax + bT x + c 2 for A 0. For regression: 1 1 1 Xw y 2 = wT XT Xw yT Xw + yT y. 2 k − k 2 − 2
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 20/53 Convex Functions Examples More examples
◮ Norms (like ℓ1 or ℓ2 for regularization):
αx + (1 α)y αx + (1 α)y = α x + (1 α) y . k − k ≤ k k k − k k k − k k
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 21/53 Convex Functions Examples More examples
◮ Norms (like ℓ1 or ℓ2 for regularization):
αx + (1 α)y αx + (1 α)y = α x + (1 α) y . k − k ≤ k k k − k k k − k k ◮ Composition with an affine function f(Ax + b):
f(A(αx + (1 α)y)+ b)= f(α(Ax + b) + (1 α)(Ay + b)) − − αf(Ax + b) + (1 α)f(Ay + b) ≤ −
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 21/53 Convex Functions Examples More examples
◮ Norms (like ℓ1 or ℓ2 for regularization):
αx + (1 α)y αx + (1 α)y = α x + (1 α) y . k − k ≤ k k k − k k k − k k ◮ Composition with an affine function f(Ax + b):
f(A(αx + (1 α)y)+ b)= f(α(Ax + b) + (1 α)(Ay + b)) − − αf(Ax + b) + (1 α)f(Ay + b) ≤ − ◮ Log-sum-exp (via 2f(x) PSD): ∇ n f(x) = log exp(xi) i ! X=1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall2009 21/53 Convex Functions Examples Important examples in Machine Learning
3
◮ SVM loss:
[1 - x]+ T f(w)= 1 yix w − i + ◮ Binary logistic loss:
x T log(1 + e ) f(w) = log 1 + exp( yix w) − i