Dani Yogatama
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
February 12, 2014
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 1 / 26 Key Concepts in Convex Analysis: Convex Sets
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 2 / 26 Key Concepts in Convex Analysis: Convex Functions
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 3 / 26 Key Concepts in Convex Analysis: Minimizers
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 4 / 26 A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1] β f (λx + (1 − λ)x0) ≤ λf (x) + (1 − λ)f (x0) − λ(1 − λ)kx − x0k2 2 2
strong convexity
⇒ Strong convexity strict convexity. 6⇐
Key Concepts in Convex Analysis: Strong Convexity Recall the definition of convex function: ∀λ ∈ [0, 1], f (λx + (1 − λ)x0) ≤ λf (x) + (1 − λ)f (x0)
convexity
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 5 / 26 ⇒ Strong convexity strict convexity. 6⇐
Key Concepts in Convex Analysis: Strong Convexity Recall the definition of convex function: ∀λ ∈ [0, 1], f (λx + (1 − λ)x0) ≤ λf (x) + (1 − λ)f (x0)
A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1] β f (λx + (1 − λ)x0) ≤ λf (x) + (1 − λ)f (x0) − λ(1 − λ)kx − x0k2 2 2
convexity strong convexity
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 5 / 26 Key Concepts in Convex Analysis: Strong Convexity Recall the definition of convex function: ∀λ ∈ [0, 1], f (λx + (1 − λ)x0) ≤ λf (x) + (1 − λ)f (x0)
A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1] β f (λx + (1 − λ)x0) ≤ λf (x) + (1 − λ)f (x0) − λ(1 − λ)kx − x0k2 2 2
convexity strong convexity
⇒ Strong convexity strict convexity. 6⇐
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 5 / 26 v is a subgradient of f at x if f (x0) ≥ f (x) + v>(x0 − x)
Subdifferential: ∂f (x) = {v : v is a subgradient of f at x} If f is differentiable, ∂f (x) = {∇f (x)}
linear lower bound non-differentiable case
Notation: ∇˜ f (x) is a subgradient of f at x
Key Concepts in Convex Analysis: Subgradients Convexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = kwk1). Subgradients generalize gradients for (maybe non-diff.) convex functions:
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26 If f is differentiable, ∂f (x) = {∇f (x)}
non-differentiable case
Notation: ∇˜ f (x) is a subgradient of f at x
Key Concepts in Convex Analysis: Subgradients Convexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = kwk1). Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x0) ≥ f (x) + v>(x0 − x)
Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}
linear lower bound
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26 non-differentiable case
Notation: ∇˜ f (x) is a subgradient of f at x
Key Concepts in Convex Analysis: Subgradients Convexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = kwk1). Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x0) ≥ f (x) + v>(x0 − x)
Subdifferential: ∂f (x) = {v : v is a subgradient of f at x} If f is differentiable, ∂f (x) = {∇f (x)}
linear lower bound
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26 Notation: ∇˜ f (x) is a subgradient of f at x
Key Concepts in Convex Analysis: Subgradients Convexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = kwk1). Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x0) ≥ f (x) + v>(x0 − x)
Subdifferential: ∂f (x) = {v : v is a subgradient of f at x} If f is differentiable, ∂f (x) = {∇f (x)}
linear lower bound non-differentiable case
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26 Key Concepts in Convex Analysis: Subgradients Convexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = kwk1). Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x0) ≥ f (x) + v>(x0 − x)
Subdifferential: ∂f (x) = {v : v is a subgradient of f at x} If f is differentiable, ∂f (x) = {∇f (x)}
linear lower bound non-differentiable case
Notation: ∇˜ f (x) is a subgradient of f at x Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26 Verify definition of a convex function. ∂2f (x) Check if ∂2x greater than or equal to 0 (for twice differentiable function). Show that it is constructed from simple convex functions with operations that preserver convexity. nonnegative weighted sum composition with affine function pointwise maximum and supremum composition minimization perspective Reference: Boyd and Vandenberghe (2004)
Establishing convexity
How to check if f (x) is a convex function?
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 7 / 26 Establishing convexity
How to check if f (x) is a convex function? Verify definition of a convex function. ∂2f (x) Check if ∂2x greater than or equal to 0 (for twice differentiable function). Show that it is constructed from simple convex functions with operations that preserver convexity. nonnegative weighted sum composition with affine function pointwise maximum and supremum composition minimization perspective Reference: Boyd and Vandenberghe (2004)
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 7 / 26 Unconstrained Optimization
Algorithms: First order methods (gradient descent, FISTA, etc.) Higher order methods (Newton’s method, ellipsoid, etc.) ...
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 8 / 26 Algorithm:
∂f (xt ) gt = ∂x . xt = xt−1 − ηgt . Repeat until convergence.
Gradient descent
Problem: min f (x) x
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 9 / 26 Gradient descent
Problem: min f (x) x
Algorithm:
∂f (xt ) gt = ∂x . xt = xt−1 − ηgt . Repeat until convergence.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 9 / 26 Algorithm:
∂f (xt ) gt = ∂x . −1 st = H gt , where H is the Hessian.
xt = xt−1 − ηst . Repeat until convergence. Newton’s method is a special case of steepest descent using Hessian norm.
Newton’s method
Problem: min f (x) x Assume f is twice differentiable.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 10 / 26 Newton’s method is a special case of steepest descent using Hessian norm.
Newton’s method
Problem: min f (x) x Assume f is twice differentiable. Algorithm:
∂f (xt ) gt = ∂x . −1 st = H gt , where H is the Hessian.
xt = xt−1 − ηst . Repeat until convergence.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 10 / 26 Newton’s method
Problem: min f (x) x Assume f is twice differentiable. Algorithm:
∂f (xt ) gt = ∂x . −1 st = H gt , where H is the Hessian.
xt = xt−1 − ηst . Repeat until convergence. Newton’s method is a special case of steepest descent using Hessian norm.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 10 / 26 Lagrangian:
m p X X L(x, λ, ν) = f (x) + λi gi (x) + νi hi (x) i=1 i=1
λi and νi are Lagrange multipliers. Suppose x is feasible and λ ≥ 0, then we get the lower bound:
m f (x) ≥ L(x, λ, ν)∀x ∈ X, λ ∈ R+
Duality Primal problem:
minx f (x) subject to gi (x) ≤ 0 i = 1,..., m hi (x) = 0 i = 1,..., p
for x ∈ X.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 11 / 26 Suppose x is feasible and λ ≥ 0, then we get the lower bound:
m f (x) ≥ L(x, λ, ν)∀x ∈ X, λ ∈ R+
Duality Primal problem:
minx f (x) subject to gi (x) ≤ 0 i = 1,..., m hi (x) = 0 i = 1,..., p
for x ∈ X. Lagrangian:
m p X X L(x, λ, ν) = f (x) + λi gi (x) + νi hi (x) i=1 i=1
λi and νi are Lagrange multipliers.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 11 / 26 Duality Primal problem:
minx f (x) subject to gi (x) ≤ 0 i = 1,..., m hi (x) = 0 i = 1,..., p
for x ∈ X. Lagrangian:
m p X X L(x, λ, ν) = f (x) + λi gi (x) + νi hi (x) i=1 i=1
λi and νi are Lagrange multipliers. Suppose x is feasible and λ ≥ 0, then we get the lower bound:
m f (x) ≥ L(x, λ, ν)∀x ∈ X, λ ∈ R+
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 11 / 26 Lagrange dual function: min L(x, λ, ν) x This is a concave function, regardless of whether f (x) convex or not. Can be −∞ for some λ and ν. Lagrange dual problem: max L(x, λ, ν) subject to λ ≥ 0 λ,ν Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x, λ, ν). Dual optimal: d∗ = max min L(x, λ, ν) λ≥0,ν x
Duality
Primal optimal: p∗ = min max L(x, λ, ν) x λ≥0,ν
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26 Lagrange dual problem: max L(x, λ, ν) subject to λ ≥ 0 λ,ν Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x, λ, ν). Dual optimal: d∗ = max min L(x, λ, ν) λ≥0,ν x
Duality
Primal optimal: p∗ = min max L(x, λ, ν) x λ≥0,ν
Lagrange dual function: min L(x, λ, ν) x This is a concave function, regardless of whether f (x) convex or not. Can be −∞ for some λ and ν.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26 Dual optimal: d∗ = max min L(x, λ, ν) λ≥0,ν x
Duality
Primal optimal: p∗ = min max L(x, λ, ν) x λ≥0,ν
Lagrange dual function: min L(x, λ, ν) x This is a concave function, regardless of whether f (x) convex or not. Can be −∞ for some λ and ν. Lagrange dual problem: max L(x, λ, ν) subject to λ ≥ 0 λ,ν Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x, λ, ν).
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26 Duality
Primal optimal: p∗ = min max L(x, λ, ν) x λ≥0,ν
Lagrange dual function: min L(x, λ, ν) x This is a concave function, regardless of whether f (x) convex or not. Can be −∞ for some λ and ν. Lagrange dual problem: max L(x, λ, ν) subject to λ ≥ 0 λ,ν Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x, λ, ν). Dual optimal: d∗ = max min L(x, λ, ν) λ≥0,ν x
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26 Strong duality p∗ = d∗ does not hold in general, but usually holds for convex problems. Strong duality is guaranteed by Slater’s constraint qualification. Strong duality holds if the problem is strictly feasible, i.e.
∃x ∈ int D s.t. gi (x) < 0, i = 1,..., m, hi (x) = 0, i = 1,..., p
Assume strong duality holds and p∗ and d∗ are attained.
p∗ = f (x∗) = d∗ = min L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗ x
We have: ∗ ∗ ∗ ∗ x ∈ arg minx L(x , λ , ν ). ∗ ∗ λi gi (x ) = 0 for i = 1,..., m (complementary slackness).
Duality Weak duality p∗ ≥ d∗ always holds for convex and nonconvex problems
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26 Assume strong duality holds and p∗ and d∗ are attained.
p∗ = f (x∗) = d∗ = min L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗ x
We have: ∗ ∗ ∗ ∗ x ∈ arg minx L(x , λ , ν ). ∗ ∗ λi gi (x ) = 0 for i = 1,..., m (complementary slackness).
Duality Weak duality p∗ ≥ d∗ always holds for convex and nonconvex problems Strong duality p∗ = d∗ does not hold in general, but usually holds for convex problems. Strong duality is guaranteed by Slater’s constraint qualification. Strong duality holds if the problem is strictly feasible, i.e.
∃x ∈ int D s.t. gi (x) < 0, i = 1,..., m, hi (x) = 0, i = 1,..., p
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26 We have: ∗ ∗ ∗ ∗ x ∈ arg minx L(x , λ , ν ). ∗ ∗ λi gi (x ) = 0 for i = 1,..., m (complementary slackness).
Duality Weak duality p∗ ≥ d∗ always holds for convex and nonconvex problems Strong duality p∗ = d∗ does not hold in general, but usually holds for convex problems. Strong duality is guaranteed by Slater’s constraint qualification. Strong duality holds if the problem is strictly feasible, i.e.
∃x ∈ int D s.t. gi (x) < 0, i = 1,..., m, hi (x) = 0, i = 1,..., p
Assume strong duality holds and p∗ and d∗ are attained.
p∗ = f (x∗) = d∗ = min L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗ x
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26 Duality Weak duality p∗ ≥ d∗ always holds for convex and nonconvex problems Strong duality p∗ = d∗ does not hold in general, but usually holds for convex problems. Strong duality is guaranteed by Slater’s constraint qualification. Strong duality holds if the problem is strictly feasible, i.e.
∃x ∈ int D s.t. gi (x) < 0, i = 1,..., m, hi (x) = 0, i = 1,..., p
Assume strong duality holds and p∗ and d∗ are attained.
p∗ = f (x∗) = d∗ = min L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗ x
We have: ∗ ∗ ∗ ∗ x ∈ arg minx L(x , λ , ν ). ∗ ∗ λi gi (x ) = 0 for i = 1,..., m (complementary slackness). Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26 Ifx ˆ, λ,ˆ νˆ satify the KKT for a convex problem, they are optimal.
Karush-Kuhn-Tucker condition
For a differentiable g(x) and h(x), the KKT conditions are:
∗ ∗ gi (x ) ≤ 0, hi (x ) = 0, primal feasibility ∗ λi ≥ 0, dual feasibility ∗ ∗ λi gi (x ) = 0, complementary slackness ∂L(x∗, λ∗, ν∗) | ∗ = 0 Lagrangian stationarity ∂x x=x
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 14 / 26 Karush-Kuhn-Tucker condition
For a differentiable g(x) and h(x), the KKT conditions are:
∗ ∗ gi (x ) ≤ 0, hi (x ) = 0, primal feasibility ∗ λi ≥ 0, dual feasibility ∗ ∗ λi gi (x ) = 0, complementary slackness ∂L(x∗, λ∗, ν∗) | ∗ = 0 Lagrangian stationarity ∂x x=x
Ifx ˆ, λ,ˆ νˆ satify the KKT for a convex problem, they are optimal.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 14 / 26 Lagrangian: n 1 X L(w, λ) = kwk2 − λ (y hx , wi − 1) 2 2 i i i i=1
Minimizing with respect to w, we have:
∂L(w,λ) ∂w = 0 Pn w − i=1 λi yi xi = 0 n X w = λi yi xi i=1
Support Vector Machines Primal problem (hard constraint): 1 min kwk2 w 2 2 subject to yi hxi , wi ≥ 1, i = 1,..., n
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 15 / 26 Minimizing with respect to w, we have:
∂L(w,λ) ∂w = 0 Pn w − i=1 λi yi xi = 0 n X w = λi yi xi i=1
Support Vector Machines Primal problem (hard constraint): 1 min kwk2 w 2 2 subject to yi hxi , wi ≥ 1, i = 1,..., n
Lagrangian: n 1 X L(w, λ) = kwk2 − λ (y hx , wi − 1) 2 2 i i i i=1
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 15 / 26 Support Vector Machines Primal problem (hard constraint): 1 min kwk2 w 2 2 subject to yi hxi , wi ≥ 1, i = 1,..., n
Lagrangian: n 1 X L(w, λ) = kwk2 − λ (y hx , wi − 1) 2 2 i i i i=1
Minimizing with respect to w, we have:
∂L(w,λ) ∂w = 0 Pn w − i=1 λi yi xi = 0 n X w = λi yi xi i=1
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 15 / 26 Lagrange dual problem is:
n n n X 1 X X max λ − y y λ λ x>x λ i 2 i j i j i j 1=1 i=1 j=1 subject to λi ≥ 0, i = 1,..., n n X λi yi = 0 i=1
> Since this problem only depends on xi xj , we can use kernels and learn in high dimensional space without having to explicitly represent φ(x).
Support Vector Machines Plug this back into the Lagrangian:
n n n X 1 X X L(λ) = λ − y y λ λ x>x i 2 i j i j i j 1=1 i=1 j=1
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 16 / 26 > Since this problem only depends on xi xj , we can use kernels and learn in high dimensional space without having to explicitly represent φ(x).
Support Vector Machines Plug this back into the Lagrangian:
n n n X 1 X X L(λ) = λ − y y λ λ x>x i 2 i j i j i j 1=1 i=1 j=1
Lagrange dual problem is:
n n n X 1 X X max λ − y y λ λ x>x λ i 2 i j i j i j 1=1 i=1 j=1 subject to λi ≥ 0, i = 1,..., n n X λi yi = 0 i=1
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 16 / 26 Support Vector Machines Plug this back into the Lagrangian:
n n n X 1 X X L(λ) = λ − y y λ λ x>x i 2 i j i j i j 1=1 i=1 j=1
Lagrange dual problem is:
n n n X 1 X X max λ − y y λ λ x>x λ i 2 i j i j i j 1=1 i=1 j=1 subject to λi ≥ 0, i = 1,..., n n X λi yi = 0 i=1
> Since this problem only depends on xi xj , we can use kernels and learn in high dimensional space without having to explicitly represent φ(x).
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 16 / 26 Support Vector Machines
Primal problem (soft constraint):
n 1 X min kwk2 + C ξ w 2 2 i i=1 subject to yi hxi , wi ≥ 1 − ξi , i = 1,..., n
ξi ≥ 0, i = 1 ..., n
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 17 / 26 KKT conditions, for all i:
λi = 0 → yi hxi , wi ≥ 1 (4)
0 < λi < C → yi hxi , wi = 1 (5)
λi = C → yi hxi , wi ≤ 1 (6)
Support Vector Machines Lagrange dual problem for the soft constraint:
n n n X 1 X X max λ − y y λ λ x>x (1) λ i 2 i j i j i j 1=1 i=1 j=1
subject to 0 ≤ λi ≤ C, i = 1,..., n (2) n X λi yi = 0 (3) i=1
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 18 / 26 Support Vector Machines Lagrange dual problem for the soft constraint:
n n n X 1 X X max λ − y y λ λ x>x (1) λ i 2 i j i j i j 1=1 i=1 j=1
subject to 0 ≤ λi ≤ C, i = 1,..., n (2) n X λi yi = 0 (3) i=1
KKT conditions, for all i:
λi = 0 → yi hxi , wi ≥ 1 (4)
0 < λi < C → yi hxi , wi = 1 (5)
λi = C → yi hxi , wi ≤ 1 (6)
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 18 / 26 In a nutshell
Choose two Lagrange multipliers λi and λj . Optimize the dual problem with respect to these two Lagrange multipliers, holding others fixed. Repeat until convergence. There are heuristics to choose Lagrange multipliers that maximizes the step size towards the global maximum. The first one is chosen from examples that violate the KKT condition. The second one is chosen using approximation based on absolute difference in error values (see Platt (1998)).
Sequential Minimal Optimization (Platt, 1998)
An efficient way to solve SVM dual problem. Break a large QP program into a series of smallest possible QP problems. Solve these small subproblems analytically.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 19 / 26 There are heuristics to choose Lagrange multipliers that maximizes the step size towards the global maximum. The first one is chosen from examples that violate the KKT condition. The second one is chosen using approximation based on absolute difference in error values (see Platt (1998)).
Sequential Minimal Optimization (Platt, 1998)
An efficient way to solve SVM dual problem. Break a large QP program into a series of smallest possible QP problems. Solve these small subproblems analytically. In a nutshell
Choose two Lagrange multipliers λi and λj . Optimize the dual problem with respect to these two Lagrange multipliers, holding others fixed. Repeat until convergence.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 19 / 26 Sequential Minimal Optimization (Platt, 1998)
An efficient way to solve SVM dual problem. Break a large QP program into a series of smallest possible QP problems. Solve these small subproblems analytically. In a nutshell
Choose two Lagrange multipliers λi and λj . Optimize the dual problem with respect to these two Lagrange multipliers, holding others fixed. Repeat until convergence. There are heuristics to choose Lagrange multipliers that maximizes the step size towards the global maximum. The first one is chosen from examples that violate the KKT condition. The second one is chosen using approximation based on absolute difference in error values (see Platt (1998)).
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 19 / 26 Sequential Minimal Optimization (Platt, 1998)
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 20 / 26 Express λi in terms of λj
γ − λj yj λi = yi
Plug this back into our original function. We are then left with a very simple quadratic problem with one variable λj
Sequential Minimal Optimization (Platt, 1998)
For any two Lagrange multipliers, the constraints are::
0 < λi , λj < C (7) P yi λi + yj λj = − k6=i,j yk λk = γ (8)
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 21 / 26 Plug this back into our original function. We are then left with a very simple quadratic problem with one variable λj
Sequential Minimal Optimization (Platt, 1998)
For any two Lagrange multipliers, the constraints are::
0 < λi , λj < C (7) P yi λi + yj λj = − k6=i,j yk λk = γ (8)
Express λi in terms of λj
γ − λj yj λi = yi
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 21 / 26 Sequential Minimal Optimization (Platt, 1998)
For any two Lagrange multipliers, the constraints are::
0 < λi , λj < C (7) P yi λi + yj λj = − k6=i,j yk λk = γ (8)
Express λi in terms of λj
γ − λj yj λi = yi
Plug this back into our original function. We are then left with a very simple quadratic problem with one variable λj
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 21 / 26 If yi 6= yj , the following bounds apply to λj :
t−1 y−1 L = max(0, λj − λi ) (9) t−1 y−1 H = min(C, C + λj − λi ) (10)
If yi = yj , the following bounds apply to λj :
t−1 y−1 L = max(0, λj + λi − C) (11) t−1 y−1 H = min(C, λj + λi ) (12)
The solution is: H ifλj > H λj = λj ifL ≤ λj ≤ H L ifλj < L
Sequential Minimal Optimization (Platt, 1998) Solve for the second Lagrange multiplier λj .
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 22 / 26 The solution is: H ifλj > H λj = λj ifL ≤ λj ≤ H L ifλj < L
Sequential Minimal Optimization (Platt, 1998) Solve for the second Lagrange multiplier λj .
If yi 6= yj , the following bounds apply to λj :
t−1 y−1 L = max(0, λj − λi ) (9) t−1 y−1 H = min(C, C + λj − λi ) (10)
If yi = yj , the following bounds apply to λj :
t−1 y−1 L = max(0, λj + λi − C) (11) t−1 y−1 H = min(C, λj + λi ) (12)
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 22 / 26 Sequential Minimal Optimization (Platt, 1998) Solve for the second Lagrange multiplier λj .
If yi 6= yj , the following bounds apply to λj :
t−1 y−1 L = max(0, λj − λi ) (9) t−1 y−1 H = min(C, C + λj − λi ) (10)
If yi = yj , the following bounds apply to λj :
t−1 y−1 L = max(0, λj + λi − C) (11) t−1 y−1 H = min(C, λj + λi ) (12)
The solution is: H ifλj > H λj = λj ifL ≤ λj ≤ H L ifλj < L
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 22 / 26 For a generic problem min f (x) x subject to Ax ≤ b Cx = d
The dual function is: −f ∗(−A>λ − C >ν) − b>λ − d>ν There are many functions whose conjugate are easy to compute: Exponential Logistic Quadratic form Log determinant ...
Fenchel duality If a convex conjugate of f (x) is known, the dual function can be easily derived. The convex conjugate of a function f is: f ∗(y) = maxhy, xi − f (x) (13) x
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 23 / 26 There are many functions whose conjugate are easy to compute: Exponential Logistic Quadratic form Log determinant ...
Fenchel duality If a convex conjugate of f (x) is known, the dual function can be easily derived. The convex conjugate of a function f is: f ∗(y) = maxhy, xi − f (x) (13) x For a generic problem min f (x) x subject to Ax ≤ b Cx = d
The dual function is: −f ∗(−A>λ − C >ν) − b>λ − d>ν
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 23 / 26 Fenchel duality If a convex conjugate of f (x) is known, the dual function can be easily derived. The convex conjugate of a function f is: f ∗(y) = maxhy, xi − f (x) (13) x For a generic problem min f (x) x subject to Ax ≤ b Cx = d
The dual function is: −f ∗(−A>λ − C >ν) − b>λ − d>ν There are many functions whose conjugate are easy to compute: Exponential Logistic Quadratic form Log determinant ... Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 23 / 26 Parting notes
Dual formulation is useful. Give new insights into our problem, Allow us to develop better optimization methods and use kernel tricks.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 24 / 26 Thank you!
Questions?
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 25 / 26 ReferencesI
Some slides are from an upcoming EACL 2014 tutorial with Andre F. T. Martins, Noah A. Smith, and Mario F. T. Figueiredo
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In Scholkopf, B., Burges, C., and Smola, A., editors, Advances in Kernel Methods - Support Vector Learning. MIT Press.
Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 26 / 26