Illustration Examples
Duality in machine learning
Mathieu Blondel
November 26, 2020 Outline
1 Conjugate functions
2 Smoothing techniques
3 Fenchel duality
4 Block coordinate ascent
5 Conclusion
Mathieu Blondel Duality in machine learning 1 / 47 Closed functions
The domain of a function is denoted dom(f ) = {x ∈ Rd : f (x) < ∞}
A function is closed if for all α ∈ R the sub-level set {x ∈ dom(f ): f (x) ≤ α} is closed (reminder: a set is closed if it contains its boundary)
If f is continuous and dom(f ) is closed then f is closed
Example 1: f (x) = x log x is not closed over dom(f ) = R>0
Example 2: f (x) = x log x is closed over dom(f ) = R≥0, f (0) = 0
Example 3: the indicator function IC is closed if C is closed 0 if x ∈ C I (x) = C ∞ if x 6∈ C.
Mathieu Blondel Duality in machine learning 2 / 47 Convex conjugate Fix a slope y. What is the intercept b of the tighest linear lower bound of f ? In other words, for all x ∈ dom(f ), we want
hx, yi − b ≤ f (x) ⇔ hx, yi − f (x) ≤ b ⇔ b = sup hx, yi − f (x) x∈dom(f ) The value of the intercept is denoted f ∗(y), the conjugate of f (x).
f(x)
? f ∗(y) x − slope y
Mathieu Blondel Duality in machine learning 3 / 47 Convex conjugate
Equivalent definition
−f ∗(y) = inf f (x) − hx, yi x∈dom(f )
∗ f can take values on the extended real line R ∪ {∞} f ∗ is closed and convex (even when f is not)
Fenchel-Young inequality: for all x, y
f (x) + f ∗(y) ≥ hx, yi
Mathieu Blondel Duality in machine learning 4 / 47 Convex conjugate examples
Example 1: f (x) = IC(x), the indicator function of C
f ∗(y) = sup hx, yi − f (x) = suphx, yi x∈dom(f ) x∈C
f ∗ is called the support function of C
Example 2: f (x) = hx, log xi, then
d X f ∗(y) = eyi −1 i=1
Example 3: f (x) = hx, log xi + I4d (x) exp(y) f ∗(y) = Pd i=1 exp(yi )
Mathieu Blondel Duality in machine learning 5 / 47 Convex conjugate calculus
Separable sum
d d X ∗ X ∗ f (x) = fi (xi ) f (y) = fi (yi ) i=1 i=1 Scalar multiplication (c > 0)
f (x) = c · g(x) f ∗(y) = c · g∗(y/c)
Addition to affine function / translation of argument
f (x) = g(x) + ha, xi + b f ∗(y) = g∗(y − a) − b
Composition with invertible linear mapping
f (x) = g(Ax) f ∗(y) = g∗(A−T y)
Mathieu Blondel Duality in machine learning 6 / 47 Biconjugates
The bi-conjugate
f ∗∗(x) = sup hx, yi − f ∗(y) y∈dom(f ∗)
f ∗∗ is closed and convex
If f is closed and convex then
f ∗∗(x) = f (x)
If f is not convex, f ∗∗ is the tightest convex lower bound of f
Mathieu Blondel Duality in machine learning 7 / 47 Subgradients
Recall that a differentiable convex function always lies above its tangents, i.e., for all x, y ∈ dom(f ) Subgradient f (y) ≥ f (x) + h∇f (x), y − xi
g is a subgradient of a convex function f at x dom f if g is the subgradient of a convex function2f if for all x, y ∈ dom(f )
f y f x + gT y x for all y dom f f ((y) ≥ (f ()x) +( h g,)y − xi 2
f y ( )
f x + gT y x ( 1) 1 ( 1) f x + gT y x ( 1) 2 ( 1)
f x + gT y x ( 2) 3 ( 2)
x1 x2
Mathieu Blondel Duality in machine learning 8 / 47 g1, g2 are subgradients at x1; g3 is a subgradient at x2
Subgradients 2.3 Subdifferential
The subdifferential is the setExamples of all subgradients
∂f (x) = {g : f (y) ≥ f (x) + hg, y − xi ∀y ∈ dom(f )} Absolute value f x = x ( ) | | Example: f (x) = |x| f x @ f x ( ) ( )
1
x
x 1
∂f (0) = [−1, 1] ∂f (x) = {∇f (x)} if x 6= 0 EuclideanMathieu Blondel norm f x = xDuality in machine learning 9 / 47 ( ) k k2 1 @ f x = x if x , 0,@f x = g g 1 if x = 0 ( ) { x } ( ) { |k k2 } k k2
Subgradients 2.8 Conjugates and subdifferentials
Alternative definition of subdifferential
∂f ∗(y) = {x ∈ dom(f ): f (x) + f ∗(y) = hx, yi}
From Danskin’s theorem
∂f ∗(y) = argmaxhx, yi − f (x) x∈dom(f )
If f is strictly convex
∇f ∗(y) = argmaxhx, yi − f (x) x∈dom(f )
And similarly for ∂f (x), ∇f (x)
Mathieu Blondel Duality in machine learning 10 / 47 Outline
1 Conjugate functions
2 Smoothing techniques
3 Fenchel duality
4 Block coordinate ascent
5 Conclusion
Mathieu Blondel Duality in machine learning 11 / 47 Bregman divergences
Let f be convex and differentiable.
The Bregman divergence generated by f between u and v is
Df (u, v) = f (u) − f (v) − h∇f (v), u − vi It is the difference between f (u) and its linearization around v.
f(u) Df (u, v) f(v) + f(v), u v h∇ − i
v u
Mathieu Blondel Duality in machine learning 12 / 47 Bregman divergences
Recall that a differentiable convex function always lies above its tangents, i.e., for all u, v
f (u) ≥ f (v) + h∇f (v), u − vi
The Bregman divergence is thus non-negative for all u, v
Df (u, v) ≥ 0
Put differently, a differentiable function f is convex if and only if it generates a non-negative Bregman divergence.
Not necessarily symmetric
Mathieu Blondel Duality in machine learning 13 / 47 Bregman divergences
1 2 Example 1: if f (x) = 2 kxk2, then Df is the squared Euclidean distance 1 D (u, v) = ku − vk2 f 2 2
Example 2: if f (x) = hx, log xi, then Df is the (generalized) Kullback-Leibler divergence
d d d X pi X X Df (p, q) = pi log − pi + qi qi i=1 i=1 i=1
Mathieu Blondel Duality in machine learning 14 / 47 Strong convexity f is said to be µ-strongly convex w.r.t. a norm k · k over C if µ ku − vk2 ≤ D (u, v) for all u, v ∈ C 2 f
Big µ
Small µ
1 2 d Example 1: f (x) = 2 kxk2 is 1-strongly convex w.r.t. k · k2 over R .
Example 2: f (x) = hx, log xi is 1-strongly convex w.r.t. k · k1 over d d the probability simplex 4 = {p ∈ R+ : kpk1 = 1}. Mathieu Blondel Duality in machine learning 15 / 47 Strong convexity
Pinsker’s inequality
1.2 KL(p, q) 2 1.0 0.5||p q||1
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
p = (π, 1 − π), q = (0.3, 0.7)
Mathieu Blondel Duality in machine learning 16 / 47 Smoothness f is said to be L-smooth w.r.t. a norm k · k over C if L D (u, v) ≤ ku − vk2 for all u, v ∈ C f 2
Small L
Big L
1 2 d Example 1: f (x) = 2 kxk2 is 1-smooth w.r.t. k · k2 over R .
P xi d Example 2: f (x) = log i e is 1-smooth w.r.t. k · k∞ over R
Mathieu Blondel Duality in machine learning 17 / 47 Hessian bounds
When f is twice differentiable, this also leads to bounds on ∇2f
When f is strongly convex, we have
2 µ · Idd ∇ f
When f is smooth, we have
2 ∇ f L · Idd
Functions can be both strongly-convex and smooth, e.g., the sum of a smooth function and a strongly-convex function.
Mathieu Blondel Duality in machine learning 18 / 47 Lipschitz functions
Given a norm kxk on C, its dual (also on C) is
kyk∗ = max hx, yi kxk≤1
Examples: || · k2 is dual with itself, k · k1 is dual with k · k∞
A function g : Rd → Rp is said to be L-Lipschitz continuous w.r.t. k · k over C if for all x, y ∈ C ⊆ Rd
kg(x) − g(y)k∗ ≤ Lkx − yk
Choose g = ∇f . Then f is said to have Lipschitz-continuous gradients.
Fact. A function is L-smooth if and only if it has L-Lipschitz continuous gradients.
Mathieu Blondel Duality in machine learning 19 / 47 Strong convexity and smoothness duality
Theorem. ∗ 1 f is µ-strongly convex w.r.t. k · k ⇔ f is µ -smooth w.r.t. k · k∗ Example 1: µ 2 f (x) = 2 kxk is µ-strongly convex w.r.t. k · k, ∗ 1 2 1 f (y) = 2µ kyk∗ is µ -smooth w.r.t. k · k∗. Example 2: d f (x) = hx, log xi is 1-strongly convex w.r.t. k · k1 over 4 , ∗ P yi d f (y) = log i e is 1-smooth w.r.t. k · k∞ over R .
Mathieu Blondel Duality in machine learning 20 / 47 Smoothing: Moreau-Yosida regularization
Suppose we have a non-smooth function g(x), e.g., g(x) = |x| We can create a smooth version of g by
1 2 gµ(x) = min g(u) + kx − uk u 2µ 2 1 2 This is also called the inf-convolution of g with 2µ k · k2
The gradient of gµ is equal to the proximity operator of µg ? ∇gµ(x) = u
1 2 = argmin g(u) + kx − uk2 u 2µ 1 2 = argmin µg(u) + kx − uk2 u 2
= proxµg(x)
Mathieu Blondel Duality in machine learning 21 / 47 Smoothing: Moreau-Yosida regularization
Example: g(x) = |x|
The proximity operator is the soft-thresholding operator ? 1 2 0 if |x| ≤ µ u = argmin µ|u| + kx − uk2 = u 2 x − µsign(x) if |x| > µ.
? 1 ? 2 Using gµ(x) = |u | + 2µ kx − u k2, we get
( x2 2µ if |x| ≤ µ gµ(x) = µ |x| − 2 if |x| > µ.
This is known as the Huber loss.
Mathieu Blondel Duality in machine learning 22 / 47 Smoothing: Moreau-Yosida regularization
3.0 = 0 = 1 2.5 = 5
2.0 ) x
( 1.5 g
1.0
0.5
0.0
3 2 1 0 1 2 3 x
Mathieu Blondel Duality in machine learning 23 / 47 Smoothing: dual approach
Suppose we want to smooth a convex function g(x)
Step 1: derive the conjugate g∗(y)
Step 2: add regularization µ g∗(y) = g∗(y) + kyk2 µ 2 2 Step 3: derive the bi-conjugate
∗∗ ∗ gµ (x) = gµ(x) = max hx, yi − gµ(y) y∈dom(g∗)
Equivalent (dual) to Moreau-Yosida regularization!
1 µ 2 By duality, gµ(x) is µ -smooth since 2 k · k2 is µ-strongly convex.
Mathieu Blondel Duality in machine learning 24 / 47 Smoothing: dual approach
Example 1: g(x) = |x|
∗ Step 1: g (y) = I[−1,1](y) Step 2: add regularization µ g∗(y) = I (y) + y 2 µ [−1,1] 2 Step 3: derive the bi-conjugate
∗∗ µ 2 gµ (x) = gµ(x) = max x · y − y y∈[−1,1] 2
Solution: ? µ ? 2 ? 1 gµ(x) = x · y − (y ) where y = clipping x 2 [−1,1] µ
Mathieu Blondel Duality in machine learning 25 / 47 Smoothing: dual approach
Example 2: g(x) = max(0, x), i.e., the relu function
∗ Step 1: g (y) = I[0,1](y) Step 2: add regularization µ g∗(y) = I (y) + y 2 µ [0,1] 2 Step 3: derive the bi-conjugate
∗∗ µ 2 gµ (x) = gµ(x) = max x · y − y y∈[0,1] 2
Solution: ? µ ? 2 ? 1 gµ(x) = x · y − (y ) where y = clipping x 2 [0,1] µ
Mathieu Blondel Duality in machine learning 26 / 47 Smoothing: dual approach
3.0 = 0 = 1 2.5 = 5
2.0 ) x
( 1.5 g
1.0
0.5
0.0
3 2 1 0 1 2 3 x
Mathieu Blondel Duality in machine learning 27 / 47 Smoothing: dual approach
µ 2 Regularization is not limited to 2 kyk Any strongly-convex regularization can be used
Example: softmax
g(x) = max xi i∈{1,...,d} ∗ g (y) = I4d (y) ∗ gµ(y) = I4d (y) + µhy, log yi d X gµ(x) = µ log exp(xi /µ) i=1 exp(x/µ) ∇g (x) = µ Pd i=1 exp(xi /µ)
Mathieu Blondel Duality in machine learning 28 / 47 Outline
1 Conjugate functions
2 Smoothing techniques
3 Fenchel duality
4 Block coordinate ascent
5 Conclusion
Mathieu Blondel Duality in machine learning 29 / 47 Fenchel dual
F(θ) convex, G(W ) strictly convex
We are going to derive the Fenchel dual of
min F(XW ) + G(W ) W ∈Rd×k
× × where X ∈ Rn d and W ∈ Rd k Let us rewrite the problem using constraints
min F(θ) + G(W ) s.t. θ = XW × W ∈Rd k × θ∈Rn k F and G now involve different variables (tied by equality constraints)
Mathieu Blondel Duality in machine learning 30 / 47 Fenchel dual
We now use Lagrange duality
min max F(θ) + G(W ) + hα, θ − XW i × W ∈Rd k α∈Rn×k × θ∈Rn k Since the problem only has linear constraints and is feasible, strong duality holds (we can swap the min and max)
max min F(θ) + G(W ) + hα, θ − XW i × α∈Rn×k W ∈Rd k × θ∈Rn k We are now going to introduce the convex conjugates of F and G.
Mathieu Blondel Duality in machine learning 31 / 47 Fenchel dual
For the terms involving θ, we have
min F(θ) + hα, θi = −F ∗(−α) θ∈Rn×k For the terms involving W , we have
min G(W ) − hα, XW i = min G(W ) − hW , X >αi W ∈Rk×d W ∈Rd×k = −G∗(X >α) To summarize, the dual consists in solving
max −F ∗(−α) − G∗(X >α) α∈Rn×k The primal-dual link is
W ? = ∇G∗(X >α?)
Mathieu Blondel Duality in machine learning 32 / 47 Fenchel dual for loss sums
Typically, in machine learning, F is a sum of loss terms and G is a regularization term:
n X > F(θ) = L(θi , yi ) where θi = W xi i=1 Since the sum is separable, we get
n ∗ X ∗ F (−α) = L (−αi , yi ) i=1 where L∗ is the convex conjugate in the first argument of L
What have we gained? If G∗ is simple enough, we can solve the objective by dual block coordinate ascent.
Mathieu Blondel Duality in machine learning 33 / 47 Examples of regularizer
Squared L2 norm λ λ G(W ) = kW k2 = hW , W i 2 F 2 1 G∗(V ) = kV k2 2λ F 1 ∇G∗(V ) = V λ Elastic-net λ G(W ) = kW k2 + λρkW k 2 F 1 G∗(V ) = h∇G∗(V ), V i − G(∇G∗(V ))
∗ 1 2 ∇G (V ) = argmin kW − V /λkF + ρkW k1 W 2 The last operation is the soft-thresholding operator (element-wise).
Mathieu Blondel Duality in machine learning 34 / 47 Fenchel-Young losses
The Fenchel-Young loss generated by Ω
∗ LΩ(θi , yi ) = Ω (θi ) + Ω(yi ) − hθi , yi i
Non-negative (Fenchel-Young inequality)
Convex in θ even when Ω is not
If Ω is strictly convex, the loss is zero if and only if
∗ 0 0 yi = ∇Ω (θi ) = argmax hy , θi i − Ω(y ) y 0∈dom(Ω)
Conjugate function (in the first argument)
∗ LΩ(−αi , yi ) = Ω(yi − αi ) − Ω(yi )
Mathieu Blondel Duality in machine learning 35 / 47 Fenchel-Young losses
Squared loss 1 1 Ω(β ) = kβ k2 L (θ , y ) = ky − θ k2 i 2 i 2 Ω i i 2 i i 2 k yi ∈ R Multiclass perceptron loss
Ω(βi ) = I4k (βi ) LΩ(θi , yi ) = max θi,j − hθi , yi i j∈{1,...,k}
yi ∈ {e1,..., ek } Multiclass hinge loss
Ω(βi ) = I4k (βi )−hβi , vi i LΩ(θi , yi ) = max θi,j +vi,j −hθi , yi i j∈{1,...,k}
vi = 1 − yi yi ∈ {e1,..., ek }
Mathieu Blondel Duality in machine learning 36 / 47 Dual in the case of Fenchel-Young losses
Recall that the dual is
n X ∗ ∗ > max − L (−αi , yi ) − G (X α) α∈ n×k R i=1
with primal-dual link W ? = ∇G∗(X >α?)
Using the change of variable βi = yi − αi and L = LΩ, we obtain
n X ∗ > max − [Ω(βi ) − Ω(yi )] − G (X (Y − β)) s.t. βi ∈ dom(Ω) β∈ n×k R i=1
with primal-dual link W ? = ∇G∗(X >(Y − β?)). Note that Y ∈ {0, 1}n×k contains the labels in one-hot representation.
Mathieu Blondel Duality in machine learning 37 / 47 Duality gap
Let P(W ) and D(β) be the primal and dual objectives, respectively.
For all W and β we have
D(β) ≤ P(W )
At the optima, we have
D(β?) = P(W ?)
P(W ) − D(β) ≥ 0 is called the duality gap and can be used as a certificate of optimality.
Mathieu Blondel Duality in machine learning 38 / 47 Outline
1 Conjugate functions
2 Smoothing techniques
3 Fenchel duality
4 Block coordinate ascent
5 Conclusion
Mathieu Blondel Duality in machine learning 39 / 47 Block coordinate ascent
k Key idea: on each iteration, pick a block of variables βi ∈ R and update only that block.
If the block has a size of 1, this is called coordinate ascent.
Exact update:
∗ > βi ← argmin Ω(βi ) − Ω(yi ) + G (X (Y − β)) i ∈ {1,..., n} βi ∈dom(Ω)
Possible schemes for picking i: random, cyclic, shuffled cyclic
Mathieu Blondel Duality in machine learning 40 / 47 Block coordinate ascent
The sub-problem can be too complicated in some cases.
Approximate update (using a quadratic approximation around the t current iterate βi ) σ β ← argmin Ω(β ) − hβ , u i + i kβ k2 i i i i 2 i 2 βi ∈dom(Ω)
= prox 1 Ω(ui /σi ) σi
2 kxi k2 ∗ > t where σi = λ and ui = ∇G (X (Y − β)) xi + σi βi . | {z } W Exact if both Ω and G∗ are quadratic
Enjoys a linear rate of convergence w.r.t. the primal objective if Ω and G are strongly-convex.
Mathieu Blondel Duality in machine learning 41 / 47 Proximal operators
1 2 Squared loss: Ω(βi ) = 2 kβi k2
1 2 τ 2 η proxτΩ(η) = argmin kβ − ηk2 + kβk2 = k 2 2 + 1 β∈R τ
Perceptron loss: Ω(βi ) = I4k (βi )
2 proxτΩ(η) = argmin kp − ηk2 p∈4k
Multiclass hinge loss: Ω(βi ) = I4k (βi ) − hβi , vi i
2 proxτΩ(η) = argmin kp − (η + τvi )k2 p∈4k
where vi = 1 − yi and yi ∈ {e1,..., ek } is the correct label.
Mathieu Blondel Duality in machine learning 42 / 47 Outline
1 Conjugate functions
2 Smoothing techniques
3 Fenchel duality
4 Block coordinate ascent
5 Conclusion
Mathieu Blondel Duality in machine learning 43 / 47 Summary
Conjugate functions are a powerful abstraction.
Smoothing techniques are enabled by the duality between smoothness and strong convexity.
The dual can often be easier to solve than the primal.
If the dual is quadratic and the constraints are decomposable, dual block coordinate ascent is very well suited.
Mathieu Blondel Duality in machine learning 44 / 47 References
Some figures and contents are borrowed from L. Vandenberghe’s lectures “Subgradients” and “Conjugate functions”.
The block coordinate ascent algorithm used is described in Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization Shai Shalev-Shwartz, Tong Zhang 2013
as well as in Learning with Fenchel-Young losses Mathieu Blondel, Andre Martins, Vlad Niculae 2019
Mathieu Blondel Duality in machine learning 45 / 47 Lab work
Implement BDCA for the squared loss and the multiclass hinge loss. Primal objective n X > k k P(W ) = LΩ(θi , yi ) + G(W ) θi = W xi ∈ R , yi ∈ R i=1 Dual objective
n X ∗ > D(β) = − [Ω(βi ) − Ω(yi )] − G (X (Y − β)) s.t. βi ∈ dom(Ω) i=1 with primal-dual link W ? = ∇G∗(X >(Y − β?)). Note that Y ∈ {0, 1}n×k contains the labels in one-hot representation. Duality gap P(W ) − D(β)
For G, use the squared L2 norm Mathieu Blondel Duality in machine learning 46 / 47 Lab work
Approximate block update
βi ← prox 1 Ω(ui /σi ) σi
2 kxi k2 ∗ > t where σi = λ and ui = ∇G (X (Y − β)) xi + σi βi . | {z } W Use cyclic block selection
See “Proximal operators” slide for prox expressions
See “Fenchel-Young losses” slide for LΩ and Ω expressions
See “Examples of regularizer” slide for G∗ and ∇G∗ expressions
Mathieu Blondel Duality in machine learning 47 / 47