<<

Convexity Theory and Gradient Methods

Angelia Nedi´c [email protected] ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign IMA, Minnesota June 1–6, 2014 Outline

• Convex Functions

• Optimality Principle

• Projection Theorem

• Gradient Methods

1 IMA, Minnesota June 1–6, 2014

n Line segment [x1, x2] ⊆ R is the set of all points xα = αx1 + (1 − α)x2 for α ∈ [0, 1].

A set C is convex when for all x1, x2 ∈ C, the segment [x1, x2] is contained in the set C.

2 IMA, Minnesota June 1–6, 2014 Convex Function

Let f be a function from Rn to R, f : Rn → R Informally: f is convex when for every segment [x1, x2], as xα = αx1+(1−α)x2 varies over the line segment [x1, x2], the points (xα, f(xα)) lie below the segment connecting (x1, f(x1)) and (x2, f(x2))

The domain of f is a set in Rn defined by

dom(f) = {x ∈ Rn | f(x) is well defined (finite)} Def. A function f is convex if (1) Its domain dom(f) is a convex set in Rn and (2) For all x1, x2 ∈ dom(f) and α ∈ [0, 1]

f(αx1 + (1 − α)x2) ≤ αf(x1) + (1 − α)f(x2) The function is strictly convex if the inequality is strict whenever x1 6= x2

3 IMA, Minnesota June 1–6, 2014 Examples on R

Convex:

• Affine: ax + b over R for any a, b ∈ R • Exponential: eax over R for any a ∈ R • Power: xp over (0, +∞) for p ≥ 1 or p ≤ 0 • Powers of : |x|p over R for p ≥ 1 • Negative entropy: x ln x over (0, +∞) Concave:

• Affine: ax + b over R for any a, b ∈ R • Powers: xp over (0, +∞) for 0 ≤ p ≤ 1 • Logarithm: ln x over (0, +∞)

4 IMA, Minnesota June 1–6, 2014 Examples: Affine Functions and Norms

• Affine functions are both convex and concave

• Norms are convex

Examples on Rn • Affine function f(x) = a0x + b with a ∈ Rn and b ∈ R

• Euclidean, l1, and l∞ norms

• General lp norms

n !1/p X p kxkp = |xi| for p ≥ 1 i=1

5 IMA, Minnesota June 1–6, 2014

Examples on Rm×n

The space Rm×n is the space of m × n matrices • Affine function

m n 0 X X f(X) = tr(A X) + b = aijxij + b i=1 j=1

• Spectral (maximum singular value)

q 0 f(X) = kXk2 = σmax(X) = λmax(X X)

where λmax(A) denotes the maximum eigenvalue of a matrix A

6 IMA, Minnesota June 1–6, 2014 Verifying Convexity of a Function

We can verify that a given function f is convex by

• Using the definition

• Applying some special criteria • Second-order conditions • First-order conditions • Reduction to a scalar function

• Showing that f is obtained through operations preserving convexity

7 IMA, Minnesota June 1–6, 2014 Second-Order Conditions

Let f be twice differentiable and let dom(f) = Rn [in general, it is required that dom(f) is open] The Hessian ∇2f(x) is a symmetric n × n matrix whose entries are the second-order partial of f at x:

h i ∂2f(x) ∇2f(x) = for i, j = 1, . . . , n ij ∂xi∂xj

2nd-order conditions: For a twice differentiable f with convex domain • f is convex if and only if ∇2f(x)  0 for all x ∈ dom(f) • f is strictly convex if ∇2f(x) 0 for all x ∈ dom(f)

8 IMA, Minnesota June 1–6, 2014 Examples

Quadratic function: f(x) = (1/2)x0P x + q0x + r with a symmetric n × n matrix P ∇f(x) = P x + q, ∇2f(x) = P Convex for P  0 Least-squares objective: f(x) = kAx − bk2 with an m × n matrix A

∇f(x) = 2A0(Ax − b), ∇2f(x) = 2A0A Convex for any A Quadratic-over-linear: f(x, y) = x2/y

T " y #" y # ∇2f(x, y) = 2  0 y3 −x −x Convex for y > 0

9 IMA, Minnesota June 1–6, 2014 Verifying Convexity of a Function

We can verify that a given function f is convex by

• Using the definition

• Applying some special criteria • Second-order conditions • First-order conditions • Reduction to a scalar function

• Showing that f is obtained through operations preserving convexity

10 IMA, Minnesota June 1–6, 2014 First-Order Condition f is differentiable if dom(f) is open and the gradient

∂f(x) ∂f(x) ∂f(x)! ∇f(x) = , ,..., ∂x1 ∂x2 ∂xn exists at each x ∈ domf

1st-order condition: differentiable f is convex if and only if its domain is convex and

f(x) + h∇f(x), z − xi ≤ f(z) for all x, z ∈ dom(f) A first order approximation is a global underestimate of f Very important property used in algorithm designs and performance analysis

11 IMA, Minnesota June 1–6, 2014 Restriction of a convex function to a line f is convex if and only if domf is convex and the function g : R → R, g(t) = f(x + tv), dom(g) = {t | x + tv ∈ dom(f)} is convex (in t) for any x ∈ domf, v ∈ Rn Checking convexity of multivariable functions can be done by checking convexity of functions of one variable n n Example f : S → R with f(X) = − ln det X, dom(f) = S++ g(t) = − ln det(X + tV ) = − ln det X − ln det(I + tX−1/2VX−1/2) n X = − ln det X − ln(1 + tλi) i=1

−1/2 −1/2 where λi are the eigenvalues of X VX g is convex in t (for any choice of V and any X 0); hence f is convex

12 IMA, Minnesota June 1–6, 2014 Operations Preserving Convexity

• Positive Scaling

• Sum

• Composition with Affine Mapping

• Special Compositions

• Point-wise Maximum

• Point-wise Supremum

• Partial Minimization

13 IMA, Minnesota June 1–6, 2014

Scaling, Sum, & Composition with Affine Function

Positive multiple For a convex f and λ > 0, the function λf is convex

Sum: For convex f1 and f2, the sum f1 + f2 is convex (extends to infinite sums, integrals) Composition with affine function: For a convex f and affine g [i.e., g(x) = Ax + b], the composition f ◦ g is convex, where (f ◦ g)(x) = f(Ax + b) Examples • Log-barrier for linear inequalities

m X 0 0 f(x) = − ln(bi − aix), domf = {x | aix < bi, i = 1, . . . , m} i=1

• (Any) Norm of affine function: f(x) = kAx + bk

14 IMA, Minnesota June 1–6, 2014 Composition with Scalar Functions

Composition of g : Rn → R and h : R → R with dom(g) = Rn and dom(h) = R: f(x) = h(g(x)) f is convex if

(1) g is convex, h is nondecreasing and convex

(2) g is concave, h is nonincreasing and convex Examples

• eg(x) is convex if g is convex

1 • g(x) is convex if g is concave and positive

15 IMA, Minnesota June 1–6, 2014 Composition with Vector Functions

Composition of g : Rn → Rp and h : Rp → R with dom(g) = Rn and dom(h) = Rp:

f(x) = h(g(x)) = h(g1(x), g2(x), . . . , gp(x))

f is convex if

(1) each gi is convex, h is convex and nondecreasing in each argument

(2) each gi is concave, h is convex and nonincreasing in each argument Example

Pm gi(x) • i=1 e is convex if gi are convex

16 IMA, Minnesota June 1–6, 2014 Pointwise maximum

For convex functions f1, . . . , fm, the pointwise-max function F (x) = max {f1(x), . . . , fm(x)} is convex (What is domain of F ?) Examples

T • Piecewise-linear function: f(x) = maxi=1,...,m(ai x + bi) is convex

• Sum of r largest components of a vector x ∈ Rn:

f(x) = x[1] + x[2] + ··· + x[r]

is convex (x[i] is i-th largest component of x)

f(x) = max {xi1 + xi2 + ··· + xir} (i1,...,ir)∈Ir

Ir = {(i1, . . . , ir) | i1 < . . . < ir, ij ∈ {1, . . . , m}, j = 1, . . . , n} Pointwise supremum - later

17 IMA, Minnesota June 1–6, 2014 Extended-Value Functions: A function f is an extended-value function if f : Rn → R ∪ {−∞, +∞} Example: consider f(x) = infy≥0 xy for x ∈ R Def. The epigraph of a function f over Rn is the following set in Rn+1: n+1 n epif = {(x, w) ∈ R | x ∈ R , f(x) ≤ w} Theorem: [Convex Function - Convex Epigraph ] A function f is convex if and only if its epigraph epif is a convex set in Rn+1 This allows us to use the convexity of the epigraph as the definition of convexity (often done). These are equivalent in view of the theorem. For an f with domain domf, we associate an extended-value function f˜  defined by f(x) if x ∈ domf f˜(x) = +∞ otherwise domf is the projection of epif on Rn; convexity of f by letting w = f(x)

18 IMA, Minnesota June 1–6, 2014 Pointwise Supremum

Let A ⊆ Rp and f : Rn × Rp → R. Let f(x, z) be convex in x for each z ∈ A. Then, the supremum function over the set A is convex:

g(x) = supz∈A f(x, z) Examples

• Set support function is convex for a set C ⊂ Rn, n 0 SC : R → R,SC(x) = supz∈C z x

• Set farthest-distance function is convex for a set C ⊂ Rn, n f : R → R, f(x) = supz∈C kx − zk

• Maximum eigenvalue function of a symmetric matrix is convex n 0 λmax : S → R, λmax(X) = supkzk=1 z Xz

19 IMA, Minnesota June 1–6, 2014 Minimization

Let C ⊆ Rp be a nonempty convex set Let f : Rn × Rp → R be a convex function [in (x, z) ∈ Rn × Rp]. Then

g(x) = inf f(x, z) is convex z∈C

Example • Distance to a set: for a nonempty convex C ⊂ Rn, dist(x, C) = minz∈C kx − zk is convex Proof for the case when g is finite: n Let x1, x2 ∈ R and α ∈ (0, 1) be arbitrary. Let  > 0 be arbitrarily small. Then, there exist z1 and z2 such that (x1, z1), (x2, z2) ∈ C with f(x1, z1) ≤ g(x1) +  and f(x2, z2) ≤ g(x2) + . Consider f(αx1 + (1 − α)x2, αz1 + (1 − α)z2) and use the convexity of f and C.

20 IMA, Minnesota June 1–6, 2014 Level Sets and Convex Functions

Def. Given a scalar c ∈ R and a function f, a (lower) of f associated with c is given by

n Lc(f) = {x ∈ R | f(x) ≤ c}

2 n x Examples: f(x) = kxk for x ∈ R , f(x1, x2) = e 1 • Every level set of a convex function is convex

• Converse is false: Consider f(x) = −ex for x ∈ R Recall definition of a : g is concave when −g is convex

• Every (upper) level set of a concave function is convex

21 IMA, Minnesota June 1–6, 2014 Optimality Principle for Differentiable f

In the following, unless otherwise stated, the function f is assumed to be differentiable and convex, with domf = Rn Let f be a differentiable convex function and let C be a nonempty closed convex set

Theorem A vector x∗ is optimal if and only if x∗ ∈ C and

h∇f(x∗), z − x∗i ≥ 0 for all z ∈ C

22 IMA, Minnesota June 1–6, 2014 Unconstrained Optimization

minimize f(x) subject to x ∈ Rn

• A vector x∗ is optimal if and only if ∇f(x∗) = 0

• Follows from

∗ ∗ h∇f(x ), z − x i ≥ 0 for all z ∈ Rn

23 IMA, Minnesota June 1–6, 2014 Linear Equality Constrained Problem

minimize f(x) subject to Ax = b with A ∈ Rm×n and b ∈ Rm • When does an optimal solution exist? • A vector x∗ is optimal if and only if

∗ ∇f(x )y = 0 for all y ∈ NA

⊥ 0 ∗ Using NA = ImA , we have that x is optimal if and only if there exists λ∗ ∈ Rm such ∇f(x∗) + A0λ∗ = 0 This is Primal-Dual () Optimality Condition

24 IMA, Minnesota June 1–6, 2014 Minimization over the Nonnegative Orthant

minimize f(x) subject to x ≥ 0, x ∈ Rn

• When does an optimal solution exist? • A vector x∗ is optimal if and only if

h∇f(x∗), x∗i = 0

This known as “Complementarity Condition” in Lagrangian . • Again it follows from optimality principle:

h∇f(x∗), z − x∗i ≥ 0 for all z ≥ 0

25 IMA, Minnesota June 1–6, 2014 Projection Theorem Let C ⊆ Rn be a nonempty closed convex set and xˆ ∈ Rn be arbitrary (a) There is a unique solution to the following problem minimize kz − xˆk2 subject to z ∈ C (b) A vector z∗ ∈ C is the solution if and only if hz∗ − x,ˆ z − z∗i ≥ 0 for all z ∈ C

• The solution is said to be the projection of xˆ on C in the Euclidean

norm, denoted by ΠC[ˆx]

26 IMA, Minnesota June 1–6, 2014 Proof of the Projection Theorem

(a) The objective function is strongly∗convex since its Hessian is equal to 2I. Therefore, the optimal solution exists and it is unique. (b) By the first-order optimality condition, we have z∗ ∈ C is the solution if and only if h∇f(z∗), z − z∗i ≥ 0 for all z ∈ C Since ∇f(z) = 2(z − xˆ), the result follows.

∗Function f has a positive definite Hessian ∇2f(x) everywhere

27 IMA, Minnesota June 1–6, 2014 Projection Properties

Th. Let C ⊆ Rn be a nonempty closed convex set

n (a) The projection mapping ΠC : R → C is non-expansive, i.e.,

n kΠC[x] − ΠC[y]k ≤ kx − yk for all x, y ∈ R (b) The set distance function d : Rn 7→ R given by

dist(x, C) = kΠC[x] − xk is convex

28 IMA, Minnesota June 1–6, 2014 Proof of Projection Property (a)

(a) The relation evidently holds for any x and y with ΠC[x] = ΠC[y]. Consider now n arbitrary x, y ∈ R with ΠC[x] 6= ΠC[y]. By Projection Theorem (b), we have

hΠC[x] − x, z − ΠC[x]i ≥ 0 for all z ∈ C (1)

hΠC[y] − y, z − ΠC[y]i ≥ 0 for all z ∈ C (2)

Using z = ΠC[y] in Eq. (1) and z = ΠC[x] in Eq. (2), and adding the resulting inequali- ties, we obtain

hΠC[y] − y + x − ΠC[x], ΠC[x] − ΠC[y]i ≥ 0 implying that 2 hy − x, ΠC[x] − ΠC[y]i ≥ kΠC[x] − ΠC[y]k

Since ΠC[x] 6= ΠC[y], it follows that

ky − xk ≥ kΠC[x] − ΠC[y]k

29 IMA, Minnesota June 1–6, 2014 Proof of Projection Property (b)

(b) Note that the distance function is equivalently given by

dist(x, C) = min kx − zk for all x ∈ n z∈C R

The function h(x, z) = kx − zk is convex in (x, z) over Rn × Rn. The set C is convex, hence dist(x, C) is convex (see the lecture on operations preserving convexity of functions)

30 IMA, Minnesota June 1–6, 2014 Fejer Para-contraction Property

Th. Let C ⊆ Rn be a nonempty closed convex set The projection mapping is para-contraction with respect to the set C, i.e.,

2 2 2 n kΠC[x] − yk ≤ ky − xk − kΠC[x] − xk for all x ∈ R , y ∈ C

31 IMA, Minnesota June 1–6, 2014 Optimality Property: Fixed-Point Interpretation

Let f be a differentiable convex function and let C be a nonempty closed convex set Theorem A vector x∗ is optimal if and only if ∗ ∗ ∗ x = ΠC[x − α∇f(x )] for all α > 0 Proof By optimality principle h∇f(x∗), z − x∗i ≥ 0 for all z ∈ C αh∇f(x∗), z − x∗i ≥ 0 for all z ∈ C and any α > 0 hx∗ − (x∗ − α∇f(x∗)) , z − x∗i ≥ 0 for all z ∈ C and any α > 0 ∗ By Projection Theorem a vector z = ΠC[ˆx] if and only if

hz∗ − x,ˆ z − z∗i ≥ 0 for all z ∈ C Hence, the relation hx∗ − (x∗ − α∇f(x∗)), z − x∗i ≥ 0 for all z ∈ C and any α > 0 | {z } xˆ is equivalent to ∗ ∗ ∗ x = ΠC[x − α∇f(x )] for all α > 0

32 IMA, Minnesota June 1–6, 2014 Gradient Methods: Solve Fixed-Point Formulation

∗ ∗ ∗ x = ΠC[x − α∇f(x )] for all α > 0 Recursion

x(t + 1) = ΠC[x(t) − αt∇f(x(t))] If T : Rn → Rn is a map that also has x∗ ∈ C as a fixed point, we can write another fixed point relation (whose fixed points are optimal for f over C) ∗ ∗ ∗ ∗ x = βT (x ) + (1 − β)ΠC[x − α∇f(x )] for all α > 0, β ∈ [0, 1]

Since the solutions to minx∈C f(x) are typically not known in advance: T = I Method

x(t + 1) = βtx(t) + (1 − βt)ΠC[x(t) − αt∇f(x(t))]

33 IMA, Minnesota June 1–6, 2014 Gradient Method: Basic Relation Recursion/Algorithm

x(t + 1) = ΠC[x(t) − αt∇f(x(t))] Using para-contractive property of the projection 2 2 2 n kΠC[x] − yk ≤ ky − xk − kΠC[x] − xk for all x ∈ R , y ∈ C we obtain for all y ∈ C and all t: 2 2 2 kx(t + 1) − yk ≤ kx(t) − αt∇f(x(t)) − yk − kx(t + 1) − x(t)k Open the quadratic term 2 2 2 2 2 kx(t+1)−yk ≤ kx(t)−yk −2αth∇f(x(t)), x(t)−yi+αt k∇f(x(t))k −kx(t+1)−x(t)k By convexity of the function h∇f(x(t)), x(t) − yi ≥ f(x(t)) − f(y) so we have 2 2 kx(t + 1) − yk ≤kx(t) − yk − 2αt (f(x(t)) − f(y)) 2 2 2 + αt k∇f(x(t))k − kx(t + 1) − x(t)k

34 IMA, Minnesota June 1–6, 2014 Gradient Methods - Another Way

x(t + 1) = ΠC[x(t) − αt∇f(x(t))] This is equivalent to   1 2 x(t + 1) = argmin αth∇f(x(t), y − x(t)i + ky − x(t)k y∈C 2 which is equivalent to  1  x(t + 1) = argmin h∇f(x(t), y − x(t)i + ky − x(t)k2 y∈C 2αt This view is suitable when norms other than the Euclidean are used! Even- semi norms can be used

 1  x(t + 1) = argmin h∇f(x(t), y − x(t)i + D(y, x(t)) y∈C 2αt with a Bregman-distance function D(y, x(t)). Analysis of the behavior starts by estab- lishing a basic relation throughout the use of the optimality principle.

35