<<

STAT 544 Lecture 5 Convexity – a very short (Bayesian) tour c Marina Meil˘a [email protected]

Convexity is a very elegant and suprisingly powerful mathematical concept, that allows us to “think geometrically” about very high dimensional spaces, such as function spaces. In optimization, convexity is THE MOST impor- tant criterion for separating between easy and hard problems. In statistics, convexity plays a less proeminent role, but it appears in key places, as will be seen shortly.

Reading Boyd and Vandenberghe, , available freely on the web (see Resources page).

1 Convex sets

A set C in a is convex iff for any x, x0 ∈ C the [x, x0] = {tx + (1 − t)x0, t ∈ [0, 1]} is contained in C.

Basic definitions

• line segment {tx1 + (1 − t)x2 with t ∈ [0, 1]}

P P k • affine combination i tixi with i ti = 1, for fixed x1:n ∈ R P Can also be written as x1 + i=2:k ti(xi − x1) with any real t2:k P P Proof t1 = 1 − ( i=2:k tj). Hence i=1:k tixi = x1 − (1 − t1)x1 + P P i=2:k tixi = x1 + i=2:k ti(xi − x1) and now t2:k can be any real numbers.

Hence, the affine hull of x1:k is a subspace of dimension k − 1, spanned by xi − x1, i = 2 : k, and shifted from the origin by the shift x1. Exercise: Show that the result is the same no matter which xi we chose in place of x1. In particular, the affine hull of two points is the line that passes through them, the affine hull of three points is the plane that contains them, etc. P P • i tixi with i ti = 1 and ti ≥ 0

1 P • conic combination i tixi with ti ≥ 0. Note that 0 is a conic combi- nation of any set of points.

• cone C: if x ∈ C then tx ∈ C for all t > 0

, affine set, cone,

, affine hull, conic hull

• relative , (relative) boundary, (relative)

1.1 Examples of Convex sets

T 1. hyperplane a (x − x0) = 0 2. half-space aT x − b ≥ 0

3. ball, ellipsoid

4. = (bounded) intersection of m half-spaces

5. = convex hull of k + 1 affinely independent points (i.e x1 − xk+1, x2 − xk+1,... are linearly independent) 6. all symmetric matrices are an affine set and a convex set (unbounded)

7. all positive definite matrices are a convex cone

8. all stochastic matrices are a convex set

9. all doubly stochastic matrices are a convex set (what are its extreme points?)

Convex sets in probability

d 1. the parameter space of all normal distributions over R is a convex set 2. the (parameter) space of all discrete distributions over some countable space X

3. all distributions with a fixed set of marginals

2 4. the conditional distributions of a discrete joint Let X,Y ∈ ΩX ×ΩY with |ΩX | = m, |ΩY | = n be two discrete random variables, and let Θ be the set of all probability distributions over ΩX × ΩY .

That is, we define Pθ(X = i, Y = j) = θij; then Θ = {θ = [θij]ij ∈ m×n P [0, 1] , ij θij = 1}. Imagine θ to be rearranged as a vector of dimension mn:

T vec(θ) = [ θ11 θ12 . . . θ21 θ22 . . . θmn ] (1)

We use the linear-fractional (or projective) function (BV page 41)

Az + b f(z) = domf = {z|cT z + d > 0} (2) cT z + d which maps a convex set into a convex set. Let now z = vec(θ), b = 0, d = 0,

 0 0  1 if j = j = j0, i = i 1 if j = j0 A 0 0 = c = (3) ij,i j 0 otherwise ij 0 otherwise

T P In other words, c vec(θ) = i θij0 = Pθ[Y = j0], and row ij of A, P 0 0 0 0 multiplied by θ gives i0,j0 Aij,i j θi j = θi,j0 = Pθ[X = i, Y = j0].

Hence, f(θ) = Pθ[X|Y = j0] for any θ ∈ Θ with Pθ(Y = j0) > 0. This subset of Θ is also convex (but not closed), therefore we conclude that the set of all conditional probabilities given a fixed Y is a convex set.

5. all distributions over R with E[X] in a convex set (in particular E[X] fixed, E[X] ≥ a)

Convex spaces of functions

1. of degree n; all polynomials

2. {g | g ≥ f0} with f0 fixed

R p 3. {g | |g| < a} with p ≥ 1, a ∈ (0, ∞) (the Lp balls) 4. all convex functions on set X

3 2 Convex functions

k A function f : C ⊆ R → R is convex iff it satisfies Jensen’s inequality f(tx + (1 − t)x0) ≤ tf(x) + (1 − t)f(x0) for all t ∈ [0, 1], x, x0 ∈ C. (4)

k Equivalent definitions: assuming f : C ⊆ R → R, f is convex iff

n • the of f epi f = {(x, y) ∈ R × R, y ≥ f(x)} is a convex set.

• (assuming ∇f is defined on the interior of C)

f(x0) ≥ f(x) + ∇f(x)T (x0 − x), for all x, x0 ∈ C. (5)

• (assuming ∇2f is defined on the interior of C)

∇2f(x)  0 for all x ∈ C, (6)

where the notation A  0 means that A is a positive definite matrix.

Examples of convex functions

aT x k k • e for any a ∈ R ,x ∈ R .

•− ln x, x ∈ R.

• x ln x, x ∈ R. k • ||x|| for any || ||, x ∈ R .

x1 x2 x T k • ln(e + e + ... + e k ) with x = [x1 . . . xk] ∈ R .

Convex functions in statistics

• Any marginal of a discrete distribution. Let X,Y ∈ ΩX × ΩY with ΩX | = m, |ΩY | = n be two discrete random variables, and let Θ be the set of all probability distributions over ΩX × ΩY . That is, we define Pθ(X = i, Y = j) = θij; then Θ = m×n P P {θ = [θij]ij ∈ [0, 1] , ij θij = 1} The marginal PX (i) = j θij is a linear function of the entries of θ, therefore it is convex.

4 • The normalization constant of an exponential family Z(θ) −θT x n P −θT x Proof e is convex in θ ∈ R for any x; then Z(θ) = x e is convex as a sum of convex functions of θ. • log Z(θ) is convex Proof The proof is statistical. Remember that

2 V arθ(x) = ∇ ln Z(θ) The convexity follows from the positive-definiteness of the variance. • The KL-divergence KL(p||q) for any two distributions p, q on Ω (con- tinuous or discrete) is jointly convex in (p, q). • An f-divergence is a score function defined as follows Z dP  Df (P ||Q) = f dQ (7) Ω dQ where P,Q are distributions over Ω and f is a on √ (0, ∞). Exercise What functions do you obtain for f(t) = t ln t, ( t − 2 2 1 1) , (t − 1) , 2 |t − 1|?

3 The conjugate of a convex function

The conjugate of the convex function f is defined as f ∗(y) = sup yT x − f(x) (8) x∈dom f The domain of f ∗ is the set of y’s for which the supremum above is finite. Note that f ∗ is always convex in y, as a supremum of linear functions in y.

T Let g(x, y) = y x − f(x). If f is differentiable and convex, then supx g(x, y) can be calculated by taking the derivative w.r.t x.

∇xg(x, y) = y − ∇f(x) = 0 (9) y = ∇f(x) ⇒ solution x∗ (10) If f is convex, then x∗ is a maximum. If the solution above is unique, then we say the pair (x∗, y) = (x∗, ∇f(x∗)) is a Legendre conjugate pair. If the solution is unique for every y, then we can write f ∗(y) + f(x∗) = yT x∗ = ∇f(x∗)T x∗ (11)

5 Because at x∗ is the supremum of g(x, y), it follows that for every x in the domain of f the r.h.s is no larger than the l.h.s, that is

f ∗(y) + f(x) ≥ yT x for all x, y (12)

This is called the Fenchel-Legendre inequality.

Proposition If f is convex, and epi f is closed, then f ∗∗ = f. In this case, we call f, f ∗ a Legendre conjugate pair of functions.

Exercise Calculate the conjugates of: ex, − ln x, Ax + b, x ∈ Rd, 1 − x2, x ∈ [0, 1], x1 x2 xm 1 2 ln(e + e + ... + e ), x , ||x|| . Find the comains of the respective conjugate functions, and find examples of vectors y which are not in those domains.

Remark: If f convex, and y ∈ ∂f(x) for some x, then y ∈ dom f ∗ and y, x are a conjugate pair.

Analogy with the Fourier Transform The Fourier transform maps f ˆ R −ωT x 2 ˆ 2 into f(ω) = dom f e f(x)dx, with f ∈ L ↔ f ∈ L . The Legendre- fˆ −yT x f(x) Fenchel transform (8) maps f into e (y) = infdom f e e dx, with f convex ↔ fˆ convex.

The next sections are for additional reading.

4 [Optional: Strictly convex and strongly convex functions]

A function is strictly convex if Jensen’s inequality is strict whenever t ∈ (0, 1), i.e.

tf(x) + (1 − t)f(x0) > f(tx + (1 − t)x0) for all t ∈ (0, 1) (13)

The concept of subgradient is a generalization of the gradient for functions which are not differentiables. A subgradient of a convex function f at n point x is any vector g ∈ R so that

f(x0) ≥ f(x) + gT (x0 − x) for all x0 ∈ domf (14)

In other words, g is a subgradient iff it is the normal of a supporting hyper- plane of the epigraph f at x. It follows immediately that a convex function

6 admits a subgradient at any point in its domain. [Exercise: Show that n ∂f(x) = {g ∈ R | g subgradient of f at x} is a convex set.] If ∇f(x) exists, then it is the unique subgradient at x.

Example Let f(x) = |x|, x ∈ R. Then,   −1, x < 0 ∂f(x) = 1, x > 0 (15)  [−1, 1], x = 0

A function f is µ-strongly convex iff there is a µ > 0 so that µ f(x0) ≥ f(x)+gT (x0 −x)+ ||x0 −x||2 for all x, x0 ∈ domf and all g ∈ ∂f(x) 2 (16) The notion of strong convexity is a generalization of the condition

∇2f(x) µI (17) from doubly differentiable functions to all convex functions.

[Exercise: Show that (17) implies (16) when the Hessian is defined every- where.] Strong convexity implies strict convexity, but the converse is not true. For example, the function f(x) = 1/x, x ∈ (0, ∞) is strictly convex but not strongly convex.

Exercise: Is f(x) = |x| strongly convex? Exercise: Show that if 0 ∈ ∂f(x∗) then x∗ is a global minimum of f.

5 [Optional: Log-concave functions]

[Reading BV 3.5]

A function f is concave iff −f is convex; f is log-concave iff ln f is concave which is equivalent with

f(tx + (1 − t)x0) ≥ f(x)tf(x0)1−t (18)

Simple examples of log-concave and log-convex functions

• f(x) = aT x + b on {aT x + b > 0} is log-concave

7 • f(x) = eaT x is log-concave and log-convex

x 2 • φ(x) = R √1 e−u /2du is log-concave 0 2π

R ∞ −x x−1 • Γ(x) = 0 e u du is log-convex for x ≥ 1 • det X, whith X a symmetric, 0 matrix, is log-concave

• det X/trace X, whith X a symmetric, 0 matrix, is log-concave

5.1 Properties of log-concave functions

1. If ∇2f exists in the interior of the domain of f, then f is log-concave iff f(x)∇2f(x)  ∇f(x)∇f T (x) (19) (the above is a “matrix cone inequality”, i.e A  B ⇔ (B − A)  0)

2. If f, g are log-concave(convex) then their product fg is log-concave(convex) R 3. If f, g are log-concave then their convolution f ∗g = Ω f(u)g(x−u)du is log-concave

4. If f, g are log-convex then their sum f + g is log-convex. From here it follows that R w(u)f(u, x)du is log-convex in x whenever f(u, x) is ΩU log-convex in x for given u and w(u) ≥ 0 for all u.

5. From the above it follows that the moment generating function M(z) = R zT x d Ω p(x)e dx where p is any probability density on Ω ⊆ R is log- covex in z [BV ex 3.41] [Exercise Prove that ∇M(0) = ∇ log M(0) = E[x] and ∇2M(0) = E[xxT ], ∇2 log M(0) = V ar x

m n 6. Theorem (without proof) If f : R × R → R is log-concave, the g(x) = R f(u, x)du is log-concave. Rm This theorem lets us prove some of the results above (e.g convolution) as well as

• The CDF of a log-concave density is log-concave • The marginals of a log-concave probability density are log-concave

8 5.2 Log-concave functions in statistics

Log-concave densities

• The multivariate Normal distribution

n • The uniform distribution over a convex set C ⊂ R • The (multivariate) exponential f(x) = Q 1 e−λT x for x ∈ [0, ∞)n i λi

p−n−1 − 1 trace Σ−1X • The Wishart distribution f(X) = a(det X) 2 e 2 for X  0 symmetric matrix

6 [Optional: Extreme points and supporting hy- perplanes

]

6.1 Extreme points x ∈ C is an of the convex set C iff whenever x = tx1 + (1 − t)x2 for some x1,2 ∈ C and t ∈ [0, 1] then x = x1 or x = x2. In other words, x cannot be in the (relative) interior of any line segment contained n C.

d Theorem 0.1 (Krein-Milman) A bounded closed convex set (in R ) is the closed convex hull of its extreme points. [This theorem extends to spaces of functions too.]

6.2 Separating and supporting hyperplanes

Theorem 0.2 C,D convex sets, C ∩ D = ∅. Then, there exist a hyperplane aT x − b = 0 that separates C,D, i.e so that aT x ≤ b for any x ∈ C and aT x ≥ b for any x ∈ D.

Strict separation = one of the inequalities is strict

9 Proposition 1 C convex closed, x0 6∈ C. Then x0 can be strictly separated from C.

Theorem 1.1 (Supporting hyperplane) Let C be a convex set and x0 ∈ bdC a point on its boundary. Then, there exists a supporting hyperplane T for C at x0, i.e. there exists a vector a so that a (x − x0) ≥ 0 for all x ∈ C.

A supporting hyperplane generalizes the notion of tangent, and the above theorem says that a convex set admits a tangent at every point on its bound- ary.

[ Corollary: Theorem of alternatives (BV Example 2.21) The system m×n of linear inequalities Ax ≺ b, with A ∈ R is infeasible iff the convex sets

n m m C = {b − Ax|x ∈ R } and D = R++ = {y ∈ R |y 0} (20) do not intersect.]

10