STAT 544 Lecture 5 Convexity – a Very Short (Bayesian) Tour C Marina Meil˘A [email protected]

STAT 544 Lecture 5 Convexity { a very short (Bayesian) tour c Marina Meil˘a [email protected] Convexity is a very elegant and suprisingly powerful mathematical concept, that allows us to \think geometrically" about very high dimensional spaces, such as function spaces. In optimization, convexity is THE MOST impor- tant criterion for separating between easy and hard problems. In statistics, convexity plays a less proeminent role, but it appears in key places, as will be seen shortly. Reading Boyd and Vandenberghe, Convex Analysis, available freely on the web (see Resources page). 1 Convex sets A set C in a vector space is convex iff for any x; x0 2 C the line segment [x; x0] = ftx + (1 − t)x0; t 2 [0; 1]g is contained in C. Basic definitions • line segment ftx1 + (1 − t)x2 with t 2 [0; 1]g P P k • affine combination i tixi with i ti = 1, for fixed x1:n 2 R P Can also be written as x1 + i=2:k ti(xi − x1) with any real t2:k P P Proof t1 = 1 − ( i=2:k tj). Hence i=1:k tixi = x1 − (1 − t1)x1 + P P i=2:k tixi = x1 + i=2:k ti(xi − x1) and now t2:k can be any real numbers. Hence, the affine hull of x1:k is a subspace of dimension k − 1, spanned by xi − x1; i = 2 : k, and shifted from the origin by the shift x1. Exercise: Show that the result is the same no matter which xi we chose in place of x1. In particular, the affine hull of two points is the line that passes through them, the affine hull of three points is the plane that contains them, etc. P P • convex combination i tixi with i ti = 1 and ti ≥ 0 1 P • conic combination i tixi with ti ≥ 0. Note that 0 is a conic combination of any set of points. • cone C: if x 2 C then tx 2 C for all t > 0 • convex set, affine set, cone, convex cone • convex hull, affine hull, conic hull • relative interior, (relative) boundary, (relative) closure 1.1 Examples of Convex sets T 1. hyperplane a (x − x0) = 0 2. half-space aT x − b ≥ 0 3. ball, ellipsoid 4. polyhedron = (bounded) intersection of m half-spaces 5. simplex = convex hull of k + 1 affinely independent points (i.e x1 − xk+1; x2 − xk+1;::: are linearly independent) 6. all symmetric matrices are an affine set and a convex set (unbounded) 7. all positive definite matrices are a convex cone 8. all stochastic matrices are a convex set 9. all doubly stochastic matrices are a convex set (what are its extreme points?) Convex sets in probability d 1. the parameter space of all normal distributions over R is a convex set 2. the (parameter) space of all discrete distributions over some countable space X 3. all distributions with a fixed set of marginals 2 4. the conditional distributions of a discrete joint Let X; Y 2 ΩX ×ΩY with jΩX j = m; jΩY j = n be two discrete random variables, and let Θ be the set of all probability distributions over ΩX × ΩY . That is, we define Pθ(X = i; Y = j) = θij; then Θ = fθ = [θij]ij 2 m×n P [0; 1] ; ij θij = 1g. Imagine θ to be rearranged as a vector of dimension mn: T vec(θ) = [ θ11 θ12 : : : θ21 θ22 : : : θmn ] (1) We use the linear-fractional (or projective) function (BV page 41) Az + b f(z) = domf = fzjcT z + d > 0g (2) cT z + d which maps a convex set into a convex set. Let now z = vec(θ), b = 0; d = 0, 0 0 1 if j = j = j0; i = i 1 if j = j0 A 0 0 = c = (3) ij;i j 0 otherwise ij 0 otherwise T P In other words, c vec(θ) = i θij0 = Pθ[Y = j0], and row ij of A, P 0 0 0 0 multiplied by θ gives i0;j0 Aij;i j θi j = θi;j0 = Pθ[X = i; Y = j0]. Hence, f(θ) = Pθ[XjY = j0] for any θ 2 Θ with Pθ(Y = j0) > 0. This subset of Θ is also convex (but not closed), therefore we conclude that the set of all conditional probabilities given a fixed Y is a convex set. 5. all distributions over R with E[X] in a convex set (in particular E[X] fixed, E[X] ≥ a) Convex spaces of functions 1. polynomials of degree n; all polynomials 2. fg j g ≥ f0g with f0 fixed R p 3. fg j jgj < ag with p ≥ 1; a 2 (0; 1) (the Lp balls) 4. all convex functions on set X 3 2 Convex functions k A function f : C ⊆ R ! R is convex iff it satisfies Jensen's inequality f(tx + (1 − t)x0) ≤ tf(x) + (1 − t)f(x0) for all t 2 [0; 1]; x; x0 2 C: (4) k Equivalent definitions: assuming f : C ⊆ R ! R, f is convex iff n • the epigraph of f epi f = f(x; y) 2 R × R; y ≥ f(x)g is a convex set. • (assuming rf is defined on the interior of C) f(x0) ≥ f(x) + rf(x)T (x0 − x); for all x; x0 2 C: (5) • (assuming r2f is defined on the interior of C) r2f(x) 0 for all x 2 C; (6) where the notation A 0 means that A is a positive definite matrix. Examples of convex functions aT x k k • e for any a 2 R ,x 2 R . •− ln x, x 2 R. • x ln x, x 2 R. k • jjxjj for any norm jj jj, x 2 R . x1 x2 x T k • ln(e + e + ::: + e k ) with x = [x1 : : : xk] 2 R . Convex functions in statistics • Any marginal of a discrete distribution. Let X; Y 2 ΩX × ΩY with ΩX j = m; jΩY j = n be two discrete random variables, and let Θ be the set of all probability distributions over ΩX × ΩY . That is, we define Pθ(X = i; Y = j) = θij; then Θ = m×n P P fθ = [θij]ij 2 [0; 1] ; ij θij = 1g The marginal PX (i) = j θij is a linear function of the entries of θ, therefore it is convex. 4 • The normalization constant of an exponential family Z(θ) −θT x n P −θT x Proof e is convex in θ 2 R for any x; then Z(θ) = x e is convex as a sum of convex functions of θ. • log Z(θ) is convex Proof The proof is statistical. Remember that 2 V arθ(x) = r ln Z(θ) The convexity follows from the positive-definiteness of the variance. • The KL-divergence KL(pjjq) for any two distributions p; q on Ω (con- tinuous or discrete) is jointly convex in (p; q). • An f-divergence is a score function defined as follows Z dP Df (P jjQ) = f dQ (7) Ω dQ where P; Q are distributions over Ω and f is a convex function on p (0; 1). Exercise What functions do you obtain for f(t) = t ln t; ( t − 2 2 1 1) ; (t − 1) ; 2 jt − 1j? 3 The conjugate of a convex function The conjugate of the convex function f is defined as f ∗(y) = sup yT x − f(x) (8) x2dom f The domain of f ∗ is the set of y's for which the supremum above is finite. Note that f ∗ is always convex in y, as a supremum of linear functions in y. T Let g(x; y) = y x − f(x). If f is differentiable and convex, then supx g(x; y) can be calculated by taking the derivative w.r.t x. rxg(x; y) = y − rf(x) = 0 (9) y = rf(x) ) solution x∗ (10) If f is convex, then x∗ is a maximum. If the solution above is unique, then we say the pair (x∗; y) = (x∗; rf(x∗)) is a Legendre conjugate pair. If the solution is unique for every y, then we can write f ∗(y) + f(x∗) = yT x∗ = rf(x∗)T x∗ (11) 5 Because at x∗ is the supremum of g(x; y), it follows that for every x in the domain of f the r.h.s is no larger than the l.h.s, that is f ∗(y) + f(x) ≥ yT x for all x; y (12) This is called the Fenchel-Legendre inequality. Proposition If f is convex, and epi f is closed, then f ∗∗ = f. In this case, we call f; f ∗ a Legendre conjugate pair of functions. Exercise Calculate the conjugates of: ex, − ln x, Ax + b; x 2 Rd, 1 − x2; x 2 [0; 1], x1 x2 xm 1 2 ln(e + e + ::: + e ), x , jjxjj . Find the comains of the respective conjugate functions, and find examples of vectors y which are not in those domains. Remark: If f convex, and y 2 @f(x) for some x, then y 2 dom f ∗ and y; x are a conjugate pair. Analogy with the Fourier Transform The Fourier transform maps f ^ R −!T x 2 ^ 2 into f(!) = dom f e f(x)dx, with f 2 L $ f 2 L . The Legendre- f^ −yT x f(x) Fenchel transform (8) maps f into e (y) = infdom f e e dx, with f convex $ f^ convex. The next sections are for additional reading. 4 [Optional: Strictly convex and strongly convex functions] A function is strictly convex if Jensen's inequality is strict whenever t 2 (0; 1), i.e.

Load more