Statistics 210B, Spring 1998 Class Notes

P.B. Stark [email protected] www.stat.berkeley.edu/ stark/index.html ∼ May 8, 1998

Seventh Set of Notes

1 Optimization

For references, see D.G. Luenberger (1969) Optimization by Methods,John Wiley and Sons, Inc., NY.; E.J. Anderson and P. Nash, 1987, Linear Programming in Infinite-Dimensional Spaces, Wiley, NY; M.S. Bazaraa and C.M. Shetty, 1979, Nonlinear Programming: Theory and Algorithms, Wiley, NY; Shor, 1985, Minimization Methods for Non-Differentiable Functions, Springer-Verlag, NY. Many questions in statistical theory can be reduced to optimization problems, sometimes in infinite-dimensional spaces (spaces of functions or measures, for example). For example, we have just seen in Donoho’s work how the difficulty of certain minimax estimation problems can be related to the modulus of continuity, whose computation is an optimization problem over a convex subset of `2. Some unconstrained optimization problems with differentiable, convex objective func- tions, are fairly straightforward to solve, e.g., using the . Solving differentiable, unconstrained, convex problems numerically is not typically difficult (descent

1 algorithms can be used), unless evaluating the objective or its derivative is ex- tremely computationally intensive. Constrained problems, nondifferentiable problems, and nonconvex problems are typically much harder. Even for convex functions whose derivative exists almost everywhere, the steepest descent algorithm can converge to a nonstationary point (see Shor, Ch. 2, 2.1 § for an example). Nondifferentiable objective functionals arise fairly frequently in statistics. For example, the function is not differentiable at zero, so the problem of finding the median as the solution of the optimization problem of minimizing the sum of the absolute deviations (or of finding a multivariate generalization of the median) is a convex, nondifferentiable optimization problem. Similarly, the objective functionals for minimum `1 and minimum ` regression are nondifferentiable. ∞ Some quite interesting statistical problems have convex objective functionals, but non- convex constraints, such as signal recovery problems subject to constraints on the of the support of the signal (sparsity constraints). Linear equality constraints (such as x, g = 0) are fairly straightforward to deal with; h i one can project the problem onto the subspace where the constraint is satisfied. Linear inequalities are somewhat harder. Two of the most useful tools for solving constrained infinite-dimensional optimization problems are Fenchel and Lagrange . A cone in a real linear vector space is a set P such that if x P ,thenαx P X ⊂X ∈ ∈ for all α>0. One can establish a partial order on a vector space with a P X (then called the positive cone) by defining x y if x y P .If is a topological space ≥ − ∈ X (such as a normed space with topology inherited from the ), and if the of the positive cone P is nonempty in the topology of , we write x>yif x y P ◦, the interior X − ∈ of P . (Note that this differs from the definition of < we used for a totally ordered set, where < meant but not =; here, < derives from topological properties of the positive cone that ≤ defines the order.)

If is a linear vector space, ∗ denotes the linear space of all linear functionals defined X X on , and is called the algebraic of .If is a normed space, by default ∗ X X X X is the space of bounded linear functionals on , (called the normed dual space of ) unless X X otherwise specified. Denote by x∗,x the action of the linear functional x∗ ∗ on the h i ∈X 2 element x . The natural mapping from a space to its second dual ∗∗ (the dual of its ∈X X X dual) is x∗∗,x∗ = x∗,x . Clearly each x gives a linear functional x∗∗ on ∗ this way. h i h i ∈X X If every (bounded) linear functional on ∗ can be obtained this way (i.e.,if ∗∗ = ), is X X X X said to be reflexive. The epigraph of a functional f on a set C is ⊂X

[f,C] (t, x) R : x C, f(x) t . (1) + ≡{ ∈ ×X ∈ ≤ }

If C is convex, [f,C] is a in R iff f is a convex functional. Similarly, let + ×X

[g, D] (t, x) R : x D, f(x) r . (2) − ≡{ ∈ ×X ∈ ≥ }

Definition. A linear variety in a vector space is the translation of a subspace S in X .Thatis,ifS is a subspace, then for every x , S + x is a linear variety. A hyperplane X ∈X H in a vector space is a maximal proper linear variety; that is, a linear variety such X that if Y is another linear variety in and H Y , then either Y = H or Y = X.A X ⊂ hyperplane can be characterized by a linear functional: every hyperplane can be written as x : x∗,x + b =0 for some x∗ ∗ and some b R. Every set of the form { ∈X h i } ∈X ∈ x : x∗,x + b =0 is a hyperplane. If x∗ is a nonzero linear functional on a normed { ∈X h i } vector space , the hyperplanes x : x∗,x + b =0 are closed for every b R iff x∗ is X { h i } ∈ continuous.

Theorem 1 Separating hyperplane theorem. Suppose that C, D are convex subsets of a , that C contains interior points, and that D contains no interior X point of C. Then there is a a hyperplane separating the sets: there is an element x∗ ∗ ∈X s.t.

sup x∗,x inf x∗,x . (3) x C h i≤x D h i ∈ ∈ Thus there is a number b R s.t. ∈

x∗,x + b 0 x C h i ≤ ∀ ∈ x∗,x + b 0 x D h i ≥ ∀ ∈ (4)

3 1.1 Algebraic Duality

We always take inf x f(x)= and supx f(x)= . ∈∅ ∞ ∈∅ −∞ For any functional f on a real linear vector space with dual ∗ and any sets C, D X X ⊂X consider the value of the primal problem

v( ) inf f(x). (5) x C D P ≡ ∈ ∩

Clearly, for any x∗ ∗, ∈X

v( )= inf f(x) x∗,x + x∗,x x C D P ∈ ∩ { −h i h i} inf x∗,x +inf f(x) x∗,x x C D x C D ≥ ∈ ∩ h i ∈ ∩ { −h i} inf x∗,x +inf f(x) x∗,x (6) x D x C ≥ ∈ h i ∈ { −h i}

This is true for all x∗,so

inf f(x) sup inf x∗,x +inf f(x) x∗,x x C D ≥ x x D h i x C{ −h i} ∈ ∩ ∗∈X ∗ ∈ ∈ =supD∗[x∗]+C∗[x∗] , (7) x { } ∗∈X ∗ where

C∗[x∗] inf f(x) x∗,x (8) x C ≡ ∈ { −h i} and

D∗[x∗] inf x∗,x . (9) x D ≡ ∈ {h i}

The only functionals x∗ it is worth considering are those for which C∗ and D∗ are greater than .Let −∞ ∗ x∗ ∗ :inf f(x) x∗,x > , (10)  x C  C ≡ ∈X ∈ { −h i} −∞ and

∗ x∗ ∗ :inf x∗,x > . (11)  x D  D ≡ ∈X ∈ h i −∞ Then wlog we can restrict attention to

sup D∗[x∗]+C∗[x∗] , (12) x { } ∗∈C∗∩D∗ with the supremum defined to be if ∗ ∗ = . −∞ C ∩D ∅ 4 In general, the sup over ∗ need not equal the inf over . When it does not, there is X X said to be a “duality gap.” Example. Suppose is a linear vector space of functions x = x(t) on some fixed domain; X n n n n x∗ ∗, x∗ linearly independent; d : R ,x ( x∗,x ) ;Ξ R is a { j }j=1 ⊂X { j } X→ → h j i j=1 ⊂ n n bounded subset of R , D = x : d(x) Ξ . Then one can show that ∗ =span x∗ { ∈X ∈ } D { j }j=1 (D is a hypercylinder constrained only in the directions “aligned” with an xj∗;inother directions D is unconstrained, so a linear functional with a component in any direction not

n in span x∗ is unbounded below on D). Thus for any f and C, the infinite-dimensional { j }j=1 problem inf f(x) (13) x C D ∈ ∩ n is bounded from below by a finite-dimensional problem on span x∗ . For particular sets { j }j=1 ΞandC, and particular functionals f, this can lead to an easy solution for v( ). Continuing P the example, suppose that C = ,thatf(x)= x∗,x ,andthat X h 0 i Ξ γ Rn : γ δ  , (14) ≡{ ∈ k − k≤ }

n n for some fixed δ R (an -ball in R centered at δ). Then ∗ = x∗, and we already saw ∈ C 0 n n that ∗ =span x∗ ,so ∗ ∗ = unless x∗ = α x∗ for some of constants D { j }j=1 C ∩D ∅ 0 j=1 j j n Pn α =(α ) . Given a linearly independent set x∗ ∗, one can construct a linearly j j=1 { j }j=1 ⊂X independent set x n such that { j}j=1 ⊂X

x∗,x =1 . (15) h j ki j=k

n If indeed x0∗ = j=1 αjxj∗,thenforx∗ ∗ ∗ P ∈C ∩D

inf x0∗,x x∗,x =0. (16) x C ∈ h i−h i For any x D, d(x)=δ + ν with ν ,soforx D, ∈ k k≤ ∈ α d(x)=α δ + α ν · · · α δ α ν ≥ · −| · | α δ α ν ≥ · −k kk k α δ  α . (17) ≥ · − k k 5 This bound is in fact attained by setting α β = δ  (18) − α k k and taking x = j βjxj.Thus P

inf x∗,x = α δ  α . (19) x D ∈ h i · − k k This gives us α δ  α ,x= n α x  0∗ j=1 j j∗ inf x0∗,x = · − k k (20) x C D  P ∈ ∩ h i  , otherwise −∞ There is no duality gap in this problem. See Stark (1992) J. Geophys. Res., 97, 14,055–14,082, for more examples.

1.2 Fenchel Duality

Fenchel duality establishes conditions under which there is no duality gap in the algebraic duality relation given in the previous section. The conditions rely on topological consid- erations, and the version presented here assumes that the spaces are normed, the sets are convex, etc. Note that the definitions are changed a bit from the last section!

Let be a normed linear vector space with normed dual space ∗,andletC and D X X be convex subsets of .Letf : R be convex, and let g : R be concave. The X X→ X→ conjugate set ∗ of C is C

∗ x∗ ∗ :sup[ x∗,x f(x)] < . (21) C ≡{ ∈X x C h i− ∞} ∈

The conjugate functional f ∗ of f is defined on C∗ by

f ∗(x∗)=sup[ x∗,x f(x)]. (22) x C h i− ∈

The conjugate set ∗ of D is D

∗ x∗ ∗ :inf[ x∗,x g(x)] > . (23) x D D ≡{ ∈X ∈ h i− −∞}

The conjugate functional g∗ of g is defined on D∗ by

g∗(x∗)= inf[ x∗,x g(x)]. (24) x D ∈ h i− 6 One may show that ∗ and ∗ are convex subsets of X∗,thatf ∗ is a convex functional, and C D that g∗ is a concave functional.

Theorem 2 Fenchel Duality (see Luenberger, 1969, 7.12, Theorem 1). Let be a normed § X space, C and D be convex subsets of , f a convex functional on , g a concave functional X X on .SupposeC D contains points in the relative interior of C and D, and that either X ∩ [f,C]+ or [g, D] has nonempty interior. If −

v( )= inf f(x) g(x) > , (25) x C D P ∈ ∩ { − } −∞ then

v( )= max g∗(x∗) f ∗(x∗) . (26) x C D P ∗∈ ∗∩ ∗{ − } The max on the right is attained by some x∗ C∗ D∗. If the infimum is attained by some 0 ∈ ∩ x C D,then 0 ∈ ∩ max[ x0∗,x f(x)] = x0∗,x0 f(x0) (27) x C ∈ h i− h i− and

min[ x0∗,x g(x)] = x0∗,x0 g(x0). (28) x D ∈ h i− h i− Remark. I cannot see where in the proof of the Fenchel Duality theorem Luenberger uses the hypothesis that C D contains points in the relative interior of C and D. Furthermore, ∩ that hypothesis is not necessarily satisfied in the example he gives to the min-max theorem of games—see below (a compact convex subset of a reflexive space does not necessarily have nonempty interior). Thanks to Jason Schweinsberg and Von Bing Yap for pointing out these problems! Can anyone see how the hypothesis enters the proof? Example. The min-max theorem in game theory. Let be a normed linear vector space X and ∗ its normed dual. Assume is reflexive. Player A selects a strategy x from X X A⊂X and player B selects a strategy from ∗ ∗, without knowledge of player A’s choice. B ⊂X Player A pays player B x∗,x .PlayerA thus wants to minimize x∗,x , while player B h i h i wants to maximize it. Suppose both players seek a minimax strategy: one in which they lose least (gain most) in the worst case. Thus player A seeks

v =inf sup x∗,x , (29) − x x h i ∈A ∗∈B∗ 7 while player B seeks

v+ =supinf x∗,x . (30) x x h i ∗∈B∗ ∈A

Theorem 3 (Min-max theorem; see Luenberger, 7.13, Theorem 1.) Suppose and ∗ are § A B compact and convex. Then

inf sup x∗,x =supinf x∗,x . (31) x x h i x x h i ∈A ∗∈B∗ ∗∈B∗ ∈A Proof. Define

f : R X→ x max x∗,x . (32) x 7→ ∗∈B∗ h i

The max is attined because ∗ is compact (by assumption) and x, viewed as an element of B ∗∗, is a on ∗. By linearity and the convexity of ∗, for all α [0, 1], X B B ∈

f(αx +(1 α)y)=maxx∗,αx+(1 α)y x − ∗∈B∗ h − i =max[ x∗,αx + x∗, (1 α)y ] x ∗∈B∗ h i h − i =max[α x∗,x +(1 α) x∗,y ] x ∗∈B∗ h i − h i max α x∗,x +max(1 α) x∗,y x x ≤ ∗∈B∗ h i ∗∈B∗ − h i = αf(x)+(1 α)f(y). (33) −

Assignment. Show that f is continuous, and that there exist points in the C D in the ∩ relative interior of C and D.

We seek minx f(x). Apply Fenchel duality, identifying C with , g =0,andD with ∈A X . Because D is a compact set, ∗ = ∗,and A D X

g∗(x∗)=min x∗,x . (34) x ∈A h i

Consider ∗. Suppose x∗ ∗. Then (by the separating hyperplane theorem) there exists C 1 6∈ B x and α R such that 1 ∈X ∈

x∗,x x∗,x >α>0, x∗ ∗. (35) h 1 1i−h 1i ∀ ∈B

8 Thus letting x = ax1 with a>0 sufficiently large,

x1∗,x max x∗,x = x1∗,x f(x) (36) x h i− ∗∈B∗ h i h i− can be made arbitrarily large, so x∗ ∗. On the other hand, if x∗ ∗, 1 6∈ C 1 ∈B

sup x1∗,x max x∗,x =0, (37) x h i−x∗ ∗ h i ∈X ∈B and is attained by x =0.Thus ∗ = ∗. The conditions of Fenchel duality apply, and hence C B

min f(x)= max g∗(x∗)=maxmin x∗,x . (38) x x x x ∈A ∗∈B∗∩X ∗ ∗∈B∗ ∈A h i Example (of something): minimization in a , and Splines. Regularized least squares is an estimator sometimes used in nonparametric regression. The idea is to find the regression function of minimum Sobolev norm (a norm that involves deriva- tives of the function) among those that fit the data within an `2 tolerance, or (equivalently) to find the model that fits the data best in an `2 sense among those whose Sobolev norm does not exceed some bound. Recall that a real inner- space is a real linear vector space with an additional X operation , : R with the properties that for all x, y, z and all γ R, h· ·i X×X→ ∈X ∈

x, y = y,x h i h i x + y,z = x, z + y,z h i h i h i γx,y = γ x, y h i h i x, x 0 h i≥ x, x =0 x = 0 (39) h i ⇒

The functional x x, x is a norm on a (real) inner-product space. A real inner-product k k≡qh i space that is complete in the norm topology (i.e., in which all Cauchy have limits in the space) is called a Hilbert space. Hilbert spaces are their own normed dual spaces. A normed vector space is separable if there is a countable set x ∞ such that every element X { j}j=1 x can be approximated arbitrarily well (in the norm) by a finite of ∈X x ∞ . All separable Hilbert spaces are isomorphic to ` . Every real { j}j=1 2 9 can be “completed” to form a Hilbert space by considering elements to be equivalence classes of Cauchy sequences. For any x, y in an inner-product space , x, y x y . Two elements x and y in X |h i| ≤ k k·k k an inner-product space are said to be orthogonal if x, y = 0; then we write x y.Ifx y h i ⊥ ⊥ then x + y 2 = x 2 + y 2.Anelementx of an inner product space is orthogonal to a k k k k k k subset M of the same inner product space if x y y M; then we write x M.Two ⊥ ∀ ∈ ⊥ subsets M, N of an inner product space are orthogonal if x y x M and y N;then ⊥ ∀ ∈ ∀ ∈ we write M N.Theorthogonal complement of a set M is ⊥ ⊂X

M ⊥ = x : x M . (40) { ∈X ⊥ }

For any M , M ⊥ is a closed subspace. If M is a closed subspace of ,eachx has ⊂X X ∈X a unique decomposition x = m + n where m M and n M ⊥; we write = M M ⊥. ∈ ∈ X L Every finite-dimensional subspace of a Hilbert space is closed.

Lemma 1 Let x n be a linearly independent set, M =span x n ,andδ Rn. { j}j=1 ⊂X { j}j=1 ∈ Then

arg min x : xj,x = δj,j=1, ,n M. (41) x ∈X {k k h i ··· }∈ The minimum is attained, and by a unique element of M.

Proof. M is a closed subspace of , so we can decompose every x into a sum x = xM +x X M ⊥ 2 2 2 with xM M and x M ⊥,and x = xM + x .Nowxj ,x =0forall ∈ M ⊥ ∈ k k k k k M ⊥k h M ⊥i j,so x ,x = x ,x .Thusif x ,x = δ j,then x ,x = δ j, but x x . h j i h j M i h j i j ∀ h j M i j ∀ k k≥k M k Thus it suffices to consider x M. Because x n is a linearly independent set, there is ∈ { j}j=1 a unique vector x M s.t. x ,x = δ .Evenif x were not linearly independent, that 0 ∈ h j 0i j { j} the minimum is attained would follow from the fact that M is a closed subspace of ,and X therefore complete. Any sequence of elements of M for which the norm converges to the minimal value can be shown to be a Cauchy sequence within M, so its limit is in M.The uniqueness would then follow from the strict convexity of the norm in Hilbert space. Remark. Essentially the same proof shows that if M is a closed subspace and x , ∈X min x m (42) m M ∈ k − k 10 is attained by an element m of M,andthatx m M ⊥. 0 − 0 ∈ Remark. Let x,x =(x ,x )n .ItfollowsfromLemma1thatifΞ Rn,andx solves h i h j i j=1 ⊂ 0 min x : x,x Ξ , (43) {k k h i∈ } then x M =span x n . 0 ∈ { j}j=1 Let S be a closed subspace of .Then X

x S =min x s (44) s S | | ∈ k − k is a seminorm. Assignment. Verify that for a closed subspace S of a Hilbert space , is a seminorm. X |·|S Let S =span s m be a finite-dimensional (and therefore closed) subspace of .Con- { k}k=1 X sider the problem of finding min x : x,x = δ , (45) {| |S h i } n with xj linearly independent. Any x can be decomposed into xS + x ,withxS S { }j=1 S⊥ ∈ and x S⊥.Let x,S = x,s : s S .TakingΞ=δ + x,S , the problem becomes S⊥ ∈ h i {h i ∈ } h i min x : x,x Ξ . (46) {k k h i∈ } The optimal x is thus a linear combination of x n , plus an element of S; we restrict 0 { j}j=1 attention to the subspace of consisting of linear combinations of elements of S and x n . X { j}j=1 Define Γ to be the n by n with elements Γ = x ,x . Because x is linearly ij h i ji { j} independent and , is symmetric, Γ is a positive-definite symmetric matrix. Let Λ be the h· ·i n by m matrix with elements Λ = x ,s . Write ij h i ji n m x = γ x + λ s = γj xj + λj sj. (47) · · jX=1 jX=1

The component of x in S does not contribute to x ,so x 2 γ Γ γ, with equality if | |S | |S ≤ · · γ x S. For the optimal x, that must hold; otherwise, the component could be absorbed · ⊥ into x , thereby decreasing x . Thus the seminorm minimization problem is to find S | |S v( )= min γ x,γ x : x,λ s + γ x = δ P γ Rn,λ Rm{h · · i h · · i } ∈ ∈ =minγ Γ γ :Γ γ +Λ λ = δ (48) { · · · · } 11 As just argued, for the optimal x, s ,γ x =0,j =1, ,m,soΛT γ =0.Thisgivesthe h j · i ··· · coefficients of x through solving

ΓΛ γ δ     =   , (49) T ·  Λ 0   λ   0        which has a unique solution if the matrix on the left is nonsingular. That holds if the projections of s onto span x are linearly independent. { j } { j} This was for the case that x,x = δ. If instead we require only x,x Ξ, the same h i h i∈ argument shows that the optimal x is a linear combination of x n s m . 0 { j}j=1 ∪{ j }j=1

1.3 Quadratic Smoothing Splines

Let be the set of functions x = x(t) on the interval [0, 1] that are absolutely continuous, X have absolutely continuous first derivatives, and have Lebesgue square-integrable second

2 2 derivatives. Let x0 = x0(t)=dx/dt,andx00 = x00(t)=d x/dt . Endow with the norm X 1 2 2 2 2 x = x (0) + x0(0) + (x00(t)) dt. (50) k k | | Z0 This norm derives from the inner product

1 x, y = x(0)y(0) + x0(0)y0(0) + x00(t)y00(t)dt. (51) h i Z0 (I.e., x = ( x, x ).) k k q h i Problem. Show that is indeed a Hilbert space: verify that , is an inner product, and X h· ·i that is complete w.r.t. the induced norm topology. X The space is in fact a reproducing Hilbert space (RKHS). A RKHS is a Hilbert X space of functions x = x(t) for which the functional that evaluates each function x at the point t0 (the point-evaluator) is a bounded linear functional; that is, evaluating any x at the point t can be written as the inner product of x with a function e . (Clearly, point- 0 t0 ∈X evaluation is linear if the addition of functions and multiplication of functions by scalars is defined in the standard way:

(x + y)(t)=x(t)+y(t),x,y ∈X (αx)(t)=α(x(t)),x ,α R. ∈X ∈ 12 The issue is whether this linear functional is bounded.) In the present case,

t0 u x(t0)=x(0) + x0(0)t0 + du dsx00(s) Z0 Z0 1 1

= x(0) + x0(0)t0 + du1u t0 dsx00(s)1s u Z0 ≤ Z0 ≤ 1 1

= x(0) + x0(0)t0 + dsx00(s) du1u t0 1u s Z0 Z0 ≤ ≥ 1 = x(0) + x0(0)t0 + dsx00(s)(t0 s)+. (52) Z0 − Formally, this looks like the inner product with an element e whose value at 0 is 1, t0 ∈X whose derivative at 0 is t , and whose second derivative is e00 (t)=(t t) .Thequestion 0 t0 0 − + is whether e(t) . It is: it is absolutely continuous and has absolutely continuous first ∈X derivative, and bounded second derivative, so its -norm is finite. We have X 1+t t + t t2 t3 t t  0 0 2 6 0 et (t)= − ≤ (53) 0  t2 t3  1+t t + t 0 0 t>t. 0 2 − 6 0  For any collection of distinct points t n , the set of functions e n is linearly { j}j=1 { tj }j=1 ⊂X independent. Define the seminorm x for x by | | ∈X 1 2 2 x = (x00(t)) dt. (54) | | Z0

Let S =span1,t .ThenS⊥ x : x(0) = x0(0) = 0 . The seminorm can be { } ≡{ ∈X } |·| viewed as the square of the norm of the orthogonal projection P of x onto the (closed) S⊥ subspace S⊥; i.e.,

x = PS x =inf x ` = x PS x . (55) ⊥ ` S | | k k ∈ k − k k − k This is a continuous functional on : consider a sequence x x in the norm. The X k → operator norm of a Hilbert-space projection is unity; that is, for any subspace M , ⊂X P x 1 x . By the triangle inequality, x+(y x) x + y x ,and y+(x y) k M k≤ ×k k k − k≤k k k − k k − k≤ y + x y ; together these imply that x y x y .Thus k k − k |k k−k k| ≤ k − k

P xk P xj P xk P xj |k S⊥ k−k S⊥ k| ≤ k S⊥ − S⊥ k

= P (xk xj) k S⊥ − k x x 0 (56) ≤k k − jk→ 13 Now S is finite dimensional, and therefore closed, so we have = S S⊥. For any x , X L ∈X let xS + x be the (unique) decomposition of x into its components xS S and x S⊥. S⊥ ∈ S⊥ ∈ n Then x = xS .Letej = etj , j =1, ,n.Let e,x =(ej,x )j=1. Because each ej is | | k ⊥k ··· h i h i a continuous linear functional on and the two-norm on Rn is continuous, for any δ Rn X ∈ and any >0, D x : e,x δ  (57) ≡{ ∈X kh i− k≤ } is a closed convex subset of . The smoothing spline optimization problem is to find a X function x that attains v( ), 0 P v( ) inf x , (58) x D P ≡ ∈ | | which is a seminorm minimization problem of the type just explored. It follows that v( ) P is attained, and the optimal vector x is a linear combination of 1, t,and e n .That 0 { j}j=1 is, x0 is an absolutely continuous piecewise cubic function with absolutely continuous first derivative and square-integrable second derivative. To try to add some intuition, consider what the solution would be like were e n S. { j}j=1 ⊥ Then there would be no advantage to including an element of S in the solution, and the optimal x would just be a linear combination of e . However, if the subspace spanned by 0 { j} e is not orthogonal to S, a part of the data can be produced by an element of S.Theleft { j} over part of the data can still be produced using just a linear combination of e ,andthere { j} is no advantage to going outside the subspace spanned by S e . Define Γ = e ,e ∪{ j} ij h i ji and Λ = e ,s . The seminorm ignores any component of S remaining in the “extra” ij h i j i linear combination of ej , so the optimal solution is of the form x0 = x0S + x ,where { } 0S⊥ x = α e with Λ α =0. 0S⊥ · · We can also apply Fenchel duality to this problem. Let f(x)= x 2 , C = ,andg(x)=0. | |S X Because e is a linearly independent subset of and f is continuous, provided >0, the { j} X epigraph [f,D] has nonempty relative interior. The conjugate set ∗ of C is + C

∗ x∗ ∗ :sup[ x∗,x f(x)] < C ≡{ ∈X x C h i− ∞} ∈ = x :sup[ x, y y ] < { ∈X y h i−| | ∞} ∈X = S⊥ B(1), (59) ∩

14 where B(1) = x : x 1 . The dual functional of f is { ∈X k k≤ }

f ∗(x∗)=sup[ x∗,x f(x)] x C h i− ∈ 2 =sup[ x∗,x x S] x h i−| | ∈X

0,x∗ S⊥ B(1) =  ∈ ∩ (60)  , otherwise , ∞  The conjugate set ∗ of D is D

∗ x :inf[ x, y g(y)] > y D D ≡{∈X ∈ h i− −∞} = x :inf[ x, y ] > y D { ∈X ∈ h i −∞} =spane n . (61) { j}j=1

The conjugate functional g∗ of g is

g∗(x∗)=inf[ x∗,x g(x)] x D ∈ h i− n ,x∗ span ej =  −∞ 6∈ { }j=1 (62)  α δ  α ,x∗ = α e , · − k k j j j  P on substituting the result we obtained in the section on algebraic duality. This leads to the dual problem v( )= max α δ  α . (63) n D α R : αj ej S; α e 1{ · − k k} ∈ j ⊥ k · k≤ Taking Λ and Γ to be defined as above,P this yields

v( )= max α δ  α . (64) D α Rn:α Λ=0;α Γ α 1{ · − k k} ∈ · · · ≤

This maximum is attained by some x0∗ = j αj ej. Because the value of the primal problem P is also attained by some x , we can use the last part of the Fenchel duality theorem to 0 ∈X characterize x0:

min x0∗,x = α δ  α x D ∈ h i · − k k = x∗,x , (65) h 0 0i and

max[ x0∗,x x ]= x0∗,x0 x0 =0. (66) x C ∈ h i−| | h i−| | One can use these relations to find x0 explicitly in terms of α.

15 1.4 Lagrange Duality

Theorem 4 (See D.G. Luenberger, 1969. Optimization by Vector Space Methods,Wiley, NY., 8.6, Theorem 1.) Let § µ =inf f(x):x C, G(x) 0 , (67) { ∈ ≤ ∈Z} where C is a convex subset of a linear vector space , f is a real-valued convex functional X on C , G is a convex mapping from C intoanormedspace with positive cone P , ⊂X Z which is assumed to have non-empty interior. Let ∗ be the normed dual space of (the Z Z space of all bounded linear functionals on ), and let P ∗ bethepositiveconein ∗ induced Z Z by P (the set of linear functionals z∗ ∗ that are nonnegative for all z P ). Suppose ∈Z ∈ that <µ< , and that y C s.t. G(y)

µ =inff(x)=maxφ(z∗) (69) G(x) 0,x C z 0 ≤ ∈ ∗≥ The maximum on the right is attained by some z∗ 0. If the infimum on the left is attained 0 ≥ by some x C,then 0 ∈ z∗,G(x ) =0, (70) h 0 0 i

and x0 maximizes

f(x)+ z∗,G(x) (71) h 0 i over all x C. ∈ Example: Linear Programming Let C be the positive orthant in Rn. Define = Rm, Z endowed with the supremum norm z =max1 i n zi , and the positive cone P = z : k k∞ ≤ ≤ | | { z 0,j=1,...,m .NotethatP has nonempty interior in the sup-norm topology. Let j ≥ } m m c R . The dual space of is ∗ = R ; its induced positive cone is the usual positive ∈ Z Z orthant. Let A be an m by n matrix, and let G(x)=A x c.Letf(x)=f x for some · − · f Rn. We seek ∈ µ =inf f x : A x c 0 . (72) x 0 Rn{ · · − ≤ } ≥ ∈ 16 This is of the canonical form given. Suppose there exists x 0 Rn s.t. A x<0, and that µ is finite. The theorem says ≥ ∈ · that µ =max inf f x + y (A x c) . (73) y Rm:y 0 x Rn:x 0 ∈ ≥ ∈ ≥ { · · · − } The infimum is unless f + y A 0, in which case it is y c. −∞ · ≥ − · The dual problem is thus max c y, (74) y 0: AT y f 0 − · ≥ − · − ≤ which is another linear program. Example: Quadratic Programming. Let Ω = Rn, b Rn, Q a positive-definite, sym- ∈ metric n by n matrix, A an m by n matrix, c Rm. We seek ∈ 1 v( )= min x Q x b x. (75) P x Rn:A x c 2 · · − · ∈ · ≤ This is called a quadratic program. If there exists x s.t. A x

1 x = Q− (b A λ). (77) − ·

1 T 1 Let R = AQ− A and d = c AQ− b.Then −

1 1 1 v( )=max λ R λ λ d b Q− b , (78) λ 0  2 2  P ≥ − · · − · − · · which is another quadratic program. If m

17