Convex Analysis and Nonsmooth Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Convex Analysis and Nonsmooth Optimization Dmitriy Drusvyatskiy October 22, 2020 ii Contents 1 Background 1 1.1 Inner products and linear maps . .1 1.2 Norms . .3 1.3 Eigenvalue and singular value decompositions of matrices . .4 1.4 Set operations . .6 1.5 Point-set topology and existence of minimizers . .7 1.6 Differentiability . 10 1.7 Accuracy in approximation . 13 1.8 Optimality conditions for smooth optimization . 16 1.9 Rates of convergence . 18 2 Convex geometry 21 2.1 Operations preserving convexity . 22 2.2 Convex hull . 25 2.3 Affine hull and relative interior . 28 2.4 Separation theorem . 30 2.5 Cones and polarity . 34 2.6 Tangents and normals . 37 3 Convex analysis 43 3.1 Basic definitions and examples . 44 3.2 Convex functions from epigraphical operations . 50 3.3 The closed convex envelope . 54 3.4 The Fenchel conjugate . 57 3.5 Subgradients and subderivatives . 60 3.5.1 Subdifferential . 61 3.5.2 Subderivative . 68 3.6 Lipschitz continuity of convex functions . 72 3.7 Strong convexity, Moreau envelope, and the proximal map . 75 iii iv CONTENTS 3.8 Monotone operators and the resolvant . 83 3.8.1 Notation and basic properties . 84 3.8.2 The resolvant and the Minty parametrization . 88 3.8.3 Proof of the surjectivity theorem. 90 4 Subdifferential calculus and primal/dual problems 95 4.1 The subdifferential of the value function . 98 4.2 Duality and subdifferential calculus . 99 4.2.1 Fenchel-Rockafellar duality . 100 4.2.2 Lagrangian Duality . 107 4.2.3 Minimax duality . 110 4.3 Spectral functions . 115 4.3.1 Fenchel conjugate and the Moreau envelope . 117 4.3.2 Proximal map and the subdifferential . 119 4.3.3 Proof of the trace inequality . 121 4.3.4 Orthogonally invariant functions of rectangular matrices122 5 First-order algorithms for black-box convex optimization 125 5.1 Algorithms for smooth convex minimization . 126 5.1.1 Gradient descent . 126 5.1.2 Accelerated gradient descent . 131 5.2 Algorithms for nonsmooth convex minimization . 133 5.2.1 Subgradient method . 134 5.3 Model-based view of first-order methods . 137 5.4 Lower complexity bounds . 138 5.4.1 Lower-complexity bound for nonsmooth convex opti- mization . 140 5.4.2 Lower-complexity bound for smooth convex optimiza- tion . 142 5.5 Additional exercises . 144 6 Algorithms for additive composite problems 149 6.1 Proximal methods based on two-sided models . 151 6.1.1 Sublinear rate . 153 6.1.2 Linear rate . 154 6.1.3 Accelerated algorithm . 157 6.2 Proximal methods based on lower models . 160 CONTENTS v 7 Smoothing and primal-dual algorithms 165 7.1 Proximal (accelerated) gradient method solves the dual . 165 7.2 Smoothing technique . 169 7.3 Proximal point method . 171 7.3.1 Proximal point method for saddle point problems . 174 7.4 Preconditioned proximal point method . 175 7.5 Extragradient method . 178 8 Introduction to Variational Analysis 183 8.1 An introduction to variational techniques. 183 8.2 Variational principles. 185 8.3 Descent principle and stability of sublevel sets. 187 8.3.1 Level sets of smooth functions. 187 8.3.2 Sublevel sets of nonsmooth functions. 190 8.4 Limiting subdifferential and limiting slope. 193 8.5 Subdifferential calculus . 195 vi CONTENTS Chapter 1 Background This chapter sets the notation and reviews the background material that will be used throughout the rest of the book. The reader can safely skim this chapter during the first pass and refer back to it when necessary. The discussion is purposefully kept brief. The comments section at the end of the chapter lists references where a more detailed treatment may be found. Roadmap. Sections 1.1-1.3 review basic constructs of linear algebra, in- cluding inner products, norms, linear maps and their adjoints, as well as eigenvalue and singular value decompositions. Section 1.4 establishes nota- tion for basic set operations, such as sums and images/preimages of sets. Section 1.5 focuses on topological preliminaries; the main results are the Bolzano-Weierstrass theorem and a variant of the extreme value theorem. The final Sections 1.6-1.8 formally define first and second-order derivatives of multivariate functions, establish estimates on the error in Taylor approx- imations, and deduce derivative-based conditions for local optimality. The material in Sections 1.6-1.8 is often covered superficially in undergraduate courses, and therefore we provide an entirely self-contained treatment. 1.1 Inner products and linear maps Throughout, we fix an Euclidean space E, meaning that E is a finite- dimensional real vector space endowed with an inner product h·; ·i. Recall that an inner-product on E is an assignment h·; ·i: E × E ! R satisfying the following three properties for all x; y; z 2 E and scalars a; b 2 R: (Symmetry) hx; yi = hy; xi 1 2 CHAPTER 1. BACKGROUND (Bilinearity) hax + by; zi = ahx; zi + bhy; zi (Positive definiteness) hx; xi ≥ 0 and equality hx; xi = 0 holds if and only if x = 0. The most familiar example is the Euclidean space of n-dimensional col- umn vectors Rn, which we always equip with the dot-product n X hx; yi := xiyi: i=1 One can equivalently write hx; yi = xT y. We will denote the coordinate n n vectors of R by ei and for any vector x 2 R , the symbol xi will denote the i'th coordinate of x. A basic result of linear algebra shows that all Euclidean spaces E can be identified with Rn for some integer n, once an orthonormal basis is chosen. Though such a basis-specific interpretation can be useful, it is often distracting, with the indices hiding the underlying geometry. Consequently, it is often best to think coordinate-free. The space of real m × n-matrices Rm×n furnishes another example of an Euclidean space, which we always equip with the trace product hX; Y i := tr XT Y: P Some arithmetic shows the equality hX; Y i = i;j XijYij. Thus the trace product on Rm×n coincides with the usual dot-product on the matrices stretched out into long vectors. An important Euclidean subspace of Rn×n is the space of real symmetric n×n-matrices Sn, along with the trace product hX; Y i := tr XY . For any linear mapping A: E ! Y, there exists a unique linear mapping A∗ : Y ! E, called the adjoint, satisfying hAx; yi = hx; A∗yi for all points x 2 E; y 2 Y: In the most familiar case of E = Rn and Y = Rm, any linear map A can be identified with a matrix A 2 Rm×n, while the adjoint A∗ may then be identified with the transpose AT . Exercise 1.1. Given a collection of real m × n matrices A1;A2;:::;Al, define the linear mapping A: Rm×n ! Rl by setting A(X) := (hA1;Xi; hA2;Xi;:::; hAl;Xi): ∗ Show that the adjoint is the mapping A y = y1A1 + y2A2 + ::: + ylAl. 1.2. NORMS 3 Linear mappings A: E ! E, between a Euclidean space E and itself, are called linear operators, and are said to be self-adjoint if equality A = A∗ holds. Self-adjoint operators on Rn are precisely those operators that are representable as symmetric matrices. A self-adjoint operator A is positive semi-definite, denoted A 0, whenever hAx; xi ≥ 0 for all x 2 E: Similarly, a self-adjoint operator A is positive definite, denoted A 0, when- ever hAx; xi > 0 for all 0 6= x 2 E: For any two linear operators A and B, we will use the notation A − B 0 to mean A B. The notation A − B 0 is defined similarly. 1.2 Norms A norm on a vector space V is a function k·k: V! R for which the following three properties hold for all point x; y 2 V and scalars a 2 R: (Absolute homogeneity) kaxk = jaj · kxk (Triangle inequality) kx + yk ≤ kxk + kyk (Positivity) Equality kxk = 0 holds if and only if x = 0. The inner product in the Euclidean space E always induces a norm kxk = phx; xi. Unless specified otherwise, the symbol kxk for x 2 E will always denote this induced norm. For example, the dot product on Rn induces the p 2 2 m×n usual 2-norm kxk2 := x1 + ::: + xn, while the trace product on R p T induces the Frobenius norm kXkF := tr (X X). The Cauchy{Schwarz inequality guarantees that the induced norm satisfies the estimate: jhx; yij ≤ kxk · kyk for all x; y 2 E: (1.1) n Other important examples of norms are the lp-norms on R : ( p p 1=p (jx1j + ::: + jxnj ) for 1 ≤ p < 1 kxkp = : maxfjx1j;:::; jxnjg for p = 1 The most notable of these are the l1, l2, and l1 norms; see Figure 1.1. For an arbitrary norm k · k on E, the dual norm k · k∗ on E is defined by kvk∗ := maxfhv; xi : kxk ≤ 1g: 4 CHAPTER 1. BACKGROUND (a) p = 1 (b) p = 1:5 (c) p = 2 (d) p = 5 (e) p = 1 Figure 1.1: Unit balls of `p-norms. Thus kvk∗ is the maximal value that the linear function x 7! hv; xi takes over the closed unit ball of the norm k · k. For example, the lp and lq norms on Rn are dual to each other whenever p−1 + q−1 = 1 and p; q 2 [1; 1]. In n particular, the `2-norm on R is self-dual; the same goes for the Frobenius norm on Rm×n (why?).