Chapter 3 Gradient-Based Optimization

Chapter 3 Gradient-based optimization Contents (class version) 3.0 Introduction........................................ 3.3 3.1 Lipschitz continuity.................................... 3.6 3.2 Gradient descent for smooth convex functions..................... 3.14 3.3 Preconditioned steepest descent............................. 3.18 Preconditioning: overview..................................... 3.19 Descent direction.......................................... 3.21 Complex case............................................ 3.22 3.4 Descent direction for edge-preserving regularizer: complex case........... 3.23 GD step size with preconditioning................................. 3.28 Finite difference implementation.................................. 3.29 Orthogonality for steepest descent and conjugate gradients.................... 3.31 3.5 General inverse problems................................. 3.32 3.6 Convergence rates..................................... 3.34 Heavy ball method......................................... 3.35 Generalized convergence analysis of PGD............................. 3.38 3.1 © J. Fessler, January 16, 2020, 17:44 (class version) 3.2 Generalized Nesterov fast gradient method (FGM)........................ 3.40 3.7 First-order methods.................................... 3.41 General first-order method classes................................. 3.41 Optimized gradient method (OGM)................................ 3.45 3.8 Machine learning via logistic regression for binary classification........... 3.52 Adaptive restart of OGM...................................... 3.57 3.9 Summary.......................................... 3.58 © J. Fessler, January 16, 2020, 17:44 (class version) 3.3 3.0 Introduction To solve a problem like x^ = arg min Ψ(x) x2FN via an iterative method, we start with some initial guess x0, and then the algorithm produces a sequence fxtg where hopefully the sequence converges to x^, meaning kxt − x^k ! 0 for some norm k·k as t ! 1. What algorithm we use depends greatly on the properties of the cost function Ψ: FN 7! R. EECS 551 explored the gradient descent (GD) and preconditioned gradient descent (PGD) algorithms for solving least-squares problems in detail. Here we review the general form of gradient descent (GD) for convex minimization problems; the LS application is simply a special case. Venn twice diagram Lipschitz differentiable quadratic for nonsmooth composite differentiable continuous with (LS) convex gradient bounded functions: curvature © J. Fessler, January 16, 2020, 17:44 (class version) 3.4 Motivating application(s) We focus initially on the numerous SIPML applications where the cost function is convex and smooth, meaning it has a Lipschitz continuous gradient. A concrete family of applications is edge-preserving image recovery where the measurement model is y = Ax + " for some matrix A and we estimate x using 1 2 x^ = arg min kAx − yk2 + βR(x) x2FN 2 where the regularizer is convex and smooth, such as X R(x) = ([Cx]k) k for a potential function that has a Lipschitz continuous derivative, such as the Fair potential. © J. Fessler, January 16, 2020, 17:44 (class version) 3.5 Example. Here is an example of image deblurring or image restoration that was performed using such a method. The left image is the blurry noisy image y, and the right image is the restored image x^. Step sizes and Lipschitz constant preview For gradient-based optimization methods, a key issue is choosing an appropriate step size (aka learning rate in ML). Usually the appropriate range of step sizes is determined by the Lipschitz constant of r Ψ, so we focus on that next. © J. Fessler, January 16, 2020, 17:44 (class version) 3.6 3.1 Lipschitz continuity The concept of Lipschitz continuity is defined for general metric spaces, but we focus on vector spaces. Define. A function g : FN 7! FM is Lipschitz continuous if there exists L < 1, called a Lipschitz constant, such that N kg(x) − g(z)k ≤ L kx − zk ; 8x; z 2 R : In general the norms on FN and FM can differ and L will depend on the choice of the norms. We will focus on the Euclidean norms unless otherwise specified. Define. The smallest such L is called the best Lipschitz constant. (Often just “the” LC.) Algebraic properties Let f and g be Lipschitz continuous functions with (best) Lipschitz constants Lf and Lg respectively. h(x) Lh αf(x) + β jαj Lf scale/shift f(x − x0) Lf translate f(x) + g(x) ≤ Lf + Lg add f(g(x)) ≤ Lf Lg compose ( HW) Ax + b jjjAjjj affine (for same norm on FM and FN ) f(x)g(x) ? multiply © J. Fessler, January 16, 2020, 17:44 (class version) 3.7 If f and g are Lipschitz continuous functions on R, then h(x) = f(x)g(x) is a Lipschitz continuous function on R. (?) A: True B: False ?? N N If f : F 7! F and g : F 7! F are Lipschitz continuous functions and h(x) , f(x)g(x), and jf(x)j ≤ fmax < 1 and jg(x)j ≤ gmax < 1, N then h(·) is Lipschitz continuous on F and Lh ≤ fmaxLg + gmaxLf : Proof jh(x) − h(z)j = jf(x)g(x) − f(z)g(z)j = jf(x)(g(x) − g(z)) − (f(z) − f(x))g(z)j ≤ jf(x)j jg(x) − g(z)j + jg(z)j jf(x) − f(z)j by triangle ineq. ≤ (fmaxLg + gmaxLf ) kx − zk2 : 2 Is boundedness of both f and g a necessary condition? group No. Think f(x) = α and g(x) = x. Then Lfg = jαj = fmax but g(·) is unbounded yet Lipschitz. p Think f(x) = g(x) = jxj, both unbounded. But h(x) = f(x)g(x) = jxj has Lh = 1. © J. Fessler, January 16, 2020, 17:44 (class version) 3.8 For our purposes, we especially care about cost functions whose gradients are Lipschitz continuous. We call these smooth functions. The definition of gradient is subtle for functions on CN so here we focus on RN . Define. A differentiable function f(x) is called smooth iff it has a Lipschitz continuous gradient, i.e., iff 9L < 1 such that N krf(x) − rf(z)k2 ≤ L kx − zk2 ; 8x; z 2 R : Lipschitz continuity of rf is a stronger condition than mere continuity, so any differentiable function whose gradient is Lipschitz continuous is in fact a continuously differentiable function. The set of differentiable functions on RN having L-Lipschitz continuous gradients is sometimes denoted 1;1 N CL (R ) [1, p. 20]. 1 2 0 0 Example. For f(x) = 2 kAx − yk2 we have krf(x) − rf(z)k2 = kA (Ax − y) − A (Az − y)k2 0 0 = kA A(x − z)k2 ≤ jjjA Ajjj2 kx − zk2 : 0 2 2 0 So the Lipschitz constant of rf is Lrf = jjjA Ajjj2 = jjjAjjj2 = σ (A) = ρ(A A): 0 The value Lrf = jjjA Ajjj2 is the best Lipschitz constant for rf(·). (?) A: True B: False ?? ?? © J. Fessler, January 16, 2020, 17:44 (class version) 3.9 1;1 N Here is an interesting geometric property of functions in CL (R ) [1, p. 22, Lemma 1.2.3]: L jf(x) − f(z) − rf(z)(x − z)j ≤ kx − zk2 ; 8x; z 2 N : 2 2 R In other words, for any point z, the function f(x) is bounded between the two quadratic functions: L q (x) f(z) + hrf(z); x − zi ± kx − zk2 : ± , 2 2 (Picture)of sinusoid sin(x) with bounding upward and downward parabolas. Convex functions with Lipschitz continuous gradients See [1, p. 56] for many equivalent conditions for a convex differentiable function f to have a Lipschitz continuous gradient, such as the following holding for all x; z 2 RN : L f(z) + hrf(z); x − zi ≤ f(x) ≤ f(z) + hrf(z); x − zi + kx − zk2 : (Picture) 2 2 | {z } | {z } tangent plane property quadratic majorization property The left inequality holds for all differentiable convex functions. © J. Fessler, January 16, 2020, 17:44 (class version) 3.10 Fact. If f(x) is twice differentiable and if there exists L < 1 such that its Hessian matrix has a bounded spectral norm: 2 N r f(x) 2 ≤ L; 8x 2 R ; (3.1) then f(x) has a Lipschitz continuous gradient with Lipschitz constant L. So twice differentiability with bounded curvature is sufficient, but not necessary, for a function to have Lipschitz continuous gradient. Proof. Using Taylor’s theorem and the triangle inequality and the definition of spectral norm: Z 1 2 krf(x) − rf(z)k2 = r f(x + τ(z − x)) dτ (x − z) 0 2 Z 1 Z 1 2 ≤ r f(x + τ(z − x)) 2 dτ kz − zk2 ≤ L dτ kx − zk2 = L kx − zk2 : 0 0 1 2 2 0 2 0 2 Example. f(x) = 2 kAx − yk2 =) r f = A A so jjjr fjjj2 = jjjA Ajjj2 = jjjAjjj2: 1 2 Example. The Lipschitz constant for the gradient of f(x) x0 x is: , 2 4 2 0 0 2 2 r f = 2zz where z = [1 2] so jjjr fjjj2 = 2 kzk2 = 10. © J. Fessler, January 16, 2020, 17:44 (class version) 3.11 Boundedness of 2nd derivative is not a necessary condition in general, because Lipschitz continuity of the derivative of a function does not require the function to be twice differentiable. 1 2 _ Example. Consider f(x) = 2 ([x]+) : The derivative of this function is f(x) = [x]+ which has Lipschitz constant L = 1; yet f is not twice differentiable. However, if a 1D function from R to R is twice differentiable, then its derivative is Lipschitz iff its second derivative is bounded. Proof. The “if” follows from (3.1). For the “only if” direction, suppose f¨ is unbounded. Then for any L < 1 there exists a point x 2 R such that f¨(x) > L. Now consider z = x ± and let g(x) = f_(x). Then g(x)−g(z) = g(x)−g(x±) ! f¨(x) > L as ! 0, so g cannot be L-Lipschitz continuous. This property holds x−z for every L < 1.

Chapter 3 Gradient-Based Optimization

Lipschitz Continuity of Convex Functions

The Beginnings 2 2. the Topology of Metric Spaces 5 3. Sequences and Completeness 9 4

Convergence Rates for Deterministic and Stochastic Subgradient

The Lipschitz Constant of Self-Attention

Hölder-Continuity for the Nonlinear Stochastic Heat Equation with Rough Initial Conditions

1. the Rademacher Theorem

On Lipschitz Continuity of Projections 2

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Arxiv:1706.10175V1 [Math.CV] 30 Jun 2017 Uscnomlslto Fteieult 11 Ihrsett H D the to Respect with (1.1) Inequality the Metric

Metric Spaces

Functions of One Variable

Fat Homeomorphisms and Unbounded Derivate Containers