Chapter 3 Gradient-based optimization
Contents (class version) 3.0 Introduction...... 3.3 3.1 Lipschitz continuity...... 3.6 3.2 Gradient descent for smooth convex functions...... 3.14 3.3 Preconditioned steepest descent...... 3.18 Preconditioning: overview...... 3.19 Descent direction...... 3.21 Complex case...... 3.22 3.4 Descent direction for edge-preserving regularizer: complex case...... 3.23 GD step size with preconditioning...... 3.28 Finite difference implementation...... 3.29 Orthogonality for steepest descent and conjugate gradients...... 3.31 3.5 General inverse problems...... 3.32 3.6 Convergence rates...... 3.34 Heavy ball method...... 3.35 Generalized convergence analysis of PGD...... 3.38
3.1 © J. Fessler, January 16, 2020, 17:44 (class version) 3.2
Generalized Nesterov fast gradient method (FGM)...... 3.40 3.7 First-order methods...... 3.41 General first-order method classes...... 3.41 Optimized gradient method (OGM)...... 3.45 3.8 Machine learning via logistic regression for binary classification...... 3.52 Adaptive restart of OGM...... 3.57 3.9 Summary...... 3.58 © J. Fessler, January 16, 2020, 17:44 (class version) 3.3
3.0 Introduction To solve a problem like xˆ = arg min Ψ(x) x∈FN via an iterative method, we start with some initial guess x0, and then the algorithm produces a sequence {xt} where hopefully the sequence converges to xˆ, meaning kxt − xˆk → 0 for some norm k·k as t → ∞. What algorithm we use depends greatly on the properties of the cost function Ψ: FN 7→ R. EECS 551 explored the gradient descent (GD) and preconditioned gradient descent (PGD) algorithms for solving least-squares problems in detail. Here we review the general form of gradient descent (GD) for convex minimization problems; the LS application is simply a special case.
Venn twice diagram Lipschitz differentiable quadratic for nonsmooth composite differentiable continuous with (LS) convex gradient bounded functions: curvature © J. Fessler, January 16, 2020, 17:44 (class version) 3.4
Motivating application(s) We focus initially on the numerous SIPML applications where the cost function is convex and smooth, meaning it has a Lipschitz continuous gradient. A concrete family of applications is edge-preserving image recovery where the measurement model is
y = Ax + ε for some matrix A and we estimate x using
1 2 xˆ = arg min kAx − yk2 + βR(x) x∈FN 2 where the regularizer is convex and smooth, such as X R(x) = ψ([Cx]k) k for a potential function ψ that has a Lipschitz continuous derivative, such as the Fair potential. © J. Fessler, January 16, 2020, 17:44 (class version) 3.5
Example. Here is an example of image deblurring or image restoration that was performed using such a method. The left image is the blurry noisy image y, and the right image is the restored image xˆ.
Step sizes and Lipschitz constant preview For gradient-based optimization methods, a key issue is choosing an appropriate step size (aka learning rate in ML). Usually the appropriate range of step sizes is determined by the Lipschitz constant of ∇ Ψ, so we focus on that next. © J. Fessler, January 16, 2020, 17:44 (class version) 3.6
3.1 Lipschitz continuity The concept of Lipschitz continuity is defined for general metric spaces, but we focus on vector spaces.
Define. A function g : FN 7→ FM is Lipschitz continuous if there exists L < ∞, called a Lipschitz constant, such that N kg(x) − g(z)k ≤ L kx − zk , ∀x, z ∈ R . In general the norms on FN and FM can differ and L will depend on the choice of the norms. We will focus on the Euclidean norms unless otherwise specified. Define. The smallest such L is called the best Lipschitz constant. (Often just “the” LC.) Algebraic properties
Let f and g be Lipschitz continuous functions with (best) Lipschitz constants Lf and Lg respectively.
h(x) Lh
αf(x) + β |α| Lf scale/shift f(x − x0) Lf translate f(x) + g(x) ≤ Lf + Lg add f(g(x)) ≤ Lf Lg compose ( HW) Ax + b |||A||| affine (for same norm on FM and FN ) f(x)g(x) ? multiply © J. Fessler, January 16, 2020, 17:44 (class version) 3.7
If f and g are Lipschitz continuous functions on R, then h(x) = f(x)g(x) is a Lipschitz continuous function on R. (?) A: True B: False ??
N N If f : F 7→ F and g : F 7→ F are Lipschitz continuous functions and h(x) , f(x)g(x),
and |f(x)| ≤ fmax < ∞ and |g(x)| ≤ gmax < ∞,
N then h(·) is Lipschitz continuous on F and Lh ≤ fmaxLg + gmaxLf . Proof
|h(x) − h(z)| = |f(x)g(x) − f(z)g(z)| = |f(x)(g(x) − g(z)) − (f(z) − f(x))g(z)| ≤ |f(x)| |g(x) − g(z)| + |g(z)| |f(x) − f(z)| by triangle ineq.
≤ (fmaxLg + gmaxLf ) kx − zk2 . 2
Is boundedness of both f and g a necessary condition? group
No. Think f(x) = α and g(x) = x. Then Lfg = |α| = fmax but g(·) is unbounded yet Lipschitz. p Think f(x) = g(x) = |x|, both unbounded. But h(x) = f(x)g(x) = |x| has Lh = 1. © J. Fessler, January 16, 2020, 17:44 (class version) 3.8
For our purposes, we especially care about cost functions whose gradients are Lipschitz continuous. We call these smooth functions. The definition of gradient is subtle for functions on CN so here we focus on RN . Define. A differentiable function f(x) is called smooth iff it has a Lipschitz continuous gradient, i.e., iff ∃L < ∞ such that N k∇f(x) − ∇f(z)k2 ≤ L kx − zk2 , ∀x, z ∈ R . Lipschitz continuity of ∇f is a stronger condition than mere continuity, so any differentiable function whose gradient is Lipschitz continuous is in fact a continuously differentiable function. The set of differentiable functions on RN having L-Lipschitz continuous gradients is sometimes denoted 1,1 N CL (R ) [1, p. 20]. 1 2 0 0 Example. For f(x) = 2 kAx − yk2 we have k∇f(x) − ∇f(z)k2 = kA (Ax − y) − A (Az − y)k2 0 0 = kA A(x − z)k2 ≤ |||A A|||2 kx − zk2 . 0 2 2 0 So the Lipschitz constant of ∇f is L∇f = |||A A|||2 = |||A|||2 = σ (A) = ρ(A A). 0 The value L∇f = |||A A|||2 is the best Lipschitz constant for ∇f(·). (?) A: True B: False ?? ?? © J. Fessler, January 16, 2020, 17:44 (class version) 3.9
1,1 N Here is an interesting geometric property of functions in CL (R ) [1, p. 22, Lemma 1.2.3]: L |f(x) − f(z) − ∇f(z)(x − z)| ≤ kx − zk2 , ∀x, z ∈ N . 2 2 R In other words, for any point z, the function f(x) is bounded between the two quadratic functions: L q (x) f(z) + h∇f(z), x − zi ± kx − zk2 . ± , 2 2
(Picture)of sinusoid sin(x) with bounding upward and downward parabolas.
Convex functions with Lipschitz continuous gradients See [1, p. 56] for many equivalent conditions for a convex differentiable function f to have a Lipschitz continuous gradient, such as the following holding for all x, z ∈ RN : L f(z) + h∇f(z), x − zi ≤ f(x) ≤ f(z) + h∇f(z), x − zi + kx − zk2 . (Picture) 2 2 | {z } | {z } tangent plane property quadratic majorization property
The left inequality holds for all differentiable convex functions. © J. Fessler, January 16, 2020, 17:44 (class version) 3.10
Fact. If f(x) is twice differentiable and if there exists L < ∞ such that its Hessian matrix has a bounded spectral norm: 2 N ∇ f(x) 2 ≤ L, ∀x ∈ R , (3.1) then f(x) has a Lipschitz continuous gradient with Lipschitz constant L. So twice differentiability with bounded curvature is sufficient, but not necessary, for a function to have Lipschitz continuous gradient. Proof. Using Taylor’s theorem and the triangle inequality and the definition of spectral norm:
Z 1 2 k∇f(x) − ∇f(z)k2 = ∇ f(x + τ(z − x)) dτ (x − z) 0 2 Z 1 Z 1 2 ≤ ∇ f(x + τ(z − x)) 2 dτ kz − zk2 ≤ L dτ kx − zk2 = L kx − zk2 . 0 0
1 2 2 0 2 0 2 Example. f(x) = 2 kAx − yk2 =⇒ ∇ f = A A so |||∇ f|||2 = |||A A|||2 = |||A|||2. 1 2 Example. The Lipschitz constant for the gradient of f(x) x0 x is: , 2 4 2 0 0 2 2 ∇ f = 2zz where z = [1 2] so |||∇ f|||2 = 2 kzk2 = 10. © J. Fessler, January 16, 2020, 17:44 (class version) 3.11
Boundedness of 2nd derivative is not a necessary condition in general, because Lipschitz continuity of the derivative of a function does not require the function to be twice differentiable. 1 2 ˙ Example. Consider f(x) = 2 ([x]+) . The derivative of this function is f(x) = [x]+ which has Lipschitz constant L = 1, yet f is not twice differentiable. However, if a 1D function from R to R is twice differentiable, then its derivative is Lipschitz iff its second derivative is bounded. Proof. The “if” follows from (3.1). For the “only if” direction, suppose f¨ is unbounded. Then for any L < ∞ there exists a point x ∈ R such that f¨(x) > L. Now consider z = x ± and let g(x) = f˙(x). Then
g(x)−g(z) = g(x)−g(x±) → f¨(x) > L as → 0, so g cannot be L-Lipschitz continuous. This property holds x−z for every L < ∞. 2 Challenge. Generalize this partial converse of (3.1) to twice differentiable functions from RN to R, i.e., prove or disprove this conjecture: if f : RN 7→ R is twice differentiable, then ∇f is Lipshitz continuous iff the bounded Hessian norm property (3.1) holds. © J. Fessler, January 16, 2020, 17:44 (class version) 3.12
(Read) Example. The Fair potential used in many imaging applications [2][3] is
ψ(z) = δ2 (|z/δ| − log(1 + |z/δ|)) , (3.2) 1 for some δ > 0 and has the property of being roughly quadratic for z ≈ 0 and roughly like |z| for δ |z| 0. When the domain of ψ is R, we can differentiate (care- 0 fully treat z > 0 and z < 0 separately): -1 0 1 3 z 1 ψ˙(z) = and ψ¨(z) = ≤ 1, 1 + |z/δ| (1 + |z/δ|)2 1 so the Lipschitz constant of the derivative of ψ(·) is 1. 0 Furthermore, its second derivative is nonnegative so it is a convex function. -1 In the figure, δ = 1. -1 0 1 3
˙ Example. Is the Fair potential ψ itself Lipschitz continuous? Yes: ψ(z) ≤ 1. © J. Fessler, January 16, 2020, 17:44 (class version) 3.13
Edge-preserving regularizer and Lipschitz continuity Example. Determine “the” Lipschitz constant for the gradient of the edge-preserving regularizer in RN , ˙ when the derivative ψ of potential function ψ has Lipschitz constant Lψ˙ :
K X 0 ˙ R(x) = ψ([Cx]k) =⇒ ∇R(x) = C ψ .(Cx) = h(g(f(x))), (3.3) k=1
˙ 0 0 where f(x) = Cx, g(u) = ψ .(u), h(v) = C v,Lf = |||C|||2,Lh = |||C |||2. (students finish it)
2 2 2 ˙ ˙ X ˙ ˙ X 2 2 2 2 kg(u) − g(v)k2 = ψ .(u) − ψ .(v) = ψ(uk) − ψ(vk) ≤ L ˙ |uk − vk| = L ˙ ku − vk2 2 ψ ψ k k
2 0 =⇒ Lg ≤ Lψ˙ , =⇒ L∇R ≤ Lψ˙ |||C|||2 = Lψ˙ |||C C|||2. (3.4)
Thus when ψ¨ ≤ 1, a Lipschitz constant for the gradient of the above R(x) is: 0 4 A: 1 B: |||C|||2 C: |||C C|||2 D: |||C|||2 E: None of these ??
Showing this L∇R is the best Lipschitz constant is a HW problem. © J. Fessler, January 16, 2020, 17:44 (class version) 3.14
3.2 Gradient descent for smooth convex functions • If convex function Ψ(x) has a (not necessarily unique) minimizer xˆ for which N −∞ < Ψ(xˆ) ≤ Ψ(x), ∀x ∈ R , • Ψ is smooth, i.e., the gradient of Ψ(x) is Lipschitz continuous: N k∇ Ψ(x) −∇ Ψ(z)k2 ≤ L kx − zk2 , ∀x, z ∈ R , • the step size α is chosen such that 0 < α < 2/L, then the GD iteration xk+1 = xk − α∇ Ψ(xk) has the following convergence properties [4, p. 207]. • The cost function is non-increasing (monotone): Ψ(xk+1) ≤ Ψ(xk), ∀k ≥ 0.
• The distance to any minimizer xˆ is non-increasing (monotone): kxk+1 − xˆk2 ≤ kxk − xˆk2 , ∀k ≥ 0. • The sequence {xk} converges to a minimizer of Ψ(·).
• The gradient norm converges to zero [4, p. 22] k∇ Ψ(xk)k2 → 0. • For 0 < α ≤ 1/L, the cost function decrease is bounded by [5]: L kx − xˆk2 1 Ψ(x ) − Ψ(xˆ) ≤ 0 2 max , (1 − α)2k . k 2 2kα + 1 This upper bound is conjectured to also hold for 1/L < α < 2/L [6]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.15
Optimal asymptotic step size for GD (Read) The above step size range 0 < α < 2/L is a wide range of values, and one might ask what is the best choice? 1 2 For a LS cost function f(x) = 2 kAx − yk2 , the EECS 551 notes show that the asymptotically optimal choice of the step size is: 2 2 α∗ = 0 0 = 2 2 , σmax(A A) + σmin(A A) σmax(∇ f) + σmin(∇ f) because ∇2f = A0A. For more general cost functions that are twice differentiable, one can apply similar analyses to show that the asymptotically optimal choice is 2 α∗ = 2 2 . σmax(∇ f(xˆ)) + σmin(∇ f(xˆ)) Although this formula is an interesting generalization, it is of little practical use because we do not know the minimizer xˆ and the Hessian ∇2f and its SVD are infeasible for large problems. Furthermore, the asymptotically optimal choice of α∗ may not be the best step size in the early iterations when the iterates are far from xˆ. © J. Fessler, January 16, 2020, 17:44 (class version) 3.16
Convergence rates (Read) There are many ways to assess the convergence rate of an iterative algorithm like GD. Researchers study: Ψ(x)
k∇ Ψ(xk)k • Ψ(xk) → Ψ(xˆ) Ψ(xk) • k∇ Ψ(xk)k → 0 •k xk − xˆk → 0 both globally and locally... Ψ(xˆ) xk xˆ x kxk − xˆk Quantifying bounds on the rates of decrease of these quantities is an active research area. Even classical GD has relatively recent results [5] that tighten up the traditional bounds. The tightest possible worst-case bound for GD for the decrease of the cost function (with a fixed step size α = 1/L) is O(1/k):
kx − xˆk2 Ψ(x ) − Ψ(xˆ) ≤ 0 2 , k L (4k + 2) where L is the Lipschitz constant of the gradient ∇f(x). In contrast, Nesterov’s fast gradient method (p. 3.40) has a worst-case cost function decrease at rate at least O(1/k2), which can be improved (and has) by only a constant factor [7]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.17
Example. The following figure illustrates how slow GD can converge for a simple LS problem 1 0 with A = and y = 0. This case used the optimal step size α for illustration. 0 2 ∗ This slow convergence has been the impetus of thousands of papers on faster algorithms!
5 GD 1 4
3
0 2
1
-1 -4 -3 -2 -1 0
The ellipses show the contours of the LS cost function kAx − yk .
Two ways to try to accelerate convergence are to use a preconditioner and/or a line search. © J. Fessler, January 16, 2020, 17:44 (class version) 3.18
3.3 Preconditioned steepest descent (Read) Instead of using GD with a fixed step size α, an alternative is to do a line search to find the best step size at each iteration. This variation is called steepest descent (or GD with a line search) [8]. Here is how preconditioned steepest descent for a linear LS problem works:
dk = −P ∇ Ψ(xk) search direction (negative preconditioned gradient)
αk = arg min Ψ(xk + αdk) step size α
xk+1 = xk + αkdk update.
• Finding αk analytically quadratic cases is a HW problem • By construction, this iteration is guaranteed to decrease the cost function monotonically, with strict de- crease unless xk is already a minimizer, provided the preconditioner P is positive definite. Expressed mathematically: ∇ Ψ(xk) 6= 0 =⇒ Ψ(xk+1) < Ψ(xk) .
• Computing αk takes some extra work, especially for non-quadratic problems. Often Nesterov’s fast gra- dient method or the optimized gradient method (OGM)[7] are preferable because they do not require a line search (if the Lipschitz constant is available). © J. Fessler, January 16, 2020, 17:44 (class version) 3.19
Preconditioning: overview (Read)
Why use the preconditioned search direction dk = −P ∇ Ψ(xk) ? 1 2 Consider the least-squares cost function Ψ(x) = 2 kAx − yk2 , and define a “preconditioned” cost function using a change of coordinates: 1 f(z) Ψ(T z) = kAT z − yk2 . , 2 2 The Hessian matrix of f(·) is ∇2f(z) = T 0A0AT . Applying GD to f yields
0 0 0 zk+1 = zk − α∇f(zk) = T A (A T zk − y) 0 0 0 =⇒ T zk+1 = T zk − αTT A (A T zk − y) 0 0 =⇒ xk+1 = xk − αPA (A xk − y) = xk − αP ∇ Ψ(xk),
0 where xk , T zk and P , TT . So ordinary GD on f is the same as preconditioned GD on Ψ. 0 −1/2 0 −1 0 −1 0 If α = 1 and T = (A A) , then P = (A A) and x1 = (A A) A y. 0 −1 2 −1 In this sense P = (A A) = [∇ Ψ(xk)] is the ideal preconditioner. © J. Fessler, January 16, 2020, 17:44 (class version) 3.20
To elaborate, when T = (A0A)−1/2, then f(z) simplifies as follows: 1 1 f(z) = (AT z − y)0(AT z − y) = z0T 0A0AT z − 2 real{y0AT z} + kyk2 2 2 2 1 1 = z0Iz − 2 real{y0AT z} + kyk2 = kz − T 0A0yk2 − kT 0A0yk2 + kyk2 . 2 2 2 2 2 2 The next figures illustrate this property of converging in 1 iteration for a quadratic cost with the ideal precon- ditioner. Example. Effect of ideal preconditioner on quadratic cost function contours. Contours of Ψ(x) contours of f(z)
use z0 and z∗ © J. Fessler, January 16, 2020, 17:44 (class version) 3.21
Descent direction Define. A vector d ∈ FN descent direction for a cost function Ψ: FN 7→ R at a point x iff moving locally from x along the direction d decreases the cost, i.e., (1D Picture)
∃ c = c(x, d, Ψ) > 0 s.t. ∀ ∈ [0, c) Ψ(x + d) ≤ Ψ(x) . (3.5) With this definition d = 0 is always a (degenerate) descent direction. Fact. For RN , if Ψ(x) is differentiable at x and P is positive definite, then the following vector, if nonzero, is a descent direction for Ψ at x: d = −P ∇ Ψ(x) . (3.6)
Proof sketch. Taylor’s theorem yields (Read) o(α) Ψ(x) − Ψ(x + αd) = −α h∇ Ψ(x), di +o(α) = α d0P −1d + , α which will be positive for sufficiently small α, because d0P −1d > 0 for P 0, and o(α)/α → 0 as α → 0. From this analysis we can see that designing/selecting a preconditioner that is positive definite is crucial. The two most common choices are: • P is diagonal with positive diagonal elements, • P = QDQ0 where D is diagonal with positive diagonal elements and Q is unitary. In this case Q is often circulant so we can use FFT operations to perform P g efficiently. © J. Fessler, January 16, 2020, 17:44 (class version) 3.22
Complex case (Read)
The definition of descent direction in (3.5) is perfectly appropriate for both RN and CN . However, the direction d specified in (3.6) is problematic in general on FN because many cost functions of interest are not holomorphic so are not differentiable on FN . However, despite not being differentiable, we can still find a descent direction for most cases of interest. N 1 2 Example. The most important case of interest here is Ψ: C 7→ R defined by Ψ(x) = 2 kAx − yk2 where A ∈ CM×N and y ∈ CM . This function is not holomorphic. However, one can show that d = −PA0(Ax − y) is a descent direction for Ψ at x when P is a positive definite matrix. ( HW) In the context of optimization problems, when we write g = ∇ Ψ(x) = A0(Ax − y) for the complex case, we mean that −g is a descent direction for Ψ at x, not a derivative. Furthermore, one can show ( HW ) that the set of minimizers of Ψ(x) is the same as the set of points that satisfy ∇ Ψ(x) = A0(Ax − y) = 0. So again even though a derivative is not defined here, the descent direction sure walks like a duck and talks like a duck, I mean like a (negated) derivative. © J. Fessler, January 16, 2020, 17:44 (class version) 3.23
3.4 Descent direction for edge-preserving regularizer: complex case
Now consider an edge-preserving regularizer defined on CN :
K X R(x) = ψ([Cx]k), where ψ(z) = f(|z|) (3.7) k=1 for some some potential function ψ : C 7→ R defined in terms of some function f : R 7→ R. If f(r) = r2 then it follows from p. 3.22 that C0Cx is a descent direction for R(x) on CN . But it seems unclear in general how to define a descent direction, due to the |·| above. To proceed, make the following assumptions about the function f: • f : R 7→ R. • 0 ≤ s ≤ t =⇒ f(s) ≤ f(t) (monotone). • f is differentiable on R. f˙(t) • ω (t) is well-defined for all t ∈ , including t = 0. f , t R • 0 ≤ ωf (t) ≤ ωmax < ∞.
Example. For the Fair potential ωf (t) = 1/ (1 + |t/δ|) ∈ (0, 1]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.24
Claim. For these assumptions, a descent direction for the edge-preserving regularizer (3.7) for x ∈ FN is
0 − ∇R(x) = −C diag{ωf .(|Cx|)} Cx, (3.8) ˙ where the |·| is evaluated element-wise, like abs.() in JULIA. Intuition: f(t) = ωf (t) t. To prove this claim, we focus on just one term in the sum: (Read)
0 0 0 0 r(x) , ψ(v x) = f(|v x|), d(x) = −v ωf (|v x|) v x. for some nonzero vector v ∈ CN . (A row of C in (3.7) corresponds to v0.) 0 Letting ω = ωf (|v x|) ≤ ωmax we have
0 0 0 0 0 0 2 r(x + d) = f(|v (x + d)|) = f(|v (x − ωvv x)|) = f(|v x(1 − ωv v)|) = f |v x| 1 − ω kvk2 .
2 Now when 0 ≤ ≤ 1/(ω kvk2) then 2 2 1 − ω kvk2 = 1 − ω kvk2 ∈ [0, 1],