<<

Chapter 3 Gradient-based optimization

Contents (class version) 3.0 Introduction...... 3.3 3.1 Lipschitz continuity...... 3.6 3.2 Gradient descent for smooth convex functions...... 3.14 3.3 Preconditioned steepest descent...... 3.18 Preconditioning: overview...... 3.19 Descent direction...... 3.21 Complex case...... 3.22 3.4 Descent direction for edge-preserving regularizer: complex case...... 3.23 GD step size with preconditioning...... 3.28 Finite difference implementation...... 3.29 Orthogonality for steepest descent and conjugate gradients...... 3.31 3.5 General inverse problems...... 3.32 3.6 Convergence rates...... 3.34 Heavy ball method...... 3.35 Generalized convergence analysis of PGD...... 3.38

3.1 © J. Fessler, January 16, 2020, 17:44 (class version) 3.2

Generalized Nesterov fast gradient method (FGM)...... 3.40 3.7 First-order methods...... 3.41 General first-order method classes...... 3.41 Optimized gradient method (OGM)...... 3.45 3.8 Machine learning via logistic regression for binary classification...... 3.52 Adaptive restart of OGM...... 3.57 3.9 Summary...... 3.58 © J. Fessler, January 16, 2020, 17:44 (class version) 3.3

3.0 Introduction To solve a problem like xˆ = arg min Ψ(x) x∈FN via an iterative method, we start with some initial guess x0, and then the algorithm produces a sequence {xt} where hopefully the sequence converges to xˆ, meaning kxt − xˆk → 0 for some k·k as t → ∞. What algorithm we use depends greatly on the properties of the cost Ψ: FN 7→ R. EECS 551 explored the gradient descent (GD) and preconditioned gradient descent (PGD) algorithms for solving least-squares problems in detail. Here we review the general form of gradient descent (GD) for convex minimization problems; the LS application is simply a special case.

Venn twice diagram Lipschitz differentiable quadratic for nonsmooth composite differentiable continuous with (LS) convex gradient bounded functions: curvature © J. Fessler, January 16, 2020, 17:44 (class version) 3.4

Motivating application(s) We focus initially on the numerous SIPML applications where the cost function is convex and smooth, meaning it has a Lipschitz continuous gradient. A concrete family of applications is edge-preserving image recovery where the measurement model is

y = Ax + ε for some matrix A and we estimate x using

1 2 xˆ = arg min kAx − yk2 + βR(x) x∈FN 2 where the regularizer is convex and smooth, such as X R(x) = ψ([Cx]k) k for a potential function ψ that has a Lipschitz continuous , such as the Fair potential. © J. Fessler, January 16, 2020, 17:44 (class version) 3.5

Example. Here is an example of image deblurring or image restoration that was performed using such a method. The left image is the blurry noisy image y, and the right image is the restored image xˆ.

Step sizes and Lipschitz constant preview For gradient-based optimization methods, a key issue is choosing an appropriate step size (aka learning rate in ML). Usually the appropriate range of step sizes is determined by the Lipschitz constant of ∇ Ψ, so we focus on that next. © J. Fessler, January 16, 2020, 17:44 (class version) 3.6

3.1 Lipschitz continuity The concept of Lipschitz continuity is defined for general spaces, but we focus on vector spaces.

Define. A function g : FN 7→ FM is Lipschitz continuous if there exists L < ∞, called a Lipschitz constant, such that N kg(x) − g(z)k ≤ L kx − zk , ∀x, z ∈ R . In general the norms on FN and FM can differ and L will depend on the choice of the norms. We will focus on the Euclidean norms unless otherwise specified. Define. The smallest such L is called the best Lipschitz constant. (Often just “the” LC.) Algebraic properties

Let f and g be Lipschitz continuous functions with (best) Lipschitz constants Lf and Lg respectively.

h(x) Lh

αf(x) + β |α| Lf scale/shift f(x − x0) Lf translate f(x) + g(x) ≤ Lf + Lg add f(g(x)) ≤ Lf Lg compose ( HW) Ax + b |||A||| affine (for same norm on FM and FN ) f(x)g(x) ? multiply © J. Fessler, January 16, 2020, 17:44 (class version) 3.7

If f and g are Lipschitz continuous functions on R, then h(x) = f(x)g(x) is a Lipschitz on R. (?) A: True B: False ??

N N If f : F 7→ F and g : F 7→ F are Lipschitz continuous functions and h(x) , f(x)g(x),

and |f(x)| ≤ fmax < ∞ and |g(x)| ≤ gmax < ∞,

N then h(·) is Lipschitz continuous on F and Lh ≤ fmaxLg + gmaxLf . Proof

|h(x) − h(z)| = |f(x)g(x) − f(z)g(z)| = |f(x)(g(x) − g(z)) − (f(z) − f(x))g(z)| ≤ |f(x)| |g(x) − g(z)| + |g(z)| |f(x) − f(z)| by triangle ineq.

≤ (fmaxLg + gmaxLf ) kx − zk2 . 2

Is boundedness of both f and g a necessary condition? group

No. Think f(x) = α and g(x) = x. Then Lfg = |α| = fmax but g(·) is unbounded yet Lipschitz. p Think f(x) = g(x) = |x|, both unbounded. But h(x) = f(x)g(x) = |x| has Lh = 1. © J. Fessler, January 16, 2020, 17:44 (class version) 3.8

For our purposes, we especially care about cost functions whose gradients are Lipschitz continuous. We call these smooth functions. The definition of gradient is subtle for functions on CN so here we focus on RN . Define. A f(x) is called smooth iff it has a Lipschitz continuous gradient, i.e., iff ∃L < ∞ such that N k∇f(x) − ∇f(z)k2 ≤ L kx − zk2 , ∀x, z ∈ R . Lipschitz continuity of ∇f is a stronger condition than mere continuity, so any differentiable function whose gradient is Lipschitz continuous is in fact a continuously differentiable function. The set of differentiable functions on RN having L-Lipschitz continuous gradients is sometimes denoted 1,1 N CL (R ) [1, p. 20]. 1 2 0 0 Example. For f(x) = 2 kAx − yk2 we have k∇f(x) − ∇f(z)k2 = kA (Ax − y) − A (Az − y)k2 0 0 = kA A(x − z)k2 ≤ |||A A|||2 kx − zk2 . 0 2 2 0 So the Lipschitz constant of ∇f is L∇f = |||A A|||2 = |||A|||2 = σ (A) = ρ(A A). 0 The value L∇f = |||A A|||2 is the best Lipschitz constant for ∇f(·). (?) A: True B: False ?? ?? © J. Fessler, January 16, 2020, 17:44 (class version) 3.9

1,1 N Here is an interesting geometric property of functions in CL (R ) [1, p. 22, Lemma 1.2.3]: L |f(x) − f(z) − ∇f(z)(x − z)| ≤ kx − zk2 , ∀x, z ∈ N . 2 2 R In other words, for any point z, the function f(x) is bounded between the two quadratic functions: L q (x) f(z) + h∇f(z), x − zi ± kx − zk2 . ± , 2 2

(Picture)of sinusoid sin(x) with bounding upward and downward parabolas.

Convex functions with Lipschitz continuous gradients See [1, p. 56] for many equivalent conditions for a convex differentiable function f to have a Lipschitz continuous gradient, such as the following holding for all x, z ∈ RN : L f(z) + h∇f(z), x − zi ≤ f(x) ≤ f(z) + h∇f(z), x − zi + kx − zk2 . (Picture) 2 2 | {z } | {z } tangent plane property quadratic majorization property

The left inequality holds for all differentiable convex functions. © J. Fessler, January 16, 2020, 17:44 (class version) 3.10

Fact. If f(x) is twice differentiable and if there exists L < ∞ such that its Hessian matrix has a bounded spectral norm: 2 N ∇ f(x) 2 ≤ L, ∀x ∈ R , (3.1) then f(x) has a Lipschitz continuous gradient with Lipschitz constant L. So twice differentiability with bounded curvature is sufficient, but not necessary, for a function to have Lipschitz continuous gradient. Proof. Using Taylor’s theorem and the and the definition of spectral norm:

Z 1 2 k∇f(x) − ∇f(z)k2 = ∇ f(x + τ(z − x)) dτ (x − z) 0 2 Z 1  Z 1  2 ≤ ∇ f(x + τ(z − x)) 2 dτ kz − zk2 ≤ L dτ kx − zk2 = L kx − zk2 . 0 0

1 2 2 0 2 0 2 Example. f(x) = 2 kAx − yk2 =⇒ ∇ f = A A so |||∇ f|||2 = |||A A|||2 = |||A|||2. 1 2 Example. The Lipschitz constant for the gradient of f(x) x0 x is: , 2 4 2 0 0 2 2 ∇ f = 2zz where z = [1 2] so |||∇ f|||2 = 2 kzk2 = 10. © J. Fessler, January 16, 2020, 17:44 (class version) 3.11

Boundedness of 2nd derivative is not a necessary condition in general, because Lipschitz continuity of the derivative of a function does not require the function to be twice differentiable. 1 2 ˙ Example. Consider f(x) = 2 ([x]+) . The derivative of this function is f(x) = [x]+ which has Lipschitz constant L = 1, yet f is not twice differentiable. However, if a 1D function from R to R is twice differentiable, then its derivative is Lipschitz iff its second derivative is bounded. Proof. The “if” follows from (3.1). For the “only if” direction, suppose f¨ is unbounded. Then for any L < ∞ there exists a point x ∈ R such that f¨(x) > L. Now consider z = x ±  and let g(x) = f˙(x). Then

g(x)−g(z) = g(x)−g(x±) → f¨(x) > L as  → 0, so g cannot be L-Lipschitz continuous. This property holds x−z  for every L < ∞. 2 Challenge. Generalize this partial converse of (3.1) to twice differentiable functions from RN to R, i.e., prove or disprove this conjecture: if f : RN 7→ R is twice differentiable, then ∇f is Lipshitz continuous iff the bounded Hessian norm property (3.1) holds. © J. Fessler, January 16, 2020, 17:44 (class version) 3.12

(Read) Example. The Fair potential used in many imaging applications [2][3] is

ψ(z) = δ2 (|z/δ| − log(1 + |z/δ|)) , (3.2) 1 for some δ > 0 and has the property of being roughly quadratic for z ≈ 0 and roughly like |z| for δ |z|  0. When the domain of ψ is R, we can differentiate (care- 0 fully treat z > 0 and z < 0 separately): -1 0 1 3 z 1 ψ˙(z) = and ψ¨(z) = ≤ 1, 1 + |z/δ| (1 + |z/δ|)2 1 so the Lipschitz constant of the derivative of ψ(·) is 1. 0 Furthermore, its second derivative is nonnegative so it is a convex function. -1 In the figure, δ = 1. -1 0 1 3

˙ Example. Is the Fair potential ψ itself Lipschitz continuous? Yes: ψ(z) ≤ 1. © J. Fessler, January 16, 2020, 17:44 (class version) 3.13

Edge-preserving regularizer and Lipschitz continuity Example. Determine “the” Lipschitz constant for the gradient of the edge-preserving regularizer in RN , ˙ when the derivative ψ of potential function ψ has Lipschitz constant Lψ˙ :

K X 0 ˙ R(x) = ψ([Cx]k) =⇒ ∇R(x) = C ψ .(Cx) = h(g(f(x))), (3.3) k=1

˙ 0 0 where f(x) = Cx, g(u) = ψ .(u), h(v) = C v,Lf = |||C|||2,Lh = |||C |||2. (students finish it)

2 2 2 ˙ ˙ X ˙ ˙ X 2 2 2 2 kg(u) − g(v)k2 = ψ .(u) − ψ .(v) = ψ(uk) − ψ(vk) ≤ L ˙ |uk − vk| = L ˙ ku − vk2 2 ψ ψ k k

2 0 =⇒ Lg ≤ Lψ˙ , =⇒ L∇R ≤ Lψ˙ |||C|||2 = Lψ˙ |||C C|||2. (3.4)

Thus when ψ¨ ≤ 1, a Lipschitz constant for the gradient of the above R(x) is: 0 4 A: 1 B: |||C|||2 C: |||C C|||2 D: |||C|||2 E: None of these ??

Showing this L∇R is the best Lipschitz constant is a HW problem. © J. Fessler, January 16, 2020, 17:44 (class version) 3.14

3.2 Gradient descent for smooth convex functions • If convex function Ψ(x) has a (not necessarily unique) minimizer xˆ for which N −∞ < Ψ(xˆ) ≤ Ψ(x), ∀x ∈ R , • Ψ is smooth, i.e., the gradient of Ψ(x) is Lipschitz continuous: N k∇ Ψ(x) −∇ Ψ(z)k2 ≤ L kx − zk2 , ∀x, z ∈ R , • the step size α is chosen such that 0 < α < 2/L, then the GD iteration xk+1 = xk − α∇ Ψ(xk) has the following convergence properties [4, p. 207]. • The cost function is non-increasing (monotone): Ψ(xk+1) ≤ Ψ(xk), ∀k ≥ 0.

• The distance to any minimizer xˆ is non-increasing (monotone): kxk+1 − xˆk2 ≤ kxk − xˆk2 , ∀k ≥ 0. • The sequence {xk} converges to a minimizer of Ψ(·).

• The gradient norm converges to zero [4, p. 22] k∇ Ψ(xk)k2 → 0. • For 0 < α ≤ 1/L, the cost function decrease is bounded by [5]: L kx − xˆk2  1  Ψ(x ) − Ψ(xˆ) ≤ 0 2 max , (1 − α)2k . k 2 2kα + 1 This upper bound is conjectured to also hold for 1/L < α < 2/L [6]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.15

Optimal asymptotic step size for GD (Read) The above step size range 0 < α < 2/L is a wide range of values, and one might ask what is the best choice? 1 2 For a LS cost function f(x) = 2 kAx − yk2 , the EECS 551 notes show that the asymptotically optimal choice of the step size is: 2 2 α∗ = 0 0 = 2 2 , σmax(A A) + σmin(A A) σmax(∇ f) + σmin(∇ f) because ∇2f = A0A. For more general cost functions that are twice differentiable, one can apply similar analyses to show that the asymptotically optimal choice is 2 α∗ = 2 2 . σmax(∇ f(xˆ)) + σmin(∇ f(xˆ)) Although this formula is an interesting generalization, it is of little practical use because we do not know the minimizer xˆ and the Hessian ∇2f and its SVD are infeasible for large problems. Furthermore, the asymptotically optimal choice of α∗ may not be the best step size in the early iterations when the iterates are far from xˆ. © J. Fessler, January 16, 2020, 17:44 (class version) 3.16

Convergence rates (Read) There are many ways to assess the convergence rate of an iterative algorithm like GD. Researchers study: Ψ(x)

k∇ Ψ(xk)k • Ψ(xk) → Ψ(xˆ) Ψ(xk) • k∇ Ψ(xk)k → 0 •k xk − xˆk → 0 both globally and locally... Ψ(xˆ) xk xˆ x kxk − xˆk Quantifying bounds on the rates of decrease of these quantities is an active research area. Even classical GD has relatively recent results [5] that tighten up the traditional bounds. The tightest possible worst-case bound for GD for the decrease of the cost function (with a fixed step size α = 1/L) is O(1/k):

kx − xˆk2 Ψ(x ) − Ψ(xˆ) ≤ 0 2 , k L (4k + 2) where L is the Lipschitz constant of the gradient ∇f(x). In contrast, Nesterov’s fast gradient method (p. 3.40) has a worst-case cost function decrease at rate at least O(1/k2), which can be improved (and has) by only a constant factor [7]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.17

Example. The following figure illustrates how slow GD can converge for a simple LS problem 1 0 with A = and y = 0. This case used the optimal step size α for illustration. 0 2 ∗ This slow convergence has been the impetus of thousands of papers on faster algorithms!

5 GD 1 4

3

0 2

1

-1 -4 -3 -2 -1 0

The ellipses show the contours of the LS cost function kAx − yk .

Two ways to try to accelerate convergence are to use a preconditioner and/or a line search. © J. Fessler, January 16, 2020, 17:44 (class version) 3.18

3.3 Preconditioned steepest descent (Read) Instead of using GD with a fixed step size α, an alternative is to do a line search to find the best step size at each iteration. This variation is called steepest descent (or GD with a line search) [8]. Here is how preconditioned steepest descent for a linear LS problem works:

dk = −P ∇ Ψ(xk) search direction (negative preconditioned gradient)

αk = arg min Ψ(xk + αdk) step size α

xk+1 = xk + αkdk update.

• Finding αk analytically quadratic cases is a HW problem • By construction, this iteration is guaranteed to decrease the cost function monotonically, with strict de- crease unless xk is already a minimizer, provided the preconditioner P is positive definite. Expressed mathematically: ∇ Ψ(xk) 6= 0 =⇒ Ψ(xk+1) < Ψ(xk) .

• Computing αk takes some extra work, especially for non-quadratic problems. Often Nesterov’s fast gra- dient method or the optimized gradient method (OGM)[7] are preferable because they do not require a line search (if the Lipschitz constant is available). © J. Fessler, January 16, 2020, 17:44 (class version) 3.19

Preconditioning: overview (Read)

Why use the preconditioned search direction dk = −P ∇ Ψ(xk) ? 1 2 Consider the least-squares cost function Ψ(x) = 2 kAx − yk2 , and define a “preconditioned” cost function using a change of coordinates: 1 f(z) Ψ(T z) = kAT z − yk2 . , 2 2 The Hessian matrix of f(·) is ∇2f(z) = T 0A0AT . Applying GD to f yields

0 0 0 zk+1 = zk − α∇f(zk) = T A (A T zk − y) 0 0 0 =⇒ T zk+1 = T zk − αTT A (A T zk − y) 0 0 =⇒ xk+1 = xk − αPA (A xk − y) = xk − αP ∇ Ψ(xk),

0 where xk , T zk and P , TT . So ordinary GD on f is the same as preconditioned GD on Ψ. 0 −1/2 0 −1 0 −1 0 If α = 1 and T = (A A) , then P = (A A) and x1 = (A A) A y. 0 −1 2 −1 In this sense P = (A A) = [∇ Ψ(xk)] is the ideal preconditioner. © J. Fessler, January 16, 2020, 17:44 (class version) 3.20

To elaborate, when T = (A0A)−1/2, then f(z) simplifies as follows: 1 1 f(z) = (AT z − y)0(AT z − y) = z0T 0A0AT z − 2 real{y0AT z} + kyk2 2 2 2 1 1   = z0Iz − 2 real{y0AT z} + kyk2 = kz − T 0A0yk2 − kT 0A0yk2 + kyk2 . 2 2 2 2 2 2 The next figures illustrate this property of converging in 1 iteration for a quadratic cost with the ideal precon- ditioner. Example. Effect of ideal preconditioner on quadratic cost function contours. Contours of Ψ(x) contours of f(z)

use z0 and z∗ © J. Fessler, January 16, 2020, 17:44 (class version) 3.21

Descent direction Define. A vector d ∈ FN descent direction for a cost function Ψ: FN 7→ R at a point x iff moving locally from x along the direction d decreases the cost, i.e., (1D Picture)

∃ c = c(x, d, Ψ) > 0 s.t. ∀ ∈ [0, c) Ψ(x + d) ≤ Ψ(x) . (3.5) With this definition d = 0 is always a (degenerate) descent direction. Fact. For RN , if Ψ(x) is differentiable at x and P is positive definite, then the following vector, if nonzero, is a descent direction for Ψ at x: d = −P ∇ Ψ(x) . (3.6)

Proof sketch. Taylor’s theorem yields (Read)  o(α) Ψ(x) − Ψ(x + αd) = −α h∇ Ψ(x), di +o(α) = α d0P −1d + , α which will be positive for sufficiently small α, because d0P −1d > 0 for P 0, and o(α)/α → 0 as α → 0. From this analysis we can see that designing/selecting a preconditioner that is positive definite is crucial. The two most common choices are: • P is diagonal with positive diagonal elements, • P = QDQ0 where D is diagonal with positive diagonal elements and Q is unitary. In this case Q is often circulant so we can use FFT operations to perform P g efficiently. © J. Fessler, January 16, 2020, 17:44 (class version) 3.22

Complex case (Read)

The definition of descent direction in (3.5) is perfectly appropriate for both RN and CN . However, the direction d specified in (3.6) is problematic in general on FN because many cost functions of interest are not holomorphic so are not differentiable on FN . However, despite not being differentiable, we can still find a descent direction for most cases of interest. N 1 2 Example. The most important case of interest here is Ψ: C 7→ R defined by Ψ(x) = 2 kAx − yk2 where A ∈ CM×N and y ∈ CM . This function is not holomorphic. However, one can show that d = −PA0(Ax − y) is a descent direction for Ψ at x when P is a positive definite matrix. ( HW) In the context of optimization problems, when we write g = ∇ Ψ(x) = A0(Ax − y) for the complex case, we mean that −g is a descent direction for Ψ at x, not a derivative. Furthermore, one can show ( HW ) that the set of minimizers of Ψ(x) is the same as the set of points that satisfy ∇ Ψ(x) = A0(Ax − y) = 0. So again even though a derivative is not defined here, the descent direction sure walks like a duck and talks like a duck, I mean like a (negated) derivative. © J. Fessler, January 16, 2020, 17:44 (class version) 3.23

3.4 Descent direction for edge-preserving regularizer: complex case

Now consider an edge-preserving regularizer defined on CN :

K X R(x) = ψ([Cx]k), where ψ(z) = f(|z|) (3.7) k=1 for some some potential function ψ : C 7→ R defined in terms of some function f : R 7→ R. If f(r) = r2 then it follows from p. 3.22 that C0Cx is a descent direction for R(x) on CN . But it seems unclear in general how to define a descent direction, due to the |·| above. To proceed, make the following assumptions about the function f: • f : R 7→ R. • 0 ≤ s ≤ t =⇒ f(s) ≤ f(t) (monotone). • f is differentiable on R. f˙(t) • ω (t) is well-defined for all t ∈ , including t = 0. f , t R • 0 ≤ ωf (t) ≤ ωmax < ∞.

Example. For the Fair potential ωf (t) = 1/ (1 + |t/δ|) ∈ (0, 1]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.24

Claim. For these assumptions, a descent direction for the edge-preserving regularizer (3.7) for x ∈ FN is

0 − ∇R(x) = −C diag{ωf .(|Cx|)} Cx, (3.8) ˙ where the |·| is evaluated element-wise, like abs.() in JULIA. Intuition: f(t) = ωf (t) t. To prove this claim, we focus on just one term in the sum: (Read)

0 0 0 0 r(x) , ψ(v x) = f(|v x|), d(x) = −v ωf (|v x|) v x. for some nonzero vector v ∈ CN . (A row of C in (3.7) corresponds to v0.) 0 Letting ω = ωf (|v x|) ≤ ωmax we have

0 0 0 0 0 0 2  r(x + d) = f(|v (x + d)|) = f(|v (x − ωvv x)|) = f(|v x(1 − ωv v)|) = f |v x| 1 − ω kvk2 .

2 Now when 0 ≤  ≤ 1/(ω kvk2) then 2 2 1 − ω kvk2 = 1 − ω kvk2 ∈ [0, 1],

0 2 0 =⇒ r(x + d) = f |v x| 1 − ω kvk2 ≤ f(|v x|) = r(x), using the monotone property of f(·). So d(x) is a descent direction for r(x). The proof of (3.8) involves the sum of K such terms. 2 © J. Fessler, January 16, 2020, 17:44 (class version) 3.25

So with our usual reuse of ∇ to denote a (negated) descent direction, we will not write (3.3) for CN but rather we will define ωψ = ωf and write:

0 ∇R(x) = C diag{ωψ .(|Cx|)} Cx. (3.9) ˙ If ψ(z) = ψ(|z|), ∀z ∈ C, then we can define ωψ in terms of ψ for nonnegative real arguments. The result (3.9) is shown in [9] using Wirtinger calculus. © J. Fessler, January 16, 2020, 17:44 (class version) 3.26

Lipschitz constant for descent direction (Read)

Having established the descent direction (3.8) for edge-preserving regularization on CN , the next step is to determine a Lipschitz constant for that function. Again we can write it as a composition of three functions:

0 0 C diag{ωψ .(|Cx|)} Cx = h(g(f(x))), f(x) = Cx, g(u) = d.(u), h(v) = C v, d(z) , ωψ(|z|) z. ˙ ˙ For real z arguments and when ψ and hence ωψ are symmetric, d(z) = ψ(z) so the Lipschitz constant is easy. When z ∈ C, I have not yet been able to prove Lipschitz continuity of d(z). My conjecture, supported by numerical experiments and [9, App. A], is that if ωψ(t) is a non-increasing function of t on [0, ∞), in addition to the other assumptions on p. 3.23, then Ld = ωψ(0) (3.10) and, akin to (3.4), the Lipschitz constant for the descent direction on CN is:

2 L∇R = |||C|||2Ld. (3.11) Challenge. Prove (3.10). Here are some initial steps that might help:

ı∠x ı∠z |d(x) − d(z)| = |ωψ(|x|) x − ωψ(|z|) z| = ωψ(|x|) |x| e − ωψ(|z|) |z| e

˙ ı∠x ˙ ı∠z = ψ(|x|) e − ψ(|z|) e ≤ ? ≤ ωψ(0) |x − z| . © J. Fessler, January 16, 2020, 17:44 (class version) 3.27

Practical Lipschitz constant (Read) 2 In general, computing |||C|||2 in (3.4) exactly would require a SVD or the power iteration, both of which are impractical for large-scale problems. If we use finite differences with periodic boundary conditions, then C is circulant and hence is a normal matrix so |||C|||2 = ρ(C), where ρ(·) denotes the spectra radius. For 1D finite differences with N even, the spectral radius is 2 and for N odd is 1 + cos(π(N − 1) /N) ≈ 2.( HW) But for nonperiodic boundary conditions we need a different approach. Recall that because C0C is symmetric:

2 0 0 0 0 |||C|||2 = |||C C|||2 = σ1(C C) = ρ(C C) ≤ |||C C|||, for any matrix norm |||·|||. In particular, the matrix 1-norm is convenient:

0 0 |||C C|||1 ≤ |||C |||1|||C|||1 = |||C|||∞|||C|||1 2 =⇒ |||C|||2 ≤ |||C|||∞|||C|||1 = 2 · 2 = 4 for 1D finite differences, because there as most a single +1 and −1 in each row or column of C. 2 Interestingly, this 1-norm approach gives us an upper bound on |||C|||2 for any boundary conditions that matches the exact value when using periodic boundary conditions.

So the practical choice for 1D first-order finite-differences is to use L∇R = 4Lψ˙ , where often we scale the potential functions so that Lψ˙ = 1. Bottom line: never use opnorm() when working with finite differences! © J. Fessler, January 16, 2020, 17:44 (class version) 3.28

GD step size with preconditioning We earlier argued that the preconditioned gradient −P ∇ Ψ(x) can be preferable to −∇ Ψ(x) as a descent direction. But the GD convergence theorem on p. 3.14 had no P in it. So must we resort to PSD on p. 3.18 that requires a line search? No!

Suppose Ψ is convex has a Lipschitz continuous gradient with Lipschitz constant L∇ Ψ. Define a new function in transformed coordinate system: f(z) , Ψ(T z) . Using the properties on p. 3.6, this function also has a Lipschitz continuous gradient and

0 0 ∇f(z) = T ∇ Ψ(T z) =⇒ L∇f ≤ |||T T |||2L∇ Ψ.

Choose a step size 0 < αf < 2/L∇f for applying GD to f, yielding 0 zk+1 = zk − αf ∇f(zk) = zk − αf T ∇ Ψ(T zk) 0 =⇒ T zk+1 = T zk − αf TT ∇ Ψ(T zk)

=⇒ xk+1 = xk − αf P ∇ Ψ(xk),

0 where xk , T zk and P = TT . So ordinary GD on f is the same as preconditioned GD on Ψ. The step size should satisfy 0 < αf < 2/L∇f so it suffices (but can be suboptimal; see HW ) to choose 2 1 2 0 < αf < 0 = . |||T T |||2L∇ Ψ |||P |||2 L∇ Ψ © J. Fessler, January 16, 2020, 17:44 (class version) 3.29

Finite difference implementation N For x ∈ F we need to compute first-order finite differences dn = xn+1 − xn, n = 1,...,N − 1, which in matrix notation is d = Cx ∈ FN−1. Here are eight (!) different implementations in JULIA: function loopdiff(x::AbstractVector) N = length(x) y = similar(x, N-1) for n=1:(N-1) @inbounds y[n] = x[n+1] - x[n] end return y end d = diff(x) # built-in d = loopdiff(x) d = (circshift(x, -1) - x)[1:(N-1)] d = [x[n+1]-x[n] for n=1:(length(x)-1)] # comprehension d = @views x[2:end] - x[1:(end-1)] # indexing d = conv(x, eltype(x).([1, -1]))[2:end-1] # using DSP d = diagm(0 => -ones(N-1), 1 => ones(N-1))[1:(N-1),:] * x # big/slow d = spdiagm(0 => -ones(N-1), 1 => ones(N-1))[1:(N-1),:] * x # SparseArrays © J. Fessler, January 16, 2020, 17:44 (class version) 3.30

Which is fastest? https://web.eecs.umich.edu/~fessler/course/598/demo/diff1.html Use notebook to discuss @inbounds @views spdiagm (column-wise storage...) LinearMaps (cf fatrix in MIRT) Adjoint tests y0(Ax) = (A0y)0x or equivalently hAx, yi = hx, A0yi

x = randn(N); y = randn(M); @assert isapprox(y’*(A*x), (A’*y)’*x) A generalization of transpose for linear maps is called the adjoint. © J. Fessler, January 16, 2020, 17:44 (class version) 3.31

Orthogonality for steepest descent and conjugate gradients Recall that the (preconditioned) steepest descent method has three steps: • Descent direction dk, e.g., dk = −P ∇ Ψ(xk) • line search: αk = arg minα fk(α), fk(α) = Ψ(xk + αdk) • update xk+1 = xk + αkdk By construction: ˙ 0 = fk(αk) = hdk, ∇ Ψ(xk + αkdk)i = hdk, ∇ Ψ(xk+1)i .

In other words, the gradient ∇ Ψ(xk+1) at the next iterate is perpendicular to the current search direction dk:

dk ⊥ ∇ Ψ(xk+1) .

This orthogonality leads to the “zig-zag” nature of PSD iterates seen on p. 3.17. The preconditioned conjugate gradient (CG) method, described in more detail in [10], replaces the standard inner product with a different inner product weighted by the Hessian of the cost function:

0 2 hdk, ∇ Ψ(xk+1)iH = dkH∇ Ψ(xk+1), H = ∇ Ψ(xk), leading to faster convergence. See [10] for details. Note that the CG method we want for optimization is the nonlinear conjugate gradient (NCG) method; in  these notes CG means NCG. © J. Fessler, January 16, 2020, 17:44 (class version) 3.32

3.5 General inverse problems As mentioned previously, a typical cost function for solving inverse problems has the form

K K 1 2 X X xˆ = arg min kAx − yk2 + βR(x),R(x) = ψ([Cx]k) = r(Cx), r(v) = ψ(vk) . (3.12) N 2 x∈F k=1 k=1 This is just one of many possible special cases of the following fairly general form

J J X X 0 Ψ(x) = fj(Bjx) =⇒ ∇ Ψ(x) = Bj∇fj(Bjx), (3.13) j=1 j=1

Mj where Bj is a Mj × N matrix and each fj : R 7→ R is a (typically convex) function. Example. For the special case (3.12), we use (3.13) with 1 J = 2, B = A, B = C, f (u) = ku − yk2 , f (v) = βr(v). 1 2 1 2 2 2 We will implement several algorithms for minimize cost functions of this general form.

When ψ is the Fair potential, the (best)√ Lipschitz constant of ∇f2(v) is: A: 1 B: β C: β K D: βK E: None ?? © J. Fessler, January 16, 2020, 17:44 (class version) 3.33

Efficient line search (Read) A naive implementation of the line search step in the PSD algorithm on p. 3.18 minimizes

hk(α) , Ψ(xk + αdk) . When applied to a cost function of the general form (3.13) this would involve repeated matrix-vector multi- plications of the form Bj(xk + αdk), which is expensive. A more efficient approach is to precompute the matrix-vector products prior to performing the line search, noting that for the general (3.13) form:

J J X X  (k) (k) (k) (k) hk(α) = Ψ(xk + αdk) = fj(Bj (xk + αdk)) = fj uj + αvj , uj , Bjxk, vj , Bjdk. j=1 j=1

(k) (k) Precomputing uj and vj prior to performing the line search avoids redundant matrix-vector products. Furthermore, algorithms with line searches like PSD and PCG have an recursive update of the form:

xk+1 = xk + αkdk.

Multiplying both sides by Bj yields an efficient recursive update for the uj vector (used in HW problems):

(k+1) (k) (k) Bjxk+1 = Bjxk + αkBjdk =⇒ uj = uj + αvj . ˙ A key simplification here is that the Lipschitz constant of hk does not use operator norms of any Bj.( HW) © J. Fessler, January 16, 2020, 17:44 (class version) 3.34

3.6 Convergence rates Asymptotic convergence rates When the cost function Ψ is locally strictly convex and twice differentiable near minimizer xˆ, one can analyze the asymptotic convergence rates of PGD, PSD, and PCG. (See Fessler book Ch. 11.) All three algorithms satsify inequalities of the following form for different values of c, ρ:

−1/2 k −1/2 −1/2 1/k P (xk+1 − xˆ) ≤ cρ P (x0 − xˆ) =⇒ lim P (xk − xˆ) ≤ ρ. 2 2 k→∞ 2

2 PGD and PSD produce sequences {xk} that con- Method ρ κ = 10 verge linearly [1, p. 32] to xˆ and 1 κ − 1 PGD standard step α = 0.99 ˆ σ1(H) κ −1/2 1/k sup lim P (xk − xˆ) = ρ, 2 κ − 1 k→∞ 2 PGD α = 0.98 x0 ∗ ˆ ˆ σ1(H) + σN (H) κ + 1 κ − 1 where ρ is called the root convergence factor. PSD with perfect line search 0.98 Define Hˆ = P 1/2∇2 Ψ(xˆ) P 1/2 and condition √κ + 1 ˆ ˆ κ − 1 number κ = σ1(H)/σN (H). PCG with perfect line search √ 0.82 This table shows the values of ρ. κ + 1 PCG converges quadratically [1, p. 45] and its ρ above matches a lower bound [1, p. 68]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.35

Heavy ball method One way to seek faster convergence is to use algorithms that have momentum. An early momentum method is the heavy ball method [4, p. 64]. One way to write it is:

dk = −∇ Ψ(xk) +βkdk−1

xk+1 = xk + αkdk, where αk > 0 and βk ≥ 0. The “search direction” dk depends on both the gradient and the previous direction. Rearranging the 2nd equation to write dk = (xk+1 − xk)/αk and the combining yields this form

˜ ˜ αk xk+1 = xk − αk∇ Ψ(xk) + βk (xk − xk−1), βk , βk . (3.14) | {z } | {z } αk−1 usual GD momentum

Convergence rate analysis (Read) To analyze the convergence rate of this method we make two simplifications. ˜ • We consider the case of constant step sizes αk = α and βk = β = βk 1 2 • We focus on a quadratic cost function Ψ(x) = 2 kAx − yk2 where M × N A has full column rank, so there is a unique minimizer xˆ. Note that ∇ Ψ(x) = A0(Ax − y) = Hx − b, where the Hessian is H = A0A and b = A0y. The unique minimizer satisfies the normal equations: Hxˆ = b, so ∇ Ψ(x) = Hx − Hxˆ = H(x − xˆ). © J. Fessler, January 16, 2020, 17:44 (class version) 3.36

EECS 551 analyzed the convergence rate of GD by relating xk+1 − xˆ to xk − xˆ. Here the recursion (3.14) depends on both xk and xk−1 so we analyze the following two-state recursion:  x − xˆ   x − α∇ Ψ(x ) +β(x − x ) − xˆ   x − xˆ − αH(x − xˆ) + β(x − x )  k+1 = k k k k−1 = k k k k−1 xk − xˆ xk − xˆ xk − xˆ  x − xˆ − αH(x − xˆ) + β(x − xˆ) − β(x − xˆ)  = k k k k−1 xk − xˆ     xk − xˆ (1 + β)I − αH βI = G , G , . xk−1 − xˆ I 0 0 Because H is Hermitian, it has a unitary eigendecomposition H = V ΛV , with eigenvalues λi(H) = 2 σi (A). Writing the governing matrix G using this eigendecomposition: V 0  (1 + β)I − αΛ βI V 0 0 G = (diagonal blocks) 0 V I 0 0 V   G1 0 V 0  V 0 0 1 + β − αλ β = Π  ..  Π0 , G = i , 0 V  .  0 V i 1 0 0 GN

N where Π is a 2N × 2N permutation matrix. Thus eig{G} = ∪i=1 eig{Gi}, using several eigenvalue proper- ties. The eigenvalues of Gi are the roots of its characteristic polynomial: 2 z − (1 + β − αλi)z + β. © J. Fessler, January 16, 2020, 17:44 (class version) 3.37

If β = 0 then the nontrivial root is at z = 1 − αλi, which is an expression see in the EECS 551 notes.

Otherwise the roots (eigenvalues of Gi) are:

(1 + β − αλ ) ± p(1 + β − αλ )2 − 4β z = i i . 2

For the fastest convergence, we would like to choose α and β to minimize maxi ρ(Gi). One can show that the best choice is: 4 σ1(A) − σN (A) α∗ = 2 , β∗ = . (σ1(A) + σN (A)) σ1(A) + σN (A) For this choice, one can show that √ σ1(A) − σN (A) κ − 1 ρ(G) = β∗ = = √ , σ1(A) + σN (A) κ + 1

2 2 where κ = σ1(H)/σN (H) = σ1(A)/σN (A). Thus for the simple LS problem, the heavy ball method with its best choice of step-size parameters has the same rate as the conjugate gradient method with a perfect line search.

Of course in practice it is usually too expensive to determine σ1(A) and σN (A), so we next seek more practical momentum methods that do not require these values. © J. Fessler, January 16, 2020, 17:44 (class version) 3.38

Generalized convergence analysis of PGD

Define. The gradient of Ψ is S-Lipschitz continuous on RN for an invertible matrix S iff

−1 0 N S (∇ Ψ(x) −∇ Ψ(z)) 2 ≤ kS (x − z)k2 , ∀x, z ∈ R . (3.15) √ If S = LI, then the S-Lipschitz condition (3.15) simplifies to the classic Lipschitz continuity condition:

N k∇ Ψ(x) −∇ Ψ(z)k2 ≤ L kx − zk2 , ∀x, z ∈ R . (3.16)

Theorem (PGD convergence). If ∇ Ψ satisfies (3.15) and for some 0 < α

αP 0SS0P ≺ P + P 0, (3.17)

then (i) the PGD algorithm xk+1 = xk − αP ∇ Ψ(xk) (3.18)

monotonically decreases Ψ [10] because for g = ∇ Ψ(xk): α Ψ(x ) − Ψ(x ) ≥ g0 (P + P 0 − αP 0SS0P ) g, k k+1 2

and (ii) k∇ Ψ(xk)k → 0. © J. Fessler, January 16, 2020, 17:44 (class version) 3.39

In the usual case where P is symmetric positive definite, (3.17) simplifies to αSS0 ≺ 2P −1 and when that −1/2 ˆ condition holds then P (xk − x) 2 converges monotonically to zero [10]. If we choose αP = (SS0)−1, then PGD is equivalent to a majorize-minimize (MM) method (discussed later) and the cost function decrease has the following bound:

kS0 (x − xˆ)k2 Ψ(x ) − Ψ(xˆ) ≤ 0 , k ≥ 1. (3.19) k 2k The bound above is the “classical” textbook formula. In 2014 the following tight bound was found [5]:

kS0 (x − xˆ)k2 Ψ(x ) − Ψ(xˆ) ≤ 0 , k ≥ 1. (3.20) k 4k + 2 It is tight because there is a Huber-like function Ψ for which GD meets that rate. Why generalize? Units of Lipschitz constant

1 2 a11 ampere x1 ohm y1 volt Consider Ψ(x) = 2 kAx − yk2 where A is 2×2 diagonal with units: a22 volt/m x2 m y2 volt What are the units of the Lipschitz constant of ∇ Ψ? A: ampere volt / m B: ampere2 C: volt2 / m2 D: ohm m E: none of these ?? © J. Fessler, January 16, 2020, 17:44 (class version) 3.40

Generalized Nesterov fast gradient method (FGM) The following is a slight generalization of the fast gradient method (FGM) of Nesterov, also known as accelerated gradient descent and Nesterov accelerated gradient (NAG): Initialize t0 = 1 and z0 = x0 then for k = 0, 1,...:  q  2 tk+1 = 1 + 1 + 4tk /2

0 −1 xk+1 = zk − [SS ] ∇ Ψ(zk) gradient step

tk − 1 zk+1 = xk+1 + (xk+1 − xk) . momentum tk+1

If tk = 1 for all k then FGM reverts to ordinary GD. Theorem If Ψ is convex and has an S-Lipschitz gradient, then this generalized FGM satisfies:

2 kS0 (x − xˆ)k2 Ψ(x ) − Ψ(xˆ) ≤ 0 . (3.21) k k2

In words, the worst-case rate of decrease of the cost function Ψ is O(1/k2). This is a huge improvement over the O(1/k) rate of PGD in (3.19). However, worst-case analysis may be pessimistic for your favorite application. For example, if Ψ is quadratic on RN , then CG converges in N iterations. © J. Fessler, January 16, 2020, 17:44 (class version) 3.41

3.7 First-order methods The most famous second-order method is Newton’s method:

2 −1 xk+1 = xk − (∇ Ψ(xk)) ∇ Ψ(xk) .

This method is impractical for large-scale problems so we focus on first-order methods [1, p. 7].

General first-order method classes • General first-order (GFO) method:

xk+1 = function(x0, Ψ(x0), ∇ Ψ(x0),..., Ψ(xk), ∇ Ψ(xk)). (3.22)

• First-order (FO) methods with fixed step-size coefficients:

n 1 X x = x − h ∇ Ψ(x ) . (3.23) n+1 n L n+1,k k k=0

Which of the algorithms discussed so far are FO (fixed-step) methods? A: PGD, FGM, PSD, PCG B: PGD, FGM, PSD C: PGD, PSD D: PGD, FGM E: PGD ?? © J. Fessler, January 16, 2020, 17:44 (class version) 3.42

Example: Barzilai-Borwein gradient method Barzilai & Borwein, 1988: [11] gk , ∇ Ψ(xk) 2 kxk − xk−1k2 αk = hxk − xk−1, gk − gk−1i

xk+1 = xk − αkgk. • In “general” first-order (GFO) class, but • not in class FO with fixed step-size coefficients.

Recent research questions • Analyze convergence rate of FO for any given step-size coefficients {hn,k} • Optimize step-size coefficients {hn,k} ◦ fast convergence ◦ efficient recursive implementation ◦ universal (design prior to iterating, independent of L) • How much better could one do with GFO? © J. Fessler, January 16, 2020, 17:44 (class version) 3.43

Nesterov’s fast gradient method is FO

Nesterov (1983) iteration [12, 13] expressed in efficient recursive form: Initialize: t0 = 1, z0 = x0 1 z = x − ∇ Ψ(x ) (usual GD update) n+1 n L n 1  p  t = 1 + 1 + 4t2 (magic momentum factors) n+1 2 n tn − 1 xn+1 = zn+1 + (zn+1 − zn) (update with momentum) tn+1 tn − 1 = (1 + γn)zn+1 − γnzn, γn = > 0. tn+1

FGM1 is in class FO [7] (for analysis, not implementation!): n 1 X x = x − h ∇ Ψ(x ) n+1 n L n+1,k k k=0

 tn − 1  1 0 0 0 0 0   hn,k, k = 0, . . . , n − 2  t  0 1.25 0 0 0 0   n+1    tn − 1  0 0.10 1.40 0 0 0  hn+1,k = (hn,n−1 − 1) , k = n − 1   t  0 0.05 0.20 1.50 0 0   n+1    tn − 1  0 0.03 0.11 0.29 1.57 0   1 + , k = n. tn+1 0 0.02 0.07 0.18 0.36 1.62 © J. Fessler, January 16, 2020, 17:44 (class version) 3.44

Nesterov’s FGM1 optimal convergence rate 2 Shown by Nesterov to be O(1/n ) for “primary” sequence {zn} [7, eqn. (3.5)]:

2 2 L kx0 − x?k2 2L kx0 − x?k2 Ψ(zn) − Ψ(x?) ≤ 2 ≤ 2 . (3.24) 2tn−1 (n + 1) Nesterov [1, p. 59-61] constructed “the worst function in the world,” a simple quadratic function Ψ, with a tridiagonal Hessian matrix similar to C0C for first-order differences, such that, for any general FO method: 3 L kx − x k2 32 0 ? 2 ≤ Ψ(x ) − Ψ(x ) . (n + 1)2 n ? Thus the O(1/n2) rate of FGM1 is “optimal” in a big-O sense.

Bound on convergence rate of “secondary” sequence {xn} [7, eqn. (5.5)]:

2 2 L kx0 − x?k2 2L kx0 − x?k2 Ψ(xn) − Ψ(x?) ≤ 2 ≤ 2 . (3.25) 2tn (n + 2) The bounds (3.24) and (3.25) are asymptotically tight [6]. To reach a cost within ε of the minimum Ψ(x ), how many iterations are needed? √ ? A: O(1) B: O(1/ ) C: O(1/) D: O(1/2) E: O(1/4) ?? The gap between 2 and 3/32 suggests we can do better, and we can, thanks to recent work from UM. © J. Fessler, January 16, 2020, 17:44 (class version) 3.45

Optimized gradient method (OGM) Recall general family of first-order (FO) methods (3.23) with fixed step-size coefficients:

n 1 X x = x − h ∇ Ψ(x ) . n+1 n L n+1,k k k=0

Inspired by [5], recent work by former UM ECE PhD student Donghwan Kim [7]:

• Analyze (i.e., bound) convergence rate as a function of ◦ number of iterations N ◦ Lipschitz constant L ◦ step-size coefficients H = {hn+1,k} ◦ initial distance to a solution: R = kx0 − x?k. • Optimize H by minimizing the bound. “Optimizing the optimizer” (meta-optimization?)

• Seek an equivalent recursive form for efficient implementation. © J. Fessler, January 16, 2020, 17:44 (class version) 3.46

(... many pages of derivations ...) • Optimized step-size coefficients [7]:

 θn − 1  hn,k, k = 0, . . . , n − 2  θn+1  θ − 1 H∗ : h = n (h − 1) , k = n − 1 n+1,k θ n,n−1  n+1  2θn − 1  1 + , k = n. θn+1

 1, n = 0  1 p 2  θn = 2 1 + 1 + 4θn−1 , n = 1,...,N − 1  1 p 2  2 1 + 1 + 8θn−1 , n = N. • Analytical convergence bound for FO method with these optimized step-size coefficients [7, eqn. (6.17)]:

1L kx − x k2 1L kx − x k2 1L kx − x k2 0 ? 2 0 ? 2√ 0 ? 2 Ψ(xN ) − Ψ(x?) ≤ 2 ≤ ≤ 2 . (3.26) 2θN (N + 1)(N + 1 + 2) (N + 1)

• Of course the bound is O(1/N 2), but the constant is twice better than that of Nesterov’s FGM in (3.24). © J. Fessler, January 16, 2020, 17:44 (class version) 3.47

Optimized gradient method (OGM) recursion

Donghwan Kim also found an efficient recursive algorithm [7]. Initialize: θ0 = 1, z0 = x0 1 z = x − ∇ Ψ(x ) n+1 n L n    1 p 2  2 1 + 1 + 4θn , n ≤ N − 2 θn+1 =   1 p 2  2 1 + 1 + 8θn , n = N − 1

θn − 1 θn xn+1 = zn+1 + (zn+1 − zn) + (zn+1 − xn) . θn+1 θn+1 | {z } new momentum Reverts to Nesterov’s FGM if one removes the new term. • Very simple modification of existing Nesterov code. • Factor of 2 better bound than Nesterov’s “optimal” FGM. • Similar momentum to Güler’s 1992 proximal point algorithm [14]. • Inconvenience: must pick N in advance to use bound (3.26) on Ψ(xN ). • Convergence bound for every iteration of the “primary” sequence [15, eqn. (20)]: 2 2 1L kx0 − x?k2 1L kx0 − x?k2 Ψ(zn) − Ψ(x?) ≤ 2 ≤ 2 . 4tn−1 (n + 1) This bound is asymptotically tight [15, p. 198]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.48

Recent refinement of OGM Newer version OGM’ [15, p. 199]:

1 z = x − ∇ Ψ(x ) n+1 n L n 1  p  t = 1 + 1 + 4t2 (momentum factors) n+1 2 n 1 + tn/tn+1 tn − 1 xn+1 = xn − ∇ Ψ(xn) + (zn+1 − zn). L tn+1 | {z } | {z } over-relaxed GD FGM momentum • Convergence bound for every iteration on the “primary” sequence [15, eqn. (25)]:

2 2 1L kx0 − x?k2 1L kx0 − x?k2 Ψ(zn) − Ψ(x?) ≤ 2 ≤ 2 . 4tn−1 (n + 1) • Simpler and more practical implementation. • Need not pick N in advance. 2 2 One can show tn ≥ (n + 1) /4 for n > 1 [15, p. 197]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.49

OGM’ momentum factors illustrated

1+ tn/tn+1 tn − 1 xn+1 = xn − ∇ Ψ(xn) + (zn+1 − zn) L tn+1

Intuition: 1+ tn/tn+1 → 2 as n → ∞ © J. Fessler, January 16, 2020, 17:44 (class version) 3.50

Optimized gradient method (OGM) is an optimal GFO method (!) Recall that within the class of first-order (FO) methods with fixed step sizes:

n 1 X x = x − h ∇ Ψ(x ), n+1 n L n+1,k k k=0

OGM is based on optimized {hn,k} step sizes and provides the convergence rate upper bound:

2 2 L kx0 − x?k2 L kx0 − x?k2 Ψ(xN ) − Ψ(x?) ≤ 2 ≤ 2 . 2θN N Recently Y. Drori [16, Thm. 3] considered the class of general FO (GFO) methods:

xn+1 = F (x0, Ψ(x0), ∇ Ψ(x0),..., Ψ(xn), ∇ Ψ(xn)), and showed for d > N (large-scale problems), any algorithm in this GFO class has a function Ψ such that

2 L kx0 − x?k2 2 ≤ Ψ(xN ) − Ψ(x?), 2θN Thus OGM has optimal (worst-case) complexity among all GFO methods, not just fixed-step FO methods. © J. Fessler, January 16, 2020, 17:44 (class version) 3.51

Worst-case functions for OGM

From [7, eqn. (8.1)] and [15, Thm. 5.1], the worst-case behav- ior of OGM is for a Huber func- tion and a quadratic function.

For R , kx0 − x?k, the worst-case behavior is: LR2 LR2 LR2 √ Ψ(xN ) − Ψ(x?)= 2 ≤ ≤ 2 . 2θN (N + 1)(N + 1 + 2) (N + 1)

Monotonicity

In these examples, the cost function Ψ(xn) happens to decrease monotonically. In general, neither FGM nor OGM guarantee non-increasing cost functions, despite the bound 1/N 2 being strictly decreasing. Nesterov [1, p. 71] states that optimal methods in general do not ensure that Ψ(xk+1) ≤ Ψ(xk). © J. Fessler, January 16, 2020, 17:44 (class version) 3.52

3.8 Machine learning via logistic regression for binary classification

N N To learn weights x ∈ R of a binary classifier given feature vectors {vm} ∈ R (training data) and labels {ym = ±1 : m = 1,...,M}, we can minimize a cost function with a regularization parameter β > 0:

M X 1 2 xˆ = arg min Ψ(x), Ψ(x) = ψ(ym hx, vmi) + β kxk2 . (3.27) x 2 m=1 Loss functions (surrogates) Want: 8 •h x, vmi > 0 if ym = +1 and exponential •h x, vmi < 0 if ym = −1, hinge • i.e., hx, ymvmi > 0, logistic 0-1 so that sign(hx, vmi) is a reasonable classifier. )

Logistic loss function has a Lipschitz derivative: z ( ψ ψ(z) = log1 + e−z ˙ −1 ψ(z) = z e + 1 1 z   ¨ e 1 ψ(z) = 2 ∈ 0, . 0 (ez + 1) 4 -2 0 2 z © J. Fessler, January 16, 2020, 17:44 (class version) 3.53

The logistic regression cost function (3.27) is a special case of the “general inverse problem” (3.13). (?) A: True B: False ??

1 J = 2, B = A0, B = I, f (u) = 10 ψ .(u), f (v) = β kvk2 1 2 1 2 2 2

M×N A = [y1v1 . . . yM vM ] ∈ R . β 2 A regularization term like 2 kxk2 is especially important in the typical case where the feature vector dimen- sion N is large relative to the sample size M. For gradient-based optimization, we need the cost function gradient and a Lipschitz constant: 1 ∇ Ψ(x) = A ψ˙ .(A0x) + βx =⇒ L ≤ |||A|||2L + β = |||A|||2 + β. ∇ Ψ 2 ψ˙ 4 2

Practical implementation: • Normalizing each column of A to unit norm can help keep ez from overflowing. • Tuning β should use cross validation or other such tools from machine learning. • The cost function is convex with Lipschitz gradient, so it is well-suited for FGM and OGM. • When feature dimension N is very large, seeking a sparse weight vector x may be preferable. For that, 2 replace the Tikhonov regularizer kxk2 with kxk1 and then use FISTA (or POGM [17]) for optimization. © J. Fessler, January 16, 2020, 17:44 (class version) 3.54

Numerical Results: logistic regression

Labeled training data (green and blue points); 4 initial decision boundary (red); final decision boundary (magenta); ideal boundary (yellow). M = 100, N = 7 (cf “large scale” ?) 1

0

-1

-4 -4 -1 0 1 4 © J. Fessler, January 16, 2020, 17:44 (class version) 3.55

Numerical Results: convergence rates 8 GD Nesterov FGM OGM1

0 0 20 40 60 80

OGM faster than FGM in early iterations... © J. Fessler, January 16, 2020, 17:44 (class version) 3.56

Adaptive restart of accelerated GD 101

100

10-1

10-2

10-3 0 20 40 60 80

FGM restart, O’Donoghue & Candès, 2015 [19] OGM restart [20] © J. Fessler, January 16, 2020, 17:44 (class version) 3.57

Adaptive restart of OGM Recall: 1+ tn/tn+1 tn − 1 xn+1 = xn − ∇ Ψ(xn)+ (zn+1 − zn) L tn+1

Heuristic: restart momentum (set tn = 1) if

h−∇ Ψ(xn), zn+1 − zni < 0.

This modified method, OGM-restart, has a better worst-case convergence bound than OGM for the class of convex cost functions with L-Lipschitz smooth gradients. (?) A: True B: False ??

Define. A function f : RN 7→ R is strongly convex with parameter µ > 0 iff it is convex and 1 f(x) ≥ f(z) + h∇f(z), x − zi +µ kx − zk2 , ∀x, z ∈ N . 2 2 R

Smooth cost functions are often locally strongly convex, but rarely are the cost functions of interest in modern signal processing (globally) strongly convex. Formal analysis of OGM for strongly convex quadratic functions is in [20].

Code: https://gitlab.eecs.umich.edu/michigan-fast-optimization/ogm-adaptive-restart © J. Fessler, January 16, 2020, 17:44 (class version) 3.58

3.9 Summary This chapter summarizes some of the most important gradient-based algorithms for solving unconstrained optimization problems with differentiable cost functions.

All of the methods discussed here require computing the gradient ∇ Ψ(xk) each iteration, and often that is the most expensive operation. Some of the algorithms (PSD, PCG) also require a line search step. A line search is itself a 1D optimization problem that requires evaluating the cost function or its gradient multiple times, and those evaluations can add considerable expense for general cost functions.

For cost functions of the form (3.13), where each component function fj and its gradient are easy to evaluate, one can perform a line search quite efficiently, as described on p. 3.33.

The set of cost functions of the form (3.13), where each fj has a Lipschitz continuous gradient, is a strict subset of the set of cost functions Ψ having Lipschitz continuous gradients. (?) A: True B: False ?? Recent work made a version of OGM with a line search [21]. © J. Fessler, January 16, 2020, 17:44 (class version) 3.59

Bibliography

[1] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004 (cit. on pp. 3.8, 3.9, 3.34, 3.41, 3.44, 3.51). [2] R. C. Fair. “On the robust estimation of econometric models”. In: Ann. Econ. Social Measurement 2 (Oct. 1974), 667–77 (cit. on p. 3.12). [3] K. Lange. “Convergence of EM image reconstruction algorithms with Gibbs smoothing”. In: IEEE Trans. Med. Imag. 9.4 (Dec. 1990). Corrections, T-MI, 10:2(288), June 1991., 439–46 (cit. on p. 3.12). [4] B. T. Polyak. Introduction to optimization. New York: Optimization Software Inc, 1987 (cit. on pp. 3.14, 3.35). [5] Y. Drori and M. Teboulle. “Performance of first-order methods for smooth convex minimization: A novel approach”. In: Mathematical Programming 145.1-2 (June 2014), 451–82 (cit. on pp. 3.14, 3.16, 3.39, 3.45). [6] A. B. Taylor, J. M. Hendrickx, and Francois Glineur. “Smooth strongly convex interpolation and exact worst-case performance of first- order methods”. In: Mathematical Programming 161.1 (Jan. 2017), 307–45 (cit. on pp. 3.14, 3.44). [7] D. Kim and J. A. Fessler. “Optimized first-order methods for smooth convex minimization”. In: Mathematical Programming 159.1 (Sept. 2016), 81–107 (cit. on pp. 3.16, 3.18, 3.43, 3.44, 3.45, 3.46, 3.47, 3.51). [8] A. Cauchy. “Methode générale pour la résolution des systems d’équations simultanées”. In: Comp. Rend. Sci. Paris 25 (1847), 536–8 (cit. on p. 3.18). [9] A. Florescu, E. Chouzenoux, J-C. Pesquet, P. Ciuciu, and S. Ciochina. “A majorize-minimize memory gradient method for complex-valued inverse problems”. In: Signal Processing 103 (Oct. 2014), 285–95 (cit. on pp. 3.25, 3.26). [10] J. A. Fessler. Image reconstruction: Algorithms and analysis. Book in preparation. ., 2006 (cit. on pp. 3.31, 3.38, 3.39). [11] J. Barzilai and J. Borwein. “Two-point step size gradient methods”. In: IMA J. Numerical Analysis 8.1 (1988), 141–8 (cit. on p. 3.42). [12] Y. Nesterov. “A method for unconstrained convex minimization problem with the rate of convergence O(1/k2)”. In: Dokl. Akad. Nauk. USSR 269.3 (1983), 543–7 (cit. on p. 3.43). [13] Y. Nesterov. “Smooth minimization of non-smooth functions”. In: Mathematical Programming 103.1 (May 2005), 127–52 (cit. on p. 3.43). [14] O. Güler. “New proximal point algorithms for convex minimization”. In: SIAM J. Optim. 2.4 (1992), 649–64 (cit. on p. 3.47). [15] D. Kim and J. A. Fessler. “On the convergence analysis of the optimized gradient methods”. In: J. Optim. Theory Appl. 172.1 (Jan. 2017), 187–205 (cit. on pp. 3.47, 3.48, 3.51). © J. Fessler, January 16, 2020, 17:44 (class version) 3.60

[16] Y. Drori. “The exact information-based complexity of smooth convex minimization”. In: J. Complexity 39 (Apr. 2017), 1–16 (cit. on p. 3.50). [17] A. B. Taylor, J. M. Hendrickx, and Francois Glineur. “Exact worst-case performance of first-order methods for composite convex optimization”. In: SIAM J. Optim. 27.3 (Jan. 2017), 1283–313 (cit. on p. 3.53). [18] D. Böhning and B. G. Lindsay. “Monotonicity of quadratic approximation algorithms”. In: Ann. Inst. Stat. Math. 40.4 (Dec. 1988), 641–63. [19] B. O’Donoghue and E. Candes. “Adaptive restart for accelerated gradient schemes”. In: Found. Comp. Math. 15.3 (June 2015), 715–32 (cit. on p. 3.56). [20] D. Kim and J. A. Fessler. “Adaptive restart of the optimized gradient method for convex optimization”. In: J. Optim. Theory Appl. 178.1 (July 2018), 240–63 (cit. on pp. 3.56, 3.57). [21] Y. Drori and A. B. Taylor. Efficient first-order methods for convex minimization: a constructive approach. 2018 (cit. on p. 3.58).