Interpretation of backtracking [RECAP]

∆x= direction of descent= f(x k) for descent −∇ A different way of understanding the varying step size withβ: Multiplyingt byβ causes the interpolation to tilt as indicated in the figure Homeworks: Letf(x) =x 2 forx . Letx 0 = 2,∆x k = 1 for allk (since it is a valid ∈ℜk k − k descent direction ofx>0) andx = 1 + 2− . What is the step sizet implicitly being used. k Show that whilet satisifies the Armijo condition (determine ac 1), it does not satisfy the second Strong Wolfe condition in the following slides. Why is the choice of step size bad?

March 19, 2018 107 / 188 Strong Wolfe conditions: Sufficient decrease BUT NOT at the cost of oscillations Ray Search for First Order Descent: Strong Wolfe Conditions [RECAP] Often, another condition is used for inexact line search in conjunction with the Armijo condition.

∆xT f(x+t∆x) c ∆xT f(x) (28) ∇ ≤ 2 ∇

where1>c 2 >c 1 >0. This condition ensures that the slope of the functionf(x+t∆x) att is less thanc 2 times that att=0.

1 The conditions in (27) and (28) are together called the strong Wolfe conditions. These conditions are particularly very important for non-convex problems. 2 While (??refeq:armijoCondition) ensures guaranteed decrease inf(x+δx), (28) provides guaranteed decrease in magnitude of slope and avoid too small steps.

3 Claim: If1>c 2 >c 1 >0 and the functionf(x) is convex and differentiable, there existst such that (27) and (28) are both satisfied for anyf. Hint: Use the Mean Value Theorem

March 19, 2018 108 / 188 Convexity Strong Wolfe Conditions [RECAP] ⇒ Letϕ(t) =f(x k +t∆x k) f(x k) +t Tf(xk)∆xk (where the second inequality is by ≥ ∇ virtue of convexity). Remember that Tf(xk)∆xk <0 ∇ Since0

March 19, 2018 109 / 188 Convexity Strong Wolfe Conditions [RECAP] ⇒ Letϕ(t) =f(x k +t∆x k) f(x k) +t Tf(xk)∆xk (where the second inequality is by ≥ ∇ virtue of convexity). Remember that Tf(xk)∆xk <0 ∇ Since0

Lett ′ >0 be the smallest intersecting value oft, that is:

k k T k k f(x+t ′∆x ) =f(x ) +t ′c f(x )∆x (29) 1∇ For allt [0,t ], ∈ ′

March 19, 2018 109 / 188 Convexity Strong Wolfe Conditions [RECAP] ⇒ Letϕ(t) =f(x k +t∆x k) f(x k) +t Tf(xk)∆xk (where the second inequality is by ≥ ∇ virtue of convexity). Remember that Tf(xk)∆xk <0 ∇ Since0

Lett ′ >0 be the smallest intersecting value oft, that is:

k k T k k f(x+t ′∆x ) =f(x ) +t ′c f(x )∆x (29) 1∇ For allt [0,t ], ∈ ′ f(xk +t∆x k) f(x k) + tc Tf(xk)∆xk (30) ≤ 1∇ That is, there exists a non-empty set oft such that the first Wolfe condition is met. By the mean value theorem, t (0,t ) such that ∃ ′′ ∈ ′ k k k T k k k f(x +t ′∆x ) f(x ) =t ′ f(x +t ′′∆x )∆x (31) − ∇

March 19, 2018 109 / 188 Convexity Strong Wolfe Conditions (contd.) ⇒

Combining (29) and (31), and usingc c f(x )∆x (32) ∇ 1∇ 2∇ Again, since Tf(xk)∆xk <0, we get thet k =t satisfying (28) ∇ ′′ T k k k T k k f(x +t ′′∆x )∆x

March 19, 2018 110 / 188 This is desirable: Neither too slow, nor too wayward..

Starts fast and then adaptively slows down Empirical Observations on Ray Search [RECAP]

A finding that is borne out of plenty of empirical evidence is that exact ray searchdoes better than empirical ray search in a few cases only.

Further, the exact choice of the value ofc 1 andc 2 seems to have little effect on the convergence of the overall descent method. The trend of specific descent methods has been like a parabola - starting with simple steepest descent techniques, then accomodating the curvature through a more sophisticated Newton’s method and finally, trying to simplify the Newton’s method through approximations to the hessian inverse, culminating in conjugate gradient techniques, that do away with any curvature matrix whatsoever, and form the internal combustion engine of many sophisticated optimization techniques today. We start the thread by describing the steepest descent methods.

March 19, 2018 111 / 188 Algorithms: Steepest Descent [RECAP]

The idea of steepest descent is to determine a descent direction such that for a unit step in that direction, the prediction of[ decrease in the] objective is maximized However, consider∆x= argmin 5 10 15 v   v −  ∞  = ∆x=   ⇒ −∞ −∞ which is unacceptable Thus, there is a necessity to restrict the norm ofv The choice of the descent direction can be stated as:

∆x= argmin ⊤f(x)v v ∇ s.t. v =1 ∥ ∥

March 19, 2018 112 / 188 Algorithms: Steepest Descent [RECAP] Letv n be a unit vector under some norm. By first order convexity condition for ∈ℜ convex and differentiablef,

f(x(k)) f(x (k) +v) Tf(x(k))v − ≤ −∇ For smallv, the inequality turns into approximate equality. The term Tf(x(k))v can −∇ be thought of as (an upper-bound on) the first order prediction of decrease. The idea in the steepest descent method is to choose a norm and then determine a descent direction such that for a unit step in that norm, the first order prediction of decrease is maximized. This choice of the descent direction can be stated as { } ∆x= argmin Tf(x)v v =1 ∇ | || || Empirical observation: If the norm chosen is aligned with the gross geometry of the sub-level sets3, the steepest descent method converges faster to the optimal solution. Else, it often amplifies the effect of oscillations. 3The alignment can be determined by fitting, for instance, a quadratic to a sample of the points. March 19, 2018 113 / 188 Various choices of the norm result in different solutions for∆x [RECAP] f(x(k)) For 2-norm,∆x= ∇ − f(x(k)) ∥∇ ∥2 () ( ) ∂f(x(k)) For 1-norm,∆x= sign (k) ei, wheree i is the ith standard basis vector − ∂xi () You dont expect the steepest descent For -norm,∆x= sign( f(x (k))) ∞ − ∇ direction to have an acute angle with the gradient

March 19, 2018 114 / 188 General Algorithm: Steepest Descent (contd) [RECAP]

Find a starting pointx (0) . ∈D repeat { } 1. Set∆x (k) = argmin Tf(x(k))v v =1 . ∇ | || || 2. Choose a step sizet (k) >0 using exact or backtracking ray search. 3. Obtainx (k+1) =x (k) +t (k)∆x(k). 4. Setk=k+1. until stopping criterion (such as f(x(k+1)) ϵ) is satisfied ||∇ ||≤ Figure 8: The steepest descent algorithm.

Two examples of the steepest descent method are the gradient descent method (for the eucledian orL 2 norm) and the coordinate-descent method (for theL 1 norm). One fact however is that no two norms should give exactly opposite steepest descent directions, though they may point in different directions.

March 19, 2018 115 / 188 Convergence of Descent Algorithm To be revisited

Consider the general descent algorithm ( Tf(xk)∆xk <0 for eachk) with each step: ∇ xk+1 =x k +t k∆xk. n ▶ Supposef is bounded below in and ℜ 0 ▶ is continuously differentiable in an open set containing the level set x f(x) f(x ) N { | ≤ } ▶ f is Lipschiz continuous. ∇∑ ∞ ( Tf(xk)∆xk)2 Then, ∇ < (that is, it is finite) ∆xk 2 ∞ An infinite sum will be finite k=1 ∥ ∥ Tf(xk)∆xk ∇ = 0 Thus, limk 0 ∆xk . → ∥ ∥ If we additionally assume that the descent direction is not orthogonal to the gradient, i.e., Tf(xk)∆xk ∇ Γ Γ>0 ( k) =0 ∆xk f(xk) for some , then, we can show that lim k 0 f x − ∥ ∥|∇ ∥ ≥ → ∥∇ ∥ Before we try and prove this result, let us discuss Lipschitz continuity (recall from midsem).

March 19, 2018 116 / 188

Lipschitz Continuity

March 19, 2018 117 / 188 Recall: Lipschitz Continuity off

Formally,f(x): n is Lipschitz continuous if f(x) f(y) L x y for all D⊆ℜ → ℜ | − |≤ ∥ − ∥ x,y . ∈D A Lipschitz continuous function is limited in how fast it changes: there exists a definite positive real numberL>0 such that, for every pair of points on the graph of the function, the absolute value of the slope of the line connecting them is not greater than this real number. This bound is called the function’s Lipschitz constant,L>0. We showed that if a functionf: is convex in(α,β) it is Lipschitz continuous in ℜ→ℜ [γ,δ] whereα<γ<δ<β. We did not assume thatf is differentiable. Recap from midsem...

March 19, 2018 118 / 188 Lipschitz continuity

Intuitively, a Lipschitz continuous function is limited in how fast it changes: there exists a definite real numberL such that, for every pair of points on the graph of the gradient, the absolute value of the slope of the line connecting them is not greater than this real number ▶ This bound is called the function’s Lipschitz constant,L>0 ▶ The sum of two Lipschitz continuous functions is also Lipschitz continuous with the Lipschitz constant specified as the sum of the respective Lipschitz constants. ▶ The product of two Lipschitz continuous and bounded functions is also Lipschitz continuous

Now, f(x) is Lipschitz continuous if f(x) f(y) L x y ∇ ∇ −∇ ≤ ∥ − ∥

March 19, 2018 119 / 188 Interpretation of Lipschitz continuity of f(x) ∇ Consider f(x) R, and f(x) = df =f (x) ∇ ∈ ∇ dx ′ f ′(x) f ′(y) L x y | −f (x) f (|y≤) | − | ′ − ′ = x y L ⇒ | − | ≤ f (x+h) f (x) = ′ − ′ (puttingy=x+h) ⇒ h Taking limith 0, we get → f (x) L (assuming the limit exits) | ′′ |≤ Providing intuition f ′′ represents curvature through necessary condition only.. If curvature is high, chances of descent algo oscillating are also high... hence convergence may take time if L is high

March 19, 2018 120 / 188 Lipschitz Continuity of f(x) and Hessian ∇ Letf(x) have continuous partial derivatives and continuous mixed partial derivatives in an open ball containing a pointx where f(x ) = 0. R ∗ ∇ ∗ Let 2f(x) denote ann n matrix of mixed partial derivatives off evaluated at the point ∇ × x, such that the ijth entry of the matrix isf . The matrix 2f(x) is called the Hessian xixj ∇ matrix. The Hessian matrix is symmetric4. If the second derivatives are continuous the Hessian will be symmetric..

4By Clairaut’s Theorem, if the partial and mixed derivatives of a function are continuous on an open region containing a pointx ∗, thenf (x∗) =f (x∗), for alli,j [1,n]. xixj xjxi ∈ March 19, 2018 121 / 188 Lipschitz Continuity of f(x) and Hessian ∇ Letf(x) have continuous partial derivatives and continuous mixed partial derivatives in an open ball containing a pointx where f(x ) = 0. R ∗ ∇ ∗ Let 2f(x) denote ann n matrix of mixed partial derivatives off evaluated at the point ∇ × x, such that the ijth entry of the matrix isf . The matrix 2f(x) is called the Hessian xixj ∇ matrix. The Hessian matrix is symmetric4. For a Lipschitz continuous f:R n R n, we can show that for any vectorv, ∇ →

4By Clairaut’s Theorem, if the partial and mixed derivatives of a function are continuous on an open region containing a pointx ∗, thenf (x∗) =f (x∗), for alli,j [1,n]. xixj xjxi ∈ March 19, 2018 121 / 188 Lipschitz Continuity of f(x) and Hessian ∇ Letf(x) have continuous partial derivatives and continuous mixed partial derivatives in an open ball containing a pointx where f(x ) = 0. R ∗ ∇ ∗ Let 2f(x) denote ann n matrix of mixed partial derivatives off evaluated at the point ∇ × x, such that the ijth entry of the matrix isf . The matrix 2f(x) is called the Hessian xixj ∇ matrix. The Hessian matrix is symmetric4. For a Lipschitz continuous f:R n R n, we can show that for any vectorv, 2 ∇ → ▶ v⊤ f(x)v v ⊤Lv ∇ ≤2 = v ⊤( f(x) LI)v 0 ⇒ ∇2 − ≤ ▶ That is, f(x) LI is negative semi-definite ∇ − ▶ This can be written as: 2f(x) LI ∇ ⪯

4By Clairaut’s Theorem, if the partial and mixed derivatives of a function are continuous on an open region containing a pointx ∗, thenf (x∗) =f (x∗), for alli,j [1,n]. xixj xjxi ∈ March 19, 2018 121 / 188 x3 Example:f(x) = 3

3 f(x) = x = f (x) =x 2 3 ⇒ ′ Claim:f ′(x) is locally Lipschitz continuous but not globally f''(x) = 2x is in general unbounded above... Hence f'(x) cannot be globally Lipschitz...

March 19, 2018 122 / 188 x3 Example:f(x) = 3

3 f(x) = x = f (x) =x 2 3 ⇒ ′ Claim:f ′(x) is locally Lipschitz continuous but not globally Considerx R ∈ supy (x 1,x+1) f ′′(y) = sup y (x 1,x+1) 2y 2 x +1 ∈ − | | ∈ − | |≤ | | Applying mean value theorem: (y,z) (x 1,x + 1) 2,λ ∃ ∈ f−(y) f (z) ′ − ′ f ′′(λ) = y z −

March 19, 2018 122 / 188 f ′( y) f ′(z ) = f ′′(λ)(y z) | − | | − | 2 2 x +1 y x , (y,z) (x 1,x + 1) ≤ | | | − | ∀ ∈ − Thus,L= 2 x +1 | | Therefore,

March 19, 2018 123 / 188 f ′( y) f ′(z ) = f ′′(λ)(y z) | − | | − | 2 2 x +1 y x , (y,z) (x 1,x + 1) ≤ | | | − | ∀ ∈ − Thus,L= 2 x +1 | | Therefore,f is Lipschitz continuous in(x 1,x + 1) ′ − But asx ,L →∞ →∞ This implies thatf ′ may not be Lipschitz continuous everywhere Considery= 0, and f (y) f (0) ̸ ′ − ′ y 0 = y | − | | | y asy | |→∞ →∞ Thus,f ′ is proved to not be Lipschitz continuous globally

March 19, 2018 123 / 188 Another example

Consider  ( )  2 1 x sin x2 ifx= 0 f(x) =  ̸ 0 ifx=0 We can verify that this function is continuous and differentiable everywhere i.e. f ′′(0) = 0 from left and right However, we can show thatf(x) is not Lipschitz continuous

March 19, 2018 124 / 188 Lipschitz continuity: another example

Consider:f ′(x) = x | | Since f (x) f (y) = x y x y , | ′ − ′ | | |−| | ≤| − | f is Lipschitz continuous withL=1 However, it is not differentiable everywhere (not at0) In fact, iff is continuously differentiable everywhere, it is also Lipschitz continuous For functions over a closed and bounded subset of the real line:f is continuous f is ⊇ differentiable (almost everywhere) f is Lipschitz continuous f is continuous f is ⊇ ⊇ ′ ⊇ ′ differentiable Recap from midsem (generalized now tof: n ) thatf is Lipschitz continuous f is ℜ → ℜ ⊇ convex

March 19, 2018 125 / 188 Considering in Lipschitz continuity Why is the curvature in terms of Hessian upper bounded for Lipschitz continuous functions? If f is Lipschitz continuous, then ∇

f(x) f(y) L x y ∇ −∇ ≤ ∥ − ∥ (n) Taylor’s theorem states that iff and its firstn derivativesf ′,f ′′,...,f are continuous in the closed interval[a,b], and differentiable in(a,b), then there exists a number c (a,b) such that ∈

1 2 1 (n) n 1 (n+1) n+1 f(b) =f(a) +f ′(a)(b a) + f ′′(a)(b a) +...+ f (a)(b a) + f (c)(b a) − 2! − n! − (n + 1)! −

March 19, 2018 126 / 188 We will invoke Taylor’s theorem up to the second degree:

1 2 f(y) =f(x) +f ′(x)(y x) + f ′′(c)(y x) − 2 − wherec (x,y) andx,y R ∈ ∈ Let us generalize tof:R n R: → 1 T 2 f(y) =f(x) + ⊤f(x)(y x) + (y x) f(c)(y x) ∇ − 2 − ∇ − wherec=x+Γ(y x),Γ (0, 1), andx,y R n − ∈ ∈ If f is Lipschitz continuous andf is doubly differentiable, ∇

L 2 f(y) f(x) + ⊤f(x)(y x) + y x (34) ≤ ∇ − 2 ∥ − ∥

March 19, 2018 127 / 188 We will invoke Taylor’s theorem up to the second degree:

1 2 f(y) =f(x) +f ′(x)(y x) + f ′′(c)(y x) − 2 − wherec (x,y) andx,y R ∈ ∈ Let us generalize tof:R n R: → 1 T 2 f(y) =f(x) + ⊤f(x)(y x) + (y x) f(c)(y x) ∇ − 2 − ∇ − wherec=x+Γ(y x),Γ (0, 1), andx,y R n − ∈ ∈ If f is Lipschitz continuous andf is doubly differentiable, ∇

L 2 f(y) f(x) + ⊤f(x)(y x) + y x (34) ≤ ∇ − 2 ∥ − ∥ While we showed (34) assumingf is doubly differentiable, (34) holds for any Lipschitz continuous f(x) . ∇ March 19, 2018 127 / 188 Gradient Descent and Lipschitz Continuity

+1 1 Replacingx byx k andy by the gradient descent updatex k =x k t f(x k), and − ∇ applying condition for Lipschitz continuity:

L 2 f(xk+1) f(x k) + Tf(xk)(xk+1 x) + xk+1 x k by (34) ≤ ∇ − 2 −

+1 2 For a descent algorithm, Tf(xk)∆xk = Tf(xk)∆(xk x k)<0 for eachk ∇ ∇ − 3 Putting together steps 1 and 2 above,

L 2 f(xk+1) f(x k) + xk+1 x k (35) ≤ 2 −

March 19, 2018 128 / 188 Convergence of Descent Algorithms: Generic and Specific Cases

March 19, 2018 129 / 188 Back to: Generic Convergence of Descent Algorithm Under strong wolfe.. Consider the general descent algorithm ( Tf(xk)∆xk <0 for eachk) with each step: ∇ xk+1 =x k +t k∆xk. n ▶ Supposef is bounded below in and ℜ 0 ▶ is continuously differentiable in an open set containing the level set x f(x) f(x ) N { | ≤ } ▶ f is Lipschiz continuous. ∇∑ ∞ ( Tf(xk)∆xk)2 Then, ∇ < (that is, it is finite) ∆xk 2 ∞ < lim k-> inf [f(x^k) - f(x^0)] k=1 ∥ ∥ Proof: For any descent algorithm: Tf(xk)∆xk <0 for eachk with each step: ∇ xk+1 =x k +t k∆xk. From the second Strong Wolfe condition:

Tf(xk +t k∆xk)∆xk c Tf(xk)∆xk (36) ∇ ≤ 2 ∇

March 19, 2018 130 / 188 Proving Convergence of Descent Algorithm Sincec >0 and Tf(xk)∆xk <0, 2 ∇

Tf(xk +t k∆xk)∆xk c Tf(xk)∆xk (37) ∇ ≥ 2∇ Subtracting Tf(xk)∆xk from both sides of (37) ∇ [ ] T f(xk +t k∆xk) f(x k) ∆xk (c 1) Tf(xk)∆xk (38) ∇ −∇ ≥ 2 − ∇ By Cauchy Shwarz inequality and from Lipschitz continuity,

[ ] T f(xk +t k∆xk) f(x k) ∆xk f(xk +t k∆xk) f(x k) ∆xk L ∆x k 2tk ∇ −∇ ≤ ∥∇ −∇ ∥∥ ∥ ≤ ∥ ∥ (39)

March 19, 2018 131 / 188 Proving Convergence of Descent Algorithm (contd.) Combining (38) and (39),

T k k c2 1 f(x )∆x tk − ∇ (40) ≥ L ∆xk 2 ∥ ∥ Substituting (40) into the first Wolfe conditionf(x k +t∆x k)0 5 ( ) =0 ∆xk f(xk) for some , then, we can show that lim f x − ∥ ∥|∇ ∥ ≥ k 0 ∥∇ ∥ This shows convergence for a generic descent algorithm. What→ we are more interested in however, is the rate of convergence of a descent algorithm. 5Making use of the Cauchy Schwarz inequality March 19, 2018 133 / 188