1. Gradient Method

L. Vandenberghe ECE236C (Spring 2020) 1. Gradient method • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method 1.1 Gradient method to minimize a convex differentiable function f : choose an initial point x0 and repeat xk+1 = xk − tkr f ¹xkº; k = 0;1;::: step size tk is constant or determined by line search Advantages • every iteration is inexpensive • does not require second derivatives Notation • xk can refer to kth element of a sequence, or to the kth component of vector x • to avoid confusion, we sometimes use x¹kº to denote elements of a sequence Gradient method 1.2 Quadratic example 1 f ¹xº = ¹x2 + γx2º (with γ > 1) 2 1 2 with exact line search and starting point x¹0º = ¹γ;1º 4 kx¹kº − x?k γ − 1 k 2 = ¹0º ? kx − x k2 γ + 1 2 x 0 where x? = 0 −4 −10 0 10 x1 gradient method is often slow; convergence very dependent on scaling Gradient method 1.3 Nondifferentiable example q 2 2 x1 + γjx2j f ¹xº = x + γx if jx2j ≤ x1; f ¹xº = if jx2j > x1 1 2 p1 + γ with exact line search, starting point x¹0º = ¹γ;1º, converges to non-optimal point 2 2 x 0 −2 −2 0 2 4 x1 gradient method does not handle nondifferentiable problems Gradient method 1.4 First-order methods address one or both shortcomings of the gradient method Methods for nondifferentiable or constrained problems • subgradient method • proximal gradient method • smoothing methods • cutting-plane methods Methods with improved convergence • conjugate gradient method • accelerated gradient method • quasi-Newton methods Gradient method 1.5 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Convex function a function f is convex if dom f is a convex set and Jensen’s inequality holds: f ¹θx + ¹1 − θºyº ≤ θ f ¹xº + ¹1 − θº f ¹yº for all x; y 2 dom f , θ 2 »0;1¼ First-order condition for (continuously) differentiable f , Jensen’s inequality can be replaced with f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº for all x; y 2 dom f Second-order condition for twice differentiable f , Jensen’s inequality can be replaced with r2 f ¹xº 0 for all x 2 dom f Gradient method 1.6 Strictly convex function f is strictly convex if dom f is a convex set and f ¹θx + ¹1 − θºyº < θ f ¹xº + ¹1 − θº f ¹yº for all x; y 2 dom f , x , y, and θ 2 ¹0;1º strict convexity implies that if a minimizer of f exists, it is unique First-order condition for differentiable f , strict Jensen’s inequality can be replaced with f ¹yº > f ¹xº + r f ¹xºT¹y − xº for all x; y 2 dom f , x , y Second-order condition note that r2 f ¹xº 0 is not necessary for strict convexity (cf., f ¹xº = x4) Gradient method 1.7 Monotonicity of gradient a differentiable function f is convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT ¹x − yº ≥ 0 for all x; y 2 dom f i.e., the gradient r f : Rn ! Rn is a monotone mapping a differentiable function f is strictly convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT ¹x − yº > 0 for all x; y 2 dom f , x , y i.e., the gradient r f : Rn ! Rn is a strictly monotone mapping Gradient method 1.8 Proof • if f is differentiable and convex, then f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº; f ¹xº ≥ f ¹yº + r f ¹yºT¹x − yº combining the inequalities gives ¹r f ¹xº − r f ¹yººT¹x − yº ≥ 0 • if r f is monotone, then g0¹tº ≥ g0¹0º for t ≥ 0 and t 2 dom g, where g¹tº = f ¹x + t¹y − xºº; g0¹tº = r f ¹x + t¹y − xººT¹y − xº hence ¹ 1 f ¹yº = g¹1º = g¹0º + g0¹tº dt ≥ g¹0º + g0¹0º 0 = f ¹xº + r f ¹xºT¹y − xº this is the first-order condition for convexity Gradient method 1.9 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Lipschitz continuous gradient the gradient of f is Lipschitz continuous with parameter L > 0 if kr f ¹xº − r f ¹yºk∗ ≤ Lkx − yk for all x; y 2 dom f • functions f with this property are also called L-smooth • the definition does not assume convexity of f (and holds for − f if it holds for f ) • in the definition, k · k and k · k∗ are a pair of dual norms: T u v T kuk∗ = sup = sup u v v,0 kvk kvk=1 this implies a generalized Cauchy–Schwarz inequality T ju vj ≤ kuk∗kvk for all u, v Gradient method 1.10 Choice of norm Equivalence of norms • for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that c1kxkb ≤ kxka ≤ c2kxkb for all x • constants depend on dimension; for example, for x 2 Rn, p 1 kxk2 ≤ kxk1 ≤ n kxk2; p kxk2 ≤ kxk1 ≤ kxk2 n Norm in definition of Lipschitz continuity • without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2 • the parameter L depends on choice of norm • in complexity bounds, choice of norm can simplify dependence on dimensions Gradient method 1.11 Quadratic upper bound suppose r f is Lipschitz continuous with parameter L • this implies (from the generalized Cauchy–Schwarz inequality) that ¹r f ¹xº − r f ¹yººT ¹x − yº ≤ Lkx − yk2 for all x; y 2 dom f (1) • if dom f is convex, (1) is equivalent to L f ¹yº ≤ f ¹xº + r f ¹xºT¹y − xº + ky − xk2 for all x; y 2 dom f (2) 2 f ¹yº ¹x; f ¹xºº Gradient method 1.12 Proof (of the equivalence of (1) and (2) if dom f is convex) • consider arbitrary x; y 2 dom f and define g¹tº = f ¹x + t¹y − xºº • g¹tº is defined for t 2 »0;1¼ because dom f is convex • if (1) holds, then g0¹tº − g0¹0º = ¹r f ¹x + t¹y − xºº − r f ¹xººT ¹y − xº ≤ tLkx − yk2 integrating from t = 0 to t = 1 gives (2): ¹ 1 L f ¹yº = g¹1º = g¹0º + g0¹tº dt ≤ g¹0º + g0¹0º + kx − yk2 0 2 L = f ¹xº + r f ¹xºT¹y − xº + kx − yk2 2 • conversely, if (2) holds, then (2) and the same inequality with x, y switched, i.e., L f ¹xº ≤ f ¹yº + r f ¹yºT¹x − yº + kx − yk2; 2 can be combined to give ¹r f ¹xº − r f ¹yººT¹x − yº ≤ Lkx − yk2 Gradient method 1.13 Consequence of quadratic upper bound if dom f = Rn and f has a minimizer x?, then 1 L kr f ¹zºk2 ≤ f ¹zº − f ¹x?º ≤ kz − x?k2 for all z 2L ∗ 2 • right-hand inequality follows from upper bound property (2) at x = x?, y = z • left-hand inequality follows by minimizing quadratic upper bound for x = z L inf f ¹yº ≤ inf f ¹zº + r f ¹zºT¹y − zº + ky − zk2 y y 2 Lt2 = inf inf f ¹zº + tr f ¹zºT v + kvk=1 t 2 1 = inf f ¹zº − ¹r f ¹zºT vº2 kvk=1 2L 1 = f ¹zº − kr f ¹zºk2 2L ∗ Gradient method 1.14 Co-coercivity of gradient if f is convex with dom f = Rn and r f is L-Lipschitz continuous, then 1 ¹r f ¹xº − r f ¹yººT¹x − yº ≥ kr f ¹xº − r f ¹yºk2 for all x; y L ∗ • this property is known as co-coercivity of r f (with parameter 1/L) • co-coercivity in turn implies Lipschitz continuity of r f (by Cauchy–Schwarz) • hence, for differentiable convex f with dom f = Rn Lipschitz continuity of r f ) upper bound property (2) (equivalently, (1ºº ) co-coercivity of r f ) Lipschitz continuity of r f therefore the three properties are equivalent Gradient method 1.15 n Proof of co-coercivity: define two convex functions fx, fy with domain R T T fx¹zº = f ¹zº − r f ¹xº z; fy¹zº = f ¹zº − r f ¹yº z • the two functions have L-Lipschitz continuous gradients • z = x minimizes fx¹zº; from the left-hand inequality on page 1.14, T f ¹yº − f ¹xº − r f ¹xº ¹y − xº = fx¹yº − fx¹xº 1 ≥ kr f ¹yºk2 2L x ∗ 1 = kr f ¹yº − r f ¹xºk2 2L ∗ • similarly, z = y minimizes fy¹zº; therefore 1 f ¹xº − f ¹yº − r f ¹yºT¹x − yº ≥ kr f ¹yº − r f ¹xºk2 2L ∗ combining the two inequalities shows co-coercivity Gradient method 1.16 Lipschitz continuity with respect to Euclidean norm supose f is convex with dom f = Rn, and L-smooth for the Euclidean norm: kr f ¹xº − r f ¹yºk2 ≤ Lkx − yk2 for all x, y • the equivalent property (1) states that ¹r f ¹xº − r f ¹yººT¹x − yº ≤ L¹x − yºT¹x − yº for all x, y • this is monotonicity of Lx − r f ¹xº, i.e., equivalent to the property that L kxk2 − f ¹xº is a convex function 2 2 • if f is twice differentiable, the Hessian of this function is LI − r2 f ¹xº: 2 λmax¹r f ¹xºº ≤ L for all x is an equivalent characterization of L-smoothness Gradient method 1.17 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Strongly convex function f is strongly convex with parameter m > 0 if dom f is convex and m f ¹θx + ¹1 − θºyº ≤ θ f ¹xº + ¹1 − θº f ¹yº − θ¹1 − θºkx − yk2 2 holds for all x; y 2 dom f , θ 2 »0;1¼ • this is a stronger version of Jensen’s inequality • it holds if and only if it holds for f restricted to arbitrary lines: m f ¹x + t¹y − xºº − t2kx − yk2 (3) 2 is a convex function of t, for all x; y 2 dom f • without loss of generality, we can take k · k = k · k2 • however, the strong convexity parameter m depends on the norm used Gradient method 1.18 Quadratic lower bound if f is differentiable and m-strongly convex, then m f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº + ky − xk2 for all x; y 2 dom f (4) 2 f ¹yº ¹x; f ¹xºº • follows from the 1st order condition of convexity of (3) • this implies that the sublevel sets of f are bounded • if f is closed (has closed sublevel sets), it has a unique minimizer x? and m 1 kz − x?k2 ≤ f ¹zº − f ¹x?º ≤ kr f ¹zºk2 for all z 2 dom f 2 2m ∗ (proof as on page 1.14) Gradient method 1.19 Strong monotonicity differentiable f is strongly convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT¹x − yº ≥ mkx − yk2 for all x; y 2 dom f this is called strong monotonicity (coercivity) of r f Proof • one direction follows from (4) and the same inequality with x and y switched • for the other direction, assume r f is strongly monotone and define m g¹tº = f ¹x + t¹y − xºº − t2kx − yk2 2 then g0¹tº is nondecreasing, so g is convex Gradient method 1.20

Load more