1. Gradient Method

L. Vandenberghe ECE236C (Spring 2020) 1. Gradient method • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method 1.1 Gradient method to minimize a convex differentiable function f : choose an initial point x0 and repeat xk+1 = xk − tkr f ¹xkº; k = 0;1;::: step size tk is constant or determined by line search Advantages • every iteration is inexpensive • does not require second derivatives Notation • xk can refer to kth element of a sequence, or to the kth component of vector x • to avoid confusion, we sometimes use x¹kº to denote elements of a sequence Gradient method 1.2 Quadratic example 1 f ¹xº = ¹x2 + γx2º (with γ > 1) 2 1 2 with exact line search and starting point x¹0º = ¹γ;1º 4 kx¹kº − x?k γ − 1 k 2 = ¹0º ? kx − x k2 γ + 1 2 x 0 where x? = 0 −4 −10 0 10 x1 gradient method is often slow; convergence very dependent on scaling Gradient method 1.3 Nondifferentiable example q 2 2 x1 + γjx2j f ¹xº = x + γx if jx2j ≤ x1; f ¹xº = if jx2j > x1 1 2 p1 + γ with exact line search, starting point x¹0º = ¹γ;1º, converges to non-optimal point 2 2 x 0 −2 −2 0 2 4 x1 gradient method does not handle nondifferentiable problems Gradient method 1.4 First-order methods address one or both shortcomings of the gradient method Methods for nondifferentiable or constrained problems • subgradient method • proximal gradient method • smoothing methods • cutting-plane methods Methods with improved convergence • conjugate gradient method • accelerated gradient method • quasi-Newton methods Gradient method 1.5 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Convex function a function f is convex if dom f is a convex set and Jensen’s inequality holds: f ¹θx + ¹1 − θºyº ≤ θ f ¹xº + ¹1 − θº f ¹yº for all x; y 2 dom f , θ 2 »0;1¼ First-order condition for (continuously) differentiable f , Jensen’s inequality can be replaced with f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº for all x; y 2 dom f Second-order condition for twice differentiable f , Jensen’s inequality can be replaced with r2 f ¹xº 0 for all x 2 dom f Gradient method 1.6 Strictly convex function f is strictly convex if dom f is a convex set and f ¹θx + ¹1 − θºyº < θ f ¹xº + ¹1 − θº f ¹yº for all x; y 2 dom f , x , y, and θ 2 ¹0;1º strict convexity implies that if a minimizer of f exists, it is unique First-order condition for differentiable f , strict Jensen’s inequality can be replaced with f ¹yº > f ¹xº + r f ¹xºT¹y − xº for all x; y 2 dom f , x , y Second-order condition note that r2 f ¹xº 0 is not necessary for strict convexity (cf., f ¹xº = x4) Gradient method 1.7 Monotonicity of gradient a differentiable function f is convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT ¹x − yº ≥ 0 for all x; y 2 dom f i.e., the gradient r f : Rn ! Rn is a monotone mapping a differentiable function f is strictly convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT ¹x − yº > 0 for all x; y 2 dom f , x , y i.e., the gradient r f : Rn ! Rn is a strictly monotone mapping Gradient method 1.8 Proof • if f is differentiable and convex, then f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº; f ¹xº ≥ f ¹yº + r f ¹yºT¹x − yº combining the inequalities gives ¹r f ¹xº − r f ¹yººT¹x − yº ≥ 0 • if r f is monotone, then g0¹tº ≥ g0¹0º for t ≥ 0 and t 2 dom g, where g¹tº = f ¹x + t¹y − xºº; g0¹tº = r f ¹x + t¹y − xººT¹y − xº hence ¹ 1 f ¹yº = g¹1º = g¹0º + g0¹tº dt ≥ g¹0º + g0¹0º 0 = f ¹xº + r f ¹xºT¹y − xº this is the first-order condition for convexity Gradient method 1.9 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Lipschitz continuous gradient the gradient of f is Lipschitz continuous with parameter L > 0 if kr f ¹xº − r f ¹yºk∗ ≤ Lkx − yk for all x; y 2 dom f • functions f with this property are also called L-smooth • the definition does not assume convexity of f (and holds for − f if it holds for f ) • in the definition, k · k and k · k∗ are a pair of dual norms: T u v T kuk∗ = sup = sup u v v,0 kvk kvk=1 this implies a generalized Cauchy–Schwarz inequality T ju vj ≤ kuk∗kvk for all u, v Gradient method 1.10 Choice of norm Equivalence of norms • for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that c1kxkb ≤ kxka ≤ c2kxkb for all x • constants depend on dimension; for example, for x 2 Rn, p 1 kxk2 ≤ kxk1 ≤ n kxk2; p kxk2 ≤ kxk1 ≤ kxk2 n Norm in definition of Lipschitz continuity • without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2 • the parameter L depends on choice of norm • in complexity bounds, choice of norm can simplify dependence on dimensions Gradient method 1.11 Quadratic upper bound suppose r f is Lipschitz continuous with parameter L • this implies (from the generalized Cauchy–Schwarz inequality) that ¹r f ¹xº − r f ¹yººT ¹x − yº ≤ Lkx − yk2 for all x; y 2 dom f (1) • if dom f is convex, (1) is equivalent to L f ¹yº ≤ f ¹xº + r f ¹xºT¹y − xº + ky − xk2 for all x; y 2 dom f (2) 2 f ¹yº ¹x; f ¹xºº Gradient method 1.12 Proof (of the equivalence of (1) and (2) if dom f is convex) • consider arbitrary x; y 2 dom f and define g¹tº = f ¹x + t¹y − xºº • g¹tº is defined for t 2 »0;1¼ because dom f is convex • if (1) holds, then g0¹tº − g0¹0º = ¹r f ¹x + t¹y − xºº − r f ¹xººT ¹y − xº ≤ tLkx − yk2 integrating from t = 0 to t = 1 gives (2): ¹ 1 L f ¹yº = g¹1º = g¹0º + g0¹tº dt ≤ g¹0º + g0¹0º + kx − yk2 0 2 L = f ¹xº + r f ¹xºT¹y − xº + kx − yk2 2 • conversely, if (2) holds, then (2) and the same inequality with x, y switched, i.e., L f ¹xº ≤ f ¹yº + r f ¹yºT¹x − yº + kx − yk2; 2 can be combined to give ¹r f ¹xº − r f ¹yººT¹x − yº ≤ Lkx − yk2 Gradient method 1.13 Consequence of quadratic upper bound if dom f = Rn and f has a minimizer x?, then 1 L kr f ¹zºk2 ≤ f ¹zº − f ¹x?º ≤ kz − x?k2 for all z 2L ∗ 2 • right-hand inequality follows from upper bound property (2) at x = x?, y = z • left-hand inequality follows by minimizing quadratic upper bound for x = z L inf f ¹yº ≤ inf f ¹zº + r f ¹zºT¹y − zº + ky − zk2 y y 2 Lt2 = inf inf f ¹zº + tr f ¹zºT v + kvk=1 t 2 1 = inf f ¹zº − ¹r f ¹zºT vº2 kvk=1 2L 1 = f ¹zº − kr f ¹zºk2 2L ∗ Gradient method 1.14 Co-coercivity of gradient if f is convex with dom f = Rn and r f is L-Lipschitz continuous, then 1 ¹r f ¹xº − r f ¹yººT¹x − yº ≥ kr f ¹xº − r f ¹yºk2 for all x; y L ∗ • this property is known as co-coercivity of r f (with parameter 1/L) • co-coercivity in turn implies Lipschitz continuity of r f (by Cauchy–Schwarz) • hence, for differentiable convex f with dom f = Rn Lipschitz continuity of r f ) upper bound property (2) (equivalently, (1ºº ) co-coercivity of r f ) Lipschitz continuity of r f therefore the three properties are equivalent Gradient method 1.15 n Proof of co-coercivity: define two convex functions fx, fy with domain R T T fx¹zº = f ¹zº − r f ¹xº z; fy¹zº = f ¹zº − r f ¹yº z • the two functions have L-Lipschitz continuous gradients • z = x minimizes fx¹zº; from the left-hand inequality on page 1.14, T f ¹yº − f ¹xº − r f ¹xº ¹y − xº = fx¹yº − fx¹xº 1 ≥ kr f ¹yºk2 2L x ∗ 1 = kr f ¹yº − r f ¹xºk2 2L ∗ • similarly, z = y minimizes fy¹zº; therefore 1 f ¹xº − f ¹yº − r f ¹yºT¹x − yº ≥ kr f ¹yº − r f ¹xºk2 2L ∗ combining the two inequalities shows co-coercivity Gradient method 1.16 Lipschitz continuity with respect to Euclidean norm supose f is convex with dom f = Rn, and L-smooth for the Euclidean norm: kr f ¹xº − r f ¹yºk2 ≤ Lkx − yk2 for all x, y • the equivalent property (1) states that ¹r f ¹xº − r f ¹yººT¹x − yº ≤ L¹x − yºT¹x − yº for all x, y • this is monotonicity of Lx − r f ¹xº, i.e., equivalent to the property that L kxk2 − f ¹xº is a convex function 2 2 • if f is twice differentiable, the Hessian of this function is LI − r2 f ¹xº: 2 λmax¹r f ¹xºº ≤ L for all x is an equivalent characterization of L-smoothness Gradient method 1.17 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Strongly convex function f is strongly convex with parameter m > 0 if dom f is convex and m f ¹θx + ¹1 − θºyº ≤ θ f ¹xº + ¹1 − θº f ¹yº − θ¹1 − θºkx − yk2 2 holds for all x; y 2 dom f , θ 2 »0;1¼ • this is a stronger version of Jensen’s inequality • it holds if and only if it holds for f restricted to arbitrary lines: m f ¹x + t¹y − xºº − t2kx − yk2 (3) 2 is a convex function of t, for all x; y 2 dom f • without loss of generality, we can take k · k = k · k2 • however, the strong convexity parameter m depends on the norm used Gradient method 1.18 Quadratic lower bound if f is differentiable and m-strongly convex, then m f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº + ky − xk2 for all x; y 2 dom f (4) 2 f ¹yº ¹x; f ¹xºº • follows from the 1st order condition of convexity of (3) • this implies that the sublevel sets of f are bounded • if f is closed (has closed sublevel sets), it has a unique minimizer x? and m 1 kz − x?k2 ≤ f ¹zº − f ¹x?º ≤ kr f ¹zºk2 for all z 2 dom f 2 2m ∗ (proof as on page 1.14) Gradient method 1.19 Strong monotonicity differentiable f is strongly convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT¹x − yº ≥ mkx − yk2 for all x; y 2 dom f this is called strong monotonicity (coercivity) of r f Proof • one direction follows from (4) and the same inequality with x and y switched • for the other direction, assume r f is strongly monotone and define m g¹tº = f ¹x + t¹y − xºº − t2kx − yk2 2 then g0¹tº is nondecreasing, so g is convex Gradient method 1.20

1. Gradient Method

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support