1. Gradient Method

Total Page:16

File Type:pdf, Size:1020Kb

1. Gradient Method L. Vandenberghe ECE236C (Spring 2020) 1. Gradient method • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method 1.1 Gradient method to minimize a convex differentiable function f : choose an initial point x0 and repeat xk+1 = xk − tkr f ¹xkº; k = 0;1;::: step size tk is constant or determined by line search Advantages • every iteration is inexpensive • does not require second derivatives Notation • xk can refer to kth element of a sequence, or to the kth component of vector x • to avoid confusion, we sometimes use x¹kº to denote elements of a sequence Gradient method 1.2 Quadratic example 1 f ¹xº = ¹x2 + γx2º (with γ > 1) 2 1 2 with exact line search and starting point x¹0º = ¹γ;1º 4 kx¹kº − x?k γ − 1 k 2 = ¹0º ? kx − x k2 γ + 1 2 x 0 where x? = 0 −4 −10 0 10 x1 gradient method is often slow; convergence very dependent on scaling Gradient method 1.3 Nondifferentiable example q 2 2 x1 + γjx2j f ¹xº = x + γx if jx2j ≤ x1; f ¹xº = if jx2j > x1 1 2 p1 + γ with exact line search, starting point x¹0º = ¹γ;1º, converges to non-optimal point 2 2 x 0 −2 −2 0 2 4 x1 gradient method does not handle nondifferentiable problems Gradient method 1.4 First-order methods address one or both shortcomings of the gradient method Methods for nondifferentiable or constrained problems • subgradient method • proximal gradient method • smoothing methods • cutting-plane methods Methods with improved convergence • conjugate gradient method • accelerated gradient method • quasi-Newton methods Gradient method 1.5 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Convex function a function f is convex if dom f is a convex set and Jensen’s inequality holds: f ¹θx + ¹1 − θºyº ≤ θ f ¹xº + ¹1 − θº f ¹yº for all x; y 2 dom f , θ 2 »0;1¼ First-order condition for (continuously) differentiable f , Jensen’s inequality can be replaced with f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº for all x; y 2 dom f Second-order condition for twice differentiable f , Jensen’s inequality can be replaced with r2 f ¹xº 0 for all x 2 dom f Gradient method 1.6 Strictly convex function f is strictly convex if dom f is a convex set and f ¹θx + ¹1 − θºyº < θ f ¹xº + ¹1 − θº f ¹yº for all x; y 2 dom f , x , y, and θ 2 ¹0;1º strict convexity implies that if a minimizer of f exists, it is unique First-order condition for differentiable f , strict Jensen’s inequality can be replaced with f ¹yº > f ¹xº + r f ¹xºT¹y − xº for all x; y 2 dom f , x , y Second-order condition note that r2 f ¹xº 0 is not necessary for strict convexity (cf., f ¹xº = x4) Gradient method 1.7 Monotonicity of gradient a differentiable function f is convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT ¹x − yº ≥ 0 for all x; y 2 dom f i.e., the gradient r f : Rn ! Rn is a monotone mapping a differentiable function f is strictly convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT ¹x − yº > 0 for all x; y 2 dom f , x , y i.e., the gradient r f : Rn ! Rn is a strictly monotone mapping Gradient method 1.8 Proof • if f is differentiable and convex, then f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº; f ¹xº ≥ f ¹yº + r f ¹yºT¹x − yº combining the inequalities gives ¹r f ¹xº − r f ¹yººT¹x − yº ≥ 0 • if r f is monotone, then g0¹tº ≥ g0¹0º for t ≥ 0 and t 2 dom g, where g¹tº = f ¹x + t¹y − xºº; g0¹tº = r f ¹x + t¹y − xººT¹y − xº hence ¹ 1 f ¹yº = g¹1º = g¹0º + g0¹tº dt ≥ g¹0º + g0¹0º 0 = f ¹xº + r f ¹xºT¹y − xº this is the first-order condition for convexity Gradient method 1.9 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Lipschitz continuous gradient the gradient of f is Lipschitz continuous with parameter L > 0 if kr f ¹xº − r f ¹yºk∗ ≤ Lkx − yk for all x; y 2 dom f • functions f with this property are also called L-smooth • the definition does not assume convexity of f (and holds for − f if it holds for f ) • in the definition, k · k and k · k∗ are a pair of dual norms: T u v T kuk∗ = sup = sup u v v,0 kvk kvk=1 this implies a generalized Cauchy–Schwarz inequality T ju vj ≤ kuk∗kvk for all u, v Gradient method 1.10 Choice of norm Equivalence of norms • for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that c1kxkb ≤ kxka ≤ c2kxkb for all x • constants depend on dimension; for example, for x 2 Rn, p 1 kxk2 ≤ kxk1 ≤ n kxk2; p kxk2 ≤ kxk1 ≤ kxk2 n Norm in definition of Lipschitz continuity • without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2 • the parameter L depends on choice of norm • in complexity bounds, choice of norm can simplify dependence on dimensions Gradient method 1.11 Quadratic upper bound suppose r f is Lipschitz continuous with parameter L • this implies (from the generalized Cauchy–Schwarz inequality) that ¹r f ¹xº − r f ¹yººT ¹x − yº ≤ Lkx − yk2 for all x; y 2 dom f (1) • if dom f is convex, (1) is equivalent to L f ¹yº ≤ f ¹xº + r f ¹xºT¹y − xº + ky − xk2 for all x; y 2 dom f (2) 2 f ¹yº ¹x; f ¹xºº Gradient method 1.12 Proof (of the equivalence of (1) and (2) if dom f is convex) • consider arbitrary x; y 2 dom f and define g¹tº = f ¹x + t¹y − xºº • g¹tº is defined for t 2 »0;1¼ because dom f is convex • if (1) holds, then g0¹tº − g0¹0º = ¹r f ¹x + t¹y − xºº − r f ¹xººT ¹y − xº ≤ tLkx − yk2 integrating from t = 0 to t = 1 gives (2): ¹ 1 L f ¹yº = g¹1º = g¹0º + g0¹tº dt ≤ g¹0º + g0¹0º + kx − yk2 0 2 L = f ¹xº + r f ¹xºT¹y − xº + kx − yk2 2 • conversely, if (2) holds, then (2) and the same inequality with x, y switched, i.e., L f ¹xº ≤ f ¹yº + r f ¹yºT¹x − yº + kx − yk2; 2 can be combined to give ¹r f ¹xº − r f ¹yººT¹x − yº ≤ Lkx − yk2 Gradient method 1.13 Consequence of quadratic upper bound if dom f = Rn and f has a minimizer x?, then 1 L kr f ¹zºk2 ≤ f ¹zº − f ¹x?º ≤ kz − x?k2 for all z 2L ∗ 2 • right-hand inequality follows from upper bound property (2) at x = x?, y = z • left-hand inequality follows by minimizing quadratic upper bound for x = z L inf f ¹yº ≤ inf f ¹zº + r f ¹zºT¹y − zº + ky − zk2 y y 2 Lt2 = inf inf f ¹zº + tr f ¹zºT v + kvk=1 t 2 1 = inf f ¹zº − ¹r f ¹zºT vº2 kvk=1 2L 1 = f ¹zº − kr f ¹zºk2 2L ∗ Gradient method 1.14 Co-coercivity of gradient if f is convex with dom f = Rn and r f is L-Lipschitz continuous, then 1 ¹r f ¹xº − r f ¹yººT¹x − yº ≥ kr f ¹xº − r f ¹yºk2 for all x; y L ∗ • this property is known as co-coercivity of r f (with parameter 1/L) • co-coercivity in turn implies Lipschitz continuity of r f (by Cauchy–Schwarz) • hence, for differentiable convex f with dom f = Rn Lipschitz continuity of r f ) upper bound property (2) (equivalently, (1ºº ) co-coercivity of r f ) Lipschitz continuity of r f therefore the three properties are equivalent Gradient method 1.15 n Proof of co-coercivity: define two convex functions fx, fy with domain R T T fx¹zº = f ¹zº − r f ¹xº z; fy¹zº = f ¹zº − r f ¹yº z • the two functions have L-Lipschitz continuous gradients • z = x minimizes fx¹zº; from the left-hand inequality on page 1.14, T f ¹yº − f ¹xº − r f ¹xº ¹y − xº = fx¹yº − fx¹xº 1 ≥ kr f ¹yºk2 2L x ∗ 1 = kr f ¹yº − r f ¹xºk2 2L ∗ • similarly, z = y minimizes fy¹zº; therefore 1 f ¹xº − f ¹yº − r f ¹yºT¹x − yº ≥ kr f ¹yº − r f ¹xºk2 2L ∗ combining the two inequalities shows co-coercivity Gradient method 1.16 Lipschitz continuity with respect to Euclidean norm supose f is convex with dom f = Rn, and L-smooth for the Euclidean norm: kr f ¹xº − r f ¹yºk2 ≤ Lkx − yk2 for all x, y • the equivalent property (1) states that ¹r f ¹xº − r f ¹yººT¹x − yº ≤ L¹x − yºT¹x − yº for all x, y • this is monotonicity of Lx − r f ¹xº, i.e., equivalent to the property that L kxk2 − f ¹xº is a convex function 2 2 • if f is twice differentiable, the Hessian of this function is LI − r2 f ¹xº: 2 λmax¹r f ¹xºº ≤ L for all x is an equivalent characterization of L-smoothness Gradient method 1.17 Outline • gradient method, first-order methods • convex functions • Lipschitz continuity of gradient • strong convexity • analysis of gradient method Strongly convex function f is strongly convex with parameter m > 0 if dom f is convex and m f ¹θx + ¹1 − θºyº ≤ θ f ¹xº + ¹1 − θº f ¹yº − θ¹1 − θºkx − yk2 2 holds for all x; y 2 dom f , θ 2 »0;1¼ • this is a stronger version of Jensen’s inequality • it holds if and only if it holds for f restricted to arbitrary lines: m f ¹x + t¹y − xºº − t2kx − yk2 (3) 2 is a convex function of t, for all x; y 2 dom f • without loss of generality, we can take k · k = k · k2 • however, the strong convexity parameter m depends on the norm used Gradient method 1.18 Quadratic lower bound if f is differentiable and m-strongly convex, then m f ¹yº ≥ f ¹xº + r f ¹xºT¹y − xº + ky − xk2 for all x; y 2 dom f (4) 2 f ¹yº ¹x; f ¹xºº • follows from the 1st order condition of convexity of (3) • this implies that the sublevel sets of f are bounded • if f is closed (has closed sublevel sets), it has a unique minimizer x? and m 1 kz − x?k2 ≤ f ¹zº − f ¹x?º ≤ kr f ¹zºk2 for all z 2 dom f 2 2m ∗ (proof as on page 1.14) Gradient method 1.19 Strong monotonicity differentiable f is strongly convex if and only if dom f is convex and ¹r f ¹xº − r f ¹yººT¹x − yº ≥ mkx − yk2 for all x; y 2 dom f this is called strong monotonicity (coercivity) of r f Proof • one direction follows from (4) and the same inequality with x and y switched • for the other direction, assume r f is strongly monotone and define m g¹tº = f ¹x + t¹y − xºº − t2kx − yk2 2 then g0¹tº is nondecreasing, so g is convex Gradient method 1.20
Recommended publications
  • 12. Coordinate Descent Methods
    EE 546, Univ of Washington, Spring 2014 12. Coordinate descent methods theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Coordinate descent methods 12–1 Notations consider smooth unconstrained minimization problem: minimize f(x) x RN ∈ n coordinate blocks: x =(x ,...,x ) with x RNi and N = N • 1 n i ∈ i=1 i more generally, partition with a permutation matrix: U =[PU1 Un] • ··· n T xi = Ui x, x = Uixi i=1 X blocks of gradient: • f(x) = U T f(x) ∇i i ∇ coordinate update: • x+ = x tU f(x) − i∇i Coordinate descent methods 12–2 (Block) coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ 1. choose coordinate i(k) 2. update x(k+1) = x(k) t U f(x(k)) − k ik∇ik among the first schemes for solving smooth unconstrained problems • cyclic or round-Robin: difficult to analyze convergence • mostly local convergence results for particular classes of problems • does it really work (better than full gradient method)? • Coordinate descent methods 12–3 Steepest coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ (k) 1. choose i(k) = argmax if(x ) 2 i 1,...,n k∇ k ∈{ } 2. update x(k+1) = x(k) t U f(x(k)) − k i(k)∇i(k) assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2 f has bounded sub-level set, in particular, define • ⋆ R(x) = max max y x 2 : f(y) f(x) y x⋆ X⋆ k − k ≤ ∈ Coordinate descent methods 12–4 Analysis for constant step size quadratic upper bound due to block coordinate-wise
    [Show full text]
  • Process Optimization
    Process Optimization Mathematical Programming and Optimization of Multi-Plant Operations and Process Design Ralph W. Pike Director, Minerals Processing Research Institute Horton Professor of Chemical Engineering Louisiana State University Department of Chemical Engineering, Lamar University, April, 10, 2007 Process Optimization • Typical Industrial Problems • Mathematical Programming Software • Mathematical Basis for Optimization • Lagrange Multipliers and the Simplex Algorithm • Generalized Reduced Gradient Algorithm • On-Line Optimization • Mixed Integer Programming and the Branch and Bound Algorithm • Chemical Production Complex Optimization New Results • Using one computer language to write and run a program in another language • Cumulative probability distribution instead of an optimal point using Monte Carlo simulation for a multi-criteria, mixed integer nonlinear programming problem • Global optimization Design vs. Operations • Optimal Design −Uses flowsheet simulators and SQP – Heuristics for a design, a superstructure, an optimal design • Optimal Operations – On-line optimization – Plant optimal scheduling – Corporate supply chain optimization Plant Problem Size Contact Alkylation Ethylene 3,200 TPD 15,000 BPD 200 million lb/yr Units 14 76 ~200 Streams 35 110 ~4,000 Constraints Equality 761 1,579 ~400,000 Inequality 28 50 ~10,000 Variables Measured 43 125 ~300 Unmeasured 732 1,509 ~10,000 Parameters 11 64 ~100 Optimization Programming Languages • GAMS - General Algebraic Modeling System • LINDO - Widely used in business applications
    [Show full text]
  • High Energy and Smoothness Asymptotic Expansion of the Scattering Amplitude
    HIGH ENERGY AND SMOOTHNESS ASYMPTOTIC EXPANSION OF THE SCATTERING AMPLITUDE D.Yafaev Department of Mathematics, University Rennes-1, Campus Beaulieu, 35042, Rennes, France (e-mail : [email protected]) Abstract We find an explicit expression for the kernel of the scattering matrix for the Schr¨odinger operator containing at high energies all terms of power order. It turns out that the same expression gives a complete description of the diagonal singular- ities of the kernel in the angular variables. The formula obtained is in some sense universal since it applies both to short- and long-range electric as well as magnetic potentials. 1. INTRODUCTION d−1 d−1 1. High energy asymptotics of the scattering matrix S(λ): L2(S ) → L2(S ) for the d Schr¨odinger operator H = −∆+V in the space H = L2(R ), d ≥ 2, with a real short-range potential (bounded and satisfying the condition V (x) = O(|x|−ρ), ρ > 1, as |x| → ∞) is given by the Born approximation. To describe it, let us introduce the operator Γ0(λ), −1/2 (d−2)/2 ˆ 1/2 d−1 (Γ0(λ)f)(ω) = 2 k f(kω), k = λ ∈ R+ = (0, ∞), ω ∈ S , (1.1) of the restriction (up to the numerical factor) of the Fourier transform fˆ of a function f to −1 −1 the sphere of radius k. Set R0(z) = (−∆ − z) , R(z) = (H − z) . By the Sobolev trace −r d−1 theorem and the limiting absorption principle the operators Γ0(λ)hxi : H → L2(S ) and hxi−rR(λ + i0)hxi−r : H → H are correctly defined as bounded operators for any r > 1/2 and their norms are estimated by λ−1/4 and λ−1/2, respectively.
    [Show full text]
  • The Gradient Sampling Methodology
    The Gradient Sampling Methodology James V. Burke∗ Frank E. Curtisy Adrian S. Lewisz Michael L. Overtonx January 14, 2019 1 Introduction in particular, this leads to calling the negative gra- dient, namely, −∇f(x), the direction of steepest de- The principal methodology for minimizing a smooth scent for f at x. However, when f is not differentiable function is the steepest descent (gradient) method. near x, one finds that following the negative gradient One way to extend this methodology to the minimiza- direction might offer only a small amount of decrease tion of a nonsmooth function involves approximating in f; indeed, obtaining decrease from x along −∇f(x) subdifferentials through the random sampling of gra- may be possible only with a very small stepsize. The dients. This approach, known as gradient sampling GS methodology is based on the idea of stabilizing (GS), gained a solid theoretical foundation about a this definition of steepest descent by instead finding decade ago [BLO05, Kiw07], and has developed into a a direction to approximately solve comprehensive methodology for handling nonsmooth, potentially nonconvex functions in the context of op- min max gT d; (2) kdk2≤1 g2@¯ f(x) timization algorithms. In this article, we summarize the foundations of the GS methodology, provide an ¯ where @f(x) is the -subdifferential of f at x [Gol77]. overview of the enhancements and extensions to it To understand the context of this idea, recall that that have developed over the past decade, and high- the (Clarke) subdifferential of a locally Lipschitz f light some interesting open questions related to GS.
    [Show full text]
  • Bertini's Theorem on Generic Smoothness
    U.F.R. Mathematiques´ et Informatique Universite´ Bordeaux 1 351, Cours de la Liberation´ Master Thesis in Mathematics BERTINI1S THEOREM ON GENERIC SMOOTHNESS Academic year 2011/2012 Supervisor: Candidate: Prof.Qing Liu Andrea Ricolfi ii Introduction Bertini was an Italian mathematician, who lived and worked in the second half of the nineteenth century. The present disser- tation concerns his most celebrated theorem, which appeared for the first time in 1882 in the paper [5], and whose proof can also be found in Introduzione alla Geometria Proiettiva degli Iperspazi (E. Bertini, 1907, or 1923 for the latest edition). The present introduction aims to informally introduce Bertini’s Theorem on generic smoothness, with special attention to its re- cent improvements and its relationships with other kind of re- sults. Just to set the following discussion in an historical perspec- tive, recall that at Bertini’s time the situation was more or less the following: ¥ there were no schemes, ¥ almost all varieties were defined over the complex numbers, ¥ all varieties were embedded in some projective space, that is, they were not intrinsic. On the contrary, this dissertation will cope with Bertini’s the- orem by exploiting the powerful tools of modern algebraic ge- ometry, by working with schemes defined over any field (mostly, but not necessarily, algebraically closed). In addition, our vari- eties will be thought of as abstract varieties (at least when over a field of characteristic zero). This fact does not mean that we are neglecting Bertini’s original work, containing already all the rele- vant ideas: the proof we shall present in this exposition, over the complex numbers, is quite close to the one he gave.
    [Show full text]
  • Jia 84 (1958) 0125-0165
    INSTITUTE OF ACTUARIES A MEASURE OF SMOOTHNESS AND SOME REMARKS ON A NEW PRINCIPLE OF GRADUATION BY M. T. L. BIZLEY, F.I.A., F.S.S., F.I.S. [Submitted to the Institute, 27 January 19581 INTRODUCTION THE achievement of smoothness is one of the main purposes of graduation. Smoothness, however, has never been defined except in terms of concepts which themselves defy definition, and there is no accepted way of measuring it. This paper presents an attempt to supply a definition by constructing a quantitative measure of smoothness, and suggests that such a measure may enable us in the future to graduate without prejudicing the process by an arbitrary choice of the form of the relationship between the variables. 2. The customary method of testing smoothness in the case of a series or of a function, by examining successive orders of differences (called hereafter the classical method) is generally recognized as unsatisfactory. Barnett (J.1.A. 77, 18-19) has summarized its shortcomings by pointing out that if we require successive differences to become small, the function must approximate to a polynomial, while if we require that they are to be smooth instead of small we have to judge their smoothness by that of their own differences, and so on ad infinitum. Barnett’s own definition, which he recognizes as being rather vague, is as follows : a series is smooth if it displays a tendency to follow a course similar to that of a simple mathematical function. Although Barnett indicates broadly how the word ‘simple’ is to be interpreted, there is no way of judging decisively between two functions to ascertain which is the simpler, and hence no quantitative or qualitative measure of smoothness emerges; there is a further vagueness inherent in the term ‘similar to’ which would prevent the definition from being satisfactory even if we could decide whether any given function were simple or not.
    [Show full text]
  • Lecture 2 — February 25Th 2.1 Smooth Optimization
    Statistical machine learning and convex optimization 2016 Lecture 2 — February 25th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization. For a convex function, a local minimum is a global minimum and the uniqueness is assured in the case of strict convexity. In the sequel, g is a convex function on Rd. The aim is to find one (or the) minimum θ Rd and the value of the function at this minimum g(θ ). Some key additional ∗ ∗ properties will∈ be assumed for g : Lipschitz continuity • d θ R , θ 6 D g′(θ) 6 B. ∀ ∈ k k2 ⇒k k2 Smoothness • d 2 (θ , θ ) R , g′(θ ) g′(θ ) 6 L θ θ . ∀ 1 2 ∈ k 1 − 2 k2 k 1 − 2k2 Strong convexity • d 2 µ 2 (θ , θ ) R ,g(θ ) > g(θ )+ g′(θ )⊤(θ θ )+ θ θ . ∀ 1 2 ∈ 1 2 2 1 − 2 2 k 1 − 2k2 We refer to Lecture 1 of this course for additional information on these properties. We point out 2 key references: [1], [2]. 2.1 Smooth optimization If g : Rd R is a convex L-smooth function, we remind that for all θ, η Rd: → ∈ g′(θ) g′(η) 6 L θ η . k − k k − k Besides, if g is twice differentiable, it is equivalent to: 0 4 g′′(θ) 4 LI. Proposition 2.1 (Properties of smooth convex functions) Let g : Rd R be a con- → vex L-smooth function. Then, we have the following inequalities: L 2 1.
    [Show full text]
  • Chapter 16 the Tangent Space and the Notion of Smoothness
    Chapter 16 The tangent space and the notion of smoothness We will always assume K algebraically closed. In this chapter we follow the approach of ˇ n Safareviˇc[S]. We define the tangent space TX,P at a point P of an affine variety X A as ⇢ the union of the lines passing through P and “ touching” X at P .Itresultstobeanaffine n subspace of A .Thenwewillfinda“local”characterizationofTX,P ,thistimeinterpreted as a vector space, the direction of T ,onlydependingonthelocalring :thiswill X,P OX,P allow to define the tangent space at a point of any quasi–projective variety. 16.1 Tangent space to an affine variety Assume first that X An is closed and P = O =(0,...,0). Let L be a line through P :if ⇢ A(a1,...,an)isanotherpointofL,thenageneralpointofL has coordinates (ta1,...,tan), t K.IfI(X)=(F ,...,F ), then the intersection X L is determined by the following 2 1 m \ system of equations in the indeterminate t: F (ta ,...,ta )= = F (ta ,...,ta )=0. 1 1 n ··· m 1 n The solutions of this system of equations are the roots of the greatest common divisor G(t)of the polynomials F1(ta1,...,tan),...,Fm(ta1,...,tan)inK[t], i.e. the generator of the ideal they generate. We may factorize G(t)asG(t)=cte(t ↵ )e1 ...(t ↵ )es ,wherec K, − 1 − s 2 ↵ ,...,↵ =0,e, e ,...,e are non-negative, and e>0ifandonlyifP X L.The 1 s 6 1 s 2 \ number e is by definition the intersection multiplicity at P of X and L.
    [Show full text]
  • A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*
    Volume 27, N. 1, pp. 93–106, 2008 Copyright © 2008 SBMAC ISSN 0101-8205 www.scielo.br/cam A new algorithm of nonlinear conjugate gradient method with strong convergence* ZHEN-JUN SHI1,2 and JINHUA GUO2 1College of Operations Research and Management, Qufu Normal University Rizhao, Sahndong 276826, P.R. China 2Department of Computer and Information Science, University of Michigan Dearborn, Michigan 48128-1491, USA E-mails: [email protected]; [email protected] / [email protected] Abstract. The nonlinear conjugate gradient method is a very useful technique for solving large scale minimization problems and has wide applications in many fields. In this paper, we present a new algorithm of nonlinear conjugate gradient method with strong convergence for unconstrained minimization problems. The new algorithm can generate an adequate trust region radius automatically at each iteration and has global convergence and linear convergence rate under some mild conditions. Numerical results show that the new algorithm is efficient in practical computation and superior to other similar methods in many situations. Mathematical subject classification: 90C30, 65K05, 49M37. Key words: unconstrained optimization, nonlinear conjugate gradient method, global conver- gence, linear convergence rate. 1 Introduction Consider an unconstrained minimization problem min f (x), x ∈ Rn, (1) where Rn is an n-dimensional Euclidean space and f : Rn −→ R is a continu- ously differentiable function. #724/07. Received: 10/V/07. Accepted: 24/IX/07. *The work was supported in part by NSF CNS-0521142, USA. 94 A NEW ALGORITHM OF NONLINEAR CONJUGATE GRADIENT METHOD When n is very large (for example, n > 106) the related problem is called large scale minimization problem.
    [Show full text]
  • Convergence Theorems for Gradient Descent
    Convergence Theorems for Gradient Descent Robert M. Gower. October 5, 2018 Abstract Here you will find a growing collection of proofs of the convergence of gradient and stochastic gradient descent type method on convex, strongly convex and/or smooth functions. Important disclaimer: Theses notes do not compare to a good book or well prepared lecture notes. You should only read these notes if you have sat through my lecture on the subject and would like to see detailed notes based on my lecture as a reminder. Under any other circumstances, I highly recommend reading instead the first few chapters of the books [3] and [1]. 1 Assumptions and Lemmas 1.1 Convexity We say that f is convex if d f(tx + (1 − t)y) ≤ tf(x) + (1 − t)f(y); 8x; y 2 R ; t 2 [0; 1]: (1) If f is differentiable and convex then every tangent line to the graph of f lower bounds the function values, that is d f(y) ≥ f(x) + hrf(x); y − xi ; 8x; y 2 R : (2) We can deduce (2) from (1) by dividing by t and re-arranging f(y + t(x − y)) − f(y) ≤ f(x) − f(y): t Now taking the limit t ! 0 gives hrf(y); x − yi ≤ f(x) − f(y): If f is twice differentiable, then taking a directional derivative in the v direction on the point x in (2) gives 2 2 d 0 ≥ hrf(x); vi + r f(x)v; y − x − hrf(x); vi = r f(x)v; y − x ; 8x; y; v 2 R : (3) Setting y = x − v then gives 2 d 0 ≤ r f(x)v; v ; 8x; v 2 R : (4) 1 2 d The above is equivalent to saying the r f(x) 0 is positive semi-definite for every x 2 R : An analogous property to (2) holds even when the function is not differentiable.
    [Show full text]
  • Online Methods in Machine Learning: Recitation 1
    Online Methods in Machine Learning: Recitation 1. As seen in class, many offline machine learning problems can be written as: n 1 X min `(w, (xt, yt)) + λR(w), (1) w∈C n t=1 where C is a subset of Rd, R is a penalty function, and ` is the loss function that measures how well our predictor, w, explains the relationship between x and y. Example: Support Vector λ 2 w Machines with R(w) = ||w|| and `(w, (xt, yt)) = max(0, 1 − ythw, xti) where h , xti is 2 2 ||w||2 the signed distance to the hyperplane {x | hw, xi = 0} used to classify the data. A couple of remarks: 1. In the context of classification, ` is typically a convex surrogate loss function in the sense that (i) w → `(w, (x, y)) is convex for any example (x, y) and (ii) `(w, (x, y)) is larger than the true prediction error given by 1yhw,xi≤0 for any example (x, y). In general, we introduce this surrogate loss function because it would be difficult to solve (1) computationally. On the other hand, convexity enables the development of general- purpose methods (e.g. Gradient Descent) that are fast, easy to implement, and simple to analyze. Moreover, the techniques and the proofs generalize very well to the online setting, where the data (xt, yt) arrive sequentially and we have to make predictions online. This theme will be explored later in the class. 2. If ` is a surrogate loss for a classification problem and we solve (1) (e.g with Gradient Descent), we cannot guarantee that our predictor will perform well with respect to our original classification problem n X min 1yhw,xi≤0 + λR(w), (2) w∈C t=1 hence it is crucial to choose a surrogate loss function that reflects accurately the classifi- cation problem at hand.
    [Show full text]
  • Be a Smooth Manifold, and F : M → R a Function
    LECTURE 3: SMOOTH FUNCTIONS 1. Smooth Functions Definition 1.1. Let (M; A) be a smooth manifold, and f : M ! R a function. (1) We say f is smooth at p 2 M if there exists a chart f'α;Uα;Vαg 2 A with −1 p 2 Uα, such that the function f ◦ 'α : Vα ! R is smooth at 'α(p). (2) We say f is a smooth function on M if it is smooth at every x 2 M. −1 Remark. Suppose f ◦ 'α is smooth at '(p). Let f'β;Uβ;Vβg be another chart in A with p 2 Uβ. Then by the compatibility of charts, the function −1 −1 −1 f ◦ 'β = (f ◦ 'α ) ◦ ('α ◦ 'β ) must be smooth at 'β(p). So the smoothness of a function is independent of the choice of charts in the given atlas. Remark. According to the chain rule, it is easy to see that if f : M ! R is smooth at p 2 M, and h : R ! R is smooth at f(p), then h ◦ f is smooth at p. We will denote the set of all smooth functions on M by C1(M). Note that this is a (commutative) algebra, i.e. it is a vector space equipped with a (commutative) bilinear \multiplication operation": If f; g are smooth, so are af + bg and fg; moreover, the multiplication is commutative, associative and satisfies the usual distributive laws. 1 n+1 i n Example. Each coordinate function fi(x ; ··· ; x ) = x is a smooth function on S , since ( 2yi −1 1 n 1+jyj2 ; 1 ≤ i ≤ n fi ◦ '± (y ; ··· ; y ) = 1−|yj2 ± 1+jyj2 ; i = n + 1 are smooth functions on Rn.
    [Show full text]