Tutorial: PART 2

Total Page:16

File Type:pdf, Size:1020Kb

Tutorial: PART 2 Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization • Stochastic optimization, ERM, online regret minimization • Online gradient descent 2. Regularization • AdaGrad and optimal regularization 3. Gradient Descent++ • Frank-Wolfe, acceleration, variance reduction, second order methods, non-convex optimization Accelerating gradient descent? 1. Adaptive regularization (AdaGrad) works for non-smooth&non-convex 2. Variance reduction uses special ERM structure very effective for smooth&convex 3. Acceleration/momentum smooth convex only, general purpose optimization since 80’s Condition number of convex functions # defined as � = , where (simplified) $ 0 ≺ �� ≼ �+� � ≼ �� � = strong convexity, � = smoothness Non-convex smooth functions: (simplified) −�� ≼ �+� � ≼ �� Why do we care? well-conditioned functions exhibit much faster optimization! (but equivalent via reductions) Examples Smooth gradient descent The descent lemma, �-smooth functions: (algorithm: �012 = �0 − ��4 ) + � �012 − � �0 ≤ −�0 �012 − �0 + � �0 − �012 1 = − � + ��+ � + = − � + 0 4� 0 Thus, for M-bounded functions: ( � �0 ≤ �) 1 −2� ≤ � � − �(� ) = > �(� ) − � � ≤ − > � + ; 2 012 0 4� 0 0 0 Thus, exists a t for which, 8�� � + ≤ 0 � Smooth gradient descent 2 Conclusions: for � = � − �� and T = Ω , finds 012 0 4 C + �0 ≤ � 1. Holds even for non-convex functions ∗ 2. For convex functions implies � �0 − � � ≤ �(�) (faster for smooth!) Non-convex stochastic gradient descent The descent lemma, �-smooth functions: (algorithm: �012 = �0 − ��N0) + � � �012 − � �0 ≤ � −�4 �012 − �0 + � �0 − �012 + + + + = � −�P0 ⋅ ��4 + � �N4 = −��4 + � �� �N4 + + + = −��4 + � �(�4 + ��� �P0 ) Thus, for M-bounded functions: ( � �0 ≤ �) M� T = � ⇒ ∃ . � + ≤ � �+ 0Y; 0 Controlling the variance: Interpolating GD and SGD Model: both full and stochastic gradients. Estimator combines both into lower variance RV: �012 = �0 − � �P� �0 − �P� �[ + ��(�[) Every so oFten, compute Full gradient and restart at new �[. Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13; Mahdavi, Zhang, Jin ’13] Variance reduction for well-conditioned functions � 0 ≺ �� ≼ �+� � ≼ �� , � = � � should be Produces an � approximate solution in time interpreted as 1 � � + � � log 1 � � Acceleration/momentum [Nesterov ‘83] • Optimal gradient complexity (smooth, convex) • modest practical improvements, non-convex “momentum” methods. • With variance reduction, fastest possible running time of first- order methods: 1 � (� + ��) � log � [Woodworth, Srebro ’15] – tight lower bound w. gradient oracle Experiments w. convex losses Improve upon GradientDescent++ ? Next few slides: Move from first order (GD++) to second order Higher Order Optimization • Gradient Descent – Direction of Steepest Descent • Second Order Methods – Use Local Curvature Newton’s method (+ Trust region) �2 + p2 �012 = �0 − � [� �(�)] ��(�) �+ For non-convex function: can move to ∞ Solution: solve a quadratic approximation in a � local area (trust region) m Newton’s method (+ Trust region) �2 + p2 �012 = �0 − � [� �(�)] ��(�) �+ d3 time per iteration! Infeasible for ML!! �m Till recently… Speed up the Newton direction computation?? • Spielman-Teng ‘04: diagonally dominant systems of equations in linear time! • 2015 Godel prize • Used by Daitch-Speilman for faster flow algorithms • Erdogu-Montanari ‘15: low rank approximation & inversion by Sherman-Morisson • Allow stochastic information • Still prohibitive: rank * d2 Stochastic Newton? �„ + p2 �012 = �0 − � [� �(�)] ��(�) ; 2 + • ERM, rank-1 loss: arg min � s∼ u [ℓ � �s,�s + |�| ] r + • unbiased estimator of the Hessian: y+ { ; � = az az ⋅ ℓ′ � �s,bz + � � ~ �[1,… ,�] p2 p2 • clearly � �y+ = �+� , but � �y+ ≠ �+� Circumvent Hessian creation and inversion! • 3 steps: • (1) represent Hessian inverse as infinite series For any distribution on naturals i ∼ � �p+ = > � − �+ s s†[ 0‡ ˆ • (2) sample from the infinite series (Hessian-gradient product) , ONCE 1 �+�p2�� = > � − �+� s �F = � � − �+� s �F ⋅ s∼‰ Pr[�] s • (3) estimate Hessian-power by sampling i.i.d. data examples 1 = E Œ � − �+� �f ⋅ s∼‰,‹∼[s] ‹ Pr[�] ‹†2 0‡ s Single example Vector-vector products only Linear-time Second-order Stochastic Algorithm (LiSSA) • Use the estimator �•p+� defined previously • Compute a full (large batch) gradient �f • Move in the direction �•p+� �� • Theoretical running time to produce an � approximate solution for � well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15] 1 1 � �� log + �� � log � � 1. Faster than first-order methods! 2. Indications this is tight [Arjevani,Shamir ‘16] What about constraints?? Next few slides – projection free (Frank-Wolfe) methods Recommendation systems movies rating of user i 1 5 for movie j 2 3 2 4 1 users 5 4 2 3 1 1 1 5 5 Recommendation systems movies complete 1 4 5 2 1 5 missing entries 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 3 users 2 3 1 5 4 4 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems movies 1 4 5 2 1 5 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 5 users 2 3 1 5 4 4 get new data 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems movies re-complete 1 2 5 5 4 4 missing entries 2 4 3 2 3 1 5 2 4 2 3 1 3 3 2 4 5 5 users 3 3 2 5 1 5 2 4 3 4 2 3 1 1 1 1 1 1 4 5 3 3 5 2 Assume low rank of “true matrix”, convex relaxation: bounded trace norm Bounded trace norm matrices • Trace norm of a matrix = sum of singular values • K = { X | X is a matrix with trace norm at most D } • Computational bottleneck: projections on K require eigendecomposition: O(n3) operations • But: linear optimization over K is easier computing top eigenvector; O(sparsity) time Projections à linear optimization 1. Matrix completion (K = bounded trace norm matrices) eigen decomposition largest singular vector computation 2. Online routing (K = flow polytope) conic optimization over flow polytope shortest path computation 3. Rotations (K = rotation matrices) conic optimization over rotations set Wahba’s algorithm – linear time 4. Matroids (K = matroid polytope) convex opt. via ellipsoid method greedy algorithm – nearly linear time! Conditional Gradient algorithm [Frank, Wolfe ’56] Convex opt problem: min �(�) •∈‘ • f is smooth, convex • linear opt over K is easy vt = arg min f(xt)>x x K r 2 x = x + ⌘ (v x ) t+1 t t t − t 1. At iteration t: convex comb. of at most t vertices (sparsity) 2 2. No learning rate. � ≈ (independent of diameter, gradients etc.) 0 0 f(x ) FW theorem r t vt+1 xt+1 = xt + ⌘t(vt xt) ,vt = arg min t>x x − x K r t+1 2 1 xt Theorem: f(xt) f(x⇤)=O( ) − t Proof, main observation: f(x ) f(x⇤)=f(x + ⌘ (v x )) f(x⇤) t+1 − t t t − t − 2 β 2 f(x ) f(x⇤)+⌘ (v x )> + ⌘ v x β-smoothness of f t − t t − t rt t 2 k t − tk 2 β 2 f(x ) f(x⇤)+⌘ (x⇤ x )> + ⌘ v x optimality of v t − t − t rt t 2 k t − tk t 2 β 2 f(x ) f(x⇤)+⌘ (f(x⇤) f(x )) + ⌘ v x convexity of f t − t − t t 2 k t − tk 2 ⌘t β 2 (1 ⌘ )(f(x ) f(x⇤)) + D . − t t − 2 * Thus: ht = f(xt) – f(x ) 2 1 h (1 ⌘ )h + O(⌘ ) ⌘t,ht = O( ) t+1 − t t t t Online Conditional Gradient • Set �2 ∈ � arbitrarily • For t = 1, 2,…, 1. Use �0, obtain ft 2. Compute �012 as follows t > vt = arg min i=1 fi(xi)+βtxt x x K r 2 ⇣ ↵ ↵ ⌘ xt+1 (1 Pt− )xt + t− vt xt − xt+1 t i=1 fi(xi)+βtxt r vt P Theorem:[Hazan, Kale ’12] ������ = �(�m/—) Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and smooth losses, 1. Offline: convergence after t steps: �p˜(0) 2. Online: ������ = �( �) Next few slides: survey state-of-the-art Non-convex optimization in ML Machine Distribution over label {a} ∈ �› 1 arg min > ℓs �, �s, �s + � � •∈œ• � s†2 0‡ u What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 ! All kinds of constraints (even restricting to continuous functions): Solution concepts for non-convex optimization?h(x)=sin(2πx)=0 250 200 150 • Global minimization is NP hard, even for degree 4 100 50 th 0 polynomials. Even local minimization up to 4 order −50 3 2 3 1 2 0 1 optimality conditions is NP hard. −1 0 −1 −2 −2 −3 −3 • Algorithmic stability is sufficient [Hardt, RechtDuchi, (UCSinger Berkeley) Convex‘16] Optimization for Machine Learning Fall 2009 7 / 53 • Optimization approaches: • Finding vanishing gradients / local minima efficiently • Graduated optimization / homotopy method • Quasi-convexity • Structure of local optima (probabilistic assumptions that allow alternating minimization,…) Gradient/Hessian based methods Goal: find point x such that 1. �� � ≤ � (approximate first order optimality) 2. �+� � ≽ −�� (approximate second order optimality) 2 1. (we’ve proved) GD algorithm: � = � − �� finds in � 012 0 4 C (expensive) iterations point (1) 2 2. (we’ve proved) SGD algorithm: � = � − ��P finds in � (cheap) 012 0 4 CŸ iterations point (1) 2 3. SGD algorithm with noise finds in � (cheap) iterations (1&2) C [Ge, Huang, Jin, Yuan ‘15] 2 4. Recent second order methods: find in � (expensive) iterations C¡/¢ point (1&2) [Carmon,Duchi, Hinder, Sidford ‘16] [Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16] Recap Chair/car What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 1. Online learning and stochastic ! All kinds of constraints (even restricting to continuous functions): optimization h(x)=sin(2πx)=0 • Regret minimization • Online gradient descent 250 200 2. Regularization 150 • AdaGradand optimal 100 50 regularization 0 −50 3. Advanced optimization 3 2 3 1 2 0 1 • Frank-Wolfe, acceleration, −1 0 −1 −2 −2 −3 −3 variance reduction, second order methods, non-convex Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization Bibliography & more information, see: http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm Thank you!.
Recommended publications
  • Advances in Interior Point Methods and Column Generation
    Advances in Interior Point Methods and Column Generation Pablo Gonz´alezBrevis Doctor of Philosophy University of Edinburgh 2013 Declaration I declare that this thesis was composed by myself and that the work contained therein is my own, except where explicitly stated otherwise in the text. (Pablo Gonz´alezBrevis) ii To my endless inspiration and angels on earth: Paulina, Crist´obal and Esteban iii Abstract In this thesis we study how to efficiently combine the column generation technique (CG) and interior point methods (IPMs) for solving the relaxation of a selection of integer programming problems. In order to obtain an efficient method a change in the column generation technique and a new reoptimization strategy for a primal-dual interior point method are proposed. It is well-known that the standard column generation technique suffers from un- stable behaviour due to the use of optimal dual solutions that are extreme points of the restricted master problem (RMP). This unstable behaviour slows down column generation so variations of the standard technique which rely on interior points of the dual feasible set of the RMP have been proposed in the literature. Among these tech- niques, there is the primal-dual column generation method (PDCGM) which relies on sub-optimal and well-centred dual solutions. This technique dynamically adjusts the column generation tolerance as the method approaches optimality. Also, it relies on the notion of the symmetric neighbourhood of the central path so sub-optimal and well-centred solutions are obtained. We provide a thorough theoretical analysis that guarantees the convergence of the primal-dual approach even though sub-optimal solu- tions are used in the course of the algorithm.
    [Show full text]
  • A New Trust Region Method with Simple Model for Large-Scale Optimization ∗
    A New Trust Region Method with Simple Model for Large-Scale Optimization ∗ Qunyan Zhou,y Wenyu Sunz and Hongchao Zhangx Abstract In this paper a new trust region method with simple model for solv- ing large-scale unconstrained nonlinear optimization is proposed. By employing generalized weak quasi-Newton equations, we derive several schemes to construct variants of scalar matrices as the Hessian approx- imation used in the trust region subproblem. Under some reasonable conditions, global convergence of the proposed algorithm is established in the trust region framework. The numerical experiments on solving the test problems with dimensions from 50 to 20000 in the CUTEr li- brary are reported to show efficiency of the algorithm. Key words. unconstrained optimization, Barzilai-Borwein method, weak quasi-Newton equation, trust region method, global convergence AMS(2010) 65K05, 90C30 1. Introduction In this paper, we consider the unconstrained optimization problem min f(x); (1.1) x2Rn where f : Rn ! R is continuously differentiable with dimension n relatively large. Trust region methods are a class of powerful and robust globalization ∗This work was supported by National Natural Science Foundation of China (11171159, 11401308), the Natural Science Fund for Universities of Jiangsu Province (13KJB110007), the Fund of Jiangsu University of Technology (KYY13012). ySchool of Mathematics and Physics, Jiangsu University of Technology, Changzhou 213001, Jiangsu, China. Email: [email protected] zCorresponding author. School of Mathematical Sciences, Jiangsu Key Laboratory for NSLSCS, Nanjing Normal University, Nanjing 210023, China. Email: [email protected] xDepartment of Mathematics, Louisiana State University, USA. Email: [email protected] 1 methods for solving (1.1).
    [Show full text]
  • Constrained Multifidelity Optimization Using Model Calibration
    Struct Multidisc Optim (2012) 46:93–109 DOI 10.1007/s00158-011-0749-1 RESEARCH PAPER Constrained multifidelity optimization using model calibration Andrew March · Karen Willcox Received: 4 April 2011 / Revised: 22 November 2011 / Accepted: 28 November 2011 / Published online: 8 January 2012 c Springer-Verlag 2012 Abstract Multifidelity optimization approaches seek to derivative-free and sequential quadratic programming meth- bring higher-fidelity analyses earlier into the design process ods. The method uses approximately the same number of by using performance estimates from lower-fidelity models high-fidelity analyses as a multifidelity trust-region algo- to accelerate convergence towards the optimum of a high- rithm that estimates the high-fidelity gradient using finite fidelity design problem. Current multifidelity optimization differences. methods generally fall into two broad categories: provably convergent methods that use either the high-fidelity gradient Keywords Multifidelity · Derivative-free · Optimization · or a high-fidelity pattern-search, and heuristic model cali- Multidisciplinary · Aerodynamic design bration approaches, such as interpolating high-fidelity data or adding a Kriging error model to a lower-fidelity function. This paper presents a multifidelity optimization method that bridges these two ideas; our method iteratively calibrates Nomenclature lower-fidelity information to the high-fidelity function in order to find an optimum of the high-fidelity design prob- A Active and violated constraint Jacobian lem. The
    [Show full text]
  • Computing a Trust Region Step
    Distribution Category: Mathematics and Computers ANL--81-83 (UC-32) DE82 005656 ANIr81483 ARGONNE NATIONAL LABORATORY 9700 South Cass Avenue Argonne, Illinois 60439 COMPUTING A TRUST REGION STEP Jorge J. Mord and D. C. Sorensen Applied Mathematics Division DISCLAIMER - -. ".,.,... December 1981 Computing a Trust Region Step Jorge J. Mord and D. C. Sorensen Applied Mathematics Division Argonne National Laboratory Argonne Illinois 60439 1. Introduction. In an important class of minimization algorithms called "trust region methods" (see, for example, Sorensen [1981]), the calculation of the step between iterates requires the solution of a problem of the form (1.1) minij(w): 1|w |isAl where A is a positive parameter, I- II is the Euclidean norm in R", and (1.2) #(w) = g w + Mw7Bw, with g E R", and B E R a symmetric matrix. The quadratic function ' generally represents a local model to the objective function defined by interpolatory data at an iterate and thus it is important to be able to solve (1.1) for any symmetric matrix B; in particular, for a matrix B with negative eigenvalues. In trust region methods it is sometimes helpful to include a scaling matrix for the variables. In this case, problem (1.1) is replaced by (1.3) mint%(v):IIDu Is A where D E R""" is a nonsingular matrix. The change of variables Du = w shows that problem (1.3) is equivalent to (1.4) minif(w):1|w| Is Al where f(w) _%f(D-'w), and that the solutions of problems (1.3) and (1.4) are related by Dv = w.
    [Show full text]
  • Welcome to the 2013 MOPTA Conference!
    Welcome to the 2013 MOPTA Conference! Mission Statement The Modeling and Optimization: Theory and Applications (MOPTA) conference is an annual event aiming to bring together a diverse group of people from both discrete and continuous optimization, working on both theoretical and applied aspects. The format consists of invited talks from distinguished speakers and selected contributed talks, spread over three days. The goal is to present a diverse set of exciting new developments from different optimization areas while at the same time providing a setting that will allow increased interaction among the participants. We aim to bring together researchers from both the theoretical and applied communities who do not usually have the chance to interact in the framework of a medium- scale event. MOPTA 2013 is hosted by the Department of Industrial and Systems Engineering at Lehigh University. Organization Committee Frank E. Curtis - Chair [email protected] Tamás Terlaky [email protected] Ted Ralphs [email protected] Katya Scheinberg [email protected] Lawrence V. Snyder [email protected] Robert H. Storer [email protected] Aurélie Thiele [email protected] Luis Zuluaga [email protected] Staff Kathy Rambo We thank our sponsors! 1 Program Wednesday, August 14 – Rauch Business Center 7:30-8:10 - Registration and continental breakfast - Perella Auditorium Lobby 8:10-8:20 - Welcome: Tamás Terlaky, Department Chair, Lehigh ISE - Perella Auditorium (RBC 184) 8:20-8:30 - Opening remarks: Patrick Farrell, Provost, Lehigh University
    [Show full text]
  • An Active–Set Trust-Region Method for Bound-Constrained Nonlinear Optimization Without Derivatives Applied to Noisy Aerodynamic Design Problems Anke Troltzsch
    An active–set trust-region method for bound-constrained nonlinear optimization without derivatives applied to noisy aerodynamic design problems Anke Troltzsch To cite this version: Anke Troltzsch. An active–set trust-region method for bound-constrained nonlinear optimization with- out derivatives applied to noisy aerodynamic design problems. Optimization and Control [math.OC]. Institut National Polytechnique de Toulouse - INPT, 2011. English. tel-00639257 HAL Id: tel-00639257 https://tel.archives-ouvertes.fr/tel-00639257 Submitted on 8 Nov 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ฀ ฀฀฀฀฀ ฀ %0$503"5%&%0$503"5%&-6/*7&34*5²-6/*7฀&34*5²%&506-064&%&506-064& ฀฀ Institut National Polytechnique฀ de Toulouse (INP Toulouse) ฀฀฀ Mathématiques Informatique฀ Télécommunications ฀ ฀฀฀฀฀฀ Anke TRÖLTZSCH฀ ฀ mardi 7 juin 2011 ฀ ฀ ฀ ฀ An active-set trust-region method for bound-constrained nonlinear optimization without฀ derivatives applied to noisy aerodynamic฀ design problems ฀ ฀฀฀ Mathématiques Informatique฀ Télécommunications (MITT) ฀฀฀฀ CERFACS ฀฀฀
    [Show full text]
  • Optimization: Applications, Algorithms, and Computation
    Optimization: Applications, Algorithms, and Computation 24 Lectures on Nonlinear Optimization and Beyond Sven Leyffer (with help from Pietro Belotti, Christian Kirches, Jeff Linderoth, Jim Luedtke, and Ashutosh Mahajan) August 30, 2016 To my wife Gwen my parents Inge and Werner; and my teacher and mentor Roger. This manuscript has been created by the UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”) under Contract No. DE-AC02-06CH11357 with the U.S. Department of Energy. The U.S. Government retains for itself, and others acting on its behalf, a paid-up, nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357. ii Contents I Introduction, Applications, and Modeling1 1 Introduction to Optimization3 1.1 Objective Function and Constraints...............................3 1.2 Classification of Optimization Problems............................4 1.2.1 Classification by Constraint Type...........................4 1.2.2 Classification by Variable Type............................5 1.2.3 Classification by Functional Forms..........................6 1.3 A First Optimization Example.................................6 1.4 Course Outline.........................................7 1.5 Exercises............................................7 2 Applications of Optimization9 2.1 Power-System Engineering: Electrical Transmission Planning................9 2.1.1 Transmission Planning Formulation.......................... 10 2.2 Other Power Systems Applications............................... 11 2.2.1 The Power-Flow Equations and Network Expansion................. 11 2.2.2 Blackout Prevention in National Power Grid...................... 12 2.2.3 Optimal Unit Commitment for Power-Grid.....................
    [Show full text]
  • Sequential Quadratic Programming Methods∗
    SEQUENTIAL QUADRATIC PROGRAMMING METHODS∗ Philip E. GILLy Elizabeth WONGy UCSD Department of Mathematics Technical Report NA-10-03 August 2010 Abstract In his 1963 PhD thesis, Wilson proposed the first sequential quadratic programming (SQP) method for the solution of constrained nonlinear optimization problems. In the intervening 48 years, SQP methods have evolved into a powerful and effective class of methods for a wide range of optimization problems. We review some of the most prominent developments in SQP methods since 1963 and discuss the relationship of SQP methods to other popular methods, including augmented Lagrangian methods and interior methods. Given the scope and utility of nonlinear optimization, it is not surprising that SQP methods are still a subject of active research. Recent developments in methods for mixed-integer nonlinear programming (MINLP) and the minimization of functions subject to differential equation constraints has led to a heightened interest in methods that may be \warm started" from a good approximate solution. We discuss the role of SQP methods in these contexts. Key words. Large-scale nonlinear programming, SQP methods, nonconvex pro- gramming, quadratic programming, KKT systems. AMS subject classifications. 49J20, 49J15, 49M37, 49D37, 65F05, 65K05, 90C30 1. Introduction This paper concerns the formulation of methods for solving the smooth nonlinear pro- grams that arise as subproblems within a method for mixed-integer nonlinear programming (MINLP). In general, this subproblem has both linear and nonlinear constraints, and may be written in the form minimize f(x) n x2R 8 9 < x = (1.1) subject to ` ≤ Ax ≤ u; : c(x) ; where f(x) is a linear or nonlinear objective function, c(x) is a vector of m nonlinear constraint functions ci(x), A is a matrix, and l and u are vectors of lower and upper bounds.
    [Show full text]
  • Classical Optimizers for Noisy Intermediate-Scale Quantum Devices
    Lawrence Berkeley National Laboratory Recent Work Title Classical Optimizers for Noisy Intermediate-Scale Quantum Devices Permalink https://escholarship.org/uc/item/9022q1tm Authors Lavrijsen, W Tudor, A Muller, J et al. Publication Date 2020-10-01 DOI 10.1109/QCE49297.2020.00041 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Classical Optimizers for Noisy Intermediate-Scale antum Devices Wim Lavrijsen, Ana Tudor, Juliane Müller, Costin Iancu, Wibe de Jong Lawrence Berkeley National Laboratory {wlavrijsen,julianemueller,cciancu,wadejong}@lbl.gov, [email protected] ABSTRACT of noise on both the classical and quantum parts needs to be taken We present a collection of optimizers tuned for usage on Noisy Inter- into account. In particular, the performance and mathematical guar- mediate-Scale Quantum (NISQ) devices. Optimizers have a range of antees, regarding convergence and optimality in the number of applications in quantum computing, including the Variational Quan- iterations, of commonly used classical optimizers rest on premises tum Eigensolver (VQE) and Quantum Approximate Optimization that are broken by the existence of noise in the objective function. (QAOA) algorithms. They have further uses in calibration, hyperpa- Consequently, they may converge too early, not nding the global rameter tuning, machine learning, etc. We employ the VQE algorithm minimum, get stuck in a noise-induced local minimum, or even fail as a case study. VQE is a hybrid algorithm, with a classical minimizer to converge at all. step driving the next evaluation on the quantum processor. While For chemistry, the necessity of developing robust classical op- most results to date concentrated on tuning the quantum VQE circuit, timizers for VQE in the presence of hardware noise has already our study indicates that in the presence of quantum noise the clas- been recognized [20].
    [Show full text]
  • Artificial Intelligence Combined with Nonlinear Optimization Techniques and Their Application for Yield Curve Optimization
    JOURNAL OF INDUSTRIAL AND doi:10.3934/jimo.2017014 MANAGEMENT OPTIMIZATION Volume 13, Number 4, October 2017 pp. 1701{1721 ARTIFICIAL INTELLIGENCE COMBINED WITH NONLINEAR OPTIMIZATION TECHNIQUES AND THEIR APPLICATION FOR YIELD CURVE OPTIMIZATION Roya Soltani∗ Department of Industrial Engineering, Science and Research Branch Islamic Azad University, Tehran, Iran Seyed Jafar Sadjadi and Mona Rahnama Department of Industrial Engineering Iran University of Science and Technology, Tehran, Iran (Communicated by Duan Li) Abstract. This study makes use of the artificial intelligence approaches com- bined with some nonlinear optimization techniques for optimization of a well- known problem in financial engineering called yield curve. Yield curve estima- tion plays an important role on making strategic investment decisions. In this paper, we use two well-known parsimonious estimation models, Nelson-Siegel and Extended Nelson-Siegel, for the yield curve estimation. The proposed mod- els of this paper are formulated as continuous nonlinear optimization problems. The resulted models are then solved using some nonlinear optimization and meta-heuristic approaches. The optimization techniques include hybrid GPSO parallel trust region-dog leg, Hybrid GPSO parallel trust region-nearly exact, Hybrid GPSO parallel Levenberg-Marquardt and Hybrid genetic electromag- netism like algorithm. The proposed models of this paper are examined using some real-world data from the bank of England and the results are analyzed. 1. Introduction. One of the most important issues on bond market is to esti- mate the yield curve in time horizon which is used for different purposes such as investment strategies, monetary policy, etc. In the bond market, there are differ- ent kinds of bonds with different times of maturities to invest.
    [Show full text]
  • Trust-Region Newton-CG with Strong Second-Order Complexity Guarantees for Nonconvex Optimization
    Industrial and Systems Engineering Trust-Region Newton-CG with Strong Second-Order Complexity Guarantees for Nonconvex Optimization Frank E. Curtis1, Daniel P. Robinson1, Clement´ W. Royer2, and Stephen J. Wright3 1Department of Industrial and Systems Engineering, Lehigh University 2LAMSADE, CNRS, Universit´eParis-Dauphine, Universit´ePSL 3Computer Sciences Department, University of Wisconsin ISE Technical Report 19T-028 Trust-Region Newton-CG with Strong Second-Order Complexity Guarantees for Nonconvex Optimization∗ Frank E. Curtis† Daniel P. Robinson† Cl´ement W. Royer‡ Stephen J. Wright§ December 9, 2019 Abstract Worst-case complexity guarantees for nonconvex optimization algorithms have been a topic of growing interest. Multiple frameworks that achieve the best known complexity bounds among a broad class of first- and second-order strategies have been proposed. These methods have often been designed primarily with complexity guarantees in mind and, as a result, represent a departure from the algorithms that have proved to be the most effective in practice. In this paper, we consider trust-region Newton methods, one of the most popular classes of algorithms for solving nonconvex optimization problems. By introducing slight modifications to the original scheme, we obtain two methods—one based on exact subproblem solves and one exploiting inexact subproblem solves as in the popular “trust-region Newton-Conjugate- Gradient” (Newton-CG) method—with iteration and operation complexity bounds that match the best known bounds for the aforementioned class of first- and second-order methods. The resulting Newton- CG method also retains the attractive practical behavior of classical trust-region Newton-CG, which we demonstrate with numerical comparisons on a standard benchmark test set.
    [Show full text]
  • Continuous Optimization and Its Applications
    Continuous Optimization and its Applications Speaker: Prof. Henry Wolkowicz Assistance by: Nathan Krislock Contents 1 Introduction 4 2 What is Optimization? 4 3 MatLAB 8 4 MatLAB and MOSEK 11 4.1 Linear Programming . 11 4.2 Convex Quadratic Optimization . 14 4.3 Conic Optimization . 16 4.4 Quadratic and Conic Optimization . 18 4.5 Dual Cones . 18 4.6 Setting accuracy parameters . 18 4.7 Linear least squares and related norm . 18 4.8 Mixed Integer Problems . 19 4.9 Large Scale QP . 19 5 A word on convexity 19 5.1 Large scale problems . 20 6 Convex sets 21 6.1 Introduction . 21 6.2 Polyhedra . 22 6.3 Positive semidefinite cone . 22 6.4 Preserving convexity . 24 6.5 Generalized Inequalities . 24 6.6 Separating Hyperplane Theorem . 25 6.7 Basic Properties of Convex Functions . 28 6.8 The “Clown Car Paradox” . 29 6.9 Patching up earlier work . 29 6.10 Back to convex functions... 29 6.11 Graphs of functions and Epigraphs . 30 6.12 Jensen’s Inequality . 30 6.13 Proving convexity . 30 1 6.14 Compositions of functions . 31 6.15 Minimization . 31 6.16 Log-concave and log-convex functions . 32 6.17 K-convexity . 32 7 Optimization problems 32 7.1 Standard form . 32 7.2 Convex optimization problem . 33 7.3 Optimality criterion for differentiable f0 . 33 7.4 Modifying the problem . 33 7.5 Piecewise linear minimization . 34 7.6 Generalized Linear Fractional Program . 34 7.7 Quadratic Programming (QP) . 34 7.8 Quadratically constrained quadratic program (QCQP) . 34 7.9 Second-order Cone Programming .
    [Show full text]