The Adaptive Sampling Gradient Method Optimizing Smooth Functions with an Inexact Oracle

Noname manuscript No. (will be inserted by the editor) The Adaptive Sampling Gradient Method Optimizing Smooth Functions with an Inexact Oracle Fatemeh S. Hashemi · Raghu Pasupathy · Michael R. Taaffe Received: date / Accepted: date Abstract Consider settings such as stochastic optimization where a smooth objective function f is unknown but can be estimated with an inexact oracle such as quasi-Monte Carlo (QMC) or numerical quadrature. The inexact oracle is assumed to yield function estimates having error that decays with increasing oracle effort. For solving such problems, we present the Adaptive Sampling Gradient Method (ASGM) in two flavors depending on whether the step size used within ASGM is constant or determined through a backtracking line search. ASGM’s salient feature is the adaptive manner in which it constructs gradient estimates (henceforth called gradient approximates), by exerting just enough oracle effort at each iterate to keep the error in the gradient approximate within a constant factor of the norm of the gradient approximate. ASGM applies to both derivative-based and derivative-free contexts, and generates iterates that globally convergence to a first-order critical point. We also prove two sets of results on ASGM’s work complexity with respect to the gradient norm: (i) when f is non-convex, ASGM’s work complexity is ar- −2− 1 bitrarily close to (ǫ µ(α) ), where µ(α) is the error decay rate of the gradient O approximate expressed in terms of the error decay rate α of the objective function approximate; (ii) when f is strongly convex, ASGM’s work complexity is arbi- − 1 trarily close to (ǫ µ(α) ). We compare these complexities to those obtained from O methods that use traditional random sampling. We also illustrate the calculation of α and µ(α) for common choices, e.g., QMC with finite-difference derivatives. Keywords Adaptive Sampling Stochastic Gradient Stochastic Optimization · · F. Hashemi The Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA 24061, USA E-mail: [email protected] R. Pasupathy Department of Statistics, Purdue University, West Lafayette, IN 47907, USA E-mail: [email protected] M. R. Taaffe The Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA 24061, USA E-mail: taaff[email protected] 2 Hashemi, Pasupathy, and Taaffe 1 INTRODUCTION Consider unconstrained optimization problems having the form minimize f(x) (P ) x∈Rd Rd R where the function f : is bounded from below, that is, infx∈Rd f(x) > , and f belongs to the class→ C1,1 of differentiable functions having a Lipschitz −∞ L continuous first derivative. An important premise is that we do not have access to exact oracle values f(x), x Rd but instead, we have access to an inexact oracle ∈ using which an approximate value f(n; x) of f(x) at any chosen point x Rd can be obtained after expending a chosen amount n of oracle effort. ∈ The inexact oracle is such that the resulting approximator f(n; ) is consistent, · that is, f (n; x) f (x) 0 as n . Furthermore, we assume that the error f(n; x) | f (x) −in the approximator| → →f ∞(n; ) satisfies for n n , | − | · ≥ f α 0 n f(n; x) f (x) < σ + σ Γ (x), (FE1) | − | 0,f 1,f f where 1. σ0,f and σ1,f are real-valued unknown constants that do not depend on x; 2. Γ ( ) := 1 + Γ 0( ) is a known positive-valued continuous function; f · f · 3. α > 0 is a known constant called error decay rate of the inexact oracle f(n; ). · (Throughout the rest of the paper, we use the function Γ ( ) := 1 + Γ 0( ) instead f · f · of Γ 0( ) as a matter of convention and convenience.) f · As we illustrate in Section 2 using two motivating examples, numerous stochastic optimization settings are naturally subsumed by the above problem setting. Specifically, while the objective function f in such settings is not known in closed- form, f may be suitable for approximation using quasi-Monte Carlo (QMC) [27], numerical quadrature [36, 12], or other low-discrepancy point-set methods [13, Chapter 5], allowing the construction of an approximator f(n; ), a scaling func- · tion Γ ( ), and a known error decay α that together satisfy (FE1). f · The problem statement that we have just outlined resembles the now popu- lar simulation optimization (SO) problem [31, 14, 42] and many modern machine learning settings [8] where the objective function f is an expectation. The difference between these settings and the context we consider here pertains primarily to the assumptions on the approximator f(n; ). In SO and machine learning settings, the objective function value f(x) is assumed· to be estimated using Monte Carlo or through random draws from a large database, implying that the resulting estimator f(n; x) of f(x) is random, and hence any error guarantees analogous to (FE1) can only be made in a “distributional” or an “expected value sense.” In the current setting, the approximate f(n; x) is constructed using QMC or numerical integration, allowing for a deterministic error bound (FE1), and consequently, sharper complexity results. We say more about this issue in Section 1.2 when we discuss scope and related literature. (As an aside, there is much ongoing debate [29, 43, 22, 27] on the appropriate- ness of the use of QMC for high-dimensional integration. Until recently, the general consensus had been that QMC is more efficient than Monte Carlo for constructing estimators of integrals in dimensions less than about 12. However, there is little The Adaptive Sampling Gradient Method 3 doubt that QMC is gaining in popularity and now seems to be used routinely in much higher dimensions [4, 40].) We emphasize that neither the error decay rate α nor the scaling function Γ ( ) appearing in (FE1) are unique, that is, numerous choices of Γ ( ) and α f · f · may satisfy (FE1). However, such non-uniqueness will not concern us, and we make no assumptions about any optimality properties of Γf ( ) and α. Instead, we only assume that, in addition to the inexact oracle approxim·ator f(n; ), we have · at our disposal one possible scaling function Γ ( ) and one possible error decay f · rate α that together satisfy the stipulation (FE1). All our results will accordingly be expressed in terms of Γ ( ) and α. f · 1.1 Terminology, Notation, and Convention We emphasize that, while our primary motivating context is stochastic optimization, the approximator f(n; ) of the objective function f( ) is indeed deterministic. Since this issue tends to· cause confusion especially among· readers steeped in stochastic and simulation optimization, we have refrained from using upper case notation for function and gradient approximators throughout the paper. For the same reason, we have also limited our use of words such as “estimator” and “esti- mate,” instead preferring the non-standard terms “approximator” and “approximate.” An exception, of course, is the name of the proposed procedure “Adaptive Sampling Method” where we have used the word “sampling” with no connotation to statistical sampling. The algorithms we present are derivative free by which we mean that we do not assume that the inexact oracle provides direct observations of the gradient approximate of f. Instead, when seeking an approximate for the gradient f(x) at the point x, we can resort to “finite differencing” of values from the∇ ineact oracle. We use bold font for vectors, script font for sets, lower case font for real numbers and upper case font for random variables. Hence Xk denotes a sequence of d { } random vectors in R and x = (x1,x2,...,xd) denotes a d-dimensional vector of d real numbers. We use ei R to denote a unit vector whose ith component is ∈ 1 and whose every other component is 0, that is, eii = 1 and eij = 0 for j = i. 6 The set (x; r) = y Rd : y x r is the closed ball of radius r > 0 with B ∈ k − k ≤ center x. For a point y :=(y1,y2,...,yq), the norm y denotes the q-dimensional q 2 1 k k 2 Euclidean norm y = ( i=1 yi ) . We denote (a)+ = max(a, 0) to refer to the maximum of a numberk k a R and zero. P∈ q q A sequence yn , y R is said to consistently approximate y R if yn { } ∈ ∈ k − y 0 as n . For a sequence of real numbers ak , we say ak = o (1) if k → → ∞ −1 { } limk→∞ ak = 0; we say ak = o (bk) if bk = o(ak). We say ak = (1) if ak is bounded, that is, there exists a constant M > 0 such that a <O M for large{ } | k| enough k. For sequences of positive real numbers a , b , we say that a b { k} { k} k ∼ k if limk→∞ ak/bk = 1. We say f C1,1 if f : Rd R is differentiable and has derivative f( ) such ∈ L → ∇ · that f(x) f(y) L x y for all x, y Rd. We say f 1,1 if f is k∇ −∇ k ≤ k − k ∈ ∈ FL convex and f C1,1. We say f 1,1 if f is twice differentiable, f 1,1, and ∈ L ∈ Sλ,L ∈ FL f(x) f(y) λ x y for all x, y Rd. k∇ −∇ k ≥ k − k ∈ 4 Hashemi, Pasupathy, and Taaffe 1.2 Related Literature As was remarked earlier, the problem we consider is closely related to what has been called the SO problem [31, 14, 42], and more recently, to optimization settings in machine learning [8, 9].

The Adaptive Sampling Gradient Method Optimizing Smooth Functions with an Inexact Oracle

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support