AN ACCELERATED METHOD FOR DERIVATIVE-FREE SMOOTH STOCHASTIC CONVEX OPTIMIZATION∗

EDUARD GORBUNOV† , PAVEL DVURECHENSKY‡ , AND ALEXANDER GASNIKOV§

Abstract. We consider an unconstrained problem of minimizing a smooth convex function which is only available through noisy observations of its values, the noise consisting of two parts. Similar to stochastic optimization problems, the first part is of stochastic nature. The second part is additive noise of unknown nature, but bounded in absolute value. In the two-point feedback setting, i.e. when pairs of function values are available, we propose an accelerated derivative-free algorithm together with its complexity√ analysis. The complexity bound of our derivative-free algorithm is only by a factor of n larger than the bound for accelerated gradient-based algorithms, where n is the dimension of the decision variable. We also propose a non-accelerated derivative-free algorithm with a complexity bound similar to the stochastic-gradient-based algorithm, that is, our bound does not have any dimension-dependent factor except logarithmic. Notably, if the difference between the starting point and the solution is a sparse vector, for both our algorithms, we obtaina better complexity bound if the algorithm uses an 1-norm proximal setup, rather than the Euclidean proximal setup, which is a standard choice for unconstrained problems

Key words. Derivative-Free Optimization, Zeroth-Order Optimization, Stochastic Convex Op- timization, Smoothness, Acceleration

AMS subject classifications. 90C15, 90C25, 90C56

1. Introduction. Derivative-free or zeroth-order optimization[58, 34, 16, 63, 24] is one of the oldest areas in optimization, which constantly attracts attention from the learning community, mostly in connection to online learning in the bandit setup [17] and [60, 23, 35, 22]. We study stochastic derivative-free optimization problems in a two-point feedback situation, considered by [1, 30, 62] in the learning community and by [55, 64, 41, 42, 40] in the optimization community. Two-point setup allows one to prove complexity bounds, which typically coincide with the complexity bounds for gradient-based algorithms up to a small-degree polynomial of n, where n is the dimension of the decision variable. On the contrary, problems with one-point feedback are harder and complexity bounds for such problems either have worse dependence on n, or worse dependence on the desired accuracy of the solution, see [52, 57, 36,2, 45, 61, 49,5, 18] and the references therein. More precisely, we consider the following optimization problem  Z  (1.1) min f(x) := ξ[F (x, ξ)] = F (x, ξ)dP (x) , n E x∈R X where ξ is a random vector with probability distribution P (ξ), ξ ∈ X , and the function f(x) is closed and convex. Note that F (x, ξ) can be non-convex in x with positive arXiv:1802.09022v3 [math.OC] 20 Sep 2020 probability. Moreover, we assume that, almost sure w.r.t. distribution P , the function

∗Submitted to the editors 30 April, 2019. Funding: The work of Eduard Gorbunov in Section 2.3. was supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) No. 075-00337-20-03, project No. 0714-2020-0005. †Moscow Institute of Physics and Technology; National Research University Higher School of Economics ([email protected], https://eduardgorbunov.github.io/). ‡Weierstrass Institute for Applied Analysis and Stochastics; Institute for Information Transmission Problems RAS; National Research University Higher School of Economics ([email protected]). §Moscow Institute of Physics and Technology; Institute for Information Transmission Problems RAS; National Research University Higher School of Economics ([email protected]) 1 2 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

F (x, ξ) has gradient g(x, ξ), which is L(ξ)-Lipschitz continuous with respect to the p 2 Euclidean norm. We assume that we know a constant L2 > 0 such that EξL(ξ) ≤ L2 < +∞. Under these assumptions, Eξg(x, ξ) = ∇f(x) and f is L2-smooth, i.e. has L2-Lipschitz continuous gradient with respect to the Euclidean norm. Also we assume that, for all x,

2 2 (1.2) Eξ[kg(x, ξ) − ∇f(x)k2] 6 σ ,

where k · k2 is the Euclidean norm. We emphasize that, unlike [30], we do not as-  2 sume that Eξ kg(x, ξ)k2 is bounded since it is not the case for many unconstrained optimization problems, e.g. for deterministic quadratic optimization problems. Finally, we assume that we are in the two-point feedback setup, which is also con- nected to the common random numbers assumption, see [48] and references therein. Specifically, an optimization procedure, given a pair of points (x, y) ∈ R2n, can obtain a pair of noisy stochastic realizations (fe(x, ξ), fe(y, ξ)) of the objective value f, which we refer to as oracle call. Here

n (1.3) fe(x, ξ) = F (x, ξ) + η(x, ξ), |η(x, ξ)| 6 ∆, ∀x ∈ R , a.s. in ξ, and there is a possibility to obtain an iid sample ξ from P . This makes our problem more complicated than problems studied in the literature. Not only do we have stochastic noise in problem (1.1), but also an additional noise η(x, ξ), which can be adversarial. Our model of the two-point feedback oracle is pretty general and covers deter- ministic exact oracle or even specific types of one-point feedback oracle. For example, if the function F (x, ξ) is separable, i.e. F (x, ξ) = f(x) + h(ξ), where Eξ [h(ξ)] = 0, ∆ |h(ξ)| ≤ 2 for all ξ and the oracle gives us F (x, ξ) at a given point x, then for all ξ1, ξ2 we can define fe(x, ξ1) = F (x, ξ1) and fe(y, ξ2) = F (y, ξ2) = F (y, ξ1) + h(ξ2) − h(ξ1). Since |h(ξ2) − h(ξ1)| ≤ |h(ξ2)| + |h(ξ1)| ≤ ∆ we can use representation (1.3) omitting dependence of η(x, ξ1) on ξ2. Moreover, such an oracle can be encountered in practice as rounding errors can be modeled as a process of adding a random bit modulo 2 to the last or several last bits in machine number representation format (see [37] for details). As it is known [47, 26, 31, 38], ifa g(x, ξ) for the gradient of f is available, an accelerated gradient method has oracle complexity bound (i.e. the  np 2 2 2 2o total number of stochastic first-order oracle calls) O max L2R2/ε, σ R2/ε , where ε is the target optimization error in terms of the objective residual, the goal being to find suchx ˆ that Ef(ˆx) − f ∗ ≤ ε. Here f ∗ is the global optimal value of f, ∗ ∗ R2 is such that kx0 − x k2 ≤ R2 with x being some solution. The question, to which we give a positive answer in this paper, is as follows. Is it possible to solve a stochastic optimization problem with the same ε-dependence in the iteration and sample complexity and only noisy observations of the objective value? Many existing first- and zero-order methods are based on so-called proximal setup (see [9] and Subsection 2.1 for the precise definition). This includes a choice of some norm in Rn and a corresponding prox-function, which is strongly convex with respect to this norm. Standard gradient method for unconstrained problems such as (1.1) is obtained when one chooses the Euclidean k · k2-norm as the norm and squared Eu- clidean norm as the prox-function. We go beyond this conventional path and consider n k·k1-norm in R and corresponding prox-function given in [9]. Yet this proximal setup AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 3

is described in the textbook, we are not aware of any particular examples where it is used for unconstrained optimization problems. Notably, as we show in our analysis, this choice can lead to better complexity bounds. In what follows, we character- ize these two cases by the choice of k · kp-norm with p ∈ {1, 2} and its conjugate 1 1 q ∈ {2, ∞}, given by the identity p + q = 1. 1.1. Related Work. Online optimization with two-point bandit feedback was considered in [1], where regret bounds were obtained. Non-smooth determinis- tic and stochastic problems in the two-point derivative-free offline optimization set- ting were considered in[55]. 1 Non-smooth stochastic problems were considered in [62] and independently in[7], the latter paper considering also problems with ad- ditional noise of an unknown nature in the objective value. The authors of[30] consider smooth stochastic optimization problems, yet under additional quite restric-  2 tive assumption Eξ kg(x, ξ)kq < + ∞. Their bound was improved in[40, 39] for the problems with non-Euclidean proximal setup and noise in the objective value. Strongly convex problems with different smoothness assumptions were considered in [39,7]. Smooth stochastic convex optimization problems, without the assumption 2 that Ekg(x, ξ)k2 < +∞, were studied in[42, 41] for the Euclidean case. Accelerated and non-accelerated derivative-free method for smooth but deterministic problems were proposed in [55] and extended in [14, 33] for the case of additional bounded noise in the function value. Table1 presents a detailed comparison of our results and most close results in the literature on two-point feedback derivative-free optimization and assumptions, under which they are obtained. The first row corresponds to the non-smooth setting with the  2 2 assumption that Eξ kg(x, ξ)k2 ≤ M2 , which mostly restricts the scope to constrained optimization problems on a convex set with the diameter Rp measured by k·kp-norm. This setting is very well understood with the proposed methods being able to solve stochastic optimization problems with additional bounded noise in the objective value and to use non-Euclidean proximal setup. Importantly, non-Euclidean proximal setup corresponding to p = 1, q = ∞ allows one to obtain a complexity bound with only logarithmic dependence on the dimension n. Rows 2-6 of Table1 correspond to smooth problems with L2-Lipschitz continu- ous gradient, which makes possible to apply Nesterov’s acceleration and obtain better complexity bounds. In this case stochastic optimization problems are characterized by the variance σ2 of the stochastic gradient, see (1.2). For the smooth setting the full picture is not completely understood in the literature, and our goal is to obtain meth- ods, which provide the full picture similarly to the non-smooth setting by combining stochastic optimization setup, additional bounded noise in the objective value, accel- eration, and better complexity bounds achievable owing to the use of non-Euclidean proximal setup corresponding to p = 1, q = ∞. Previous works for the smooth case consider only Euclidean case and either deterministic problems with additional bounded noise [55, 14, 33] or stochastic problems without additional bounded noise [42, 41]. We also mention the works[52, 57, 56, 28, 36, 59, 25, 39,2, 49,8, 18, 61, 45, 44, 5, 45,6, 50,3] where derivative-free optimization with one-point feedback is studied in different settings, and works [54,4] on coupling non-accelerated methods to obtain acceleration, which inspired our work. After our paper appeared as a preprint, the papers [10, 15] studied derivative-free quasi-Newton methods for problems with noisy

1We list the references in the order of the date of the first appearance, but not in the order of the date of official publication. 4 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

Method Assumptions Oracle complexity, Oe (·) p = 1 σ2 ∆ 2 MD q 2 2 n M2 Rp √ √ √ [30, 40, 39, 37] bound. gr. ε2 [62,7] n 2 2 2 o √ RSGF nL2R2 nσ R2 bound. var. max , 2 × × [42, 41] ε ε RS 2 √ nL2R2 × × [55, 14] ε  2 2  q 2 q 2 2 RDFDS  n L2Rp n σ Rp  √ √ √ bound. var. max ε , ε2 [This paper]   AccRS q 2 √ n L2R2 × × [55, 33] ε  2  1 1  + q L R2 n q σ2R2  √ √ √ ARDFDS 2 q 2 p p bound. var. max n ε , ε2 [This paper]   Table 1 Comparison of oracle complexity (total number of zero-order oracle calls) of different methods ∗ ∗ with two-point feedback for convex optimization problems. Rp is such that kx0 − x kp ≤ Rp with x  2 2 being some solution. In the column “Assumptions” we use “bound. gr.” for Eξ kg(x, ξ)k2 ≤ M2 2 2 and “bound. var.” for Eξkg(x, ξ) − ∇f(x)k2 6 σ . Column “p = 1” corresponds to the support of non-Euclidean proximal setup, column “σ2” to the support of stochastic optimization problems,“∆” corresponds to the support of additional bounded noise of unknown nature. All the rows except the first one assume that f is L2-smooth. Oe(·) notation means O(·) up to logarithmic factors in n, ε.

function values, and the paper [11] reported theoretical and empirical comparison of different gradient approximations for zero-order methods. The authors of [21] combine accelerated derivative-free optimization with accelerated variance reduction technique for finite-sum convex problems. For a recent review of derivative-free optimization see [48]. We extend the proposed algorithms for a more general setting of inexact directional derivative oracle as well as for strongly convex problems in [32]. Mixed first-order/zero-order setting is considered in [12] and zero-order methods for non- smooth saddle-point problems are developed in [13]. 1.2. Our Contributions. As our main contribution, we propose an accelerated method for smooth stochastic derivative-free optimization with two-point feedback, which we call Accelerated Randomized Derivative-Free Directional Search (ARDFDS). Our method has the complexity bound

  2  1 1  q L R2 n q σ2R2  2 + q 2 p p (1.4) Oe max n ε , ε2  ,  

∗ where Oe hides logarithmic factor of the dimension, Rp is such that kx0 − x kp ≤ Rp ∗ with x being an arbitrary solution to (1.1) and x0 being the starting point of the algorithm. We underline that our bounds hold for any solution. Thus, to obtain the best possible bound, one can consider the closest solution to the starting point. In the Euclidean case p = q = 2, the first term in the above bound has better dependence on ε, L2 and R2 than the first term in the bound in [42, 41]. Unlike these papers, our bound also covers the non-Euclidean case p = 1, q = ∞ and, due to that, allows to ob- tain better complexity bounds. To illustrate this, let us consider an arbitrary solution ∗ ∗ x to (1.1), start method from a point x0 and define the sparsity s of the vector x0−x , AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 5

∗ ∗ √ i.e. kx0−x k1 ≤ s·kx0−x k2 and 1 ≤ s ≤ n. Then the complexity of our method for  q 2 ∗ 2 2 2 ∗ 2  ns L2kx0−x k2 s σ kx0−x k2 p = 1, q = ∞ is Oe max ε , ε2 , which is always no worse

 q 2 ∗ 2 2 ∗ 2  n L2kx0−x k2 nσ kx0−x k2 than the complexity for p = q = 2, which is Oe max ε , ε2 √ and allows to gain up to n if s is close to 1. Notably, this is done automatically, without any prior knowledge of s. An example of this situation can be a typical compressed sensing problem [19, 29] of recovering a sparse signal x∗ from noisy ob- servations of a linear transform of x∗ via solving an optimization problem. In this ∗ case, if x0 = 0 then x0 − x is sparse by the problem assumption. Moreover, since our bounds hold for arbitrary solution x∗, to get better complexity estimate, one can use the bound obtained using the sparsest solution. Unlike previous works, we consider additional, possibly adversarial noise η(x, ξ) in the objective value and analyze how this noise affects the convergence rate estimates. If the noise can be controlled and ∆ can be made arbitrarily small, e.g. if the objective is calculated by an auxiliary procedure, we estimate how ∆ should depend on the target accuracy ε to ensure finding an ε-solution. If the noise is uncontrolled, e.g. we only have an estimate for the noise level ∆ and we cannot make ∆ arbitrarily small, we can run our algorithms and guarantee that they generate a point with expected objective residual bounded by a quantity dependent on ∆. This is important when the objective is given as a solution to some auxiliary problem, which can not be solved exactly, e.g. in bi-level optimization or reinforcement learning. It should also 2 be mentioned that our assumption Eξ[L(ξ) ] ≤ L2 for some L2 is weaker than the assumption that there is L2 s.t. L(ξ) ≤ L2 a.s. in ξ, which is used in [42, 41]. As our second contribution, we propose a non-accelerated Randomized Derivative- Free Directional Search (RDFDS) method with the complexity bound

  2 2  q 2 q 2 2  n L2Rp n σ Rp  (1.5) Oe max ε , ε2  ,   where, unlike [42, 41], the non-Euclidean case p = 1, q = ∞ is also covered with the gain in the complexity of up to the factor of n in comparison to the case p = q = 2. Notably, in the non-Euclidean case, we obtain a nearly dimension independent( Oe hides logarithmic factor of the dimension) complexity bound despite we use only noisy function value observations. Why is it important to improve the first term under the maximum? 1. Acceleration when n is large. The first term under the maximum dom- 3 1 1 2 − q √ inates the second term when σ2 ≤ ε 2 n L2 in the accelerated case and Rp 2 when σ ≤ εL2 in the non-accelerated case, which could be met in practice if ε, L2 and n are large enough compared to Rp. For example, if p = 1, q = ∞ −3 and we would like to find an ε-solution with ε = 10 and L2 = 100, Rp = 10, n = 10000 (or larger), and the variance satisfies mild assumption σ2 ≤ 10−1, then the complexity of ARDFDS is better than that of RDFDS. 2. Better dimension dependence in the deterministic case. We underline that even in the deterministic case with σ = 0 and without additive noise, both our non-accelerated and accelerated complexity√ bounds for p = 1 are new. Moreover, disregarding ln n factors, for s ∈ [1, n], the existing bounds [55] are n/s2 and n/s times worse than our new bounds respectively in non- 6 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

accelerated and accelerated cases. Importantly, in the non-accelerated case our bound is dimension-independent up to a ln n factor. 3. Parallel computation of mini-batches makes acceleration reason- able when σ2 is not small. Even when the second term in (1.4) is dom- inating and, thus, the total computation time is proportional to the second term, using parallel computations we can force the total computation time to be proportional to the first term, underlining the importance of making it smaller via accelerating the method. The idea is to use parallel computations of mini-batches as follows. Instead of sampling one ξ in each iteration of the algorithm one can consider a mini-batch of size r, i.e. sample r iid realiza- tions of ξ and average r finite-difference approximations for the gradient to 2 σ2 reduce the variance of this approximation from σ to r . If one can have an access to at least r processors, in each iteration all processors simulta- neously in parallel can make a call to the zeroth-order oracle and calculate finite-difference approximation for the gradient. Then a processor chosen to be central calculates the average of these r approximations, which gives a mini-batch approximation of the gradient. Since this work is done in parallel, it takes nearly the same amount of time as using a mini-batch of size 1 in the standard approach. By choosing sufficiently large r, one can make the second σ2 term in (1.4) (which is now proportional to r ) smaller than the first term. Hence, the total computation time will be proportional to the first term under the maximum in (1.4). Such an acceleration can be achieved by a reasonable amount of processors. For example, if σ2 = 1, which is not small, n = 10000, −3 2.5 Rp = 10, ε = 10 and L2 = 100, then it is sufficient to have r = 10 ≈ 316 processors which is a small number compared to modern supercomputers and clusters that often have ∼ 105 − 106 processors. 2. Algorithms for Stochastic Convex Optimization.

2.1. Preliminaries. p-norm proximal setup. Let p ∈ [1, 2] and kxkp be the n n p P p k · kp-norm in R defined as kxkp = |xi| . Further, let k · kq be its dual, defined by i=1  kgkq = max hg, xi, kxkp ≤ 1 , where q ∈ [2, ∞] is the conjugate number to p, given x 1 1 by + = 1, and, for q = ∞, by definition, kxk∞ = max |xi|. We also use kxk0 p q i=1,...,n to denote the number of non-zero components of x ∈ Rn. We choose a prox-function n d(x), which is continuousand 1-strongly convex on R with respect to k · kp, i.e., for n 1 2 any x, y ∈ R , d(y) − d(x) − h∇d(x), y − xi ≥ 2 ky − xkp. Without loss of generality, we assume that min d(x) = 0. We define also the corresponding Bregman divergence n x∈R V [z](x) = d(x) − d(z) − h∇d(z), x − zi, for x, z ∈ Rn. Note that, by the 1-strong convexity of d(·), 1 2 n (2.1) V [z](x) ≥ 2 kx − zkp, ∀ x, z ∈ R .

2 (κ−1)(2−κ)/κ kxkκexp(1)n ln n For p = 1, we choose the prox-function (see [9]) d(x) = 2 , 1 where κ = 1 + ln n and, for the case p = 2, we choose the prox-function to be 1 2 d(x) = 2 kxk2. Main technical lemma. In our proofs of complexity bounds, we rely on the following lemma. The proof is rather technical and is provided in the appendix.

Lemma 2.1. Let e ∈ RS2(1), i.e. be a random vector uniformly distributed on the n 1 1 surface of the unit Euclidean sphere in R , p ∈ [1, 2] and q be given by p + q = 1. AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 7

2/q−1 Define ρn = min{q − 1, 16 ln n − 8}n . Then, for n > 8, 2 (2.2) Eekekq ≤ ρn, and 2 2 6ρn 2 n (2.3) Ee hs, ei kekq ≤ n ksk2, ∀s ∈ R . Stochastic approximation of the gradient. Based on the noisy observations (1.3) of the objective value, we form the following stochastic approximation of ∇f(x)

m 1 X fe(x + te, ξi) − fe(x, ξi) (2.4) ∇e mf t(x) = e, m t i=1

where e ∈ RS2(1), ξi, i = 1, ..., m are independent realizations of ξ, m is the mini-batch size, t is some small positive parameter, which we call smoothing parameter. 2.2. Accelerated Randomized Derivative-Free Directional Search. The method is listed as Algorithm 2.1. Following [55, 42, 41] we assume that L2 is known. The possible choice of the smoothing parameter t and mini-batch size m are discussed below. Note that at every iteration the algorithm requires to solve an auxiliary min- imization problem. As it is shown in [9], for both cases p = 1 and p = 2 this minimization can be made explicitly in O(n) arithmetic operations.

Algorithm 2.1 Accelerated Randomized Derivative-Free Directional Search (ARDFDS)

Input: x0 – starting point; N – number of iterations; L2 – smoothness constant; m ≥ 1 – mini-batch size; t > 0 – smoothing parameter; V (·, ·) – Bregman divergence. 1: y0 ← x0, z0 ← x0. 2: for k = 0,...,N − 1. do k+1 3: Generate ek+1 ∈ RS2 (1) independently from previous iterations and ξi , i = 1, ..., m – independent realizations of ξ, which are also independent from previous iterations. 2 4: τk ← k+2 , xk+1 ← τkzk + (1 − τk)yk. m t 5: Calculate ∇e f (xk+1) using(2.4) with e = ek+1 and set yk+1 ← xk+1 − 1 m t ∇ f (xk+1). 2L2 e k+1 n D m t E o 6: αk ← 96n2ρ L , zk+1 ← argmin αk+1n ∇e f (xk+1), z − zk + V [zk](z) . n 2 n z∈R 7: end for 8: return yN

Theorem 2.2. Let ARDFDS be applied to solve problem (1.1), x∗ be an arbitrary ∗ solution to (1.1) and Θp be such that V [z0](x ) ≤ Θp. Then, for all n ≥ 8, √ 2 2 12 2nΘ ∗ 384n ρnL2Θp 384N σ p L2t 2∆  E[f(yN )] − f(x ) 6 2 + + 2 + (2.5) N nL2 m N 2 t 6N  2 2 16∆2  N 2  2 2 16∆2  + L t + 2 + L t + 2 . L2 2 t 24nρnL2 2 t

Before we prove Theorem 2.2 in the next subsection, let us discuss its result. In the simple case ∆ = 0, all the terms in the r.h.s. of (2.5) can be made smaller than ε for any ε > 0 by an appropriate choice of N, m, t. Thus, we consider a more interesting 3 2 case and assume that the noise level satisfies 0 < ∆ 6 L2Θpn ρn/2. The second ∗ inequality is non-restrictive since by the L2-smoothness f(x0) − f 6 L2Θp and it is 8 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

p = 1 p = 2 q q 2 n ln nL2Θ1 n L2Θ2 N(ε) ε ε n 2 q o n 2 q o σ Θ1 ln n σ Θ2 m(ε) max 1, 3/2 · max 1, 3/2 · ε nL2 ε L2 n 3/2 2 o n 3/2 2 o ∆(ε) min √ ε , ε min √ε , ε L2Θ1n ln n nL2Θ1 n L2Θ2 nL2Θ2  3/4   3/4  t(ε) min √ ε , √ε min √ ε , √ε 4 3 L2 nΘ1 4 2 3 L2 nΘ2 L2Θ1n ln n n L2Θ2 q 2  q 2 2  n ln nL2Θ1 σ Θ1 ln n n L2Θ2 σ Θ2n N(ε)m(ε) max ε , ε2 max ε , ε2 Table 2 Summary of the values for N, m, ∆, t and the total number of function value evaluations Nm guaranteeing for the cases p = 1 and p = 2 that Algorithm 2.1 outputs yN satisfying E [f(yN )] − f(x∗) ≤ ε. Numerical constants are omitted for simplicity.

natural to assume that the oracle error is smaller than the initial objective residual. 2 2 16∆2 In order to minimize the terms with L2t + 2 in the r.h.s of (2.5), we set t as q t 2 ∆ . Substituting this into the r.h.s. of (2.5) and using that, by our assumption L2 2 p on ∆, Θpn ρnL2 > 2nL2Θp∆, we obtain the following inequality

2 2 2 ∗ 408L2Θpn ρn 384N σ N ∆ (2.6) [f(yN )] − f(x ) 2 + + 48N∆ + E 6 N nL2 m 3nρn First, we consider the situation of controlled noise level ∆ which can be made arbi- trarily small. For example, the value of f is defined as a solution of some auxiliary problem, which can be solved numerically with arbitrarily small accuracy ∆. Then we have control over parameters N, m, ∆ in the r.h.s of (2.6) and can choose these parameters to make it smaller than ε. First, we choose N to make the first term to be smaller than ε. After that we choose m to make the second term smaller than ε. Finally, we choose ∆ to make all the other terms smaller than ε. The resulting values of these parameters up to constants are given in Table2. As a summary, we have the following corollary of Theorem 2.2. Corollary 2.3. Assume that the value of ∆ can be controlled and satisfies 0 < 3 2 ∆ 6 L2Θpn ρn/2. Assume that for a given accuracy ε > 0 the values of the parame- ters N(ε), m(ε), t(ε), ∆(ε) satisfy relations stated in Table2 and ARDFDS is applied ∗ to solve problem (1.1). Then the output point yN satisfies E [f(yN )] − f(x ) 6 ε. Moreover, the overall number of oracle calls is N(ε)m(ε) given in the same table. Note that in the case of uncontrolled noise level ∆, the values of this parameter stated in Table2 can be seen as the maximum value of the noise level which can be ∗ tolerated by the method still allowing it to achieve E [f(yN )] − f(x ) 6 ε. Next, we consider the case of uncontrolled noise level ∆ and estimate the smallest expected objective residual which can be guaranteed in theory. First, we focus on the following three terms in the r.h.s. of (2.6), for simplicity disregarding the numerical constants,

2 2 L2Θpn ρn N ∆ (2.7) 2 + N∆ + , N nρn

and consider two cases a) N 6 nρn and b) N > nρn. In the case a), we have that the third term in (2.7) is dominated by the second one. Minimizing then in N the AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 9

p = 1 p = 2 p p t(∆) ∆/L2 ∆/L2  1 3 1 4 n 1/3 1/4o  2  /  3  / L2Θ1n  L2Θ1n  L2Θ2n L2Θ2n N(∆) min ∆ , ∆ min ∆ , ∆

n σ2 σ2 o n σ2 σ2 o m(∆) min nL ∆ , 4 5 3 1/4 min nL ∆ , 3 5 3 1/4 2 (n L2Θ1∆ ) 2 (n L2Θ2∆ ) 1 √ 1 √ n 2 /3 o n 2 2 /3 o ε(∆) max L2Θ1n∆ , nL2Θ1∆ max L2Θ2n ∆ , nL2Θ2∆  1   1    /3 2   /3 2 2 Θ1 σ 2 Θ2 σ Nm min σ 2 2 4 , min σ 2 4 , n L2∆ nL2∆ nL2∆ L2∆ Table 3 Summary of the values for N, m, t and the total number of function value evaluations Nm guaranteeing for the cases p = 1 and p = 2 that Algorithm 2.1 outputs yN with minimal possible expected objective residual ε if the oracle noise level ∆ is uncontrolled. Numerical constants and logarithmic factors in n are omitted for simplicity.

2 L2Θpn ρn upper bound N 2 +N∆ for (2.7), we obtain the optimal number Na) of steps and minimal possible value εa) of this upper bound. Moreover inequality Na) 6 nρn turns L2Θp out to be equivalent to ∆ 2 . In the case b) the second term in (2.7) is dominated > nρn 2 2 L2Θpn ρn N ∆ by the third one. Minimizing then in N the upper bound 2 + for (2.7), we N nρn obtain the optimal number Nb) of steps and minimal possible value εb) of this upper L2Θp bound. Moreover inequality Nb) nρn turns out to be equivalent to ∆ 2 . Now > 6 nρn 2 2 Na) σ Nb) σ we can choose ma) = and mb) = to make the second term in the r.h.s. nL2 εa) nL2 εb) of (2.6) to be of the same order as the smallest achievable error εa) or εb) in the case L2Θp a) or b) respectively. Finally, we check that ∆ 2 is equivalent to the case a) and > nρn inequalities εa) > εb), Na) 6 Nb), ma) 6 mb). This means that the smallest possible error is max{εa), εb)} and it is achieved in the number of iterations min{Na),Nb)} with batch size min{ma), mb)}. The corresponding values of the parameters are given in Table3 and we summarize the result as follows.

3 2 Corollary 2.4. Assume that ∆ is known and satisfies 0 < ∆ 6 L2Θpn ρn/2, the parameters N(∆), m(∆), t(∆) satisfy relations stated in Table3 and ARDFDS is ∗ applied to solve problem (1.1). Then the output point yN satisfies E [f(yN )] − f(x ) 6 ε(∆), where ε(∆) satisfies the corresponding relation in the same table. Moreover, the overall number of oracle calls is N(∆)m(∆) given in the same table.

2 2 Using an additional “light-tail” assumption that Eξ[exp(kg(x, ξ)−∇f(x)k2/σ )] 6 exp(1) and techniques of [43] our algorithm and analysis can be extended to obtain results in terms of probability of large deviations. For example, in the case of con- trolled noise level ∆ this means that our algorithm outputs a point yN which satisfies ∗ P{f(yN ) − f(x ) 6 ε} > 1 − δ, where δ ∈ (0, 1) is the confidence level, for the price 1 of extra ln δ factor in N and m. In the several next subsections we provide the full proof of Theorem 2.2 consisting of the four following parts. We start with the technical result providing us with inequalities relating the approximation (2.4) with the stochastic gradient g(x, ξ) and full gradient ∇f(x). The next two parts are in the spirit of Linear Coupling method of [4]. Namely, we analyze the progress of the Gradient Descent step (line 5 of ARDFDS) and estimate the progress of the Mirror Descent step (line 6 of ARDFDS). In the final fourth part, we combine all previous parts and finish the proof of the theorem. We 10 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

emphasize that in the last part we use a careful analysis of the recurrent inequalities ∗ for E[kx −zkkp] (see Lemma B.1, proved in AppendixB) in order to bound the terms related to the noise in the objective values. 2.2.1. Inequalities for Gradient Approximation. The proof of the main theorem relies on the following technical result, which connects finite-difference ap- proximation (2.4) of the stochastic gradient with the stochastic gradient itself and also with ∇f. This lemma plays a central role in our analysis providing correct dependence of the complexity bounds on the dimension. Lemma 2.5. For all x, s ∈ Rn, we have m 2 2 m t 2 12ρn m ~ 2 ρnt X 2 16ρn∆ (2.8) Eek∇e f (x)kq 6 n kg (x, ξm)k2 + m L(ξi) + t2 , i=1 m m t 2 1 m ~ 2 t2 X 2 8∆2 (2.9) Eek∇e f (x)k2 > 2n kg (x, ξm)k2 − 2m L(ξi) − t2 , i=1 m tksk X 2∆ksk m t 1 m ~ √p √ p (2.10) Eeh∇e f (x), si > n hg (x, ξm), si − 2m n L(ξi) − t n , i=1 m m t 2 2 m ~ 2 t2 X 2 16∆2 (2.11)Eekh∇f(x), eie − ∇e f (x)k2 6 n k∇f(x) − g (x, ξm)k2 + m L(ξi) + t2 , i=1

m m ~ 1 P where g (x, ξm) := m g(x, ξi), ∆ is defined in (1.3), L(ξ) is the Lipschitz constant i=1 of g(·, ξ), which is the gradient of F (·, ξ). Proof. First of all, we rewrite ∇e mf t(x) as follows

m ! m t D m ~ E 1 X ∇e f (x) = g (x, ξm), e + m θ(x, ξi, t, e) e, i=1

F (x+te,ξi)−F (x,ξi) η(x+te,ξi)−η(x,ξi) where θ(x, ξi, t, e) = t − hg(x, ξi), ei + t , i = 1, . . . , m. By the L(ξ)-smoothness of F (·, ξ) and (1.3), we have

L(ξ)t 2∆ (2.12) |θ(x, ξi, t, e)| ≤ 2 + t . Proof of (2.8).

D E m  2 k∇mf t(x)k2 = gm(x, ξ~ ), e + 1 P θ(x, ξ , t, e) e Ee e q Ee m m i i=1 q x m 2 m ~ 2 1 P 6 2Eekhg (x, ξm), eiekq + 2Ee m θ(x, ξi, t, e)e (2.13) i=1 q y m  2 12ρn m ~ 2 2ρn P L(ξi)t 2∆ 6 n kg (x, ξm)k2 + m 2 + t i=1 2 m 2 12ρn m ~ 2 ρnt P 2 16ρn∆ 6 n kg (x, ξm)k2 + m L(ξi) + t2 , i=1

2 2 2 n where x holds since kx + ykq 6 2kxkq + 2kykq, ∀x, y ∈ R ; y follows from inequal- ities (2.2), (2.3), (2.12) and the fact that, for any a1, a2, . . . , am > 0, it holds that  m 2 m P P 2 ai 6 m ai . i=1 i=1 AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 11

Proof of (2.9).

D E m  2 k∇mf t(x)k2 = gm(x, ξ~ ), e + 1 P θ(x, ξ , t, e) e Ee e 2 Ee m m i i=1 2 x m  2 1 m ~ 2 1 P L(ξi)t 2∆ (2.14) > 2 Eekhg (x, ξm), eiek2 − m 2 + t i=1 y m 1 m ~ 2 t2 P 2 8∆2 > 2n kg (x, ξm)k2 − 2m L(ξi) − t2 , i=1

2 1 2 2 n where x follows from (2.12) and inequality kx + yk2 > 2 kxk2 − kyk2, ∀x, y ∈ R ; n y follows from e ∈ RS2(1) and Lemma B.10 in [14], stating that, for any s ∈ R , 2 1 2 Ehs, ei = n ksk2. Proof of (2.10).

m m t m ~ 1 P Eeh∇e f (x), si = Eehhg (x, ξm), eie, si + Ee m θ(x, ξi, t, e)he, si i=1 x m   1 m ~ 1 P L(ξi)t 2∆ (2.15) > n hg (x, ξm), si − m 2 + t Ee|he, si| i=1 y m tksk 2∆ksk 1 m ~ √p P √ p > n hg (x, ξm), si − 2m n L(ξi) − t n i=1

n where x follows from Ee[nhg, eie] = g, ∀g ∈ R and (2.12); y follows from Lemma p 2 B.10 in [14], since E|hs, ei| ≤ Ehs, ei , and the fact that kxk2 6 kxkp for p 6 2. Proof of (2.11).

m t 2 Eekh∇f(x), eie − ∇e f (x)k2 m 2 m ~ 1 P = Ee h∇f(x), eie − hg (x, ξm), eie − m θ(x, ξi, t, e)e i=1 2 (2.16) x 2 m 2 2 h∇f(x) − gm(x, ξ~ ), eie + 2 1 P θ(x, ξ , t, e)e 6 Ee m Ee m i 2 i=1 2 y m 2 m ~ 2 t2 P 2 16∆2 6 n k∇f(x) − g (x, ξm)k2 + m L(ξi) + t2 , i=1

2 2 2 n where x holds since kx + yk2 6 2kxk2 + 2kyk2, ∀x, y ∈ R ; y follows from e ∈ S2(1) and Lemma B.10 in [14], and (2.12). 2.2.2. Progress of the Gradient Descent Step. The following lemma esti- mates the progress in step5 of ARDFDS, which is a gradient step. Lemma 2.6. Assume that y = x − 1 ∇mf t(x). Then, 2L2 e (2.17) m m ~ 2 m ~ 2 5nt2 X 2 80n∆2 kg (x, ξm)k2 ≤ 8nL2(f(x)−Eef(y))+8k∇f(x)−g (x, ξm)k2+ m L(ξi) + t2 , i=1

m ~ where g (x, ξm) is defined in Lemma 2.5, ∆ is defined in (1.3), L(ξ) is the Lipschitz constant of g(·, ξ), which is the gradient of F (·, ξ).

Proof. Since ∇e mf t(x) is collinear to e, we have that, for some γ ∈ R, y − x = γe. Then, since kek2 = 1,

h∇f(x), y − xi = h∇f(x), eiγ = h∇f(x), eihe, y − xi = hh∇f(x), eie, y − xi. 12 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

From this and L2-smoothness of f, we obtain

L2 2 f(y) 6 f(x) + hh∇f(x), eie, y − xi + 2 ||y − x||2 m t 2 m t = f(x) + h∇e f (x), y − xi + L2||y − x||2 + hh∇f(x), eie − ∇e f (x), y − xi L2 2 − 2 ||y − x||2 x m t 2 1 m t 2 f(x) + h∇ f (x), y − xi + L2||y − x|| + kh∇f(x), eie − ∇ f (x)k , 6 e 2 2L2 e 2

ζ 2 1 2 where x follows from the Fenchel inequality hs, zi − 2 kzk2 ≤ 2ζ ksk2. Using y = x − 1 ∇mf t(x), we get 2L2 e

1 k∇mf t(x)k2 f(x) − f(y) + 1 kh∇f(x), eie − ∇mf t(x)k2. 4L2 e 2 6 2L2 e 2 Taking the expectation in e we obtain

 m  (2.9) 1 1 m ~ 2 t2 P 2 8∆2 1 m t 2 kg (x, ξm)k − L(ξi) − 2 ek∇ f (x)k 4L2 2n 2 2m t 6 4L2 E e 2 i=1 1 m t 2 f(x) − ef(y) + ekh∇f(x), eie − ∇ f (x)k 6 E 2L2 E e 2 (2.11)  m  1 2 m ~ 2 t2 P 2 16∆2 f(x) − ef(y) + k∇f(x) − g (x, ξm)k + L(ξi) + 2 . 6 E 2L2 n 2 m t i=1 Rearranging the terms, we obtain the statement of the lemma. 2.2.3. Progress of the Mirror Descent Step. The following lemma estimates the progress in step6 of ARDFDS, which is a Mirror Descent step.

n D m t E o Lemma 2.7. For z+ = argmin αn ∇e f (x), u − z + V [z](u) we have n u∈R

m ~ 2 m ~ 2 αhg (x, ξm), z − ui 6 6α nρnkg (x, ξm)k2 + V [z](u) − Ee[V [z+](u) 2 2  2 m 2  α n ρn t P 2 16∆ + 2 m L(ξi) + t2 (2.18) i=1  m  √ t P 2∆ +α nkz − ukp 2m L(ξi) + t , i=1

m ~ where g (x, ξm) is defined in Lemma 2.5, ∆ is defined in (1.3), L(ξ) is the Lipschitz constant of g(·, ξ), which is the gradient of F (·, ξ). Proof. For all u ∈ Rn, we have

m t m t m t αnh∇e f (x), z − ui = αnh∇e f (x), z − z+i + αnh∇e f (x), z+ − ui x m t 6 αnh∇e f (x), z − z+i + h−∇V [z](z+), z+ − ui y m t (2.19) = αnh∇e f (x), z − z+i + V [z](u) − V [z+](u) − V [z](z+) z  m t 1 2 6 αnh∇e f (x), z − z+i − 2 kz − z+kp + V [z](u) − V [z+](u) { α2n2 m t 2 6 2 k∇e f (x)kq + V [z](u) − V [z+](u),

m t where x follows from the definition of z+, whence h∇V [z](z+) + αn∇e f (x), u − n z+i > 0 for all u ∈ R ; y follows from the “magic identity” Fact 5.3.2 in [9] for the Bregman divergence; z follows from (2.1); and { follows from the Fenchel inequality 1 2 ζ2 2 ζhs, zi − 2 kzkp ≤ 2 kskq. Taking expectation in e, applying (2.10) with s = z − u AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 13

and (2.8), we get (2.20)  m  tkz−uk 2∆kz−uk 1 m ~ √ p P √ p αn n hg (x, ξm), z − ui − 2m n L(ξi) − t n i=1 m t α2n2 m t 2 6 αnEeh∇e f (x), z − ui 6 2 Eek∇e f (x)kq + V [z](u) − Ee[V [z+](u)] 2 2  2 m 2  α n 12ρn m ~ 2 ρnt P 2 16ρn∆ 6 2 n kg (x, ξm)k2 + m L(ξi) + t2 + V [z](u) − Ee[V [z+](u)]. i=1

Rearranging the terms, we obtain the statement of the lemma.

2.2.4. Proof of Theorem 2.2. First, we prove the following lemma, which estimates the per-iteration progress of the whole algorithm.

Lemma 2.8. Let {xk, yk, zk, αk, τk}, k > 0 be generated by ARDFDS. Then, for all u ∈ Rn, (2.21) 2 2 2 2 48n ρnL2αk+1Ee,ξ[f(yk+1) | Ek, Ξk] − (48n ρnL2αk+1 − αk+1)f(yk) −V [zk](u) + Ee,ξ[V [zk+1](u) | Ek, Ξk] − Rk+1 6 αk+1f(u),

(2.22) 2 2 2 2  2  √ 48αk+1nρnσ 61αk+1n ρn 2 2 16∆ L2t 2∆  Rk+1 = m + 2 L2t + t2 + αk+1 nkzk − ukp 2 + t ,

where ∆ is defined in (1.3), Ek and Ξk denote the history of realizations of e1, . . . , ek 1 1 k k and ξ1 , . . . , ξm, . . . , ξ1 , . . . , ξm respectively, up to the step k.

Proof. Combining (2.17) and (2.18), we obtain (2.23) m ~ k+1 2 2 αhg (xk+1, ξm ), z − ui 6 48α n ρnL2(f(xk+1) − Eef(yk+1)) 2 m ~ k+1 2 +V [zk](u) − Ee[V [zk+1](u)] + 48α nρnk∇f(xk+1) − g (xk+1, ξm )k2 m m 2 2  2 2  √   61α n ρn t P k+1 2 16∆ t P k+1 2∆ + 2 m L(ξi ) + t2 + α nkzk − ukp 2m L(ξi ) + t , i=1 i=1

m ~ where g (x, ξm) is defined in Lemma 2.5 and the expectation in e is conditional on m ~ n m ~ Ek. By the definition of g (x, ξm) and (1.2), for all x ∈ R , Eξg (x, ξm) = ∇f(x) m ~ 2 σ2 and Eξk∇f(x) − g (x, ξm)k2 ≤ m . Using these two facts and taking the expectation ~ k+1 in ξm conditional on Ξk, we obtain (2.24) 2 2 αk+1h∇f(xk+1), zk − ui 6 48αk+1n ρnL2 (f(xk+1) − Ee,ξ[f(yk+1) | Ek, Ξk]) +V [zk](u) − Ee,ξ[V [zk+1](u) | Ek, Ξk] + Rk+1. 14 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

Further,

αk+1 (f(xk+1) − f(u)) 6 αk+1h∇f(xk+1), xk+1 − ui = αk+1h∇f(xk+1), xk+1 − zki + αk+1h∇f(xk+1), zk − ui x (1−τk)αk+1 = h∇f(xk+1), yk − xk+1i + αk+1h∇f(xk+1), zk − ui τk y (1−τk)αk+1 (f(yk) − f(xk+1)) + αk+1h∇f(xk+1), zk − ui 6 τk (2.24) (1−τk)αk+1 (f(yk) − f(xk+1)) 6 τk 2 2 +48αk+1n ρnL2 (f(xk+1) − Ee,ξ[f(yk+1) | Ek, Ξk]) +V [zk](u) − Ee,ξ[V [zk+1](u) | Ek, Ξk] + Rk+1 z 2 2 = (48αk+1n ρnL2 − αk+1)f(yk) 2 2 −48αk+1n ρnL2Ee,ξ[f(yk+1) | Ek, Ξk] + αk+1f(xk+1) +V [zk](u) − Ee,ξ[V [zk+1](u) | Ek, Ξk] + Rk+1.

Here x is since xk+1 := τkzk + (1 − τk)yk ⇔ τk(xk+1 − zk) = (1 − τk)(yk − xk+1), y follows from the convexity of f and inequality 1 − τk > 0, and z is since τk = 1 2 . Rearranging the terms, we obtain the statement of the lemma. 48αk+1n ρnL2 Proof of Theorem 2.2. Note that 2 2 1 2 2 48n ρnL2α − αk+1 + 2 = 48n ρnL2α since k+1 192n ρnL2 k

2 2 2 1 (k+2) k+2 1 48n ρnL2α −αk+1 + 2 = 2 − 2 + 2 k+1 192n ρnL2 192n ρnL2 96n ρnL2 192n ρnL2 2 k2+4k+4−2k−4+1 (k+1) 2 2 = 2 = 2 = 48n ρnL2α . 192n ρnL2 192n ρnL2 k

Taking, for any 1 l N, the full expectation [·] = 1 1 N N [·] in 6 6 E Ee1,...,eN ,ξ1 ,...,ξm,...,ξ1 ,...,ξm both sides of (2.21) for k = 0, . . . , l − 1 and telescoping the obtained inequalities2, we have, l−1 2 2 P 1 48n ρnL2α [f(yl)] + 2 [f(yk)] − V [z0](u) l E 192n ρnL2 E k=1 (2.25) l−1 l−1 l−1 P P 2 P +E[V [zl](u)] − ζ1 αk+1E[ku − zkkp] − ζ2 αk+1 6 αk+1f(u), k=0 k=0 k=0 where we denoted √ 2 2  2  L2t 2∆  σ 61n ρn 2 2 16∆ (2.26) ζ1 := n 2 + t , ζ2 := 48nρn m + 2 L2t + t2 . Since u in (2.25) is arbitrary, we set u = x∗, where x∗ is a solution to (1.1), use the ∗ ∗ inequality Θp > V [z0](x ), and define Rk := E[kx − zkkp]. Also, from (2.1), we have √ l−1 2Θpζ1 P 2 that ζ1α1R0 ≤ 2 . To simplify the notation, we define Bl := ζ2 α +Θp + 48n ρnL2 k+1 k=0 √ l−1 2Θpζ1 P l(l+3) ∗ 2 . Since αk+1 = 2 and, for all i = 1,...,N, f(yi) f(x ), we 48n ρnL2 192n ρnL2 6 k=0 get from (2.25) that

2 (l+1) ∗  (l+3)l l−1  2 [f(yl)] f(x ) 2 − 2 + Bl 192n ρnL2 E 6 192n ρnL2 192n ρnL2 l−1 ∗ P (2.27) −E[V [zl](x )] + ζ1 αk+1Rk, k=1 2 l−1 (l+1) ∗ ∗ P 0 2 ( [f(yl)] − f(x )) Bl − [V [zl](x )] + ζ1 αk+1Rk, 6 192n ρnL2 E 6 E k=1

2 2 1 2 2 Note that α1 = 2 = 2 and therefore 48n ρnL2α − α1 = 0. 96n ρnL2 48n ρnL2 1 AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 15

l−1 ∗ P which gives E[V [zl](x )] 6 Bl + ζ1 αk+1Rk and k=1

l−1 1 ∗ 2 1 ∗ 2 ∗ X (2.28) 2 (E[kzl − x kp]) 6 2 E[kzl − x kp] 6 E[V [zl](x )] 6 Bl + ζ1 αk+1Rk, k=1 s l−1 √ P whence, Rl 6 2 · Bl + ζ1 αk+1Rk. This recurrent sequence of Rl’s is analyzed k=1 √ 2 2Θpζ1 separately in AppendixB. Applying Lemma B.1 with a0 = ζ2α +Θp + 2 , ak = 1 48n ρnL2 2 ζ2αk+1, b = ζ1 for k = 1,...,N − 1, we obtain

l−1 √ √ 2 2 P  l  (2.29) Bl + ζ1 αk+1Rk Bl + 2ζ1 · 2 , l = 1,...,N 6 96n ρnL2 k=1

∗ Since V [z](x ) > 0, by inequality (2.27), for l = N and the definition of Bl, we have (2.30) 2 √ √ 2 2 (N+1) ∗  N  2 ( [f(yN )] − f(x )) BN + 2ζ1 · 2 192n ρnL2 E 6 96n ρnL2 l−1 √ x 4 2Θ ζ 2 4 2 N P 2 p 1 4ζ1 N 2BN + 4ζ · 2 2 = 2ζ2 α + 2Θp + 2 + 2 2 6 1 (96n ρnL2) k+1 24n ρnL2 (96n ρnL2) k=0 √ y 2Θ ζ 3 2 4 p 1 2ζ2(N+1) 4ζ1 N 2Θp + 2 + 2 2 + 2 2 6 24n ρnL2 (96n ρnL2) (96n ρnL2)

2 2 2 where x is due to the fact that, ∀a, b ∈ R, (a + b) 6 2a + 2b and y is be- N−1 N+1 P 2 1 P 2 1 (N+1)(N+2)(2N+3) 1 cause α = 2 2 k 2 2 · 2 2 · k+1 (96n ρnL2) 6 (96n ρnL2) 6 6 (96n ρnL2) k=0 k=2 (N+1)2(N+1)3(N+1) (N+1)3 (N+1)2 = 2 2 . Dividing (2.30) by 2 and substituting ζ1, ζ2 6 (96n ρnL2) 192n ρnL2 from (2.26), we obtain √ 2 12 2Θ 4 2 ∗ 384Θpn ρnL2 p 384(N+1)ζ2 N ζ1 [f(yN )] − f(x ) 2 + 2 ζ1 + 2 2 + 2 2 E 6 (N+1) (N√+1) (96n ρnL2) 12n ρnL2(N+1) 2 12 2nΘ 2 384Θpn ρnL2 p L2t 2∆  384N σ 2 + 2 + + 6 N N 2 t nL2 m 6N  2 2 16∆2  N 2  2 2 16∆2  + L t + 2 + L t + 2 . L2 2 t 24nρnL2 2 t

2.3. Randomized Derivative-Free Directional Search. Our non-accelerated method is listed as Algorithm 2.2. Following [55, 42, 41] we assume that L2 is known. The possible choice of the smoothing parameter t and mini-batch size m are discussed below. Note that at every iteration the algorithm requires to solve an auxiliary min- imization problem. As it is shown in [9], for both cases p = 1 and p = 2 this minimization can be made explicitly in O(n) arithmetic operations. Theorem 2.9. Let RDFDS be applied to solve problem (1.1), x∗ be an arbitrary ∗ solution to (1.1), and Θp be such that V [z0](x ) ≤ Θp. Then

2    2 2 2  384nρnL2Θp 2σ n N L2t 8∆ E[f(¯xN )] − f(x∗) 6 + + + + 2 (2.31) N√ L2m 6L2 3L2ρn 2 t 8 2nΘ p L2t 2∆  + N 2 + t , ∀n ≥ 8. 16 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

Algorithm 2.2 Randomized Derivative-Free Directional Search (RDFDS)

Input: x0 – starting point; N – number of iterations; L2 – smoothness constant; m 1 – mini-batch size; t > 0 – smoothing parameter; α = 1 – stepsize; > 48nρnL2 V (·, ·) – Bregman divergence. 1: for k = 0,...,N − 1. do k+1 2: Generate ek+1 ∈ RS2 (1) independently from previous iterations and ξi , i = 1, ..., m – independent realizations of ξ, which are also independent from previous iterations. m t 3: Calculate ∇e f (xk) using(2.4) with e = ek+1. n D m t E o 4: xk+1 ← argmin αn ∇e f (xk), x − xk + V [xk](x) . n x∈R 5: end for N−1 1 P 6: return x¯N ← N xk. k=0

Proof of Theorem 2.9. The proof of this result is rather similar to the proof of Theorem 2.2. First of all, (2.32) m t αnh∇e f (xk), xk − x∗i m t m t = αnh∇e f (xk), xk − xk+1i + αnh∇e f (xk), xk+1 − x∗i x m t 6 αnh∇e f (xk), xk − xk+1i + h−∇V [xk](xk+1), xk+1 − x∗i y m t = αnh∇e f (xk), xk − xk+1i + V [xk](x∗) − V [xk+1](x∗) − V [xk](xk+1) z  m t 1 2 6 αnh∇e f (xk), xk − xk+1i − 2 kxk − xk+1kp + V [xk](x∗) − V [xk+1](x∗) α2n2 m t 2 6 2 k∇e f (xkkq + V [xk](x∗) − V [xk+1](x∗),

m t n where x follows from h∇V [xk](xk+1) + αn∇e f (xk), x − xk+1i > 0 for all x ∈ R , y follows from “magic identity” Fact 5.3.2 in [9] for Bregman divergence, and z is 1 2 since V [x](y) > 2 kx − ykp. Taking conditional expectation Ee[ · | Ek] in both sides of (2.32) we get

2 2 αn [h∇mf t(x ), x − x i | E ] α n [k∇mf t(x )k2 | E ] (2.33) Ee e k k ∗ k 6 2 Ee e k q k +V [xk](x∗) − Ee[V [xk+1](x∗) | Ek]

From (2.33), (2.8) and (2.10) for s = xk − x∗, we obtain

m ~ k+1 2 hg (xk, ξm ), xk − x∗i 6 24α nρnL2(f(xk) − f(x∗)) 2 m 2 2 2 2 m ~ k+1 2 2 2 t P k+1 2 8α n ρn∆ +12α nρnk∇f(xk) − g (xk, ξm )k2 + α n ρn · 2m L2(ξi ) + t2 i=1 √ m √ t P k+1 2α∆ nkxk−x∗kp +α nkxk − x∗kp · 2m L2(ξi ) + t i=1 +V [xk](x∗) − Ee[V [xk+1](x∗) | Ek].

Taking conditional expectation Eξ[ · | Ξk] in the both sides of the previous inequality and using the convexity of f and (1.2), we have (2.34) 2  2 2 2  2 2 σ 2 2 L2t 8∆ (α − 24α nρnL2) (f(xk) − f(x∗)) 6 12α nρn m + α n ρn 2 + t2 | {z } α/√4 L2t 2∆  +α nkxk − x∗kp 2 + t + V [xk](x∗) − Ee,ξ[V [xk+1](x∗) | Ek, Ξk], AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 17

since α = 1 . Denote 48nρnL2

2 2 2 L2t 2∆ L2t 8∆ (2.35) ζ1 = 2 + t , ζ2 = 2 + t2 . Note that

2 2 2 2 2 L2t 2∆  L2t 4∆ (2.36) ζ1 = 2 + t 6 2 · 4 + 2 · t2 = ζ2.

Taking for any 1 l N, the full expectation [·] = 1 1 N N [·] in 6 6 E Ee1,...,eN ,ξ1 ,...,ξm,...,ξ1 ,...,ξm both sides of inequalities (2.34) for k = 0, . . . , l − 1 and summing them, we get (2.37) Nα 2 σ2 2 2 0 6 4 (E[f(¯xl)] − f(x∗)) 6 l · 12α nρn m + lα n ρnζ2 l−1 √ P ∗ +α nζ1 E[kxk − x∗kp]+V [x0](x∗) − E[V [xl](x )], k=0

l−1 1 P ∗ wherex ¯l = l xk. From the previous inequality, since V [z0](x ) 6 Θp, we get k=0

1 2 1 2 2 (E[kxl − x∗kp]) 6 2 E[kxl − x∗kp] 6 E[V [xl](x∗)] l−1 (2.38) 2 σ2 2 2 √ P 6 Θp + l · 12α nρn m + lα n ρnζ2 + α nδζ1 E[kxk − x∗kp], k=0

whence, ∀l 6 N, we obtain (2.39) v u l−1 √ 2 √ u 2 σ 2 2 X E[kxl − x∗kp] 6 2tΘp + l · 12α nρn m + lα n ρnζ2 + α nζ1 E[kxk − x∗kp]. k=0

∗ Denote Rk = E[kx − xkkp] for k = 0,...,N. The recurrent sequence of√Rk’s is ana- lyzed separately in AppendixB. Applying Lemma B.2 with a0 = Θp + α nζ1E[kx0 − p 2 σ2 2 2 √ x∗kp] 6 Θp + α 2nΘpζ1, ak = 12α nρn m + α n ρnζ2, b = nζ1 for k = 1,...,N − 1 we have for l = N

Nα 4 (E[f(¯xN )] − f(x∗)) q √ 2 2 σ2 2 2 p 6 Θp + N · 12α nρn m + Nα n ρnζ2 + α 2nΘpζ1 + 2nζ1αN x 2 σ2 2 2 p 2 2 2 6 2Θp + 24Nα nρn m + 2Nα n ρnζ2 + 2α 2nΘpζ1 + 4nζ1 α N , whence √ (2.36) 2 384nρnL2Θp 2σ nζ2 8 2nΘpζ1 ζ2N [f(¯xN )] − f(x∗) + + + + E 6 N L2m 6L2 N 3L2ρn (2.35) 2    2 2 2  384nρnL2Θp 2σ n N L2t 8∆ 6 + + + + 2 √ N L2m 6L2 3L2ρn 2 t 8 2nΘ p L2t 2∆  + N 2 + t , where we used also that α = 1 . 48nρnL2 Similarly to the discussion above concerning the ARDFDS and its convergence the- orem, we can formulate corollaries for the RDFDS in the case of controlled and uncontrolled noise level ∆. In the simple case ∆ = 0, all the terms in the r.h.s. 18 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

p = 1 p = 2 ln nL2Θ1 nL2Θ2 N(ε) ε ε n 2 o n 2 o m(ε) max 1, σ max 1, σ L2ε L2ε n 2 o n 2 o ∆(ε) min ε , ε min ε , ε n nL2Θ1 n nL2Θ2 q  q  t(ε) min ε , √ ε min ε , √ ε nL2 2 nL2 2 nL2Θ1 nL2Θ1 n 2 o n 2 o L2Θ1 ln n σ Θ1 ln n nL2Θ2 nσ Θ2 N(ε)m(ε) max ε , ε2 max ε , ε2 Table 4 Summary of the values for N, m, ∆, t and the total number of function value evaluations Nm guaranteeing for the cases p = 1 and p = 2 that Algorithm 2.2 outputs x¯N satisfying E [f(¯xN )] − f(x∗) ≤ ε. Numerical constants are omitted for simplicity.

of (2.31) can be made smaller than ε for any ε > 0 by an appropriate choice of N, m, t. Thus, we consider a more interesting case and assume that the noise level 2 satisfies 0 < ∆ 6 L2Θpnρn/2, the second inequality being non-restrictive. In or- 2 2 2 q L2t 8∆ ∆ der to minimize the term with + 2 in the r.h.s of (2.31), we set t = 2 . 2 t L2 Substituting this into the r.h.s. of (2.31) and using that, by our assumption on ∆, p 2nL2Θp∆ 6 nρnL2Θp, we obtain an upper bound for E[f(¯xN )] − f(x∗). Following the same steps as in the proof of Corollaries 2.3 and 2.4, we obtain the following results for RDFDS. Corollary 2.10. Assume that the value of ∆ can be controlled and satisfies 0 < 2 ∆ 6 L2Θpnρn/2. Assume that for a given accuracy ε > 0 the values of the parameters N(ε), m(ε), t(ε), ∆(ε) satisfy the relations stated in Table4 and RDFDS is applied ∗ to solve problem (1.1). Then the output point x¯N satisfies E [f(¯xN )] − f(x ) ≤ ε. Moreover, the overall number of oracle calls is N(ε)m(ε) given in the same table. Note that in the case of uncontrolled noise level ∆, the values of this parameter stated in Table4 can be seen as the maximum value of the noise level which can be tolerated ∗ by the method still allowing it to achieve E [f(¯xN )] − f(x ) 6 ε. For a more general case of uncontrolled noise level ∆, we obtain the following Corollary. 2 Corollary 2.11. Assume that ∆ is known and satisfies 0 < ∆ 6 L2Θpnρn/2, the parameters N(∆), m(∆), t(∆) satisfy relations stated in Table5 and RDFDS is ∗ applied to solve problem (1.1). Then the output point x¯N satisfies E [f(¯xN )]−f(x ) 6 ε(∆), where ε(∆) satisfies the corresponding relation in the same table. Moreover, the overall number of oracle calls is N(∆)m(∆) given in the same table. Similarly to ARDFDS, RDFDS and its analysis can be extended to obtain convergence in terms of probability of large deviations under additional “light-tail” assumption.

2.4. Role of the algorithms parameters. Role of ∆ and t. We would like to mention that there is no need to know the noise level ∆ to run our algorithms. As it can be seen from (2.5), the ARDFDS method is robust in the sense of [51] to the choice of the smoothing parameter t. Namely, if we under/overestimate∆ by a constant factor, the corresponding terms in the convergence rate will increase only by a constant factor. Similar remark holds for the assumption that L2 is known. Our Theorems 2.2 and 2.9 are applicable in two situations, the noise being a) controlled and b) uncontrolled. AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 19

p = 1 p = 2 p p t(∆) ∆/L2 ∆/L2  q   q  L2Θ1 L2Θ1 L2Θ2 nL2Θ2 N(∆) min n∆ , n∆ min ∆ , ∆

 2 2   2 2  m(∆) min σ , √ σ min σ , √ σ nL2∆ 3 nL2∆ 3 nL2Θ1∆ nL2Θ2∆  √  √ ε(∆) max n∆, nL2Θ1∆ max n∆, nL2Θ2∆ n 2 2 o n 2 2 o σ Θ1 σ σ Θ2 σ N(∆)m(∆) min 2 2 , min 2 , n ∆ nL2∆ n∆ L2∆ Table 5 Summary of the values for N, m, t and the total number of function value evaluations Nm guaranteeing for the cases p = 1 and p = 2 that Algorithm 2.2 outputs x¯N with minimal possible ex- pected objective residual ε if the oracle noise ∆ is uncontrolled. Numerical constants and logarithmic factors in n are omitted for simplicity.

a) Our assumptions on the noise level in Tables2 and4 can be met in practice. For example, in [14], the objective function is defined by some auxiliary prob- lem and its value can be calculated with accuracy ∆ at the cost proportional 1 1 to ln ∆ , which would result in only a ln ε factor in the total complexity of our methods in this paper combined with the method in [14] for approximating the function value. b) The minimum guaranteed accuracy ε(∆) in Tables3 and5 can not be arbi- trarily small, which is reasonable: one can not solve the problem with better accuracy than the accuracy of the available information. Interestingly, the minimal possible accuracy for the accelerated method could be larger than for the non-accelerated method, which means that accelerated methods are less robust to noise (cf. full gradient methods [27, 46]). To illustrate this, let us, for simplicity neglect the√ numerical constants and consider a case with σ = 0, L2 = 1, Θp = 1, t = 2 ∆, and large N  nρn. Then the main terms 2 2 n ρn N ∆ in the r.h.s. of (2.5) are 2 + . Minimizing in N, we have the minimal √ N nρn accuracy of the order n∆. Similarly, the main terms in the r.h.s. of (2.31) are nρn + N∆ . Minimizing in N, we have the minimal accuracy of the order N ρn 1 √ 1/2 1/2 /2 ∆ n ρn < n∆, which is smaller than for the accelerated method. Role of σ2. Although, all the related works, which we are aware of, assume σ2 to be known, adaptivity to the variance σ2 is a very important direction of future work. Note that similarly to the robustness to ∆, our method is robust to σ2.

3. Experiments. We performed several numerical experiments to illustrate our theoretical results. In particular, we compared our methods with the Euclidean and 1-norm proximal setups and the RSGF method from [41] applied to two problems: minimizing Nesterov’s function and logistic regression problem. For all the results reported below we tuned parameters αk and α for ARDFDS and RDFDS respectively and the stepsize parameter for RSGF. We use E and NE in the plots to refer to the methods with 2-norm and 1-norm proximal setups respectively and RSGF to refer to the method from [41]. The code was written in Python using standard libraries, see the details at https://github.com/eduardgorbunov/ardfds. 20 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

3.1. Experiments with Nesterov’s function. We tested our methods on the problem of minimizing Nesterov’s function [53] defined as:

" n−1 # ! L2 1 1 2 X i i+1 2 n 2 1 f(x) = 4 2 (x ) + (x − x ) + (x ) − x , i=1

i n where x is i-th component of vector x ∈ R . f is convex, L2-smooth w.r.t. the   ∗ L2 1 Euclidean norm and attains its minimal value f = 8 −1 + n+1 at the point ∗ ∗,1 ∗,n > ∗,i i x = (x , . . . , x ) such that x = 1 − n+1 . Moreover, the lower complexity bound for first-order methods in smooth convex optimization is attained [53] on this function. We add stochastic noise to this function and consider F (x, ξ) = f(x) + ξha, xi, where ξ is Gaussian with mean µ = 0 and variance σ2, a ∈ Rn is 2 some vector in the unit Euclidean sphere, i.e. kak2 = 1. This implies that f(x) = Eξ [F (x, ξ)] and F (x, ξ) is L2-smooth in x w.r.t. the Euclidean norm since g(x, ξ) −  2 g(y, ξ) = ∇f(x)−∇f(y). Moreover, Eξg(x, ξ) = ∇f(x) and Eξ kg(x, ξ) − ∇f(x)k2 = 2  2 2 n kak2Eξ ξ = σ for all x ∈ R . Also we introduce an additive noise η(x) = ∗ −2 n ∆ sin kx − x k2 . It is clear that |η(x)| ≤ ∆ for all x ∈ R . Overall, we are in the setting described in Introduction with fe(x, ξ) = F (x, ξ) + η(x) = f(x) + ξha, xi + −2 ∆ sin kx − x∗k2 . We compare our methods with the Euclidean and1-norm proximal setups as well as the RSGF method from [41] applied to this problem for different sparsity levels ∗ of x0 − x and different values of n, σ and ∆. For all tests we use L2 = 10, adjust ∗ 2 −8 p∆ starting point x0 such that f(x0) − f(x ) ∼ 10 and choose t = max{10 , 2 /L2}. The second term under the maximum in the definition of t corresponds to the optimal choice of t for given ∆ and L2, i.e., it minimizes the right-hand sides of (2.5) and (2.31), and the first term under the maximum is needed to prevent unstable computations when t is too small. 3.1.1. Experiments with different sparsity levels. In this set of experi- ments we considered different choices of the starting point x0 with different sparsity ∗ levels of x0 −x , i.e., for n = 100, 500, 1000 we picked such starting points x0 that vec- ∗ n n n n tor x0 −x has 1, /10, /2 non-zero components. In particular, we shift first 1, /10, /2 ∗ components of x by some constant to obtain x0. In order to isolate the effect of the sparsity from effects coming from the stochastic nature of fe(x, ξ) and noise η(x) we choose σ = ∆ = 0. Our results are reported in Figure1. As the theory predicts, our methods with p = 1 work increasingly better than our methods with p = 2 as n is ∗ growing when kx0 − x k0 is small. 3.1.2. Experiments with different variance. In this subsection we report the numerical results with different values of σ2. For each choice of the dimension n 3/2√ 2 2 ε nL2 2 2 −3 we used two values of σ : σ = ∗ and σ = 10000σ with ε = 10 . small kx0−x k1 big small 2 2 As one can see from Table2, when σ = σsmall the first term under the maximum in the complexity bound is dominating (up to logarithmic factors). This implies that ARDFDS with p = 1 is guaranteed to find an ε-solution even with the mini-batch −3 size m = 1 (up to logarithmical factors). We choose ε = 10 , ∆ = 0 and x0 such that it differs from x∗ only in the first component and run the experiments for 2 2 2 n = 100, 500, 1000 and σ = σsmall, σbig, see Figures2 and3. We see in Figure2 that 2 2 for σ = σsmall it is sufficient to use mini-batches of the size m = 1 to reach accuracy ε = 10−3 and the overall picture is very similar to the one presented in Figure1. AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 21

2 * n = 500, 2 = = 0, ||x x * || = 1 2 * n = 100, = = 0, ||x0 x ||0 = 1 0 0 n = 1000, = = 0, ||x0 x ||0 = 1 ARDFDS_E ARDFDS_E ARDFDS_E 1 1 ARDFDS_NE 10 ARDFDS_NE 10 10 1 ARDFDS_NE RDFDS_E RDFDS_E RDFDS_E RDFDS_NE RDFDS_NE RDFDS_NE ) 3 ) ) ) RSGF ) 3 ) * 10 RSGF * RSGF * * * 10 * 3 x x 10 x x x x ( ( ( ( ( ( f f f f f f 10 5 ) ) 5 ) ) ) ) k 0 10 5 k 0 k 0 x x 10 x x x x ( ( ( ( ( f ( f f f f 10 7 f 7 10 10 7 10 9

0 25000 50000 75000 100000 125000 150000 175000 200000 0 200000 400000 600000 800000 1000000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 Number of oracle calls Number of oracle calls Number of oracle calls

2 * 2 * 2 * n = 100, = = 0, ||x0 x ||0 = 10 n = 500, = = 0, ||x0 x ||0 = 50 n = 1000, = = 0, ||x0 x ||0 = 100 0 0 0 10 ARDFDS_E 10 ARDFDS_E 10 ARDFDS_E 1 ARDFDS_NE ARDFDS_NE ARDFDS_NE 10 1 RDFDS_E 10 RDFDS_E 10 1 RDFDS_E 10 2 RDFDS_NE RDFDS_NE RDFDS_NE ) ) ) ) RSGF ) 2 RSGF ) RSGF * * * * * 10 * 2 x x 3 10 x x x x ( ( ( ( 10 ( ( f f f f f 10 3 f 4 3 ) ) ) ) 10 ) ) 10 k 0 k k 0 0

x 4 x x x x x ( ( ( ( 5 ( 10 ( f f f f 10 f f 10 4 10 6 10 5 5 10 7 10 6 10

0 25000 50000 75000 100000 125000 150000 175000 200000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 Number of oracle calls Number of oracle calls Number of oracle calls

2 * 2 * 2 * n = 100, = = 0, ||x0 x ||0 = 50 n = 500, = = 0, ||x0 x ||0 = 250 n = 1000, = = 0, ||x0 x ||0 = 500 0 0 0 10 ARDFDS_E 10 ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE ARDFDS_NE 10 1 RDFDS_E 1 RDFDS_E RDFDS_E RDFDS_NE 10 2 RDFDS_NE 10 1 RDFDS_NE ) ) ) ) 10 RSGF ) RSGF ) RSGF * * * * * *

x 2 x x x 3 x x ( ( ( ( ( 10 ( f f f f 10 f f 2 ) ) 4 ) ) ) ) 10 k

0 3 k k 10 0 10 0 x x x x x x ( ( ( ( ( ( f f f f 10 5 f f 10 4 10 3 10 6 10 7 10 5 0 50000 100000 150000 200000 250000 300000 350000 400000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 Number of oracle calls Number of oracle calls Number of oracle calls

Fig. 1. Numerical results for minimizing Nesterov’s function for different sparsity levels of ∗ x0 − x and dimensions n of the problem.

2 2 In contrast, when σ = σbig (Figure3) and m = 1 the methods fail to reach the target accuracy. In these tests accelerated methods show higher sensitivity to the noise and, as a consequence, we see that for n = 500, 1000 and m = 10 RDFDS NE reaches the accuracy ε = 10−3 faster than competitors justifying the following insight that we have from our theory: when the variance is large, non-accelerated methods require smaller mini-batch size m and are able to find ε-solution faster than their accelerated counterparts. 3.1.3. Experiments with different noise level of the oracle. Here we present the numerical experiments with different values of ∆. To isolate the effect of the non-stochastic noise, we set σ = 0 for all tests reported in this subsection. We run the methods for problems with n = 100, 500, 1000 and chose the starting point in the same way as in Subsection 3.1.2. For each choice of the dimension n we used three  √  ε3/2 2 2ε2 3 values of ∆: ∆small = min √ , ∗ 2 , ∆medium = 10 · ∆small ∗ 2 nL2kx0−x k L2kx0−x k1n ln n 1 6 −3 and ∆large = 10 ·∆small with ε = 10 . As one can see from Table2 when ∆ = ∆ small ARDFDS with p = 1 is guaranteed to find an ε-solution. The results are reported in Figure4. We see that for larger values of ∆ accelerated methods achieve worse accu- racy than for small values of ∆. However, in all experiments our methods succeeded to reach ε-solution with ε = 10−3 meaning that the noise level ∆ in practice can be much larger than it is prescribed by our theory. 3.1.4. Experiment with large dimension. In Figure5 we report the experi- 3/2√ 2 2 ε nL2 ment results for n = 5000, σ = σ = ∗ , small kx0−x k1  √  ε3/2 2 2ε2 −3 ∆ = ∆small = min √ , ∗ 2 , and ε = 10 . The obtained ∗ 2 nL2kx0−x k L2kx0−x k1n ln n 1 results are in a good agreement with our theory and experiment results for smaller 22 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

2 n = 100, small 2, = 0, m = 1 n = 500, small , = 0, m = 1 n = 1000, small 2, = 0, m = 1 100 0 ARDFDS_E ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE 1 ARDFDS_NE 10 1 10 1 RDFDS_E RDFDS_E 10 RDFDS_E RDFDS_NE 2 RDFDS_NE 2 RDFDS_NE

) 10 ) ) ) RSGF ) ) 10 * RSGF * RSGF * * * 3 * x x 3 x x x x ( 10 ( ( ( 10 3 ( ( f f f f f f 10 4 ) ) ) ) ) 10 ) 4 k 0 k 0 k 0 10 x 5 x x x x x ( ( ( ( ( 10 f ( f 5 f f f 10 f 10 5 6 10 6 10 7 10 10 7 10 7 0 25000 50000 75000 100000 125000 150000 175000 200000 0 200000 400000 600000 800000 1000000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 Number of oracle calls Number of oracle calls Number of oracle calls 2 2 n = 500, small , = 0, m = 10 n = 1000, small 2, = 0, m = 10 n = 100, small , = 0, m = 10 0 10 ARDFDS_E 100 ARDFDS_E ARDFDS_E ARDFDS_NE ARDFDS_NE 1 ARDFDS_NE 1 1 10 10 RDFDS_E 10 RDFDS_E RDFDS_E 2 RDFDS_NE RDFDS_NE 10 10 2 RDFDS_NE ) ) ) RSGF ) RSGF * * ) ) * RSGF * * * 3 3 3 x x x 10 10 x ( ( 10 x x ( ( f f ( ( f f f f 10 4 10 4 ) ) ) ) k 0 ) ) k 5 0 k 0 x x x 10 5 x 5 ( ( x x ( ( f f 10 10 ( f ( f f f 10 6 10 6 10 7 10 7 10 7

0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of oracle calls Number of oracle calls 1e7 Number of oracle calls 1e7 n = 100, small 2, = 0, m = 100 n = 500, small 2, = 0, m = 100 n = 1000, small 2, = 0, m = 100 0 10 ARDFDS_E ARDFDS_E ARDFDS_E 1 ARDFDS_NE ARDFDS_NE 10 10 1 ARDFDS_NE RDFDS_E RDFDS_E RDFDS_E 2 10 RDFDS_NE RDFDS_NE RDFDS_NE ) ) ) ) ) RSGF RSGF ) RSGF * * 3 * * * 10 * 3 x x 10 x x x x ( ( ( ( ( ( f 4 f f f f 10 f ) ) ) ) ) 5 ) k 0 k k 0 10 0 5 x x x x x x 10 ( 6 ( ( ( ( ( f f f f f 10 f

7 10 7 10 8 10

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of oracle calls 1e7 Number of oracle calls 1e8 Number of oracle calls 1e8

Fig. 2. Numerical results for minimizing Nesterov’s function with noisy stochastic oracle having 2 2 σ = σsmall for different sizes of mini-batch m and dimensions of the problem n.

dimensions. 3.2. Experiments with logistic regression. In this subsection we report the numerical results for our methods applied to the logistic regression problem:

( M ) 1 X (3.1) min f(x) = fi(x) , fi(x) = log (1 + exp (−yi · (Ax)i)) . x∈ n M R i=1

M×n Here fi(x) is the loss on the i-th data point, A ∈ R is a matrix of instances, y ∈ {−1, 1}M is a vector of labels and x ∈ Rn is a vector of parameters (or weights). It can be easily√ shown that f(x) is convex and L2-smooth w.r.t. the Euclidean norm λ (A>A) > > with L2 = max /4M where λmax(A A) denotes the maximal eigenvalue of A A. Moreover, problem (3.1) is a special case of (1.1) with ξ being a random variable with the uniform distribution on {1,...,M}. For our experiments we use the data from LIBSVM library [20], see also Table6 summarizing the information about the datasets we used. In all test we chose t = 10−8

heart diabetes a9a phishing w8a Size M 270 768 32561 11055 49749 Dimension n 13 8 123 68 300 Table 6 Summary of used datasets.

∗ and the starting point x0 such that it differs from x only in the first component and ∗ f(x0) − f(x ) ∼ 10. We use standard solvers from scipy library to obtain a very good approximation of a solution x∗ and use it to measure the quality of the approximations AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 23

2 n = 100, big 2, = 0, m = 1 n = 500, big , = 0, m = 1 n = 1000, big 2, = 0, m = 1 0 100 0 10 ARDFDS_E ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE ARDFDS_NE RDFDS_E RDFDS_E 1 RDFDS_E 1 10 1 10 RDFDS_NE RDFDS_NE 10 RDFDS_NE ) ) ) ) RSGF ) ) * RSGF * RSGF * * * * x x x x x x ( 2 ( 2 ( ( ( ( f 10 f 10 2 f f f f 10 ) ) ) ) ) ) k 0 k 0 k 0 x 3 x 3 x x x x ( 10 ( 10 ( ( ( f ( f 3 f f f f 10

10 4 10 4 10 4

0 25000 50000 75000 100000 125000 150000 175000 200000 0 200000 400000 600000 800000 1000000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 Number of oracle calls Number of oracle calls Number of oracle calls 2 2 n = 500, big , = 0, m = 10 n = 1000, big 2, = 0, m = 10 n = 100, big , = 0, m = 10 0 0 10 ARDFDS_E 100 10 ARDFDS_E ARDFDS_E ARDFDS_NE ARDFDS_NE ARDFDS_NE 1 RDFDS_E 1 RDFDS_E 1 RDFDS_E 10 10 10 RDFDS_NE RDFDS_NE RDFDS_NE ) ) ) RSGF ) RSGF * * ) ) * RSGF 2 * 2 * * 10 x x x 2 x 10 ( ( x x ( ( f 10 f ( ( f f f f 3 3 ) ) ) 10 ) k

0 10 ) ) k 3 0 k 0 x x x

10 x ( ( x x ( ( f f ( f ( f f f 4 10 10 4 10 4 5 10 10 5 10 5 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of oracle calls Number of oracle calls 1e7 Number of oracle calls 1e7 n = 100, big 2, = 0, m = 100 n = 500, big 2, = 0, m = 100 n = 1000, big 2, = 0, m = 100 0 0 0 10 ARDFDS_E 10 ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE ARDFDS_NE 1 10 1 RDFDS_E 10 RDFDS_E 10 1 RDFDS_E RDFDS_NE RDFDS_NE RDFDS_NE ) ) ) ) ) 2 RSGF 2 RSGF ) RSGF * * 2 * * * 10 10 * 10 x x x x x x ( ( ( ( ( ( f f f f f 3 f 10 10 3 10 3 ) ) ) ) ) ) k 0 k k 0 0 x x x x x 4 x ( ( ( (

( 4 (

10 f f 4 f f f 10 f 10 5 10 5 10 10 5 10 6 10 6 10 6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of oracle calls 1e7 Number of oracle calls 1e8 Number of oracle calls 1e8 n = 100, big 2, = 0, m = 1000 n = 500, big 2, = 0, m = 1000 n = 1000, big 2, = 0, m = 1000 0 0 0 10 ARDFDS_E 10 ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE ARDFDS_NE 1 1 1 10 RDFDS_E 10 RDFDS_E 10 RDFDS_E RDFDS_NE RDFDS_NE RDFDS_NE 2 2 ) ) 2 ) ) ) 10 RSGF 10 RSGF ) RSGF * * 10 * * * * x x x x x x ( 3 ( ( ( ( ( f 10 f 3 f f f 10 f 10 3 ) 4 ) ) ) ) ) k 10 0 4 k k 0 0 4 x

x 10 x x x x (

( 10 ( ( ( ( f f f f f 5 f 10 5 10 10 5 6 10 6 10 10 6 10 7 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of oracle calls 1e8 Number of oracle calls 1e9 Number of oracle calls 1e9

Fig. 3. Numerical results for minimizing Nesterov’s function with noisy stochastic oracle having 2 2 σ = σbig for different sizes of mini-batch m and dimensions of the problem n. by other algorithms. The results for the batch (and hence deterministic) methods with m = M and mini-batch stochastic methods are presented in Figures6 and7 respectively. In all cases methods with the 1-norm proximal setup show the best or comparable with the best results. 4. Conclusion. In this paper, we propose two new algorithms for stochastic smooth derivative-free convex optimization with two-point feedback and inexact func- tion values oracle. Our first algorithm is an accelerated one and the second one is a non-accelerated one. Notably, despite the traditional choice of2-norm proximal setup for unconstrained optimization problems, our analysis has yielded better com- plexity bounds for the method with1-norm proximal setup than the ones with2-norm proximal setup. This is also confirmed by numerical experiments.

REFERENCES

[1] A. Agarwal, O. Dekel, and L. Xiao, Optimal algorithms for online convex optimization with multi-point bandit feedback, in COLT 2010 - The 23rd Conference on Learning Theory, 2010. [2] A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin, Stochastic convex optimization with bandit feedback, in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, eds., Curran Associates, Inc., 2011, pp. 1035–1043. 24 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

2 n = 100, 2 = 0, small , m = 1 n = 500, = 0, small , m = 1 n = 1000, 2 = 0, small , m = 1 0 0 10 ARDFDS_E ARDFDS_E 10 ARDFDS_E ARDFDS_NE 10 1 ARDFDS_NE ARDFDS_NE RDFDS_E RDFDS_E RDFDS_E 2 2 10 RDFDS_NE RDFDS_NE 10 RDFDS_NE ) ) ) ) RSGF ) ) * RSGF * 3 RSGF * * * 10 * x x x x x x ( ( ( ( 4 ( ( f f f f f 10 f 10 4 ) ) ) ) ) ) k 0 5 k 0 k 0

x 10 x x

x 6 x x ( ( ( ( ( f ( 10 f

f 6 f f f 10 7 10 8 10 10 8

0 25000 50000 75000 100000 125000 150000 175000 200000 0 200000 400000 600000 800000 1000000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 Number of oracle calls Number of oracle calls Number of oracle calls 2 n = 100, 2 = 0, medium , m = 1 n = 500, = 0, medium , m = 1 n = 1000, 2 = 0, medium , m = 1 ARDFDS_E ARDFDS_E ARDFDS_E ARDFDS_NE 1 ARDFDS_NE 10 1 10 10 1 ARDFDS_NE RDFDS_E RDFDS_E RDFDS_E RDFDS_NE RDFDS_NE RDFDS_NE ) ) ) ) RSGF ) ) * RSGF * RSGF * * 3 * 3 * 3 x 10 x 10

x 10 x x x ( ( ( ( ( ( f f f f f f ) ) ) ) ) ) k 0

k 5 5 0 k 0 5 x 10 x 10 x x x x

( 10 ( ( ( ( f ( f f f f f

7 10 10 7 10 7

0 25000 50000 75000 100000 125000 150000 175000 200000 0 200000 400000 600000 800000 1000000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 Number of oracle calls Number of oracle calls Number of oracle calls 2 n = 100, 2 = 0, large , m = 1 n = 500, = 0, large , m = 1 n = 1000, 2 = 0, large , m = 1 0 100 0 10 ARDFDS_E ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE ARDFDS_NE 1 RDFDS_E 1 10 1 RDFDS_E 10 10 RDFDS_E RDFDS_NE RDFDS_NE RDFDS_NE ) ) 2 ) ) RSGF ) ) 2 * 2 RSGF * 10 RSGF * * * 10 * 10 x x x x x x ( ( ( ( ( ( f f f f 3 f 10 f 3 10 3 10 ) ) ) ) ) ) k 0 k 0 k 4 0 x x x

x 4 x x ( ( 10

( 4 ( ( f ( f 10 f f f

10 f 10 5 5 10 5 10 6 10 10 6 10 6 0 25000 50000 75000 100000 125000 150000 175000 200000 0 200000 400000 600000 800000 1000000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 Number of oracle calls Number of oracle calls Number of oracle calls

Fig. 4. Numerical results for minimizing Nesterov’s function with noisy stochastic oracle having σ2 = 0 for different ∆ and dimensions of the problem n.

n = 5000, small 2, small , m = 1 0 10 ARDFDS_E ARDFDS_NE 1 10 RDFDS_E RDFDS_NE 10 2 ) ) RSGF * * x x 3 ( ( 10 f f

4 ) ) 10 k 0 x x ( ( f f 10 5 10 6 10 7 0.0 0.2 0.4 0.6 0.8 1.0 Number of oracle calls 1e7

Fig. 5. Numerical results for minimizing Nesterov’s function with noisy stochastic oracle having 2 2 σ = σsmall and ∆ = ∆small for the dimension of the problem n = 5000.

[3] A. Akhavan, M. Pontil, and A. B. Tsybakov, Exploiting higher order smoothness in derivative-free optimization and continuous bandits, arXiv:2006.07862, (2020). [4] Z. Allen-Zhu and L. Orecchia, Linear coupling: An ultimate unification of gradient and mirror descent, arXiv:1407.1537, (2014). [5] F. Bach and V. Perchet, Highly-smooth zero-th order online optimization, in 29th Annual Conference on Learning Theory, V. Feldman, A. Rakhlin, and O. Shamir, eds., vol. 49 of Proceedings of Research, Columbia University, New York, New York, USA, 23–26 Jun 2016, PMLR, pp. 257–283. [6] P. L. Bartlett, V. Gabillon, and M. Valko, A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption, in Proceedings of the 30th International Conference on Algorithmic Learning Theory, A. Garivier and S. Kale, eds., AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 25

heart, M = 270, n = 13, full batch diabetes, M = 768, n = 8, full batch a9a, M = 32561, n = 123, full batch 0 0 0 10 ARDFDS_E 10 ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE ARDFDS_NE 10 2 RDFDS_E 10 2 RDFDS_E RDFDS_E RDFDS_NE RDFDS_NE RDFDS_NE ) ) ) ) RSGF ) RSGF ) RSGF * * * * 10 4 * 10 4 * 10 1 x x x x x x ( ( ( ( ( ( f f f f f f 6 6 ) ) ) ) 10 ) 10 ) k k k 0 0 0 x x x x x x ( ( ( ( ( ( f f f f 10 8 f 10 8 f 10 2

10 10 10 10

0 1 2 3 4 5 0 1 2 3 4 0 1 2 3 4 5 6 Number of oracle calls 1e8 Number of oracle calls 1e8 Number of oracle calls 1e8 phishing, M = 11055, n = 68, full batch w8a, M = 49749, n = 300, full batch 0 0 10 ARDFDS_E 10 ARDFDS_E ARDFDS_NE ARDFDS_NE RDFDS_E RDFDS_E 1 10 RDFDS_NE RDFDS_NE ) ) ) RSGF ) RSGF * * * * 1 x x x x 10 ( ( ( 2 ( f f f 10 f ) ) ) ) k k 0 0 x x x 3 x ( ( ( ( f f

f 10 f 10 2 10 4

0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 6 Number of oracle calls 1e9 Number of oracle calls 1e9

Fig. 6. Numerical results for solving logistic regression problem (3.1) for different datasets using batch methods with m = M.

a9a, M = 32561, n = 123, m = 100 0 10 ARDFDS_E ARDFDS_NE RDFDS_E RDFDS_NE 1 ) ) 10 RSGF * * x x ( ( f f ) ) 2 k 0 10 x x ( ( f f

10 3

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of oracle calls 1e7 w8a, M = 49749, n = 300, m = 100 0 10 ARDFDS_E ARDFDS_NE RDFDS_E RDFDS_NE 1 ) ) 10 RSGF * * x x ( ( f f ) ) 2 k 0 10 x x ( ( f f

10 3

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Number of oracle calls 1e7

Fig. 7. Numerical results for solving logistic regression problem (3.1) for different datasets using mini-batch stochastic methods.

vol. 98 of Proceedings of Machine Learning Research, Chicago, Illinois, 22–24 Mar 2019, PMLR, pp. 184–206. [7] A. Bayandina, A. Gasnikov, and A. Lagunovskaya, Gradient-free two-points optimal method for non smooth stochastic convex optimization problem with additional small noise, Au- tomation and remote control, 79 (2018). arXiv:1701.03821. [8] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin, Escaping the local minima via : Optimization of approximately convex functions, in Proceedings of The 28th Conference on Learning Theory, P. Gr¨unwald, E. Hazan, and S. Kale, eds., vol. 40 of Proceedings of Machine Learning Research, Paris, France, 03–06 Jul 2015, PMLR, pp. 240–265. [9] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization (Lecture Notes), Personal web-page of A. Nemirovski, 2020, https://www2.isye.gatech.edu/∼nemirovs/ LMCOLN2020WithSol.pdf. [10] A. S. Berahas, R. H. Byrd, and J. Nocedal, Derivative-free optimization of noisy functions via quasi-Newton methods, SIAM Journal on Optimization, 29 (2019), pp. 965–993, https: //doi.org/10.1137/18M1177718. [11] A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg, A theoretical and empirical comparison of gradient approximations in derivative-free optimization, arXiv:1905.01332, (2019). [12] A. Beznosikov, E. Gorbunov, and A. Gasnikov, Derivative-free method for composite opti- 26 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

mization with applications to decentralized distributed optimization, IFAC-PapersOnLine, (2020). Accepted, arXiv:1911.10645. [13] A. Beznosikov, A. Sadiev, and A. Gasnikov, Gradient-free methods for saddle-point problem, in Mathematical Optimization Theory and Operations Research 2020, A. Kononov and et al., eds., Cham, 2020, Springer International Publishing. accepted, arXiv:2005.05913. [14] L. Bogolubsky, P. Dvurechensky, A. Gasnikov, G. Gusev, Y. Nesterov, A. M. Raig- orodskii, A. Tikhonov, and M. Zhukovskii, Learning supervised pagerank with gradient- based and gradient-free optimization methods, in Advances in Neural Information Process- ing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds., Curran Associates, Inc., 2016, pp. 4914–4922. arXiv:1603.00717. [15] R. Bollapragada and S. M. Wild, Adaptive sampling quasi-Newton methods for derivative- free stochastic optimization, arXiv:1910.13516, (2019). [16] R. Brent, Algorithms for Minimization Without Derivatives, Dover Books on Mathematics, Dover Publications, 1973. [17] S. Bubeck and N. Cesa-Bianchi, Regret analysis of stochastic and nonstochastic multi-armed bandit problems, Foundations and Trends in Machine Learning, 5 (2012), pp. 1–122, https: //doi.org/10.1561/2200000024. [18] S. Bubeck, Y. T. Lee, and R. Eldan, Kernel-based methods for bandit convex optimization, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, New York, NY, USA, 2017, ACM, pp. 72–85. arXiv:1607.03084. [19] E. J. Candes, J. K. Romberg, and T. Tao, Stable signal recovery from incomplete and inaccurate measurements, Communications on Pure and Applied Mathematics, 59 (2006), pp. 1207–1223, https://doi.org/10.1002/cpa.20124. [20] C.-C. Chang and C.-J. Lin, Libsvm: A library for support vector machines, ACM transactions on intelligent systems and technology (TIST), 2 (2011), pp. 1–27. [21] Y. Chen, A. Orvieto, and A. Lucchi, An accelerated DFO algorithm for finite-sum con- vex functions, in Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, 2020. (accepted), arXiv:2007.03311. [22] K. Choromanski, A. Iscen, V. Sindhwani, J. Tan, and E. Coumans, Optimizing simula- tions with noise-tolerant structured exploration, in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 2970–2977. [23] K. Choromanski, M. Rowland, V. Sindhwani, R. Turner, and A. Weller, Structured evolution with compact architectures for scalable policy optimization, in Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, eds., vol. 80 of Proceedings of Machine Learning Research, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018, PMLR, pp. 970–978. [24] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative-Free Optimiza- tion, Society for Industrial and Applied Mathematics, 2009, https://doi.org/10.1137/1. 9780898718768. [25] O. Dekel, R. Eldan, and T. Koren, Bandit smooth convex optimization: Improving the bias- variance tradeoff, in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds., Curran Associates, Inc., 2015, pp. 2926–2934. [26] O. Devolder, Stochastic first order methods in smooth convex optimization, CORE Discussion Paper 2011/70, (2011). [27] O. Devolder, F. Glineur, and Y. Nesterov, First-order methods of smooth convex opti- mization with inexact oracle, Mathematical Programming, 146 (2014), pp. 37–75. [28] J. Dippon, Accelerated randomized stochastic optimization, Ann. Statist., 31 (2003), pp. 1260– 1281, https://doi.org/10.1214/aos/1059655913. [29] D. L. Donoho, Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), pp. 1289–1306. [30] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, Optimal rates for zero- order convex optimization: The power of two function evaluations, IEEE Trans. Informa- tion Theory, 61 (2015), pp. 2788–2806. arXiv:1312.2139. [31] P. Dvurechensky and A. Gasnikov, Stochastic intermediate gradient method for convex problems with stochastic inexact oracle, Journal of Optimization Theory and Applications, 171 (2016), pp. 121–145, https://doi.org/10.1007/s10957-016-0999-6. [32] P. Dvurechensky, A. Gasnikov, and E. Gorbunov, An accelerated directional derivative method for smooth stochastic convex optimization, arXiv:1804.02394, (2018). [33] P. Dvurechensky, A. Gasnikov, and A. Tiurin, Randomized similar triangles method: A unifying framework for accelerated randomized optimization methods (coordinate descent, directional search, derivative-free method), arXiv:1707.08486, (2017). AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 27

[34] V. Fabian, Stochastic approximation of minima with improved asymptotic speed, Ann. Math. Statist., 38 (1967), pp. 191–200, https://doi.org/10.1214/aoms/1177699070. [35] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, Global convergence of policy gradient methods for the linear quadratic regulator, in Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, eds., vol. 80 of Proceedings of Machine Learning Research, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018, PMLR, pp. 1467–1476. [36] A. D. Flaxman, A. T. Kalai, and H. B. McMahan, Online convex optimization in the bandit setting: Gradient descent without a gradient, in Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, Philadelphia, PA, USA, 2005, Society for Industrial and Applied Mathematics, pp. 385–394. [37] A. Gasnikov, P. Dvurechensky, and Y. Nesterov, Stochastic gradient methods with inexact oracle, Proceedings of Moscow Institute of Physics and Technology, 8 (2016), pp. 41–91. In Russian, first appeared in arXiv:1411.4218. [38] A. V. Gasnikov and P. E. Dvurechensky, Stochastic intermediate gradient method for convex optimization problems, Doklady Mathematics, 93 (2016), pp. 148–151. [39] A. V. Gasnikov, E. A. Krymova, A. A. Lagunovskaya, I. N. Usmanova, and F. A. Fe- dorenko, Stochastic online optimization. single-point and multi-point non-linear multi- armed bandits. convex and strongly-convex case, Automation and Remote Control, 78 (2017), pp. 224–234, https://doi.org/10.1134/S0005117917020035. arXiv:1509.01679. [40] A. V. Gasnikov, A. A. Lagunovskaya, I. N. Usmanova, and F. A. Fedorenko, Gradient- free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex, Automation and Remote Control, 77 (2016), pp. 2018–2034, https://doi.org/10.1134/S0005117916110114. arXiv:1412.3890. [41] S. Ghadimi and G. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM Journal on Optimization, 23 (2013), pp. 2341–2368. arXiv:1309.5549. [42] S. Ghadimi, G. Lan, and H. Zhang, Mini-batch stochastic approximation methods for noncon- vex stochastic composite optimization, Mathematical Programming, 155 (2016), pp. 267– 305, https://doi.org/10.1007/s10107-014-0846-1. arXiv:1308.6594. [43] E. Gorbunov, D. Dvinskikh, and A. Gasnikov, Optimal decentralized distributed algorithms for stochastic convex optimization, arXiv preprint arXiv:1911.07363, (2019). [44] E. Hazan and K. Levy, Bandit convex optimization: Towards tight bounds, in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds., Curran Associates, Inc., 2014, pp. 784–792. [45] K. G. Jamieson, R. Nowak, and B. Recht, Query complexity of derivative-free optimization, in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds., Curran Associates, Inc., 2012, pp. 2672–2680. [46] D. Kamzolov, P. Dvurechensky, and A. V. Gasnikov, Universal intermediate gradient method for convex problems with inexact oracle, Optimization Methods and Software, 0 (2020), pp. 1–28, https://doi.org/10.1080/10556788.2019.1711079. arXiv:1712.06036. [47] G. Lan, An optimal method for stochastic composite optimization, Mathematical Programming, 133 (2012), pp. 365–397. Firs appeared in June 2008. [48] J. Larson, M. Menickelly, and S. M. Wild, Derivative-free optimization methods, Acta Numerica, 28 (2019), p. 287404, https://doi.org/10.1017/S0962492919000060. [49] T. Liang, H. Narayanan, and A. Rakhlin, On zeroth-order stochastic convex optimization via random walks, arXiv:1402.2667, (2014). [50] A. Locatelli and A. Carpentier, Adaptivity to smoothness in X-armed bandits, in Proceed- ings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigol- let, eds., vol. 75 of Proceedings of Machine Learning Research, PMLR, 06–09 Jul 2018, pp. 1463–1492. [51] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574– 1609, https://doi.org/10.1137/070704277. [52] A. Nemirovsky and D. Yudin, Problem Complexity and Method Efficiency in Optimization, J. Wiley & Sons, New York, 1983. [53] Y. Nesterov, Introductory Lectures on Convex Optimization: a basic course, Kluwer Academic Publishers, Massachusetts, 2004. [54] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming, 103 (2005), pp. 127–152, https://doi.org/10.1007/s10107-004-0552-5. [55] Y. Nesterov and V. Spokoiny, Random gradient-free minimization of convex func- tions, Found. Comput. Math., 17 (2017), pp. 527–566, https://doi.org/10.1007/ s10208-015-9296-2. First appeared in 2011 as CORE discussion paper 2011/16. [56] B. T. Polyak and A. B. Tsybakov, Optimal order of accuracy of search algorithms in stochas- 28 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

tic optimization, Problemy Peredachi Informatsii, 26 (1990), pp. 45–53. [57] V. Y. Protasov, Algorithms for approximate calculation of the minimum of a convex function from its values, Mathematical Notes, 59 (1996), pp. 69–74. [58] H. H. Rosenbrock, An automatic method for finding the greatest or least value of a function, The Computer Journal, 3 (1960), pp. 175–184, https://doi.org/10.1093/comjnl/3.3.175. [59] A. Saha and A. Tewari, Improved regret guarantees for online smooth convex optimization with bandit feedback, in Proceedings of the Fourteenth International Conference on Arti- ficial Intelligence and , G. Gordon, D. Dunson, and M. Dudk, eds., vol. 15 of Proceedings of Machine Learning Research, Fort Lauderdale, FL, USA, 11–13 Apr 2011, PMLR, pp. 636–642. [60] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, Evolution strategies as a scalable alternative to reinforcement learning, arXiv:1703.03864, (2017). [61] O. Shamir, On the complexity of bandit and derivative-free stochastic convex optimization, in Proceedings of the 26th Annual Conference on Learning Theory, S. Shalev-Shwartz and I. Steinwart, eds., vol. 30 of Proceedings of Machine Learning Research, Princeton, NJ, USA, 12–14 Jun 2013, PMLR, pp. 3–24. [62] O. Shamir, An optimal algorithm for bandit and zero-order convex optimization with two-point feedback, Journal of Machine Learning Research, 18 (2017), pp. 52:1–52:11. First appeared in arXiv:1507.08752. [63] J. C. Spall, Introduction to Stochastic Search and Optimization, John Wiley & Sons, Inc., New York, NY, USA, 1 ed., 2003. [64] S. U. Stich, C. L. Muller, and B. Gartner, Optimization of convex functions with random pursuit, SIAM Journal on Optimization, 23 (2013), pp. 1284–1309. Appendix A. Proof of Lemma 2.1. In this appendix we prove that, for e ∈ RS2 (1), q > 2, and n > 8, 2 2 q −1 (A.1) E[kekq] 6 min{q − 1, 16 ln n − 8}n , 2 2 2 2 q −2 (A.2) E[hs, ei kekq] 6 6ksk2 min{q − 1, 16 ln n − 8}n . Throughout this appendix, to simplify the notation, we denote by E the expectation w.r.t. random vector e ∈ RS2 (1). We start by proving the following inequality, which could be not tight for large q:

2 2 q −1 (A.3) E[kekq] 6 (q − 1)n , 2 6 q < ∞. We have  2  2 2  n  q x   n  q y 2 P q P q q q (A.4) E[kekq] = E  |ek|  6 E |ek| = (nE[|e2| ]) , k=1 k=1

2 where x is due to probabilistic version of Jensen’s inequality (function ϕ(x) = x q is concave, because q > 2) and y is because expectation is linear and components of the vector e are identically distributed. We also denote by ek the k-th component of e. In particular, e2 is the second component. By the Poincare lemma, e has the same distribution as √ ξ , where ξ is 2 2 ξ1 +···+ξn the standard Gaussian random vector with zero mean and identity covariance matrix. Then " # q q |ξ2| E[|e2| ] = E q 2 2 (ξ +...+ξn) 2 1 q  n − 2  n  R R q P 2 1 1 P 2 = ··· |x2| xk · n · exp − 2 xk dx1 . . . dxn. n R k=1 (2π) 2 k=1 AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 29

For the transition to the spherical coordinates

x1 = r cos ϕ sin θ1 ... sin θn−2, x2 = r sin ϕ sin θ1 ... sin θn−2, x3 = r cos θ1 sin θ2 ... sin θn−2, x4 = r cos θ2 sin θ3 ... sin θn−2, ... xn = r cos θn−2, r > 0, ϕ ∈ [0, 2π), θi ∈ [0, π],i = 1, ..., n − 2

the Jacobian satisfies

 ∂(x1,...,xn)  n−1 2 n−2 det = r sin θ1(sin θ2) ... (sin θn−2) . ∂(r,ϕ,θ1,θ2,...,θn−2)

In the new coordinates we have

q R R n−1 q q+1 q+2 q+n−2 E[|e2| ] = ··· r | sin ϕ| | sin θ1| | sin θ2| ... | sin θn−2| r>0, ϕ∈[0,2π), θi∈[0,π], i=1,...,n−2 r2 exp (− 2 ) 1 · n dr . . . dθn−2 = n Ir · Iϕ · Iθ1 · Iθ2 · ... · Iθn−2 , (2π) 2 (2π) 2

+∞ R n−1  r2  where Ir = r exp − 2 dr, 0 2π π π R q R q R q+i Iϕ = | sin ϕ| dϕ = 2 | sin ϕ| dϕ, Iθi = | sin θi| dθi for i = 1, ..., n − 2. Next we 0 0 0 calculate these integrals starting with Ir:

+∞ √ +∞  2  r= 2t n n R n−1 r R 2 −1 2 −1 n Ir = r exp − 2 dr = (2t) exp (−t)dt = 2 Γ( 2 ). 0 0

To compute the other integrals, we consider the following integral for α > 0:

π π π 2 2 α R | sin ϕ|αdϕ = 2 R | sin ϕ|αdϕ = 2 R (sin2 ϕ) 2 dϕ 0 0 0 1 α+1 1 α+1 t=sin2 ϕ α−1 1 Γ( )Γ( ) √ Γ( ) R 2 − 2 α+1 1 2 2 2 = t (1 − t) dt = B( 2 , 2 ) = α+2 = π α+2 . 0 Γ( 2 ) Γ( 2 ) This gives

q 1 E[|e2| ] = n Ir · Iϕ · Iθ1 · Iθ2 · ... · Iθn−2 (2π) 2 q+1 q+2 q+n−1 n q+1 (A.5) n √ Γ( ) √ Γ( ) √ Γ( ) Γ( )Γ( ) 1 2 −1 n 2 2 2 √1 2 2 = n · 2 Γ( ) · 2 π q+2 · π q+3 · ... · π q+n = · q+n . 2 Γ( ) π Γ( ) (2π) 2 Γ( 2 ) Γ( 2 ) 2 2

The next step is to show that, for all q > 2,

n q+1 q Γ( )Γ( ) √1 2 2 q−1  2 (A.6) π · q+n 6 n . Γ( 2 ) First we show that (A.6) holds for q = 2 and arbitrary n:

n 2+1 n 1 1 1 Γ( 2 )Γ( 2 ) 1 1 Γ( 2 )· 2 Γ( 2 ) 1 1 1 √ · − = √ · n n − = − = 0 0. π 2+n n π Γ( ) n n n 6 Γ( 2 ) 2 2 30 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

n q+1 q Γ( )Γ( ) √1 2 2 q−1  2 Next, we consider the function fn(q) = π · q+n − n , where q > 2, and Γ( 2 ) d(ln(Γ(x))) digamma function ψ(x) = dx with scalar argument x > 0. For the gamma function it holds thatΓ( x + 1) = xΓ(x), x > 0. Taking natural logarithm in both d(ln(Γ(x+1))) d(ln(Γ(x))) 1 sides and derivative w.r.t. x, we get dx = dx + x , meaning that 1 ψ(x + 1) = ψ(x) + x . To prove that the digamma function monotonically increases for x > 0, we show that

(A.7) (Γ0(x))2 < Γ(x)Γ00(x). Indeed,

 +∞ 2 (Γ0(x))2 = R exp(−t) ln t · tx−1dt 0 2 2 x +∞ x−1  +∞ x−1  R t 2 R t 2 < exp(− 2 )t dt · exp(− 2 )t ln t dt 0 0 +∞ +∞ = R exp(−t)tx−1dt · R exp(t)tx−1 ln2 tdt = Γ(x)Γ00(x), 0 0 where x follows from the Cauchy-Schwartz inequality and we have strict inequality x−1 x−1 t 2 t 2 since functions exp(− 2 )t and exp(− 2 )t ln t are linearly independent. From 2 2  0 0 00 (Γ0(x)) (A.7) (A.7) it follows that d (ln Γ(x)) = Γ (x) = Γ (x) − > 0, i.e. digamma dx2 Γ(x) Γ(x) (Γ(x))2 function is increasing. Now we show that fn(q) decreases on the interval [2, +∞). To that end, we consider ln(fn(q))

n  Γ( )  √2 q+1  q+n  q ln(fn(q)) = ln π + ln Γ 2 − ln Γ 2 − 2 (ln(q − 1) − ln n) ,

d(ln(fn(q))) 1 q+1  1 q+n  1 q 1 dq = 2 ψ 2 − 2 ψ 2 − 2 ln(q − 1) − 2(q−1) + 2 ln n

d(ln(fn(q))) n and show that dq < 0 for q > 2. Let k = b 2 c (the largest integer which is no n q+n  q+1  greater than 2 ). Then ψ 2 > ψ k − 1 + 2 and ln n 6 ln(2k + 1), whence,

d(ln(fn(q))) 1 q+1  q+1  1 q 1 dq < 2 ψ 2 − ψ k − 1 + 2 − 2 ln(q − 1) − 2(q−1) + 2 ln(2k + 1)  k−1  1 q+1  P 1 q+1  q 1  2k+1  = 2 ψ 2 − q+1 − ψ 2 − 2(q−1) + 2 ln q−1 i=1 2 +k−i−1 x k−1 1 P 2 1 1  2k+1  6 − 2 q−1+2k−2i − q−1 + 2 ln q−1 i=1 1  2 2 2 2  1  2k+1  = − 2 q−1 + q+1 + q+3 + ... + q+2k−3 + 2 ln q−1 y z 1  q+2k−1  1  2k+1  1  2k+1  1  2k+1  < − 2 ln q−1 + 2 ln q−1 6 − 2 ln q−1 + 2 ln q−1 = 0,

1 where x and z are since q > 2, y follows from an estimate of the integral of x by the 1 integral of the constant functions gi(x) = q−1+2i , x ∈ [q − 1 + 2i, q − 1 + 2i + 2], i = q+2k−1 2 2 2 2 R 1  q+2k−1  0, ..., 2k − 1: q−1 + q+1 + q+3 + ... + q+2k−3 > x dx = ln q−1 . q−1 d(ln(fn(q))) Thus, we have shown that dq < 0 for q > 2 and an arbitrary natural number n. Therefore, for any fixed number n, the function fn(q) decreases as q AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 31

increases, which means that fn(q) 6 fn(2) = 0, i.e., (A.6) holds. From this, (A.4), and (A.5) we obtain(A.3), i.e. that, for all 2 6 q < ∞,

(A.4) 2 (A.5),(A.6) 2 2 q q q −1 (A.8) E[||e||q] 6 (nE[|e2| ]) 6 (q − 1)n .

Next, we analyze separately the case of large q, in particular, q = ∞. We consider the r.h.s. of (A.8) asa function of q and find its minimum for q > 2. Denote  2  hn(q) = ln(q − 1) + q − 1 ln n, which is the logarithm of the r.h.s. of (A.8). The dhn(q) 1 2 ln n derivative of hn(q) is dq = q−1 − q2 , which implies that the first-order optimality 1 2 ln n 2 condition is q−1 − q2 = 0, or equivalently q −2q ln n+2 ln n = 0. If n > 8, then the  q 2  function hn(q) attains its minimum on the set [2, +∞) at q0 = ln n 1 + 1 − ln n (for the case n 6 7 the optimal point is q0 = 2 and without loss of generality, we assume n > 8). Therefore, for all q > q0, including q = ∞, we have

x (A.8) 2 y 2 2 2 −1 −1 [||e|| ] < [||e|| ] (q − 1)n q0 (2 ln n − 1)n ln n E q E q0 6 0 6 (A.9) 2 exp(2) 1 q −1 = (2 ln n − 1) n 6 (16 ln n − 8) n 6 (16 ln n − 8)n , x y where is since kekq < kekq0 for q > q0, follows from q0 6 2 ln n, q0 > ln n. Combining estimates(A.8) and (A.9), we obtain (A.1). q 4 It remains to prove (A.2). First, we estimate E[kekq]. By the probabilistic Jensen’s inequality, for q > 2,  2  2  n 2! q " n 2#! q 4  P q  P q E[||e||q] = E |ek| 6 E |ek|  k=1  k=1 2 x   n  q 2 P 2q y 2 2q  q 6 E n |ek| = n E[|e2| ] k=1 2 (A.5),(A.6) 4  2q  q 4 q 2q−1  2 2 q −2 6 n n = (2q − 1) n ,

 n 2 n P P 2 where x is since xk 6 n xk for x1, x2, . . . , xn ∈ R and y follows from the k=1 k=1 linearity of expectation and the components of the random vector e being identically distributed. From this we obtain q 2 4 q −1 (A.10) E[||e||q] 6 (2q − 1)n .

Next, we consider the r.h.s. of (A.10) as a function of q and find its minimum for  2  q > 2. The logarithm of the r.h.s. of (A.10) is hn(q) = ln(2q − 1) + q − 1 ln n with dhn(q) 2 2 ln n the derivative dq = 2q−1 − q2 , which implies the first-order optimality condition 2 2 ln n 2 2q−1 − q2 = 0, or equivalently q − 2q ln n + ln n = 0. If n > 3, the point where the  q 1  function hn(q) attains its minimum on the set [2, +∞) is q0 = ln n 1 + 1 − ln n (for the case n 6 2 the optimal point is q0 = 2 and without loss of generality we 32 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV

assume that n > 3). Therefore for all q > q0, including q = ∞, (A.10) 2 y 2 q x q −1 −1 4 4 q0 ln n E[kekq] < E[kekq ] (2q0 − 1)n (4 ln n − 1)n (A.11) 0 6 6 2 exp(2) 1 q −1 = (4 ln n − 1) n 6 (32 ln n − 8) n 6 (32 ln n − 8)n , x y where is since kekq < kekq0 for q > q0, follows from q0 6 2 ln n, q0 > ln n. Combining the estimates(A.10) and (A.11), we get the inequality q 2 4 q −1 (A.12) E[kekq] 6 min{2q − 1, 32 ln n − 8}n .

The next step is to estimate E[hs, ei4], where s ∈ Rn is some fixed vector. Let Sn(r) be the surface area of n-dimensional Euclidean sphere with radius r and dσ(e) be unnormalized uniform measure on n-dimensional Euclidean sphere. Then Sn(r) = n+2 Γ( ) n−1 Sn−1(1) n−1 2 Sn(1)r , = √ . Let ϕ be the angle between s and e. Then Sn(1) n π n+1 Γ( 2 ) (A.13) π 4 1 R 4 1 R 4 3 [hs, ei ] = hs, ei dσ(ϕ) = ksk cos ϕSn−1(sin ϕ)dϕ E Sn(1) Sn(1) 2 S 0 π n+2 π S (1) Γ( ) = ksk4 n−1 R cos4 ϕ sinn−2 ϕdϕ = ksk4 · n√−1 2 R cos4 ϕ sinn−2 ϕdϕ. 2 Sn(1) 2 n π n+1 0 Γ( 2 ) 0 Further, denoting the Beta function by B(·, ·),

π π π 2 t=sin2 ϕ 2 n−3 3 R cos4 ϕ sinn−2 ϕdϕ = 2 R cos4 ϕ sinn−2 ϕdϕ = R t 2 (1 − t) 2 dt 0 0 0 5 n−1 3 1 1 n−1 √ n−1 n−1 5 Γ( 2 )Γ( 2 ) 2 · 2 Γ( 2 )Γ( 2 ) 3 πΓ( 2 ) = B( 2 , 2 ) = n+4 = n+2 n+2 = n+2 · n+2 . Γ( 2 ) 2 ·Γ( 2 ) 2Γ( 2 ) From this and (A.13), we obtain n+2 √ n−1 Γ( ) πΓ( ) 4 4 n√−1 2 3 2 E[hs, ei ] = ksk2 · n π n+1 · n+2 · n+2 Γ( 2 ) 2Γ( 2 ) (A.14) n−1 Γ( ) 4 x 4 4 3(n−1) 2 3ksk2 3ksk2 = ksk2 · 2n(n+2) · n−1 n−1 = n(n+2) 6 n2 . 2 Γ( 2 ) To prove (A.2), it remains to use (A.12), (A.14) and the Cauchy-Schwartz in- 2 2 2 equality (E[XY ]) 6 E[X ] · E[Y ]: q √ 2 2 2 4 4 2 q −2 E[hs, ei ||e||q] 6 E[hs, ei ] · E[kekq] 6 3ksk2 min{2q − 1, 32 ln n − 8}n . Appendix B. Technical Results on Recurrent Sequences.

Lemma B.1. Let a0, . . . , aN−1, b, R1,...,RN−1 be non-negative numbers and v u l−1 l−1 ! √ u X X (B.1) Rl 6 2 · t ak + b αk+1Rk l = 1,...,N, k=0 k=1 k+2 where αk+1 = 2 for all k ∈ . Then, for l = 1,...,N, 96n ρnL2 N v 2 l−1 l−1 ul−1 √ X X uX l2 (B.2) ak + b αk+1Rk t ak + 2b · 2 . 6  96n ρnL2  k=0 k=1 k=0 AN ACCEL. METHOD FOR DER.-FREE SMOOTH STOCH. CONVEX OPTIMIZATION 33

Proof. For l = 1 the inequality is trivial. Next we assume that (B.2) holds for some l < N and prove this inequality for l + 1. From the induction assumption and (B.1) we obtain

s ! √ l−1 √ P l2 (B.3) Rl 2 ak + 2b · 2 , 6 96n ρnL2 k=0

whence

l l l−1 l−1 P P P P ak + b αk+1Rk = ak + b αk+1Rk + al + bαl+1Rl k=0 k=1 k=0 k=1 s !2 s ! x l−1 √ √ l−1 √ P l2 P l2 ak + 2b · 2 + al + 2bαl+1 ak + 2b · 2 6 96n ρnL2 96n ρnL2 k=0 k=0 s s ! l l−1 √ √ l−1 √ P P l2· 2b l4·2b2 P l2· 2b = ak + 2 ak · 2 + 2 2 + 2bαl+1 ak + 2 96n ρnL2 (96n ρnL2) 96n ρnL2 k=0 k=0 k=0 s l l−1 √  2   4 2  P P l αl+1 2 l αl+1l = ak + 2 ak · 2b 2 + + 2b 2 2 + · 2 96n ρnL2 2 (96n ρnL2) 96n ρnL2 k=0 k=0 2 s √ s ! y l l 2 4 2 l √ 2 P P (l+1) · 2b (l+1) ·2b P (l+1) ak + 2 ak 2 + 2 2 = ak + 2b · 2 , 6 96n ρnL2 (96n ρnL2) 96n ρnL2 k=0 k=0 k=0

l−1 l P P where x holds by the induction assumption and (B.3), y is since ak 6 ak and k=0 k=0

2 2 2 l αl+1 2l +l+2 (l+1) 2 + = 2 2 , 96n ρnL2 2 192n ρnL2 6 96n ρnL2 l4 l2 l4+(l+2)l2 (l+1)4 2 2 + αl+1 · 2 2 2 2 2 . (96n ρnL2) 96n ρnL2 6 (96n ρnL2) 6 (96n ρnL2)

Lemma B.2. Let α,a0, . . . , aN−1, b, R1,...,RN−1 be non-negative numbers and

v u l−1 l−1 ! √ u X X (B.4) Rl 6 2 · t ak + bα Rk l = 1,...,N. k=0 k=1

Then, for l = 1,...,N,

v 2 l−1 l−1 ul−1 X X uX √ (B.5) ak + bα Rk 6 t ak + 2bαl . k=0 k=1 k=0

Proof. For l = 1 the inequality is trivial. Next we assume that (B.5) holds for some l < N and prove it for l + 1. By the induction assumption and (B.4) we obtain

s l−1 ! √ P √ (B.6) Rl 6 2 ak + 2bαl , k=0 34 E. GORBUNOV, P. DVURECHENSKY, AND A. GASNIKOV whence

l l l−1 l−1 P P P P ak + bα Rk = ak + bα Rk + al + bαRl k=0 k=1 k=0 k=1 s 2 s l−1 ! l−1 ! x P √ √ P √ 6 ak + 2bαl + al + 2bα ak + 2bαl k=0 k=0 s s ! l l−1 √ √ l−1 √ P P 2 2 2 P = ak + 2 ak · 2bαl + 2b α l + 2bα ak + 2bαl k=0 k=0 k=0 s l l−1 √ P P 1  2 2 2  = ak + 2 ak · 2bα l + 2 + 2b α l + l k=0 k=0 s s !2 y l l √ l √ P P 2 P 6 ak + 2 ak · 2bα(l + 1) + 2(bα(l + 1)) = ak + 2bα(l + 1) , k=0 k=0 k=0

l−1 l P P where x is by the induction assumption and (B.6), y is since ak 6 ak. k=0 k=0