Published as a conference paper at ICLR 2020

ASTOCHASTIC DERIVATIVE FREE OPTIMIZATION METHODWITH MOMENTUM

Eduard Gorbunov∗ Adel Bibi MIPT, Russia and IITP RAS, Russia and RANEPA, Russia KAUST, Saudi Arabia [email protected] [email protected]

Ozan Sener El Houcine Bergou Intel Labs KAUST, Saudi Arabia and MaIAGE, INRA, France [email protected] [email protected]

Peter Richtárik KAUST, Saudi Arabia and MIPT, Russia [email protected]

ABSTRACT

We consider the problem of unconstrained minimization of a smooth objective function in Rd in setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particu- lar, we propose, SMTP, a momentum version of the stochastic three-point method (STP) Bergou et al. (2019). We show new complexity results for non-convex, convex and strongly convex functions. We test our method on a collection of learning to continuous control tasks on several MuJoCo Todorov et al. (2012) envi- ronments with varying difficulty and compare against STP, other state-of-the-art derivative-free optimization algorithms and against policy gradient methods. SMTP significantly outperforms STP and all other methods that we considered in our numerical experiments. Our second contribution is SMTP with importance sam- pling which we call SMTP_IS. We provide convergence analysis of this method for non-convex, convex and strongly convex objectives.

1 INTRODUCTION

In this paper, we consider the following minimization problem min f(x), (1) d x∈R where f : Rd → R is "smooth" but not necessarily a convex function in a Derivative-Free Opti- mization (DFO) setting where only function evaluations are possible. The function f is bounded from below by f(x∗) where x∗ is a minimizer. Lastly and throughout the paper, we assume that f is L-smooth. DFO. In DFO setting Conn et al. (2009); Kolda et al. (2003), the derivatives of the objective function f are not accessible. That is they are either impractical to evaluate, noisy (function f is noisy) (Chen, 2015) or they are simply not available at all. In standard applications of DFO, evaluations of f are only accessible through simulations of black-box engine or software as in reinforcement learning and continuous control environments Todorov et al. (2012). This setting of optimization problems appears also in applications from computational medicine Marsden et al. (2008) and fluid dynamics Allaire (2001); Haslinger & Mäckinen (2003); Mohammadi & Pironneau (2001) to localization Marsden et al. (2004; 2007) and continuous control Mania et al. (2018); Salimans et al. (2017) to name a few. The literature on DFO for solving (1) is long and rich. The first approaches were based on deterministic direct search (DDS) and they span half a century of work Hooke & Jeeves (1961); Su (1979); Torczon ∗The research of Eduard Gorbunov was supported by RFBR, project number 18-31-20005 mol_a_ved

1 Published as a conference paper at ICLR 2020

(1997). However, for DDS methods complexity bounds have only been established recently by the work of Vicente and coauthors Vicente (2013); Dodangeh & Vicente (2016). In particular, the work of Vicente Vicente (2013) showed the first complexity results on non-convex f and the results were extended to better complexities when f is convex Dodangeh & Vicente (2016). However, there have been several variants of DDS, including randomized approaches Matyas (1965); Karmanov (1974a;b); Baba (1981); Dorea (1983); Sarma (1990). Only very recently, complexity bounds have also been derived for randomized methods Diniz-Ehrhardt et al. (2008); Stich et al. (2011); Ghadimi & Lan (2013); Ghadimi et al. (2016); Gratton et al. (2015). For instance, the work of Diniz-Ehrhardt et al. (2008); Gratton et al. (2015) imposes a decrease condition on whether to accept or reject a step of a set of random directions. Moreover, Nesterov & Spokoiny (2017) derived new complexity bounds when the random directions are normally distributed vectors for both smooth and non-smooth f. They proposed both accelerated and non-accelerated zero-order (ZO) methods. Accelerated derivative-free methods in the case of inexact oracle information was proposed in Dvurechensky et al. (2017). An extension of Nesterov & Spokoiny (2017) for non-Euclidean proximal setup was proposed by Gorbunov et al. (2018) for the smooth stochastic with inexact oracle. In Stich (2014a;b) authors also consider acceleration of ZO methods and, in particular, develop the method called SARP, proved that its convergence rate is not worse than for non-accelerated ZO methods and showed that in some cases it works even better. More recently and closely related to our work, Bergou et al. (2019) proposed a new randomized direct search method called Stochastic Three Points (STP). At each iteration k STP generates a direction sk according to a certain probability law and compares the objective function at three points: current iterate xk, a point in the direction of sk and a point in the direction of −sk with a certain step size αk. The method then chooses the best of these three points as the new iterate: xk+1 = arg min{f(xk), f(xk + αksk), f(xk − αksk)}. The key properties of STP are its simplicity, generality and practicality. Indeed, the update rule for STP makes it extremely simple to implement, the proofs of convergence results for STP are short and clear and assumptions on random search directions cover a lot of strategies of choosing decent direction and even some of first-order methods fit the STP scheme which makes it a very flexible in comparison with other zeroth-order methods (e.g. two-point evaluations methods like in Nesterov & Spokoiny (2017), Ghadimi & Lan (2013), Ghadimi et al. (2016), Gorbunov et al. (2018) that try to approximate directional derivatives along random direction at each iteration). Motivated by these properties of STP we focus on further developing of this method. Momentum. Heavy ball momentum1 is a special technique introduced by Polyak in 1964 Polyak (1964) to get faster convergence to the optimum for the first-order methods. In the original paper, p  Polyak proved that his method converges locally with O L/µ log 1/ε rate for twice continuously differentiable µ-strongly convex and L-smooth functions. Despite the long history of this approach, there is still an open question whether heavy ball method converges to the optimum globally with accelerated rate when the objective function is twice continuous differentiable, L-smooth and µ- strongly convex. For this class of functions, only non-accelerated global convergence was proved Ghadimi et al. (2015) and for the special case of quadratic strongly convex and L-smooth functions Lessard et. al. Lessard et al. (2016) recently proved asymptotic accelerated global convergence. However, heavy ball method performs well in practice and, therefore, is widely used. One can find more detailed survey of the literature about heavy ball momentum in Loizou & Richtárik (2017). Importance Sampling. Importance sampling has been celebrated and extensively studied in stochas- tic gradient based methods Zhao & Zhang (2015) or in coordinate based methods Richtárik & Takácˇ (2016). Only very recently, Bibi et al. (2019) proposed, STP_IS, the first DFO algorithm with importance sampling. In particular, under coordinate-wise smooth function, they show that sampling coordinate directions, can be generalized to arbitrary directions, with probabilities proportional to the function coordinate smoothness constants, improves the leading constant by the same factor typically gained in gradient based methods. Contributions. Our contributions can be summarized into three folds. • First ZO method with heavy ball momentum. Motivated by practical effectiveness of first-order momentum heavy ball method, we introduce momentum into STP method and 1We will refer to this as momentum.

2 Published as a conference paper at ICLR 2020

Algorithm 1 SMTP: Stochastic Momentum Three Points k 0 d d Require: learning rates {γ }k≥0, starting point x ∈ R , D — distribution on R , 0 ≤ β < 1 — momentum parameter 1: Set v−1 = 0 and z0 = x0 2: for k = 0, 1,... do 3: Sample sk ∼ D k k−1 k k k−1 k 4: Let v+ = βv + s and v− = βv − s k+1 k k k k+1 k k k 5: Let x+ = x − γ v+ and x− = x − γ v− k+1 k+1 γkβ k k+1 k+1 γkβ k 6: Let z+ = x+ − 1−β v+ and z− = x− − 1−β v− k+1  k k+1 k+1 7: Set z = arg min f(z ), f(z+ ), f(z− )  k+1 k+1 k+1  k+1 k+1 k+1 x+ , if z = z+ v+ , if z = z+ k+1  k+1 k+1 k+1 k+1  k+1 k+1 k+1 8: Set x = x− , if z = z− and v = v− , if z = z− xk, if zk+1 = zk vk, if zk+1 = zk 9: end for

propose new DFO algorithm with heavy ball momentum (SMTP). We summarized the method in Algorithm 1, with theoretical guarantees for non-convex, convex and strongly convex functions under generic sampling directions D. We emphasize that the SMTP with momentum is not a straightforward generalization of STP and Polyak’s method and requires insights from virtual iterates analysis from Yang et al. (2016). To the best of our knowledge it is the first analysis of derivative-free method with heavy ball momentum, i.e. we show that the same momentum trick that works for the first order method could be applied for zeroth-order methods as well. • First ZO method with both heavy ball momentum and importance sampling. In order to get more gain from momentum in the case when the sampling directions are coordinate directions and the objective function is coordinate-wise L-smooth (see Assumption 4.1), we consider importance sampling to the above method. In fact, we propose the first zeroth-order momentum method with importance sampling (SMTP_IS) summarized in Algorithm 2 with theoretical guarantees for non-convex, convex and strongly convex functions. The details and proofs are left for Section 4 and Appendix E. • Practicality. We conduct extensive experiments on continuous control tasks from the MuJoCo suite Todorov et al. (2012) following recent success of DFO compared to model- free reinforcement learning Mania et al. (2018); Salimans et al. (2017). We achieve with SMTP_IS the state-of-the-art results on across all tested environments on the continuous control outperforming DFO Mania et al. (2018) and policy gradient methods Schulman et al. (2015); Rajeswaran et al. (2017).

We provide more detailed comparison of SMTP and SMTP_IS in Section E.4 of the Appendix.

2 NOTATION AND DEFINITIONS

1/p d def Pd p We use k · kp to define `p-norm of the vector x ∈ R : kxkp = i=1 |xi| for p ≥ 1 and def kxk∞ = maxi∈[d] |xi| where xi is the i-th component of vector x, [d] = {1, 2, . . . , d}. Operator E[·] denotes mathematical expectation with respect to all randomness and Es∼D[·] denotes conditional expectation w.r.t. randomness coming from random vector s which is sampled from probability def distribution D on Rd. To denote standard inner product of two vectors x, y ∈ Rd we use hx, yi = Pd d Pd i=1 xiyi, ei denotes i-th coordinate vector from standard basis in R , i.e. x = i=1 xiei. We use ∗ ∗ def  d k · k to define the conjugate norm for the norm k · k: kxk = max ha, xi | a ∈ R , kak ≤ 1 . As we mention in the introduction we assume throughout the paper2 that the objective function f is L-smooth. 2We will use thinner assumption in Section 4.

3 Published as a conference paper at ICLR 2020

SMTP Importance SMTP_IS Assumptions on f Theorem Theorem Complexity Sampling Complexity

Pd 2r0LγD Li 2r0d i=1 Li None 2 2 3.1 pi = Pd ε2 E.1 µD ε i=1 Li 2 2 Pd 1 LγD R0 2r0  Li R0d i=1 Li 2r0  Convex, R0 < ∞ ε 2 ln ε 3.2 pi = Pd ε ln ε E.2 µD i=1 Li Pd L 2r0  Li i=1 Li 2r0  µ-strongly convex 2 ln ε 3.5 pi = Pd µ ln ε E.5 µµD i=1 Li

Table 1: Summary of the new derived complexity results of SMTP and SMTP_IS. The complexities for SMTP are under a generic sampling distribution D satisfying Assumption 3.1 while for SMTP_IS are under an arbitrary discrete sampling from a set of coordinate directions following Bibi et al. (2019) where we propose an importance sampling that improves the leading constant marked in red. Note that r0 = f(x0) − f(x∗) and that all assumptions listed are in addition to Assumption 2.1. K Complexity means number of iterations in order to guarantee Ek∇f(z )kD ≤ ε for the non-convex  K ∗  case, E f(z ) − f(x ) ≤ ε for convex and strongly convex cases. R0 < ∞ is the radius in ∗ k · kD-norm of a bounded level set where the exact definition is given in Assumption 3.2. We ∗ notice that for SMTP_IS k · kD = k · k1 and k · kD = k · k∞ in non-convex and convex cases and k · kD = k · k2 in the strongly convex case. Assumption 2.1. (L-smoothness) We say that f is L-smooth if d k∇f(x) − ∇f(y)k2 ≤ Lkx − yk2 ∀x, y ∈ R . (2)

From this definition one can obtain L |f(y) − f(x) − h∇f(x), y − xi| ≤ ky − xk2, ∀x, y ∈ d, (3) 2 2 R and if additionally f is convex, i.e. f(y) ≥ f(x) + h∇f(x), y − xi, we have 2 ∗ d k∇f(x)k2 ≤ 2L(f(x) − f(x )), ∀x ∈ R . (4)

3 STOCHASTIC MOMENTUM THREE POINTS (SMTP)

Our analysis of SMTP is based on the following key assumption. Assumption 3.1. The probability distribution D on Rd satisfies the following properties:

def 2 1. The quantity γD = Es∼Dksk2 is finite. d d 2. There is a constant µD > 0 for a norm k · kD in R such that for all g ∈ R

Es∼D|hg, si| ≥ µDkgkD. (5)

Some examples of distributions that meet above assumption are described in Lemma 3.4 from Bergou et al. (2019). For convenience we provide the statement of the lemma in the Appendix (see Lemma F.1). Recall that one possible view on STP Bergou et al. (2019) is as following. If we substitute gradient ∇f(xk) in the update rule for the gradient descent xk+1 = xk − γk∇f(xk) by ±sk where sk is sampled from distribution D satisfied Assumption 3.1 and then select xk+1 as the best point in terms of functional value among xk, xk − γksk, xk + γksk we will get exactly STP method. However, gradient descent is not the best algorithm to solve unconstrained smooth minimization problems and the natural idea is to try to perform the same substitution-trick with more efficient first-order methods than gradient descent. We put our attention on Polyak’s heavy ball method where the update rule could be written in the following form: vk = βvk−1 + ∇f(xk), xk+1 = xk − γkvk. (6) k k k k As in STP, we substitute ∇f(x ) by ±s and consider new sequences {v+}k≥0 and {v−}k≥0 defined in the Algorithm 1. However, it is not straightforward how to choose next xk+1 and vk and

4 Published as a conference paper at ICLR 2020

k+1 the virtual iterates analysis Yang et al. (2016) hints the update rule. We consider new iterates z+ = k+1 γkβ k k+1 k+1 γkβ k k+1  k k+1 k+1 x+ − 1−β v+ and z− = x− − 1−β v− and define z as arg min f(z ), f(z+ ), f(z− ) . Next we update xk+1 and vk in order to have the same relationship between zk+1, xk+1 and vk as k+1 k+1 k k+1 k+1 k between z+ , x+ and v+ and z− , x− and v−. Such scheme allows easily apply virtual iterates analysis and and generalize Key Lemma from Bergou et al. (2019) which is the main tool in the analysis of STP. k+1 k By definition of z , we get that the sequence {f(z )}k≥0 is monotone: f(zk+1) ≤ f(zk) ∀k ≥ 0. (7)

Now, we establish the key result which will be used to prove the main complexity results and remaining theorems in this section. Lemma 3.1. Assume that f is L-smooth and D satisfies Assumption 3.1. Then for the iterates of SMTP the following inequalities hold: γk L(γk)2 f(zk+1) ≤ f(zk) − |h∇f(zk), ski| + kskk2 (8) 1 − β 2(1 − β)2 2 and k k 2  k+1  k γ µD k L(γ ) γD E k f(z ) ≤ f(z ) − k∇f(z )k + . (9) s ∼D 1 − β D 2(1 − β)2

3.1 NON-CONVEX CASE

In this section, we show our complexity results for Algorithm 1 in the case when f is allowed to be non-convex. In particular, we show that SMTP√ in Algorithm 1 guarantees complexity bounds with the same order as classical bounds, i.e. 1/ K where K is the number of iterations, in the literature. We notice that query complexity (i.e. number of oracle calls) of SMTP coincides with its iteration complexity up to numerical constant factor. For clarity and completeness, proofs are left for the appendix. Theorem 3.1. Let Assumptions 2.1 and 3.1 be satisfied. Let SMTP with γk ≡ γ > 0 produce points {z0, z1, . . . , zK−1} and zK is chosen uniformly at random among them. Then 0 ∗  K  (1 − β)(f(x ) − f(x )) LγγD E k∇f(z )kD ≤ + . (10) KγµD 2µD(1 − β) Moreover, if we choose γ = √γ0 the complexity (10) reduces to K  0 ∗   K  1 (1 − β)(f(z ) − f(x )) Lγ0γD E k∇f(z )kD ≤ √ + . (11) K γ0µD 2µD(1 − β)

q 2(1−β)2(f(x0)−f(x∗)) Then γ0 = minimizes the right-hand side of (11) and for this choice we have LγD

p 0 ∗  K  2 (f(x ) − f(x )) LγD E k∇f(z )kD ≤ √ . (12) µD K

In other words, the above theorem states that SMTP converges no worse than STP for non-convex problems to the stationary point. In the next sections we also show that theoretical convergence guarantees for SMTP are not worse than for STP for convex and strongly convex problems. However, in practice SMTP significantly outperforms STP. So, the relationship between SMTP and STP correlates with the known in the literature relationship between Polyak’s heavy ball method and gradient descent.

3.2 CONVEX CASE

In this section, we present our complexity results for Algorithm 1 when f is convex. In particular, we show that this method guarantees complexity bounds with the same order as classical bounds, i.e. 1/K, in the literature. We will need the following additional assumption in the sequel.

5 Published as a conference paper at ICLR 2020

Assumption 3.2. We assume that f is convex, has a minimizer x∗ and has bounded level set at x0:

def  ∗ ∗ 0 R0 = max kx − x kD | f(x) ≤ f(x ) < +∞, (13)

∗ def where kξkD = max {hξ, xi | kxkD ≤ 1} defines the dual norm to k · kD. From the above assumption and Cauchy-Schwartz inequality we get the following implication:

∗ ∗ ∗ f(x) ≤ f(x0) =⇒ f(x) − f(x∗) ≤ h∇f(x), x − x i ≤ k∇f(x)kDkx − x kD ≤ R0k∇f(x)kD, which implies f(x) − f(x∗) k∇f(x)kD ≥ ∀x : f(x) ≤ f(x0). (14) R0 Theorem 3.2 (Constant stepsize). Let Assumptions 2.1, 3.1 and 3.2 be satisfied. If we set γk ≡ γ < (1−β)R0 , then for the iterates of SMTP method the following inequality holds: µD

 γµ k Lγγ R E f(zk) − f(x∗) ≤ 1 − D f(x0) − f(x∗) + D 0 . (15) (1 − β)R0 2(1 − β)µD

2 ε(1−β)µD LγD R0 If we choose γ = for some 0 < ε ≤ 2 and run SMTP for k = K iterations where LγD R0 µD 2  0 ∗  1 LγDR0 2(f(x ) − f(x )) K = 2 ln , (16) ε µD ε then we will get E f(zK ) − f(x∗) ≤ ε.

2(f(x0)−f(x∗)) In order to get rid of factor ln ε in the complexity we consider decreasing stepsizes. Theorem 3.3 (Decreasing stepsizes). Let Assumptions 2.1, 3.1 and 3.2 be satisfied. If we set γk = 2 , where α = µD and θ ≥ 2 , then for the iterates of SMTP method the following αk+θ (1−β)R0 α inequality holds: 1  2Lγ  E f(zk) − f(x∗) ≤ max f(x0) − f(x∗), D , (17) ηk + 1 αθ(1 − β)2

def α k 2α µD where η = . Then, if we choose γ = 2 where α = and run SMTP for k = K θ α k+2 (1−β)R0 iterations where 2 2 2 1 2R0  2 0 ∗ 2(1 − β) R0 K = · 2 max (1 − β) (f(x ) − f(x )), LγD − 2 , ε > 0, (18) ε µD µD we get E f(zK ) − f(x∗) ≤ ε.

We notice that if we choose β sufficiently close to 1, we will obtain from the formula (18) that 2 2R0LγD K ≈ 2 . εµD

3.3 STRONGLY CONVEX CASE

In this section we present our complexity results for Algorithm 1 when f is µ-strongly convex. ∗ Assumption 3.3. We assume that f is µ-strongly convex with respect to the norm k · kD: µ f(y) ≥ f(x) + h∇f(x), y − xi + (ky − xk∗ )2, ∀x, y ∈ d. (19) 2 D R

It is well known that strong convexity implies

2 ∗ k∇f(x)kD ≥ 2µ (f(x) − f(x )) . (20)

6 Published as a conference paper at ICLR 2020

Theorem 3.4 (Solution-dependent stepsizes). Let Assumptions 2.1, 3.1 and 3.3 be satisfied. If we k (1−β)θkµD p k ∗ 2 set γ = 2µ(f(z ) − f(x )) for some θk ∈ (0, 2) such that θ = inf {2θk − γDθk} ∈ L k≥0 L 2 (0, /(µD µ)), then for the iterates of SMTP, the following inequality holds:

 θµ2 µk E f(zk) − f(x∗) ≤ 1 − D f(x0) − f(x∗) . (21) L Then, If we run SMTP for k = K iterations where κ f(x0) − f(x∗) K = 2 ln , ε > 0, (22) θµD ε

def L  K  ∗ where κ = µ is the condition number of the objective, we will get E f(z ) − f(x ) ≤ ε.

Note that the previous result uses stepsizes that depends on the optimal solution f(x∗) which is often not known in practice. The next theorem removes this drawback without spoiling the convergence rate. However, we need an additional assumption on the distribution D and one extra function evaluation.

Assumption 3.4. We assume that for all s ∼ D we have ksk2 = 1. Theorem 3.5 (Solution-free stepsizes). Let Assumptions 2.1, 3.1, 3.3 and 3.4 be satisfied. If addi- k k k k k k tionally we compute f(z + ts ), set γ = (1−β)|f(z +ts )−f(z )|/(Lt) for t > 0 and assume that D 2 L is such that µD ≤ /µ, then for the iterates of SMTP the following inequality holds:

 2 k 2 2  k  ∗ µDµ 0 ∗  L t E f(z ) − f(x ) ≤ 1 − f(x ) − f(x ) + 2 . (23) L 8µDµ Moreover, for any ε > 0 if we set t such that r 4εµ2 µ 0 < t ≤ D , (24) L2 and run SMTP for k = K iterations where κ 2(f(x0) − f(x∗)) K = 2 ln , (25) µD ε

def L  K  ∗ where κ = µ is the condition number of f, we will have E f(z ) − f(x ) ≤ ε.

4 STOCHASTIC MOMENTUM THREE POINTSWITH IMPORTANCE SAMPLING (SMTP_IS)

In this section we consider another assumption, in a similar spirit to Bibi et al. (2019), on the objective. Assumption 4.1 (Coordinate-wise L-smoothness). We assume that the objective f has coordinate- wise Lipschitz gradient, with Lipschitz constants L1,...,Ld > 0, i.e. L f(x + he ) ≤ f(x) + ∇ f(x)h + i h2, ∀x ∈ d, h ∈ , (26) i i 2 R R where ∇if(x) is i-th partial derivative of f at the point x.

For this kind of problems we modify SMTP and present STMP_IS method in Algorithm 2. In general, the idea behind methods with importance sampling and, in particular, behind SMTP_IS is to adjust probabilities of sampling in such a way that gives better convergence guarantees. In the case when f satisfies coordinate-wise L-smoothness and Lipschitz constants Li are known it is natural to sample k direction s = ei with probability depending on Li (e.g. proportional to Li). One can find more detailed discussion of the importance sampling in Zhao & Zhang (2015) and Richtárik & Takácˇ (2016). Now, we establish the key result which will be used to prove the main complexity results of STMP_IS.

7 Published as a conference paper at ICLR 2020

Algorithm 2 SMTP_IS: Stochastic Momentum Three Points with Importance Sampling

Require: stepsize parameters w1, . . . , wn > 0, probabilities p1, . . . , pn > 0 summing to 1, starting point x0 ∈ Rd, 0 ≤ β < 1 — momentum parameter 1: Set v−1 = 0 and z0 = x0 2: for k = 0, 1,... do 3: Select ik = i with probability pi > 0 k 1 4: Choose stepsize γi proportional to wik k k−1 k k−1 5: Let v+ = βv + eik and v− = βv − eik k+1 k k k k+1 k k k 6: Let x+ = x − γi v+ and x− = x − γi v− k k k+1 k+1 γi β k k+1 k+1 γi β k 7: Let z+ = x+ − 1−β v+ and z− = x− − 1−β v− k+1  k k+1 k+1 8: Set z = arg min f(z ), f(z+ ), f(z− )  k+1 k+1 k+1  k+1 k+1 k+1 x+ , if z = z+ v+ , if z = z+ k+1  k+1 k+1 k+1 k+1  k+1 k+1 k+1 9: Set x = x− , if z = z− and v = v− , if z = z− xk, if zk+1 = zk vk, if zk+1 = zk 10: end for

Lemma 4.1. Assume that f satisfies Assumption 4.1. Then for the iterates of SMTP_IS the following inequalities hold: γk L (γk)2 f(zk+1) ≤ f(zk) − i |∇ f(zk)| + ik i (27) 1 − β ik 2(1 − β)2 and

 k+1  k 1  k k k 1  k 2 k E k f(z ) ≤ f(z ) − E γ |∇ f(z )| | z + E L (γ ) | z . (28) s ∼D 1 − β i ik 2(1 − β)2 ik i

Due to the page limitation, we provide the complexity results of SMTP_IS in the Appendix.

5 EXPERIMENTS

Experimental Setup. We conduct extensive experiments3 on challenging non-convex problems on the continuous control task from the MuJoCO suit Todorov et al. (2012). In particular, we address the problem of model-free control of a dynamical system. Policy gradient methods for model-free reinforcement learning algorithms provide an off-the-shelf model-free approach to learn how to control a dynamical system and are often benchmarked in a simulator. We compare our proposed momentum stochastic three points method SMTP and the momentum with importance sampling version SMTP_IS against state-of-art DFO based methods as STP_IS Bibi et al. (2019) and ARS Mania et al. (2018). Moreover, we also compare against classical policy gradient methods as TRPO Schulman et al. (2015) and NG Rajeswaran et al. (2017). We conduct experiments on several environments with varying difficulty Swimmer-v1, Hopper-v1, HalfCheetah-v1, Ant-v1, and Humanoid-v1. Note that due to the stochastic nature of problem where f is stochastic, we use the mean of the k k k function values of f(x ), f(x+) and f(x−), see Algorithm 1, over K observations. Similar to the work in Bibi et al. (2019), we use K = 2 for Swimmer-v1, K = 4 for both Hopper-v1 and HalfCheetah-v1, K = 40 for Ant-v1 and Humanoid-v1. Similar to Bibi et al. (2019), these values were chosen based on the validation performance over the grid that is K ∈ {1, 2, 4, 8, 16} for the smaller dimensional problems Swimmer-v1, Hopper-v1, HalfCheetah-v1 and K ∈ {20, 40, 80, 120} for larger dimensional problems Ant-v1, and Humanoid-v1. As for the momentum term, for SMTP we set β = 0.5. For SMTP_IS, as the smoothness constants are not available for continuous control, we use the coordinate smoothness constants of a θ pa- ˆ rameterized smooth function fθ (multi-layer perceptron) that estimates f. In particular, consider n running any DFO for n steps; with the queried sampled {xi, f(xi)}i=1, we estimate f by solving P ˆ 2 θn+1 = argminθ i(f(xi) − f(xi; θ)) . See Bibi et al. (2019) for further implementation details 3The code will be made available online upon acceptance of this work.

8 Published as a conference paper at ICLR 2020

Swimmer-v1 Hopper-v1 HalfCheetah-v1

5000 300 3000 4000

200 2000 3000

2000 100 1000 1000

Average Reward 0 Average Reward 0 Average Reward 0 0 50 100 150 200 250 300 350 400 0 1000 2000 3000 4000 0 2000 4000 6000 8000 10000 12000 14000 Number of Episodes Number of Episodes Number of Episodes

Ant-v1 Humanoid-v1

4000 6000

3000 4000

2000 2000

1000 Average Reward Average Reward 0 0 100000 200000 300000 400000 0 20000 40000 60000 80000 100000 120000 140000 Number of Episodes Number of Episodes

Figure 1: SMTP is far superior to STP on all 5 different MuJoCo tasks particularly on the high dimensional Humanoid-v1 problem. The horizontal dashed lines are the thresholds used in Table 2 to demonstrate complexity of each method.

Table 2: For each MuJoCo task, we report the average number of episodes required to achieve a predefined reward threshold. Results for our method is averaged over five random seeds, the rest is copied from (Mania et al., 2018) (N/A means the method failed to reach the threshold. UNK means the results is unknown since they are not reported in the literature.)

Threshold STP STPIS SMTP SMTPIS ARS(V1-t) ARS(V2-t) NG-lin TRPO-nn Swimmer-v1 325 320 110 80 100 100 427 1450 N/A Hopper-v1 3120 3970 2400 1264 1408 51840 1973 13920 10000 HalfCheetah-v1 3430 13760 4420 1872 1624 8106 1707 11250 4250 Ant-v1 3580 107220 43860 19890 14420 58133 20800 39240 73500 Humanoid-v1 6000 N/A 530200 161230 207160 N/A 142600 130000 UNK as we follow the same experimental procedure. In contrast to STP_IS, our method (SMTP) does not required sampling from directions in the canonical basis; hence, we use directions from standard in each iteration. For SMTP_IS, we follow a similar procedure as Bibi et al. (2019) and sample from columns of a random matrix B. Similar to the standard practice, we perform all experiments with 5 different initialization and measure the average reward, in continuous control we are maximizing the reward function f, and best and worst run per iteration. We compare algorithms in terms of reward vs. sample complexity. Comparison Against STP. Our method improves sample complexity of STP and STP_IS sig- nificantly. Especially for high dimensional problems like Ant-v1 and Humanoid-v1, sample efficiency of SMTP is at least as twice as the STP. Moreover, SMTP_IS helps in some experiments by improving over SMTP. However, this is not consistent in all environments. We believe this is largely due to the fact that SMTP_IS can only handle sampling from canonical basis similar to STP_IS. Comparison Against State-of-The-Art. We compare our method with state-of-the-art DFO and policy gradient algorithms. For the environments, Swimmer-v1, Hopper-v1, HalfCheetah-v1 and Ant-v1, our method outperforms the state-of-the-art results. Whereas for Humanoid-v1, our methods results in a comparable sample complexity.

6 CONCLUSION

We have proposed, SMTP, the first heavy ball momentum DFO based algorithm with convergence rates for non-convex, convex and strongly convex functions under generic sampling direction. We specialize the sampling to the set of coordinate bases and further improve rates by proposing a momentum and importance sampling version SMPT_IS with new convergence rates for non-convex, convex and strongly convex functions too. We conduct large number of experiments on the task of

9 Published as a conference paper at ICLR 2020

controlling dynamical systems. We outperform two different policy gradient methods and achieve comparable or better performance to the best DFO algorithm (ARS) on the respective environments.

REFERENCES G. Allaire. Shape Optimization by the Homogenization Method. Springer, New York, USA, 2001.

N. Baba. Convergence of a random optimization method for constrained optimization problems. Journal of Optimization Theory and Applications, 33:1–11, 1981.

El Houcine Bergou, Eduard Gorbunov, and Peter Richtárik. Stochastic three points method for unconstrained smooth minimization. arXiv preprint arXiv:1902.03591, 2019.

Adel Bibi, El Houcine Bergou, Ozan Sener, Bernard Ghanem, and Peter Richtárik. Stochastic derivative-free optimization method with importance sampling. arXiv preprint arXiv:1902.01272, 2019.

Ruobing Chen. Stochastic derivative-free optimization of noisy functions. PhD thesis at Lehigh University., 2015.

A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to Derivative-Free Optimization. SIAM, Philadelphia, PA, USA, 2009.

M. A. Diniz-Ehrhardt, J. M. Martinez, and M. Raydan. A derivative-free nonmonotone line-search technique for unconstrained optimization. Journal of Optimization Theory and Applications, 219:383–397, 2008.

Mahdi Dodangeh and Luís N Vicente. Worst case complexity of direct search under convexity. Mathematical Programming, 155(1-2):307–332, 2016.

C. Dorea. Expected number of steps of a random optimization method. Journal of Optimization Theory and Applications, 39:165–171, 1983.

Pavel Dvurechensky, Alexander Gasnikov, and Alexander Tiurin. Randomized similar triangles method: A unifying framework for accelerated randomized optimization methods (coordinate descent, directional search, derivative-free method). arXiv preprint arXiv:1707.08486, 2017.

Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In 2015 European Control Conference (ECC), pp. 310–315. IEEE, 2015.

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic program- ming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.

E. Gorbunov, P. Dvurechensky, and A. Gasnikov. An accelerated method for derivative-free smooth stochastic convex optimization. arXiv preprint arXiv:1802.09022, 2018.

S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25(3):1515–1541, 2015.

J. Haslinger and R.A.E. Mäckinen. Introduction to Shape Optimization: Theory, Approximation, and Computa- tion. SIAM, Philadelphia, PA, USA, 2003.

R. Hooke and T.A. Jeeves. Direct search solution of numerical and statistical problems. J. Assoc. Comput. Mach, 8:212–229, 1961.

V. G. Karmanov. Convergence estimates for iterative minimization methods. USSR Computational Mathematics and Mathematical Physics, 14:1–13, 1974a.

V. G. Karmanov. On convergence of a random search method in convex minimization problems. Theory of Probability and its applications, 19:788–794, 1974b.

T. G. Kolda, R. M. Lewis, and V. J. Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM Review, 45:385–482, 2003.

Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.

10 Published as a conference paper at ICLR 2020

Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017. Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018. A. L. Marsden, M. Wang, J. E. Dennis, and P. Moin. Optimal aeroacustic shape design using the surrogate management framework. Optimization and Engineering, 5:235–262, 2004. A. L. Marsden, M. Wang, J. E. Dennis, and P. Moin. Trailing-edge noise reduction using derivative-free optimization and large-eddy simulation. Journal of Fluid Mechanics, 5:235–262, 2007. A. L. Marsden, J. A. Feinstein, and C. A. Taylor. A computational framework for derivative-free optimization of cardiovascular geometries. Computer Methods in Applied Mechanics and Engineering, 197:1890–1905, 2008. J. Matyas. Random optimization. Automation and Remote Control, 26:246–253, 1965. Konstantin Mishchenko, Eduard Gorbunov, Martin Takác,ˇ and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019. B. Mohammadi and O. Pironneau. Applied Shape Optimization for Fluids. Clarendon Press, Oxford, 2001. Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017. Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561, 2017. Peter Richtárik and Martin Takác.ˇ On optimal probabilities in stochastic coordinate descent methods. Optimiza- tion Letters, 10(6):1233–1243, 2016. Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017. M. Sarma. On the convergence of the Baba and Dorea random optimization methods. Journal of Optimization Theory and Applications, 66:337–343, 1990. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015. S. U. Stich, C. L. Muller, and B. Gartner. Optimization of convex functions with random pursuit. arXiv preprint arXiv:1111.0194, 2011. Sebastian U Stich. Convex optimization with random pursuit. PhD thesis, ETH Zurich, 2014a. Sebastian Urban Stich. On low complexity acceleration techniques for randomized optimization. In International Conference on Parallel Problem Solving from Nature, pp. 130–140. Springer, 2014b. Yu Wen Su. Positive basis and a class of direct search techniques. Scientia Sinica (in Chinese), 9(S1):53–67, 1979. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012. Virginia Torczon. On the convergence of pattern search algorithms. SIAM Journal on optimization, 7(1):1–25, 1997. Luís Nunes Vicente. Worst case complexity of direct search. EURO Journal on Computational Optimization, 1 (1-2):143–153, 2013. Tianbao Yang, Qihang Lin, and Zhe Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257, 2016. Peilin Zhao and Tong Zhang. with importance sampling for regularized loss minimization. In international conference on machine learning, pp. 1–9, 2015.

11 Published as a conference paper at ICLR 2020

A Stochastic Derivative Free Optimization Method with Momentum

(Supplementary Material)

APRELIMINARIES

We first list the main assumptions. Assumption A.1. (L-smoothness) We say that f is L-smooth if:

d k∇f(x) − ∇f(y)k2 ≤ Lkx − yk2 ∀x, y ∈ R . (29) Assumption A.2. The probability distribution D on Rd satisfies the following properties:

def 2 1. The quantity γD = Es∼Dksk2 is positive and finite. d d 2. There is a constant µD > 0 and norm k · kD on R such that for all g ∈ R

Es∼D|hg, si| ≥ µDkgkD. (30)

We establish the key lemma which will be used to prove the theorems stated in the paper. Lemma A.1. Assume that f is L-smooth and D satisfies Assumption A.2. Then for the iterates of SMTP the following inequalities hold: γk L(γk)2 f(zk+1) ≤ f(zk) − |h∇f(zk), ski| + kskk2 (31) 1 − β 2(1 − β)2 2 and k k 2  k+1  k γ µD k L(γ ) γD E k f(z ) ≤ f(z ) − k∇f(z )k + . (32) s ∼D 1 − β D 2(1 − β)2

Proof. By induction one can show that γkβ zk = xk − vk−1. (33) 1 − β That is, for k = 0 this recurrence holds and update rules for zk, xk and vk−1 do not brake it. From this we get γkβ γkβ zk+1 = xk+1 − vk = xk − γkvk − vk + + 1 − β + + 1 − β + γk γkβ γk = xk − vk = xk − vk−1 − sk 1 − β + 1 − β 1 − β γk (33)= zk − sk. 1 − β Similarly, γkβ γkβ zk+1 = xk+1 − vk = xk − γkvk − vk − − 1 − β − − 1 − β − γk γkβ γk = xk − vk = xk − vk−1 + sk 1 − β − 1 − β 1 − β γk (33)= zk + sk. 1 − β

12 Published as a conference paper at ICLR 2020

It implies that

(3) L f(zk+1) ≤ f(zk) + h∇f(zk), zk+1 − z i + kzk+1 − zkk2 + + k 2 + 2 γk L(γk)2 = f(zk) − h∇f(zk), ski + kskk2 1 − β 2(1 − β)2 2 and γk L(γk)2 f(zk+1) ≤ f(zk) + h∇f(zk), ski + kskk2. − 1 − β 2(1 − β)2 2

Unifying these two inequalities we get

γk L(γk)2 f(zk+1) ≤ min{f(zk+1), f(zk+1)} = f(zk) − |h∇f(zk), ski| + kskk2, + − 1 − β 2(1 − β)2 2 which proves (31). Finally, taking the expectation Esk∼D of both sides of the previous inequality and invoking Assumption A.2, we obtain

k k 2  k+1  k γ µD k L(γ ) γD E k f(z ) ≤ f(z ) − k∇f(z )k + . s ∼D 1 − β D 2(1 − β)2

BNON-CONVEX CASE

Theorem B.1. Let Assumptions A.1 and A.2 be satisfied. Let SMTP with γk ≡ γ > 0 produce points {z0, z1, . . . , zK−1} and zK is chosen uniformly at random among them. Then

0 ∗  K  (1 − β)(f(x ) − f(x )) LγγD E k∇f(z )kD ≤ + . (34) KγµD 2µD(1 − β)

Moreover, if we choose γ = √γ0 the complexity (34) reduces to K

 0 ∗   K  1 (1 − β)(f(z ) − f(x )) Lγ0γD E k∇f(z )kD ≤ √ + . (35) K γ0µD 2µD(1 − β)

q 2(1−β)2(f(x0)−f(x∗)) Then γ0 = minimizes the right-hand side of (35) and for this choice we have LγD

p 0 ∗  K  2 (f(x ) − f(x )) LγD E k∇f(z )kD ≤ √ . (36) µD K

Proof. Taking full expectation from both sides of inequality (32) we get

 k k+1   k  (1 − β)E f(z ) − f(z ) LγγD E k∇f(z )kD ≤ + . γµD 2µD(1 − β) Further, summing up the results for k = 0, 1,...,K −1, dividing both sides of the obtained inequality by K and using tower property of the mathematical expectation we get

K−1 0 ∗  K  1 X  k  (1 − β)(f(z ) − f(x )) LγγD E k∇f(z )kD = E k∇f(z )kD ≤ + . K KγµD 2µD(1 − β) k=0

The last part where γ = √γ0 is straightforward. K

13 Published as a conference paper at ICLR 2020

CCONVEX CASE

Assumption C.1. We assume that f is convex, has a minimizer x∗ and has bounded level set at x0:

def  ∗ ∗ 0 R0 = max kx − x kD | f(x) ≤ f(x ) < +∞, (37)

∗ def where kξkD = max {hξ, xi | kxkD ≤ 1} defines the dual norm to k · kD. Theorem C.1 (Constant stepsize). Let Assumptions A.1, A.2 and C.1 be satisfied. If we set γk ≡ γ < (1−β)R0 , then for the iterates of SMTP method the following inequality holds: µD  γµ k Lγγ R E f(zk) − f(x∗) ≤ 1 − D f(x0) − f(x∗) + D 0 . (38) (1 − β)R0 2(1 − β)µD

2 ε(1−β)µD LγD R0 If we choose γ = for some 0 < ε ≤ 2 and run SMTP for k = K iterations where LγD R0 µD 2  0 ∗  1 LγDR0 2(f(x ) − f(x )) K = 2 ln , (39) ε µD ε then we will get E f(zK ) − f(x∗) ≤ ε.

k Proof. From the (32) and monotonicity of {f(z )}k≥0 we have 2  k+1  k γµD k Lγ γD E ∼ f(z ) ≤ f(z ) − k∇f(z )k + s D 1 − β D 2(1 − β)2 (14) 2 k γµD k ∗ Lγ γD ≤ f(z ) − (f(z ) − f(x )) + 2 . (1 − β)R0 2(1 − β) Taking full expectation, subtracting f(x∗) from the both sides of the previous inequality and using the tower property of mathematical expectation we get   2  k+1 ∗  γµD  k ∗  Lγ γD E f(z ) − f(x ) ≤ 1 − E f(z ) − f(x ) + 2 . (40) (1 − β)R0 2(1 − β)

Since γ < (1−β)R0 the term 1 − γµD is positive and we can unroll the recurrence (40): µD (1−β)R0

 k 2 k−1  l  k ∗  γµD 0 ∗  Lγ γD X γµD E f(z ) − f(x ) ≤ 1 − f(z ) − f(x ) + 2 1 − (1 − β)R0 2(1 − β) (1 − β)R0 l=0  k 2 ∞  l γµD 0 ∗  Lγ γD X γµD ≤ 1 − f(x ) − f(x ) + 2 1 − (1 − β)R0 2(1 − β) (1 − β)R0 l=0  k 2 γµD 0 ∗  Lγ γD (1 − β)R0 ≤ 1 − f(x ) − f(x ) + 2 · (1 − β)R0 2(1 − β) γµD  γµ k Lγγ R = 1 − D f(x0) − f(x∗) + D 0 . (1 − β)R0 2(1 − β)µD Lastly, putting γ = ε(1−β)µD and k = K from (39) in (38) we have LγD R0

 2 K K ∗ εµD 0 ∗  ε E[f(z )] − f(x ) = 1 − 2 f(x ) − f(x ) + LγDR0 2  2  εµD 0 ∗  ε ≤ exp −K · 2 f(x ) − f(x ) + LγDR0 2 ε ε (39)= + = ε. 2 2

14 Published as a conference paper at ICLR 2020

Next we use technical lemma from Mishchenko et al. (2019). We provide the original proof for completeness. k Lemma C.1 (Lemma 6 from Mishchenko et al. (2019)). Let a sequence {a }k≥0 satisfy inequality k+1 k k k 2 k a ≤ (1−γ α)a +(γ ) N for any positive γ ≤ γ0 with some constants α > 0, N > 0, γ0 > 0. 2 αθ Further, let θ ≥ and take C such that N ≤ C and a0 ≤ C. Then, it holds γ0 4

k C a ≤ α θ k + 1

k 2 if we set γ = αk+θ .

k Proof. We will show the inequality for a by induction. Since inequality a0 ≤ C is one of our assumptions, we have the initial step of the induction. To prove the inductive step, consider  2α  θC C ak+1 ≤ (1 − γkα)ak + (γk)2N ≤ 1 − + θα . αk + θ αk + θ (αk + θ)2

θC To show that the right-hand side is upper bounded by α(k+1)+θ , one needs to have, after multiplying both sides by (αk + θ)(αk + α + θ)(θC)−1,  2α  αk + α + θ 1 − (αk + α + θ) + α ≤ αk + θ, αk + θ αk + θ which is equivalent to αk + α + θ α − α ≤ 0. αk + θ The last inequality is trivially satisfied for all k ≥ 0.

Theorem C.2 (Decreasing stepsizes). Let Assumptions A.1, A.2 and C.1 be satisfied. If we set γk = 2 , where α = µD and θ ≥ 2 , then for the iterates of SMTP method the following αk+θ (1−β)R0 α inequality holds: 1  2Lγ  E f(zk) − f(x∗) ≤ max f(x0) − f(x∗), D , (41) ηk + 1 αθ(1 − β)2

def α k 2α µD where η = . Then, if we choose γ = 2 where α = and run SMTP for k = K θ α k+2 (1−β)R0 iterations where 2 2 2 1 2R0  2 0 ∗ 2(1 − β) R0 K = · 2 max (1 − β) (f(x ) − f(x )), LγD − 2 , ε > 0, (42) ε µD µD we get E f(zK ) − f(x∗) ≤ ε.

Proof. In (40) we proved that

  2  k+1 ∗  γµD  k ∗  Lγ γD E f(z ) − f(x ) ≤ 1 − E f(z ) − f(x ) + 2 . (1 − β)R0 2(1 − β)

Having that, we can apply Lemma C.1 to the sequence E f(zk) − f(x∗). The constants for n o LγD µD 0 ∗ 2LγD the lemma are: N = 2 , α = and C = max f(x ) − f(x ), 2 . Lastly, 2(1−β) (1−β)R0 αθ(1−β) k 2α 2 choosing γ = α2k+2 is equivalent to the choice θ = α . In this case, we have αθ = 2, C = n o 2 2 0 ∗ LγD α α µD max f(x ) − f(x ), 2 and η = = = 2 2 . Putting these parameters and K (1−β) θ 2 2(1−β) R0 from (42) in the (41) we get the result.

15 Published as a conference paper at ICLR 2020

DSTRONGLY CONVEX CASE

∗ Assumption D.1. We assume that f is µ-strongly convex with respect to the norm k · kD: µ f(y) ≥ f(x) + h∇f(x), y − xi + (ky − xk∗ )2, ∀x, y ∈ d. (43) 2 D R It is well known that strong convexity implies 2 ∗ k∇f(x)kD ≥ 2µ (f(x) − f(x )) . (44) Theorem D.1 (Solution-dependent stepsizes). Let Assumptions A.1, A.2 and D.1 be satisfied. If we k (1−β)θkµD p k ∗ 2 set γ = 2µ(f(z ) − f(x )) for some θk ∈ (0, 2) such that θ = inf {2θk − γDθk} ∈ L k≥0  L  0, 2 , then for the iterates of SMTP the following inequality holds: µD µ  θµ2 µk E f(zk) − f(x∗) ≤ 1 − D f(x0) − f(x∗) . (45) L If we run SMTP for k = K iterations where κ f(x0) − f(x∗) K = 2 ln , ε > 0, (46) θµD ε

def L  K  ∗ where κ = µ is the condition number of the objective, we will get E f(z ) − f(x ) ≤ ε.

k θkµD p k ∗ Proof. From (32) and γ = L 2µ(f(x ) − f(x )) we have k k 2  k+1  ∗ k ∗ γ µD k L(γ ) γD E k f(z ) − f(x ) ≤ f(z ) − f(x ) − k∇f(z )k + s ∼D 1 − β D 2(1 − β)2 (44) γkµ q ≤ f(zk) − f(x∗) − D 2µ(f(zk) − f(x∗)) 1 − β γ θ2µ2 µ + D k D (f(zk) − f(x∗)) L 2θkµ2 µ ≤ f(zk) − f(x∗) − D (f(zk) − f(x∗)) L γ θ2µ2 µ + D k D (f(zk) − f(x∗)) L  µ2 µ ≤ 1 − (2θ − γ θ2) D (f(zk) − f(x∗)). k D k L

2  L  Using θ = inf {2θk −γDθk} ∈ 0, 2 and taking the full expectation from the previous inequality k≥0 µD µ we get  θµ2 µ E f(zk+1) − f(x∗) ≤ 1 − D E f(zk) − f(x∗) L  θµ2 µk+1 ≤ 1 − D f(x0) − f(x∗) . L Lastly, from (45) we have  θµ2 µK E f(zK ) − f(x∗) ≤ 1 − D f(x0) − f(x∗) L  θµ2 µ ≤ exp −K D f(x0) − f(x∗) L (46) ≤ ε.

16 Published as a conference paper at ICLR 2020

Assumption D.2. We assume that for all s ∼ D we have ksk2 = 1. Theorem D.2 (Solution-free stepsizes). Let Assumptions A.1, A.2, D.1 and D.2 be satisfied. If k k k k k k (1−β)|f(z +ts )−f(z )| additionally we compute f(z + ts ), set γ = Lt for t > 0 and assume that D 2 L is such that µD ≤ µ , then for the iterates of SMTP the following inequality holds:  2 k 2 2  k  ∗ µDµ 0 ∗  L t E f(z ) − f(x ) ≤ 1 − f(x ) − f(x ) + 2 . (47) L 8µDµ Moreover, for any ε > 0 if we set t such that r 4εµ2 µ 0 < t ≤ D , (48) L2 and run SMTP for k = K iterations where κ 2(f(x0) − f(x∗)) K = 2 ln , (49) µD ε def L  K  ∗ where κ = µ is the condition number of f, we will have E f(z ) − f(x ) ≤ ε.

Proof. Recall that from (31) we have γk L(γk)2 f(zk+1) ≤ f(zk) − |h∇f(zk), ski| + . 1 − β 2(1 − β)2 If we minimize the right hand side of the previous inequality as a function of γk, we will get that k k k (1−β)|h∇f(z ),s i| the optimal choice in this sense is γopt = L . However, this stepsize is impractical for derivative-free optimization, since it requires to know ∇f(zk). The natural way to handle this is to k k k k k f(z +ts )−f(z ) approximate directional derivative h∇f(z ), s i by finite difference t and that is what k k k k k k k k k (1−β)|f(z +ts )−f(z )| (1−β)|h∇f(z ),s i| (1−β)|f(z +ts )−f(z )| we do. We choose γ = Lt = L + Lt − k k (1−β)|h∇f(z ),s i| def k k L = γopt + δ . From this we get |h∇f(zk), ski|2 L f(zk+1) ≤ f(zk) − + (δk)2. 2L 2(1 − β)2 Next we estimate |δk|:

k (1 − β) k k k k k |δ | = |f(z + ts ) − f(z )| − |h∇f(z ), ts i| Lt (1 − β) k k k k k ≤ f(z + ts ) − f(z ) − h∇f(z ), ts i Lt (3) (1 − β) L (1 − β)t ≤ · ktskk2 = . Lt 2 2 2 It implies that |h∇f(zk), ski|2 L (1 − β)2t2 f(zk+1) ≤ f(zk) − + · 2L 2(1 − β)2 4 |h∇f(zk), ski|2 Lt2 = f(zk) − + 2L 8 and after taking full expectation from the both sides of the obtained inequality we get 1 Lt2 E f(zk+1) − f(x∗) ≤ E f(zk) − f(x∗) − E |h∇f(zk), ski|2 + . 2L 8 Note that from the tower property of mathematical expectation and Jensen’s inequality we have  k k 2   k k 2 k E |h∇f(z ), s i| = E Esk∼D |h∇f(z ), s i| | z h  k k k2i ≥ E Esk∼D |h∇f(z ), s i| | z

(30) (44)  2 k 2  2  k ∗  ≥ E µDk∇f(z )kD ≥ 2µDµE f(z ) − f(x ) .

17 Published as a conference paper at ICLR 2020

Putting all together we get  µ2 µ Lt2 E f(zk+1) − f(x∗) ≤ 1 − D E f(zk) − f(x∗) + . L 8

2 L Due to µD ≤ µ we have

k k−1 l  µ2 µ Lt2 X  µ2 µ E f(zk) − f(x∗) ≤ 1 − D f(x0) − f(x∗) + 1 − D L 8 L l=0 k ∞ l  µ2 µ Lt2 X  µ2 µ ≤ 1 − D f(x0) − f(x∗) + 1 − D L 8 L l=0  2 k 2 2 µDµ 0 ∗  L t = 1 − f(x ) − f(x ) + 2 . L 8µDµ Lastly, from (47) we have

 2 K 2 2  K  ∗ µDµ 0 ∗  L t E f(z ) − f(x ) ≤ 1 − f(x ) − f(x ) + 2 L 8µDµ (48)  µ2 µ ε ≤ exp −K D f(x0) − f(x∗) + L 2 (49) ε ε ≤ + = ε. 2 2

18 Published as a conference paper at ICLR 2020

E SMTP_IS:STOCHASTIC MOMENTUM THREE POINTSWITH IMPORTANCE SAMPLING

k+1 k Again by definition of z we get that the sequence {f(z )}k≥0 is monotone: f(zk+1) ≤ f(zk) ∀k ≥ 0. (50) Lemma E.1. Assume that f satisfies Assumption 4.1. Then for the iterates of SMTP_IS the following inequalities hold: γk L (γk)2 f(zk+1) ≤ f(zk) − i |∇ f(zk)| + ik i (51) 1 − β ik 2(1 − β)2 and

 k+1  k 1  k k k 1  k 2 k E k f(z ) ≤ f(z ) − E γ |∇ f(z )| | z + E L (γ ) | z . (52) s ∼D 1 − β i ik 2(1 − β)2 ik i

Proof. In the similar way as in Lemma A.1 one can show that γkβ zk = xk − i vk−1 (53) 1 − β and γk zk+1 = zk − i e , + 1 − β ik

γk zk+1 = zk + i e . − 1 − β ik It implies that

(26) γk L (γk)2 f(zk+1) ≤ f(zk) − i ∇ f(zk) + ik i + 1 − β i 2(1 − β)2 and γk L (γk)2 f(zk+1) ≤ f(zk) + i ∇ f(zk) + ik i . − 1 − β i 2(1 − β)2 Unifying these two inequalities we get γk L (γk)2 f(zk+1) ≤ min{f(zk+1), f(zk+1)} = f(zk) − i |∇ f(zk)| + ik i , + − 1 − β i 2(1 − β)2 which proves (51). Finally, taking the expectation E[· | zk] conditioned on zk from the both sides of the previous inequality we obtain 1 1 E f(zk+1) | zk ≤ f(zk) − E γk|∇ f(zk)| | zk + E L (γk)2 | zk . 1 − β i ik 2(1 − β)2 ik i

E.1 NON-CONVEX CASE

k γ Theorem E.1. Assume that f satisfies Assumption 4.1. Let SMTP_IS with γi = for some wik γ > 0 produce points {z0, z1, . . . , zK−1} and zK is chosen uniformly at random among them. Then

0 ∗ d (1 − β)(f(x ) − f(x )) γ X Lipi E k∇f(zK )k  ≤ + . 1 pi pi 2 (54) Kγ min 2(1 − β) min wi i=1,...,d wi i=1,...,d wi i=1

19 Published as a conference paper at ICLR 2020

Moreover, if we choose γ = √γ0 , then K

0 ∗ d !  K  1 (1 − β)(f(x ) − f(x )) γ0 X Lipi E k∇f(z )k1 ≤ √ + . (55) pi 2 K min γ0 2(1 − β) wi i=1,...,d wi i=1

s 2(1−β)2(f(x0)−f(x∗)) Note that if we choose γ0 = d in order to minimize right-hand side of (55), we P Lipi w2 i=1 i will get s d 0 ∗ P Lipi 2 (f(x ) − f(x )) 2 wi  K  i=1 E k∇f(z )k1 ≤ √ . (56) K min pi i=1,...,d wi

Li Pd Note that for pi = / i Li with wi = Li we have that the rates improves to q 2(f(x0) − f(x∗))d Pd L  K  i=1 i E k∇f(z )k1 ≤ √ . (57) K

Proof. Recall that from (52) we have 1 1 E f(zk+1) | zk ≤ f(zk) − E γk|∇ f(zk)| | zk + E L (γk)2 | zk . (58) 1 − β i ik 2(1 − β)2 ik i

k γ Using our choice γi = we derive wik

d  k k k X pi k k pi E γi |∇ik f(z )| | z = γ |∇if(z )| ≥ γk∇f(z )k1 min w i=1,...,d w i=1 i i and

d X Lipi E L (γk)2 | zk = γ2 . ik i w2 i=1 i Putting it in (58) and taking full expectation from the both sides of obtained inequality we get

pi γ min w 2 d i=1,...,d i γ X Lipi E f(zk+1) ≤ E f(zk) − Ek∇f(zk)k + , 1 − β 1 2(1 − β)2 w2 i=1 i whence

 k   k+1  d (1 − β) E f(z ) − E f(z ) γ X Lipi k∇f(zk)k ≤ + . 1 pi pi 2 γ min 2(1 − β) min wi i=1,...,d wi i=1,...,d wi i=1

Summing up previous inequality for k = 0, 1,...,K − 1 and dividing both sides of the result by K, we get

K−1 0 ∗ d 1 X (1 − β)(f(z ) − f(x )) γ X Lipi E k∇f(zk)k  ≤ + . 1 pi pi 2 K Kγ min 2(1 − β) min wi k=0 i=1,...,d wi i=1,...,d wi i=1

K−1 It remains to notice that 1 P E k∇f(zk)k  = E k∇f(zK )k . The last part where γ = √γ0 is K 1 1 K k=0 straightforward.

20 Published as a conference paper at ICLR 2020

E.2 CONVEX CASE

As for SMTP to tackle convex problems by SMTP_IS we use Assumption 3.2 with k · kD = k · k1.  ∗ 0 Note that in this case R0 = max kx − x k∞ | f(x) ≤ f(x ) .

k γ Theorem E.2 (Constant stepsize). Let Assumptions 3.2 and 4.1 be satisfied. If we set γi = such wik (1−β)R0 that 0 < γ ≤ p , then for the iterates of SMTP_IS method the following inequality holds: min i i=1,...,d wi

 pi k γ min w d i=1,...,d i γR0 X Lipi E f(zk) − f(x∗) ≤ 1 − f(z0) − f(x∗) + .   pi 2 (1 − β)R0 2(1 − β) min wi i=1,...,d wi i=1 (59) d L p ε(1−β) min pi R2 P i i w 0 w2 i=1,...,d i i=1 i SMTP_IS Moreover, if we choose γ = d for some 0 < ε ≤ p2 and run for P Lipi i R0 min 2 w2 i=1,...,d w i=1 i i k = K iterations where

d 2 P Lipi R0 w2  0 ∗  1 i=1 i 2(f(x ) − f(x )) K = 2 ln , (60) ε pi ε min 2 i=1,...,d wi

 K  ∗ Li Pd we will get E f(z ) − f(x ) ≤ ε. Moreover, for pi = / i Li with wi = Li, the rate improves to

d 1 X 2(f(x0) − f(x∗)) K = R2d L ln . (61) ε 0 i ε i=1

Proof. Recall that from (52) we have

1 1 E f(zk+1) | zk ≤ f(zk) − E γk|∇ f(zk)| | zk + E L (γk)2 | zk . (62) 1 − β i ik 2(1 − β)2 ik i

k γ Using our choice γi = we derive wik

d  k k k X pi k k pi E γi ∇ik f(z ) | z = γ |∇if(z )| ≥ γk∇f(z )k1 min w i=1,...,d w i=1 i i (14) γ p ≥ min i f(zk) − f(x∗) R0 i=1,...,d wi and

d X Lipi E L (γk)2 | zk = γ2 . ik i w2 i=1 i

Putting it in (62) and taking full expectation from the both sides of obtained inequality we get

 pi  γ min w 2 d i=1,...,d i γ X Lipi E f(zk+1) − f(x∗) ≤ 1 − E f(zk) − f(x∗) + . (63)  (1 − β)R  2(1 − β)2 w2 0 i=1 i

21 Published as a conference paper at ICLR 2020

  (1−β)R0 γ pi Due to our choice of γ ≤ pi we have that the factor 1 − min is non- min (1−β)R0 wi i=1,...,d wi i=1,...,d negative and, therefore,  γ p k E f(zk) − f(x∗) ≤ 1 − min i f(z0) − f(x∗) (1 − β)R0 i=1,...,d wi 2 d ! k−1  l γ X Lipi X γ pi + 2 2 1 − min 2(1 − β) w (1 − β)R0 i=1,...,d wi i=1 i l=0  γ p k ≤ 1 − min i f(z0) − f(x∗) (1 − β)R0 i=1,...,d wi 2 d ! ∞  l γ X Lipi X γ pi + 2 2 1 − min 2(1 − β) w (1 − β)R0 i=1,...,d wi i=1 i l=0

 pi k γ min w d i=1,...,d i γR0 X Lipi ≤ 1 − f(z0) − f(x∗) + .   pi 2 (1 − β)R0 2(1 − β) min wi i=1,...,d wi i=1 ε(1−β) min pi i=1,...,d wi Then, putting γ = d and k = K from (60) in (59) we have P Lipi R0 w2 i=1 i  K 2 pi ε min 2  i=1,...,d wi  ε E[f(zK )] − f(x∗) = 1 −  f(z0) − f(x∗) +  d  2  2 P Lipi  R0 w2 i=1 i   2  pi   ε min 2   i=1,...,d wi  ε ≤ exp −K · f(z0) − f(x∗) + d 2  2 P Lipi   R0 w2   i=1 i  ε ε (60)= + = ε. 2 2

k γk Theorem E.3 (Decreasing stepsizes). Let Assumptions 3.2 and 4.1 be satisfied. If we set γi = wik pi min w and γk = 2 , where α = i=1,...,d i and θ ≥ 2 , then for the iterates of SMTP_IS method the αk+θ (1−β)R0 α following inequality holds: ( d ) 1 2 X Lipi E f(zk) − f(x∗) ≤ max f(x0) − f(x∗), , (64) ηk + 1 αθ(1 − β)2 w2 i=1 i min pi def α k 2α i=1,...,d wi where η = . Moreover, if we choose γ = 2 where α = and run SMTP_IS for θ α k+2 (1−β)R0 k = K iterations where 2 ( d ) 2 2 1 2R0 2 0 ∗ X Lipi 2(1 − β) R0 K = · 2 max (1 − β) (f(x ) − f(x )), 2 − 2 , ε > 0, ε pi w pi min 2 i=1 i min 2 i=1,...,d wi i=1,...,d wi (65) we will get E f(zK ) − f(x∗) ≤ ε.

Proof. In (63) we proved that p  γ min i  d wi 2  k+1 ∗  i=1,...,d  k ∗  γ X Lipi E f(z ) − f(x ) ≤ 1 −  E f(z ) − f(x ) + 2 2 . (1 − β)R0 2(1 − β) w l=1 i

22 Published as a conference paper at ICLR 2020

Having that, we can apply Lemma C.1 to the sequence E f(zk) − f(x∗). The con- pi d min w 1 P Lipi i=1,...,d i stants for the lemma are: N = 2 2 , α = and C = 2(1−β) w (1−β)R0 l=1 i  d  0 ∗ 2 P Lipi k 2α max f(x ) − f(x ), αθ(1−β)2 w2 . Lastly, note that choosing γ = α2k+2 is equivalent i=1 i  d  2 0 ∗ 1 P Lipi to choice θ = α . In this case we have αθ = 2 and C = max f(x ) − f(x ), (1−β)2 w2 i=1 i p2 min i 2 w2 α α i=1,...,d i and η = = = 2 2 . Putting these parameters and K from (65) in the (64) we get the θ 2 2(1−β) R0 result.

E.3 STRONGLY CONVEX CASE

Theorem E.4 (Solution-dependent stepsizes). Let Assumptions 3.3 (with k · kD = k · k1) and 4.1 pi (1−β)θk min w k i=1,...,d i p k ∗ be satisfied. If we set γi = d 2µ(f(z ) − f(x )) for some θk ∈ (0, 2) such that P Lipi wi k w2 i=1 i  d  P Lipi w2 2 i=1 i SMTP_IS θ = inf {2θk − θk} ∈ 0, p2 , then for the iterates of method the following k≥0 µ min i w2 i=1,...,d i inequality holds:  k 2 pi θµ min 2  i=1,...,d wi  E f(zk) − f(x∗) ≤ 1 −  f(x0) − f(x∗) . (66)  d   P Lipi  w2 i=1 i If we run SMTP_IS for k = K iterations where d P Lipi w2  0 ∗  i=1 i f(x ) − f(x ) K = 2 ln , ε > 0, (67) pi ε θµ min 2 i=1,...,d wi we will get E f(zK ) − f(x∗) ≤ ε.

Proof. Recall that from (52) we have 1 1 E f(zk+1) | zk ≤ f(zk) − E γk|∇ f(zk)| | zk + E L (γk)2 | zk . (68) 1 − β i ik 2(1 − β)2 ik i

pi (1−β)θk min w k i=1,...,d i p k ∗ Using our choice γi = d 2µ(f(z ) − f(x )) we derive P Lipi wi k w2 i=1 i

pi (1 − β)θk min w d i=1,...,d i q X pi E γk∇ f(zk) | zk = 2µ(f(zk) − f(x∗)) |∇ f(zk)| i ik d i wi P Lipi i=1 w2 i=1 i  2 pi (1 − β)θk min i=1,...,d wi q ≥ 2µ(f(zk) − f(x∗))k∇f(zk)k d 1 P Lipi w2 i=1 i 2 pi 2(1 − β)θk min 2 (20) i=1,...,d wi ≥ µ(f(zk) − f(x∗)) d P Lipi w2 i=1 i

23 Published as a conference paper at ICLR 2020

and 2 2 2 pi 2(1 − β) θk min w2 d  k 2 k i=1,...,d i k ∗ X Lipi E Lik (γi ) | z = 2 µ(f(z ) − f(x )) 2  d  wi P Lipi i=1 w2 i=1 i 2 2 2 pi 2(1 − β) θk min 2 i=1,...,d wi = µ(f(zk) − f(x∗)). d P Lipi w2 i=1 i Putting it in (68) and taking full expectation from the both sides of obtained inequality we get   2 pi µ min 2  i=1,...,d wi  E f(zk+1) − f(x∗) ≤ 1 − (2θ − θ2)  E f(zk) − f(x∗) .  d   P Lipi  w2 i=1 i  d  P Lipi w2 2 i=1 i Using θ = inf {2θk − θk} ∈ 0, p2  we obtain k≥0 µ min i w2 i=1,...,d i   2 pi θµ min 2  i=1,...,d wi  E f(zk+1) − f(x∗) ≤ 1 −  E f(zk) − f(x∗)  d   P Lipi  w2 i=1 i  k+1 2 pi θµ min 2  i=1,...,d wi  ≤ 1 −  f(x0) − f(x∗) .  d   P Lipi  w2 i=1 i Lasrtly, from (66) we have  K 2 pi θµ min 2  i=1,...,d wi  E f(zK ) − f(x∗) ≤ 1 −  f(x0) − f(x∗)  d   P Lipi  w2 i=1 i   2  pi   θµ min 2   i=1,...,d wi  ≤ exp −K f(x0) − f(x∗) d  P Lipi   w2   i=1 i  (67) ≤ ε.

The previous result based on the choice of γk which depends on the f(zk) − f(x∗) which is often unknown in practice. The next theorem does not have this drawback and makes it possible to obtain the same rate of convergence as in the previous theorem using one extra function evaluation.

Theorem E.5 (Solution-free stepsizes). Let Assumptions 3.3 (with k·kD = k·k2) and 4.1 be satisfied. k k k k (1−β)|f(z +teik )−f(z )| If additionally we compute f(z + teik ), set γi = for t > 0, then for the Lik t iterates of SMTP_IS method the following inequality holds:

 k 2 d  k  ∗ pi 0 ∗  t X E f(z ) − f(x ) ≤ 1 − µ min f(x ) − f(x ) + pi piLi. (69) i=1,...,d Li 8µ min i=1,...,d Li i=1

24 Published as a conference paper at ICLR 2020

Moreover, for any ε > 0 if we set t such that v u4εµ min pi u l=1,...,d Li 0 < t ≤ u , (70) u d t P piLi i=1 and run SMTP_IS for k = K iterations where 1 2(f(x0) − f(x∗)) K = ln , (71) µ min pi ε i=1,...,d Li

 K  ∗ Li Pd we will get E f(z ) − f(x ) ≤ ε. Moreover, note that for pi = / i Li with wi = Li, the rate improves to Pd L 2(f(x0) − f(x∗)) K = i=1 i ln . (72) µ ε

Proof. Recall that from (51) we have γk L (γk)2 f(zk+1) ≤ f(zk) − i |∇ f(zk)| + ik i . 1 − β ik 2(1 − β)2 k If we minimize the right hand side of the previous inequality as a function of γi , we will get that k k (1−β)|∇ik f(z )| the optimal choice in this sense is γopt = . However, this stepsize is impractical for Lik k derivative-free optimization, since it requires to know ∇ik f(z ). The natural way to handle this is to k k k f(z +teik )−f(z ) approximate directional derivative ∇ik f(z ) by finite difference t and that is what k k k k k k (1−β)|f(z +teik )−f(z )| (1−β)|∇ik f(z )| (1−β)|f(z +teik )−f(z )| we do. We choose γi = = + − Lik t Lik Lik t k (1−β)|∇ik f(z )| def k k = γopt + δi . From this we get Lik k 2 k+1 k |∇ik f(z )| Lik k 2 f(z ) ≤ f(z ) − + 2 (δi ) . 2Lik 2(1 − β) k Next we estimate |δi |: k (1 − β) k k k |δi | = |f(z + teik ) − f(z )| − |∇ik f(z )|t Lik t

(1 − β) k k k ≤ f(z + teik ) − f(z ) − ∇ik f(z )t Lik t (26) (1 − β) L t2 (1 − β)t ≤ · ik = . Lik t 2 2 It implies that k 2 2 2 k+1 k |∇ik f(z )| Lik (1 − β) t f(z ) ≤ f(z ) − + 2 · 2Lik 2(1 − β) 4 |∇ f(zk)|2 L t2 = f(zk) − ik + ik 2Lik 8 and after taking expectation E · | zk conditioned on zk from the both sides of the obtained inequality we get  k 2  2  k+1 k k 1 |∇ik f(z )| k t  k E f(z ) | z ≤ f(z ) − E | z + E Lik | z . 2 Lik 8 Note that  k 2  d |∇i f(z )| X pi E k | zk = |∇ f(zk)|2 L L i ik i=1 i k 2 pi ≥ k∇f(z )k2 min i=1,...,d Li (44) p ≥ 2µ f(zk) − f(x∗) min i , i=1,...,d Li

25 Published as a conference paper at ICLR 2020

since k · kD = k · k2, and

d  k X E Lik | z = piLi. i=1 Putting all together we get

2 d  k+1 k k pi k ∗  t X E f(z ) | z ≤ f(z ) − µ min f(z ) − f(x ) + piLi. i=1,...,d L 8 i i=1 Taking full expectation from the previous inequality we get

  2 d  k+1 ∗  pi  k ∗  t X E f(z ) − f(x ) ≤ 1 − µ min E f(z ) − f(x ) + piLi. i=1,...,d L 8 i i=1

Since µ ≤ Li for all i = 1, . . . , d we have  p k E f(zk) − f(x∗) ≤ 1 − µ min i f(x0) − f(x∗) i=1,...,d Li 2 d ! k−1  l t X X pi + piLi 1 − µ min 8 i=1,...,d Li i=1 l=0  p k ≤ 1 − µ min i f(x0) − f(x∗) i=1,...,d Li 2 d ! ∞  l t X X pi + piLi 1 − µ min 8 i=1,...,d Li i=1 l=0  k 2 d pi 0 ∗  t X = 1 − µ min f(x ) − f(x ) + pi piLi. i=1,...,d Li 8µ min i=1,...,d Li i=1 Lastly, from (69) we have

 K 2 d  K  ∗ pi 0 ∗  t X E f(z ) − f(x ) ≤ 1 − µ min f(x ) − f(x ) + pi piLi i=1,...,d Li 8µ min i=1,...,d Li i=1 (70)  p  ε ≤ exp −Kµ min i f(x0) − f(x∗) + i=1,...,d Li 2 (71) ε ε ≤ + = ε. 2 2

E.4 COMPARISON OF SMTP AND SMTP_IS

I Here we compare SMTP when D is normal distribution with zero mean and d covariance matrix Li Pd with SMTP_IS with probabilities pi = / i=1 Li. We choose such a distribution for SMTP since it shows the best dimension dependence among other distributions considered in Lemma F.1. Note that if f satisfies Assumption 4.1, it is L-smooth with L = max Li. So, we always have that i=1,...,d Pd i=1 Li ≤ dL. Table 3 summarizes complexities in this case.

We notice that for SMTP we have k · kD = k · k2. That is why one needs to compare SMTP with SMTP_IS accurately. At the first glance, Table 3 says that for non-convex and convex cases we get an extra d factor in the complexity of SMTP_IS when L1 = ... = Ld = L. However, it is natural since we use different norms for SMTP and SMTP_IS. In the non-convex case for SMTP  K  we give number of iterations in order to guarantee E k∇f(z )k2 ≤ ε while for SMTP_IS we  K  provide number of iterations in order to guarantee E k∇f(z )k1 ≤ ε. From Holder’s inequality

26 Published as a conference paper at ICLR 2020

SMTP Importance SMTP_IS Assumptions on f Theorem Theorem Compleixty Sampling Complexity

Pd πr0dL Li 2r0d i=1 Li None ε2 3.1 pi = Pd ε2 E.1 i=1 Li πR2 dL R2 d Pd L 0,`2 2r0  Li 0,`∞ i=1 i 2r0  Convex, R0 < ∞ 2ε ln ε 3.2 pi = Pd ε ln ε E.2 i=1 Li Pd πdL 2r0  Li i=1 Li 2r0  µ-strongly convex 2µ ln ε 3.5 pi = Pd µ ln ε E.5 i=1 Li

I  SMTP SMTP_IS Li Pd L Table 3: Comparison of with D = N 0, d and with pi = / i=1 i. Here 0 ∗ r0 = f(x ) − f(x ), R0,`2 corresponds to the R0 from Assumption C.1 with k · kD = k · k2 and

R0,`∞ corresponds to the R0 from Assumption C.1 with k · kD = k · k1.

√  K  k · k1 ≤ dk · k2 and, therefore, in order to have E k∇f(z )k1 ≤ ε for SMTP we need to ensure that E k∇f(zK )k  ≤ √ε . That is, to guarantee E k∇f(zK )k  ≤ ε SMTP for aforementioned 2 d 1 2 πr0d L distribution needs to perform ε2 iterations. √ Analogously, in the convex case using Cauchy-Schwartz inequality k · k2 ≤ dk · k∞ we have that √ R R ≤ dR . Typically this inequality is tight and if we assume that R ≥ C √0,`2 , we will 0,`2 0,`∞ 0,`∞ d 2 Pd R Li 0,`2 i=1 2r0  get that SMTP_IS complexity is ε ln ε up to constant factor. That is, in all cases SMTP_IS shows better complexity than SMTP up to some constant factor.

FAUXILIARY RESULTS

Lemma F.1 (Lemma 3.4 from Bergou et al. (2019)). Let g ∈ Rd. 1. If D is the uniform distribution on the unit sphere in Rd, then 1 γD = 1 and Es∼D | hg, si | ∼ √ kgk2. (73) 2πd Hence, D satisfies Assumption 3.1 with γ = 1, k · k = k · k and µ ∼ √ 1 . D D 2 D 2πd 2. If D is the normal distribution with zero mean and identity over d as covariance matrix (i.e. s ∼ N(0, I )) then d √ 2 γD = 1 and Es∼D | hg, si | = √ kgk2. (74) dπ √ Hence, D satisfies Assumption 3.1 with γ = 1, k · k = k · k and µ = √ 2 . D D 2 D dπ

3. If D is the uniform distribution on {e1, . . . , ed}, then 1 γ = 1 and E | hg, si | = kgk . (75) D s∼D d 1 1 Hence, D satisfies Assumption 3.1 with γD = 1, k · kD = k · k1 and µD = d .

4. If D is an arbitrary distribution on {e1, . . . , ed} given by P {s = ei} = pi > 0, then d def X γD = 1 and Es∼D | hg, si | = kgkD = pi|gi|. (76) i=1

Hence, D satisfies Assumption 3.1 with γD = 1 and µD = 1.

5. If D is a distribution on D = {u1, . . . , ud} where u1, . . . , ud form an orthonormal basis of d R and P {s = di} = pi, then d def X γD = 1 and Es∼D | hg, si | = kgkD = pi|gi|. (77) i=1

27 Published as a conference paper at ICLR 2020

Hence, D satisfies Assumption 3.1 with γD = 1 and µD = 1.

28