<<

Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent

Wenqing Hu Chris Junchi Missouri University of Science and Techology Tencent Lab [email protected] [email protected]

Xiangru Huizhuo Yuan ∗ University of Rochester University of Rochester & Kwai Inc. Peking University [email protected] [email protected] [email protected]

Abstract

Stochastic compositional optimization arises in many important machine learning tasks such as reinforcement learning and portfolio management. The objective function is the composition of two expectations of stochastic functions, and is more challenging to optimize than vanilla stochastic optimization problems. In this paper, we investigate the stochastic compositional optimization in the general smooth non-convex setting. We employ a recently developed idea of Stochastic Recursive Gradient Descent to design a novel algorithm named SARAH-Compositional, and prove a sharp Incremental First-order Oracle (IFO) complexity upper bound for stochastic compositional optimization: O((n + m)1/2ε−2) in the finite-sum case and O(ε−3) in the online case. Such a complexity is known to be the best one among IFO complexity results for non-convex stochastic compositional optimiza- tion. Numerical experiments validate the superior performance of our algorithm and theory.

1 Introduction

We consider the general smooth, non-convex compositional optimization problem of minimizing the composition of two expectations of stochastic functions: min {Φ(x) ≡ (f ◦ g)(x)} , (1) d x∈R l d l where the outer and inner functions f : R → R, g : R → R are defined as f(y) := Ev[fv(y)], g(x) := Ew[gw(y)] where v and w are random variables, and each component fv, gw are smooth but not necessarily convex. Compositional optimization can be used to formulate many important machine learning problems, e.g. reinforcement learning (Sutton and Barto, 1998), risk management (Dentcheva et al., 2017), multi-stage stochastic programming (Shapiro et al., 2009) and deep neural net ( et al., 2018), etc. We list two specific application instances that can be written in the stochastic compositional form of (1): • Risk management problem, which is formulated as T T T !2 1 X 1 X 1 X min − hrt, xi + hrt, xi − hrs, xi , (2) x∈ N T T T R t=1 t=1 s=1 ∗Partial work was performed when the author was intern at Tencent AI Lab.

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Algorithm Finite-sum Online SCGD (Wang et al., 2017a) unknown ε−8 Acc-SCGD (Wang et al., 2017a) unknown ε−7 SCGD (Wang et al., 2017b) unknown ε−4.5 SCVR / SC-SCSG (Liu et al., 2017) (n + m)4/5ε−2 ε−3.6 VRSC-PG ( et al., 2018) (n + m)2/3ε−2 unknown SARAH-Compositional (this work) (n + m)1/2ε−2 ε−3 Table 1: Comparison of IFO complexities with different algorithms for general non-convex problem.

N N where rt ∈ R denotes the returns of N assets at timestamp t = 1,...,T , and x ∈ R denotes the investment quantity corresponding to N assets. The goal is to maximize the expected return while controlling the variance of the portfolio. (2) can be written as a compositional optimization problem with two functions

" T #> 1 X g(x) = x , x , . . . , x , hr , xi , (3) 1 2 N T s s=1 T T 1 X 1 X 2 f(w) = − hr , w i + hr , w i − w  , (4) T t \(N+1) T t \(N+1) N+1 t=1 t=1

where w\(N+1) denotes the (column) subvector of its first N coordinates and wN+1 denotes its (N + 1)-th coordinate. • Value function evaluation in reinforcement learning, where the objective function of interest is π π 2 E (V (s1) − E[rs1,s2 + γV (s2)|s1]) , (5)

where s1, s2 ∈ S are available states, rs1,s2 denotes the reward to move from s1 to s2, and V π(s) is the value function on state s corresponding to policy π.

Compared with traditional optimization problem which allows accesses to the (unbiased) stochastic gradients, problem (1) is more difficult to solve. Existing algorithms for solving (1) are often more computationally challenging. This is mainly due to the nonlinear structure of the composition function with respect to the random index pair (v, w). Treating the objective function as an expectation Evfv(g(x)), every iteration of the gradient estimation involves recalculating g(x) = Ewgw(x), which is either time-consuming or impractical. To tackle such weakness, Wang et al. (2017a) firstly introduce a two-time-scale algorithm called Stochastic Compositional Gradient Descent (SCGD) along with its accelerated (in Nesterov’s sense) variant Acc-SCGD. Subsequently, Wang et al. (2017b) proposed accelerated stochastic compositional proximal gradient algorithm (SCGD) which improves over the existing complexity. Furthermore, variance reduced gradient methods designed specifically for compositional optimization on non-convex settings arises from Liu et al. (2017) and later generalized to the nonsmooth setting (Huo et al., 2018). These approaches aim at getting variance reduced estimators of g, ∂g and ∂g(x)∇f(g(x)), respectively. Such success signals the necessity and possibility of designing a special algorithm for non-convex objectives with better convergence rates. In this paper, we propose an efficient algorithm called SARAH-Compositional for the stochastic compositional optimization problem (1). For notational simplicity, we let n, m ≥ 1 and the index pair (v, w) be uniformly distributed over the product set [n] × [m], i.e.

n  m  1 X 1 X Φ(x) = f g (x) . (6) n i m j  i=1 j=1 We use the same notation for the online case, in which case either n or m can be infinite. A fundamental theoretical question for stochastic compositional optimization is the Incremental First-order Oracle (IFO) (the number of individual gradient and function evaluations; see Defini- tion 1 below) complexity bounds for stochastic compositional optimization. Our new SARAH- Compositional algorithm is developed by integrating the iteration of Stochastic Recursive Gradient

2 Descent (Nguyen et al., 2017), shortened as SARAH, with the stochastic compositional optimization structure (Wang et al., 2017a). The motivation of this approach is that SARAH with specific choice of stepsizes is known to be optimal in stochastic optimization and regarded as a cutting-edge variance reduction technique, with significantly reduced oracle access complexities than earlier variance reduction method ( et al., 2018). We prove that SARAH-Compositional can reach an IFO computational complexity of O(min (n + m)1/2ε−2, ε−3), improving the best known result of O(min (n + m)2/3ε−2, ε−3.6) in non-convex compositional optimization. See Table 1 for detailed comparison. Related Works Gradient descent and stochastic gradient descent are basic algorithms in optimization, and their performance are superceded by numerous recently proposed first-order methods. Nesterov’s acceleration (Nesterov, 2004) is one such example. Ghadimi and (2016) proposed Nesterov-type method and Li and (2015) proposed accelerated proximal gradient for non-convex functions, in which case Nesterov’s accelerated gradient method achieves the same convergence rate with SGD, while obtaining accelerated rate than GD when the objective function is non-stochastic. Another variant of GD named normalized gradient descent (Nesterov, 2004) stabilizes the training process by fixing the step length, which enjoys in the non-convex setting the same IFO complexity as GD. In recent years, variance reduction methods have received intensive attentions in both convex and non-convex optimization. When the objective is strongly convex, variance reduction techniques can improve (from sub-linear convergence rates by GD) to linear convergence rates. Variance-reduced gradient methods include SVRG (Xiao and , 2014), SAG/SAGA (Schmidt et al., 2017; Defazio et al., 2014), SDCA (Shalev-Shwartz and Zhang, 2013, 2014), etc. For non-convex objectives, the method of stochastic variance reduced gradient (SVRG) (Xiao and Zhang, 2014) applies in a straightforward fashion (Allen-Zhu and Hazan, 2016; Reddi et al., 2016), as well as an online variant of SVRG (called SCSG) later proposed by et al. (2017). The IFO complexities above can be further reduced by a recently proposed SPIDER algorithm (Fang et al., 2018), which hybrids the stochastic recursive gradient descent framework (SARAH) (Nguyen et al., 2017) with normalized gradient descent (NGD). It adopt a small stepsize that is proportional to ε and hence variance can be effectively controlled. For finding ε-accurate solution purposes, Wang et al. (2018) small stepsize restriction in Fang et al. (2018) while achieving the same complexity. The convergence property of both SARAH and SPIDER methods in the non-convex case outperforms that of SVRG, and is state-of-the-art. It turns out in compositional optimization problem, gradient-descent based methods for optimizing a single objective function can either be non-applicable, or it brings at least O(m) queries to calculate the inner function g. To remedy for this problem, Wang et al. (2017a,b) proposed SCGD method to calculate the inner finite-sum more efficiently and achieve a polynomial rate that is independent of m. Later on, Lian et al. (2017); Liu et al. (2017); Huo et al. (2018) and Lin et al. (2018) merged SVRG method into the compositional optimization framework to do variance reduction on all three steps of the estimation. Among these, Lin et al. (2018); Huo et al. (2018) and Wang et al. (2017b) consider the case of composition objective with an extra non-smooth proximal term. Due to the lack of space, we satisfy ourselves with the smooth case and leave the non-smooth analysis to the future. Contributions This work makes two contributions as follows. First, we propose a new algo- rithm for stochastic compositional optimization called SARAH-Compositional, which operates SARAH/SPIDER-type variance reduction to approximate and estimate relevant quantities. Such algorithm is designed to unify both online and finite-sum cases. Second, we conduct theoretical analysis for both online and finite-sum cases. In the finite-sum case, we obtain a complexity of (n + m)1/2ε−2 which improves over the prior best known complexity (n + m)2/3ε−2 achieved by Huo et al. (2018). In the online case we obtain a complexity of ε−3 which improves the prior best known complexity ε−3.6 obtained in Liu et al. (2017).

Notational Conventions Throughout the paper, we treat the parameters Lg,Lf ,LΦ,Mg,Mf , ∆ and σ as global constants. Let k•k denote the Euclidean norm of a vector or the operator norm of a matrix induced by Euclidean norm, and let k • kF denotes the Frobenius norm. For fixed T ≥ t ≥ 0 let xt:T denote the sequence {xt, ..., xT }. Let Et[•] denote the conditional expectation E[•|x0, x1, ..., xt]. Let [n] = {1, ..., n} and S denote the cardinality of a multi-set S ⊆ [n] of samples (a generic set that permits repeated instances). The averaged sub-sampled stochastic estimator is denoted as P AS = (1/S) Ai where the summation counts repeated instances. We denote pn = O(qn) if i∈S

3 there exist some constants 0 < c < C < ∞ such that cqn ≤ pn ≤ Cqn as n becomes large. Other notations are explained at their first appearances. Organization The rest of our paper is organized as follows. §2 formally poses our algorithm and assumptions. §3 presents the convergence rate theorem and §4 presents numerical experiments that apply our algorithm to the task of portfolio management. We conclude our paper in §5. Proofs of convergence results for finite-sum and online cases and auxiliary lemmas are deferred to the supplementary material.

2 SARAH for Stochastic Compositional Optimization

Recall our goal is to solve the compositional optimization problem (1), i.e. to minimize Φ(x) = f (g(x)) where n m 1 X 1 X f(y) := f (y), g(x) := g (x). n i m j i=1 j=1 d l l Here for each j ∈ [m] and i ∈ [n] the functions gj : R → R and fi : R → R. We can formally take a derivative to the function Φ(x) and obtain (via the chain rule) the gradient descent iteration

> xt+1 = xt − η[∂g(xt)] ∇f(g(xt)) , (7) where the ∂ operator computes the Jacobian matrix of the smooth mapping, and the gradient operator ∇ is only taken with respect to the first-level variable. As discussed in §1, it can be either impossible m 1 X (online case) or time-consuming (finite-sum case) to estimate the terms ∂g(x ) = ∂g (x ) t m j t j=1 m 1 X and g(x ) = g (x ) in the above iteration scheme. In this paper, we design a novel algorithm t m j t j=1 (SARAH-Compositional) based on Stochastic Compositional Variance Reduced Gradient method (see Lin et al. (2018)) yet hybriding with the stochastic recursive gradient method Nguyen et al. (2017). As the readers see later, our SARAH-Compositional is more efficient than all existing algorithms for non-convex compositional optimization. We introduce some definitions and assumptions. First, we assume the algorithm has accesses to the Incremental First-order Oracle (IFO) in our black-box environment (Lin et al., 2018); also see (Agarwal and Bottou, 2015; Woodworth and Srebro, 2016) for vanilla optimization case: Definition 1 (IFO). (Lin et al., 2018) The Incremental First-order Oracle (IFO) returns, when some d l x ∈ R and j ∈ [m] are inputted, the vector-matrix pair [gj(x), ∂gj(x)] or when some y ∈ R and i ∈ [n] are inputted, the scalar-vector pair [fi(y), ∇fi(y)].

Second, our goal in this work is to find an ε-accurate solution, defined as Definition 2 (ε-accurate solution). We call x ∈ Rd an ε-accurate solution to problem (1), if k∇Φ(x)k ≤ ε. (8)

It is worth remarking here that the inequality (8) can be modified to k∇Φ(x)k ≤ Cε for some global constant C > 0 without hurting the complexities. Let us first make some assumptions regarding to each component of the (compositional) objective function. Analogous to Assumption 1(i) of Fang et al. (2018), we make the following finite gap assumption: d Assumption 1 (Finite-gap). We assume that the algorithm is initialized at x0 ∈ R with ∗ ∆ := Φ(x0) − Φ < ∞ , (9) where Φ∗ denotes the global minimum value of Φ(x).

We make the following smoothness and boundedness assumptions, which are standard in recent compositional optimization literatures (e.g. Lian et al. (2017); Huo et al. (2018); Lin et al. (2018)).

4 Algorithm 1 SARAH-Compositional, Online Case (resp. Finite-Sum Case)

Input: T, q, x0, η, S1,S2,S3 for t = 0 to T − 1 do if mod (t, q) = 0 then 1 1 P Draw S1 indices with replacement St ⊆ [m] and let gt = gj(xt) S1 1 j∈St (resp. gt = g (xt) in finite-sum case) 2 1 P Draw S2 indices with replacement St ⊆ [m] and let Gt = ∂gj(xt) S2 2 j∈St (resp. Gt = ∂g (xt) in finite-sum case) " # 3 > 1 P Draw S3 indices with replacement St ⊆ [n] and let Ft = (Gt) ∇fi(gt) S3 3 i∈St > (resp. Ft = (Gt) ∇f (gt) in finite-sum case) else

Draw one index jt ∈ [m] and let gt = gjt (xt) − gjt (xt−1) + gt−1 and

Gt = ∂gjt (xt) − ∂gjt (xt−1) + Gt−1

Draw one index it ∈ [n] and let > > Ft = (Gt) ∇fit (gt) − (Gt−1) ∇fit (gt−1) + Ft−1 end if Update xt+1 = xt − ηFt end for T −1 return Output xe chosen uniformly at random from {xt}t=0

Assumption 2 (Smoothness). There exist Lipschitz constants Lg,Lf ,LΦ > 0 such that for i ∈ [n], j ∈ [m] we have

0 0 0 d k∂gj(x) − ∂gj(x )kF ≤ Lgkx − x k for x, x ∈ R , 0 0 0 l k∇fi(y) − ∇fi(y )k ≤ Lf ky − y k for y, y ∈ R , (10) > 0 > 0 0 0 d [∂gj(x)] ∇fi(g(x)) − [∂gj(x )] ∇fi(g(x )) ≤ LΦkx − x k for x, x ∈ R .

Here for the purpose of using stochastic recursive estimation of ∂g(x), we slightly strengthen the smoothness assumption by adopting the Frobenius norm in left hand of the first line of (10).

Assumption 3 (Boundedness). There exist boundedness constants Mg,Mf > 0 such that for i ∈ [n], j ∈ [m] we have d k∂gj(x)k ≤ Mg for x ∈ R , (11) l k∇fi(y)k ≤ Mf for y ∈ R .

Notice that applying mean-value theorem for vector-valued functions to (11) gives another Lipschitz condition 0 0 0 d kgj(x) − gj(x )k ≤ Mgkx − x k for x, x ∈ R , (12) and analogously for fi(y). It turns out that under the above two assumptions, a choice of LΦ in (10) can be expressed as a polynomial of Lf ,Lg,Mf ,Mg. For clarity purposes in the rest of this paper, we adopt the following typical choice of LΦ

2 LΦ ≡ Mf Lg + Mg Lf , (13) whose applicability can be verified via a simple application of the chain rule. We integrate both finite-sum and online cases into one algorithm SARAH-Compositional and write it in Algorithm 1.

5 3 Convergence Rate Analysis

In this section, we aim to justify that our proposed SARAH-Compositional algorithm provides IFO complexities of O((n + m)1/2ε−2) in the finite-sum case and O(ε−3) in the online case, which supersedes the concurrent and comparative algorithms (see more in Table 1). Let us first analyze the convergence in the finite-sum case. In this case we have S1 = [m], S2 = [m], S3 = [n]. Involved analysis leads us to conclude Theorem 1 (Finite-sum case). Suppose Assumptions 1, 2 and 3 in §2 hold, let S1 = S2 = [m], S3 = [n], q = (2m + n)/3, and set the stepsize 1 η = . (14) r   4 2 2 2 6(2m + n) Mg Lf + Mf Lg

Then for the finite-sum case, SARAH-Compositional Algorithm 1 outputs an x satisfying 2 2 e Ek∇Φ(xe)k ≤ ε in r   2[Φ(x ) − Φ∗] 6(2m + n) M 4L2 + M 2L2 · 0 (15) g f f g ε2 iterates. The IFO complexity to achieve an ε-accurate solution is bounded by √ r   216[Φ(x ) − Φ∗] 2m + n + 9(2m + n) M 4L2 + M 2L2 · 0 . (16) g f f g ε2

Like in Fang et al. (2018), the IFO complexity to achieve an ε-accurate solution is upper bounded by O((m + n)1/2ε−2). We see that in the online case, the IFO complexity to achieve an ε-accurate solution is upper bounded by O(ε−3).2 Theorem 1 allows us to achieve an ε-accurate solution, and a simple application of Markov’s inequality allows us to derive high-probability results for achieving ε-accurate solutions. Compared with Fang et al. (2018), one observes that Theorem 1 indicates an IFO complexity upper bound of O((m + n + (m + n)1/2ε−2), sharing a similar form with that of SARAH/SPIDER for non-convex stochastic optimization when m + n is regarded as the number of individual functions. SPIDER-SFO (as a SARAH variant) is optimal in the finite-sum case, in the sense that it matches the theoretical lower bound, which makes it tempting to claim that our proposed SARAH-Compositional as its extension is also optimal. We emphasize that the set of assumptions for compositional optimization is different from vanilla optimization, and claiming optimality of the IFO complexity requires a corresponding lower bound result, left as a future direction to explore. Let us then analyze the convergence in the online case, where we sample minibatches S1, S2, S3 of relevant quantities instead of the ground truth once every q iterates. To characterize the estimation error, we put in one additional finite variance assumption:

Assumption 4 (Finite Variance). We assume that there exists H1, H2 and H3 as the upper bounds on the variance of the functions f(y), ∂g(x), and g(x), respectively, such that

2 d Ekgi(x) − g(x)k ≤ H1 for x ∈ R , 2 d Ek∂gi(x) − ∂g(x)k ≤ H2 for x ∈ R , (17) 2 l Ek∇fi(y) − ∇f(y)k ≤ H3 for y ∈ R .

From Assumptions 2 and 3 we can easily verify, via triangle inequality and convexity of norm, that 2 2 H2 as 4Mg and H3 can be chosen as 4Mf . The choice of H1 cannot be represented as a function of boundedness and smoothness constants.

We know that kGbt/qcqk = k∂gS2 (xbt/qcq)k ≤ Mg. From (17) we derive

> > 2 2 Ek(Gbt/qcq) ∇fi(gbt/qcq) − (Gbt/qcq) ∇f(gbt/qcq)k ≤ H3Mg .

2 ∗ Here and in below, the smoothness and boundedness parameters and Φ(x0) − Φ are treated as constants

6 2 2 2 2 3H3Mg 3H2Mf 3H1Mg Lf When S3 = ε2 ,S2 = ε2 ,S1 = ε2 , under Assumption 4, ω1, ω2 and ω3 can be ε2 ε2 ε2 choosen as , 2 and 2 2 , respectively. 3 3Mf 3Mg Lf We conclude the following theorem for the online case: 2 2 3H1M L Theorem 2 (Online case). Suppose Assumptions 1, 2, 3 and 4 in §2 hold, let S = g f ,S = 1 ε2 2 2 2 3H2M 3H3M D f , S = g , let q = 0 where we denote the noise-relevant parameter ε2 3 ε2 3ε2 2 2 2 2 D0 := 3 H1Mg Lf + H2Mf + H3Mg , (18) and set the stepsize ε η = . (19) r   4 2 2 2 6 Mg Lf + Mf Lg D0

2 Then for the online case, SARAH-Compositional Algorithm 1 outputs an xe satisfying Ek∇Φ(xe)k ≤ 2ε2 in r   2[Φ(x ) − Φ∗] 6 M 4L2 + M 2L2 · 0 (20) g f f g ε3 iterates. The IFO complexity to achieve an ε-accurate solution is bounded by √ r ∗ D0 p   216[Φ(x0) − Φ ] + D · 9 M 4L2 + M 2L2 · . (21) ε2 0 g f f g ε3

Due to space limits, the detailed proofs of Theorems 1 and 2 are deferred to §A in the supplementary material.

4 Experiments

In this section, we study performance of our algorithm to (risk-adverse) portfolio manage- ment/optimization problem and conduct numerical experiments to support our theory. We follow the setups in Huo et al. (2018); Liu et al. (2017) and compare with existing algorithms for compositional optimization. Readers are referred to Wang et al. (2017a,b) for more tasks our algorithm can be potentially applied for. Recall that in Section 1, we formulate our portfolio management minimization problem as a mean- variance optimization problem (2), which can be formulated as a compositional optimization prob- lem (1). For convenience we repeat the display here:

T T T !2 1 X 1 X 1 X min − hrt, xi + hrt, xi − hrs, xi , (2) x∈ N T T T R t=1 t=1 s=1 N where x = {x1, x2, . . . , xN } ∈ R denotes the quantities invested at every assets i = 1,...,N. For illustration purpose we set T = 2000 and N = 200. Similar to the setup of Huo et al. (2018), we sample rt from the Gaussian distribution and take the absolute value. Each row rt in the T × N > matrix [r1, r2, . . . , rT ] is (independently) generated from a zero-mean N-dimensional Gaussian distribution with covariance matrix Σ ∈ RN×N . We prescribe the conditional number κ of the population covariance Σ as one of our parameters, and tested the cases where κ(Σ) = 4 and κ(Σ) = 20. When applying SARAH-Compositional we adopt the finite-sum case i.e. S1 = S2 = S3 = T . Our search of stepsize is among 5 × 10−4, 10−3, 2 × 10−3, 5 × 10−3, 1 × 10−2, 2 × 10−2, 5 × 10−2 , 1 × 10−1, 2 × 10−1 , and we plot the learning curve of each algorithms corresponding to their individual best stepsize found. The results are shown in Figure 1. We demonstrate the comparison between our algorithm SARAH-Compositional, basic SCGD (Wang et al., 2017b) for compositional optimization, and VRSC-PG (Huo et al., 2018) which serves as a baseline for variance reduced stochastic compositional optimization methods. We plot the objective function value gap and gradient norm against IFO complexity (measured by counts of gradient

7 (a) The risk matrix are generated by a Gaussian distribution with covariance matrix Σ, κ(Σ) = 4

(b) The risk matrix are generated by a Gaussian distribution with covariance matrix Σ, κ(Σ) = 20

Figure 1: Experiment on the portfolio management. The x-axis is the number of gradients calculations, the y-axis is the function value gap and the norm of gradient respectively.

calculation) for all three algorithms in each of the two covariance settings. We observe that SARAH- Compositional outperforms the other two algorithms SCGD and VRSC-PG in our experiments. The toy experiment justifies that our proposed SARAH-based compositional optimization on portfolio management achieves state-of-the art performance. Moreover, we note that due to the small size of the batches, basic SCGD achieves a less satisfactory result, a phenomenon also shown by Huo et al. (2018); Lian et al. (2017). In near future, we will provide more simulation results and release our code on Github.

5 Conclusion

In this paper, we propose a novel algorithm called SARAH-Compositional for solving stochastic compositional optimization problems using the idea of a recently proposed variance reduced gradient method. Our algorithm achieves both outstanding theoretical and experimental results. Theoretically, we show that the SARAH-Compositional algorithm can achieve improved convergence rates and IFO complexities for finding an ε-accurate solution to non-convex compositional problems in both finite-sum and online cases. Experimentally, we compare our new compositional optimization method with a few rival algorithms for the task of portfolio management. Future directions include handling the non-smooth case and the theory of lower bounds for stochastic compositional optimization. We hope this work can provide new perspectives to both optimization and machine learning communities interested in compositional optimization.

8 References Agarwal, A. and Bottou, L. (2015). A lower bound for the optimization of finite sums. In International Conference on Machine Learning, pages 78–86. Allen-Zhu, Z. and Hazan, E. (2016). Variance reduction for faster non-convex optimization. In International Conference on Machine Learning, pages 699–707. Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654. Dentcheva, D., Penev, S., and Ruszczynski,´ A. (2017). Statistical estimation of composite risk functionals and risk optimization problems. Annals of the Institute of Statistical Mathematics, 69(4):737–760. Fang, C., Junchi Li, C., Lin, Z., and Zhang, T. (2018). Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. arXiv preprint arXiv:1807.01695. Ghadimi, S. and Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99. Huo, Z., Gu, B., Liu, J., and , H. (2018). Accelerated method for stochastic composition optimization with nonsmooth regularization. In Thirty-Second AAAI Conference on Artificial Intelligence. Lei, L., Ju, C., Chen, J., and Jordan, M. I. (2017). Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2345–2355. Li, H. and Lin, Z. (2015). Accelerated proximal gradient methods for nonconvex programming. In Advances in neural information processing systems, pages 379–387. Lian, X., Wang, M., and Liu, J. (2017). Finite-sum composition optimization via variance reduced gradient descent. In International Conference on Artificial Intelligence and Statistics, pages 1159–1167. Lin, T., Fan, C., Wang, M., and Jordan, M. I. (2018). Improved oracle complexity for stochastic compositional variance reduced gradient. arXiv preprint arXiv:1806.00458. Liu, L., Liu, J., and Tao, D. (2017). Variance reduced methods for non-convex composition optimiza- tion. arXiv preprint arXiv:1711.04416. Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, volume 87. Springer. Nguyen, L. M., Liu, J., Scheinberg, K., and Takác,ˇ M. (2017). Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621. Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323. Schmidt, M., Le Roux, N., and Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112. Shalev-Shwartz, S. and Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599. Shalev-Shwartz, S. and Zhang, T. (2014). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72. Shapiro, A., Dentcheva, D., and Ruszczynski,´ A. (2009). Lectures on stochastic programming: modeling and theory. SIAM. Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction.

9 Wang, M., Fang, E. X., and Liu, H. (2017a). Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1- 2):419–449. Wang, M., Liu, J., and Fang, E. X. (2017b). Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18:1–23. Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. (2018). Spiderboost: A class of faster variance- reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690. Woodworth, B. E. and Srebro, N. (2016). Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pages 3639–3647. Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075. Yang, S., Wang, M. M., and Fang, E. (2018). Multi-level stochastic gradient methods for nested composition optimization. arXiv preprint arXiv:1801.03600.

10