Randomized Block Cubic Newton Method

Nikita Doikov 1 Peter Richtarik´ 2 3 4

Abstract properties. Our aim is to capitalize on these different prop- erties in the design of our algorithm. We assume that g has We study the problem of minimizing the sum of Lipschitz gradient1, φ has Lipschitz Hessian, while ψ is three convex functions—a differentiable, twice- allowed to be nonsmooth, albeit “simple”. differentiable and a non-smooth term—in a high dimensional setting. To this effect we propose and analyze a randomized block cubic Newton 1.1. Block Structure (RBCN) method, which in each iteration builds Moreover, we assume that the N coordinates of x are parti- a model of the objective function formed as the P tioned into n blocks of sizes N1,...,Nn, with i Ni = N, sum of the natural models of its three compo- Ni and then write x = (x(1), . . . , x(n)), where x(i) ∈ R . nents: a linear model with a quadratic regular- This block structure is typically dictated by the particular izer for the differentiable term, a quadratic model application considered. Once the block structure is fixed, with a cubic regularizer for the twice differen- we further assume that φ and ψ are block separable. That is, tiable term, and perfect (proximal) model for the Pn Pn φ(x) = i=1 φi(x(i)) and ψ(x) = i=1 ψi(x(i)), where nonsmooth term. Our method in each iteration φi are twice differentiable with Lipschitz Hessians, and ψi minimizes the model over a random subset of are closed convex (and possibly nonsmooth) functions. blocks of the search variable. RBCN is the first algorithm with these properties, generalizing sev- Revealing this block structure, problem (1) takes the form eral existing methods, matching the best known n n def X X bounds in all special cases. We establish O(1/), min F (x) = g(x) + φi(x(i)) + ψi(x(i)). (2) √ x∈Q O(1/ ) and O(log(1/)) rates under different i=1 i=1 assumptions on the component functions. Lastly, We are specifically interested in the case when n is big, in we show numerically that our method outperforms which case it make sense to update a small number of the the state of the art on a variety of block in each iteration only. problems, including cubically regularized least- squares, logistic regression with constraints, and 1.2. Related Work Poisson regression. There has been a substantial and growing volume of research related to second-order and block-coordinate optimization. 1. Introduction In this part we briefly mention some of the papers most relevant to the present work. In this paper we develop an efficient randomized algorithm for solving an optimization problem of the form A major leap in second-order optimization theory was made since the cubic Newton method was proposed by arXiv:1802.04084v2 [math.OC] 7 Aug 2018 Griewank(1981) and independently rediscovered by Nes- def min F (x) = g(x) + φ(x) + ψ(x), (1) terov & Polyak(2006), who also provided global complexity x∈Q guarantees. where Q ⊆ RN is a closed convex set, and g, φ and ψ are convex functions with different smoothness and structural Cubic regularization was equipped with acceleration by Nes- terov(2008), adaptive stepsizes by (Cartis et al., 2011a;b) 1National Research University Higher School of Economics, 2 and extended to a universal framework by Grapiglia & Nes- Moscow, Russia King Abdullah University of Science and terov(2017). The universal schemes can automatically Technology, Thuwal, Saudi Arabia 3University of Edinburgh, Edinburgh, United Kingdom 4Moscow Institute of Physics adjust to the implicit smoothness level of the objective. Cu- and Technology, Dolgoprudny, Russia. Correspondence to: bically regularized second-order schemes for solving sys- Nikita Doikov , Peter Richtarik´ . 1Our assumption is bit more general than this; see Assump- February 10, 2018 tions1,2 for details. Randomized Block Cubic Newton Method

(2007) and randomized variants for stochastic optimiza- reduces to the cubically-regularized Newton method of tion were considered by Tripuraneni et al.(2017); Ghadimi Nesterov & Polyak(2006). Even when n = 1, RBCN et al.(2017); Kohler & Lucchi(2017); Cartis & Scheinberg can be seen as an extension of this method to composite (2018). optimization. For n > 1, RBCN provides an extension of the algorithm in Nesterov & Polyak(2006) to the Despite their attractive global iteration complexity guar- randomized block coordinate setting, popular for high- antees, the weakness of second-order methods in general, dimensional problems. and cubic Newton in particular, is their high computational cost per iteration. This issue remains the subject of active • In the special case when φ = 0 and Ni = 1 for all research. For successful theoretical results related to the i, RBCN specializes to the stochastic Newton (SN) approximation of the cubic step we refer to (Agarwal et al., method of Qu et al.(2016). Applied to the empirical 2016) and (Carmon & Duchi, 2016). risk minimization problem (see Section7), our method has a dual interpretation (see Algorithm2). In this At the same time, there are many successful attempts case, our method reduces to the stochastic dual Newton to use block coordinate randomization to accelerate first- ascent method (SDNA) also described in (Qu et al., order (Tseng & Yun, 2009; Richtarik´ & Taka´cˇ, 2014; 2016) 2016). Hence, RBCN can be seen as an extension of and second-order (Qu et al., 2016; Mutny` & Richtarik´ , 2018) SN and SDNA to blocks of arbitrary sizes, and to the methods. inclusion of the twice differentiable term φ. In this work we are addressing the issue of combining block- • In the case when φ = 0 and the simplest over approxi- coordinate randomization with cubic regularization, to get a 2 second-order method with proven global complexity guar- mation of g is assumed: 0  ∇ g(x)  LI, the com- antees and with a low cost per iteration. posite block coordinate method Tseng & Yun (2009) can be applied to solve (1). Our method extends A powerful advance in theory was the this in two directions: we add twice-differentiable advent of composite or proximal first-order methods (see terms φ, and use a tighter model for g, using all global (Nesterov, 2013) as a modern reference). This technique has curvature information (if available). become available as an algorithmic tool in block coordinate setting as well (Richtarik´ & Taka´cˇ, 2014; Qu et al., 2016). We prove high probability global convergence guarantees Our aim in this work is the development of a composite under several regimes, summarized next: cubically regularized second-order method. • Under no additional assumptions on g, φ and ψ beyond 1.3. Contributions convexity (and either boundedness of Q, or bounded- ness of the level sets of F on Q), we prove the rate We propose a new randomized second-order proximal al-  n  gorithm for solving convex optimization problems of the O , form (2). Our method, Randomized Block Cubic Newton τ (RBCN) (see Algorithm1) treats the three functions appear- where τ is the mini-batch size (see Theorem1). ing in (1) differently, according to their nature. • Under certain conditions combining the properties of Our method is a randomized block method because in each g with the way the random blocks are sampled, for- iteration we update a random subset of the n blocks only. malized by the assumption β > 0 (see (12) for the This facilitates faster convergence, and is suited to problems definition of β), we obtain the rate where n is very large. Our method is proximal because we  n  keep the functions ψi in our model, which is minimized in O √ each iteration, without any approximation. Our method is τ max{1, β}  a cubic Newton method because we approximate each φi (see Theorem2). In the special case when n = 1, we using a cubically-regularized second order model. necessarily have τ = 1 and β = µ/L (reciprocal of the condition number of g) we get the rate O( L√ ). If We are not aware of any method that can solve (2) via µ  using the most appropriate models of the three functions g is quadratic and τ = n, then β = 1 and the resulting complexity O( √1 ) recovers the rate of cubic Newton (quadratic with a constant Hessian for g, cubically regular-  ized quadratic for φ and no model for ψ), not even in the established by Nesterov & Polyak(2006). case n = 1. • Finally, if g is strongly convex, the above result can be Our approach generalizes several existing results: improved (see Theorem3) to  n 1 O log . • In the case when n = 1, g = 0 and ψ = 0, RBCN τ max{1, β}  Randomized Block Cubic Newton Method

1.4. Contents Note that these definitions imply that

N The rest of the paper is organized as follows. In Section2 hA[S]x, yi = hAx[S], y[S]i, x, y ∈ R . we introduce the notation and elementary identities needed to efficiently handle the block structure of our model. In Next, we define the block-diagonal operator, which, up Section3 we make the various smoothness and convexity to permutation of coordinates, retains diagonal blocks and assumptions on g and φi formal. Section4 is devoted to nullifies the off-diagonal blocks: the description of the block sampling process used in our n n def X T T X method, along with some useful identities. In Section5 blockdiag(A) = UiUi AUiUi = A[{i}]. we describe formally our randomized block cubic Newton i=1 i=1 (RBCN) method. Section6 is devoted to the statement and N def  N description of our main convergence results, summarized Finally, denote R[S] = x[S] | x ∈ R . This is a linear in the introduction. Missing proofs are provided in the subspace of RN composed of vectors which are zero in supplementary material. In Section7 we show how to apply blocks i∈ / S. our method to the empirical risk minimization problem. Applying RBCN to its dual leads to Algorithm2. Finally, 3. Assumptions our numerical experiments on synthetic and real datasets are described in Section8. In this section we formulate our main assumptions about differentiable components of (2) and provide some examples 2. Block Structure to illustrate the concepts. N To model a block structure, we decompose the space N We assume that g : R → R is a differentiable function R Ni into n subspaces in the following standard way. Let and all φi : R → R, i ∈ [n] are twice differentiable. Thus, at any point x ∈ N we should be able to compute all U ∈ RN×N be a column permutation of the N ×N identity R the {∇g(x), ∇φ1(x(1)),..., ∇φn(x(n))} and the matrix I and let a decomposition U = [U1, U2,..., Un] 2 2 N×Ni Hessians {∇ φ1(x(1)),..., ∇ φn(x(n))}, or at least their be given, where Ui ∈ R are n submatrices, N = Pn N actions on arbitrary vector h of appropriate dimension. i=1 Ni. Subsequently, any vector x ∈ R can be Pn def Next, we formalize our assumptions about convexity and uniquely represented as x = i=1 Uix(i), where x(i) = T Ni level of smoothness. Speaking informally, g is similar to a Ui x ∈ R . quadratic, and functions φi are arbitrary twice-differentiable In what follows we will use the standard Euclidean inner def P and smooth. product: hx, yi = i xiyi, Euclidean norm of a vector: kxk def= phx, xi and induced spectral norm of a matrix: Assumption 1 (Convexity) There is a positive semidefinite def matrix G  0 such that for all x, h ∈ RN : kAk = maxkxk=1 kAxk. Using block decomposition, for two vectors x, y ∈ RN we have: 1 g(x + h) ≥ g(x) + h∇g(x), hi + hGh, hi, (3) n n n 2 X X  X hx, yi = Uix(i), Ujy(j) = hx(i), y(i)i. φi(x(i) + h(i)) ≥ φi(x(i)) + h∇φi(x(i)), h(i)i, i ∈ [n]. i=1 j=1 i=1

def In the special case when G = 0, (3) postulates convexity. For a given nonempty subset S of [n] = {1, . . . , n} and For positive definite G, the objective will be strongly convex N N for any vector x ∈ R we denote by x[S] ∈ R the vector def with the strong convexity parameter µ = λmin(G) > 0. obtained from x by retaining only blocks x(i) for which i ∈ S and zeroing all other: Note that for all φi we only require convexity. How- ever, if we happen to know that any φi is strongly convex def X X T 2 Ni x[S] = Uix(i) = UiUi x. (λmin(∇ φi(y)) ≥ µi > 0 for all y ∈ R ), we can move µi 2 i∈S i∈S this strong convexity to g by subtracting 2 kx(i)k from φi and adding it to g. This extra knowledge may in some N×N Furthermore, for any matrix A ∈ R we write A[S] ∈ particular cases improve convergence guarantees for our N×N R for the matrix obtained from A by retaining only algorithm, but does not change the actual computations. elements whose indices are both in some coordinate blocks from S, formally: Assumption 2 (Smoothness of g) There is a positive N ! ! semidefinite matrix A  0 such that for all x, h ∈ R : def X X A = U UT A U UT . 1 [S] i i i i g(x + h) ≤ g(x) + h∇g(x), hi + hAh, hi. (4) i∈S i∈S 2 Randomized Block Cubic Newton Method

The main example of g is a quadratic function g(x) = Assumption 4 (Uniform sampling) Sampling Sˆ is uni- 1 hMx, xi with a symmetric positive semidefinite M ∈ ˆ ˆ def 2 form, i.e., P(i ∈ S) = P(j ∈ S) = p, for all i, j ∈ [n]. RN×N for which both (3) and (4) hold with G = A = M. Of course, any convex g with Lipschitz-continuous gradient The above assumpotion means that the diagonal of P is with a constant L ≥ 0 satisfies (3) and (4) with G = 0 and constant: Pii = p for all i ∈ [N]. It is easy to see that A = LI (Nesterov, 2004). (Corollary 3.1 in (Qu & Richtarik´ , 2016)):   A ˆ = A ◦ P, (7) Assumption 3 (Smoothness of φi) For every i ∈ [n] E [S] there is a nonnegative constant Hi ≥ 0 such that the Hes- where ◦ denotes the Hadamard product. sian of φi is Lipschitz-continuous: def 2 2 Denote τ = [|Sˆ|] = np (expected minibatch size). The k∇ φi(x + h) − ∇ φi(x)k ≤ Hikhk, (5) E special uniform sampling defined by picking from all sub- for all x, h ∈ RNi . sets of size τ uniformly at random is called τ-nice sampling. If Sˆ is τ-nice, then (see Lemma 4.3 in (Qu & Richtarik´ , Examples of functions which satisfy (5) with a known 2016)) Lipschitz constant of Hessian H are quadratic: φ(t) = 2 τ kCt − t0k (H = 0 for all the parameters), cubed norm: P = ((1 − γ) blockdiag(E) + γE) , (8) 3 n φ(t) = (1/3)kt − t0k (H = 2, see Lemma 5 in (Nesterov, 2008)), logistic√ regression loss: φ(t) = log(1 + exp(t)) where γ = (τ − 1)/(n − 1). (H = 1/(6 3), see Proposition 1 in the supplementary material). In particular, the above results in the following:

For a fixed set of indices S ⊂ [n] denote Lemma 2 For the τ-nice sampling Sˆ, we have

def X N φS(x) = φi(x(i)), x ∈ R . τ  τ − 1  τ(τ − 1) [A ] = 1 − blockdiag(A) + A. i∈S E [Sˆ] n n − 1 n(n − 1) Then we have:

X N Proof: Combine (7) and (8). h∇φS(x), hi = h∇φi(x(i)), h(i)i, x, h ∈ R , i∈S 2 X 2 N 5. Algorithm h∇ φS(x)h, hi = h∇ φi(x(i))h(i), h(i)i, x, h ∈ R . i∈S Due to the problem structure (2) and utilizing the smooth- ness of the components (see (4) and (5)), for a fixed subset N Lemma 1 If Assumption3 holds, then for all x, h ∈ R of indices S ⊂ [n] it is natural to consider the following we have the following second-order approximation bound: N model of our objective F around a point x ∈ R :

1 2 φS(x + h) − φS(x) − h∇φS(x), hi − h∇ φS(x)h, hi def 1 2 MH,S(x; y) = F (x) + h(∇g(x))[S], yi + hA[S]y, yi+ 3 2 ≤ max{Hi} · kh[S]k . (6) 1 H i∈S + h(∇φ(x)) , yi + h(∇2φ(x)) y, yi + ky k3+ [S] 2 [S] 6 [S] def X  From now on we denote HF = max{H1, H2,..., Hn}. + ψi(x(i) + y(i)) − ψi(x(i)) . (9) i∈S

4. Sampling of Blocks The above model arises as a combination of a first-order model for g with global curvature information provided by In this section we summarize some basic properties of sam- matrix A, second-order model with cubic regularization pling Sˆ, which is a random set-valued mapping with val- (following (Nesterov & Polyak, 2006)) for φ, and perfect ues being subsets of [n]. For a fixed block-decomposition, model for the non-differentiable terms ψ (i.e., we keep with each sampling Sˆ we associate the probability matrix i these terms as they are). P ∈ RN×N as follows: an element of P is the probability of choosing a pair of blocks which contains indices of this Combining (4) and (6), and for large enough H (H ≥ N×N element. Denoting by E ∈ R the matrix of all ones, maxi∈S Hi is sufficient), we get the global upper bound   we have P = E E[Sˆ] . Wel restrict our analysis to uniform N N samplings, defined next. F (x + y) ≤ MH,S(x; y), x ∈ R , y ∈ R[S]. Randomized Block Cubic Newton Method

Moreover, the value of all summands in MH,S(x; y) de- Given theoretical result provides global sublinear rate of  pends on the subset of blocks {y(i)|i ∈ S} only, and there- convergence, with iteration complexity of the order O 1/ε . fore def Note that for a case φ(x) ≡ 0 we can put HF = 0, and TH,S(x) = argmin MH,S(x; y) (10) N Theorem1 in this situation restates well-known result about y∈R[S] subject to x+y∈Q convergence of composite gradient-type block-coordinate methods (see, for example, (Richtarik´ & Taka´cˇ, 2014)). can be computed efficiently for small |S| and as long as Q is simple (for example, affine). Denote the minimum of 6.2. Strongly Convex Loss ∗ def the cubic model by M (x) = MH,S(x; TH,S(x)). The H,S Here we study the case when the matrix G from the con- RBCN method performs the update x ← x + TH,S(x), and is formalized as Algorithm1. vexity assumption (3) is strictly positive definite: G 0, which means that the objective F is strongly convex with a def Algorithm 1 RBCN: Randomized Block Cubic Newton constant µ = λmin(G) > 0. 1: Parameters: sampling distribution Sˆ Denote by β a condition number for the function g and 2: Initialization: choose initial point x0 ∈ Q sampling distribution Sˆ: the maximum nonnegative real 3: for k = 0, 1, 2,... do number such that ˆ 4: Sample Sk ∼ S   τ β · E ˆ A[S]  G. (12) 5: Find Hk ∈ (0, 2HF ] such that S∼S n k k ∗ k F (x + TH ,S (x )) ≤ M (x ) k k Hk,Sk If (12) holds for all nonnegative β we put by definition k+1 def k k 6: Make the step x = x + THk,Sk (x ) β ≡ +∞. 7: end for µ A simple lower bound exists: β ≥ L > 0, where L = λmax(A), as in Theorem1. However, because (12) depends not only on g, but also on sampling distribution Sˆ, it is 6. Convergence Results possible that β > 0 even if µ = 0 (for example, β = 1 if In this section we establish several convergence rates for P(S = [n]) = 1 and A = G 6= 0). Algorithm1 under various structural assumptions: for the The following theorems describe√ global iteration complex- general class of convex problems, and for the more specific ity guarantees of the order O1/ ε and Olog(1/ε) for strongly convex case. We will focus on the family of uni- Algorithm1 in the cases β > 0 and µ > 0 correspondingly, form samplings only, but generalizations to other sampling which is an improvement of general O1/ε. distributions are also possible. Theorem 2 Let Assumptions1,2,3,4 hold. Let solution 6.1. Convex Loss x∗ ∈ Q of problem (1) exist, let level sets be bounded: D < +∞, and assume that β, which is defined by (12), is We start from the general situation where the term g(x) and greater than zero. Choose required accuracy ε > 0 and all the φi(x(i)) and ψi(x(i)), i ∈ [n] are convex, but not confidence level ρ ∈ (0, 1). Then after necessary strongly convex. 2 n 1  1r n o Denote by D the maximum distance from an optimum point K ≥ √ 2 + log max H D3,F (x0) − F ∗ ε τ σ ρ F x∗ to the initial level set: (13) def def n ∗ 0 o D = sup kx − x k x ∈ Q, F (x) ≤ F (x ) . iterations of Algorithm1, where σ = min{β, 1} > 0, we have  K ∗  Theorem 1 Let Assumptions1,2,3,4 hold. Let solution P F (x ) − F ≤ ε ≥ 1 − ρ. x∗ ∈ Q of problem (1) exist, and assume the level sets are bounded: D < +∞. Choose required accuracy ε > 0 and Theorem 3 Let Assumptions1,2,3,4 hold. Let solution ∗ def confidence level ρ ∈ (0, 1). Then after x ∈ Q of problem (1) exist, and assume that µ = λmin(G) is strictly positive. Then after   2 n 1 n 2 3 0 ∗o s K ≥ 1+log max LD +HF D ,F (x )−F 3 F (x0) − F ∗ n 1 H D  ε τ ρ K ≥ log max F , 1 (14) (11) 2 ερ τ σ µ def iterations of Algorithm1, where L = λmax(A), we have iterations of Algorithm1, we have

 K ∗   K ∗  P F (x ) − F ≤ ε ≥ 1 − ρ. P F (x ) − F ≤ ε ≥ 1 − ρ. Randomized Block Cubic Newton Method

Given complexity estimates show which parameters of the of the subproblem by the fast approximate eigenvalue com- problem directly affect on the convergence of the algorithm. putation (Agarwal et al., 2016) or by applying (Carmon & Duchi, 2016). Both of these schemes Bound (13) improves initial estimate (11) by the fac- provide global convergence guarantees. Additionally, they tor pD /ε. The cost is additional term σ−1 = 0 are Hessian-free. Thus we need only a procedure of mul- (min{β, 1})−1, which grows up while the condition num- tiplying quadratic part of (9) to arbitrary vector, without ber β becomes smaller. storing the full matrix. The latter approach is the most uni- Opposite and limit case is when the quadratic part of the versal one and can be spread to the composite case, by using objective is vanished (g(x) ≡ 0 ⇒ σ = 1). Algorithm1 is proximal or its accelerated variant (Nes- turned to be a parallelized block-independent minimization terov, 2013). of the objective components via cubically regularized New- There are basically two strategies to find parameter H on ton steps. Then, the complexity estimate coincides with a k every iteration: a constant choice H := max {H } known result (Nesterov & Polyak, 2006) in a nonrandom- k i∈Sk i or H := H , if Lipschitz constants of the Hessians are ized (τ = n, ρ → 1) setting. k F known, or simple adaptive procedure, which performs a Bound (14) guarantees a linear rate of convergence, which truncated binary search and has a logarithmic cost per one means logarithmic dependence on required accuracy ε for step of the method. Example of such procedure can be found the number of iterations. The main complexity factor be- in primal-dual Algorithm2 from the next section. −1 1/2 comes a product of two terms: σ · max{HF D/µ, 1} . For the case φ(x) ≡ 0 we can put HF = 0 and get the 6.4. Extension of the Problem Class stochastic Newton method from (Qu et al., 2016) with its global linear convergence guarantee. The randomized cubic model (9), which has been consid- ered and analyzed before, arises naturally from the separable Despite the fact that linear rate is asymptotically bet- structure (2) and by our smoothness assumptions (4), (5). √  ter than sublinear, and O 1/ ε is asymptotically better Let us discuss an interpretation of Algorithm1 in terms  than O 1/ε , we need to take into account all the fac- of general problem min F (x) with twice-differentiable N tors, which may slow down the algorithm. Thus, while x∈R F (omitting non-differentiable component for simplicity). µ = λ (G) → 0, estimate (13) is becoming better min One can state and minimize the model m (x; y) ≡ than (14), as well as (11) is becoming better than (13) while H,S F (x)+h(∇F (x)) , yi+ 1 h(∇2F (x)) y, yi+ H ky k3 β → 0. [S] 2 [S] 6 [S] which is a sketched version of the originally proposed Cubic Newton method (Nesterov & Polyak, 2006). For alterna- 6.3. Implementation Issues tive sketched variants of Newton-type methods but without Let us explain how one step of the method can be performed, cubic regularization see (Pilanci & Wainwright, 2015). which requires the minimization of the cubic model (10), The latter model mH,S(x; y) is the same as MH,S(x; y) possibly with some simple convex constraints. when inequality (4) from the smoothness assumption for The first and the classical approach was proposed in (Nes- g is exact equality, i.e. when the function g is a quadratic 2 terov & Polyak, 2006) and, before for trust-region meth- with the ∇ g(x) ≡ A. Thus, we may use ods, in (Conn et al., 2000). It works with unconstrained mH,S(x; y) instead of MH,S(x; y), which is still compu- (Q ≡ RN ) and differentiable case (ψ(x) ≡ 0). Firstly we tationally cheap for small |S|. However, this model does need to find a root of a special one-dimensional nonlinear not give any convergence guarantees for the general F , to equation (this can be done, for example, by simple Newton the best of our knowledge, unless S = [n]. But it can be a iterations). After that, we just solve one linear system to workable approach, when the separable structure (2) is not produce a step of the method. Then, total complexity of provided. 3 solving the subproblem can be estimated as O(d ) arith- Note also, that Assumption3 about Lipschitz-continuity of metical operations, where d is the dimension of subproblem, the Hessian is not too restrictive. Recent result (Grapiglia in our case: d = |S|. Since some matrix factorization is & Nesterov, 2017) shows that Newton-type methods with used, the cost of the cubically regularized Newton step is cubic regularization and with a standard procedure of adap- actually similar by efficiency to the classical Newton one. tive estimation of HF automatically fit the actual level of See also (Gould et al., 2010) for detailed analysis. For the smoothness of the Hessian without any additional changes case of affine constraints, the same procedure can be applied. in the algorithm. Example of using this technique is given by Lemma3 from the next section. Moreover, step (10) of the method as the global minimum of MH,S(x; y) is well-defined and computationally tractable Another approach is based on finding an inexact solution even in nonconvex cases (Nesterov & Polyak, 2006). Thus Randomized Block Cubic Newton Method

0 m ˆ T we can try to apply the method to nonconvex objective as Denote vector b2 ≡ φi(αi) i=1 and b ≡ mλb1 + B b2. |S|×|S| well, but without known theoretical guarantees for S 6= [n]. Define the family of matrices Z(τ): R+ → R :

7. Empirical Risk Minimization def ˆ ˆ T  00 Hτ  ˆ Z(τ) = mλA + B diag(φi (αi)) + I B. One of the most popular examples of optimization problems 2 in machine learning is empirical risk minimization problem, ∗ ∗ which in many cases can be formulated as follows: Then the solution (y , h ) of (17) can be found from ∗ ∗ ∗ ˆ ∗ the equations: Z(τ )yS = −b, h = ByS, where  m  τ ∗ ≥ 0 satisfies one-dimensional nonlinear equation: 1 X T min P (w) ≡ φi(bi w) + λg(w) , (15) τ ∗ = kBˆ (Z(τ ∗))†bk and y∗ ∈ |S| is the subvector of w∈ d m S R R i=1 the solution y∗ with element indices from S. where φi are convex loss functions, g is a regularizer, vari- ables w are weights of a model and m is a size of a dataset. Thus, after we find the root of nonlinear one-dimensional equation, we need to solve |S|×|S| linear system to compute 7.1. Constrained Problem Reformulation y∗. Then, to find h∗ we do one matrix-vector multiplication with the matrix of size m × |S| . Matrix B usually has a Let us consider the case, when the dimension d of prob- sparse structure when m is big, which also should be used lem (15) is very huge and d  m. This asks us to use in effective implementation. some coordinate-randomization technique. Note that formu- lation (15) does not directly fit our problem setup (2), but 7.2. Maximization of the Dual Problem we can easily transform it to the following constrained opti- T mization problem, by introducing new variables αi ≡ bi w: Another approach to solving optimization problem (15) is to maximize its Fenchel dual (Rockafellar, 1997):  m m  1 X X T min φi(αi)+λg(w)+ {αi = b w} . (16) d I i m w∈R m  1  1  α∈ m i=1 i=1 X ∗ ∗ T R max D(α) ≡ −φi (−αi) − λg B α , α∈ m m λm R i=1 Following our framework, on every step we will sample a (18) random subset of coordinates S ⊂ [d] of weights w, build ∗ ∗ where g and {φi } are the Fenchel conjugate functions of the cubic model of the objective (assuming that φ and g def i g and {φ } respectively, f ∗(s) = sup [ hs, xi − f(x)] for satisfy (4), (5)): i x arbitrary f. It is know (Bertsekas, 1978), that if φi is twice- y ∇2φ (y) 0  1  differentiable in a neighborhood of and i M (w, α; y, h) ≡ λ (∇g(w)) , y + A y, y + in this area, then its Fenchel conjugate φ∗ is also twice- H,S [S] 2 [S] i differentiable in some neighborhood of s = ∇φi(y) and it m 1 X 1  H  holds: ∇2φ∗(s) = (∇2φ (y))−1. + φ0 (α )h + φ00(α )h2 + khk3 + P (w) i i m i i i 2 i i i 6 i=1 Then, in a case of smooth differentiable g∗ and twice- ∗ differentiable φi , i ∈ [m] we can apply our framework and minimize it by y and h on the affine set: to (18), by doing cubic steps in random subsets of the dual m d ∗ ∗ variables α ∈ R . The primal w ∈ R corresponded to (y , h ) := argmin MH,S(w, α; y, h), (17) d m particular α can be computed from the stationary equation y∈R[S],h∈R subject to h=By   ∗ 1 T m×d T w = ∇g B α , where rows of matrix B ∈ R are bi . Then, updates of λm the variables are: w+ := w + y∗ and α+ := α + h∗. The following lemma is addressing the issue of how to which holds for solutions of primal (15) and dual (18) prob- solve (17), which is required on every step of the method. lems in a case of strong duality. Its proof can be found in the supplementary material. Let us assume that the function g is 1-strongly convex (which is of course true for ` -regularizer 1/2kwk2). Then Lemma 3 Denote by Bˆ ∈ m×|S| the submatrix of B with 2 2 R ∗ 1 T  |S|×|S| for G(α) ≡ λg B α the uniform bound for the Hes- row indices from S, by Aˆ ∈ R the submatrix of λm 2 1 T A with elements whose both indices are from S, by b1 ∈ sian exists: ∇ G(α)  λm2 B B. As before we build the R|S| the subvector of ∇g(w) with element indices from S. randomized cubic model and compute its minimizer (setting Randomized Block Cubic Newton Method

Tm ∗ Q ≡ i=1 dom φi ): with full-coordinate method. This agrees with the provided complexity estimates for the method: an increase of the     ∗ 1 T batch size speeds up convergence rate linearly, but slows MH,S(α, h) ≡ −D(α) + λ ∇g B α , h[S] + λm down the cost of one iteration cubically. Therefore, the  1 2 1 X ∗ optimum size of the block is on a medium level. + Bh[S]k + − ∇φ (−αi), hi + 2λm2 m i i∈S  Synthetic, tolerance=1e-12 leukemia, tolerance=1e-6 duke breast-cancer, tolerance=1e-6 1 H 22.5 2 ∗ 3 256 N = 16K 25 Block coordinate Gradient Descent + ∇ φ (−α )h , h + kh k ; S ⊂ [m], N = 12K 20.0 i i i i [S] Block coordinate Cubic Newton 2 6 N = 8K 17.5 N = 4K 20 64 15.0

15 T (α) ≡ argmin M (α, h), 12.5 H,S H,S Block coordinate Gradient Descent m Block coordinate Cubic Newton h∈R[S] 16 10.0 10 subject to α+h∈Q 8 7.5 Total computational time (s) Total computational time (s) Total computational time (s) ∗ 5 5.0 2.5 MH,S(α) ≡ MH,S(α, TH,S(α)). 2 10 25 50 100 200 500 1K 2K 4K 8K 10 25 50 100 200 500 1K 2K 4K 8K 10 25 50 100 200 500 1K 2K 4K 8K Block size Block size Block size Because in general we may not know exact Lipschitz- constant for the Hessians, we do an adaptive search for Figure 1. Time it takes to solve a problem for different sampling estimating H. Resulting primal-dual scheme is presented in block sizes. Left: synthetic problem. Center and right: logistic Algorithm2. When a small subset of coordinates S is used, regression for real data. the most expensive operations become: computation of the objective at a current point D(α) and the matrix-vector prod- uct BT α. Both of them can be significantly optimized by 8.2. Logistic regression storing already computed values in memory and updating only changed information on every step. In this experiment we train `2-regularized logistic regression model for classification task with two classes by its con- Algorithm 2 Stochastic Dual Cubic Newton Ascent (SD- strained reformulation (16) and compare the Algorithm1 CNA) with the Block coordinate Gradient Descent (see, for exam- 2 1: Parameters: sampling distribution Sˆ ple, (Tseng & Yun, 2009)) on the datasets : leukemia (m = 0 38, d = 7129) and duke breast-cancer (m = 44, d = 7129). 2: Initialization: choose initial α ∈ Q and H0 > 0 3: for k = 0, 1, 2,... do We see that using coordinate blocks of size 25 − 50 for the 4: Make a primal update wk := ∇g∗ 1 BT αk Cubic Newton outperforms all other cases of both methods λm in terms of total computational time. Increasing block size 5: Sample S ∼ Sˆ k further starts to significantly slow down the method because ∗ k k k 6: While M (α ) > −D(α + TH ,S (α )) do Hk,Sk k k of high cost of every iteration. 7: Hk := 1/2 · Hk k+1 k k 8.3. Poisson regression 8: Make a dual update α := α + THk,Sk (x ) 9: Set Hk+1 := 2 · Hk In this experiment we train Poisson model for regression 10: end for task with integer responses by the primal-dual Algorithm2 and compare it with SDCA (Shalev-Shwartz & Zhang, 2013) and SDNA (Qu et al., 2016) methods on synthetic (m = 8. Numerical experiments 1000, d = 200) and real data3 (m = 319, d = 20). Our approach requires smaller number of epochs to reach given 8.1. Synthetic accuracy, but computational efficiency of every step is the same as in SDNA method. We consider the following synthetic regression task: In the following set of experiments with Poisson regression N 1 2 X ci 3 we take real datasets: australian (m = 690, d = 14), breast- min kAx − bk2 + |xi| x∈ N 2 6 cancer (m = 683, d = 10), splice (m = 1000, d = 60), R i=1 svmguide3 (m = 1243, d = 21) and use for response a with randomly generated parameters and for different N. On random vector y ∈ (N ∪ {0})m from the standard Poisson each problem of this type we run Algorithm1 and evaluate distribution. total computational time until reaching 10−12 accuracy in 2http://www.csie.ntu.edu.tw/˜cjlin/ function residual. Using middle-size blocks of coordinates libsvmtools/datasets/ on each step is the best choice in terms of total computa- 3https://www.kaggle.com/pablomonleon/ tional time, comparing it with small coordinate subsets and montreal-bike-lanes Randomized Block Cubic Newton Method

Synthetic Montreal bike lanes where v ∼ N (0, 1) for all 1 ≤ j ≤ N. Running Algo- Cubic, 8 j Cubic, 32 1010 101 Cubic, 256 rithm1 we use for Hk the known Lipschitz constants of the SDNA, 8 107 SDNA, 32 SDNA, 256 Hessians: Hk := maxi∈Sk ci. 10 1 SDCA, 8 104 SDCA, 32 SDCA, 256 10 3 101 4 Cubic, 8 In the experiments with `2-regularized logistic regression Cubic, 32 2

Duality gap Duality gap 10 Cubic, 256 10 5 SDNA, 8 we use the Armijo rule for computing step length in the 10 5 SDNA, 32 SDNA, 256 Block coordinate gradient descent and the constant choice 10 7 SDCA, 8 8 10 SDCA, 32 SDCA, 256 for the parameters Hk in Algorithm1 provided by Proposi-

0 20 40 60 80 100 0 100 200 300 400 500 600 Epoch Epoch tion1.

Figure 2. Comparison of Algorithm2 (marked as Cubic) with In the experiments with Poisson regression for the methods SDNA and SDCA methods for minibatch sizes τ = 8, 32, 256, SDNA and SDCA we use damped Newton method as a com- training Poisson regression. Left: synthetic. Right: real data. putational subroutine to compute one method step, using it until reaching 10−12 accuracy for the norm of the gradient in the corresponding subproblem and the same accuracy for australian breast cancer 101 Cubic, 8 Cubic, 8 solving inner cubic subproblem in Algorithm2. For the 101 100 Cubic, 32 Cubic, 32 Cubic, 256 Cubic, 256 synthetic experiment with Poisson regression, we generate 10 1 SDNA, 8 SDNA, 8 SDNA, 32 SDNA, 32 10 1 m×d 10 2 SDNA, 256 SDNA, 256 data matrix B ∈ as independent samples from stan- SDCA, 8 SDCA, 8 R 10 3 SDCA, 32 SDCA, 32 3 N (0, 1) SDCA, 256 10 SDCA, 256 dard normal distribution and corresponding vector 10 4 of responses y ∈ ( ∪ {0})m from the standard Poisson 10 5 N Duality gap Duality gap 10 5 10 6 distribution (in which mean parameter equals to 1).

10 7 10 7

10 8

0 20 40 60 80 100 0 20 40 60 80 100 9. Acknowledgments Epoch Epoch

splice svmguide3 The work of the first author was partly supported by Sam- Cubic, 8 100 Cubic, 8 Cubic, 32 Cubic, 32 sung Research, Samsung Electronics, the Russian Science 0 10 Cubic, 256 10 1 Cubic, 256 SDNA, 8 SDNA, 8 SDNA, 32 10 2 SDNA, 32 Foundation Grant 17-11-01027 and KAUST. SDNA, 256 SDNA, 256 10 2 SDCA, 8 SDCA, 8 10 3 SDCA, 32 SDCA, 32 SDCA, 256 SDCA, 256 4 10 4 10

10 5 Duality gap Duality gap 10 6 10 6

10 7 10 8 10 8

0 20 40 60 80 100 0 20 40 60 80 100 Epoch Epoch

Figure 3. Comparison of Algorithm2 (marked as Cubic) with SDNA and SDCA methods for minibatch sizes τ = 8, 32, 256, training Poisson regression.

We see that in all the cases our method (Algorithm2) out- performs state-of-the-art analogues in terms of number of data accesses.

8.4. Experimental setup In all our experiments we use τ-nice sampling, i.e., sub- sets of fixed size chosen from all such subset uniformly at random (note that such subsets overlap).

In the experiments with synthetic cubically regularized re- gression we generate a data in the following way: sample firstly a matrix U ∈ R10×N , each entry of which is iden- tically distributed from N (0, 1), then put A := U T U ∈ N×N T 10 R , b := −U ξ, where ξ ∈ R is a normal vector: 4Regularization parameter λ in all the experiments with logistic ξi ∼ N (0, 1) for each 1 ≤ i ≤ 10, and cj := 1 + |vj| and Poisson regression set to 1/m. Randomized Block Cubic Newton Method

References Nesterov, Y. Introductory lectures on convex optimization., 2004. Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima for nonconvex opti- Nesterov, Y. Modified gauss–newton scheme with worst mization in linear time. arXiv preprint arXiv:1611.01146, case guarantees for global performance. Optimisation 2016. Methods and Software, 22(3):469–483, 2007. Bertsekas, D. P. Local convex conjugacy and Fenchel dual- Nesterov, Y. Accelerating the cubic regularization of New- ity. In Preprints of 7th Triennial World Congress of IFAC, ton’s method on convex problems. Mathematical Pro- Helsinki, Finland, volume 2, pp. 1079–1084, 1978. gramming, 112(1):159–181, 2008. Carmon, Y. and Duchi, J. C. Gradient descent effi- Nesterov, Y. Gradient methods for minimizing composite ciently finds the cubic-regularized non-convex newton functions. Mathematical Programming, 140(1):125–161, step. arXiv preprint arXiv:1612.00547, 2016. 2013. Cartis, C. and Scheinberg, K. Global convergence rate Nesterov, Y. and Polyak, B. T. Cubic regularization of New- analysis of unconstrained optimization methods based on ton’s method and its global performance. Mathematical probabilistic models. Mathematical Programming, 169 Programming, 108(1):177–205, 2006. (2):337–375, 2018. Pilanci, M. and Wainwright, M. J. Randomized sketches of Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic convex programs with sharp guarantees. IEEE Transac- regularisation methods for unconstrained optimization. tions on Information Theory, 61(9):5096–5115, 2015. part I: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011a. Qu, Z. and Richtarik,´ P. Coordinate descent with arbitrary Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regu- sampling II: expected separable overapproximation. Op- larisation methods for unconstrained optimization. part II: timization Methods and Software, 31(5):858–884, 2016. worst-case function-and derivative-evaluation complexity. Qu, Z., Richtarik,´ P., Taka´c,ˇ M., and Fercoq, O. SDNA: Mathematical programming, 130(2):295–319, 2011b. stochastic dual Newton ascent for empirical risk mini- Conn, A. R., Gould, N. I., and Toint, P. L. mization. In International Conference on Machine Learn- methods. SIAM, 2000. ing, pp. 1823–1832, 2016. Ghadimi, S., Liu, H., and Zhang, T. Second-order meth- Richtarik,´ P. and Taka´c,ˇ M. Iteration complexity of random- ods with cubic regularization under inexact information. ized block-coordinate descent methods for minimizing arXiv preprint arXiv:1710.05782, 2017. a composite function. Mathematical Programming, 144 (1-2):1–38, 2014. Gould, N. I., Robinson, D. P., and Thorne, H. S. On solving trust-region and other regularised subproblems in opti- Richtarik,´ P. and Taka´c,ˇ M. Parallel coordinate descent mization. Mathematical Programming Computations, 2 methods for big data optimization. Mathematical Pro- (1):21–57, 2010. gramming, 156(1-2):433–484, 2016. Grapiglia, G. N. and Nesterov, Y. Regularized Newton Rockafellar, R. T. Convex analysis. Princeton landmarks in methods for minimizing functions with Holder¨ contin- mathematics, 1997. uous Hessians. SIAM Journal on Optimization, 27(1): 478–506, 2017. Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Jour- Griewank, A. The modification of Newton’s method for un- nal of Machine Learning Research, 14(Feb):567–599, constrained optimization by bounding cubic terms. Tech- 2013. nical report, Technical Report NA/12, Department of Applied Mathematics and Theoretical Physics, University Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, of Cambridge, 1981. M. I. Stochastic cubic regularization for fast nonconvex optimization. arXiv preprint arXiv:1711.02838, 2017. Kohler, J. M. and Lucchi, A. Sub-sampled cubic regu- larization for non-convex optimization. arXiv preprint Tseng, P. and Yun, S. A coordinate gradient descent method arXiv:1705.05933, 2017. for nonsmooth separable minimization. Mathematical Programming, 117(1-2):387–423, 2009. Mutny,` M. and Richtarik,´ P. Parallel stochastic newton method. Journal of Computational Mathematics, 36(3), 2018. Appendix

A. Proof of Lemma1

Proof: Let us show that the Hessian of φS(x) is Lipschitz-continuous with a constant max{Hi}. Then (6) will be fulfilled i∈S (see Lemma 1 in (Nesterov & Polyak, 2006)). So,

2 2 2 2 2  2 ∇ φS(x + h) − ∇ φS(x) = max ∇ φS(x + h) − ∇ φS(x) y kyk=1

X 2 2  2 = max ∇ φi(x(i) + h(i)) − ∇ φi(x(i)) y(i) kyk=1 i∈S

(5) X 2 2 2 ≤ max Hi · kh(i)k · ky(i)k kyk=1 i∈S

n 2 2o 2 2 = max Hi · kh(i)k ≤ max{Hi } · kh[S]k . i∈S i∈S

B. Auxiliary Results Lemma 4 Let Assumptions1,2,3,4 hold. Let minimum of problem (1) exist, and let x∗ ∈ Q be its arbitrary solution: F (x∗) = F ∗. k Then for the random sequence {F (x )}k≥0 generated by Algorithm1 we have a bound: ατ   F (xk+1) | xk ≤ F (xk) − F (xk) − F ∗ E n αD τ  E α3τ H − G − α A  (xk − x∗), xk − x∗ + · F · kxk − x∗k3, (19) 2 n E [Sk] n 2 for all α ∈ [0, 1].

Proof: From Assumption1 we have the following bound, G  0: 1 h∇g(x), zi ≤ g(x + z) − g(x) − hGz, zi, x, z ∈ N . (20) 2 R

k k k+1 k k Denote h ≡ THk,Sk (x ). Then x = x + h . By a way of choosing Hk we have: F (xk+1) ≤ M ∗ (xk) Hk,Sk 1 ≡ F (xk) + h(∇g(xk)) , hki + hA hk, hki [Sk] 2 [Sk]

1 X  k k 2 k k k  Hk k 3 + h∇φi(x ), h i + ∇ φi(x )h , h i + kh k 2 (i) (i) (i) (i) (i) 6 [Sk] i∈Sk

X  k k k  + ψi(x(i) + h(i)) − ψi(x(i)) . i∈Sk Randomized Block Cubic Newton Method

k k k At the same time, because h is a minimizer of the cubic model MHk,Sk (x , y), we can change h in the previous bound to arbitrary y ∈ RN , such that xk + y ∈ Q: 1 X  1  F (xk+1) ≤ F (xk) + h(∇g(xk)) , yi + hA y, yi + h∇φ (xk ), y i + h∇2φ (xk )y , y i [Sk] 2 [Sk] i (i) (i) 2 i (i) (i) (i) i∈Sk

Hk X   + ky k3 + ψ (xk + y ) − ψ (xk ) 6 [Sk] i (i) (i) i (i) i∈Sk

1 X   ≤ F (xk) + h(∇g(xk)) , yi + hA y, yi + φ (xk + y ) − φ (xk ) [Sk] 2 [Sk] i (i) (i) i (i) i∈Sk

Hk + HF X   + ky k3 + ψ (xk + y ) − ψ (xk ) , (21) 6 [Sk] i (i) (i) i (i) i∈Sk where the last inequality is given by Lemma1. Note that

" # n ! X τ X τ ky k3 ≤ kyk · ky k2 = kyk · ky k2 = kyk · ky k2 = kyk3. E [Sk] E [Sk] E (i) n (i) n i∈Sk i=1

 k Thus, by taking conditional expectation E · | x from (21) we have τ 1 F (xk+1) | xk ≤ F (xk) + h∇g(xk), yi + h A y, yi E n 2 E [Sk] n τ X  τ HF + φ (xk + y ) + ψ (xk + y ) − φ (xk ) − ψ (xk ) + · kyk3, n i (i) (i) i (i) (i) i (i) i (i) n 2 i=1 which is valid for arbitrary y ∈ RN such that xk + y ∈ Q. Restricting y to the segment: y = α(x∗ − xk), where α ∈ [0, 1], by convexity of φi and ψi we get:

ατ α2 F (xk+1) | xk ≤ F (xk) + h∇g(xk), x∗ − xki + h A x∗ − xk, x∗ − xki E n 2 E Sk

n 3 ατ X  α τ HF + φ (x∗ ) + ψ (x∗ ) − φ (xk ) − ψ (xk ) + · kx∗ − xkk3. n i (i) i (i) i (i) i (i) n 2 i=1

Finally, using inequality (20) for x ≡ xk and z ≡ x∗ − xk we get the state of the lemma. k ∗ The following technical tool is useful for analyzing a convergence of the random sequence ξk ≡ F (x ) − F with high probability. It is a generalization of result from (Richtarik´ & Taka´cˇ, 2014).

Lemma 5 Let ξ0 > 0 be a constant and consider a nonnegative nonincreasing sequence of random variables {ξk}k≥0 with the following property, for all k ≥ 0: ξp ξ | ξ  ≤ ξ − k , (22) E k+1 k k c where c > 0 is a constant and p ∈ {3/2, 2}. Choose confidence level ρ ∈ (0, 1). 1/(p−1) Then if we set 0 < ε < min{ξ0, c } and

c  1 1 c K ≥ p−1 + log − p−1 , (23) ε p − 1 ρ ξ0 (p − 1) we have P(ξK ≤ ε) ≥ 1 − ρ. (24) Randomized Block Cubic Newton Method

Proof: Technique of the proof is similar to corresponding one of Theorem 1 from (Richtarik´ & Taka´cˇ, 2014). For a fixed 1/(p−1) ε 0 < ε < min{ξ0, c } define a new sequence of random variables {ξk}k≥0 by the following way: ( ξ , if ξ > ε, ξε = k k k 0, otherwise.

It satisfies ε ξk ≤ ε ⇔ ξk ≤ ε, k ≥ 0, therefore, by Markov inequality: [ξε] (ξ > ε) = (ξε > ε) ≤ E k , P k P k ε and hence it suffices to show that θK ≤ ρε,

def ε where θk = E[ξk]. From the conditions of the lemma we get

(ξε)p  εp−1  ξε | ξε ≤ ξε − k , ξε | ξε ≤ 1 − ξε, k ≥ 0, E k+1 k k c E k+1 k c k and by taking expectations and using convexity of t 7→ tp for p > 1 we obtain

θp θ ≤ θ − k , k ≥ 0, (25) k+1 k c  εp−1  θ ≤ 1 − θ , k ≥ 0. (26) k+1 c k

Consider now two cases, whether p = 2 or p = 3/2, and find a number k1 for which we get θk1 ≤ ε.

1. p = 2, then 1 1 θk − θk+1 θk − θk+1 (25) 1 − = ≥ 2 ≥ , θk+1 θk θk+1θk θk c 1 1 k 1 k c c thus we have ≥ + = + , and choosing k1 ≥ − we obtain θk ≤ ε. θk θ0 c ξ0 c ε ξ0 1 2. p = 3/2, then

1/2 1/2 1 1 θ − θ θ − θ θ − θ (25) 1 − = k k+1 = k k+1 ≥ k k+1 ≥ , 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 3/2 2c θk+1 θk θk+1θk (θk + θk+1)θk+1θk 2θk

1 1 k 2c 2c thus we have 1/2 ≥ 1/2 + 2c , and choosing k1 ≥ ε1/2 − 1/2 we get θk1 ≤ ε. θk θ0 ξ0

Therefore, for both cases p ∈ {3/2, 2} it is enough to choose

c  1 1  k1 ≥ p−1 − p−1 , p − 1 ε ξ0

c 1 for which we get θk1 ≤ ε. Finally, letting k2 ≥ εp−1 log ρ and K ≥ k1 + k2, we have

(26)  εp−1 k2  k εp−1  θ ≤ θ ≤ 1 − θ ≤ exp − 2 θ ≤ ρε. K k1+k2 c k1 c k1 Randomized Block Cubic Newton Method C. Proof of Theorem1   τL Proof: From bound (19), using G  0 and E A[Sk]  n I we get, for all α ∈ [0, 1]:

ατ   α2τ   F (xk+1) | xk ≤ F (xk) − F (xk) − F ∗ + Lkxk − x∗k2 + αH kxk − x∗k3 . E n 2n F

k ∗ Thus, for the random sequence ξk ≡ F (x ) − F we have a bound, for all α ∈ [0, 1]:

ατξ α2τ ξ | ξ  ≤ ξ − k + D , (27) E k+1 k k n 2n 0

 2 3 ∗ ξk where D0 ≡ max LD + HF D , ξ0 . Minimum of the right hand side is attained at α = ≤ 1, which substituting D0 2 τξk to (27) gives: [ξk+1|ξk] ≤ ξk − . Applying Lemma5 complete the proof. E 2nD0

D. Proof of Theorem2 Proof: From bound (19), restricting α to the segment [0, σ] and using (12) we get:

ατξ α3τ ξ | ξ  ≤ ξ − k + H kxk − x∗k3, (28) E k+1 k k n 2n F

k ∗ where ξk ≡ F (x ) − F as before. To get the first complexity bound, we rough the right hand side, denoting D0 ≡  3 max HF D , ξ0 : ατξ α3τ D ξ | ξ  ≤ ξ − k + 0 , E k+1 k k n 2n σ2 q minimum of which is attained at α∗ = σ 2 ξk ≤ σ. Therefore we obtain 3 D0

3/2 3/2 τσξk E[ξk+1|ξk] ≤ ξk − (2/3) , nD0 and applying Lemma5 complete the proof.

E. Proof of Theorem3 Proof: Because of strong convexity we know that β > 0 and all the conditions of Theorem2 are satisfied. Using inequality k ∗ µ k ∗ 2 F (x ) − F ≥ 2 kx − x k for (28), we have for every α ∈ [0, σ]:

 ατ α3τ H D  ξ | ξ  ≤ 1 − + F ξ . E k+1 k n n µ k q Minimum of the right hand side is attained at α∗ = σ min µ , 1 , substituting of which and taking total expectation 3HF D gives a recurrence

r ! r !k   2τσ n µ o   2τσ n µ o E ξk+1 ≤ 1 − min , 1 E ξk ≤ ... ≤ 1 − min , 1 ξ0 3n HF D 3n HF D

! 2τσ r n µ o ≤ exp −(k + 1) min , 1 ξ0. 3n HF D

(14) −1 Thus, choosing K large enough, by Markov inequality we have P(ξK > ε) ≤ E[ξK ]ε ≤ ρ. Randomized Block Cubic Newton Method F. Proof of Lemma3 Using notation from the statement of the lemma and multiplying everything by m, we can formulate our target optimization subproblem as follows:   mλ 1 H 3 min mλhb1, xi + hAˆ x, xi + hb2, hi + hDh, hi + khk , (29) |S| m x∈R ,h∈R 2 2 6 s.t. h=Bˆ x m×m 00 where D ∈ R is a diagonal matrix: D ≡ diag(φi (αi)). We also denote a subvector of y as x to avoid confusion. The minimum of (29) satisfies the following KKT system:  ˆ ˆ T  mλb1 + mλAx + B µ = 0,   H b2 + Dh + 2 khkh − µ = 0, (30)   h = Bˆ x, where µ ∈ Rm is a vector of slack variables. From the second and the third equations we get: H µ = b + DBxˆ + kBˆ xkBˆ x, 2 2 plugging of which into the first one gives:   T  H  T mλAˆ + Bˆ D + kBˆ xk Bˆ x = − (mλb1 + Bˆ b2) . 2 | {z } | {z } ≡ b ≡ Z(kBˆ xk) Thus, if we have a solution τ ∗ ≥ 0 of the one-dimensional equation: τ ∗ = kBˆ (Z(τ ∗))†bk, then we can set H x∗ := −(Z(τ ∗))†b, h∗ := Bˆ x∗, µ∗ := b + DBˆ x∗ + kBˆ x∗kBˆ x∗. 2 2 It is easy to check that (x∗, h∗, µ∗) are solutions of (30) and therefore of (29) as well.

G. Lipschitz Constant of the Hessian of Logistic Loss

Proposition√ 1 Loss function for logistic regression φ(t) := log(1 + exp(t)) has Lipschitz-continuous Hessian with constant Hφ = 1/(6 3). Thus, it holds, for all t, s ∈ R: 00 00 |φ (t) − φ (s)| ≤ Hφ|t − s|. (31)

Proof:

000 To prove (31) it is enough to show: |φ (t)| ≤ Hφ, for all t ∈ R. Direct calculations give: 1 φ0(t) = , φ00(t) = φ0(t) · (1 − φ0(t)), φ000(t) = φ00(t) · (1 − 2φ0(t)). 1 + exp(−t)

Let us find extreme values of the function g(t) := φ000(t) for which we have lim g(t) = lim g(t) = 0. t→−∞ t→+∞ Stationary points of g(t) are solutions of the equation g0(t∗) = φ(4)(t∗) = φ00(t∗) · (1 − 2φ0(t∗))2 − 2φ0(t∗) · (1 − φ0(t∗)) = 0 which consequently should satisfy φ0(t∗) = 1 ± √1 and therefore: 2 12 1 1  1 1   1  1 g(t∗) = φ000(t∗) = + √ · − √ · ±√ = ± √ , 2 12 2 12 3 6 3 √ from what we get: |φ000(t)| ≤ 1/(6 3).