Randomized Block Cubic Newton Method
Nikita Doikov 1 Peter Richtarik´ 2 3 4
Abstract properties. Our aim is to capitalize on these different prop- erties in the design of our algorithm. We assume that g has We study the problem of minimizing the sum of Lipschitz gradient1, φ has Lipschitz Hessian, while ψ is three convex functions—a differentiable, twice- allowed to be nonsmooth, albeit “simple”. differentiable and a non-smooth term—in a high dimensional setting. To this effect we propose and analyze a randomized block cubic Newton 1.1. Block Structure (RBCN) method, which in each iteration builds Moreover, we assume that the N coordinates of x are parti- a model of the objective function formed as the P tioned into n blocks of sizes N1,...,Nn, with i Ni = N, sum of the natural models of its three compo- Ni and then write x = (x(1), . . . , x(n)), where x(i) ∈ R . nents: a linear model with a quadratic regular- This block structure is typically dictated by the particular izer for the differentiable term, a quadratic model application considered. Once the block structure is fixed, with a cubic regularizer for the twice differen- we further assume that φ and ψ are block separable. That is, tiable term, and perfect (proximal) model for the Pn Pn φ(x) = i=1 φi(x(i)) and ψ(x) = i=1 ψi(x(i)), where nonsmooth term. Our method in each iteration φi are twice differentiable with Lipschitz Hessians, and ψi minimizes the model over a random subset of are closed convex (and possibly nonsmooth) functions. blocks of the search variable. RBCN is the first algorithm with these properties, generalizing sev- Revealing this block structure, problem (1) takes the form eral existing methods, matching the best known n n def X X bounds in all special cases. We establish O(1/), min F (x) = g(x) + φi(x(i)) + ψi(x(i)). (2) √ x∈Q O(1/ ) and O(log(1/)) rates under different i=1 i=1 assumptions on the component functions. Lastly, We are specifically interested in the case when n is big, in we show numerically that our method outperforms which case it make sense to update a small number of the the state of the art on a variety of machine learning block in each iteration only. problems, including cubically regularized least- squares, logistic regression with constraints, and 1.2. Related Work Poisson regression. There has been a substantial and growing volume of research related to second-order and block-coordinate optimization. 1. Introduction In this part we briefly mention some of the papers most relevant to the present work. In this paper we develop an efficient randomized algorithm for solving an optimization problem of the form A major leap in second-order optimization theory was made since the cubic Newton method was proposed by arXiv:1802.04084v2 [math.OC] 7 Aug 2018 Griewank(1981) and independently rediscovered by Nes- def min F (x) = g(x) + φ(x) + ψ(x), (1) terov & Polyak(2006), who also provided global complexity x∈Q guarantees. where Q ⊆ RN is a closed convex set, and g, φ and ψ are convex functions with different smoothness and structural Cubic regularization was equipped with acceleration by Nes- terov(2008), adaptive stepsizes by (Cartis et al., 2011a;b) 1National Research University Higher School of Economics, 2 and extended to a universal framework by Grapiglia & Nes- Moscow, Russia King Abdullah University of Science and terov(2017). The universal schemes can automatically Technology, Thuwal, Saudi Arabia 3University of Edinburgh, Edinburgh, United Kingdom 4Moscow Institute of Physics adjust to the implicit smoothness level of the objective. Cu- and Technology, Dolgoprudny, Russia. Correspondence to: bically regularized second-order schemes for solving sys- Nikita Doikov
(2007) and randomized variants for stochastic optimiza- reduces to the cubically-regularized Newton method of tion were considered by Tripuraneni et al.(2017); Ghadimi Nesterov & Polyak(2006). Even when n = 1, RBCN et al.(2017); Kohler & Lucchi(2017); Cartis & Scheinberg can be seen as an extension of this method to composite (2018). optimization. For n > 1, RBCN provides an extension of the algorithm in Nesterov & Polyak(2006) to the Despite their attractive global iteration complexity guar- randomized block coordinate setting, popular for high- antees, the weakness of second-order methods in general, dimensional problems. and cubic Newton in particular, is their high computational cost per iteration. This issue remains the subject of active • In the special case when φ = 0 and Ni = 1 for all research. For successful theoretical results related to the i, RBCN specializes to the stochastic Newton (SN) approximation of the cubic step we refer to (Agarwal et al., method of Qu et al.(2016). Applied to the empirical 2016) and (Carmon & Duchi, 2016). risk minimization problem (see Section7), our method has a dual interpretation (see Algorithm2). In this At the same time, there are many successful attempts case, our method reduces to the stochastic dual Newton to use block coordinate randomization to accelerate first- ascent method (SDNA) also described in (Qu et al., order (Tseng & Yun, 2009; Richtarik´ & Taka´cˇ, 2014; 2016) 2016). Hence, RBCN can be seen as an extension of and second-order (Qu et al., 2016; Mutny` & Richtarik´ , 2018) SN and SDNA to blocks of arbitrary sizes, and to the methods. inclusion of the twice differentiable term φ. In this work we are addressing the issue of combining block- • In the case when φ = 0 and the simplest over approxi- coordinate randomization with cubic regularization, to get a 2 second-order method with proven global complexity guar- mation of g is assumed: 0 ∇ g(x) LI, the com- antees and with a low cost per iteration. posite block coordinate gradient method Tseng & Yun (2009) can be applied to solve (1). Our method extends A powerful advance in convex optimization theory was the this in two directions: we add twice-differentiable advent of composite or proximal first-order methods (see terms φ, and use a tighter model for g, using all global (Nesterov, 2013) as a modern reference). This technique has curvature information (if available). become available as an algorithmic tool in block coordinate setting as well (Richtarik´ & Taka´cˇ, 2014; Qu et al., 2016). We prove high probability global convergence guarantees Our aim in this work is the development of a composite under several regimes, summarized next: cubically regularized second-order method. • Under no additional assumptions on g, φ and ψ beyond 1.3. Contributions convexity (and either boundedness of Q, or bounded- ness of the level sets of F on Q), we prove the rate We propose a new randomized second-order proximal al- n gorithm for solving convex optimization problems of the O , form (2). Our method, Randomized Block Cubic Newton τ (RBCN) (see Algorithm1) treats the three functions appear- where τ is the mini-batch size (see Theorem1). ing in (1) differently, according to their nature. • Under certain conditions combining the properties of Our method is a randomized block method because in each g with the way the random blocks are sampled, for- iteration we update a random subset of the n blocks only. malized by the assumption β > 0 (see (12) for the This facilitates faster convergence, and is suited to problems definition of β), we obtain the rate where n is very large. Our method is proximal because we n keep the functions ψi in our model, which is minimized in O √ each iteration, without any approximation. Our method is τ max{1, β} a cubic Newton method because we approximate each φi (see Theorem2). In the special case when n = 1, we using a cubically-regularized second order model. necessarily have τ = 1 and β = µ/L (reciprocal of the condition number of g) we get the rate O( L√ ). If We are not aware of any method that can solve (2) via µ using the most appropriate models of the three functions g is quadratic and τ = n, then β = 1 and the resulting complexity O( √1 ) recovers the rate of cubic Newton (quadratic with a constant Hessian for g, cubically regular- ized quadratic for φ and no model for ψ), not even in the established by Nesterov & Polyak(2006). case n = 1. • Finally, if g is strongly convex, the above result can be Our approach generalizes several existing results: improved (see Theorem3) to n 1 O log . • In the case when n = 1, g = 0 and ψ = 0, RBCN τ max{1, β} Randomized Block Cubic Newton Method
1.4. Contents Note that these definitions imply that
N The rest of the paper is organized as follows. In Section2 hA[S]x, yi = hAx[S], y[S]i, x, y ∈ R . we introduce the notation and elementary identities needed to efficiently handle the block structure of our model. In Next, we define the block-diagonal operator, which, up Section3 we make the various smoothness and convexity to permutation of coordinates, retains diagonal blocks and assumptions on g and φi formal. Section4 is devoted to nullifies the off-diagonal blocks: the description of the block sampling process used in our n n def X T T X method, along with some useful identities. In Section5 blockdiag(A) = UiUi AUiUi = A[{i}]. we describe formally our randomized block cubic Newton i=1 i=1 (RBCN) method. Section6 is devoted to the statement and N def N description of our main convergence results, summarized Finally, denote R[S] = x[S] | x ∈ R . This is a linear in the introduction. Missing proofs are provided in the subspace of RN composed of vectors which are zero in supplementary material. In Section7 we show how to apply blocks i∈ / S. our method to the empirical risk minimization problem. Applying RBCN to its dual leads to Algorithm2. Finally, 3. Assumptions our numerical experiments on synthetic and real datasets are described in Section8. In this section we formulate our main assumptions about differentiable components of (2) and provide some examples 2. Block Structure to illustrate the concepts. N To model a block structure, we decompose the space N We assume that g : R → R is a differentiable function R Ni into n subspaces in the following standard way. Let and all φi : R → R, i ∈ [n] are twice differentiable. Thus, at any point x ∈ N we should be able to compute all U ∈ RN×N be a column permutation of the N ×N identity R the gradients {∇g(x), ∇φ1(x(1)),..., ∇φn(x(n))} and the matrix I and let a decomposition U = [U1, U2,..., Un] 2 2 N×Ni Hessians {∇ φ1(x(1)),..., ∇ φn(x(n))}, or at least their be given, where Ui ∈ R are n submatrices, N = Pn N actions on arbitrary vector h of appropriate dimension. i=1 Ni. Subsequently, any vector x ∈ R can be Pn def Next, we formalize our assumptions about convexity and uniquely represented as x = i=1 Uix(i), where x(i) = T Ni level of smoothness. Speaking informally, g is similar to a Ui x ∈ R . quadratic, and functions φi are arbitrary twice-differentiable In what follows we will use the standard Euclidean inner def P and smooth. product: hx, yi = i xiyi, Euclidean norm of a vector: kxk def= phx, xi and induced spectral norm of a matrix: Assumption 1 (Convexity) There is a positive semidefinite def matrix G 0 such that for all x, h ∈ RN : kAk = maxkxk=1 kAxk. Using block decomposition, for two vectors x, y ∈ RN we have: 1 g(x + h) ≥ g(x) + h∇g(x), hi + hGh, hi, (3) n n n 2 X X X hx, yi = Uix(i), Ujy(j) = hx(i), y(i)i. φi(x(i) + h(i)) ≥ φi(x(i)) + h∇φi(x(i)), h(i)i, i ∈ [n]. i=1 j=1 i=1
def In the special case when G = 0, (3) postulates convexity. For a given nonempty subset S of [n] = {1, . . . , n} and For positive definite G, the objective will be strongly convex N N for any vector x ∈ R we denote by x[S] ∈ R the vector def with the strong convexity parameter µ = λmin(G) > 0. obtained from x by retaining only blocks x(i) for which i ∈ S and zeroing all other: Note that for all φi we only require convexity. How- ever, if we happen to know that any φi is strongly convex def X X T 2 Ni x[S] = Uix(i) = UiUi x. (λmin(∇ φi(y)) ≥ µi > 0 for all y ∈ R ), we can move µi 2 i∈S i∈S this strong convexity to g by subtracting 2 kx(i)k from φi and adding it to g. This extra knowledge may in some N×N Furthermore, for any matrix A ∈ R we write A[S] ∈ particular cases improve convergence guarantees for our N×N R for the matrix obtained from A by retaining only algorithm, but does not change the actual computations. elements whose indices are both in some coordinate blocks from S, formally: Assumption 2 (Smoothness of g) There is a positive N ! ! semidefinite matrix A 0 such that for all x, h ∈ R : def X X A = U UT A U UT . 1 [S] i i i i g(x + h) ≤ g(x) + h∇g(x), hi + hAh, hi. (4) i∈S i∈S 2 Randomized Block Cubic Newton Method
The main example of g is a quadratic function g(x) = Assumption 4 (Uniform sampling) Sampling Sˆ is uni- 1 hMx, xi with a symmetric positive semidefinite M ∈ ˆ ˆ def 2 form, i.e., P(i ∈ S) = P(j ∈ S) = p, for all i, j ∈ [n]. RN×N for which both (3) and (4) hold with G = A = M. Of course, any convex g with Lipschitz-continuous gradient The above assumpotion means that the diagonal of P is with a constant L ≥ 0 satisfies (3) and (4) with G = 0 and constant: Pii = p for all i ∈ [N]. It is easy to see that A = LI (Nesterov, 2004). (Corollary 3.1 in (Qu & Richtarik´ , 2016)): A ˆ = A ◦ P, (7) Assumption 3 (Smoothness of φi) For every i ∈ [n] E [S] there is a nonnegative constant Hi ≥ 0 such that the Hes- where ◦ denotes the Hadamard product. sian of φi is Lipschitz-continuous: def 2 2 Denote τ = [|Sˆ|] = np (expected minibatch size). The k∇ φi(x + h) − ∇ φi(x)k ≤ Hikhk, (5) E special uniform sampling defined by picking from all sub- for all x, h ∈ RNi . sets of size τ uniformly at random is called τ-nice sampling. If Sˆ is τ-nice, then (see Lemma 4.3 in (Qu & Richtarik´ , Examples of functions which satisfy (5) with a known 2016)) Lipschitz constant of Hessian H are quadratic: φ(t) = 2 τ kCt − t0k (H = 0 for all the parameters), cubed norm: P = ((1 − γ) blockdiag(E) + γE) , (8) 3 n φ(t) = (1/3)kt − t0k (H = 2, see Lemma 5 in (Nesterov, 2008)), logistic√ regression loss: φ(t) = log(1 + exp(t)) where γ = (τ − 1)/(n − 1). (H = 1/(6 3), see Proposition 1 in the supplementary material). In particular, the above results in the following:
For a fixed set of indices S ⊂ [n] denote Lemma 2 For the τ-nice sampling Sˆ, we have
def X N φS(x) = φi(x(i)), x ∈ R . τ τ − 1 τ(τ − 1) [A ] = 1 − blockdiag(A) + A. i∈S E [Sˆ] n n − 1 n(n − 1) Then we have:
X N Proof: Combine (7) and (8). h∇φS(x), hi = h∇φi(x(i)), h(i)i, x, h ∈ R , i∈S 2 X 2 N 5. Algorithm h∇ φS(x)h, hi = h∇ φi(x(i))h(i), h(i)i, x, h ∈ R . i∈S Due to the problem structure (2) and utilizing the smooth- ness of the components (see (4) and (5)), for a fixed subset N Lemma 1 If Assumption3 holds, then for all x, h ∈ R of indices S ⊂ [n] it is natural to consider the following we have the following second-order approximation bound: N model of our objective F around a point x ∈ R :
1 2 φS(x + h) − φS(x) − h∇φS(x), hi − h∇ φS(x)h, hi def 1 2 MH,S(x; y) = F (x) + h(∇g(x))[S], yi + hA[S]y, yi+ 3 2 ≤ max{Hi} · kh[S]k . (6) 1 H i∈S + h(∇φ(x)) , yi + h(∇2φ(x)) y, yi + ky k3+ [S] 2 [S] 6 [S] def X From now on we denote HF = max{H1, H2,..., Hn}. + ψi(x(i) + y(i)) − ψi(x(i)) . (9) i∈S
4. Sampling of Blocks The above model arises as a combination of a first-order model for g with global curvature information provided by In this section we summarize some basic properties of sam- matrix A, second-order model with cubic regularization pling Sˆ, which is a random set-valued mapping with val- (following (Nesterov & Polyak, 2006)) for φ, and perfect ues being subsets of [n]. For a fixed block-decomposition, model for the non-differentiable terms ψ (i.e., we keep with each sampling Sˆ we associate the probability matrix i these terms as they are). P ∈ RN×N as follows: an element of P is the probability of choosing a pair of blocks which contains indices of this Combining (4) and (6), and for large enough H (H ≥ N×N element. Denoting by E ∈ R the matrix of all ones, maxi∈S Hi is sufficient), we get the global upper bound we have P = E E[Sˆ] . Wel restrict our analysis to uniform N N samplings, defined next. F (x + y) ≤ MH,S(x; y), x ∈ R , y ∈ R[S]. Randomized Block Cubic Newton Method
Moreover, the value of all summands in MH,S(x; y) de- Given theoretical result provides global sublinear rate of pends on the subset of blocks {y(i)|i ∈ S} only, and there- convergence, with iteration complexity of the order O 1/ε . fore def Note that for a case φ(x) ≡ 0 we can put HF = 0, and TH,S(x) = argmin MH,S(x; y) (10) N Theorem1 in this situation restates well-known result about y∈R[S] subject to x+y∈Q convergence of composite gradient-type block-coordinate methods (see, for example, (Richtarik´ & Taka´cˇ, 2014)). can be computed efficiently for small |S| and as long as Q is simple (for example, affine). Denote the minimum of 6.2. Strongly Convex Loss ∗ def the cubic model by M (x) = MH,S(x; TH,S(x)). The H,S Here we study the case when the matrix G from the con- RBCN method performs the update x ← x + TH,S(x), and is formalized as Algorithm1. vexity assumption (3) is strictly positive definite: G 0, which means that the objective F is strongly convex with a def Algorithm 1 RBCN: Randomized Block Cubic Newton constant µ = λmin(G) > 0. 1: Parameters: sampling distribution Sˆ Denote by β a condition number for the function g and 2: Initialization: choose initial point x0 ∈ Q sampling distribution Sˆ: the maximum nonnegative real 3: for k = 0, 1, 2,... do number such that ˆ 4: Sample Sk ∼ S τ β · E ˆ A[S] G. (12) 5: Find Hk ∈ (0, 2HF ] such that S∼S n k k ∗ k F (x + TH ,S (x )) ≤ M (x ) k k Hk,Sk If (12) holds for all nonnegative β we put by definition k+1 def k k 6: Make the step x = x + THk,Sk (x ) β ≡ +∞. 7: end for µ A simple lower bound exists: β ≥ L > 0, where L = λmax(A), as in Theorem1. However, because (12) depends not only on g, but also on sampling distribution Sˆ, it is 6. Convergence Results possible that β > 0 even if µ = 0 (for example, β = 1 if In this section we establish several convergence rates for P(S = [n]) = 1 and A = G 6= 0). Algorithm1 under various structural assumptions: for the The following theorems describe√ global iteration complex- general class of convex problems, and for the more specific ity guarantees of the order O 1/ ε and O log(1/ε) for strongly convex case. We will focus on the family of uni- Algorithm1 in the cases β > 0 and µ > 0 correspondingly, form samplings only, but generalizations to other sampling which is an improvement of general O 1/ε. distributions are also possible. Theorem 2 Let Assumptions1,2,3,4 hold. Let solution 6.1. Convex Loss x∗ ∈ Q of problem (1) exist, let level sets be bounded: D < +∞, and assume that β, which is defined by (12), is We start from the general situation where the term g(x) and greater than zero. Choose required accuracy ε > 0 and all the φi(x(i)) and ψi(x(i)), i ∈ [n] are convex, but not confidence level ρ ∈ (0, 1). Then after necessary strongly convex. 2 n 1 1r n o Denote by D the maximum distance from an optimum point K ≥ √ 2 + log max H D3,F (x0) − F ∗ ε τ σ ρ F x∗ to the initial level set: (13) def def n ∗ 0 o D = sup kx − x k x ∈ Q, F (x) ≤ F (x ) . iterations of Algorithm1, where σ = min{β, 1} > 0, we have K ∗ Theorem 1 Let Assumptions1,2,3,4 hold. Let solution P F (x ) − F ≤ ε ≥ 1 − ρ. x∗ ∈ Q of problem (1) exist, and assume the level sets are bounded: D < +∞. Choose required accuracy ε > 0 and Theorem 3 Let Assumptions1,2,3,4 hold. Let solution ∗ def confidence level ρ ∈ (0, 1). Then after x ∈ Q of problem (1) exist, and assume that µ = λmin(G) is strictly positive. Then after 2 n 1 n 2 3 0 ∗o s K ≥ 1+log max LD +HF D ,F (x )−F 3 F (x0) − F ∗ n 1 H D ε τ ρ K ≥ log max F , 1 (14) (11) 2 ερ τ σ µ def iterations of Algorithm1, where L = λmax(A), we have iterations of Algorithm1, we have
K ∗ K ∗ P F (x ) − F ≤ ε ≥ 1 − ρ. P F (x ) − F ≤ ε ≥ 1 − ρ. Randomized Block Cubic Newton Method
Given complexity estimates show which parameters of the of the subproblem by the fast approximate eigenvalue com- problem directly affect on the convergence of the algorithm. putation (Agarwal et al., 2016) or by applying gradient descent (Carmon & Duchi, 2016). Both of these schemes Bound (13) improves initial estimate (11) by the fac- provide global convergence guarantees. Additionally, they tor pD /ε. The cost is additional term σ−1 = 0 are Hessian-free. Thus we need only a procedure of mul- (min{β, 1})−1, which grows up while the condition num- tiplying quadratic part of (9) to arbitrary vector, without ber β becomes smaller. storing the full matrix. The latter approach is the most uni- Opposite and limit case is when the quadratic part of the versal one and can be spread to the composite case, by using objective is vanished (g(x) ≡ 0 ⇒ σ = 1). Algorithm1 is proximal gradient method or its accelerated variant (Nes- turned to be a parallelized block-independent minimization terov, 2013). of the objective components via cubically regularized New- There are basically two strategies to find parameter H on ton steps. Then, the complexity estimate coincides with a k every iteration: a constant choice H := max {H } known result (Nesterov & Polyak, 2006) in a nonrandom- k i∈Sk i or H := H , if Lipschitz constants of the Hessians are ized (τ = n, ρ → 1) setting. k F known, or simple adaptive procedure, which performs a Bound (14) guarantees a linear rate of convergence, which truncated binary search and has a logarithmic cost per one means logarithmic dependence on required accuracy ε for step of the method. Example of such procedure can be found the number of iterations. The main complexity factor be- in primal-dual Algorithm2 from the next section. −1 1/2 comes a product of two terms: σ · max{HF D/µ, 1} . For the case φ(x) ≡ 0 we can put HF = 0 and get the 6.4. Extension of the Problem Class stochastic Newton method from (Qu et al., 2016) with its global linear convergence guarantee. The randomized cubic model (9), which has been consid- ered and analyzed before, arises naturally from the separable Despite the fact that linear rate is asymptotically bet- structure (2) and by our smoothness assumptions (4), (5). √ ter than sublinear, and O 1/ ε is asymptotically better Let us discuss an interpretation of Algorithm1 in terms than O 1/ε , we need to take into account all the fac- of general problem min F (x) with twice-differentiable N tors, which may slow down the algorithm. Thus, while x∈R F (omitting non-differentiable component for simplicity). µ = λ (G) → 0, estimate (13) is becoming better min One can state and minimize the model m (x; y) ≡ than (14), as well as (11) is becoming better than (13) while H,S F (x)+h(∇F (x)) , yi+ 1 h(∇2F (x)) y, yi+ H ky k3 β → 0. [S] 2 [S] 6 [S] which is a sketched version of the originally proposed Cubic Newton method (Nesterov & Polyak, 2006). For alterna- 6.3. Implementation Issues tive sketched variants of Newton-type methods but without Let us explain how one step of the method can be performed, cubic regularization see (Pilanci & Wainwright, 2015). which requires the minimization of the cubic model (10), The latter model mH,S(x; y) is the same as MH,S(x; y) possibly with some simple convex constraints. when inequality (4) from the smoothness assumption for The first and the classical approach was proposed in (Nes- g is exact equality, i.e. when the function g is a quadratic 2 terov & Polyak, 2006) and, before for trust-region meth- with the Hessian matrix ∇ g(x) ≡ A. Thus, we may use ods, in (Conn et al., 2000). It works with unconstrained mH,S(x; y) instead of MH,S(x; y), which is still compu- (Q ≡ RN ) and differentiable case (ψ(x) ≡ 0). Firstly we tationally cheap for small |S|. However, this model does need to find a root of a special one-dimensional nonlinear not give any convergence guarantees for the general F , to equation (this can be done, for example, by simple Newton the best of our knowledge, unless S = [n]. But it can be a iterations). After that, we just solve one linear system to workable approach, when the separable structure (2) is not produce a step of the method. Then, total complexity of provided. 3 solving the subproblem can be estimated as O(d ) arith- Note also, that Assumption3 about Lipschitz-continuity of metical operations, where d is the dimension of subproblem, the Hessian is not too restrictive. Recent result (Grapiglia in our case: d = |S|. Since some matrix factorization is & Nesterov, 2017) shows that Newton-type methods with used, the cost of the cubically regularized Newton step is cubic regularization and with a standard procedure of adap- actually similar by efficiency to the classical Newton one. tive estimation of HF automatically fit the actual level of See also (Gould et al., 2010) for detailed analysis. For the smoothness of the Hessian without any additional changes case of affine constraints, the same procedure can be applied. in the algorithm. Example of using this technique is given by Lemma3 from the next section. Moreover, step (10) of the method as the global minimum of MH,S(x; y) is well-defined and computationally tractable Another approach is based on finding an inexact solution even in nonconvex cases (Nesterov & Polyak, 2006). Thus Randomized Block Cubic Newton Method