Alternating Randomized Block Coordinate Descent

Jelena Diakonikolas 1 Lorenzo Orecchia 1

Abstract perform a descent step on a single block at every iteration, while leaving the remaining variable blocks fixed. Block-coordinate descent algorithms and alternat- A paradigmatic example of this approach is the randomized ing minimization methods are fundamental opti- Kaczmarz algorithm of (Strohmer & Vershynin, 2009) for mization algorithms and an important primitive linear systems and its generalization (Nesterov, 2012). The in large-scale optimization and machine learn- second class is that of alternating minimization methods, ing. While various block-coordinate-descent-type i.e., algorithms that partition the variable set into only n = 2 methods have been studied extensively, only alter- blocks and alternate between exactly optimizing one block nating minimization – which applies to the setting or the other at each iteration (see, e.g., (Beck, 2015) and of only two blocks – is known to have conver- references therein). gence time that scales independently of the least smooth block. A natural question is then: is the Besides the computational advantage in only having to up- setting of two blocks special? date a subset of variables at each iteration, methods in these two classes are also able to exploit better the structure of We show that the answer is “no” as long as the the problem, which, for instance, may be computationally least smooth block can be optimized exactly – expensive only in a small number of variables. To formalize an assumption that is also needed in the setting this statement, assume that the set of variables is partitioned of alternating minimization. We do so by intro- into n ≤ N mutually disjoint blocks, where the ith block ducing a novel algorithm AR-BCD, whose con- of variable x is denoted by xi, and the gradient correspond- vergence time scales independently of the least th ing to the i block is denoted by ∇if(x). Each block i smooth (possibly non-smooth) block. The basic will be associated with a smoothness parameter L , I.e., algorithm generalizes both alternating minimiza- i ∀x, y ∈ RN : tion and randomized block coordinate (gradient) descent, and we also provide its accelerated ver- i i k∇if(x + IN y) − ∇if(x)k∗ ≤ Liky k, (1.1) sion – AAR-BCD. As a special case of AAR- i BCD, we obtain the first nontrivial accelerated where IN is a diagonal matrix whose diagonal entries equal alternating minimization algorithm. one for coordinates from block i, and are zero otherwise. In this setting, the convergence time of standard randomized block-coordinate descent methods, such as those in (Nes-  P  1. Introduction i Li terov, 2012), scales as O  , where  is the desired First-order methods for minimizing smooth convex func- additive error. By contrast, when n = 2, the convergence tions are a cornerstone of large-scale optimization and ma- time of the alternating minimization method (Beck, 2015) Lmin  arXiv:1805.09185v2 [math.OC] 1 Jul 2019 chine learning. Given the size and heterogeneity of the data scales as O  , where Lmin is the minimum smooth- in these applications, there is a particular interest in design- ness parameter of the two blocks. This means that one of ing iterative methods that, at each iteration, only optimize the two blocks can have arbitrarily poor smoothness (in- over a subset of the decision variables (Wright, 2015). cluding ∞), as long it is easy to optimize over it. Some important examples with a nonsmooth block (with smooth- This paper focuses on two classes of methods that consti- ness parameter equal to infinity) can be found in (Beck, tute important instantiations of this idea. The first class 2015). Additional examples of problems for which exact is that of block-coordinate descent methods, i.e., methods optimization over the least smooth block can be performed that partition the set of variables into n ≥ 2 blocks and efficiently are provided in AppendixB. 1 Department of Computer Science, Boston University, Boston, In this paper, we address the following open question, which MA, USA. Correspondence to: Jelena Diakonikolas , Lorenzo Orecchia . was implicitly raised by (Beck & Tetruashvili, 2013): can we design algorithms that combine the features of randomized block-coordinate descent and alternating minimization? In Alternating Randomized Block Coordinate Descent particular, assuming we can perform exact optimization However, even in the non-smooth convex case, methods on block n, can we construct a block-coordinate descent that perform exact minimization over a fixed set of blocks Pn−1 algorithm whose running time scales with O( i=1 Li), i.e., may converge arbitrarily slowly. This has lead scholars th independently of the smoothness Ln of the n block? This to focus on the case of smooth convex minimization, for would generalize both existing block-coordinate descent which nonasymptotic convergence rates were obtained re- methods, by allowing one block to be optimized exactly, cently in (Beck & Tetruashvili, 2013; Beck, 2015; Sun & and existing alternating minimization methods, by allowing Hong, 2015; Saha & Tewari, 2013). However, prior to n to be larger than 2 and requiring exact optimization only our work, convergence bounds that are independent of the on a single block. largest smoothness parameter were only known for the set- ting of two blocks. We answer these questions in the affirmative by presenting a novel algorithm: alternating randomized block coordinate Randomized coordinate descent methods, in which steps descent (AR-BCD). The algorithm alternates between an over coordinate blocks are taken in a non-cyclic random- exact optimization over a fixed, possibly non-smooth block, ized order (i.e., in each iteration one block is sampled with and a or exact optimization over a randomly replacement) were originally analyzed in (Nesterov, 2012). selected block among the remaining blocks. For two blocks, The same paper (Nesterov, 2012) also provided an accel- the method reduces to the standard alternating minimization, erated version of these methods. The results of (Nesterov, while when the non-smooth block is empty (not optimized 2012) were subsequently improved and generalized to vari- over), we get randomized block coordinate descent (RCDM) ous other settings (such as, e.g., composite minimization) in from (Nesterov, 2012). (Lee & Sidford, 2013; Allen-Zhu et al., 2016; Nesterov & Stich, 2017; Richtarik´ & Taka´cˇ, 2014; Fercoq & Richtarik´ , Our second contribution is AAR-BCD, an accelerated ver- 2015; Lin et al., 2014). The analysis of the different block sion of AR-BCD, which achieves the accelerated rate of coordinate descent methods under various sampling proba- 1 without incurring any dependence on the smoothness k2 bilities (that, unlike in our setting, are non-zero over all the of block n. Furthermore, when the non-smooth block is blocks) was unified in (Qu & Richtarik´ , 2016) and extended empty, AAR-BCD recovers the fastest known convergence to a more general class of steps within each block in (Gower bounds for block-coordinate descent (Qu & Richtarik´ , 2016; & Richtarik´ , 2015; Qu et al., 2016). Allen-Zhu et al., 2016; Nesterov, 2012; Lin et al., 2014; Nesterov & Stich, 2017). As a special case, AAR-BCD Our results should be carefully compared to a number of provides the first accelerated alternating minimization algo- proximal block-coordinate methods that rely on different rithm, obtained directly from AAR-BCD when the number assumptions (Tseng & Yun, 2009; Richtarik´ & Taka´cˇ, 2014; of blocks equals two.1 Another conceptual contribution Lin et al., 2014; Fercoq & Richtarik´ , 2015). In this setting, is our extension of the approximate duality gap technique the function f is assumed to have the structure f0(x) + of (Diakonikolas & Orecchia, 2017), which leads to a gen- Ψ(x), where f0 is smooth, the non-smooth convex function Pn eral and more streamlined analysis. Ψ is separable over the blocks, i.e., Ψ(x) = i=1 Ψi(xi), and we can efficiently compute the proximal operator of Finally, to illustrate the results, we perform a preliminary each Ψ . This strong assumption allows these methods to experimental evaluation of our methods against existing i make use of the standard proximal optimization framework. block-coordinate algorithms and discuss how their perfor- By contrast, in our paper, the convex objective can be taken mance depends on the smoothness and size of the blocks. to have an arbitrary form, where the non-smoothness of a block need not be separable, though the function is assumed Related Work Alternating minimization and cyclic block to be differentiable. coordinate descent are old and fundamental algorithms (Or- tega & Rheinboldt, 1970) whose convergence (to a station- 2. Preliminaries ary point) has been studied even in the non-convex setting, in which they were shown to converge asymptotically under We assume that we are given oracle access to the of the additional assumptions that the blocks are optimized a continuously differentiable convex function f : RN → R, exactly and their minimizers are unique (Bertsekas, 1999). where computing gradients over only a subset of coordinates

1 is computationally much cheaper than computing the full The remarks about accelerated alternating minimization have N been added in the second version of the paper, in July 2019, partly gradient. We are interested in minimizing f(·) over R , and we denote x = argmin N f(x). We let k · k denote an to clarify the relationship to methods obtained in (Guminov et al., ∗ x∈R 2019), which was posted to the arXiv for the first time in June arbitrary (but fixed) norm, and k·k∗ denote its dual norm, de- 2019. At a technical level, nothing new is introduced compared to the first version of the paper – everything stated in Section 4.2 follows either as a special case or a simple corollary of the results that appeared in the first version of the paper in May 2018. Alternating Randomized Block Coordinate Descent

2 fined in the standard way: kzk = sup N hz, xi. n = 2. The algorithm is defined as follows. ∗ x∈R :kxk=1 i Let IN be the identity matrix of size N, IN be a diagonal ˆxk = argmin f(x), matrix whose diagonal elements j are equal to one if vari- x∈S1(xk−1) th able j is in the i block, and zero otherwise. Notice that xk = argmin f(x), (AM) Pn i N i x∈S2(ˆxk) IN = i=1 IN . Let Si(x) = {y ∈ R : (IN − IN )y = i N N (IN − IN )x}, that is, Si contains all the points from R x1 ∈ R is an arbitrary initial point. whose coordinates differ from those of x only over block i. We note that for the standard analysis of alternating mini- We denote the smoothness parameter of block i by Li, as mization (Beck, 2015), the exact minimization step over defined in Equation (1.1). Equivalently, ∀x, y ∈ RN : the smoother block can be replaced by a gradient step (Equation (2.2)), while still leading to convergence that is i i Li i 2 f(x + I y) ≤ f(x) + ∇if(x), y + ky k . (2.1) only dependent on the smaller smoothness parameter. N 2

The gradient step over block i is then defined as: 2.2. Randomized Block Coordinate (Gradient) Descent

Ti(x) The simplest version of randomized block coordinate (gra- n L o (2.2) dient) descent (RCDM) can be stated as (Nesterov, 2012): = argmin h∇f(x), y − xi + i ky − xk2 . y∈Si(x) 2 Select ik ∈ {1, . . . , n} w.p. pik > 0, By standard arguments (see, e.g., Exercise 3.27 in (Boyd & xk = Ti (xk−1), (RCDM) Vandenberghe, 2004)): k N x1 ∈ R is an arbitrary initial point, 1 2 f(Ti(x)) − f(x) ≤ − k∇if(x)k∗. (2.3) Pn 2Li where i=1 pi = 1. A standard choice of the probability distribution is pi ∼ Li, leading to the convergence rate that Without loss of generality, we will assume that the nth depends on the sum of block smoothness parameters. block has the largest smoothness parameter and is possi- bly non-smooth (i.e., it can be Ln = ∞). The standing 3. AR-BCD assumption is that exact minimization over the nth block is “easy”, meaning that it is computationally inexpensive The basic version of alternating randomized block co- and possibly solvable in closed form; for some important ordinate descent (AR-BCD) is a direct generalization examples that have this property, see AppendixB. Observe of (AM) and (RCDM): when n = 2, it is equiva- that when block n contains a small number of variables, it lent to (AM), while when the size of the nth block is is often computationally inexpensive to use second-order zero, it reduces to (RCDM). The method is stated as follows: optimization methods, such as, e.g., interior point method.

We assume that f(·) is strongly convex with parameter µ ≥ Select ik ∈ {1, . . . , n − 1} w.p. pik > 0, 0 µ = 0 f(·) , where it could be (in which case is not strongly ˆxk = Tik (xk−1), convex). Namely, ∀x, y: xk = argmin f(x), (AR-BCD) x∈S (ˆx ) µ 2 n k f(y) ≥ f(x) + h∇f(x), y − xi + ky − xk . (2.4) N 2 x1 ∈ R is an arbitrary initial point, When µ > 0, we take k · k = k · k , which is customary for 2 where Pn−1 p = 1. We note that nothing will smooth and strongly convex minimization (Bubeck, 2014). i=1 i change in the analysis if the step ˆxk = Tik (xk−1) Throughout the paper, whenever we take unconditional ex- ˆx = argmin f(x) is replaced by k x∈Si (xk−1) , since pectation, it is with respect to all randomness in the algo- k minx∈Si (xk−1) f(x) ≤ f(Tik (xk−1)). rithm. k In the rest of the section, we show that (AR-BCD) leads 2.1. Alternating Minimization to a convergence bound that interpolates between the con- vergence bounds of (AM) and (RCDM): it depends on the In (standard) alternating minimization (AM), sum of the smoothness parameters of the first n − 1 blocks, there are only two blocks of coordinates, i.e., while the dependence on the remaining problem parameters is the same for all these methods. 2Note that the analysis extends in a straightforward way to the case where each block is associated with a different norm (see, e.g., (Nesterov, 2012)); for simplicity of presentation, we take the same norm over all blocks. Alternating Randomized Block Coordinate Descent

3.1. Approximate Duality Gap Using (2.3), if i2 = i, then: To analyze (AR-BCD), we extend the approximate duality 1 2 gap technique (Diakonikolas & Orecchia, 2017) to the set- U1 = f(x2) ≤ f(ˆx2) ≤ f(x1) − k∇if(x1)k . 2L ∗ ting of randomized block coordinate descent methods. The i approximate duality gap Gk is defined as the difference of Since block i is selected with probability pi and A1 = a1: an upper bound Uk and a lower bound Lk to the minimum function value f(x∗). For (AR-BCD), we choose the upper n−1 P a1pi 2 bound to simply be Uk = f(xk+1). E[A1U1] ≤a1f(x1) − k∇if(xi)k∗. i=1 2Li The generic construction of the lower bound is as follows. N Let x1, x2, ..., xk be any sequence of points from R (in Since the inequality 2ab − a2 ≤ b2 holds ∀a, b, we have: fact we will choose them to be exactly the sequence con- structed by (AR-BCD)). Then, by (strong) convexity of i i a1pi 2 µ 2 a1k∇if(x1)k∗kx∗ − x1k − k∇if(xi)k∗ f(·), f(x∗) ≥ f(xj)+h∇f(xj), x∗ − xji+ 2 kx∗ −xjk , 2Li ∀j ∈ {1, . . . , k}. In particular, if aj > 0 is a sequence of a1Li i i 2 ≤ kx∗ − x1k , ∀i ∈ {1, . . . , n − 1} (deterministic, independent of ij) positive real numbers and 2pi Pk Ak = j=1 aj, then: Pn−1 a1Li i i 2 Hence, when µ = 0, E[A1G1] ≤ i=1 2p kx∗ − x1k . Pk Pk i j=1 ajf(xj) + j=1 aj h∇f(xj), x∗ − xji When µ > 0, since in that case we are assuming k·k = k·k2 f(x∗) ≥ 2 Pn−1 i i 2 Ak (Section2), kx∗ − x1k ≥ kx∗ − x1k , leading to  i=1 µ k 2 Pn−1 Li µ i i 2 P a kx − x k [A1G1] ≤ a1 − kx − x k . 2 j=1 j ∗ j def E i=1 2pi 2 ∗ 1 + = Lk. (3.1) Ak

3.2. Convergence Analysis We now show how to bound the error in the decrease of the scaled gap AkGk. The main idea in the analysis is to show that E[AkGk − Lemma 3.3. E[AkGk − Ak−1Gk−1] ≤ Ek, where Ek = Ak−1Gk−1] ≤ Ek, for some deterministic Ek. Then, using   Pn−1 akLi µ ak − Rxi . linearity of expectation, E[f(xk+1)] − f(x∗) ≤ E[Gk] ≤ i=1 2Akpi 2 ∗ Pk E E[A1G1] + j=2 j . The bound in expectation can then Ak Ak be turned into a bound in probability, using well-known Proof. Let F denote the natural filtration up to iteration k. concentration bounds. The main observation that allows us k By linearity of expectation and A L − A L being not to pay for the non-smooth block is: k k k−1 k−1 measurable w.r.t. Fk, Observation 3.1. For xk’s constructed by (AR-BCD), ∇ f(x ) = 0, ∀k, where 0 is the vector of all zeros. n k E[AkGk − Ak−1Gk−1|Fk] = [AkUk − Ak−1Uk−1|Fk] − (AkLk − Ak−1Lk−1). This observation is essentially what allows us to sample ik E only from the first n − 1 blocks, and holds due to the step With probability pi and as f(xk+1) ≤ f(ˆxk+1), the change xk = argminx∈S (ˆx ) f(x) from (AR-BCD). n k in the upper bound is: i 2 Denote R i = max N {kI (x − x)k : f(x) ≤ x∗ x∈R N ∗ f(x )}, and let us bound the initial gap A G . 1 1 1 AkUk − Ak−1Uk−1 ≤Akf(ˆxk+1) − Ak−1f(xk)

Proposition 3.2. E[A1G1] ≤ E1, where E1 = Ak 2   ≤akf(xk) − k∇if(xk)k , Pn−1 Li µ ∗ a1 − R i . 2Li i=1 2pi 2 x∗

where the second line follows from ˆxk+1 = Tik (xk) and Proof. By linearity of expectation, E[A1G1] = E[A1U1] − Equation (2.3). Hence: E[A1L1]. The initial lower bound is deterministic, and, by ∇ f(x ) = 0 and duality of norms, is bounded as: n 1 E[AkUk − Ak−1Uk−1|Fk] n−1 n−1 P pi 2 [A L ] ≥a f(x ) − a P k∇ f(x )k kxi − xi k ≤ akf(xk) − Ak k∇if(xk)k∗. E 1 1 1 1 1 i 1 ∗ ∗ 1 i=1 2Li i=0 µ + a kx − x k2. 1 2 ∗ 1 On the other hand, using the duality of norms, the change Alternating Randomized Block Coordinate Descent

2 2 j+1 aj (j+1) in the lower bound is: µ = 0. Let aj = . Then = ≤ 1, and thus: 2 Aj j(j+3) n−1 Pk 2 P Li R Ej i=1 p xi AkLk − Ak−1Lk−1 j=1 ≤ i ∗ ,which proves the first part of the Ak k+3 n−1 theorem, up to concrete choices of p ’s, which follow by P i i i ≥ akf(xk) − ak k∇if(xk)k∗kx∗ − xkk simple computations. i=1 µ 2 For the second part of the theorem, as µ > 0, we are as- + ak kx∗ − xkk 2 suming that k · k = k · k2, as discussed in Section2. From n−1   q Pn−1 aj Li µ P Lemma 3.3, E = a − R i , ∀j ≥ 2. ≥ a f(x ) − a k∇ f(x )k R i j j i=1 2A p 2 x∗ k k k i k ∗ x∗ j i i=1 Li aj µ As pi = n−1 , if we take = n−1 , it follows P L 0 Aj P L 0 µ 2 i0=1 i i0=1 i + ak kx∗ − xkk . aj µ 2 that Ej = 0, ∀j ≥ 2. Let a1 = A1 = 1 and = n−1 Aj P L 0 i0=1 i E[A1G1] By the same argument as in the proof of Proposi- for j ≥ 2. Then: [f(xk+1)]−f(x∗) ≤ [Gk] ≤ . E E Ak A A a tion 3.2, it follows that: E[AkGk − Ak−1Gk−1|Fk] ≤ As A1 = A1 · A2 ····· k−1 and j−1 = 1 − j :   Ak A2 A3 Ak Aj Aj Pn−1 Liak µ ak − R i = Ek. Taking expectations k−1 i=1 2Akpi 2 x∗  µ  [f(xk+1)] − f(x∗) ≤ 1 − n−1 [G1]. It re- E P L 0 E on both sides, as Ek is deterministic, the proof follows. i0=1 i mains to observe that, from Proposition 3.2, E[G1] ≤ n−1 n 2 (P L 0 )k(I −I )(x −x )k µ  i0=1 i N N ∗ 1 We are now ready to prove the convergence bound for (AR- 1 − n−1 . P L 0 2 BCD), as follows. i0=1 i Theorem 3.4. Let x evolve according to (AR-BCD). Then, k We note that when n = 2, the asymptotic convergence ∀k ≥ 1: of AR-BCD coincides with the convergence of alternating th n−1 minimization (Beck, 2015). When n block is empty (i.e., P Li 2 R i i=1 pi x∗ when all blocks are sampled with non-zero probability and 1. If µ = 0 : E[f(xk+1)] − f(x∗) ≤ k+3 . In Li there is no exact minimization over a least-smooth block), particular, for pi = n−1 , 1 ≤ i ≤ n − 1: P L 0 i0=1 i we obtain the convergence bound of the standard random- ized coordinate descent method (Nesterov, 2012). Pn−1 Pn−1 2( 0 L 0 ) R i i =1 i i=1 x∗ E[f(xk+1)] − f(x∗) ≤ . k + 3 4. Accelerated AR-BCD 1 : Similarly, for pi = n−1 , 1 ≤ i ≤ n − 1 In this section, we show how to accelerate (AR-BCD) when f(·) is smooth. We believe it is possible to obtain similar Pn−1 2(n − 1) i=1 LiRxi results in the smooth and strongly convex case, which we [f(x )] − f(x ) ≤ ∗ E k+1 ∗ k + 3 defer to a future version of the paper. Denote:

ik Li ∆ = I ∇f(x )/p , 2. If µ > 0, pi = Pn−1 and k · k = k · k2 : k N k ik 0 Li0 i =1 k n P vk = argmin aj h∆j, ui E[f(xk+1)] − f(x∗) u j=1  µ k n σ o ≤ 1 − P i i i 2 Pn−1 + ku − x1k , (4.1) i0=1 Li0 i=1 2 Pn−1 n 2 ( 0 Li0 )k(IN − I )(x∗ − x1)k · i =1 N . where σi > 0, ∀i, will be specified later. Accel- 2 erated AR-BCD (AAR-BCD) is defined as follows: Proof. From Proposition 3.2 and Lemma 3.3, by linearity of expectation and the definition of Gk: Select ik from {1, . . . , n − 1} w.p. pik , Ak−1 ak Pk ˆxk = yk−1 + vk−1, j=1 Ej A A [f(x )] − f(x ) ≤ [G ] ≤ , (3.2) k k E k+1 ∗ E k A k xk = argmin f(x), (AAR-BCD) x∈Sn(ˆxk) 2 aj Pn−1 Li i ak i where Ej = A i=1 2p Rx . k j i ∗ yk = xk + IN (vk − vk−1), pi Ak Notice that the algorithm does not depend on the sequence k x1 is an arbitrary initial point, {aj} and thus we can choose it arbitrarily. Suppose that Alternating Randomized Block Coordinate Descent

Pn−1 Pn−1 σi i i 2 where i=1 pi = 1, pi > 0, ∀i ∈ {1, . . . , n − 1}, and Adding and subtracting (deterministic) i=1 2 kx∗ −x1k vk is defined by (4.1). To seed the algorithm, we further to/from (4.3) and using that: i1 1 assume that y1 = x1 + IN (v1 − x1). pi1 k n−1 P P σi i i 2 Remark 4.1. Iteration complexity of (AAR-BCD) is domi- aj h∆j, x∗ − xji + kx∗ − x1k j=1 i=1 2 nated by the computation of ˆxk, which requires updating an k n−1 entire vector. This type of an update is not unusual for accel- n P P σi i i 2o ≥ min aj h∆j, u − xji + ku − x1k erated block coordinate descent methods, and in fact appears u j=1 i=1 2 in all such methods we are aware of (Nesterov, 2012; Lee & = min mk(u), Sidford, 2013; Lin et al., 2014; Fercoq & Richtarik´ , 2015; u

Allen-Zhu et al., 2016). In most cases of practical interest, Pk Pn−1 σi i where mk(u) = j=1 aj h∆j, u − xji + i=1 2 ku − however, it is possible to implement this step efficiently i 2 x1k , it follows that: (using that vk changes only over block ik in iteration k).

More details are provided in AppendixB. Pk Pn−1 σi i i 2 h j=1 ajf(xj) − i=1 2 kx∗ − x1k f(x∗) ≥ E To analyze the convergence of AAR-BCD, we will need Ak to construct a more sophisticated duality gap than in the min N m (u)i + u∈R k , previous section, as follows. Ak which is equal to [Λ ], and completes the proof. 4.1. Approximate Duality Gap E k

We define the upper bound to be Uk = f(yk). The con- Similar as before, define the approximate gap as Γk = structed lower bound Lk from previous subsection is not Uk − Λk. Then, we can bound the initial gap as follows. 2 2 directly useful for the analysis of (AAR-BCD). Instead, we a1 σipi Proposition 4.4. If a1 = ≤ , ∀i ∈ {1, ..., n − 1}, A1 Li will construct a random variable Λk, which in expectation n−1 then [A Γ ] ≤ P σi kx − x k2. is upper bounded by f(x∗). The general idea, as in previ- E 1 1 i=1 2 ∗ 1 ous subsection, is to show that some notion of approximate duality gap decreases in expectation. Proof. As a1 = A1 and y1 differs from x1 only over block i = i1, by smoothness of f(·): Towards constructing Λk, we first prove the following tech- nical proposition, whose proof is in AppendixA. A1U1 = A1f(y1) a L Proposition 4.2. Let xk be as in (AAR-BCD). Then: ≤ a f(x ) + a ∇ f(x ), yi − xi + 1 i kyi − xi k2. 1 1 1 i 1 1 1 2 1 1 k k P P E[ aj h∆j, x∗ − xji] = E[ aj h∇f(xj), x∗ − xji]. On the other hand, the initial lower bound is: j=1 j=1 A1Λ1 =a1(f(x1) + h∆1, v1 − x1i) Define the randomized lower bound as in Eq. (4.2), and ob- n−1 n−1 P σi i i 2 P σi i i 2 serve that (4.1) defines vk as the argument of the minimum + kv1 − x1k − kx∗ − x1k . i=1 2 i=1 2 from Λk. The crucial property of Λk is that it lower bounds i i 1 i i f(x∗) in expectation, as shown in the following lemma. Recall that y = x + (v −x ). Using A1Γ1 = A1U1 − 1 1 pi 1 1 Lemma 4.3. Let xk be as in (AAR-BCD). Then f(x∗) ≥ A1Λ1 and the bounds on U1, Λ1 from the above: A1Γ1 ≤ Pn−1 σi i i 2 2 σi kx −x k , as a1 ≤ pi , and, thus, [A1Γ1] ≤ E[Λk]. i=1 2 ∗ 1 Li E Pn−1 σi i i 2 i=1 2 kx∗ − x1k . Proof. By convexity of f(·), for any sequence {x˜j} from Pk aj (f(x˜j )+h∇f(x˜j ),x∗−x˜j i) N j=1 The next part of the proof is to show that AkΓk is a super- R , f(x∗) ≥ A . Since the k martingale. The proof is provided in AppendixA. statement holds for any sequence {x˜j}, it also holds if {x˜j} 2 2 is selected according to some probability distribution. In ak pi σi Lemma 4.5. If A ≤ L , ∀i ∈ {1, . . . , n − 1}, then particular, for {x˜ } = {x }: k i j j E[AkΓk|Fk−1] ≤ Ak−1Γk−1. Pk h j=1 aj(f(xj) + h∇f(xj), x∗ − xji)i Finally, we bound the convergence of (AAR-BCD). f(x∗) ≥E . Ak Theorem 4.6. Let xk, yk evolve according to (AAR-BCD), 2 2 ak σipi for = min1≤i≤n−1 = const. Then, ∀k ≥ 1: By linearity of expectation and Proposition 4.2: Ak Li

Pk n−1 i i 2 h aj(f(xj) + h∆j, x∗ − xji)i P σ kx − x k j=1 (4.3) i=1 i ∗ a f(x∗) ≥ E . E[f(yk)] − f(x∗) ≤ . Ak 2Ak Alternating Randomized Block Coordinate Descent

n o Pk Pk Pn−1 σi i i 2 Pn−1 σi i i 2 a f(x ) + min N a h∆ , u − x i + ku − x k − kx − x k j=1 j j u∈R j=1 j j j i=1 2 1 i=1 2 ∗ 1 Λk = . (4.2) Ak

√ √ Li n−1 2 P 0 In particular, if pi = n−1 √ , σi = ( 0 Li ) , of the method with exact minimization over both blocks, it P L 0 i =1 i0=1 i and a = 1, then: simply suffices to replace the yk step with the exact mini- 1 mization over the first block, and, as already discussed, the √ Pn−1 2 Pn−1 i i 2 same analysis applies. 2( i0=1 Li0 ) i=1 kx∗ − x1k E[f(yk)] − f(x∗) ≤ . k(k + 3) To obtain a method that is symmetric over blocks 1 and 2, one only needs to replace the roles of blocks 1 and 2 in, say, 1 1 Alternatively, if pi = n−1 , σi = Li, and a1 = (n−1)2 : even iterations. Again, the same analysis applies, except 2 that in even iterations one would need to have ak ≤ σ2 , 2 Pn−1 i i 2 Ak L2 2(n − 1) L kx − x k 2 i=1 i ∗ 1 ak σ1 3 E[f(yk)] − f(x∗) ≤ . whereas in odd iterations it would still be ≤ ; this k(k + 3) Ak L1 affects the convergence bound in Theorem 4.6 by at most a factor of 2. Observe further that when L → ∞, we have Proof. The first part of the proof follows immediately by 2 a → 0 in even iterations, and the method reduces to the applying Proposition 4.4 and Lemma 4.5. The second part k non-symmetric version obtained directly from AAR-BCD. follows by plugging in the particular choice of parameters j+1 and observing that aj grows faster than 2 in the former, Finally, in the case of two blocks (n = 2), it is straight- j+1 forward to obtain a parameter-free version of the method. and faster than 2(n−1)2 in the latter case. Indeed, all that is needed for the proof is that AkΓk ≤ Finally, we make a few remarks regarding Theorem 4.6. In Ak−1Γk−1 (as in Lemma 4.5. As discussed before, when the setting without a non-smooth block (when nth block is n = 2, there is no randomness in the algorithm. Suppose √ that we want to implement a parameter-free version of the empty), (AAR-BCD) with sampling probabilities pi ∼ Li has the same convergence bound as the NU ACDM algo- symmetric method and consider odd iterations (which are rithm (Allen-Zhu et al., 2016) and the ALPHA algorithm for obtained as special cases of AAR-BCD for n = 2, with or without the exact minimization in the yk-step). Then, from smooth minimization (Qu & Richtarik´ , 2016). Further, when 2 ak σ1 the sampling probabilities are uniform, (AAR-BCD) con- Lemma 4.5, all that needs to be satisfied is that A ≤ L . ∗ k 1 verges at the same rate as the ACDM algorithm (Nesterov, Hence, if we find ak for which AkΓk ≤ Ak−1Γk−1, then ∗ 2012) and the APCG algorithm applied to non-composite AkΓk ≤ Ak−1Γk−1 for any ak ≤ ak. This is sufficient for functions (Lin et al., 2014). implementing a backtracking over ak to ensure AkΓk ≤ Ak−1Γk−1. To do so, one can use: 4.2. Accelerated Alternating Minimization AkΓk−Ak−1Γk−1

Before making specific remarks about accelerated alternat- =Akf(yk) − Ak−1f(yk−1) − akf(xk) ing minimization (case n = 2), we first note that the con- σ − a ∇ f(x ), v1 − x1 − 1 kv1 − v1 k2, vergence analysis of AAR-BCD applies generically even if k 1 k k k 2 k k−1 the step yk is replaced by exact minimization over block ik where we have used Uk = f(yk) (by the definition of Uk) (namely, if, instead of the current of choice of yk in AAR- and the equivalent expression for AkΛk − Ak−1Λk−1 from BCD we set yk = argmin f(x)). This change is x∈Sik (xk) relevant only at the beginning of the proof of Lemma 4.5, Eq. (A.4). As a practical matter, in even iterations, if L2 and, to see that the same analysis still applies, observe that: is very large and potentially approaching ∞, one can halt the backtracking line search and set ak = 0 as soon as the a k ik search reaches some preset “sufficiently small” value of ak. f(yk) = min f(x) ≤ f(xk+ IN (vk−vk−1)). x∈Sik (xk) pik Ak 5. Numerical Experiments That is, replacing a particular step over block ik with the exact minimization over the same block can only reduce the To illustrate the results, we solve the least squares problem function value, which can only make the upper bound Uk on the BlogFeedback Data Set (Buza, 2014) obtained from (and, thus, the gap Γk) lower. UCI Repository (Lichman, 2013). The

Accelerated alternating minimization with exact minimiza- 3 Observe that when n = 2, {pi} is supported on a single tion over one block is immediately obtained from AAR- element – block 1 is selected deterministically in even iterations; BCD as a special case when n = 2. To obtain a version block 2 is selected deterministically in odd iterations. Alternating Randomized Block Coordinate Descent

(a) N/n = 5 (b) N/n = 10 (c) N/n = 20 (d) N/n = 40

(e) N/n = 5 (f) N/n = 10 (g) N/n = 20 (h) N/n = 40

(i) N/n = 5 (j) N/n = 10 (k) N/n = 20 (l) N/n = 40

Figure 1. Comparison of different block coordinate descent methods: (a)-(d) distribution of smoothness parameters over blocks, (e)-(h) comparison of non-accelerated methods, and (i)-(l) comparison of accelerated methods. Block sizes N/n increase going left to right. data set contains 280 attributes and 52,396 data points. The {56, 28, 14, 7} coordinate blocks, respectively. attributes correspond to various metrics of crawled blog The distribution of the smoothness parameters over blocks, posts. The data is labeled, and the labels correspond to the for all chosen block sizes, is shown in Fig. 1(a)-1(d). Ob- number of comments that were posted within 24 hours from serve that as the block size increases (going from left to a fixed basetime. The goal of a regression method is to right in Fig. 1(a)-1(d)), the discrepancy between the two predict the number of comments that a blog post receives. largest smoothness parameters increases. What makes linear regression with least squares on this In all the comparisons between the different methods, we dataset particularly suitable to our setting is that the smooth- define an epoch to be equal to n iterations (this would corre- ness parameters of individual coordinates in the least squares spond to a single iteration of a full-). The problem take values from a large interval, even when the graphs plot the optimality gap of the methods over epochs, data matrix A is scaled by its maximum absolute value (the where the optimal objective value f ∗ is estimated via a values are between 0 and ∼354).4 The minimum eigenvalue higher precision method and denoted by fˆ∗. All the results of AT A is zero (i.e., AT A is not a full-rank matrix), and are shown for 50 method repetitions, with bold lines repre- thus the problem is not strongly convex. senting the median5 optimality gap over those 50 runs. The We partition the data into blocks as follows. We first sort norm used in all the experiments is `2, i.e., k · k = k · k2. the coordinates by their individual smoothness parameters. Then, we group the first N/n coordinates (from the sorted Non-accelerated methods We first compare AR-BCD list of coordinates) into the first block, the second N/n with a gradient step to RCDM(Nesterov, 2012) and stan- coordinates into the second block, and so on. The chosen dard cyclic BCD – C-BCD (see, e.g., (Beck & Tetruashvili, block sizes N/n are 5, 10, 20, 40, corresponding to n = 2013)). To make the comparison fair, as AR-BCD makes 4 We did not compare AR-BCD and AAR-BCD to other meth- 5We choose to show the median as opposed to the mean, as it is ods on problems with a non-smooth block (Ln = ∞), as no other well-known that in the presence of outliers the median is a robust methods have any known theoretical guarantees in such a setting. estimator of the true mean (Hampel et al., 2011). Alternating Randomized Block Coordinate Descent two steps per iteration, we slow it down by a factor of two tion AAR-BCD. Our work answers the open question compared to the other methods (i.e., we count one iteration of (Beck & Tetruashvili, 2013) whether the convergence of AR-BCD as two). In the comparison, we consider two of block coordinate descent methods intrinsically depends cases for RCDM and C-BCD: (i) the case in which these on the largest smoothness parameter over all the blocks by two algorithms perform gradient steps on the first n − 1 showing that such a dependence is not necessary, as long as blocks and exact minimization on the nth block (denoted exact minimization over the least smooth block is possible. by RCDM and C-BCD in the figure), and (ii) the case in Before our work, such a result only existed for the setting which the algorithms perform gradient steps on all blocks of two blocks, using the alternating minimization method. (denoted by RCDM-G and C-BCD-G in the figure). The There are several research directions that merit further inves- sampling probabilities for RCDM and AR-BCD are propor- tigation. For example, we observed empirically that exact tional to the block smoothness parameters. The permutation optimization over the non-smooth block improves the per- for C-BCD is random, but fixed in each method run. formance of RCDM and C-BCD, which is not justified by Fig. 1(e)-1(h) shows the comparison of the described the existing analytical bounds. We expect that in both of non-accelerated algorithms, for block sizes N/n ∈ these methods the dependence on the least smooth block {5, 10, 20, 40}. The first observation to make is that adding can be removed, possibly at the cost of a worse dependence exact minimization over the least smooth block speeds up on the number of blocks. Further, AR-BCD and AAR-BCD the convergence of both C-BCD and RCDM, suggesting are mainly useful when the discrepancy between the largest that the existing analysis of these two methods is not tight. block smoothness parameter and the remaining smoothness Second, AR-BCD generally converges to a lower optimality parameters is large, while under uniform distribution of the gap. While RCDM makes a large initial progress, it stag- smoothness parameters it can be slower than other methods nates afterwards due to the highly non-uniform sampling by a factor 1.5-2. It is an interesting question whether there probabilities, whereas AR-BCD keeps making progress. are modifications to AR-BCD and AAR-BCD that would make them uniformly better than the alternatives.

Accelerated methods Finally, we compare AAR-BCD 7. Acknowledgements to NU ACDM (Allen-Zhu et al., 2016), APCG (Lin et al., 2014), and accelerated C-BCD (ABCGD) from (Beck & Part of this work was done while the authors were visiting Tetruashvili, 2013). As AAR-BCD makes three steps per the Simons Institute for the Theory of Computing. It was iteration (as opposed to two steps normally taken by other partially supported by NSF grant #CCF-1718342 and by methods), we slow it down by a factor 1.5 (i.e., we count the DIMACS/Simons Collaboration on Bridging Continu- one iteration of AAR-BCD as 1.5). We chose the sampling ous and Discrete Optimization through NSF grant #CCF- 1740425. probabilities√ of NU ACDM and AAR-BCD to be propor- tional to Li, while the sampling probabilities for APCG are uniform6. Similar as before, each full run of ABCGD is References performed on a random but fixed permutation of the blocks. Allen-Zhu, Zeyuan, Qu, Zheng, Richtarik,´ Peter, and Yuan, The results are shown in Fig. 1(i)-1(l). Compared to APCG Yang. Even faster accelerated coordinate descent using (and ABCGD), NU ACDM and AAR-BCD converge much non-uniform sampling. In Proc. ICML’16, 2016. faster, which is expected, as the distribution of the smooth- ness parameters is highly non-uniform and the meethods Beck, Amir. On the convergence of alternating minimization with non-uniform√ sampling are theoretically faster by factor for convex programming with applications to iteratively of the order n (Allen-Zhu et al., 2016). As the block size reweighted least squares and decomposition schemes. is increased (going left to right), the discrepancy between SIAM J. Optimiz., 25(1):185–209, 2015. the smoothness parameters of the least smooth block and the remaining blocks increases, and, as expected, AAR-BCD Beck, Amir and Tetruashvili, Luba. On the convergence of exhibits more dramatic improvements compared to the other block coordinate descent type methods. SIAM J. Optimiz., methods. 23(4):2037–2060, 2013.

Bertsekas, Dimitri P. . Athena 6. Conclusion scientific Belmont, 1999. We presented a novel block coordinate descent algorithm Boyd, Stephen and Vandenberghe, Lieven. Convex optimiza- AR-BCD and its accelerated version for smooth minimiza- tion. Cambridge university press, 2004. 6The theoretical results for APCG were only presented for uniform sampling (Lin et al., 2014). Bubeck, Sebastien.´ Theory of Alternating Randomized Block Coordinate Descent

for Machine Learning. 2014. arXiv preprint, Nesterov, Yurii and Stich, Sebastian U. Efficiency of the arXiv:1405.4980v1. accelerated coordinate descent method on structured op- timization problems. SIAM J. Optimiz., 27(1):110–123, Buza, Krisztian. Feedback prediction for blogs. In Data 2017. analysis, machine learning and knowledge discovery, pp. 145–152. Springer, 2014. Ortega, James M and Rheinboldt, Werner C. Iterative so- lution of nonlinear equations in several variables, vol- Diakonikolas, Jelena and Orecchia, Lorenzo. The approxi- ume 30. SIAM, 1970. mate duality gap technique: A unified theory of first-order methods, 2017. arXiv preprint, arXiv:1712.02485. Qu, Zheng and Richtarik,´ Peter. Coordinate descent with arbitrary sampling i: Algorithms and complexity. Opti- Fercoq, Olivier and Richtarik,´ Peter. Accelerated, parallel, mization Methods and Software, 31(5):829–857, 2016. and proximal coordinate descent. SIAM J. Optimiz., 25 (4):1997–2023, 2015. Qu, Zheng, Richtarik,´ Peter, Taka´c,ˇ Martin, and Fercoq, Olivier. SDNA: Stochastic dual Newton ascent for empir- Gower, Robert Mansel and Richtarik,´ Peter. Stochastic ical risk minimization. In Proc. ICML’16, 2016. dual ascent for solving linear systems. arXiv preprint arXiv:1512.06890, 2015. Richtarik,´ Peter and Taka´c,ˇ Martin. Iteration complexity of randomized block-coordinate descent methods for min- Guminov, Sergey, Dvurechensky, Pavel, and Gasnikov, imizing a composite function. Math. Prog., 144(1-2): Alexander. Accelerated alternating minimization. arXiv 1–38, 2014. preprint arXiv:1906.03622, 2019. Hampel, Frank R, Ronchetti, Elvezio M, Rousseeuw, Peter J, Saha, Ankan and Tewari, Ambuj. On the nonasymptotic SIAM and Stahel, Werner A. Robust statistics: the approach convergence of cyclic coordinate descent methods. J. Optimiz. based on influence functions, volume 196. John Wiley & , 23(1):576–601, 2013. Sons, 2011. Strohmer, Thomas and Vershynin, Roman. A Randomized Lee, Yin Tat and Sidford, Aaron. Efficient accelerated Kaczmarz Algorithm with Exponential Convergence. J. coordinate descent methods and faster algorithms for Fourier Anal. Appl., 15(2):262–278, 2009. solving linear systems. In Proc. IEEE FOCS’13, 2013. Sun, Ruoyu and Hong, Mingyi. Improved iteration complex- Lichman, M. UCI machine learning repository, 2013. URL ity bounds of cyclic block coordinate descent for convex http://archive.ics.uci.edu/ml. problems. In Proc. NIPS’15, 2015. Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated Tseng, Paul and Yun, Sangwoon. A coordinate gradient proximal coordinate gradient method. In Proc. NIPS’14, descent method for nonsmooth separable minimization. 2014. Math. Prog., 117(1-2):387–423, 2009. Nesterov, Yu. Efficiency of coordinate descent methods on Wright, Stephen J. Coordinate descent algorithms. Math. huge-scale optimization problems. SIAM J. Optimiz., 22 Prog., 151(1):3–34, 2015. (2):341–362, 2012. Alternating Randomized Block Coordinate Descent A. Omitted Proofs from Section4

Proof of Proposition 4.2. Let Fk−1 be the natural filtration up to iteration k − 1. Observe that, as ∇nf(xk) = 0:

E[∆k|Fk−1] = ∇f(xk). (A.1)

Since x1 is deterministic (fixed initial point) and the only random variable ∆1 depends on is i1, we have:

E[a1 h∆1, x∗ − x1i] = a1 h∇f(x1), x∗ − x1i (A.2) = E[a1 h∇f(x1), x∗ − x1i].

Let k > 1. Observe that aj h∆j, x∗ − xji is measurable with respect to Fk−1 for j ≤ k − 1. By linearity of expectation, using (A.1):

k k−1 P P E[ aj h∆j, x∗ − xji |Fk−1] = ak h∇f(xk), x∗ − xki + aj h∆j, x∗ − xji . j=1 j=1 Pk Taking expectations on both sides of the last equality gives a recursion on E[ j=1 aj h∆j, x∗ − xji], which, combined with (A.2), completes the proof.

Proof of Lemma 4.5. As Ak−1Γk−1 is measurable with respect to the natural filtration Fk−1, E[AkΓk|Fk−1] ≤ Ak−1Γk−1 is equivalent to E[AkΓk − Ak−1Γk−1|Fk−1] ≤ 0. The change in the upper bound is:

AkUk − Ak−1Uk−1 = Ak(f(yk) − f(xk)) + Ak−1(f(xk) − f(yk−1)) + akf(xk).

ik ak By convexity, f(xk) − f(yk−1) ≤ h∇f(xk), xk − yk−1i. Further, as yk = xk + IN (vk − vk−1), we have, by pik Ak D E 2 ik ak Lik ak ik ik 2 smoothness of f(·), that f(yk) − f(xk) ≤ ∇f(xk),IN p A (vk − vk−1) + 2 2 kvk − vk−1k . Hence: ik k 2pik Ak

AkUk − Ak−1Uk−1   2 (A.3) ik ak Lik ak ik ik 2 ≤ akf(xk) + ∇f(xk),Ak−1(xk − yk−1) + IN (vk − vk−1) + 2 kvk − vk−1k . pik 2pik Ak

Pk Pn σi i i 2 Let mk(u) = j=1 aj h∆j, u − xji + i=1 2 ku − x1k denote the function under the minimum in the definition of Λk. Observe that mk(u) = mk−1(u) + ak h∆k, u − xki and vk = argminu mk(u). Then:

n−1 P σi i i 2 mk−1(vk) =mk−1(vk−1) + h∇mk−1(vk−1), vk − vk−1i + kvk − vk−1k i=1 2 σ =m (v ) + ik kvik − vik k2, k−1 k−1 2 k k−1 as vk and vk−1 only differ over the block ik and vk−1 = argminu mk−1(u) (and, thus, ∇mk−1(vk−1) = 0).

σik ik ik 2 Hence, it follows that mk(vk) − mk−1(vk−1) = ak h∆k, vk − xki + 2 kvk − vk−1k , and, thus: σ A Λ − A Λ = a f(x ) + a h∆ , v − x i + ik kvik − vik k2. (A.4) k k k−1 k−1 k k k k k k 2 k k−1

Combining (A.3) and (A.4):   ik ak AkΓk − Ak−1Γk−1 ≤ ∇f(xk),Ak−1(xk − yk−1) + IN (vk − vk−1) − ak h∆k, vk − xki pik 2 Lik ak ik ik 2 σik ik ik 2 + 2 kvk − vk−1k − kvk − vk−1k 2pik Ak 2   ik ak ≤ ∇f(xk),Ak−1(xk − yk−1) + IN (vk − vk−1) − ak h∆k, vk − xki , pik Alternating Randomized Block Coordinate Descent

2 2 p σi as, by the initial assumptions, ak ≤ ik k . Ak Lik

Finally, taking expectations on both sides, and as xk, yk−1, vk−1 are all measurable w.r.t. Fk−1 and by the separability of the terms in the definition of vk:

E[AkΓk − Ak−1Γk−1|Fk−1] ≤ h∇f(xk),Akxk − Ak−1yk−1 − akvk−1i = 0,

Ak−1 ak as, from (AAR-BCD), xk = ˆxk = yk−1 + vk−1 over all the blocks except for the block n, while ∇nf(xk) = 0, Ak Ak 7 as xk is the minimizer of f over block n, when other blocks in ˆxk are fixed.

B. Efficient Implementation of AAR-BCD Iterations Using similar ideas as in (Fercoq & Richtarik´ , 2015; Lin et al., 2014; Lee & Sidford, 2013), here we discuss how to efficiently implement iterations of AAR-BCD, without requiring full-vector updates. First, due to the separability of the terms inside the minimum, between successive iterations vk changes only over a single block. This is formalized in the following simple proposition.

i i ik ik ik Proposition B.1. In each iteration k ≥ 1, vk = vk−1, ∀i 6= ik and vk = vk−1 + w , where: σ ik ik ik ik ik 2 w = argmin{ak ∆k , u + ku − vk−1k }. uik 2

Proof. Recall the definition of vk. We have:

k n−1 n X X σi i i 2o vk = argmin h∆j, ui + ku − x1k u 2 j=1 i=1 k−1 n−1 n X X σi i i 2o = argmin h∆j, ui + h∆k, ui + ku − x1k u 2 j=1 i=1 n k−1 n−1 σ o X ik ik X i i i 2 = argmin h∆j, ui + ∆k , u + ku − x1k u 2 j=1 i=1 n σ o ik ik ik ik ik 2 =vk−1 + argmin ∆k , u + ku − vk−1k , uik 2

i where the third equality is by the definition of ∆k (∆k = 0 for i 6= ik) and the last equality follows from block-separability of the terms under the min.

Since vk only changes over a single block, this will imply that the changes in xk and yk can be localized. In particular, let us observe the patterns in changes between successive iterations. We have that, ∀i 6= n :

i Ak−1 i ak i Ak−1 i i  i xk = yk−1 + vk−1 = yk−1 − vk−1 + vk−1 (B.1) Ak Ak Ak and i i 1 ak i i  yk = xk + vk − vk−1 pi Ak   (B.2) Ak−1 i i  1 ak i i  i = yk−1 − vk−1 + 1 − vk−1 − vk + vk. Ak pi Ak

Due to Proposition B.1, vk and vk−1 can be computed without full-vector operations (assuming the gradients can be computed without full-vector operations, which we will show later in this section). Hence, we need to show that it is possible

7Previous version of the proof provided an incomplete justification for the last expression in the proof being equal to zero; we thank Sergey Guminov for pointing this out. Alternating Randomized Block Coordinate Descent to replace Ak−1 yi − vi  with a quantity that can be computed without the full-vector operations. Observe that Ak k−1 k−1 y0 − v0 = 0 (from the initialization of (AAR-BCD)) and that, from (B.2):   i i Ak−1 i i  1 ak i i  yk − vk = yk−1 − vk−1 + 1 − vk−1 − vk . Ak pi Ak

2 2 ak ak Dividing both sides by 2 and assuming that is constant over iterations, we get: Ak Ak

2 2 2   Ak i i  Ak−1 i i  Ak 1 ak i i  2 yk − vk = 2 yk−1 − vk−1 + 2 1 − vk−1 − vk . (B.3) ak ak−1 ak pi Ak

2 th i Ak i i  Let Nn denote the size of the n block and define the (N − Nn)-length vector uk by u = 2 y − v , ∀i 6= n. k ak k k 2   i i Ak 1 ak i i  Then (from (B.3)) u = u + 2 1 − v − v , and, hence, in iteration k, uk changes only over block ik. k k−1 ak pi Ak k−1 k Combining with (B.1) and (B.2), we have the following lemma. 2 ak Lemma B.2. Assume that is kept constant over the iterations of AAR-BCD. Let uk be the (N − Nn)-dimensional vector Ak 2   i i ik ik Ak 1 ak i i  defined recursively as u0 = 0, u = u for i ∈ {1, ..., n − 1}, i 6= ik and u = u + 2 1 − v − v . k k−1 k k−1 ak pi Ak k−1 k 2 2   i ak i i i ak i 1 ak i i  i Then, ∀i ∈ {1, ..., n − 1}: x = 2 u + v and y = 2 u + 1 − v − v + v . k Ak k−1 k−1 k Ak k−1 pi Ak k−1 k k

Note that we will never need to explicitly compute xk, yk, except for the last iteration K, which outputs yK . To formalize this claim, we need to show that we can compute the gradients ∇if(xk) without explicitly computing xk and that we can efficiently perform the exact minimization over the nth block. This will only be possible by assuming specific structure of the objective function, as is typical for accelerated block-coordinate descent methods (Fercoq & Richtarik´ , 2015; Lee & Sidford, 2013; Lin et al., 2014). In particular, we assume that for some m × N dimensional matrix M :

m X T f(x) = φj(ej Mx) + ψ(x), (B.4) j=1

Pn N where φj : R → R and ψ = i=1 ψi : R → R is block-separable.

n Efficient Gradient Computations. Assume for now that xk can be computed efficiently (we will address this at the end of this section). Let ind denote the set of indices of the coordinates from blocks {1, 2, ..., n − 1} and denote by B the matrix obtained by selecting the columns of M that are indexed by ind. Similarly, let indn denote the set of indices of the coordinates from block n and let C denote the submatrix of M obtained by selecting the columns of M that are indexed by 1 2 n−1 T n indn. Denote ruk = Buk, rvk = B[vk, vk, ..., vk ] , rn = Cxk . Let indik be the set of indices corresponding to the coordinates from block ik. Then:

m  2  X ak ∇ f(x ) = (M )T φ0 rj + rj + rj + ∇ ψ(x). ik k j,indik j 2 uk−1 vk−1 n ik (B.5) j=1 Ak

Hence, as long as we maintain ruk , rvk , and rn (which do not require full-vector operations), we can efficiently compute the partial gradients ∇ik f(xk) without ever needing to perform any full-vector operations.

Efficient Exact Minimization. Suppose first that ψ(x) ≡ 0. Then:   m  2  X ak  r = argmin φ rj + rj + rj , n j 2 uk−1 vk−1 m r∈R j=1 Ak  and rn can be computed but solving m single-variable minimization problems, which can be done in closed form or with a very low complexity. Computing rn is sufficient for defining all algorithm iterations, except for the last one (that outputs a n solution). Hence, we only need to compute xk once – in the last iteration. Alternating Randomized Block Coordinate Descent

n More generally, xk is determined by solving:   m  2  X ak  xn = argmin φ rj + rj + (Cx)j + ψ (x) . k j 2 uk−1 vk−1 n N x∈R n j=1 Ak 

When m and Nn are small, high-accuracy polynomial-time convex optimization algorithms are computationally inexpensive, n and xk can be computed efficiently. n In the special case of linear and ridge regression, xk can be computed in closed form, with minor preprocessing. In particular, if b is the vector of labels, then the problem becomes:   m  2 2 X ak λ  xn = argmin rj + rj + (Cx)j − bj + kxk2 , k 2 uk−1 vk−1 2 N 2 x∈R n j=1 Ak 

2 0 ak where λ = 0 in the case of (simple) linear regression. Let b = b − 2 ru − rv . Then: Ak k−1 k−1

n T † T 0 xk = (C C + λI) (C b ), where (·)† denotes the matrix pseudoinverse, and I is the identity matrix. Since CT C + λI does not change over iterations, T † T (C C + λI) can be computed only once at the initialization. Recall that C C + λI is an Nn × Nn matrix, where Nn is th T the size of the n block, and thus inverting C C + λI is computationally inexpensive as long as Nn is not too large. This reduces the overall per-iteration cost of the exact minimization to about the same cost as for performing gradient steps.