On Matching Pursuit and Coordinate Descent

Francesco Locatello * 1 2 Anant Raj * 1 Sai Praneeth Karimireddy 3 Gunnar Rätsch 2 Bernhard Schölkopf 1 Sebastian U. Stich 3 Martin Jaggi 3

Abstract community (Seber & Lee, 2012; Meir & Rätsch, 2003; Schölkopf & Smola, 2001; Menard, 2018; Two popular examples of first-order optimization Tibshirani, 2015). methods over linear spaces are coordinate descent and matching pursuit algorithms, with their ran- Traditionally, matching pursuit (MP) algorithms were in- domized variants. While the former targets the troduced to solve the inverse problem of representing a optimization by moving along coordinates, the measured signal by a sparse combination of atoms from an latter considers a generalized notion of directions. over-complete basis (Mallat & Zhang, 1993). In other words, Exploiting the connection between the two al- the solution of the optimization problem (1) is formed as gorithms, we present a unified analysis of both, a linear combination of few of the elements of the atom providing affine invariant sublinear O(1/t) rates set A – i.e. a . At each iteration, the on smooth objectives and linear convergence on MP algorithm picks a direction from A according to the strongly convex objectives. As a byproduct of information, and takes a step. This procedure is our affine invariant analysis of matching pursuit, not limited to atoms of fixed dimension. Indeed, lin(A) can our rates for steepest coordinate descent are the be an arbitrary linear subspace of the ambient space and tightest known. Furthermore, we show the first ac- we are interested in finding the minimizer of f only on this celerated convergence rate O(1/t2) for matching domain, see e.g. (Gillis & Luce, 2018). Conceptually, MP pursuit and steepest coordinate descent on convex stands in the middle between coordinate descent (CD) and objectives. , as the algorithm is allowed to descend the function along a prescribed set of directions which does not necessarily correspond to coordinates. This is particularly 1. Introduction important for machine learning applications as it translates to a sparse representation of the iterates in terms of the In this paper we address the following elements of A while maintaining the convergence guaran- problem: tees (Lacoste-Julien et al., 2013; Locatello et al., 2017b). min f(x) , (1) x∈lin(A) The first analysis of the MP algorithm in the optimization where f is a convex function. The minimization is over a sense to solve the template (1) without incoherence assump- linear space, which is parametrized as the set of linear com- tions was done by (Locatello et al., 2017a). To prove con- binations of elements from a given set A. These elements vergence, they exploit the connection between MP and the of A are called atoms. In the most general setting, A is Frank-Wolfe (FW) algorithm (Frank & Wolfe, 1956), a assumed to be a compact but not necessarily finite subset of popular projection-free algorithm for the constrained op- timization case. On the other hand, steepest coordinate arXiv:1803.09539v7 [stat.ML] 31 May 2019 a Hilbert space, i.e., a linear space equipped with an inner product, complete in the corresponding norm. Problems descent is a special case of MP (when the atom set is the of the form (1) are tackled by a multitude of first-order L1 ball). This is particularly important as the CD rates can optimization methods and are of paramount interest in the be deduced from the MP rates. Furthermore, the literature on coordinate descent is currently much richer than the one *Equal contribution 1Max Planck Institute for Intelli- on MP. Therefore, understanding the connection of the two 2 gent Systems, Tübingen, Germany Dept. of Computer classes of CD and MP-type algorithms is a main goal of this Science, ETH Zurich, Zurich, Switzerland 3EPFL, Lau- sanne, Switzerland. Correspondence to: Francesco Lo- paper, and results in benefits for both sides of the spectrum. catello , Anant Raj In particular, the contributions of this paper are: . • We present an affine invariant convergence analysis for Proceedings of the 35 th International Conference on Machine Matching Pursuit algorithms solving (1). Our approach Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 is tightly related to the analysis of coordinate descent by the author(s). On Matching Pursuit and Coordinate Descent

and relies on the properties of the atomic norm in order More recently random pursuit was revisited in (Stich et al., to generalize from coordinates to atoms. 2013; Stich, 2014). • Using our analysis, we present the tightest known lin- Acceleration of first-order methods was first developed ear and sublinear convergence rates for steepest coor- in (Nesterov, 1983). An accelerated CD method was de- dinate descent, improving the constants in the rates scribed in (Nesterov, 2012). The method was extended of (Stich et al., 2017; Nutini et al., 2015). in (Lee & Sidford, 2013) for non-uniform sampling, and • We discuss the convergence guarantees of Random later in (Stich, 2014) for optimization along arbitrary ran- Pursuit (RP) methods which we analyze through the dom directions. Recently, optimal rates have been obtained lens of MP. In particular, we present a unified analysis for accelerated CD (Nesterov & Stich, 2017; Allen-Zhu of both MP and RP which allows us to carefully trade et al., 2016). A close setup is the accelerated algorithm off the use of (approximate) steepest directions over presented in (El Halabi et al., 2017), which minimizes a random ones. composite problem of a convex function on Rn with a non- • We prove the first known accelerated rate for MP, as smooth regularizer which acts as prior for the structure of well as for steepest coordinate descent. As a conse- the space. Contrary to our setting, the approach is restricted quence, we also demonstrate an improvement upon the to the atoms being linearly independent. Simultaneously at accelerated random CD rate by performing a steepest ICML 2018, Lu et al.(2018) propose an accelerated rate for coordinate update instead. the semi-greedy coordinate descent which is a special case of our accelerated MP algorithm. Notation: Given a non-empty subset A of some Hilbert Related Work: Matching Pursuit was introduced in the space, let conv(A) be the convex hull of A, and let lin(A) context of sparse recovery (Mallat & Zhang, 1993), and later, denote its linear span. Given a closed set A, we call its fully corrective variants similar to the one used in Frank- diameter diam(A) = max kz − z k and its radius Wolfe (Holloway, 1974; Lacoste-Julien & Jaggi, 2015; Ker- z1,z2∈A 1 2 radius(A) = max kzk. kxk := inf{c > 0: x ∈ dreux et al., 2018) were introduced under the name of or- z∈A A c · conv(A)} is the atomic norm of x over a set A (also thogonal matching pursuit (Chen et al., 1989; Tropp, 2004). known as the gauge function of conv(A)). We call a subset The classical literature for MP-type methods is typically A of a Hilbert space symmetric if it is closed under negation. focused on recovery guarantees for sparse signals and the convergence depends on very strong assumptions (from an optimization perspective), such as incoherence or restricted 2. Revisiting Matching Pursuit isometry properties of the atom set (Tropp, 2004; Davenport Let H be a Hilbert space with associated inner product & Wakin, 2010). Convergence rates with incoherent atom hx, yi, ∀ x, y ∈ H. The inner product induces the norm sets are predented in (Gribonval & Vandergheynst, 2006; kxk2 := hx, xi, ∀ x ∈ H. Let A ⊂ H be a compact Temlyakov, 2013; 2014; Nguyen & Petrova, 2014). Also and symmetric set (the “set of atoms” or dictionary) and boosting can be seen as a generalized coordinate descent let f : H → be convex and L-smooth (L-Lipschitz gra- method over a hypothesis class (Rätsch et al., 2001; Meir & R dient in the finite dimensional case). If H is an infinite- Rätsch, 2003). dimensional Hilbert space, then f is assumed to be Fréchet The idea of following a prescribed set of directions also differentiable. appears in the field of derivative free methods. For instance, the early method of Pattern-Search (Hooke & Jeeves, 1961; Algorithm 1 Generalized Matching Pursuit

Dennis & Torczon, 1991; Torczon, 1997) explores the search 1: init x0 ∈ lin(A) space by probing function values along predescribed direc- 2: for t = 0 ...T tions (“patterns” or atoms). This method is in some sense 3: Find zt := (Approx-)LMOA(∇f(xt)) orthogonal to the approach here: by probing the function h∇f(xt),zti 4: xt+1 := xt − 2 zt Lkztk values along all atoms, one aims to find a direction along 5: end for which the function decreases (and the absolute value of the scalar product with the gradient is potentially small). MP does not access the function value, but computes the In each iteration, MP queries a linear minimization oracle gradient and then picks the atom with the smallest scalar (LMO) to find the steepest descent direction among the product with the gradient, and then moves to a point where set A: the function value decreases. LMOA(y) := arg min hy, zi , (2) The description of random pursuit appears already in the z∈A work of Mutseniyeks & Rastrigin(1964) and was first ana- for a given query vector y ∈ H. This key subroutine is lyzed by Karmanov(1974b;a); Zieli nski´ & Neumann(1983). shared with the Frank-Wolfe method (Frank & Wolfe, 1956; On Matching Pursuit and Coordinate Descent

Jaggi, 2013) as well as steepest coordinate descent. Indeed, relative to z = LMOA(d) being an exact solution. finding the steepest coordinate is equivalent to minimizing Equation2. The MP update step minimizes a quadratic 2.1. Affine Invariant Algorithm L upper bound gxt (x) = f(xt) + h∇f(xt), x − xti + 2 kx − 2 In this section, we will present our new affine invariant xtk of f at xt on the direction z returned by the LMO, where L is an upper bound on the smoothness constant of f algorithm for the optimization problem (1). Hence, we 1 2 first explain in Definition1 that what does it mean for an with respect to the Hilbert norm k·k. For f(x) = 2 ky−xk , y ∈ H, Algorithm1 recovers the classical MP algorithm optimization algorithm to be affine invariant: (Mallat & Zhang, 1993). Definition 1. An optimization method is called affine in- variant if it is invariant under affine transformations of the The LMO. Greedy and projection-free optimization al- input problem: If one chooses any re-parameterization gorithms such as Frank-Wolfe and Matching Pursuit rely of the domain Q by a surjective linear or affine map on the property that the result of the LMO is a descent di- M : Qˆ → Q, then the “old” and “new” optimization prob- rection, which is translated to an alignment assumption of ˆ ˆ lems minx∈Q f(x) and minxˆ∈Qˆ f(xˆ) for f(xˆ) := f(Mxˆ) the search direction returned by the LMO (i.e., zt in Al- look the same to the algorithm. gorithm1) and the gradient of the objective at the current iteration (see (Locatello et al., 2017b), (Pena & Rodriguez, In other words, a step of the algorithm in the original opti- 2015, third premise) and (Torczon, 1997, Lemma 12 and mization problem is the same as a step in the transformed proof of Proposition 6.4)). Specifically, for Algorithm1, problem. We will further demonstrate in the appendix that a symmetric atom set A ensures that h∇f(xt), zti < 0, the proposed Algorithm2 which we discuss later in detail as long as xt is not optimal yet. Indeed, we then have is indeed an affine invariant algorithm. In order to obtain that minz∈Ah∇f(xt), zi = minz∈conv(A)h∇f(xt), zi < 0 an affine invariant algorithm, we define an affine invariant where the inequality comes from symmetry as z = 0 ∈ notion of smoothness using the atomic norm. This notion is conv(A). Note that an alternative sufficient condition in- inspired by the curvature constant employed in FW and MP, stead of symmetry is that A is the atomic ball of a norm (the see (Jaggi, 2013; Locatello et al., 2017a). We define: so called atomic norm (Chandrasekaran et al., 2012)). 2   LA := sup 2 f(y) − f(x) − h∇f(x), y − xi . Steepest Coordinate Descent. In the case when A is x,y∈lin(A) γ y=x+γz the L1-ball, the MP algorithm becomes identical to steep- kzkA=1,γ∈R>0 est coordinate descent (Nesterov, 2012). Indeed, due to (4) symmetry of A, one can rewrite the LMO problem as This definition combines the complexity of the function f it = arg maxi |∇if(x)| , where ∇i is the i-th component as well as the set A into a single number, and is affine of the gradient, i.e. h∇f(x), eii with ei being one of the invariant under transformations of our input problem (1). natural vectors. Then the update step can be written as: It yields the same upper bound to the function as the one 1 given by the traditional smoothness definition, that is LA- x := x − ∇ f(x )e . smoothness with respect to the atomic norm k·k , when x, y t+1 t+1 L it t i A are constrained to the set lin(A): Note that by assuming a symmetric atom set and solving the LA LMO problem as defined in (2) the steepest atom is aligned f(y) ≤ f(x) + h∇f(x), y − xi + ky − xk , 2 A with the negative gradient, therefore the positive stepsize − h∇f(xt),zti decreases the objective. For example, if A is the L1-ball we obtain f(x + γz) ≤ L 2 L1 f(x) + γh∇f(x), zi + γ 2 where kzk1 = 1. Based on Approximate linear oracles. Exactly solving the LMO the affine-invariant notion of smoothness defined above, we defined in (2) can be costly in practice, both in the MP and now present pseudocode of our affine-invariant method in the CD setting, as A can contain (infinitely) many atoms. Algorithm2. On the other hand, approximate versions can be much more Algorithm 2 Affine Invariant Generalized Matching Pursuit efficient. Algorithm1 allows for an approximate LMO. Different notions of such a LMO were explored for MP 1: init x0 ∈ lin(A) and OMP in (Mallat & Zhang, 1993) and (Tropp, 2004), re- 2: for t = 0 ...T spectively, for the Frank-Wolfe framework in (Jaggi, 2013; 3: Find zt := (Approx-)LMOA(∇f(xt)) h∇f(xt),zti 4: xt+1 = xt − zt Lacoste-Julien et al., 2013) and for coordinate descent (Stich LA et al., 2017). For given quality parameter δ ∈ (0, 1] and 5: end for given direction d ∈ H, the approximate LMO for Algo- rithm1 returns a vector z˜ ∈ A such that: The above algorithm looks very similar to the generalized hd, z˜i ≤ δhd, zi , (3) MP (Algorithm1), however, the main difference is that On Matching Pursuit and Coordinate Descent while the original algorithm is not affine invariant over the not orthogonal to each other, do not have the same norm and domain Q = lin(A) (Def1), the new Algorithm2 is so, due do not correspond to the coordinates of the ambient space. to using the generalized smoothness constant LA. Indeed, lin(A) could be a subset of the ambient space and the only assumptions on A are that is closed, bounded and ? Note. For the purpose of the analysis, we call x the k · kA is a norm over lin(A). We do not make any inco- minimizer of problem (1). If the optimum is not unique, we herence assumption. The key element of our proof is the pick the one with largest atomic norm as it represent the definition of smoothness using the atomic norm. Further- worst case for the analysis. All the proofs are deferred to more, we use the properties of the atomic norm to obtain a the appendix. proof which shares the spirit of the Nesterov’s one without having to rely on strong assumptions on A. 2.1.1. NEW AFFINE INVARIANT SUBLINEAR RATE Relation to Previous MP Sublinear Rate. The sublin- In this section, we will provide the theoretical justification ear convergence rate presented in Theorem3 is fundamen- of our proposed approach for smooth functions (sublinear tally different in spirit from the one proved in (Locatello rate) and its theoretical comparison with existing previous et al., 2017a). Indeed, their convergence analysis builds on analysis for special cases. We define the level set radius top of the proof technique used for Frank-Wolfe in (Jaggi, measured with the atomic norm as: 2013). They introduce a dependency from the atomic norm of the iterates as a way to constrain the part of the space 2 ? 2 RA := max kx − x kA . (5) in which the optimization is taking place which artificially x∈lin(A) f(x)≤f(x0) induce a notion of duality gap. They do so by defining ? ρ := max {kx kA, kx0kA ..., kxT kA} < ∞.(Locatello 2 When we measure this radius with the k · k2 we call it R2, et al., 2017a) also used an affine invariant notion of smooth- 2 and when we measure it with k · k1 we call it R1. Note that ness, thus obtaining an affine invariant rate. On the other measuring smoothness using the atomic norm guarantees hand, their notion of smoothness depends explicitly on ρ. that for the Lipschitz constant LA the following holds: While this constant can be further upper bounded with the Lemma 2. Assume f is L-smooth w.r.t. a given norm k · k, level set radius, it is not known a priori, which makes the over lin(A) where A is symmetric. Then, estimation of the smoothness constant problematic as it is needed in the algorithm and the proof technique more in- 2 LA ≤ L radiusk·k(A) . (6) volved. We propose a much more elegant solution, which uses a different affine invariant definition of smoothness For example, in the coordinate descent setting we mea- which explicitly depend on the atomic norm. Furthermore, sure smoothness with the atomic norm being the L1-norm. we managed to get rid of the dependency on the sequence of the iterates by using only properties of the atomic norm Lemma2 implies that LA ≤ L1 ≤ L2 where L2 is the smoothness constant measured with the L2-norm. Note that without any additional assumption (finiteness of ρ). the radius of the L1-ball measured with k·k1 is 1. Therefore, Relation to Steepest Coordinate Descent. From our we put ourselves in a more general setting than Algorithm1, analysis, we can readily recover existing rates for coordinate showing convergence of the affine invariant Algorithm2 descent. Indeed, if A is the L1-ball in an n dimensional We are now ready to prove the convergence rate of Algo- space, the rate of Theorem3 with exact oracle can be written rithm2 for smooth functions. as: 2 2 2 Theorem 3. Let A ⊂ H be a closed and bounded set. We ? 2L1R1 2L2R1 2L2nR2 f(xt+1) − f(x ) ≤ ≤ ≤ , assume that k · kA is a norm over lin(A). Let f be convex t + 2 t + 2 t + 2 and LA-smooth w.r.t. the norm k · kA over lin(A), and let where the first inequality is our rate, the second inequality RA be the radius of the level set of x0 measured with the is the rate of (Stich et al., 2017) and the last inequality is atomic norm. Then, Algorithm2 converges for t ≥ 0 as the rate given in (Nesterov, 2012), both with global Lips- chitz constant. Therefore, by measuring smoothness with 2L R2 ? A A the atomic norm, we have shown a tighter dependency on f(xt+1) − f(x ) ≤ 2 , δ (t + 2) the dimensionality of the space. Indeed, the atomic norm where δ ∈ (0, 1] is the relative accuracy parameter of the gives the tightest norm to measure the product between the employed approximate LMO(3). smoothness of the function and the level set radius among the known rates. Therefore, our rate for steepest coordinate descent is the tightest known1. Discussion. The proof of Theorem3 extends the conver- gence analysis of steepest coordinate descent. As opposed 1Note that for coordinate-wise L our definition is equivalent to to the classical proof in (Nesterov, 2012), the atoms are here the classical one. LA ≤ L2 if the norm is defined over more than On Matching Pursuit and Coordinate Descent

Coordinate Descent and Affine Transformations. But We are now ready to present the sublinear convergence rate what does it mean to have an affine invariant rate for of random matching pursuit. coordinate descent? By definition, it means that if Theorem 4. Let A ⊂ H be a closed and bounded set. We one applies an affine transformation to the L1-ball, the assume that k · kA is a norm. Let f be convex and LA- coordinate descent algorithm in the natural basis and on smooth w.r.t. the norm k · kA over lin(A) and let RA be the the transformed domain Qˆ are equivalent. Note that in the radius of the level set of x0 measured with the atomic norm. transformed problem, the coordinates do not corresponds to Then, Algorithm2 converges for t ≥ 0 as the natural coordinates anymore. Indeed, in the transformed −1 −1 2 domain the coordinates are eˆi = M ei where M is   ? 2LARA ˆ Ez f(xt+1) − f(x ) ≤ , the inverse of the affine map M : Q → Q. If one would δˆ2(t + 2) instead perform coordinate descent in the transformed space using the natural coordinates, one would obtain not when the LMO is replaced with random sampling of z from only different atoms but also a different iterate sequence. In a distribution over A. other words, while Matching Pursuit is fully affine invariant, the definition of CD is not, as the choice of the coordinates Gradient-Free Variant. If is possible to obtain a fully is not part of the definition of the optimization problem. gradient-free optimization scheme. In addition to having The two algorithms do coincide for one particular choice replaced the LMO in Algorithm1 by the random sampling of basis, the canonical coordinate basis for A. as above, as can additionally also replace the step on the quadratic upper bound given by smoothness, 2.1.2. SUBLINEAR RATE OF RANDOM PURSUIT with instead an approximate line search on f. As long as the There is a significant literature on optimization methods update scheme guarantees as much decrease as the above which do not require full gradient information. A notable algorithm, the convergence rate of Theorem4 holds. example is random coordinate descent, where only a random component of the gradient is known. As long as the direction Discussion. This approach is very general, as it allows to that is selected by the LMO is not orthogonal to the gradient guarantee convergence for any sampling scheme and any we have convergence guarantees due to the inexact oracle set A provided that δˆ2 6= 0. In the coordinate descent case definition. We now abstract from the random coordinate de- we have that for the worst possible gradient for random has ˆ2 1 scent setting and analyze a randomized variant of matching δ = n . Therefore, the speed-up of steepest can be up pursuit, the random pursuit algorithm, in which the atom z to a factor equal to the number of dimensions in the best is randomly sampled from a distribution over A, rather than case. Similarly, if z is sampled from a spherical distribution, ˆ2 1 picked by a linear minimization oracle. This approach is par- δ = n (Stich et al., 2013). More examples of computation ticularly interesting, as it is deeply connected to the random of δˆ2 can be found in (Stich, 2014, Section 4.2). Last but not pursuit algorithm analyzed in (Stich et al., 2013). For now least, note that δˆ2 is affine invariant as long as the sampling we assume that we can compute the projection of the gradi- distribution over the atoms is preserved. ent onto a single atom h∇f, zi efficiently. In order to present a general recipe for any atom set, we exploit the notion of 2.1.3. STRONG CONVEXITYAND AFFINE INVARIANT inexact oracle and define the inexactness of the expectation LINEAR RATES of the sampled direction for a given sampling distribution: Similar to the affine invariant notion of smoothness, we here 2 ˆ2 Ez∈Ahd, zi define the affine invariant notion of strong convexity. δ := min 2 . (7) d∈lin(A) kdk A∗ 2 µA := inf 2 D(y, x) . This constant was already used in (Stich, 2014) to measure x,y∈lin(A) ky − xkA x6=y the convergence of random pursuit (β2 in his notation). Note that for uniform sampling from the corners of the where D(y, x) := f(y) − f(x) − h∇f(x), y − xi We can δˆ2 = 1 hd, zi2 = 1 L1-ball, we have n . Indeed, Ez∈A n for now show the linear convergence rate of both the matching any d. This definition holds for any sampling scheme pursuit algorithm and its random pursuit variant. as long as δˆ2 6= 0. Note that by using this quantity we do not get the tightest possible rate, as at each iteration, Theorem 5. Let A ⊂ H be a closed and bounded set. We we consider how much worse a random update could be assume that k · kA is a norm. Let f be µA-strongly convex compared to the optimal (steepest) update. and LA-smooth w.r.t. the norm k · kA, both over lin(A). Then, Algorithm2 converges for t ≥ 0 as one dimension (i.e. blocks), otherwise there is equality. For the relationship of L1-smoothness to coordinate-wise smoothness, see 2 µA  also (Karimireddy et al., 2019, Theorem 4 in Appendix). εt+1 ≤ 1 − δ εt . LA On Matching Pursuit and Coordinate Descent

? where εt := f(xt)−f(x ). If the LMO direction is sampled where the allowed directions do not need to form an orthog- randomly from A, Algorithm2 converges for t ≥ 0 as onal basis. This insight allows us to generalize the analysis of accelerated coordinate descent methods and to accelerate ˆ2 µA  Ez [εt+1|xt] ≤ 1 − δ εt . matching pursuit (Lee & Sidford, 2013; Nesterov & Stich, LA 2017). However it is not clear at the outset how to even accelerate greedy coordinate descent, let alone the matching Relation to Previous MP Linear Rate. Again, the proof pursuit method. Recently Song et al.(2017) proposed an of Theorem5 extends the convergence analysis of steepest accelerated greedy coordinate descent method by using the coordinate descent using solely the affine invariant definition linear coupling framework of (Allen-Zhu & Orecchia, 2014). of strong convexity and the properties of the atomic norm. However the updates they perform at each iteration are not Note that again we define the strong convexity constant with- guaranteed to be sparse which is critical for our application. ? out relying on ρ = max {kx kA, kx0kA ..., kxT kA} < We instead extend the acceleration technique in (Stich et al., ∞ as in (Locatello et al., 2017a). We now show that our 2013) which in turn is based on (Lee & Sidford, 2013). They choice of the strong convexity parameter is the tightest w.r.t. allow the updates to the two sequences of iterates x and v any choice of the norm and that we can precisely recover to be chosen from any distribution. If this distribution is the non affine invariant rate of (Locatello et al., 2017a). Let chosen to be over coordinate directions, we get the familiar us recall their notion of minimal directional width, which accelerated coordinate descent, and if we instead chose the is the crucial constant to measure the geometry of the atom distribution to be over the set of atoms, we would get an set for a fixed norm: accelerated random pursuit algorithm. To obtain an accel- D d E erated matching pursuit algorithm, we need to additionally mDW(A) := min max , z . d∈lin(A) z∈A kdk decouple the updates for x and v and allow them to be cho- d6=0 sen from different distributions. We will update x using the greedy coordinate update (or the matching pursuit update), Note that for CD we have that mDW(A) = √1 . Now, n and use a random coordinate (or atom) direction to update v. we relate the affine invariant notion of strong convexity with the minimal directional width and the strong convexity The possibility of decoupling the updates was noted in w.r.t. any chosen norm. This is important, as we want to (Stich, 2014, Corollary 6.4) though its implications for accel- make sure to perfectly recover the convergence rate given erating greedy coordinate descent or matching pursuit were in (Locatello et al., 2017a). not explored. From here on out, we shall assume that the linear space spanned by the atoms A is finite dimensional. Lemma 6. Assume f is µ-strongly convex w.r.t. a given This was not necessary for the non-accelerated matching norm k · k over lin(A) and A is symmetric. Then: pursuit and it remains open if it is necessary for accelerated 2 MP. When sampling, we consider only a non-symmetric ver- µA ≥ mDW(A) µ . sion of the set A with all the atoms in the same half space. Line search ensures that sampling either z or −z yields the We then recover their non-affine-invariant rate as: same update. For simplicity, we focus on an exact LMO. 2  2 µ mDW(A)  εt+1 ≤ 1 − δ 2 εt . 3.1. From Coordinates to Atoms L radiusk·k(A) For the acceleration of MP we make some stronger assump- Relation to Coordinate Descent. When we fix A as the tion w.r.t. the rates in the previous section. In particular, L1-ball and use an exact oracle our rate becomes: we will not obtain an affine invariant rate which remains an open problem. The key challenges for an affine invariant ac-   µ1  µ1   µ  celerated rate are strong convexity of the model, which can εt+1 ≤ 1 − εt ≤ 1 − εt ≤ 1 − εt , L1 L nL be solved using arguments similar to (d’Aspremont et al., 2018) and the fact that our proof relies on defining a new where the first is our rate, the second is the rate of steepest norm which deform the space in order to obtain favorable CD (Nutini et al., 2015) and the last is the one for random- sampling properties as we will explain in this section. The ized CD (Nesterov, 2012)(n is the dimension of the ambient main difference between working with atoms and working space). Therefore, our linear rate for coordinate descent is with coordinates is that projection along coordinate basis the tightest known. vectors is ’unbiased’. Let ei represent the ith coordinate basis vector. Then for some vector d, if we project along a 3. Accelerating Generalized Matching Pursuit random basis vector ei, As we established in the previous sections, matching pursuit 1 [he , die ] = d . can be considered a generalized greedy coordinate descent Ei∈[n] i i n On Matching Pursuit and Coordinate Descent

However if instead of coordinate basis, we choose from a by defining two different quadratic subproblems, one upper set of atoms A, then this is no longer true. We can correct bound given by smoothness, and one lower bound given by a for this by morphing the geometry of the space. Suppose model of the function. The constraints on the set of possible we sample the atoms from a distribution Z defined over A. descent direction implicitly used in MP influence both these Let us define subproblems. While the smoothness quadratic upper bound ˜ > P := Ez∼Z [zz ] . contains information about A in its definition (y = x + γz and kzkA = 1), the model of the function needs explicit We assume that the distribution Z is such that lin(A) ⊆ modeling of A. This is particularly crucial when sampling a range(P˜ ). This intuitively corresponds to assuming that direction in the model update, which can be thought as a sort there is a non-zero probability that the sampled z ∼ Z is of exploration part of the algorithm. In both the algorithms, along the direction of every atom zt ∈ A i.e. the update of the parameter v corresponds to optimizing the modeling function ψ which can be given as : Pz∼Z [hz, zti > 0] > 0, ∀zt ∈ A . ψ (x) = ψ (x)+ ˜ † ˜ t+1 t Further let P = P be the pseudo-inverse of P. Note that   ˜ > > both P and P are positive semi-definite matrices. We can αt+1 f(yt) + hz˜t ∇f(yt), z˜t P(x − yt)i , (8) equip our space with a new inner product h·, P·i and the resulting norm k·k . With this new dot product, 1 2 P where ψ0(x) = 2 kx − x0kP. > † Lemma 7. The update of v in Algorithm3 and4 minimizes Ez∼Z [hz, Pdiz] = Ez∼Z [zz ]Pd = P Pd = d . the model The last equality follows from our assumption that lin(A) ⊆ vt ∈ arg min ψt(x) . range(P˜ ). x We will be first discussing the theory for the greedy accel- Algorithm 3 Accelerated Random Pursuit erated method in detail. As evident from the algorithm4, 0 1: init x0 = v0 = y0, β0 = 0, and ν another important constant which is required for both the 2: for t = 0, 1 ...T analysis and to actually run the algorithm is ν for which: 2 0 3: Solve αt+1Lν = βt + αt+1 h > 2 2 i 2 4: βt+1 := βt + αt+1 E (z˜t d) kz˜tkP kz(d)k2 αt+1 5: τt := ν ≤ max > 2 , βt+1 d∈lin(A) (z(d) d) 6: Compute yt := (1 − τt)xt + τtvt 7: Sample zt ∼ Z where z(d) is defined to be h∇f(yt),zti 8: xt+1 := yt − Lkz k2 zt t 2 z(d) = LMOA(−d) . 9: vt+1 := vt − αt+1h∇f(yt), ztizt 10: end for The quantity ν relates the geometry of the atom set with the sampling procedure in a similar way as δˆ2 in Equation (7) Algorithm 4 Accelerated Matching Pursuit but instead of measuring how much worse a random update 1: init x0 = v0 = y0, β0 = 0, and ν is when compared to a steepest update. 2: for t = 0, 1 ...T Let be a convex function and be a sym- 2 Theorem 8. f A 3: Solve αt+1Lν = βt + αt+1 metric compact set. Then the output of algorithm4 for any 4: βt+1 := βt + αt+1 t ≥ 1 converges with the following rate: αt+1 5: τt := β t+1 2Lν 6: Compute yt := (1 − τt)xt + τtvt ? ? 2 E[f(xt)] − f(x ) ≤ kx − x0kP . 7: Find zt := LMOA(∇f(yt)) t(t + 1) h∇f(yt),zti 8: xt+1 := yt − 2 zt Lkztk2 9: Sample z˜t ∼ Z Proof. We extend the proof technique of (Lee & Sidford, 10: vt+1 := vt − αt+1h∇f(yt), z˜tiz˜t 2013; Stich et al., 2013) to allow for general atomic updates. 11: end for The analysis can be found in Appendix C.1

Once we understand the convergence of the greedy ap- 3.2. Analysis proach, the analysis of accelerated random pursuit can be Modeling explicitly the dependency on the structure of the derived easily. Here, we state the rate of convergence for set is crucial to accelerate MP. Indeed, acceleration works accelerated random pursuit: On Matching Pursuit and Coordinate Descent

Theorem 9. Let f be a convex function and A be a sym- metric set. Then the output of the algorithm3 for any t ≥ 1 converges with the following rate: 2Lν0 [f(x )] − f(x?) ≤ kx? − x k2 , E t t(t + 1) 0 P where h > 2 2 i (zt d) kztk 0 E P ν ≤ max h i . d∈lin(A) > 2 2 E (zt d) /kztk2

Discussion on Greedy Accelerated Coordinate Descent. The convergence rate for greedy accelerated coordinate de- Figure 1. loss for synthetic data scent can directly be obtained from the rate from accelerated 107 matching pursuit. Let the atom set A consist of the standard 2 random basis vectors {ei, i ∈ [n]} and Z be a uniform distribution steepest over this set. Then algorithm3 reduces to the accelerated 1.5 accelerated random randomized coordinate method (ACDM) of (Lee & Sidford, accelerated steepest 2013; Nesterov & Stich, 2017) and we recover their rates. 1 Instead if we use algorithm4, we obtain a novel acceler- ated greedy coordinate method with a (potentially) better convergence rate.2 function value 0.5

Lemma 10. When A = {ei, i ∈ [n]} and Z is a uniform 0 0 distribution over A, then P = nI, ν = n and ν ∈ [1, n]. 0 200 400 600 800 1000 iteration 4. Empirical Evaluation Figure 2. loss for hyperspectral data In this section we aim at empirically validate our theoretical findings. In both experiments we use 1 and the intrinsic dimensionality of lin(A) as ν and ν0 respectively. Note that are depicted in this HSI data (Gillis & Luce, 2018). We a value of ν smaller than ν0 represents the best case for the approximate each pixel with a linear combination of the steepest update. We implicitly assume that the worst case in dictionary elements by minimizing the square distance be- which a random update is as good as the steepest one never tween the observed pixel and our approximation. We report happens. in Figure2 the loss as an average across all the pixels: Toy Data: First, we report the function value while min- N imizing the squared distance between the a random 100 1 X 2 min kxi − bik dimensional signal with both positive and negative entries x ∈lin(A) N i i=1 and its sparse representation in terms of atoms. We sam- ple a random dictionary containing 200 atoms which we then make symmetric. The result is depicted in Figure1. As anticipated from our analysis, the accelerated schemes We notice that as expected, the steepest matching pursuit converge much faster than the non-accelerated variants. Fur- converges faster than the random pursuit, but as expected thermore, in both cases the steepest update converge faster both of them converge at the same regime. On the other than the random one, due to a better dependency on the hand, the accelerated scheme converge much faster than the dimensionality of the space. non-accelerated variants. Note that the acceleration kicks in only after a few iterations as the accelerated rate has a Real Data: We use the under-sampled Urban HDI Dataset worse dependency on the intrinsic dimensionality of the from which we extract the dictionary of atoms using the linear span than the non accelerated algorithms. We notice hierarchical clustering approached of (Gillis et al., 2015). that the speedup of steepest MP is much more evident in the This dataset contains 5’929 pixels, each associated with 162 synthetic data. The reason is that this experiment is much hyperspectral features. The number of dictionary elements more high dimensional than the hyperspectral data. Indeed, is 6, motivated by the fact that 6 different physical materials the span of the dictionary is a 6 dimensional manifold in the 2Simultaneously (and independently) (Lu et al., 2018) derived latter and the full ambient space in the former and the steep- the same accelerated greedy coordinate algorithm. est update yields a better dependency on the dimensionality. On Matching Pursuit and Coordinate Descent

5. Conclusions Frank, M. and Wolfe, P. An Algorithm for Quadratic Pro- gramming. Naval Research Logistics Quarterly, 3:95– In this paper we presented a unified analysis of matching pur- 110, 1956. suit and coordinate descent algorithms. As a consequence, we exploit the similarity between the two to obtain the best Gillis, N. and Luce, R. A fast for non- of both worlds: tight sublinear and linear rates for steepest negative sparse regression with self-dictionary. IEEE coordinate descent and the first accelerated rate for matching Transactions on Image Processing, 27(1):24–37, 2018. pursuit and steepest coordinate descent. Furthermore, we discussed the relation between the steepest and the random Gillis, N., Kuang, D., and Park, H. Hierarchical clustering of directions by viewing the latter as an approximate version hyperspectral images using rank-two nonnegative matrix of the former. An affine invariant accelerated proof remains factorization. IEEE Transactions on Geoscience and an open problem. Remote Sensing, 53(4):2066–2078, 2015.

Gribonval, R. and Vandergheynst, P. On the exponential Acknowledgements convergence of matching pursuits in quasi-incoherent FL is supported by Max Planck/ETH Center for Learning dictionaries. IEEE Transactions on Information Theory, Systems and partially supported by ETH core funding (to 52(1):255–261, 2006. GR). SPK and MJ acknowledge support by SNSF project Holloway, C. A. An extension of the frank and Wolfe 175796, and SUS by the Microsoft Research Swiss JRC. method of feasible directions. Mathematical Program- ming, 6(1):14–27, 1974. References Hooke, R. and Jeeves, T. A. “Direct Search” solution of Allen-Zhu, Z. and Orecchia, L. Linear Coupling of Gradient numerical and statistical problems. Journal of the ACM, and Mirror Descent: A Novel, Simple Interpretation of 8(2):212–229, 1961. ISSN 0004-5411. doi: 10.1145/ Nesterov’s Accelerated Method. arXiv.org, July 2014. 321062.321069. Allen-Zhu, Z., Qu, Z., Richtárik, P., and Yuan, Y. Even faster accelerated coordinate descent using non-uniform Jaggi, M. Revisiting Frank-Wolfe: Projection-Free Sparse sampling. In ICML 2016 - Proceedings of The 33rd Inter- Convex Optimization. In ICML, pp. 427–435, 2013. national Conference on Machine Learning, volume 48 of Karimireddy, S. P., Koloskova, A., Stich, S. U., and Jaggi, PMLR, pp. 1110–1119. PMLR, 20–22 Jun 2016. M. Efficient greedy coordinate descent for composite Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, problems. AISTATS 2019, and arXiv:1810.06999, 2019. A. S. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12(6):805– Karmanov, V. G. On convergence of a random search 849, October 2012. method in convex minimization problems. Theory of Probability and its applications, 19(4):788–794, 1974a. Chen, S., Billings, S. A., and Luo, W. Orthogonal least (in Russian). squares methods and their application to non-linear sys- tem identification. International Journal of control, 50 Karmanov, V. G. Convergence estimates for iterative min- (5):1873–1896, 1989. imization methods. USSR Computational Mathematics and Mathematical Physics, 14(1):1–13, 1974b. ISSN d’Aspremont, A., Guzmán, C., and Jaggi, M. Optimal affine 0041-5553. invariant smooth minimization algorithms. SIAM Journal on Optimization (and arXiv:1301.0465), 2018. Kerdreux, T., Pedregosa, F., and d’Aspremont, A. Frank- wolfe with subsampling oracle. In ICML 2018 - Proceed- Davenport, M. A. and Wakin, M. B. Analysis of orthogonal ings of the 35th International Conference on Machine matching pursuit using the restricted isometry property. Learning, 2018. IEEE Transactions on Information Theory, 56(9):4395– 4401, 2010. Lacoste-Julien, S. and Jaggi, M. On the Global Linear Dennis, Jr., J. and Torczon, V. Direct Search methods on Convergence of Frank-Wolfe Optimization Variants. In parallel machines. SIAM Journal on Optimization, 1(4): NIPS 2015, pp. 496–504, 2015. 448–474, 1991. doi: 10.1137/0801027. Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. El Halabi, M., Hsieh, Y.-P., Vu, B., Nguyen, Q., and Cevher, Block-Coordinate Frank-Wolfe Optimization for Struc- V. General proximal gradient method: A case for non- tural SVMs. In ICML 2013 - Proceedings of the 30th euclidean norms. Technical report, 2017. International Conference on Machine Learning, 2013. On Matching Pursuit and Coordinate Descent

Lee, Y. T. and Sidford, A. Efficient accelerated coordinate Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., and descent methods and faster algorithms for solving linear Koepke, H. Coordinate Descent Converges Faster with systems. In FOCS ’13 - Proceedings of the 2013 IEEE the Gauss-Southwell Rule Than Random Selection. In 54th Annual Symposium on Foundations of Computer ICML 2015 - Proceedings of the 32th International Con- Science, FOCS ’13, pp. 147–156, 2013. ference on Machine Learning, pp. 1632–1641, 2015.

Locatello, F., Khanna, R., Tschannen, M., and Jaggi, M. Pena, J. and Rodriguez, D. Polytope conditioning and linear A unified optimization view on generalized matching convergence of the frank-wolfe algorithm. arXiv preprint pursuit and frank-wolfe. In AISTATS - Proc. Interna- arXiv:1512.06142, 2015. tional Conference on Artificial Intelligence and Statistics, Rätsch, G., Mika, S., Warmuth, M. K., et al. On the conver- 2017a. gence of leveraging. In NIPS, pp. 487–494, 2001. Locatello, F., Tschannen, M., Rätsch, G., and Jaggi, M. Schölkopf, B. and Smola, A. J. Learning with kernels: Greedy algorithms for cone constrained optimization with support vector machines, regularization, optimization, convergence guarantees. In NIPS - Advances in Neural and beyond. MIT press, 2001. Information Processing Systems 30, 2017b. Seber, G. A. and Lee, A. J. Linear regression analysis, Lu, H., Freund, R. M., and Mirrokni, V. Accelerating greedy volume 329. John Wiley & Sons, 2012. coordinate descent methods. In ICML 2018 - Proceed- ings of the 35th International Conference on Machine Song, C., Cui, S., Jiang, Y., and Xia, S.-T. Accelerated Learning, 2018. stochastic greedy coordinate descent by soft thresholding projection onto simplex. In NIPS - Advances in Neural Mallat, S. and Zhang, Z. Matching pursuits with time- Information Processing Systems, pp. 4841–4850, 2017. frequency dictionaries. IEEE Transactions on Signal Stich, S. U. Convex Optimization with Random Pursuit. Processing, 41(12):3397–3415, 1993. PhD thesis, ETH Zurich, 2014. Nr. 22111.

Meir, R. and Rätsch, G. An introduction to boosting and Stich, S. U., Müller, C. L., and Gärtner, B. Optimization of leveraging. In Advanced lectures on machine learning, convex functions with random pursuit. SIAM Journal on pp. 118–183. Springer, 2003. Optimization, 23(2):1284–1309, 2013. Menard, S. Applied logistic regression analysis, volume Stich, S. U., Raj, A., and Jaggi, M. Approximate steep- 106. SAGE publications, 2018. est coordinate descent. In ICML 2017 - Proceedings of the 34th International Conference on Machine Learning, Mutseniyeks, V. A. and Rastrigin, L. A. Extremal control volume 70 of PMLR, pp. 3251–3259, 2017. of continuous multi-parameter systems by the method of random search. Eng. Cybernetics, 1:82–90, 1964. Temlyakov, V. Chebushev in convex optimization. arXiv.org, December 2013. Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/K2). Soviet Mathe- Temlyakov, V. Greedy algorithms in convex optimization on matics Doklady, 27:372–376, 1983. Banach spaces. In 48th Asilomar Conference on Signals, Systems and Computers, pp. 1331–1335. IEEE, 2014. Nesterov, Y. Introductory Lectures on Convex Optimization, Tibshirani, R. J. A general framework for fast stagewise volume 87 of Applied Optimization. Springer US, Boston, algorithms. Journal of Machine Learning Research, 16: MA, 2004. 2543–2588, 2015.

Nesterov, Y. Efficiency of coordinate descent methods on Torczon, V. On the convergence of pattern search algorithms. huge-scale optimization problems. SIAM Journal on SIAM Journal on optimization, 7(1):1–25, 1997. Optimization, 22(2):341–362, 2012. Tropp, J. A. Greed is good: algorithmic results for sparse ap- Nesterov, Y. and Stich, S. U. Efficiency of the accelerated proximation. IEEE Transactions on Information Theory, coordinate descent method on structured optimization 50(10):2231–2242, 2004. problems. SIAM Journal on Optimization, 27(1):110– 123, 2017. Zielinski,´ R. and Neumann, P. Stochastische Verfahren zur Suche nach dem Minimum einer Funktion. Akademie- Nguyen, H. and Petrova, G. Greedy strategies for convex Verlag, Berlin, Germany, 1983. optimization. Calcolo, pp. 1–18, 2014. On Matching Pursuit and Coordinate Descent Appendix

A. Sublinear Rates Theorem’2. Assume f is L-smooth w.r.t. a given norm k · k, over lin(A) where A is symmetric. Then, 2 LA ≤ L radiusk·k(A) . (9)

Proof. Let D(y, x) := f(y) − f(x) + γh∇f(x), y − xi By the definition of smoothness of f w.r.t. k · k, L D(y, x) ≤ ky − xk2 . 2

Hence, from the definition of LA,

2 L 2 LA ≤ sup 2 ky − xk x,y∈lin(A) γ 2 y=x+γz kzkA=1,γ∈R>0 = L sup ksk2 z s.t.kzkA=1 2 = L radiusk·k(A) .

The definition of the smoothness constant w.r.t. the atomic norm yields the following quadratic upper bound: 2 LA = sup 2 [f(y) − f(x) + h∇f(x), y − xi] . (10) x,y∈lin(A) γ y=x+γz kzkA=1,γ∈R>0 Furthermore, let: 2 ? 2 RA = max kx − x kA . (11) x∈lin(A) f(x)≤f(x0) Now, we show that the algorithm we presented is affine invariant. An optimization method is called affine invariant if it is invariant under affine transformations of the input problem: If one chooses any re-parameterization of the domain Q by a surjective linear or affine map M : Qˆ → Q, then the “old” and “new” optimization problems minx∈Q f(x) and ˆ ˆ ˆ T minxˆ∈Qˆ f(xˆ) for f(xˆ) := f(Mxˆ) look the same to the algorithm. Note that ∇f = M ∇f.

First of all, let us note that LA is affine invariant as it does not depend on any norm. Now: ˆ ! h∇f(xˆt), zˆti Mxˆt+1 = M xˆt + zˆt LA ˆ h∇f(xˆt), zˆti = Mxˆt + Mzˆt LA ˆ h∇f(xˆt), zˆti = xt + zt LA T hM ∇f(xt), zˆti = xt + zt LA h∇f(xt), Mzˆti = xt + zt LA h∇f(xt), zti = xt + zt LA = xt+1 . On Matching Pursuit and Coordinate Descent

Therefore the algorithm is affine invariant.

A.1. Affine Invariant Sublinear Rate

Theorem’3. Let A ⊂ H be a closed and bounded set. We assume that k · kA is a norm over lin(A). Let f be convex and LA-smooth w.r.t. the norm k · kA over lin(A), and let RA be the radius of the level set of x0 measured with the atomic norm. Then, Algorithm2 converges for t ≥ 0 as

2L R2 f(x ) − f(x?) ≤ A A , t+1 δ2(t + 2) where δ ∈ (0, 1] is the relative accuracy parameter of the employed approximate LMO(3).

Proof. Recall that z˜t is the atom selected in iteration t by the approximate LMO defined in (3). We start by upper-bounding f using the definition of LA as follows:

2 γ 2 f(xt+1) ≤ min f(xt) + γh∇f(xt), z˜ti + LAkzkA γ∈R 2 γ2 = min f(xt) + γh∇f(xt), z˜ti + LA γ∈R 2 2 h∇f(xt), z˜ti ≤ f(xt) − 2LA 2 h∇kf(xt), z˜ti = f(xt) − 2LA 2 2 h∇kf(xt), zti ≤ f(xt) − δ . 2LA

Where ∇kf is the parallel component of the gradient wrt the linear span of A. Note that kdkA∗ := sup {hz, di, z ∈ A} is the dual of the atomic norm. Therefore, by definition:

2 2 h∇kf(xt), zti = k − ∇kf(xt)kA∗ , which gives:

2 1 2 f(xt+1) ≤ f(xt) − δ k∇kf(xt)kA∗ 2LA ? 2 2 1 −h∇kf(xt), xt − x i ≤ f(xt) − δ 2 2LA RA ? 2 2 1 h∇kf(xt), xt − x i = f(xt) − δ 2 2LA RA ? 2 2 1 (f(xt) − f(x )) ≤ f(xt) − δ 2 , 2LA RA where the second inequality is Cauchy-Schwarz and the third one is convexity. Which gives:

2L R2  ≤ A A . t+1 δ2(t + 2)

A.2. Randomized Affine Invariant Sublinear Rate For random sampling of z from a distribution over A, let

2 ˆ2 Ez∈Ahd, zi δ := min 2 . (12) d∈lin A kdkA∗ On Matching Pursuit and Coordinate Descent

Theorem’4. Let A ⊂ H be a closed and bounded set. We assume that k · kA is a norm. Let f be convex and LA smooth w.r.t. the norm k · kA over lin(A) and let RA be the radius of the level set of x0 measured with the atomic norm. Then, Algorithm2 converges for t ≥ 0 as

2   ? 2LARA Ez f(xt+1) − f(x ) ≤ , δˆ2(t + 2) when the LMO is replaced with random sampling of z from a distribution over A.

Proof. Recall that z˜t is the atom selected in iteration t by the approximate LMO defined in (3). We start by upper-bounding f using the definition of LA as follows

 2  γ 2 Ezf(xt+1) ≤ Ez min f(xt) + γh∇f(xt), zi + LAkzkA γ∈R 2  γ2  = Ez min f(xt) + γh∇f(xt), zi + LA γ∈R 2  2 Ez h∇f(xt), zi ≤ f(xt) − 2LA  2 Ez h∇kf(xt), zi = f(xt) − 2LA h∇ f(x ), z i2 ˆ2 k t t ≤ f(xt) − δ . 2LA

The rest of the proof proceeds as in Theorem3.

B. Linear Rates B.1. Affine Invariant Linear Rate Let us first the fine the affine invariant notion of strong convexity based on the atomic norm:

2 µA := inf 2 [f(y) − f(x) − h∇f(x), y − xi] . x,y∈lin A ky − xkA x6=y

Let us recall the definition of minimal directional width from (Locatello et al., 2017a):

d mDW(A) := min maxh , zi . d∈lin(A) z∈A kdk d6=0

Then, we can relate our new definition of strong convexity with the mDW(A) as follows. Theorem’6. Assume f is µ strongly convex wrt a given norm k · k over lin(A) and A is symmetric. Then:

2 µA ≥ mDW(A) µ .

Proof. First of all, note that for any x, y ∈ lin(A) with x 6= y we have that:

2 2 2 h∇f(x), x − yi ≤ k∇f(x)kA∗kx − ykA . On Matching Pursuit and Coordinate Descent

Therefore: 2 µA = inf 2 D(x, y) x,y∈lin A ky − xkA x6=y kdk2 ≥ inf A∗ 2D(x, y) x,y,d∈lin A hd, x − yi2 x6=y,d6=0 kdk2 ≥ inf A∗ µkx − yk2 x,y,d∈lin A hd, x − yi2 x6=y,d6=0 kdk2 ≥ inf A∗ µ x,y,d∈lin A hd, x−y i2 x6=y,d6=0 kx−yk kdk2 ≥ inf A∗ µ x,y,d∈lin A kdk2 x6=y,d6=0 hd, zi2 ≥ inf max µ d∈lin A z kdk2 d6=0 = mDW(A)2 µ .

Theorem’5. (Part 1). Let A ⊂ H be a closed and bounded set. We assume that k · kA is a norm. Let f be µA-strongly convex and LA-smooth w.r.t. the norm k · kA, both over lin(A). Then, Algorithm2 converges for t ≥ 0 as

2 µA  t+1 ≤ 1 − δ t . LA ? where t := f(xt) − f(x ).

Proof. Recall that z˜t is the atom selected in iteration t by the approximate LMO defined in (3). We start by upper-bounding f using the definition of LA as follows 2 γ 2 f(xt+1) ≤ min f(xt) + γh∇f(xt), z˜ti + LAkzkA γ∈R 2 γ2 = min f(xt) + γh∇f(xt), z˜ti + LA γ∈R 2 2 h∇f(xt), z˜ti ≤ f(xt) − 2LA 2 h∇kf(xt), z˜ti = f(xt) − 2LA 2 2 h∇kf(xt), zti ≤ f(xt) − δ 2LA 2 2 h−∇kf(xt), zti = f(xt) − δ . 2LA

Where kdkA∗ := sup {hz, di, z ∈ A} is the dual of the atomic norm. Therefore, by definition: 2 2 h−∇kf(xt), zti = k∇kf(xt)kA∗ , which gives:

2 1 2 f(xt+1) ≤ f(xt) − δ k∇kf(xt)kA∗ . 2LA From strong convexity we have that: µ f(y) ≥ f(x) + h∇f(x), y − xi + A ky − xk2 . 2 A On Matching Pursuit and Coordinate Descent

? Fixing y = xt + γ(x − xt) and γ = 1 in the LHS and minimizing the RHS we obtain: ? ? 1 h∇f(xt), x − xti f(x ) ≥ f(xt) − ? 2 2µA kx − xtkA 1 2 ≥ f(xt) − k∇kf(xt)kA∗ , 2µA ? ? where the last inequality is obtained by the fact that h∇f(xt), x − xti = h∇kf(xt), x − xtiand Cauchy-Schwartz. Therefore:

k∇f(xt)kA∗ ≥ 2tµA , which yields:

2 µA t+1 ≤ t − δ t . LA

B.2. Randomized Affine Invariant Linear Rate

Theorem’5. (Part 2). Let A ⊂ H be a closed and bounded set. We assume that k · kA is a norm. Let f be µA-strongly convex and LA-smooth w.r.t. the norm k · kA, both over lin(A). Then, Algorithm2 converges for t ≥ 0 as ˆ2 µA  Ez [t+1|xt] ≤ 1 − δ t , LA ? where t := f(xt) − f(x ), and the LMO direction z is sampled randomly from A, from the same distribution as used in the definition of δˆ.

Proof. We start by upper-bounding f using the definition of LA as follows  2  γ 2 Ez [f(xt+1)] ≤ Ez min f(xt) + γh∇f(xt), z˜ti + LAkzkA γ∈R 2  γ2  = Ez min f(xt) + γh∇f(xt), z˜ti + LA γ∈R 2  2  h∇f(xt), z˜ti ≤ f(xt) − Ez 2LA 2 ˆ2 h∇f(xt), zti ≤ f(xt) − δ 2LA h∇ f(x ), z˜ i2 ˆ2 k t t = f(xt) − δ 2LA h−∇ f(x ), z˜ i2 ˆ2 k t t = f(xt) − δ . 2LA The rest of the proof proceeds as in Part 1 of the proof of Theorem5.

C. Accelerated Matching Pursuit Our proof follows the technique for acceleration given in (Lee & Sidford, 2013; Nesterov & Stich, 2017; Nesterov, 2004; Stich et al., 2013)

C.1. Proof of Convergence

2 > We define kxkP = x Px. We start our proof by first defining the model function ψt. For t = 0, we define : 1 ψ (x) = kx − v k2 . 0 2 0 P

Then for t > 1, ψt is inductively defined as

 > >  ψt+1(x) = ψt(x) + αt+1 f(yt) + hz˜t ∇f(yt), z˜t P(x − yt)i . (13) On Matching Pursuit and Coordinate Descent

Algorithm 5 Accelerated Matching Pursuit

1: init x0 = v0 = y0, β0 = 0, and ν 2: for t = 0, 1 ...T 2 3: Solve αt+1Lν = βt + αt+1 4: βt+1 = βt + αt+1 αt+1 5: τt = βt+1 6: Compute yt = (1 − τt)xt + τtvt 7: Find zt := LMOA(∇f(yt)) h∇f(yt),zti 8: xt+1 = yt − 2 zt Lkztk2 9: sample z˜t ∼ Z 10: vt+1 = vt − αt+1h∇f(yt), z˜tiz˜t 11: end for

1 2 Proof of Lemma7 . We will prove the statement inductively. For t = 0, ψ0(x) = 2 kx − v0kP and so the statement holds. Suppose it holds for some t ≥ 0. Observe that the function ψt(x) is a quadratic with Hessian P. This means that we can reformulate ψt(x) with minima at vt as 1 ψ (x) = ψ (v ) + kx − v k2 . t t t 2 t P Using this reformulation,

n  > > o arg min ψt+1(x) = arg min ψt(x) + αt+1 f(yt) + hz˜t ∇f(yt), z˜t P(x − yt)i x x

n 1 2  > > o = arg min ψt(vt) + kx − vtkP + αt+1 f(yt) + hz˜t ∇f(yt), z˜t P(x − yt)i x 2

n1 2 > > o = arg min kx − vtkP + αt+1hz˜t ∇f(yt), z˜t P(x − vt)i x 2

= vt − αt+1h∇f(yt), z˜tiz˜t

= vt+1 .

Lemma 11 (Upper bound on ψt(x)). E[ψt(x)] ≤ βtf(x) + ψ0(x) .

Proof. We will also show this through induction. The statement is trivially true for t = 0 since β0 = 0. Assuming the statement holds for some t ≥ 0,

h  > > i E[ψt+1(x)] = E ψt(x) + αt+1 f(yt) + hz˜t ∇f(yt), z˜t P(x − yt)i

h i h > > i = E ψt(x) + αt+1E f(yt) + hz˜t ∇f(yt), z˜t P(x − yt)i

 > h >i  ≤ βtf(x) + ψ0(x) + αt+1 f(yt) + ∇f(yt) E z˜tz˜t P(x − yt)i

 > −1  = βtf(x) + ψ0(x) + αt+1 f(yt) + ∇f(yt) P P(x − yt)i

 >  = βtf(x) + ψ0(x) + αt+1 f(yt) + ∇f(yt) (x − yt)i

≤ βtf(x) + ψ0(x) + αt+1f(x) .

In the above, we used the convexity of the function f(x) and the definition of P. Lemma 12 (Bound on progress). For any t ≥ 0 of algorithm5,

1 > > f(xt+1) − f(yt) ≤ − 2 ∇f(yt) ztzt ∇f(yt) . 2Lkztk2 On Matching Pursuit and Coordinate Descent

h∇f(yt),zti Proof. The update xt+1 along with the smoothness of f(x) guarantees that for γt+1 = 2 , Lkztk

f(xt+1) = f(yt + γt+1zt) Lγ2 ≤ f(y ) + γ h∇f(y ), z i + t+1 kz k2 t t+1 t t 2 t 1 > > = f(yt) − 2 ∇f(yt) ztzt ∇f(yt) . 2Lkztk2

Lemma 13 (Lower bound on ψt(x)). Given a filtration Ft upto time step t,

[min ψt(x)|Ft] ≥ βtf(xt) . E x

1 2 Proof. This too we will show inductively. For t = 0, ψt(x) = 2 kx − v0kP ≥ 0 with β0 = 0. Assume the statement holds 1 2 for some t ≥ 0. Recall that ψt(x) has a minima at vt and can be alternatively formulated as ψt(vt) + 2 kx − vtkP. Using this,

?  > >  ψt+1 = min ψt(x) + αt+1 z˜t ∇f(yt), z˜t P(x − yt) + f(yt) x    > > 1 2 = min ψt(vt) + αt+1 z˜t ∇f(yt), z˜t P(x − yt) + kx − vtkP + f(yt) x 2αt+1   ? > 1 2 = ψt + αt+1f(yt) + αt+1 min Pz˜tz˜t ∇f(yt), x − yt + kx − vtkP . x 2αt+1

Since we defined yt = (1 − τt)xt + τtvt, rearranging the terms gives us that

1 − τt yt − vt = (xt − yt) . τt

? Let us take now compute E[ψt+1|Ft] by combining the above two equations:

? ? αt+1(1 − τt) > E[ψt+1|Ft] = ψt + αt+1f(yt) + PEt[z˜tz˜t ]∇f(yt), yt − xt τt   > 1 2 + αt+1Et min Pz˜tz˜t ∇f(yt), x − vt + kx − vtkP x 2αt+1

? αt+1(1 − τt) = ψt + αt+1f(yt) + h∇f(yt), yt − xti τt   > 1 2 + αt+1Et min Pz˜tz˜t ∇f(yt), x − vt + kx − vtkP x 2αt+1

? αt+1(1 − τt) = ψt + αt+1f(yt) + h∇f(yt), yt − xti τt α2 − t+1 ∇f(y )> z˜ z˜>PP−1Pz˜ z˜>∇f(y ) 2 t Et t t t t t ? αt+1(1 − τt) = ψt + αt+1f(yt) + h∇f(yt), yt − xti τt α2 − t+1 ∇f(y )> z˜ z˜>Pz˜ z˜>∇f(y ) . 2 t Et t t t t t

Let us define a constant ν ≥ 0 such that it is the smallest number for which the below inequality holds for all t,

 > > ztzt >  > > ν∇f(yt) 2 ∇f(yt) ≥ ∇f(yt) E z˜tz˜t Pz˜tz˜t ∇f(yt) . 2Lkztk2 On Matching Pursuit and Coordinate Descent

Also recall from Lemma 12 that

1 > > f(xt+1) − f(yt) ≤ − 2 ∇f(yt) ztzt ∇f(yt) . 2Lkztk2 ? Using the above two statements in our computation of ψt+1, we get

? ? αt+1(1 − τt) E[ψt+1|Ft] = ψt + αt+1f(yt) + h∇f(yt), yt − xti τt α2 − t+1 ∇f(y )> z˜ z˜>Pz˜ z˜>∇f(y ) 2 t Et t t t t t ? αt+1(1 − τt) ≥ ψt + αt+1f(yt) + h∇f(yt), yt − xti τt α2 ν − t+1 ∇f(y )>z z>∇f(y ) 2 t t t t ? αt+1(1 − τt) ≥ ψt + αt+1f(yt) + h∇f(yt), yt − xti τt 2 + αt+1Lν(f(xt+1) − f(yt))

? αt+1(1 − τt) ≥ ψt + αt+1f(yt) + (f(yt) − f(xt)) τt 2 + αt+1Lν(f(xt+1) − f(yt)) .

2 Let us pick αt+1 such that it satisfies αt+1νL = βt+1. Then the above equation simplifies to

? ? αt+1 αt+1(1 − τt) E[ψt+1|Ft] ≥ ψt + f(yt) − f(xt) + βt+1(f(xt+1) − f(yt)) τt τt ? = ψt − βtf(xt) + βt+1f(yt) − βt+1f(yt) + βt+1f(xt+1) ? = ψt − βtf(xt) + βt+1f(xt+1) .

We used that τt = αt+1/βt+1. Finally we use the inductive hypothesis to conclude that

? ? E[ψt+1|Ft] ≥ ψt − βtf(xt) + βt+1f(xt+1) ≥ βt+1f(xt+1) . Lemma 14 (Final convergence rate). For any t ≥ 1 the output of algorithm5 satisfies: 2Lν [f(x )] − f(x?) ≤ kx? − x k2 . E t t(t + 1) 0 P

Proof. Putting together Lemmas 11 and 13, we have that

? ? ? ? βtE[f(xt)] ≤ E[ψt ] ≤ E[ψt(x )] ≤ βtf(x ) + ψ0(x ) . Rearranging the terms we get ? 1 ? 2 E[f(xt)] − f(x ) ≤ kx − x0kP . 2βt

To finish the proof of the theorem, we only have to compute the value of βt. Recall that

2 αt+1Lν = βt + αt+1 .

t 1 We will inductively show that αt ≥ 2Lν . For t = 0, β0 = 0 and α1 = 2Lν which satisfies the condition. Suppose that for Pt t(t+1) some t ≥ 0, the inequality holds for all iterations i ≤ t. Recall that βt = i=1 αi i.e. βt ≥ 4Lν . Then t(t + 1) (α Lν)2 − α Lν = β Lν ≥ . t+1 t+1 t 4 On Matching Pursuit and Coordinate Descent √ 2 1  The positive root of the quadratic x − x − c = 0 for c ≥ 0 is x = 2 1 + 4c + 1 . Thus 1  t + 1 α Lν ≥ 1 + pt(t + 1) + 1 ≥ . t+1 2 2 This finishes our induction and proves the final rate of convergence. Lemma 15 (Understanding ν). h > 2 2 i 2 E (z˜t d) kz˜tkP kz(d)k2 ν ≤ max , d∈lin(A) (z(d)>d)2

Proof. Recall the definition of ν as a constant which satisfies the following inequality for all iterations t

 > > ztzt >  > > ν∇f(yt) 2 ∇f(yt) ≥ ∇f(yt) E z˜tz˜t Pz˜tz˜t ∇f(yt) . 2Lkztk2 which then yields the following sufficient condition for ν:

h > 2 2 i 2 E (z˜t d) kz˜tkP kz(d)k2 ν ≤ max , d∈lin(A) (z(d)>d)2 where z(d) is defined to be z(d) = LMOA(−d) .

Proof of Theorem9. The proof of Theorem9 is exactly the same as that of the previous except that now the update to vt is also a random variable. The only change needed is the definition of ν0 where we need the following to hold: 1 h i ν0∇f(y )> z z>/kz k2 ∇f(y ) ≥ ∇f(y )> z z>Pz z>∇f(y ) . t 2LEt t t t 2 t t E t t t t t

Proof of Lemma 10. When A = {ei, i ∈ [n]} and Z is a uniform distribution over A, then P˜ = 1/nI and P = nI.A 0 0 simple computation shows that ν = n and ν ∈ [1, n]. Note that√ here ν could be upto n times smaller than ν meaning that our accelerated greedy coordinate descent algorithm could be n times faster than the accelerated random coordinate descent. In the worst case ν = ν0, but in practice one can pick a smaller ν compared to ν0 as the worst case gradient rarely happen. It is possible to tune ν and ν0 empirically but we do not explore this direction.