Fast Column Generation for Atomic Norm Regularization Marina Vinyes, Guillaume Obozinski

Fast column generation for atomic norm regularization Marina Vinyes, Guillaume Obozinski

To cite this version:

Marina Vinyes, Guillaume Obozinski. Fast column generation for atomic norm regularization. The 20th International Conference on Artificial Intelligence and Statistics, Apr 2017, Fort Lauderdale, United States. hal-01502575

HAL Id: hal-01502575 https://hal-enpc.archives-ouvertes.fr/hal-01502575 Submitted on 9 Apr 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Fast column generation for atomic norm regularization

Marina Vinyes Guillaume Obozinski Université Paris-Est, LIGM (UMR8049) Université Paris-Est, LIGM (UMR8049) Ecole des Ponts Ecole des Ponts Marne-la-Vallée, France Marne-la-Vallée, France [email protected] [email protected]

Abstract regularizers that can be used to encode a priori knowl- edge on the structure of the objects to estimate have We consider optimization problems that con- been described as atomic norms and atomic gauges by sist in minimizing a quadratic function under Chandrasekaran et al. (2012). The concept of atomic an atomic norm1 regularization or constraint. norm has found several applications to design sparsity In the line of work on conditional gradient inducing norms for vectors (Jacob et al., 2009; Obozin- algorithms, we show that the fully corrective ski et al., 2011), matrices (Richard et al., 2014; Foygel Frank-Wolfe (FCFW) algorithm — which is et al., 2012) and tensors (Tomioka and Suzuki, 2013; most naturally reformulated as a column gen- Liu et al., 2013; Wimalawarne et al., 2014). eration algorithm in the regularized case — A number of these norms remain difficult to use in can be made particularly efficient for difficult practice because it is in general not possible to com- problems in this family by solving the sim- pute the associated proximal operator or even the norm plicial or conical subproblems produced by itself at a reasonable cost. However, the dual norm FCFW using a special instance of a classical which is defined as a supremum of dot products with active set algorithm for quadratic program- the atoms that define the norm can often be computed ming (Nocedal and Wright, 2006) that gener- efficiently because of the structure of the set of atoms. alizes the min-norm point algorithm (Wolfe, Also a number of atomic norms are actually naturally 1976). defined as infimal convolution of other norms (Jacob Our experiments show that the algorithm et al., 2009; Tomioka and Suzuki, 2013; Liu et al., takes advantages of warm-starts and of the 2013) and this structure has been used to design either sparsity induced by the norm, displays fast block-coordinate descent approaches or dual ADMM linear convergence, and clearly outperforms optimization schemes (Tomioka and Suzuki, 2013) in- the state-of-the-art, for both complex and volving latent variables associated with the elementary classical norms, including the standard group norms convolved. Lasso. In this paper, we propose to solve problems regularized or constrained by atomic norms using a fully corrective Frank-Wolfe algorithm—which can be reformulated as 1 INTRODUCTION simple column generation algorithm in the regularized case—combined with a dedicated active-set algorithm A number of problems in machine learning and struc- for quadratic programming. Our experiments show tured optimization involve either structured convex that we achieve state-of-the-art performance. We also constraint sets that are defined as the intersection of include a formal proof of the correspondance between a number of simple convex sets or dually, norms of the column generation algorithm and Fully Corrective sets that are defined as convex hull of either extreme Frank-Wolfe. points or of a collection of sets. A broad class of convex

1 After a review of the concept of atomic norms, as well or more generally atomic gauge some illustrations, we present a number of the main algorithmic approaches that have been proposed. We Proceedings of the 20th International Conference on Artiﬁ- cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- then present the scheme we propose and ﬁnally some erdale, Florida, USA. JMLR: W&CP volume 54. Copyright experiments on synthetic and real datasets. 2017 by the author(s). Fast column generation for atomic norm regularization

1.1 Notations weights δB for each set B ∈ B, the atoms of LGL norm are the vectors of norm δ−1 and support in B. The polar p B [[p]] denotes the set {1, . . . , p}. If x ∈ , xG denotes ○ −1 R LGL norm is defined as ΩLGL(s) = maxB∈B δB YsBY2. the subvector of x whose entries are indexed by a set In the particular case where B form a partition of ∗ G ∈ [[p]]. Given a function ψ, ψ denotes its Fenchel [[p]] we recover the group Lasso norm. Maurer and ∗ conjugate ψ (s) ∶= maxx⟨s, x⟩ − ψ(x). YMYtr denotes Pontil (2012) consider a generalization to a broader the trace norm of the matrix M defined as the `1-norm family of atomic norms with dual norms of the form of its singular values. supM∈M YMsY2, where M is a collection of operators. Matrix counterparts of the latent group Lasso norms 2 ATOMIC NORMS are the latent group trace norms (Tomioka and Suzuki, 2013; Wimalawarne et al., 2014). In many machine learning applications applications, Additive decompositions. There has been inter- and particularly for ill-posed problems, models are est in the literature for additive matrix decomposi- constrained structurally so they have a simple rep- tions (Agarwal et al., 2012), the most classical ex- resentation in terms of some fundamental elements. ample being “sparse+low rank decompositions” which Examples of such elements include sparse vectors for have been proposed for robust PCA and multitask many sparsity inducing norms, rank-one matrices for learning (Candès et al., 2011; Chandrasekaran et al., the trace norm or low-rank tensors as used in the nu- 2011). This formulation leads to a problem of the form clear tensor norm (Liu et al., 2013). We call atoms minL,S f(L+S)+µYLYtr +λYSY1, which under the form these elements and atomic set A their (possibly infi- minM f(M) + γA(M) with γA the atomic norm where nite) collection. Assuming A is bounded and centrally p ×p A ⊂ R 1 2 is defined as symmetric, and provided its convex hull CA has non A ∶= λ A1 ∪ µ Atr, where empty interior, we can define an atomic norm γA as the A ∶= ± ⊺ ( ) ∈ [[ ]] × [[ ]] norm of unit ball CA. It can be shown that (in a finite 1 eiej , i, j p1 p2 , dimensional space) γ (x) ∶= inf{∑ c S ∑ c a = ⊺ A a∈A a a∈A a Atr ∶= uv , YuY2 = YvY2 = 1. x, ca ≥ 0, a ∈ A}. The polar norm or dual norm is de- ○ ○ = 1 ○ ∩ 1 ○ ○ fined as: γA(s) ∶= supa∈A⟨s, a⟩. If A is not symmetric, As a consequence, CA λ C1 µ Ctr with C1 a unit `∞ ○ or if CA is empty, as long as A contains the origin and ball and Ctr a unit spectral norm ball. is closed, γ can still be defined as a gauge instead A Convex sparse SVD and PCA. A third example are of a norm and the theory and algorithms presented the norms introduced in Richard et al. (2014), including in this paper still apply. We restrict the discussion the (k, q)-trace norm for which to norms for simplicity. For a reference on gauges, see Rockafellar (1997). A ∶= {AI,J S(I,J) ⊂ [[p1]] × [[p2]], SIS = k, SJS = q}, ⊺ We consider in this paper formulations in which an with AI,J ∶= {uv ∈ Atr SYuY0 ≤ k, YvY0 ≤ q}, atomic norm is used as a regularizer, and which lead and the sparse-PCA norm2 for which to an optimization problem of the form A ∶= {AI,⪰ S I ⊂ [[p1]], SIS = k}, min f(x) + γ (x), (1) p A ⊺ x∈R with AI,⪰ ∶= {uu S u ∈ AI }, and AI defined like AB where f is a quadratic function. The case where f for LGL. is more generally twice differentiable is obviously of Beyond these examples a number of structured convex interest, but beyond the scope of this work. optimization problems encountered in machine learning and operations research that involve combinatorial 2.1 Examples of atomic norms or structured tasks such as finding permutations or alignments, convex relaxation of structured matrix fac- Lasso. The Lasso is a natural example of atomic norm, torization problems (Bach et al., 2008; Ding et al., whose atoms are the (±ei)i∈[[p]], where the (ei)i∈[[p]] is 2010), Procrustes analysis, etc, involve difficult convex p the canonical basis of R . The Lasso polar norm is constraint sets such as elliptope, the Birkhoff polytope, ○ defined as ΩLasso(s) = maxi∈[[p]] SsiS. the set of doubly nonnegative matrices that are natu- Latent group lasso (LGL). The norms introduced rally written (themselves or their polar) as intersections in Jacob et al. (2009) are a strong motivating example. of simpler sets such as the p.s.d. cone, the positive or- For instance Obozinski and Bach (2016) show that a thant, simplices, hypercubes, etc, and which lead to broad family of tight relaxations for structured sparsity optimization problems whose duals are regularized by can be written in LGL form. Given a collection of sets B associated atomic norms. covering [[p]] and which can overlap, and fixed positive 2In fact this is not a norm but only a gauge. Marina Vinyes, Guillaume Obozinski

t 2.2 Existing algorithmic approaches The optimal step sizes η ∈ R for FW and PWFW are easily obtained in closed form when f is quadratic. 2.2.1 Conditional gradient algorithms In FCFW, all weights are reoptimized at each iteration: For many3 of these norms, it is assumed that an efficient t+1 = ( ) ∈ (At+1) algorithm is available to compute argmaxa∈A⟨a, s⟩. For xFCFW argminxf x s.t. x Convhull . the case of the constrained problem t If A = {a1, ..., akt }, where kt ≤ t is the number of atoms min f(x) s.t. x ∈ CA, (2) t x in A , the subproblem that has to be solved at each iteration t of FCFW rewrites this has motivated a number of authors to suggest variants of the conditional gradient algorithm, also known kt kt min f Q ciai s.t. Q ci = 1. (3) as the Frank-Wolfe algorithm when the objective is c≥0 quadratic, as a tool of choice to solve problems with i=1 i=1 atomic norm constraints. Indeed, the principle of con- Lacoste-Julien and Jaggi (2015) show that PWFW and ditional gradient algorithms is to build a sequence of FCFW converge linearly for strongly convex objectives approximations to the solution of the problem as con- when A is finite. vex combinations of extreme points of the constraint sets, which here correspond to atoms, so that the ex- Rao et al. (2015) propose a variant of FCFW to solve t t (2) for f smooth and specifically for atomic norm con- pansion take the form x = ∑i=1 ciai with ∑i=1 ci = 1. This procedure guarantees a feasible sequence. At each straints, with an enhancing “backward step” which t iteration a new atom, also called Frank-Wolfe direction applies hard-thresholding to the coefficients c . To or forward direction, is added in the expansion. This solve (3) they use a projected gradient algorithm. atom is the extreme point of the constraint set defined Beyond constrained optimization problems, the basic t+1 t by a ∶= argmaxa∈A⟨a, −∇f(x )⟩. The Frank-Wolfe conditional gradient algorithm (corresponding to plain (FW) algorithm writes FW when f is quadratic) has been generalized to solve t+1 t t t t+1 problems of the form minx f(x) + ψ(x) where the set xFW = (1 − η ) x + η a , constraint CA is replaced by a proper convex function where ηt ∈ [0, 1] is a scalar stepsize and x0 = 0. It can ψ for which the subgradient of ψ∗ can be computed 1 be set to 1+t or found by line search. efficiently (Bredies et al., 2009; Yu et al., 2014). Bach (2015) shows that the obtained algorithm can be inter- Other variants of FW algorithms have been proposed, preted as a dual mirror descent algorithm. Yu et al. notably, FW with away steps (which we do not (2014); Bach (2015) and Nesterov et al. (2015) prove describe here), pairwise FW (PWFW) and fully sublinear convergence rates for these algorithms. Cor- corrective Frank-Wolfe (FCFW). We refer the reader responding generalizations of PWFW and FCFW are to Lacoste-Julien and Jaggi (2015) for a detailed however not obvious. As exploited in Yu et al. (2014); presentation and summarize hereafter the form of Harchaoui et al. (2015), if ψ = h ○ γ , with h a nonde- the different updates for PWFW and FCFW. The A creasing convex function and γ an atomic norm, and active set of atoms At at time t is recursively defined A if an upper bound ρ can be specified a priori on γ (x⋆) by At+1 = Ãt ∪ {at+1} with Ãt the set of active A for x⋆ a solution of the problem, it can reformulated atoms of At at the end of iteration t, i.e. the ones as that contributed with a non-zero coefficient in the expansion of xt. min f(x) + h(τ) s.t. γA(x) ≤ τ, τ ≤ ρ, (4) PWFW makes use of a backward direc- x,τ tion also called away atom, and defined as t+1 t and it is natural to apply the different variant of Frank- aB = argmaxa∈Ãt ⟨a, ∇f(x )⟩, i.e. it is the active atom of largest projection on the gradient direction. The Wolfe on the variable (x, τ), because the FW direction idea in PWFW is to move by transferring weight from is easy to compute (see Section 3.1). the away atom at+1 to the FW atom at+1: B 2.2.2 Proximal block-coordinate descent xt+1 = xt + ηt (at+1 − at+1), PWFW p B In the context where they are applicable, proximal t t t where η ∈ [0, cB], with cB ≥ 0 the weight attributed to gradient methods provide an appealing alternative to t+1 t atom aB at iteration t, and η is found by line search. Frank-Wolfe algorithms. However, the former require to be able to compute efficiently the proximal operator 3This is not true for the norms introduced in Richard et al. (2014) whose dual are NP-hard to compute, but for of the norm γA appearing in the objective, which is which reasonable heuristic algorithms or relaxations are typically more difficult to compute than the Frank- available. Wolfe direction. Fast column generation for atomic norm regularization

J For a number of atomic norms, we have A = ⋃j=1 Cj then, for the constrained problem (2), the subproblem where Cj are convex sets. As a consequence the polar is the simplicial problem: ○ ( ) = ○ norm takes the form γA s maxj γC , with γCi the j min f(Atc) s.t. c ∈ ∆kt , (5) atomic norm (or gauge) associated with the set Ci, and c it is a standard result that k k k with ∆ ∶= {c ∈ R+ S ∑i=1 ci = 1} the canonical simplex.

γA(x) = inf{ γC1 (z1) + ... + γCJ (zJ )S z1 + ... + zJ = x}. The regularized problem (1) can be reformulated as the constrained optimization problem (4) on a truncated Technically, γA is called the inﬁmal convolution of the cone, provided the truncation level ρ is an upper bound ( ) norms γCi i (see Rockafellar, 1997). In fact most of of the value of γA at the optimum. Actually, if ρ is the norms that we presented in section 2.1 are of this suﬃciently large, several Frank-Wolfe algorithms do form, including LGL norms, latent group trace norms, not depend any longer on the value of ρ and can be norms arising from additive decomposition (obviously interpreted as algorithms in which whole extreme rays by construction), and the norms for sparse SVD and of the cone {(x; τ)S γA(x) ≤ τ} enter the active set via sparse PCA. the linear minimization oracle, and where the original For all these norms, problem (1) can be reformulated cone is locally approximated from inside by the simpli- as cial cone obtained as their conical hull. In particular in the case of FCFW, the subproblem considered at the min f(z1 + ... + zJ ) + γC1 (z1) + ... + γCJ (zJ ). z1,...,zJ t-th iteration takes the form of the conical problem

t Since the objective is then a sum of a smooth and min f(A c) + Q ci s.t. c ≥ 0, (6) c of a separable function, randomized proximal block- i coordinate descent algorithm are typical candidates. which is simply a Lasso problem with positivity con- These algorithms have attracted a lot of attention in straints when f is quadratic. The fact that problem (1) the recent literature (see ?, and reference therein) and can be solved by as sequence of problems of the form (6) have been applied successfully to a number of formu- is shown in Harchaoui et al. (2015, Sec. 5), who argue lations involving convex sparsity inducing regularizers that this leads to an algorithm no worse and possi- (Shalev-Shwartz and Tewari, 2011; Friedman et al., bly better. We formally show that the simple column 2010; Gu et al., 2016), where they achieve state-of-the- generating scheme presented as Algorithm 1 is in fact art performance. Such BCD algorithms where the ones exactly equivalent to FCFW applied to the truncated proposed for the norms proposed in Jacob et al. (2009) cone formulation as soon as ρ is large enough: and Richard et al. (2014). Proposition 1. If f is assumed lower bounded by 0 Unfortunately these algorithms are slow in general even and if ρ > f(0), or more generally if the level sets of if f is strongly convex because of the composition with x ↦ f(x) + γA(x) are bounded and ρ is sufficiently t the linear mapping (z1, . . . , zJ ) ↦ z1 + ... + zJ . Intu- large, then the sequence (x¯ )t produced by the FCFW itively if the atoms of the different norms are similar, algorithm applied to the truncated cone constrained then the formulation is badly conditioned. If they are problem (4) and initialized at (x¯0; τ 0) = (0; 0) is the t different or essentially decorrelated, BCD remains one same as the sequence (x )t produced by Algorithm 1 of the most efficient algorithms (Shalev-Shwartz and initialized with x0 = 0, with equivalent sequences of Tewari, 2011; Gu et al., 2016). subproblems, active sets and decomposition coefficients. See the appendix for a proof. As discussed as well in 3 PIVOTING FRANK-WOLFE the appendix, a variant of Algorithm 1 without pruning of the atoms with zero coefficients (at step 7) is derived After reviewing the form of the corrective step of FCFW very naturally as the dual of a cutting plane algorithm. and reformulating FCFW in the regularized case as a column generation algorithm, we introduce active-set 3.2 Leveraging active-set algorithms for algorithms to solve efficiently sequences of corrective quadratic programming steps. Problems (5) and (6) can efficiently be solved by a 3.1 Simplicial and conical subproblems number of algorithms. In particular, an appropriate variant LARS algorithm solves both problem in a finite We focus on the sequence of subproblems that need to number of iterations and it is fast if the solution in t be solved at the corrective step of FCFW. Let kt ∶= SA S sparse, in spite of the fact that it solves exactly a be the number of selected atoms at iteration t, and sequence of linear systems. Interior point algorithms t p k t A ∈ R × t , the matrix whose columns are the atoms A , can always be used, and are often considered to be a Marina Vinyes, Guillaume Obozinski

Algorithm 1 Column generation mal feasible). Each update of c in Algorithm 2 is called 1: Require: f convex differentiable, tolerance a pivot, which is either full-step or drop-step. Given 0 0 2: Initialization: x = 0, A = ∅, k0 = 0, t = 1 a collection of active atoms indexed by a set J, the 3: repeat solution d of the non-constrained quadratic program re- t−1 4: at ← arg maxa∈A⟨−∇f(x ), a⟩ stricted to this set of atoms and obtained by removing t t−1 5: A ← [A , at] the positivity constraints is computed (line 4). If d lies t t 6: c ← arg minc≥0 fA c + YcY1 in the positive orthant, we set c = d, and we say that we t 7: I ← {i S ci > 0}, perform a full-step. In that case, the index of an atom t t 8: c ← cI that must become active (if any), based on gradients, t ← t SJS 9: A A⋅,I is added to J. If d ∉ R+ , a drop-step is performed: c is t t t 10: x ← A c updated as the intersection between segment [cold, d] 11: t ← t + 1 and the positive orthant, and the index i such that t−1 12: until maxa∈A⟨−∇f(x ), a⟩ ≤ ci = 0 is dropped from J (line 13). The algorithm stops if after a full-step, no new index is added in J. natural choice to solve this step in the literature. For larger scale problems, and if f has Lipschitz gradients (which is obviously the case for a quadratic function), the forward-backward proximal algorithm can be used as well, since the projection on the simplex for (5) and the asymmetric soft-thresholding for (6) can be computed efficiently. For the constrained case, this is the algorithm used by Rao et al. (2015). Figure 1: Illustration of Algorithm 2. Here, it converges In our case, we need to solve a sequence of problems of 1 after a drop-step (variable c2 is dropped) leading to c the form (5) or (6), that differ each from the previous 2 followed by a full-step (along c1) leading to c . one by the addition of a single atom. So being able to use warm-starts is key! If the simplicial problems remains of small size, and if the corresponding Hes- Algorithm 2 [c, J]=Active-set(H, b, c0,J0) ⊺ ⊺ sians can be computed efficiently, using second order 1: Solves: P ∶= minc c Hc + b c, s.t. c ≥ 0 algorithms is likely to outperform first order methods. 2: Initialization: c = c0 , J = J0, But the LARS and interior point methods cannot take 3: repeat −1 advantage of warm-starts. Thus, when f is quadratic, 4: d ← HJ,J bJ we propose to use active set algorithms for convex 5: if d ≥ 0, quadratic programming (Nocedal and Wright, 2006; 6: c ← d ▷ full-step 4 Forsgren et al., 2015). In particular, following Bach 7: g ← Hc + b (2013, Chap. 7.12), we propose to apply the active-set 8: k ← arg mini∈J0J gi algorithm of Nocedal and Wright (2006, Chap. 16.5) 9: if gk ≥ 0, then break, else J ← J ⋃{k} end to iteratively solve (5) and (6). This algorithm takes 10: else 5 ∗ ci the very simple form of Algorithm 2. In fact, as noted 11: i ← arg mini s.t. ci − di > 0, di < 0 ci−di in Bach (2013, Chap. 9.2), this algorithm is a general- 12: τ ← ci∗ ization of the famous min-norm point algorithm (Wolfe, ci∗ −di∗ 13: J ← J\{i∗} ▷ drop-step 1976), the latter being recovered when the Hessian is 14: c ← c + τ(d − c) the identity. 15: end

Algorithm 2 is illustrated in Figure 1. The obtained 16: until gJ0J ≥ 0 iterates always remain in the positive orthant (i.e. pri- 17: return c, J

4Bach (2013) proposed to use this active-set algorithm to optimize convex objectives involving the Lovász extension 3.3 Convergence and computational cost of a submodular function. 5Despite the fact that, in the context of a simplicial algorithms, the polyhedral constraints sets of (5) and (6) as In this section, we discuss ﬁrst the convergence of convex hulls, the algorithm of Nocedal and Wright (2006, the algorithm and the number of pivots needed for Chap. 16.5) actually exploits their structure as intersections convergence, and the the cost of each pivot. of half-spaces, and thus the active constraints of the algorithm actually correspond counter-intuitively to dropped Algorithm 2 is an instance of min-norm point (MNP) atoms. with a general quadratic instead of Euclidean distance, Fast column generation for atomic norm regularization

1000 but the algorithm is affine invariant, so the convergence support of the parameter w0 ∈ R of the model is the same. MNP is known to be finitely convergent. to be {1,..., 50} and all non zero coefficients are kt The positive orthant in dimension kt has at most 2 set to 2. We generate n = 200 examples (yi)i=1,..,n faces which is a naive bound on the number of pivots from y = x⊺w + ε. Block Coordinate Descent (BCD) in the active-set at iteration t of FCFW. But, Lacoste- algorithms are the standard method for this problems Julien and Jaggi (2015) prove that MNP is linearly but they suffer slow convergence when the design convergent. In practice, the solution is most of the matrix is highly correlated. In this experiment time either strictly inside the orthant or in one of the we choose a highly correlated design matrix (with k − 1 dimensional faces in which case it is in fact found singular values in {1, 0.92, .., 0.92(p−2), 0.92(p−1)}) to in just 1 or respectively 2 iterations! The number of highlight the advantages of our algorithm for the pivots per call is illustrated in Figure 4 upper left. harder instances. We compared our algorithm to our own implementation of BCD and an enhanced BCD Let s = max YaY be the sparsity of the atoms, k the a∈A 0 from Qin et al. (2013) (hyb-BCD). Figure 2 shows number of active atoms at iteration t and Ht = At⊺QAt that we outperform both methods. the Hessian of the quadratic problem in the active set, where Q is the Hessian of the quadratic function f. Group Lasso duality gap for lambda=1.5 2 The cost of one pivot is the cost of computing the Hes- 10 t O ( 2 2 + sian H and its inverse, which is min k s , kps 0 10 k2s) for building the Hessian and an extra Ok3 for the inversion. In the active-set with warm starts we −2 10 only add or remove one atom at a time. We can take ad- t −4 vantage of this to efficiently update the Hessian H and 10 its inverse with rank one updates. The computational duality gap 2 cost for updating the Hessian is O min(ks , ps + ks) −6 10 BCD when an atom is added and O(k) when removing an hyb−BCD ours t −1 −8 atom. The additional cost to update (H ) is then 10 −2 0 2 2 10 10 10 just O(k ) in both cases. See the appendix for more time(s) details on the rank one updates. Figure 2: Experiment for classical group Lasso. log-log plot of progress of the duality gap. 4 EXPERIMENTS

In this section, we report experiments that illustrate 4.2 k-chain latent group Lasso the computational eﬃciency of the proposed algorithm. We consider linear regression problems of the form of (1) We consider a toy example involving latent group Lasso with f(w) = 1~2YXw − yY2, where X is a design matrix regularization where the groups are chains of continu- ous indices of length k = 8, that is where the collection and γA the LGL or the sparse-PCA norms described in Section 2. We also considered the constrained version of group is B = {1, . . . , k}, {2, . . . , k + 1},..., {p − k + for LGL, minx f(x) s.t. ΩLGL(w) ≤ ρ, in section 4.2. 1, . . . , p}. We choose the support of the parameter w0 of the model to be {1,..., 10}. Hence, three over- Section 4.1 compares the performance of our proposed lapping chains are needed to retrieve the support of algorithm with state-of-the-art algorithms for the group w0. We generate n = 300 examples (yi)i=1,..,n from Lasso. Section 4.2 presents comparisons with the vari- y = x⊺w + ε where x is a standard Gaussian vector and ants of Frank-Wolfe and with COGEnT on problem 2 ε ∼ N (0, σ Ip). The noise level is chosen to be σ = 0.1. involving the latent group Lasso. Section 4.3 provides a In upper Figure 3 we show a time comparison of our comparison with a version of FCFW relying on interior- algorithm on the regularized problem. We implemented point solver on larger scale problems. Sections 4.3 and Algorithm 1 and three Frank-Wolfe versions: simple 4.4 provide comparisons with randomized block proxi- FW, FW with line search (FW-ls) and pairwise FW mal coordinate descent algorithms. Most experiments (FW-pw). We compare also with a regularized version are on simulated data to control characteristics of the of the forward-backward greedy algorithm from Rao experiments, except in section 4.3. et al. (2015)(CoGEnT). In the bottom plot of Figure 3 we show a comparison on the constrained problem. All 4.1 Classical group Lasso codes are in Matlab and we used Rao et al.’s code for the forward-backward greedy algorithm. We consider an example with group Lasso regularization with groups of size 10, B = Figure 4 illustrates complexity and memory usage of {1,..., 10}, {11,..., 20},... . We choose the our algorithm for the same experiment. Top plots show Marina Vinyes, Guillaume Obozinski

LGL regularized problem k=8 lambda=5 #pivots per Active−Set call total #pivots 5 300 3 FW 250 10 FW−ls 4 2 10 FW−pw 200 1 CoGEnT−0.5 3 10 colgen 150 0 10 2 100 −1 10

1 cumulated #pivots −2 #pivots per iteration 50 10

duality gap −3 0 0 10 0 50 100 0 50 100 −4 10 iterations iterations −5 #active variables during iterations 10 40

30 −3 −2 −1 0 1 2 10 10 10 10 10 10 time(s) 20 LGL constrained problem k=10 rho=5 10 #active variables 0 3 10 FW 0 20 40 60 80 100 120 iterations 2 10 FW−ls 1 FW−pw Figure 4: Experiment for the k-chain group Lasso. 10 CoGEnT−0.5 0 10 colgen Top:number of pivots, i.e., drop/full step in active set. −1 10 Left ﬁgure shows the number of pivots per active set −2 10 call and right plot shows the total number of pivots

duality gap −3 10 during iterations. Bottom: evolution of the number of −4 10 active atoms in our algorithm. −5 10

−3 −2 −1 0 1 2 10 10 10 10 10 10 of groups B thus contains the singletons {i} and con- time(s) tains for all pairs {i, j} the sets {i, {i, j}} and {j, {i, j}} Figure 3: Experiments for k-chain group Lasso, where (coupling respectively the selection of βij with that of X is a generated random design matrix.log-log plot of βi or that of βi). We focussed on WH sparsity which progress of the duality gap during computation time. is more challenging here because of the group overlaps, CoGEnT truncation parameter is set to η = 0.5. but the approach applies also to the counterpart for strong hierarchical constraints. that each call to the active-set algorithm has low cost. Simulated data We consider a quadratic problem Indeed less than two pivots in average, i.e. drop or full with p = 50 main features, which entails that we have steps, are needed to converge. This is clearly due to p × (p − 1)~2 = 1225 potential interaction terms and the use of warm starts. Bottom plot shows the number simulate n = 1000 samples. We choose the parameter of active atoms during iterations. β to have 10% of the interaction terms βij equal to 1 and the rest equal to zero. In order to respect the WH structure, the minimal number of necessary unary 4.3 Hierarchical sparsity terms βi possible given the WH constrains are included In high-dimensional linear models that involve interac- in the model with βi = 0.5. We compare our algorithm tion terms, statisticians usually favor variable selection with FCFW combined with an interior point solver obeying certain logical hierarchical constraints. In this (FCFW-ip) instead of the active-set subroutine, and section we consider a quadratic model (linear + inter- with a degraded version of our algorithm not using action terms) of the form warm starts. Figure 5 shows that FCFW-ip becomes slower than our algorithm only beyond 200 seconds. A p plausible explanation is that at the begining the sub- y = Q βixi + Q βijxixj. i=1 i≠j problems being solved are small and time is dominated by the search of the new direction; when the size of Strong and weak hierarchical sparsity are usually dis- the problem grows, the active-set with warm start is tinguished (see Bien et al., 2013, and reference therein). faster, meaning that the active-set exploits the struc- The Weak Hierarchical (WH) sparsity constraints are ture of positivity constraints better than IP, which has that if an interaction is selected, then at least one of to invert bigger matrices. Full corrections of FCFW-ip its associated main eﬀects is selected, i.e., βij ≠ 0 ⇒ call the quadprog function of Matlab, which is an βi ≠ 0 or βj ≠ 0. We use the latent overlapping group optimized C++ routine, whereas our implementation Lasso formulation proposed in Yan and Bien (2015) to is done in Matlab. An optimized C implementation formulate our problem. The corresponding collection of our active-set algorithm, in particular leveraging the Fast column generation for atomic norm regularization rank one updates on the inverse Hessian described in CalHous dataset WHS lambda=0.001 sections 3.3 should provide an additional signiﬁcant bcd−cyclic 0 bcd−random speedup. 10 cogent−0.5 ours −1 duality gap WHS lambda=1000 10 6 10 −2 10

4 10 FCFW Interior−Point

duality gap −3 ours without ws 10 ours 2 10 −4 10

0 10 duality gap −3 −2 −1 0 1 2 10 10 10 10 10 10 −2 time(s) 10 Figure 6: Experiments on California House data set.

−4 10 log-log plot of progress of the duality gap during com- 0 50 100 150 200 250 300 350 400 putation time. time(s)

Figure 5: Experiments on simulated data for WH spar- SPCA k=10 lambda=0.5 sity. log-plot of progress of the duality gap as a function bcd time in seconds. ours 1 10 California housing data set We apply the previ- 0 ous hierarchical mode to the California housing data 10 (Pace and Barry, 1997). The data contains 8 variables, −1 10

so with interaction terms the intial model contains 36 duality gap variables. To make the selection problem more challeng- −2 ing, following She and Jiang (2014), we add 20 main 10 nuisance variables, generated as standard Gaussian ran- −3 10 dom variables corresponding to 370 additional noisy 0 1 2 10 10 10 interaction terms. We compare our algorithm to the time(s) greedy Forward-Backward algorithm with a truncation Figure 7: Experiments on sparse PCA. log-log plot of parameter η = 0.5 and with Block Coordinate Descent progress of the duality gap. (BCD). Table 1 shows running time for different levels of regularization λ. λ = 10−3 is the value selected by 10-fold cross validation on the validation risk. Figure 6 5 CONCLUSION shows the running time for the different algorithms. In this paper, we have shown that to minimize a Table 1: Computation time in seconds needed to reach quadratic function with an atomic norm regulariza- a duality gap of 10−3 on California housing data set. tion or constraint, the fully corrective Frank-Wolfe Time is not reported when larger than 103 seconds. algorithm, which in the regularized case corresponds exactly to a very simple column generating algorithm that λ 10−5 10−4 10−3 10−2 10−1 is not well known, is particularly efficient given that sparsity make the computation of the reduced Hessian BCD - - 585 73 5 relatively cheap. In particular, the corrective step is CoGEnT - - 1300 14 0.2 solved very efficiently with a simple active-set methods ours 27 1.4 0.4 0.06 0.02 for quadratic programming. The proposed algorithm takes advantage of warm-starts, and empirically out- 4.4 Sparse PCA performs other Frank-Wolfe schemes, block-coordinate descent (when applicable) and the algorithm of Rao We compare our method to the block proximal gradient et al. (2015). Its performance could be enhanced by descent (BCD) described in Richard et al. (2014). We low-rank updates of the inverse Hessian. In future work ⋆ generate a sparse covariance matrix Σ of size 150×150 we intend to generalize the algorithm to smooth loss obtained as the sum of five overlapping rank one blocks functions using sequential quadratic programming. 1111⊺ of size k × k with k = 10. We generate a noisy covariance with a noise level σ = 0.3. We consider an `2 Acknowledgements loss and a regularization by the gauge γAk,⪰ described Marina Vinyes is funded by ANR CHORUS research in Section 2 with k = 10. The regularization parameter grant 13-MONU-0005-10. is λ. Figure 7 shows a time comparison with BCD. Marina Vinyes, Guillaume Obozinski

References Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models Agarwal, A., Negahban, S., Wainwright, M. J., et al. via coordinate descent. Journal of statistical software, (2012). Noisy matrix decomposition via convex relax- 33(1):1. ation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171–1197. Gu, Q., Wang, Z., and Liu, H. (2016). Low-rank and sparse structure pursuit via alternating minimization. Argyriou, A., Signoretto, M., and Suykens, J. A. (2014). In Proceedings of the 19th International Conference Hybrid conditional gradient-smoothing algorithms on Artificial Intelligence and Statistics, pages 600– with applications to sparse and low rank regulariza- 609. tion. In Regularization, Optimization, Kernels, and Support Vector Machines, pages 53–82. Chapman Harchaoui, Z., Juditsky, A., and Nemirovski, A. (2015). and Hall/CRC. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Program- Bach, F. (2013). Learning with submodular functions: ming, 152(1–2):75–112. A convex optimization perspective. Foundations and Trends in Machine Learning, 6(2):145–373. Jacob, L., Obozinski, G., and Vert, J.-P. (2009). Group lasso with overlap and graph lasso. In ICML. Bach, F. (2015). Duality between subgradient and conditional gradient methods. SIAM Journal on Lacoste-Julien, S. and Jaggi, M. (2015). On the global Optimization, 25(1):115–129. linear convergence of Frank-Wolfe optimization variants. Advances in Neural Information Processing Bach, F., Mairal, J., and Ponce, J. (2008). Con- Systems 28, pages 496–504. vex sparse matrix factorizations. arXiv preprint arXiv:0812.1869. Liu, J., Musialski, P., Wonka, P., and Ye, J. (2013). Tensor completion for estimating missing values in Bien, J., Taylor, J., Tibshirani, R., et al. (2013). A visual data. Pattern Analysis and Machine Intelli- lasso for hierarchical interactions. The Annals of gence, IEEE Transactions on, 35(1):208–220. Statistics, 41(3):1111–1141. Maurer, A. and Pontil, M. (2012). Structured sparsity Bredies, K., Lorenz, D. A., and Maass, P. (2009). A and generalization. The Journal of Machine Learning generalized conditional gradient method and its con- Research, 13(1):671–690. nection to an iterative shrinkage method. Computa- tional Optimization and Applications, 42(2):173–193. Nesterov, Y. et al. (2015). Complexity bounds for primal-dual methods minimizing the model of objec- Candès, E. J., Li, X., Ma, Y., and Wright, J. (2011). tive function. Center for Operations Research and Robust principal component analysis? Journal of Econometrics, CORE Discussion Paper. the ACM (JACM), 58(3):11. Nocedal, J. and Wright, S. (2006). Numerical optimiza- Chandrasekaran, V., Recht, B., Parrilo, P. A., and tion. Springer Science & Business Media. Willsky, A. S. (2012). The convex geometry of linear inverse problems. Foundations of Computational Obozinski, G. and Bach, F. (2016). A unified per- mathematics, 12(6):805–849. spective on convex structured sparsity: Hierarchical, symmetric, submodular norms and beyond. HAL Id Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and : hal-01412385, version 1. Willsky, A. S. (2011). Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimiza- Obozinski, G., Jacob, L., and Vert, J.-P. (2011). Group tion, 21(2):572–596. Lasso with overlaps: the Latent Group Lasso approach. preprint HAL - inria-00628498. Ding, C., Li, T., and Jordan, M. I. (2010). Convex and semi-nonnegative matrix factorizations. Pattern Pace, R. K. and Barry, R. (1997). Sparse spatial autore- Analysis and Machine Intelligence, IEEE Transac- gressions. Statistics and Probability Letters, 33(3):291 tions on, 32(1):45–55. – 297. Forsgren, A., Gill, P. E., and Wong, E. (2015). Primal Qin, Z., Scheinberg, K., and Goldfarb, D. (2013). Ef- and dual active-set methods for convex quadratic ficient block-coordinate descent algorithms for the programming. Mathematical Programming, pages group lasso. Mathematical Programming Computa- 1–40. tion, 5(2):143–169. Foygel, R., Srebro, N., and Salakhutdinov, R. R. (2012). Rao, N., Shah, P., and Wright, S. (2015). Forward Matrix reconstruction with the local max norm. In -Backward Greedy Algorithms for Atomic Norm Reg- Advances in Neural Information Processing Systems, ularization. IEEE Transactions on Signal Processing, pages 935–943. 63(21):5798–5811. Fast column generation for atomic norm regularization

Richard, E., Obozinski, G. R., and Vert, J.-P. (2014). Tight convex relaxations for sparse matrix factoriza- tion. In Advances in Neural Information Processing Systems, pages 3284–3292. Rockafellar, R. T. (1997). Convex Analysis. Princeton landmarks in mathematics. Princeton University Press, Princeton, NJ. Shalev-Shwartz, S. and Tewari, A. (2011). Stochastic methods for l1-regularized loss minimization. Journal of Machine Learning Research, 12(Jun):1865–1892. She, Y. and Jiang, H. (2014). Group regularized estimation under structural hierarchy. Journal of the American Statistical Association. Tomioka, R. and Suzuki, T. (2013). Convex tensor decomposition via structured Schatten norm regularization. In Advances in Neural information Processing Systems, pages 1331–1339. Wimalawarne, K., Sugiyama, M., and Tomioka, R. (2014). Multitask learning meets tensor factoriza- tion: task imputation via convex optimization. In Advances in Neural Information Processing Systems 27, pages 2825–2833. Wolfe, P. (1976). Finding the nearest point in a polytope. Mathematical Programming, 11(1):128–149. Yan, X. and Bien, J. (2015). Hierarchical sparse mod- eling: A choice of two regularizers. arXiv preprint arXiv:1512.01631. Yu, Y., Zhang, X., and Schuurmans, D. (2014). Gen- eralized conditional gradient for sparse estimation. arXiv preprint arXiv:1410.4828. Zhang, X., Schuurmans, D., and liang Yu, Y. (2012). Accelerated training for matrix-norm regularization: A boosting approach. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 2906–2914. Curran Associates, Inc.