Mind the gap: safer rules for the Lasso

Olivier Fercoq [email protected] Alexandre Gramfort [email protected] Joseph Salmon [email protected] Institut Mines-Tel´ ecom,´ Tel´ ecom´ ParisTech, CNRS LTCI 46 rue Barrault, 75013, Paris, France

Abstract one can mention dictionary learning (Mairal, 2010), bio- statistics (Haury et al., 2012) and medical imaging (Lustig Screening rules allow to early discard irrelevant et al., 2007; Gramfort et al., 2012) to name a few. variables from the optimization in Lasso prob- lems, or its derivatives, making solvers faster. In Many algorithms exist to approximate Lasso solutions, but this paper, we propose new versions of the so- it is still a burning issue to accelerate solvers in high di- called safe rules for the Lasso. Based on duality mensions. Indeed, although some other variable selection gap considerations, our new rules create safe test and prediction methods exist (Fan & Lv, 2008), the best regions whose diameters converge to zero, pro- performing methods usually rely on the Lasso. For sta- vided that one relies on a converging solver. This bility selection methods (Meinshausen & Buhlmann¨ , 2010; property helps screening out more variables, for Bach, 2008; Varoquaux et al., 2012), hundreds of Lasso a wider range of regularization parameter values. problems need to be solved. For non-convex approaches In addition to faster convergence, we prove that such as SCAD (Fan & Li, 2001) or MCP (Zhang, 2010), we correctly identify the active sets (supports) solving the Lasso is often a required preliminary step (Zou, of the solutions in finite time. While our pro- 2006; Zhang & Zhang, 2012; Candes` et al., 2008). posed strategy can cope with any solver, its per- Among possible algorithmic candidates for solving the formance is demonstrated using a coordinate de- Lasso, one can mention homotopy methods (Osborne et al., scent algorithm particularly adapted to machine 2000), LARS (Efron et al., 2004), and approximate homo- learning use cases. Significant computing time topy (Mairal & Yu, 2012), that provide solutions for the full reductions are obtained with respect to previous Lasso path, i.e., for all possible choices of tuning parame- safe rules. ter λ. More recently, particularly for p ą n, coordinate descent approaches (Friedman et al., 2007) have proved to 1. Introduction be among the best methods to tackle large scale problems. Following the seminal work by El Ghaoui et al.(2012), Since the mid 1990’s, high dimensional statistics has at- screening techniques have emerged as a way to exploit the tracted considerable attention, especially in the context of known sparsity of the solution by discarding features prior linear regression with more explanatory variables than ob- to starting a Lasso solver. Such techniques are coined safe servations: the so-called p ą n case. In such a context, the rules when they screen out coefficients guaranteed to be least squares with `1 regularization, referred to as the Lasso zero in the targeted optimal solution. Zeroing those coeffi- (Tibshirani, 1996) in statistics, or Basis Pursuit (Chen et al., cients allows to focus more precisely on the non-zero ones 1998) in signal processing, has been one of the most pop- (likely to represent signal) and helps reducing the computa- ular tools. It enjoys theoretical guarantees (Bickel et al., tional burden. We refer to (Xiang et al., 2014) for a concise 2009), as well as practical benefits: it provides sparse solu- introduction on safe rules. Other alternatives have tried to tions and fast convex solvers are available. This has made screen the Lasso relaxing the “safety”. Potentially, some the Lasso a popular method in modern data-science tool- variables are wrongly disregarded and post-processing is kits. Among successful fields where it has been applied, needed to recover them. This is for instance the strategy Proceedings of the 32 nd International Conference on Machine adopted for the strong rules (Tibshirani et al., 2012). Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- The original basic safe rules operate as follows: one right 2015 by the author(s). Mind the duality gap: safer rules for the Lasso chooses a fixed tuning parameter λ, and before launching finite time the active variables of the optimal solution (or any solver, tests whether a coordinate can be zeroed or not equivalently the inactive variables), and the tests become (equivalently if the corresponding variable can be disre- more and more precise as the optimization algorithm pro- garded or not). We will refer to such safe rules as static ceeds. We also show that our new GAP SAFE rules, built safe rules. Note that the test is performed according to a on dual gap computations, are converging safe rules since safe region, i.e., a region containing a dual optimal solu- their associated safe regions have a diameter converging to tion of the Lasso problem. In the static case, the screening zero. We also explain how our GAP SAFE tests are se- is performed only once, prior any optimization iteration. quential by nature. Application of our GAP SAFE rules Two directions have emerged to improve on static strate- with a coordinate descent solver for the Lasso problem is gies. proposed in Section4. Using standard data-sets, we report the time improvement compared to prior safe rules. • The first direction is oriented towards the resolution of the Lasso for a large number of tuning parameters. 1.1. Model and notation Indeed, practitioners commonly compute the Lasso We denote by rds the set t1, . . . , du for any integer d P N. over a grid of parameters and select the best one in a n Our observation vector is y P R and the design matrix data-driven manner, e.g., by cross-validation. As two X x , , x nˆp has p explanatory variables (or 1 “ r 1 ¨ ¨ ¨ ps P R consecutive λ s in the grid lead to similar solutions, features) column-wise. We aim at approximating y as a knowing the first solution may help improve screen- linear combination of few variables xj’s, hence expressing ing for the second one. We call sequential safe rules p y as Xβ where β P R is a sparse vector. The standard such strategies, also referred to as recursive safe rules Euclidean norm is written } ¨ }, the `1 norm } ¨ }1, the `8 in (El Ghaoui et al., 2012). This road has been pur- norm } ¨ }8, and the matrix transposition of a matrix Q is sued in (Wang et al., 2013; Xu & Ramadge, 2013; Xi- J denoted by Q . We denote ptq` “ maxp0, tq. ang et al., 2014), and can be thought of as a “warm start” of the screening (in addition to the warm start of For such a task, the Lasso is often considered (see the solution itself). When performing sequential safe Buhlmann¨ & van de Geer(2011) for an introduction). For a rules, one should keep in mind that generally, only an tuning parameter λ ą 0, controlling the trade-off between approximation of the previous dual solution is com- data fidelity and sparsity of the solutions, a Lasso estimator puted. Though, the safety of the rule is guaranteed βˆpλq is any solution of the primal optimization problem only if one uses the exact solution. Neglecting this is- ˆpλq 1 2 sue, leads to “unsafe” rules: relevant variables might β P arg min kXβ ´ yk2 ` λ kβk1 . (1) βP p 2 be wrongly disregarded. R “Pλpβq

• The second direction aims at improving the screen- looooooooooooomooooooooooooonn J Denoting ∆X “ θ P R : xj θ ď 1, @j P rps the ing by interlacing it throughout the optimization algo- dual feasible set, a dual formulation of the Lasso reads (see rithm itself: although screening might be useless at the for instance Kim et al.(2007) or Xiang et al.(2014)): ( beginning of the algorithm, it might become (more) 1 λ2 y 2 efficient as the algorithm proceeds towards the opti- ˆpλq 2 θ “ arg max kyk2 ´ θ ´ . (2) n 2 mal solution. We call these strategies dynamic safe θP∆X ĂR 2 2 λ

rules following (Bonnefoy et al., 2014a;b). “Dλpθq looooooooooooomooooooooooooon We can reinterpret Eq. (2) as θˆpλq “ Π py{λq, where Based on convex optimization arguments, we leverage du- ∆X Π refers to the projection onto a closed C. In ality gap computations to propose a simple strategy uni- C particular, this ensures that the dual solution θˆpλq is always fying both sequential and dynamic safe rules. We coined unique, contrarily to the primal βˆpλq. GAP SAFE rules such safe rules. The main contributions of this paper are 1) the introduction 1.2. A KKT detour of new safe rules which demonstrate a clear practical im- βˆpλq p provement compared to prior strategies 2) the definition of For the Lasso problem, a primal solution P R and the θˆpλq n a theoretical framework for comparing safe rules by look- dual solution P R are linked through the relation: ing at the convergence of their associated safe regions. y “ Xβˆpλq ` λθˆpλq . (3) In Section2, we present the framework and the basic con- The Karush-Khun-Tucker (KKT) conditions state: cepts which guarantee the soundness of static and dynamic tsignpβˆpλqqu if βˆpλq ‰ 0, screening rules. Then, in Section3, we introduce the new j p , xJθˆpλq j j (4) @ P r s j P ˆpλq concept of converging safe rules. Such rules identify in #r´1, 1s if βj “ 0. Mind the duality gap: safer rules for the Lasso

Rule Center Radius Ingredients y J J Static Safe (El Ghaoui et al., 2012) y{λ Rλp q λmax “}X y}8 “|x ‹ y| λmax j 1 2 2 λmax Dynamic ST3 (Xiang et al., 2011) y{λ ´ δxj‹ pRλpθkq ´ δ q 2 δ “ ´ 1 {}xj‹ } q λ Dynamic Safe (Bonnefoy et al., 2014a) y{λ Rλpθkq θk P ∆` X (e.g., as˘ in (11)) q Sequential (Wang et al., 2013) θˆpλt´1q 1 ´ 1 }y} exact θˆpλt´1q required λt´1q λt 1 GAP SAFE sphere (proposed) θk rλ pβk, θˇkq “ 2ˇGλ pβk, θkq dual gap for βk, θk t ˇ λt ˇ t ˇ ˇ a Table 1. Review of some common safe sphere tests.

See for instance (Xiang et al., 2014) for more details. The Remark 2. If C “ tθˆpλqu, the safe active set is the equicor- pλq J ˆpλq KKT conditions lead to the fact that for λ ě λmax “ relation set A pCq “ Eλ :“ tj P rps : |xj θ | “ 1u J p }X y}8, 0 P R is a primal solution. It can be consid- (in most cases (Tibshirani, 2013) it is exactly the active set ered as the mother of all safe screening rules. So from now of βˆpλq). If the Lasso has a unique solution, its support on, we assume that λ ď λmax for all the considered λ’s. is exactly the equicorrelation set. If it is not unique, the equicorrelation set contains all the solutions’ supports and 2. Safe rules there exists a Lasso solution whose support is exactly this set (Tibshirani, 2013)[Lemma 12]. The other extreme case pλq Safe rules exploit the KKT condition (4). This equation im- is when C “ ∆X , and A pCq “ rps. Here, no variable is ˆpλq J ˆpλq pλq plies that βj “ 0 as soon as |xj θ | ă 1. The main chal- screened out: Z pCq “ H and the screening is useless. lenge is that the dual optimal solution is unknown. Hence, a n pλq We now consider common safe regions whose support safe rule aims at constructing a set C Ă R containing θˆ . We call such a set C a safe region. Safe regions are all the functions are easy to obtain in closed form. For simplicity J we focus only on balls and domes, though more compli- more helpful that for many j’s, µCpxjq :“ supθPC |xj θ| ă ˆpλq cated regions could be investigated (Xiang et al., 2014). 1, hence for many j’s, βj “ 0. Practical benefits are obtained if one can construct a region 2.1. Sphere tests C for which it is easy to compute its support function, de- n Following previous work on safe rules, we call sphere tests, noted by σC and defined for any x P R by: tests relying on balls as safe regions. For a sphere test, one J σCpxq “ max x θ . (5) chooses a ball containing θˆpλq with center c and radius r, θPC i.e., C “ Bpc, rq. Due to their simplicity, safe spheres have Cast differently, for any safe region C, any j P rps, and any been the most commonly investigated safe regions (see for ˆpλq primal optimal solution β , the following holds true: instance Table1 for a brief review). The corresponding test ˆpλq is defined as follows: If µCpxjq “ maxpσCpxjq, σCp´xjqq ă 1 then βj “ 0. (6) J ˆpλq If µ pxjq “ |x c| ` r}xj} ă 1, then β “ 0. (9) We call safe test or safe rule, a test associated to C and Bpc,rq j j screening out explanatory variables thanks to Eq. (6). Note that for a fixed center, the smaller the radius, the better Remark 1. Reminding that the support function of a set is the safe screening strategy. the same as the support function of its closed Example 1. The first introduced sphere test (El Ghaoui (Hiriart-Urruty & Lemarechal´ , 1993)[Proposition V.2.2.1], et al., 2012) consists in using the center c “ y{λ and radius we restrict our search to closed convex safe regions. ˆpλq r “ |1{λ ´ 1{λmax|}y}. Given that θ “ Π∆X py{λq, Based on a safe region C one can partition the explanatory this is a safe region since y{λmax P ∆X and }y{λmax ´ λ Π y λ y 1 λ 1 λ . However, one can variables into a safe active set A pCq and a safe zero set ∆X p { q} ď } }| { ´ { max| ZλpCq where: check that this static safe rule is useless as soon as pλq J A pCq “ tj P rps : µ px q ě 1u, (7) λ 1 ` |xj y|{p}xj}}y}q C j ď min . (10) pλq λmax jPrps 1 ` λmax{p}xj}}y}q Z pCq “ tj P rps : µCpxjq ă 1u. (8) ˜ ¸

pλq Note that for nested safe regions C1 Ă C2 then A pC1q Ă pλq 2.2. Dome tests A pC2q. Consequently, a natural goal is to find safe re- gions as small as possible: narrowing safe regions can only Other popular safe regions are domes, the intersection be- increase the number of screened out variables. tween a ball and a half-space. This kind of safe region has Mind the duality gap: safer rules for the Lasso

of the primal) then with the previous display and (3), we ˆpλq can show that limkÑ`8 θk “ θ . Moreover, the conver- gence of the primal is unaltered by safe rules: screening out unnecessary coefficients of βk, can only decrease the ˆpλq distance between βk and β .

Example 2. Note that any dual feasible point θ P ∆X im- mediately provides a ball that contains θˆpλq since

ˆpλq y 1 y y θ ´ “ min θ ´ ď θ ´ :“ Rλpθq. 1 λ θ P∆X λ λ (12) The ball B y{λ, R pθ q corresponds to the simplestq safe Figure 1. Representation of the dome Dpc, r, α, wq (dark blue). λ k region introduced in (Bonnefoy et al., 2014a;b)(cf. Figure2 In our case, note that α is positive. ` ˘ for more insights).q When the algorithm proceeds, one ex- ˆpλq pects that θk gets closer to θ , so }θk ´ y{λ} should get been considered for instance in (El Ghaoui et al., 2012; Xi- closer to }θˆpλq ´ y{λ}. Similarly to Example1, this dy- ang & Ramadge, 2012; Xiang et al., 2014; Bonnefoy et al., namic rule becomes useless once λ is too small. More pre- 2014b). We denote Dpc, r, α, wq the dome with ball center cisely, this occurs as soon as c, ball radius r, oriented hyperplane with unit normal vec- J tor w and parameter α such that c ´ αrw is the projection λ 1 ` |xj y|{p}xj}}y}q ď min . of c on the hyperplane (see Figure1 for an illustration in ˆpλq λmax jPrps ˜λmax}θ }{}y} ` λmax{p}xj}}y}q¸ the interesting case α ą 0). (13) ˆpλq Remark 3. The dome is non-trivial whenever α P r´1, 1s. Noticing that }θ } ď }y{λ} (since Π∆X is a contraction When α “ 0, one gets simply a hemisphere. and 0 P ∆X ) and proceeding as for (10), one can show that this dynamic safe rule is inefficient when: For the dome test one needs to compute the support func- J tion for C “ Dpc, r, α, wq. Interestingly, as for balls, it can λ |xj y| ď min . (14) be obtained in a closed form. Due to its length though, the λmax jPrps ˜ λmax ¸ formula is deferred to the Appendix (see also (Xiang et al., 2014)[Lemma 3] for more details). This is a critical threshold, yet the screening might stop even at a larger λ thanks to Eq. (13). In practice the bound 2.3. Dynamic safe rules in Eq. (13) cannot be evaluated a priori due to the term }θˆpλq}). Note also that the bound in Eq. (14) is close to the ˆpλq For approximating a solution β of the Lasso primal one in Eq. (10), explaining the similar behavior observed problem Pλ, iterative algorithms are commonly used. We in our experiments (see Figure3 for instance). p denote βk P R the current estimate after k iterations of any iterative algorithm (see Section4 for a specific study on coordinate descent). Dynamic safe rules aim at discov- 3. New contributions on safe rules ering safe regions that become narrower as k increases. To 3.1. Support discovery in finite time do so, one first needs dual feasible points: θk P ∆X . Fol- lowing El Ghaoui et al.(2012) (see also (Bonnefoy et al., Let us first introduce the notions of converging safe regions 2014a)), this can be achieved by a simple transformation of and converging safe tests. the current residuals ρ “ y ´ Xβ , defining θ as k k k Definition 1. Let pCkqkPN be a sequence of closed convex sets in Rn containing θˆpλq. It is a converging sequence of θk “αkρk, safe regions for the Lasso with parameter λ if the diameters J y ρk ´1 1 αk “min max 2 , J , J . of the sets converge to zero. The associated safe screening # λkρkk kX ρkk8 kX ρkk8 ” ´ ¯ ı (11) rules are referred to as converging safe tests. Such dual feasible θ is proportional to ρ , and is the clos- k k Not only converging safe regions are crucial to speed up est point (for the norm }¨}) to y{λ in ∆ with such a prop- X computation, but they are also helpful to reach exact active erty, i.e., θ “ Π py{λq. A reason for choosing k ∆X XSpanpρkq set identification in a finite number of steps. More pre- θˆpλq this dual point is that the dual optimal solution is the cisely, we prove that one recovers the equicorrelation set of y λ ∆ projection of { on the dual feasible set X , and the op- the Lasso (cf. Remark2) in finite time with any converg- ˆpλq ˆpλq timal θ is proportional to y ´ Xβ , cf. Equation (3). ing strategy: after a finite number of steps, the equicor- ˆpλq Remark 4. Note that if limkÑ`8 βk “ β (convergence relation set Eλ is exactly identified. Such a property is Mind the duality gap: safer rules for the Lasso

(a) Location of the dual optimal (b) Refined location of the dual (c) Proposed GAP SAFE sphere (d) Proposed GAP SAFE dome θˆpλq in the annulus. optimal θˆpλq (dark blue). (orange). (orange).

Figure 2. Our new GAP SAFE sphere and dome (in orange). The dual optimal solution θˆpλq must lie in the dark blue region; β is any p point in R , and θ is any point in the dual feasible set ∆X . Remark that the GAP SAFE dome is included in the GAP SAFE sphere, and that it is the convex hull of the dark blue region. sometimes referred to as finite identification of the support 1998) for a reminder on weak and ). Fix p (Liang et al., 2014). This is summarized in the following. θ P ∆X and β P R , then it holds that

Theorem 1. Let pCkqkP be a sequence of converging 1 λ2 y 2 1 N kyk2 ´ θ ´ ď kXβ ´ yk2 ` λ kβk . safe regions. The estimated support provided by Ck, 2 2 λ 2 1 pλq J A pCkq “ tj P rps : maxθPCk |θ xj| ě 1u, satisfies pλq Hence, limkÑ8 A pCkq “ Eλ, and there exists k0 P N such that pλq @k ě k0 one gets A pCkq “ Eλ. 2 2 kyk ´ kXβ ´ yk ´ 2λ kβk1 y ` θ ´ ě c . (16) Proof. The main idea of the proof is to use that λ ´ λ ¯ ˆpλq limkÑ8 Ck “ tθ u, limkÑ8 µCk pxq “ µtθˆpλqupxq “ ˆpλq In particular, this provides }θ ´ y{λ} ě Rλpβq. Com- J ˆpλq pλq |x θ | and that the set A pCkq is discrete. Details are bining (12) and (16), asserts that θˆpλq belongs to the an- delayed to the Appendix. n nulus Apy{λ, Rλpθq, Rλpβqq :“ tz P R p: Rλpβq ď }z ´ y{λ} ď Rλpθqu (the light blue zone in Figure2). Remark 5. A more general result is proved for a spe- q p p cific algorithm (Forward-Backward) in Liang et al.(2014). Remind that the dual feasible set ∆X is convex, hence q Interestingly, our scheme is independent of the algorithm ∆X X Bpy{λ, Rλpθqq is also convex. Thanks to (16), considered (e.g., Forward-Backward (Beck & Teboulle, ∆X XBpy{λ, Rλpθqq “ ∆X XApy{λ, Rλpθq, Rλpβqq, and 2009), Primal Dual (Chambolle & Pock, 2011), coordinate- then ∆X X Apyq{λ, Rλpθq, Rλpβqq is convex too. Hence, ˆpλq descent (Tseng, 2001; Friedman et al., 2007)) and relies θ is insideq the annulus Apy{λ, Rλpqθq, Rλppβqq and so only on the convergence of a sequence of safe regions. ˆpλq is rθ, θ s Ď Apy{qλ, Rλpθpq, Rλpβqq by convexity (see Figure2,(a) and Figure2,(b)). Moreover,q p θˆpλq is the 3.2. GAP SAFE regions: leveraging the duality gap point of rθ, θˆpλqs whichq is closestp to y{λ. The farthest In this section, we provide new dynamic safe rules built on where θˆpλq can be according to this information would be ˆpλq converging safe regions. if rθ, θ s were tangent to the inner ball Bpy{λ, Rλpβqq and }θˆpλq ´ y{λ}“ R pβq. Let us denote θ such a Theorem 2. Let us take any pβ, θq P p ˆ ∆ . Denote λ int R X p 1 2 2 1{2 point. The tangency property reads kθint ´ y{λk “ Rλpβq Rλpβq :“ kyk ´kXβ ´ yk ´2λ kβk1 , Rλpθq :“ J λ ` and pθ ´ θintq py{λ ´pθintq “ 0. Hence, with the later kθ ´ y{λk , θˆpλq the dual optimal Lasso solution and 2 2 ` ˘ and the definition of Rλpθq, kθ ´ y{λk “ kθ ´ θintpk ` p 2 2 q 2 2 2 2 r˜λpβ, θq :“ Rλpθq ´ Rλpβq , then kθint ´ y{λk and kθ ´ θintk “ Rλpθq ´ Rλpβq . q b Since by construction θˆpλq cannot be further away from θ q ˆpλq p q p θ P B θ, r˜λpβ, θq . (15) than θint (again, insights can be gleaned from Figure2), we ˆpλq 2 2 1{2 ´ ¯ conclude that θ P B θ, pRλpθq ´ Rλpβq q . Proof. The construction of the ball Bpθ, r˜ pβ, θqq is based Remark 6. Choosing `β “ 0 and θ “ y{λ ,˘ then one λ q p max on the weak duality theorem (cf. (Rockafellar & Wets, recovers the static safe rule given in Example1. Mind the duality gap: safer rules for the Lasso

With the definition of the primal (resp. dual) objective for 3.3. GAP SAFE rules : sequential for free P pβq, (resp. D pθq), the duality gap reads as G pβ, θq “ λ λ λ As a byproduct, our dynamic screening tests provide a P pβq´D pθq. Remind that if G pβ, θq ď , then one has λ λ λ warm start strategy for the safe regions, making our GAP P pβq ´ P pβˆpλqq ď , which is a standard stopping cri- λ λ SAFE rules inherently sequential. The next proposition terion for Lasso solvers. The next proposition establishes a shows their efficiency when attacking a new tuning param- connection between the radius r pβ, θq and the duality gap λ eter, after having solved the Lasso for a previous λ, even G pβ, θq. λ only approximately. Handling approximate solutions is a p Proposition 1. For any pβ, θq P R ˆ ∆X , the following critical issue to produce safe sequential strategies: without holds taking into account the approximation error, the screening 2 might disregard relevant variables, especially the one near r˜ pβ, θq2 ď r pβ, θq2 :“ G pβ, θq. (17) λ λ λ2 λ the safe regions boundaries. Except for λmax, it is unreal- istic to assume that one can dispose of exact solutions. 2 2 Proof. Use the fact that Rλpθq “ kθ ´ y{λk and Consider λ0 “ λmax and a non-increasing sequence of 2 2 2 2 Rλpβq ě pkyk ´ kXβ ´ yk ´ 2λ kβk1q{λ . T ´ 1 tuning parameters pλtqtPrT ´1s in p0, λmaxq. In prac- q tice, we choose the common grid (Buhlmann¨ & van de ´δt{pT ´1q Ifp we could choose the “oracle” θ “ θˆpλq and β “ βˆpλq Geer, 2011)[2.12.1]): λt “ λ010 (for instance in (15) then we would obtain a zero radius. Since those in Figure3, we considered δ “ 3). The next result controls quantities are unknown, we rather pick dynamically the how the duality gap, or equivalently, the diameter of our current available estimates given by an optimization algo- GAP SAFE regions, evolves from λt´1 to λt. p rithm: β “ βk and θ “ θk as in Eq. (11). Introducing GAP Proposition 3. Suppose that t ě 1 and pβ, θq P R ˆ ∆X . SAFE spheres and domes as below, Proposition2 ensures Reminding r2 β, θ 2G β, θ λ2, the following holds λt p q “ λt p q{ t that they are converging safe regions. λ GAP SAFE sphere: r2 pβ, θq “ t´1 r2 pβ, θq (20) λt λ λt´1 ˆ t ˙ C “ B pθ , r pβ, θqq . (18) 2 k k λ λt Xβ ´ y λt´1 2 ` p1 ´ q ´ p ´ 1q kθk . λ λ λ GAP SAFE dome: t´1 t t

y 2 y Proof. Details are given in the Appendix. λ ` θk Rλpθkq Rλpβkq θk ´ λ Ck “ D , , 2 ´ 1, . ¨ 2 2 }θ ´ y }˛ ˜Rλpθkq ¸ k λ This proposition motivates to screen sequentially as fol- q p (19) p ˝ ‚ lows: having pβ, θq P R ˆ ∆X such that Gλt´1 pβ, θq ď , q Proposition 2. For any converging primal sequence then, we can screen using the GAP SAFE sphere with cen- ter θ and radius rλpβ, θq. The adaptation to the GAP SAFE pβkqkPN, and dual sequence pθkqkPN defined as in Eq. (11), then the GAP SAFE sphere and the GAP SAFE dome are dome is straightforward and consists in replacing θk, βk, λ converging safe regions. by θ, β, λt in the GAP SAFE dome definition. Remark 8. The basic sphere test of (Wang et al., 2013) re- Proof. For the GAP SAFE sphere the result follows quires the exact dual solution θ “ θˆpλt´1q for center, and from strong duality, Remark4 and Proposition1 yield has radius |1{λt ´1{λt´1| kyk, which is strictly larger than ˆpλq limkÑ8 rλpβk, θkq “ 0, since limkÑ8 θk “ θ and ours. Indeed, if one has access to dual and primal opti- ˆpλq ˆpλt´1q ˆpλt´1q limkÑ8 βk “ β . For the GAP SAFE dome, one can mal solutions at λt´1, i.e., pθ, βq “ pθ , β q, then r2 pβ, θq “ 0, θ “ py ´ Xβq{λ and check that it is included in the GAP SAFE sphere, therefore λt´1 t´1 inherits the convergence (see also Figure2,(c) and (d)). 2 2 λt´1 λt λt´1 2 Remark 7. The radius r pβ , θ q can be compared rλ pβ, θq “ p1 ´ q ´ p ´ 1q kθk , λ k k t λ2 λ λ with the radius considered for the Dynamic Safe rule ˆ t t´1 t ˙ 1 1 2 and Dynamic ST3 (Bonnefoy et al., 2014a) respectively: kyk2 , 2 2 2 1{2 ď ´ R pθ q “ }θ ´ y{λ} and pR pθ q ´ δ q , where λt λt 1 λ k k λ k ˆ ´ ˙ δ “ pλmax{λ ´ 1q{ kxj‹ k. We have proved that ˆpλt´1q limq kÑ8 rλpβk, θkq “ 0, but a weakerq property is satisfied since }θ} ď }y}{λt´1 for θ “ θ . ˆpλq by the two other radius: limkÑ8 Rλpθkq “ Rλpθ q “ Note that contrarily to former sequential rules (Wang et al., ˆpλq 2 2 2 1{2 }Rλpθ q ´ y{λ} and limkÑ8pRλpθkq ´ δ q “ 2013), our introduced GAP SAFE rules still work when one ˆpλq 2 2 1{2 pλt´1q pRλpθ q ´ δ q ą 0. q q has only access to approximations of θˆ . q q q Mind the duality gap: safer rules for the Lasso 4. Experiments 4.1. Coordinate Descent Screening procedures can be used with any optimization algorithm. We chose coordinate descent because it is well suited for machine learning tasks, especially with sparse and/or unstructured design matrix X. Coordinate descent requires to extract efficiently columns of X which is typi- cally not easy in signal processing applications where X is commonly an implicit operator (e.g. Fourier or wavelets).

Algorithm 1 Coordinate descent with GAP SAFE rules input X, y, , K, f, pλtqtPrT ´1s Initialization: λ0 “ λmax βλ0 “ 0 for t P rT ´ 1s do λt´1 β Ð β (previous -solution) Figure 3. Proportion of active variables as a function of λ and the for k P rKs do number of iterations K on the Leukemia dataset. Better strategies if k mod f “ 1 then have longer range of λ with (red) small active sets. Compute θ and C thanks to (11) and (18) or (19) λt Get A pCq “ tj P rps : µCpxjq ě 1u as in (7)

if Gλt pβ, θq ď  then βλt Ð β break the evaluation of the stopping criterion. λ for j P A t pCq do We did not compare our method against the strong rules xJ Xβ y λt j p ´ q βj Ð ST 2 , βj ´ 2 of Tibshirani et al.(2012) because they are not safe and kxj k kxj k # ST u, x sign x x u (soft- therefore need complex post-processing with parameters to p ` q “ p q p| |˘ ´ q` threshold) tune. Also we did not compare against the sequential rule λt of Wang et al.(2013)( e.g., EDDP) because it requires the output pβ qt T 1 Pr ´ s exact dual optimal solution of the previous Lasso problem, which is not available in practice and can prevent the solver We implemented the screening rules of Table1 based on the from actually converging: this is a phenomenon we always coordinate descent in Scikit-learn (Pedregosa et al., 2011). observed on our experiments. This code is written in Python and Cython to generate low level C code, offering high performance. A low level lan- 4.2. Number of screened variables guage is necessary for this algorithm to scale. Two im- plementations were written to work efficiently with both Figure3 presents the proportion of variables screened by dense data stored as Fortran ordered arrays and sparse data several safe rules on the standard Leukemia dataset. The stored in the compressed sparse column (CSC) format. Our screening proportion is presented as a function of the num- pseudo-code is presented in Algorithm1. In practice, we ber of iterations K. As the SAFE screening rule of El perform the dynamic screening tests every f “ 10 passes Ghaoui et al.(2012) is sequential but not dynamic, for a through the entire (active) variables. Iterations are stopped given λ the proportion of screened variables does not de- when the duality gap is smaller than the target accuracy. pend on K. The rules of Bonnefoy et al.(2014a) are more efficient on this dataset but they do not benefit much from The naive computation of θk in (11) involves the compu- the dynamic framework. Our proposed GAP SAFE tests tation of XJρ (ρ being the current residual), which k 8 k screen much more variables, especially when the tuning pa- costs Opnpq operations. This can be avoided as one knows rameter λ gets small, which is particularly relevant in prac- when using a safe rule that the index achieving the max- tice. Moreover, even for very small λ’s (notice the logarith- imum for this norm is in Aλt pCq. Indeed, by construc- mic scale) where no variable is screened at the beginning J J tion arg maxjPAλt pCq |xj θk| “ arg maxjPrps |xj θk| “ of the optimization procedure, the GAP SAFE rules man- J arg maxjPrps |xj ρk|. In practice the evaluation of the dual age to screen more variables, especially when K increases. gap is therefore not a Opnpq but Opnqq where q is the size Finally, the figure demonstrates that the GAP SAFE dome of Aλt pCq. In other words, using screening also speeds up test only brings marginal improvement over the sphere. Mind the duality gap: safer rules for the Lasso

5 7 No screening 6 No screening 4 SAFE (El Ghaoui et al.) SAFE (El Ghaoui et al.) 5 ST3 (Bonnefoy et al.) ST3 (Bonnefoy et al.) 3 4 SAFE (Bonnefoy et al.) SAFE (Bonnefoy et al.) 3 2 GAP SAFE (sphere) GAP SAFE (sphere) Time (s) Time (s) GAP SAFE (dome) 2 GAP SAFE (dome) 1 1 0 2 8 6 0 4 10 2 8 6 4 -log10(duality gap) -log10(duality gap)

Figure 4. Time to reach convergence using various screening rules Figure 5. Time to reach convergence using various screening rules on the Leukemia dataset (dense data: n “ 72, p “ 7129). on bag of words from the 20newsgroup dataset (sparse data: with n “ 961, p “ 10094).

4.3. Gains in the computation of Lasso paths 180 160 No screening 140 The main interest of variable screening is to reduce com- SAFE (El Ghaoui et al.) 120 ST3 (Bonnefoy et al.) putation costs. Indeed, the time to compute the screening 100 SAFE (Bonnefoy et al.) 80 itself should not be larger than the gains it provides. Hence, GAP SAFE (sphere) Time (s) 60 we compared the time needed to compute Lasso paths to 40 GAP SAFE (dome) prescribed accuracy for different safe rules. Figures4,5 20 0 2 8 6 and6 illustrate results on three datasets. Figure4 presents 4 10 results on the dense, small scale, Leukemia dataset. Fig- -log10(duality gap) ure5 presents results on a medium scale sparse dataset obtained with bag of words features extracted from the Figure 6. Computation time to reach convergence using different 20newsgroup dataset (comp.graphics vs. talk.religion.misc screening strategies on the RCV1 (Reuters Corpus Volume 1) with TF-IDF removing English stop words and words oc- dataset (sparse data with n “ 20242 and p “ 47236). curring only once or more than 95% of the time). Text feature extraction was done using Scikit-Learn. Figure6 focuses on the large scale sparse RCV1 (Reuters Corpus would not be significant. Volume 1) dataset, cf. (Schmidt et al., 2013). In all cases, Lasso paths are computed as required to esti- 5. Conclusion mate optimal regularization parameters in practice (when We have presented new results on safe rules for accelerat- using cross-validation one path is computed for each fold). ing algorithms solving the Lasso problem (see Appendix For each Lasso path, solutions are obtained for T “ 100 for extension to the Elastic Net). First, we have introduced values of λ’s, as detailed in Section 3.3. Remark that the the framework of converging safe rules, a key concept in- grid used is the default one in both Scikit-Learn and the dependent of the implementation chosen. Our second con- glmnet R package. With our proposed GAP SAFE screen- tribution was to leverage duality gap computations to create ing we obtain on all datasets substantial gains in computa- two safer rules satisfying the aforementioned convergence tional time. We can already get an up to 3x speedup when properties. Finally, we demonstrated the important practi- we require a duality gap smaller than 10´4. The inter- cal benefits of those new rules by applying them to standard est of the screening is even clearer for higher accuracies: dense and sparse datasets using a coordinate descent solver. GAP SAFE sphere is 11x faster than its competitors on the Future works will extend our framework to generalized lin- Leukemia dataset, at accuracy 10´8. One can observe that ear model and group-Lasso. with the parameter grid used here, the larger is p compared to n, the higher is the gain in computation time. Acknowledgment In our experiments, the other safe screening rules did not show much speed-up. As one can see on Figure3, those We acknowledge the support from Chair Machine Learn- screening rules keep all the active variables for a wide range ing for Big Data at Tel´ ecom´ ParisTech and from the Or- of λ’s. The algorithm is thus faster for large λ’s but slower ange/Tel´ ecom´ ParisTech think tank phi-TAB. This work afterwards, since we still compute the screening tests. Even benefited from the support of the ”FMJH Program Gaspard if one can avoid some of these useless computations thanks Monge in optimization and operation research”, and from to formulas like (14) or (10), the corresponding speed-up the support to this program from EDF. Mind the duality gap: safer rules for the Lasso

References Gramfort, A., Kowalski, M., and Ham¨ al¨ ainen,¨ M. Mixed- norm estimates for the M/EEG inverse problem using Bach, F. Bolasso: model consistent Lasso estimation accelerated gradient methods. Physics in Medicine and through the bootstrap. In ICML, 2008. Biology, 57(7):1937–1961, 2012. Beck, A. and Teboulle, M. A fast iterative shrinkage- Haury, A.-C., Mordelet, F., Vera-Licona, P., and Vert, J.- thresholding algorithm for linear inverse problems. P. TIGRESS: Trustful Inference of Gene REgulation us- SIAM J. Imaging Sci., 2(1):183–202, 2009. ing Stability Selection. BMC systems biology, 6(1):145, Bickel, P. J., Ritov, Y., and Tsybakov, A. B. Simultaneous 2012. analysis of Lasso and Dantzig selector. Ann. Statist., 37 Hiriart-Urruty, J.-B. and Lemarechal,´ C. (4):1705–1732, 2009. and minimization algorithms. I, volume 305. Springer- Verlag, Berlin, 1993. Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R. A dynamic screening principle for the lasso. In EU- Kim, S.-J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, SIPCO, 2014a. D. An interior-point method for large-scale l1- regularized least squares. IEEE J. Sel. Topics Signal Pro- Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, cess., 1(4):606–617, 2007. R. Dynamic Screening: Accelerating First-Order Algo- rithms for the Lasso and Group-Lasso. ArXiv e-prints, Liang, J., Fadili, J., and Peyre,´ G. Local linear conver- 2014b. gence of forward–backward under partial smoothness. In NIPS, pp. 1970–1978, 2014. Buhlmann,¨ P. and van de Geer, S. Statistics for high- dimensional data. Springer Series in Statistics. Springer, Lustig, M., Donoho, D. L., and Pauly, J. M. Sparse MRI: Heidelberg, 2011. Methods, theory and applications. The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine, 58(6):1182– Candes,` E. J., Wakin, M. B., and Boyd, S. P. Enhancing 1195, 2007. sparsity by reweighted l1 minimization. J. Fourier Anal. Applicat., 14(5-6):877–905, 2008. Mairal, J. Sparse coding for machine learning, image pro- cessing and computer vision. PhD thesis, Ecole´ normale Chambolle, A. and Pock, T. A first-order primal-dual algo- superieure´ de Cachan, 2010. rithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1):120–145, 2011. Mairal, J. and Yu, B. Complexity analysis of the lasso reg- ularization path. In ICML, 2012. Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., Meinshausen, N. and Buhlmann,¨ P. Stability selection. J. 20(1):33–61 (electronic), 1998. Roy. Statist. Soc. Ser. B, 72(4):417–473, 2010.

Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. Osborne, M. R., Presnell, B., and Turlach, B. A. A new Least angle regression. Ann. Statist., 32(2):407–499, approach to variable selection in least squares problems. 2004. With discussion, and a rejoinder by the authors. IMA J. Numer. Anal., 20(3):389–403, 2000.

El Ghaoui, L., Viallon, V., and Rabbani, T. Safe feature Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., elimination in sparse supervised learning. J. Pacific Op- Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., tim., 8(4):667–698, 2012. Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- napeau, D., Brucher, M., Perrot, M., and Duchesnay, Fan, J. and Li, R. Variable selection via nonconcave penal- E. Scikit-learn: Machine learning in Python. J. Mach. ized likelihood and its oracle properties. J. Amer. Statist. Learn. Res., 12:2825–2830, 2011. Assoc., 96(456):1348–1360, 2001. Rockafellar, R. T. and Wets, R. J.-B. , Fan, J. and Lv, J. Sure independence screening for ultrahigh volume 317 of Grundlehren der Mathematischen Wis- dimensional feature space. J. Roy. Statist. Soc. Ser. B, 70 senschaften [Fundamental Principles of Mathematical (5):849–911, 2008. Sciences]. Springer-Verlag, Berlin, 1998.

Friedman, J., Hastie, T., Hofling,¨ H., and Tibshirani, R. Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite Pathwise coordinate optimization. Ann. Appl. Stat., 1 sums with the stochastic average gradient. arXiv preprint (2):302–332, 2007. arXiv:1309.2388, 2013. Mind the duality gap: safer rules for the Lasso

Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for dis- carding predictors in lasso-type problems. J. Roy. Statist. Soc. Ser. B, 74(2):245–266, 2012.

Tibshirani, R. J. The lasso problem and uniqueness. Elec- tron. J. Stat., 7:1456–1490, 2013. Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109(3):475–494, 2001.

Varoquaux, G., Gramfort, A., and Thirion, B. Small- sample brain mapping: sparse recovery on spatially cor- related designs with randomization and clustering. In ICML, 2012. Wang, J., Zhou, J., Wonka, P., and Ye, J. Lasso screening rules via dual polytope projection. In NIPS, pp. 1070– 1078, 2013. Xiang, Z. J. and Ramadge, P. J. Fast lasso screening tests based on correlations. In ICASSP, pp. 2137–2140, 2012.

Xiang, Z. J., Xu, H., and Ramadge, P. J. Learning sparse representations of high dimensional data on large scale dictionaries. In NIPS, pp. 900–908, 2011. Xiang, Z. J., Wang, Y., and Ramadge, P. J. Screening tests for lasso problems. arXiv preprint arXiv:1405.4897, 2014. Xu, P. and Ramadge, P. J. Three structural results on the lasso problem. In ICASSP, pp. 3392–3396, 2013. Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist., 38(2):894–942, 2010. Zhang, C.-H. and Zhang, T. A general theory of con- cave regularization for high-dimensional sparse estima- tion problems. Statistical Science, 27(4):576–593, 2012.

Zou, H. The adaptive lasso and its oracle properties. J. Am. Statist. Assoc., 101(476):1418–1429, 2006.