Mind the Duality Gap: Safer Rules for the Lasso
Total Page:16
File Type:pdf, Size:1020Kb
Mind the duality gap: safer rules for the Lasso Olivier Fercoq [email protected] Alexandre Gramfort [email protected] Joseph Salmon [email protected] Institut Mines-Tel´ ecom,´ Tel´ ecom´ ParisTech, CNRS LTCI 46 rue Barrault, 75013, Paris, France Abstract one can mention dictionary learning (Mairal, 2010), bio- statistics (Haury et al., 2012) and medical imaging (Lustig Screening rules allow to early discard irrelevant et al., 2007; Gramfort et al., 2012) to name a few. variables from the optimization in Lasso prob- lems, or its derivatives, making solvers faster. In Many algorithms exist to approximate Lasso solutions, but this paper, we propose new versions of the so- it is still a burning issue to accelerate solvers in high di- called safe rules for the Lasso. Based on duality mensions. Indeed, although some other variable selection gap considerations, our new rules create safe test and prediction methods exist (Fan & Lv, 2008), the best regions whose diameters converge to zero, pro- performing methods usually rely on the Lasso. For sta- vided that one relies on a converging solver. This bility selection methods (Meinshausen & Buhlmann¨ , 2010; property helps screening out more variables, for Bach, 2008; Varoquaux et al., 2012), hundreds of Lasso a wider range of regularization parameter values. problems need to be solved. For non-convex approaches In addition to faster convergence, we prove that such as SCAD (Fan & Li, 2001) or MCP (Zhang, 2010), we correctly identify the active sets (supports) solving the Lasso is often a required preliminary step (Zou, of the solutions in finite time. While our pro- 2006; Zhang & Zhang, 2012; Candes` et al., 2008). posed strategy can cope with any solver, its per- Among possible algorithmic candidates for solving the formance is demonstrated using a coordinate de- Lasso, one can mention homotopy methods (Osborne et al., scent algorithm particularly adapted to machine 2000), LARS (Efron et al., 2004), and approximate homo- learning use cases. Significant computing time topy (Mairal & Yu, 2012), that provide solutions for the full reductions are obtained with respect to previous Lasso path, i.e., for all possible choices of tuning parame- safe rules. ter λ. More recently, particularly for p ¡ n, coordinate descent approaches (Friedman et al., 2007) have proved to 1. Introduction be among the best methods to tackle large scale problems. Following the seminal work by El Ghaoui et al.(2012), Since the mid 1990’s, high dimensional statistics has at- screening techniques have emerged as a way to exploit the tracted considerable attention, especially in the context of known sparsity of the solution by discarding features prior linear regression with more explanatory variables than ob- to starting a Lasso solver. Such techniques are coined safe servations: the so-called p ¡ n case. In such a context, the rules when they screen out coefficients guaranteed to be least squares with `1 regularization, referred to as the Lasso zero in the targeted optimal solution. Zeroing those coeffi- (Tibshirani, 1996) in statistics, or Basis Pursuit (Chen et al., cients allows to focus more precisely on the non-zero ones 1998) in signal processing, has been one of the most pop- (likely to represent signal) and helps reducing the computa- ular tools. It enjoys theoretical guarantees (Bickel et al., tional burden. We refer to (Xiang et al., 2014) for a concise 2009), as well as practical benefits: it provides sparse solu- introduction on safe rules. Other alternatives have tried to tions and fast convex solvers are available. This has made screen the Lasso relaxing the “safety”. Potentially, some the Lasso a popular method in modern data-science tool- variables are wrongly disregarded and post-processing is kits. Among successful fields where it has been applied, needed to recover them. This is for instance the strategy Proceedings of the 32 nd International Conference on Machine adopted for the strong rules (Tibshirani et al., 2012). Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- The original basic safe rules operate as follows: one right 2015 by the author(s). Mind the duality gap: safer rules for the Lasso chooses a fixed tuning parameter λ, and before launching finite time the active variables of the optimal solution (or any solver, tests whether a coordinate can be zeroed or not equivalently the inactive variables), and the tests become (equivalently if the corresponding variable can be disre- more and more precise as the optimization algorithm pro- garded or not). We will refer to such safe rules as static ceeds. We also show that our new GAP SAFE rules, built safe rules. Note that the test is performed according to a on dual gap computations, are converging safe rules since safe region, i.e., a region containing a dual optimal solu- their associated safe regions have a diameter converging to tion of the Lasso problem. In the static case, the screening zero. We also explain how our GAP SAFE tests are se- is performed only once, prior any optimization iteration. quential by nature. Application of our GAP SAFE rules Two directions have emerged to improve on static strate- with a coordinate descent solver for the Lasso problem is gies. proposed in Section4. Using standard data-sets, we report the time improvement compared to prior safe rules. • The first direction is oriented towards the resolution of the Lasso for a large number of tuning parameters. 1.1. Model and notation Indeed, practitioners commonly compute the Lasso We denote by rds the set t1; : : : ; du for any integer d P N. over a grid of parameters and select the best one in a n Our observation vector is y P R and the design matrix data-driven manner, e.g., by cross-validation. As two X x ; ; x nˆp has p explanatory variables (or 1 “ r 1 ¨ ¨ ¨ ps P R consecutive λ s in the grid lead to similar solutions, features) column-wise. We aim at approximating y as a knowing the first solution may help improve screen- linear combination of few variables xj’s, hence expressing ing for the second one. We call sequential safe rules p y as Xβ where β P R is a sparse vector. The standard such strategies, also referred to as recursive safe rules Euclidean norm is written } ¨ }, the `1 norm } ¨ }1, the `8 in (El Ghaoui et al., 2012). This road has been pur- norm } ¨ }8, and the matrix transposition of a matrix Q is sued in (Wang et al., 2013; Xu & Ramadge, 2013; Xi- J denoted by Q . We denote ptq` “ maxp0; tq. ang et al., 2014), and can be thought of as a “warm start” of the screening (in addition to the warm start of For such a task, the Lasso is often considered (see the solution itself). When performing sequential safe Buhlmann¨ & van de Geer(2011) for an introduction). For a rules, one should keep in mind that generally, only an tuning parameter λ ¡ 0, controlling the trade-off between approximation of the previous dual solution is com- data fidelity and sparsity of the solutions, a Lasso estimator puted. Though, the safety of the rule is guaranteed β^pλq is any solution of the primal optimization problem only if one uses the exact solution. Neglecting this is- ^pλq 1 2 sue, leads to “unsafe” rules: relevant variables might β P arg min kXβ ´ yk2 ` λ kβk1 : (1) βP p 2 be wrongly disregarded. R “Pλpβq • The second direction aims at improving the screen- looooooooooooomooooooooooooonn J Denoting ∆X “ θ P R : xj θ ¤ 1; @j P rps the ing by interlacing it throughout the optimization algo- dual feasible set, a dual formulation of the Lasso reads (see rithm itself: although screening might be useless at the for instance Kim et al.(2007) or Xiang et al.(2014)): ( beginning of the algorithm, it might become (more) 1 λ2 y 2 efficient as the algorithm proceeds towards the opti- ^pλq 2 θ “ arg max kyk2 ´ θ ´ : (2) n 2 mal solution. We call these strategies dynamic safe θP∆X ĂR 2 2 λ rules following (Bonnefoy et al., 2014a;b). “Dλpθq looooooooooooomooooooooooooon We can reinterpret Eq. (2) as θ^pλq “ Π py{λq, where Based on convex optimization arguments, we leverage du- ∆X Π refers to the projection onto a closed convex set C. In ality gap computations to propose a simple strategy uni- C particular, this ensures that the dual solution θ^pλq is always fying both sequential and dynamic safe rules. We coined unique, contrarily to the primal β^pλq. GAP SAFE rules such safe rules. The main contributions of this paper are 1) the introduction 1.2. A KKT detour of new safe rules which demonstrate a clear practical im- β^pλq p provement compared to prior strategies 2) the definition of For the Lasso problem, a primal solution P R and the θ^pλq n a theoretical framework for comparing safe rules by look- dual solution P R are linked through the relation: ing at the convergence of their associated safe regions. y “ Xβ^pλq ` λθ^pλq : (3) In Section2, we present the framework and the basic con- The Karush-Khun-Tucker (KKT) conditions state: cepts which guarantee the soundness of static and dynamic tsignpβ^pλqqu if β^pλq ‰ 0; screening rules. Then, in Section3, we introduce the new j p ; xJθ^pλq j j (4) @ P r s j P ^pλq concept of converging safe rules. Such rules identify in #r´1; 1s if βj “ 0: Mind the duality gap: safer rules for the Lasso Rule Center Radius Ingredients y J J Static Safe (El Ghaoui et al., 2012) y{λ Rλp q λmax “}X y}8 “|x ‹ y| λmax j 1 2 2 λmax Dynamic ST3 (Xiang et al., 2011) y{λ ´ δxj‹ pRλpθkq ´ δ q 2 δ “ ´ 1 {}xj‹ } q λ Dynamic Safe (Bonnefoy et al., 2014a) y{λ Rλpθkq θk P ∆` X (e.g., as˘ in (11)) q Sequential (Wang et al., 2013) θ^pλt´1q 1 ´ 1 }y} exact θ^pλt´1q required λt´1q λt 1 GAP SAFE sphere (proposed) θk rλ pβk; θˇkq “ 2ˇGλ pβk; θkq dual gap for βk; θk t ˇ λt ˇ t ˇ ˇ a Table 1.