The Group-Lasso for Generalized Linear Models: Uniqueness of Solutions and Efﬁcient Algorithms

The Group-Lasso for Generalized Linear Models: Uniqueness of Solutions and Efficient Algorithms Volker Roth [email protected] Computer Science Department, University of Basel, Bernoullistr. 16, CH-4056 Basel, Switzerland Bernd Fischer [email protected] Institute of Computational Science, ETH Zurich, Universitaetstrasse 6, CH-8092 Zurich, Switzerland Abstract that it finds solutions that are sparse on the level of groups The Group-Lasso method for finding impor- of variables, which makes this method a good candidate tant explanatory factors suffers from the poten- for situations described above. The Group-Lasso estimator, tial non-uniqueness of solutions and also from however, has several drawbacks: (i) in high-dimensional high computational costs. We formulate condi- spaces, the solutions may not be unique. The potential tions for the uniqueness of Group-Lasso solu- existence of several solutions that involve different vari- tions which lead to an easily implementable test ables seriously hampers the interpretability of “identified” procedure that allows us to identify all poten- explanatory factors; (ii) existing algorithms can handle in- tially active groups. These results are used to put dimensions up to thousands (Kim et al., 2006) or even derive an efficient algorithm that can deal with several thousands (Meier et al., 2008), but in practical ap- input dimensions in the millions and can approx- plications with high-order interactions or polynomial ex- imate the solution path efficiently. The derived pansions these limits are easily exceeded; (iii) contrary to methods are applied to large-scale learning prob- the standard Lasso, the solution path (i.e. the evolution of lems where they exhibit excellent performance the individual group norms as a function of the constraint) and where the testing procedure helps to avoid is not piecewise linear, which precludes the application of misinterpretations of the solutions. efficient optimization methods like least angle regression (LARS) (Efron et al., 2004). In this paper we address all these issues: (i) we derive 1. Introduction conditions for the completeness and uniqueness of Group- Lasso estimates, where we call a solution complete, if it In many practical learning problems we are not only inter- includes all groups that might be relevant in other solu- ested in low prediction errors but also in identifying im- tions (meaning that we cannot have “overlooked” relevant portant explanatory factors. These explanatory factors can groups). Based on these conditions we develop an easily often be represented as groups of input variables. Com- implementable test procedure. If a solution is not com- mon examples are k-th order polynomial expansions of the plete, this procedure identifies all other groups that may be inputs where the groups consist of products over combina- included in alternative solutions with identical costs. (ii) tions of variables up to degree k. Such expansions compute These results allow us to formulate a highly efficient active- explicit mappings into feature spaces induced by polyno- set algorithm that can deal with input dimensions in the mial kernel functions of the form k(x, y)=(1+ x · y)k. millions. (iii) The solution path can be approximated on Another popular example are categorical variables that are a fixed grid of constraint values with almost no additional represented as groups of dummy variables. computational costs. Large-scale applications using both A method for variable selection which has gained particular synthetic and real data illustrate the excellent performance attention is the Lasso (Tibshirani, 1996) which exploits the of the developed concepts and algorithms. In particular, idea of using ℓ1-constraints in fitting problems. The Group- we demonstrate that the proposed completeness test suc- Lasso (Yuan & Lin, 2006) extends the former in the sense cessfully detects ambiguous solutions and thus avoids the th misinterpretation of “identified” explanatory factors. Appearing in Proceedings of the 25 International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). The Group-Lasso for Generalized Linear Models 2. Characterization of Group-Lasso Solutions gradient of l can be viewed as a function in either η or β: for Generalized Linear Models −1 ∇ηl(η)= −(y − g (η)), This section largely follows (Osborne et al., 2000), with the ⊤ ⊤ − (3) ∇ l(β)= −X ∇ l(η)= −X (y − g 1(Xβ)), exception that here we address the Group-Lasso problem β η and a more general class of likelihood functions. −1 −1 −1 ⊤ where g (η) := (g (η1), . , g (ηn)) . The corre- According to (McCullaghand & Nelder, 1983), a general- sponding Hessians are ized linear model (GLM) consists of three elements: ⊤ (i) a random component f(y; µ) specifying the stochastic Hη = W, Hβ = X WX, (4) behavior of a response variable Y ; −1 ′ ⊤ where W is diagonal with elements Wii = (g ) (ηi) = (ii) a systematic component η = x β specifying the vari- ′ ′ ′′ ation in the response variable accounted for by known co- 1/(g (µi)) = µ (ηi)= b (ηi). variates x; and For the following derivation, it is convenient to partition X, (iii) a link function g(µ) = η specifying the relationship β and h := ∇βl into J subgroups: X =(X1,...,Xj), between the random and systematic components. ⊤ The random component f(y; µ) is typically an exponential β1 h1 X1 ∇ηl family distribution . . . β = . , h = . = . (5) ⊤ − β h X ∇ l f(y; θ,φ) = exp(φ 1(yθ − b(θ)) + c(y,φ)), (1) J J J η ′′ with natural parameter θ, sufficient statistics y/φ, log par- As stated above, b is strictly convex in θ = η, thus b (ηi) > tition function b(θ)/φ and a scale parameter φ> 0. 0 which in turn implies that Hη ≻ 0 and Hβ 0. This means that l is a strictly convex function in η. For general Note that in the model (1) the mean of the responses µ = matrices X it is convex in β, and it is strictly convex in β E [y] is related to the natural parameter θ by µ = b′(θ). θ if X has full rank and d ≤ n. The link function g can be any strictly monotone differen- tiable function. In the following, however, we will consider Given X and y, the Group-Lasso minimizes the negative only canonical link functions for which g(µ)= η = θ. We log-likelihood viewed as a function in β under a constraint will thus use the parametrization f(y; η,φ). on the sum of the ℓ2-norms of the subvectors βj: From a technical perspective, an important property of this minimize l(β) s.t. g(β) ≥ 0, (6) framework is that log f(y; η,φ) is strictly concave in η. This follows from the fact that the one-dimensional suffi- where g(β)= κ − J kβ k. (7) cient statistics y/φ is necessarily minimal, which implies Pi=1 j that the log partition function b(η)/φ is strictly convex, see Here g(β) is implicitly a function of the fixed parameter κ. (Brown, 1986; Wainwright et al., 2005). Considering the unconstrained problem, the solution is not ∗ The standard linear regression model is a special case de- unique if the dimensionality exceeds n: every β = β0 +ξ rived from the normal distribution with φ = σ2, the iden- with ξ being an element of the null space N(X) is also a tity link η = µ and b(η) = (1/2)η2. Other popular mod- solution. By defining the unique value els include logistic regression (binomial distribution), Pois- J 0 κ0 := minξ∈N(X) kβ + ξ k, (8) son regression for count data and gamma- ( or exponential-, Pi=1 j j Weibull-) models for cost- or survival analysis. we will require that the constraint is active i.e. κ < κ0. Rd Given an i.i.d. data sample {x1,..., xn}, xi ∈ , ar- Note that the minimum κ0 is unique, even though there ranged as rows of the data matrix X, and a corresponding might exist several vectors ξ ∈ N(X) which attain this ⊤ vector of responses y = (y1,...,yn) , we will consider minimum. Enforcing the constraint to be active is essential the problem of minimizing the negative log-likelihood for the following characterization of solutions. Although it might be infeasible to ensure this activeness by computing l(y, η, φ)= − log f(y ; η ,φ) κ and selecting κ accordingly, practical algorithms will X i i 0 i not suffer from this problem: given a solution, we can al- (2) = − φ−1(y η − b(η )) + c(y ,φ). ways check if the constraint was active. If this was not X i i i i i the case, then the uniqueness question reduces to checking if d ≤ n (if X has full rank). In this case the solutions We assume that the scale parameter is known, and for the are usually not sparse, because the feature selection mech- sake of simplicity we assume φ = 1. Since η = xT β, the anism has been switched off. To produce a sparse solution, The Group-Lasso for Generalized Linear Models one can then try smaller κ-values until the constraint is ac- for some v of the form described above. Hence, for all j 0 tive. In section 3 we propose a more elegant solution to this with βj 6= dj it holds that problem in the form of an algorithm that approximates the b ⊤ solution path, i.e. the evolution of the group norms when kXj ∇ηl(η)|η=ηbk = λ. (13) relaxing the constraint. This algorithm can be initialized with an arbitrarily small constraint value κ0 which typically 0 For all other j with βj = dj it holds that ensures that the constraint is active in the first optimization ⊤ kX ∇ηl(η)| bk≤ λ whichb implies step.

Load more