Sparse Discriminant Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Technometrics ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: https://www.tandfonline.com/loi/utch20 Sparse Discriminant Analysis Line Clemmensen, Trevor Hastie, Daniela Witten & Bjarne Ersbøll To cite this article: Line Clemmensen, Trevor Hastie, Daniela Witten & Bjarne Ersbøll (2011) Sparse Discriminant Analysis, Technometrics, 53:4, 406-413, DOI: 10.1198/TECH.2011.08118 To link to this article: https://doi.org/10.1198/TECH.2011.08118 Published online: 24 Jan 2012. Submit your article to this journal Article views: 2531 Citing articles: 235 View citing articles Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=utch20 Sparse Discriminant Analysis Line CLEMMENSEN Trevor H ASTIE Department of Informatics and Mathematical Modelling Department of Statistics Technical University of Denmark Stanford University Kgs. Lyngby 2800, Denmark Stanford, CA 94305-4065 ([email protected]) ([email protected]) Daniela WITTEN Bjarne ERSBØLL Department of Biostatistics Department of Informatics and Mathematical Modelling University of Washington Technical University of Denmark Seattle, WA 98195-7232 Kgs. Lyngby 2800, Denmark ([email protected]) ([email protected]) We consider the problem of performing interpretable classification in the high-dimensional setting, in which the number of features is very large and the number of observations is limited. This setting has been studied extensively in the chemometrics literature, and more recently has become commonplace in biological and medical applications. In this setting, a traditional approach involves performing feature selection before classification. We propose sparse discriminant analysis, a method for performing linear discriminant analysis with a sparseness criterion imposed such that classification and feature selection are performed simultaneously. Sparse discriminant analysis is based on the optimal scoring interpretation of linear discriminant analysis, and can be extended to perform sparse discrimination via mixtures of Gaussians if boundaries between classes are nonlinear or if subgroups are present within each class. Our proposal also provides low-dimensional views of the discriminative directions. KEY WORDS: Classification; Dimension reduction; Feature selection; Linear discriminant analysis; Mixture discriminant analysis. 1. INTRODUCTION lasso solves the problem { − 2 + } Linear discriminant analysis (LDA) is a favored tool for su- minimizeβ y Xβ λ β 1 (1) pervised classification in many applications, due to its simplic- and the elastic net (Zou and Hastie 2005) solves the problem ity, robustness, and predictive accuracy (Hand 2006). LDA also minimize {y − Xβ2 + λβ + γ β2}, (2) provides low-dimensional projections of the data onto the most β 1 discriminative directions, which can be useful for data inter- where λ and γ are nonnegative tuning parameters. When λ is pretation. There are three distinct arguments that result in the large, then both the lasso and the elastic net will yield sparse LDA classifier: the multivariate Gaussian model, Fisher’s dis- coefficient vector estimates. Through the additional use of an criminant problem, and the optimal scoring problem. These are 2 penalty, the elastic net provides some advantages over the reviewed in Section 2.1. lasso: correlated features tend to be assigned similar regression Though LDA often performs quite well in simple, low- coefficients, and more than min(n, p) features can be included dimensional settings, it is known to fail in the following cases: in the model. In this article, we apply an elastic net penalty to the coefficient vectors in the optimal scoring interpretation • When the number of predictor variables p is larger than the of LDA in order to develop a sparse version of discriminant number of observations n. In this case, LDA cannot be ap- analysis. This is related to proposals by Grosenick, Greer, and plied directly because the within-class covariance matrix Knutson (2008) and Leng (2008). Since our proposal is based of the features is singular. on the optimal scoring framework, we are able to extend it to • When a single Gaussian distribution per class is insuffi- mixtures of Gaussians (Hastie and Tibshirani 1996). cient. There already exist a number of proposals to extend LDA • When linear boundaries cannot separate the classes. to the high-dimensional setting. Some of these proposals in- volve non-sparse classifiers. For instance, within the multivari- Moreover, in some cases where p n, one may wish for a clas- ate Gaussian model for LDA, Dudoit, Fridlyand, and Speed sifier that performs feature selection—that is, a classifier that (2001) and Bickel and Levina (2004) assumed independence involves only a subset of the p features. Such a sparse classifier of the features (naive Bayes), and Friedman (1989) suggested ensures easier model interpretation and may reduce overfitting applying a ridge penalty to the within-class covariance matrix. of the training data. Other positive definite estimates of the within-class covariance In this article, we develop a sparse version of LDA using an matrix are considered by Krzanowski et al. (1995) and Xu, 1 or lasso penalty (Tibshirani 1996). The use of an 1 penalty to achieve sparsity has been studied extensively in the regres- © 2011 American Statistical Association and sion framework (Tibshirani 1996; Efron et al. 2004; Zou and the American Society for Quality Hastie 2005; Zou, Hastie, and Tibshirani 2006). If X is an n × p TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 data matrix and y is an outcome vector of length n, then the DOI 10.1198/TECH.2011.08118 406 SPARSE DISCRIMINANT ANALYSIS 407 ··· Brock, and Parrish (2009). Some proposals that lead to sparse nearest centroid classification on the matrix (Xβ1 Xβq) classifiers have also been considered: Tibshirani et al. (2002) with q < K − 1. One can show that performing nearest centroid adapted the naive Bayes classifier by soft-thresholding the mean classification on this n × q matrix is exactly equivalent to per- vectors, and Guo, Hastie, and Tibshirani (2007) combined a forming full-rank LDA on this n × q matrix. We will make use ridge-type penalty on the within-class covariance matrix with of this fact later. Fisher’s discriminant problem also leads to a a soft-thresholding operation. Witten and Tibshirani (2011) ap- tool for data visualization, since it can be informative to plot the plied 1 penalties to Fisher’s discriminant problem in order to vectors Xβ1, Xβ2, and so on. obtain sparse discriminant vectors, but this approach cannot be In this article, we will make use of optimal scoring, a third extended to the Gaussian mixture setting and lacks the simplic- formulation that yields the LDA classification rule and was dis- ity of the regression-based optimal scoring approach that we cussed in detail in the article by Hastie, Buja, and Tibshirani take in this article. (1995). It involves recasting the classification problem as a re- The rest of this article is organized as follows. In Section 2, gression problem by turning categorical variables into quanti- we review LDA and we present our proposals for sparse dis- tative variables, via a sequence of scorings.LetY denote an criminant analysis and sparse mixture discriminant analysis. n × K matrix of dummy variables for the K classes; Yik is an Section 3 briefly describes three methods to which we will com- indicator variable for whether the ith observation belongs to the pare our proposal: shrunken centroids regularized discriminant kth class. The optimal scoring criterion takes the form analysis, sparse partial least squares, and elastic net regression minimize {Yθ − Xβ 2} of dummy variables. Section 4 contains experimental results, βk,θk k k and Section 5 comprises the discussion. 1 subject to θ T YT Yθ = 1, (4) n k k 2. METHODOLOGY T T = ∀ θ k Y Yθl 0 l < k, 2.1 A Review of Linear Discriminant Analysis where θk is a K-vector of scores, and βk is a p-vector of vari- Let X be an n × p data matrix, and suppose that each of the able coefficients. Since the columns of X are centered to have n observations falls into one of K classes. Assume that each of mean zero, we can see that the constant score vector 1 is trivial, the p features has been centered to have mean zero, and that the since Y1 = 1 is an n-vector of 1’s and is orthogonal to all of the features have been standardized to have equal variance if they columns of X. Hence there are at most K − 1 nontrivial solu- are not measured on the same scale. Let x denote the ith obser- = 1 T i tions to (4). Letting Dπ n Y Y be a diagonal matrix of class vation, and let Ck denote the indices of the observations in the T = proportions, the constraints in (4) can be written as θ k Dπ θ k 1 kth class. Consider a very simple multivariate Gaussian model T = and θk Dπ θ l 0forl < k. One can show that the p-vector βk for the data, in which we assume that an observation in class k that solves (4) is proportional to the solution to (3), and hence ∈ Rp is distributed N(μk, w) where μk is the mean vector for we will also refer to the vector βk that solves (4)asthekth dis- class k and w is a p × p pooled within-class covariance matrix 1 criminant vector. Therefore, performing full-rank LDA on the common to all K classes. We use | | ∈ xi as an estimate × ··· Ck i Ck n q matrix (Xβ1 Xβq) yields the rank-q classification for μ , and we use 1 K (x −μ )(x −μ )T as an esti- k n k=1 i∈Ck i k i k rule obtained from Fisher’s discriminant problem. mate for w (see, e.g., Hastie, Tibshirani, and Friedman 2009). The LDA classification rule then results from applying Bayes’s 2.2 Sparse Discriminant Analysis rule to estimate the most likely class for a test observation.