ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: https://www.tandfonline.com/loi/utch20

Sparse Discriminant Analysis

Line Clemmensen, , Daniela Witten & Bjarne Ersbøll

To cite this article: Line Clemmensen, Trevor Hastie, Daniela Witten & Bjarne Ersbøll (2011) Sparse Discriminant Analysis, Technometrics, 53:4, 406-413, DOI: 10.1198/TECH.2011.08118 To link to this article: https://doi.org/10.1198/TECH.2011.08118

Published online: 24 Jan 2012.

Submit your article to this journal

Article views: 2531

Citing articles: 235 View citing articles

Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=utch20 Sparse Discriminant Analysis

Line CLEMMENSEN Trevor H ASTIE

Department of Informatics and Mathematical Modelling Department of Statistics Technical University of Denmark Kgs. Lyngby 2800, Denmark Stanford, CA 94305-4065 ([email protected]) ([email protected])

Daniela WITTEN Bjarne ERSBØLL

Department of Biostatistics Department of Informatics and Mathematical Modelling Technical University of Denmark Seattle, WA 98195-7232 Kgs. Lyngby 2800, Denmark ([email protected]) ([email protected])

We consider the problem of performing interpretable classification in the high-dimensional setting, in which the number of features is very large and the number of observations is limited. This setting has been studied extensively in the chemometrics literature, and more recently has become commonplace in biological and medical applications. In this setting, a traditional approach involves performing feature selection before classification. We propose sparse discriminant analysis, a method for performing linear discriminant analysis with a sparseness criterion imposed such that classification and feature selection are performed simultaneously. Sparse discriminant analysis is based on the optimal scoring interpretation of linear discriminant analysis, and can be extended to perform sparse discrimination via mixtures of Gaussians if boundaries between classes are nonlinear or if subgroups are present within each class. Our proposal also provides low-dimensional views of the discriminative directions. KEY WORDS: Classification; Dimension reduction; Feature selection; Linear discriminant analysis; Mixture discriminant analysis.

1. INTRODUCTION lasso solves the problem { − 2 + } Linear discriminant analysis (LDA) is a favored tool for su- minimizeβ y Xβ λ β 1 (1) pervised classification in many applications, due to its simplic- and the elastic net (Zou and Hastie 2005) solves the problem ity, robustness, and predictive accuracy (Hand 2006). LDA also minimize {y − Xβ2 + λβ + γ β2}, (2) provides low-dimensional projections of the data onto the most β 1 discriminative directions, which can be useful for data inter- where λ and γ are nonnegative tuning parameters. When λ is pretation. There are three distinct arguments that result in the large, then both the lasso and the elastic net will yield sparse LDA classifier: the multivariate Gaussian model, Fisher’s dis- coefficient vector estimates. Through the additional use of an criminant problem, and the optimal scoring problem. These are 2 penalty, the elastic net provides some advantages over the reviewed in Section 2.1. lasso: correlated features tend to be assigned similar regression Though LDA often performs quite well in simple, low- coefficients, and more than min(n, p) features can be included dimensional settings, it is known to fail in the following cases: in the model. In this article, we apply an elastic net penalty to the coefficient vectors in the optimal scoring interpretation • When the number of predictor variables p is larger than the of LDA in order to develop a sparse version of discriminant number of observations n. In this case, LDA cannot be ap- analysis. This is related to proposals by Grosenick, Greer, and plied directly because the within-class covariance matrix Knutson (2008) and Leng (2008). Since our proposal is based of the features is singular. on the optimal scoring framework, we are able to extend it to • When a single Gaussian distribution per class is insuffi- mixtures of Gaussians (Hastie and Tibshirani 1996). cient. There already exist a number of proposals to extend LDA • When linear boundaries cannot separate the classes. to the high-dimensional setting. Some of these proposals in- volve non-sparse classifiers. For instance, within the multivari- Moreover, in some cases where p n, one may wish for a clas- ate Gaussian model for LDA, Dudoit, Fridlyand, and Speed sifier that performs feature selection—that is, a classifier that (2001) and Bickel and Levina (2004) assumed independence involves only a subset of the p features. Such a sparse classifier of the features (naive Bayes), and Friedman (1989) suggested ensures easier model interpretation and may reduce overfitting applying a ridge penalty to the within-class covariance matrix. of the training data. Other positive definite estimates of the within-class covariance In this article, we develop a sparse version of LDA using an matrix are considered by Krzanowski et al. (1995) and Xu, 1 or lasso penalty (Tibshirani 1996). The use of an 1 penalty to achieve sparsity has been studied extensively in the regres- © 2011 American Statistical Association and sion framework (Tibshirani 1996; Efron et al. 2004; Zou and the American Society for Quality Hastie 2005; Zou, Hastie, and Tibshirani 2006). If X is an n × p TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 data matrix and y is an outcome vector of length n, then the DOI 10.1198/TECH.2011.08118

406 SPARSE DISCRIMINANT ANALYSIS 407

··· Brock, and Parrish (2009). Some proposals that lead to sparse nearest centroid classification on the matrix (Xβ1 Xβq) classifiers have also been considered: Tibshirani et al. (2002) with q < K − 1. One can show that performing nearest centroid adapted the naive Bayes classifier by soft-thresholding the mean classification on this n × q matrix is exactly equivalent to per- vectors, and Guo, Hastie, and Tibshirani (2007) combined a forming full-rank LDA on this n × q matrix. We will make use ridge-type penalty on the within-class covariance matrix with of this fact later. Fisher’s discriminant problem also leads to a a soft-thresholding operation. Witten and Tibshirani (2011) ap- tool for data visualization, since it can be informative to plot the plied 1 penalties to Fisher’s discriminant problem in order to vectors Xβ1, Xβ2, and so on. obtain sparse discriminant vectors, but this approach cannot be In this article, we will make use of optimal scoring, a third extended to the Gaussian mixture setting and lacks the simplic- formulation that yields the LDA classification rule and was dis- ity of the regression-based optimal scoring approach that we cussed in detail in the article by Hastie, Buja, and Tibshirani take in this article. (1995). It involves recasting the classification problem as a re- The rest of this article is organized as follows. In Section 2, gression problem by turning categorical variables into quanti- we review LDA and we present our proposals for sparse dis- tative variables, via a sequence of scorings.LetY denote an criminant analysis and sparse mixture discriminant analysis. n × K matrix of dummy variables for the K classes; Yik is an Section 3 briefly describes three methods to which we will com- indicator variable for whether the ith observation belongs to the pare our proposal: shrunken centroids regularized discriminant kth class. The optimal scoring criterion takes the form analysis, sparse partial least squares, and elastic net regression minimize {Yθ − Xβ 2} of dummy variables. Section 4 contains experimental results, βk,θk k k and Section 5 comprises the discussion. 1 subject to θ T YT Yθ = 1, (4) n k k 2. METHODOLOGY T T = ∀ θ k Y Yθl 0 l < k, 2.1 A Review of Linear Discriminant Analysis where θk is a K-vector of scores, and βk is a p-vector of vari- Let X be an n × p data matrix, and suppose that each of the able coefficients. Since the columns of X are centered to have n observations falls into one of K classes. Assume that each of mean zero, we can see that the constant score vector 1 is trivial, the p features has been centered to have mean zero, and that the since Y1 = 1 is an n-vector of 1’s and is orthogonal to all of the features have been standardized to have equal variance if they columns of X. Hence there are at most K − 1 nontrivial solu- are not measured on the same scale. Let x denote the ith obser- = 1 T i tions to (4). Letting Dπ n Y Y be a diagonal matrix of class vation, and let Ck denote the indices of the observations in the T = proportions, the constraints in (4) can be written as θ k Dπ θ k 1 kth class. Consider a very simple multivariate Gaussian model T = and θk Dπ θ l 0forl < k. One can show that the p-vector βk for the data, in which we assume that an observation in class k that solves (4) is proportional to the solution to (3), and hence ∈ Rp is distributed N(μk, w) where μk is the mean vector for we will also refer to the vector βk that solves (4)asthekth dis- class k and w is a p × p pooled within-class covariance matrix 1 criminant vector. Therefore, performing full-rank LDA on the common to all K classes. We use | | ∈ xi as an estimate × ···   Ck i Ck n q matrix (Xβ1 Xβq) yields the rank-q classification for μ , and we use 1 K (x −μ )(x −μ )T as an esti- k n k=1 i∈Ck i k i k rule obtained from Fisher’s discriminant problem. mate for w (see, e.g., Hastie, Tibshirani, and Friedman 2009). The LDA classification rule then results from applying Bayes’s 2.2 Sparse Discriminant Analysis rule to estimate the most likely class for a test observation. Since w does not have full rank when the number of fea- LDA can also be seen as arising from Fisher’s discriminant tures is large relative to the number of observations, LDA can- problem. Define the between-class covariance matrix b = K T not be performed. One approach to overcome this problem in- k=1 πkμkμk , where πk is the prior probability for class k volves using a regularized estimate of the within-class covari- (generally estimated as the fraction of observations belong- ance matrix in Fisher’s discriminant problem (3). For instance, ingtoclassk). Fisher’s discriminant problem involves seeking one possibility is discriminant vectors β ,...,β − that successively solve the 1 K 1 { T } problem maximizeβk βk bβk { T } T + = maximizeβk βk bβk subject to βk (w )βk 1, (5) T = βT ( + )β = 0 ∀l < k subject to βk wβk 1, (3) k w l T = ∀ with  a positive definite matrix. This approach was taken in βk wβl 0 l < k. the article by Hastie, Buja, and Tibshirani (1995). Then  + − − w Since b has rank at most K 1, there are at most K 1 non- is positive definite and so the discriminant vectors in (5) can be trivial solutions to the generalized eigen problem (3), and hence calculated even if p n. Moreover, for an appropriate choice − at most K 1 discriminant vectors. These solutions are direc- of ,(5) can result in smooth discriminant vectors. However, in tions upon which the data have maximal between-class variance this article, we are instead interested in a technique for obtain- relative to their within-class variance. One can show that nearest ing sparse discriminant vectors. One way to do this is by apply- centroid classification on the matrix (Xβ ··· Xβ − ) yields 1 K 1 ing an 1 penalty in (5), resulting in the optimization problem the same LDA classification rule as the multivariate Gaussian { T − } model described previously (see, e.g., Hastie, Tibshirani, and maximizeβk βk bβk γ βk 1 Friedman 2009). Fisher’s discriminant problem has an advan- subject to βT ( + )β = 1, (6) tage over the multivariate Gaussian interpretation of LDA, in k w k T + = ∀ that one can perform reduced-rank classification by performing βk (w )βl 0 l < k. TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 408 L. CLEMMENSEN ET AL.

Algorithm 1 Sparse discriminant analysis × = 1. Let Y be an n K matrix of indicator variables, Yij 1i∈Ck . = 1 T 2. Let Dπ n Y Y. 3. Initialize k = 1, and let Q1 be a K × 1matrixof1’s. = 4. For k 1,...,q, compute a new SDA direction pair (θ k, βk) as follows: = − T T = (a) Initialize θ k (I QkQk Dπ )θ ∗, where θ ∗ is a random K-vector, and then normalize θ k so that θ k Dπ θ k 1. (b) Iterate until convergence or until a maximum number of iterations is reached: (i) Let β be the solution to the generalized elastic net problem k   1 minimize Yθ − Xβ 2 + γ βT β + λβ . (10) βk n k k k k k 1

(ii) For fixed βk let  ˜ = − T −1 T = ˜ ˜ T ˜ θk (I QkQk Dπ )Dπ Y Xβk, θ k θ k/ θk Dπ θ k. (11)

(c) If k < q,setQk+1 = (Qk : θ k). × ··· 5. The classification rule results from performing standard LDA with the n q matrix (Xβ1 Xβ2 Xβq).

= 1 T × Indeed, this approach was taken by Witten and Tibshirani where Dπ n Y Y.LetQk be the K k matrix consisting of (2011). Solving (6) is challenging, since it is not a convex prob- the previous k − 1 solutions θ k, as well as the trivial solution lem and so specialized techniques, such as the minorization- vector of all 1’s. One can show that the solution to (9)isgiven maximization approach pursued by Witten and Tibshirani = · − T −1 T by θ k s (I QkQk Dπ )Dπ Y Xβk, where s is a proportion- (2011), must be applied. In this article, we instead apply 1 T = −1 T ality constant such that θ k Dπ θ k 1. Note that Dπ Y Xβk is penalties to the optimal scoring formulation for LDA (4). − T the unconstrained estimate for θ k, and the term (I QkQk Dπ ) Our sparse discriminant analysis (SDA) criterion is defined K is the orthogonal projector (in Dπ ) onto the subspace of R sequentially. The kth SDA solution pair (θ k, βk) solves the problem orthogonal to Qk. Once sparse discriminant vectors have been obtained, we can 2 T minimize {Yθk − Xβ + γ β β + λβ } βk,θk k k k k 1 plot the vectors Xβ1, Xβ2, and so on in order to perform data visualization in the reduced subspace. The classification rule is 1 T T subject to θ Y Yθk = 1, (7) × n k obtained by performing standard LDA on the n q reduced data matrix (Xβ ··· Xβ ) with q < K. In summary, the SDA θ T YT Yθ = 0 ∀l < k, 1 q k l algorithm is given in Algorithm 1. where  is a positive definite matrix as in (5), and λ and γ are nonnegative tuning parameters. The penalty on β results in 1 k 2.3 Sparse Mixture of Gaussians sparsity when λ is large. We will refer to the βk that solves (7) as the kth SDA discriminant vector. It was shown by Witten 2.3.1 A Review of Mixture Discriminant Analysis. LDA and Tibshirani (2011) that critical points of (7) are also criti- will tend to perform well if there truly are K distinct classes sep- cal points of (6). Since neither criterion is convex, we cannot arated by linear decision boundaries. However, if a single pro- claim these are local minima, but the result does establish an equivalence at this level. totype per class is insufficient for capturing the class structure, We now consider the problem of solving (7). We propose the then LDA will perform poorly. Hastie and Tibshirani (1996) use of a simple iterative algorithm for finding a local optimum proposed mixture discriminant analysis (MDA) to overcome the shortcomings of LDA in this setting. We review the MDA to (7). The algorithm involves holding θ k fixed and optimizing proposal here. with respect to βk, and holding βk fixed and optimizing with respect to θ k. For fixed θ k, we obtain Rather than modeling the observations within each class as multivariate Gaussian with a class-specific mean vector and a minimize {Yθ − Xβ 2 + γ βT β + λβ }, (8) βk k k k k k 1 common within-class covariance matrix, in MDA one instead which is an elastic net problem if  = I and a generalized elas- models each class as a mixture of Gaussians in order to achieve tic net problem for an arbitrary symmetric positive semidefinite increased flexibility. The kth class, k = 1,...,K, is divided into = K matrix . Equation (8) can be solved using the algorithm pro- Rk subclasses, and we define R k=1 Rk. It is assumed that posed in the article by Zou and Hastie (2005), or using a coor- the rth subclass in class k, r = 1, 2,...,Rk, has a multivari- dinate descent approach (Friedman et al. 2007). For fixed βk, ate Gaussian distribution with a subclass-specific mean vector the optimal scores θ solve ∈ Rp × k μkr and a common p p covariance matrix w.Weletk { − 2} denote the prior probability for the kth class, and πkr the mix- minimizeθk Yθk Xβk  r Rk π =  T = ing probability for the th subclass, with r=1 kr 1. The k subject to θ k Dπ θk 1, (9) can be easily estimated from the data, but the πkr are unknown T = ∀ θ k Dπ θ l 0 l < k, model parameters. TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 SPARSE DISCRIMINANT ANALYSIS 409

Hastie and Tibshirani (1996) suggested employing the expec- K  Rk = 1 | ∈ tation–maximization (EM) algorithm in order to estimate the w p(ckr xi, i Ck) n = ∈ = subclass-specific mean vectors, the within-class covariance ma- k 1 i Ck r 1 trix, and the subclass mixing probabilities. In the expectation × − − T (xi μkr)(xi μkr) . (15) step, one estimates the probability that the ith observation be- longs to the rth subclass of the kth class, given that it belongs The EM algorithm proceeds by iterating between Equations to the kth class: (12)–(15) until convergence. Hastie and Tibshirani (1996)also presented an extension of this EM approach to accommodate a | ∈ p(ckr xi, i Ck) reduced-rank LDA solution via optimal scoring, which we ex- π exp(−(x − μ )T −1(x − μ )/2) tend in the next section. =  kr i kr w i kr , Rk T −1 2.3.2 The Sparse Mixture Discriminant Analysis Proposal.  π  exp(−(xi − μ  )  (xi − μ  )/2) r =1 kr kr w kr We now describe our sparse mixture discriminant analysis r = 1,...,Rk. (12) (SMDA) proposal. We define Z,ann × R blurred response ma- trix, which is a matrix of subclass probabilities. If the ith obser- In (12), c is shorthand for the event that the observation x is in kr i vation belongs to the kth class, then the ith row of Z contains the the rth subclass of the kth class. In the maximization step, esti- values p(c |x , i ∈ C ),...,p(c |x , i ∈ C ) in the kth block mates are updated for the subclass mixing probabilities as well k1 i k kRk i k of R entries, and 0’s elsewhere. Z is the mixture analog of the as the subclass-specific mean vectors and the pooled within- k indicator response matrix Y. We extend the MDA algorithm class covariance matrices:  presented in Section 2.3.1 by performing SDA using Z, rather ∈ p(ckr|xi, i ∈ Ck) than Y, as the indicator response matrix. Then rather than us- =  iCk πkr R , (13) k  | ∈ ing the raw data X in performing the EM updates (12)–(15), we = i∈C p(ckr xi, i Ck) r 1 k instead use the transformed data XB where B = (β ··· β ) | ∈ 1 q i∈C xip(ckr xi, i Ck) and where q < R. Details are provided in Algorithm 2.This μ =  k , (14) kr p(c |x , i ∈ C ) algorithm yields a classification rule for assigning class mem- i∈Ck kr i k

Algorithm 2 Sparse mixture discriminant analysis 1. Initialize the subclass probabilities, p(ckr|xi, i ∈ Ck), for instance by performing Rk-means clustering within the kth class. 2. Use the subclass probabilities to create the n × R blurred response matrix Z. 3. Iterate until convergence or until a maximum number of iterations is reached: (a) Using Z instead of Y, perform SDA in order to find a sequence of q < R pairs of score vectors and discriminant { }q vectors, θ k, βk k=1. ˜ = = ··· (b) Compute X XB, where B (β1 βq). (c) Compute the weighted means, covariance, and mixing probabilities using Equations (13)–(15), substituting X˜ instead of X. That is,  ∈ p(ckr|x˜i, i ∈ Ck) =  iCk πkr R , (16) k  |˜ ∈ = i∈C p(ckr xi, i Ck) r 1 k ∈ x˜ip(ckr|x˜i, i ∈ Ck) μ˜ = i Ck , (17) kr p(c |x˜ , i ∈ C ) i∈Ck kr i k K R 1   k ˜ = p(c |x˜ , i ∈ C )(x˜ − μ˜ )(x˜ − μ˜ )T . (18) w n kr i k i kr i kr k=1 i∈Ck r=1 (d) Compute the subclass probabilities using Equation (12), substituting X˜ instead of X and using the current estimates for the weighted means, covariance, and mixing probabilities, as follows: π exp(−(x˜ − μ˜ )T ˜ −1(x˜ − μ˜ )/2) |˜ ∈ =  kr i kr w i kr p(ckr xi, i Ck) R − . (19) k  − ˜ − ˜  T ˜ 1 ˜ − ˜  r=1 πkr exp( (xi μkr ) w (xi μkr )/2) (e) Using the subclass probabilities, update the blurred response matrix Z. p 4. The classification rule results from assigning a test observation xtest ∈ R , with x˜test = xtestB, to the class for which

Rk − ˜ − ˜ T ˜ −1 ˜ − ˜ k πkr exp( (xtest μkr) w (xtest μkr)/2) (20) r=1 is largest.

TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 410 L. CLEMMENSEN ET AL. bership to a test observation. Moreover, the matrix XB serves 3.3 Elastic Net Regression of Dummy Variables as a q-dimensional graphical projection of the data. As a simple alternative to SDA, we consider performing an elastic net (EN) regression of the matrix of dummy variables Y 3. METHODS FOR COMPARISON onto the data matrix X, in order to compute an n × K matrix of fitted values Yˆ . This is followed by a (possibly reduced-rank) In Section 3, we will compare SDA to shrunken centroids LDA, treating the fitted value matrix Yˆ as the predictors. The regularized discriminant analysis (RDA; Guo, Hastie, and Tib- resulting classification rule involves only a subset of the fea- shirani 2007), sparse partial least squares regression (SPLS; tures if the lasso tuning parameter in the elastic net regression Chun and Keles 2010), and elastic net (EN) regression of is sufficiently large. If the elastic net regression is replaced with dummy variables. standard linear regression, then this approach amounts to stan- dard LDA (see, e.g., Indahl, Martens, and Naes 2007). 3.1 Shrunken Centroids Regularized Discriminant Analysis 4. EXPERIMENTAL RESULTS This section illustrates results on a number of datasets. In Shrunken centroids regularized discriminant analysis (RDA) these examples, SDA arrived at a stable solution in fewer than is based on the same underlying model as LDA, that is, nor- 30 iterations. The tuning parameters for all of the methods con- mally distributed data with equal dispersion (Guo, Hastie, and sidered were chosen using leave-one-out cross-validation on the Tibshirani 2007). The method regularizes the within-class co- training data (Hastie, Tibshirani, and Friedman 2009). Subse- variance matrix used by LDA, quently, the models with the chosen parameters were evaluated ˜ ˆ on the test data. Unless otherwise specified, the features were w = αw + (1 − α)I (21) centered to have mean zero and standard deviation 1, and the ˆ = for some α,0≤ α ≤ 1, where w is the standard estimate of penalty matrix  I was used in the SDA formulation. the within-class covariance matrix used in LDA. In order to per- form feature selection, one can perform soft-thresholding of the 4.1 Female and Male Silhouettes quantity ˜ −1μ , where μ is the observed mean vector for the w k k In order to illustrate the sparsity of the SDA discriminant vec- kth class. That is, we compute tors, we consider a shape-based dataset consisting of 20 male ˜ −1 ˜ −1 and 19 female adult face silhouettes. A minimum description sgn( μ )(| μ |−)+, (22) w k w k length (MDL) approach to annotate the silhouettes was used −1 (Thodberg and Ólafsdóttir 2003), and Procrustes’ alignment and use (22) instead of w μk in Bayes’s classification rule arising from the multivariate Gaussian model. The R package was performed on the resulting 65 MDL (x, y)-coordinates. The rda is available from CRAN (2009). training set consisted of 22 silhouettes (11 female and 11 male), and there were 17 silhouettes in the test set (8 female and 9 male). Panels (a) and (b) of Figure 1 illustrate the two classes 3.2 Sparse Partial Least Squares of silhouettes. We performed SDA in order to classify the observations In the chemometrics literature, partial least squares (PLS) into male versus female. Leave-one-out cross-validation on the is a widely used regression method in the p n setting (see, training data resulted in the selection of 10 nonzero features. e.g., Barker and Rayens 2003; Indahl, Martens, and Naes 2007; The SDA results are illustrated in Figure 1(c). Since there are Indahl, Liland, and Naes 2009). Sparse PLS (SPLS) is an exten- two classes in this problem, there was only one SDA discrim- sion of PLS that uses the lasso to promote sparsity of a surrogate inant vector. Note that the nonzero features included in the direction vector c instead of the original latent direction vector model were placed near high curvature points in the silhouettes. α, while keeping α and c close (Chun and Keles 2010). That is, The training and test classification rates (fraction of observa- the first SPLS direction vector solves tions correctly classified) were both 82%. In the original article, a logistic regression was performed on a subset of the principal {− T + − − T − minimizeα∈Rp,c∈Rp κα Mα (1 κ)(c α) M(c α) components of the data, where the subset was determined by 2 + λc1 + γ c } (23) backward elimination using a classical statistical test for signif- icance. This resulted in an 85% classification rate on the test set T = subject to α α 1, (Thodberg and Ólafsdóttir 2003). The SDA model has an inter- pretational advantage, since it reveals the exact locations of the where M = XT YYT X, κ is a tuning parameter with 0 ≤ κ ≤ 1, main differences between the two genders. and γ and λ are nonnegative tuning parameters. A simple ex- tension of (23) allows for the computation of additional latent p 4.2 Leukemia Microarray Data direction vectors. Letting c1,...,cq ∈ R denote the sparse sur- rogate direction vectors resulting from the SPLS method, we We now consider a leukemia microarray dataset published obtained a classification rule by performing standard LDA on by Yeoh et al. (2002) and available at http://datam.i2r.a-star. the matrix (Xc1 ··· Xcq).TheR package spls is available edu.sg/datasets/krbd/. The study aimed to classify subtypes of from CRAN (2009). pediatric acute lymphoblastic leukemia. The data consisted of

TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 SPARSE DISCRIMINANT ANALYSIS 411

(a) Female (b) Male (c) Model

Figure 1. (a) and (b): The silhouettes and the 65 (x, y)-coordinates for the two classes. (c): The mean shape of the silhouettes, and the 10 (x, y)-coordinates in the SDA model. The arrows illustrate the directions of the differences between male and female observations.

12,558 gene expression measurements for 163 training sam- et al. (2007). The data were partitioned into a training set ples and 85 test samples belonging to six cancer classes: BCR- (24 samples) and a test set (12 samples); one of the three repli- ABL, E2A-PBX1, Hyperdiploid (>50 chromosomes), MLL cates of each strain was included in the test set. Table 2 summa- rearrangement, T-ALL, and TEL-AML1. Analyses were per- rizes the results. The SDA discriminant vectors are displayed in formed on nonnormalized data for comparison with the original Figure 3. analysis of Yeoh et al. (2002). In the work of Yeoh et al. (2002), the data were analyzed in two steps: a feature selection step was 4.4 Classification of Fish Species Based followed by a classification step, using a decision tree structure on Shape and Texture such that one group was separated using a support vector ma- Here we consider classification of three fish species—cod, chine at each tree node. On these data, SDA resulted in a model haddock, and whiting—on the basis of shape and texture fea- with only 30 nonzero features in each of the SDA discriminant tures. The data were taken from the work of Larsen, Olafsdot- vectors. The classification rates obtained by SDA were compa- tir, and Ersbøll (2009), and consist of texture and shape mea- rable to or slightly better than those of Yeoh et al. (2002). The surements for 20 cod, 58 haddock, and 30 whiting. The shapes results are summarized in Table 1. In comparison, EN resulted of the fish are represented with coordinates based on MDL. in overall classification rates of 98% on both the training and There were 700 coordinates for the contours of the fish, 300 for test sets, with 20 features in the model. Figure 2 displays scat- the mid line, and one for the eye. The shapes were Procrustes terplots of the six groups projected onto the SDA discriminant aligned to have full correspondence. The texture features were vectors. simply the red, green, and blue intensity values from digi- tized color images taken with a standard camera under white- 4.3 Spectral Identification of Fungal Species light illumination. They were annotated to the shapes using a Next, we consider a high-dimensional dataset consisting of Delauney triangulation approach. In total, there were 103,348 multispectral imaging of three Penicillium species: melanoco- shape and texture features. In the work of Larsen, Olafsdottir, nodium, polonicum, and venetum. The three species all have and Ersbøll (2009), classification was performed via principal green/blue conidia (spores) and are therefore visually difficult components analysis followed by LDA; this led to a 76% leave- to distinguish. For each of the three species, four strains were one-out classification rate. Here, we split the data in two: 76 fish injected onto yeast extract sucrose agar in triplicate, resulting for training, and 32 fish for testing. The results are listed in Ta- in 36 samples. Three thousand five hundred forty-two vari- ble 3. In this case, SDA gives the most sparse solution and the ables were extracted from multispectral images with 18 spectral best test classification rate. Only one of the whiting was mis- bands—ten in the visual range, and eight in the near-infrared classified as haddock. range. More details can be found in the article by Clemmensen The SDA discriminant vectors are displayed in Figure 4.The first SDA discriminant vector is mainly dominated by blue in- tensities, and reflects the fact that cod are in general less blue Table 1. Training and test classification rates using SDA than haddock and whiting around the mid line and mid fin with 3 nonzero features on the leukemia data (Larsen, Olafsdottir, and Ersbøll 2009). The second SDA dis- Group Train Test criminant vector suggests that relative to cod and whiting, had- All groups 99% 99% dock tends to have more blue around the head and tail, less green around the mid line, more red around the tail, and less BCR-ABL 89% 83% red around the eye, the lower part, and the mid line. E2A-PBX1 100% 100% Hyperdiploid 100% 100% T-ALL 100% 100% 5. DISCUSSION TEL-AML1 100% 100% MLL 100% 100% Linear discriminant analysis is a commonly used method for classification. However, it is known to fail if the true decision

TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 412 L. CLEMMENSEN ET AL.

Figure 2. SDA discriminant vectors for the leukemia dataset. A color version of this figure is available online.

Table 2. Classification rates on the Penicillium data boundary between the classes is nonlinear, if more than one pro- Method Train Test Nonzero loadings totype is required in order to properly model each class, or if the number of features is large relative to the number of observa- RDA 100% 100% 3502 SPLS 100% 100% 3810 tions. In this article, we addressed the latter setting. We pro- EN 100% 100% 3 posed an approach for extending LDA to the high-dimensional SDA 100% 100% 2 setting in such a way that the resulting discriminant vectors in- volve only a subset of the features. The sparsity in the discrim- inant vectors as well as the low-dimensional number of vectors (the number of classes less one) give improved interpretability. Our proposal is based upon the simple optimal scoring frame- work, which recasts LDA as a regression problem. We are con- sequently able to make use of existing techniques for perform- ing sparse regression when the number of features is very large relative to the number of observations. It is possible to set the exact number of nonzero loadings desired in each discrimina- tive direction, and it should be noted that this number is much smaller than the number of features for the applications seen here. Furthermore, our proposal is easily extended to more com- plex settings, such as the case where the observations from each class are drawn from a mixture of Gaussian distributions result- ing in nonlinear separations between classes. Sparse partial least squares failed to work in dimensions of asize105 or larger and tended to be conservative with respect to the number of nonzero loadings. Shrunken centroids regular- ized discriminant analysis performed well when data were not Figure 3. The Penicillium dataset projected onto the SDA discrim- normalized, but likewise tended to be conservative with respect inant vectors. to the number of nonzero loadings. Regression on dummy- variables using the elastic net performed well, although not as well as the sparse discriminant analysis, and it fails to extend to nonlinear separations. However, it is faster than sparse discrim- inant analysis. Further investigation is required of these and re- Table 3. Classification rates for the fish data. RDA (n) and (u) lated proposals for high-dimensional classification in order to indicate the procedure applied to the normalized and develop a full understanding of their strengths and weaknesses. unnormalized data. SPLS was excluded from comparisons for computational reasons

Method Train Test Nonzero loadings ACKNOWLEDGMENTS

RDA (n) 100% 41% 103,084 We thank Hildur Ólafsdóttir and Rasmus Larsen at Informat- RDA (u) 100% 94% 103,348 ics and Mathematical Modelling, Technical University of Den- EN 100% 94% 90 SDA 100% 97% 60 mark for making the silhouette and fish data available, and Karl Sjöstrand for valuable comments. Trevor Hastie was supported

TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4 SPARSE DISCRIMINANT ANALYSIS 413

Figure 4. On the left, the projection of the fish data onto the first and second SDA discriminant vectors. On the right, the selected texture features are displayed on the fish mask. The first SDA discriminant vector is mainly dominated by blue intensities whereas the second SDA discriminant vector consists of red, green, and blue intensities. Only texture features were selected by SDA.

in part by grant DMS-1007719 from the National Science Foun- Hastie, T., Buja, A., and Tibshirani, R. (1995), “Penalized Discriminant Analy- dation, and grant RO1-EB001988-12 from the National Insti- sis,” The Annals of Statistics, 23 (1), 73–102. [407] Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical tutes of Health. Finally, we thank the editor, an associate editor, Learning (2nd ed.), New York: Springer. [407,410] and two referees for valuable comments. Indahl, U., Liland, K., and Naes, T. (2009), “Canonical Partial Least Squares— A Unified PLS Approach to Classification and Regression Problems,” Jour- [Received June 2008. Revised July 2011.] nal of Chemometrics, 23, 495–504. [410] Indahl, U., Martens, H., and Naes, T. (2007), “From Dummy Regression to Prior Probabilities in PLS-DA,” Journal of Chemometrics, 21, 529–536. [410] REFERENCES Krzanowski, W., Jonathan, P., McCarthy, W., and Thomas, M. (1995), “Dis- criminant Analysis With Singular Covariance Matrices: Methods and Ap- Barker, M., and Rayens, W. (2003), “Partial Least Squares for Discrimination,” plications to Spectroscopic Data,” Journal of the Royal Statistical Society, Journal of Chemometrics, 17, 166–173. [410] Ser. C, 44, 101–115. [406] Bickel, P., and Levina, E. (2004), “Some Theory for Fisher’s Linear Discrim- inant Function, ‘Naive Bayes,’ and Some Alternatives When There Are Larsen, R., Olafsdottir, H., and Ersbøll, B. (2009), “Shape and Texture Based Many More Variables Than Observations,” Bernoulli, 6, 989–1010. [406] Classification of Fish Species,” in 16th Scandinavian Conference on Image Chun, H., and Keles, S. (2010), “Sparse Partial Least Squares Regression for Analysis. Lecture Notes in Computer Science, Vol. 5575, Berlin: Springer. Simultaneous Dimension Reduction and Variable Selection,” Journal of the [411] Royal Statistical Society, Ser. B, 72 (1), 3–25. [410] Leng, C. (2008), “Sparse Optimal Scoring for Multiclass Cancer Diagnosis and Clemmensen, L., Hansen, M., Ersbøll, B., and Frisvad, J. (2007), “A Method Biomarker Detection Using Microarray Data,” Computational Biology and for Comparison of Growth Media in Objective Identification of Penicillium Chemistry, 32, 417–425. [406] Based on Multi-Spectral Imaging,” Journal of Microbiological Methods, Thodberg, H. H., and Ólafsdóttir, H. (2003), “Adding Curvature to Minimum 69, 249–255. [411] Description Length Shape Models,” in British Machine Vision Conference, CRAN (2009), “The Comprehensive Archive Network,” available at http:// BMVC, Norwich, U.K.: British Machine Vision Association. [410] cran.r-project.org/.[410] Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Dudoit, S., Fridlyand, J., and Speed, T. (2001), “Comparison of Discrimination Journal of the Royal Statistical Society, Ser. B, 58 (1), 267–288. [406] Methods for the Classification of Tumors Using Gene Expression Data,” Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002), “Diagnosis of Journal of the American Statistical Association, 96, 1151–1160. [406] Multiple Cancer Types by Shrunken Centroids of Gene Expression,” Pro- Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least Angle ceedings of the National Academy of Sciences of the United States of Amer- Regression,” The Annals of Statistics, 32, 407–499. [406] ica, 99, 6567–6572. [407] Friedman, J. (1989), “Regularized Discriminant Analysis,” Journal of the Witten, D., and Tibshirani, R. (2011), “Penalized Classification Using Fisher’s American Statistical Association, 84, 165–175. [406] Linear Discriminant,” Journal of the Royal Statistical Society, Ser. B,to Friedman, J., Hastie, T., Hoefling, H., and Tibshirani, R. (2007), “Pathwise Co- appear. [407,408] ordinate Optimization,” The Annals of Applied Statistics, 1, 302–332. [408] Xu, P., Brock, G., and Parrish, R. (2009), “Modified Linear Discriminant Anal- Grosenick, L., Greer, S., and Knutson, B. (2008), “Interpretable Classifiers for ysis Approaches for Classification of High-Dimensional Microarray Data,” fMRI Improve Prediction of Purchases,” IEEE Transactions on Neural Sys- Computational Statistics and Data Analysis, 53, 1674–1687. [406,407] tems and Rehabilitation Engineering, 16 (6), 539–548. [406] Yeoh, E.-J. et al. (2002), “Classification, Subtype Discovery, and Prediction of Guo, Y., Hastie, T., and Tibshirani, R. (2007), “Regularized Linear Discrimi- Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression nant Analysis and Its Applications in Microarrays,” Biostatistics, 8 (1), 86– Profiling,” Cancer Cell, 1, 133–143. [410,411] 100. [407,410] Zou, H., and Hastie, T. (2005), “Regularization and Variable Selection via the Hand, D. J. (2006), “Classifier Technology and the Illusion of Progress,” Sta- Elastic Net,” Journal of the Royal Statistical Society, Ser. B,67(2),301– tistical Science, 21 (1), 1–15. [406] 320. [406,408] Hastie, T., and Tibshirani, R. (1996), “Discriminant Analysis by Gaussian Zou, H., Hastie, T., and Tibshirani, R. (2006), “Sparse Principal Component Mixtures,” Journal of the Royal Statistical Society, Ser. B, 58, 158–176. Analysis,” Journal of Computational and Graphical Statistics, 15, 265–286. [406,408,409] [406]

TECHNOMETRICS, NOVEMBER 2011, VOL. 53, NO. 4