A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

Liang Sun [email protected] Shuiwang Ji [email protected] Jieping Ye [email protected] Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA

Abstract nant Analysis (LDA), and Hypergraph Spectral Learn- ing (HSL) (Hotelling, 1936; Rosipal & Kr¨amer, 2006; Many machine learning algorithms can be Sun et al., 2008a; Tao et al., 2009; Ye, 2007). Al- formulated as a generalized eigenvalue prob- though well-established algorithms in numerical linear lem. One major limitation of such formula- algebra have been developed to solve generalized eigen- tion is that the generalized eigenvalue prob- value problems, they are in general computationally lem is computationally expensive to solve expensive and hence may not scale to large-scale ma- especially for large-scale problems. In this chine learning problems. In addition, it is challenging paper, we show that under a mild condi- to directly incorporate the sparsity constraint into the tion, a class of generalized eigenvalue prob- mathematical formulation of these techniques. Spar- lems in machine learning can be formulated sity often leads to easy interpretation and a good gen- as a least squares problem. This class of eralization ability. It has been used successfully in lin- problems include classical techniques such as ear regression (Tibshirani, 1996), and Principal Com- Canonical Correlation Analysis (CCA), Par- ponent Analysis (PCA) (d’Aspremont et al., 2004). tial Least Squares (PLS), and Linear Dis- criminant Analysis (LDA), as well as Hy- Multivariate Linear Regression (MLR) that minimizes pergraph Spectral Learning (HSL). As a re- the sum-of-squares error function, called least squares, sult, various regularization techniques can be is a classical technique for regression problems. It can readily incorporated into the formulation to also be applied for classification problems by defining improve model sparsity and generalization an appropriate class indicator (Bishop, 2006). ability. In addition, the least squares for- The solution to the least squares problems can be ob- mulation leads to efficient and scalable im- tained by solving a linear system of equations, and plementations based on the iterative conju- a number of algorithms, including the conjugate gra- gate gradient type algorithms. We report dient algorithm, can be applied to solve it efficiently experimental results that confirm the estab- (Golub & Van Loan, 1996). Furthermore, the least lished equivalence relationship. Results also squares formulation can be readily extended using the demonstrate the efficiency and effectiveness regularization technique. For example, the 1-norm of the equivalent least squares formulations and 2-norm regularization can be incorporated into on large-scale problems. the least squares formulation to improve sparsity and control model complexity (Bishop, 2006). Motivated by the mathematical and numerical prop- 1. Introduction erties of the generalized eigenvalue problem and the A number of machine learning algorithms can be for- least squares formulation, several researchers have at- mulated as a generalized eigenvalue problem. Such tempted to connect these two approaches. In particu- techniques include Canonical Correlation Analysis lar, it has been shown that there is close relationship (CCA), Partial Least Squares (PLS), Linear Discrimi- between LDA, CCA, and least squares (Hastie et al., 2001; Bishop, 2006; Ye, 2007; Sun et al., 2008b). How- Appearing in Proceedings of the 26 th International Confer- ever, the intrinsic relationship between least squares ence on Machine Learning, Montreal, Canada, 2009. Copy- and other techniques involving generalized eigenvalue right 2009 by the author(s)/owner(s). problems mentioned above remains unclear. A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

In this paper, we study the relationship between the where X ∈ Rd×n represents the data matrix and least squares formulation and a class of generalized S∈Rn×n is symmetric and positive semi-definite. eigenvalue problems in machine learning. In particu- The generalized eigenvalue problem in Eq. (1) is of- lar, we establish the equivalence relationship between ten reformulated as the following eigenvalue problem: these two formulations under a mild condition. As a † (XXT ) XSXT w = λw, (2) result, various regularization techniques such as the 1- † norm and 2-norm regularization can be readily incor- where (XXT ) is the pseudoinverse of XXT .Ingen- porated into the formulation to improve model sparsity eral, we are interested in eigenvectors corresponding to and generalization ability. In addition, this equiva- nonzero eigenvalues. It turns out that many machine lence relationship leads to efficient and scalable imple- learning techniques can be formulated in the form of mentations for these generalized eigenvalue problems Eqs. (1) and (2). based on the iterative conjugate gradient type algo- rithms such as LSQR (Paige & Saunders, 1982). We 2.2. Examples of Generalized Eigenvalue have conducted experiments using several benchmark Problems data sets. The experiments confirm the equivalence re- We briefly review several algorithms that involve a lationship between these two models under the given generalized eigenvalue problem in the general form of assumption. Our results show that even when the as- Eq. (1). They include Canonical Correlation Analy- sumption does not hold, the performance of these two sis, Partial Least Squares, Linear Discriminant Anal- models is still very close. Results also demonstrate ysis, and Hypergraph Spectral Learning. For super- the efficiency and effectiveness of the equivalent least vised learning methods, the label information is en- squares models and their extensions. k×n coded in the matrix Y =[y1,y2, ··· ,yn] ∈ R , Notations: The number of training samples, the where yi(j)=1ifxi belongs to class j and yi(j)=0 data dimensionality, and the number of classes (or la- otherwise. d bels) are denoted by n, d,andk, respectively. xi ∈ R k Canonical Correlation Analysis denotes the ith observation, and yi ∈ R encodes the d×n label information for xi. X =[x1,x2, ··· ,xn] ∈ R In CCA (Hotelling, 1936), two different representa- represents the data matrix, and Y =[y1,y2, ··· ,yn] ∈ tions, X and Y , of the same set of objects are given, k×n R is the matrix representation for label informa- and a projection is computed for each representa- {x }n n x tion. i 1 is assumed to be centered, i.e., i=1 i = tion such that they are maximally correlated in the 0. S∈Rn×n is a symmetric and positive semi-definite dimensionality-reduced space. Denote the projection e d T matrix, and is a vector of all ones. vector for X by wx ∈ R , and assume that YY is non- singular. It can be verified that w is the first princi- Organization: We present background and related x pal eigenvector of the following generalized eigenvalue work in Section 2, establish the equivalence relation- problem: ship between the generalized eigenvalue problem and T T −1 T T the least squares problem in Section 3, discuss exten- XY (YY ) YX wx = λXX wx. (3) sions based on the established equivalence result in Multiple projection vectors can be obtained simulta- Section 4, present the efficient implementation in Sec- neously by computing the first  principal eigenvec- tion 5, report empirical results in Section 6, and con- tors of the generalized eigenvalue problem in Eq. (3). clude this paper in Section 7. It can be observed that CCA is in the form of the generalized eigenvalue problem in Eq. (1) with S = 2. Background and Related Work Y T (YYT )−1Y . In this section, we present a class of generalized eigen- Partial Least Squares value problems studied in this paper. The least squares In contrast to CCA, Orthonormalized PLS (OPLS), a formulation is briefly reviewed. variant of PLS (Rosipal & Kr¨amer, 2006), computes orthogonal score vectors by maximizing the covariance 2.1. A Class of Generalized Eigenvalue between X and Y . It solves the following generalized Problems eigenvalue problem: We consider a class of generalized eigenvalue problems XY T YXT w = λXX T w. (4) in the following form: It can be observed that Orthonormalized PLS involves a generalized eigenvalue problem in Eq. (1) with S = XSXT w = λXX T w, (1) Y T Y . A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

Hypergraph Spectral Learning vector-valued class code to each data point (Bishop, 2006). The solution to the least squares problem A hypergraph (Agarwal et al., 2006) is a generalization depends on the choice of the class indicator matrix of the traditional graph in which the edges (a.k.a. hy- (Hastie et al., 2001; Ye, 2007). In contrast to the gen- peredges) are arbitrary non-empty subsets of the ver- eralized eigenvalue problem, the least squares problem tex set. HSL (Sun et al., 2008a) employs a hypergraph can be solved efficiently using iterative conjugate gra- to capture the correlation information among differ- dient algorithms (Golub & Van Loan, 1996; Paige & ent labels for improved classification performance in Saunders, 1982). multi-label learning. It has been shown that given the normalized Laplacian LH for the constructed hyper- graph, HSL invovles the following generalized eigen- 3. Generalized Eigenvalue Problem value problem: versus Least Squares Problem

T T XSX w = λ(XX )w, where S = I −LH . (5) In this section, we investigate the relationship between the generalized eigenvalue problem in Eq. (1) or its It has been shown that for many existing definitions of equivalent formulation in Eq. (2) and the least squares L S the Laplacian H , the resulting matrix is symmetric formulation. In particular, we show that under a mild and positive semi-definite, and it can be decomposed condition1, the eigenvalue problem in Eq. (2) can be T n×k as S = HH ,whereH ∈ R . formulated as a least squares problem with a specific Linear Discriminant Analysis target matrix.

LDA is a supervised dimensionality reduction tech- 3.1. Matrix Orthonormality Property nique. The transformation in LDA is obtained by maximizing the ratio of the inter-class distance to the For convenience of presentation, we define matrices CX intra-class distance. It is known that CCA is equiv- and CS as follows: alent to LDA for multi-class problems. Thus, the S C XXT ∈ Rd×d,C XSXT ∈ Rd×d. matrix can be derived similarly. X = S = The eigenvalue problem in Eq. (2) can then be ex- 2.3. Least Squares for Regression and pressed as Classification C† C w λw. X S = (8) Least squares is a classical technique for both regres- Recall that S is symmetric and positive semi-definite, sion and classification (Bishop, 2006). In regression, thus it can be decomposed as { x ,t }n x ∈ Rd we are given a training set ( i i) i=1,where i t ∈ Rk T is the observation and i is the corresponding S = HH , (9) target. We assume that both the observations and the targets are centered, then the intercept can be elim- where H ∈ Rn×s,ands ≤ n. For most examples inated. In this case, the weight matrix W ∈ Rd×k discussed in Section 2.2, the closed-form of H can be can be computed by minimizing the following sum-of- obtained and s = k  n.SinceX is centered, i.e., Xe XC X C I − 1 eeT squares error function: =0,wehave = ,where = n is the centering matrix satisfying CT = C,andCe =0.It n W T x − t 2 W T X − T 2 , follows that min i i 2 = F (6) W i=1 T T T T CS = XSX =(XC)S(XC) = X(CSC )X T t , ··· ,t where =[1 n] is the target matrix. It is well- = X(CT SC)XT = XS˜XT , (10) known that the optimal solution Wls is given by S˜ CT SC T † T where = .Notethat Wls =(XX ) XT . (7) eT S˜e = eT CT SCe =0. (11) Least squares can also be applied for classification T problems. In the general multi-class case, we are given Thus, we can assume that e Se =0,thatis,both n { x ,y }n columns and rows of S are centered. Before presenting a data set consisting of samples ( i i) i=1,where d xi ∈ R ,andyi ∈{1, 2, ··· ,k} denotes the class la- the main results, we have the following lemmas. bel of the i-th sample, and k ≥ 2. To apply the least 1 n It states that {xi}i=1 are linearly independent before squares formulation to the multi-class case, the 1-of-k centering, i.e., rank(X)=n − 1 after the data is centered binary coding scheme is usually employed to apply a (of zero mean). A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

T Lemma 1. Assume that e Se =0.LetH be de- Lemma 3. Let M1 be defined as above. Then T fined in Eq. (9). Let HP = QR be the QR decom- M1 M1 = Ir. position of H with column pivoting, where Q ∈ Rn×r r×s T has orthonormal columns, R ∈ R is upper triangu- Proof. Since Xe =0,wehaveV1 e =0.Alsonotethat r H ≤ s P T T lar, = rank( ) ,and is a . V1 V1 = In−1. Recall that (QUR) (QUR)=Ir and Then we have QT e =0. QU T e U T QT e ( R) = R = 0 from Lemma 1. It follows T from Lemma 2 that M1 M1 = Ir. Proof. Since P is a permutation matrix, we have P T P I eT Se = . The result follows since = 3.2. The Equivalence Relationship eT HHT e = eT QRP T PRT QT e = eT QRRT QT e =0 and RRT is positive definite. We first derive the solution to the eigenvalue problem in Eq. (2) in the following theorem: Lemma 2. Let A ∈ Rm×(m−1) and B ∈ Rm×p (p ≤ T T Theorem 1. Let U1, Σ1, V1, Q, Σ ,andU be de- m) be two matrices satisfying A e =0, A A = I −1, R R m X BT B = I ,andBT e =0.LetF = AT B.Then fined as above. Assume that the columns of are p Xe rank X n − F T F = I . centered, i.e., =0,and ( )= 1.Then p the nonzero eigenvalues of the problem in Eq. (2) 2 A A are diag(ΣR), and the corresponding eigenvectors are Proof. Define the x as x = −1 T Weig = U1Σ1 V1 QUR. A, √1 e ∈ Rm×m m .Wehave We summarize the main result of this section in the I A AT AAT 1 eeT ⇔ AAT I − 1 eeT . following theorem: m = x x = + m = m m Theorem 2. Assume that the class indicator matrix T T T˜ for least squares classification is defined as Since B e =0andB B = Ip,weobtain   T˜ = U T QT ∈ Rr×n. (14) F T F BT AAT B BT I − 1 eeT B R = = m m Then the solution to the least squares formulation in BT B − 1 BT e BT e T I . Eq. (6) is given by = m( )( ) = p −1 T Wls = U1Σ1 V1 QUR. (15)

Thus, the eigenvalue problem and the least squares R U V T Let = RΣR R be the thin Singular Value De- problem are equivalent. r×s composition (SVD) of R ∈ R ,whereUR ∈ r×r s×r R is orthogonal, VR ∈ R has orthonormal The proofs of the above two theorems are given in the r×r columns, and ΣR ∈ R is diagonal. It follows that Appendix. QU T QU I S ( R) ( R)= r,andtheSVDof can be derived Remark 1. The analysis in (Ye, 2007; Sun et al., as follows: 2008b) is based on a key assumption that the H matrix in S = HHT as defined in Eq. (9) has orthonormal S = HHT = QRRT QT = QU Σ2 U T QT R R R columns, which is the case for LDA and CCA. How- QU 2 QU T . =( R)ΣR ( R) (12) ever, this is in general not true, e.g., the H matrix in OPLS and HSL. The equivalence result established X Xe Assume that the columns of are centered, i.e., = in this paper significantly improves previous work by rank X n − 0, and ( )= 1. Let relaxing this assumption. T T X = UΣV = U1Σ1V1 The matrix Weig is applied for dimensionality reduc- X U V tion (projection) for all examples of Eq. (1). The be the SVD of ,where and are orthogonal, W ∈ Rd×n U V T X U ∈ weight matrix ls in least squares can also be used Σ , 1Σ1 1 is the compact SVD of , 1 T˜ QT d×(n−1) n×(n−1) (n−1)×(n−1) for dimensionality reduction. If = is used as R , V1 ∈ R ,andΣ1 ∈ R is M the class indicator matrix, the weight matrix becomes diagonal. Define 1 as −1 T W˜ ls = U1Σ1 V1 Q. Thus, the difference between Weig T (n−1)×r W˜ U M1 = V1 (QUR) ∈ R . (13) and ls is the orthogonal matrix R.Notethat the Euclidean distance is invariant to any orthogo- We can show that the columns of M1 are orthonormal nal transformation. If a classifier, such as K-Nearest- as summarized in the following lemma: Neighbor (KNN) and linear Support Vector Machines A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

Algorithm 1 Efficient Implementation via LSQR the-art algorithms (Friedman et al., 2007; Hale et al., Input: X, H 2008). In addition, the entire solution path for all val- Compute the QR decomposition of H: HP = QR. ues of γ can be obtained by applying the Least Angle R R U V T Compute the SVD of : = RΣR R . Regression algorithm (Efron et al., 2004). T U T QT Compute the class indicator matrix = R . Regress X onto T using LSQR. 5. Efficient Implementation via LSQR Recall that we deal with the generalized eigenvalue (SVM) (Sch¨olkopf & Smola, 2002) based on the Eu- problem in Eq. (1), although in our theoretical deriva- clidean distance, is applied on the dimensionality- tion an equivalent eigenvalue problem in Eq. (2) is reduced data via Weig and W˜ ls, they will achieve the used instead. Large-scale generalized eigenvalue prob- same classification performance. lems are known to be much harder than regular eigen- value problems. There are two options to transform In some cases, the number of nonzero eigenvalues, i.e. the problem in Eq. (1) into a standard eigenvalue r, is large (comparable to n). It is common to use problem (Saad, 1992): (i) factor XXT ; and (ii) em- the top eigenvectors corresponding the largest 0 is the regularization pa- no longer sparse after centering. In order to keep the rameter. We can then apply the LSQR algorithm, a sparsity of X, the vector xi is augmented by an ad- conjugate gradient type method proposed in (Paige xT ,xT ditional component as ˜i =[1 i ], and the extended & Saunders, 1982) for solving large-scale sparse least- X is denoted as X˜ ∈ R(d+1)×n. This new component squares problems. acts as the bias for least squares. It is known that model sparsity can often be achieved For dense data matrix, the overall computational cost by applying the L1-norm regularization (Donoho, of each iteration of LSQR is O(3n +5d +2dn)(Paige 2006; Tibshirani, 1996). This has been introduced into & Saunders, 1982). Since the least squares prob- the least squares formulation and the resulting model lems are solved k times, the overall cost of LSQR is is called lasso (Tibshirani, 1996). Based on the es- O(Nk(3n +5d +2dn)), where N is the total number tablished equivalence relationship between the original of iterations. When the matrix X˜ is sparse, the cost generalized eigenvalue problem and the least squares is significantly reduced. Let the number of nonzero formulation, we derive the 1-norm regularized least elements in X˜ be z, then the overall cost of LSQR is squares formulation by minimizing the following ob- reduced to O(Nk(3n +5d +2z)). In summary, the L W T X − T˜2 γ r w  jective function: 1 = F + j=1 j 1, total time complexity for solving the least squares for- for some tuning parameter γ>0 (Tibshirani, 1996). mulation via LSQR is O(nk2 + Nk(3n +5d +2z)). The lasso can be solved efficiently using the state-of- A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning 6. Experiments Table 1. of the test data sets: n is number of data In this section, we report experimental results that points, d is the data dimensionality, and k is the number validate the established equivalence relationship. We of labels (classes). Data Set n d k also demonstrate the efficiency of the proposed least Yeast 2417 103 14 squares extensions. Yahoo\Arts&Humanities 3712 23146 26 USPS 9298 256 10 Experimental Setup The techniques involved can X be divided into two categories: (1) CCA, PLS, and tion achieve the same performance when rank( )= n − HSL for multi-label learning; and (2) LDA for multi- 1 holds. These are consistent with the theoretical class learning. We use both multi-label [yeast (Elisseeff results in Theorems 1 and 2. & Weston, 2001) and Yahoo (Kazawa et al., 2005)] We also compare the performance of the two formu- and multi-class [USPS (Hull, 1994)] data sets in the lations when the assumption in Theorems 1 and 2 is experiments. For the Yahoo data set, we preprocess violated. Figure 1 shows the performance of different them using the feature selection method proposed in formulations when the size of training set varies from (Yang & Pedersen, 1997). The statistics of the data 100 to 900 with a step size about 100 on the yeast setsaresummarizedinTable1. data set and the USPS data set. Due to the space con- For each data set, a is learned straint, only the results from HSL based on the star from the training set, and it is then used to project the expansion and LDA are presented. We can observe n test data onto a lower-dimensional space. The linear from the figure that when is small, the assumption Support Vector Machine (SVM) is applied for classifi- in Theorem 1 holds and the two formulations achieve n cation. The Receiver Operating Characteristic (ROC) the same performance; when is large, the assumption score and classification accuracy are used to evaluate in Theorem 1 does not hold and the two formulations the performance of multi-label and multi-class tasks, achieve different performance, although the difference respectively. Ten random partitions of the data sets is always very small in the experiment. We can also into training and test sets are generated in the exper- observe from Figure 1 that the regularized methods iments, and the mean performance over all labels and outperform the unregularized ones, which validates the all partitions are reported. All experiments are per- effectiveness of regularization. formed on a PC with Intel Core 2 Duo T7200 2.0G Evaluation of Scalability In this experiment, we CPU and 2G RAM. compare the scalability of the original generalized For the generalized eigenvalue problem in Eq. (1), a eigenvalue problem and the equivalent least squares regularization term is commonly added as (XXT +γI) formulation. Since regularization is commonly em- to cope with the singularity problem of XXT .We ployed in practice, we compare the regularized version name the resulting regularized method using a prefix of the generalized eigenvalue problem and its corre- “r” before the corresponding method, e.g., “rStar”2. sponding 2-norm regularized least squares formulation. The equivalent least squares formulations are named The least squares problem is solved by the LSQR al- using a prefix “LS” such as “LS-Star”, and the result- gorithm (Paige & Saunders, 1982). ing 1-norm and 2-norm regularized formulations are The computation time of the two formulations on the named by adding subscripts 1 and 2, respectively, e.g., high-dimensional multi-label Yahoo data set is shown “LS-Star1” and “LS-Star2”. All algorithms were im- in Figure 2 (top figure), where the data dimensionality plemented in Matlab. increases and the training sample size is fixed at 1000. Evaluation of the Equivalence Relationship In Only the results from CCA and HSL are presented this experiment, we show the equivalence relationship due to the space constraint. It can be observed that between the generalized eigenvalue problem and its the computation time for both algorithms increases corresponding least squares formulation for all tech- steadily as the data dimensionality increases. How- niques discussed in this paper. We observe that for ever, the computation time of the least squares formu- all data sets, when the data dimensionality d is larger lation is substantially less than that of the original one. than the sample size n,rank(X)=n − 1 is likely to We also evaluate the scalability of the two formulations hold. We also observe that the generalized eigenvalue in terms of the training sample size. Figure 2 (bottom problem and its corresponding least squares formula- figure) shows the computation time of the two formula- tions on the Yahoo data set as the training sample size 2 rStar denotes the regularized HSL formulation when increases with the data dimensionality fixed at 5000. the star expansion (Agarwal et al., 2006) is used to form We can also observe that the least squares formulation the hypergraph Laplacian. is much more scalable than the original one. A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

0.7 40 40 0.68 LS-CCA2 35 LS-Star2 rCCA 0.66 30 30 rStar 0.64 25

0.62 20 20 15 0.6 Time (in seconds)

10 Time (in seconds) 10 0.58 Mean ROC 5 0.56 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 2000 4000 6000 8000 0.54 Star Dimensionality Dimensionality LS-Star rStar 0.52 LS-Star1 600 600 LS-Star2 LS-CCA2 LS-Star2 0.5 100 200 300 400 500 600 700 800 900 500 rCCA 500 rStar Training Size 400 400

(A) HSL-Star 300 300 1 200 200 0.9

Time (in seconds) 100 Time (in seconds) 100 0.8 0 0 1000 1500 2000 2500 1000 1500 2000 2500 0.7 Size of Training Set Size of Training Set

0.6 (A) CCA (B) HSL-Star

0.5 Figure 2. Computation time (in seconds) of the generalized eigenvalue problem and the corresponding least squares Mean ACCURACY 0.4 LDA formulation on the Yahoo\Arts&Humanities data set as LS-LDA 0.3 rLDA LS-LDA1 the data dimensionality (top row) or the training sample LS-LDA2 x 0.2 size (bottom row) increases. The -axis represents the data 100 200 300 400 500 600 700 800 900 Training Size dimensionality (top) or the training sample size (bottom) (B) LDA and the y-axis represents the computation time. Figure 1. Comparison of different formulations in terms of Unlabeled data can be incorporated into the least the ROC score/accuracy for different techniques on the squares formulation through the graph Laplacian, yeast data set and the USPS data set. HSL is applied on which captures the local geometry of the data (Belkin the yeast data set, and LDA is applied on the USPS data et al., 2006). We plan to investigate the effectiveness of set. For regularized algorithms, the optimal value of γ is this semi-supervised framework for the class of gener- { e− , e− , e− , , , , } estimated from 1 6 1 4 1 2 1 10 100 1000 using alized eigenvalue problems studied in this paper. The cross validation. equivalence relationship will in general not hold for 7. Conclusions and Future Work low-dimensional data. However, it is common to map the low-dimensional data into a high-dimensional fea- In this paper, we study the relationship between a ture space through a nonlinear feature mapping in- class of generalized eigenvalue problems in machine duced by the (Sch¨olkopf & Smola, 2002). We learning and the least squares formulation. In par- plan to study the relationship between these two for- ticular, we show that a class of generalized eigenvalue mulations in the kernel-induced feature space. problems in machine learning can be reformulated as a least squares problem under a mild condition, which Acknowledgments generally holds for high-dimensional data. The class of problems include CCA, PLS, HSL, and LDA. Based on This work was supported by NSF IIS-0612069, IIS- the established equivalence relationship, various regu- 0812551, CCF-0811790, and NGA HM1582-08-1-0016. larization techniques can be employed to improve the generalization ability and induce sparsity in the re- References sulting model. In addition, the least squares formu- Agarwal, S., Branson, K., & Belongie, S. (2006). lation results in the efficient implementation based on Higher order learning with graphs. International the iterative conjugate gradient type algorithms such Conference on Machine Learning (pp. 17–24). as LSQR. Our experimental results confirm the es- Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Man- tablished equivalence relationship. Results also show ifold regularization: A geometric framework for that the performance of the least squares formulation learning from labeled and unlabeled examples. Jour- and the original generalized eigenvalue problem is very nal of Machine Learning Research, 7, 2399–2434. close even when the assumption is violated. Our ex- Bishop, C. M. (2006). Pattern recognition and machine periments also demonstrate the effectiveness and scal- learning. New York, NY: Springer. ability of the least squares extensions. d’Aspremont, A., Ghaoui, L., Jordan, M., & Lanckriet, A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

G. (2004). A direct formulation for sparse PCA us- Sun, L., Ji, S., & Ye, J. (2008b). A least squares for- ing semidefinite programming. Neural Information mulation for canonical correlation analysis. Interna- Processing Systems (pp. 41–48). tional Conference on Machine Learning (pp. 1024– 1031). Donoho, D. (2006). For most large underdeter- mined systems of linear equations, the minimal 11- Tao, D., Li, X., Wu, X., & Maybank, S. (2009). Geo- norm near-solution approximates the sparsest near- metric mean for subspace selection. IEEE Transac- solution. Communications on Pure and Applied tions on Pattern Analysis and Machine Intelligence, Mathematics, 59, 907–934. 31, 260–274. Tibshirani, R. (1996). Regression shrinkage and selec- Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. tion via the lasso. J. R. Stat. Soc. B, 58, 267–288. (2004). Least angle regression. Annals of Statistics, Yang, Y., & Pedersen, J. O. (1997). A comparative 32, 407. study on feature selection in text categorization. In- Elisseeff, A., & Weston, J. (2001). A kernel method ternational Conference on Machine Learning (pp. for multi-labelled classification. Neural Information 412–420). Processing Systems (pp. 681–687). Ye, J. (2007). Least squares linear discriminant analy- Friedman, J., Hastie, T., H¨ofling, H., & Tibshirani, R. sis. International Conference on Machine Learning (2007). Pathwise coordinate optimization. Annals (pp. 1087–1094). of Applied Statistics, 302–332. Appendix Golub, G. H., & Van Loan, C. F. (1996). Matrix com- putations. Baltimore, MD: Johns Hopkins Press. Proof of Theorem 1 Hale, E., Yin, W., & Zhang, Y. (2008). Fixed-point Proof. It follows from Lemma 3 that the columns of  (n−1)×r continuation for 1-minimization: Methodology and M1 ∈ R are orthonormal. Hence there ex- (n−1)×(n−1−r) convergence. SIAM Journal on Optimization, 19, ists M2 ∈ R such that M =[M1,M2] ∈ 1107–1130. R(n−1)×(n−1) is orthogonal (Golub & Van Loan, 1996). Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). C† C We can derive the eigen-decomposition of X S as: The elements of statistical learning: Data mining,  † C† C XXT XSXT inference, and prediction. New York, NY: Springer. X S = Hotelling, H. (1936). Relations between two sets of U −2U T U V T QU 2 QU T V U T =( 1Σ1 1 ) 1Σ1 1 ( R)ΣR ( R) 1Σ1 1 variables. Biometrika, 28, 312–377. U −1V T QU 2 QU T V U T = 1Σ1 1 ( R)ΣR ( R) 1Σ1 1 Hull, J. J. (1994). A database for handwritten text −1 2 T T = U1Σ M1Σ M Σ1U recognition research. IEEE Transactions on Pattern 1 R 1 1 Analysis and Machine Intelligence, 16, 550–554. I 2 M T U n−1 −1 M M ΣR 0 1 = Σ1 [ 1 2] M T Kazawa, H., Izumitani, T., Taira, H., & Maeda, E. 0 00n−1−r 2 (2005). Maximal margin labeling for multi-topic I , U T Σ 1[ n− 1 0] text categorization. Neural Information Processing I 2 n−1 −1 ΣR 0 T T Systems (pp. 649–656). = U Σ1 M M Σ1[In−1, 0]U 0 00n−1−r Paige, C. C., & Saunders, M. A. (1982). LSQR: An al- −1M 2 M T gorithm for sparse linear equations and sparse least U Σ1 ΣR Σ1 U T . = I I (16) squares. ACM Transactions on Mathematical Soft- 0n−r ware, 8, 43–71. r 2 There are nonzero eigenvalues,whicharediag(ΣR), Rosipal, R., & Kr¨amer, N. (2006). Overview and re- and the corresponding eigenvectors are cent advances in partial least squares. Subspace, W U −1M U −1V T QU . Latent Structure and Feature Selection Techniques, eig = 1Σ1 1 = 1Σ1 1 R (17) Lecture Notes in Computer Science (pp. 34–51). Saad, Y. (1992). Numerical methods for large eigen- value problems. New York, NY: Halsted Press. Proof of Theorem 2 Sch¨olkopf, B., & Smola, A. J. (2002). Learning with Proof. When T˜ is used as the class indicator matrix, kernels: support vector machines, regularization, it follows from Eq. (7) that the solution to the least optimization, and beyond. Cambridge, MA: MIT squares problem is Press. T † T T † Wls =(XX ) XT˜ =(XX ) XQUR Sun, L., Ji, S., & Ye, J. (2008a). Hypergraph spec- −2 T T −1 T = U1Σ U U1Σ1V QU = U1Σ V QU . tral learning for multi-label classification. ACM 1 1 1 R 1 1 R SIGKDD International Conference On Knowledge It follows from Eq. (17) that Wls = Weig . Discovery and Data Mining (pp. 668–676).