A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning
Liang Sun [email protected] Shuiwang Ji [email protected] Jieping Ye [email protected] Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA
Abstract nant Analysis (LDA), and Hypergraph Spectral Learn- ing (HSL) (Hotelling, 1936; Rosipal & Kr¨amer, 2006; Many machine learning algorithms can be Sun et al., 2008a; Tao et al., 2009; Ye, 2007). Al- formulated as a generalized eigenvalue prob- though well-established algorithms in numerical linear lem. One major limitation of such formula- algebra have been developed to solve generalized eigen- tion is that the generalized eigenvalue prob- value problems, they are in general computationally lem is computationally expensive to solve expensive and hence may not scale to large-scale ma- especially for large-scale problems. In this chine learning problems. In addition, it is challenging paper, we show that under a mild condi- to directly incorporate the sparsity constraint into the tion, a class of generalized eigenvalue prob- mathematical formulation of these techniques. Spar- lems in machine learning can be formulated sity often leads to easy interpretation and a good gen- as a least squares problem. This class of eralization ability. It has been used successfully in lin- problems include classical techniques such as ear regression (Tibshirani, 1996), and Principal Com- Canonical Correlation Analysis (CCA), Par- ponent Analysis (PCA) (d’Aspremont et al., 2004). tial Least Squares (PLS), and Linear Dis- criminant Analysis (LDA), as well as Hy- Multivariate Linear Regression (MLR) that minimizes pergraph Spectral Learning (HSL). As a re- the sum-of-squares error function, called least squares, sult, various regularization techniques can be is a classical technique for regression problems. It can readily incorporated into the formulation to also be applied for classification problems by defining improve model sparsity and generalization an appropriate class indicator matrix (Bishop, 2006). ability. In addition, the least squares for- The solution to the least squares problems can be ob- mulation leads to efficient and scalable im- tained by solving a linear system of equations, and plementations based on the iterative conju- a number of algorithms, including the conjugate gra- gate gradient type algorithms. We report dient algorithm, can be applied to solve it efficiently experimental results that confirm the estab- (Golub & Van Loan, 1996). Furthermore, the least lished equivalence relationship. Results also squares formulation can be readily extended using the demonstrate the efficiency and effectiveness regularization technique. For example, the 1-norm of the equivalent least squares formulations and 2-norm regularization can be incorporated into on large-scale problems. the least squares formulation to improve sparsity and control model complexity (Bishop, 2006). Motivated by the mathematical and numerical prop- 1. Introduction erties of the generalized eigenvalue problem and the A number of machine learning algorithms can be for- least squares formulation, several researchers have at- mulated as a generalized eigenvalue problem. Such tempted to connect these two approaches. In particu- techniques include Canonical Correlation Analysis lar, it has been shown that there is close relationship (CCA), Partial Least Squares (PLS), Linear Discrimi- between LDA, CCA, and least squares (Hastie et al., 2001; Bishop, 2006; Ye, 2007; Sun et al., 2008b). How- Appearing in Proceedings of the 26 th International Confer- ever, the intrinsic relationship between least squares ence on Machine Learning, Montreal, Canada, 2009. Copy- and other techniques involving generalized eigenvalue right 2009 by the author(s)/owner(s). problems mentioned above remains unclear. A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning
In this paper, we study the relationship between the where X ∈ Rd×n represents the data matrix and least squares formulation and a class of generalized S∈Rn×n is symmetric and positive semi-definite. eigenvalue problems in machine learning. In particu- The generalized eigenvalue problem in Eq. (1) is of- lar, we establish the equivalence relationship between ten reformulated as the following eigenvalue problem: these two formulations under a mild condition. As a † (XXT ) XSXT w = λw, (2) result, various regularization techniques such as the 1- † norm and 2-norm regularization can be readily incor- where (XXT ) is the pseudoinverse of XXT .Ingen- porated into the formulation to improve model sparsity eral, we are interested in eigenvectors corresponding to and generalization ability. In addition, this equiva- nonzero eigenvalues. It turns out that many machine lence relationship leads to efficient and scalable imple- learning techniques can be formulated in the form of mentations for these generalized eigenvalue problems Eqs. (1) and (2). based on the iterative conjugate gradient type algo- rithms such as LSQR (Paige & Saunders, 1982). We 2.2. Examples of Generalized Eigenvalue have conducted experiments using several benchmark Problems data sets. The experiments confirm the equivalence re- We briefly review several algorithms that involve a lationship between these two models under the given generalized eigenvalue problem in the general form of assumption. Our results show that even when the as- Eq. (1). They include Canonical Correlation Analy- sumption does not hold, the performance of these two sis, Partial Least Squares, Linear Discriminant Anal- models is still very close. Results also demonstrate ysis, and Hypergraph Spectral Learning. For super- the efficiency and effectiveness of the equivalent least vised learning methods, the label information is en- squares models and their extensions. k×n coded in the matrix Y =[y1,y2, ··· ,yn] ∈ R , Notations: The number of training samples, the where yi(j)=1ifxi belongs to class j and yi(j)=0 data dimensionality, and the number of classes (or la- otherwise. d bels) are denoted by n, d,andk, respectively. xi ∈ R k Canonical Correlation Analysis denotes the ith observation, and yi ∈ R encodes the d×n label information for xi. X =[x1,x2, ··· ,xn] ∈ R In CCA (Hotelling, 1936), two different representa- represents the data matrix, and Y =[y1,y2, ··· ,yn] ∈ tions, X and Y , of the same set of objects are given, k×n R is the matrix representation for label informa- and a projection is computed for each representa- {x }n n x tion. i 1 is assumed to be centered, i.e., i=1 i = tion such that they are maximally correlated in the 0. S∈Rn×n is a symmetric and positive semi-definite dimensionality-reduced space. Denote the projection e d T matrix, and is a vector of all ones. vector for X by wx ∈ R , and assume that YY is non- singular. It can be verified that w is the first princi- Organization: We present background and related x pal eigenvector of the following generalized eigenvalue work in Section 2, establish the equivalence relation- problem: ship between the generalized eigenvalue problem and T T −1 T T the least squares problem in Section 3, discuss exten- XY (YY ) YX wx = λXX wx. (3) sions based on the established equivalence result in Multiple projection vectors can be obtained simulta- Section 4, present the efficient implementation in Sec- neously by computing the first principal eigenvec- tion 5, report empirical results in Section 6, and con- tors of the generalized eigenvalue problem in Eq. (3). clude this paper in Section 7. It can be observed that CCA is in the form of the generalized eigenvalue problem in Eq. (1) with S = 2. Background and Related Work Y T (YYT )−1Y . In this section, we present a class of generalized eigen- Partial Least Squares value problems studied in this paper. The least squares In contrast to CCA, Orthonormalized PLS (OPLS), a formulation is briefly reviewed. variant of PLS (Rosipal & Kr¨amer, 2006), computes orthogonal score vectors by maximizing the covariance 2.1. A Class of Generalized Eigenvalue between X and Y . It solves the following generalized Problems eigenvalue problem: We consider a class of generalized eigenvalue problems XY T YXT w = λXX T w. (4) in the following form: It can be observed that Orthonormalized PLS involves a generalized eigenvalue problem in Eq. (1) with S = XSXT w = λXX T w, (1) Y T Y . A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning
Hypergraph Spectral Learning vector-valued class code to each data point (Bishop, 2006). The solution to the least squares problem A hypergraph (Agarwal et al., 2006) is a generalization depends on the choice of the class indicator matrix of the traditional graph in which the edges (a.k.a. hy- (Hastie et al., 2001; Ye, 2007). In contrast to the gen- peredges) are arbitrary non-empty subsets of the ver- eralized eigenvalue problem, the least squares problem tex set. HSL (Sun et al., 2008a) employs a hypergraph can be solved efficiently using iterative conjugate gra- to capture the correlation information among differ- dient algorithms (Golub & Van Loan, 1996; Paige & ent labels for improved classification performance in Saunders, 1982). multi-label learning. It has been shown that given the normalized Laplacian LH for the constructed hyper- graph, HSL invovles the following generalized eigen- 3. Generalized Eigenvalue Problem value problem: versus Least Squares Problem
T T XSX w = λ(XX )w, where S = I −LH . (5) In this section, we investigate the relationship between the generalized eigenvalue problem in Eq. (1) or its It has been shown that for many existing definitions of equivalent formulation in Eq. (2) and the least squares L S the Laplacian H , the resulting matrix is symmetric formulation. In particular, we show that under a mild and positive semi-definite, and it can be decomposed condition1, the eigenvalue problem in Eq. (2) can be T n×k as S = HH ,whereH ∈ R . formulated as a least squares problem with a specific Linear Discriminant Analysis target matrix.
LDA is a supervised dimensionality reduction tech- 3.1. Matrix Orthonormality Property nique. The transformation in LDA is obtained by maximizing the ratio of the inter-class distance to the For convenience of presentation, we define matrices CX intra-class distance. It is known that CCA is equiv- and CS as follows: alent to LDA for multi-class problems. Thus, the S C XXT ∈ Rd×d,C XSXT ∈ Rd×d. matrix can be derived similarly. X = S = The eigenvalue problem in Eq. (2) can then be ex- 2.3. Least Squares for Regression and pressed as Classification C† C w λw. X S = (8) Least squares is a classical technique for both regres- Recall that S is symmetric and positive semi-definite, sion and classification (Bishop, 2006). In regression, thus it can be decomposed as { x ,t }n x ∈ Rd we are given a training set ( i i) i=1,where i t ∈ Rk T is the observation and i is the corresponding S = HH , (9) target. We assume that both the observations and the targets are centered, then the intercept can be elim- where H ∈ Rn×s,ands ≤ n. For most examples inated. In this case, the weight matrix W ∈ Rd×k discussed in Section 2.2, the closed-form of H can be can be computed by minimizing the following sum-of- obtained and s = k n.SinceX is centered, i.e., Xe XC X C I − 1 eeT squares error function: =0,wehave = ,where = n is the centering matrix satisfying CT = C,andCe =0.It n W T x − t 2 W T X − T 2 , follows that min i i 2 = F (6) W i=1 T T T T CS = XSX =(XC)S(XC) = X(CSC )X T t , ··· ,t where =[1 n] is the target matrix. It is well- = X(CT SC)XT = XS˜XT , (10) known that the optimal solution Wls is given by S˜ CT SC T † T where = .Notethat Wls =(XX ) XT . (7) eT S˜e = eT CT SCe =0. (11) Least squares can also be applied for classification T problems. In the general multi-class case, we are given Thus, we can assume that e Se =0,thatis,both n { x ,y }n columns and rows of S are centered. Before presenting a data set consisting of samples ( i i) i=1,where d xi ∈ R ,andyi ∈{1, 2, ··· ,k} denotes the class la- the main results, we have the following lemmas. bel of the i-th sample, and k ≥ 2. To apply the least 1 n It states that {xi}i=1 are linearly independent before squares formulation to the multi-class case, the 1-of-k centering, i.e., rank(X)=n − 1 after the data is centered binary coding scheme is usually employed to apply a (of zero mean). A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning
T Lemma 1. Assume that e Se =0.LetH be de- Lemma 3. Let M1 be defined as above. Then T fined in Eq. (9). Let HP = QR be the QR decom- M1 M1 = Ir. position of H with column pivoting, where Q ∈ Rn×r r×s T has orthonormal columns, R ∈ R is upper triangu- Proof. Since Xe =0,wehaveV1 e =0.Alsonotethat r H ≤ s P T T lar, = rank( ) ,and is a permutation matrix. V1 V1 = In−1. Recall that (QUR) (QUR)=Ir and Then we have QT e =0. QU T e U T QT e ( R) = R = 0 from Lemma 1. It follows T from Lemma 2 that M1 M1 = Ir. Proof. Since P is a permutation matrix, we have P T P I eT Se = . The result follows since = 3.2. The Equivalence Relationship eT HHT e = eT QRP T PRT QT e = eT QRRT QT e =0 and RRT is positive definite. We first derive the solution to the eigenvalue problem in Eq. (2) in the following theorem: Lemma 2. Let A ∈ Rm×(m−1) and B ∈ Rm×p (p ≤ T T Theorem 1. Let U1, Σ1, V1, Q, Σ ,andU be de- m) be two matrices satisfying A e =0, A A = I −1, R R m X BT B = I ,andBT e =0.LetF = AT B.Then fined as above. Assume that the columns of are p Xe rank X n − F T F = I . centered, i.e., =0,and ( )= 1.Then p the nonzero eigenvalues of the problem in Eq. (2) 2 A A are diag(ΣR), and the corresponding eigenvectors are Proof. Define the orthogonal matrix x as x = −1 T Weig = U1Σ1 V1 QUR. A, √1 e ∈ Rm×m m .Wehave We summarize the main result of this section in the I A AT AAT 1 eeT ⇔ AAT I − 1 eeT . following theorem: m = x x = + m = m m Theorem 2. Assume that the class indicator matrix T T T˜ for least squares classification is defined as Since B e =0andB B = Ip,weobtain T˜ = U T QT ∈ Rr×n. (14) F T F BT AAT B BT I − 1 eeT B R = = m m Then the solution to the least squares formulation in BT B − 1 BT e BT e T I . Eq. (6) is given by = m( )( ) = p −1 T Wls = U1Σ1 V1 QUR. (15)
Thus, the eigenvalue problem and the least squares R U V T Let = RΣR R be the thin Singular Value De- problem are equivalent. r×s composition (SVD) of R ∈ R ,whereUR ∈ r×r s×r R is orthogonal, VR ∈ R has orthonormal The proofs of the above two theorems are given in the r×r columns, and ΣR ∈ R is diagonal. It follows that Appendix. QU T QU I S ( R) ( R)= r,andtheSVDof can be derived Remark 1. The analysis in (Ye, 2007; Sun et al., as follows: 2008b) is based on a key assumption that the H matrix in S = HHT as defined in Eq. (9) has orthonormal S = HHT = QRRT QT = QU Σ2 U T QT R R R columns, which is the case for LDA and CCA. How- QU 2 QU T . =( R)ΣR ( R) (12) ever, this is in general not true, e.g., the H matrix in OPLS and HSL. The equivalence result established X Xe Assume that the columns of are centered, i.e., = in this paper significantly improves previous work by rank X n − 0, and ( )= 1. Let relaxing this assumption. T T X = UΣV = U1Σ1V1 The matrix Weig is applied for dimensionality reduc- X U V tion (projection) for all examples of Eq. (1). The be the SVD of ,where and are orthogonal, W ∈ Rd×n U V T X U ∈ weight matrix ls in least squares can also be used Σ , 1Σ1 1 is the compact SVD of , 1 T˜ QT d×(n−1) n×(n−1) (n−1)×(n−1) for dimensionality reduction. If = is used as R , V1 ∈ R ,andΣ1 ∈ R is M the class indicator matrix, the weight matrix becomes diagonal. Define 1 as −1 T W˜ ls = U1Σ1 V1 Q. Thus, the difference between Weig T (n−1)×r W˜ U M1 = V1 (QUR) ∈ R . (13) and ls is the orthogonal matrix R.Notethat the Euclidean distance is invariant to any orthogo- We can show that the columns of M1 are orthonormal nal transformation. If a classifier, such as K-Nearest- as summarized in the following lemma: Neighbor (KNN) and linear Support Vector Machines A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning
Algorithm 1 Efficient Implementation via LSQR the-art algorithms (Friedman et al., 2007; Hale et al., Input: X, H 2008). In addition, the entire solution path for all val- Compute the QR decomposition of H: HP = QR. ues of γ can be obtained by applying the Least Angle R R U V T Compute the SVD of : = RΣR R . Regression algorithm (Efron et al., 2004). T U T QT Compute the class indicator matrix = R . Regress X onto T using LSQR. 5. Efficient Implementation via LSQR Recall that we deal with the generalized eigenvalue (SVM) (Sch¨olkopf & Smola, 2002) based on the Eu- problem in Eq. (1), although in our theoretical deriva- clidean distance, is applied on the dimensionality- tion an equivalent eigenvalue problem in Eq. (2) is reduced data via Weig and W˜ ls, they will achieve the used instead. Large-scale generalized eigenvalue prob- same classification performance. lems are known to be much harder than regular eigen- value problems. There are two options to transform In some cases, the number of nonzero eigenvalues, i.e. the problem in Eq. (1) into a standard eigenvalue r, is large (comparable to n). It is common to use problem (Saad, 1992): (i) factor XXT ; and (ii) em- the top eigenvectors corresponding the largest
0.7 40 40 0.68 LS-CCA2 35 LS-Star2 rCCA 0.66 30 30 rStar 0.64 25
0.62 20 20 15 0.6 Time (in seconds)
10 Time (in seconds) 10 0.58 Mean ROC 5 0.56 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 2000 4000 6000 8000 0.54 Star Dimensionality Dimensionality LS-Star rStar 0.52 LS-Star1 600 600 LS-Star2 LS-CCA2 LS-Star2 0.5 100 200 300 400 500 600 700 800 900 500 rCCA 500 rStar Training Size 400 400
(A) HSL-Star 300 300 1 200 200 0.9
Time (in seconds) 100 Time (in seconds) 100 0.8 0 0 1000 1500 2000 2500 1000 1500 2000 2500 0.7 Size of Training Set Size of Training Set
0.6 (A) CCA (B) HSL-Star
0.5 Figure 2. Computation time (in seconds) of the generalized eigenvalue problem and the corresponding least squares Mean ACCURACY 0.4 LDA formulation on the Yahoo\Arts&Humanities data set as LS-LDA 0.3 rLDA LS-LDA1 the data dimensionality (top row) or the training sample LS-LDA2 x 0.2 size (bottom row) increases. The -axis represents the data 100 200 300 400 500 600 700 800 900 Training Size dimensionality (top) or the training sample size (bottom) (B) LDA and the y-axis represents the computation time. Figure 1. Comparison of different formulations in terms of Unlabeled data can be incorporated into the least the ROC score/accuracy for different techniques on the squares formulation through the graph Laplacian, yeast data set and the USPS data set. HSL is applied on which captures the local geometry of the data (Belkin the yeast data set, and LDA is applied on the USPS data et al., 2006). We plan to investigate the effectiveness of set. For regularized algorithms, the optimal value of γ is this semi-supervised framework for the class of gener- { e− , e− , e− , , , , } estimated from 1 6 1 4 1 2 1 10 100 1000 using alized eigenvalue problems studied in this paper. The cross validation. equivalence relationship will in general not hold for 7. Conclusions and Future Work low-dimensional data. However, it is common to map the low-dimensional data into a high-dimensional fea- In this paper, we study the relationship between a ture space through a nonlinear feature mapping in- class of generalized eigenvalue problems in machine duced by the kernel (Sch¨olkopf & Smola, 2002). We learning and the least squares formulation. In par- plan to study the relationship between these two for- ticular, we show that a class of generalized eigenvalue mulations in the kernel-induced feature space. problems in machine learning can be reformulated as a least squares problem under a mild condition, which Acknowledgments generally holds for high-dimensional data. The class of problems include CCA, PLS, HSL, and LDA. Based on This work was supported by NSF IIS-0612069, IIS- the established equivalence relationship, various regu- 0812551, CCF-0811790, and NGA HM1582-08-1-0016. larization techniques can be employed to improve the generalization ability and induce sparsity in the re- References sulting model. In addition, the least squares formu- Agarwal, S., Branson, K., & Belongie, S. (2006). lation results in the efficient implementation based on Higher order learning with graphs. International the iterative conjugate gradient type algorithms such Conference on Machine Learning (pp. 17–24). as LSQR. Our experimental results confirm the es- Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Man- tablished equivalence relationship. Results also show ifold regularization: A geometric framework for that the performance of the least squares formulation learning from labeled and unlabeled examples. Jour- and the original generalized eigenvalue problem is very nal of Machine Learning Research, 7, 2399–2434. close even when the assumption is violated. Our ex- Bishop, C. M. (2006). Pattern recognition and machine periments also demonstrate the effectiveness and scal- learning. New York, NY: Springer. ability of the least squares extensions. d’Aspremont, A., Ghaoui, L., Jordan, M., & Lanckriet, A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning
G. (2004). A direct formulation for sparse PCA us- Sun, L., Ji, S., & Ye, J. (2008b). A least squares for- ing semidefinite programming. Neural Information mulation for canonical correlation analysis. Interna- Processing Systems (pp. 41–48). tional Conference on Machine Learning (pp. 1024– 1031). Donoho, D. (2006). For most large underdeter- mined systems of linear equations, the minimal 11- Tao, D., Li, X., Wu, X., & Maybank, S. (2009). Geo- norm near-solution approximates the sparsest near- metric mean for subspace selection. IEEE Transac- solution. Communications on Pure and Applied tions on Pattern Analysis and Machine Intelligence, Mathematics, 59, 907–934. 31, 260–274. Tibshirani, R. (1996). Regression shrinkage and selec- Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. tion via the lasso. J. R. Stat. Soc. B, 58, 267–288. (2004). Least angle regression. Annals of Statistics, Yang, Y., & Pedersen, J. O. (1997). A comparative 32, 407. study on feature selection in text categorization. In- Elisseeff, A., & Weston, J. (2001). A kernel method ternational Conference on Machine Learning (pp. for multi-labelled classification. Neural Information 412–420). Processing Systems (pp. 681–687). Ye, J. (2007). Least squares linear discriminant analy- Friedman, J., Hastie, T., H¨ofling, H., & Tibshirani, R. sis. International Conference on Machine Learning (2007). Pathwise coordinate optimization. Annals (pp. 1087–1094). of Applied Statistics, 302–332. Appendix Golub, G. H., & Van Loan, C. F. (1996). Matrix com- putations. Baltimore, MD: Johns Hopkins Press. Proof of Theorem 1 Hale, E., Yin, W., & Zhang, Y. (2008). Fixed-point Proof. It follows from Lemma 3 that the columns of (n−1)×r continuation for 1-minimization: Methodology and M1 ∈ R are orthonormal. Hence there ex- (n−1)×(n−1−r) convergence. SIAM Journal on Optimization, 19, ists M2 ∈ R such that M =[M1,M2] ∈ 1107–1130. R(n−1)×(n−1) is orthogonal (Golub & Van Loan, 1996). Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). C† C We can derive the eigen-decomposition of X S as: The elements of statistical learning: Data mining, † C† C XXT XSXT inference, and prediction. New York, NY: Springer. X S = Hotelling, H. (1936). Relations between two sets of U −2U T U V T QU 2 QU T V U T =( 1Σ1 1 ) 1Σ1 1 ( R)ΣR ( R) 1Σ1 1 variables. Biometrika, 28, 312–377. U −1V T QU 2 QU T V U T = 1Σ1 1 ( R)ΣR ( R) 1Σ1 1 Hull, J. J. (1994). A database for handwritten text −1 2 T T = U1Σ M1Σ M Σ1U recognition research. IEEE Transactions on Pattern 1 R 1 1 Analysis and Machine Intelligence, 16, 550–554. I 2 M T U n−1 −1 M M ΣR 0 1 = Σ1 [ 1 2] M T Kazawa, H., Izumitani, T., Taira, H., & Maeda, E. 0 00n−1−r 2 (2005). Maximal margin labeling for multi-topic I , U T Σ 1[ n− 1 0] text categorization. Neural Information Processing I 2 n−1 −1 ΣR 0 T T Systems (pp. 649–656). = U Σ1 M M Σ1[In−1, 0]U 0 00n−1−r Paige, C. C., & Saunders, M. A. (1982). LSQR: An al- −1M 2 M T gorithm for sparse linear equations and sparse least U Σ1 ΣR Σ1 U T . = I I (16) squares. ACM Transactions on Mathematical Soft- 0n−r ware, 8, 43–71. r 2 There are nonzero eigenvalues,whicharediag(ΣR), Rosipal, R., & Kr¨amer, N. (2006). Overview and re- and the corresponding eigenvectors are cent advances in partial least squares. Subspace, W U −1M U −1V T QU . Latent Structure and Feature Selection Techniques, eig = 1Σ1 1 = 1Σ1 1 R (17) Lecture Notes in Computer Science (pp. 34–51). Saad, Y. (1992). Numerical methods for large eigen- value problems. New York, NY: Halsted Press. Proof of Theorem 2 Sch¨olkopf, B., & Smola, A. J. (2002). Learning with Proof. When T˜ is used as the class indicator matrix, kernels: support vector machines, regularization, it follows from Eq. (7) that the solution to the least optimization, and beyond. Cambridge, MA: MIT squares problem is Press. T † T T † Wls =(XX ) XT˜ =(XX ) XQUR Sun, L., Ji, S., & Ye, J. (2008a). Hypergraph spec- −2 T T −1 T = U1Σ U U1Σ1V QU = U1Σ V QU . tral learning for multi-label classification. ACM 1 1 1 R 1 1 R SIGKDD International Conference On Knowledge It follows from Eq. (17) that Wls = Weig . Discovery and Data Mining (pp. 668–676).