A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning Liang Sun [email protected] Shuiwang Ji [email protected] Jieping Ye [email protected] Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA Abstract nant Analysis (LDA), and Hypergraph Spectral Learn- ing (HSL) (Hotelling, 1936; Rosipal & Kr¨amer, 2006; Many machine learning algorithms can be Sun et al., 2008a; Tao et al., 2009; Ye, 2007). Al- formulated as a generalized eigenvalue prob- though well-established algorithms in numerical linear lem. One major limitation of such formula- algebra have been developed to solve generalized eigen- tion is that the generalized eigenvalue prob- value problems, they are in general computationally lem is computationally expensive to solve expensive and hence may not scale to large-scale ma- especially for large-scale problems. In this chine learning problems. In addition, it is challenging paper, we show that under a mild condi- to directly incorporate the sparsity constraint into the tion, a class of generalized eigenvalue prob- mathematical formulation of these techniques. Spar- lems in machine learning can be formulated sity often leads to easy interpretation and a good gen- as a least squares problem. This class of eralization ability. It has been used successfully in lin- problems include classical techniques such as ear regression (Tibshirani, 1996), and Principal Com- Canonical Correlation Analysis (CCA), Par- ponent Analysis (PCA) (d’Aspremont et al., 2004). tial Least Squares (PLS), and Linear Dis- criminant Analysis (LDA), as well as Hy- Multivariate Linear Regression (MLR) that minimizes pergraph Spectral Learning (HSL). As a re- the sum-of-squares error function, called least squares, sult, various regularization techniques can be is a classical technique for regression problems. It can readily incorporated into the formulation to also be applied for classification problems by defining improve model sparsity and generalization an appropriate class indicator matrix (Bishop, 2006). ability. In addition, the least squares for- The solution to the least squares problems can be ob- mulation leads to efficient and scalable im- tained by solving a linear system of equations, and plementations based on the iterative conju- a number of algorithms, including the conjugate gra- gate gradient type algorithms. We report dient algorithm, can be applied to solve it efficiently experimental results that confirm the estab- (Golub & Van Loan, 1996). Furthermore, the least lished equivalence relationship. Results also squares formulation can be readily extended using the demonstrate the efficiency and effectiveness regularization technique. For example, the 1-norm of the equivalent least squares formulations and 2-norm regularization can be incorporated into on large-scale problems. the least squares formulation to improve sparsity and control model complexity (Bishop, 2006). Motivated by the mathematical and numerical prop- 1. Introduction erties of the generalized eigenvalue problem and the A number of machine learning algorithms can be for- least squares formulation, several researchers have at- mulated as a generalized eigenvalue problem. Such tempted to connect these two approaches. In particu- techniques include Canonical Correlation Analysis lar, it has been shown that there is close relationship (CCA), Partial Least Squares (PLS), Linear Discrimi- between LDA, CCA, and least squares (Hastie et al., 2001; Bishop, 2006; Ye, 2007; Sun et al., 2008b). How- Appearing in Proceedings of the 26 th International Confer- ever, the intrinsic relationship between least squares ence on Machine Learning, Montreal, Canada, 2009. Copy- and other techniques involving generalized eigenvalue right 2009 by the author(s)/owner(s). problems mentioned above remains unclear. A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning In this paper, we study the relationship between the where X ∈ Rd×n represents the data matrix and least squares formulation and a class of generalized S∈Rn×n is symmetric and positive semi-definite. eigenvalue problems in machine learning. In particu- The generalized eigenvalue problem in Eq. (1) is of- lar, we establish the equivalence relationship between ten reformulated as the following eigenvalue problem: these two formulations under a mild condition. As a † (XXT ) XSXT w = λw, (2) result, various regularization techniques such as the 1- † norm and 2-norm regularization can be readily incor- where (XXT ) is the pseudoinverse of XXT .Ingen- porated into the formulation to improve model sparsity eral, we are interested in eigenvectors corresponding to and generalization ability. In addition, this equiva- nonzero eigenvalues. It turns out that many machine lence relationship leads to efficient and scalable imple- learning techniques can be formulated in the form of mentations for these generalized eigenvalue problems Eqs. (1) and (2). based on the iterative conjugate gradient type algo- rithms such as LSQR (Paige & Saunders, 1982). We 2.2. Examples of Generalized Eigenvalue have conducted experiments using several benchmark Problems data sets. The experiments confirm the equivalence re- We briefly review several algorithms that involve a lationship between these two models under the given generalized eigenvalue problem in the general form of assumption. Our results show that even when the as- Eq. (1). They include Canonical Correlation Analy- sumption does not hold, the performance of these two sis, Partial Least Squares, Linear Discriminant Anal- models is still very close. Results also demonstrate ysis, and Hypergraph Spectral Learning. For super- the efficiency and effectiveness of the equivalent least vised learning methods, the label information is en- squares models and their extensions. k×n coded in the matrix Y =[y1,y2, ··· ,yn] ∈ R , Notations: The number of training samples, the where yi(j)=1ifxi belongs to class j and yi(j)=0 data dimensionality, and the number of classes (or la- otherwise. d bels) are denoted by n, d,andk, respectively. xi ∈ R k Canonical Correlation Analysis denotes the ith observation, and yi ∈ R encodes the d×n label information for xi. X =[x1,x2, ··· ,xn] ∈ R In CCA (Hotelling, 1936), two different representa- represents the data matrix, and Y =[y1,y2, ··· ,yn] ∈ tions, X and Y , of the same set of objects are given, k×n R is the matrix representation for label informa- and a projection is computed for each representa- {x }n n x tion. i 1 is assumed to be centered, i.e., i=1 i = tion such that they are maximally correlated in the 0. S∈Rn×n is a symmetric and positive semi-definite dimensionality-reduced space. Denote the projection e d T matrix, and is a vector of all ones. vector for X by wx ∈ R , and assume that YY is non- singular. It can be verified that w is the first princi- Organization: We present background and related x pal eigenvector of the following generalized eigenvalue work in Section 2, establish the equivalence relation- problem: ship between the generalized eigenvalue problem and T T −1 T T the least squares problem in Section 3, discuss exten- XY (YY ) YX wx = λXX wx. (3) sions based on the established equivalence result in Multiple projection vectors can be obtained simulta- Section 4, present the efficient implementation in Sec- neously by computing the first principal eigenvec- tion 5, report empirical results in Section 6, and con- tors of the generalized eigenvalue problem in Eq. (3). clude this paper in Section 7. It can be observed that CCA is in the form of the generalized eigenvalue problem in Eq. (1) with S = 2. Background and Related Work Y T (YYT )−1Y . In this section, we present a class of generalized eigen- Partial Least Squares value problems studied in this paper. The least squares In contrast to CCA, Orthonormalized PLS (OPLS), a formulation is briefly reviewed. variant of PLS (Rosipal & Kr¨amer, 2006), computes orthogonal score vectors by maximizing the covariance 2.1. A Class of Generalized Eigenvalue between X and Y . It solves the following generalized Problems eigenvalue problem: We consider a class of generalized eigenvalue problems XY T YXT w = λXX T w. (4) in the following form: It can be observed that Orthonormalized PLS involves a generalized eigenvalue problem in Eq. (1) with S = XSXT w = λXX T w, (1) Y T Y . A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning Hypergraph Spectral Learning vector-valued class code to each data point (Bishop, 2006). The solution to the least squares problem A hypergraph (Agarwal et al., 2006) is a generalization depends on the choice of the class indicator matrix of the traditional graph in which the edges (a.k.a. hy- (Hastie et al., 2001; Ye, 2007). In contrast to the gen- peredges) are arbitrary non-empty subsets of the ver- eralized eigenvalue problem, the least squares problem tex set. HSL (Sun et al., 2008a) employs a hypergraph can be solved efficiently using iterative conjugate gra- to capture the correlation information among differ- dient algorithms (Golub & Van Loan, 1996; Paige & ent labels for improved classification performance in Saunders, 1982). multi-label learning. It has been shown that given the normalized Laplacian LH for the constructed hyper- graph, HSL invovles the following generalized eigen- 3. Generalized Eigenvalue Problem value problem: versus Least Squares Problem T T XSX w = λ(XX )w, where S = I −LH . (5) In this section, we investigate the relationship between the generalized eigenvalue problem in Eq. (1) or its It has been shown that for many existing definitions of equivalent formulation in Eq. (2) and the least squares L S the Laplacian H , the resulting matrix is symmetric formulation. In particular, we show that under a mild and positive semi-definite, and it can be decomposed condition1, the eigenvalue problem in Eq. (2) can be T n×k as S = HH ,whereH ∈ R . formulated as a least squares problem with a specific Linear Discriminant Analysis target matrix.