Spectral Feature Scaling Method for Supervised Dimensionality Reduction

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Spectral Feature Scaling Method for Supervised Dimensionality Reduction Momo Matsuda1, Keiichi Morikuni1, Tetsuya Sakurai1;2 1 University of Tsukuba 2 JST/CREST [email protected], [email protected], [email protected] Abstract which the data samples are linearly separated if they are well clustered. This technique is more effective in complicated Spectral dimensionality reduction methods enable clustering than existing methods. For example, the k-means linear separations of complex data with high- algorithm induces partitions using hyperplanes, but may fail dimensional features in a reduced space. How- to give satisfactory results due to irregularities and uncertain- ever, these methods do not always give the desired ties of the features. results due to irregularities or uncertainties of the data. Thus, we consider aggressively modifying the Here, we consider aggressively modifying the scales of the scales of the features to obtain the desired classi- features to improve the separation in the reduced space, and fication. Using prior knowledge on the labels of obtain the desired classification. We derive the factors for partial samples to specify the Fiedler vector, we scaling the features (scaling factors) using a learning machine formulate an eigenvalue problem of a linear ma- based on spectral dimensionality reduction; namely, the in- trix pencil whose eigenvector has the feature scal- puts are unscaled and labeled samples, and the outputs are the ing factors. The resulting factors can modify the scaling factors. To this end, by exploiting the prior knowledge features of entire samples to form clusters in the on the labels of partial samples, we specify the Fiedler vec- reduced space, according to the known labels. In tor and reformulate the Laplacian eigenproblem as an eigen- this study, we propose new dimensionality reduc- problem of a linear matrix pencil whose eigenvector has the tion methods supervised using the feature scaling scaling factors. The obtained factors can modify the features associated with the spectral clustering. Numerical of the entire samples to form clusters in the reduced dimen- experiments show that the proposed methods out- sionality space according to the known labels. Thus, we use perform well-established supervised methods for the prior knowledge to implicitly specify the phenomenon toy problems with more samples than features, and of interest, and supervise dimensionality reduction methods are more robust regarding clustering than existing for the scaling factors. These approaches yield the desired methods. Also, the proposed methods outperform clusters, incorporated with unsupervised spectral clustering, existing methods regarding classification for real- for test data scaled by the obtained factors for training data. world problems with more features than samples of Numerical experiments on artificial data and real-world data gene expression profiles of cancer diseases. Fur- from gene expression profiles show that the feature scaling thermore, the feature scaling tends to improve the improves the accuracy of the spectral clustering. In addition, clustering and classification accuracies of existing the spectral dimensionality reduction methods supervised us- unsupervised methods, as the proportion of training the feature scaling outperform existing methods in some ing data increases. cases, and are more robust than existing methods. We review related work on supervised spectral dimensionality reduction methods. Learning a similarity matrix from 1 Introduction training data is effective in spectral clustering [Bach and Jor- Consider clustering a set of data samples with high- dan, 2003; Kamvar et al., 2003]. The proposed classifi- dimensional features into mutually exclusive subsets, called cation method can be considered a supervised method, in- clusters, and classifying these clusters when given prior corporated with the kernel version of the locality preserv- knowledge on the class labels of partial samples. These kinds ing projections (LPP) [He and Niyogi, 2003]. LPP is com- of problems arise in pathological diagnoses using gene ex- parable with the linear discriminant analyisis (LDA) [Bel- pression data [Tarca et al., 2006], the analysis of chemical humeur et al., 1997], the local Fisher discriminant analysis sensor data [Jurs et al., 2000], the community detection in (LFDA) [Sugiyama, 2007], and the locality adaptive discrim- social networks [Tichy et al., 1979], the analyses of neural inant analysis (LADA) [Li et al., 2017]. Kernel versions of spike sorting [Bar-Hillel et al., 2006], and so on [Sogh, 2006]. LDA, LPP and LFDA aim at nonlinear dimensionality reduc- Spectral clustering is an unsupervised technique that projects tion. These methods have semi-supervised variants [Song et the data samples to the eigenspace of the Laplacian matrix, in al., 2008; Kulis et al., 2005; Cui and Fan, 2012]. 2560 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 1.1 Spectral Clustering with N samples, we can extend the prior knowledge on the (N−n)×m The underlying idea of the proposed methods follows from partial samples to overall samples, where R 2 R spectral clustering, which is induced by graph partitioning, denotes the remaining data. InP statistical terms, the scaling changes the mean y¯ = 1 n y and the variance where the graph is weighted and undirected. The weight wi;j P j n i=1 i;j of the edge between nodes i and j represents the similarity 2 1 n − 2 1=2 σj = n i=1(yi;j y¯j) of the jth feature to sj y¯j and between samples i and j for i; j = 1; 2; : : : ; n, where n is the js jσ2, respectively, for j = 1; 2; : : : ; m. Note that we allow number of samples. By dividing the graph into mutually ex- j j the scaling factors to have negative values. See [Schleif and clusive subgraphs, the corresponding samples form clusters. Tino, 2015]. Finding appropriate scaling factors is also con- This discrete problem can be continuously relaxed to a matrix sidered metric learning [Yang, 2006]. Recall that the center eigenvalue problem. Let of a cluster associated with a Bregman divergence is equal n×n W = fwi;j g 2 R ; to the centroid of the samples of the cluster [Banerjee et al., Xn 2005]. n×n Now, we reformulate (3) as another eigenproblem to ex- D = diag (d1; d2; : : : ; dn) 2 R with di = wi;j ; tract the scaling factors si 2 R; i = 1; 2; : : : ; m as i=1 1 2 n an eigenvector. Denote the scaling matrix by S = e 2 R be a vector with all elements 1, and t be an indicator 1 2 m×m diag (s1; s2; : : : ; sm) 2 R and denote the (i; j) en- vector with the ith entry ti = 1 if sample i is in a cluster, try of the similarity matrix W for the scaled data XS1=2 by and ti = −1 if sample i is in another cluster. Then, the con- 8 s − T − T strained minimization of the Ncut function [Shi and Malik, >1 − (xi xj ) S(xi xj ) = 1 − s xi;j ; > ( 2σ2 )2σ2 2000] < − T − ' − (xi xj ) S(xi xj ) 6 (s) exp 2σ2 ; i = j; vT(D − W )v wi;j = > (4) min ; subject to eTDv = 0 (1) > v vTDv :> P P 0; i = j; over vi 2 f1; −bg with b = di= di, is relaxed ti>0 ti<0 T 2 Rm to finding the Fiedler vector v 2 Rn n f0g [Fiedler, 1973] as- where s = [s1; s2; : : : ; sm] and the kth entry of m 2 sociated with the smallest eigenvalue of the constrained gen- xi;j 2 R is (xi;k − xj;k) . Here, we used the first-order eralized eigenvalue problem approximation of the exponential function exp (−x) ≈ 1 − x for 0 < x < 1: Then, the ith row of Ws is T Lv = λDv; subject to e Dv = 0; (2) sT w(s)T = [1;:::; 1; 0; 1;:::; 1] − [x ; x ;: ::; x ] where L = D − W is a Laplacian matrix and λ 2 R. The lat- i 2σ2 i;1 i;2 i;n T − T ter problem (2) is tractable, while the former (1) is NP-hard. = e~i s Xi; Moreover, a few eigenvectors of (2) form clusters that are sep- where e~ is the n-dimensional vector with the ith entry arable using hyperplanes if the samples form well-separated i equal to zero and the remaining entries ones, and Xi = clusters. A sophisticated algorithm for (2) is nearly linear 1 2 Rm×n 2 [xi;1; xi;2;: ::; xi;n] : Hence, we have time [Spielman and Teng, 2014]. 2σ [ ] T T T T − T Wsv = e~1 ; e~2 ;:::; e~n v [X1v;X2v;:::;Xnv] s: 2 Proposed Method 1 ·· · Let xî = 2σ2 (xi;1 + xi;2 + + xi;n) : Then, the ith diag- Irregular scales and uncertainty of features prevent well- onal entry d(s) of D is established dimensionality reduction methods from cluster- i s Xn ing data samples into the desired disjoint subsets. In spectral d(s) = w(s) = (n − 1) − sTx^ : clustering, the Ncut function is popular by virtue of the non- i i;j i linear separability, but is not versatile. To cope with these is- j=1 T sues, we propose a remedy to aggressively modify the scales Hence, denoting the Fiedler vector by v = [v1; v2; : : : ; vn] , of the features in the eigenspace where a linear separation we have works, based on prior knowledge of the partial n samples T Dsv = (n − 1)v − [v1x^1; v2x^2; : : : ; vnx^n] s: X = [x ; x ;:::; x ]T ; x 2 Rm with m features. If the 1 2 n i Thus, (3) is written as classes of partial samples are known, we can estimate the 2 Rnnf g , − entries of the Fiedler vector v 0 of the Laplacian Lsv = λsDsv Wsv =[ (1 ]λs)Dsv [ ] eigenvalue problem s s , [A α] = µ [B β] ; T −1 −1 Lsv = λsDsv; e Dsv = 0; λs 2 R; (3) where µ = 1 − λs, − f (s)g 2 Rn×n 2 3 2 3 where Ls = Ds Ws, and Ws = wi;j depends T T m (X1v) v1x^1 on the feature scaling factors s 2 R .

Load more