Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Spectral Feature Scaling Method for Supervised

Momo Matsuda1, Keiichi Morikuni1, Tetsuya Sakurai1,2 1 University of Tsukuba 2 JST/CREST [email protected], [email protected], [email protected]

Abstract which the data samples are linearly separated if they are well clustered. This technique is more effective in complicated Spectral dimensionality reduction methods enable clustering than existing methods. For example, the k- linear separations of complex data with high- algorithm induces partitions using hyperplanes, but may fail dimensional features in a reduced space. How- to give satisfactory results due to irregularities and uncertain- ever, these methods do not always give the desired ties of the features. results due to irregularities or uncertainties of the data. Thus, we consider aggressively modifying the Here, we consider aggressively modifying the scales of the scales of the features to obtain the desired classi- features to improve the separation in the reduced space, and fication. Using prior knowledge on the labels of obtain the desired classification. We derive the factors for partial samples to specify the Fiedler vector, we scaling the features (scaling factors) using a learning machine formulate an eigenvalue problem of a linear ma- based on spectral dimensionality reduction; namely, the in- trix pencil whose eigenvector has the feature scal- puts are unscaled and labeled samples, and the outputs are the ing factors. The resulting factors can modify the scaling factors. To this end, by exploiting the prior knowledge features of entire samples to form clusters in the on the labels of partial samples, we specify the Fiedler vec- reduced space, according to the known labels. In tor and reformulate the Laplacian eigenproblem as an eigen- this study, we propose new dimensionality reduc- problem of a linear matrix pencil whose eigenvector has the tion methods supervised using the feature scaling scaling factors. The obtained factors can modify the features associated with the spectral clustering. Numerical of the entire samples to form clusters in the reduced dimen- experiments show that the proposed methods out- sionality space according to the known labels. Thus, we use perform well-established supervised methods for the prior knowledge to implicitly specify the phenomenon toy problems with more samples than features, and of interest, and supervise dimensionality reduction methods are more robust regarding clustering than existing for the scaling factors. These approaches yield the desired methods. Also, the proposed methods outperform clusters, incorporated with unsupervised spectral clustering, existing methods regarding classification for real- for test data scaled by the obtained factors for training data. world problems with more features than samples of Numerical experiments on artificial data and real-world data gene expression profiles of cancer diseases. Fur- from gene expression profiles show that the feature scaling thermore, the feature scaling tends to improve the improves the accuracy of the spectral clustering. In addition, clustering and classification accuracies of existing the spectral dimensionality reduction methods supervised us- unsupervised methods, as the proportion of train- ing the feature scaling outperform existing methods in some ing data increases. cases, and are more robust than existing methods. We review related work on supervised spectral dimension- ality reduction methods. Learning a similarity matrix from 1 Introduction training data is effective in spectral clustering [Bach and Jor- Consider clustering a set of data samples with high- dan, 2003; Kamvar et al., 2003]. The proposed classifi- dimensional features into mutually exclusive subsets, called cation method can be considered a supervised method, in- clusters, and classifying these clusters when given prior corporated with the kernel version of the locality preserv- knowledge on the class labels of partial samples. These kinds ing projections (LPP) [He and Niyogi, 2003]. LPP is com- of problems arise in pathological diagnoses using gene ex- parable with the linear discriminant analyisis (LDA) [Bel- pression data [Tarca et al., 2006], the analysis of chemical humeur et al., 1997], the local Fisher discriminant analysis sensor data [Jurs et al., 2000], the community detection in (LFDA) [Sugiyama, 2007], and the locality adaptive discrim- social networks [Tichy et al., 1979], the analyses of neural inant analysis (LADA) [Li et al., 2017]. Kernel versions of spike sorting [Bar-Hillel et al., 2006], and so on [Sogh, 2006]. LDA, LPP and LFDA aim at nonlinear dimensionality reduc- Spectral clustering is an unsupervised technique that projects tion. These methods have semi-supervised variants [Song et the data samples to the eigenspace of the Laplacian matrix, in al., 2008; Kulis et al., 2005; Cui and Fan, 2012].

2560 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

1.1 Spectral Clustering with N samples, we can extend the prior knowledge on the (N−n)×m The underlying idea of the proposed methods follows from partial samples to overall samples, where R ∈ R spectral clustering, which is induced by graph partitioning, denotes the remaining data. In∑ statistical terms, the scal- ing changes the y¯ = 1 n y and the variance where the graph is weighted and undirected. The weight wi,j ∑ j n i=1 i,j of the edge between nodes i and j represents the similarity 2 1 n − 2 1/2 σj = n i=1(yi,j y¯j) of the jth feature to sj y¯j and between samples i and j for i, j = 1, 2, . . . , n, where n is the |s |σ2, respectively, for j = 1, 2, . . . , m. Note that we allow number of samples. By dividing the graph into mutually ex- j j the scaling factors to have negative values. See [Schleif and clusive subgraphs, the corresponding samples form clusters. Tino, 2015]. Finding appropriate scaling factors is also con- This discrete problem can be continuously relaxed to a matrix sidered metric learning [Yang, 2006]. Recall that the center eigenvalue problem. Let of a cluster associated with a Bregman divergence is equal n×n W = {wi,j } ∈ R , to the centroid of the samples of the cluster [Banerjee et al., ∑n 2005]. n×n Now, we reformulate (3) as another eigenproblem to ex- D = diag (d1, d2, . . . , dn) ∈ R with di = wi,j , tract the scaling factors si ∈ R, i = 1, 2, . . . , m as i=1 1 2 n an eigenvector. Denote the scaling matrix by S = e ∈ R be a vector with all elements 1, and t be an indicator 1 2 m×m diag (s1, s2, . . . , sm) ∈ R and denote the (i, j) en- vector with the ith entry ti = 1 if sample i is in a cluster, try of the similarity matrix W for the scaled data XS1/2 by and ti = −1 if sample i is in another cluster. Then, the con-  s − T − T strained minimization of the Ncut function [Shi and Malik, 1 − (xi xj ) S(xi xj ) = 1 − s xi,j ,  ( 2σ2 )2σ2 2000]  − T − ≃ − (xi xj ) S(xi xj ) ̸ (s) exp 2σ2 , i = j, vT(D − W )v wi,j =  (4) min , subject to eTDv = 0 (1)  v vTDv  ∑ ∑ 0, i = j, over vi ∈ {1, −b} with b = di/ di, is relaxed ti>0 ti<0 T ∈ Rm to finding the Fiedler vector v ∈ Rn \{0} [Fiedler, 1973] as- where s = [s1, s2, . . . , sm] and the kth entry of m 2 sociated with the smallest eigenvalue of the constrained gen- xi,j ∈ R is (xi,k − xj,k) . Here, we used the first-order eralized eigenvalue problem approximation of the exponential function exp (−x) ≈ 1 − x for 0 < x < 1. Then, the ith row of Ws is T Lv = λDv, subject to e Dv = 0, (2) sT w(s)T = [1,..., 1, 0, 1,..., 1] − [x , x ,. .., x ] where L = D − W is a Laplacian matrix and λ ∈ R. The lat- i 2σ2 i,1 i,2 i,n T − T ter problem (2) is tractable, while the former (1) is NP-hard. = e˜i s Xi, Moreover, a few eigenvectors of (2) form clusters that are sep- where e˜ is the n-dimensional vector with the ith entry arable using hyperplanes if the samples form well-separated i equal to zero and the remaining entries ones, and Xi = clusters. A sophisticated algorithm for (2) is nearly linear 1 ∈ Rm×n 2 [xi,1, xi,2,. .., xi,n] . Hence, we have time [Spielman and Teng, 2014]. 2σ [ ] T T T T − T Wsv = e˜1 , e˜2 ,..., e˜n v [X1v,X2v,...,Xnv] s. 2 Proposed Method 1 ·· · Let xˆi = 2σ2 (xi,1 + xi,2 + + xi,n) . Then, the ith diag- Irregular scales and uncertainty of features prevent well- onal entry d(s) of D is established dimensionality reduction methods from cluster- i s ∑n ing data samples into the desired disjoint subsets. In spectral d(s) = w(s) = (n − 1) − sTxˆ . clustering, the Ncut function is popular by virtue of the non- i i,j i linear separability, but is not versatile. To cope with these is- j=1 T sues, we propose a remedy to aggressively modify the scales Hence, denoting the Fiedler vector by v = [v1, v2, . . . , vn] , of the features in the eigenspace where a linear separation we have works, based on prior knowledge of the partial n samples T Dsv = (n − 1)v − [v1xˆ1, v2xˆ2, . . . , vnxˆn] s. X = [x , x ,..., x ]T , x ∈ Rm with m features. If the 1 2 n i Thus, (3) is written as classes of partial samples are known, we can estimate the ∈ Rn\{ } ⇔ − entries of the Fiedler vector v 0 of the Laplacian Lsv = λsDsv Wsv =[ (1 ]λs)Dsv [ ] eigenvalue problem s s ⇔ [A α] = µ [B β] , T −1 −1 Lsv = λsDsv, e Dsv = 0, λs ∈ R, (3) where µ = 1 − λs, − { (s)} ∈ Rn×n     where Ls = Ds Ws, and Ws = wi,j depends T T m (X1v) v1xˆ1 on the feature scaling factors s ∈ R . We obtain the scaling  T   T  ∈ Rm (X2v)  v2xˆ2  × factors s in (3) by solving an eigenproblem of a lin- A =   ,B =   ∈ Rn m, (5)  .   .  ear matrix pencil. Then, applying the obtained[ scaling] factors . . T T T ∈ RN×m T T to the features of the entire data Y = X ,R (Xnv) vnxˆn

2561 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

T n n α = [e˜1, e˜2, ..., e˜n] v ∈ R , β = (n − 1)v ∈ R . (6) Algorithm 1 Spectral clustering/classification methods su- pervised using the feature scaling Furthermore, the feature scaling factors s have the constraint Input: Training data X ∈ Rn×m, entire data Y ∈ RN×m, ∑n ( ) v eTD v = (n − 1)v − v sTxˆ = 0 estimated Fiedler vector , dimension of the reduced s i i i space ℓ. i=1 ( ) 1: Compute A, B, α, β, γ, ρ in (5), (6), (8). ∑n ∑n T ⇔ T − − 2: Solve (7) for an eigenvector s = [s1, s2, . . . , sm] . vixˆi s (n 1) vi = 0. 1/2 1/2 3: Set the scaling matrix S = diag(s1, s2, . . . , sm) . i=1 i=1 4: Compute the scaled data matrix Z = YS1/2. ′ × ′ Therefore, we can obtain the scaling factors s by solving the 5: Compute the similarity matrix W ∈ RN N (9) and D , generalized eigenproblem of a linear matrix pencil using the k-nearest neighbors, and set L′ = D′ − W ′. [ ][ ] [ ][ ] ′ ′ ′ L u = λ D u u , u ... u A α s B β s 6: Solve for the eigenvectors 1 2 ℓ = µ , s ∈ Rm, µ ∈ R, (7) corresponding to the ℓ nonzero smallest eigenvalues. γT ρ −1 0T 0 −1 7: Cluster/classify the reduced data samples N×ℓ where [u1, u2,..., uℓ] ∈ R rowwise by using the k- ∑n ∑n means algorithm/the one-nearest neighborhood method, m respectively. γ = vixˆi ∈ R , ρ = (n − 1) vi ∈ R. (8) i=1 i=1 (A − B)T e = 0 We employed the LPP code in the Matlab Toolbox for Dimen- It follows from the equality that there exists 1 µ sionality Reduction and the LFDA and KLFDA codes in the a real number that satisfies (7). To solve (7), we can use the 2 contour integral method [Morikuni, 2016] for n < m; other- Sugiyama-Sato-Honda Lab site . In the compared methods [ except for LFDA, we chose the optimal value of the parame- wise, we can also use the minimal perturbation approach Ito ±1 ±2 and Murota, 2016]. When s is obtained as a complex vector, ter σ among 1, 10 , and 10 to achieve the best accuracy. its real and imaginary parts are also eigenvectors of (7). We We formed the similarity matrix in (9) by using the seven- took the real part in the experiments given in section 3. For nearest neighbor and taking the symmetric part. We set the µ = 1 − λ, we adopt the eigenvector of (7) associated with dimensions of the reduced space to one for clustering prob- the eigenvalue µ closest to one as the scaling factors. Then, lems, and one, two and three for classification problems. It we apply the squared feature scaling factors s1/2 to the entire is known that using one eigenvector is sufficient for binary clustering if the clusters are well-separated [Chung, 1996]. data Y , such that Z = YS1/2. Finally, we perform the spec- × The reduced data samples to a low-dimension were clustered tral clustering or classification on the scaled data Z ∈ RN m using the k-means algorithm, and classified using the one- by solving the Laplacian eigenvalue problem nearest neighborhood method. L′u = λ′D′u, eTD′u = 0, u ∈ RN \{0}, λ′ ∈ R The performance measures we used were the normalized mutual information (NMI) [Strehl and Ghosh, 2002] and the for a few eigenvectors corresponding to the smallest eigenval- Rand index (RI) [Rand, 1971] for evaluating the clustering ′ ′ − ′ ∈ RN×N ′ { ′ } ∈ RN×N ues, where L = D W ,W = wi,j , methods, and RI for evaluating the classification methods. ( ) Let n be the number of samples in cluster/class i, nˆ be the ∥ − ∥ 2 i i ′ − zi zj 2 number of samples clustered/classified by a method to clus- wi,j = exp 2 , i, j = 1, 2 ...,N, (9) 2σ ter/class i, and ni,j be the number of samples in cluster/class ∑ i clustered/classified by a method to cluster/class j. Then, the ∈ Rm ′ n ′ ′ zi is the ith row of Z, di = j=1 wi,j , and D = NMI measure for binary clustering is defined by ′ ′ ′ ∑ ∑ ( ( )) diag (d1, d2, . . . , dN ). We summarize the procedures of the 2 2 n log n · n / n · n′ proposed methods in Algorithm 1. √ i=1 j=1 i,j i,j i j NMI = (∑ )(∑ ( )). 2 n log (n /n) 2 n′ log n′ /n 3 Numerical Experiments i=1 i i j=1 j j Numerical experiments on test problems were performed to Next, let TP (true positive) denote the number of samples compare the proposed methods with existing methods in which are correctly identified as class 1, FP (false positive) terms of accuracy. For clustering problems, we compared the denote the number of samples which are incorrectly identified proposed method (spectral clustering method supervised us- as class 1, TN (true negative) denote the number of samples ing the feature scaling, SC-S) with the unsupervised spectral which are correctly identified as class 2, FN (false negative) clustering method (SC) [Shi and Malik, 2000], LPP, its kernel denote the number of samples which are incorrectly identified version (KLPP) [He and Niyogi, 2003], and two supervised as class 2. Then, the RI measure is defined by methods, LFDA and its kernel version (KLFDA) [Sugiyama, TP + TN 2007]. For classification problems, we compared the pro- RI = . TP + TN + FP + FN posed method (spectral classification method supervised us- ing the feature scaling, SC-S) with LPP, KLPP, LFDA, and 1https://lvdmaaten.github.io/drtoolbox/ KLFDA. All programs were coded and run in Matlab 2016b. 2http://www.ms.k.u-tokyo.ac.jp/software.html

2562 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

As the obtained clustering and classification become more ac- curate, the values of NMI and RI become large. Class 1 1 Class 2

0.5 3.1 Artificial Data 0 -0.5 3rd feature

In this subsection, we use toy problems of data sets consist- 4 3 ing of 800 samples with 10 features. Figure 1 shows three 2 ◦ 1 0.5 of the 10 features, where the symbols and + denote the 0 0 -0.5 2nd feature -1 data samples in different classes, and the remaining seven 1st feature features were given uniformly distributed random numbers in the interval [0, 1]. The original data were normalized to Figure 1: Visualization in three dimensions of the artificial data have a mean of zero and variance one. We set the entries of the Fiedler vector v to vi = 1 if sample i is in a cluster, and vi = −0.2 if sample i is in another cluster (see (1)). We SC-S KLPP KLFDA SC LPP LFDA repeated clustering the same reduced data samples 20 times 100 using the k-means algorithm and the one-nearest neighbor- 90 hood method, starting with different random initial guesses. 80 Moreover, we tested on different proportions of training data from 5% to 50%, and repeated each test 10 times with dif- 70 60 ferent random choices of training data. Thus, we report the (%) ACC average performance by taking the arithmetic mean of RI and 50 NMI. 40 5 10 15 20 25 30 35 40 45 50 Figures 2 and 3 show the means and standard deviations of Proportion of training data (%) RI and NMI, respectively, for the clustering problems. SC-S performed the best in terms of accuracy among the compared Figure 2: Proportion of training data vs. RI for clustering artificial methods. The value of RI for SC-S, LFDA, and KLFDA data tended to increase with the proportion of training data used. On the other hand, the accuracies of clustering using SC, LPP, SC-S KLPP KLFDA and KLPP were similar for different proportions of training SC LPP LFDA data. 1 0.8 Figure 4 shows the means and standard deviations of RI for the classification problems, where the reduced dimensions are 0.6 0.4 ℓ = 1, 2, and, 3. SC-S performed the best in terms of accuracy NMI for proportions of training data of 5–50% with few exceptions 0.2 among the compared methods. As the proportion of training 0 data increased, the accuracy of the compared methods tended 5 10 15 20 25 30 35 40 45 50 to increase. Proportion of training data (%) Figure 5 shows the values of the scaling factors for each feature when using 50% of the entire data for training. Since Figure 3: Proportion of training data vs. NMI for clustering artificial data the fourth to tenth features given by random numbers were not involved in the classification, the values of the scaling factors of these seven features were small. Thus, the values ℓ 1 2 3 of the scaling factors indicated which features were effective SC-S *100.0± 0.0 *100.0± 0.0 *100.0± 0.0 for the desired classification. LPP 77.0±40.1 89.6±30.5 99.9± 3.5 ± ± ± Figure 6 shows the data samples reduced to a one- KLPP 52.4 49.9 52.4 50.0 52.4 49.9 LFDA 78.4±41.2 83.9±36.8 81.4±38.9 dimensional space for each method when using 50% of all ± ± ± data for training. SC-S reduced all data to one dimension, KLFDA 86.8 33.9 86.9 33.8 87.0 33.6 and made it linearly separable to the true clusters. KLFDA Table 1: RI (mean%  std) for classification of artificial data sets failed to reduce the data samples to be linearly separable to the true clusters. Table 1 gives the means and standard deviations of RI 3.2 GEO Data Sets through the leave-one-out cross-validation for the classifica- In this subsection, we use four data sets in the Gene Expres- tion problems for reduced dimensions ℓ = 1, 2, and 3. Ta- sion Omunibus3 from real-world problems with more features ble 1 shows that SC-S was immune from an overfitting in the learning process. 3https://www.ncbi.nlm.nih.gov/

2563 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) than samples. Table 2 gives the specifications of the data sets, sample i is in a cluster, and vi = −1 if sample i is in another where IPF denotes Idiopathic Pulmonary Fibrosis. Each data cluster (see (1)). set has two binary classes: diseased patients and healthy peo- Tables 3 and 4 give the means and standard deviations of ples. We chose 50% of the entire data for training and re- RI and NMI, respectively, for each data set for the clustering peated each test 10 times with different choices of the training problems. The symbol ∗ denotes the largest values of RI and data. The entries of the Fiedler vector v were set to vi = 1 if NMI for each data set. SC-S performed the best in terms of accuracy for two data sets, and second best for two data sets among the compared methods. LFDA performed the best in SC-S KLPP KLFDA LPP LFDA terms of accuracy for one data set, second best for one data set but worst for one data set among the compared methods. 100 These results showed that SC-S was more robust than other 90 methods. 80 70 Table 5 gives the means and standard deviations of RI for

RI (%) 60 the classification problems for the reduced dimensions ℓ = 50 1, 2, and 3. SC-S was best in average in terms of accuracy for 40 5 10 15 20 25 30 35 40 45 50 Proportion of training data (%) Class 1 Class 2 (a) One dimension 0.02

SC-S KLPP KLFDA LPP LFDA 0.015 100 0.01 90 80 0.005 70 0 RI (%) 60 50 0 100 200 300 400 500 600 700 800 40 Index 5 10 15 20 25 30 35 40 45 50 Proportion of training data (%) (a) SC-S (b) Two dimensions Class 1 Class 2 0.1 SC-S KLPP KLFDA LPP LFDA 100 0.05 90 80 0 70

RI (%) 60 -0.05 50 40 -0.1 5 10 15 20 25 30 35 40 45 50 0 100 200 300 400 500 600 700 800 Proportion of training data (%) Index (c) Three dimensions (b) SC Figure 4: Proportion of training data vs. RI for classification of arti- ficial data for reduced dimensions ℓ = 1, 2, and 3 Class 1 Class 2 2

1

1.6 0 1.4 1.2 -1 1 0.8 -2 0.6 0.4 -3 0.2 0 100 200 300 400 500 600 700 800 0 Index

Values of scalingfactor of the Values 1 2 3 4 5 6 7 8 9 10 Index of the feature (c) KLFDA

Figure 5: Values of the scaling factors Figure 6: Values of the reduced one-dimensional data samples

2564 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) # of # of # of Data set training samples entire samples features Data set Ovarian Pancreatic SC 0.004  0.000 0.043  0.000 Ovarian 11 22 54675 SC-S *0.082  0.019 0.042  0.010 Pancreatic 12 24 54675 LPP 0.044  0.024 *0.053  0.022 IPF 12 23 54613 KLPP 0.006  0.000 0.039  0.000 Colorectral 11 22 54675 LFDA 0.057  0.021 0.040  0.021 KLFDA 0.067  0.001 0.005  0.000 Table 2: Specifications of data sets

Data set IPF Colorectal Data set Ovarian Pancreatic SC 0.024  0.000 0.041  0.000 SC 52.6  0.05 54.2  0.00 SC-S 0.093  0.000 *0.323  0.016 SC-S *60.2  1.11 56.0  1.22 LPP 0.040  0.014 0.048  0.024 LPP 57.0  1.47 *59.5  1.20 KLPP 0.040  0.000 0.052  0.000 KLPP 54.6  0.00 54.2  0.00 LFDA 0.077  0.041 0.030  0.013 LFDA 57.9  1.77 55.1  1.57 KLFDA *0.148  0.005 0.109  0.003 KLFDA 56.4  0.05 51.0  0.08 Table 4: NMI (mean  std) for clustering GEO data sets

Data set IPF Colorectal SC 69.6  0.00 50.0  0.00 (a) Ovarian SC-S 76.1  0.00 *74.1  1.10 ℓ 1 2 3 LPP 76.7  2.12 57.8  1.33 SC-S 75.0  6.82 *90.5  6.88 *90.0  6.03 KLPP 71.3  0.00 53.2  0.00 LPP 76.4  4.90 75.5  7.93 75.9  8.87 LFDA 70.0  4.34 54.3  2.01 KLPP 55.5  4.45 56.4  6.17 57.7  5.40 KLFDA *79.9  0.12 63.4  0.25 LFDA *77.3  7.87 70.9  4.64 69.6  5.40 KLFDA 55.9  6.12 55.0  3.78 56.4  4.17 Table 3: RI (mean%  std) for clustering GEO data sets (b) Pancreatic ℓ 1 2 3    three data sets, except for Ovarian with ℓ = 1. LFDA was SC-S *77.5 8.58 *77.9 7.69 *73.3 7.50 LPP 74.6  8.00 73.8  5.61 72.1  7.23 best in terms of accuracy for Ovarian with ℓ = 1, and for IPF KLPP 55.4  4.19 56.7  3.33 54.6  3.46 with ℓ = 3. KLFDA was best in terms of accuracy for IPF. LFDA 50.0  0.00 50.0  0.00 50.0  0.00 SC-S became more accurate, as the number of eigenvectors ℓ KLFDA 54.2  2.64 54.2  2.64 54.2  2.64 used increased for Ovarian. It is known that the cross-validation is unreliable in small (c) IPF sample classification [Isaksson et al., 2008]. Indeed, we ob- ℓ 1 2 3 served significantly large values of the for SC-S *90.4  7.48 *96.5  3.79 *97.4  4.43 the compared methods for the data sets. LPP 74.4  9.01 76.5  8.06 76.5  8.30 KLPP 68.7  3.25 65.2  4.76 67.8  9.76 LFDA 75.2  8.26 84.8  6.52 87.0  0.00 4 Conclusions KLFDA 87.0 0.00 87.0  0.00 87.0  0.00 We considered the dimensionality reduction of high- dimensional data. To deal with irregularity or uncertainty (d) Colorectal ℓ 1 2 3 in features, we proposed supervised dimensionality reduction SC-S *82.3  8.73 *85.5  6.98 *82.7  6.98 methods that exploit knowledge on the labels of partial sam- LPP 72.3  9.63 75.0  6.18 80.0  4.64 ples. With the supervision, we modify the features’ variances KLPP 57.3  7.39 60.9  6.80 64.6  7.27 and means. Moreover, the feature scaling can reduce those LFDA 70.5  3.05 77.3  4.07 79.1  5.82 features that prevent us from obtaining the desired clusters. KLFDA 77.3  0.00 77.3  0.00 77.3  0.00 To obtain the factors used to scale the features, we formu- lated an eigenproblem of a linear matrix pencil whose eigen- Table 5: RI (mean%  std) for classification of GEO data sets vector has the feature scaling factors, and described the pro- cedures for the proposed methods. Numerical experiments showed that the feature scaling is effective when combined Acknowledgements with spectral clustering and classification methods. For toy We would like to thank Doctor Akira Imakura for his valu- problems with more samples than features, the feature scaling able comments. The second author was supported in part improved the accuracy of the unsupervised spectral cluster- by JSPS KAKENHI Grant Number 15K04768 and the Hat- ing, and the proposed methods outperformed several existing tori Hokokai Foundation. The third author was supported by methods. For real-world problems with more features than JST/CREST, Japan. samples from gene expression profiles, the proposed method was more robust in terms of clustering than well-established methods, and outperformed existing methods in some cases.

2565 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

References [Li et al., 2017] Xuelong Li, Mulin Chen, Feiping Nie, and [Bach and Jordan, 2003] Francis R. Bach and Michael I. Jor- Qi Wang. Locality adaptive discriminant analysis. In Pro- dan. Learning spectral clustering. In Advances in Neural ceedings of the 26th International Joint Conference on Ar- Information Processing Systems 16, pages 305–312, Van- tificial Intelligence, pages 2201–2207, Melbourne, Aus- couver, Canada, June 2003. tralia, August 2017. [Banerjee et al., 2005] Arindam Banerjee, Srujana Merugu, [Morikuni, 2016] Keiichi Morikuni. Contour integral-type Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with method for eigenproblems of rectangular matrix pencils. bregman divergences. Journal of Re- In Annual Meeting 2016 of the Japan Society for Industrial search, 6:1705–1749, October 2005. and Applied Mathematics, pages 352–353, Kitakyushu, Japan, September 2016. [Bar-Hillel et al., 2006] Aharon Bar-Hillel, Adam Spiro, and Eran Stark. Spike sorting: Bayesian clustering of [Rand, 1971] William M. Rand. Objective criteria for the non-stationary data. Journal of Neuroscience Methods, evaluation of clustering methods. Journal of the Ameri- 157(2):303–316, October 2006. can Statistical Association, 66(336):846–850, December 1971. [Belhumeur et al., 1997] Peter N. Belhumeur, Joao˜ P. Hes- panha, and David J. Kriegman. Eigenfaces vs. fisher- [Schleif and Tino, 2015] Frank-Michael Schleif and Peter faces: recognition using class specific linear projection. Tino. Indefinite proximity learning: A review. Journal of IEEE Transactions on Pattern Analysis and Machine In- Neural Computation, 27(10):2039–2096, October 2015. telligence, 19(7):711–720, July 1997. [Shi and Malik, 2000] Jianbo Shi and Jitendra Malik. Nor- [Chung, 1996] Fan R. K. Chung. Spectral Graph Theory. malized cuts and image segmentation. IEEE Transactions American Mathematical Society, Providence, USA, 1996. on Pattern Analysis and Machine Intelligence, 22(8):888– 905, August 2000. [Cui and Fan, 2012] Yan Cui and Liya Fan. A novel super- vised dimensionality reduction algorithm: Graph-based [Sogh, 2006] Tarek S. Sogh. Wired and wireless intru- Fisher analysis. Pattern Recognition, 45(4):1471–1481, sion detection system: Classifications, good characteris- April 2012. tics and state-of-the-art. Computer Standards and Inter- faces, 28(6):670–694, September 2006. [Fiedler, 1973] Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(98):298– [Song et al., 2008] Yangqiu Song, Feiping Nie, Changshui 305, 1973. Zhang, and Shiming Xiang. A unified framework for semi- [He and Niyogi, 2003] Xiaofei He and Partha Niyogi. Local- supervised dimensionality reduction. Pattern Recognition, ity preserving projections. In Advances in Neural Informa- 41(9):2789–2799, September 2008. tion Processing Systems 16, pages 153–160, Vancouver, [Spielman and Teng, 2014] Daniel A. Spielman and Shang- Canada, June 2003. Hua Teng. Nearly linear time algorithms for precondi- [Isaksson et al., 2008] Anders Isaksson, Mikael Wallman, tioning and solving symmetric, diagonally dominant linear Hanna Goransson,¨ and Mats G. Gustafsson. Cross- systems. SIAM Journal on Matrix Analysis and Applica- validation and bootstrapping are unreliable in small tions, 35(3):853–885, July 2014. sample classification. Pattern Recognition Letters, [Strehl and Ghosh, 2002] Alexander Strehl and Joydeep 29(14):1960–1965, July 2008. Ghosh. Cluster ensembles — a knowledge reuse frame- [Ito and Murota, 2016] Shinji Ito and Kazuo Murota. An work for combining multiple partitions. Journal of Ma- algorithm for the generalized eigenvalue problem for chine Learning Research, 3:583–617, December 2002. nonsquare matrix pencils by minimal perturbation ap- [Sugiyama, 2007] Masashi Sugiyama. Dimensionality re- proach. SIAM Journal on Matrix Analysis and Applica- duction of multimodal labeled data by local Fisher dis- tions, 37(1):409–419, June 2016. criminant analysis. Journal of Machine Learning Re- [Jurs et al., 2000] Peter C. Jurs, Gregory A. Bakke, and search, 8:1027–1061, May 2007. Heidi E. McClellands. Computational methods for the [Tarca et al., 2006] Adi L. Tarca, Robert Remero, and Sorin analysis of chemical sensor array data from volatile ana- Draghici. Analysis of microarray experiments of gene ex- lytes. Chemical Reviews, 100(7):2649–2678, June 2000. pression profiling. American Journal of Obstetrics and [Kamvar et al., 2003] Sepandar D. Kamvar, Dan Klein, and Gynecology,, 195(2):373–388, August 2006. Christopher D. Manning. Spectral learning. In Proceed- [Tichy et al., 1979] Noel M. Tichy, Michael L. Tushman, ings of the 18th International Joint Conference on Artifi- and Charles Fombrun. Social network analysis for organi- cial Intelligence, pages 561–566, Acapulco, Mexico, Au- zations. Academy of Management Review, 4(4):507–519, gust 2003. October 1979. [Kulis et al., 2005] Brian Kulis, Sugato Basu, Inderjit [Yang, 2006] Liu Yang. Distance metric learning: A com- Dhillon, and Raymond Mooney. Semi-supervised graph prehensive survey. Michigan State University, pages 1–51, clustering: a kernel approach. In Proceedings of the 22th May 2006. International Conference on Machine Learning, pages 457–464, Bonn, Germany, July 2005.

2566