Discriminative Unsupervised Dimensionality Reduction

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Discriminative Unsupervised Dimensionality Reduction Xiaoqian Wang, Yun Liu, Feiping Nie, Heng Huang∗ University of Texas at Arlington Arlington, Texas 76019, USA [email protected], [email protected], [email protected], [email protected] Abstract the high dimensional ambient. Multitudinous supervised and unsupervised dimensionality reduction methods have been As an important machine learning topic, dimen- put forward, such as PCA (Principal Component Analysis), sionality reduction has been widely studied and uti- LDA (Linear Discriminant Analysis), LLE [Roweis and Saul, lized in various kinds of areas. A multitude of 2000], LPP [Niyogi, 2004], shift invariant LPP [Nie et al., dimensionality reduction methods have been de- 2014a], NMMP [Nie et al., 2007], TRACK [Wang et al., veloped, among which unsupervised dimensional- 2014], etc. In the circumstances where accessing label of ity reduction is more desirable when obtaining la- the data is intricate, unsupervised dimensionality reduction bel information requires onerous work. However, methods are more favorable. Meanwhile, among the tremen- most previous unsupervised dimensionality reduc- dous number of unsupervised dimensionality reduction meth- tion methods call for an affinity graph constructed ods, graph embedding method is laid more emphasis on since beforehand, with which the following dimension- graph and manifold information are utilized within. ality reduction steps can be then performed. Sepa- However, most of state-of-the-art graph based dimension- ration of graph construction and dimensionality re- ality reduction methods require an affinity graph constructed duction leads the dimensionality reduction process before hand, which makes their projection ability dependent highly dependent on quality of the input graph. In heavily on the input of graph. However, due to the sepa- this paper, we propose a novel graph embedding rated learning processes, the constructed graph may not be method for unsupervised dimensionality reduction. optimal for the later dimensionality reduction. To address We simultaneously conduct dimensionality reduc- this problem, in this paper, we propose a novel graph em- tion along with graph construction by assigning bedding method for unsupervised dimensionality reduction adaptive and optimal neighbors according to the which asks for no input of the graph. Instead, graph construc- projected local distances. Our method doesn’t need tion in our model is conducted simultaneously with dimen- an affinity graph constructed in advance, but in- sionality reduction. We assign adaptive and optimal neigh- stead learns the graph concurrently with dimen- bors on the basis of the projected local distances. Our main sionality reduction. Thus, the learned graph is opti- assumption is that data with lower distance apart usually has mal for dimensionality reduction. Meanwhile, our a larger probability to be connected, which is a common hy- learned graph has an explicit block diagonal struc- pothesis in previous graph based methods [Nie et al., 2014b]. ture, from which the clustering results could be di- Also, we constrain the learned graph to an ideal structure rectly revealed without any postprocessing steps. where the graph is block diagonal with the number of con- Extensive empirical results on dimensionality re- nected components to be exactly the number of clusters in duction as well as clustering are presented to cor- the data, such that the constructed graph also uncovers the roborate the performance of our method. data cluster structure to enhance the graph quality. 1 Introduction 2 Related Work Natural and social science applications are crowded with high-dimensional data in this day and age. However, in PCA is the most famous dimensionality reduction method, most cases these data are literally characterized by an under- which is meant to find the projection direction where the variance of data is maximized. Since the variance of pro- lying low-dimensional space. This interesting phenomenon n n draws high attention to dimensionality reduction researches P P T T 2 jected data can be rewritten as W xi − W xj 2 = which focus on discovering intrinsic manifold structure from i j tr(W T XHXT W ), where H is the centering matrix as: ∗Corresponding Author. X. Wang and Y. Liu contribute equally to this paper. This work was partially supported by NSF IIS- 1 H = I − 11T ; (1) 1117965, IIS-1302675, IIS-1344152, DBI-1356628. n 3925 and the goal of PCA is to solve: graph construction by assigning adaptive and optimal neigh- T T bors on the basis of the projected local distances. Our funda- max tr(W XHX W ) : mental standing point is that data with lower distance apart W T W =I usually has a larger probability to be connected, in other However, as is pointed out in [Welling, 2005], PCA goes in- words, in the same cluster. Also, we constrain the learned capable when distances between classes lower down while graph to an ideal structure where the graph is block diagonal the scale of each class is still large in a certain degree, as we with the number of connected components to be exactly the can imagine the case where several cigars are placed closely, number of clusters in the data. The detailed description on with each cigar representing the distribution of data in one how we implement dimensionality reduction and graph con- class. struction at the same time and what remarkable properties of Afterwards, LDA was proposed to accomplish a better di- the learned graph have will be exhibited in the next section. mensionality reduction task as it was devoted to minimizing the within class distance, Sw, while maximizing the between class distance, Sb. Nevertheless, LDA also encounters sev- 3 Graph Embedding Discriminative eral problems. For example, the small sample size problem Unsupervised Dimension Reduction occurs when the number of samples is smaller than the number of data dimensions. Researchers have came up with nu- We hope to learn the optimal affinity graph S to optimize merous ways to overcome this obstacle. Authors in [Chen et the projection matrix W for unsupervised dimensionality re- al., 2000] put forward an approach to finding the most dis- duction. Ideally, we should learn both S and W simulta- criminative information in the null space of Sw which then neously in a unified objective. To design the proper learn- evades the computational difficulty generated by the singu- ing model, we hope to emphasize the following ideas: 1) n×n larity of Sw in Fisher LDA. Another research [Yu and Yang, The affinity matrix S 2 R and projection matrix W are 2001] indicates that by discarding the null space of S , which mutually learned, e.g. optimize S and W simultaneously in b n n is non-informative, one can solve the traditional LDA prob- P P T T 2 min W xi − W xj 2 sij (previous methods learned lem from a better point of view. W;S i j Whereas, all these LDA methods mentioned above ne- S separately and only optimize W in dimensionality reduc- cessitate the knowledge of data labels, which may not be tion). 2) The learned affinity matrix S implies the probability always easily accessible, especially in the cases where la- of each data point in X to connect with its neighbors, i.e. a beling a data calls for mountains of work. Researches in larger probability should be assigned to a pair with smaller the state of art come up with graph embedding methods, distance, such that the graph structure is interpretable. 3) The among which LPP can be seen as an representative exam- data variance in the embedding space is maximized to retain ple. Given data X 2 Rd×n, suppose we are to learn a pro- most information. jection matrix W 2 Rd×m, where m is the number of di- Based the above considerations, we can build the following mension we’d like to reduce to. The idea of LPP works like objective: this: firstly learn an affinity graph S showing pairwise affinity between data points and then find a ”good” mapping by n 2 P W T x − W T x s [ i j 2 ij tackling the following problem Belkin and Niyogi, 2001; i;j=1 Niyogi, 2004]: min (2) S;W tr(W T XHXT W ) n n T T X X T T 2 s:t: 8i; si 1 = 1; 0 ≤ sij ≤ 1;W W = I; min W xi − W xj sij ; W T XDXT W =I 2 i j where H is the centering matrix define in Eq. (1). where D is the degree matrix of S. However, Problem (2) has a trivial solution that only the T Also, there is another variant of this model published in nearest data point of W xi is assigned a probability as 1 [Kokiopoulou and Saad, 2007]: while all others assigned 0, that is to say, xi is connected with only its nearest neighbor in the projected space. n n X X T T 2 That is definitely not what we expect. Instead, we hope min W xi − W xj sij : W T W =I 2 the learned affinity graph can maintain or enhance the data i j cluster relations, such that the projected data don’t destroy Despite the fact that no more do these graph embedding meth- such important structure. The desired result is that we project ods require label information, it is still a must that an affinity the data to a low-dimensional subspace where the probabil- graph constructed ahead of time to work as the input. These ity within cluster is nonzero and evenly distributed while the methods separate dimensionality reduction and graph con- probability between clusters is zero. struction, thus depend on quality of the input affinity graph The assumption of this ideal structure describes a block to a large extent, where poor quality graph gives rise to erro- diagonal graph whose number of connected components is neous dimensionality reduction.

Discriminative Unsupervised Dimensionality Reduction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support