Similarity Learning Via Kernel Preserving Embedding

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Similarity Learning via Kernel Preserving Embedding Zhao Kang,1∗ Yiwei Lu,1;2 Yuanzhang Su,3 Changsheng Li,1 Zenglin Xu1∗ 1School of Computer Science and Engineering, University of Electronic Science and Technology of China, China 2Department of Computer Science, University of Manitoba, 66 Chancellors Cir, Winnipeg, MB R3T 2N2, Canada 3School of Foreign Languages, University of Electronic Science and Technology of China, Sichuan 611731, China [email protected], [email protected], [email protected], [email protected], [email protected] Abstract Towne, Rose,´ and Herbsleb 2016). A variety of similarity metrics, e.g., Cosine, Jaccard coefficient, Euclidean dis- Data similarity is a key concept in many data-driven applica- tance, Gaussian function, are often used in practice for tions. Many algorithms are sensitive to similarity measures. convenience. However, they are often data-dependent and To tackle this fundamental problem, automatically learning of similarity information from data via self-expression has been sensitive to noise (Huang, Nie, and Huang 2015). Conse- developed and successfully applied in various models, such quently, different metrics lead to a big difference in the as low-rank representation, sparse subspace learning, semi- final results. In addition, several other similarity measure supervised learning. However, it just tries to reconstruct the strategies are popular in dimension reduction techniques. original data and some valuable information, e.g., the mani- For example, in the widely used locally linear embedding fold structure, is largely ignored. In this paper, we argue that (LLE) (Roweis and Saul 2000), isomeric feature mapping it is beneficial to preserve the overall relations when we ex- (ISOMAP) (Tenenbaum, De Silva, and Langford 2000), and tract similarity information. Specifically, we propose a novel locality preserving projection (LPP) (Niyogi 2004) methods, similarity learning framework by minimizing the reconstruc- one has to construct an adjacency graph of neighbors. Then, tion error of kernel matrices, rather than the reconstruction k-nearest-neighborhood (knn) and -nearest-neighborhood error of original data adopted by existing work. Taking the clustering task as an example to evaluate our method, we ob- graph construction methods are often utilized. These ap- serve considerable improvements compared to other state-of- proaches also have some inherent drawbacks, including 1) the-art methods. More importantly, our proposed framework how to determine neighbor number k or radius ; 2) how to is very general and provides a novel and fundamental build- choose an appropriate similarity metric to define neighbor- ing block for many other similarity-based tasks. Besides, our hood; 3) how to counteract the adverse effect of noise and proposed kernel preserving opens up a large number of possi- outliers; 4) how to tackle data with structures at different bilities to embed high-dimensional data into low-dimensional scales of size and density. Unfortunately, all these factors space. heavily influence the subsequent tasks (Kang et al. 2018b). Recently, automatically learning of similarity information Introduction from data has drawn significant attention. In general, it can be classified into two categories. The first one is adaptive Nowadays, high-dimensional data can be collected every- neighbors approach. It learns similarity information by as- where, either by low-cost sensors or from the internet (Chen signing a probability for each data point as the neighborhood et al. 2012). Extracting useful information from massive of another data point (Nie, Wang, and Huang 2014). It has high-dimensional data is critical in different areas like text, been shown to be an effective way to capture the local man- images, videos and more. Data similarity is especially im- ifold structure. portant since it is the input for a number of data anal- The other one is self-expression approach. The basic idea ysis tasks, such as spectral clustering (Ng et al. 2002; is to represent every data point by a linear combination of Chen et al. 2018), nearest neighbor classification (Wein- other data points. In contrast, LLE reconstructs the original berger, Blitzer, and Saul 2005), image segmentation (Li et data by expressing each data point as a linear combination al. 2016), person re-identification (Hirzer et al. 2012), im- of its k nearest neighbors only. Through minimizing this re- age retrieval (Hoi, Liu, and Chang 2008), dimension reconstruction error, we can obtain a coefficient matrix, which duction (Passalis and Tefas 2017), and graph-based semi- is also named similarity matrix. It has been widely applied supervised learning (Kang et al. 2018a). Therefore, similar- in various representation learning tasks, including sparse ity measure is crucial to the performance of many techniques subspace clustering (Elhamifar and Vidal 2013; Peng et al. and is a fundamental problem in machine learning, pattern 2016), low-rank representation (Liu et al. 2013), multi-view recognition, and data mining communities (Gao et al. 2017; learning (Tao et al. 2017), semi-supervised learning (Zhuang ∗Corresponding author. et al. 2017), nonnegative matrix factorization(NMF) (Zhang Copyright c 2019, Association for the Advancement of Artificial et al. 2017). Intelligence (www.aaai.org). All rights reserved. However, this approach just tries to reconstruct the origi- 4057 nal data and has no explicit mechanism to preserve manifold Z from the data set by solving an optimization problem: structure information about the data. In many applications, n the data can display structures beyond simply being low- X 2 2 T~ min (kxi − xjk zij + αzij) s:t: zi 1 = 1; 0 ≤ zij ≤ 1; rank or sparse. It is well-accepted that it is essential to take zi into account structure information when we perform high- j=1 (1) dimensional data analysis. For instance, LLE preserves the where α is the regularization parameter. Recently, a variety local structure information. of algorithms have been developed by using Eq. (1) to learn In view of this issue with the current approaches, we pro- a similarity matrix. Some applications are clustering (Nie, pose to learn the similarity information through reconstruct- Wang, and Huang 2014), NMF (Huang et al. 2018), and fea- ing the original data kernel matrix, which is supposed to ture selection (Du and Shen 2015). This approach can effec- preserve overall relations. By doing so, we expect to ob- tively capture the local structure information. tain more accurate and complete data similarity. Consider- ing clustering as a specific application of our proposed sim- Self-expression Approach ilarity learning method, we demonstrate that our framework provides impressive performance on several benchmark data The so-called self-expression is to approximate each data point as a linear combination of other data points, i.e., xi = sets. In summary, the main contributions of this paper are P threefold: j xjzij. The rationale here is that if xi and xj are similar, weight zij should be big. Therefore, Z also behaves like • Compared to other approaches, the use of the kernel- the similarity matrix. This shares the similar spirit as LLE, based distances allows to work on preserving the sets of except that we do not predetermine the neighborhood. Its overall relations rather than individual pairwise similari- corresponding learning problem is: ties. 1 2 min kX − XZkF + αρ(Z) s:t: Z ≥ 0; (2) • Similarity preserving provides a fundamental build- Z 2 ing block to embed high-dimensional data into low- dimensional latent space. It is general enough to be ap- where ρ(Z) is a regularizer of Z. Two commonly used as- plied to a variety of learning problems. sumptions about Z are low-rank and sparse. Hence, in many domains, we also call Z as the low-dimensional representa- • We evaluate the proposed approach in the clustering task. tion of X. Through this procedure, the individual pairwise It shows that our algorithm enjoys superior performance similarity information hidden in the data is explored (Nie, compared to many state-of-the-art methods. Wang, and Huang 2014) and the most informative “neighbors” for each data point are automatically chosen. Notations. Given a data set fx ; x ; ··· ; x g, we denote 1 2 n Moreover, this learned Z can not only reveal low- X 2 Rm×n with m features and n instances. Then the dimensional structure of data, but also be robust to data scale (i; j)-th element of matrix X are denoted by x . The ` - ij p 2 (Huang, Nie, and Huang 2015). Therefore, this approach has T norm of a vector x is represented by kxk = x · x, drawn significant attention and achieved impressive perfor- where T denotes transpose. The `1-norm of X is defined P mance in a number of applications, including face recog- as kXk1 = ij jxijj. The squared Frobenius norm is rep- nition (Zhang, Yang, and Feng 2011), subspace clustering 2 P 2 resented as kXkF = ij xij. The nuclear norm of X (Liu et al. 2013; Elhamifar and Vidal 2013), semi-supervised P is kXk∗ = σi, where σi is the i-th singular value of learning (Zhuang et al. 2017). In many real-world applica- X. I is the identity matrix with a proper size. ~1 repre- tions, data often present complex structures. Nevertheless, sents a column vector whose every element is one. Z ≥ 0 the first term in Eq. (2) simply minimizes the reconstruction means all the elements of Z are nonnegative. Inner product error. Some important manifold structure information, such T as overall relations, could be lost during this process. Pre- < xi; xj >= x · xj. i serving relation information has been shown to be important for feature selection (Zhao et al. 2013). In (Zhao et al. 2013), Related Work new feature vector f is obtained by maximizing f T Kf^ , where K^ is the refined similarity matrix derived from origi- In this section, we provide a brief review of existing auto- T matic similarity learning techniques.

Load more