A New Simplex Sparse Learning Model to Measure Data Similarity for Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A New Simplex Sparse Learning Model to Measure Data Similarity for Clustering Jin Huang, Feiping Nie, Heng Huang∗ University of Texas at Arlington Arlington, Texas 76019, USA [email protected], [email protected], [email protected] Abstract The key to many applications mentioned lies in the ap- propriate construction of the similarity graphs. There are The Laplacian matrix of a graph can be used in various ways to set up such graphs, which depend on both many areas of mathematical research and has a the application and data sets. However, there are still a few physical interpretation in various theories. How- open issues: (i) Selecting the appropriate scale of analysis, ever, there are a few open issues in the Lapla- (ii) Selecting the appropriate number of neighbors, (iii) Han- cian graph construction: (i) Selecting the appro- dling multi-scale data, and, (iv) Dealing with noise and out- priate scale of analysis, (ii) Selecting the appro- liers. There are already a few papers [Long et al., 2006; priate number of neighbors, (iii) Handling multi- Li et al., 2007] solving one of these issues. In particular, the scale data, and, (iv) Dealing with noise and out- classic paper self-tuning spectral clustering [Zelnik-Manor liers. In this paper, we propose that the affinity be- and Perona, 2004] by L. Zelnik-Manor et al. solves issues tween pairs of samples could be computed using (i) and (iii). However, there is no single method that solves sparse representation with proper constraints. This all of these issues to the best of our knowledge. parameter free setting automatically produces the In the remainder part of this section, we first provide a Laplacian graph, leads to significant reduction in brief review about commonly used similarity graph construc- computation cost and robustness to the outliers and tion methods. Next, we present a concise introduction to noise. We further provide an efficient algorithm to sparse representation so as to lay out the background of our solve the difficult optimization problem based on method. In Section 2, we present our motivation and for- improvement of existing algorithms. To demon- mulate our Simplex Sparse Representation (SSR) objective strate our motivation, we conduct spectral cluster- function Eq. (10). We propose a novel accelerated projected ing experiments with benchmark methods. Empir- gradient algorithm to optimize the objective function in an ical experiments on 9 data sets demonstrate the ef- efficient way. Our experiment part consists of two sections, fectiveness of our method. one with a synthetic data set to highlight the difference be- tween our method and previous method, the other with real data sets to demonstrate the impressive performance of SSR. 1 Background and Motivation We conclude this paper with the conclusion and future work. Graph theory is an important branch in mathematics and 1.1 Different Similarity Graphs has many applications. There are many algorithms built on graphs: (1) Clustering algorithms [Ng et al., 2002; Hagen There are many ways to transform a given set x1,. ,xn of and Kahng, 1992; Shi and Malik, 2000; Nie et al., 2014], (2) data points with pairwise similarities Sij or pairwise distance Dimension reduction algorithms based on manifold, such as Dij into a graph. The following are a few popular ones: LLE [Roweis and Saul, 2000] and Isomap [Tenenbaum et al., 1. The "-neighbor graph: We connect all points whose 2000], (3) Semi-supervised learning algorithms, such as label pairwise distances are smaller than ". propagation [Zhu et al., 2003; Wang et al., 2014], (4) Rank- ing algorithms, such as Page-Rank [Page et al., 1999] and 2. k-nearest neighbor graph: The vertex xi is connected Hyperlink-Induced Topic Search (HITS) [Kleinberg, 1999]. with xj if xj is among the k-nearest neighbors of xi. Many data sets also have a natural graph structure, web-pages 3. The fully connected graph: All points are connected and hyper-link structure, protein interaction networks and so- with each other as long as they have positive similar- cial networks, just to name a few. In this paper, we limit our ities. This construction is useful only if the similar- discussion to the data sets that can be transformed to similar- ity function models local neighborhoods. An exam- ity graph by simple means. ple of such function is the Gaussian similarity function 2 −kxi−xj k ∗ S(x ; x ) = exp σ Corresponding Author. This work was partially supported by i j 2σ2 , where controls the NSF IIS-1117965, IIS-1302675, IIS-1344152, DBI-1356628. width of neighborhoods. 3569 Note that all the above three methods suffer from some of the One major disadvantage of this method is that the hyper- open issues mentioned. The "-neighbor graph method needs parameter σ in the Gaussian kernel function is very sensitive to find the appropriate " to establish the edge connections, and is difficult to tune in practice. the k-nearest neighbor graph uses the specified k to connect The sparse representation proposed in [Roweis and Saul, neighbors for each node, the σ in Gaussian similarity method 2000] can be applied here to compute the similarity matrix n−1 clearly determines the overall shape of the similarity graph. S. Specifically, the similarities αi 2 R between the i-th In short, they are not parameter free. Also, " is clearly prone feature and other features are calculated by a sparse represen- to the influence of data scale, k and σ would also be dom- tation as follows: inated by any feature scale inconsistence among these data 2 min kX−iαi − xik2 + λkαik1 (6) vectors. Therefore, a parameter free and robust similarity αi graph construction method is desired and this is our motiva- d×(n−1) tion of this paper. where X−i = [x1;:::; xi−1; xi+1;:::; xn] 2 R , i.e, the data matrix without column i. 1.2 Introduction to Sparse Representation Since the similarity matrix is usually non-negative, we can Suppose we have n data vectors with size d and arrange them impose the non-negative constraint on problem (6) to mini- into columns of a training sample matrix mize the sum of sparse representation residue error: d×n X = (x1;:::; xn) 2 : n R X 2 y min kX−iαi − xik2 + λkαik1 (7) Given a new observation , sparse representation method αi≥0 computes a representation vector i=1 T n The parameter λ is still needed to be well tuned here. It is β = (β1; ··· ; βn) 2 R (1) easy to note that Eq. (7) is the sum of n independent variables, to satisfy n as a result, we will limit the discussion to a single term in X the following context. More importantly, when the data are y ≈ xiβi = Xβ (2) T shifted by constants t = [t1; ··· ; td] , i.e., xk = xk + t for i=1 any k, the computed similarities will be changed. To get the To seek a sparse solution, it is natural to solve shift invariant similarities, we need the following equation: 2 min kXβ − yk2 + λ0kβk0 (3) β T 2 2 (X−i + t1 )αi − (xi + t) 2 = kX−iαi − xik2; (8) ,where pseudo norm `0 counts the number of non-zero ele- T T ments in β and λ0 > 0 is the controlling parameter. which indicates αi 1 = 1. Imposing the constraint αi 1 = 1 Recent discovery in [Donoho, 2004; Candes` and Tao, on the problem (7), we have 2006] found that the sparse solution in Eq. (3) could be ap- 2 proximated by solving the `1 minimization problem: min kX−iαi − xik2 + λkαik1 αi (9) T min kβk1; s:t: Xβ = y (4) s:t: αi ≥ 0; αi 1 = 1 β or the penalty version: Interestingly, the constraints in problem (9) makes the second 2 term constant. So problem (9) becomes min kXβ − yk2 + λ1kβk1 (5) β 2 min kX−iαi − xik2 This `1 problem can be solved in polynomial time by standard αi (10) T linear programming methods [Roweis and Saul, 2000]. s:t: αi ≥ 0; αi 1 = 1 In the literature, sparse representation framework is well known for its robustness to noise and outliers in image classi- The constraints in problem (10) is simplex, so we call this fication [Wright et al., 2008]. It is noting that in such scheme, kind of sparse representation as simplex representation. ` there is no restriction on the scale consistence for the data Note that the above constraints ( 1 ball constraints) indeed α vectors. In other words, sparse representation has the poten- introduce sparse solution i and have empirical success in tial to address the scale inconsistence and outlier issue pre- various applications. In order to solve the Eq. (10) for high sented in the introduction. dimensional data, it is more appropriate to apply first-order methods, i.e, use function values and their (sub)gradient at each iteration. There are quite a few classic first-order meth- 2 Similarity Matrix Construction with ods, including gradient descent, subgradient descent, and Simplex Representation Nesterovs optimal method [Nesterov, 2003]. In this paper, Assuming we have the data matrix X = [x1; x2;:::; xn] 2 we use the accelerated projected gradient method to optimize Rd×n, where each sample is (converted into) a vector of di- Eq. (10). We will present the details of the optimization and mension d. The most popular method to compute the similar- the algorithm in next section. ity matrix S is using Gaussian kernel function 8 2 3 Optimization Details and Algorithm < − kxi − xjk2 expf g; xi; xj are neighbors Sij = 2σ2 Our accelerate projected gradient method introduces an aux- : 0; otherwise iliary variable to convert the objective equation into an easier 3570 one, meanwhile the auxiliary variable approximates and con- Input: Data Matrix X verges to the solution α in an iterative manner.