A High Performance Implementation of Spectral Clustering on CPU-GPU Platforms
Total Page:16
File Type:pdf, Size:1020Kb
A High Performance Implementation of Spectral Clustering on CPU-GPU Platforms Yu Jin Joseph F. JaJa Institute for Advanced Computer Studies Institute for Advanced Computer Studies Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering University of Maryland, College Park, USA University of Maryland, College Park, USA Email: [email protected] Email: [email protected] Abstract—Spectral clustering is one of the most popular graph CPUs, further boost the overall performance and are able clustering algorithms, which achieves the best performance for to achieve very high performance on problems whose sizes many scientific and engineering applications. However, existing grow up to the capacity of CPU memory [6, 7, 8, 9, 10, implementations in commonly used software platforms such as Matlab and Python do not scale well for many of the emerging 11]. In this paper, we present a hybrid implementation of the Big Data applications. In this paper, we present a fast imple- spectral clustering algorithm which significantly outperforms mentation of the spectral clustering algorithm on a CPU-GPU the known implementations, most of which are purely based heterogeneous platform. Our implementation takes advantage on multi-core CPUs. of the computational power of the multi-core CPU and the There have been reported efforts on parallelizing the spec- massive multithreading and SIMD capabilities of GPUs. Given the input as data points in high dimensional space, we propose tral clustering algorithm. Zheng et al. [12] presented both a parallel scheme to build a sparse similarity graph represented CUDA and OpenMP implementations of spectral clustering. in a standard sparse representation format. Then we compute However, the implementation was targeted for a much smaller the smallest k eigenvectors of the Laplacian matrix by utilizing data size than the work in this paper, and moreover, their the reverse communication interfaces of ARPACK software and implementation achieve a relatively limited speedup. Matam cuSPARSE library, where k is typically very large. Moreover, we implement a very fast parallelized k-means algorithm on GPUs. et al. [13] implemented a special case of spectral clustering, Our implementation is shown to be significantly faster compared namely the spectral bisection algorithm, which was shown to the best known Matlab and Python implementations for each to achieve high speed-ups compared to Matlab and Intel step. In addition, our algorithm scales to problems with a very MKL implementations. Chen et al. [14, 15] implemented large number of clusters. the spectral clustering algorithm on a distributed environment Index Terms—CPU-GPU platform; spectral clustering; sparse using Message Passing Interface (MPI), which is targeted similarity graph;reverse communication interface; k-means clus- tering for problems whose sizes that could not fit in the memory of a single machine. Tsironis and Sozio [16] proposed an implementation of spectral clustering based on MapReduce. I. INTRODUCTION Both implementations were targeted for clusters, and involve Spectral clustering algorithm has recently gained popularity frequent data communications which will clearly constrain the in handling many graph clustering tasks such as those reported overall performance. in [1, 2, 3]. Compared to traditional clustering algorithms, In this paper, we present a hybrid implementation of spec- such as k-means clustering and hierarchical clustering, spectral tral clustering on a CPU-GPU heterogeneous platform which clustering has a very well formulated mathematical framework significantly outperforms all the best implementations we are arXiv:1802.04450v1 [cs.DC] 13 Feb 2018 and is able to discover non-convex regions which may not aware of, which are based on existing parallel platforms. We be detected by other clustering algorithms. Moreover, spectral highlight the main contributions of our paper as follows: clustering can be conveniently implemented by linear algebra • Our algorithm is the first work to comprehensively ex- operations using popular scientific software environments such plore the hybrid implementation of spectral clustering as Matlab and Python. Most of the available software imple- algorithm on CPU-GPU platforms. mentations are built upon CPU-optimized Basic Linear Al- • Our implementation makes use of sparse representation gebra Subprograms (BLAS), usually accelerated using multi- of the corresponding graphs and can handle extremely thread programming. However, such implementations scale large input sizes and generate a very large number of poorly as the problem size or the number of clusters grow clusters. very large. Recent results show that GPU accelerated BLAS • The hybrid implementation is highly efficient and is significantly outperforms multi-threaded BLAS libraries such shown to make a very good use of available resources. as the Intel MKL package, LAPACK and Goto BLAS [4, 5]. • Our experimental results show superior performance rel- Moreover, hybrid computing environments, which collabora- ative to the common scientific software implementations tively combine the computational advantages of GPUs and on multicore CPUs. The rest of the paper is organized as follows. Section TABLE I. CPU and GPU specifics II gives an overview of the spectral clustering algorithm, CPU Model Intel Xeon E5-2690 while describing the important steps in some detail. Section CPU Cores 8 III describes the operating environment and the necessary DRAM Size 128GB GPU Model Tesla K20c software dependencies. Section IV provides a description of Device Memory Size 5GB GDDR5 our parallel implementation, while Section V evaluates the SMs and SPs 13 and 192 performance of our algorithm with a comparison with Matlab Compute Capability 3.5 CUDA SDK 7.5 and Python implementations on both synthetic and real-world PCIe Bus PCIe x16 Gen2 datasets. The codes are available on https://github.com/yuj- umd/fastsc. k ¯ 1 X W (Ai; Ai) Ncut(A ;A ;A ) = ; (4) II. OVERVIEW OF SPECTRAL CLUSTERING ALGORITHM 1 2 k 2 vol(A ) i=1 i Spectral clustering was first introduced in 1973 to study the graph partition problem [17]. Later, the algorithm was In our implementation, we focus on the problem of minimizing extended in [18, 19], and generalized to a wide range of the Ncut which has an equivalent algebraic formulation as applications, such as computational biology [20, 21], medical defined next. image analysis [2, 3], social networks [22, 23] and informa- 0 0 min trace(H LH) subject to H DH = I (5) tion retrieval [24, 25]. A standard procedure of the spectral H clustering algorithm to compute k clusters is described next That is, we need to determine a matrix H 2 Rn×k whose [26], columns are indicator vectors, which minimizes the objective d • Step 1: Given a set of data points x1; x2; :::; xn 2 R and function introduced above. some similarity measure s(xi; xj), construct a sparse sim- Since this problem is NP-hard, we relax the discrete con- ilarity matrix W that captures the significant similarities straints on H are removed, thereby allowing H to be any between the pairs of points. matrix in Rn×k. Note that there is no theoretical guarantee on • Step 2: Compute the normalized graph Laplacian matrix the quality of the solution of the relaxed problem compared to −1 as Ln = D L where L is the unnormalized graph the exact solution of the discrete version. It turns out that the Laplacian matrix defined as L = D − W and D is the relaxed problem is a well-known trace minimization problem, Pn diagonal matrix with each element Di;i = j=1 Wi;j. which can be exactly solved by taking H as the eigenvectors • −1 Step 3: Compute the k eigenvectors of the normalized with the smallest k eigenvalues of the matrix Ln = D L graph Laplacian matrix Ln corresponding to the smallest or equivalently the k generalized eigenvectors corresponding k nonzero eigenvalues. to the smallest k eigenvalues of Lx = λDx. The k-means • Step 4: Apply the k-means clustering algorithm on the clustering is then applied on the rows of H to obtain the rows of the matrix whose columns are the k eigenvectors desired clustering. to obtain the final clusters. The algorithm described above begins with a set of d- Given the similarity graph defined by the similarity matrix dimensional data points and builds the similarity graph explic- W , the basic idea behind spectral clustering is to partition itly from the pair-wise similarity metric. The similarity graph the graph into k partitions such that some measure of the cut is usually stored in a sparse matrix representation, which often between the partitions is minimized. The traditional graph cut reduces the memory requirement and computational cost to is defined as follows: linear instead of quadratic. For the general graph clustering whose input is specified as a graph, our spectral clustering k 1 X algorithm starts directly in Step 2. Otherwise, we build our Cut(A ;A ; :::; A ) = W (A ; A¯ ); (1) 1 2 k 2 i i sparse graph representation from the given set of data points. i=1 III. ENVIRONMENT SETUP X W (A; A¯) := w (2) ij A. The Heterogeneous System i2A;j2A¯ The CPU-GPU heterogeneous system used in our imple- To ensure that the each partition represents a meaningful mentation is specified in Table I. cluster of reasonable size, two alternative cut measures are The CPU and the GPU communicate through the PCIe often used, namely RatioCut and normalized cut Ncut. Note bus whose theoretical peak bandwidth is 8 GB/s. The cost that we use below jAij as the number of nodes in A and of data communication can be quite significant for large- vol(A) as the sum of the degrees of all the nodes in A. scale problems. To achieve the best overall performance, our implementation leverages the GPU to compute the most k ¯ 1 X W (Ai; Ai) computationally expensive part while minimizing the data RatioCut(A ;A ;A ) = ; (3) 1 2 k 2 jA j transfer between the host and the device.