Optimal Neighborhood Multiple Kernel Clustering with Adaptive Local Kernels

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MONTH JUNE, YEAR 2019 1 Optimal Neighborhood Multiple Kernel Clustering with Adaptive Local Kernels Jiyuan Liu, Xinwang Liu∗, Senior Member, IEEE, Jian Xiong, Qing Liao, Sihang Zhou, Siwei Wang and Yuexiang Yang∗ Abstract—Multiple kernel clustering (MKC) algorithm aims to group data into different categories by optimally integrating information from a group of pre-specified kernels. Though demonstrating superiorities in various applications, we observe that existing MKC algorithms usually do not sufficiently consider the local density around individual data samples and excessively limit the representation capacity of the learned optimal kernel, leading to unsatisfying performance. In this paper, we propose an algorithm, called optimal neighborhood MKC with adaptive local kernels (ON-ALK), to address the two issues. In specific, we construct adaptive local kernels to sufficiently consider the local density around individual data samples, where different numbers of neighbors are discriminatingly selected on each sample. Further, the proposed ON-ALK algorithm boosts the representation of the learned optimal kernel via relaxing it into the neighborhood area of weighted combination of the pre-specified kernels. To solve the resultant optimization problem, a three-step iterative algorithm is designed and theoretically proven to be convergent. After that, we also study the generalization bound of the proposed algorithm. Extensive experiments have been conducted to evaluate the clustering performance. As indicated, the algorithm significantly outperforms state-of-the-art methods in recent literatures on six challenging benchmark datasets, verifying its advantages and effectiveness. Index Terms—Multiple kernel clustering, Kernel alignment, Kernel k-mean. F 1 INTRODUCTION ERNEL clustering has been widely explored in current kernels, are well studied in literatures to address the afore- K machine learning and data mining literatures. It im- mentioned issues and can be roughly grouped into three plicitly maps the original non-separable data into a high- categories. Methods in the first category intend to construct dimensional Hilbert space where corresponding vertices a consensus kernel for clustering by integrating low-rank have a clear decision boundary. Then, various clustering optimization [6], [7], [8], [9], [10], [11]. For instance, Zhou methods, including k-means [1], [2], fuzzy c-means [3], et al. firstly recover a shared low-rank matrix from tran- spectral clustering [4] and Gaussian Mixture Model (GMM) sition probability matrices of multiple kernels, and then [5], are applied to group the unlabeled data into categories. use it as input to the standard Markov chain method for Although kernel clustering algorithms have achieved great clustering [10]. Techniques in the second category compute success in a large volume of applications, they are only able their clustering results with the partition matrices generated to handle data with a single kernel. Meanwhile, kernel func- from each individual kernel. Liu et al. firstly perform kernel tions are of different types, such as Polynomial, Gaussian, k-means on each incomplete view and then explore the Linear, etc., and parameterized manually. How to choose complementary information among all incomplete cluster- the right kernel function and pre-define its parameters opti- ing results to obtain a final solution [12]. On the contrary, mally for a specific clustering task is still an open problem. algorithms of the third category build the consensus ker- Nevertheless, sample features are collected from different nel along with the clustering process. Most of them take sources in most practical settings. For example, news is the basic assumption that the optimal kernel is able to reported by multiple news organizations; a person can be be represented as a weighted combination of pre-specified described from its fingerprint, palm veins, palm print, DNA kernels. Huang et al. extend the fuzzy c-means by incor- , etc. The most common approach is to concatenate all porating multiple kernels and automatically adjusting the features into one vector. But it ignores the fact that the kernel weights, which makes the clustering algorithm more features may not be directly comparable. immune to ineffective kernels and irrelevant features [13]. Multiple kernel clustering (MKC) algorithms, which uti- They also show multiple kernel k-means to be a special case lize the complementary information from the pre-specified of multiple kernel fuzzy c-means. The weighted combination assumption is also applied in spectral clustering, such • J. Liu, X. Liu, S. Zhou, S. Wang and Y. Yang are with the College of as [14], [15]. Similarly, Yu et al. optimize the kernel weights Computer, National University of Defense Technology, Changsha 410073, based on the same Rayleigh quotient objective and claim China. E-mail: fliujiyuan13, xinwangliu, [email protected] their algorithm is of lower complexity [16]. Apart from this, • J. Xiong is with School of Business Administration, Southwestern Uni- various regularizations are formulated to help constrain the versity of Finance and Economics, Chengdu, Sichuan, 611130, China. • Q. Liao is with the Department of Computer Science and Technology, kernel weights and affinity matrix. For example, Du et al. Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, China. use the L2;1-norm in the original feature space to minimize • Corresponding author: Xinwang Liu and Yuexiang Yang. the reconstruction error [17]. Liu et al. propose Matrix- Manuscript received December 8, 2019. induced regularization to prevent from highly imbalanced SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MONTH JUNE, YEAR 2019 2 weight assignment so as to reduce the mutual redundancy combinations of pre-specified kernels. Then, the both among kernels and enhance the diversity of selected ker- techniques are utilized into a single multiple kernel nels [18]. Zhao et al. assume each pre-specified kernel is clustering framework. constructed by the consensus matrix and a transition proba- • We derive an algorithm, named optimal neighbor- bility matrix [19]. They regularize the two types of matrices hood multiple kernel clustering with adaptive local to be low-rank and sparse respectively. Liu et al. deal with kernels, and study its generalization bound. Never- incomplete kernels and propose a mutual kernel completion theless, a three-step iterative algorithm is designed term to compute the missing items in kernels and learn the to solve the resultant optimization problem and we kernel weights simultaneously [20]. Instead of assuming the prove its convergence. equality of samples in one kernel, some researches perform • Generalization ability of the proposed algorithm is clustering with assigning different weights to samples, such well studied, and the generalization bound is proven p as [21], [22]. to be O( 1=n). Kernel alignment is an effective regularization in multi- • Comprehensive experiments on six challenging ple kernel k-means algorithms [23], [24], [25]. However, Li benchmark datasets are conducted to validate the et al. claim kernel alignment forces all sample pairs equally effectiveness of the proposed algorithm. As demon- aligned with the same ideal similarity, conflicting with the strated, the proposed algorithm outperforms state- well-established concept that aligning two farther samples of-the-art clustering methods in recent literatures. with a low similarity is less reliable in high dimensional The rest of this paper is organized as follows: Section space [26]. Observing the local kernel trick in [27] can better 2 presents a review of related work. Section 3 is devoted capture sample-specific characteristics of the data, they use to the proposed ON-ALK algorithm. Section 4 explores its neighbors of each sample to construct local kernels and max- generation ability. Extensive experiments are conducted in imize the sum of their alignments with the ideal similarity section 5 to support our claims. We make some discussions matrix [26]. Additionally, the local kernel is demonstrated and introduce the potential future work in Section 6, and to be capable of helping the clustering algorithm better use finish the paper with conclusion in section 7. the information provided by closer sample pairs [28]. The aforementioned MKC algorithms suffer from two 2 RELATED WORK facts, not sufficiently considering the local density around individual data samples and excessively limiting the representation In this section, we introduce some related work, including k k capacity of the learned optimal kernel. In specific, the local kernel -means, multiple kernel -means and regularized k kernel in [26] globally sets the number of neighbors for multiple kernel -means. each sample to a constant, which cannot guarantee all 2.1 Kernel k-means sample pairs in the local kernel relatively close. It is known that performing alignment with farther sample pairs is Given a feature space X and a collection of n samples n less reliable. Therefore, this local kernel cannot reduce the fxigi=1, The feature map '(·): X!H is denoted to unreliability to a minimum due to overlooking the local map X into a Reproducing Kernel Hilbert Space (RKHS) density around individual data samples. At the same time, H [29], such that for any x 2 X we have φ = '(x). most MKC algorithms assume that the optimal kernel is a k-means algorithm aims to partition the samples into k weighted

Load more