A Parameter-Free Affinity Based Clustering

1 A Parameter-free Affinity Based Clustering Bhaskar Mukhoty, Ruchir Gupta, Member IEEE and Y. N. Singh, Senior Member IEEE The paper is under consideration at Pattern Recognition Letters. Abstract—Several methods have been proposed to estimate the upon some measures of intra-cluster cohesion and inter-cluster number of clusters in a dataset; the basic ideal behind all of them separation. They are often used to predict suitable number of has been to study an index that measures inter-cluster separation clusters by searching the data with a range of cluster numbers and intra-cluster cohesion over a range of cluster numbers and report the number which gives an optimum value of the index. In and reporting the number which gives optimal value of the this paper we propose a simple, parameter free approach that is index [5]. This, when followed by a clustering algorithm, like human cognition to form clusters, where closely lying points parameterized by number of clusters, gives partition of the are easily identified to form a cluster and total number of clusters data [6]. are revealed. To identify closely lying points, affinity of two points Other parameterized algorithms supplied with some form is defined as a function of distance and a threshold affinity is identified, above which two points in a dataset are likely to be in of information about the data, are also in use. DBSCAN the same cluster. Well separated clusters are identified even in the [7], a density based clustering algorithm requires parameters presence of outliers, whereas for not so well separated dataset, like neighborhood radius and minimum number of points in final number of clusters are estimated and the detected clusters that radius to identify core point of the clusters. Generally, are merged to produce the final clusters. Experiments performed additional work is needed to estimate such parameters which with several large dimensional synthetic and real datasets show good results with robustness to noise and density variation within requires area specific domain knowledge [8]. dataset. In order to have a parameter free clustering algorithm, we have taken a different approach to identify the clusters. Our Index Terms—Number of clusters, parameter-free clustering, outlier handling, affinity histogram method imitates the way human recognizes the groups in the data. Human, when exposed to a representation of an intelligible dataset, at once recognizes the clusters present, I. INTRODUCTION because some data points appears so close, that they could Cluster analysis is an unsupervised learning problem, where hardly go to different clusters. Such groups when counted the objective is to suitably group n data points, X = give the number of clusters. We do not take the redundant fx1; x2; :::; xng, where xi is taken from a d-dimensional approach to search through space of possible clusterings, to real space Rd. Traditionally this has been considered as an identify optimal clusters. important statistical problem with numerous applications in Our algorithm tries to imitate human cognition. In order various fields including image analysis, bio-informatics and to identify closely grouped points we calculate an affinity market research. threshold followed by sequential search within the data space Generally, if the number of clusters k is given, finding the for the points in vicinity, which leads to the formation of optimal set of clusters C = fc1; c2; :::ckg which minimizes groups. Points, which remain single in such a search, are the variance or sum of squares of distances from the points to identified as outliers. If the data has well separated clusters, the center of their clusters, is an NP-hard problem [1]. Here the said process will be able to detect them, whereas, if the points represent some parametric measure for each data point clusters are close enough they could also be merged to form represented in the d-dimensional real space. The variance or new cluster. The nature of the dataset is decided by cost sum of squares within (SSW ) is given as function defined in section 3.2. Unlike equation 1, the cost will arXiv:1507.05409v2 [cs.CV] 11 Jan 2016 not always increase with the decrease in number of clusters, k X X indicating that the dataset supports merge. For such datasets SSW (C) = jjx − c¯ jj2 (1) i j we prioritize large clusters as representative of the data and j=1 x 2c i j the number of final cluster is estimated using distribution of Here, jjpjj denotes the magnitude of a vector p, so that detected cluster sizes. Clusters are then merged in order of jjp − qjj is the euclidean distance between two points p and q. closeness to produce final clusters. The point c¯j denotes mean of xi s.t. xi 2 cj, or the centroid of We have conducted experiments with standard convex the cluster cj. It can be observed that if SSW is calculated by datasets, and compared the performance of the algorithm varying total number of clusters, it reduces with the increase with the existing algorithms, in terms of proposed number of of number of clusters, and ultimately goes to zero, when the clusters and quality of the obtained clusters. number of clusters becomes equal to number of data points The remainder of the paper is organized as follows. Sec- n. Methods like Gap Statistics [2], study the curve of SSW tion 2 introduces the previous works related to clustering plotted against number of cluster, for an elbow point after algorithms. Section 3 describes the proposed parameter free which the curve flattens, to give an optimal number of clusters. algorithm. Section 4 experimentally evaluates the relative Internal cluster evaluation indexes like Silhoutte [3], performance of the proposed algorithm, Section 5 presents the Calinsky-Harabasz [4] evaluate quality of clustering depending conclusions. 2 II. RELATED WORK D. Density Method The general problem of grouping of data can be translated Density based methods such as DBSCAN [7] method works to different mathematical interpretations. For some datasets by detecting points in a cluster using a neighborhood radius that do not have well separated clusters the reported clusters length , and a parameter minP ts to represent minimum num- may differ depending upon the way the problem is defined. ber of points within the neighborhood. This method although Following are the four basic kind of approach taken to address robust to outliers and works for arbitrary shapes, is susceptible cluster analysis problem. to density variations. OPTICS [17] is an improvement which overcomes the requirement of as parameter. Finding suitable value for density parameter in such algorithm requires domain A. Partition Method knowledge. This kind of methods partition the data into k Voronoi cells Our method, in contrast being parameter free does not [9], where k is supposed to be known. Starting with k arbitrary require any additional input from the user. Moreover by the cluster centers k-means allocates points to the cluster which nature of cluster identification the method gives immunity to has nearest center. After allocation if the centroid of a cluster, noise and identify the outliers present in the dataset. All this shifts from the previous cluster center, the centroid becomes happens in a time bound manner which is no more than the the new cluster center and points are reallocated. The last step time required by most hierarchical algorithms that is of the O(n2) repeats till the process converges [6], [10]. order of . Although k-means and its variations are quite fast, these methods require number of clusters as parameter. Moreover III. METHOD PROPOSED they are not deterministic, that is, these may produce different In the proposed algorithm we try find the natural clusters results on different runs of the algorithm, so aggregation of in a dataset. We define a notion for closeness in which we results often becomes necessary [11]. Out-lier detection is also derive how close two points must be, to be in the same cluster. not possible in this format. The data space is then searched to find points that are close enough to be in the same cluster. Often clusters so detected are close enough so that additional merge is required to club them. B. Hierarchical Method The whole task is performed in two parts, findClusters In these methods, the clusters are formed either by ag- method listed in Algorithm 1 detects the initial clusters and the glomerating nearest points in a bottom up approach or by mergeClusters method listed in Algorithm 4 further merges partitioning the set in a divisive top down approach, until these clusters if required. Following two sections describe the it gives the desired number of clusters or falsify some para- process in detail. metric criterion which measures separation and cohesion of clusters [12]. The agglomeration or division typically depends A. Finding initial clusters upon the distance measure used to define separation between Before we do actual processing, feature scaling is often a the clusters. Among the several distance measure available, necessary step to start with. In the proposed algorithm we single-linkage defines the cluster distance as the minimum assume numeric no categorical data as input, but the range of distance between all pair of points taken from the two clusters each dimension of such data may vary widely. Hence, while [13], whereas complete-linkage takes the maximum. Average- using a distance function to measure separation between two linkage defines the cluster distance as the sum of all pairs of points, the dimension with higher range would dominate the distance normalized by the product of the cluster sizes [14]. function. Thus as a preprocessing each dimension of the input Ward method measures the increase in SSW when two clusters matrix X , is normalized to give the standard z-score of are merged [15].

Load more