Single Pass Fuzzy C Means

Single Pass Fuzzy C Means Prodip Hore, Lawrence O. Hall, and Dmitry B. Goldgof Abstract— Recently several algorithms for clustering large clustering algorithms which can process very large data sets data sets or streaming data sets have been proposed. Most are becoming increasingly important. of them address the crisp case of clustering, which cannot be In this paper we propose a modified fuzzy c means algo- easily generalized to the fuzzy case. In this paper, we propose a simple single pass (through the data) fuzzy c means algorithm rithm for large or very large data sets, which will produce a that neither uses any complicated data structure nor any final clustering in a single pass through the data with limited complicated data compression techniques, yet produces data memory allocated. We neither keep any complicated data partitions comparable to fuzzy c means. We also show our structures nor use any complicated compression techniques. simple single pass fuzzy c means clustering algorithm when We will show that our simple single pass fuzzy c means compared to fuzzy c means produces excellent speed-ups in clustering and thus can be used even if the data can be fully algorithm (SP) will provide almost similar clustering quality loaded in memory. Experimental results using five real data sets (by loading as little as 1% of the data) as that of clustering are provided. all the data at once using fuzzy c means (FCM). Moreover, we will also show our single pass fuzzy c means algorithm I. INTRODUCTION provides significant speed up when compared with clustering Recently various algorithms for clustering large data sets a complete data set, which is less than the size of the memory. and streaming data sets have been proposed [2], [4], [5], [6], [7], [8], [12], [13]. The focus has been primarily either on II. RELATED WORK sampling [2], [7], [8], [10], [22] or incrementally loading In [1], a multistage random sampling method was pro- partial data, as much as can fit into memory at one time. posed to speedup fuzzy c means. There were two phases The incremental approach [5], [6], [12], [13] generally keeps in the method. In the first phase, random sampling was sufficient statistics or past knowledge of clusters from a used to obtain an estimate of centroids and then fuzzy c previous run of a clustering algorithm in some data structures means (FCM) was run on the full data with these centroids and uses them in improving the model for the future. Various initialized. A speed-up of 2-3 times was reported. In [9], algorithms [1], [3], [9], [15], [19] for speeding up clustering speed up is obtained by taking a random sample of the data have also been proposed. While many algorithms have been and clustering it. The centroids obtained then were used to proposed for large and very large data sets for the crisp initialize the entire data set. This method is similar to that in case, not as much work has been done for the fuzzy case. [1]; the difference is they used one random sample where in As pointed out in [10], the crisp case may not be easily [1] they may use multiple random samples. In [2], another generalized for fuzzy clustering. This is due to the fact method based on sampling for clustering large image data that in fuzzy methods an example does not belong to a was proposed, where the samples were chosen by the chi- cluster completely but has partial membership values in most square or divergence hypothesis test. It was shown that they clusters. More about clustering algorithms can be found in achieved an average speed-up of 4.2 times on image data [23]. while providing a good final partition using 24% of the total Clustering large amounts of data takes a long time. Further, data. new unlabeled data sets which will not fit in memory are Another speed up technique for image data was proposed becoming available. To cluster them, either sub sampling is in [3]. In this method FCM convergence is obtained by required to fit the data in memory or the time will be greatly using a data reduction method. Data reduction is done by affected by disk accesses making clustering an unattractive quantization and speed-up by aggregating similar examples, choice for data analysis. Another source of large data sets which were then represented by a single weighted exemplar. is streaming data where you do not store all the data, but The objective function of the FCM algorithm was modified process it and delete it. There are some very large data sets to take into account the weights of the exemplars. However, for which a little labeled data is available and the rest of the presence of similar examples might not be common in all the data is unlabeled i.e. for example, computer intrusion data sets. They showed that it performs well on image data. detection. Semi-supervised clustering might be applied to this In summary, the above algorithms attempt to speed up fuzzy type of data [11]. We do not address this specifically in this c means either by reducing the number of examples through paper, but the approach here could be adapted. In general, sampling or by data reduction techniques or by providing Prodip Hore, Lawrence O. Hall, and Dmitry B. Goldgof are with good initialization points to reduce the number of iterations. the Department of Computer Science and Engineering, University of However, the above algorithms do not seem to address the South Florida, Tampa, FL 33620, USA (Prodip Hore phone: 813-472- issue of clustering large or very large data sets under the con- 3634; email:[email protected]), (Lawrence O. Hall phone: 813-974- 4195; email:[email protected]), (Dmitry B. Goldgof phone: 813-974-4055; straint of limited memory. Moreover, some of them address email:[email protected]) the speed up issues for image data only where the range of features may be limited. Some work on parallel/distributed compressing training examples with exemplar. This was done approaches has been done, where multiple processors could by estimating the probability density functions for the PNN be used in parallel to speed up fuzzy c means [17], [20]. with Gaussian models. In [21] a parallel version of the adaptive fuzzy Leader clustering algorithm has been discussed, whereas, in [18] an Recently in [10], a sampling based method has been efficient variant of the conventional Leader algorithm known proposed for extending fuzzy and probabilistic clustering as ARFL (Adaptive rough fuzzy leader) clustering algorithm to large or very large data sets. The approach is based on was proposed. progressive sampling and is an extension of eFFCM [2] to geFFCM, which can handle non-image data. However, There has been research on clustering large or very large the termination criteria for progressive sampling could be data sets [4], [5], [6], [7], [8]. Birch [4] is a data clustering complicated as it depends upon the features of the data method for large data sets. It loads the entire data into set. They used 4 acceptance strategies to control progressive memory by building a CF (Clustering Feature) tree, which sampling based on the features of the data. The first strategy is a compact representation of the data using the available (SS1) is to accept the sampled data when a particular feature memory. Then the leaf entries of the CF tree can be clustered signals termination. The second one (SS2) is to accept to produce a partition. A hierarchical clustering algorithm when any one of the features signals termination. The third was used in their paper. It provides an optional cluster one (SS3) is to accept when all the features sequentially refining step in which quality can be further improved by signal termination and the last one accepts (SS4) when all additional passes over the dataset. However, the accuracy of the features simultaneously signal termination. However, the data summarization depends on available memory. It was method could be complicated for a large number of features pointed out in [5] that depending on the size of the data, and the sample size could grow large also. memory usage can increase significantly as the implementation of Birch has no notion of an allocated memory buffer. In [24], two methods of scaling EM [23] to large data sets In [5], a single pass hard c means clustering algorithm is have been proposed by reducing time spent in E step. The proposed under the assumption of a limited memory buffer. first method, incremental EM, is similar to [28], in which data They used various data compression techniques to obtain a is partitioned into blocks and then incrementally updating the compact representation of data. In [6], another single pass log-likelihoods. In the second method, lazy EM, at scheduled scalable hard c means algorithm was proposed. This is a iterations the algorithm performs partial E and M steps on simpler implementation of Bradley’s single pass algorithm a subset of data. In [25], EM was scaled in a similar way [5], where no data compression techniques have been used. as they scaled k-means in [5]. In [27], data is incrementally They showed that complicated data compression techniques loaded and modelled using gaussians. At the first stage the do not improve cluster quality much while the overhead gaussians models are allowed to overfit the data and then and book-keeping of data compression techniques slow the later reduced at a second stage to output the final models. The algorithm down. Bradley’s algorithm compresses data using a methods used to scale EM may not generalize to FCM as they primary compression and secondary compression algorithm.

Load more