A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph

Total Page:16

File Type:pdf, Size:1020Kb

A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph electronics Article A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph Jeong-Hun Kim 1, Jong-Hyeok Choi 1 , Kwan-Hee Yoo 1 , Woong-Kee Loh 2,* and Aziz Nasridinov 1,* 1 Department of Computer Science, Chungbuk National University, Cheongju 28644, Korea; [email protected] (J.-H.K.); [email protected] (J.-H.C.); [email protected] (K.-H.Y.) 2 Department of Software, Gachon University, Seongnam 13120, Korea * Correspondence: [email protected] (W.-K.L.); [email protected] (A.N.) Received: 2 August 2019; Accepted: 24 September 2019; Published: 27 September 2019 Abstract: OPTICS is a state-of-the-art algorithm for visualizing density-based clustering structures of multi-dimensional datasets. However, OPTICS requires iterative distance computations for all objects and is thus computed in O n2 time, making it unsuitable for massive datasets. In this paper, we propose constrained OPTICS (C-OPTICS) to quickly create density-based clustering structures that are identical to those by OPTICS. C-OPTICS uses a bi-directional graph structure, which we refer to as the constraint graph, to reduce unnecessary distance computations of OPTICS. Thus, C-OPTICS achieves a good running time to create density-based clustering structures. Through experimental evaluations with synthetic and real datasets, C-OPTICS significantly improves the running time in comparison to existing algorithms, such as OPTICS, DeLi-Clu, and Speedy OPTICS (SOPTICS), and guarantees the quality of the density-based clustering structures. Keywords: OPTICS; density-based clustering structure; visualization; constraint graph 1. Introduction Clustering is one of the data mining techniques that group data objects based on a similarity [1]. The groups can provide important insights that are used for a broad range of applications [2–8], such as superpixel segmentation for image clustering [2,3], brain cancer detection [4], wireless sensor networks [5,6], pattern recognition [7,8], and others. We can classify clustering algorithms into centroid, hierarchy, model, graph, density, and grid-based clustering algorithms [9]. Many algorithms address various clustering issues including scalability, noise handling, dealing with multi-dimensional datasets, the ability to discover clusters with arbitrary shapes, and the minimum dependency on domain knowledge for determining certain input parameters [10]. Among clustering algorithms, density-based clustering algorithms can discover arbitrary shaped clusters and noise from datasets. Furthermore, density-based clustering algorithms do not require the number of clusters as an input parameter. Instead, clusters are defined as dense regions separated by sparse regions and are formed by growing due to the inter-connectivity between objects. Density-based spatial clustering of applications with noise (DBSCAN) [11] is a well-known density-based clustering algorithm. To define dense regions which serve as clusters, DBSCAN requires two parameters: ", which represents the radius of the neighborhood of an observed object, and MinPts, which is the minimum number of objects in the "-neighborhood of an observed object. Let P be a set of multi-dimensional objects and let the "-neighbors of an object p P be N"(p ). Here, DBSCAN implements two rules: i 2 i An object pi is an "-core object if N"(pi) MinPts; • ≥ If p is an "-core object, all objects in N"(p ) should appear in the same cluster as p . • i i i Electronics 2019, 8, 1094; doi:10.3390/electronics8101094 www.mdpi.com/journal/electronics Electronics 2019, 8, 1094 2 of 23 The process of DBSCAN is simple. Firstly, an arbitrary "-core object pi is added to an empty cluster. Secondly, a cluster grows as follows: for every "-core object pi in the cluster, all objects of N"(pi) are added to the cluster. This process is then repeated until the size of a cluster no longer increases. However, DBSCAN cannot easily select appropriate input parameters to form suitable clusters because the input parameters depend on prior knowledge, such as the distribution of objects and the ranges of datasets. Moreover, DBSCAN cannot find clusters of differing densities. Figure1 demonstrates this limitation of DBSCAN in a two-dimensional dataset P when MinPts = 3. If " = 0.955, p2 is an "-core object and forms a cluster, which contains p1, p2, p3, and p4 because N"(p2) MinPts is satisfied. ≥ However, p9 and p13 are noise objects. Here, a noise object is an object that is not included in any cluster. In other words, a set of objects that cannot reach "-core objects in the clusters are defined as noise objects. On the other hand, if " = 1.031, p13 becomes an "-core object and forms a cluster, which contains p11, p13, and p12. However, p9 is still a noise object. As shown in the example in Figure1, input parameter selection in DBSCAN is problematic. Electronics 2019, 8, x FOR PEER REVIEW 2 of 22 Figure 1. TwoTwo-dimensional-dimensional clustering example demonstrating problems of density-baseddensity-based spatial clustering of applications with noise (DBSCAN) for the selectionselection ofof "휀( MinPts(푀푖푛푃푡푠==3).3). TheTo address process this of disadvantageDBSCAN is simple of DBSCAN,. Firstly, a an method arbitrary for ordering 휀-core object points 푝 to푖 identifyis added the to clusteringan empty structure,cluster. Secondly called OPTICS, a cluster [12 grows], was as proposed. follows: Likefor every DBSCAN, 휀-core OPTICS object requires푝푖 in the two cluster, input all parameters, objects of "푁,휀( and푝푖) MinPtsare added, and to findsthe cluster clusters. This of diprocessffering is densities then repeated by creating until the a reachability size of a cluster plot. no Here, longer the reachabilityincreases. However, plot represents DBSCAN an cannot ordering easily of a select dataset appropriate with respect input to theparameters density-based to form clustering suitable structure.clusters because To create the theinput reachability parameters plot, depend OPTICS on formsprior knowledge, a linear order such of objectsas the distribution where objects of thatobjects are spatiallyand the ranges closest of become datasets. neighbors Moreover, [13 DBSCAN]. Figure2 cannot shows find the reachabilityclusters of differing plot for adensities dataset. PFigure, when 1 "demonstrates= p2 and MinPts this limitation= 3. The of DBSCAN horizontal in axis a two of- thedimensional reachability dataset plot enumerates푃 when 푀푖푛푃푡푠 the objects= 3. If in휀 = a linear0.955, order,푝2 is while an 휀 vertical-core object bars displayand forms reachability a cluster, distance which (see contains Definition 푝1 , 2),푝2 , which 푝3 , and is the 푝4 minimum because distance|푁휀(푝2)| ≥ for푀푖푛푃푡푠 an object is tosatisfied. be included However, in a cluster. 푝9 and The 푝13 reachability are noise objects. distances Here, for some a noise objects object (e.g., is anp1, pobject6, p11 ,that and isp14 not) are included infinite. in In any this cluster. case, an In infinite other reachabilitywords, a set distance of objects results that whencannot the reach distance 휀-core to eachobjects object in the is undefinedclusters are because defined the as distancenoise objects. value On is greater the other than hand, given if "휀. OPTICS= 1.031, does푝13 becomes not provide an clustering휀-core object results and forms explicitly, a cluster, but which the reachability contains 푝 plot11, 푝1 shows3, and the푝12. clusters However, for 푝"9. is For still example, a noise object. when "As= shown0.5, a firstin the cluster exampleC1 containing in Figure 1,p1 input, p2, and parameterp3 is found. selection When in" DBSCAN= 0.943, second is problematic cluster C. 2, which contains p6, p7, and p8 is found. As " grows larger, clusters C1 and C2 continue to expand, and a third cluster C3 containing p11, p12, and p13 is found. Figure 2. A reachability plot of an example dataset (휀 = √2 and 푀푖푛푃푡푠 = 3). To address this disadvantage of DBSCAN, a method for ordering points to identify the clustering structure, called OPTICS [12], was proposed. Like DBSCAN, OPTICS requires two input parameters, 휀, and 푀푖푛푃푡푠, and finds clusters of differing densities by creating a reachability plot. Here, the reachability plot represents an ordering of a dataset with respect to the density-based clustering structure. To create the reachability plot, OPTICS forms a linear order of objects where objects that are spatially closest become neighbors [13]. Figure 2 shows the reachability plot for a dataset 푃, when 휀 = √2 and 푀푖푛푃푡푠 = 3. The horizontal axis of the reachability plot enumerates the objects in a linear order, while vertical bars display reachability distance (see Definition 2), which is the minimum distance for an object to be included in a cluster. The reachability distances for some objects (e.g., 푝1, 푝6, 푝11, and 푝14) are infinite. In this case, an infinite reachability distance results when the distance to each object is undefined because the distance value is greater than given 휀. OPTICS does not provide clustering results explicitly, but the reachability plot shows the clusters for Electronics 2019, 8, x FOR PEER REVIEW 2 of 22 Figure 1. Two-dimensional clustering example demonstrating problems of density-based spatial clustering of applications with noise (DBSCAN) for the selection of 휀 (푀푖푛푃푡푠 = 3). The process of DBSCAN is simple. Firstly, an arbitrary 휀-core object 푝푖 is added to an empty cluster. Secondly, a cluster grows as follows: for every 휀-core object 푝푖 in the cluster, all objects of 푁휀(푝푖) are added to the cluster. This process is then repeated until the size of a cluster no longer increases. However, DBSCAN cannot easily select appropriate input parameters to form suitable clusters because the input parameters depend on prior knowledge, such as the distribution of objects and the ranges of datasets. Moreover, DBSCAN cannot find clusters of differing densities. Figure 1 demonstrates this limitation of DBSCAN in a two-dimensional dataset 푃 when 푀푖푛푃푡푠 = 3. If 휀 = 0.955, 푝2 is an 휀 -core object and forms a cluster, which contains 푝1 , 푝2 , 푝3 , and 푝4 because |푁휀(푝2)| ≥ 푀푖푛푃푡푠 is satisfied.
Recommended publications
  • A Literature Review on Patent Texts Analysis Techniques Guanlin Li
    International Journal of Knowledge www.ijklp.org and Language Processing KLP International ⓒ2018 ISSN 2191-2734 Volume 9, Number 3, 2018 pp.1–-15 A Literature Review on Patent Texts Analysis Techniques Guanlin Li School of Software & Microelectronics Peking University No.5 Yiheyuan Road Haidian District Beijing, 100871, China [email protected] Received Sep 2018; revised Sep 2018 ABSTRACT. Patent data are expanding explosively nowadays with the advent of new technologies, and it’s significant to put forward the method of automatic patent analysis and use appropriate patent analysis techniques to make use of scattered, multi-source and interrelated patent text, in order to improve the efficiency of patent analyzing. Currently there are a lot of techniques being used to process patent intelligence. This literature review focuses on automatic patent text analysis techniques, which use computer to automatically analyze large scale of patent texts and find useful information in them. These techniques are divided into the following categories: semantic analysis based techniques, rule based techniques, machine learning based techniques and patent text clustering techniques. Keywords: Patent analysis, text mining, patent intelligence 1. Introduction. Patents are important sources of technology information, in which we can find great value of scientific and technological intelligence. At the same time, patents are of high commercial value. Enterprise analyzes patent information, which contains more than 90% of the world's scientific and technological
    [Show full text]
  • Enhancement of DBSCAN Algorithm and Transparency Clustering Of
    IJRECE VOL. 5 ISSUE 4 OCT.-DEC. 2017 ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE) Enhancement of DBSCAN Algorithm and Transparency Clustering of Large Datasets Kumari Silky1, Nitin Sharma2 1Research Scholar, 2Assistant Professor Institute of Engineering Technology, Alwar, Rajasthan, India Abstract: The data mining is the technique which can extract process of dividing the data into similar objects groups. A level useful information from the raw data. The clustering is the of simplification is achieved in case of less number of clusters technique of data mining which can group similar and dissimilar involved. But because of less number of clusters some of the type of information. The density based clustering is the type of fine details have been lost. With the use or help of clusters the clustering which can cluster data according to the density. The data is modeled. According to the machine learning view, the DBSCAN is the algorithm of density based clustering in which clusters search in a unsupervised manner and it is also as the EPS value is calculated which define radius of the cluster. The hidden patterns. The system that comes as an outcome defines a Euclidian distance will be calculated using neural networks data concept [3]. The clustering mechanism does not have only which calculate similarity in the more effective manner. The one step it can be analyzed from the definition of clustering. proposed algorithm is implemented in MATLAB and results are Apart from partitional and hierarchical clustering algorithms analyzed in terms of accuracy, execution time. number of new techniques has been evolved for the purpose of clustering of data.
    [Show full text]
  • Comparison of Dimensionality Reduction Techniques on Audio Signals
    Comparison of Dimensionality Reduction Techniques on Audio Signals Tamás Pál, Dániel T. Várkonyi Eötvös Loránd University, Faculty of Informatics, Department of Data Science and Engineering, Telekom Innovation Laboratories, Budapest, Hungary {evwolcheim, varkonyid}@inf.elte.hu WWW home page: http://t-labs.elte.hu Abstract: Analysis of audio signals is widely used and this work: car horn, dog bark, engine idling, gun shot, and very effective technique in several domains like health- street music [5]. care, transportation, and agriculture. In a general process Related work is presented in Section 2, basic mathe- the output of the feature extraction method results in huge matical notation used is described in Section 3, while the number of relevant features which may be difficult to pro- different methods of the pipeline are briefly presented in cess. The number of features heavily correlates with the Section 4. Section 5 contains data about the evaluation complexity of the following machine learning method. Di- methodology, Section 6 presents the results and conclu- mensionality reduction methods have been used success- sions are formulated in Section 7. fully in recent times in machine learning to reduce com- The goal of this paper is to find a combination of feature plexity and memory usage and improve speed of following extraction and dimensionality reduction methods which ML algorithms. This paper attempts to compare the state can be most efficiently applied to audio data visualization of the art dimensionality reduction techniques as a build- in 2D and preserve inter-class relations the most. ing block of the general process and analyze the usability of these methods in visualizing large audio datasets.
    [Show full text]
  • DBSCAN++: Towards Fast and Scalable Density Clustering
    DBSCAN++: Towards fast and scalable density clustering Jennifer Jang 1 Heinrich Jiang 2 Abstract 2, it quickly starts to exhibit quadratic behavior in high di- mensions and/or when n becomes large. In fact, we show in DBSCAN is a classical density-based clustering Figure1 that even with a simple mixture of 3-dimensional procedure with tremendous practical relevance. Gaussians, DBSCAN already starts to show quadratic be- However, DBSCAN implicitly needs to compute havior. the empirical density for each sample point, lead- ing to a quadratic worst-case time complexity, The quadratic runtime for these density-based procedures which is too slow on large datasets. We propose can be seen from the fact that they implicitly must compute DBSCAN++, a simple modification of DBSCAN density estimates for each data point, which is linear time which only requires computing the densities for a in the worst case for each query. In the case of DBSCAN, chosen subset of points. We show empirically that, such queries are proximity-based. There has been much compared to traditional DBSCAN, DBSCAN++ work done in using space-partitioning data structures such can provide not only competitive performance but as KD-Trees (Bentley, 1975) and Cover Trees (Beygelzimer also added robustness in the bandwidth hyperpa- et al., 2006) to improve query times, but these structures are rameter while taking a fraction of the runtime. all still linear in the worst-case. Another line of work that We also present statistical consistency guarantees has had practical success is in approximate nearest neigh- showing the trade-off between computational cost bor methods (e.g.
    [Show full text]
  • Density-Based Clustering of Static and Dynamic Functional MRI Connectivity
    Rangaprakash et al. Brain Inf. (2020) 7:19 https://doi.org/10.1186/s40708-020-00120-2 Brain Informatics RESEARCH Open Access Density-based clustering of static and dynamic functional MRI connectivity features obtained from subjects with cognitive impairment D. Rangaprakash1,2,3, Toluwanimi Odemuyiwa4, D. Narayana Dutt5, Gopikrishna Deshpande6,7,8,9,10,11,12,13* and Alzheimer’s Disease Neuroimaging Initiative Abstract Various machine-learning classifcation techniques have been employed previously to classify brain states in healthy and disease populations using functional magnetic resonance imaging (fMRI). These methods generally use super- vised classifers that are sensitive to outliers and require labeling of training data to generate a predictive model. Density-based clustering, which overcomes these issues, is a popular unsupervised learning approach whose util- ity for high-dimensional neuroimaging data has not been previously evaluated. Its advantages include insensitivity to outliers and ability to work with unlabeled data. Unlike the popular k-means clustering, the number of clusters need not be specifed. In this study, we compare the performance of two popular density-based clustering methods, DBSCAN and OPTICS, in accurately identifying individuals with three stages of cognitive impairment, including Alzhei- mer’s disease. We used static and dynamic functional connectivity features for clustering, which captures the strength and temporal variation of brain connectivity respectively. To assess the robustness of clustering to noise/outliers, we propose a novel method called recursive-clustering using additive-noise (R-CLAN). Results demonstrated that both clustering algorithms were efective, although OPTICS with dynamic connectivity features outperformed in terms of cluster purity (95.46%) and robustness to noise/outliers.
    [Show full text]
  • Semantic Computing
    SEMANTIC COMPUTING 10651_9789813227910_TP.indd 1 24/7/17 1:49 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence ISSN: 2529-7686 Published Vol. 1 Semantic Computing edited by Phillip C.-Y. Sheu 10651 - Semantic Computing.indd 1 27-07-17 5:07:03 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence – Vol. 1 SEMANTIC COMPUTING Editor Phillip C-Y Sheu University of California, Irvine World Scientific NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO 10651_9789813227910_TP.indd 2 24/7/17 1:49 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence ISSN: 2529-7686 Published Vol. 1 Semantic Computing edited by Phillip C.-Y. Sheu Catherine-D-Ong - 10651 - Semantic Computing.indd 1 22-08-17 1:34:22 PM Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE Library of Congress Cataloging-in-Publication Data Names: Sheu, Phillip C.-Y., editor. Title: Semantic computing / editor, Phillip C-Y Sheu, University of California, Irvine. Other titles: Semantic computing (World Scientific (Firm)) Description: Hackensack, New Jersey : World Scientific, 2017. | Series: World Scientific encyclopedia with semantic computing and robotic intelligence ; vol. 1 | Includes bibliographical references and index. Identifiers: LCCN 2017032765| ISBN 9789813227910 (hardcover : alk. paper) | ISBN 9813227915 (hardcover : alk. paper) Subjects: LCSH: Semantic computing. Classification: LCC QA76.5913 .S46 2017 | DDC 006--dc23 LC record available at https://lccn.loc.gov/2017032765 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
    [Show full text]
  • An Improvement for DBSCAN Algorithm for Best Results in Varied Densities
    Computer Engineering Department Faculty of Engineering Deanery of Higher Studies The Islamic University-Gaza Palestine An Improvement for DBSCAN Algorithm for Best Results in Varied Densities Mohammad N. T. Elbatta Supervisor Dr. Wesam M. Ashour A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Gaza, Palestine (September, 2012) 1433 H ii stnkmegnelnonkcA Apart from the efforts of myself, the success of any project depends largely on the encouragement and guidelines of many others. I take this opportunity to express my gratitude to the people who have been instrumental in the successful completion of this project. I would like to thank my parents for providing me with the opportunity to be where I am. Without them, none of this would be even possible to do. You have always been around supporting and encouraging me and I appreciate that. I would also like to thank my brothers and sisters for their encouragement, input and constructive criticism which are really priceless. Also, special thanks goes to Dr. Homam Feqawi who did not spare any effort to review and audit my thesis linguistically. My heartiest gratitude to my wonderful wife, Nehal, for her patience and forbearance through my studying and preparing this study. I would like to express my sincere gratitude to my advisor Dr. Wesam Ashour for the continuous support of my master study and research, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my master study.
    [Show full text]
  • Unsupervised Novelty Detection Using Deep Autoencoders with Density Based Clustering
    applied sciences Article Unsupervised Novelty Detection Using Deep Autoencoders with Density Based Clustering Tsatsral Amarbayasgalan 1, Bilguun Jargalsaikhan 1 and Keun Ho Ryu 1,2,* 1 Database and Bioinformatics Laboratory, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea; [email protected] (T.A.); [email protected] (B.J.) 2 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam * Correspondence: [email protected] or [email protected]; Tel.: +82-43-261-2254 Received: 30 July 2018; Accepted: 22 August 2018; Published: 27 August 2018 Abstract: Novelty detection is a classification problem to identify abnormal patterns; therefore, it is an important task for applications such as fraud detection, fault diagnosis and disease detection. However, when there is no label that indicates normal and abnormal data, it will need expensive domain and professional knowledge, so an unsupervised novelty detection approach will be used. On the other hand, nowadays, using novelty detection on high dimensional data is a big challenge and previous research suggests approaches based on principal component analysis (PCA) and an autoencoder in order to reduce dimensionality. In this paper, we propose deep autoencoders with density based clustering (DAE-DBC); this approach calculates compressed data and error threshold from deep autoencoder model, sending the results to a density based cluster. Points that are not involved in any groups are not considered a novelty; the grouping points will be defined as a novelty group depending on the ratio of the points exceeding the error threshold. We have conducted the experiment by substituting components to show that the components of the proposed method together are more effective.
    [Show full text]
  • Mobility Modes Awareness from Trajectories Based on Clustering and a Convolutional Neural Network
    International Journal of Geo-Information Article Mobility Modes Awareness from Trajectories Based on Clustering and a Convolutional Neural Network Rui Chen * , Mingjian Chen, Wanli Li, Jianguang Wang and Xiang Yao Institute of Geospatial Information, Information Engineering University, Zhengzhou 450000, China; [email protected] (M.C.); [email protected] (W.L.); [email protected] (J.W.); [email protected] (X.Y.) * Correspondence: [email protected]; Tel.: +86-181-4029-5462 Received: 13 March 2019; Accepted: 5 May 2019; Published: 7 May 2019 Abstract: Massive trajectory data generated by ubiquitous position acquisition technology are valuable for knowledge discovery. The study of trajectory mining that converts knowledge into decision support becomes appealing. Mobility modes awareness is one of the most important aspects of trajectory mining. It contributes to land use planning, intelligent transportation, anomaly events prevention, etc. To achieve better comprehension of mobility modes, we propose a method to integrate the issues of mobility modes discovery and mobility modes identification together. Firstly, route patterns of trajectories were mined based on unsupervised origin and destination (OD) points clustering. After the combination of route patterns and travel activity information, different mobility modes existing in history trajectories were discovered. Then a convolutional neural network (CNN)-based method was proposed to identify the mobility modes of newly emerging trajectories. The labeled history trajectory data were utilized to train the identification model. Moreover, in this approach, we introduced a mobility-based trajectory structure as the input of the identification model. This method was evaluated with a real-world maritime trajectory dataset. The experiment results indicated the excellence of this method.
    [Show full text]
  • Representatives for Visually Analyzing Cluster Hierarchies
    Visually Mining Through Cluster Hierarchies Stefan Brecheisen Hans-Peter Kriegel Peer Kr¨oger Martin Pfeifle Institute for Computer Science University of Munich Oettingenstr. 67, 80538 Munich, Germany brecheis,kriegel,kroegerp,pfeifle @dbs.informatik.uni-muenchen.de f g Abstract providing the user with significant and quick information. Similarity search in database systems is becoming an increas- In this paper, we introduce algorithms for automatically detecting hierarchical clusters along with their correspond- ingly important task in modern application domains such as ing representatives. In order to evaluate our ideas, we de- multimedia, molecular biology, medical imaging, computer veloped a prototype called BOSS (Browsing OPTICS Plots aided engineering, marketing and purchasing assistance as for Similarity Search). BOSS is based on techniques related well as many others. In this paper, we show how visualizing to visual data mining. It helps to visually analyze cluster the hierarchical clustering structure of a database of objects hierarchies by providing meaningful cluster representatives. can aid the user in his time consuming task to find similar ob- To sum up, the main contributions of this paper are as jects. We present related work and explain its shortcomings follows: which led to the development of our new methods. Based We explain how different important application ranges on reachability plots, we introduce approaches which auto- • would benefit from a tool which allows visually mining matically extract the significant clusters in a hierarchical through cluster hierarchies. cluster representation along with suitable cluster represen- We reason why the hierarchical clustering algorithm tatives. These techniques can be used as a basis for visual • OPTICS forms a suitable foundation for such a brows- data mining.
    [Show full text]
  • Visualization of the Optics Algorithm
    Visualization of the Optics Algorithm Gregor Redinger* Markus Hunner† 01163940 01503441 VIS 2017 - Universitt Wien ABSTRACT Our Visualization not only helps in interpreting this Reachability- In our Project we have the goal to provide a visualization for the Plot, but also provides the functionality of picking a cutoff value OPTICS Clustering Algorithm. There hardly exist in-depth visual- for parameter e, that we called e’, with this cutoff it is possible to izations of this algorithm and we developed a online tool to fill this interpret one result of the OPTICS algorithm like several results of the related DBSCAN clustering algorithm. Thus our visualization gap. In this paper we will give you a deep insight in our solution. 0 In a first step we give an introduction to our visualization approach. provides a parameter space exploration for all e < e. Then we will discuss related work and introduce the different parts Users Therefore our visualization enables users without prior of our visualization. Then we discuss the software stack we used knowledge of the OPTICS algorithm and its unusual result format for our application and which challenges and problems we encoun- to easily interpret the ordered result structure as cluster assignments. tered during the development. After this, we will look at concrete Additionally it allows the user to explore the parameter space of use cases for our visualization, take a look at the performance and the eparameter in an intuitive way, without the need to educate the present the results of a evaluation in form of a field study. At last we user on the algorithmic details of OPTICS and why introducing a will discuss the strengths and weaknesses of our approach and take cutoff value for the calculated distance measures corresponds to the a closer look at the lessons we learned from our project.
    [Show full text]
  • Comparative Study of Different Clustering Algorithms for Association Rule Mining Ms
    Ms. Pooja Gupta et al./ International Journal of Computer Science & Engineering Technology (IJCSET) Comparative Study of Different Clustering Algorithms for Association Rule Mining Ms. Pooja Gupta, Ms. Monika Jena, Computer science and Engineering Department, Amity University, Noida, India [email protected] [email protected] Ms. Manisha Chowdhary, Ms. Shilpi Singh, CSE Department, United group of Institutions, Greater Noida, India [email protected] [email protected] Abstract— In data mining, association rule mining is an important research area in today’s scenario. Various association rule mining can find interesting associations and correlation relationship among a large set of data items[1]. To find association rules for single dimensional database Apriori algorithm is appropriate. For large databases lots of candidate sets are generated. Thus Apriori algorithm is not efficient for large databases. We need some extension in the existing Apriori algorithm so that it can also work for large multidimensional database or quantitative database. For this purpose to work with apriori in large multidimensional database, data is divided into multiple data sets called as clusters. In order to divide large data bases into clusters we need various clustering algorithms which can be based on Statistical methods, Hierarchical methods, Density Based method or Grid based method. Once clusters are created by these clustering algorithms, the apriori algorithm can be easily applied on clusters of our interest for mining association rules. Since overall process of finding association rules highly depends on clustering algorithms so we have to use best suited clustering algorithm according to given data base ,thus overall execution time will be reduced.
    [Show full text]