Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques

Total Page:16

File Type:pdf, Size:1020Kb

Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques 1, 1 1 Md. Mostofa Ali Patwary †, Diana Palsetia , Ankit Agrawal , Wei-keng Liao1, Fredrik Manne2, Alok Choudhary1 1Northwestern University, Evanston, IL 60208, USA 2University of Bergen, Norway †Corresponding author: [email protected] ABSTRACT and sparse regions and therefore, discovers overall distribution pat- terns and correlations in the data. Based on the data properties or OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using the task requirements, various clustering algorithms have been de- veloped. Well-known algorithms include K-means [35], K-medoids adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequen- [41], BIRCH [55], DBSCAN [20], OPTICS [5, 29], STING [52], and WaveCluster [48]. These algorithms have been used in vari- tial data access order. We present a scalable parallel OPTICS algo- ous scientific areas such as satellite image segmentation [38], noise rithm (POPTICS) designed using graph algorithmic concepts. To filtering and outlier detection [11], unsupervised document clus- break the data access sequentiality, POPTICS exploits the similar- tering [50], and clustering of bioinformatics data [36]. Existing ities between the OPTICS algorithm and PRIM’s Minimum Span- ning Tree algorithm. Additionally, we use the disjoint-set data clustering algorithms have been roughly categorized as partitional, structure to achieve a high parallelism for distributed cluster ex- hierarchical, grid-based, and density-based [26, 27]. OPTICS (Or- traction. Using high dimensional datasets containing up to a billion dering Points To Identify the Clustering Structure) is a hierarchical floating point numbers, we show scalable speedups of up to 27.5 density-based clustering algorithm [5]. The key idea of the density- for our OpenMP implementation on a 40-core shared-memory ma- based clustering algorithm such as OPTICS and DBSCAN is that for chine, and up to 3,008 for our MPI implementation on a 4,096-core each data point in a cluster, the neighborhood within a given radius distributed-memory machine. We also show that the quality of the (ε), known as generating distance, has to contain at least a mini- mum number of points (minpts), i.e. the density of the neighbor- results given by POPTICS is comparable to those given by the clas- hood has to exceed some threshold [5, 20]. Additionally, OPTICS sical OPTICS algorithm. addresses DBSCAN’s major limitation: the problem of detecting Categories and Subject Descriptors meaningful clusters in data of varying density. OPTICS provides an overview of the cluster structure of a dataset H.3.3 [Information Storage and Retrieval]: Information Search with respect to density and contains information about every clus- and Retrieval—Clustering; I.5.3 [ ]: Cluster- Pattern Recognition ter level of the dataset. In order to do so, OPTICS generates a linear ing—Algorithms; H.2.8 [Database Management]: Database Ap- order of points where spatially closest points become neighbors. plications—Data Mining; G.1.0 [Mathematics of Computing]: Additionally, for each point, a spacial distance (known as reach- Numerical Analysis—Parallel Algorithms ability distance) is computed which represents the density. Once the order and the reachability distances are computed using ε and General Terms minpts, we can query for the clusters that a particular value of Algorithms, Experimentation, Performance, Theory ε (known as clustering distance) would give where ε ε. The query is answered in linear time. ≤ Keywords One example application of OPTICS, which requires high per- Density-based clustering, Minimum spanning tree, Union-Find al- formance computing, is finding halos and subhalos (clusters) from gorithm, Disjoint-set data structure massive cosmology data in astrophysics [34]. Other application domains include analyzing satellite images, X-ray crystallography, 1. INTRODUCTION and anomaly detection [7]. However, OPTICS is challenging to Clustering is a data mining technique that groups data into mean- parallelize as its data access pattern is inherently sequential. To ingful subclasses, known as clusters, such that it minimizes the the best of our knowledge, there has not been any effort yet to do intra-differences and maximizes inter-differences of these subclasses so. Due to the similarities with DBSCAN, a natural choice for de- [21]. For the purpose of knowledge discovery, it identifies dense signing a parallel OPTICS could be one of the several master-slave based approaches [6, 13, 14, 17, 23, 54, 56]. However, in [44], Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not we showed that these approaches incur high communication over- made or distributed for profit or commercial advantage and that copies bear head between the master and slaves, and low parallel efficiency. As this notice and the full citation on the first page. Copyrights for components an alternative to these approaches we presented a parallel DBSCAN of this work owned by others than ACM must be honored. Abstracting with algorithm based on the disjoint-set data structure, suitable for mas- credit is permitted. To copy otherwise, or republish, to post on servers or to sive data sets [44]. However, this approach is not directly applica- redistribute to lists, requires prior specific permission and/or a fee. Request ble to OPTICS as DBSCAN produces a clustering result for a single permissions from [email protected]. PTICS SC 13 November 17-21, 2013, Denver, CO, USA set of density parameters, whereas O generates a linear or- Copyright 2013 ACM 978-1-4503-2378-9/13/11 ...$15.00. der of the points that provides an overview of the cluster structure http://dx.doi.org/10.1145/2503210.2503255 for a wide range of input parameters. One key difference between clusters in data of varying density by producing a linear order of these two algorithms from the viewpoint of parallelization is that in points such that points which are spatially closest become neigh- DBSCAN, after processing a point one can process its neighbors in bors in the order. OPTICS starts with adding an arbitrary point of a parallel, whereas in OPTICS, the processing of the neighbors fol- cluster to the order list and then iteratively expands the cluster by lows a strict order. adding a point within the ε neighborhood of a point in the clus- − To overcome this challenge, we develop a scalable parallel OPTI- ter which is also closest to any of the already selected points. The CS algorithm (POPTICS) using graph algorithmic concepts. POPTI- process continues until the entire cluster has been added to the or- CS exploits the similarities between OPTICS and PRIM’s Minimum der. The process then moves on to the remaining clusters. Addi- Spanning Tree (MST) algorithm [46] to break the sequential access tionally, OPTICS computes the reachability distance for each point. of data points in the classical OPTICS algorithm. The main idea This represents the required density in order to keep two points in is that two points should be assigned to the same cluster if they the same cluster. Once the order and the reachability distances are are sufficiently close (if at least one of them has sufficiently many computed, we can extract the clusters for any clustering distance, neighbors). This relationship is transitive so a connected compo- ε where ε ε, in linear time. In the following we first define nent of points should also be in the same cluster. If the distance the notation used≤ throughout the paper and then present a brief de- bound is set sufficiently high, all vertices will be in the same clus- scription of OPTICS based on [5]. ter. As this bound is lowered, the cluster will eventually break apart Let X be the set of forming sub-clusters. This is modeled by calculating a minimum data points to be clus- ! distance spanning tree on the graph using an initial (high) distance tered. The neighborhood MinPts = 3 bound (ε). Then to query the dataset for the clusters that an ε ε of a point x X within !” ≤ ∈ would give, one has only to remove edges from the MST of weight a given radius ε (known as y more than ε and the remaining connected components will give the generating distance) is x the clusters. called the ε-neighborhood The idea of our POPTICS algorithm is as follows. Each processor of x, denoted by Nε(x). z computes a MST on its local dataset without incurring any commu- More formally, Nε(x)= nication. We then merge the local MSTs to obtain a global MST. y X DISTANCE(x, y) core distance (x) Both steps are performed in parallel. Additionally, we extract the { ∈ε, y | = x , where reachability distance (y) ≤ } reachability distance (z) clusters directly from the global MST (without a linear order of DISTANCE(x, y) is the dis- Figure 1: An example showing the points) for any clustering distance, ε, by simply traversing the tance function. A point the core distance of x and the edges of the MST once in an arbitrary order, thus also enabling x X is referred to reachability distances of y and z the cluster generation in parallel using the parallel disjoint-set data as a∈core point if its ε- with respect to x. structure [42]. POPTICS shows higher concurrency for data access neighborhood contains at while maintaining a comparable time complexity and quality with least a minimum number of points (minpts), i.e., Nε(x) | |≥ the classical OPTICS algorithm. We note that MST-based tech- minpts. A point y X is directly density-reachable from x X niques have been applied previously in cluster analysis, such as the if y is within the ε-neighborhood∈ of x and x is a core point.∈ A single-linkage method that uses MST to join clusters by the shortest point y X is density-reachable from x X if there is a chain ∈ ∈ distance between them [25].
Recommended publications
  • A Literature Review on Patent Texts Analysis Techniques Guanlin Li
    International Journal of Knowledge www.ijklp.org and Language Processing KLP International ⓒ2018 ISSN 2191-2734 Volume 9, Number 3, 2018 pp.1–-15 A Literature Review on Patent Texts Analysis Techniques Guanlin Li School of Software & Microelectronics Peking University No.5 Yiheyuan Road Haidian District Beijing, 100871, China [email protected] Received Sep 2018; revised Sep 2018 ABSTRACT. Patent data are expanding explosively nowadays with the advent of new technologies, and it’s significant to put forward the method of automatic patent analysis and use appropriate patent analysis techniques to make use of scattered, multi-source and interrelated patent text, in order to improve the efficiency of patent analyzing. Currently there are a lot of techniques being used to process patent intelligence. This literature review focuses on automatic patent text analysis techniques, which use computer to automatically analyze large scale of patent texts and find useful information in them. These techniques are divided into the following categories: semantic analysis based techniques, rule based techniques, machine learning based techniques and patent text clustering techniques. Keywords: Patent analysis, text mining, patent intelligence 1. Introduction. Patents are important sources of technology information, in which we can find great value of scientific and technological intelligence. At the same time, patents are of high commercial value. Enterprise analyzes patent information, which contains more than 90% of the world's scientific and technological
    [Show full text]
  • Density-Based Clustering of Static and Dynamic Functional MRI Connectivity
    Rangaprakash et al. Brain Inf. (2020) 7:19 https://doi.org/10.1186/s40708-020-00120-2 Brain Informatics RESEARCH Open Access Density-based clustering of static and dynamic functional MRI connectivity features obtained from subjects with cognitive impairment D. Rangaprakash1,2,3, Toluwanimi Odemuyiwa4, D. Narayana Dutt5, Gopikrishna Deshpande6,7,8,9,10,11,12,13* and Alzheimer’s Disease Neuroimaging Initiative Abstract Various machine-learning classifcation techniques have been employed previously to classify brain states in healthy and disease populations using functional magnetic resonance imaging (fMRI). These methods generally use super- vised classifers that are sensitive to outliers and require labeling of training data to generate a predictive model. Density-based clustering, which overcomes these issues, is a popular unsupervised learning approach whose util- ity for high-dimensional neuroimaging data has not been previously evaluated. Its advantages include insensitivity to outliers and ability to work with unlabeled data. Unlike the popular k-means clustering, the number of clusters need not be specifed. In this study, we compare the performance of two popular density-based clustering methods, DBSCAN and OPTICS, in accurately identifying individuals with three stages of cognitive impairment, including Alzhei- mer’s disease. We used static and dynamic functional connectivity features for clustering, which captures the strength and temporal variation of brain connectivity respectively. To assess the robustness of clustering to noise/outliers, we propose a novel method called recursive-clustering using additive-noise (R-CLAN). Results demonstrated that both clustering algorithms were efective, although OPTICS with dynamic connectivity features outperformed in terms of cluster purity (95.46%) and robustness to noise/outliers.
    [Show full text]
  • Semantic Computing
    SEMANTIC COMPUTING 10651_9789813227910_TP.indd 1 24/7/17 1:49 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence ISSN: 2529-7686 Published Vol. 1 Semantic Computing edited by Phillip C.-Y. Sheu 10651 - Semantic Computing.indd 1 27-07-17 5:07:03 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence – Vol. 1 SEMANTIC COMPUTING Editor Phillip C-Y Sheu University of California, Irvine World Scientific NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO 10651_9789813227910_TP.indd 2 24/7/17 1:49 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence ISSN: 2529-7686 Published Vol. 1 Semantic Computing edited by Phillip C.-Y. Sheu Catherine-D-Ong - 10651 - Semantic Computing.indd 1 22-08-17 1:34:22 PM Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE Library of Congress Cataloging-in-Publication Data Names: Sheu, Phillip C.-Y., editor. Title: Semantic computing / editor, Phillip C-Y Sheu, University of California, Irvine. Other titles: Semantic computing (World Scientific (Firm)) Description: Hackensack, New Jersey : World Scientific, 2017. | Series: World Scientific encyclopedia with semantic computing and robotic intelligence ; vol. 1 | Includes bibliographical references and index. Identifiers: LCCN 2017032765| ISBN 9789813227910 (hardcover : alk. paper) | ISBN 9813227915 (hardcover : alk. paper) Subjects: LCSH: Semantic computing. Classification: LCC QA76.5913 .S46 2017 | DDC 006--dc23 LC record available at https://lccn.loc.gov/2017032765 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
    [Show full text]
  • Mobility Modes Awareness from Trajectories Based on Clustering and a Convolutional Neural Network
    International Journal of Geo-Information Article Mobility Modes Awareness from Trajectories Based on Clustering and a Convolutional Neural Network Rui Chen * , Mingjian Chen, Wanli Li, Jianguang Wang and Xiang Yao Institute of Geospatial Information, Information Engineering University, Zhengzhou 450000, China; [email protected] (M.C.); [email protected] (W.L.); [email protected] (J.W.); [email protected] (X.Y.) * Correspondence: [email protected]; Tel.: +86-181-4029-5462 Received: 13 March 2019; Accepted: 5 May 2019; Published: 7 May 2019 Abstract: Massive trajectory data generated by ubiquitous position acquisition technology are valuable for knowledge discovery. The study of trajectory mining that converts knowledge into decision support becomes appealing. Mobility modes awareness is one of the most important aspects of trajectory mining. It contributes to land use planning, intelligent transportation, anomaly events prevention, etc. To achieve better comprehension of mobility modes, we propose a method to integrate the issues of mobility modes discovery and mobility modes identification together. Firstly, route patterns of trajectories were mined based on unsupervised origin and destination (OD) points clustering. After the combination of route patterns and travel activity information, different mobility modes existing in history trajectories were discovered. Then a convolutional neural network (CNN)-based method was proposed to identify the mobility modes of newly emerging trajectories. The labeled history trajectory data were utilized to train the identification model. Moreover, in this approach, we introduced a mobility-based trajectory structure as the input of the identification model. This method was evaluated with a real-world maritime trajectory dataset. The experiment results indicated the excellence of this method.
    [Show full text]
  • Representatives for Visually Analyzing Cluster Hierarchies
    Visually Mining Through Cluster Hierarchies Stefan Brecheisen Hans-Peter Kriegel Peer Kr¨oger Martin Pfeifle Institute for Computer Science University of Munich Oettingenstr. 67, 80538 Munich, Germany brecheis,kriegel,kroegerp,pfeifle @dbs.informatik.uni-muenchen.de f g Abstract providing the user with significant and quick information. Similarity search in database systems is becoming an increas- In this paper, we introduce algorithms for automatically detecting hierarchical clusters along with their correspond- ingly important task in modern application domains such as ing representatives. In order to evaluate our ideas, we de- multimedia, molecular biology, medical imaging, computer veloped a prototype called BOSS (Browsing OPTICS Plots aided engineering, marketing and purchasing assistance as for Similarity Search). BOSS is based on techniques related well as many others. In this paper, we show how visualizing to visual data mining. It helps to visually analyze cluster the hierarchical clustering structure of a database of objects hierarchies by providing meaningful cluster representatives. can aid the user in his time consuming task to find similar ob- To sum up, the main contributions of this paper are as jects. We present related work and explain its shortcomings follows: which led to the development of our new methods. Based We explain how different important application ranges on reachability plots, we introduce approaches which auto- • would benefit from a tool which allows visually mining matically extract the significant clusters in a hierarchical through cluster hierarchies. cluster representation along with suitable cluster represen- We reason why the hierarchical clustering algorithm tatives. These techniques can be used as a basis for visual • OPTICS forms a suitable foundation for such a brows- data mining.
    [Show full text]
  • Visualization of the Optics Algorithm
    Visualization of the Optics Algorithm Gregor Redinger* Markus Hunner† 01163940 01503441 VIS 2017 - Universitt Wien ABSTRACT Our Visualization not only helps in interpreting this Reachability- In our Project we have the goal to provide a visualization for the Plot, but also provides the functionality of picking a cutoff value OPTICS Clustering Algorithm. There hardly exist in-depth visual- for parameter e, that we called e’, with this cutoff it is possible to izations of this algorithm and we developed a online tool to fill this interpret one result of the OPTICS algorithm like several results of the related DBSCAN clustering algorithm. Thus our visualization gap. In this paper we will give you a deep insight in our solution. 0 In a first step we give an introduction to our visualization approach. provides a parameter space exploration for all e < e. Then we will discuss related work and introduce the different parts Users Therefore our visualization enables users without prior of our visualization. Then we discuss the software stack we used knowledge of the OPTICS algorithm and its unusual result format for our application and which challenges and problems we encoun- to easily interpret the ordered result structure as cluster assignments. tered during the development. After this, we will look at concrete Additionally it allows the user to explore the parameter space of use cases for our visualization, take a look at the performance and the eparameter in an intuitive way, without the need to educate the present the results of a evaluation in form of a field study. At last we user on the algorithmic details of OPTICS and why introducing a will discuss the strengths and weaknesses of our approach and take cutoff value for the calculated distance measures corresponds to the a closer look at the lessons we learned from our project.
    [Show full text]
  • Superior Technique to Cluster Large Objects Ms.T.Mythili1, Ms.R.D.Priyanka2, Dr.R.Sabitha3
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 03 Issue: 04 | Apr-2016 www.irjet.net p-ISSN: 2395-0072 Superior Technique to Cluster Large Objects Ms.T.Mythili1, Ms.R.D.Priyanka2, Dr.R.Sabitha3 1Assistant Professor, Dept. of Information Technology, Info Institute of Engg., Tamilnadu, India 2Assistant Professor, Dept. of CSE, Info Institute of Engg., Tamilnadu, India 3Associate Professor, Dept. of Information Technology, Info Institute of Engg., Tamilnadu, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Data Mining (DM) is the science of extracting points to a finite system of k subsets, clusters. Usually useful and non-trivial information from the huge amounts of subsets do not intersect, and their union is equal to a full data that is possible to collect in many and diverse fields of dataset with possible exception of outliers. science, business and engineering. One of the most widely studied problems in this area is the identification of clusters, or densely populated region, in a multidimensional dataset. Cluster analysis is a primary method for database mining. It is either used as a standalone tool to get insight into the This paper deals with the implementation of the distribution of a data set, e.g. to focus further analysis and OPTICS algorithm [3] along with the various feature data processing, or as a pre-processing step for other selection algorithms [5] on adult dataset. The adult dataset algorithms operating on the detected clusters The clustering contains the information regarding the people, which was problem has been addressed by researchers in many extracted from the census dataset and was originally disciplines; this reflects its broad appeal and usefulness as one obtained from the UCI Repository of Machine Learning of the steps in exploratory data analysis.
    [Show full text]
  • Paper-Deliclu-Annotated.Pdf
    In Proc. 10th Pacific-Asian Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD'06), Singapore, 2006 DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking Elke Achtert, Christian Bohm,¨ and Peer Kroger¨ Institute for Computer Science, University of Munich, Germany {achtert,boehm,kroegerp,}@dbs.ifi.lmu.de Abstract. Hierarchical clustering algorithms, e.g. Single-Link or OPTICS com- pute the hierarchical clustering structure of data sets and visualize those struc- tures by means of dendrograms and reachability plots. Both types of algorithms have their own drawbacks. Single-Link suffers from the well-known single-link effect and is not robust against noise objects. Furthermore, the interpretability of the resulting dendrogram deteriorates heavily with increasing database size. OPTICS overcomes these limitations by using a density estimator for data group- ing and computing a reachability diagram which provides a clear presentation of the hierarchical clustering structure even for large data sets. However, it re- quires a non-intuitive parameter ε that has significant impact on the performance of the algorithm and the accuracy of the results. In this paper, we propose a novel and efficient k-nearest neighbor join closest-pair ranking algorithm to overcome the problems of both worlds. Our density-link clustering algorithm uses a sim- ilar density estimator for data grouping, but does not require the ε parameter of OPTICS and thus produces the optimal result w.r.t. accuracy. In addition, it provides a significant performance boosting over Single-Link and OPTICS. Our experiments show both, the improvement of accuracy as well as the efficiency acceleration of our method compared to Single-Link and OPTICS.
    [Show full text]
  • Outline of Machine Learning
    Outline of machine learning The following outline is provided as an overview of and topical guide to machine learning: Machine learning – subfield of computer science[1] (more particularly soft computing) that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".[2] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[3] Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions. Contents What type of thing is machine learning? Branches of machine learning Subfields of machine learning Cross-disciplinary fields involving machine learning Applications of machine learning Machine learning hardware Machine learning tools Machine learning frameworks Machine learning libraries Machine learning algorithms Machine learning methods Dimensionality reduction Ensemble learning Meta learning Reinforcement learning Supervised learning Unsupervised learning Semi-supervised learning Deep learning Other machine learning methods and problems Machine learning research History of machine learning Machine learning projects Machine learning organizations Machine learning conferences and workshops Machine learning publications
    [Show full text]
  • A Modified Approach of OPTICS Algorithm for Data Streams
    Engineering, Technology & Applied Science Research Vol. 7, No. 2, 2017, 1478-1481 1478 A Modified Approach of OPTICS Algorithm for Data Streams M. Shukla Y. P. Kosta M. Jayswal Department of Computer Engineering Department of Computer Engineering Department of Computer Engineering Marwadi Education Foundation & Marwadi Education Foundation Marwadi Education Foundation R. K. University, Rajkot, India Rajkot, India Rajkot, India, [email protected] [email protected] [email protected] Abstract-Data are continuously evolving from a huge variety of [19] etc. Density grid-based algorithms for data streams are as applications in huge volume and size. They are fast changing, follows: D-Stream Algorithm, MR-Stream Algorithm, temporally ordered and thus data mining has become a field of DENGRIS Algorithm[19] etc. Table I gives a basic mapping of major interest. A mining technique such as clustering is several existing algorithms that contains the description of their implemented in order to process data streams and generate a set of advantages and disadvantages. similar objects as an individual group. Outliers generated in this process are the noisy data points that shows abnormal behavior Clustering is a key task in data mining. There are various compared to the normal data points. In order to obtain the other additional challenges by data streams on clustering such as clusters of pure quality outliers should be efficiently discovered one pass clustering, limited time and limited memory. Along and discarded. In this paper, a concept of pruning is applied on with this, finding out clusters with arbitrary shapes is very much the stream optics algorithm along with the identification of real necessary in data stream applications.
    [Show full text]
  • Two Step Density-Based Object-Inductive Clustering Algorithm
    Two Step Density-Based Object-Inductive Clustering Algorithm Volodymyr Lytvynenko1[0000-0002-1536-5542], Irina Lurie1[0000-0001-8915-728X], Jan Krejci2 [0000-0003-4365-5413] , Mariia Voronenko1[0000-0002-5392-5125], Nataliіa Savina3[0000-0001-8339-1219], Mohamed Ali Taif 1[0000-0002-3449-6791] 1Kherson National Technical Uneversity, Kherson, Ukraine, 2Jan Evangelista Purkyne University in Usti nad Labem, Czech Republic, 3National University of Water and Environmental Engineering, Rivne, Ukraine, [email protected],[email protected],[email protected], [email protected], [email protected], [email protected] Abstract. The article includes the results of study into the practical implementa- tion of two-step DBSCAN and OPTICS clustering algorithms in the field of ob- jective clustering of inductive technologies. The architecture of the objective clustering technology was developed founded on the two-step clustering algo- rithm DBSCAN and OPTICS. The accomplishment of the technology includes the simultaneous data’s clustering on two subsets of the same power by the DBSCAN algorithm, which involve the same number of pairwise objects simi- lar to each other with the subsequent correction of the received clusters by the OPTICS algorithm. The finding the algorithm’s optimal parameters was carried out based on the clustering quality criterion's maximum value of a complex bal- ance, which is rated as the geometric average of the Harrington desirability in- dices for clustering quality criteria (internal and external). Keywords: Clustering, Density-based clustering, Objective clustering, Induc- tive clustering, clustering quality criteria, Two Step Clustering, DBSCAN, OPTICUS 1 Introduction Clustering is the primary method for extracting data.
    [Show full text]
  • Arxiv:2101.04285V1 [Cs.LG] 12 Jan 2021 Ever, Declined Transactions Also Contain Risk Indicators and 1 Introduction Can Be Utilized in an Unsupervised Setting
    Explainable Deep Behavioral Sequence Clustering for Transaction Fraud Detection Wei Min1, Weiming Liang1, Hang Yin1, Zhurong Wang1, Mei Li1, Alok Lal2 1eBay China 2eBay USA Abstract and is harder to fabricate, therefore brings opportunities to boost the capability of risk management. Recently, with the In e-commerce industry, user-behavior sequence data has booming of deep learning, there is a growing trend to lever- been widely used in many business units such as search age user behavioral data in risk management by learning and merchandising to improve their products. However, it is rarely used in financial services not only due to its 3V charac- the representation of click-stream sequence. For example, teristics – i.e. Volume, Velocity and Variety – but also due to e-commerce giants such as JD, Alibaba use recurrent neu- its unstructured nature. In this paper, we propose a Financial ral network to model the sequence of user clicks for fraud Service scenario Deep learning based Behavior data repre- detection(Wang et al. 2017; Li et al. 2019), and Zhang et.al. sentation method for Clustering (FinDeepBehaviorCluster) (Zhang, Zheng, and Min 2018) use both convolution neu- to detect fraudulent transactions. To utilize the behavior se- ral network and recurrent neural network to learn the em- quence data, we treat click stream data as event sequence, bedding of the click stream in online credit loan application use time attention based Bi-LSTM to learn the sequence em- process for default prediction. bedding in an unsupervised fashion, and combine them with However, the common practice for risk management is intuitive features generated by risk experts to form a hy- to use a predictive framework, which is largely relying on brid feature representation.
    [Show full text]