My Journey to Data Mining

Ludwig-Maximilians-Universität München Institut für Informatik DATABASE Lehr- und Forschungseinheit für Datenbanksysteme SYSTEMS GROUP My Journey to Data Mining Hans-Peter Kriegel Ludwig-Maximilians-Universität München München, Germany 1 DATABASE In the beginning… SYSTEMS GROUP • spatial databases – spatial data mining height profile: Maunga Whau Volcano (Mt. Eden), Auckland, New Zealand 2 DATABASE Density-based Clustering: Intuition SYSTEMS GROUP • probability density function of the data • threshold at high probability density level • cluster of low probability density disappears to noise probability density function 3 DATABASE Density-based Clustering: Intuition SYSTEMS GROUP • low probability density level • 2 clusters are merged to 1 probability density function 4 DATABASE Density-based Clustering: Intuition SYSTEMS GROUP • medium (good) probability density level • 3 clusters are well separated probability density function 5 DATABASE DBSCAN SYSTEMS GROUP DBSCAN: Density-Based Spatial Clustering of Applications with Noise [Ester, Kriegel, Sander, Xu KDD 1996] minPts = 5 • Core points have at least minPts points in their -neighborhood • Density connectivity is defined based on core points • Clusters are transitive hulls of density-connected points 6 DATABASE DBSCAN SYSTEMS GROUP • DBSCAN received the 2014 SIGKDD Test of Time Award • DBSCAN Revisited: Mis-claim, Un-Fixability, and Approximation [Gan & Tao SIGMOD 2015] – Mis-claim according to Gan & Tao: DBSCAN terminates in O(n log n) time. DBSCAN actually runs in O(n²) worst-case time. – Our KDD 1996 paper claims: DBSCAN has an “average“ run time complexity of O(n log n) for range queries with a “small“ radius (compared to the data space size) when using an appropriate index structure (e.g. R*-tree) – The criticism should have been directed at the “average“ performance of spatial index structures such as R*-trees and not at an algorithm that uses such index structures 7 DATABASE DBSCAN SYSTEMS GROUP • Contributions of the SIGMOD 2015 paper (apply only to Euclidean distance) 1. Reduction from the USEC (Unit-Spherical Emptiness Checking) problem to the Euclidean DBSCAN problem lower bound of ( ) for the time complexity of every algorithm 4 solving the Euclidean DBSCAN⁄3 problem in 3 Ω 2. Proposal of an approximate grid-based DBSCAN ≥ algorithm for Euclidean distance running in ( ) expected time 8 DATABASE DBSCAN SYSTEMS GROUP • DBSCAN Revisited, Revisited: Why and how you should (still) use DBSCAN [E. Schubert, Sander, Ester, Kriegel, Xu, to appear in ACM TODS, 2017] – Experiments in the SIGMOD 2015 paper not of practical value – Parameter ɛ for the range queries was chosen much too large the approximate algorithm puts all objects into 1 cluster – Extensive experiments show that for adequate choice of ɛ, the original⟹ DBSCAN algorithm with an R*-tree index outperforms the SIGMOD‘15 approximate algorithm 9 DATABASE DBSCAN SYSTEMS GROUP • DBSCAN Revisited, Revisited: Why and how you should (still) use DBSCAN [E. Schubert, Sander, Ester, Kriegel, Xu, to appear in ACM TODS, 2017] – Lessons learnt from SIGMOD 2015 and ACM TODS 2017: • Lower bound of ( ) for the time complexity of any algorithm 4 solving the Euclidean⁄DBSCAN3 problem (SIGMOD 2015) • Original DBSCANΩ algorithm is still the method of choice (ACM TODS 2017) 10 DATABASE Variants of Density-based Clustering SYSTEMS GROUP • OPTICS: Ordering Points To Identify the Clustering Structure [Ankerst, Breunig, Kriegel, Sander SIGMOD 1999] • ordering of the database representing its density- based clustering structure • suitable for data of different local densities and for hierarchical clusters 11 DATABASE Variants of Density-based Clustering SYSTEMS GROUP • GDBSCAN: Generalized DBSCAN [Sander, Ester, Kriegel, Xu DMKD Journal 1998] clusters point objects as well as spatially extended objects according to spatial and non-spatial attributes and more… 12 DATABASE Survey on Density-based Clustering SYSTEMS GROUP • recent survey on density-based clustering: H.-P. Kriegel, P. Kröger, J. Sander, A. Zimek: Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3): 231–240, 2011. 13 Subspace Clustering in High- DATABASE SYSTEMS GROUP dimensional data spaces • SUBCLU: Density-Connected SUBspace CLUstering for High-Dimensional Data [Kailing, Kriegel, Kröger SDM 2004] discovers dense clusters in axis-parallel subspaces of the high-dimensional data space A A q o q p p B B p and q density-connected in {A,B}, {A} and {B} p and q not density-connected in {B} and {A,B} 14 DATABASE Outlier Detection SYSTEMS GROUP • LOF (Local Outlier Factor): Density-based, local outlier detection [Breunig, Kriegel, Ng, Sander SIGMOD 2000] • quantifies how outlying an object is in its local neighborhood C1 q C2 o2 o1 15 DATABASE High-dimensional Outlier Detection SYSTEMS GROUP • ABOD: Angle-Based Outlier Degree [Kriegel, M. Schubert, Zimek SIGKDD 2008] 73 73 72 72 71 71 70 70 o 69 69 o 68 68 67 outlier 67 no outlier 66 66 31 32 33 34 35 36 37 38 39 40 41 31 32 33 34 35 36 37 38 39 40 41 • variance of the angles of the potential "outlier" to pairs of points • angles are more stable than distances in high- dimensional spaces 16 DATABASE Subspace Outlier Detection SYSTEMS GROUP • SOD: Subspace Outlier Degree [Kriegel, Kröger, E. Schubert, Zimek PAKDD 2009] • detects outliers in subspaces of the high-dimensional data space H (kNN(p)) A 2 x x x p A3 x dist(H (kNN(p), p) x A1 17 DATABASE Approximations for Outlier Detection SYSTEMS GROUP • Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles [E. Schubert, Zimek, Kriegel DASFAA 2015] – avoids pairwise comparison of objects to compute nearest neighbors – computes nearest neighbors in near-linear time using an ensemble of space-filling curves 18 DATABASE Trend Detection SYSTEMS GROUP • SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds [E. Schubert, Weiler, Kriegel SIGKDD 2014] – introduces a new significance measure using outlier detection – tracks all keyword pairs using hash tables in a heavy-hitter type algorithm – aggregates the detected co-trends into larger topics using clustering 19 DATABASE Runtime Evaluation SYSTEMS GROUP • The (Black) Art of Runtime Evaluation: Are we comparing (data mining) algorithms or implementations? [Kriegel, E. Schubert, Zimek KAIS Journal, 1-38, 2016] – extensive study of runtime behavior of several algorithms (single-link, DBSCAN, k-means, LOF) – implementation details often dominate algorithmic merits – the same algorithm can exhibit runtime differences of two orders of magnitude and more in different implementations 20 DATABASE Runtime Evaluation SYSTEMS GROUP • For more realistic comparisons, all algorithms should be implemented – in the same framework, in the same version – at the same level of generality, modularization, and optimization – using the same backing features (DB layer, index structures) and all algorithms should be suitably parameterized. • We should – compare the behavior of algorithms in scalability experiments, not in single absolute runtime values, – demonstrate at which point (data set size, dimensionality, parameter values) the asymptotic behavior kicks in. 21 DATABASE Tutorials and Surveys SYSTEMS GROUP • Subspace clustering, clustering high-dimensional data [Kriegel, Kröger, Zimek] – Tutorials at ICDM, KDD, VLDB, PAKDD – Survey ACM TKDD 2009 • Outlier detection – Tutorials at PAKDD, KDD, SDM [Kriegel, Kröger, Zimek] • Outlier detection in high-dimensional data [Zimek, E. Schubert, Kriegel]: – Tutorials at ICDM, PAKDD – Survey Statistical Analysis and Data Mining 2012 22 DATABASE Implementations SYSTEMS GROUP • all these algorithms (and many more) are available in the ELKI framework: http://elki.dbs.ifi.lmu.de/ • ELKI is a java framework, integrating fast data management (e.g., indexing) and many data mining algorithms in a flexible way release 0.6: Elke Achtert, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek: Interactive Data Mining with 3D-Parallel-Coordinate-Trees. Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York City, NY, 2013. (release of version 0.7.1 at VLDB 2015) 23 DATABASE SYSTEMS GROUP Thank You! 24.

My Journey to Data Mining

Outlier Detection in Graphs: a Study on the Impact of Multiple Graph Models

Comparison of Data Mining Methods on Different Applications: Clustering and Classiﬁcation Methods

A Study on the Impact of Multiple Graph Models Campos, Guilherme Oliveira; Moreira, Edré; Meira, Wagner; Zimek, Arthur

Automated Exploration for Interactive Outlier Detection

Outlier Detection with Explanations on Music Streaming Data: a Case Study with Danmark Music Group Ltd

Angle-Based Outlier Detectin in High-Dimensional Data

Density-Based Clustering Validation

Outlier Detection Techniques

47 Internal Evaluation of Unsupervised Outlier Detection

Arxiv:1901.01588V2 [Cs.LG] 10 Jun 2019 a Means to Reliably Perform Pattern Recognition (Akoglu Et Al., 2012)

Community Distribution Outlier Detection in Heterogeneous Information Networks

There and Back Again: Outlier Detection Between Statistical Reasoning and Data Mining Algorithms