Ludwig-Maximilians-Universität München Institut für Informatik DATABASE Lehr- und Forschungseinheit für Datenbanksysteme SYSTEMS GROUP

Knowledge Discovery in Databases SS 2016

Chapter 4: Clustering

Lecture: Prof. Dr. Thomas Seidl

Tutorials: Julian Busch, Evgeniy Faerman, Florian Richter, Klaus Schmid

Knowledge Discovery in Databases I: Clustering 1 DATABASE Contents SYSTEMS GROUP

1) Introduction to Clustering 2) Partitioning Methods – K-Means – Variants: K-Medoid, K-Mode, K-Median – Choice of parameters: Initialization, Silhouette coefficient 3) Probabilistic Model-Based Clusters: Expectation Maximization 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN

Clustering 2 DATABASE What is Clustering? SYSTEMS GROUP

Grouping a set of data objects into clusters – Cluster: a collection of data objects 1) Similar to one another within the same cluster 2) Dissimilar to the objects in other clusters Clustering = unsupervised “classification“ (no predefined classes) Typical usage – As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms

Clustering Introduction 3 DATABASE General Applications of Clustering SYSTEMS GROUP

Preprocessing – as a data reduction (instead of sampling) – Image data bases (color histograms for filter ) – Stream clustering (handle endless data sets for offline clustering) and Image Processing Spatial Data Analysis – create thematic maps in Geographic Information Systems by clustering feature spaces – detect spatial clusters and explain them in spatial Business Intelligence (especially market research) WWW – Documents (Web Content Mining) – Web-logs (Web Usage Mining) Biology – Clustering of gene expression data

Clustering Introduction 4 DATABASE An Application Example: Downsampling Images SYSTEMS GROUP

• Reassign color values to k distinct colors • Cluster pixels using color difference, not spatial data 58483 KB 19496 KB 9748 KB

65536 256 16

8 4 2 Clustering Introduction 7 DATABASE Major Clustering Approaches SYSTEMS GROUP

Partitioning algorithms – Find k partitions, minimizing some objective function Probabilistic Model-Based Clustering (EM) Density-based – Find clusters based on connectivity and density functions Hierarchical algorithms – Create a hierarchical decomposition of the set of objects Other methods – Grid-based – Neural networks (SOM’s) – Graph-theoretical methods – Subspace Clustering – . . .

Clustering Introduction 9 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN – Outlier Detection

Clustering 10 DATABASE Partitioning Algorithms: Basic Concept SYSTEMS GROUP

. Goal: Construct a partition of a database D of n objects into a set of k (푘 < 푛) clusters .퐶푖 = 퐷) minimizing an objective function ڂ ,퐶1, … , 퐶푘 (Ci ⊂ 퐷, 퐶푖 ∩ 퐶푗 = ∅ ⇔ 퐶푖 ≠ 퐶푗 – Exhaustively enumerating all possible partitions into k sets in order to find the global minimum is too expensive.

. Popular heuristic methods: – Choose k representatives for clusters, e.g., randomly – Improve these initial representatives iteratively:  Assign each object to the cluster it “fits best” in the current clustering  Compute new cluster representatives based on these assignments  Repeat until the change in the objective function from one iteration to the next drops below a threshold

. Examples of representatives for clusters – k-means: Each cluster is represented by the center of the cluster – k-medoid: Each cluster is represented by one of its objects

Clustering Partitioning Methods 11 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – Variants: K-Medoid, K-Mode, K-Median – Choice of parameters: Initialization, Silhouette coefficient 3) Probabilistic Model-Based Clusters: Expectation Maximization 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Scaling Up Clustering Algorithms – Outlier Detection

Clustering 12 DATABASE K-Means Clustering: Basic Idea SYSTEMS GROUP

Idea of K-means: find a clustering such that the within-cluster variation of each cluster is small and use the centroid of a cluster as representative. Objective: For a given k, form k groups so that the sum of the (squared) distances

between the mean of the groups and their elements is minimal. clustermean Poor Clustering μ μ (large sum of distances) μ

μ Centroids

μ Optimal Clustering μ (minimal sum of distances) μ

μ Centroids S.P. Lloyd: Least squares quantization in PCM. In IEEE Information Theory, 1982 (original version: technical report, Bell Labs, 1957) J. MacQueen: Some methods for classification and analysis of multivariate observation, In Proc. of the 5th Berkeley Symp. on Math. Statist. and Prob., 1967. Clustering Partitioning Methods K-Means 13 DATABASE K-Means Clustering: Basic Notions SYSTEMS GROUP

Objects 푝 = (푝1, … , 푝푑) are points in a 푑-dimensional vector space 1 (the mean 휇 of a set of points 푆 must be defined: 휇 = σ 푝) 푆 푆 푆 푝∈푆

Measure for the compactness of a cluster Cj (sum of squared errors): 2

푆푆퐸 퐶푗 = ෍ 푑푖푠푡 푝, 휇퐶푗 푝∈퐶푗 Measure for the compactness of a clustering 풞:

2 푆푆퐸 풞 = ෍ 푆푆퐸(퐶푗) = ෍ 푑푖푠푡 푝, 휇퐶(푝) 퐶푗∈풞 푝∈퐷퐵 Optimal Partitioning: argmin 푆푆퐸(풞) 풞 Optimizing the within-cluster variation is computationally challenging (NP-hard)  use efficient heuristic algorithms

Clustering Partitioning Methods K-Means 14 DATABASE K-Means Clustering: Algorithm SYSTEMS GROUP

k-Means algorithm (Lloyd’s algorithm): Given k, the k-means algorithm is implemented in 2 main steps: Initialization: Choose k arbitrary representatives Repeat until representatives do not change: 1. Assign each object to the cluster with the nearest representative. 2. Compute the centroids of the clusters of the current partitioning.

8 8 8 8 8

6 6 6 6 6

4 4 4 4 4 repeat 2 2 2 2 2 until stable 0 0 0 0 0 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 init: arbitrary new clustering centroids of new clustering centroids of representatives candidate current partition candidate current partition

assign objects compute new assign objects compute new means means Clustering Partitioning Methods K-Means 15 DATABASE K-Means Clustering: Discussion SYSTEMS GROUP

Strengths – Relatively efficient: O(tkn), where n = # objects, k = # clusters, and t = # iterations – Typically: k, t << n – Easy implementation Weaknesses – Applicable only when mean is defined – Need to specify k, the number of clusters, in advance – Sensitive to noisy data and outliers – Clusters are forced to convex space partitions (Voronoi Cells) – Result and runtime strongly depend on the initial partition; often terminates at a local optimum – however: methods for a good initialization exist Several variants of the k-means method exist, e.g., ISODATA – Extends k-means by methods to eliminate very small clusters, merging and split of clusters; user has to specify additional parameters

Clustering Partitioning Methods K-Means 16 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – Variants: K-Medoid, K-Mode, K-Median – Choice of parameters: Initialization, Silhouette coefficient 3) Probabilistic Model-Based Clusters: Expectation Maximization 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Scaling Up Clustering Algorithms – Outlier Detection

Clustering 17 K-Medoid, K-Modes, K-Median Clustering: DATABASE SYSTEMS Basic Idea GROUP

data set . Problems with K-Means: – Applicable only when mean is defined (vector space) – Outliers have a strong influence on the result . The influence of outliers is intensified by the use of the squared error  use the absolute error (total distance instead): 푇퐷 퐶 = σ 푑푖푠푡(푝, 푚 ) and 푇퐷(풞) = σ 푇퐷(퐶 ) 푝∈퐶 푐(푝) 퐶푖∈풞 푖 poor clustering . Three alternatives for using the Mean as representative: – Medoid: representative object “in the middle” – Mode: value that appears most often

– Median: (artificial) representative object “in the middle” Medoid . Objective as for k-Means: Find k representatives so that, optimal clustering the sum of the distances between objects and their closest representative is minimal.

Medoid Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median 18 DATABASE K-Median Clustering SYSTEMS GROUP

Problem: Sometimes, data is not numerical

Idea: If there is an ordering on the data 푋 = {푥1, 푥2, 푥3,…, 푥푛}, use median instead of mean 푀푒푑푖푎푛 {푥} = 푥 푀푒푑푖푎푛 푥, 푦 ∈ 푥, 푦 푀푒푑푖푎푛 푋 = 푀푒푑푖푎푛 푋 − min 푋 − max 푋 , 푖푓 |푋| > 2 • A median is computed in each dimension independently and can thus be a combination of multiple instances  median can be efficiently computed for ordered data • Different strategies to determine the “middle” in an array of even length possible

mean median Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median 19 DATABASE K-Mode Clustering: First Approach [Huang 1997] SYSTEMS GROUP

Given: 푋 ⊆ Ω = 퐴1 × 퐴2 × ⋯ × 퐴푑 is a set of 푛 objects, each described by 푑 categorical attributes 퐴푖 (1 ≤ 푖 ≤ 푑)

Mode: a mode of 푋 is a vector 푀 = 푚1, m2, ⋯ , 푚푑 ∈ Ω that minimizes

푑 푀, 푋 = ෍ 푑(푥푖, 푀) 푥푖∈푋 where 푑 is a distance function for categorical values (e.g. Hamming Dist.) → Note: 푀 is not necessarily an element of 푋 1 Theorem to determine a Mode: let 푓 푐, 푗, 푋 = ⋅ 푥 ∈ 푋| 푥 푗 = 푐 be the relative 푛 frequency of category 푐 of attribute 퐴푗 in the data, then:

푑 푀, 푋 is minimal ⇔ ∀푗 ∈ 1, … , 푑 : ∀푐 ∈ 퐴푗: 푓 mj, j, X ≥ 푓 푐, 푗, 푋 → this allows to use the k-means paradigm to cluster 8 categorical data without loosing its efficiency 6 → Note: the mode of a dataset might be not unique 4 K-Modes algorithm proceeds similar to k-Means algorithm 2 0 0 2 4 6 8 Huang, Z.: A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining, In DMKD, 1997. Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median 20 DATABASE K-Mode Clustering: Example SYSTEMS GROUP

Employee-ID Profession Household Pets #133 Technician Cat #134 Manager None #135 Cook Cat #136 Programmer Dog #137 Programmer None #138 Technician Cat #139 Programmer Snake #140 Cook Cat #141 Advisor Dog

Profession: (Programmer: 3, Technician: 2, Cook: 2, Advisor: 1, Manager:1) Household Pet: (Cat: 4, Dog: 2, None: 2, Snake: 1)

Mode is (Programmer, Cat) Remark: (Programmer, Cat) ∉ 퐷퐵

Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median 21 DATABASE K-Medoid Clustering: PAM Algorithm SYSTEMS GROUP

Partitioning Around Medoids [Kaufman and Rousseeuw, 1990]

. Given k, the k-medoid algorithm is implemented in 3 steps: – Initialization: Select k objects arbitrarily as initial medoids (representatives) – assign each remaining (non-medoid) object to the cluster with the nearest representative

– compute TDcurrent

. Problem of PAM: high complexity (푂(푡푘(푛 − 푘)^2))

Kaufman L., Rousseeuw P. J., Finding Groups in Data: An Introduction to , John Wiley & Sons, 1990. Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median 23 DATABASE Partitionierende Verfahren SYSTEMS GROUP

Algorithmus PAM

PAM(Punktmenge D, Integer k) Initialisiere die k Medoide; TD_Änderung := ; while TD_Änderung < 0 do Berechne für jedes Paar (Medoid M, Nicht-Medoid N)

den Wert TDNM; Wähle das Paar (M, N), für das der Wert

TD_Änderung := TDNM  TD minimal ist; if TD_Änderung < 0 then ersetze den Medoid M durch den Nicht-Medoid N; Speichere die aktuellen Medoide als die bisher beste Partitionierung; return Medoide;

Knowledge Discovery in Databases I: Clustering 24 DATABASE Partitionierende Verfahren SYSTEMS GROUP

Algorithmus CLARANS CLARANS(Punktmenge D, Integer k, Integer numlocal, Integer maxneighbor) for from 1 to numlocal do wähle zufällig k Objekte als Medoide; i := 0; while i < maxneighbor do Wähle zufällig (Medoid M, Nicht-Medoid N);

Berechne TD_Änderung := TDNM  TD; if TD_Änderung < 0 then ersetze M durch N;

TD := TDNM; i := 0; else i:= i + 1; if TD < TD_best then TD_best := TD; Speichere aktuelle Medoide; return Medoide;

Knowledge Discovery in Databases I: Clustering 25 DATABASE K-Means/Medoid/Mode/Median overview SYSTEMS GROUP

Age Age

Shoe Employee-ID Profession Age size Shoe size Shoe size mean #133 Technician 42 28 medoid #134 Manager 41 45 #135 Cook 46 32 #136 Programmer 40 35 Age #137 Programmer 41 49 #138 Technician 43 41 #139 Programmer 39 29 #140 Cook 38 33 Profession: Programmer #141 Advisor 40 56 Shoe size: 40/41 Age: n.a.

Shoe size median mode

Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median 26 K-Means/Median/Mode/Medoid Clustering: DATABASE SYSTEMS Discussion GROUP

k-Means k-Median K-Mode K-Medoid numerical ordered categorical data data data (mean) attribute data attribute data high high high low efficiency 푂(푡푘푛) 푂(푡푘푛) 푂(푡푘푛) 푂(푡푘 푛 − 푘 2) sensitivity to high low low low outliers

. Strength – Easy implementation ( many variations and optimizations in the literature) . Weakness – Need to specify k, the number of clusters, in advance – Clusters are forced to convex space partitions (Voronoi Cells) – Result and runtime strongly depend on the initial partition; often terminates at a local optimum – however: methods for a good initialization exist

Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median 27 DATABASE Voronoi Model for convex cluster regions SYSTEMS GROUP

Definition: Voronoi diagram

– For a given set of points 푃 = 푝푖| 푖 = 1 … 푘 (here: cluster representatives), a Voronoi diagram partitions the data space in Voronoi cells, one cell per point. – The cell of a point 푝 ∈ 푃 covers all points in the data space for which 푝 is the nearest neighbors among the points from 푃.

Observations – The Voronoi cells of two neighboring points 푝푖, 푝푗 ∈ 푃 are separated by the perpendicular hyperplane („Mittelsenkrechte“) between 푝푖 and 푝푗. – As Voronoi cells are intersections of half spaces, they are convex regions.

Clustering Partitioning Methods 28 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – Variants: K-Medoid, K-Mode, K-Median – Choice of parameters: Initialization, Silhouette coefficient 3) Probabilistic Model-Based Clusters: Expectation Maximization 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Scaling Up Clustering Algorithms

Clustering 30 DATABASE Initialization of Partitioning Clustering Methods SYSTEMS GROUP

whole dataset Just two examples: k = 3 [naïve] – Choose sample 퐴 of the dataset – Cluster the sample and use centers as initialization

[Fayyad, Reina, and Bradley 1998] m = 4 samples A, B, C, D – Choose m different (small) samples 퐴, … , 푀 of the dataset true cluster centers – Cluster each sample to get m estimates for k representatives D3 A = (A1, A2, . . ., Ak), B = (B1,. . ., Bk), ..., M = (M1,. . ., Mk) A3 C3 – Then, cluster the set DS = A  B  …  M m times. Each time D2 B3 C1 use the centers of A, B, ..., M as respective initial partitioning A1 B2 C2 – Use the centers of the best clustering as initialization for D1 A2 B1 the partitioning clustering of the whole dataset

Fayyad U., Reina C., Bradley P. S., „Initialization of Iterative Refinement Clustering Algorithms“, In KDD 1998), pp. 194—198.

Clustering Partitioning Methods Choice of parameters 31 DATABASE Choice of the Parameter k SYSTEMS GROUP

. Idea for a method: – Determine a clustering for each k = 2, ... , n-1 – Choose the “best” clustering . But how to measure the quality of a clustering? – A measure should not be monotonic over k. – The measures for the compactness of a clustering SSE and TD are monotonously decreasing with increasing value of k. . Silhouette-Coefficient [Kaufman & Rousseeuw 1990] – Measure for the quality of a k-means or a k-medoid clustering that is not monotonic over k.

Clustering Partitioning Methods Choice of parameters 32 DATABASE The Silhouette coefficient (1) SYSTEMS GROUP

. Basic idea: – How good is the clustering = how appropriate is the mapping of objects to clusters – Elements in cluster should be „similar“ to their representative  measure the average distance of objects to their representative: a(o) – Elements in different clusters should be „dissimilar“  measure the average distance of objects to alternative clusters (i.e. second closest cluster): b(o)

b(o)

o

a(o)

Clustering Partitioning Methods Choice of parameters 33 DATABASE The Silhouette coefficient (2) SYSTEMS GROUP

. 푎(표): average distance between object 표 and the objects in its cluster A 1 푎 표 = ෍ 푑푖푠푡(표, 푝) |퐶(표)| 푝∈퐶(표)

. 푏(표): for each other cluster 퐶푖 compute the average distance between o and the objects in 퐶푖. Then take the smallest average distance

1 푏 표 = min ෍ 푑푖푠푡(표, 푝) b(o) 퐶푖≠퐶(표) |퐶푖| 푝∈퐶푖 . The silhouette of o is then defined as B A a(o) 0 푖푓 푎 표 = 0, 푒. 푔. |퐶푖| = 1 푠 표 = ൞ 푏 표 − 푎(표) 푒푙푠푒 max{푎 표 , 푏 표 } . The values of the silhouette coefficient range from –1 to +1

Clustering Partitioning Methods Choice of parameters 34 DATABASE The Silhouette coefficient (3) SYSTEMS GROUP

. The silhouette of a cluster 퐶푖 is defined as:

1 푠푖푙ℎ 퐶푖 = ෍ 푠(표) |퐶푖| 표∈퐶푖

. The silhouette of a clustering 풞 = (퐶1, … , 퐶푘) is defined as:

1 푠푖푙ℎ 풞 = ෍ 푠(표) , |퐷| 표∈퐷

where 퐷 denotes the whole dataset.

Clustering Partitioning Methods Choice of parameters 35 DATABASE The Silhouette coefficient (4) SYSTEMS GROUP

. „Reading“ the silhouette coefficient: Let 푎 표 ≠ 0. – 푏 표 ≫ 푎 표 ⇒ 푠 표 ≈ 1: good assignment of o to its cluster A – 푏 표 ≈ 푎 표 ⇒ 푠 표 ≈ 0: 표 is in-between A and B – 푏 표 ≪ 푎 표 ⇒ 푠 표 ≈ −1: bad, on average 표 is closer to members of B

. Silhouette Coefficient 푠풞 of a clustering: average silhouette of all objects

– 0.7 < sC  1.0 strong structure, 0.5 < sC  0.7 medium structure

– 0.25 < sC  0.5 weak structure, sC  0.25 no structure

Clustering Partitioning Methods Choice of parameters 36 DATABASE The Silhouette coefficient (5) SYSTEMS GROUP

Silhouette Coefficient for points in ten clusters

in: Tan, Steinbach, Kumar: Introduction to Data Mining (Pearson, 2006)

Knowledge Discovery in Databases I: Evaluation von unsupervised Verfahren 37 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN

Clustering 38 DATABASE Expectation Maximization (EM) SYSTEMS GROUP

Statistical approach for finding maximum likelihood estimates of parameters in probabilistic models Here: using EM as clustering algorithm Approach: Observations are drawn from one of several components of a mixture distribution.

Main idea: – Define clusters as probability distributions  each object has a certain probability of belonging to each cluster – Iteratively improve the parameters of each distribution (e.g. center, “width” and “height” of a Gaussian distribution) until some quality threshold is reached

Additional Literature: C. M. Bishop „Pattern Recognition and “, Springer, 2009 Clustering Expectation Maximization (EM) 39 DATABASE Excursus: Gaussian Mixture Distributions SYSTEMS GROUP

Note: EM is not restricted to Gaussian distributions, but they will serve as example in this lecture. Gaussian distribution: – Univariate: single variable x ∈ ℝ: 1 1 − ⋅ 푥−휇 2 푝 푥 휇, 휎2 = 𝒩 푥 휇, 휎2 = ⋅ 푒 2휎2 2휋휎2 mean ∈ ℝ variance ∈ ℝ – Multivariate: 푑-dimensional vector 풙 ∈ ℝ푑: 1 1 − ⋅ 풙−흁 푇⋅ 휮 −1⋅ 풙−흁 푝 풙 흁, 휮 = 𝒩 풙 흁, 휮 = ⋅ 푒 2 2휋 푑 휮

mean vector ∈ ℝ푑 ∈ ℝ푑×푑 Gaussian mixture distribution with 퐾 components: – 푑-dimensional vector 풙 ∈ ℝ푑: 퐾 푝 풙 = σ푘=1 휋푘 ⋅ 𝒩 풙|흁푘, 휮푘

mixing coefficients ∈ ℝ : σ푘 휋푘 = 1 and 0 ≤ 휋푘 ≤ 1

Clustering Expectation Maximization (EM) 40 Expectation Maximization (EM): DATABASE SYSTEMS Exemplary Application GROUP

Example taken from: C. M. Bischop „Pattern Recognition and Machine Learning“, 2009

iter. 1

iter. 2 iter. 5 iter. 20

Clustering Expectation Maximization (EM) 41 DATABASE Expectation Maximization (EM) SYSTEMS GROUP

Note: EM is not restricted to Gaussian distributions, but they will serve as example in this lecture.

A clustering ℳ = 퐶1, … , 퐶퐾 is represented by a mixture distribution with parameters Θ = 휋1, 흁1, 횺1, … , 휋퐾, 흁퐾, 횺퐾 : 퐾 푝 풙|훩 = σ푘=1 휋푘 ⋅ 𝒩 풙|흁푘, 휮푘

Each cluster is represented by one component of the mixture distribution: 푝 풙 흁푘, 휮푘 = 𝒩 풙 흁푘, 휮푘

푑 Given a dataset 퐗 = 풙1, … , 풙푁 ⊆ ℝ , we can write the likelihood that all data points 퐱푛 ∈ 퐗 are generated (independently) by the mixture model with parameters Θ as: 푁

log 푝 퐗|Θ = log ෑ 푝(푥푛|Θ) 푛=1 Goal: Find the parameters 훩푀퐿 with maximal (log-)likelihood estimation (MLE) Θ푀퐿 = arg max log 푝 퐗|Θ Θ

Clustering Expectation Maximization (EM) 42 DATABASE Expectation Maximization (EM) SYSTEMS GROUP

• Goal: Find the parameters 훩푀퐿 with the maximal (log-)likelihood estimation! Θ푀퐿 = arg max log 푝 퐗|Θ Θ

푁 퐾 푁 퐾

log 푝 퐗|Θ = log ෑ ෍ 휋푘 ⋅ p 퐱푛 흁푘, 횺푘 = ෍ log ෍ 휋푘 ⋅ p 퐱푛 흁푘, 횺푘 푛=1 푘=1 푛=1 푘=1

• Maximization with respect to the means: 휕 −1 𝒩 풙푛 흁푗, 횺푗 = 횺j 풙푛 − 흁푗 𝒩 풙푛 흁푗, 횺푗 휕흁푗

휕푝 풙푛|Θ 휕 휋푗 ⋅ 푝 풙푛 흁푗, 횺푗 푁 푁 푁 푁 −1 휕 log 푝 퐗|Θ 휕log 푝 풙푛|Θ 휕 흁푗 휕 흁푗 휋푗 ⋅ 횺j 풙푛 − 흁푗 𝒩 풙푛 흁푗, 횺푗 = ෍ = ෍ = ෍ 퐾 = ෍ 퐾 휕 흁푗 휕 흁푗 푝 풙푛|Θ σ 푝 풙푛 흁푘, 횺푘 σ 푝 풙푛 흁푘, 횺푘 푛=1 푛=1 푛=1 푘=1 푛=1 푘=1

푁 휕 log 푝 퐗|Θ −1 휋푗 ⋅ 𝒩 풙푛 흁푗, 횺푗 = 횺j ෍ 풙푛 − 흁푗 퐾 ≝ ퟎ 휕 흁푗 σ 휋 ⋅ 𝒩 풙 흁 , 횺 푛=1 푘=1 푘 푛 푘 푘

• Define 훾푗 푥푛 ≔ 휋푗 ⋅ 𝒩 풙푛 흁푗, 횺푗 .

훾푗 푥푛 is the probability that component 푗 generated the object 풙푛.

Clustering Expectation Maximization (EM) 43 DATABASE Expectation Maximization (EM) SYSTEMS GROUP

Maximization w.r.t. the means yields:

푁 σ푛=1 훾푗 풙푛 풙푛 흁푗 = 푁 (weighted mean) σ푛=1 훾푗 풙푛

Maximization w.r.t. the covariance yields:

푁 푇 σ푛=1 훾푗 풙푛 풙푛 − 흁푗 풙푛 − 흁푗 횺푗 = 푁 σ푛=1 훾푗 풙푛

Maximization w.r.t. the mixing coefficients yields:

푁 σ푛=1 훾푗 풙푛 휋푗 = 퐾 푁 σ푘=1 σ푛=1 훾푘 풙푛

Clustering Expectation Maximization (EM) 44 DATABASE Expectation Maximization (EM) SYSTEMS GROUP

Problem with finding the optimal parameters 훩푀퐿:

푁 σ푛=1 훾푗 풙푛 풙푛 휋푗⋅𝒩 풙푛 흁푗, 횺푗 흁푗 = 푁 and 훾푗 풙푛 = 퐾 σ푛=1 훾푗 풙푛 σ푘=1 휋푘⋅𝒩 풙푛 흁푘, 횺푘

– Non-linear mutual dependencies. – Optimizing the Gaussian of cluster 푗 depends on all other Gaussians.  There is no closed-form solution!  Approximation through iterative optimization procedures

 Break the mutual dependencies by optimizing 흁푗 and 훾푗(풙푛) independently

Clustering Expectation Maximization (EM) 45 DATABASE Expectation Maximization (EM) SYSTEMS GROUP

EM-approach: iterative optimization

1. Initialize means 흁푗, covariances 횺푗, and mixing coefficients 휋푗 and evaluate the initial log likelihood. 2. E step: Evaluate the responsibilities using the current parameter values:

푛푒푤 휋푗 ⋅ 𝒩 풙푛 흁푗, 횺푗 훾 푗 풙푛 = 퐾 σ푘=1 휋푘 ⋅ 𝒩 풙푛 흁푘, 횺푘 3. M step: Re-estimate the parameters using the current responsibilities: σ푁 푛푒푤 푛푒푤 푛=1 훾 푗 풙푛 풙푛 흁푗 = 푁 푛푒푤 σ푛=1 훾 푗 풙푛 푇 σ푁 푛푒푤 푛푒푤 푛푒푤 푛푒푤 푛=1 훾 푗 풙푛 풙푛 − 흁푗 풙푛 − 흁푗 횺푗 = 푁 푛푒푤 σ푛=1 훾 푗 풙푛 σ푁 푛푒푤 푛푒푤 푛=1 훾 푗 풙푛 휋푗 = 퐾 푁 푛푒푤 σ푘=1 σ푛=1 훾 푘 풙푛 4. Evaluate the new log likelihood log 푝 퐗|Θnew and check for convergence of parameters or log likelihood (|log 푝 퐗|Θnew - log 푝 퐗|Θ | ≤ 휖). If the convergence criterion is not satisfied, set Θ = Θnew and go to step 2.

Clustering Expectation Maximization (EM) 46 DATABASE EM: Turning the Soft Clustering into a Partitioning SYSTEMS GROUP

EM obtains a soft clustering (each object belongs to each cluster with a certain probability) reflecting the uncertainty of the most appropriate assignment. Example taken from: C. M. Bishop „Pattern Recognition and Machine Learning“, 2009

a) input for EM b) soft clustering result of EM c) original data Modification to obtain a partitioning variant – Assign each object to the cluster to which it belongs with the highest probability

Cluster obj푒푐푡푛 = 푎푟푔푚푎푥푘∈{1,…,퐾}{훾 푧푛푘 }

Clustering Expectation Maximization (EM) 47 DATABASE Discussion SYSTEMS GROUP

Superior to k-Means for clusters of varying size or clusters having differing variances  more accurate data representation Convergence to (possibly local) maximum Computational effort for N objects, K derived clusters, and 푡 iterations: – 푂(푡 ⋅ 푁 ⋅ 퐾) – #iterations is quite high in many cases Both - result and runtime - strongly depend on – the initial assignment  do multiple random starts and choose the final estimate with highest likelihood  Initialize with clustering algorithms (e.g., K-Means usually converges much faster)  Local maxima and initialization issues have been addressed in various extensions of EM – a proper choice of parameter K (= desired number of clusters)  Apply principals of model selection (see next slide)

Clustering Expectation Maximization (EM) 48 DATABASE EM: Model Selection for Determining Parameter K SYSTEMS GROUP

Classical trade-off problem for selecting the proper number of components 퐾 – If 퐾 is too high, the mixture may overfit the data – If 퐾 is too low, the mixture may not be flexible enough to approximate the data

Idea: determine candidate models ΘK for a range of values of 퐾 (from 퐾푚푖푛 to 퐾푚푎푥) and select the model ΘK∗ = max{qual(Θ퐾)|퐾 ∈ {퐾푚푖푛, … , 퐾푚푎푥}} – Silhouette Coefficient (as for 푘-Means) only works for partitioning approaches. – The MLE (Maximum Likelihood Estimation) criterion is nondecreasing in 퐾 Solution: deterministic or stochastic model selection methods[MP’00] which try to balance the goodness of fit with simplicity.

– Deterministic: 푞푢푎푙 Θ퐾 = log 푝 퐗 Θ퐾 + 풫(퐾) where 풫(퐾) is an increasing function penalizing higher values of 퐾 – Stochastic: based on Markov Chain Monte Carlo (MCMC)

[MP‘00] G. McLachlan and D. Peel. Finite Mixture Models. Wiley, New York, 2000. Clustering Expectation Maximization (EM) 49 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN

Clustering 50 DATABASE Density-Based Clustering SYSTEMS GROUP

. Basic Idea: – Clusters are dense regions in the data space, separated by regions of lower object density . Why Density-Based Clustering?

Results of a k-medoid algorithm for k=4

. Different density-based approaches exist (see Textbook & Papers) Here we discuss the ideas underlying the DBSCAN algorithm Clustering Density-based Methods: DBSCAN 51 DATABASE Density-Based Clustering: Basic Concept SYSTEMS GROUP

. Intuition for the formalization of the basic idea – For any point in a cluster, the local point density around that point has to exceed some threshold – The set of points from one cluster is spatially connected . Local point density at a point q defined by two parameters – 휀–radius for the neighborhood of point q: 푁휀 푞 : = 푝 ∈ 퐷 푑푖푠푡 푝, 푞 ≤ 휀} ! contains q itself !

– MinPts – minimum number of points in the given neighbourhood 푁휀 (푞) . q is called a core object (or core point)

w.r.t. e, MinPts if | Ne (q) |  MinPts 휺 q MinPts = 5  q is a core object

Clustering Density-based Methods: DBSCAN 52 DATABASE Density-Based Clustering: Basic Definitions SYSTEMS GROUP

. p directly density-reachable from q p w.r.t. e, MinPts if q 1) p  Ne(q) and 2) q is a core object w.r.t. e, MinPts p . density-reachable: transitive closure of directly density-reachable q

. p is density-connected to a point q p w.r.t. e, MinPts if there is a point o q such that both, p and q are o density-reachable from o w.r.t. e, MinPts.

Clustering Density-based Methods: DBSCAN 53 DATABASE Density-Based Clustering: Basic Definitions SYSTEMS GROUP

. Density-Based Cluster: non-empty subset S of database D satisfying: 1) Maximality: if p is in S and q is density-reachable from p then q is in S 2) Connectivity: each object in S is density-connected to all other objects in S

. Density-Based Clustering of a database D : {S1, ..., Sn; N} where

– S1, ..., Sn : all density-based clusters in the database D

– N = D \ {S1 u … u Sn} is called the noise (objects not in any cluster)

Noise Border 휀 = 1.0 푀푖푛푃푡푠 = 5

Core

Clustering Density-based Methods: DBSCAN 54 DATABASE Density-Based Clustering: DBSCAN Algorithm SYSTEMS GROUP

. Density Based Spatial Clustering of Applications with Noise . Basic Theorem: – Each object in a density-based cluster C is density-reachable from any of its core-objects – Nothing else is density-reachable from core objects.

for each o D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE

– density-reachable objects are collected by performing successive e-neighborhood queries.

Ester M., Kriegel H.-P., Sander J., Xu X.: „A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise“, In KDD 1996, pp. 226—231.

Clustering Density-based Methods: DBSCAN 55 DATABASE DBSCAN Algorithm: Example SYSTEMS GROUP

. Parameter  e = 2.0  MinPts = 3

for each o D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE

Clustering Density-based Methods: DBSCAN 56 DATABASE DBSCAN Algorithm: Example SYSTEMS GROUP

. Parameter  e = 2.0  MinPts = 3

for each o D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE

Clustering Density-based Methods: DBSCAN 57 DATABASE DBSCAN Algorithm: Example SYSTEMS GROUP

. Parameter  e = 2.0  MinPts = 3

for each o D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE

Clustering Density-based Methods: DBSCAN 58 DATABASE Determining the Parameters e and MinPts SYSTEMS GROUP

. Cluster: Point density higher than specified by e and MinPts . Idea: use the point density of the least dense cluster in the data set as parameters – but how to determine this? . Heuristic: look at the distances to the k-nearest neighbors

p 4-distance(p) :

q 4-distance(q) :

. Function k-distance(p): distance from p to the its k-nearest neighbor . k-distance plot: k-distances of all objects, sorted in decreasing order

Clustering Density-based Methods: DBSCAN 59 DATABASE Determining the Parameters e and MinPts SYSTEMS GROUP

. Heuristic method: – Fix a value for MinPts – (default: 2  d – 1, d = dimension of data space) – User selects “border object” o from the MinPts-distance plot; e is set to MinPts-distance(o)

. Example k-distance plot

first „kink“ distance

1 dim = 2  MinPts = 3 - 3 2 Identify border object (kink) 3 Set e Objects

„border object“

Clustering Density-based Methods: DBSCAN 60 DATABASE Determining the Parameters e and MinPts SYSTEMS GROUP

Problematic example

A C C FF

EE A, B, C

e c

n B, D, E

a

t s

GG i D G1 - BB‘,‘D‘,F,G

G1 3 D G3 D1, D2, G1, D G3 G2, G3 BB D’D‘ GG22 D1 BB‘‘ D1 D2 Objects D2

Clustering Density-based Methods: DBSCAN 61 DATABASE Density-Based Clustering: Discussion SYSTEMS GROUP

. Advantages – Clusters can have arbitrary shape and size, i.e. clusters are not restricted to have convex shapes – Number of clusters is determined automatically – Can separate clusters from surrounding noise – Can be supported by spatial index structures 2 – Complexity: 푁휀-query: O(n) DBSCAN: O(n )

. Disadvantages – Input parameters may be difficult to determine – In some situations very sensitive to input parameter setting

Clustering Density-based Methods: DBSCAN 62 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN – Outlier Detection

Clustering 77 DATABASE From Partitioning to Hierarchical Clustering SYSTEMS GROUP

. Global parameters to separate all clusters with a partitioning clustering method may not exist

and/or

hierarchical largely differing cluster densities and structure sizes  Need a hierarchical clustering algorithm in these situations

Clustering Hierarchical Methods 78 DATABASE Hierarchical Clustering: Basic Notions SYSTEMS GROUP

. Hierarchical decomposition of the data set (with respect to a given similarity measure) into a set of nested clusters . Result represented by a so called dendrogram (greek dendro = tree) – Nodes in the dendrogram represent possible clusters – can be constructed bottom-up (agglomerative approach) or top down (divisive approach)

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative a ab b abcde c cde d de e divisive Step 4 Step 3 Step 2 Step 1 Step 0

Clustering Hierarchical Methods 79 DATABASE Hierarchical Clustering: Example SYSTEMS GROUP

. Interpretation of the dendrogram – The root represents the whole data set – A leaf represents a single object in the data set – An internal node represents the union of all objects in its sub-tree – The height of an internal node represents the distance between its two child nodes distance between clusters 8 9 4 7 5 3 2 4 6 2 3 5 1 1 1 0 1 5 1 2 3 4 5 6 7 8 9 Clustering Hierarchical Methods 80 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN – Outlier Detection

Clustering 81 DATABASE Agglomerative Hierarchical Clustering SYSTEMS GROUP

1. Initially, each object forms its own cluster 2. Consider all pairwise distances between the initial clusters (objects) 3. Merge the closest pair (A, B) in the set of the current clusters into a new cluster C = A  B 4. Remove A and B from the set of current clusters; insert C into the set of current clusters 5. If the set of current clusters contains only C (i.e., if C represents all objects from the database): STOP 6. Else: determine the distance between the new cluster C and all other clusters in the set of current clusters; go to step 3.

 Requires a distance function for clusters (sets of objects)

Clustering Hierarchical Methods Agglomerative Hierarchical Clustering 82 DATABASE Single Link Method and Variants SYSTEMS GROUP

. Given: a distance function dist(p, q) for database objects . The following distance functions for clusters (i.e., sets of objects) X and Y are commonly used for hierarchical clustering: Single-Link: dist _ sl(X ,Y)  min dist(x, y) xX ,yY

dist _ cl(X ,Y)  max dist(x, y) Complete-Link: xX ,yY

1 dist _ al(X,Y)   dist(x, y) | X | |Y | xX ,yY Average-Link:

Clustering Hierarchical Methods Agglomerative Hierarchical Clustering 83 DATABASE Divisive Hierarchical Clustering SYSTEMS GROUP

General approach: Top Down – Initially, all objects form one cluster – Repeat until all objects are singletons • Choose a cluster to split how ? • Replace the chosen cluster with the sub-clusters how to split ? e.g., ‘reversing’ agglomerative approach and split into two Example solution: DIANA – Select the cluster 퐶 with largest diameter for splitting – Search the most disparate observation 표 in 퐶 (highest average dissimilarity) • 푆푝푙푖푛푡푒푟퐺푟표푢푝 ≔ {표} • Iteratively assign the 표′ ∈ 퐶\푆푝푙푖푛푡푒푟퐺푟표푢푝 with the highest 퐷 표′ > 0 to the splinter group until for all 표′ ∈ 퐶\푆푝푙푖푛푡푒푟퐺푟표푢푝: 퐷 표′ ≤ 0 ′ ′ 푑 표 , 표푗 푑 표 , 표 D o′ = ෍ − ෍ 푖 oj∈퐶\푆푝푙푖푛푡푒푟퐺푟표푢푝 |퐶\푆푝푙푖푛푡푒푟퐺푟표푢푝| oi∈푆푝푙푖푛푡푒푟퐺푟표푢푝 |푆푝푙푖푛푡푒푟퐺푟표푢푝|

Clustering Hierarchical Methods Divisive Hierarchical Clustering 84 DATABASE Discussion Agglomerative vs. Divisive HC SYSTEMS GROUP

Divisive HC and Agglomerative HC need n-1 steps 푛∙(푛−1) – Agglomerative HC has to consider = 푛 combinations in the first 2 2 step – Divisive HC has 2푛−1 − 1 many possibilities to split the data in its first step Not every possibility has to be considered (DIANA) Divisive HC is conceptually more complex since it needs a second ‘flat’ clustering algorithm (splitting procedure) Agglomerative HC decides based on local patterns Divisive HC uses complete information about the global data distribution more accurate than Agglomerative HC?!

Clustering Hierarchical Methods Divisive Hierarchical Clustering 85 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN – Outlier Detection

Clustering 86 DATABASE Density-Based Hierarchical Clustering SYSTEMS GROUP

. Observation: Dense clusters are completely contained by less dense clusters C D

C1 C2

. Idea: Process objects in the “right” order and keep track of point density in their neighborhood C MinPts = 3

C2 C1

e2 e1

Clustering Hierarchical Methods Density-based HC: OPTICS 87 DATABASE Core Distance and Reachability Distance SYSTEMS GROUP

. Parameters: “generating” distance e, fixed value MinPts

. core-distancee,MinPts(o) MinPts = 5 “smallest distance such that o is a core object” (if core-distance >e :“?”) p o 휀 . reachability-distancee,MinPts(p, o) q “smallest distance such that p is directly density-reachable from o” (if reachability-distance > e : ∞) core-distance(o) reachability-distance(p,o) reachability-distance(q,o)

푑푖푠푡(푝, 표) , 푑푖푠푡 푝, 표 ≥ 푐표푟푒_푑푖푠푡(표) 푟푒푎푐ℎ_푑푖푠푡 푝, 표 = ൞푐표푟푒_푑푖푠푡(표) , 푑푖푠푡 푝, 표 < 푐표푟푒_푑푖푠푡(표) ∞ , 푑푖푠푡 푝, 표 > 휖

Clustering Hierarchical Methods Density-based HC: OPTICS 88 DATABASE The Algorithm OPTICS SYSTEMS GROUP

. OPTICS: Ordering Points To Identify the Clustering Structure

. Basic data structure: controlList – Memorize shortest reachability distances seen so far (“distance of a jump to that point”)

. Visit each point – Make always a shortest jump

. Output: – order of points – core-distance of points – reachability-distance of points

Ankerst M., Breunig M., Kriegel H.-P., Sander J.: „OPTICS: Ordering Points To Identify the Clustering Structure“, In SIGMOD 1999, pp. 49-60. Clustering Hierarchical Methods Density-based HC: OPTICS 89 DATABASE The Algorithm OPTICS SYSTEMS GROUP

foreach o  Database ControlList ordered by // initially, o.processed = false for all objects o reachability-distance.

if o.processed = false; insert (o, ∞) into ControlList; ControlList cluster-ordered  file while ControlList contains objects not yet processed select first element (o, r-dist) from ControlList;

retrieve Ne(o) and determine c_dist = core-distance(o); set o.processed = true; write (o, r_dist, c_dist) to file; database if o is a core object at any distance £ e

foreach p Î Ne(o) not yet processed; determine r_distp = reachability-distance(p, o); if (p, _) Ï ControlList

insert (p, r_distp) in ControlList; else if (p, old_r_dist) Î ControlList and r_distp < old_r_dist update (p, r_distp) in ControlList;

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 90 DATABASE The Algorithm OPTICS - Example SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N A I L J P R

seed list:

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 91 DATABASE The Algorithm OPTICS – Example (2) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B core- distance K M N I A L P J A e R

seed list: (B,40) (I, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 92 DATABASE The Algorithm OPTICS – Example (3) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B R

seed list: (I, 40) (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 93 DATABASE The Algorithm OPTICS – Example (4) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I R

seed list: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 94 DATABASE The Algorithm OPTICS – Example (5) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J R

seed list: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 95 DATABASE The Algorithm OPTICS – Example (6) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L R

seed list: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 96 DATABASE The Algorithm OPTICS – Example (7) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M R

seed list: (K, 18) (N, 19) (R, 20) (P, 21) (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 97 DATABASE The Algorithm OPTICS – Example (8) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K R

seed list: (N, 19) (R, 20) (P, 21) (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 98 DATABASE The Algorithm OPTICS – Example (9) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R

seed list: (R, 20) (P, 21) (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 99 DATABASE The Algorithm OPTICS – Example (10) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R R

seed list: (P, 21) (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 100 DATABASE The Algorithm OPTICS – Example (11) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P R

seed list: (C, 40)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 101 DATABASE The Algorithm OPTICS – Example (12) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P C R

seed list: (D, 22) (F, 22) (E, 30) (G, 35)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 102 DATABASE The Algorithm OPTICS – Example (13) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P C D R

seed list: (F, 22) (E, 22) (G, 32)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 103 DATABASE The Algorithm OPTICS – Example (14) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P C D F R

seed list: (G, 17) (E, 22)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 104 DATABASE The Algorithm OPTICS – Example (15) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P C D F G R

seed list: (E, 15) (H, 43)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 105 DATABASE The Algorithm OPTICS – Example (16) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P C D F G E R

seed list: (H, 43)

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 106 DATABASE The Algorithm OPTICS – Example (17) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P C D F G E H R

seed list: -

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 107 DATABASE The Algorithm OPTICS – Example (18) SYSTEMS GROUP • Example Database (2-dimensional, 16 points) • e= 44, MinPts = 3

reach 

E D G H F C 44

B

K M N I A L P J A B I J L M K N R P C D F G E H R

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 108 DATABASE OPTICS: The Reachability Plot SYSTEMS GROUP Reachability diagram

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 109 DATABASE OPTICS: The Reachability Plot SYSTEMS GROUP . plot the points together with their reachability-distances. Use the order in which they where returned by the algorithm – represents the density-based clustering structure – easy to analyze

– independent of the dimension of the data

reachability distance reachability distance cluster ordering cluster ordering

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 110 DATABASE OPTICS: Properties SYSTEMS GROUP

. “Flat” density-based clusters wrt. e* e and MinPts afterwards: – Start with an object o where c-dist(o)  e* and r-dist(o) > e* – Continue while r-dist  e* 1 2 17 3 16 18 3 1 4

4 2 16 17 18

Core-distance Reachability-distance . Performance: approx. runtime( DBSCAN(e, MinPts) ) – O( n * runtime(e-neighborhood-query) )  without spatial index support (worst case): O( n2 )  e.g. tree-based spatial index support: O( n * log(n) )

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 111 DATABASE OPTICS: Parameter Sensitivity SYSTEMS GROUP

. Relatively insensitive to parameter settings 1 3 . Good result if parameters are just “large enough”

2

MinPts = 10, e = 10 MinPts = 10, e = 5 MinPts = 2, e = 10 1 2 3 1 2 3 1 2 3

Clustering Hierarchical Methods Density-based hierarchical clustering: OPTICS 112 DATABASE Hierarchical Clustering: Discussion SYSTEMS GROUP

. Advantages – Does not require the number of clusters to be known in advance – No (standard methods) or very robust parameters (OPTICS) – Computes a complete hierarchy of clusters – Good result visualizations integrated into the methods – A “flat” partition can be derived afterwards (e.g. via a cut through the dendrogram or the reachability plot)

. Disadvantages – May not scale well  Runtime for the standard methods: O(n2 log n2)  Runtime for OPTICS: without index support O(n2) – User has to choose the final clustering

Clustering Hierarchical Methods 113 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN

Clustering 114 DATABASE Evaluation of Clustering Results SYSTEMS GROUP

. Evaluation based on expert‘s opinion + may reveal new insight into the data – very expensive, results are not comparable

. Evaluation based on internal measures + no additional information needed – approaches optimizing the evaluation criteria will always be preferred

. Evaluation based on external measures + objective evaluation – needs „ground truth“ e.g., comparison of two clusterings

Clustering  Evaluation of Clustering Results 115 DATABASE Evaluation based on internal measures SYSTEMS GROUP

Given a clustering 풞 = (퐶1, … , 퐶푘) for Dataset 퐷퐵

1 2 – Sum of square distances: 푆푆퐷 풞 = σ σ 푑푖푠푡 푝, 휇 퐶 |퐷퐵| 퐶푖∈풞 푝∈퐶푖 푖 – Cohesion: measures the similarity of objects within a cluster – Separation: measures the dissimilarity of one cluster to another one – Silhouette Coefficient: combines cohesion and separation

Clustering  Evaluation of Clustering Results 116 DATABASE Evaluation based on external measures (1) SYSTEMS GROUP

Given clustering 풞 = (퐶1, … , 퐶푘) and ground truth 풢 = (퐺1, … , 퐺푙) for dataset 퐷퐵

|퐶푖∩퐺푗| |퐶푖∩퐺푗| – Recall: 푟푒푐 퐶푖, 퐺푗 = Precision: 푝푟푒푐 퐶푖, 퐺푗 = |퐺푗| |퐶푖|

2∗푟푒푐 퐶푖,퐺푗 ∗푝푟푒푐 퐶푖,퐺푗 – F-Measure: 퐹 퐶푖, 퐺푗 = 푟푒푐 퐶푖,퐺푗 +푝푟푒푐 퐶푖,퐺푗 퐶 – Purity (P): 푃 풞, 풢 = σ 푖 푝푢푟(퐶 , 풢) 푝푢푟 퐶 , 풢 = 퐶푖∈풞 퐷퐵 푖 푖 max 푝푟푒푐(퐶푖, 퐺푗) Gj∈풢 |퐷퐵| – Rand Index: 푅퐼 풞, 풢 = (푎 + 푏)/ „normalized number of 2 agreements “ 푎 = 표푖, 표푗 ∈ 퐷퐵 × 퐷퐵|표푖 ≠ 표푗 ∧ ∃C ∈ 풞: 표푖, 표푗 ∈ 퐶 ∧ ∃퐺 ∈ 풢: 표푖, 표푗 ∈ 퐺 푏 = 표푖, 표푗 ∈ 퐷퐵 × 퐷퐵|표푖 ≠ 표푗 ∧ ¬∃C ∈ 풞: 표푖, 표푗 ∈ 퐶 ∧ ¬∃퐺 ∈ 풢: 표푖, 표푗 ∈ 퐺 – Jaccard Coefficient (JC) – …

Clustering  Evaluation of Clustering Results 117 DATABASE Evaluation based on external measures (2) SYSTEMS GROUP

Given clustering 풞 = (퐶1, … , 퐶푘) and ground truth 풢 = (퐺1, … , 퐺푙) for dataset 퐷퐵

– Mutual Entropy: 퐻 풞|풢 = − σ 푝 퐶 σ 푝 퐺 퐶 푙표푔 푝 퐺 퐶 퐶푖∈풞 푖 퐺푗∈풢 푗 푖 푗 푖 퐶푖 |퐶푖∩퐺푗| |퐶푖∩퐺푗| = − σ퐶 ∈풞 σ퐺 ∈풢 ∗ log2( ) 푖 퐷퐵 푗 |퐶푖| |퐶푖|

: 퐼 풞, 풢 = 퐻 풞 − 퐻 풞 풢 = 퐻 풢 − 퐻 풢 풞 퐶 퐶 where entropy 퐻 풞 = − σ 푝 퐶 ⋅ log 푝 퐶 = − σ 푖 ⋅ log 푖 퐶푖∈풞 푖 푖 퐶∈풞 퐷퐵 퐷퐵

퐼(풞,풢) – Normalized Mutual Information: 푁푀퐼 풞, 풢 = 퐻 풞 퐻(풢)

Clustering  Evaluation of Clustering Results 120 DATABASE Ambiguity of Clusterings SYSTEMS GROUP

Different possibilities to cluster a set of objects

from: Tan, Steinbach, Kumar: Introduction to Data Mining (Pearson, 2006)

Clustering  Evaluation of Clustering Results 124 DATABASE Ambiguity of Clusterings SYSTEMS GROUP

• Kind of a philosophical problem: “What is a correct clustering?” • Most approaches find clusters in every dataset, even in uniformly distributed object

• Are there clusters? • Apply clustering algorithm • Check for reasonability of clusters • Problem: No clusters found ⇏ no clusters existing • Maybe clusters exists only in certain models, but can not be found by used clustering approach

• Independent of clustering: Is there a data-given cluster tendency?

Clustering  Evaluation of Clustering Results 126 DATABASE Hopkins SYSTEMS GROUP

w1 u w2 1

w6

u6 Sample u2 w3 u5

w5 u3 u4 w4

dataset Random selection m uniformly (n objects) (m objects) m << n distributed objects

wi: distances of selected objects to the next neighbor in dataset

ui: distances of uniformly distributed objects to the next neighbor in dataset m H  0 : data are very regular (e.g. on grid) ui 0  H  1 i1 H  0,5 : data are uniformly distributed H  m m u  w H  1 : data are strongly clustered i1 i i1 i

Clustering  Evaluation of Clustering Results 127 DATABASE Cohesion and Separation SYSTEMS GROUP

• Suitable for globular cluster, but not for stretched clusters

Clustering  Evaluation of Clustering Results 144 DATABASE Evaluating the Similarity Matrix SYSTEMS GROUP

dataset Similarity matrix (well separated clusters) (sorted by k-means Cluster-labels)

from: Tan, Steinbach, Kumar: Introduction to Data Mining (Pearson, 2006)

Clustering  Evaluation of Clustering Results 145 DATABASE Evaluating the Similarity Matrix (cont’d) SYSTEMS GROUP similarity matrices differ for different clustering approaches

DBSCAN k-means complete link

nach: Tan, Steinbach, Kumar: Introduction to Data Mining (Pearson, 2006) Clustering  Evaluation of Clustering Results 146 DATABASE Example: Pen digit clustering SYSTEMS GROUP

Example Pendigits-dataset: • Label for each digit: 0,...,9 • Features: temporal ordered points in the draw lines • Useful concepts depending of the given classes: • One digit can have subgroups for varying writing styles

• Different digits can have similarities

Clustering  Evaluation of Clustering Results 160 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN

Clustering 161 DATABASE Ensemble Clustering SYSTEMS GROUP

Problem: – Many differing cluster definitions – Parameter choice usually highly influences the result  What is a ‚good‘ clustering? Idea: Find a consensus solution (also ensemble clustering) that consolidates multiple clustering solutions. Benefits of Ensemble Clustering: – Knowledge Reuse: possibility to integrate the knowledge of multiple known, good clusterings – Improved quality: often ensemble clustering leads to “better” results than its individual base solutions. – Improved robustness: combining several clustering approaches with differing data modeling assumptions leads to an increased robustness across a wide range of datasets. – Model Selection: novel approach for determining the final number of clusters – Distributed Clustering: if data is inherently distributed (either feature-wise or object-wise) and each clusterer has only access to a subset of objects and/or features, ensemble methods can be used to compute a unifying result.

Clustering  Further Clustering Topics  Ensemble Clustering 162 DATABASE Ensemble Clustering – Basic Notions SYSTEMS GROUP

퐷 Given: a set of 퐿 clusterings ℭ = 풞1, … , 풞퐿 for dataset 퐗 = 퐱1, … , 퐱푁 ∈ ℝ Goal : find a consensus clustering 풞∗ What exactly is a consensus clustering?

We can differentiate between 2 categories for ensemble clustering: – Approaches based on pairwise similarity Idea: find a consensus clustering 풞∗ for which the similarity function 1 휙 ℭ, 풞∗ = σ퐿 휙(풞 , 풞∗) is maximal. 퐿 푙=1 푙 basically our external evaluation measures, which compare two clusterings – Probabilistic approaches : Assume that the 퐿 labels for the objects 퐱푖 ∈ 퐗 follow a certain distribution

 We will present one exemplary approach for both categories in the following

Clustering  Further Clustering Topics  Ensemble Clustering 163 DATABASE Ensemble Clustering – Similarity-based Approaches SYSTEMS GROUP

퐷 Given: a set of 퐿 clusterings ℭ = 풞1, … , 풞퐿 for dataset 퐗 = 퐱1, … , 퐱푁 ∈ ℝ . Goal : find a consensus clustering 풞∗ for which the similarity function 1 휙 ℭ, 풞∗ = σ퐿 휙(풞 , 풞∗) is maximal. 퐿 푙=1 푙 – Popular choices for 휙 in the literature: • Pair counting-based measures: Rand Index (RI), Adjusted RI, Probabilistic RI • Information theoretic measures: Mutual Information (I), Normalized Mutual Information (NMI), Variation of Information (VI) Problem: the above objective is intractable Solutions: – Methods based on the co-association matrix (related to RI) – Methods using cluster labels without co-association matrix (often related to NMI) • Mostly graph partitioning • Cumulative voting

Clustering  Further Clustering Topics  Ensemble Clustering 164 Ensemble Clust. – Approaches based on DATABASE SYSTEMS Co-Association GROUP

퐷 Given: a set of 퐿 clusterings ℭ = 풞1, … , 풞퐿 for dataset 퐗 = 퐱1, … , 퐱푁 ∈ ℝ The co-association matrix 퐒 ℭ is an 푁 × 푁 matrix representing the label ℭ 퐿 similarity of object pairs: 푠푖,푗 = σ푙=1 훿 퐶 풞푙, 퐱푖 , 퐶 풞푙, 퐱푗 1 푖푓 푎 = 푏 where 훿 푎, 푏 = ቊ 0 푒푙푠푒 cluster label of 퐱푖, 퐱푗 in clustering 풞푙 Based on the similarity matrix defined by 퐒 ℭ traditional clustering approaches can be used Often 퐒 ℭ is interpreted as weighted adjacency matrix, such that methods for graph partitioning can be applied. In [Mirkin’96] a connection of consensus clustering based on the co- association matrix and the optimization of the pairwise similarity based 푏푒푠푡 1 ∗ on the Rand Index (풞 = 푎푟푔푚푎푥 ∗ σ 푅퐼(풞 , 풞 ) ) 풞 퐿 풞푙∈ℭ 푙 has been proven.

[Mirkin’96] B. Mirkin: Mathematical Classification and Clustering. Kluwer, 1996. Clustering  Further Clustering Topics  Ensemble Clustering 165 Ensemble Clust. – Approaches based on DATABASE SYSTEMS Cluster Labels GROUP

1 . Consensus clustering 풞∗ for which σ 휙(풞 , 풞∗) is maximal 퐿 풞푙∈ℭ 푙  Information theoretic approach: choose 휙 as mutual information (I), normalized mutual information (NMI), information bottleneck (IB),… . Problem: Usually a hard optimization problem  Solution 1: Use meaningful optimization approaches (e.g. gradient descent) or heuristics to approximate the best clustering solution (e.g. [SG02])  Solution 2: Use a similar but solvable objective (e.g. [TJP03]) 1 푠 ∗ • Idea: use as objective 풞 = 푎푟푔푚푎푥 ∗ σ 퐼 (풞 , 풞 ) 푏푒푠푡 풞 퐿 풞푙∈ℭ 푙 where 퐼푠 is the mutual information based on the generalized entropy of degree 푠: 푠 1−푠 −1 σ 푠 퐻 푋 = 2 − 1 푥푖∈푋 (푝푖 − 1) 푠 ∗  For s = 2, 퐼 (풞푙, 풞 ) is equal to the category utility function whose maximization is proven to be equivalent to the minimization of the square-error clustering criterion  Thus apply a simple label transformation and use e.g. K-Means

[SG02] A. Strehl, J. Ghosh: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 2002, pp. 583-617. [TJP03]A. Topchy, A.K. Jain, W. Punch. Combining multiple weak clusterings. In ICDM, pages 331-339, 2003.

Clustering  Further Clustering Topics  Ensemble Clustering 166 DATABASE Ensemble Clustering – A Probabilistic Approach SYSTEMS GROUP

Assumption 1: all clusterings 풞푙 ∈ ℭ are partitionings of the dataset 퐗. ∗ Assumption 2: there are 퐾 consensus clusters cluster label of x푛 in clustering 풞 The dataset 퐗 is represented by the set 푙 퐿 퐘 = 풚푛 ∈ ℕ0 ∃퐱푛 ∈ 퐗. ∀풞푙 ∈ ℭ. y푛푙= 퐶 풞푙, 퐱푛 퐿 th we have a new feature Space ℕ0, where the 푙 feature represents the cluster labels from partition 풞푙 Assumption 3: the dataset 퐘 (labels of base clusterings) follow a multivariate mixture distribution: conditional independence assumptions for 풞푙 ∈ ℭ 푁 퐾∗ 푁 퐾∗ 퐿

푃 퐘 횯 = ෑ ෍ 훼푘푃푘(풚푛|훉푘) = ෑ ෍ 훼푘 ෑ 푃푘푙 (푦푛푙|훉푘푙) 푛=1 푘=1 푛=1 푘=1 푙=1

푃푘푙 (푦푛푙|훉푘푙)~푀 1, 푝푘,푙,1, … , 푝푘,푙,|풞푙| follows a 풞푙 -dimensional multinomial ′ 풞푙 훿(푦푛푙,푘 ) distribution: 푃 (푦 |훉 ) = ς 푝 therefore: 훉 = 푝 , … , 푝 푘푙 푛푙 푘푙 푘′=1 푘,푙,푘′ 푘푙 푘,푙,1 푘,푙,|풞푙|

Goal: find the parameters 횯 = 훼1, 훉1, … , 훼퐾∗, 훉퐾∗ such that the likelihood 푃 퐘 횯 is maximized

Solution: optimizing the parameters via the EM approach. (details omitted) Presented approach: Topchy, Jain, Punch: A mixture model for clustering ensembles. In ICDM, pp. 379-390, 2004. Later extensions: H. Wang, H. Shan, A. Banerjee: Bayesian cluster ensembles. In ICDM, pp. 211-222, 2009. P. Wang, C. Domeniconi, K. Laskey: Nonparametric Bayesian clustering ensembles. In PKDD, pp. 435-450, 2010. Clustering  Further Clustering Topics  Ensemble Clustering 167 DATABASE Contents SYSTEMS GROUP

1) Introduction to clustering 2) Partitioning Methods – K-Means – K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering – Discussion: an alternative view on DBSCAN

Clustering 168 DATABASE Database Support for Density-Based Clustering SYSTEMS GROUP

Reconsider DBSCAN algorithm – Standard DBSCAN evaluation is based on recursive database traversal. – Böhm et al. (2000) observed that DBSCAN, among other clustering algorithms, may be efficiently built on top of similarity join operations.

Similarity joins – An e-similarity join yields all pairs of e-similar objects from two data sets P, Q: 푃⨝휀푄 = 푝, 푞 | 푝 ∈ 푃 ∧ 푞 ∈ 푄 ∧ 푑푖푠푡(푝, 푞) ≤ 휀 • SQL-Query: SELECT * FROM P, Q WHERE 푑푖푠푡(P, Q) ≤ 휀

– An e-similarity self join yields all pairs of e-similar objects from a database DB: 퐷퐵⨝휀퐷퐵 = 푝, 푞 | 푝 ∈ 퐷퐵 ∧ 푞 ∈ 퐷퐵 ∧ 푑푖푠푡(푝, 푞) ≤ 휀 • SQL-Query: SELECT * FROM DB p, DB q WHERE 푑푖푠푡(푝, 푞) ≤ 휀 Böhm C., Braunmüller, B., Breunig M., Kriegel H.-P.: High performance clustering based on the similarity join. CIKM 2000: 298-305.

Clustering  Further Clustering Topics  Discussion: DBSCAN 169 DATABASE Similarity Join for Density-Based Clustering SYSTEMS GROUP

e-Similarity self join: 퐷퐵⨝휀퐷퐵 = 푝, 푞 | 푝 ∈ 퐷퐵 ∧ 푞 ∈ 퐷퐵 ∧ 푑푖푠푡(푝, 푞) ≤ 휀

Relation „directly e, MinPts-density reachable“ may be expressed in terms of an e-similarity self join:

푑푑푟휀,휇 = 푝, 푞 | 푝 is 휀, 휇−core point ∧ 푞 ∈ 푁휀 푝

= 푝, 푞 |푝, 푞 ∈ 퐷퐵 ∧ 푑푖푠푡 푝, 푞 ≤ 휀 ∧ ∃≥휇 푞′ ∈ 퐷퐵: 푑푖푠푡 푝, 푞′ ≤ 휀

= 푝, 푞 | 푝, 푞 ∈ 퐷퐵⨝휀퐷퐵 ∧ ∃≥휇 푞′: 푝, 푞′ ∈ 퐷퐵⨝휀퐷퐵 = 휎 퐷퐵⨝ 퐷퐵 =: 퐷퐵⨝ 퐷퐵 휋푝 퐷퐵⨝휀퐷퐵 ≥휇 휀 휀,휇 – SQL-Query: SELECT * FROM DB p, DB q WHERE 푑푖푠푡(푝, 푞) ≤ 휀 GROUP BY p.id HAVING count 푝. 푖푑 ≥ μ

– Remark: 퐷퐵⨝휀퐷퐵 is a symmetric relation, 푑푑푟휀,휇 = 퐷퐵⨝휀,휇 퐷퐵 is not. DBSCAN then computes the connected components within 퐷퐵⨝휀,휇 퐷퐵. Clustering  Further Clustering Topics  Discussion: DBSCAN 170 DATABASE Efficient Similarity Join Processing SYSTEMS GROUP

For very large databases, efficient join techniques are available – Block nested loop or index-based nested loop joins exploit secondary storage structure of large databases. – Dedicated similarity join, distance join, or spatial join methods based on spatial indexing structures (e.g., R-Tree) apply particularly well. They may traverse their hierarchical directories in parallel (see illustration below). – Other join techniques including sort-merge join or hash join are not applicable.

P Q 푃⨝휀푄

Clustering  Further Clustering Topics  Discussion: DBSCAN 171 DATABASE Clustering – Summary SYSTEMS GROUP

Partitioning Methods: K-Means, K-Medoid, K-Mode, K-Median Probabilistic Model-Based Clusters: Expectation Maximization Density-based Methods: DBSCAN Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS Evaluation of Clustering Results – Evaluation based on an expert‘s knowledge; internal evaluation measures; external evaluation measures Further Clustering Topics – Ensemble Clustering: finding a consensus clustering agreeing with multiple base clustering and its advantages • co-assocaition matrix, information theoretic approaches, probabilistic approaches – Discussion of DBSCAN as Join operation

Clustering  Summary & References 172