Hierarchical and Ensemble Clustering Ke Chen

Reading: [7.8-7.10, EA], [25.5, KPM], [Fred & Jain, 2005]

COMP24111 Outline

• Introduction

• Cluster Measures

• Agglomerative Algorithm

• Example and Demo

• Key Concepts in Hierarchal Clustering

• Clustering Ensemble via Evidence Accumulation

• Summary

COMP24111 Machine Learning 2 Introduction • Hierarchical Clustering Approach – A typical clustering analysis approach via partitioning data set sequentially – Construct nested partitions layer by layer via grouping objects into a tree of clusters (without the need to know the number of clusters in advance) – Use (generalised) as clustering criteria • Agglomerative vs. Divisive – Agglomerative: a bottom-up strategy . Initially each data object is in its own (atomic) cluster . Then merge these atomic clusters into larger and larger clusters – Divisive: a top-down strategy . Initially all objects are in one single cluster . Then the cluster is subdivided into smaller and smaller clusters • Clustering Ensemble – Using multiple clustering results for robustness and overcoming weaknesses of single clustering algorithms. COMP24111 Machine Learning 3 Introduction: Illustration

• Illustrative Example: Agglomerative vs. Divisive Agglomerative and divisive clustering on the data set {a, b, c, d ,e }

Step 0 Step 1 Step 2 Step 3 Step 4 Agglomerative

a a b b a b c d e . Cluster distance c . Termination condition c d e d d e e Divisive Step 4 Step 3 Step 2 Step 1 Step 0

COMP24111 Machine Learning 4 Cluster Distance Measures single link • Single link: smallest distance (min) between an element in one cluster and an element in the other, i.e.,

d(Ci, Cj) = min{d(xip, xjq)} • Complete link: largest distance complete link (max) between an element in one cluster and an element in the other, i.e.,

d(Ci, Cj) = max{d(xip, xjq)} • Average: avg distance between average elements in one cluster and elements in the other, i.e.,

d(Ci, Cj) = avg{d(xip, xjq)} d(C, C)=0 COMP24111 Machine Learning 5 Cluster Distance Measures

Example: Given a data set of five objects characterised by a single continuous feature, assume that there are two clusters: C1: {a, b} and C2: {c, d, e}. a b c d e Feature 1 2 4 5 6 1. Calculate the distance matrix . 2. Calculate three cluster between C1 and C2.

Single link a b c d e dist(C1 ,C2 ) = min{d(a,c),d(a,d),d(a,e),d(b,c),d(b,d),d(b,e)}

a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2

b 1 0 2 3 4 Complete link

dist(C1 ,C2 ) = max{d(a,c),d(a,d),d(a,e),d(b,c),d(b,d),d(b,e)} c 3 2 0 1 2 = max{3, 4, 5, 2, 3, 4} = 5 d 4 3 1 0 1 Average d(a,c) + d(a,d) + d(a,e) + d(b,c) + d(b,d) + d(b,e) e 5 4 2 1 0 dist(C ,C ) = 1 2 6 3 + 4 + 5 + 2 + 3 + 4 21 = = = 3.5 6 6 COMP24111 Machine Learning 6 Agglomerative Algorithm • The Agglomerative algorithm is carried out in three steps:

1) Convert all object features into a distance matrix 2) Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) 3) Repeat until number of cluster is one (or known # of clusters) . Merge two closest clusters . Update “distance matrix”

COMP24111 Machine Learning 7 Example

• Problem: clustering analysis with agglomerative algorithm

data matrix

Euclidean distance

distance matrix COMP24111 Machine Learning 8 Example

• Merge two closest clusters (iteration 1)

COMP24111 Machine Learning 9 Example

• Update distance matrix (iteration 1)

COMP24111 Machine Learning 10 Example

• Merge two closest clusters (iteration 2)

COMP24111 Machine Learning 11 Example

• Update distance matrix (iteration 2)

COMP24111 Machine Learning 12 Example

• Merge two closest clusters/update distance matrix (iteration 3)

COMP24111 Machine Learning 13 Example

• Merge two closest clusters/update distance matrix (iteration 4)

COMP24111 Machine Learning 14 Example

• Final result (meeting termination condition)

COMP24111 Machine Learning 15 Key Concepts in Hierarchal Clustering

• Dendrogram tree representation

1. In the beginning we have 6 clusters: A, B, C, D, E and F 2. We merge clusters D and F into 6 cluster (D, F) at distance 0.50 3. We merge cluster A and cluster B

lifetime into (A, B) at distance 0.71 4. We merge clusters E and (D, F) into ((D, F), E) at distance 1.00

5 5. We merge clusters ((D, F), E) and C into (((D, F), E), C) at distance 1.41 4 6. We merge clusters (((D, F), E), C) 3 and (A, B) into ((((D, F), E), C), (A, B)) 2 at distance 2.50 7. The last cluster contain all the objects, thus conclude the computation

object

COMP24111 Machine Learning 16 Key Concepts in Hierarchal Clustering

• Lifetime vs K-cluster Lifetime • Lifetime The distance between that a cluster is created and that it disappears (merges with other clusters during clustering). 6 e.g. lifetime of A, B, C, D, E and F are 0.71, 0.71, 1.41, 0.50, 1.00 and 0.50, respectively, the life time of (A, B) is lifetime 2.50 – 0.71 = 1.79, ……

• K-cluster Lifetime

5 The distance from that K clusters emerge to that K clusters vanish (due to the reduction to K-1 clusters). 4 e.g. 3 5-cluster lifetime is 0.71 - 0.50 = 0.21 2 4-cluster lifetime is 1.00 - 0.71 = 0.29 3-cluster lifetime is 1.41 – 1.00 = 0.41 2-cluster lifetime is 2.50 – 1.41 = 1.09 object

COMP24111 Machine Learning 17 Demo

Agglomerative Demo

COMP24111 Machine Learning 18 Relevant Issues

• How to determine the number of clusters – If the number of clusters known, termination condition is given! – The K-cluster lifetime as the range of threshold value on the dendrogram tree that leads to the identification of K clusters – Heuristic rule: cut a dendrogram tree with maximum life time to find a “proper” K • Major weakness of agglomerative clustering methods – Can never undo what was done previously – Sensitive to cluster distance measures and noise/outliers – Less efficient: O (n2 logn), where n is the number of total objects • There are several variants to overcome its weaknesses – BIRCH: scalable to a large data set – ROCK: clustering categorical data – CHAMELEON: hierarchical clustering using dynamic modelling COMP24111 Machine Learning 19 Clustering Ensemble • Motivation – A single clustering algorithm may be affected by various factors . Sensitive to initialisation and noise/outliers, e.g. the K-means is sensitive to initial centroids! . Sensitive to distance metrics but hard to find a proper one . Hard to decide a single best algorithm that can handle all types of cluster shapes and sizes – An effective treatments: clustering ensemble . Utilise the results obtained by multiple clustering analyses for robustness

COMP24111 Machine Learning 20 Clustering Ensemble • Clustering Ensemble via Evidence Accumulation (Fred & Jain, 2005) – A simple clustering ensemble algorithm to overcome the main weaknesses of different clustering methods by exploiting their synergy via evidence accumulation • Algorithm summary – Initial clustering analysis by using either different clustering algorithms or running a single clustering algorithm on different conditions, leading to multiple partitions e.g. the K-mean with various initial centroid settings and different K, the agglomerative algorithm with different distance metrics and forced to terminated with different number of clusters… – Converting clustering results on different partitions into binary “distance” matrices – Evidence accumulation: form a collective “distance” matrix based on all the binary “distance” matrices – Apply a hierarchical clustering algorithm (with a proper cluster distance ) to the collective “distance” matrix and use the maximum K-cluster lifetime to decide K COMP24111 Machine Learning 21 Clustering Ensemble

 Example: convert clustering results into binary “Distance” matrix

Cluster 2 (C2)

D “distance” Matrix

C A B C D 0 0 1 1 A Cluster 1 (C1)   0 0 1 1 B D1 = A B 1 1 0 0 C   1 1 0 0 D

COMP24111 Machine Learning 22 Clustering Ensemble

 Example: convert clustering results into binary “Distance” matrix

Cluster 3 (C3) D “distance Matrix”

Cluster 2 (C2) C A B C D 0 0 1 1 A Cluster 1 (C1)   0 0 1 1 B D2 = A B 1 1 0 1 C   1 1 1 0 D

COMP24111 Machine Learning 23 Clustering Ensemble

 Evidence accumulation: form the collective “distance” matrix

0 0 1 1 0 0 1 1     0 0 1 1 0 0 1 1 D =   D =   1 1 1 0 0 2 1 1 0 1     1 1 0 0 1 1 1 0

0 0 2 2   0 0 2 2 D = D + D =   C 1 2 2 2 0 1   2 2 1 0

COMP24111 Machine Learning 24 Clustering Ensemble

 Application to “non-convex” dataset – Data set of 400 data points – Initial clustering analysis: K-mean (K=2,…,11), 3 initial settings per K  totally 30 partitions – Converting clustering results to binary “distance” matrices for the collective “distance matrix” – Applying the Agglomerative algorithm to the collective “distance matrix” (single-link) – Cut the dendrogram tree with the maximum K-cluster lifetime to decide K

COMP24111 Machine Learning 25 Summary

• Hierarchical algorithm is a sequential clustering algorithm – Use distance matrix to construct a tree of clusters (dendrogram) – Hierarchical representation without the need of knowing # of clusters (can set termination condition with known # of clusters) • Major weakness of agglomerative clustering methods – Can never undo what was done previously – Sensitive to cluster distance measures and noise/outliers – Less efficient: O (n2 logn), where n is the number of total objects • Clustering ensemble based on evidence accumulation – Initial clustering with different conditions, e.g., K-means on different K, initialisations – Evidence accumulation – “collective” distance matrix – Apply agglomerative algorithm to “collective” distance matrix and max k-cluster lifetime

Online tutorial: how to use hierarchical clustering functions in Matlab: https://www.youtube.com/watch?v=aYzjenNNOcc

COMP24111 Machine Learning 26