Hierarchical and Ensemble Clustering Ke Chen
Reading: [7.8-7.10, EA], [25.5, KPM], [Fred & Jain, 2005]
COMP24111 Machine Learning Outline
• Introduction
• Cluster Distance Measures
• Agglomerative Algorithm
• Example and Demo
• Key Concepts in Hierarchal Clustering
• Clustering Ensemble via Evidence Accumulation
• Summary
COMP24111 Machine Learning 2 Introduction • Hierarchical Clustering Approach – A typical clustering analysis approach via partitioning data set sequentially – Construct nested partitions layer by layer via grouping objects into a tree of clusters (without the need to know the number of clusters in advance) – Use (generalised) distance matrix as clustering criteria • Agglomerative vs. Divisive – Agglomerative: a bottom-up strategy . Initially each data object is in its own (atomic) cluster . Then merge these atomic clusters into larger and larger clusters – Divisive: a top-down strategy . Initially all objects are in one single cluster . Then the cluster is subdivided into smaller and smaller clusters • Clustering Ensemble – Using multiple clustering results for robustness and overcoming weaknesses of single clustering algorithms. COMP24111 Machine Learning 3 Introduction: Illustration
• Illustrative Example: Agglomerative vs. Divisive Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4 Agglomerative
a a b b a b c d e . Cluster distance c . Termination condition c d e d d e e Divisive Step 4 Step 3 Step 2 Step 1 Step 0
COMP24111 Machine Learning 4 Cluster Distance Measures single link • Single link: smallest distance (min) between an element in one cluster and an element in the other, i.e.,
d(Ci, Cj) = min{d(xip, xjq)} • Complete link: largest distance complete link (max) between an element in one cluster and an element in the other, i.e.,
d(Ci, Cj) = max{d(xip, xjq)} • Average: avg distance between average elements in one cluster and elements in the other, i.e.,
d(Ci, Cj) = avg{d(xip, xjq)} d(C, C)=0 COMP24111 Machine Learning 5 Cluster Distance Measures
Example: Given a data set of five objects characterised by a single continuous feature, assume that there are two clusters: C1: {a, b} and C2: {c, d, e}. a b c d e Feature 1 2 4 5 6 1. Calculate the distance matrix . 2. Calculate three cluster distances between C1 and C2.
Single link a b c d e dist(C1 ,C2 ) = min{d(a,c),d(a,d),d(a,e),d(b,c),d(b,d),d(b,e)}
a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2
b 1 0 2 3 4 Complete link
dist(C1 ,C2 ) = max{d(a,c),d(a,d),d(a,e),d(b,c),d(b,d),d(b,e)} c 3 2 0 1 2 = max{3, 4, 5, 2, 3, 4} = 5 d 4 3 1 0 1 Average d(a,c) + d(a,d) + d(a,e) + d(b,c) + d(b,d) + d(b,e) e 5 4 2 1 0 dist(C ,C ) = 1 2 6 3 + 4 + 5 + 2 + 3 + 4 21 = = = 3.5 6 6 COMP24111 Machine Learning 6 Agglomerative Algorithm • The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into a distance matrix 2) Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) 3) Repeat until number of cluster is one (or known # of clusters) . Merge two closest clusters . Update “distance matrix”
COMP24111 Machine Learning 7 Example
• Problem: clustering analysis with agglomerative algorithm
data matrix
Euclidean distance
distance matrix COMP24111 Machine Learning 8 Example
• Merge two closest clusters (iteration 1)
COMP24111 Machine Learning 9 Example
• Update distance matrix (iteration 1)
COMP24111 Machine Learning 10 Example
• Merge two closest clusters (iteration 2)
COMP24111 Machine Learning 11 Example
• Update distance matrix (iteration 2)
COMP24111 Machine Learning 12 Example
• Merge two closest clusters/update distance matrix (iteration 3)
COMP24111 Machine Learning 13 Example
• Merge two closest clusters/update distance matrix (iteration 4)
COMP24111 Machine Learning 14 Example
• Final result (meeting termination condition)
COMP24111 Machine Learning 15 Key Concepts in Hierarchal Clustering
• Dendrogram tree representation
1. In the beginning we have 6 clusters: A, B, C, D, E and F 2. We merge clusters D and F into 6 cluster (D, F) at distance 0.50 3. We merge cluster A and cluster B
lifetime into (A, B) at distance 0.71 4. We merge clusters E and (D, F) into ((D, F), E) at distance 1.00
5 5. We merge clusters ((D, F), E) and C into (((D, F), E), C) at distance 1.41 4 6. We merge clusters (((D, F), E), C) 3 and (A, B) into ((((D, F), E), C), (A, B)) 2 at distance 2.50 7. The last cluster contain all the objects, thus conclude the computation
object
COMP24111 Machine Learning 16 Key Concepts in Hierarchal Clustering
• Lifetime vs K-cluster Lifetime • Lifetime The distance between that a cluster is created and that it disappears (merges with other clusters during clustering). 6 e.g. lifetime of A, B, C, D, E and F are 0.71, 0.71, 1.41, 0.50, 1.00 and 0.50, respectively, the life time of (A, B) is lifetime 2.50 – 0.71 = 1.79, ……
• K-cluster Lifetime
5 The distance from that K clusters emerge to that K clusters vanish (due to the reduction to K-1 clusters). 4 e.g. 3 5-cluster lifetime is 0.71 - 0.50 = 0.21 2 4-cluster lifetime is 1.00 - 0.71 = 0.29 3-cluster lifetime is 1.41 – 1.00 = 0.41 2-cluster lifetime is 2.50 – 1.41 = 1.09 object
COMP24111 Machine Learning 17 Demo
Agglomerative Demo
COMP24111 Machine Learning 18 Relevant Issues
• How to determine the number of clusters – If the number of clusters known, termination condition is given! – The K-cluster lifetime as the range of threshold value on the dendrogram tree that leads to the identification of K clusters – Heuristic rule: cut a dendrogram tree with maximum life time to find a “proper” K • Major weakness of agglomerative clustering methods – Can never undo what was done previously – Sensitive to cluster distance measures and noise/outliers – Less efficient: O (n2 logn), where n is the number of total objects • There are several variants to overcome its weaknesses – BIRCH: scalable to a large data set – ROCK: clustering categorical data – CHAMELEON: hierarchical clustering using dynamic modelling COMP24111 Machine Learning 19 Clustering Ensemble • Motivation – A single clustering algorithm may be affected by various factors . Sensitive to initialisation and noise/outliers, e.g. the K-means is sensitive to initial centroids! . Sensitive to distance metrics but hard to find a proper one . Hard to decide a single best algorithm that can handle all types of cluster shapes and sizes – An effective treatments: clustering ensemble . Utilise the results obtained by multiple clustering analyses for robustness
COMP24111 Machine Learning 20 Clustering Ensemble • Clustering Ensemble via Evidence Accumulation (Fred & Jain, 2005) – A simple clustering ensemble algorithm to overcome the main weaknesses of different clustering methods by exploiting their synergy via evidence accumulation • Algorithm summary – Initial clustering analysis by using either different clustering algorithms or running a single clustering algorithm on different conditions, leading to multiple partitions e.g. the K-mean with various initial centroid settings and different K, the agglomerative algorithm with different distance metrics and forced to terminated with different number of clusters… – Converting clustering results on different partitions into binary “distance” matrices – Evidence accumulation: form a collective “distance” matrix based on all the binary “distance” matrices – Apply a hierarchical clustering algorithm (with a proper cluster distance metric) to the collective “distance” matrix and use the maximum K-cluster lifetime to decide K COMP24111 Machine Learning 21 Clustering Ensemble
Example: convert clustering results into binary “Distance” matrix
Cluster 2 (C2)
D “distance” Matrix
C A B C D 0 0 1 1 A Cluster 1 (C1) 0 0 1 1 B D1 = A B 1 1 0 0 C 1 1 0 0 D
COMP24111 Machine Learning 22 Clustering Ensemble
Example: convert clustering results into binary “Distance” matrix
Cluster 3 (C3) D “distance Matrix”
Cluster 2 (C2) C A B C D 0 0 1 1 A Cluster 1 (C1) 0 0 1 1 B D2 = A B 1 1 0 1 C 1 1 1 0 D
COMP24111 Machine Learning 23 Clustering Ensemble
Evidence accumulation: form the collective “distance” matrix
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 D = D = 1 1 1 0 0 2 1 1 0 1 1 1 0 0 1 1 1 0
0 0 2 2 0 0 2 2 D = D + D = C 1 2 2 2 0 1 2 2 1 0
COMP24111 Machine Learning 24 Clustering Ensemble
Application to “non-convex” dataset – Data set of 400 data points – Initial clustering analysis: K-mean (K=2,…,11), 3 initial settings per K totally 30 partitions – Converting clustering results to binary “distance” matrices for the collective “distance matrix” – Applying the Agglomerative algorithm to the collective “distance matrix” (single-link) – Cut the dendrogram tree with the maximum K-cluster lifetime to decide K
COMP24111 Machine Learning 25 Summary
• Hierarchical algorithm is a sequential clustering algorithm – Use distance matrix to construct a tree of clusters (dendrogram) – Hierarchical representation without the need of knowing # of clusters (can set termination condition with known # of clusters) • Major weakness of agglomerative clustering methods – Can never undo what was done previously – Sensitive to cluster distance measures and noise/outliers – Less efficient: O (n2 logn), where n is the number of total objects • Clustering ensemble based on evidence accumulation – Initial clustering with different conditions, e.g., K-means on different K, initialisations – Evidence accumulation – “collective” distance matrix – Apply agglomerative algorithm to “collective” distance matrix and max k-cluster lifetime
Online tutorial: how to use hierarchical clustering functions in Matlab: https://www.youtube.com/watch?v=aYzjenNNOcc
COMP24111 Machine Learning 26