Clustering Hierarchical Methods

Outline AGNES and DIANA BIRCH Clustering Hierarchical Methods Huiping Cao Introduction to Data Mining, Slide 1/12 Outline AGNES and DIANA BIRCH Hierarchical Methods Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Agglomerative hierarchical clustering; bottom-up Divisive hierarchical clustering; top-down Introduction to Data Mining, Slide 2/12 Outline AGNES and DIANA BIRCH AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Start by letting each object form its own clusters Iteratively merges clusters Finds two clusters that are closest to each other Termination: all nodes belong to the same cluster Analysis: Each iteration merges two clusters. This method requires at mostAGNES (Agglomerative Nesting)n iterations. Introduction to Data Mining, Slide 3/12 Outline AGNES and DIANA BIRCH DIANA (Divisive Analysis) DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990) Start by placing all objects in one cluster Iteratively divides clusters Eventually (1) each node forms a cluster on its own, or (2) objects within a cluster are sufficiently similar to each other. Introduction to Data Mining, Slide 4/12 Outline AGNES and DIANA BIRCH DIANA (Divisive Analysis) Inverse order of AGNES Practically, use heuristics in partitioning because there are 2n−1 − 1 possible ways to partition a set of n objects into two exclusive subsets. There are many more agglomerative methods than divisive methods. Introduction to Data Mining, Slide 5/12 Outline AGNES and DIANA BIRCH Hierarchical Methods Hierarchical Clustering A user can specify the desired number of clusters as a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) a a b b a b c d e c c d e d d e e divisive (DIANA) Step 4 Step 3 Step 2 Step 1 Step 0 Introduction to Data Mining, Slide 6/12 Outline AGNES and DIANA BIRCH Dendrogram: Shows How the Clusters are Merged Decompose data objects into several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected componentDendrogram forms: a Shows cluster. How the Clusters are Merged Introduction to Data Mining, Slide 7/12 Outline AGNES and DIANA BIRCH R: AGNES, DIANA > library(cluster) > help(agnes) > help(diana) > help(mona) CRAN document: https://cran.r-project.org/web/packages/cluster/cluster.pdf The method mona returns a list representing a divisive hierarchical clustering of a dataset with binary variables only. Introduction to Data Mining, Slide 8/12 Outline AGNES and DIANA BIRCH Hierarchical Clustering Methods Major weakness of agglomerative clustering methods Do not scale well: time complexity of at least O(n2), where n is the number of total objects Further improvement BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters ROCK (1999): clustering categorical data by neighbor and link analysis CHAMELEON (1999): hierarchical clustering using dynamic modeling Introduction to Data Mining, Slide 9/12 Outline AGNES and DIANA BIRCH BIRCH (1996) Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD 1996:103-114. Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Introduction to Data Mining, Slide 10/12 Outline AGNES and DIANA BIRCH Clustering Feature Vector in BIRCH Clustering feature (CF) CF = (N; LS; SS) N: Number of data points PN LS = i=1 Xi , linear sum of all the points PN 2 SS = i=1 Xi , square sum of all the points Easy to calculate centroid, radius, and diameter PN i=1 xi LS x0 = N = N q PN 2 q 2 2 q 2 i=1(xi −x0) N·SS−2LS +LS N·SS−LS R = N = N2 = N2 r PN PN 2 i=1 j=1(xi −xj ) q 2N·SS−2LS2 D = N(N−1) = N(N−1) Additivity theorem CF1 + CF2 = (N1 + N2; LS1 + LS2; SS1 + SS2) Introduction to Data Mining, Slide 11/12 Outline AGNES and DIANA BIRCH BIRCH (1996) Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans { O(n): where n is the number of the objects to be clustered Weaknesses handles only numeric data, and sensitive to the order of the data record. node size is limited by the memory size not good quality when clusters are not in spherical Introduction to Data Mining, Slide 12/12.

Clustering Hierarchical Methods

Text Mining Course for KNIME Analytics Platform

A Survey of Clustering with Deep Learning: from the Perspective of Network Architecture

A Tree-Based Dictionary Learning Framework

Comparison of Dimensionality Reduction Techniques on Audio Signals

Robust Hierarchical Clustering∗

Space-Time Hierarchical Clustering for Identifying Clusters in Spatiotemporal Point Data

Chapter G03 – Multivariate Methods

A Study of Hierarchical Clustering Algorithm

CNN Features Are Also Great at Unsupervised Classification

Single Link Clustering on Data Sets Ajaya Kushwaha, Manojeet Roy

ML Cheatsheet Documentation

LEARNING EMBEDDINGS for SPEAKER CLUSTERING BASED on VOICE EQUALITY Yanick X. Lukic, Carlo Vogt, Oliver Dürr, and Thilo Stadelma