Hierarchical Cluster Analysis

Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure

Two types of clustering algorithms: hierarchical and non-hierarchical. We starts with hierarchical clustering in this notes; we will move onto a non-hierarchical method K- means clustering next.

0. How it works

Given a set of N items to be clustered, and an N*N distance matrix, the basic process of hierarchical clustering is:

1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances between the clusters the same as the distances between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.

In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.

In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Alternatively, instead of joining the two closest clusters, Ward seeks to join the two clusters whose merger leads to the smallest within-cluster sum of squares, i.e., the error sum of squares ESS. It is known as the Ward’s method. All these kind of hierarchical clustering is called agglomerative because it merges clusters iteratively, starting from N (original sample size) clusters to 1 cluster in the end.

1. Artificial Data

x1 = c(1, 2, 3, 4, 7, 8, 10) x2 = c(8, 2, 3, 1, 11, 8, 10) X = cbind(x1, x2) plot(X, pch=16) identify(X, labels=1:7) [1] 1 2 3 4 5 6 7

X.d = dist(X) X.d 1 2 3 4 5 6 2 6.082763 3 5.385165 1.414214 4 7.615773 2.236068 2.236068 5 6.708204 10.295630 8.944272 10.440307 6 7.000000 8.485281 7.071068 8.062258 3.162278 7 9.219544 11.313708 9.899495 10.816654 3.162278 2.828427 X.hc.s = hclust(X.d, method="single") win.graph() # Open additional graphics window plot(X.hc.s) dev.set(2) # Switch back to original window cutree(X.hc.s, 3) [1] 1 2 2 2 3 3 3 plot(X, pch=cutree(X.hc.s, 3), col=cutree(X.hc.s, 3)) 2. Iris Data

iris.s = scale(iris[,-5]) iris.d = dist(iris.s) iris.hc.s = hclust(iris.d, method="single") plot(iris.hc.s)

pairs(iris[,-5], pch=cutree(iris.hc.s,6), col=cutree(iris.hc.s,6))

We can use the known species variable to determine how well the clustering method (which doesn’t use the species variable) is able to reconstruct the species. If it works well, plotting character and color should match well.

pairs(iris[,-5], pch=unclass(iris[,5]), col=cutree(iris.hc.s,6)) If one approach doesn’t appear to work very well, we can try other linkage methods. For the iris data, complete linkage clustering gives at least more balanced clusters.

iris.hc.c = hclust(iris.d, method="complete") plot(iris.hc.c)

pairs(iris[,-5], pch=cutree(iris.hc.c,6), col=cutree(iris.hc.c,6)) pairs(iris[,-5], pch=unclass(iris[,5]), col=cutree(iris.hc.c,6)) pairs(iris[,-5], pch=unclass(iris[,5]), col=cutree(iris.hc.c,3))

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 5 . 7 5 . Sepal.Length 6 5 . 5 5 . 4 0 . 4 5 . 3 0

. Sepal.Width 3 5 . 2 0 . 2 7 6 5

Petal.Length 4 3 2 1 5 . 2 0 . 2 5 . 1

0 Petal.Width . 1 5 . 0

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 1 2 3 4 5 6 7 > table(species=unclass(iris[,5]), pred=cutree(iris.hc.c,3)) pred species 1 2 3 1 49 1 0 2 0 21 29 3 0 2 48

> table(species=unclass(iris[,5]), pred=cutree(iris.hc.c,6)) pred species 1 2 3 4 5 6 1 42 7 1 0 0 0 2 0 0 4 29 17 0 3 0 0 0 37 2 11

# pred 1,2 was labeled pred 1 # pred 3,5 was labeled pred 2 # pred 4,6 was labeled pred 3

> missclassification.rate = (1+29+2)/150 > missclassification.rate [1] 0.2133333

3. More Artificial Data – Four Clusters

X = matrix(rnorm(320), 80, 4) X[1:20,1] = X[1:20,1]+10 X[21:40,2] = X[21:40,2]+10 X[41:60,3] = X[41:60,3]+10 X = scale(X) pairs(X)

library(rgl) plot3d(X[,1],X[,2],X[,3], type="s", size=.25, col="red") plot3d(X[,1],X[,2],X[,3], type="s", size=.25, col=rep(1:4, c(20,20,20,20))) rot = eigen(cor(iris[,-5]))$vectors # Arbitrary rotation matrix to hide structure Y = scale(X %*% rot) pairs(Y) plot3d(Y[,1],Y[,2],Y[,3], type="s", size=.25, col="red") plot3d(Y[,1],Y[,2],Y[,3], type="s", size=.25, col=rep(1:4, c(20,20,20,20)))

Y.d = dist(Y) Y.hc.s = hclust(Y.d, method="single") plot(Y.hc.s)

Y.hc.c = hclust(Y.d, method="complete") plot(Y.hc.c) Y.hc.a = hclust(Y.d, method="average") plot(Y.hc.a)

Y.hc.w = hclust(Y.d, method="ward") plot(Y.hc.w)

Examining the pairs plot with color or symbol marked to show a promising cluster structure, we can try to identify variables which separate out each cluster. pairs(Y, pch=as.character(rep(1:4,c(20,20,20,20))),col=cutree(Y.hc.w,4)) -2.0 -1.0 0.0 1.0 -2 -1 0 1 2 2 1 3 3 1 3 1 1 33 3 3 1 3 3 1 1 333433 4 33333 3 1 333 3 1 4 1 1 3 3 3 3 11 3 3 11 1 1 11 3 3 33 3 1 11 33 3 1 11 11 1 311 3 11 1 11 3 111 1 11 1 1 11 1 33 3 33 1 3 3 1

44 3 44 3 3 4 4 0 1 1 4 4 4 4 1 1 141 4 var 1 2 1 1 4 3 4 2 44 3 1 1 3 1144 2 2 2 44 22 4 4 2 4 42 22 2 44 2 22 4 4 42242 2 4 4 4 2 44 4 4 4442 1 222 22 22 2 - 2 2 444 22 44 4 424 2 4 2 2 4 2 2 2 4 2242 22 2 4 222 4 422 2

2 2 2 2 2 2 2 -

4 4 4 4 4 44 44

0 4 4 4 . 44 4 4 3 4 3 4 4 3 4 4 44 1 4 4 4 44 4 4 4 33 33 434 33 44344 33 33 4 4444 4 4 3334 4 333 3 333 3 4 143 3 3 33 3 33434 33 1 33 3 33 14 4 4 11 3 3 3 3 3 4 1 1 3 33 4 11 0 3 3 3 . 1 1 1 1 1 1 0 1 1 111 11 11 1 11 1 111 111 1 1 1 1 1 1 1 1 1 1 1 11 var 2 11 1 1 1 1 1 1 0 . 1 - 2 2 2 2 2 2 2 2 2 2 22 2 22222 2 22 2 2 222 2 22 2 2 2 2 22 2 22 2 0 2 2 2 . 2 2 2 2 2 222

2 2 2 2 - 11 1 1 1 1 2 11 1 1 1 11 11 11 4 1 1 1 1 1 11 1 4 4 111 1 4 11 1 4 4 1 1 1 3 1 1 1 11 1 111 1 3 3 1 1 111 1 2 4 4 2 4 4 244 4 4 3 434 3 44 222 444444 1 222 1 44444 4 24422444 1 0 2 2 44 33 3 2 2 34 3 3 4 var 3 3 33 2 2 44 222 3 33 2 22 33 3 333 2 22 2222 4 4 3 22 2 3 4 4 32 2 22 4 4 4 333 3 3 43 33 4 1 2 4 3333 2 3433 3 33 3 2 4 - 2 2 4 33 3 2 22 3 3 4 3 33 22 24 2 3 2 3 3 2

2 2 2 2 4 4 4 -

4 4 4 2 1 1 11 1 1 1 1 11 11 4 11 1 1 11 4 4 1 11 1 2 4 11 1 1 2 11 11 4 24 11 1 4 2 1 111 2 1 11 11 4 2 4 1 111 1 4 14 414 4 4 1 4224244 1141 2 22 1 11 4444 44 2 422 444 11 1 0 2 2 2 2 2 22 2 2 22 42224 22 2 222 2 4 4 22 222 24 4 4 var 4 224 44 22 4 4 2 2 44 4 22 4 3 333 2 2 333 3 4 3 3 3 3 2 2 4 1 2 4 3 2 3 4 3 2 4 - 3 3 3 333 3 3333 33 33 3 3 33 3 3 333 33 33 2 3 3 3 - 3 3 3 3 3 3

-2 -1 0 1 2 -2 -1 0 1 2

> table(truth=rep(1:4,c(20,20,20,20)), pred=cutree(Y.hc.w,4)) pred truth 1 2 3 4 1 20 0 0 0 2 0 20 0 0 3 0 0 20 0 4 0 0 0 20

> table(truth=rep(1:4,c(20,20,20,20)), pred=cutree(Y.hc.c,4)) pred truth 1 2 3 4 1 16 4 0 0 2 0 0 20 0 3 14 6 0 0 4 6 13 0 1

Principal components can be a useful supplement or even alternative to cluster analysis. Well- separated clusters in high dimensions will often appear clearly in the first few components.

Y.pca = prcomp(Y) plot(Y.pca) pairs(Y.pca$x[,1:3], pch=as.character(rep(1:4,c(20,20,20,20))),col=cutree(Y.hc.w,4)) 4. Swiss Canton Data

pairs(swiss, pch=16)

swiss.d = dist(scale(swiss)) plot(hclust(swiss.d, method="single")) plot(hclust(swiss.d, method="complete"))

plot(hclust(swiss.d, method="ward")) plot(hclust(swiss.d, method="average")) swiss.hc.w = hclust(swiss.d, method="ward") pairs(swiss, pch=cutree(swiss.hc.w, 3), col=cutree(swiss.hc.w, 3))

swiss.pca = prcomp(swiss, scale=T) plot(swiss.pca) swiss.pca$rotation PC1 PC2 PC3 PC4 PC5 Fertility -0.4569876 0.3220284 -0.17376638 0.53555794 -0.38308893 Agriculture -0.4242141 -0.4115132 0.03834472 -0.64291822 -0.37495215 Examination 0.5097327 0.1250167 -0.09123696 -0.05446158 -0.81429082 Education 0.4543119 0.1790495 0.53239316 -0.09738818 0.07144564 Catholic -0.3501111 0.1458730 0.80680494 0.09947244 -0.18317236 Infant.Mortality -0.1496668 0.8111645 -0.16010636 -0.52677184 0.10453530 pairs(swiss.pca$x[,1:4], pch=cutree(swiss.hc.w, 3), col=cutree(swiss.hc.w, 3))

swiss.hc.c = hclust(swiss.d, method="complete") pairs(swiss, pch=cutree(swiss.hc.c, 2), col=cutree(swiss.hc.c, 2)) pairs(swiss, pch=cutree(swiss.hc.c, 4), col=cutree(swiss.hc.c, 4))

pairs(swiss.pca$x[,1:4], pch=cutree(swiss.hc.c, 4), col=cutree(swiss.hc.c,4)) 5. Alternative Distance Measures

x1 = c(1, 2, 3, 4, 7, 8, 10) x2 = c(8, 2, 3, 1, 11, 8, 10) X = cbind(x1, x2) plot(X, pch=16) identify(X, labels=1:7) [1] 1 2 3 4 5 6 7

X x1 x2 [1,] 1 8 [2,] 2 2 : : :

dist(X) # default - Euclidean (2 norm) 1 2 3 4 5 6 2 6.082763 3 5.385165 1.414214 4 7.615773 2.236068 2.236068 5 6.708204 10.295630 8.944272 10.440307 6 7.000000 8.485281 7.071068 8.062258 3.162278 7 9.219544 11.313708 9.899495 10.816654 3.162278 2.828427

sqrt((1-2)^2+(8-2)^2) # d(1,2) [1] 6.082763

dist(X, method="manhattan") # 1 norm 1 2 3 4 5 6 2 7 3 7 2 4 10 3 3 5 9 14 12 13 6 7 12 10 11 4 7 11 16 14 15 4 4

abs(1-2)+abs(8-2) # d(1,2) [1] 7 dist(X, method="maximum") # sup (infinity) norm 1 2 3 4 5 6 2 6 3 5 1 4 7 2 2 5 6 9 8 10 6 7 6 5 7 3 7 9 8 7 9 3 2

max(abs(1-2),abs(8-2)) # d(1,2) [1] 6

dist(X, method="minkowski", p = 3) # 3 norm 1 2 3 4 5 6 2 6.009245 3 5.104469 1.259921 4 7.179054 2.080084 2.080084 5 6.240251 9.487518 8.320335 10.089202 6 7.000000 7.559526 6.299605 7.410795 3.036589 7 9.032802 10.079368 8.819447 9.813199 3.036589 2.519842 (abs(1-2)^3+abs(8-2)^3)^(1/3) # d(1,2) [1] 6.009245

6. Distances in Clustering

pairs(swiss, pch=16)

swiss.d2 = dist(scale(swiss)) swiss.hc.c2 = hclust(swiss.d2, method="complete") plot(swiss.hc.c2) pairs(swiss, pch=cutree(swiss.hc.c2, 4), col=cutree(swiss.hc.c2, 4))

swiss.d1 = dist(scale(swiss), "manhattan") swiss.hc.c1 = hclust(swiss.d1, method="complete") plot(swiss.hc.c1) pairs(swiss, pch=cutree(swiss.hc.c1, 4), col=cutree(swiss.hc.c1, 4)) swiss.pca = prcomp(swiss, scale=T) pairs(swiss.pca$x[,1:4], pch=cutree(swiss.hc.c2, 4), col=cutree(swiss.hc.c2,4)) pairs(swiss.pca$x[,1:4], pch=cutree(swiss.hc.c1, 4), col=cutree(swiss.hc.c1, 4))

table(data.frame(c2=cutree(swiss.hc.c2,4),c1=cutree(swiss.hc.c1,4))) c1 c2 1 2 3 4 1 20 0 0 0 2 0 16 0 0 3 0 0 8 2 4 0 0 0 1

swiss.dsup = dist(scale(swiss), "maximum") swiss.hc.csup = hclust(swiss.d1, method="complete") plot(swiss.hc.csup) pairs(swiss,pch=cutree(swiss.hc.csup,4),col=cutree(swiss.hc.csup,4)) pairs(swiss.pca$x[,1:4], pch=cutree(swiss.hc.csup, 4), col=cutree(swiss.hc.csup, 4))

7. Natural Modality – Unimodal vs. Bimodal

library(MASS) mu1 = c(3,3) mu2 = c(6,6) Sigma1 = matrix(c(1,0,0,1), nrow=2) bimodal = rbind(mvrnorm(25, mu=mu1,Sigma=Sigma1), mvrnorm(25,mu=mu2, Sigma=Sigma1)) plot(bimodal, pch=16)

mu3 = c(4.5, 4.5) Sigma2 = matrix(c(2.25,1.5,1.5,2.25),nrow=2) unimodal = mvrnorm(50, mu=mu3, Sigma=Sigma2) plot(unimodal, pch=16) bim.d2 = dist(bimodal) unim.d2 = dist(unimodal) plot(hclust(bim.d2, method="complete")) plot(hclust(unim.d2, method="complete"))

plot(hclust(bim.d2, method="single")) plot(hclust(unim.d2, method="single"))

plot(hclust(bim.d2, method="ward")) plot(hclust(unim.d2, method="ward")) 8. Kth Nearest Neighbor Distance

The following function computes the Kth Nearest Neighbor Distance between observation in X:

knn.dist <- function(X, k)

{X.d <- as.matrix(dist(X)) n <- nrow(X) knn.d <- apply(X.d, 2, sort)[k+1,] m<- 2*max(X.d) X.knn <- matrix(m, n, n) diag(X.knn) <- 0 for (i in 1:(n-1)) for (j in (i+1):n) if (X.d[i,j] <= knn.d[i] | X.d[i,j] <= knn.d[j]) {X.knn[i,j] <- (knn.d[i]+knn.d[j])/2 X.knn[j,i] <- X.knn[i,j]} return(as.dist(X.knn))}

bim.dk.10 = knn.dist(bimodal, 10) unim.dk.10 = knn.dist(unimodal, 10)

plot(hclust(bim.dk.10, method="single")) plot(hclust(unim.dk.10, method="single"))

bim.dk.5 = knn.dist(bimodal, 5) unim.dk.5 = knn.dist(unimodal, 5)

plot(hclust(bim.dk.5, method="single")) plot(hclust(unim.dk.5, method="single")) # Hierarchical Clustering

# A sample Dataset x1 <- c(1, 2, 3, 4, 7, 8, 10) x2 <- c(8, 2, 3, 1, 11, 8, 10) X <- cbind(x1, x2) plot(X, pch=16) identify(X, labels=1:7)

X.d <- dist(X) X.d

X.hc.s <- hclust(X.d, method="single") win.graph() plot(X.hc.s) dev.set(2) cutree(X.hc.s, 3) plot(X, pch=cutree(X.hc.s, 3), col=cutree(X.hc.s, 3)) dev.off(3)

# Iris Data iris.s <- scale(iris[,-5]) iris.d <- dist(iris.s) species <- iris[,5] iris.hc.s <- hclust(iris.d, method="single") plot(iris.hc.s) pairs(iris[,-5], pch=cutree(iris.hc.s,6), col=cutree(iris.hc.s,6)) pairs(iris[,-5], pch=unclass(species), col=cutree(iris.hc.s,6)) iris.hc.c <- hclust(iris.d, method="complete") plot(iris.hc.c) pairs(iris[,-5], pch=cutree(iris.hc.c,6), col=cutree(iris.hc.c,6)) pairs(iris[,-5], pch=unclass(species), col=cutree(iris.hc.c,6))

# Artificial data - 4 clusters

X <- matrix(rnorm(320), 80, 4) X[1:20,1] <- X[1:20,1]+10 X[21:40,2] <- X[21:40,2]+10 X[41:60,3] <- X[41:60,3]+10 X <- scale(X) pairs(X) library(rgl) plot3d(X[,1],X[,2],X[,3], type="s", size=.25, col="red") plot3d(X[,1],X[,2],X[,3], type="s", size=.25, col=rep(1:4, c(20,20,20,20))) rot <- eigen(cor(iris[,-5]))$vectors # Arbitrary rotation matrix to hide structure

Y <- scale(X %*% rot) pairs(Y) plot3d(Y[,1],Y[,2],Y[,3], type="s", size=.25, col="red") plot3d(Y[,1],Y[,2],Y[,3], type="s", size=.25, col=rep(1:4, c(20,20,20,20)))

Y.d <- dist(Y)

Y.hc.s <- hclust(Y.d, method="single") plot(Y.hc.s) Y.hc.c <- hclust(Y.d, method="complete") plot(Y.hc.c) Y.hc.a <- hclust(Y.d, method="average") plot(Y.hc.a) Y.hc.w <- hclust(Y.d, method="ward") plot(Y.hc.w) pairs(Y, pch=cutree(Y.hc.w,4), col=cutree(Y.hc.w,4))

Y.pca <- prcomp(Y) plot(Y.pca) pairs(Y.pca$x[,1:3], pch=cutree(Y.hc.w,4), col=cutree(Y.hc.w,4))

# Swiss canton data pairs(swiss, pch=16) swiss.d <- dist(scale(swiss)) plot(hclust(swiss.d, method="single")) plot(hclust(swiss.d, method="complete")) plot(hclust(swiss.d, method="ward")) plot(hclust(swiss.d, method="average")) swiss.hc.w <- hclust(swiss.d, method="ward") pairs(swiss, pch=cutree(swiss.hc.w, 3), col=cutree(swiss.hc.w, 3)) swiss.pca <- prcomp(swiss, scale=T) plot(swiss.pca) swiss.pca$rotation pairs(swiss.pca$x[,1:4], pch=cutree(swiss.hc.w, 3), col=cutree(swiss.hc.w, 3)) swiss.hc.c <- hclust(swiss.d, method="complete") pairs(swiss, pch=cutree(swiss.hc.c, 2), col=cutree(swiss.hc.c, 2)) pairs(swiss, pch=cutree(swiss.hc.c, 4), col=cutree(swiss.hc.c, 4)) pairs(swiss.pca$x[,1:4], pch=cutree(swiss.hc.c, 4), col=cutree(swiss.hc.c, 4))