(WBL), ETHZ Applied Multivariate Statistics, Week 5

Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics, Week 5 Lecturer: Beate Sick [email protected] Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW. 1 Topics of today • Clustering • How to assess the quality of a clustering? • K-means clustering cntd. • Visualization of K-means clusters • Quality measures for cluster result • Pros and Cons • Hierarchical clustering • Principles of hierarchical clustering • Linkage methods • Visualization and quality measures • Density based or network analysis based clustering is skipped • Summary on clustering 2 Partition clustering A division of data objects into non-overlapping subsets (clusters) 3 What is optimized in K-means clustering ? The goal in K-means is to partition the observations into K homogeneous clusters such that the total within-cluster variation (WCV), summed over all K clusters Ck, is as small as possible. WCVtotal= WCV of cluster Ck: Squared Euclidian distance between observations i and i’ where |Ck| denotes the number of observations in the kth cluster and p is the number of features (dimensions). 4 How to choose the “best” number of clusters in K-means ? • Run K-Means for several k • Determine minimized sum of WCV • Plot minimized sum of WCV vs. k • Choose k at the last big drop of WCV ## find suitable number of centers wss = rep(0, 6) # initialize wss[1] = (n-1) * sum(apply(pots, 2, var)) # wss if all data in 1 cluster for (i in 2:6) wss[i] <- sum(kmeans(pots, centers = i)$withinss) plot(1:6, wss, type = "b", xlab = "Number of groups", ylab = "Within groups sum of squares") ## 3 groups is a good choice ## Result varies because of the random starting configuration in K-means 5 Quantum Clustering (QC) of clustering: Check out the silhouette widths The silhouette width of observation i is defined as: x2 i ∈1,1 max , ai average distance between data point i and all other points in the same cluster to which i belongs. x1 bi average distance between i and its “neighbor” cluster, i.e., the nearest one to which it does not belong Range: b ≫ Neighboring cluster far away → 1 b Neighboring cluster close → 0 Warning: Clustering methods optimizing average Euclidian distances tend to get better sil values. 6 How to visualize the result of K-means ? • silhouette plot •2D: PCA or MDS # visualize in PC 1 & 2 p1=prcomp(pots,retx=TRUE) # check explained variance ckm <- kmeans(pots, centers=3) summary(p1) grpsKM = ckm$cluster plot( p1$x[,1] , p1$x[,2], # Silhouette plot xlab="PC1", ylab="PC2", plot(silhouette(grpsKM, dist.pots)) pch=grpsKM, col=grpsKM) 7 What does the average silhouette width tell us? For each object, the silhouette plot visualizes with a bar the closeness of the object to its cluster compared to the closeness to the next neighbour cluster (with range: -1 < sil < 1) – the longer the bars the better the cluster The 'average width' the cluster coefficient gives the average sil: 0.70 < sil < 1.00: Good structure has been found 0.50 < sil < 0.70: Reasonable structure found 0.25 < sil < 0.50: Weak structure, requiring confirmation -1 < sil < 0.25: Forget it! 8 What does a negative silhouette width tell us? This clustering makes sense but would not result from a k-means (k=2) clustering Example leading to negative silhouette A negative value indicates that a point is on average closer to the points of another cluster than to points with in its own cluster. → this does not necessarily indicate poor clustering (only w.r.t. Euclidean distance mean sil criterion). 9 K-means results can strongly depend on starting configuration Image credits: ISLR 10 Issues with K-means (non-convex sets) A non-convex set The color indicates the result of the K-means clustering. Eye-ball analysis would result in 2 clusters corresponding to the 2 bananas. https://pafnuty.wordpress.com/2013/08/14/non-convex-sets-with-k-means-and-hierarchical-clustering/ 11 Comments on the K-means method Strength • Fast Weakness • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes • Can terminate at a local optimum • Is based on Euclidean distances and cannot handle other dissimilarities 12 The K-medoids clustering method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids) - starts from an initial set of medoids - iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering Pro’s: - more robust against outliers - can deal with any dissimilarity measure - easy to find representative objects per cluster (e.g. for easy interpretation) Con’s: - PAM is slow for large data sets (only fast for small data sets) 13 K-means clustering in R # sample from a 3D normal distributions in 2D: x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1.6, sd = 0.3), ncol = 2) ) colnames(x) <- c("x", "y") (cl <- kmeans(x, 3)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex = 2) 14 PAM cl <- pam(x, 3) plot(x, col = cl$cluster,main='PAM') points(cl$medoids, col = 1:2, pch = 8, cex = 2) Center is always on an observation 15 Partitioning Methods in R Function “kmeans” in package “stats” Function “pam” in package “cluster” 16 Hierarchical clustering A set of nested clusters organized as a hierarchical tree 17 How to do hierarchical clustering? Without proof: The number of Since we cannot test all possible possible dendrograms with n dendrograms we will have to leafs is heuristically search for all (n -2) (2n -3)!/[(2 ) (n -2)!] possible dendrograms. We could do this.. Number Number of Possible of Leafs Dendrograms 21 Bottom-Up (agglomerative): 33 Starting with each item in its 415 own cluster, find the best pair to 5 105 ... … merge into a new cluster. 10 34,459,425 Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the items in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides. 18 Agglomerative Hierarchical Clustering Feature 2 Problem: Need a generalization of the distance between the objects to cluster of objects. Feature 1 Feature 2 What is the distance between dendrogram those objects?-> Linkage Feature 1 19 Dissimilarity between samples or observations Any dissimilarity we have seen before can be used - Euclidean - Manhattan - Simple Matching Coefficient - Jaccard dissimilarity - Gower’s dissimilarity - etc. 20 Dissimilarity between clusters: Linkages single link (min) Single link: smallest distance between point-pairs linking both clusters Complete link: largest distance complete link Average: avg distance between (max) Wards: With this method we try to minimize the variance of the merged clusters Hint: in R use always "ward.D2" average wards 21 How to read a dendrogram The position of the join node on the distance-scale indicates the distance between clusters (this distance depends on the linkage method). For example, if you see two clusters joined at a height 22, it means that the distance between those clusters was 22 . Distance (R: clust$height) When you read a dendrogram, you want to determine at which stage the distance between combined clusters is large. You look for large distances between sequential join nodes (here horizontal lines). 22 Agglomerative hierarchical clustering is often visualized by a dendrogram https://www.researchgate.net/figure/Example-of-hierarchical-clustering-clusters-are-consecutively-merged-with-the-most_fig3_273456906 23 Simple example Draw a dendrogram visualizing the grouping of the following 1D data: Use Euclidian distances and single-linkage. 24 Simple example [Solution] x = c(1,3,6,6.5) names(x) = c('1','3','6', '6.5') d = dist(x) cluster = hclust(d, method = 'single') plot(cluster, hang=-10, axes = FALSE) axis(2) cutree(cluster, k=3) #1 3 6 6.5 #1 2 3 3 25 Cluster result depends on the used distances and linkage methods Distances between Linkage Methods based on 2 data points / observations “distances” between 2 clusters: • Euclidean • Single link: smallest distance • Manhattan between point-pairs linking both clusters • Simple Matching Coefficent • Jaccard dissimilarity • Complete link: largest distance between point-pairs linking both • Gower’s dissimilarity clusters • etc. • Average: avg distance between point-pairs linking both clusters • Wards: with this method we try to minimize the variance of the merged clusters 26 Agglomerative Clustering in R Functions “hclust”, “cutree” in package “stats” Alternative: Function “agnes” in package “cluster” ## determine euclidean distances dist.pots = dist(pots) ## apply agglomerative clutstering using Ward linkage hc = hclust(dist.pots, method="ward.D2") # plot the dendrogram plot(hc) ## split into 3 groups grps = cutree(hc, k=3) grps # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 # # 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 27 Use a dataset about cars to do a heatmap representation > str(dat) 'data.frame': 38 obs. of 8 variables: $ Country : Factor w/ 6 levels "France","Germany",..: 6 6 6 ... $ Car : Factor w/ 38 levels "AMC Concord D/L",..: 6 21..

Load more