Quick viewing(Text Mode)

(WBL), ETHZ Applied Multivariate Statistics, Week 5

(WBL), ETHZ Applied Multivariate Statistics, Week 5

Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics, Week 5

Lecturer: Beate Sick [email protected] Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW. 1 Topics of today

• Clustering • How to assess the quality of a clustering? • K-means clustering cntd. • Visualization of K-means clusters • Quality measures for cluster result • Pros and Cons • • Principles of hierarchical clustering • Linkage methods • Visualization and quality measures • Density based or network analysis based clustering is skipped • Summary on clustering

2 Partition clustering A division of data objects into non-overlapping subsets (clusters)

3 What is optimized in K-means clustering ?

The goal in K-means is to partition the observations into K homogeneous clusters such that the total within-cluster variation (WCV), summed over

all K clusters Ck, is as small as possible.

WCVtotal=

WCV of cluster Ck: Squared Euclidian distance between observations i and i’

where |Ck| denotes the number of observations in the kth cluster and p is the number of features (dimensions).

4 How to choose the “best” number of clusters in K-means ?

• Run K-Means for several k • Determine minimized sum of WCV • Plot minimized sum of WCV vs. k • Choose k at the last big drop of WCV

## find suitable number of centers wss = rep(0, 6) # initialize wss[1] = (n-1) * sum(apply(pots, 2, var)) # wss if all data in 1 cluster

for (i in 2:6) wss[i] <- sum(kmeans(pots, centers = i)$withinss)

plot(1:6, wss, type = "b", xlab = "Number of groups", ylab = "Within groups sum of squares") ## 3 groups is a good choice ## Result varies because of the random starting configuration in K-means

5 Quantum Clustering (QC) of clustering: Check out the silhouette widths

The silhouette width of observation i is defined as: x2

𝑏 𝑎 i 𝑠𝑖𝑙 ∈1,1 max 𝑎, 𝑏

ai average distance between data point i and all other points in the same cluster to which i belongs. x1

bi average distance between i and its “neighbor” cluster, i.e., the nearest one to which it does not belong

Range: b ≫ 𝑎  Neighboring cluster far away 𝑠𝑖𝑙 → 1 b 𝑎  Neighboring cluster close 𝑠𝑖𝑙 → 0

Warning: Clustering methods optimizing average Euclidian distances tend to get better sil values.

6 How to visualize the result of K-means ?

• silhouette plot

•2D: PCA or MDS # visualize in PC 1 & 2 p1=prcomp(pots,retx=TRUE) # check explained ckm <- kmeans(pots, centers=3) summary(p1) grpsKM = ckm$cluster plot( p1$x[,1] , p1$x[,2], # Silhouette plot xlab="PC1", ylab="PC2", plot(silhouette(grpsKM, dist.pots)) pch=grpsKM, col=grpsKM)

7 What does the average silhouette width tell us?

For each object, the silhouette plot visualizes with a bar the closeness of the object to its cluster compared to the closeness to the next neighbour cluster (with range: -1 < sil < 1) – the longer the bars the better the cluster The 'average width' the cluster coefficient gives the average sil:

0.70 < sil < 1.00: Good structure has been found

0.50 < sil < 0.70: Reasonable structure found

0.25 < sil < 0.50: Weak structure, requiring confirmation

-1 < sil < 0.25: Forget it!

8 What does a negative silhouette width tell us?

This clustering makes sense but would not result from a k-means (k=2) clustering

Example leading to negative silhouette

A negative value indicates that a point is on average closer to the points of another cluster than to points with in its own cluster.

→ this does not necessarily indicate poor clustering (only w.r.t. Euclidean distance mean sil criterion).

9 K-means results can strongly depend on starting configuration

Image credits: ISLR 10 Issues with K-means (non-convex sets)

A non-convex set

The color indicates the result of the K-means clustering.

Eye-ball analysis would result in 2 clusters corresponding to the 2 bananas.

https://pafnuty.wordpress.com/2013/08/14/non-convex-sets-with-k-means-and-hierarchical-clustering/ 11 Comments on the K-means method

Strength • Fast

Weakness • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes • Can terminate at a local optimum • Is based on Euclidean distances and cannot handle other dissimilarities

12 The K-medoids clustering method

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids) - starts from an initial set of medoids - iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering

Pro’s: - more robust against outliers - can deal with any dissimilarity measure - easy to find representative objects per cluster (e.g. for easy interpretation)

Con’s: - PAM is slow for large data sets (only fast for small data sets)

13 K-means clustering in R

# sample from a 3D normal distributions in 2D: x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1.6, sd = 0.3), ncol = 2) ) colnames(x) <- c("x", "y") (cl <- kmeans(x, 3)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex = 2)

14 PAM

cl <- pam(x, 3) plot(x, col = cl$cluster,main='PAM') points(cl$medoids, col = 1:2, pch = 8, cex = 2) Center is always on an observation

15 Partitioning Methods in R

 Function “kmeans” in package “stats”

 Function “pam” in package “cluster”

16 Hierarchical clustering A set of nested clusters organized as a hierarchical tree

17 How to do hierarchical clustering?

Without proof: The number of Since we cannot test all possible possible dendrograms with n dendrograms we will have to leafs is heuristically search for all (n -2) (2n -3)!/[(2 ) (n -2)!] possible dendrograms. We could do this.. Number Number of Possible of Leafs Dendrograms 21 Bottom-Up (agglomerative): 33 Starting with each item in its 415 own cluster, find the best pair to 5 105 ... … merge into a new cluster. 10 34,459,425 Repeat until all clusters are fused together.

Top-Down (divisive): Starting with all the items in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides. 18 Agglomerative Hierarchical Clustering

Feature 2 Problem: Need a generalization of the distance between the objects to cluster of objects. Feature 1

Feature 2 What is the distance between dendrogram those objects?-> Linkage

Feature 1

19 Dissimilarity between samples or observations

Any dissimilarity we have seen before can be used - Euclidean - Manhattan - Simple Matching Coefficient - Jaccard dissimilarity - Gower’s dissimilarity - etc.

20 Dissimilarity between clusters: Linkages

single link (min)  Single link: smallest distance between point-pairs linking both clusters

 Complete link: largest distance complete link  Average: avg distance between (max)

 Wards: With this method we try to minimize the variance of the merged clusters Hint: in R use always "ward.D2" average

wards

21 How to read a dendrogram

The position of the join node on the distance-scale indicates the distance between clusters (this distance depends on the linkage method). For example, if you see two clusters joined at a height 22, it means that the distance between those clusters was 22 .

Distance (R: clust$height)

When you read a dendrogram, you want to determine at which stage the distance between combined clusters is large. You look for large distances between sequential join nodes (here horizontal lines). 22 Agglomerative hierarchical clustering is often visualized by a dendrogram

https://www.researchgate.net/figure/Example-of-hierarchical-clustering-clusters-are-consecutively-merged-with-the-most_fig3_273456906

23 Simple example

Draw a dendrogram visualizing the grouping of the following 1D data:

Use Euclidian distances and single-linkage.

24 Simple example [Solution]

x = c(1,3,6,6.5) names(x) = c('1','3','6', '6.5') d = dist(x) cluster = hclust(d, method = 'single') plot(cluster, hang=-10, axes = FALSE) axis(2) cutree(cluster, k=3) #1 3 6 6.5 #1 2 3 3

25 Cluster result depends on the used distances and linkage methods

Distances between Linkage Methods based on 2 data points / observations “distances” between 2 clusters:

• Euclidean • Single link: smallest distance • Manhattan between point-pairs linking both clusters • Simple Matching Coefficent • Jaccard dissimilarity • Complete link: largest distance between point-pairs linking both • Gower’s dissimilarity clusters • etc. • Average: avg distance between point-pairs linking both clusters

• Wards: with this method we try to minimize the variance of the merged clusters

26 Agglomerative Clustering in R

 Functions “hclust”, “cutree” in package “stats”

 Alternative: Function “agnes” in package “cluster”

## determine euclidean distances dist.pots = dist(pots) ## apply agglomerative clutstering using Ward linkage hc = hclust(dist.pots, method="ward.D2") # plot the dendrogram plot(hc)

## split into 3 groups grps = cutree(hc, k=3) grps # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 # # 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

27 Use a dataset about cars to do a heatmap representation

> str(dat) 'data.frame': 38 obs. of 8 variables: $ Country : Factor w/ 6 levels "France","Germany",..: 6 6 6 ... $ Car : Factor w/ 38 levels "AMC Concord D/L",..: 6 21... $ MPG : num 16.9 15.5 19.2 18.5 30 27.5 27.2 30.9 20.3 17 ... $ Weight : num 4.36 4.05 3.6 3.94 2.15 ... $ Drive_Ratio : num 2.73 2.26 2.56 2.45 3.7 3.05 3.54 3.37 3.9 3.5 ... $ Horsepower : int 155 142 125 150 68 95 97 75 103 125 ... $ Displacement: int 350 351 267 360 98 134 119 105 131 163 ... $ Cylinders : int 8 8 8 8 4 4 4 4 5 6 ...

28 Pretty heatmap does incorporate hierarchical clustering

# prepare numeric feature matrix my.select=dat[,3:8]

x=t(as.matrix(my.select))

# defaults: dist=euclidean, # linkage=complete, # rows are scaled # (mean=0, sd=1) library(pheatmap) pheatmap(x, scale="row")

Heatmaps allow to “look into the clusters”

Here: left cluster holds heave thirsty cars

29 Add some meta to heatmap

# give matrix x colnames to be used in heatmap plot colnames(x) = dat$Car # let's prepare for a color side bar annot_col = data.frame(Country=dat$Country) # give assocation between cols (=Car) and annot_col rownames(annot_col) = colnames(x) # plot the heatmap pheatmap(x, scale="row", annotation_col=annot_col)

Side-color-bars allow to check hypothesis on an association between clusters and an external categorical variable that was not used during clustering.

30 Cluster result depends on data structure, distances and linkage methods

Data: we simulated 2 2D-Gaussian clusters with very different sizes

Single linkage

complete linkage

average linkage

Ward likes to produce clusters of equal sizes

31 Compare linkage methods

• Single-Linkage produces long and skinny clusters • Wards produces ten very separated clusters • Average linkage yield more round clusters

Generally clustering is an exploratory tool. Use the linkage which produces the “best” result.

25 7

6 20

5 15

4

10 3

5 2

1 0 5 14 23 7 4 12 19 21 24 15 16 18 1 3 8 9 29 2 10 11 20 28 17 26 27 25 6 13 22 30 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7 Average linkage Single linkage Wards linkage

32 Users dilemma when doing clustering

(a) What is a cluster? (b) What features should be used? (c) Should the data be normalized? (d) Does the data contain any outliers? (e) How do we define the pair-wise similarity? (f) How many clusters are present in the data? (g) Which clustering method should be used? (h) Does the data have any clustering tendency? (i) Are the discovered clusters and partitions valid?

All these questions have usually no clear answer!

Data clustering: 50 years beyond K-means AK Jain - Pattern recognition letters, 2010 https://www.sciencedirect.com/science/article/pii/S0167865509002323 33 How good is the clustering?

34 How to assess the quaility of a clustering?

• Internal QC criterions (w/o ground truth knowledge or labels)

. Visualization based: dendrogram, heatmap, coloring in 2D plot (PCA, MDS, t-SNE)

. Silhouette plot

. WCV to quantify within-cluster variation

. …

• External QC measures (requires labels or ground truth knowledge)

• Confusion matrix

• Mutual information

•….

See also: Data clustering: 50 years beyond K-means AK Jain - Pattern recognition letters, 2010 https://www.sciencedirect.com/science/article/pii/S0167865509002323 35 Confusion Matrix / Accuracy

Evaluate prediction accuracy on data

Confusion Matrix:

Assigned cluster For an ideal black yellow red classifier the off- black 7 00diagonal entries ACTUAL should be zero CLASS brown 2 6 2 yielding an accuracy=1 blue 0 0 16

7616 Simply count the #correct and Accuracy 0.88 divide it by the sum of all 761622 

36 Pointwise mutual information (PMI): A simple example with two binary variables

2x2-Example: Mutual information measures the amount of information y that one carries about the other. The PMI of a pair of outcomes x and y belonging to 1 0.7 0.05 discrete random variables X and Y is defined as:

0 0.1 0.15 pxy(, ) pmi(; x y ) log2 0 px() py () 1 x PMI is zero if X and Y are independent, since in the Joint Marginal case of independence it holds: p(x,y)=p(x)*p(y). distribution: distributions: PMI is maximized when X and Y are perfectly associated. 0.1 pm(x=0;y=0)= log 1 2 0.8 0.25 pm(x=0;y=1)=0.22 pm(x=1;y=0)=1.58 pm(x=1;y=1)=-1.58

37 Mutual Information (MI): A simple example with two binary variables

2x2-Example: The mutual information (MI) of the random y variables X and Y is the of the pointwise mutual information (pmi) over all 1 0.7 0.05 possible outcomes. MI=0 if X carries no information about Y. 0 0.1 0.15

0 1 x MI(;) X Y EXY,  pmi (,) x y  Joint Marginal pxy(, ) distribution: distributions:  pxy(, )log2 xy, px() py ()

0.1 0.05 MI( X ; Y ) 0.1 log  ... 0.05 log 220.8 0.25 0.25 0.75 0.1 ( 1) 0.7 0.22  0.15  1.58  0.05  ( 1.58)  0.21

38 Using external information to measure cluster quality: Normalized mutual information criterion

The mutual information 𝑀𝐼 𝐶, 𝑇 between the categorical variable 𝑇 – giving the ground truth group (1, .., k) and the categorical variable 𝐶 - giving the cluster-assignment (1, …, r) – quantifies the amount of information that the clustering carries on the ground truth group and therefore 𝑀𝐼 𝐶, 𝑇 quantifies the clustering quality.

rk p(c, t) MI (C;T) NMI(C;T) [0,1] MI(C;T) p (c, t) log2  [0, ] ij11 pp(c) (t) HC() HT ()

𝑀𝐼 𝐶, 𝑇 is zero if 𝐶 and 𝑇 are independent, of clustering C: however, there is no upper bound of 𝑀𝐼. k H(Cp ) (c)  log2 p (c) Therefor, often the normalized mutual 𝐍𝑴𝑰 j1

information is used; Entropy of partitioning T: N𝑀𝐼 0: worst possible clustering r H(T)pp (t)  log (t) N𝑀𝐼 1: perfect clustering.  2 j1

https://www.coursera.org/learn/cluster-analysis/lecture/baJNC/6-5-external-measure-2-entropy-based-measures

39 Customer segmentation  exercises

Customer segmentation is there to get an understanding of your customer base.

Some marketing campaigns Customer Orders

Campaings Customers

40 Summary: Typical Steps in

Hierarchical clustering Partitioning methods (k-means, PAM)

[Calculate] Distances [Calculate] Distances for PAM

Create Dendrogram • Linkage? Use different linkage methods and check if clusters are stable

Cut Dendrogram Choose k • Drop in Dendrogram • kink in sum of WCV plot (long distance between splits) (point k where the WCV stops to decreas fast)

. Check quality of clustering: WCV, Silhouette plot (or with external criteria)

. Visualization in 2D: PCA, MDS, or tSNE, use different colors for different clusters

. Give meaning to the cluster: - generally hard in high dimensions - look at centers or representatives (easy in PAM) - look at heatmap and interpret colors (for numeric data) - perform Classification with cluster-ID as class-label and look at variable importance (needs classification, see later)

41