Multivariate Statistics Chapter 6: Cluster Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estad´ıstica Universidad Carlos III de Madrid [email protected] Course 2017/2018 Master in Mathematical Engineering Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 1 / 70 1 Introduction 2 The clustering problem 3 Hierarchical clustering 4 Partition clustering 5 Model-based clustering Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 2 / 70 Introduction The purpose of cluster analysis is to group objects in a multivariate data set into different homogeneous groups. This is done by grouping individuals that are somehow similar according to some appropriate criterion. Once the clusters are obtained, it is generally useful to describe each group using some descriptive tools to create a better understanding of the differences that exists among the formulated groups. Cluster methods are also known as unsupervised classification methods. These are different than the supervised classification methods, or Classification Analysis, that will be presented in Chapter 7. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 3 / 70 Introduction Clustering techniques are applicable whenever a data set needs to be grouped into meaningful groups. In some situations we know that the data naturally fall into a certain number of groups, but usually the number of clusters is unknown. Some clustering methods requires the user to specify the number of clusters a priori. Thus, unless additional information exists about the number of clusters, it is reasonable to explore different values and looks at potential interpretation of the clustering results. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 4 / 70 Introduction Central to some clustering approaches is the notion of proximity of two random vectors. We usually measure the degree of proximity of two multivariate observations by a distance measure. The Euclidean distance is typically the first and also the most common distance one applies in Cluster Analysis. Other distances such as those presented in Chapter 5 can be considered. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 5 / 70 Introduction Some cluster procedures are based on using mixtures of distributions. The underlying assumptions of these mixtures, i.e., that the data in the different parts are from a certain distribution, are not easy to verify and may not hold. However, these methods have been shown to be powerful under general circum- stances. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 6 / 70 Introduction Cluster Analysis can be seen as an exploratory tool. Different cluster solutions will appear if one considers different number of clus- ters, distance measures or mixture distribution. These solutions might provide new understanding of the structure of the data set. Therefore, if possible, the interpretation of cluster solutions should involve sub- ject experts. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 7 / 70 Introduction There are a large vast amount of cluster procedures. Here, we will focus on: I Hierarchical clustering: start with single clusters (individual observations) and merges clusters or start with a single cluster (the whole data set) and split clusters. I Partition clustering: starts from a given group definition and proceed by exchang- ing elements between groups until a certain criterion is optimized. I Model-based clustering: the random vectors are modeled by mixtures of distri- butions leading to posterior probabilities of the observation memberships. Before presenting these methods, we define the problem. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 8 / 70 The clustering problem Given a data matrix X of dimension n × p, we want to obtain a partition of the data set, C1;:::; CK , where Ck , for k = 1;:::; K, are sets containing the indices of the observations in each cluster. Therefore, i 2 Ck means that the observation xi· belongs to cluster k. Any partition C1;:::; CK verifies the following two properties: I Each observation belongs to at least one of the K clusters, i.e., C1 [···[ CK = f1;:::; ng. 0 I No observation belongs to more than one cluster, i.e., Ck \ Ck0 = ;, for k 6= k . The problem is to find an appropriate partition, C1;:::; CK , for our data set. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 9 / 70 Clustering problem The key interpretative point of hierarchical and partition methods is that ele- ments within a Ck are much more similar to each other than to any element from a different Ck0 . This interpretation does not necessarily hold in model-based clustering, where similar observations can belong to different clusters. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 10 / 70 Hierarchical clustering There are two types of hierarchical clustering methods: 1 In agglomerative clustering, one starts with n single clusters and merges them into larger clusters. 2 In divisive clustering, one starts with a single cluster and divides it into smaller clusters. Most attention has been paid on agglomerative methods. However, arguments have been made that divisive methods can provide more sophisticated and robust clusterings. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 11 / 70 Hierarchical clustering The end result of all hierarchical clustering methods is a graphical output called dendogram, where the k-th cluster solution is obtained by merging some of the clusters from the (k + 1)-th cluster solution. The result of hierarchical algorithms depend on the distance considered. In particular, when the variables are in different units of measurement and the distance used do not take into account this fact, it is better to standardize the variables. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 12 / 70 Hierarchical clustering The algorithm for agglomerative hierarchical clustering (agglomerative nesting or agnes) is given next: 1 Initially, each observation xi·, for i = 1;:::; n, is a cluster. 0 2 Compute D = fdii0 ; i; i = 1;:::; ng, the matrix that contains the distances be- tween the n observations (clusters). 0 3 Find the smallest distance in D, say, dII 0 and merge clusters I and I to form a new cluster II 0. 0 4 Compute the distances, dII 0;I 00 , between the new cluster II and all other clusters I 00 6= II 0 (detailed in the next slide). 0 5 Form a new distance matrix, D, by deleting rows and columns I and I and adding a new row and column II 0 with the distances computed from step 4. 6 Repeat steps 3, 4 and 5 a total of n − 1 times until all observations are merged together into a single cluster. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 13 / 70 Hierarchical clustering 0 Computation of the distances dII 0;I 00 , between the new cluster II and all other clusters I 00 6= II 0 can be done using one of the following linkage methods: I Single linkage: dII 0;I 00 = min fdI ;I 00 ; dI 0;I 00 g. I Complete linkage: dII 0;I 00 = max fdI ;I 00 ; dI 0;I 00 g. P P I Average linkage: dII 0;I 00 = i2II 0 i002II 00 di;i00 = (nii0 ni00 ), where nii0 and ni00 are the number of items in clusters II 0 and I 00, respectively. I Ward linkage: dII 0;I 00 is the squared Euclidean distance between the sample mean vector of the elements in both clusters. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 14 / 70 Hierarchical clustering The dendogram is a graphical representation of the cluster solutions. Particularly, the dendogram shows the distances at which clusters are combined together to form new clusters. Similar clusters are combined at low distances, whereas dissimilar clusters are combined at high distances. Consequently, the difference in distances defines how close clusters are of each other. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 15 / 70 Hierarchical clustering To obtain a partition of the data into a specified number of groups, we can cut the dendogram at an appropriate distance. The number of vertical lines, K, cut by a horizontal line on the dendogram at a given distance identifies a K-cluster solution. The items located at the end of all branches below the horizontal line constitute the members of the cluster. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 16 / 70 Hierarchical clustering To know whether or not the cluster solution is appropriate, we can use the Silhouette. Let: I a (xi·) be the average distance of xi· with respect all other points in its cluster. I b (xi·) be the lowest average distance of xi· to any other cluster of which xi· is not a member. I s (xi·) be the silhouette of xi·: a (xi·) − b (xi·) s (xi·) = max fa (xi·) ; b (xi·)g The silhouette s (xi·) ranges from −1 to 1, such that a positive value means that the object is well matched to its own cluster and a negative value means that the object is bad matched to its own cluster. The average silhouette gives a global measure of the assignment, such that the more positive, the better the configuration. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 17 / 70 Illustrative example (I) We are going to apply the agnes algorithm to the states data set. For that, we make use of the Euclidean distance after take logarithms of the first, third and eighth variables and after standardize all the variables.