<<

Multivariate Chapter 6: Cluster

Pedro Galeano Departamento de Estad´ıstica Universidad Carlos III de Madrid [email protected]

Course 2017/2018

Master in Mathematical Engineering

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 1 / 70 1 Introduction

2 The clustering problem

3 Hierarchical clustering

4 Partition clustering

5 Model-based clustering

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 2 / 70 Introduction

The purpose of is to group objects in a multivariate set into different homogeneous groups.

This is done by grouping individuals that are somehow similar according to some appropriate criterion.

Once the clusters are obtained, it is generally useful to describe each group using some descriptive tools to create a better understanding of the differences that exists among the formulated groups.

Cluster methods are also known as unsupervised classification methods.

These are different than the supervised classification methods, or Classification Analysis, that will be presented in Chapter 7.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 3 / 70 Introduction

Clustering techniques are applicable whenever a data set needs to be grouped into meaningful groups.

In some situations we know that the data naturally fall into a certain number of groups, but usually the number of clusters is unknown.

Some clustering methods requires the user to specify the number of clusters a priori.

Thus, unless additional information exists about the number of clusters, it is reasonable to explore different values and looks at potential interpretation of the clustering results.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 4 / 70 Introduction

Central to some clustering approaches is the notion of proximity of two random vectors.

We usually measure the degree of proximity of two multivariate observations by a distance measure.

The Euclidean distance is typically the first and also the most common distance one applies in Cluster Analysis.

Other distances such as those presented in Chapter 5 can be considered.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 5 / 70 Introduction

Some cluster procedures are based on using mixtures of distributions.

The underlying assumptions of these mixtures, i.e., that the data in the different parts are from a certain distribution, are not easy to verify and may not hold.

However, these methods have been shown to be powerful under general circum- stances.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 6 / 70 Introduction

Cluster Analysis can be seen as an exploratory tool.

Different cluster solutions will appear if one considers different number of clus- ters, distance measures or mixture distribution.

These solutions might provide new understanding of the structure of the data set.

Therefore, if possible, the interpretation of cluster solutions should involve sub- ject experts.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 7 / 70 Introduction

There are a large vast amount of cluster procedures.

Here, we will focus on:

I Hierarchical clustering: start with single clusters (individual observations) and merges clusters or start with a single cluster (the whole data set) and split clusters.

I Partition clustering: starts from a given group definition and proceed by exchang- ing elements between groups until a certain criterion is optimized.

I Model-based clustering: the random vectors are modeled by mixtures of distri- butions leading to posterior probabilities of the observation memberships.

Before presenting these methods, we define the problem.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 8 / 70 The clustering problem

Given a data matrix X of dimension n × p, we want to obtain a partition of the data set, C1,..., CK , where Ck , for k = 1,..., K, are sets containing the indices of the observations in each cluster.

Therefore, i ∈ Ck that the observation xi· belongs to cluster k.

Any partition C1,..., CK verifies the following two properties:

I Each observation belongs to at least one of the K clusters, i.e., C1 ∪ · · · ∪ CK = {1,..., n}.

0 I No observation belongs to more than one cluster, i.e., Ck ∩ Ck0 = ∅, for k 6= k .

The problem is to find an appropriate partition, C1,..., CK , for our data set.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 9 / 70 Clustering problem

The key interpretative point of hierarchical and partition methods is that ele- ments within a Ck are much more similar to each other than to any element from a different Ck0 . This interpretation does not necessarily hold in model-based clustering, where similar observations can belong to different clusters.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 10 / 70 Hierarchical clustering

There are two types of hierarchical clustering methods:

1 In agglomerative clustering, one starts with n single clusters and merges them into larger clusters.

2 In divisive clustering, one starts with a single cluster and divides it into smaller clusters.

Most attention has been paid on agglomerative methods.

However, arguments have been made that divisive methods can provide more sophisticated and robust clusterings.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 11 / 70 Hierarchical clustering

The end result of all hierarchical clustering methods is a graphical output called dendogram, where the k-th cluster solution is obtained by merging some of the clusters from the (k + 1)-th cluster solution.

The result of hierarchical algorithms depend on the distance considered.

In particular, when the variables are in different units of measurement and the distance used do not take into account this fact, it is better to standardize the variables.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 12 / 70 Hierarchical clustering

The algorithm for agglomerative hierarchical clustering (agglomerative nesting or agnes) is given next:

1 Initially, each observation xi·, for i = 1,..., n, is a cluster. 0 2 Compute D = {dii0 , i, i = 1,..., n}, the matrix that contains the distances be- tween the n observations (clusters).

0 3 Find the smallest distance in D, say, dII 0 and merge clusters I and I to form a new cluster II 0.

0 4 Compute the distances, dII 0,I 00 , between the new cluster II and all other clusters I 00 6= II 0 (detailed in the next slide).

0 5 Form a new distance matrix, D, by deleting rows and columns I and I and adding a new row and column II 0 with the distances computed from step 4.

6 Repeat steps 3, 4 and 5 a total of n − 1 times until all observations are merged together into a single cluster.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 13 / 70 Hierarchical clustering

0 Computation of the distances dII 0,I 00 , between the new cluster II and all other clusters I 00 6= II 0 can be done using one of the following linkage methods:

I Single linkage: dII 0,I 00 = min {dI ,I 00 , dI 0,I 00 }.

I Complete linkage: dII 0,I 00 = max {dI ,I 00 , dI 0,I 00 }. P P I Average linkage: dII 0,I 00 = i∈II 0 i00∈II 00 di,i00 / (nii0 ni00 ), where nii0 and ni00 are the number of items in clusters II 0 and I 00, respectively.

I Ward linkage: dII 0,I 00 is the squared Euclidean distance between the vector of the elements in both clusters.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 14 / 70 Hierarchical clustering

The dendogram is a graphical representation of the cluster solutions.

Particularly, the dendogram shows the distances at which clusters are combined together to form new clusters.

Similar clusters are combined at low distances, whereas dissimilar clusters are combined at high distances.

Consequently, the difference in distances defines how close clusters are of each other.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 15 / 70 Hierarchical clustering

To obtain a partition of the data into a specified number of groups, we can cut the dendogram at an appropriate distance.

The number of vertical lines, K, cut by a horizontal line on the dendogram at a given distance identifies a K-cluster solution.

The items located at the end of all branches below the horizontal line constitute the members of the cluster.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 16 / 70 Hierarchical clustering

To know whether or not the cluster solution is appropriate, we can use the Silhouette.

Let:

I a (xi·) be the average distance of xi· with respect all other points in its cluster.

I b (xi·) be the lowest average distance of xi· to any other cluster of which xi· is not a member.

I s (xi·) be the silhouette of xi·:

a (xi·) − b (xi·) s (xi·) = max {a (xi·) , b (xi·)}

The silhouette s (xi·) ranges from −1 to 1, such that a positive value means that the object is well matched to its own cluster and a negative value means that the object is bad matched to its own cluster.

The average silhouette gives a global measure of the assignment, such that the more positive, the better the configuration.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 17 / 70 Illustrative example (I)

We are going to apply the agnes algorithm to the states data set.

For that, we make use of the Euclidean distance after take logarithms of the first, third and eighth variables and after standardize all the variables.

The next slides shows dendograms for the solutions with the four linkage meth- ods (simple, complete, average and Ward), joint with scatterplot matrices, plots of the first two PCs and the silhouette are given.

To compare solutions, we focus on K = 3 although different linkage methods may provide with different suggestions on the number of clusters.

For K = 3, the silhouette suggests to consider the solution given with the complete linkage.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 18 / 70 Illustrative example (I)

Single linkage 3.0 Hawaii Alaska 2.5 2.0 Texas Delaware New Mexico New Height Rhode Island 1.5 California Nevada Arizona Maine North Dakota Florida 1.0 Virginia Utah Colorado Oklahoma Connecticut New York New Idaho South Dakota West Virginia West Maryland New Jersey New Massachusetts Montana Georgia Wyoming Missouri Vermont 0.5 Arkansas Mississippi Iowa Ohio Oregon South Carolina Kansas Illinois Alabama New Hampshire New Indiana Louisiana Wisconsin Nebraska Washington Michigan Minnesota Kentucky North Carolina Tennessee Pennsylvania

X.s Agglomerative Coefficient = 0.6

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 19 / 70 Illustrative example (I)

Single linkage

−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 −3 −2 −1 0 1 2 2 1

Log−Population 0 −2 2

Income 0 −2 1.5

Log−Illiteracy 0.0 −1.5 2 1

0 Life Exp −2 2 1 Murder 0 −1 1

0 HS Grad −2 1

Frost 0 −2 1 Log−Area −1 −3

−2 −1 0 1 2 −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 −1 0 1 2 −2 −1 0 1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 20 / 70 Illustrative example (I)

CLUSPLOT( X.s ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 South Carolina Mississippi Arkansas South Dakota Kentucky

Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Hawaii Idaho 0 Alabama Oklahoma New Jersey Wyoming MontanaWisconsin Georgia NebraskaUtah Component 2 Maryland Indiana Iowa Missouri Virginia Pennsylvania Minnesota Ohio Kansas

−1 OregonNevada Arizona Michigan Colorado Illinois Washington Florida New York Texas −2

Alaska −3 California

−4 −3 −2 −1 0 1 2

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 21 / 70 Illustrative example (I)

Silhouette for Agnes and Single

n = 50 3 clusters Cj j : nj | avei∈C

1 : 48 | 0.20

2 : 1 | 0.00 3 : 1 | 0.00

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si Average silhouette width : 0.2

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 22 / 70 Illustrative example (I)

Complete linkage 7 6 5 4 3 Height Hawaii 2 Alaska Nevada Texas California New Mexico New 1 Delaware Florida Maine Arizona Oklahoma Colorado Rhode Island Virginia Utah Idaho 0 Arkansas Georgia Missouri West Virginia West North Dakota South Dakota New York New Maryland Connecticut Iowa Ohio Montana Vermont Wyoming New Jersey New Illinois Oregon Kansas Mississippi Wisconsin Massachusetts Indiana Alabama Pennsylvania Louisiana Michigan Nebraska Kentucky Minnesota Washington South Carolina Tennessee New Hampshire New North Carolina

X.s Agglomerative Coefficient = 0.79

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 23 / 70 Illustrative example (I)

Complete linkage

−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 −3 −2 −1 0 1 2 2 1

Log−Population 0 −2 2

Income 0 −2 1.5

Log−Illiteracy 0.0 −1.5 2 1

0 Life Exp −2 2 1 Murder 0 −1 1

0 HS Grad −2 1

Frost 0 −2 1

Log−Area −1 −3

−2 −1 0 1 2 −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 −1 0 1 2 −2 −1 0 1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 24 / 70 Illustrative example (I)

CLUSPLOT( X.s ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 South Carolina Mississippi Arkansas South Dakota Kentucky

Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Hawaii Idaho 0 Alabama Oklahoma Wyoming New Jersey Montana Wisconsin Nebraska Georgia Maryland Indiana UtahIowa

Component 2 Missouri Virginia Pennsylvania Minnesota Ohio Kansas

−1 OregonNevada Arizona Michigan Colorado Illinois Washington Florida New York Texas −2

Alaska −3 California

−4 −3 −2 −1 0 1 2

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 25 / 70 Illustrative example (I)

Silhouette for Agnes and Complete

n = 50 3 clusters Cj j : nj | avei∈C

1 : 24 | 0.31

2 : 2 | 0.31

3 : 24 | 0.28

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si Average silhouette width : 0.3

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 26 / 70 Illustrative example (I)

Average linkage 5 Alaska 4 3 Hawaii Height 2 New Mexico New Texas California Nevada Delaware 1 Florida Oklahoma Arizona Rhode Island Maine Virginia Colorado Utah North Dakota Idaho South Dakota Georgia West Virginia West New York New Missouri Connecticut Arkansas Maryland 0 Montana Vermont Wyoming New Jersey New Iowa Massachusetts Ohio Mississippi Oregon Illinois Kansas Indiana Wisconsin Alabama Louisiana Nebraska Michigan South Carolina Kentucky Pennsylvania Minnesota Washington New Hampshire New Tennessee North Carolina

X.s Agglomerative Coefficient = 0.74

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 27 / 70 Illustrative example (I)

Average linkage

−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 −3 −2 −1 0 1 2 2 1

Log−Population 0 −2 2

Income 0 −2 1.5

Log−Illiteracy 0.0 −1.5 2 1

0 Life Exp −2 2 1 Murder 0 −1 1

0 HS Grad −2 1

Frost 0 −2 1 Log−Area −1 −3

−2 −1 0 1 2 −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 −1 0 1 2 −2 −1 0 1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 28 / 70 Illustrative example (I)

CLUSPLOT( X.s ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 South Carolina Mississippi Arkansas South Dakota Kentucky

Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Hawaii Idaho 0 Alabama Oklahoma Wyoming New Jersey Montana Wisconsin Nebraska Georgia Maryland Indiana UtahIowa MissouriPennsylvania Component 2 Virginia Minnesota Ohio Kansas

−1 OregonNevada Arizona Michigan Colorado Illinois Washington FloridaNew York Texas −2

Alaska −3 California

−4 −3 −2 −1 0 1 2 3

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 29 / 70 Illustrative example (I)

Silhouette for Agnes and Average

n = 50 3 clusters Cj j : nj | avei∈C

1 : 11 | 0.55

2 : 1 | 0.00

3 : 38 | 0.22

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si Average silhouette width : 0.29

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 30 / 70 Illustrative example (I)

Ward linkage 15 10 5 Height Alaska Hawaii 0 Texas Nevada California Maine Arizona Utah Delaware Florida North Dakota Idaho Colorado New Mexico New Oklahoma Virginia Georgia Iowa Ohio Missouri Arkansas New York New Rhode Island Illinois Vermont Maryland Montana Oregon Kansas Indiana West Virginia West Wyoming South Dakota Connecticut Alabama Michigan Wisconsin New Jersey New Mississippi Louisiana Kentucky Nebraska Minnesota Tennessee Washington Massachusetts Pennsylvania South Carolina North Carolina New Hampshire New

X.s Agglomerative Coefficient = 0.9

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 31 / 70 Illustrative example (I)

Ward linkage

−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 −3 −2 −1 0 1 2 2 1

Log−Population 0 −2 2

Income 0 −2 1.5

Log−Illiteracy 0.0 −1.5 2 1

0 Life Exp −2 2 1 Murder 0 −1 1

0 HS Grad −2 1

Frost 0 −2 1

Log−Area −1 −3

−2 −1 0 1 2 −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 −1 0 1 2 −2 −1 0 1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 32 / 70 Illustrative example (I)

CLUSPLOT( X.s ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 MississippiSouth Carolina Arkansas South Dakota Kentucky

Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Idaho

0 Hawaii Alabama Oklahoma New Jersey Wyoming MontanaWisconsin Nebraska Georgia Maryland Indiana UtahIowa Missouri Virginia Pennsylvania Minnesota Ohio Kansas Component 2

−1 OregonNevada Arizona Michigan Colorado Illinois Washington FloridaNew York Texas −2

Alaska −3 California −4 −4 −3 −2 −1 0 1 2

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 33 / 70 Illustrative example (I)

Silhouette for Agnes and Ward

n = 50 3 clusters Cj j : nj | avei∈C

1 : 10 | 0.55

2 : 19 | 0.28

3 : 21 | 0.12

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si Average silhouette width : 0.27

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 34 / 70 Hierarchical clustering

None of the distance/linkage procedures is uniformly best for all clustering prob- lems.

Singe linkage often leads to long clusters, joined by singleton observations near each other, a result that does not have much appeal in practice.

Complete linkage tends to produce many small, compact clusters.

Average linkage is dependent upon the size of the clusters, while single and complete linkage do not.

Ward linkage also tends to produce many small, compact clusters.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 35 / 70 Hierarchical clustering

In divisive clustering (divisive analysis or diana), the idea is that at each step, the observations are divided into a “splinter” group (say cluster A) and the “remainder” group (say cluster B).

The splinter group is initiated by extracting that observation that has the largest average distance from all other observations in the data set, and that observation is set up as cluster A.

Given the separation of the data into A and B, we next compute, for each observation in cluster B, the following quantities:

1 the average distance between that observation and all other observations in cluster B, and

2 the average distance between that observation and all observations in cluster A.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 36 / 70 Hierarchical clustering

Then, we compute the difference between (1) and (2) above for each observation in B.

There are two possibilities:

1 If all the differences are negative, we stop the algorithm.

2 If any of these differences are positive, we take the observation in B with the largest positive difference, move it to A, and repeat the procedure.

This algorithm provides with a binary split of the data into two clusters A and B.

This same procedure can then be used to obtain binary splits of each of the clusters A and B separately.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 37 / 70 Illustrative example (I)

We are going to apply the diana algorithm to the states data set.

The next slides shows dendograms for the solution, joint with scatterplot matri- ces, plots of the first two PCs and the silhouette with the optimal solution for K = 3.

It is not difficult to see that this algorithm points out the presence of special states.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 38 / 70 Illustrative example (I)

Diana 7 6 5 4 Alaska 3 Height Hawaii Texas 2 New Mexico New Nevada California 1 Delaware Florida Maine Arizona Oklahoma Colorado Rhode Island Virginia Utah Idaho 0 Arkansas Georgia Missouri West Virginia West North Dakota South Dakota New York New Maryland Connecticut Iowa Ohio Montana Vermont Wyoming Kansas New Jersey New Illinois Oregon Mississippi Massachusetts Indiana Alabama Pennsylvania Louisiana Michigan Nebraska Wisconsin Minnesota Kentucky Washington South Carolina Tennessee New Hampshire New North Carolina

X.s Divisive Coefficient = 0.79

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 39 / 70 Illustrative example (I)

Diana

−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 −3 −2 −1 0 1 2 2 1

Log−Population 0 −2 2

0 Income −2 1.5 Log−Illiteracy 0.0 −1.5 2 1

0 Life Exp −2 2 1

Murder 0 −1 1

0 HS Grad −2 1

Frost 0 −2 1 Log−Area −1 −3

−2 −1 0 1 2 −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 −1 0 1 2 −2 −1 0 1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 40 / 70 Illustrative example (I)

CLUSPLOT( X.s ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 South Carolina Mississippi Arkansas South Dakota Kentucky

Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Hawaii Idaho 0 Alabama Oklahoma New Jersey Wyoming MontanaWisconsin Nebraska Georgia Maryland Indiana UtahIowa Component 2 Missouri Virginia Pennsylvania Minnesota Ohio Kansas

−1 OregonNevada Arizona Michigan Colorado Illinois Washington Florida New York Texas −2

Alaska −3 California

−4 −3 −2 −1 0 1 2

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 41 / 70 Illustrative example (I)

Silhouette for Diana

n = 50 3 clusters Cj j : nj | avei∈C

1 : 26 | 0.29

2 : 4 | 0.25

3 : 20 | 0.24

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si Average silhouette width : 0.27

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 42 / 70 Partition clustering

Partition methods simply split the data observations into a predetermined num- ber K of groups or clusters, where there is no hierarchical relationship between the K-cluster solution and the (K + 1)-cluster solution.

Given K, we seek to partition the data into K clusters so that the observations within each cluster are similar to each other, whereas observations from different clusters are dissimilar.

Ideally, one can obtain all the possible partition of the data into K clusters and selects the “best” partition using some optimizing criterion.

Clearly, for medium or large data sets such a method rapidly becomes infeasible, requiring incredible amount of computer time and storage.

As a result, all available partition methods are iterative and work on only a few possible partitions.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 43 / 70 Partition clustering

The k-means algorithm is the most popular partition method.

As it is extremely efficient, it is often used for large-scale clustering projects.

The algorithm depends on the concept of centroid of a cluster, which is a representative point of the group (not necessarily an observation).

Usually, the centroid is taken as the sample mean vector of the observations in the cluster, although this is not always the choice.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 44 / 70 Partition clustering The algorithm is given next:

1 Let xi·, for i = 1,..., n be the set of observations in the data matrix X .

2 Do one of the following:

1 Form an initial of the observations into K clusters and, for cluster k, compute its current centroid, k x.

2 Pre-specify K cluster centroids, k x, for k = 1,..., K.

3 Compute the squared Euclidean distance of each observation to its current cluster centroid and sum all of them:

K X X 0 SSE = (xi· −k x) (xi· −k x) k=1 c(i)=k

where k x is the k-th cluster centroid and c (i) is the cluster containing xi·.

4 Reassign each observation to its nearest cluster centroid so that SSE is reduced in magnitude. Update the cluster centroids after each reassignment.

5 Repeat steps 3 and 4 until no further reassignment of observations takes place.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 45 / 70 Partition clustering

The solution (a configuration of observations into K clusters) will typically not be unique.

This is because, the algorithm will only find a local minimum of the SSE.

It is recommended that the algorithm be run using different initial random assignments to the observations to the K clusters (or by randomly selecting K initial centroids) in order to find the lowest minimum of SSE and, hence, the best clustering solution based upon K clusters.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 46 / 70 Illustrative example (I)

We are going to apply the k-means algorithm to the states data set.

As with the hierarchical algorithms, we use standardized variables, as the algo- rithm uses Euclidean distances.

The next slides show scatterplot matrices, plots of the first two PCs and the silhouette with the optimal solution for K = 3.

We run the algorithm 25 times, i.e., we form 25 initial random assignments of the observations into 3 clusters and run the algorithm.

The value of SSE attained by the algorithm is 203.2068.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 47 / 70 Illustrative example (I)

k−means

−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 −3 −2 −1 0 1 2 2 1

Log−Population 0 −2 2

Income 0 −2 1.5

Log−Illiteracy 0.0 −1.5 2 1

0 Life Exp −2 2 1 Murder 0 −1 1

0 HS Grad −2 1

Frost 0 −2 1

Log−Area −1 −3

−2 −1 0 1 2 −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 −1 0 1 2 −2 −1 0 1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 48 / 70 Illustrative example (I)

CLUSPLOT( X.s ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 South Carolina Mississippi Arkansas South Dakota Kentucky

Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Hawaii Idaho 0 Alabama Oklahoma Wyoming New Jersey Montana Wisconsin Nebraska Georgia Maryland Indiana UtahIowa MissouriPennsylvania Component 2 Virginia Minnesota Ohio Kansas

−1 OregonNevada Arizona Michigan Colorado Illinois Washington Florida New York Texas −2

Alaska −3 California

−4 −3 −2 −1 0 1 2

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 49 / 70 Illustrative example (I)

Silhouette for k−means

n = 50 3 clusters Cj j : nj | avei∈C

1 : 18 | 0.28

2 : 12 | 0.46

3 : 20 | 0.16

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si Average silhouette width : 0.28

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 50 / 70 Partition clustering

Partition around medoids (pam) is another partition algorithm.

Essentially, pam is a modification of the k-means algorithm.

This algorithm searches for K “representative objects” rather than the centroids among the observations in the data set.

Then, the method is expected to be more robust to data anomalies such as outliers.

A disadvantage of the pam algorithm is that, although it run well on small data sets, they are not efficient enough to use for clustering large data sets.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 51 / 70 Partition clustering

The algorithm is given next:

1 Let xi·, for i = 1,..., n be the set of observations in the data matrix. 0 2 Compute D = {dii0 , i, i = 1,..., n}, the matrix that contains the distances be- tween the n observations.

3 Choose K observations as the medoids of K initial clusters.

4 Assign every observation to its closest medoid using the matrix D.

5 For each cluster, search the observation, xi0·, of the cluster (if any) that gives the largest reduction in: K X X SSEmed = dii0 k=1 c(i)=k

and select this observation as the medoid for this cluster (note that SSEmed only considers distances from every observation in the cluster to the medoid).

6 Repeat steps 4 and 5 until no further reduction in SSEmed takes place.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 52 / 70 Illustrative example (I)

We are going to apply the pam algorithm to the states data set.

As with the previous algorithms, we use standardized variables.

The next slides show the same information as in the previous methods.

For that we consider the case of 3 groups, as previously done.

The results do not appear to be very good.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 53 / 70 Illustrative example (I)

pam

−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 −3 −2 −1 0 1 2 2 1

Log−Population 0 −2 2

Income 0 −2 1.5

Log−Illiteracy 0.0 −1.5 2 1

0 Life Exp −2 2 1 Murder 0 −1 1

0 HS Grad −2 1

Frost 0 −2 1

Log−Area −1 −3

−2 −1 0 1 2 −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 −1 0 1 2 −2 −1 0 1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 54 / 70 Illustrative example (I)

CLUSPLOT( X.s ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 MississippiSouth Carolina Arkansas South Dakota Kentucky Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Idaho

0 Hawaii Alabama Oklahoma New Jersey Wyoming MontanaWisconsin Nebraska Georgia Maryland Indiana UtahIowa Component 2 Missouri Virginia Pennsylvania Minnesota Ohio Kansas

−1 OregonNevada Arizona Michigan Colorado Illinois Washington Florida New York Texas −2

Alaska −3 California

−4 −3 −2 −1 0 1 2

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 55 / 70 Illustrative example (I)

Silhouette for pam

n = 50 3 clusters Cj j : nj | avei∈C

1 : 15 | 0.29

2 : 22 | 0.13

3 : 13 | 0.27

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si Average silhouette width : 0.22

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 56 / 70 Model-based clustering

In model-based clustering, it is assumed that the data have been generated by a mixture of K unknown distributions.

Maximum likelihood estimation can be carried out to estimate the parameters of the mixture model.

This is usually undertaken using the Expectation-Maximization (EM) algorithm.

Then, once the model parameters have been estimated, each observation is assigned to the mixture (cluster) with larger probability of having generated the observation.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 57 / 70 Model-based clustering

Then, we assume that the data set have been generated from a mixture of distributions with pdf given by:

K X fx (x|θ) = πk fx,k (x|θk ) k=1 where θ is a vector with all the parameters of the model, including the weights πk and the parameters of the distributions fx,k (·|θk ), denoted by θk .

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 58 / 70 Model-based clustering

0 Then, for a data matrix, X , with observations xi· = (xi1,..., xip) , the is given by:

n n K ! Y Y X L (θ|X ) = fx (xi·|θk ) = πk fx,k (xi·|θk ) i=1 i=1 k=1 while the log-likelihood is given by:

n K ! X X ` (θ|X ) = log πk fx,k (xi·|θk ) i=1 k=1

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 59 / 70 Model-based clustering

Derivation of closed form expressions of the MLE of the mixture parameters is not possible, even in the case of the multivariate Gaussian distribution.

Moreover, although it is possible to apply a Newton-Raphson type algorithm to solve the equalities provided by the MLE method, the usual approach is to use the EM algorithm to obtain the MLEs (see the references).

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 60 / 70 Model-based clustering

Then, let πb1,..., πbG and θb1,..., θbG , be the MLE of the weights and the param- eters of the group distributions, respectively, obtained with the EM algorithm.

The estimated posterior probabilities that observation xi· belongs to population k are obtained by applying the Bayes Theorem:   πbk fx,k xi·|θbk Prb (k|xi·) = PK   g=1 πbg fx,g xi·|θbg

The observations are assigned to the density (cluster) k with maximum value of Prb (k|xi·).

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 61 / 70 Model-based clustering

In model-based clustering, it is possible to select the number of groups, K, from the data set.

The idea is to compare solutions with different values of K = 1, 2,... and choosing the best result.

For that, we can rely on criteria such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).

For instance, the BIC selects the number of clusters that minimizes:   BIC (k) = −2 × `k θb|X + log (n) × q

  where `k θb|X denotes the maximized log-likelihood assuming k groups and q is the number of parameters of the model.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 62 / 70 Model-based clustering

M-clust is a popular method to perform model-based clustering.

M-clust assumes Gaussian densities and selects the optimal model according to BIC.

To reduce the number of parameters to fit, M-clust works with the spectral decomposition of the matrices of the Gaussian densities, Σk , for k = 1,..., K, given by: 0 Σk = λ1,k Vk Λek Vk ,

where λ1,k is the largest eigenvalue, Vk is the matrix that contains the eigen- vectors of Σk and Λek is the diagonal matrix of eigenvalues divided by λ1,k .

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 63 / 70 Model-based clustering

The decompostion allows for different configurations:

1 spherical and equal volume,

2 spherical and unequal volume,

3 diagonal and equal volume and shape,

4 diagonal, varying volume and equal shape,

5 diagonal, equal volume and varying shape,

6 diagonal, varying volume and shape,

7 ellipsoidal, equal volume, shape, and orientation,

8 ellipsoidal, equal volume and equal shape,

9 ellipsoidal and equal shape, and

10 ellipsoidal, varying volume, shape, and orientation.

Here (i) spherical, diagonal and ellipsoidal are relative to the covariance matrices; (ii) similar volume means that λ1,1 = ··· = λ1,K ; (iii) equal shape means Λe1 = ··· = ΛeK ; and (iv) equal orientation means V1 = ··· = VK .

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 64 / 70 Illustrative example (I)

For the states data set, Mclust selects an ellipsoidal, equal shape and orientation (VEE) model with 3 components.

After estimating the model using the EM algorithm, the procedure compute the posterior probabilities for each country and population.

The results are shown in the next two slides.

The first one shows the scatterplot matrix with the assignments made by the algorithm.

The second one shows the first two principal components with the assignments made by the algorithm.

Note how close observations can be in different clusters.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 65 / 70 Illustrative example (I)

M−clust solution

3000 4000 5000 6000 68 69 70 71 72 73 40 45 50 55 60 65 7 8 9 10 11 12 13 10 9

Log−Population 8 7 6 6000

Income 4500 3000 0.5 Log−Illiteracy −0.5 72 Life Exp 70 68 14

Murder 10 6 2 60 HS Grad 50 40 150 Frost 50 0 13

11 Log−Area 9 7

6 7 8 9 10 −0.5 0.0 0.5 1.0 2 4 6 8 10 12 14 0 50 100 150

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 66 / 70 Illustrative example (I)

CLUSPLOT( X ) 3 Rhode Island 2 Vermont Delaware West Virginia New Hampshire Maine

1 South Carolina Mississippi Arkansas South Dakota Kentucky

Connecticut North Dakota North Carolina Tennessee Louisiana New Mexico Massachusetts Hawaii Idaho 0 Alabama Oklahoma Wyoming New Jersey Montana Wisconsin Nebraska Georgia Maryland Indiana UtahIowa MissouriPennsylvania Component 2 Virginia Minnesota Ohio Kansas

−1 OregonNevada Arizona Michigan Colorado Illinois Washington FloridaNew York Texas −2

Alaska −3 California

−4 −3 −2 −1 0 1 2 3

Component 1 These two components explain 62.5 % of the point variability.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 67 / 70 Model-based clustering

There are other alternatives procedures for model based clustering.

For instance, very appealing methodologies for estimating mixtures have been given from the Bayesian point of view.

These procedures include the number of groups as an additional parameter, and posterior probabilities are also provided for this number.

Also, procedures based on the use of projections (projection pursuit methods) are also very popular.

The idea is to project the data into different directions that separate the groups as much as possible and look for clusters in the projected data.

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 68 / 70 Chapter outline

1 Introduction

2 The clustering problem

3 Hierarchical clustering

4 Partition clustering

5 Model-based clustering

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 69 / 70 We are ready now for:

Chapter 7: Classification analysis

Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 6 Master in Mathematical Engineering 70 / 70