<<

Cluster

1 2 3 4 5 Can we organize 6 entities into Species discrete classes, such that Sites A B C D within-group similarity is 1 1 9 12 1 maximized and among- 2 1 8 11 1 3 1 6 10 10 group similarity is 4 10 0 9 10 minimized according to 5 10 2 8 10 some objective criterion? 6 10 0 7 2

1

Important Characteristics of Techniques

P Family of techniques with similar goals. P Operate on sets for which pre-specified, well-defined groups do "not" exist; characteristics of the data are used to assign entities into artificial groups. P Summarize data redundancy by reducing the on the whole set of say N entities to information about say g groups of nearly similar entities (where hopefully g is very much smaller than N).

2 Important Characteristics of Cluster Analysis Techniques

P Identify by leaving them solitary or in small clusters, which may then be omitted from further analyses. P Eliminate noise from a multivariate by clustering nearly similar entities without requiring exact similarity. P Assess relationships within a single set of variables; no attempt is made to define the relationship between a set of independent variables and one or more dependent variables.

3

What’s a Cluster?

A B E

C D F

4 Cluster Analysis: The Data Set

P Single set of variables; no distinction Variables between independent and dependent Sample x1 x2 x3 ... xp 1xx x ... x variables. 11 12 13 1p 2x21 x22 x23 ... x2p 3x31 x32 x33 ... x3p P Continuous, categorical, or count ...... variables; usually all the same scale...... nxn1 xn2 xn3 ... xnp P Every sample entity must be measured on the same set of variables. P There can be fewer samples (rows) than number of variables (columns) [i.e., data matrix does not have to be of full rank].

5

Cluster Analysis: The Data Set P Common 2-way ecological data: < Sites-by-environmental Parameters < Species-by-niche parameters < Species-by-behavioral Characteristics Variables < Samples-by-species Sample x1 x2 x3 ... xp < Specimens-by-characterisitcs 1x11 x12 x13 ... x1p 2x21 x22 x23 ... x2p 3x31 x32 x33 ... x3p ...... nxn1 xn2 xn3 ... xnp

6 Cluster Analysis: The Data Set

1 AMRO 15.31 31.42 64.28 20.71 47.14 0.00 0.28 0.14 . . . 1.45 2 BHGR 5.76 24.77 73.18 22.95 61.59 0.00 0.00 1.09 . . . 1.28 3 BRCR 4.78 64.13 30.85 12.03 63.60 0.44 0.44 2.08 . . . 1.18 4 CBCH 3.08 58.52 39.69 15.47 62.19 0.31 0.28 1.52 . . . 1.21 5 DEJU 13.90 60.78 36.50 13.81 62.89 0.23 0.31 1.23 . . . 1.23 ......

19 WIWR 8.05 41.09 55.00 18.62 53.77 0.09 0.18 0.81 . . . 1.36

7

Cluster Techniques

Exclusive Each entity in Nonexclusive Each entity in one cluster only one or more clusters

Sequential Recursive sequence Simultaneous Single nonrecursive of operations operation

Arrange clusters in Achieve maximum Nonhierarchical Hierarchical hierarchy; within-culster relationships among homogeneity clusters defined Agglomerative Divisive Agglomerative Divisive Build groups Break into groups Build groups Break into groups

Polythetic Monothetic Polythetic Monothetic Consider all Consider one Consider all Consider one variables variable variables variable

8 Nonhierarchical Clustering

P NHC techniques merely assign each entity to a cluster, placing similar entities together. P NHC is, of all cluster techniques, conceptually the simplest. Maximizing within-cluster homogeneity is the basic property to be achieved in all NHC techniques. P Within-cluster homogeneity makes possible inference about an entities' properties based on its cluster membership. This one property makes NHC useful for mitigating noise, summarizing redundancy, and identifying outliers.

9

Nonhierarchical Clustering

P NHC primary purpose is to summarize redundant entities into fewer groups for subsequent analysis (e.g., for subsequent to elucidate relationships among “groups”.) Several different available that differ in various details. In all cases, the single criterion achieved is within-cluster homogeneity, and the results are, in general, similar.

+ ?

+ ? +

10 Nonhierarchical Clustering K- Clustering (KMEANS)

P Specify number of random seeds (kernals) + ? or provide seeds. + + P Assign samples to ? ‘nearest’ seed. Group Seeds P Iteratively reassign Centroids samples to groups in order to minimize within + group variabilitiy (i.e., + + assigned to group with ‘closest’ centroid).

11

Nonhierarchical Clustering Composite Clustering (COMPCLUS) P Select a seed at random. P Assign samples to seed if 4 5 within specified distance 3 1 (radius) of seed. 2 6 P Pick a second seed and repeat 7 process until all samples are classified. P Groups smaller than specified 4 number are dissolved and 1 samples reassigned to closest 2 6 centroid, providing it is within specified maximum distance.

12 Nonhierarchical Clustering Minimum Partitioning P Compute standardized distances between each sample and overall centroid. 1 + P Select sample w/ largest distance as new cluster centroid. P Assign samples to nearest cluster centroid. 2 + P Select sample w/ largest + distance from its cluster centroid to initiate new cluster. P Assign samples to nearest P Continue until desired cluster centroid. number of clusters created.

13

Nonhierarchical Clustering Maximum Likelihood Clustering

P Model-based method. P Choose θ = (θ 1,...,θ c) and γ P Assume the samples consist of to maximize the likelihood: c subpopulations each n Lfx,,  corresponding to a cluster, and  iii i1 that the density function of a P q-dimensional If fj(x,θ j) is taken as the from the jth subpopulation is multivariate normal density with vector μj and fj(x,θ j) for some unknown vector of parameters, θ . matrix Σj, a ML j solution can be found P Assume that γ = (γ1,...,γn) gives based on varying the labels of the subpopulation assumptions about the to which each sample belongs. .

14 Nonhierarchical Clustering Maximum Likelihood Clustering P Normal mixture modeling (package mclust; Fraley et al. (2012)

n Lfx,,   iii i1

15

Nonhierarchical Clustering Maximum Likelihood Clustering P Normal mixture modeling (package mclust; Fraley et al. (2012)

16 Nonhierarchical Clustering Limitations

P NHC procedures involve various assumptions about the form of the underlying population from which the sample is drawn. These assumptions often include the typical parametric multivariate assumptions, e.g., equal covariance matrices among clusters. P Most NHC techniques are strongly biased towards finding elliptical and spherical clusters.

17

Nonhierarchical Clustering Limitations

P NHC is not effective for elucidating relationships because there is no interesting structure within clusters and no definition of relationships among clusters derived. P Regardless of the NHC procedure used, it is best to have a reasonable guess on how many groups to expect in the data.

18 Nonhierarchical Clustering Choosing the ‘Right’ Number of Clusters P Scree of cluster properties: < Sum of within-cluster dissimilarities to the cluster medoids. < Average sample silhouette width (si)

baii si  max(baii , )

ai = ave dist to all others in ith cluster bi = min dist to neighboring cluster

19

Nonhierarchical Clustering Choosing the ‘Right’ Number of Clusters

P Silhouette width (si)

baii s  d1(1) * * i d1(2) max(baii , ) * d n a 1(3) i 1 *  a j * j1 a a a2 3 i  * * d2(1) ni  ni   d  * d2(2)  ij d2(3) bdmin j1  * ii n  * i d2(5)   d2(4)   * * Si 6 1, very well clustered Si 6 0, in between clusters Si < 0, placed in wrong cluster

20 Nonhierarchical Clustering Testing the ‘Significance’ of the Clusters

P Are groups significantly different? (How valid are the groups?) < Multivariate Analysis of Variance (MANOVA) < Multi-Response Permutation Procedures (MRPP) < Analysis of Group Similarities (ANOSIM) < Mantel’s Test (MANTEL)

We will cover these procedures in the next section of the course.

21

Nonhierarchical Clustering Evaluating the Clusters Cluster Plot Silhouette Plot

22 Hierarchical Clustering

P HC combines similar entities into classes or groups and arranges these groups into a hierarchy. P HC reveals relationships expressed among the entities classified. Limitations: P For large data sets hierarchies are problematic, because a hierarchy with > 50 entities is difficult to display or interpret. P HC techniques have a general disadvantage since they contain no provision for reallocation of entities who may have been poorly classified at an early stage in the analysis.

23

Complementary Use of NHC and HC

+ ?

+ ? +

P HC is ideal for sets and NHC for large data sets. P HC helps reveal relationships in the data while NHC does not. P NHC can be used initially to summarize a large data set by producing far fewer composite samples, which then makes HC feasible and effective for depicting relationships.

24 Polythetic Agglomerative Hierarchical Clustering

P PAHC techniques use the information on all the variables (i.e., polythetic). P Each entity is initially assigned as an individual cluster. PAHC agglomerates these in a hierarchy of larger and larger clusters until finally a single cluster contains all entities. 12345678910 P There are numerous different Fusion resemblance measures and fusion algorithms; consequently, there exists a profusion of PAHC techniques.

25

Polythetic Agglomerative Hierarchical Clustering

Assumptions: P Basically none! Hence, the purpose of PAHC is generally purely descriptive. P However, some "assume" spherical shaped clusters. P Certain resemblance measures (e.g., Euclidean distance) assume that the variables are uncorrelated within clusters. Sample Size Requirements: P Basically none!

12345678910

26 Polythetic Agglomerative Hierarchical Clustering Two-Stage Process 1. Resemblance Matrix P The first step is to compute a dissimilarity/ from the original data matrix. 2. Fusion Strategy P The second step is to agglomerate entities successively to build up a hierarchy of increasingly large clusters. P The choice of a particular fusion strategy will depend almost entirely on the objectives of the investigator. P All fusion strategies cluster the two most similar (or least dissimilar) entities first. Strategies differ with respect to how they fuse subsequent entities (or clusters).

27

Polythetic Agglomerative Hierarchical Clustering The Fusion Process (Nearest Neighbor--Euclidean Distance)

Step 1: Sites Step 2:Sites Sites123456 Sites1-23456 10 1-2 0 21.40 39.30 39.79.30 4 15.2 10.9 0 4 15.9 15.2 10.9 0 5 14.4 10.0 2.2 0 5 15.1 14.4 10.0 2.2 0 6 12.7 13.8 8.2 8.3 0 6 13.7 12.7 13.8 8.2 8.3 0 Combine sites 4 and 5 Combine sites 1 and 2

28 Polythetic Agglomerative Hierarchical Clustering The Fusion Process (Nearest Neighbor--Euclidean Distance)

Step 3: Sites Step 4: Sites Step 5: Sites Sites 1-2 3 4-5 6 Sites 1-2 3 4-5-6 Sites 1-2-3 4-5-6 1-2 0 1-2 0 1-2-3 0 39.30 39.30 4-5-6 10.0 0 4-5 14.4 10.0 0 4-5-6 12.7 10.0 0 6 12.7 13.8 8.2 0 Combine cluster 1-2-3 Combine site 4 and cluster 4-5-6 Combine site 6 and cluster 1-2 and cluster 4-5

29

Polythetic Agglomerative Hierarchical Clustering Agglomeration

P Shows the agglomeration sequence and the corresponding dissimilarity values at which entities and clusters combine to form new clusters. P Note that dissimilarity values will vary depending on the fusion strategy and resemblance measure used.

Number of Clusters Fusion Minimum Distance 5 Sites 1 and 2 1.4 4 Sites 4 and 5 2.2 3 Site 6 and Cluster 4-5 8.2 2 Site 3 and Cluster 1-2 9.3 1 Cluster 1-2 and Cluster 4-5 10.0

30 Polythetic Agglomerative Hierarchical Clustering

P Tree-like plot depicting the agglomeration sequence in which entities are enumerated (identified) along one axis and the dissimilarity level at which each fusion of clusters occurs on the other axis.

Number of Clusters Fusion Minimum Distance 5 Sites 1 and 2 1.4 4 Sites 4 and 5 2.2 3 Site 6 and Cluster 4-5 8.2 2 Site 3 and Cluster 1-2 9.3 1 Cluster 1-2 and Cluster 4-5 10.0

31

Polythetic Agglomerative Hierarchical Clustering Cluster Membership Table & Icicle Plot

P Cluster Membership Table -- identifies which cluster each entity belongs to for any specified number of clusters. P Icicle Plot – is a -like plot depicting cluster membership in relation to the number of clusters, in which entities are enumerated (identified) along one axis and the number of clusters (cluster level) along the other axis.

32 Polythetic Agglomerative Hierarchical Clustering Fusion Properties P Space conserving – A. Space- fusion process preserves conserving the properties of the original inter-entity resemblance space.

P Space distorting – B. Space- C. Space- fusion process distorts contracting dialating the properties of the original inter-entity resemblance space.

Distance between clusters

33

Polythetic Agglomerative Hierarchical Clustering Fusion Properties

P Space-Contracting – groups appear, on formation, to move nearer to some or all of the remaining entities; the chance that an individual entity will add to a pre- existing cluster rather than act as the nucleus of a new group is increased, and the system is said to “chain.” Distance between clusters P Space-Dialating – groups appear to recede on formation and growth; individual entities not yet in groups are more likely to form nuclei of new groups.

34 Polythetic Agglomerative Hierarchical Clustering Fusion Strategies Single-Linkage (Nearest Neighbor): P An entities' dissimilarity to a cluster is defined to be equal to its dissimilarity to the closest entity in that cluster; 1 when two clusters agglomerate, their 2 dissimilarity is equal to the smallest dissimilarity for any pair of entities with one in each cluster. P Space-contracting strategy: tends to produce straggly clusters, which 3 quickly agglomerate very dissimilar samples.

35

Polythetic Agglomerative Hierarchical Clustering Fusion Strategies Complete-Linkage (Furthest Neighbor): P An entities' dissimilarity to a cluster is defined to be equal to its dissimilarity to the furthest entity in that cluster; 1 when two clusters agglomerate, their 2 dissimilarity is equal to the greatest dissimilarity for any pair of entities with one in each cluster. P Space-dilating strategy: produces clusters of very similar samples which 3 agglomerate slowly. As clusters agglomerate, groups are moved away from each other. 36 Polythetic Agglomerative Hierarchical Clustering Fusion Strategies Centroid-Linkage (Unweighted Pair-Group Centroid): P Valid only with data. An entities' dissimilarity to a cluster is equal to its dissimilarity to the cluster centroid; when 2 clusters fuse, their dissimilarity 1 is equal to the dissimilarity between 2 cluster centroids. P Space-conserving strategy; but ‘reversals’ can occur in which a fusion takes place at a lower dissimilarity than a prior fusion; 3 group-size distortions occur because centroid of 2 fused clusters is weighted toward larger group.

37

Polythetic Agglomerative Hierarchical Clustering Fusion Strategies -Linkage (Weighted Pair-Group Centroid): P Similar to centroid-linkage except that the "centroids" of newly fused groups are positioned at the 1 median between old group 2 centroids. P Space-conserving strategy; but ‘reversals’ still can occur, although no group- size dependency. 3

38 Polythetic Agglomerative Hierarchical Clustering Fusion Strategies Average-Linkage (Unweighted Pair-Group Average): P An entities' dissimilarity to a cluster is defined to be equal to the average of the distances between the entity and each 1 point in the cluster; when two clusters 2 fuse, their dissimilarity is equal to the average of the distances between each entity in one cluster with each entity in the other cluster.

P Space-conserving strategy; maximizes the 3 cophenetic correlation, no reversals and eliminates group-size dependency.

39

Polythetic Agglomerative Hierarchical Clustering Fusion Strategies Ward's Minimum-Variance-Linkage: P Agglomerates clusters, provided that the increase in within-group dispersion (variance) is less than it 1 would be if either of the two 2 clusters were joined with any other cluster. P Space-conserving strategy similar to ? average-linkage fusion, except that instead of minimizing an average 3 distance, it minimizes a squared distance weighted by cluster size.

40 Fusion Strategies

41

Fusion Strategies

42 Fusion Strategies

43

Polythetic Agglomerative Hierarchical Clustering Deciding on the Number of Significant Clusters P Should each successive linkage in an agglomerative hierarchical clustering be accepted? < Null hypothesis: the two entities or clusters that are linked are sufficiently alike that they can be considered to represent a single cluster (i.e., accept the linkage). < Alternative hypothesis: the two entities or clusters are distinctly different and should be retained as separate entities or clusters (i.e., reject the linkage). P Once a linkage has been rejected, the linkages (at greater levels of dissimilarity) that depend on the rejected linkage are no longer defined and therefore do not need to be considered.

44 Polythetic Agglomerative Hierarchical Clustering Monte Carlo Test to Determine the Number of Clusters ?

Permuted 12345678910 Distance Distribution Distribution under Ho b1 865973410 b2

b bobs 8 10 643 7 59 mean

bn If bobs > 95% of the permuted distribution, then reject Ho 3 9 5 748 6 10

45

Polythetic Agglomerative Hierarchical Clustering Examining the Dendrogram &

46 Polythetic Agglomerative Hierarchical Clustering Testing the ‘Significance’ of the Clusters

P Are groups significantly different? (How valid are the groups?) < Multivariate Analysis of Variance (MANOVA) < Multi-Response Permutation Procedures (MRPP) < Analysis of Group Similarities (ANOSIM) < Mantel’s Test (MANTEL)

We will cover these procedures in the next section of the course.

47

Polythetic Agglomerative Hierarchical Clustering Evaluating the Cluster Solution Agglomeration Coefficient P Agglomerative coefficient (cluster library) is a measure of the clustering structure of the dataset. P For each observation i, denote by m(i) its dissimilarity to the first cluster it is merged with, divided by the dissimilarity of the merger in the final step of the . The AC is the average of all 1 - m(i). Agglomerative coefficient = .57

48 Polythetic Agglomerative Hierarchical Clustering Evaluating the Cluster Solution Cophenetic Correlation

P Multiple correlation between input dissimilarities (in the dissimilarity matrix) and the output dissimilarities implied by the resulting dendrogram (using the lowest level required to join any given entity pair in the dendrogram). P Measures how well the final number of clusters (usually depicted by the dendrogram) portrays the original data structure. P Values >0.75 are considered good.

49

Polythetic Agglomerative Hierarchical Clustering Evaluating the Cluster Solution Cophenetic Correlation

rc = .89 (based on Euclidean distance and nearest-neighbor linkage)

Input Dissimilarities Output Dissimilarities Sites Sites Sites123456 Sites123456 10 10 21.40 21.40 39.79.30 39.39.30 4 15.9 15.2 10.9 0 4 10.0 10.0 10.0 0 5 15.1 14.4 10.0 2.2 0 5 10.0 10.0 10.0 2.2 0 6 13.7 12.7 13.8 8.2 8.3 0 6 10.0 10.0 10.0 8.2 8.2 0

50 Polythetic Agglomerative Hierarchical Clustering Evaluating the Cluster Solution Cophenetic Correlation

51

Polythetic Agglomerative Hierarchical Clustering Describing Clusters

P Compare the clusters with respect to their means/ and on the various variables. P Univariate ANOVA or Kruskal-Wallis rank sum test can be employed to compare the differences between cluster means (or medians) for each variable.

52 Polythetic Agglomerative Hierarchical Clustering Describing Clusters

P Compare the clusters with respect to their means/medians and variances on the various variables. P can be employed to qualitatively determine which variables best describe each cluster

53

Polythetic Agglomerative Hierarchical Clustering Describing Clusters

P Compare the clusters with respect to their means/medians and variances on the various variables. P Box-and-whisker plots can be employed to compare the differences between cluster means (or medians) for each variable.

54 Polythetic Agglomerative Hierarchical Clustering Describing Clusters

P Display clusters on an ordination plot produced from an ordination procedure (e.g., PCA, NMDS), allowing entities to be clustered while simultaneously providing an ecological interpretation of how clusters differ.

55

Polythetic Agglomerative Hierarchical Clustering Describing Clusters

P Cluster entities using ordination scores produced from an ordination procedure (e.g., PCA, NMDS), allowing entities to be clustered while simultaneously providing an ecological interpretation.

56 Polythetic Agglomerative Hierarchical Clustering Standardizing Data P Standardization (row or col) is often recommended in sample-by-species data when the objective is to cluster samples based on relative abundance profiles. Column standardization (z-score) is essential when the variables have different scales. P Standardizing data can have the serious effect of diluting the differences between groups on the variables (species) which are the best discriminators. P The choice of whether to standardize or not largely depends on the data set involved, the resemblance measure to be used, and whether or not you want each variable (species) to receive equal weight in the cluster analysis.

57

Polythetic Agglomerative Hierarchical Clustering Standardizing Data

1 Site A B C D E F Total 1 1 1 1 3 3 1 11 Raw 2 2 2 4 6 6 0 22 2 3 10 10 20 30 30 0 103 Data 4 3 3 2 1 1 0 14 4 5 0 0 0 0 1 0 6 6 0 0 0 0 20 0 26 3 Total 16 16 27 40 61 1 161 6 Percent Dissimilarity 5 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Site 1 2 3 4 5 6 Average Distance Between Clusters 1 0 1 2 0.40 0 3 0.84 0.67 0 2 4 0.50 0.47 0.82 0 5 0.82 0.90 0.98 0.82 0 3 6 0.80 0.70 0.67 0.93 0.90 0 4

5

6

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Average Distance Between Clusters

58