Cluster Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Cluster Analysis 1 2 3 4 5 Can we organize 6 sampling entities into Species discrete classes, such that Sites A B C D within-group similarity is 1 1 9 12 1 maximized and among- 2 1 8 11 1 3 1 6 10 10 group similarity is 4 10 0 9 10 minimized according to 5 10 2 8 10 some objective criterion? 6 10 0 7 2 1 Important Characteristics of Cluster Analysis Techniques P Family of techniques with similar goals. P Operate on data sets for which pre-specified, well-defined groups do "not" exist; characteristics of the data are used to assign entities into artificial groups. P Summarize data redundancy by reducing the information on the whole set of say N entities to information about say g groups of nearly similar entities (where hopefully g is very much smaller than N). 2 Important Characteristics of Cluster Analysis Techniques P Identify outliers by leaving them solitary or in small clusters, which may then be omitted from further analyses. P Eliminate noise from a multivariate data set by clustering nearly similar entities without requiring exact similarity. P Assess relationships within a single set of variables; no attempt is made to define the relationship between a set of independent variables and one or more dependent variables. 3 What’s a Cluster? A B E C D F 4 Cluster Analysis: The Data Set P Single set of variables; no distinction Variables between independent and dependent Sample x1 x2 x3 ... xp 1xx x ... x variables. 11 12 13 1p 2x21 x22 x23 ... x2p 3x31 x32 x33 ... x3p P Continuous, categorical, or count . ... variables; usually all the same scale. ... nxn1 xn2 xn3 ... xnp P Every sample entity must be measured on the same set of variables. P There can be fewer samples (rows) than number of variables (columns) [i.e., data matrix does not have to be of full rank]. 5 Cluster Analysis: The Data Set P Common 2-way ecological data: < Sites-by-environmental Parameters < Species-by-niche parameters < Species-by-behavioral Characteristics Variables < Samples-by-species Sample x1 x2 x3 ... xp < Specimens-by-characterisitcs 1x11 x12 x13 ... x1p 2x21 x22 x23 ... x2p 3x31 x32 x33 ... x3p . ... ... nxn1 xn2 xn3 ... xnp 6 Cluster Analysis: The Data Set 1 AMRO 15.31 31.42 64.28 20.71 47.14 0.00 0.28 0.14 . 1.45 2 BHGR 5.76 24.77 73.18 22.95 61.59 0.00 0.00 1.09 . 1.28 3 BRCR 4.78 64.13 30.85 12.03 63.60 0.44 0.44 2.08 . 1.18 4 CBCH 3.08 58.52 39.69 15.47 62.19 0.31 0.28 1.52 . 1.21 5 DEJU 13.90 60.78 36.50 13.81 62.89 0.23 0.31 1.23 . 1.23 . 19 WIWR 8.05 41.09 55.00 18.62 53.77 0.09 0.18 0.81 . 1.36 7 Cluster Techniques Exclusive Each entity in Nonexclusive Each entity in one cluster only one or more clusters Sequential Recursive sequence Simultaneous Single nonrecursive of operations operation Arrange clusters in Achieve maximum Nonhierarchical Hierarchical hierarchy; within-culster relationships among homogeneity clusters defined Agglomerative Divisive Agglomerative Divisive Build groups Break into groups Build groups Break into groups Polythetic Monothetic Polythetic Monothetic Consider all Consider one Consider all Consider one variables variable variables variable 8 Nonhierarchical Clustering P NHC techniques merely assign each entity to a cluster, placing similar entities together. P NHC is, of all cluster techniques, conceptually the simplest. Maximizing within-cluster homogeneity is the basic property to be achieved in all NHC techniques. P Within-cluster homogeneity makes possible inference about an entities' properties based on its cluster membership. This one property makes NHC useful for mitigating noise, summarizing redundancy, and identifying outliers. 9 Nonhierarchical Clustering P NHC primary purpose is to summarize redundant entities into fewer groups for subsequent analysis (e.g., for subsequent hierarchical clustering to elucidate relationships among “groups”.) Several different algorithms available that differ in various details. In all cases, the single criterion achieved is within-cluster homogeneity, and the results are, in general, similar. + ? + ? + 10 Nonhierarchical Clustering K-means Clustering (KMEANS) P Specify number of random seeds (kernals) + ? or provide seeds. + + P Assign samples to ? ‘nearest’ seed. Group Seeds P Iteratively reassign Centroids samples to groups in order to minimize within + group variabilitiy (i.e., + + assigned to group with ‘closest’ centroid). 11 Nonhierarchical Clustering Composite Clustering (COMPCLUS) P Select a seed at random. P Assign samples to seed if 4 5 within specified distance 3 1 (radius) of seed. 2 6 P Pick a second seed and repeat 7 process until all samples are classified. P Groups smaller than specified 4 number are dissolved and 1 samples reassigned to closest 2 6 centroid, providing it is within specified maximum distance. 12 Nonhierarchical Clustering Minimum Variance Partitioning P Compute standardized distances between each sample and overall centroid. 1 + P Select sample w/ largest distance as new cluster centroid. P Assign samples to nearest cluster centroid. 2 + P Select sample w/ largest + distance from its cluster centroid to initiate new cluster. P Assign samples to nearest P Continue until desired cluster centroid. number of clusters created. 13 Nonhierarchical Clustering Maximum Likelihood Clustering P Model-based method. P Choose θ = (θ 1,...,θ c) and γ P Assume the samples consist of to maximize the likelihood: c subpopulations each n Lfx,, corresponding to a cluster, and iii i1 that the density function of a P q-dimensional observation If fj(x,θ j) is taken as the from the jth subpopulation is multivariate normal density with mean vector μj and fj(x,θ j) for some unknown vector of parameters, θ . covariance matrix Σj, a ML j solution can be found P Assume that γ = (γ1,...,γn) gives based on varying the labels of the subpopulation assumptions about the to which each sample belongs. covariance matrix. 14 Nonhierarchical Clustering Maximum Likelihood Clustering P Normal mixture modeling (package mclust; Fraley et al. (2012) n Lfx,, iii i1 15 Nonhierarchical Clustering Maximum Likelihood Clustering P Normal mixture modeling (package mclust; Fraley et al. (2012) 16 Nonhierarchical Clustering Limitations P NHC procedures involve various assumptions about the form of the underlying population from which the sample is drawn. These assumptions often include the typical parametric multivariate assumptions, e.g., equal covariance matrices among clusters. P Most NHC techniques are strongly biased towards finding elliptical and spherical clusters. 17 Nonhierarchical Clustering Limitations P NHC is not effective for elucidating relationships because there is no interesting structure within clusters and no definition of relationships among clusters derived. P Regardless of the NHC procedure used, it is best to have a reasonable guess on how many groups to expect in the data. 18 Nonhierarchical Clustering Choosing the ‘Right’ Number of Clusters P Scree plot of cluster properties: < Sum of within-cluster dissimilarities to the cluster medoids. < Average sample silhouette width (si) baii si max(baii , ) ai = ave dist to all others in ith cluster bi = min dist to neighboring cluster 19 Nonhierarchical Clustering Choosing the ‘Right’ Number of Clusters P Silhouette width (si) baii s d1(1) * * i d1(2) max(baii , ) * d n a 1(3) i 1 * a j * j1 a a a2 3 i * * d2(1) ni ni d * d2(2) ij d2(3) bdmin j1 * ii n * i d2(5) d2(4) * * Si 6 1, very well clustered Si 6 0, in between clusters Si < 0, placed in wrong cluster 20 Nonhierarchical Clustering Testing the ‘Significance’ of the Clusters P Are groups significantly different? (How valid are the groups?) < Multivariate Analysis of Variance (MANOVA) < Multi-Response Permutation Procedures (MRPP) < Analysis of Group Similarities (ANOSIM) < Mantel’s Test (MANTEL) We will cover these procedures in the next section of the course. 21 Nonhierarchical Clustering Evaluating the Clusters Cluster Plot Silhouette Plot 22 Hierarchical Clustering P HC combines similar entities into classes or groups and arranges these groups into a hierarchy. P HC reveals relationships expressed among the entities classified. Limitations: P For large data sets hierarchies are problematic, because a hierarchy with > 50 entities is difficult to display or interpret. P HC techniques have a general disadvantage since they contain no provision for reallocation of entities who may have been poorly classified at an early stage in the analysis. 23 Complementary Use of NHC and HC + ? + ? + P HC is ideal for small data sets and NHC for large data sets. P HC helps reveal relationships in the data while NHC does not. P NHC can be used initially to summarize a large data set by producing far fewer composite samples, which then makes HC feasible and effective for depicting relationships. 24 Polythetic Agglomerative Hierarchical Clustering P PAHC techniques use the information on all the variables (i.e., polythetic). P Each entity is initially assigned as an individual cluster. PAHC agglomerates these in a hierarchy of larger and larger clusters until finally a single cluster contains all entities. 12345678910 P There are numerous different Fusion resemblance measures and fusion algorithms; consequently, there exists a profusion of PAHC techniques. 25 Polythetic Agglomerative Hierarchical Clustering Assumptions: P Basically none! Hence, the purpose of PAHC is generally purely descriptive. P However, some "assume" spherical shaped clusters. P Certain resemblance measures (e.g., Euclidean distance) assume that the variables are uncorrelated within clusters. Sample Size Requirements: P Basically none! 12345678910 26 Polythetic Agglomerative Hierarchical Clustering Two-Stage Process 1. Resemblance Matrix P The first step is to compute a dissimilarity/distance matrix from the original data matrix.