Quick viewing(Text Mode)

Clustering CS503 - Spring 2019 Narayanan C Krishnan [email protected] Supervised Vs Unsupervised Learning

Clustering CS503 - Spring 2019 Narayanan C Krishnan Ckn@Iitrpr.Ac.In Supervised Vs Unsupervised Learning

Clustering CS503 - Spring 2019 Narayanan C Krishnan [email protected] Supervised vs

' • – Given !", $" "%&, learn a function (: * → , • Categorical output – classification • Continuous output – regression ' • Unsupervised learning - Given !" "%&, can we infer the structure of the ? • Learning without a teacher

Clustering CS503 - 2 Why Unsupervised Learning?

• Unlabeled data is cheap • is expensive – cumbersome to collect • Exploratory data • Preprocessing step for supervised learning • Analysis of data in high dimensional spaces

Clustering CS503 - Machine Learning 3

• Discover groups such that samples within a group are more similar to each other than samples across groups

Clustering CS503 - Machine Learning 4 Applications of Clustering (1)

• Unsupervised

Clustering CS503 - Machine Learning 5 Components of Clustering

• A dissimilarity (similarity) function • Measures the distance/dissimilarity between examples • A • Evaluates the clusters • An that optimizes this loss function

Clustering CS503 - Machine Learning 6 Proximity Matrices

• Data is directly represented in terms of proximity between pairs of objects • Subjectively judged dissimilarities are seldom distance in the strict sense (not necessarily follow the properties of a distance measure) • Replace the proximity matrix ! by ! + !# /2

Clustering CS503 - Machine Learning 7 Dissimilarity Based on Attributes (1)

• Data point x" has # features • Attributes are real-valued • Euclidean distance between the data points + . # x", x% = ' ,"( − ,%( ()* • Resulting clusters are invariant to rotation and translation, but not to scaling • If features have different scales - standardize the data

Clustering CS503 - Machine Learning 8 Dissimilarity Based on Attributes (2)

• Data point x" has # features • Attributes are real-valued • Any ℒ%

) . 1 # x", x' = * /"+ − /'+ +,- • Cosine distance between the data points . ∑+,- /"+/'+ # x", x' = . 3 . 3 ∑+,- /"+ ∑+,- /'+

Clustering CS503 - Machine Learning 9 Dissimilarity Based on Attributes (3)

• Data point x" has # features • Attributes are ordinal • Grades – A, B, C, D • Answers to survey question - strongly agree, agree, neutral, disagree • Replace the ordinal values by quantitative representations $ − 1/2 , $ = 1, … , ) )

Clustering CS503 - Machine Learning 10 Dissimilarity Based on Attributes (4)

• Data point x" has # features • Attributes are categorical • Values of an attribute are unordered • Define explicit difference between the values $%% ⋯ $%' ⋮ ⋱ ⋮ $'% ⋯ $'' • Often 2 • For identical values - $*,*, = 0, if 1 = 1 2 • For different values- $*,*,= 1, if 1 ≠ 1

Clustering CS503 - Machine Learning 11 Loss Function for Clustering (1)

• Assign each observation to a cluster without regard to the probability model describing the data • Let ! - be the number of clusters and " - indexes into the number of clusters • Each observation is assigned to one and only one cluster • View the assignment as a function # $ = " • Loss function - 1 & # = ) ) ) 1(x , x 0) 2 / / *+, . / +* . /0 +* • Characterized the extent to which observations assigned to the same cluster tend to be close to one another • Within cluster distance/scatter

Clustering CS503 - Machine Learning 12 Loss Function for Clustering (2)

• Consider the function ) ) 1 ! = % % + * 2 && &'( &*'( • Total point scatter • This can be split as - 1 ! = % % % + * + % + * 2 && && ,'( . & ', . &* ', . &* 0,

! = 1 2 + 3(2)

Clustering CS503 - Machine Learning 13 Loss Function for Clustering (3)

• The function ! " * 1 ! " = & & & / - 2 ,, '() + , (' + ,- .' • Between cluster distance/scatter • Thus minimizing 0 " is equivalent to maximizing ! "

Clustering CS503 - Machine Learning 14 Combinatorial Clustering

• Minimize ! over all possible assignments of " data points to # clusters • Unfortunately feasible only for very small data sets • The number of distinct assignments is - 1 # $ ", # = ) −1 -/* 01 #! 0 *+, • $(10, 4) = 34,105 • $ 19, 4 = 10,9 • Not a practical clustering algorithm

Clustering CS503 - Machine Learning 15 K- Clustering (1)

• Most popular iterative descent clustering method • Suppose all variables/features are real-valued and we use squared Euclidean distance as the dissimilarity measure ) ! x#, x%& = x% − x%& • The within cluster scatter can be written as 2 1 ) * + = . . . x − x & 2 % % /01 3 % 0/ 3 %& 0/ 2 ) = . 4/ . x% − x5/ /01 3 % 0/

Clustering CS503 - Machine Learning 16 K-Means Clustering (2)

• Find , ∗ 2 ! = min ( -) ( x. − x1) ' )*+ ' . *) • Note that for a set 3 2 56̅ = argmin: ( x. − m .∈6 • So find , ∗ 2 ! = min ( -) ( x. − m) ', : @ = =>? )*+ ' . *)

Clustering CS503 - Machine Learning 17 K-Means Clustering (3)

• Find the “optimal” solution using Expectation Maximization • Iterative procedure consisting of two steps % • Expectation step (E Step) – Fix the vectors m" "#$ and find the optimal &∗ • Maximization step (M step) – Fix the cluster assignments & and find % the optimal mean vectors m" "#$ • Each step of this procedure reduces the loss function value

Clustering CS503 - Machine Learning 18 K-Means Clustering Illustration (1)

Clustering CS503 - Machine Learning 19 K-Means Clustering Illustration (2)

Clustering CS503 - Machine Learning 20 K-Means Clustering Illustration (3)

Clustering CS503 - Machine Learning 21 K-Means Clustering Illustration (4)

Clustering CS503 - Machine Learning 22 K-Means Clustering Illustration (5)

Clustering CS503 - Machine Learning 23 K-Means Clustering Illustration (6)

Clustering CS503 - Machine Learning 24 K-Means Clustering Illustration (7)

Clustering CS503 - Machine Learning 25 K-Means Clustering Illustration (8)

Clustering CS503 - Machine Learning 26 K-Means Clustering Illustration (9)

Clustering CS503 - Machine Learning 27 K-Means Clustering Illustration (10)

• Blue point - Expectation step • Red point – Maximization step

Clustering CS503 - Machine Learning 28 How to Choose K?

• Similar to choosing ! in kNN • The loss function generally decreases with !

Clustering CS503 - Machine Learning 29 Limitations of K-Means Clustering

• Hard assignments are susceptible to noise/ • Assumes spherical (convex) clusters with uniform prior on the clusters • Clusters can change arbitrarily for different ! and initializations

Clustering CS503 - Machine Learning 30 K-Medoids

• K-Means is suitable only when using Euclidean distance • Susceptible to outliers • Challenge when the centroid of a cluster is not a valid data point • Generalizing K-Means to arbitrary distance measures • Replace the mean calculation by calculation • Ensures the centroid to be a medoid – always a valid data point • Increases computation as we have to now find the medoid

Clustering CS503 - Machine Learning 31 Soft K-Means as Gaussian Mixture Models (1)

• Probabilistic Clusters

• Each cluster is associated with a Gaussian Distribution - !(#$, Σ$)

• Each cluster also has a - ($ • Then the likelihood of a data point drawn from the ) clusters will be 0

* + = - ($* + #$, Σ$ $./ 0 • Where ∑$./ ($ = 1

Clustering CS503 - Machine Learning 32 Soft K-Means as Gaussian Mixture Models (2)

• Given ! iid data points, the " #$, … , #' is " #$, … , #' =

Clustering CS503 - Machine Learning 33 Soft K-Means as Gaussian Mixture Models (3)

• Given ! iid data points, the likelihood function " #$, … , #' is ' ' 0 " #$, … , #' = ∏*+$ "(#*) = ∏*+$ ∑/+$ 1/" #* 2/, Σ/ • Let us take the log likelihood

Clustering CS503 - Machine Learning 34 Soft K-Means as Gaussian Mixture Models (4)

• Given ! iid data points, the likelihood function " #$, … , #' is ' ' 0 " #$, … , #' = ∏*+$ "(#*) = ∏*+$ ∑/+$ 1/" #* 2/, Σ/ • Let us take the log likelihood ' 0

4 log 4 1/" #* 2/, Σ/ *+$ /+$

Clustering CS503 - Machine Learning 35 Soft K-Means as Gaussian Mixture Models (5)

• Latent Variables

• Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Given the complete data 1, 2, we look at maximizing / 1, 2 0), 3), Σ)

Clustering CS503 - Machine Learning 36 Soft K-Means as Gaussian Mixture Models (6)

• Latent Variables

• Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0)

• Let the probability / #") = 1|!" be denoted as 2 #")

Clustering CS503 - Machine Learning 37 Soft K-Means as Gaussian Mixture Models (7)

• Latent Variables

• Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0)

• Let the probability / #") = 1|!" be denoted as 2 #") • From Bayes theorem / #") = 1 / !"|#") = 1 2 #") = / #") = 1|!" = / !"

Clustering CS503 - Machine Learning 38 Soft K-Means as Gaussian Mixture Models (8)

• Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Let the probability / #") = 1|!" be denoted as 2 #") • From Bayes theorem / #) = 1 / !"|#") = 1 2 #") = / #") = 1|!" = / !" ∑ ∑( ( • The marginal distribution / !" = 345 / !", #" = ).% / #") = 1)/ !"|#") = 1

Clustering CS503 - Machine Learning 39 Soft K-Means as Gaussian Mixture Models (9)

• Now, ! "# = 1 = &# ! '(|"(# = 1 = * '(| +#, Σ# • Therefore . "(# = ! "(# = 1|'( =

Clustering CS503 - Machine Learning 40 Estimating the mean !" (1)

• Begin with the log-likelihood function ' +

# log # ,"- .$ !", Σ" $%& "%& • Taking the derivative wrt to !" and equating it to 0

Clustering CS503 - Machine Learning 41 Estimating the mean !" (2)

* 1 !" = & + ,'" -' %" '() * • Where %" = ∑'() + ,'" • Effective number of points assigned to cluster / • So the mean of /01Gaussian component is the weighted mean of all the points in the dataset • Where the weight of the 201data point is the that component k was responsible for generating -'

Clustering CS503 - Machine Learning 42 Estimating the Σ"

• Begin with the log-likelihood function ' +

# log # ,"- .$ /", Σ" $%& "%& • Taking the derivative wrt to Σ" and equating it to 0 ' 1 7 Σ" = # 4 5$" .$ − /" .$ − /" 3" $%& • Similar to the result for a single Gaussian for the dataset, but each data point is weighted by the corresponding posterior probability.

Clustering CS503 - Machine Learning 43 Estimating the mixing coefficients !"

• Begin with the log-likelihood function ' +

# log # !", -$ .", Σ" $%& "%& • Maximize the log-likelihood, w.r.t !" + • Subject to the condition that ∑"%& !" = 1 • Use Lagrange multiplier 4 and maximize ' + + # log # !", -$ .", Σ" + 4 # !" − 1 "%& $%& "%& • Solving this will result in 7 ! = " " 7

Clustering CS503 - Machine Learning 44 Soft K-Means as Gaussian Mixture Models (9)

• In Summary $ • ! = % " $ ' $ • &" = ∑)*' + ,)" -) $% ' $ 0 • Σ" = ∑)*' + ,)" -) − &" -) − &" $% • But then what if ,)" is unknown? • Use EM algorithm!

Clustering CS503 - Machine Learning 45 EM for GMM

• First choose initial values for !", $", Σ" • Alternate between Expectation and Maximization Steps • Expectation Step (E) – Given the parameters of the compute the posterior probabilities &(()") • Maximization step (M) – Given the posterior probabilities, update !", $", Σ"

Clustering CS503 - Machine Learning 46 EM for GMM Illustration (1)

Clustering CS503 - Machine Learning 47 EM for GMM Illustration (2)

Clustering CS503 - Machine Learning 48 EM for GMM Illustration (3)

Clustering CS503 - Machine Learning 49 EM for GMM Illustration (4)

Clustering CS503 - Machine Learning 50 EM for GMM Illustration (5)

Clustering CS503 - Machine Learning 51 EM for GMM Illustration (6)

Clustering CS503 - Machine Learning 52 Practical Issues with EM for GMM

• Takes many more iterations than k-Means • Each iteration requires more computation • Run k-Means first, and then EM for GMM • Covariance can be initialized to the covariance of the clusters obtained from k-Means • EM is not guaranteed to find the global maximum of the log likelihood function • Check for convergence • Log likelihood does not change significantly between two iterations

Clustering CS503 - Machine Learning 53 (1)

• Organize clusters in a hierarchical fashion • Produces a rooted binary tree ()

Clustering CS503 - Machine Learning 54 Hierarchical Clustering (2)

• Bottom-up (agglomerative): recursively merge two groups with the smallest between cluster similarity • Top-down (divisive): recursively split the least coherent cluster • Users can choose a cut through the hierarchy to represent the most natural division of clusters

Clustering CS503 - Machine Learning 55 Hierarchical Clustering (3)

• Bottom-up (agglomerative): recursively merge two groups with the smallest between cluster similarity • Top-down (divisive): recursively split the least coherent cluster • Share a monotonicity property • Dissimilarity between merged clusters is monotone increase with the level of the merger

Clustering CS503 - Machine Learning 56 Agglomerative Clustering (1)

• Single Linkage – distance between two most similar points in ! and " #$% !, " = min #(1, 2) +∈-,.∈/ • Also referred to as nearest neighbor linkage • Results in extended clusters through chaining • May violate the compactness property (large diameter)

Clustering CS503 - Machine Learning 57 Agglomerative Clustering (2)

• Complete Linkage – distance between two most dissimilar points in ! and " #$% !, " = max #(1, 2) +∈-,.∈/ • Furthest neighbor technique • Forces spherical clusters with consistent diameter • May violate the closeness property

Clustering CS503 - Machine Learning 58 Agglomerative Clustering (3)

• Average Linkage (Group Average) – average dissimilarity between the groups 1 !"# $, & = + + 0 1, 2 )")* ,∈" .∈/ • Less affected by outliers

Clustering CS503 - Machine Learning 59 Agglomerative524 14. Unsupervised Clustering Learning (4) Average Linkage Complete Linkage Single Linkage

Clustering CS503 - Machine Learning 60

FIGURE 14.13. from agglomerative hierarchical clustering of hu- man tumor microarray data.

observations within them are relatively close together (small dissimilarities) as compared with observations in different clusters. To the extent this is not the case, results will differ. Single linkage (14.41) only requires that a single dissimilarity dii′ , i G and i′ H,besmallfortwogroupsG and H to be considered close∈ together,∈ irrespective of the other observation dissimilarities between the groups. It will therefore have a tendency to combine, at relatively low thresholds, observations linked by a series of close intermediate observa- tions. This phenomenon, referred to as chaining,isoftenconsideredade- fect of the method. The clusters produced by single linkage can violate the “compactness” property that all observations within each cluster tend to be similar to one another, based on the supplied observation dissimilari- ties dii′ .Ifwedefinethediameter DG of a group of observations as the largest{ dissimilarity} among its members

DG =maxdii′ , (14.44) i∈G i′∈G then single linkage can produce clusters with very large diameters. Complete linkage (14.42) represents the opposite extreme. Two groups G and H are considered close only if all of the observations in their union are relatively similar. It will tend to produce compact clusters with small diameters (14.44). However, it can produce clusters that violate the “close- ness” property. That is, observations assigned to a cluster can be much Summary

• Unsupervised Learning • K-means clustering • Expectation Maximization for discovering the clusters • K-medoids clustering • Gaussian Mixture Models • Expectation Maximization for estimating the parameters of the Gaussian mixtures • Hierarchical Clustering • Agglomerative Clustering

Clustering CS503 - Machine Learning 61