Clustering CS503 - Spring 2019 Narayanan C Krishnan [email protected] Supervised Vs Unsupervised Learning

Clustering CS503 - Spring 2019 Narayanan C Krishnan [email protected] Supervised vs Unsupervised Learning ' • Supervised learning – Given !", $" "%&, learn a function (: * → , • Categorical output – classification • Continuous output – regression ' • Unsupervised learning - Given !" "%&, can we infer the structure of the data? • Learning without a teacher Clustering CS503 - Machine Learning 2 Why Unsupervised Learning? • Unlabeled data is cheap • Labeled data is expensive – cumbersome to collect • Exploratory data analysis • Preprocessing step for supervised learning algorithms • Analysis of data in high dimensional spaces Clustering CS503 - Machine Learning 3 Cluster Analysis • Discover groups such that samples within a group are more similar to each other than samples across groups Clustering CS503 - Machine Learning 4 Applications of Clustering (1) • Unsupervised image segmentation Clustering CS503 - Machine Learning 5 Components of Clustering • A dissimilarity (similarity) function • Measures the distance/dissimilarity between examples • A loss function • Evaluates the clusters • An algorithm that optimizes this loss function Clustering CS503 - Machine Learning 6 Proximity Matrices • Data is directly represented in terms of proximity between pairs of objects • Subjectively judged dissimilarities are seldom distance in the strict sense (not necessarily follow the properties of a distance measure) • Replace the proximity matrix ! by ! + !# /2 Clustering CS503 - Machine Learning 7 Dissimilarity Based on Attributes (1) • Data point x" has # features • Attributes are real-valued • Euclidean distance between the data points + . # x", x% = ' ,"( − ,%( ()* • Resulting clusters are invariant to rotation and translation, but not to scaling • If features have different scales - standardize the data Clustering CS503 - Machine Learning 8 Dissimilarity Based on Attributes (2) • Data point x" has # features • Attributes are real-valued • Any ℒ% norm ) . 1 # x", x' = * /"+ − /'+ +,- • Cosine distance between the data points . ∑+,- /"+/'+ # x", x' = . 3 . 3 ∑+,- /"+ ∑+,- /'+ Clustering CS503 - Machine Learning 9 Dissimilarity Based on Attributes (3) • Data point x" has # features • Attributes are ordinal • Grades – A, B, C, D • Answers to survey question - strongly agree, agree, neutral, disagree • Replace the ordinal values by quantitative representations $ − 1/2 , $ = 1, … , ) ) Clustering CS503 - Machine Learning 10 Dissimilarity Based on Attributes (4) • Data point x" has # features • Attributes are categorical • Values of an attribute are unordered • Define explicit difference between the values $%% ⋯ $%' ⋮ ⋱ ⋮ $'% ⋯ $'' • Often 2 • For identical values - $*,*, = 0, if 1 = 1 2 • For different values- $*,*,= 1, if 1 ≠ 1 Clustering CS503 - Machine Learning 11 Loss Function for Clustering (1) • Assign each observation to a cluster without regard to the probability model describing the data • Let ! - be the number of clusters and " - indexes into the number of clusters • Each observation is assigned to one and only one cluster • View the assignment as a function # $ = " • Loss function - 1 & # = ) ) ) 1(x , x 0) 2 / / *+, . / +* . /0 +* • Characterized the extent to which observations assigned to the same cluster tend to be close to one another • Within cluster distance/scatter Clustering CS503 - Machine Learning 12 Loss Function for Clustering (2) • Consider the function ) ) 1 ! = % % + * 2 && &'( &*'( • Total point scatter • This can be split as - 1 ! = % % % + * + % + * 2 && && ,'( . & ', . &* ', . &* 0, ! = 1 2 + 3(2) Clustering CS503 - Machine Learning 13 Loss Function for Clustering (3) • The function ! " * 1 ! " = & & & / - 2 ,, '() + , (' + ,- .' • Between cluster distance/scatter • Thus minimizing 0 " is equivalent to maximizing ! " Clustering CS503 - Machine Learning 14 Combinatorial Clustering • Minimize ! over all possible assignments of " data points to # clusters • Unfortunately feasible only for very small data sets • The number of distinct assignments is - 1 # $ ", # = ) −1 -/* 01 #! 0 *+, • $(10, 4) = 34,105 • $ 19, 4 = 10,9 • Not a practical clustering algorithm Clustering CS503 - Machine Learning 15 K- Means Clustering (1) • Most popular iterative descent clustering method • Suppose all variables/features are real-valued and we use squared Euclidean distance as the dissimilarity measure ) ! x#, x%& = x% − x%& • The within cluster scatter can be written as 2 1 ) * + = . x − x & 2 % % /01 3 % 0/ 3 %& 0/ 2 ) = . 4/ . x% − x5/ /01 3 % 0/ Clustering CS503 - Machine Learning 16 K-Means Clustering (2) • Find , ∗ 2 ! = min ( -) ( x. − x1) ' )*+ ' . *) • Note that for a set 3 2 56̅ = argmin: ( x. − m .∈6 • So find , ∗ 2 ! = min ( -) ( x. − m) ', : @ = =>? )*+ ' . *) Clustering CS503 - Machine Learning 17 K-Means Clustering (3) • Find the “optimal” solution using Expectation Maximization • Iterative procedure consisting of two steps % • Expectation step (E Step) – Fix the mean vectors m" "#$ and find the optimal &∗ • Maximization step (M step) – Fix the cluster assignments & and find % the optimal mean vectors m" "#$ • Each step of this procedure reduces the loss function value Clustering CS503 - Machine Learning 18 K-Means Clustering Illustration (1) Clustering CS503 - Machine Learning 19 K-Means Clustering Illustration (2) Clustering CS503 - Machine Learning 20 K-Means Clustering Illustration (3) Clustering CS503 - Machine Learning 21 K-Means Clustering Illustration (4) Clustering CS503 - Machine Learning 22 K-Means Clustering Illustration (5) Clustering CS503 - Machine Learning 23 K-Means Clustering Illustration (6) Clustering CS503 - Machine Learning 24 K-Means Clustering Illustration (7) Clustering CS503 - Machine Learning 25 K-Means Clustering Illustration (8) Clustering CS503 - Machine Learning 26 K-Means Clustering Illustration (9) Clustering CS503 - Machine Learning 27 K-Means Clustering Illustration (10) • Blue point - Expectation step • Red point – Maximization step Clustering CS503 - Machine Learning 28 How to Choose K? • Similar to choosing ! in kNN • The loss function generally decreases with ! Clustering CS503 - Machine Learning 29 Limitations of K-Means Clustering • Hard assignments are susceptible to noise/outliers • Assumes spherical (convex) clusters with uniform prior on the clusters • Clusters can change arbitrarily for different ! and initializations Clustering CS503 - Machine Learning 30 K-Medoids • K-Means is suitable only when using Euclidean distance • Susceptible to outliers • Challenge when the centroid of a cluster is not a valid data point • Generalizing K-Means to arbitrary distance measures • Replace the mean calculation by median calculation • Ensures the centroid to be a medoid – always a valid data point • Increases computation as we have to now find the medoid Clustering CS503 - Machine Learning 31 Soft K-Means as Gaussian Mixture Models (1) • Probabilistic Clusters • Each cluster is associated with a Gaussian Distribution - !(#$, Σ$) • Each cluster also has a prior probability - ($ • Then the likelihood of a data point drawn from the ) clusters will be 0 * + = - ($* + #$, Σ$ $./ 0 • Where ∑$./ ($ = 1 Clustering CS503 - Machine Learning 32 Soft K-Means as Gaussian Mixture Models (2) • Given ! iid data points, the likelihood function " #$, … , #' is " #$, … , #' = Clustering CS503 - Machine Learning 33 Soft K-Means as Gaussian Mixture Models (3) • Given ! iid data points, the likelihood function " #$, … , #' is ' ' 0 " #$, … , #' = ∏*+$ "(#*) = ∏*+$ ∑/+$ 1/" #* 2/, Σ/ • Let us take the log likelihood Clustering CS503 - Machine Learning 34 Soft K-Means as Gaussian Mixture Models (4) • Given ! iid data points, the likelihood function " #$, … , #' is ' ' 0 " #$, … , #' = ∏*+$ "(#*) = ∏*+$ ∑/+$ 1/" #* 2/, Σ/ • Let us take the log likelihood ' 0 4 log 4 1/" #* 2/, Σ/ *+$ /+$ Clustering CS503 - Machine Learning 35 Soft K-Means as Gaussian Mixture Models (5) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Given the complete data 1, 2, we look at maximizing / 1, 2 0), 3), Σ) Clustering CS503 - Machine Learning 36 Soft K-Means as Gaussian Mixture Models (6) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Let the probability / #") = 1|!" be denoted as 2 #") Clustering CS503 - Machine Learning 37 Soft K-Means as Gaussian Mixture Models (7) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Let the probability / #") = 1|!" be denoted as 2 #") • From Bayes theorem / #") = 1 / !"|#") = 1 2 #") = / #") = 1|!" = / !" Clustering CS503 - Machine Learning 38 Soft K-Means as Gaussian Mixture Models (8) • Latent Variables • Each data point !" is associated with a latent variable - #" = #"%, … , #"( ( • Where #") ∈ 0, 1 , ∑).% #") = 1 and / #") = 1 = 0) • Let the probability / #") = 1|!" be denoted as 2 #") • From Bayes theorem / #) = 1 / !"|#") = 1 2 #") = / #") = 1|!" = / !" ∑ ∑( ( • The marginal distribution / !" = 345 / !", #" = ).% / #") = 1)/ !"|#") = 1 Clustering CS503 - Machine Learning 39 Soft K-Means as Gaussian Mixture Models (9) • Now, ! "# = 1 = &# ! '(|"(# = 1 = * '(| +#, Σ# • Therefore . "(# = ! "(# = 1|'( = Clustering CS503 - Machine Learning 40 Estimating the mean !" (1) • Begin with the log-likelihood function ' + # log # ,"- .$ !", Σ" $%& "%& • Taking the derivative wrt to !" and equating

Load more