DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images
A First Glimpse on Clustering
References Introduction to Computer Science
Arthur Zimek
University of Southern Denmark
DM534, fall 2020
1 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images
A First Glimpse on A First Glimpse on Clustering Clustering
References
2 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary
A First Glimpse on Clustering A First Glimpse on Clustering References
3 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary
A First Glimpse on Clustering A First Glimpse on Clustering References
4 Similarity
DM534
Arthur Zimek I Similarity (as given by some distance measure) is a Color Histograms as Feature Spaces for central concept in data mining, e.g.: Representation of Images Distances I clustering: group similar objects in the same cluster, Features for Images separate dissimilar objects to different clusters Summary
A First Glimpse on I outlier detection: identify objects that are dissimilar (by Clustering some characteristic) from most other objects References I definition of a suitable distance measure is often crucial for deriving a meaningful solution in the data mining task I images I CAD objects I proteins I texts ... 5 I Spaces and Distance Functions
DM534 Arthur Zimek Common distance measure for (Euclidean) feature vectors: Color Histograms as Feature Spaces for LP-norm Representation of Images 1 Distances P P P P Features for Images distP(p, q) = |p1 − q1| + |p2 − q2| + ... + |pn − qn| Summary
A First Glimpse on Clustering References Euclidean norm Manhattan norm Maximum norm (L2): (L1): (L∞, also: Lmax, supremum dist., Chebyshev dist.)
6 Spaces and Distance Functions
DM534 Arthur Zimek weighted Euclidean norm: quadratic form*: Color Histograms as 2 Feature Spaces for dist(p, q) = w1|p1 − q1| + dist(p, q) = Representation of 1 1 Images T 2 2 2 2 Distances w2|p2 − q2| + ... + wn|pn − qn| ((p − q)M(p − q) ) Features for Images Summary
A First Glimpse on Clustering
References
*note that we assume vectors to be row vectors here
7 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary
A First Glimpse on Clustering A First Glimpse on Clustering References
8 Categories of Feature Descriptors for Images
DM534
Arthur Zimek
Color Histograms as I distribution of colors Feature Spaces for Representation of Images I texture Distances Features for Images I shapes (contoures) Summary
A First Glimpse on Clustering
References
9 Color Histogram
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References I a histogram represents the distribution of colors over the pixels of an image I definition of an color histogram: I choose a color space (RGB, HSV, HLS, . . . ) I choose number of representants (sample points) in the color space I possibly normalization (to account for different image sizes)
10 Color Space Example: RGB cube
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
11 Impact of Number of Representants
DM534 Arthur Zimek original images in full RGB space (2563 = 16, 777, 216) Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
12 Impact of Number of Representants
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
23 33
43 163 13 Impact of Number of Representants
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
23 33
43 163 14 Impact of Number of Representants
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
23 33
43 163 15 Impact of Number of Representants
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
23 33
43 163 16 Impact of Number of Representants
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
The histogram for each image is essentially a visualization of a vector: (0.77, 0, 0, 0, 0.08, 0, 0.15, 0)( 0.8, 0, 0, 0, 0.11, 0, 0.09, 0) (0.9, 0, 0, 0, 0.05, 0, 0.05, 0)( 0.955, 0, 0, 0, 0.045, 0, 0, 0)
17 Impact of Number of Representants
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
17 Impact of Number of Representants
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary
A First Glimpse on Clustering
References
17 Distances for Color Histograms
DM534 Arthur Zimek Euclidean distance for images P and Q using the color Color Histograms as Feature Spaces for histograms hP and hQ: Representation of Images Distances q Features for Images T Summary dist(P, Q) = (hP − hQ) · (hP − hQ)
A First Glimpse on Clustering
References √ dist(RED, PINK) = 2 √ dist(RED, BLUE) = 2 √ (1, 0, 0)(0, 1, 0)(0, 0, 1) dist(BLUE, PINK) = 2
A ‘psychologic’ distance would consider that red is (in our perception) more similar to pink than to blue.
18 Example for the Distance Computation of Histograms
DM534
Arthur Zimek q Color Histograms as T Feature Spaces for dist(P, Q) = (h − h ) · (h − h ) Representation of P Q P Q Images Distances Features for Images Summary p T A First Glimpse on dist(RED, PINK) = ((1, 0, 0) − (0, 1, 0)) · ((1, 0, 0) − (0, 1, 0)) Clustering p References = (1, −1, 0) · (1, −1, 0)T p = (1 · 1 + (−1) · (−1) + 0 · 0) √ = 2
19 Distances for Color Histograms
DM534 Arthur Zimek Quadratic form with ‘psychological’ similarity matrix Color Histograms as Feature Spaces for 1 a12 ... Representation of Images a21 1 ... ? Distances A = where a (= a ) describe the Features for Images . .. . ij ji Summary . . . A First Glimpse on Clustering ... 1 References subjective similarity of the features i and j in the color histogram: q T distA(P, Q) = (hP − hQ) · A · (hP − hQ)
√ 1 0.9 0 dist(RED, PINK) = 0.2 A0 = 0.9 1 0 √ dist(RED, BLUE) = 2 0 0 1 √ dist( , ) = 20 BLUE PINK 2 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary
A First Glimpse on Clustering A First Glimpse on Clustering References
21 Your Choice of a Distance Measure
DM534 Arthur Zimek There are hundreds of distance functions [Deza and Deza, Color Histograms as Feature Spaces for 2009]. Representation of Images Distances I For time series: DTW, EDR, ERP, LCSS, . . . Features for Images Summary I For texts: Cosine and normalizations A First Glimpse on Clustering I For sets – based on intersection, union, . . . (Jaccard) References I For clusters (single-link, average-link, etc.) I For histograms: histogram intersection, “Earth movers distance”, quadratic forms with color similarity I With normalization: Canberra, . . . I Quadratic forms / bilinear forms: d(x, y) := xT My for some positive (usually symmetric) definite matrix M. Note that: Choosing the appropriate distance function can be seen as a part of “preprocessing”. 22 Summary
DM534
Arthur Zimek
Color Histograms as You learned in this section: Feature Spaces for Representation of Images I distances (Lp-norms, weighted, quadratic form) Distances Features for Images I color histograms as feature (vector) descriptors for Summary
A First Glimpse on images Clustering References I impact of the granularity of color histograms on similarity measures
23 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images
A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary
References
24 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images
A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary
References
25 Purpose of Clustering
DM534
Arthur Zimek
Color Histograms as I identify a finite number of categories (classes, groups: Feature Spaces for Representation of clusters) in a given dataset Images
A First Glimpse on Clustering I similar objects shall be grouped in the same cluster, General Purpose of Clustering dissimilar objects in different clusters Partitional Clustering Algorithm “similarity” is highly subjective, depending on the Visualization: I Algorithmic Differences application scenario Summary
References
26 A Dataset can be Clustered in Different Meaningful Ways
DM534
Arthur Zimek
Color Histograms as Feature Spaces for Representation of Images
A First Glimpse on Clustering General Purpose of Clustering Partitional Clustering Algorithm Visualization: Algorithmic Differences Summary
References
(Figure from Tan et al. [2006].)
27 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images
A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary
References
28 Criteria of Quality: Cohesion and Separation
DM534
Arthur Zimek
Color Histograms as I cohesion: how strong are the cluster objects connected Feature Spaces for Representation of (how similar, pairwise, to each other)? Images
A First Glimpse on Clustering I separation: how well is a cluster separated from other General Purpose of Clustering clusters? Partitional Clustering Algorithm Visualization: Algorithmic Differences Summary
References
small within cluster distances large between cluster distances
29 Optimization of Cohesion
DM534 Arthur Zimek Partitional clustering algorithms partition a dataset into k Color Histograms as Feature Spaces for clusters, typically minimizing some cost function Representation of Images (compactness criterion), i.e., optimizing cohesion. A First Glimpse on Clustering General Purpose of Clustering Partitional Clustering Algorithm Visualization: Algorithmic Differences Summary
References
30 Assumptions for Partitioning Clustering
DM534 Arthur Zimek Central assumptions for approaches in this family are Color Histograms as Feature Spaces for typically: Representation of Images
A First Glimpse on I number k of clusters known (i.e., given as input) Clustering General Purpose of Clustering I clusters are characterized by their compactness Partitional Clustering Algorithm I compactness measured by some distance function Visualization: Algorithmic Differences (e.g., distance of all objects in a cluster from some Summary cluster representative is minimal) References I criterion of compactness typically leads to convex or even spherically shaped clusters
31 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images
A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary
References
32 Construction of Central Points: Basics
DM534
Arthur Zimek I objects are points x = (x1,..., xd) in Euclidean vector Color Histograms as d Feature Spaces for space , dist = Euclidean distance (L2) Representation of R Images
A First Glimpse on I centroid µC: mean vector of all points in cluster C Clustering General Purpose of Clustering 12 Partitional Clustering 11 Algorithm Visualization: 10 Algorithmic Differences 9 Summary
References 8 7 6 1 P µCi = · o∈C o 5 |Ci| i 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12 33 Construction of Central Points: Basics
DM534
Arthur Zimek I objects are points x = (x1,..., xd) in Euclidean vector Color Histograms as d Feature Spaces for space , dist = Euclidean distance (L2) Representation of R Images
A First Glimpse on I centroid µC: mean vector of all points in cluster C Clustering General Purpose of Clustering 12 Partitional Clustering 11 Algorithm Visualization: 10 Algorithmic Differences 9 Summary
References 8 7 6 1 P µCi = · o∈C o 5 |Ci| i 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12 33 Construction of Central Points: Basics
DM534
Arthur Zimek 12 Color Histograms as measure of compactness for Feature Spaces for I 11 Representation of Images a cluster C: 10 A First Glimpse on Clustering 9 General Purpose of 2 X 2 Clustering TD (C) = dist(p, µC) 8 Partitional Clustering 7 Algorithm p∈C Visualization: 6 Algorithmic Differences 5 Summary (a.k.a. SSQ: sum of 4 References squares) 3 I measure of compactness for 2 a clustering 1
k 1 2 3 4 5 6 7 8 9 10 11 12 2 X 2 TD (C1, C2,..., Ck) = TD (Ci) i=1
34 Construction of Central Points: Basics
DM534
Arthur Zimek 12 Color Histograms as measure of compactness for Feature Spaces for I 11 Representation of Images a cluster C: 10 A First Glimpse on Clustering 9 General Purpose of 2 X 2 Clustering TD (C) = dist(p, µC) 8 Partitional Clustering 7 Algorithm p∈C Visualization: 6 Algorithmic Differences 5 Summary (a.k.a. SSQ: sum of 4 References squares) 3 I measure of compactness for 2 a clustering 1
k 1 2 3 4 5 6 7 8 9 10 11 12 2 X 2 TD (C1, C2,..., Ck) = TD (Ci) i=1
34 Basic Algorithm [Forgy, 1965, Lloyd, 1982]
DM534
Arthur Zimek Algorithm 2.1 (Clustering by Minimization of Variance)
Color Histograms as Feature Spaces for Representation of I start with k (e.g., randomly selected) points as cluster Images
A First Glimpse on representatives (or with a random partition into k Clustering General Purpose of “clusters”) Clustering Partitional Clustering repeat: Algorithm I Visualization: Algorithmic I assign each point to the closest representative Differences Summary I compute new representatives based on the given References partitions (centroid of the assigned points) I until there is no change in assignment
35 k-means
DM534 Arthur Zimek k-means [MacQueen, 1967] is a variant of the basic Color Histograms as Feature Spaces for algorithm: Representation of Images
A First Glimpse on I a centroid is immediately updated when some point Clustering General Purpose of changes its assignment Clustering Partitional Clustering Algorithm I k-means has very similar properties, but the result now Visualization: Algorithmic depends on the order of data points in the input file Differences Summary
References
Note that: The name “k-means” is often used indifferently for any variant of the basic algorithm, in particular also for Algorithm 2.1 [Forgy, 1965, Lloyd, 1982].
36 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images
A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary
References
37 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (6.0, 4.3) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (5.0, 6.4) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 reassign points Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (5.0, 2.7) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (5.6, 7.4) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 reassign points Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (4.0, 3.25) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (6.75, 8.0) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 reassign points Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – Lloyd/Forgy Algorithm
DM534
Arthur Zimek
Color Histograms as 12 reassign points Feature Spaces for Representation of 11 no change Images
A First Glimpse on 10 convergence! Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
38 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 Centroids Feature Spaces for Representation of 11 (e.g.: from Images
A First Glimpse on 10 previous iteration): Clustering General Purpose of 9 Clustering Partitional Clustering 8 µ ≈ (6.0, 4.3) Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 µ ≈ (5.0, 6.4) 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 assign first point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 assign second point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (5.0, 4.0) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (5.75, 7.25) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 assign third point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (4.6, 4.0) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (6.7, 8.3) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 assign fourth point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 assing fifth point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (4.0, 3.25) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (6.75, 8.0) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 reassign more points Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 reassign more points Feature Spaces for Representation of 11 possibly more iterations Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm
DM534
Arthur Zimek
Color Histograms as 12 reassign more points Feature Spaces for Representation of 11 possibly more iterations Images
A First Glimpse on 10 convergence Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
39 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 Centroids Feature Spaces for Representation of 11 (e.g.: from Images
A First Glimpse on 10 previous iteration): Clustering General Purpose of 9 Clustering Partitional Clustering 8 µ ≈ (6.0, 4.3) Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 µ ≈ (5.0, 6.4) 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 assign first point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 assign second point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 assign third point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 assign fourth point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (4.0, 8.5) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (4.3, 6.2) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 assign fifth point Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images
A First Glimpse on 10 µ ≈ (10.0, 1.0) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (4.7, 6.3) Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 reasign more points Feature Spaces for Representation of 11 Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 reasign more points Feature Spaces for Representation of 11 possibly more iterations Images
A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order
DM534
Arthur Zimek
Color Histograms as 12 reasign more points Feature Spaces for Representation of 11 possibly more iterations Images
A First Glimpse on 10 convergence Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12
40 Second solution: TD2 ≈ 72.68 2 2 Optimal solution: TD = 54 5
k-means Clustering – Quality
DM534
Arthur Zimek SSQ(µ , p ) = |4 − 10|2 + |3.25 − 1|2 = 36 + 5 1 = 41 1 Color Histograms as 12 1 1 16 16 Feature Spaces for SSQ(µ , p ) = |4 − 2|2 + |3.25 − 3|2 = 4 + 1 = 4 1 Representation of 11 1 2 16 16 Images 2 2 9 9 10 SSQ(µ1, p3) = |4 − 3| + |3.25 − 4| = 1 + 16 = 1 16 A First Glimpse on 9 8 SSQ(µ , p ) = |4 − 1|2 + |3.25 − 5|2 = 9 + 3 1 = 12 1 Clustering 1 4 16 16 8 6µ 7 2 3 General Purpose of 2 TD (C1) = 58 4 Clustering 7 5 Partitional Clustering 6 2 2 1 1 SSQ(µ2, p5) = |6.75 − 7| + |8 − 7| = 16 + 1 = 1 16 Algorithm 5 4 2 2 9 9 SSQ(µ2, p6) = |6.75 − 6| + |8 − 8| = + 0 = Visualization: 3 16 16 Algorithmic 4 µ 2 2 1 1 1 SSQ(µ2, p7) = |6.75 − 7| + |8 − 8| = + 0 = Differences 3 2 16 16 2 2 1 1 Summary 2 SSQ(µ2, p8) = |6.75 − 7| + |8 − 9| = 16 + 1 = 1 16 1 2 3 References 1 TD (C2) = 2 4 1 2 3 4 5 6 7 8 9 101112 2 1 First solution: TD = 61 2
2 2 Note: SSQ(µ, p) = Euclidean(µ, p) = L2(µ, p).
41 2 2 Optimal solution: TD = 54 5
k-means Clustering – Quality
DM534
Arthur Zimek SSQ(µ , p ) = |10 − 10|2 + |1 − 1|2 = 0 Color Histograms as 1 1 12 2 Feature Spaces for TD (C1) = 0 Representation of 11 Images 10 2 2 SSQ(µ2, p2) ≈ |4.7 − 2| + |6.3 − 3| ≈ 18.2 A First Glimpse on 9 8 2 2 Clustering SSQ(µ2, p3) ≈ |4.7 − 3| + |6.3 − 4| ≈ 8.2 8 6 7 2 2 General Purpose of SSQ(µ , p ) ≈ |4.7 − 1| + |6.3 − 5| ≈ 15.4 5 2 4 Clustering 7 µ2 2 2 SSQ(µ2, p5) ≈ |4.7 − 7| + |6.3 − 7| ≈ 5.7 Partitional Clustering 6 SSQ(µ , p ) ≈ |4.7 − 6|2 + |6.3 − 8|2 ≈ 4.6 Algorithm 5 4 2 6 SSQ(µ , p ) ≈ |4.7 − 7|2 + |6.3 − 8|2 ≈ 8.2 Visualization: 4 3 2 7 Algorithmic SSQ(µ , p ) ≈ |4.7 − 7|2 + |6.3 − 9|2 ≈ 12.6 Differences 2 2 7 3 2 Summary 2 TD (C2) ≈ 72.86 µ1 References 1 1 1 2 3 4 5 6 7 8 9 101112 2 1 First solution: TD = 61 2 Second solution: TD2 ≈ 72.68
2 2 Note: SSQ(µ, p) = Euclidean(µ, p) = L2(µ, p).
41 k-means Clustering – Quality
DM534
Arthur Zimek SSQ(µ , p ) = |2 − 2|2 + |4 − 3|2 = 0 + 1 = 1 Color Histograms as 1 2 12 2 2 Feature Spaces for 11 SSQ(µ1, p3) = |2 − 3| + |4 − 4| = 1 + 0 = 1 Representation of 2 2 Images 10 SSQ(µ1, p4) = |2 − 1| + |4 − 5| = 1 + 1 = 2 2 A First Glimpse on 9 8 TD (C1) = 4 Clustering 8 6 7 2 2 19 9 3 General Purpose of SSQ(µ2, p1) = |7.4−10| +|6.6−1| = 6 25 +31 25 = 38 25 Clustering 7 5 SSQ(µ , p ) = |7.4 − 7|2 + |6.6 − 7|2 = 4 + 4 = 8 Partitional Clustering 6 µ2 2 5 25 25 25 2 2 24 24 23 Algorithm 5 4 SSQ(µ2, p6) = |7.4 − 6| + |6.6 − 8| = 1 25 + 1 25 = 3 25 Visualization: µ1 3 SSQ(µ , p ) = |7.4 − 7|2 + |6.6 − 8|2 = 4 + 1 24 = 2 3 Algorithmic 4 2 7 25 25 25 Differences 3 2 2 2 4 19 23 SSQ(µ2, p8) = |7.4 − 7| + |6.6 − 9| = 25 + 5 25 = 5 25 Summary 2 2 2 TD (C2) = 50 References 1 1 5 1 2 3 4 5 6 7 8 9 101112 2 1 First solution: TD = 61 2 Second solution: TD2 ≈ 72.68 2 2 Optimal solution: TD = 54 5
2 2 Note: SSQ(µ, p) = Euclidean(µ, p) = L2(µ, p).
41 Outline
DM534
Arthur Zimek
Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images
A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary
References
42 Discussion
DM534 Arthur Zimek pros Color Histograms as Feature Spaces for Representation of I efficient: O(k · n) per iteration, number of iterations is Images
A First Glimpse on usually in the order of 10. Clustering General Purpose of easy to implement, thus very popular Clustering I Partitional Clustering Algorithm cons Visualization: Algorithmic Differences I k-means converges towards a local minimum Summary
References I k-means (MacQueen-variant) is order-dependent I deteriorates with noise and outliers (all points are used to compute centroids) I clusters need to be convex and of (more or less) equal extension I number k of clusters is hard to determine I strong dependency on initial partition (in result quality as well as runtime) 43 Summary
DM534
Arthur Zimek
Color Histograms as You learned in this section: Feature Spaces for Representation of Images I What is Clustering? A First Glimpse on Clustering I Basic idea for identifying “good” partitions into k clusters General Purpose of Clustering Partitional Clustering I selection of representative points Algorithm Visualization: iterative refinement Algorithmic I Differences Summary I local optimum References I k-means variants [Forgy, 1965, Lloyd, 1982, MacQueen, 1967]
44 ReferencesI
DM534 Arthur Zimek M. M. Deza and E. Deza. Encyclopedia of Distances. Springer, 3rd edition, 2009. Color Histograms as Feature Spaces for ISBN 9783662443415. Representation of Images E. W. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability A First Glimpse on of classifications. Biometrics, 21:768–769, 1965. Clustering
References S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136, 1982. doi: 10.1109/TIT.1982.1056489. J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematics, Statistics, and Probabilistics, volume 1, pages 281–297, 1967. P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006.
45