DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images

A First Glimpse on Clustering

References Introduction to Computer Science

Arthur Zimek

University of Southern

DM534, fall 2020

1 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images

A First Glimpse on A First Glimpse on Clustering Clustering

References

2 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary

A First Glimpse on Clustering A First Glimpse on Clustering References

3 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary

A First Glimpse on Clustering A First Glimpse on Clustering References

4 Similarity

DM534

Arthur Zimek I Similarity (as given by some distance measure) is a Color Histograms as Feature Spaces for central concept in , e.g.: Representation of Images Distances I clustering: group similar objects in the same cluster, Features for Images separate dissimilar objects to different clusters Summary

A First Glimpse on I outlier detection: identify objects that are dissimilar (by Clustering some characteristic) from most other objects References I definition of a suitable distance measure is often crucial for deriving a meaningful solution in the data mining task I images I CAD objects I proteins I texts ... 5 I Spaces and Distance Functions

DM534 Arthur Zimek Common distance measure for (Euclidean) feature vectors: Color Histograms as Feature Spaces for LP-norm Representation of Images 1 Distances P P P P Features for Images distP(p, q) = |p1 − q1| + |p2 − q2| + ... + |pn − qn| Summary

A First Glimpse on Clustering References Euclidean norm Manhattan norm Maximum norm (L2): (L1): (L∞, also: Lmax, supremum dist., Chebyshev dist.)

6 Spaces and Distance Functions

DM534 Arthur Zimek weighted Euclidean norm: quadratic form*: Color Histograms as 2 Feature Spaces for dist(p, q) = w1|p1 − q1| + dist(p, q) = Representation of 1 1 Images T 2 2 2 2 Distances w2|p2 − q2| + ... + wn|pn − qn| ((p − q)M(p − q) ) Features for Images Summary

A First Glimpse on Clustering

References

*note that we assume vectors to be row vectors here

7 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary

A First Glimpse on Clustering A First Glimpse on Clustering References

8 Categories of Feature Descriptors for Images

DM534

Arthur Zimek

Color Histograms as I distribution of colors Feature Spaces for Representation of Images I texture Distances Features for Images I shapes (contoures) Summary

A First Glimpse on Clustering

References

9 Color Histogram

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References I a histogram represents the distribution of colors over the pixels of an image I definition of an color histogram: I choose a color space (RGB, HSV, HLS, . . . ) I choose number of representants (sample points) in the color space I possibly normalization (to account for different image sizes)

10 Color Space Example: RGB cube

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

11 Impact of Number of Representants

DM534 Arthur Zimek original images in full RGB space (2563 = 16, 777, 216) Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

12 Impact of Number of Representants

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

23 33

43 163 13 Impact of Number of Representants

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

23 33

43 163 14 Impact of Number of Representants

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

23 33

43 163 15 Impact of Number of Representants

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

23 33

43 163 16 Impact of Number of Representants

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

The histogram for each image is essentially a visualization of a vector: (0.77, 0, 0, 0, 0.08, 0, 0.15, 0)( 0.8, 0, 0, 0, 0.11, 0, 0.09, 0) (0.9, 0, 0, 0, 0.05, 0, 0.05, 0)( 0.955, 0, 0, 0, 0.045, 0, 0, 0)

17 Impact of Number of Representants

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

17 Impact of Number of Representants

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images Distances Features for Images Summary

A First Glimpse on Clustering

References

17 Distances for Color Histograms

DM534 Arthur Zimek Euclidean distance for images P and Q using the color Color Histograms as Feature Spaces for histograms hP and hQ: Representation of Images Distances q Features for Images T Summary dist(P, Q) = (hP − hQ) · (hP − hQ)

A First Glimpse on Clustering

References √ dist(RED, PINK) = 2 √ dist(RED, BLUE) = 2 √ (1, 0, 0)(0, 1, 0)(0, 0, 1) dist(BLUE, PINK) = 2

A ‘psychologic’ distance would consider that red is (in our perception) more similar to pink than to blue.

18 Example for the Distance Computation of Histograms

DM534

Arthur Zimek q Color Histograms as T Feature Spaces for dist(P, Q) = (h − h ) · (h − h ) Representation of P Q P Q Images Distances Features for Images Summary p T A First Glimpse on dist(RED, PINK) = ((1, 0, 0) − (0, 1, 0)) · ((1, 0, 0) − (0, 1, 0)) Clustering p References = (1, −1, 0) · (1, −1, 0)T p = (1 · 1 + (−1) · (−1) + 0 · 0) √ = 2

19 Distances for Color Histograms

DM534 Arthur Zimek Quadratic form with ‘psychological’ similarity matrix Color Histograms as   Feature Spaces for 1 a12 ... Representation of Images a21 1 ...  ? Distances A =   where a (= a ) describe the Features for Images  . .. . ij ji Summary  . . . A First Glimpse on Clustering ... 1 References subjective similarity of the features i and j in the color histogram: q T distA(P, Q) = (hP − hQ) · A · (hP − hQ)

  √ 1 0.9 0 dist(RED, PINK) = 0.2 A0 = 0.9 1 0 √ dist(RED, BLUE) = 2 0 0 1 √ dist( , ) = 20 BLUE PINK 2 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Distances Images Distances Features for Images Features for Images Summary Summary

A First Glimpse on Clustering A First Glimpse on Clustering References

21 Your Choice of a Distance Measure

DM534 Arthur Zimek There are hundreds of distance functions [Deza and Deza, Color Histograms as Feature Spaces for 2009]. Representation of Images Distances I For time series: DTW, EDR, ERP, LCSS, . . . Features for Images Summary I For texts: Cosine and normalizations A First Glimpse on Clustering I For sets – based on intersection, union, . . . (Jaccard) References I For clusters (single-link, average-link, etc.) I For histograms: histogram intersection, “Earth movers distance”, quadratic forms with color similarity I With normalization: Canberra, . . . I Quadratic forms / bilinear forms: d(x, y) := xT My for some positive (usually symmetric) definite matrix M. Note that: Choosing the appropriate distance function can be seen as a part of “preprocessing”. 22 Summary

DM534

Arthur Zimek

Color Histograms as You learned in this section: Feature Spaces for Representation of Images I distances (Lp-norms, weighted, quadratic form) Distances Features for Images I color histograms as feature (vector) descriptors for Summary

A First Glimpse on images Clustering References I impact of the granularity of color histograms on similarity measures

23 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images

A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary

References

24 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images

A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary

References

25 Purpose of Clustering

DM534

Arthur Zimek

Color Histograms as I identify a finite number of categories (classes, groups: Feature Spaces for Representation of clusters) in a given dataset Images

A First Glimpse on Clustering I similar objects shall be grouped in the same cluster, General Purpose of Clustering dissimilar objects in different clusters Partitional Clustering Algorithm “similarity” is highly subjective, depending on the Visualization: I Algorithmic Differences application scenario Summary

References

26 A Dataset can be Clustered in Different Meaningful Ways

DM534

Arthur Zimek

Color Histograms as Feature Spaces for Representation of Images

A First Glimpse on Clustering General Purpose of Clustering Partitional Clustering Algorithm Visualization: Algorithmic Differences Summary

References

(Figure from Tan et al. [2006].)

27 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images

A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary

References

28 Criteria of Quality: Cohesion and Separation

DM534

Arthur Zimek

Color Histograms as I cohesion: how strong are the cluster objects connected Feature Spaces for Representation of (how similar, pairwise, to each other)? Images

A First Glimpse on Clustering I separation: how well is a cluster separated from other General Purpose of Clustering clusters? Partitional Clustering Algorithm Visualization: Algorithmic Differences Summary

References

small within cluster distances large between cluster distances

29 Optimization of Cohesion

DM534 Arthur Zimek Partitional clustering algorithms partition a dataset into k Color Histograms as Feature Spaces for clusters, typically minimizing some cost function Representation of Images (compactness criterion), i.e., optimizing cohesion. A First Glimpse on Clustering General Purpose of Clustering Partitional Clustering Algorithm Visualization: Algorithmic Differences Summary

References

30 Assumptions for Partitioning Clustering

DM534 Arthur Zimek Central assumptions for approaches in this family are Color Histograms as Feature Spaces for typically: Representation of Images

A First Glimpse on I number k of clusters known (i.e., given as input) Clustering General Purpose of Clustering I clusters are characterized by their compactness Partitional Clustering Algorithm I compactness measured by some distance function Visualization: Algorithmic Differences (e.g., distance of all objects in a cluster from some Summary cluster representative is minimal) References I criterion of compactness typically leads to convex or even spherically shaped clusters

31 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images

A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary

References

32 Construction of Central Points: Basics

DM534

Arthur Zimek I objects are points x = (x1,..., xd) in Euclidean vector Color Histograms as d Feature Spaces for space , dist = Euclidean distance (L2) Representation of R Images

A First Glimpse on I centroid µC: mean vector of all points in cluster C Clustering General Purpose of Clustering 12 Partitional Clustering 11 Algorithm Visualization: 10 Algorithmic Differences 9 Summary

References 8 7 6 1 P µCi = · o∈C o 5 |Ci| i 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12 33 Construction of Central Points: Basics

DM534

Arthur Zimek I objects are points x = (x1,..., xd) in Euclidean vector Color Histograms as d Feature Spaces for space , dist = Euclidean distance (L2) Representation of R Images

A First Glimpse on I centroid µC: mean vector of all points in cluster C Clustering General Purpose of Clustering 12 Partitional Clustering 11 Algorithm Visualization: 10 Algorithmic Differences 9 Summary

References 8 7 6 1 P µCi = · o∈C o 5 |Ci| i 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12 33 Construction of Central Points: Basics

DM534

Arthur Zimek 12 Color Histograms as measure of compactness for Feature Spaces for I 11 Representation of Images a cluster C: 10 A First Glimpse on Clustering 9 General Purpose of 2 X 2 Clustering TD (C) = dist(p, µC) 8 Partitional Clustering 7 Algorithm p∈C Visualization: 6 Algorithmic Differences 5 Summary (a.k.a. SSQ: sum of 4 References squares) 3 I measure of compactness for 2 a clustering 1

k 1 2 3 4 5 6 7 8 9 10 11 12 2 X 2 TD (C1, C2,..., Ck) = TD (Ci) i=1

34 Construction of Central Points: Basics

DM534

Arthur Zimek 12 Color Histograms as measure of compactness for Feature Spaces for I 11 Representation of Images a cluster C: 10 A First Glimpse on Clustering 9 General Purpose of 2 X 2 Clustering TD (C) = dist(p, µC) 8 Partitional Clustering 7 Algorithm p∈C Visualization: 6 Algorithmic Differences 5 Summary (a.k.a. SSQ: sum of 4 References squares) 3 I measure of compactness for 2 a clustering 1

k 1 2 3 4 5 6 7 8 9 10 11 12 2 X 2 TD (C1, C2,..., Ck) = TD (Ci) i=1

34 Basic Algorithm [Forgy, 1965, Lloyd, 1982]

DM534

Arthur Zimek Algorithm 2.1 (Clustering by Minimization of Variance)

Color Histograms as Feature Spaces for Representation of I start with k (e.g., randomly selected) points as cluster Images

A First Glimpse on representatives (or with a random partition into k Clustering General Purpose of “clusters”) Clustering Partitional Clustering repeat: Algorithm I Visualization: Algorithmic I assign each point to the closest representative Differences Summary I compute new representatives based on the given References partitions (centroid of the assigned points) I until there is no change in assignment

35 k-means

DM534 Arthur Zimek k-means [MacQueen, 1967] is a variant of the basic Color Histograms as Feature Spaces for algorithm: Representation of Images

A First Glimpse on I a centroid is immediately updated when some point Clustering General Purpose of changes its assignment Clustering Partitional Clustering Algorithm I k-means has very similar properties, but the result now Visualization: Algorithmic depends on the order of data points in the input file Differences Summary

References

Note that: The name “k-means” is often used indifferently for any variant of the basic algorithm, in particular also for Algorithm 2.1 [Forgy, 1965, Lloyd, 1982].

36 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images

A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary

References

37 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (6.0, 4.3) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (5.0, 6.4) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 reassign points Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (5.0, 2.7) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (5.6, 7.4) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 reassign points Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (4.0, 3.25) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (6.75, 8.0) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 reassign points Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – Lloyd/Forgy Algorithm

DM534

Arthur Zimek

Color Histograms as 12 reassign points Feature Spaces for Representation of 11 no change Images

A First Glimpse on 10 convergence! Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

38 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 Centroids Feature Spaces for Representation of 11 (e.g.: from Images

A First Glimpse on 10 previous iteration): Clustering General Purpose of 9 Clustering Partitional Clustering 8 µ ≈ (6.0, 4.3) Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 µ ≈ (5.0, 6.4) 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 assign first point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 assign second point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (5.0, 4.0) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (5.75, 7.25) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 assign third point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (4.6, 4.0) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (6.7, 8.3) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 assign fourth point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 assing fifth point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (4.0, 3.25) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (6.75, 8.0) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 reassign more points Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 reassign more points Feature Spaces for Representation of 11 possibly more iterations Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm

DM534

Arthur Zimek

Color Histograms as 12 reassign more points Feature Spaces for Representation of 11 possibly more iterations Images

A First Glimpse on 10 convergence Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

39 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 Centroids Feature Spaces for Representation of 11 (e.g.: from Images

A First Glimpse on 10 previous iteration): Clustering General Purpose of 9 Clustering Partitional Clustering 8 µ ≈ (6.0, 4.3) Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 µ ≈ (5.0, 6.4) 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 assign first point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 assign second point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 assign third point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 assign fourth point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (4.0, 8.5) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (4.3, 6.2) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 assign fifth point Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 recompute centroids: Feature Spaces for Representation of 11 Images

A First Glimpse on 10 µ ≈ (10.0, 1.0) Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: µ ≈ (4.7, 6.3) Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 reasign more points Feature Spaces for Representation of 11 Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 reasign more points Feature Spaces for Representation of 11 possibly more iterations Images

A First Glimpse on 10 Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 k-means Clustering – MacQueen Algorithm Alternative Run – Different Order

DM534

Arthur Zimek

Color Histograms as 12 reasign more points Feature Spaces for Representation of 11 possibly more iterations Images

A First Glimpse on 10 convergence Clustering General Purpose of 9 Clustering Partitional Clustering 8 Algorithm 7 Visualization: Algorithmic 6 Differences Summary 5 References 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12

40 Second solution: TD2 ≈ 72.68 2 2 Optimal solution: TD = 54 5

k-means Clustering – Quality

DM534

Arthur Zimek SSQ(µ , p ) = |4 − 10|2 + |3.25 − 1|2 = 36 + 5 1 = 41 1 Color Histograms as 12 1 1 16 16 Feature Spaces for SSQ(µ , p ) = |4 − 2|2 + |3.25 − 3|2 = 4 + 1 = 4 1 Representation of 11 1 2 16 16 Images 2 2 9 9 10 SSQ(µ1, p3) = |4 − 3| + |3.25 − 4| = 1 + 16 = 1 16 A First Glimpse on 9 8 SSQ(µ , p ) = |4 − 1|2 + |3.25 − 5|2 = 9 + 3 1 = 12 1 Clustering 1 4 16 16 8 6µ 7 2 3 General Purpose of 2 TD (C1) = 58 4 Clustering 7 5 Partitional Clustering 6 2 2 1 1 SSQ(µ2, p5) = |6.75 − 7| + |8 − 7| = 16 + 1 = 1 16 Algorithm 5 4 2 2 9 9 SSQ(µ2, p6) = |6.75 − 6| + |8 − 8| = + 0 = Visualization: 3 16 16 Algorithmic 4 µ 2 2 1 1 1 SSQ(µ2, p7) = |6.75 − 7| + |8 − 8| = + 0 = Differences 3 2 16 16 2 2 1 1 Summary 2 SSQ(µ2, p8) = |6.75 − 7| + |8 − 9| = 16 + 1 = 1 16 1 2 3 References 1 TD (C2) = 2 4 1 2 3 4 5 6 7 8 9 101112 2 1 First solution: TD = 61 2

2 2 Note: SSQ(µ, p) = Euclidean(µ, p) = L2(µ, p).

41 2 2 Optimal solution: TD = 54 5

k-means Clustering – Quality

DM534

Arthur Zimek SSQ(µ , p ) = |10 − 10|2 + |1 − 1|2 = 0 Color Histograms as 1 1 12 2 Feature Spaces for TD (C1) = 0 Representation of 11 Images 10 2 2 SSQ(µ2, p2) ≈ |4.7 − 2| + |6.3 − 3| ≈ 18.2 A First Glimpse on 9 8 2 2 Clustering SSQ(µ2, p3) ≈ |4.7 − 3| + |6.3 − 4| ≈ 8.2 8 6 7 2 2 General Purpose of SSQ(µ , p ) ≈ |4.7 − 1| + |6.3 − 5| ≈ 15.4 5 2 4 Clustering 7 µ2 2 2 SSQ(µ2, p5) ≈ |4.7 − 7| + |6.3 − 7| ≈ 5.7 Partitional Clustering 6 SSQ(µ , p ) ≈ |4.7 − 6|2 + |6.3 − 8|2 ≈ 4.6 Algorithm 5 4 2 6 SSQ(µ , p ) ≈ |4.7 − 7|2 + |6.3 − 8|2 ≈ 8.2 Visualization: 4 3 2 7 Algorithmic SSQ(µ , p ) ≈ |4.7 − 7|2 + |6.3 − 9|2 ≈ 12.6 Differences 2 2 7 3 2 Summary 2 TD (C2) ≈ 72.86 µ1 References 1 1 1 2 3 4 5 6 7 8 9 101112 2 1 First solution: TD = 61 2 Second solution: TD2 ≈ 72.68

2 2 Note: SSQ(µ, p) = Euclidean(µ, p) = L2(µ, p).

41 k-means Clustering – Quality

DM534

Arthur Zimek SSQ(µ , p ) = |2 − 2|2 + |4 − 3|2 = 0 + 1 = 1 Color Histograms as 1 2 12 2 2 Feature Spaces for 11 SSQ(µ1, p3) = |2 − 3| + |4 − 4| = 1 + 0 = 1 Representation of 2 2 Images 10 SSQ(µ1, p4) = |2 − 1| + |4 − 5| = 1 + 1 = 2 2 A First Glimpse on 9 8 TD (C1) = 4 Clustering 8 6 7 2 2 19 9 3 General Purpose of SSQ(µ2, p1) = |7.4−10| +|6.6−1| = 6 25 +31 25 = 38 25 Clustering 7 5 SSQ(µ , p ) = |7.4 − 7|2 + |6.6 − 7|2 = 4 + 4 = 8 Partitional Clustering 6 µ2 2 5 25 25 25 2 2 24 24 23 Algorithm 5 4 SSQ(µ2, p6) = |7.4 − 6| + |6.6 − 8| = 1 25 + 1 25 = 3 25 Visualization: µ1 3 SSQ(µ , p ) = |7.4 − 7|2 + |6.6 − 8|2 = 4 + 1 24 = 2 3 Algorithmic 4 2 7 25 25 25 Differences 3 2 2 2 4 19 23 SSQ(µ2, p8) = |7.4 − 7| + |6.6 − 9| = 25 + 5 25 = 5 25 Summary 2 2 2 TD (C2) = 50 References 1 1 5 1 2 3 4 5 6 7 8 9 101112 2 1 First solution: TD = 61 2 Second solution: TD2 ≈ 72.68 2 2 Optimal solution: TD = 54 5

2 2 Note: SSQ(µ, p) = Euclidean(µ, p) = L2(µ, p).

41 Outline

DM534

Arthur Zimek

Color Histograms as Color Histograms as Feature Spaces for Representation of Images Feature Spaces for Representation of Images

A First Glimpse on A First Glimpse on Clustering Clustering General Purpose of General Purpose of Clustering Clustering Partitional Clustering Partitional Clustering Algorithm Algorithm Visualization: Algorithmic Visualization: Algorithmic Differences Differences Summary Summary

References

42 Discussion

DM534 Arthur Zimek pros Color Histograms as Feature Spaces for Representation of I efficient: O(k · n) per iteration, number of iterations is Images

A First Glimpse on usually in the order of 10. Clustering General Purpose of easy to implement, thus very popular Clustering I Partitional Clustering Algorithm cons Visualization: Algorithmic Differences I k-means converges towards a local minimum Summary

References I k-means (MacQueen-variant) is order-dependent I deteriorates with noise and outliers (all points are used to compute centroids) I clusters need to be convex and of (more or less) equal extension I number k of clusters is hard to determine I strong dependency on initial partition (in result quality as well as runtime) 43 Summary

DM534

Arthur Zimek

Color Histograms as You learned in this section: Feature Spaces for Representation of Images I What is Clustering? A First Glimpse on Clustering I Basic idea for identifying “good” partitions into k clusters General Purpose of Clustering Partitional Clustering I selection of representative points Algorithm Visualization: iterative refinement Algorithmic I Differences Summary I local optimum References I k-means variants [Forgy, 1965, Lloyd, 1982, MacQueen, 1967]

44 ReferencesI

DM534 Arthur Zimek M. M. Deza and E. Deza. Encyclopedia of Distances. Springer, 3rd edition, 2009. Color Histograms as Feature Spaces for ISBN 9783662443415. Representation of Images E. W. Forgy. of multivariate data: efficiency versus interpretability A First Glimpse on of classifications. Biometrics, 21:768–769, 1965. Clustering

References S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136, 1982. doi: 10.1109/TIT.1982.1056489. J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematics, Statistics, and Probabilistics, volume 1, pages 281–297, 1967. P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006.

45