Enterprise and Object-Orientation

Performance Functions and Clustering Algorithms Wesam Barbakh and Colin Fyfe

Abstract. In this paper, we investigate two alternative performance functions and show the effect the We investigate the effect of different performance different functions have on the effectiveness of the functions for measuring the performance of clustering resulting algorithms. We are specifically interested in algorithms and derive different algorithms depending developing algorithms which are effective in a worst on which performance algorithm is used. In case scenario: when the prototypes are initialised at particular, we show that two algorithms may be the same position which is very far from the data derived which do not exhibit the dependence on initial points. If an algorithm can cope with this scenario, it conditions(and hence the tendency to get stuck in should be able to cope with a more benevolent local optima) that the standard K-Means algorithm initialisation. exhibits. PERFORMANCE FUNCTIONS FOR CLUSTERING INTRODUCTION The performance function for K-Means may be The K-Means algorithm is one of the most frequently written as used investigatory algorithms in data analysis. The N algorithm attempts to locate K prototypes or means 2 JK= (1)  min k{1,...,M } ( X i  mk ) throughout a data set in such a way that the K i1 prototypes in some way best represents the data. which we wish to minimise by moving the prototypes The algorithm is one of the first which a data analyst to the appropriate positions. Note that(1) detects only will use to investigate a new data set because it is the centres closest to data points and then distributes algorithmically simple, relatively robust and gives them to give the minimum performance which `good enough’ answers over a wide variety of data determines the clustering. sets: it will often not be the single best algorithm on Any prototype which is still far from data is not any individual data set but be close to the optimal over utilised and does not enter any calculation that give a wide range of data sets. minimum performance, which may result in dead However the algorithm is known to suffer from the prototypes, prototypes which are never appropriate for defect that the means or prototypes found depend on any cluster. Thus initializing centres appropriately can the initial values given to them at the start of the play a big effect in K-Means. simulation. There are a number of heuristics in the We can illustrate this effect with the following toy literature which attempt to address this issue but, at heart, the fault lies in the performance function on example: assume we have 3 data points ( X 1 , X 2 and which K-Means is based. Recently, there have been X 3 ) and 3 prototypes ( m1 , m2 and m3 ) and the several investigations of alternative performance distances between them are as follows: functions for clustering algorithms. One of the most effective updates of K-Means has been K-Harmonic 2 Let di,k  xi  mk . We consider the situation that Means which minimises m is closest to X N K 1 1  i1 K 1 m1 is closest to X 2  2 k 1 x  m i k m2 is closest to X 3 for data samples {x1,…,xN} and prototypes {m1, m …,mK}. This performance function can be shown to and so 3 is not closest to any data point be minimised when N 1 m1 m2 m3  xi i1 K 4 1 2 d ik ( ) l1 2 X 1 d1,1 d1,2 d1,3 d il mk  N 1  i1 K 4 1 2 d ik ( ) l1 2 d il all data points and all centres while (1) provides an X 2 d 2,1 d 2,2 d 2,3 attempt to cluster data points at minimum performance. Therefore it may seem that what we X 3 d3,1 d3,2 d3,3 want is to combine features of (1) and (2) to make a performance function such as:

Perf = JK = d1,1 + d 2,1 + d3,2 which we minimise by J1= N M changing the postions of the prototypes, m1, m2, m3. 2 (3)   X i  m j *min k  X i  mk  Then i1 j1 We derive the clustering algorithm associated with Perf d1,1 d 2,1    0 this performance function by calculating the partial m1 m1 m1 derivatives of (3) with respect to the prototypes.

Perf d3,2  0  0  Consider the presentation of a specific data point, Xa. m2 m2 and the prototype mk , closest to Xa.

Perf 2 2  0  0  0 i.e. min X  m = X  m m3 k  a k  a k

So, it is possible now to find new locations for m1 and Then m2 to minimise the performance function which Perf (i  a) (X a  mk ) 2 determines the clustering, but it is not possible to find   * X a  mk   X a  m1  ..  X a  mk  ..  X a  mM * 2X a  mk  mk X a  mk a new location for prototype m3 as it is far from the data and is not used as a minimum for any data point. Perf (i  a)  (X a  mk ) *  X a  mk  2 *  X a  m1  ..  X a  mk  ..  X a  mM  mk We might consider the following performance function: Perf (i  a)  (X a  mk ) * Aak (4) N M 2 mk JA =  X i  mL (2) i1 L1 where which provides a relationship between all the data Aak   X a  mk  2* X a  m1  ... X a  mM  points and prototypes, but it doesn’t provide clustering at minimum performance since

N Now consider a second data point Xb for which mk is J A not the closest prototype, i.e. the min() function gives  2 (X i  mk ) i1 mk the distance with respect to a prototype other than mk ,

2 2 J 1 N = X  m , A min k  X b  mk  b r  0  mk   X i i1 mk N where r ≠ k. Then Minimizing the performance function groups all the Perf (i  b) = prototypes to the centre of data set regardless of the  X  m  ..  X  m  .. X  m * X  m 2 intitial position of the prototypes which is useless for b 1 b k b M b r identification of clusters. Perf (i  b)  (X  m ) * B (5) A combined performance function b k bk mk

We wish to form a performance equation with 2 X  m following properties: b r where B  - Minimum performance gives a good bk X  m clustering b k - Creates a relationship between all data points and all prototypes. (2) provides an attempt to reduce the sensitivity to centres initialization by making relationship between For the algorithm, the partial derivatives with respect 2 min1 Perf B12  to mk for all data points, is based on (4), or d1,2 mk where 2 (5) or both of them. min 2 B22  d 2,2 Consider the specific situation in which mk , closest to A  min  2(d  min  d ) X2 but not the closest to X1 or X3. 32 3 3,1 3 3,3 Then we have Perf  (X  m ) * B  (X  m ) * A  (X  m ) * B X 1 * B13  X 2 * B23  X 3 * B33 1 k 1k 2 k 2k 3 k 3k m  mk 3 B13  B23  B33

Setting to 0 and solving for mk gives 2 min1 B13  X1 *B1k X 2 * A2k  X 3 * B3k d1,3 mk  (6) B1k  A2k  B3k 2 min 2 where B23  d 2,3 2 Consider the previous example with 3 data points ( X 1 min 3 B33  , X 2 and X 3 ) and 3 centres ( m1 , m2 and m3 ) and the d3,3 distances between them such that m1 is closest to X 1 , This algorithm will cluster the data with the m X m X 1 is closest to 2 , and 2 is closest to 3 prototypes which are closest to the data points being We will write positioned in such a way that the clusters can be identified. However there are some potential

m1 m2 m3 prototypes (such as m3 in the example) which are not sufficiently responsive to the data and so never move X 1 min1 d1,2 d1,3 to identify a cluster. In fact, as illustrated in the example, these points move to the centre of the data X 2 min 2 d 2,2 d 2,3 set (actually a weighted centre as shown in the example). This may be an advantage in some cases in X 3 d3,1 min 3 d3,3 that we can easily identify redundancy in the prototypes however it does waste computational resources unnecessarily. Then after training A second algorithm

X 1 * A11  X 2 * A21  X 3 * B31 To solve this, we need to move these unused m1  A11  A21  B31 prototypes towards the data so that they may become closest prototypes to at least one data sample and thus

A11  min1  2(min1  d1,2  d1,3 ) take advantage of the whole performance function. We do this by changing A21  min 2  2(min 2  d 2,2  d 2,3 ) 2 where 2 X b  mr min 3 B  B31  bk d3,1 X b  mk X * B  X * B  X * A in (5) to m  1 12 2 22 3 32 2 B  B  A 2 12 22 32 X b  mr Bbk  2 X b  mk which allows centres to move continuously until they are in a position to be closest to some data points. This change allows the algorithm to work very well in the case that all centres are initialized in the same location and very far from the data Note: all centres will go to the same location even if points. they are calculated by using two different types of equation.

Example:

Assume we have 3 data points ( X 1 , X 2 and X 3 ) and

3 centres ( m1 , m2 and m3 ) initialized at the same location.

Note: we assume every data point has only one For algorithm 2, we have minimum distance to the centres, in other words it is X *(7a)  X *(7b)  X *(7c) closest to one centre only. We treat the other centres 1 2 3 m1  as distant centres even they have the same minimum 7a  7b  7c value. X 1 *(a)  X 2 *(b)  X 3 *(c) This step is optional, but it is very important if we  want the algorithm to work very well in the case that a  b  c all centres are initialized in the same location. while Without this assumption in this example, we will find a 2 b 2 c 2 the centres ( m1 , m2 and m3 ) will use the same       X 1 *   X 2 *   X 3 *  equation and hence go to the same location! Let  a 2   b 2   c 2  m2  2 2 2 m m m  a   b   c  1 2 3          a 2   b 2   c 2  X 1 a a a X  X  X X b b b  1 2 3 2 3 X c c c 3 and similarly, Note: if we have 3 different data points, it is not X  X  X possible to have a  b  c 1 2 3 m3  For algorithm 1, we have 3 Notice, the centre m that is detected by minimum X *(7a)  X *(7b)  X *(7c) 1 1 2 3 function goes to a new location and all the other m1  7a  7b  7c centres m2 and m3 are grouped together in another X *(a)  X *(b)  X *(c) location. This change for B makes separation  1 2 3 bk a  b  c between centres possible even if all of them start in the same location Similarly,

X 1 *(a)  X 2 *(b)  X 3 *(c) m2  For algorithm 1, if the new location for m2 and m3 is a  b  c still very far from the data and none of them is detected as minimum, the clustering algorithm stops without taking these centres into account in clustering X *(a)  X *(b)  X *(c) m  1 2 3 data. 3 a  b  c For algorithm 2, if the new location for m2 and m3 are still very far from data points and none of them is detected as minimum, the clustering algorithm moves these undetected centres continually toward new locations Thus the algorithm lprovides clustering for data insensitive to initialization of centres. Simulations In both cases, all four clusters were reliably and stably identified.

We illustrate these algorithms with a few simulations on artificial two dimensional data sets, since the results are easily visualised. Consider first the data set in Figure 1: the prototypes have all been initialised within one of the four clusters.

Figure 1 Data set is shown as 4 clusters of red '+'s, prototypes are initialised to lie within one cluster and shown as blue '*'s.

Figure 2 shows the final positions of the prototypes when K-Means is used: two clusters are not identified.

Figure 3 Top: K-Harmonic Means after 5 iterations. Bottom: algorithm 2 after 3 iterations.

Even with a good initialisation (for example with all the prototypes in the centre of the data, around the point (2,2)), K-Means will not guarantee to find all the clusters, K-Harmonic Means takes 5 iterations to move the prototypes to appropriate positions while algorithm 2 takes only one iteration to stably find appropriate positions for the prototypes.

Consider now the situation in which the prototypes are initialised very far from the data; this is an unlikely Figure 2 Prototypes’ positions when using K-Means. situation to happen in general but it may be that all the prototypes are in fact initialised very far from a particular cluster. The question arises as to whether Figure 3 shows the final positions of the prototypes this cluster would be found by any algorithm. when K-Harmonic Means and algorithm 2 are used. We show the initial positions of the data and a set of K-Harmonic Means takes 5 iterations to find this four prototypes in Figure 4. The four prototypes are in position while algorithm 2 takes only three iterations. slightly different positions. Figure 5 shows the final positions of the prototypes for K-Harmonic Means and algorithm 2. Again K-Harmonic Means took rather longer (24 iterations) to appropriately position the prototypes than algorithm 2 (5 iterations). K-Means moved all four prototypes to a central location (approximately the point (2,2)) and did not subsequently find the 4 clusters.

Figure 4 The prototypes are positioned very far from the four clusters.

Figure 6 Results when prototypes are initialised very far from the data and all in the same position. Top: K- Harmonic Means. Bottom: algorithm 2.

Figure 4 Top: K-Harmonic Means after 24 Figure 7 Top: Initial prototypes and data. Bottom: after iterations. Bottom: algorithm 2 after 5 iterations. 123 iterations all prototypes are situated on a data point. Of course, the above situation was somewhat unrealistic since the number of prototypes was exactly equal to the number of clusters but we have similar results with e.g. 20 prototypes and the same data set. We now go to the other extreme and have the same number of prototypes as data points. We show in Figure 7 a simulation in which we had 40 data points from the same four clusters and 40 prototypes which were initialised to a single location far from the data. The bottom diagram shows that each prototype is eventually located at each data point. This took 123 iterations of the algorithm which is rather a lot, however neither K-Means nor K-Harmonic Means performed well: both of these located every prototype in the centre of the data i.e. at approximately the point (2,2). Even with a good initialisation (random locations throughout the data set), K-Means was unable to perform as well, having typically 28 prototypes redundant in that they have moved only to the centre of the data.

CONCLUSION We have developed a new algorithm for data clustering and have shown that this algorithm is clearly superior to K-Means, the standard work-horse for clustering. We have also compared our algorithm to K-Harmonic Means which is state-of-the-art in clustering algorithms and have shown that under typical conditions, it is comparable while under extreme conditions, it is superior. Future work will investigate convergence of these algorithms on real data sets.

REFERENCES

B. Z. Zhang, M. Hsu, U. Dayal, K-Harmonic Means – a data clustering algorithm, Technical Report, HP Palo Alto laboratories, Oct 1999.