LETTER Communicated byAndrew Barto

AReinforcementLearning Approach toOnlineClusterin g

AristidisLikas Departmentof ComputerScience, University of Ioannina, 45110, Ioannina, Greece

Ageneraltechnique is proposed forembeddin gonline clusteringalgo- rithmsbased on competitivelearning in areinforcementlearning frame- work. The basicidea is thatthe clusteringsystemcan beviewedas arein- forcementlearningsystem that learns through reinforcementsto follow theclusterin gstrategywe wishto implement.Inthissense, the reinforce - mentguided competitivelearning (RGCL) algorithmis proposed that constitutesa reinforcement-basedadaptati on of learningvector quanti- zation(L VQ)withenhanced clusteri ngcapabilities.In addition,we sug- gestextensio nsofRGCLand LVQthatare characte rizedby theproperty of sustainedexploration and signiŽcantly improve theperforma nceof thosealgorithm s,asindicatedbyexperimentaltests on well-knowndata sets.

1Introduction

Many pattern recognitionand data analysis tasks assume no priorclass informationabout the data to be used. Pattern clustering belongs to this category and aims at organizing the data intocategories (clusters) so that patterns within acluster are moresimilar to each other (interms ofan ap- propriatedistance metric)than patterns belonging todifferent clusters. To achieve this objective, many clustering strategies are parametricand op- erate by deŽning aclustering criterionand then trying to determine the optimalallocation of patterns to clusters with respect to the criterion.In most cases such strategies are iterative and operate online ; patterns are con- sidered one at atime, and, based on the distance ofthe pattern fromthe cluster centers, the parameters ofthe clusters are adjusted accordingto the clustering strategy.In this article,we present an approachto onlineclus- tering that treats competitivelearning as areinforcementlearning problem. MorespeciŽ cally ,we considerpartitional clustering (orhard clustering or vectorquantization), where the objective istoorganize patterns intoa small number ofclusters such that each pattern belongs exclusively toone cluster. constitutes an intermediate learning paradigm thatliesbetween supervised(withcompleteclassinformationavailable) and (with no available class information).The training informationprovided to the learning system by the environment (external teacher) is inthe formof a scalar reinforcementsignal r that constitutes a

Neural Computation 11, 1915–1932 (1999) c 1999Massachusetts Institute of Technology ° 1916 Aristidis Likas measure ofhow well the system operates. The main idea ofthis articleis that the clustering system does not directlyimplement a prespeciŽed clustering strategy (forexample, competitivelearning) but instead tries to learn to followthe clustering strategy using the suitably computedreinforcements providedby the environment. In other words, the external environment rewards orpenalizes the learning system depending on how well itlearns to applythe clustering strategy we have selected to follow.This approach willbe formallydeŽ ned inthe followingsections and leads to the devel- opmentof clustering algorithms that exploitthe stochasticity inherent ina reinforcementlearning system and therefore are more exible (donot get easily trapped inlocalminima) compared to the originalclustering proce- dures. The proposedtechnique can be appliedwith any online hard clus- tering strategy and suggests anovelway to implementthe strategy (update equations forcluster centers). In addition we present an extension ofthe approachthat is based on the sustained explorationproperty that can be easily obtained by aminormod- iŽcation tothe reinforcementupdate equations and gives the algorithms the ability toescape fromlocal minima. In the next section we providea formaldeŽ nition of onlinehard cluster- ing as areinforcementlearning problemand present reinforcementlearn- ing equations forthe update ofthe cluster centers. The equations are based on the familyof REINFORCE algorithms that have been shown to exhibit stochastic hillclimbingproperties (Williams, 1992). Section3 describes the reinforcementguided competitivelearning (RGCL) algorithmthat consti- tutes astochastic version ofthe learning vectorquantization (LVQ)algo- rithm.Section 4 discusses issues concerningsustained explorationand the adaptation ofthe reinforcementlearning equations to achieve continuous search ofthe parameter space, section 5presents experimental results and several comparisonsusing well-known data sets, and section 6summarizes the articleand providesfuture research directions.

2ClusteringasaReinforcementLearning Problem

2.1Online Competit iveLearning. Supposewe are given asequence p X (x , . . . , x ) ofunlabeled data x (x , . . . , x )> R and want to D 1 N i D i1 ip 2 assign each ofthem to one of L clusters. Each cluster i isdescribed by apro- totype vector w (w 1, . . . , w )> (i 1, . . . , L), and let W (w1, . . . , w ). i D i ip D D L Also let d(x, w) denote the distance metricbased on which the clustering is performed.In the case ofhard clustering, most methods attempt toŽ nd goodclusters by minimizinga suitably deŽned objective function J(W). We restrictourselves here to techniques based on competitivelearning where the objective function is (Kohonen, 1989 ; Hathaway &Bezdek, 1995)

N J min d(xi, wr). (2.1) D r i 1 XD AReinforcement Learning Approachto Online Clustering 1917

The clustering strategy ofthe competitivelearning techniques can be summarized as follows :

1. Randomly take asample xi from X.

2. Compute the distances d(xi, wj) for j 1, . . . , L and locate the winning ? D prototype j ,that is, the one with minimumdistance from xi.

3. Update the weights wij so that the winning prototype wj? moves to- ward pattern xi.

4. Goto step 1.

Depending on what happens instep 3with the nonwinning prototypes, several competitivelearning schemes have been proposedsuch as LVQ(or adaptive k-means) (Kohonen, 1989), the RPCL (rivalpenalized competitive learning) (Xu, Krzyzak, &Oja, 1993), the SOMnetwork (Kohonen, 1989), the “neural-gas” network (Martinez, Berkovich, &Schulten, 1993), and others. Moreoverin step 2, some approaches, such as frequency sensitive com- petitive learning (FSCL), assume that the winning prototypeminimizes a function ofthe distance d(x, w) and not the distance itself.

2.2Immedia teReinforcementLearning. In theframeworkofreinforce- ment learning, asystem accepts inputs fromthe environment, responds by selecting appropriateactions (decisions), and the environment evaluates the decisions by sending arewarding orpenalizing scalar reinforcement signal. Accordingto the value ofthe received reinforcement,the learning system updates its parameters so that gooddecisions become morelikely to be made inthe future and bad decisions become less likelyto occur (Kaelbling, Littman, &Moore,1996). Asimplespecial case is immediate reinforcementlearning, where the reinforcementsignal isreceived at every step immediately after the decision has been made. In orderfor the learning system to be able to search forthe best decision correspondingto each input, astochastic explorationmechanism is fre- quently necessary.Forthis reason many reinforcementlearning algorithms applyto neural networks ofstochastic units. These units draw their outputs fromsome probabilitydistribution, employingeither one ormany parame- ters. These parameters depend on the inputs and the network weights and are updated at each step to achieve the learning task. Aspecialcase, which isofinterest to ourapproach, is when the output ofeach unit isdiscrete and morespeciŽ cally is either one orzero, depending onasingle parameter p [0, 1].This type ofstochastic unit iscalled the Bernoulliunit (Barto & 2 Anandan, 1985 ; Williams,1988, 1992). Severaltraining algorithms have been developed forimmediate rein- forcementproblems. W ehave used the familyof REINFORCE algorithms 1918 Aristidis Likas

inwhich the parameters wij ofthe stochastic unit i with input x are updated as

@ ln gi Dwij a(r bij) , (2.2) D ¡ @wij where a > 0is the learning rate, r the received reinforcement,and bij a quantity called the reinforcementbaseline. The quantity @ ln gi /@wij is called the characteristic eligibilityof w , where g (y w , x) is the probabilitymass ij i iI i function (in the case ofa discrete distribution) orthe probabilitydensity function (in the case ofa continuous distribution), which determines the output yi ofthe unit as afunction ofthe parameter vector wi and the input pattern x to the unit. Animportant result is that REINFORCE algorithms are characterized by the stochastic hillclimbingproperty .At each step, the average update direction E DW | W, x inthe weight space lies inthe directionfor which f g the performancemeasure E r | W, x isincreasing, where W isthe matrixof f g allnetwork parameters,

@E r | W, x E Dwij | W, x a f g (2.3) f g D @wij where a > 0. This means that forany REINFORCE algorithm,the expectation ofthe weight change followsthe gradient ofthe performancemeasure E r | W, x . f g Therefore, REINFORCE algorithms can be used toperform stochastic max- imizationof the performancemeasure. In the case ofthe Bernoulliunit with p inputs, the probability pi is com- p puted as pi f ( j 1 wijxj ), where f (x) 1/(1 exp( x)),and itholds D D D C ¡ that P @ ln g (y p ) y p i iI i i ¡ i , (2.4) @p D p (1 p ) i i ¡ i where yi isthe binary output (0or1) (Williams,1992).

2.3The ReinforcementClusteri ngApproach. In ourapproach to clus- tering based onreinforcementlearning (called the RCapproach),we con- sider that each cluster i (i 1, . . . , L)correspondsto a Bernoulliunit, whose D weight vector w (w 1, . . . , w )> correspondsto the prototypevector for i D i ip cluster i.Ateach step, each Bernoulliunit i isfed with arandomlyselected pattern x and performsthe followingoperations. First, the distance s d(x, w ) is computed, and then the probability p i D i i is obtained as follows :

p h(s ) 2(1 f (s )), (2.5) i D i D ¡ i AReinforcement Learning Approachto Online Clustering 1919 where f (x) 1/(1 exp( x)). Function h providesvalues in (0, 1) (since D C ¡ s 0) and is monotonicallydecreasing. Therefore, the smallerthe distance i ¸ si between x and wi,the higher the probability pi that the output yi of the unit willbe 1. Thus, when apattern ispresented tothe clustering units, they provideoutput 1with probabilityinversely proportionalto the distance of the pattern fromthe cluster prototype.Consequently ,the closer(according to some proximitymeasure) aunit is to input pattern, the higher the proba- bility that the unit willbe active (i.e., y 1). The probabilities p provide a i D i measure ofthe proximitybetween patterns and cluster centers. Therefore, if a unit i is active, itis very probable that this unit iscloseto the input pattern. Accordingto the immediate reinforcementlearning framework, after each cluster unit i has computedthe output yi,the environment (external teacher) must evaluate the decisions by sending aseparate reinforcement signal ri to each unit i.This evaluation is made insuch away that the units update their weights so that the desirable clustering strategy is im- plemented. In the next section we consideras examples the cases ofsome well-known clustering strategies. Followingequation 2.2, the use ofthe REINFORCE algorithmfor updat- ing the weights ofclustering units suggests that

@ ln gi(yi pi) @pi @si Dwij a(ri bij ) I . (2.6) D ¡ @pi @si @wij

Using equations 2.4 and 2.5, equation 2.6 takes the form

@si Dwij a(ri bij )(yi pi ) , (2.7) D ¡ ¡ @wij which isthe weight update equation correspondingto the reinforcement clustering (RC) scheme. Animportantcharacteristic ofthe above weight update scheme is that it operates toward maximizingthe followingobjective function :

N N L R(W) R(W, x ) E r | W, x , (2.8) D O i D f j ig i 1 i 1 j 1 XD XD XD where E r | W, x denotes the expected value ofthe reinforcementreceived f j ig by cluster unit j when the input pattern is xi. Consequently the reinforcementclustering scheme can be employedin thecase ofproblems whose objective isthe onlinemaximization ofafunction that can be speciŽed inthe formof R(W).The maximizationis achieved by performingupdates that at each step (assuming input xi)maximizethe term RO(W, xi).The latter is validsince fromequation 2.3 we have that

@E rk | W, xi E Dwkl | W, xi a f g . (2.9) f g D @wkl 1920 Aristidis Likas

Sincethe weight w affects only the term E r | W, x inthe deŽnition of kl f k ig RO(W, xi),we conclude that

@RO(W, xi ) E Dwkl | W, xi a . (2.10) f g D @wkl

Therefore, the RCupdate algorithmperforms online stochastic maxi- mization ofthe objective function R inthe same sense that the LVQmini- mizes the objective function J (see equation 2.1) orthe online backpropaga- tion algorithmminimizes the well-known mean square errorfunction.

3The RGCLAlgorithm

In the classical LVQalgorithm,only the winning unit i? updates its weights, which are movedtoward input pattern x,while the weights ofthe remaining units remainunchanged :

a (x w ) if i is the winning unit Dw i j ¡ ij (3.1) ij D 0 otherwise. »

Forsimplicity ,inequation 3.1, the dependence ofthe parameters ai on time is not stated explicitly.Usually the ai start froma reasonable initialvalue and gradually reduce tozero insome way.But inmany LVQimplementations (as, forexample, inXu et al., 1993) the parameter ai remains Žxed assuming asmallvalue. The strategy we would likethe system to learn is that when one pattern is presented to the system, only the winning unit (the closest one) becomes active (with high probability)and updates its weights, while the other units remaininactive (again with high probability). Toimplementthis strategy ? the environment identiŽes the unit i with maximum pi and returns areward signal r ? 1to that unit ifit has decided correctly( y ? 1) and apenalty i D i D signal r ? 1ifits guess is wrong ( y ? 0). The reinforcements sent to the i D ¡ i D other (nonwinning) units are r 0 (i i?),so that their weights are not i D D6 affected. Therefore

1 if i i? and y 1 D i D r 1 if i i? and y 0 (3.2) i D 8¡ D i D 0 if i i?. < D6 : Followingthis speciŽcation of r and setting b 0 for every i and j, equa- i ij D tion 2.7 takes the form

@si Dwij ari (yi pi) . (3.3) D ¡ @wij AReinforcement Learning Approachto Online Clustering 1921

2 p In the case where the Euclidean distance isused ( si d (x, wi) j 1(xj D D D ¡ 2 wij) )),equation 3.3 becomes P

Dw ar (y p )(x w ), (3.4) ij D i i ¡ i j ¡ ij which is the update equation ofRGCL. Moreover,following the speciŽcation of ri (see equation 3.2), itis easy to verify that

a|y p |(x w ) if i i? Dw i ¡ i j ¡ ij D (3.5) ij D 0 otherwise. »

Therefore, each iteration ofthe RGCL clustering algorithmconsists ofthe followingsteps : 1. Randomly select asample x fromthe data set. 2. For i 1, . . . , L computethe probability p and decide the output y D i i ofcluster unit i. ? 3. Specifythe winning unit i with p ? max p . i D i i 4. Compute the reinforcements r (i 1, . . . , L) using equation 3.2. i D 5. Update the weight vectors w (i 1, . . . , L) using equation 3.4. i D Asinthe case with the LVQalgorithm,we considerthat the parameter a does not depend on timeand remains Žxed at aspeciŽc smallvalue. The main pointin ourapproach is that we have alearning system that operates inorder to maximizethe expected reward at the upcomingtrial. Accordingto the speciŽcation ofthe rewarding strategy,high values of r are received when the system followsthe clustering strategy,while lowvalues are obtained when the system fails inthis task. Therefore, the maximization ofthe expected value of r means that the system isable to follow(on the aver- age) the clustering strategy.Sincethe clustering strategy aims at minimizing the objective function J,inessence we have obtained an indirectstochastic way tominimize J through the learning ofthe clustering strategy—that is, through the maximizationof the immediate reinforcement r.This intuition ismade moreclear in the following. In the case ofthe RGCL algorithm,the reinforcements are providedby equation 3.2. Using this equation and taking intoaccount that y 1 with i D probability p and y 0with probability1 p ,itis easily toderive from i i D ¡ i equation 2.8 that the objective function R1 maximizedby RGCL is

N R (X, W) p ?(x ) (1 p ? (x )) , (3.6) 1 D i j ¡ ¡ i j j 1 XD £ ¤ 1922 Aristidis Likas

where pi? (xj ) isthe maximumprobability for input xj.Equation 3.6 gives

N R (X, W) 2 p ? (x ) N. (3.7) 1 D i j ¡ j 1 XD

Since N is aconstant and the probability pi isinversely proportionalto the distance d(x, wi ),we conclude that the RGCL algorithmperforms updates that minimizethe objective function J,since itoperates toward the maxi- mization ofthe objective function R1. Alsoanother interesting case results ifwe set r ? 0 when y ? 0, which i D i D yields the followingobjective function,

N R (X, W) p ? (x ), (3.8) 2 D i j j 1 XD having the same propertiesas R1. In fact the LVQalgorithmcan be considered aspecialcase ofthe RGCL algorithm.This stems fromthe fact that by setting

1 ? if i i and yi 1 yi pi D D ¡ 1 ? (3.9) ri 8 y p if i i and yi 0, D ¡ i ¡ i D D <>0 if i i? D6 > the update: equation, 3.4, becomes exactly the LVQupdate equation. Con- sequently,using equation 2.8, itcan be veriŽed that except forminimizing the hard clustering objective function J,the LVQalgorithmoperates toward maximizingthe objective function :

N pi? (xj ) 1 pi?(xj) R3(X, W) ¡ . (3.10) D 1 p ? (x ) C p ?(x ) j 1 µ i j i j ¶ XD ¡ Moreover,if we comparethe RGCL update equation, 3.5, with the LVQ update equation, we can see that the actual difference lies inthe presence of the term |y p | inthe RGCL update equation. Since y may be either one or i ¡ i i zero (depending on the p value), the absolute value |y p | is different (high i i ¡ i orlow) depending onthe outcome yi.Therefore, under the same conditions (W and xi),the strength ofthe weight updates wij may be different depend- ing on the yi value. This fact introduces akind ofnoise inthe weight update equations that assists the learning system to escape fromshallow localmin- imaand be moreeffective than the LVQalgorithm.It must also be stressed that the RGCL scheme isnot by any means aglobal optimizationclustering approach.It isalocaloptimization clustering procedurethat exploitsran- domness to escape fromshallow localminima, but itcan be trapped insteep localminimum points. Insection 4we present amodiŽcation tothe RGCL AReinforcement Learning Approachto Online Clustering 1923 weight update equation that gives the algorithmthe propertyof sustained exploration.

3.1Other Clusteri ng Strategies. Followingthe above guidelines, al- most every onlineclustering technique may be considered inthe RCframe- work by appropriatelyspecifying the reinforcementvalues ri provided to the clustering units. Such an attempt would introduce the characteristics of “noisy search ” inthe dynamics ofthe correspondingtechnique and would make itmore effective inthe same sense that the RGCL algorithmseems to be moreeffective than LVQaccordingto the experimentaltests. Moreover, any distance measure d(x, wi) may be used providedthat the derivative @d(x, wi)/@wij can be computed. Weconsidernow the speciŽcation ofthe reinforcements ri to be used in the RCweight update equation, 2.7, inthe cases ofsome well-known online clustering techniques.

3.1.1 FrequencySensitive CompetitiveLearning (FSCL)(Ahalt, Krishnamurty, Chen, &Melton, 1990). In the FSCL case itis considered d(x, w ) c |x w |2 i D i ¡ i with c n n , where n is the number oftimes that unit i isthe winning i D i / j j i unit. Also, P 1 if i i? and y 1 D i D r 1 if i i? and y 0 i D 8¡ D i D 0 if i i?. < D6 3.1.2 Rival Penalized Competitive: Learning (RPCL) (Xu etal., 1993). This isamodiŽcation ofFSCL where the second winning unit is moves to the op- posite directionwith respect to the input vector x.This means that d(x, w ) i D c |x w |2 and i ¡ i 1 if i i? and y 1 D i D 1 if i i? and y 0 8¡ D i D r b if i is and y 1 , i D >¡ D i D >b if i is and y 0 < D i D 0 if i i? > D6 where b 1accordingto > the speciŽcation ofRPCL. ¿ :> 3.1.3 Maximum Entropy Clustering (Rose,Gurewitz, & Fox, 1990). The ap- plicationof the RCtechnique to the maximumentropy clustering approach suggests that d(x, w ) |x w |2 and i D ¡ i 2 exp( b |x wi | ) L ¡ ¡ if y 1 2 i exp( b |x wj | ) D j 1 ¡ ¡ ri D 2 D 8 exp( b |x wi | ) L ¡ ¡ if y 0 > P 2 i < exp( b |x wj | ) D ¡ j 1 D ¡ ¡ where the parameter b gradually:> P increases with time. 1924 Aristidis Likas

3.1.4 Self-Organizing Map (SOM) (Kohonen, 1989). It is also possible to applythe RCtechnique to the SOMnetwork by using d(x, w ) |x w |2 i D ¡ i and specifying the reinforcements ri as follows:

? hs (i, i ) if yi 1 ri ? D D hs (i, i ) if y 0, »¡ i D where hs (i, j) is aunimodal function that decreases monotonicallywith respect tothe distance ofthe two units i and j inthe network lattice and s is acharacteristic decay parameter.

4SustainedExploration

The RGCL algorithmcan be easily adapted inorder to obtain the prop- erty ofsustained exploration.This isamechanism that gives asearch al- gorithmthe ability to escape fromlocal minima through the broadening ofthe search at certain times (Ackley,1987 ; Willams& Peng, 1991). The propertyof sustained explorationactually emphasizes divergence—return to global searching without completelyforgetting what has been learned. The importantissue is that such adivergence mechanism isnot external to the learning system (as, forexample, inthe case ofmultiple restarts) ; it is an internal mechanism that broadens the search when the learning system tends to settle on acertain state, without any external intervention. In the case ofREINFORCE algorithms with Bernoulliunits, sustained explorationis very easily obtained by adding aterm gw to the weight ¡ ij update equation, 2.2, which takes the form(Williams & Peng, 1991)

@ ln gi Dwij a(r bij) gwij. (4.1) D ¡ @wij ¡

Consequently the update equation, 3.4, ofthe RGCL algorithmnow takes the form

Dw ar (y p )(x w ) gw , (4.2) ij D i i ¡ i j ¡ ij ¡ ij where ri is given fromequation 3.2. The modiŽcation ofRGCL that em- ploysthe above weight update scheme willbe called the SRGCL algorithm (sustained RGCL). The parameter g > 0must be much smallerthan the parameter a so the term gw does not affect the localsearch propertiesof ¡ ij the algorithms—that is, the movement toward localminimum states. It isobvious that the sustained explorationterm emphasizes divergence and starts to dominate inthe update equations, 4.1 and 4.2, when the algo- rithmis trapped inlocalminimum states. In such acase, itholds that the quantity y p becomes small, and therefore the Žrst termhas negligible i ¡ i contribution. Asthe search broadens, the difference y p tends to become i ¡ i AReinforcement Learning Approachto Online Clustering 1925 higher, and the Žrst termagain starts todominate overthe second term. It must be noted that accordingto equation 4.2, not only the weights ofthe winning unit are updated at each step ofSRGCL, but also the weights of the units with r 0. i D The sustained explorationterm gw can also be added tothe LVQup- ¡ ij date equation, which takes the form

a (x w ) gw if i isthe winning unit Dw i j ¡ ij ¡ ij (4.3) ij D gw otherwise. »¡ ij The modiŽed algorithmwill be called SLVQ(sustained LVQ)and improves the performanceof the LVQinterms ofminimizingthe clustering objective function J. Due to the sustained explorationproperty of SRGCL and SLVQ, they do not converge at localminima of the objective function, since their divergence mechanism allows them toescape fromthem and continue the exploration ofthe weight space. Therefore, acriterionmust be speciŽed inorder to ter- minate the search, which isusually the speciŽcation ofa maximumnumber of steps.

5ExperimentalResults

The proposedtechniques have been tested using two well-known data sets : the IRISdata set (Anderson, 1935) and the “synthetic” data set used inRipley (1996). In allexperiments the value of a 0.001 was used forL VQ,while D forSL VQwe set a 0.001 and g 0.00001. ForRGCL we have assumed D D a 0.5forthe Žrst 500 iterations and afterward a 0.1, the same holding D D forSRGCL where we set g 0.0001. These parameter values have been D found tolead to best performancefor all algorithms. Moreover,the RGCL and LVQalgorithms were run for1500 iterations and the SRGCL and SLVQ for4000 iterations, where one iteration correspondsto a single pass through alldata samples inarbitrary order.In addition, inorder to specify the Žnal solution (with Jmin)inthe case ofSRGCL and SLVQ, which donot regularly converge to aŽnal state, we computedthe value of J every 10 iterations, and, ifit were lowerthan the current minimumvalue of J,we saved the weight values ofthe clustering units. In previousstudies (Williams& Peng, 1991) the effectiveness ofstochas- ticsearch using reinforcementalgorithms has been demonstrated. Never- theless, in orderto comparethe effectiveness ofRGCL as arandomized clustering technique, we have also implementedthe followingadaptation ofLVQ, called randomized LVQ(RLVQ). Atevery step ofthe RLVQpro- cess, each actual distance d(x, wi) is Žrst modiŽed by adding noise ; that is, we computethe quantities d0 (x, w ) (1 n)d(x, w ), where n isuniformly i D ¡ i selected inthe range [ L, L] (with 0 < L < 1). Anew value of n is drawn ¡ forevery computationof d0 (x, wi).Then the selection ofthe winning unit is 1926 Aristidis Likas

Table 1: Average Valueof the Objective Function J Corresponding tothe Solu- tions Obtained Using the RGCL,L VQ,RL VQ,SRGCL,and SLVQAlgorithms (IRISData Set).

Average J

Number of Clusters RGCLL VQRLVQSRGCLSL VQ 3 98.6115.5 106.8 86.3 94.5 4 75.594.3 87.8 62.5 70.8 5 65.371.2 69.4 52.4 60.3

Table 2: Average Valueof the Objective Function J Corresponding tothe Solu- tions Obtained Using the RGCL,L VQ,RL VQ,SRGCL,and SLVQAlgorithms (Synthetic DataSet).

Average J

Number of Clusters RGCLL VQRLVQSRGCLSL VQ 4 14.415.3 14.8 12.4 13.3 6 12.613.7 13.3 10.3 12.3 8 10.311.4 10.8 9.2 10.1

done by considering the perturbed values d0 (x, wi),and Žnally the ordinary LVQupdate formulais applied. Experimentalresults fromthe applicationof allmethods are summarized inT ables 1and 2. The performanceof the RLVQheuristic was very sensitive to the levelof the injected noise—the value of L.Ahigh value of L leads to purerandom search, while asmallvalue of L makes the behavior ofRLVQ similarto the behavior ofLVQ.Best results were obtained for L 0.35. We D have also tested the case where the algorithmstarts with ahigh initial L value (L 0.5) that gradually decreases toa smallŽ nal value L 0.05, but D D no performanceimprovement was obtained. Finally,itmust be stressed that RLVQmay also be adapted inthe reinforcementlearning framework (inthe spiritof subsection 3.1) ; itcan be considered as an extension ofRGCL with additional noise injected inthe evaluation ofthe reinforcementsignal ri.

5.1IRIS DataSet. The IRISdata set is aset of150 data points in R4. Each pointcorresponds to three classes, and there are 50 points ofeach class inthe data set. Of course, the class informationis not available during training. When three clusters are considered, the minimumvalue ofthe objective function J is J 78.9(Hathaway &Bezdek, 1995) inthe case where the min D Euclidean distance is used. The IRISdata set contains two distinct clusters, while the third cluster is not distinctly separate fromthe other two. Forthis reason, when three AReinforcement Learning Approachto Online Clustering 1927

220 RGCL LVQ 200

180 J

n 160 o i t c n u f 140 e v i t c e j

b 120 O

100

80

60 200 400 600 800 1000 1200 1400 Iterations

Figure 1: Minimization ofthe objective functionJ using the LVQand the RGCL algorithm forthe IRISproblem with three cluster units. clusters are considered, there is the problemof the at localminimum (with J 150), which correspondsto a solution with two clusters (there exists one ¼ dead unit). Figure 1displays the minimizationof the objective function J in a typical run ofthe RGCL and LVQalgorithmwith three cluster units. Both algo- rithmsstarted fromthe same initialweight values. It is apparent that the existence ofthe localminimum with J 150 that correspondsto the solution ¼ with two clusters mentioned previously.This iswhere the LVQalgorithmis trapped. On the other hand, the RGCL algorithmmanages toescape from the localminimum and oscillatenear the global minimumvalue J 78.9. D Wehave examined the cases ofthree, four, and Žve cluster units. In each case, aseries of20 experiments was conducted. Ineach experiment the LVQ, RGCL, and RLVQalgorithms were tested starting fromthe same weight values that were randomlyspeciŽ ed. Table 1presents the average values (overthe 20 runs) ofthe objective function J correspondingto the solutions obtained using each algorithm.In allcases, the RGCL algorithmis moreeffective comparedto the LVQalgorithm.As expected, the RLVQalgo- rithmwas inall experiments at least as effective as the LVQ,but its average performanceis inferiorcompared to RGCL. This means that the injection ofnoise during prototypeselection was sometimes helpful and assisted the LVQalgorithmto achieve better solutions, while inother experiments ithad 1928 Aristidis Likas no effect on LVQperformance. Table 1also presents results concerningthe same series ofexperiments (using the same initialweights) forSRGCL and SLVQ.It is clearthat signiŽ- cant improvementis obtained by using the SLVQalgorithmin place of L VQ, as well as that the SRGCL is moreeffective than SLVQand, as expected, it also improvesthe RGCL algorithm.On the other hand, the sustained ver- sions require agreater number ofiterations.

5.2Synthetic Data Set. The same observations were veriŽed inasecond series ofexperiments where the synthetic data set isused. Inthis data set (Ripley,1996), the patterns are two-dimensional, and there are two classes, each having abimodaldistribution ; thus, there are fourclusters with small overlaps. Wehave used the 250 patterns that are considered inRipley (1996) as the training set, and we make no use ofclass information. Severalexperiments have been conducted on this data set concerningthe RGCL, LVQ,and RLVQalgorithms Žrst and then SRGCL and SLVQ. The experiments were performedassuming four, six, and eight cluster units. In analogy with the IRISdata set, foreach number ofclusters, aseries of20 experiments were performed(with different initialweights). Ineach exper- iment, each ofthe fouralgorithms isappliedwith the same initialweight values. The obtained results concerningthe average value of J correspond- ing to the solutions providedby each method are summarized inT able 2. Comparative performanceresults are similarto those obtained with the IRIS data set. Atypicalrun ofthe RGCL and LVQalgorithms with fourcluster units starting fromthe same positions(far fromthe optimal)is depicted inFig- ures 2and 3, respectively.These Žgures display the data set (represented with crosses), as well as the traces ofthe fourcluster units until they reach their Žnal positions(represented with squares). It is clearthat the RGCL providesa four-cluster optimalsolution (with J 12.4), while the LVQal- D gorithmprovides a three-cluster solution with J 17.1. The existence of D adead unit at position ( 1.9, 0.55) (square at the lowleft cornerof Fig- ¡ ¡ ure 3) inthe LVQsolution and the effect ofrandomness onthe RGCL traces that supplies the algorithmwith better explorationcapabilities are easily observed. Moreover,in order to performa morereliable comparisonbetween the RGCL and RLVQalgorithms, we have conducted an additional series of experiments on the synthetic data set assuming four, six, and eight clus- ter units. Foreach number ofclusters we have conducted 100 runs with each algorithmin exactly the same manner with the previousexperiments. Tables 3and 4display the statistics ofthe Žnal value of J obtained with each algorithm,and Table 5displays the percentage ofruns forwhich the performanceof RGCL was superior( J < J ), similar (J RGCL RLVQ RGCL » JRLVQ),orinferior to RLVQ( JRGCL > JRLVQ).MorespeciŽ cally ,the perfor- mance ofthe RGCL algorithmwith respect toRL VQwas superiorwhen AReinforcement Learning Approachto Online Clustering 1929

1.2

1

0.8

0.6

0.4

0.2

0

-0.2

-0.4

-0.6 -1.5 -1 -0.5 0 0.5

Figure 2: Synthetic dataset and tracesof the fourcluster prototypes correspond- ing toa run ofthe RGCL algorithm with fourcluster units (fourtraces).

1.2

1

0.8

0.6

0.4

0.2

0

-0.2

-0.4

-0.6 -1.5 -1 -0.5 0 0.5

Figure 3: Synthetic dataset and tracesof the cluster prototypes corresponding toa run ofthe LVQalgorithm with fourcluster units (three tracesand one dead unit). 1930 Aristidis Likas

Table 3: Statisticsof the Objective Function J Corresponding toSolutions Ob- tained from 100Runs with the RGCL Algorithm, Synthetic DataSet.

J

Number of Clusters AverageStandard Best Worst 4 14.5 2.2 12.4 28.9 6 12.4 1.8 9.1 17.2 8 10.2 2.5 7.1 17.2

Table 4: Statisticsof the Objective Function J Corresponding toSolutions Ob- tained from 100Runs with the RLVQAlgorithm, Synthetic DataSet.

J

Number of Clusters AverageStandard Best Worst 4 15.1 3.1 12.4 28.9 6 13.3 3.7 9.1 28.9 8 10.7 4.2 7.5 28.9

J < J 0.3, similarwhen |J J | 0.3, and inferiorwhen RGCL RLVQ ¡ RGCL ¡ RLVQ · J > J 0.3. The displayed results make clearthe superiorityof the RGCL RLVQ C RGCL approach,which not only providessolutions that are almost always better orsimilar to RL VQbut also leads to solutions that are morereliable and consistent, as indicated by the signiŽcantly lowervalues ofthe standard deviation measure.

6Conclusion

Wehave proposedreinforcement clustering as areinforcement-based tech- nique foronline clustering. This approachcan be combinedwith any online clustering algorithmbased on competitivelearning and introduces adegree ofrandomness tothe weight update equations that has apositiveeffect on clustering performance. Further research willbe directed to the applicationof the approachto

Table 5: Percentage ofRuns forWhich the Performance ofthe RGCL Algorithm wasSuperior ,Similar,or Inferior toRL VQ.

Number of Clusters J < J J J J > J RGCL RLVQ RGCL » RLVQ RGCL RLVQ 4 47% 51% 2% 6 52 45 3 8 64 34 2 AReinforcement Learning Approachto Online Clustering 1931 clustering algorithms other than LVQ, forexample, the ones that are re- portedin section 3. The assessment ofthe performanceof those algorithms under the RCframework needs tobe examined and assessed. Moreover,the applicationof the proposedtechnique to real-worldclustering problems(for example, ) constitutes another importantfuture objec- tive. Another interesting directionconcerns the applicationof reinforcement algorithms to mixturedensity problems.In this case, the employmentof doubly stochastic units—those with anormalcomponent followed by a Bernoullicomponent— seems appropriate(Kontoravdis, Likas, & Stafylopatis, 1995). Alsoof great interest is the possible applicationof the RCapproachto fuzzy clustering, as well as the development ofsuitable criteriafor inserting, deleting, splitting, and merging cluster units.

Acknowledgments

The author would liketo thank the anonymous referees fortheir useful commentsand forsuggesting theRLVQalgorithmfor comparison purposes.

References

Ackley,D.E.(1987). ACconnectionistmachine for genetic hillclimbing . Norwell, MA: Kluwer. Ahalt, S.C.,Krishnamurty,A.K.,Chen, P.,&Melton, D. E.(1990).Competitive learning algorithms forvector quantization. NeuralNetworks, 3 , 277–291. Anderson, E.(1935).The IRISesof the GaspePeninsula. Bull. Amer.IRIS Soc. , 59, 381–406. Barto,A. G.,&Anandan, P.(1985).Pattern recognizing stochasticlearning au- tomata. IEEE Transactionson Systems,Man andCybernetics , 15, 360–375. Hathaway,R.J.,&Bezdek, J.C.(1995).Optimization of clustering criteria by reformulation. IEEE Trans.on FuzzySystems , 3, 241–245. Kaelbling, L.,Littman, M.,& Moore, A.(1996).Reinforcement learning :A survey. Journalof ArtiŽcial Intelligence Research , 4, 237–285. Kohonen, T.(1989). Self-organizationand associative memory (3rded.). Berlin : Springer-Verlag. Kontoravdis, D.,Likas, A.,& Stafylopatis,A. (1995).Enhancing stochasticity in reinforcement learning schemes : Applicationto the exploration ofbinary domains. Journalof IntelligentSystems , 5, 49–77. Martinez, T.,Berkovich, S.,&Schulten,G. (1993). “Neural-gas” network for vector quantizationand its applicationto time-series prediction. IEEE Trans. on NeuralNetworks , 4, 558–569. Ripley,B.(1996). Patternrecognition and neural networks . Cambridge: Cambridge University Press. Rose, K.,Gurewitz,F .,&Fox,G. (1990).Statistical mechanics and phase transi- tions in clustering. PhysicalRev. Lett. , 65, 945–948. 1932 Aristidis Likas

Williams, R.J.(1988). Towarda theoryof reinforcement learning vonnectionist dystems (Tech. Rep. NU-CCS-88-3).Boston, MA : Northeastern University. Williams, R.J.(1992).Simple statisticalgradient-following algorithms forcon- nectionist reinforcement learning. MachineLearning , 8, 229–256. Williams, R.J.,&Peng, J.(1991).Function optimizationusing connectionnist reinforcement learning networks. ConnectionScience , 3, 241–268. Xu,L., Krzyzak, A., & Oja,E. (1993).Rival penalized competitive learning for clustering analysis, RBF net, and curve detection. IEEE Trans.on NeuralNet- works, 4, 636–649.

Received February26, 1998 ; accepted November24, 1998.