A Reinforcement Learning Approach to Online Clustering
Total Page:16
File Type:pdf, Size:1020Kb
LETTER Communicated byAndrew Barto AReinforcementLearning Approach toOnlineClusterin g AristidisLikas Departmentof ComputerScience, University of Ioannina, 45110, Ioannina, Greece Ageneraltechnique is proposed forembeddin gonline clusteringalgo- rithmsbased on competitivelearning in areinforcementlearning frame- work. The basicidea is thatthe clusteringsystemcan beviewedas arein- forcementlearningsystem that learns through reinforcementsto follow theclusterin gstrategywe wishto implement.Inthissense, the reinforce - mentguided competitivelearning (RGCL) algorithmis proposed that constitutesa reinforcement-basedadaptati on of learningvector quanti- zation(L VQ) withenhanced clusteri ngcapabilities.In addition,we sug- gestextensio nsofRGCL and LVQthatare characte rizedby theproperty of sustainedexploration and signicantly improve theperforma nceof thosealgorithm s,asindicatedbyexperimentaltests on well-knowndata sets. 1Introduction Many pattern recognitionand data analysis tasks assume no priorclass informationabout the data to be used. Pattern clustering belongs to this category and aims at organizing the data into categories (clusters) so that patterns within acluster are moresimilar to each other (interms ofan ap- propriatedistance metric)than patterns belonging to different clusters. To achieve this objective, many clustering strategies are parametricand op- erate by dening aclustering criterionand then trying to determine the optimalallocation of patterns to clusters with respect to the criterion.In most cases such strategies are iterative and operate online ; patterns are con- sidered one at atime, and, based on the distance ofthe pattern fromthe cluster centers, the parameters ofthe clusters are adjusted accordingto the clustering strategy.In this article, we present an approachto online clus- tering that treats competitivelearning as areinforcementlearning problem. Morespeci cally ,we considerpartitional clustering (orhard clustering or vectorquantization), where the objective isto organize patterns intoa small number ofclusters such that each pattern belongs exclusively toone cluster. Reinforcement learning constitutes an intermediate learning paradigm thatliesbetween supervised(withcompleteclassinformationavailable) and unsupervised learning (with no available class information).The training informationprovided to the learning system by the environment (external teacher) is inthe formof a scalar reinforcementsignal r that constitutes a Neural Computation 11, 1915–1932 (1999) c 1999Massachusetts Institute of Technology ° 1916 Aristidis Likas measure ofhow well the system operates. The main idea ofthis articleis that the clustering system does not directlyimplement a prespecied clustering strategy (forexample, competitivelearning) but instead tries to learn to followthe clustering strategy using the suitably computedreinforcements providedby the environment. In other words, the external environment rewards orpenalizes the learning system depending on how well itlearns to applythe clustering strategy we have selected to follow.This approach willbe formallyde ned inthe followingsections and leads to the devel- opmentof clustering algorithms that exploitthe stochasticity inherent ina reinforcementlearning system and therefore are more exible (donot get easily trapped inlocalminima) compared to the originalclustering proce- dures. The proposedtechnique can be appliedwith any online hard clus- tering strategy and suggests anovelway to implementthe strategy (update equations forcluster centers). In addition we present an extension ofthe approachthat is based on the sustained explorationproperty that can be easily obtained by aminormod- ication to the reinforcementupdate equations and gives the algorithms the ability to escape fromlocal minima. In the next section we providea formalde nition of online hard cluster- ing as areinforcementlearning problemand present reinforcementlearn- ing equations forthe update ofthe cluster centers. The equations are based on the familyof REINFORCE algorithms that have been shown to exhibit stochastic hillclimbingproperties (Williams, 1992). Section3 describes the reinforcementguided competitivelearning (RGCL) algorithmthat consti- tutes astochastic version ofthe learning vectorquantization (LVQ) algo- rithm.Section 4 discusses issues concerningsustained explorationand the adaptation ofthe reinforcementlearning equations to achieve continuous search ofthe parameter space, section 5presents experimental results and several comparisonsusing well-known data sets, and section 6summarizes the articleand providesfuture research directions. 2ClusteringasaReinforcementLearning Problem 2.1Online Competit iveLearning. Supposewe are given asequence p X (x , . , x ) ofunlabeled data x (x , . , x )> R and want to D 1 N i D i1 ip 2 assign each ofthem to one of L clusters. Each cluster i isdescribed by apro- totype vector w (w 1, . , w )> (i 1, . , L), and let W (w1, . , w ). i D i ip D D L Also let d(x, w) denote the distance metricbased on which the clustering is performed.In the case ofhard clustering, most methods attempt to nd goodclusters by minimizinga suitably dened objective function J(W). We restrictourselves here to techniques based on competitivelearning where the objective function is (Kohonen, 1989 ; Hathaway &Bezdek, 1995) N J min d(xi, wr). (2.1) D r i 1 XD AReinforcement Learning Approachto Online Clustering 1917 The clustering strategy ofthe competitivelearning techniques can be summarized as follows : 1. Randomly take asample xi from X. 2. Compute the distances d(xi, wj) for j 1, . , L and locate the winning ? D prototype j ,that is, the one with minimumdistance from xi. 3. Update the weights wij so that the winning prototype wj? moves to- ward pattern xi. 4. Goto step 1. Depending on what happens instep 3with the nonwinning prototypes, several competitivelearning schemes have been proposedsuch as LVQ(or adaptive k-means) (Kohonen, 1989), the RPCL (rivalpenalized competitive learning) (Xu, Krzyzak, &Oja, 1993), the SOM network (Kohonen, 1989), the “neural-gas” network (Martinez, Berkovich, &Schulten, 1993), and others. Moreoverin step 2, some approaches, such as frequency sensitive com- petitive learning (FSCL), assume that the winning prototypeminimizes a function ofthe distance d(x, w) and not the distance itself. 2.2Immedia teReinforcementLearning. In theframeworkofreinforce- ment learning, asystem accepts inputs fromthe environment, responds by selecting appropriateactions (decisions), and the environment evaluates the decisions by sending arewarding orpenalizing scalar reinforcement signal. Accordingto the value ofthe received reinforcement, the learning system updates its parameters so that gooddecisions become morelikely to be made inthe future and bad decisions become less likelyto occur (Kaelbling, Littman, &Moore,1996). Asimplespecial case is immediate reinforcementlearning, where the reinforcementsignal isreceived at every step immediately after the decision has been made. In orderfor the learning system to be able to search forthe best decision correspondingto each input, astochastic explorationmechanism is fre- quently necessary.Forthis reason many reinforcementlearning algorithms applyto neural networks ofstochastic units. These units draw their outputs fromsome probabilitydistribution, employingeither one ormany parame- ters. These parameters depend on the inputs and the network weights and are updated at each step to achieve the learning task. Aspecialcase, which isofinterest to ourapproach, is when the output ofeach unit isdiscrete and morespeci cally is either one orzero, depending onasingle parameter p [0, 1].This type ofstochastic unit iscalled the Bernoulliunit (Barto & 2 Anandan, 1985 ; Williams,1988, 1992). Severaltraining algorithms have been developed forimmediate rein- forcementproblems. W ehave used the familyof REINFORCE algorithms 1918 Aristidis Likas in which the parameters wij ofthe stochastic unit i with input x are updated as @ ln gi Dwij a(r bij) , (2.2) D ¡ @wij where a > 0is the learning rate, r the received reinforcement,and bij a quantity called the reinforcementbaseline. The quantity @ ln gi /@wij is called the characteristic eligibilityof w , where g (y w , x) is the probabilitymass ij i iI i function (in the case ofa discrete distribution) orthe probabilitydensity function (in the case ofa continuous distribution), which determines the output yi ofthe unit as afunction ofthe parameter vector wi and the input pattern x to the unit. An importantresult is that REINFORCE algorithms are characterized by the stochastic hillclimbingproperty .At each step, the average update direction E DW | W, x inthe weight space lies inthe directionfor which f g the performancemeasure E r | W, x isincreasing, where W isthe matrixof f g allnetwork parameters, @E r | W, x E Dwij | W, x a f g (2.3) f g D @wij where a > 0. This means that forany REINFORCE algorithm,the expectation ofthe weight change followsthe gradient ofthe performancemeasure E r | W, x . f g Therefore, REINFORCE algorithms can be used to performstochastic max- imizationof the performancemeasure. In the case ofthe Bernoulliunit with p inputs, the probability pi is com- p puted as pi f ( j 1 wijxj ), where f (x) 1/(1 exp( x)),and itholds D D D C ¡ that P @ ln g (y p ) y p i iI i i ¡ i , (2.4) @p D p (1 p ) i i ¡ i where yi isthe binary output (0or1) (Williams,1992). 2.3The ReinforcementClusteri ngApproach. In ourapproach to clus- tering based onreinforcementlearning (called the RCapproach),we con- sider