139

Diffusion Approximation for Bayesian Markov Chains

Michael Duff [email protected] GatsbyComputational Neuroscience Unit, University College London, WCIN 3AR England

Abstract state--~action--+state transitions of each kind with their associated expected immediate rewards. Ques- Given a Markovchain with uncertain transi- tion modelledin a Bayesian way, tions about value in this case reduce to questions about we investigate a technique for analytically ap- the distribution of transition frequency counts. These considerations lead to the focus of this paper--the esti- proximating the mean transition frequency mation of the meantransition-frequency counts associ- counts over a finite horizon. Conventional techniques for addressing this problemeither ated with a Markovchain (an MDPgoverned by a fixed require the enumeration of a set of general- policy), with unknowntransition probabilities (whose ized process "hyperstates" whosecardinality uncertainty distributions are updated in a Bayesian grows exponentially with the terminal hori- way), over a finite time-horizon. zon, or axe limited to the two-state case and This is a challenging problem in itself with a long expressed in terms of hypergeometric series. history, 1 and though in this paper we do not explic- Our approach makes use of a diffusion ap- itly pursue the moreambitious goal of developing full- proximation technique for modelling the evo- blown policy-iteration methods for computing optimal lution of information state componentsof the value functions and policies for uncertain MDP’s,one hyperstate process. Interest in this problem can envision using the approximation techniques de- stems from a consideration of the policy eval- scribed here as a basis for constructing effective static uation step of policy iteration algorithms ap- stochastic policies or receding-horizon approaches that plied to Markovdecision processes with un- repeatedly computesuch policies. certain transition probabilities. Our analysis begins by examining the dynamics of information-state. For each physical 1. Introduction state, we view parameters describing uncertainty in transition probabilities associated with arcs leading Given a Markov decision process (MDP) with ex- from the state as that state’s information state com- pressed prior uncertainties in the process transition ponent. Information state components change when probabilities, consider the problem of computing a transitions from their corresponding physical state are policy that optimizes expected total (finite-horizon) observed, and on this time-scale, the dynamics of reward. Such a policy would effectively address the each information state component forms an embed- "exploration-versus-exploitation tradeoff" faced by an ded Markov chain. The dynamics of each embedded agent that seeks to optimize total reinforcement ob- Markovchain is non-stationary but smoothly-varying. tained over the entire duration of its interaction with Weintroduce a diffusion model for the embeddeddy- an uncertain world. namics, which yields a simple analytic approximation for describing the flow of information-state density. A policy-iteration-based approach for solving this problem would interleave "policy-improvement" steps 1This problem was addressed by students of Ronaid with "policy-evaluation" steps, which compute or ap- Howard’sat MITin the 1960’s(Silver, 1963; Martin, 1967); proximate the value of a fixed prospective policy. however,their methodseither entail an enumerationof pro- cess "hyperstates," (a set whosecardinaiity growsexpo- If one considers an MDPwith a known reward nentially with the planning horizon), or else express the structure, then the expected total reward over solution analytically in termsof hypergeometricseries (for some finite horizon under a fixed policy is just the case of two states only). the inner product between the expected number of

Proceedings of the Twentieth International Conferenceon MachineLearning (ICML-2003),Washington DC, 2003. 140

The analysis of embedded information state compo- nent dynamics may be carried out independently for

each physical-state component.In order to account for X interdependency between componentsand statistically model their joint behavior, we incorporate flux con- straints into our analysis; i.e., weaccount for the fact that, roughly, the numberof transitions into a phys- ical state must be equal to the numberof transitions out of that state. The introduction of flux constraints essentially links the -scales of each embed- ded componentinformation-state process together in Figure 1. The state transition diagramassociated with a a way that accounts for the interdependence of joint Bayes-Bernoulliprocess. Transitionprobabilities label arcs information-state components and maintains consis- in the explodedview. Dashedaxes showthe affine trans- tency with the global time-scale that governs the true, formed coordinate system, whose origin coincides with overall hyperstate process. (m:,m~)= (:, In slightly more concrete terms, we model each embed- ded information-state component density by a Gaus- X = m2 -- r~l. sian process. Incorporating this model into the flux constraints leads to a linear system of equations in In the newcoordinate system, the of mov- which the coefficient matrix has random(normal) el- ing up or do~nAx= 1 in time At=1 is ½(1 + t--4-g) ements, and the of the solution to this and½(1- t-~) ~ respectively. Andin analogyto the linear system provides an analytic approximation for passage from random walks to Wiener processes, we the mean number of transition counts. Analysis re- propose a limiting process in which the probability of duces to finding an approximation for the mean-value moving up or downAx in time At is ½(1 + t-~2 grAt) of the inverse of a matrix with randomelements. In this paper we show how this can be done when the and ½(1 - ~-~2v/-A-/) respectively. Taking the limit matrix possesses the special structure implied by our At --+ 0 (with Ax = ~ ) results in a diffusion model problem. with drift ~ and diffusion coefficient 1; the density function of x(t) is specified (by Kolmogorov’sequa- tions) as soon as these two local quantities have been 2. Diffusion approximation and solution determined. Alternatively, we mayformulate the dy- 2.1. Information drift of a ninnies of x in terms of the stochastic differential equa- tion (SDE): X Consider first a Bernoulli process with unknownpa- dx = ~--~dt + dWt, rameter 0. If we use a beta distribution to model the uncertainty in 0, then two parameters, (m:,m2), suf- where Wtis a Wienerprocess and the symbolic differ- fice (the parameters are directly related to the num- ential maybe rigorously interpreted as an (Ito) integral ber of successes and failures observed)--we choose the equation for each sample path. beta distribution to model uncertainty because it is a This is a homogeneousequation with additive noise; it conjugate family (Degroot, 1970) for samples from is "linear in the narrow sense" and an application of . This means that if our prior Ito’s formula leads to the closed form solution: uncertainty is described by a beta distribution speci- fied by a particular pair of parameters, then our un- xt° + certainty after observing a Bernoulli process sample xt = to + 2 (t + 2) dWs, is also described by a beta distribution, but with pa- rameter values that are incremented in a simple way. whichimplies that the solution is a Ganssian process. t+2 Figure 1 showsthe state transition diagramfor this el- Its meanis E(xt) = Xto t0+2" ementary process. Each information state determines The second moment, E(x~) = P(t), may be shown the transition probabilities leading out of it in a well- to satisfy dP 2 P + defined way. dt -- tTi 1, which by introducing an integrating factor can be solved: P(t) = c(t 2)2 - In preparation for what follows, we submit the infor- ( t+2),Wherec-- x~0-l-(to+2p (to-l-2) mation state space to a rotation and translation: Though many steps have been omitted--see (Arnold, t = ml +m2 - 2 1974) and (Kloeden & Platen, 1992) for rele~mt back- 141

ground material on second order processes and Ito differing initial conditions (defined by their priors at calculus--the idea we wish to advance is that diffu- starting time), and (2) their "local clocks" maytick sion models can be developed in this context and that different time-varying rates--the matriculation rate of they can yield useful information (namely, estimates a given component of information state through its for the meanand variance of the information state as transition diagram is determined by the rate of phys- a function of time). ical state transitions observed from the information state component’s corresponding physical state. 2.2. Information drift of a Bayesian Markov chain 2.3. Flux constraints The preceeding development has modeled the informa- In Section 2.1 we proposed a diffusion model for the tion state dynamics of an unknownBernoulli process, density function associated with embeddedinforma- but what we are really interested in is a similar model tion state components. What we desire is a model for for Markovchains with unknowntransition probabili- the joint density of these componentsunder the con- ties. Consider Figure 2, which depicts a Markovchain straint that they correspond to physical-state trajec- with two physical states and unknowntransition prob- tories of a Markovchain. The information-state tra- abilities. jectories must conformto certain constraints of flux-- the implied concerted flow into and out of physical states--in order to be consistent with realizable phys- ical state trajectories. Later in this section, we shall argue that these flux constraints imply that reachable information states must lie within regions defined, in part, by certain particular hyperplanes in the joint in- formation state space. H we are given a set of component information state trajectories, xi(~-i),~-~ < ~.i < T} and the initial i----1 physical state s~0, we can, in a straightforward manner construct the corresponding physical state trajectory. Wesimply follow the information state diffusion xS~o until it "level-crosses"; i.e., it reaches a boundaryline Figlzre 2. A Markovchain with two physical states and (or hyperplane in higher dimensions) that signifies unknowntransition probabilities, along with the compo- change in physical state. For example, in Figure 2, nent information state transition diagrams(the informa- lines defined by constant integral values of rot2 sig- tion state trajectory showncorresponds to the following nify boundaries that, when crossed by the information sequenceof physicalstate transitions: 1 --~ 1 --~ 2 -~ 1 -~ state process, indicate a physical state transition to 2 -~ 2 -~ 2-~ 1 -+ 1 -~ 2). state 2. We then follow the information state dif- fusion associated with the new physical state until it Wemay think of each physical state as having associ- level-crosses, and so on. ated information state components that describe the uncertainty in the transition probabilities leading from This procedure constructs sample paths in a left-to- that physical state---these information state compo- right fashion. An alternative approach starts from nents evolve only when the associated physical state an initial set of pre-generated componentinformation experiences a transition, and on a time-scale defined state trajectories, 2 which maythen be retracted so as by these particular physical state transitions, the evo- to satisfy flux constraints. lution of the corresponding information state compo- nents exactly follows that of a Bernoulli process with 2.4. Normal approximation and interpolation an unknownparameter. (More generally, for N phys- We now consider an approach that may be viewed as ical states, sampling is multinomial and the appropri- using the diffusion modelto gauge the distribution of a ate conjugate family is Dirichlet.) Moreover, the in- given information state componentif we were to follow formation state componenttransition diagrams asso- ciated with each physical state are identical; all infor- 2Theseinformation state trajectories can be generated mation state components move about upon identical independently;i.e., a giventrajectory’s shape (but not ex- terrain--the only differences between the information tent or progress as a function of time) is independentof state components’ dynamics are: (1) they may have other information state components. 142

it out to any specified horizon. Tile slope defined by whose expected value is constant: linear interpolation to the terminal information state x(T0) is a (normal) randomvariable, which in turn appears E(s(AT)): in the coefficient matrix of a linear system of equa- T0+2 tions describing flux-constraints (and whosesolution The variance of the slope is determines the joint values of information state). Using the special structure of the flux-constraint ma- Var(s(Ar)) = ~ l/ar(x(ro + AT)). trix we can invert the linear flux-constraint system. Finally, we use a Taylor-series estimation tectmique Suppose the Markovchain starts in physic’a] state 1. to approximate tile mean values of joint informa- The flux constraints require that observed state tran- tion state, or equivalently, the expected values of the sition frequency counts satisfy Markov-chaintransition frequencies. In an attempt to make this approach more concrete, we present the steps for the simple case of a Markov where6 is tile Kroneckerdelta and v is the final state, chain with two physical states. Let the parameters for together with the (matrix beta) prior for P fit + fl2 + f22 -I- f21 = n, where n is the total numberof observed transitions. In matrix form, the flux constraints are Begin by transforming to (X,T) coordinates: f11 [:i l= [11] [ 10,] [0101 1 1 1_1] /2~. and (Tim two right-hand sides correspond to the final state being equal to state 1 or state 2~ respectively). The transition frequency matrix is just the difference of the prior and posterior matrices of matrix beta pa- Introducing a diffusion approximationfor each of these rameters. To write the flux constraints in terms of componentsas in Section 2.1 implies that each embed- (x, T) parameters we first consider the transformation ded information state componenthas a density that is to matrix beta parameters, and then to transition fre- Gaussian with mean quencies; e.g., for componentsassociated with state 1, i=1,2 mr2 ---- 2 [ 1 1 xl 1 ] and second moment which implies that

E(x~(Ti)) = ci(Ti 2)2 -- (Ti+ 2)i = 1, 1 1 11 [fn Axl ]" where Thus, in terms of (x, T) coordinates, the flux con- X~(Ti)+(TO+2) i= 1,2. straints may be ~’ = (TO+ 2)2 written as 1 -1 AT1 This implies that the variance of xi is a quadratic func- 0 1 0 -1 1 Axl tion of Ti. 1 1 1 1 111 1 -1 AT2 Suppressing subscripts for the moment, define the 1 1 Ax2 slope, s(AT), as the slope of linearly-interpolated tra- AT1 jectory of x defined by its values at T0 and To + AT: nXl 211112 0 _12 _1] 0 AT2 AT Ax2 143

Nowin time Ao-1, AXl ~ sxAT1, and in time A~-2, Estimates for the meanvalue of any desired element of -1 Ax2 ~ s2AT2, which make the flux constraints A can be derived by applying the Taylor-series for- mula in which the function g assumes the form of the AT1 corresponding element’s functional form. For example, 1 1 -1 -1 ] slAT1 the second column of A-1 is: 1 0 1 0 At2 s~+l sl+s2+2 82AT2 6’1-{-1 sl-J-s2+2 sl is2-1-1) or sl@sa+2 s2~s!+1) sl +s2+2 l+Sl 1 1 n and the Taylor-series estimate for the meanof the third element, for example, in this column is given by The full s.ystem (including equations of the form Ax sat) is: (~2-bl) (~2-t-2) ¯ (~1+~2+2)s °Sl _ ~1(~1+1) 0.2 l+sl -(1+s2) AT1 262v (~1+~2+2)3 ¯ 1 1 An n 81 -1 Axl = 0 Finally, we may obtain an estimate for joint transi- 82 -1 Ax2 0 tion frequencies by substituting values for the slope means, ~i, and variances, a~, derived previously¯ Es- Call the coefficient matrix A. It has the partitioned timates for the meanof the joint distribution over x~ form and Ti can be derived by multiplying the estimate for the mean of A-1 by the appropriate right-hand side, A = D -I yielding an estimate for meanxi and ~’i- A simple lin- ear transformation translates the result back into the whoseinverse is original coordinate system that has an interpretation p-1 in terms of transition frequencies. A-1 = 0 ] DP-1 -I 2.6. Some details -1 sl +1 [1 s~+l] 2.6.1. SLOPE VARIANCE s~+s2+2 0 It can be shownthat [ -s281 81(82nul)s2(s1+l) , = TO+2 ’ 2.5. Taylor-series approximation for mean and hencethat

sU We nowmake use of the fact (see Papoulis, 1991, p. ¯ (T+2) 156, for example)that if a function, g(x, y) is suffi- (~---To)(ro+2) ciently smooth near the point (#=, #y), then the mean whose asymptotic value is 1 #g (and ~variance 2) of g(X, Y) c an be e stimated i to+2" terms of the mean, variance, and covariance of X and For computingestimates for joint transition frequen- Y: cies, we suggest two possibilities: (1) Use the asymp- 0 2 0 g totic values for slope variances, or (2) Begin by using oyd] J ’ the asymptotic slope variance values in solving the lin- ear system for mean x and T, then substitute the mean where the derivatives of g are evaluated at x = #= and r into the formula for a~ and recompute the solution y = #y, are r is the correlation coefficient. to the linear system. In our case, the slopes, Sl and s2, are independent, 2.6.2. FLUXCONSTRAINTS making r zero: Previously, we stated that the flux constraints require 1 rO2go.2 O2ga2 ] (assumingstate 1 is the starting state) that the tran- + 2 " + sition frequency counts satisfy f12- f21 = 52v, wherev 144

is the final state, together with fll +f12+f22+f21 = n, sition probabilities, to (x, T) coordinates: where n is the total numberof observed transitions. Adopting the linearly-interpolated model for incre- ments in information state components implies that J=[1,11 ] f12 and f21 are linear functions of fll and f22, respec- and tively: f12 = 81111 T° 11 1 f21 = s2f22, [:1[ : ] where the si nowdenote slopes in the transition fre- quencyplanes, and we can rewrite the second flux con- Step 2. An approximation for the expected incre- straint as: ments in the values of the information state trajecto- ries, in (x, ~-) coordinates, assumingterminal state sl + 1 s2 + 1 --f12 + f21---- n. is given by the expected value of the second column of 81 82 the inverse of the flux constraint matrix, A-I, multi- plied by the total numberof transitions, n. Utilizing the Taylor-series approximation, the expected values of the elements of the second column of A-1 are given by: ~2 n V=1

s21-1 I ~V=2

f12 l _~Ln Sl+ I

Figure 3. Flux constraint geometry.

Figure 3 sketches the geometric relationships implied by the flux constraints. The first flux constraint de- fines two hyperplanes, each corresponding to a partic- where ular terminal state. The second flux constraint defines x,(#) a third hyperplane whose intersection with the pre- ~i = E(si) -- to+2, i=1,2, viously defined hyperplanes determines transition fre- quency counts that would satisfy the flux constraint and the slope variances, 0.2 , i = 1, 2, maybe defined equations. In general, the solution to the flux con- by their asymptotic values, 0.,¢2 = ~--~,I or through an straint system will be a set of transition frequencies iterative procedure of the sort described previously in that are real numbers rather than integers. An in- Section 2.6.1. terpretation of interpolated trajectories in terms of incremental jumps on the information state transi- Step 3. Multiply the approximate #(.)-values from tion diagram (Figure 1) wouldgive rise to "stairstep" Step 2 by n, the total numberof transitions, to get ap- functions. Allowing transition frequencies to take on proximate values for the expected increments in (x, T) real values facilitates analysis, but introduces another coordinates. Perform a simple linear transformation source of approximation error. of the result to get an approximationfor the expected increments in matrix-beta parameters, which are the expected transition frequencies: 3. Example

3.1. Summary of the analytic procedure fl 1 1 -1 AT1 f12 1 1 1 AX~ Step 1. ~anslate from the matrix-beta parameters, E 1 -1 AV= which specify prior uncertainty in the unkno~mtran- 1 1 Ax2 145

Finally, one may take the inner product of the ap- The diffusion model for the basic Bayes-BernouUipro- proximate expected transition frequencies with the cess exactly matches the first momentof the true pro- corresponding expected immediate rewards to form cess, but systematically overestimates the second mo- an approximation for the value of the Bayes-adaptive ment (tails of the propagating normal density exceed Markov reward process. the natural boundaries of propagation). To derive more accurate model of the second moment, one may 3.2. Example(2 physical states, horizon----25) naturally consider Fokker-Planck equations for infor- mation state densities with suitable (reflecting) bound- As a simple test of our analytic approach we consider ary conditions, though the resulting forward equations a two-state Markovchain whose transition probabil- are not time-homogeneous,which discourages solution ity uncertainty is characterized by a matrix-beta prior via separation-of-variables. It is possible to adjust the with parameters: predictions suggested by the simple model4 by condensing the portions of the propagating M°- m°l m°2 1 3 " distribution tails that exceed the natural boundaries onto the boundaries, and then calculating the resulting decrement in the second moment. Considerations of Over the course of 25 state transitions (starting from this kind lead to a model with reduced slope-variance state 1) a Monte-Carlo estimate (which may be re- estimates and improved estimates for mean transition garded as "ground truth") for the meantransition fre- frequencies: quencies yields:

fscal~av"r= 3.66 13.8 " fMC= 3.75 13.7 The computational complexity of our analytic ap- while our analytic approximation technique leads to: proach is independent of the time horizon; it requires -1 2.67 3.48 ] coordinate transformations and evaluations of the A ]Anal, ---- 3.48 15.4 " column formulae, which entails total work of order o(g2), H we adopt a right-hand-side of the flux constraint equations that is consistent with the terminal state be- 4. Summary ing state 2, then two columns of E(A-1) enter into our approximation formulae. (The approximate ex- Our goal has been the synthesis of an analytic proce- pected increment in (x, T) coordinates is given by our dure for estimating the meanstate-transition frequen- previous estimate plus 2× first column of E(A-1) .) cies of a Markovchah whose transition probabilities This leads to an estimate equal to are unknownbut modeled in a Bayesian way. We have noted that information state components (the param- 3.19 4.23 ] eters specifying distributions that model uncertainty -]Anal2 ---- 3.23 14.3 " in transition probabilities leading from each state) may be viewed as embedded Markov chains that are A reasonable approach would be to average the esti- amenable to diffusion modeling. Analysis yields sim- mates in someway. If we form the certainty-equivalent ple formulae for means and variances of Ganssian pro- estimate for the transition probability matrix from cesses that modelthe flow of information state density. Conceptually we may imagine generating sample paths of joint information state by (see Figure 4): (1) then solve for the associated stationary distribution, erating a set of terminal information states consistent 7r, via ~" = 7rP, }-~’i = 1, we may construct a with the Gaussian process models, (2) Constructing ~r-weighted combination of the fAn,l, (for our exam- correspondingset of linearly-interpolated information- s ]): state trajectories, and (3) Retracting these trajectories so that they obey flux constraints, so that together 3.05 4.02 ] -lAnai, = 3.30 14.6 " they forma set of trajectories whosejoint extent is con- sistent with a feasible physical-state trajectory. Our

3For reference, we note the certainty-equivalenceesti- 4We note that there exist values of a~ and a2~2 that makethe diffusion approximationprediction essentially mate: "fc~ = 4.35 13.0 " exact. 146

analysis gauges the mean result of performing these sample path constructions many times and computing the mean terminal joint information state. Our ap- proach relies upon analytic methods, rather than sim- ulation or brute-force computation, and its complexity scales polynomially with the number of physical states (O(N2)), rather than exponentially with the time hori- zon. In (Duff, 2002), it is shown how this approach can provide estimates for the value of a stationary, stochas- tic policy. One next step would be to combine the policy evaluation scheme with a procedure for policy improvement. These ideas are rather preliminary in nature, admittedly suggestive rather than definitive. Some open issues include: (1) We seek a more-refined diffusion model. The variance of the simple Gans- sian model overestimates that of true information state components. (2) We seek a more thorough accounting of terminal physical state; our method of combining es- timates by weighting them by the certainty-equivalent stationary distribution is provisional, and could almost surely be improved upon.

Acknowledgements co) The author wishes to thank Andy Barto and Peter Dayan for helpful comments. This work was supported in part by a grant from the Gatsby Charitable Foun- dation.

References

Arnold, L. (1974). Stochastic Differential Equations. Wiley, New York. DeGroot, M. H. (1970) Optimal Statistical Decisions, McGraw-Hill, New York. (d) Duff, M. O. (2002) Optimal learning: Computational procedures for Bayes-adaptive Markov decision pro- cesses. Ph.D. Thesis. Department of Computer Sci- ence, University of Massachusetts, Amherst. Figure 4. A pictorial summaryof our approach: (a) Figure 2 repeated: information components as embedded Markov Kloeden, P.E. and Platen, E. (1992). Numerical so- chains, (b) Information density modeled as a Gau~qianpro- lution of stochastic differential equations, Springer, tess, (c) Sampling from the Gaussian process and construc- Berlin. tion of a linearly-interpolated trajectories, (d) Retraction to extents that satisfy flux constraints. O~u"analysis gauges Martin, 3. J. (1967). Bayesian Decision Problems and the mean result of performing these sample path construc- Markov Chains. John Wiley & Sons, New York. tions many times and computing the mean terminal joint Papoulis, A. (1991). Probability, Random Variables, information state. and Stochastic Processes. McGraw-Hill, New York. Silver, E. A. (1963). Markovian decision processes wih uncertain transition probabilities or rewards, (Tech- nical Report No. 1). Research in the Control of Complex Systems, Operations Research Center, MIT. 147

Using the Triangle Inequality to Accelerate k-Means

Charles Elkan ELKAN@ CS .UCSD.EDU Departmentof ComputerScience and Engineering University of California, San Diego La Jolla, California 92093-0114

Abstract this center. Conversely,if a point is muchcloser to one Thek-means algorithm is by far the mostwidely center than to any other, calculating exact distances is not used methodfor discoveringclusters in data. We necessary to knowthat the point should be assigned to the first center. Weshow below how to makethese intuitions showhow to accelerate it dramatically, while still always computingexactly the sameresult concrete. as the standard algorithm. The accelerated al- Wewant the accelerated k-meansalgorithm to be usable gorithmavoids unnecessarydistance calculations wherever the standard algorithm is used. Therefore, we by applyingthe triangle inequality in two differ- need the accelerated algorithm to satisfy three properties. ent ways, and by keepingtrack of lower and up- First, it shouldbe able to start with anyinitial centers, so per boundsfor distances betweenpoints and cen- that all existing initialization methodscan continueto be ters. Experimentsshow that the new algorithm used. Second,given the sameinitial centers, it should al- is effective for datasets with up to 1000dimen- waysproduce exactly the samefinal centers as the standard sions, and becomesmore and more effective as algorithm. Third, it should be able to use any black-box the numberk of clusters increases. For k > 20 distance metric, so it shouldnot rely for exampleon opti- it is manytimes faster than the best previously mizationsspecific to Euclideandistance. knownaccelerated k-means method. Ouralgorithm in fact satisfies a conditionstronger than the secondone above:after each iteration, it producesthe same 1. Introduction set of center locations as the standard k-meansmethod. This stronger property meansthat heuristics for mergingor The most commonmethod for finding clusters in data used splitting centers (and for dealing with emptyclusters) can in applications is the algorithm knownas k-means, k- be used together with the newalgorithm. The third condi- meansis considereda fast methodbecause it is not based tion is important becausemany applications use a domain- on computingthe distances betweenall pairs of data points. specific distance metric. For example,clustering to identify However,the algorithm is still slow in practice for large duplicate alphanumericrecords is sometimesbased on al- datasets. The numberof distance computations is nke phanumericedit distance (Monge& Elkan, 1996), while wheren is the numberof data points, k is the numberof clustering of protein structures is often based on an expen- clusters to be found, and e is the numberof iterations re- sive distancefunction that first rotates andtranslates struc- quired. Empirically, e growssublinearly with k, n, and the tures to superimposethem. Evenwithout a domain-specific dimensionalityd of the data. metric, recent workshows that using a non-traditional Lp normwith 0 < p < 1 is beneficial whenclustering in a Themain contribution of this paper is an optimizedversion high-dimensionalspace (Aggarwalet al., 2001). of the standard k-means method, with which the number of distance computationsis in practice closer to n than to This paper is organizedas follows. Section 2 explains how nke. to use the triangle inequality to avoid redundantdistance The optimizedalgorithm is based on the fact that mostdis- calculations. ThenSection 3 presents the newalgorithm, and Section 4 discusses experimentalresults on six datasets tance calculations in standard k-meansare redundant. If a point is far awayfrom a center, it is not necessaryto cal- ofdimensionality2 to 1000. Section 5 outlines possible im- culate the exact distance betweenthe point and the center provementsto the method,while Section 6 reviews related work,and Section 7 explains three openresearch issues. in order to knowthat the point should not be assigned to

Proceedingsof the TwentiethInternational Conference on MachineLearning (ICML-2003), Washington DC, 2003.