Quantum Annealing for Clustering

Quantum Annealing for Clustering Kenichi Kurihara Shu Tanaka Seiji Miyashita Google, Institute for Solid State Physics, Dept. of Physics, Tokyo, Japan University of Tokyo University of Tokyo, Tokyo, Japan Chiba, Japan CREST, Saitama, Japan Abstract 1. to propose a QA-based optimization algorithm for clustering, in particular This paper studies quantum annealing (QA) (a) quantum effect Hq for clustering for clustering, which can be seen as an exten- (b) and a good annealing schedule, which is cru- sion of simulated annealing (SA). We derive cial for applications, a QA algorithm for clustering and propose an 2. and to experimentally show the proposed al- annealing schedule, which is crucial in prac- gorithm optimizes clustering assignments better tice. Experiments show the proposed QA al- than SA. gorithm finds better clustering assignments than SA. Furthermore, QA is as easy as SA We also show the proposed algorithm is as easy as SA to implement. to implement. The algorithm we propose is a Markov chain Monte 1 Introduction Carlo (MCMC) sampler, which we call QA-ST sampler. As we explain later, a naive QA sampler is in- tractable even with MCMC. Thus, we approximate Clustering is one of the most popular methods in data QA by the Suzuki-Trotter (ST) expansion (Trotter, mining. Typically, clustering problems are formulated 1959; Suzuki, 1976) to derive a tractable sampler, as optimization problems, which are solved by algo- which is the QA-ST sampler. QA-ST looks like parallel rithms, for example the EM algorithm or convex relax- m SAs with interaction f (see Fig.2). At the begin- ation. However, clustering is typically NP-hard. The ning of the annealing process, QA-ST is almost the simulated annealing (SA) (Kirkpatrick et al., 1983) same as m SAs. Hence, QA-ST finds m (local) optima is a promising candidate. Geman and Geman (1984) independently. As the annealing process continues, in- proved SA was able to find the global optimum with teraction f in Fig.2 becomes stronger to move m states a slow cooling schedule of temperature T . Although closer. QA-ST at the end picks up the state with the their schedule is in practice too slow for clustering of lowest energy in m states as the final solution. a large amount of data, it is well known that SA still finds a reasonably good solution even with a faster QA-ST with the proposed quantum effect Hq works schedule than what Geman and Geman proposed. well for clustering. Fig.3 is an example where data points are grouped into four clusters. σ1 and σ2 are In statistical mechanics, quantum annealing (QA) has ∗ locally optimal and σ is globally optimal. Suppose m been proposed as a novel alternative to SA (Kadowaki is equal to two and σ and σ in Fig.2 correspond to σ and Nishimori, 1998; Santoro et al., 2002; Matsuda 1 2 1 and σ in Fig.3. Although σ and σ are local optima, et al., 2009). QA adds another dimension, Γ, to SA 2 1 2 the interaction f in Fig.2 allows σ and σ to search for annealing, see Fig.1. Thus, it can be seen as an ex- 1 2 for a better clustering assignment between σ and σ . tension of SA. QA has succeeded in specific problems, 1 2 Quantum effect H defines the distance metric of clus- e.g. the Ising model in statistical mechanics, and it is q tering assignments. In this case, the proposed Hq lo- still unclear that QA works better than SA in general. ∗ cates σ between σ1 and σ2. Thus, the interaction f We do not actually think QA intuitively helps cluster- ∗ gives good chance to go to σ because f makes σ and ing, but we apply QA to clustering just as procedure to 1 σ2 closer (see Fig.2). The proposed algorithm actually derive an algorithm. A derived QA algorithm depends ∗ finds σ from σ and σ . Fig.3 is just an example. on the definition of quantum effect H . We propose 1 2 q However, a similar situation often occurs in cluster- quantum effect H , which leads to a search strategy q ing. Clustering algorithms in most cases give “almost” fit to clustering. Our contribution is, Figure 1: Quantum annealing (QA) adds another dimension to simulated Figure 2: Illustrative explanation of QA. The left figure annealing (SA) to control a model. shows m independent SAs, and the right one is QA algo- QA iteratively decreases T and Γ rithm derived with the Suzuki-Trotter (ST) expansion. σ whereas SA decreases just T . denotes a clustering assignment. cluster 1; cluster 2; cluster 3; cluster 4; globally optimal solutions like σ1 and σ2, where the majority of data points are well-clustered, but some σ1 (local optimum) σ2 (local optimum) of them are not. Thus, a better clustering assignment can be constructed by picking up well-clustered data points from many sub-optimal clustering assignments. Note an assignment constructed in such a way is lo- cated between the sub-optimal ones by the proposed quantum effect Hq so that QA-ST can find a better assignment between sub-optimal ones. 2 Preliminaries σ∗ (global optimum) First of all, we introduce the notation used in this paper. We assume we have n data points, and they are assigned to k clusters. The assignment of the i- th data point is denoted by binary indicator vector σ˜i. For example, when k is equal to two, we denote the i-th data point assigned to the first and the sec- T T ond cluster byσ ˜i = (1, 0) andσ ˜i = (0, 1) , re- spectively. The assignment of all data points is also Figure 3: Three clustering results by a mixture of four n denoted by an indicator vector, σ, whose length is k Gaussians (i.e. #clusters=4). because the number of available assignments is kn. σ n n is constructed with {σ˜i}i=1, σ = i=1 σ˜i, where ⊗ is parallel m SAs. We denote the j-th SA of the parallel the Kronecker product, which is a special case of the N SA by σj. The i-th data point in σj is denoted byσ ˜j,i, tensor product for matrices. Let A and B be matrices n A s.t. σj = i=1 σ˜j,i. When A is a matrix, e is the a a a B a B A ∞ 1 l where A= 11 12 . Then, A ⊗ B = 11 12 matrix exponential of A defined by e = A . a a a B a B N l=0 l! 21 22 21 22 (see Minka (2000) for example). Only one element P in σ is one, and the others are zero. For example, 3 Simulated Annealing for Clustering T σ =σ ˜1⊗σ˜2 = (0, 1, 0, 0) when k = 2, n = 2, the first T data point is assigned to the first cluster (˜σ1 = (1, 0) ) We briefly review simulated annealing (SA) (Kirk- and the second data point is assigned to the second patrick et al., 1983) particularly for clustering. SA T cluster (˜σ2 = (0, 1) ). We also use k by n matrix Y is a stochastic optimization algorithm. An objective to denote the assignment of all data, function is given as an energy function such that a better solution has a lower energy. In each step, SA Y (σ)=(˜σ1, σ˜2, ..., σñ). (1) searches for the next random solution near the current We do not store σ in memory whose length is kn, but one. The next solution is chosen with a probability we store Y . We use σ only for the derivation of quan- that depends on temperature T and on the energy tum annealing. The proposed QA algorithm is like function value of the next solution. SA almost ran- domly choose the next solution when T is high, and have the following Hc when k = 2 and n = 2. it goes down the hill of the energy function when T is low. Slower cooling T increases the probability to find E(σ(1))0 0 0 the global optimum. 0 E(σ(2)) 0 0 Hc = . (5) 0 0 E(σ(3)) 0 Algorithm 1 summarizes a SA algorithm for clustering. 0 0 0 E(σ(4)) Given inverse temperature β = 1/T , SA updates state σ with, In this example, σ(t) indicates the t-th assignment of n (1) T 1 k available assignments, i,e. σ = (1, 0, 0, 0) , pSA(σ; β) = exp [−βE(σ)] , (2) σ(2) = (0, 1, 0, 0)T , σ(3) = (0, 0, 1, 0)T and Z − H σ(4) = (0, 0, 0, 1)T . e β c is the matrix exponen- −βHc where E(σ) is the energy function of state σ, tial in (4). Since Hc is diagonal, e is also di- −βHc (t) and Z is a normalization factor defined by agonal with [e ]tt = exp(−βE(σ )). Hence, we (t)T −βHc (t) (t) Z = σ exp(−βE(σ)). For probabilistic mod- find σ e σ = exp(−βE(σ )) and (4) equal els, the energy function is defined by E(σ) ≡ to (2). In practice, we use MCMC methods to sample P − log pprob-model(X,σ) where pprob-model(X,σ) is given σ from pSA(σ; β) in (4) by (3). This is because we by a probabilistic model and X is data. Note do not need to calculate Z and it is easy to evaluate T −βH pSA(σ; β = 1) = pprob-model(σ|X).

Load more