Learning Using the Born Rule

Learning using the Born Rule Lior Wolf¤ Abstract In Quantum Mechanics the transition from a deterministic description to a probabilistic one is done using a simple rule termed the Born rule. This rule states that the probability of an outcome (a) given a state (ª) is the square of their inner products ((a>ª)2). In this paper, we will explore the use of the Born-rule-based probabilities for clustering, feature selection, classification, and for comparison between sets. We show how these probabilities lead to existing and new algebraic algorithms for which no other complete probabilistic justification is known. Keywords: Spectral methods, unsupervised feature selection, set to set similarity mea- sures. 1 Introduction In this work we form a connection between spectral theory and probability theory. Spectral theory is a powerful tool for studying matrices and graphs and is often used in machine learning. However, the connections to the statistical tools employed in this field, are ad-hoc and domain specific. In this work we connect the two by examining the basic probability rule of Quantum Mechanics. Consider an affinity matrix A = [aij] where aij is a similarity between points i and j in some dataset. One can obtain an effective clustering of the datapoints by simply thresholding ¤L. Wolf is with The Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA 02138. Email: [email protected]. 1 the first eigenvector of this affinity matrix. Most previously suggested justifications of spectral clustering will fail to explain this: the view of spectral clustering as an approximation of the normalized minimal graph cut problem requires the use of the Laplacian; the view of spectral clustering as the infinite limit of a stochastic walk on a graph also requires a double stochastic matrix; other approaches can only explain this when the affinity matrix is approximately block diagonal. In this work we show that by modeling class membership using the Born rule one attains the above spectral clustering algorithm as the two class clustering algorithm. Moreover, let > v1 = [v1(1); v1(2); :::] be the first eigenvector of A, with an eigenvalue of ¸1. We show that according to this model the probability of point j to belong to the dominant cluster is given by 2 ¸1v1(j) . This result is detailed in section 4, as well as a justification to one of the most popular multiple-class spectral-clustering algorithms. In this section, which is in the heart of our paper, we also justify a successful feature selection algorithm known as the Q ¡ ® algorithm, derive similarity functions between sets and related retrieval algorithms, and suggest a very simple plug-in classification algorithm. In Quantum Mechanics the Born rule is usually taken as one of the axioms. However, this rule has well-established foundations. Gleason’s theorem states that the Born rule is the only consistent probability distribution for a Hilbert space structure. Wootters was the first one to note the intriguing observation that by using the Born rule as a probability rule, the natural Euclidean metric on a Hilbert space coincides with a natural notion of a statistical distance. Recently, attempts were made to derive this rule with even fewer assumptions by showing it to be the only possible rule that allows certain invariants in the quantum model. We will briefly review these justifications for the Born rule in section 3. Physics based methods have been used in solving learning and optimization problems for a long time. Simulated annealing, the use of statistical mechanic models in neural networks, and heat equations based kernels are some examples. However, we would like to stress out that everything we say in this paper has grounds that are independent of any physical model. From a statistician point of view, the our paper could be viewed as a description of learning methods 2 that use the Born rule as a plug-in estimator. 2 The Quantum Probability Model The quantum probability model takes place in a Hilbert space H of finite or infinite dimension 1.A state is represented by a positive definite linear mapping (a matrix ½) from this space to itself, which has a trace of 1, i.e 8ª 2 H ª>½ª ¸ 0 ; T r(½) = 1. Such a mapping ½ is self adjoint (½> = ½, where throughout this paper > means complex conjugate and transpose) and is called a density matrix. > Since ½ is self adjoint its eigenvectors ©i are orthonormal (©i ©j = ±ij), and since it is positive definite its eigenvalues pi are real and positive pi ¸ 0. The trace of a matrix is equal to P the sum of its eigenvalues and so i pi = 1. P > The equality ½ = i pi©i©i is interpreted as ”the system is in state ©i with probability pi”. > The state ½ is called pure if 9i s.t pi = 1. In this case, ½ = ªª for some normalized state vector ª, and the system is said to be in state ª. Note that the representation of a mixed (not pure) state as a mixture of states with probabilities is not unique. A measurement M with an outcome x in some set X is represented by a collection of P positive definite matrices fmxgx2X such that x2X mx = 1 (1 being the identity matrix in H). Applying a measurement M to state ½ produces outcome x with probability px(½) = T r(½mx): (1) Eq. 1 is the Born rule. Most quantum models deal with a more restrictive type of measurement called the von Neumann measurement, which involves a set of projection operators > > 0 P > ma = aa for which a a = ±aa0 . As before a2M aa = 1. For this type of measurement > > > the Born rule takes a simpler form: pa(½) = T r(½aa ) = T r(a ½a) = a ½a. Assuming ½ is a 1The results described in this paper hold for Hilbert (complex vector spaces) as well as for real vector spaces. 3 pure state ½ = ªª>, this can be simplified further to: > 2 pa(½) = (a ª) : (2) Since our algorithms will require the recovery of the parameters of unknown distributions, we would use the simpler mode (eq. 2). This will reduce the number of parameters we fit minimal. 3 Whence the Born rule? The Born rule has an extremely simple form that is convenient to handle, but why should it be considered justified as a probabilistic model for a broad spectrum of data? In quantum physics the Born rule is one of the axioms, and it is essential as a link between deterministic dynamics of the states and the probabilistic outcomes. It turns out that this rule cannot be replaced with other rules, as any other probabilistic rule would not be consistent. Next we will describe three existing approaches for deriving the Born rule 2, and interpret their relation to learning problems. Assigning probabilities when learning with vectors. Gleason’s theorem [12] derives the Born rule by assuming the Hilbert-space structure of observables (what we are trying to mea- sure). In this structure, each orthonormal basis of H corresponds to a mutually exclusive set of results of a given measurement. Theorem 1 (Gleason’s theorem) Let H be a Hilbert space of a dimension greater than 2. Let f be a function mapping the one-dimensional projections on H to the interval [0,1] such that P > for each orthonormal basis fªkg of H it happens that k f(ªkªk ) = 1. Then there exists a unique density matrix ½ such that for each ª 2 H f(ªª>) = ª>½ª. By this theorem the Born rule is the only rule that can assign probabilities to all the measurements of a state (each measurement is given by a resolution of the identity matrix), such 2This will be done briefly as to not repeat already published work. The reader is referred to the more recent references for more details. 4 that these probabilities rely only on the density matrix of that specific state. To be more concrete: we are given examples (data points) as vectors, we are asked to assign them to clusters/classes. Naturally, we choose to represent each class with a single vector (in the simplest case). We do not want the clusters/classes to overlap. Otherwise, in almost any optimization framework we would get the most prominent cluster repeatedly. The simplest constraint to add is an orthogonality constraint on the unknown vector representations of the clusters. Having this simple model, and simple constraint, the question arises: what probability rule can be used? Gleason’s theorem suggests that there is only one possible rule under the constraint that all the probabilities must sum to one. One may wonder why is it that our simple vector-based model above and the world of quantum mechanics result in the use of the same probability rule. There is nothing mystical about this. At the beginning of the last century physicists represented both states and observations as vectors. The only probability rule suitable for these models that use vectors as points as an operators is the Born rule. Gleason’s theorem is very simple and appealing, and it is very powerful as a justification for the use of the Born rule [2]. However, the assumptions of Gleason’s theorem are somewhat restrictive. They assume that the algorithm that assigns probabilities to measurements on a state has to assign probabilities to all possible von Neumann measurements. Some Quantum Mechanic approaches, as well as our use of the Born rule in machine learning, do not require that probabilities are defined for all resolutions of the identity. Axiomatic approaches. Recently a new line of inquiry has emerged that tries to justify the Born rule from the axioms of the decision theory.

Load more